ISCApad Archive » 2023 » ISCApad #298 » Jobs » (2023-02-01) Master or engineer internship at Loria, Nancy, France |
ISCApad #298 |
Friday, April 07, 2023 by Chris Wellekens |
Master or engineer internship at Loria (France)
Development of language model for business agreement use cases
Duration: 6 months, starting February or March 2023
Location: Loria (Nancy) and Station F, 5 Parvis Alan Turing, 75013, Paris
Supervision: Tristan Thommen (tristan@koncile.ai), Irina Illina (illina@loria.fr) and Jean-Charles Lamirel (jean-charles.lamirel@loria.fr)
Please apply by sending your CV and a short motivation letter directly to Tristan Thommen and Irina Illina.
Motivations and context
The usage of pre-trained models like Embeddings from Language Models (ELMo) (Peters et al., 2018), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), Robustly optimized BERT approach (RoBERTa) (Liu et al., 2019c), Generative Pre-Trained Transformer (GPT) (Radford et al., 2018), etc. proved to be state-of-the-art for various Natural Language Model (NLP) tasks. These models are trained on a huge unlabeled corpus and can be easily fine-tuned to various downstream tasks using task-specific datasets. Fine-tuning involves adding new task-specific layers to the model and updating the pretrained model parameters along with learning new task-specific layers.
Objectives
The goal of the internship is to develop a language model specific to business agreement use cases. This model should be able to identify and extract non-trivial information from a large mass of procurement contracts, in English and in French. This information consists, on the one hand, of simple contract identification data such as signature date, name of the parties, contract title, signatories, and on the other hand, of more complex information to be deduced from clauses, in particular, price determination according to parameters such as date or volume, renewal or expiry, obligations for the parties as well as the conditions. The difficulty of this task is that all this information is not standardized and may be represented in different ways and in different places in an agreement. For instance, a price could be based on a formula defined in the articles of the agreement and an index defined in one of its appendices.
To develop this language model we propose to fine-tune a pre-trained language model using a business agreement use dataset. The intern will identify the relevant pre-trained language model, prepare the data for training and adjust the parameters of fine-tuning.
The particularity of the internship is to use case relevant information of management of business agreements. Datasets will be constituted by Koncile’s clients and partners and developed during this internship models will be directly put into practice and tested with end users.
Koncile (link) is a start-up based in Paris, founded in 2022, that tackles the issue of mismanagement of procurements agreements by companies. It intends to leverage natural
langage processing techniques to analyze supplier contracts and invoicing. Koncile is incubated by Entrepreneur First and hosted at Station F in Paris.
Additional information and requirements
A good practice in Python and basic knowledge about Natural Langage Processing techniques are required. Some notions of machine learning is a plus, both theoretical and practical (e.g., using PyTorch).
References
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019c). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training. |
Back | Top |