ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2021 » ISCApad #271 » Jobs » (2020-11-14) Master2 Internship, LORIA-INRIA, Nancy, France

ISCApad #271

Monday, January 11, 2021 by Chris Wellekens

6-18 (2020-11-14) Master2 Internship, LORIA-INRIA, Nancy, France

Master2 Internship: Semantic information from the past in a speech recognition system: does the past help the present?

Supervisor: Irina Illina, MdC, HDR

Team and Laboratory: Multispeech, LORIA-INRIA

Contact: illina@loria.fr

Co-Supervisor: Dominique Fohr, CR CNRS

Team and Laboratory: Multispeech, LORIA-INRIA

Contact : dominique.fohr@loria.fr

Motivation and context

Semantic and thematic spaces are vector spaces used for the representation of words, sentences or textual documents. The corresponding models and methods have a long history in the field of computational linguistics and natural language processing. Almost all models rely on the hypothesis of statistical semantics that states that: statistical schemes of appearance of words (context of a word) can be used to describe the underlying semantics. The most used method to learn these representations is to predict a word using the context in which this word appears [Mikolov et al., 2013, Pennington et al., 2014], and this can be realized with neural networks. These representations have proved their effectiveness for a range of natural language processing tasks. In particular, Mikolov's Skip-gram and CBOW models [Mikolov et al., 2013] and BERT model [Devlin et al., 2019] have become very popular because of their ability to process large amounts of unstructured text data with reduced computing costs. The efficiency and the semantic properties of these representations motivate us to explore these semantic representations for our speech recognition system.

Robust automatic speech recognition (ASR) is always a very ambitious goal. Despite constant efforts and some dramatic advances, the ability of a machine to recognize the speech is still far from equaling that of the human being. Current ASR systems see their performance significantly decrease when the conditions under which they were trained and those in which they are used differ. The causes of variability may be related to the acoustic environment, sound capture equipment, microphone change, etc.

Objectives

Our speech recognition (ASR) system [Povey et al, 2011] is supplemented by a semantic analysis for detecting the words of the processed sentence that could have been misrecognized and for finding words having a similar pronunciation and matching better the context [Level et al., 2020]. For example, the sentence « Silvio Berlusconi, prince de Milan » can be recognized by the speech recognition system as: « Silvio Berlusconi, prince de mille ans ». A good semantic context representation of the sentence could help to find and correct this error. This semantic analysis re-evaluates (rescores) the N-best transcription hypotheses and can be seen as a form of dynamic adaptation in the case of noisy speech data. A semantic analysis is performed in combining predictive representations using continuous vectors. All our models are based on the powerful technologies of DNN. We use for this the BERT model. The semantic analysis contains the following two modules: the semantic analysis module and the module for re-evaluating sentence hypotheses (rescoring) taking into account the semantic information.

The semantic module improves significantly the performance of speech recognition system. But we would like to go beyond the semantic information of the current sentence. Indeed, sometimes the previous sentences could help to understand and to recognize the current sentence. The Master internship will be devoted to the innovative study of the taking into account the past recognized sentences to improve the recognition of the current sentence. Research will be conducted on the combination of semantic information from one or several past sentences with semantic information from current sentence to improve the speech recognition. As deep neural networks (DNNs) can model complex functions and get outstanding performance, they will be used in all our modeling. The performance of the different modules will be evaluated on artificially noisy speech data.

Required skills: background in statistics, natural language processing and computer program skills (Perl, Python).

Candidates should email a detailed CV with diploma

Bibliography

[Devlin et al., 2019] Devlin, J., Chang, M.-W., Lee, K. and Toutanova K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

[Level et al., 2020] Level S., Illina I., Fohr D. (2020). Introduction of semantic model to help speech recognition. International Conference on TEXT, SPEECH and DIALOGUE.

[Mikolov et al., 2013] Mikolov, T., Sutskever, I., Chen, T. Corrado, G.S.,and Dean, J. (2013). Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, pp. 3111?3119.

[Pennington et al., 2014] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543.

[Povey et al., 2011] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motl?cek, P., Qian, Y., Schwarz, Y., Silovsky, J., Stemmer, G., Vesely, K. (2011). The Kaldi Speech Recognition Toolkit, Proc. ASRU.

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy