ISCApad Archive » 2021 » ISCApad #272 » Jobs » (2020-12-16) Master 2 / PFE internship at GIPSA-lab (Grenoble) |
ISCApad #272 |
Wednesday, February 10, 2021 by Chris Wellekens |
Stage MASTER / PFE 2020-2021 REAL-TIME SILENT SPEECH SYNTHESIS BASED ON END-TO-END DEEP LEARNING MODELS Context Various pathologies affect the voice sound source, i.e. the vibration of the vocal folds, thus preventing any sound production despite the normal functioning of articulators (movements of the jaw, tongue, lips, etc.): this is known as silent speech. Silent speech interfaces [Denby et al., 2010] consist in converting inaudible cues such as articulators movements into an audible speech signal to rehabilitate the speaker’s voice. At GIPSA-lab, we have a system for measuring articulators using ultrasound imaging and video and for converting this data into acoustic parameters that describe a speech signal, using machine learning [Hueber and Bailly, 2016, Tatulli and Hueber, 2017]. The speech signal is then reconstructed from the predicted acoustic parameters using a vocoder [Imai et al., 1983]. Current silent speech interfaces have two main limitations: 1) The intonation (or speech melody), normally produced by the vibration of the vocal folds, is absent in the considered type of pathologies and is difficult to reconstruct from articulatory information only; 2) The generated speech quality is often limited by the type of vocoder used. While the recent emergence of neural vocoders has allowed a leap in the quality of speech synthesis [van den Oord et al., 2016], they have not yet been integrated into silent speech interface, where the constraint of real-time generation is crucial. Objectives Mapping We propose in this internship to address these two problems, by implementing an end-to-end silent speech synthesis system with deep learning models. In particular, it will consist in interfacing our system for articulation measurement and acoustic parameter generation with the LPCNet neural vocoder [Valin and Skoglund, 2019]. The latter takes asinput acoustic parameters coming from articulation on the one hand, and the intonation on the other hand. This distinction offers the possibility of decorrelating both controls, by proposing a gestural control of the intonation for example [Perrotin, 2015]. Regarding the acoustic parameters, the first step will be to adapt the acoustic output of our system to match theinput of LPCNet. Moreover, LPCNet is trained by default on acoustic parameters extracted from natural speech, forwhich large databases are available. However, the acoustic parameters predicted from silent speech are degraded, and produced in small quantities. We will thus study the robustness of LPCNet to a degraded input, and several re-training strategies (adaptation of LPCNet to new data, end-to-end learning, etc.) will be explored. Once the system is functional, the second part of the internship will consist in implementing the system in real-time, so that the speech s ynthesis is generated synchronously with the user’s articulation. All stages of implementation (learning strategies, real-time system) will be evaluated in terms of intelligibility, sound quality, and intonation reconstruction. Tasks The tasks expected during this internship are: Implement the full silent speech synthesis pipeline by interfacing the lab ultrasound system with LPCNet, and explore training strategies. Evaluate the performance of the system regarding speech quality and reconstruction errors. Implement and evaluate a real-time version of the system. Required skills Signal processing and machine learning. Knowledge of Python and C is required for implementation. Knowledge of Max/MSP environment would be a plus for real-time implementation. Strong motivation for methodology and experimentation. Allowance The internship allowance is fixed by ministerial decree (about 570 euros / month). Grenoble Images Parole Signal Automatique UMR CNRS 5216 – Grenoble Campus 38400 Saint Martin d’Hères - FRANCE Stage MASTER / PFE 2020-2021 Contact Olivier PERROTIN + 33 4 76 57 45 36 olivier.perrotin@grenoble-inp.fr Thomas HUEBER + 33 4 76 57 49 40 thomas.hueber@grenoble-inp.fr References [Denby et al., 2010] Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J. M., and Brumberg, J. S. (2010). Silent speech interfaces. Speech Communication, 52(4):270–287. [Hueber and Bailly, 2016] Hueber, T. and Bailly, G. (2016). Statistical conversion of silent articulation into audible speech using fullcovariance hmm. Computer Speech & Language, 36(Supplement C):274–293. [Hueber et al., 2010] Hueber, T., Benaroya, E.-L., Chollet, G., Denby, B., Dreyfus, G., and Stone, M. (2010). Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, 52(4):288–300. [Imai et al., 1983] Imai, S., Sumita, K., and Furuichi, C. (1983). Mel log spectrum approximation (mlsa) filter for speech synthesis. Electronics and Communications in Japan (Part I: Communications), 66(2):10–18. [Perrotin, 2015] Perrotin, O. (2015). Chanter avec les mains: Interfaces chironomiques pour les instruments de musique numériques. PhD thesis, Université Paris-Sud, Orsay, France. [Tatulli and Hueber, 2017] Tatulli, E. and Hueber, T. (2017). Feature extraction using multimodal convolutional neural networks for visual speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP ’17, pages 2971–2975, New Orleans, LA, USA. [Valin and Skoglund, 2019] Valin, J.-M. and Skoglund, J. (2019). Lpcnet: Improving neural speech synthesis through linear prediction. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), ICASSP ’19, pages 5891–5895, Brighton, UK. IEEE. [van den Oord et al., 2016] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499. |
Back | Top |