ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2021 » ISCApad #272 » Jobs » (2020-12-16) Master 2 / PFE internship at GIPSA-lab (Grenoble)

ISCApad #272

Wednesday, February 10, 2021 by Chris Wellekens

6-28 (2020-12-16) Master 2 / PFE internship at GIPSA-lab (Grenoble)

Stage MASTER / PFE 2020-2021

REAL-TIME SILENT SPEECH SYNTHESIS

BASED ON END-TO-END DEEP LEARNING MODELS

Context

Various pathologies affect the voice sound source, i.e. the vibration of the vocal folds, thus preventing any sound

production despite the normal functioning of articulators (movements of the jaw, tongue, lips, etc.): this is known as

silent speech. Silent speech interfaces [Denby et al., 2010] consist in converting inaudible cues such as articulators

movements into an audible speech signal to rehabilitate the speaker’s voice. At GIPSA-lab, we have a system for

measuring articulators using ultrasound imaging and video and for converting this data into acoustic parameters that

describe a speech signal, using machine learning [Hueber and Bailly, 2016, Tatulli and Hueber, 2017]. The speech

signal is then reconstructed from the predicted acoustic parameters using a vocoder [Imai et al., 1983]. Current

silent speech interfaces have two main limitations: 1) The intonation (or speech melody), normally produced by

the vibration of the vocal folds, is absent in the considered type of pathologies and is difficult to reconstruct from

articulatory information only; 2) The generated speech quality is often limited by the type of vocoder used. While the

recent emergence of neural vocoders has allowed a leap in the quality of speech synthesis [van den Oord et al., 2016],

they have not yet been integrated into silent speech interface, where the constraint of real-time generation is crucial.

Objectives

Mapping

We propose in this internship to address these two problems, by implementing an

end-to-end silent speech synthesis system with deep learning models. In particular,

it will consist in interfacing our system for articulation measurement and acoustic parameter

generation with the LPCNet neural vocoder [Valin and Skoglund, 2019].

The latter takes asinput acoustic parameters coming from articulation on the one hand,

and the intonation on the other hand. This distinction offers the possibility of decorrelating

both controls, by proposing a gestural control of the intonation for example [Perrotin, 2015].

Regarding the acoustic parameters, the first step will be to adapt the acoustic output

of our system to match theinput of LPCNet. Moreover, LPCNet is trained by default

on acoustic parameters extracted from natural speech, forwhich large databases

are available. However, the acoustic parameters predicted from silent speech are degraded,

and produced in small quantities. We will thus study the robustness of LPCNet to a degraded input, and several re-training strategies (adaptation of LPCNet to new data, end-to-end learning, etc.) will be explored. Once the system is functional, the second part of the

internship will consist in implementing the system in real-time, so that the speech s

ynthesis is generated synchronously with the user’s articulation. All stages of

implementation (learning strategies, real-time system) will be evaluated in terms of intelligibility, sound quality, and intonation reconstruction.

Tasks

The tasks expected during this internship are:

Implement the full silent speech synthesis pipeline by interfacing the lab ultrasound

system with LPCNet, and explore training strategies.

Evaluate the performance of the system regarding speech quality and reconstruction

errors.

Implement and evaluate a real-time version of the system.

Required skills

Signal processing and machine learning.

Knowledge of Python and C is required for implementation.

Knowledge of Max/MSP environment would be a plus for real-time implementation.

Strong motivation for methodology and experimentation.

Allowance

The internship allowance is fixed by ministerial decree (about 570 euros / month).

Grenoble Images Parole Signal Automatique

UMR CNRS 5216 – Grenoble Campus

38400 Saint Martin d’Hères - FRANCE

Stage MASTER / PFE 2020-2021

Contact

Olivier PERROTIN + 33 4 76 57 45 36 olivier.perrotin@grenoble-inp.fr

Thomas HUEBER + 33 4 76 57 49 40 thomas.hueber@grenoble-inp.fr

References

[Denby et al., 2010] Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J. M., and Brumberg, J. S. (2010). Silent speech interfaces.

Speech Communication, 52(4):270–287.

[Hueber and Bailly, 2016] Hueber, T. and Bailly, G. (2016). Statistical conversion of silent articulation into audible speech using fullcovariance

hmm. Computer Speech & Language, 36(Supplement C):274–293.

[Hueber et al., 2010] Hueber, T., Benaroya, E.-L., Chollet, G., Denby, B., Dreyfus, G., and Stone, M. (2010). Development of a silent

speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, 52(4):288–300.

[Imai et al., 1983] Imai, S., Sumita, K., and Furuichi, C. (1983). Mel log spectrum approximation (mlsa) filter for speech synthesis.

Electronics and Communications in Japan (Part I: Communications), 66(2):10–18.

[Perrotin, 2015] Perrotin, O. (2015). Chanter avec les mains: Interfaces chironomiques pour les instruments de musique numériques. PhD

thesis, Université Paris-Sud, Orsay, France.

[Tatulli and Hueber, 2017] Tatulli, E. and Hueber, T. (2017). Feature extraction using multimodal convolutional neural networks for visual

speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP ’17,

pages 2971–2975, New Orleans, LA, USA.

[Valin and Skoglund, 2019] Valin, J.-M. and Skoglund, J. (2019). Lpcnet: Improving neural speech synthesis through linear prediction.

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), ICASSP ’19, pages 5891–5895, Brighton, UK.

IEEE.

[van den Oord et al., 2016] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior,

A. W., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499.

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy