ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2021 » ISCApad #271 » Jobs » (2020-12-03) 6 months internship at GIPSA-Lab, Grenoble, France

ISCApad #271

Monday, January 11, 2021 by Chris Wellekens

6-25 (2020-12-03) 6 months internship at GIPSA-Lab, Grenoble, France

Deep learning-based speech coding and synthesis in adverse conditions.

Projet : Vokkero 2023

Type : Internship, 6 months, start of 2021

Offre : vogo-bernin-pfe-2

Contact : r.vincent@vogo.fr

Keywords : Neural vocoding, deep-learning, speech synthesis, training dataset, normalisation.

Résumé : The project consists in evaluating the performances of the LPCNet neural vocoder for

speech coding and decoding under adverse conditions (noisy environment, varied speech style, etc.)

and in proposing learning techniques to improve the quality of synthesis.

1 L’entreprise VOGO, le Gipsa-lab

Vogo is an SME based in Montpellier, south of France : www.vogo-group.com. Vogo is the first Sportech

listed on Euronext Growth and develops solutions that enrich the experience of fans and professionals

during sporting events. Its brand Vokkero is specialized in the design and production of radio communication

systems : www.vokkero.com. It offers solutions for teams working in very noisy environments and

is notably a world reference in the professional sports refereeing market.

Gipsa-lab is a CNRS research unit joint with Grenoble-INP (Grenoble Institute of Technology), and

Université Grenoble Alpes. With 350 people, including about 150 doctoral students, Gipsa-lab is a multidisciplinary

research unit developing both basic and applied researches on complex signals and systems.

Gipsa-lab is internationally recognised for the research achieved in Automatic Control, Signal and Images

processing, Speech and Cognition, and develops projects in the strategic areas of energy, environment,

communication, intelligent systems, Life and Health and language engineering.

2 Le projet Vokkero 2023

Every 3 years, Vokkero renews its Hardware (radio, cpu) and Software (rte, audio processing) platforms,

in order to design new generations of products. The project extends over several years of study

and it is within this framework that the internship is proposed. In the form of a partnership with the Gipsalab,

the project consists in the study of speech coding using « neural networks » approaches, in order

to obtain performances not yet reached by classical approaches. The student will work at the GIPSA-lab

in the CRISSP team of the Speech and Cognition cluster under the supervision of Olivier PERROTIN, research

fellow at CNRS, and at the R&D of Vogo Bernin, with Rémy VINCENT, project leader on the Vogo

side.

3 Context & Objectives

The project consists in evaluating the performances of the LPCNet neural vocoder for speech coding

and decoding under adverse conditions (noisy environment, varied speech style, etc.) and in proposing

learning techniques to improve the quality of synthesis.

3.1 Context

Vocoders (voice coders) are models that allow a speech signal to be first reduced to a small set of

parameters (this is speech analysis or coding) and then reconstructed from these parameters (this is

speech synthesis or decoding). This coding/decoding process is essential in telecommunication applications,

where speech is coded, transmitted and then decoded at the receiver. The challenge is to minimise

the quantity of information transmitted, while keeping the quality of the reconstructed speech signal as

high as possible. Current techniques use high-quality speech signal models, with a constraint on algorithmic

complexity to ensure real-time processes in embedded systems. Examples of Codecs widelay

used are Speex (Skype) and its little brother, Opus (Zoom). A few orders of magnitude : OPUS converts a

sampled stream at 16kHz into a bitstream at 16kbits (i.e. a compression ratio of 1 :16), the reconstructed

signal is also at 16kHz and has 20ms of latency.

Since 2016 a new type of vocoder has emerged, called neural vocoder. Based on deep neural network

architectures, these are able to generate a speech signal from the classical input parameters of a

vocoder, without a priori knowledge of an explicit speech model, but using machine learning. The first

system, Google’s WaveNet [1], is capable of reconstructing a signal almost identical to natural speech,

but at a very high computation cost (20 seconds to generate a sample, 16,000 samples per second).

Since then, models have been simplified and are capable of generating speech in real time (WaveRNN

[2], WaveGlow [3]). In particular, the LPCNet neural vocoder [4, 5], also developed by Mozilla, is able to

convert a 16kHz sampled stream into a 4kbits bitstream, and reconstruct a 16kHz audio signal. This mix

of super-compression combined with bandwidth extension leads to much higher equivalent compression

ratios than 1 :16 !

However, the ability of these systems to generate high-quality speech has only been evaluated following

training on large and homogeneous databases, i.e. 24 hours of speech read by a single speaker

and recorded in a quiet environment [6]. On the other hand, in the application of Vokkero, speech is

recorded in adverse conditions (very noisy environment), and presents a significant variability (spoken

voice, shouted voice, multiplicity of referees, etc.). Is a neural vocoder trained on a read speech database

capable of decoding speech of this type? If not, is it possible to train the model on such data, while

they are only available in small quantities ?

The aim of this internship is to explore the limits of the LPCNet vocoder in application to the decoding

of referee speech. Various learning strategies (curriculum training, transfer learning, learning on

augmented data, etc.) will then be explored to try to adapt pre-trained models to our data.

3.2 Tasks

The student will evaluate the performance of a pre-trained LPCNet vocoder on referee speech data,

and will propose learning strategies to adapt the model to this new data, in a coding/re-synthesis scena

rio :

1. Get familiar with the system, performance evaluation on an audio-book database (baseline) ;

2. Evaluation of LPCNet on the Vokkero database and identification of the limits (ambient noise, pretreatments,

voice styles, etc.) ;

3. Study of strategies to improve system performance by data augmentation :

— Creation of synthetic and specific databases : noisy atmospheres, shouted voices ;

— Recording campaigns on Vokkero systems, in anechoic rooms and/or real conditions if the sanitary

situation allows it ;

— Comparison of the 2 approaches according to various learning strategies to learn a new model

from this data.

3.3 Required Skills

The student is expected to have a solid background in speech signal processing and an interest in

Python development. Experience in programming deep learning models in Python is a plus. The student

is expected to show curiosity for research, scientific rigour in methodology and experimentation, and

show autonomy for technical and organisational aspects. Depending on the candidate’s motivation, and

subject to obtaining funding, it is possible to pursue this topic as a PhD thesis.

The student will be able to subscribe to the company’s insurance system, will have luncheon vouchers

and will receive a monthly gratuity of 800€.

Références

[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.

Senior et K. Kavukcuoglu, “WaveNet : A Generative Model for Raw Audio”, CoRR, t. abs/1609.03499,

2016. arXiv : 1609.03499 (cf. p. 1).

[2] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van

den Oord, S. Dieleman et K. Kavukcuoglu, “Efficient Neural Audio Synthesis”, CoRR, t. abs/1802.08435,

2018. arXiv : 1802.08435 (cf. p. 1).

[3] R. Prenger, R. Valle et B. Catanzaro, “Waveglow : A Flow-based Generative Network for Speech Synthesis”,

in Proceedings of the International Conference on Acoustics, Speech and Signal

Processing (ICASSP), Brighton, UK : IEEE, mai 2019, p. 3617-3621 (cf. p. 1).

[4] J.-M. Valin et J. Skoglund, “LPCNET : Improving Neural Speech Synthesis through Linear Prediction”,

in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),

sér. ICASSP ’19, Brighton, UK : IEEE, mai 2019, p. 5891-5895 (cf. p. 1).

[5] ——, “A Real-Time Wideband Neural Vocoder at 1.6kb/s Using LPCNet”, in Proceedings of Interspeech,

Graz, Austria : ISCA, sept. 2019, p. 3406-3410 (cf. p. 1).

[6] P. Govalkar, J. Fischer, F. Zalkow et C. Dittmar, “A Comparison of Recent Neural Vocoders for Speech

Signal Reconstruction”,

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy