ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2023 » ISCApad #302 » Jobs » (2023-03-08) PhD in ML/Speech Processing @LIA, Avignon, France

ISCApad #302

Friday, August 11, 2023 by Chris Wellekens

6-12 (2023-03-08) PhD in ML/Speech Processing @LIA, Avignon, France

PhD in ML/Speech Processing ? Speaker recognition systems against voice attacks : voices synthesis and voice transformation

Starting date: September 1st, 2023 (flexible)

Application deadline: July 10th, 2023

Interviews (tentative): July 15th, 2023

Salary: ~2000? gross/month (social security included)

Mission: research oriented (teaching possible but not mandatory)

Keywords: speech processing, automatic speaker recognition, anti-spoofing, deep neural network

CONTEXT

It is now widely accepted that automatic speaker recognition (ASV) systems are vulnerable not only to speech produced artificially by text-to-speech (TTS) [1], but also to other forms of attacks such as voice conversion (VC) and replay [2]. Voice conversion can be used to manipulate the voice identity of a speech signal, has progressed extremely rapidly in recent years [3], and has indeed become a serious threat.

The progress made in recent years in deep neural networks training has enabled spectacular advances in the fields of text-to-speech (TTS) and voice conversion (VC): DeepVoice, Tacotron 1 and 2 [4], Auto-VC [5,6]. Existing architectures now make possible producing synthesized or manipulated artificial voices with a realism close to or equal to that of human voices [4]. At the same time, voice conversion algorithms (from one speaker to another) have also made spectacular advances. It now becomes possible to clone a voice identity using a small amount of data. In the space of two years, extremely significant advances have been made [5,6,7]. The ability of these algorithms to forge voice identities capable of deceiving speaker recognition and counter-measure systems is an urgent topic of research.

Progress in terms of the fight against identity theft has been led by the initiative of the ASVspoof community, formed in 2013 and recognized as competent at the international level [8]. The most significant efforts have been made at the level of the acoustic parametrization (front-end) making it possible to better differentiate authentic (human) utterances from fraudulent utterances. The best performing system [9], which combines acoustic parameters based on Cepstrum-Mel, Cepstrum based on cochlear filters and instantaneous frequencies using a classifier based on a Gaussian mixture model, obtained the best performance.

For the past years, research efforts have focused on the back-end. As in speaker recognition research, the anti-spoofing community has embraced the power of deep learning and, unsurprisingly, the neural architectures used are almost the same. Advances in anti-spoofing have followed the rapid advances in TTS and VC. The best anti-spoofing system again used traditional acoustic parameters, with a classifier based on ResNet-18 [10].

SCIENTIFIC OBJECTIVES

As part of this thesis, the robustness of existing countermeasures against new forms of adversarial attacks designed specifically to deceive them will be assessed. One of the advances expected in this thesis will focus on the design of new countermeasures to detect such emerging, increasinly adversarial attacks. To do this, two avenues will be explored. The first is to redesign front-end feature extraction to capture cues that characterize adversarial attacks, then use them with re-trained classifiers. As it is not always easy to identify reliable characteristics, the second direction will aim at the adoption of end-to-end architectures able to learn characteristics automatically. Although these advances improve robustness to adversarial attacks, it will be important to ensure that the resulting countermeasures remain robust to previous attacks. This is known as the problem of the generalization. An effective anti-spoofing countermeasure must reliably detect any form of attack it encounters, not just the specific attacks it is trained to detect. Finally, improving adversarial attack detection performance should not come at the cost of increased false positives (genuine speech labeled as spoofed speech), which can hurt usability and convenience. The progress and results targeted in this thesis will therefore be countermeasures capable of defending speaker recognition systems against adversarial and non-adversarial attacks.

In parallel to this competition between research teams specializing in attacks and research teams specializing in counter-attacks, the speaker recognition community is focused on the creation and design of high-performance systems that are robust to acoustic variability. Recognition systems are trained to recognize speakers in increasingly difficult conditions (presence of several types of noise: additive, reverb, etc.). This robustness against difficult acoustic conditions can lead to weakness against recordings of attacks that were not taken into account during training. Of course this vulnerability can be reduced by using countermeasures (CM) systems. This approach can impact the usability of ASV systems since the countermeasures can also reject genuine clients (authentic users). This thesis will therefore go beyond the state of the art by optimizing both the ASV and the CM system, so that they work together to achieve the best possible compromise between security and usability/convenience.

REQUIRED SKILLS

- Master 2 in speech processing, computer science or data science

- Good mastering of Python programming and deep learning framework

- Previous experience in bias in machine learning would be a plus

- Good communication skills in English

- Good command of French would be a plus but is not mandatory

LAB AND SUPERVISION

The PhD position will be co-supervised by Nicholas Evans from EURECOM and Driss Matrouf from LIA-Avignon. Joint meetings are planned on a regular basis and the student is expected to spend time in LIA-Avignon. The students, along with the partners (IRCAM specialized in attack generation and EURECOM specialized in countermeasures) will closely collaborate.

INSTRUCTIONS FOR APPLYING

Applications must contain: CV + letter/message of motivation + master notes + be ready to provide letter(s) of recommendation; and be addressed to Driss Matrouf (driss.matrouf@univ-avignon.fr), Mickael Rouvier (mickael.rouvier@univ-avignon.fr) and Nicholas Evans (evans@eurecom.fr).

REFERENCES

[1] https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-1567157402

[2] N. Evans, T. Kinnunen and J. Yamagishi, ?Spoofing and countermeasures for automatic speaker verification? in Proc. Interspeech 2013 Aug 25 (pp. 925-929).

[3] Z. Yhi et al. (2020) Voice Conversion Challenge 2020- Intra-lingual semi-parallel and cross-lingual voice conversion . SCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[4] J. Shen et al, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Prediction, ICASSP,2018.

[5] Qian, K., Zhang, Y., Chang, S., Yang, X., and Hasegawa- Johnson, M. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In International Conference on Machine Learning (ICML), pp. 5210?5219, 2019

[6] Zhang, J.-X., Ling, Z.-H., and Dai, L.-R. Non-Parallel Sequence-to-Sequence Voice Conversion With Disentan- gled Linguistic and Speaker Representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 28:540?552, 2020.

[7] Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems (pp. 4480-4490).

[8] N. Evans, T. Kinnunen and J. Yamagishi, ?Spoofing and countermeasures for automatic speaker verification? in Proc. Interspeech 2013,

pp. 925-929, 2013

[9] T. B. Patel, H. A. Patil, ?Combining Evidences from Mel Cepstral, Cochlear Filter Cepstral and Instantaneous Frequency Features for Detection of Natural vs. Spoofed Speech?, in Proc. INTERSPEECH 2015, pp. 2062-2066, 2015

[10] X. Cheng, M. Xu, and T. F. Zheng, ?Replay detection using CQTbased modified group delay feature and ResNeWt network in ASVs poof 2019?, in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 540?545, 2019

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy