ISCApad Archive » 2023 » ISCApad #296 » Jobs » (2023-12-05) Personalized speech enhancement Master internship, Lille (France) @SteelSeries France R&D team (former Nahimic R&D team), France |
ISCApad #296 |
Tuesday, February 07, 2023 by Chris Wellekens |
Personalized speech enhancement Master internship, Lille (France), 2022
Advisors — Damien Granger, R&D Engineer, damien.granger@steelseries.com — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Lucas Dislaire, R&D Engineer, lucas.dislaire@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Speech related audio source separation task Speech enhancement[8] or speech denoising usually refers to the task where the signal of interest is one intelligible speaker drown in additive noise. Speech separation [9] (sometime speaker separation) refers to the task of separately retrieving multiple unknown speakers usually not drown in additive noise. In the case of Personalized Speech Enhancement (also called target voice separation or target speaker extraction), the signal of interest is a speaker but a known speaker. This opens to (1) potentially better speech enhancement performances and also (2) focus on a particular speaker where speech enhancement would have kept all intelligible speakers. Personalized speech enhancement The research area is really recent and has just been added last year to the DNS challenge 1 [10]. VoiceFilter [11] seems to be the first article to tackle the problem of Personalized speech enhancement. 1. It can be noticed that the DNS challenge tasks [10] are defined as being real-time with a 40ms constraint on the latency (mainly composed of the look-ahead). 1 It uses two separately-trained neural networks : one discrimination network that produces speaker-specific embeddings from reference utterances of the target speaker ; and one “main” network, that performs the actual speech enhancement by taking as input both the corrupted utterance and the target speaker embeddings. The approach has now been outperformed, e.g., [12], while the two-step approach tends to prevail in the literature [13, 14, 15] : firstly one needs to learn a target speaker representation during an enrollment phase for instance by means of speaker embeddings, such as x-vectors or d-vectors [12], secondly incorporate this results in the neural network that will learn to extract this target speaker’s speech. However, it can be noticed that two steps does not necessarily means 2 networks, for instance a jointly trained 4-stage network is proposed in [15]. Axis of research The objective of the internship is to address one or several of the following targets : — Firstly, a baseline framework needs to be set up. It will require : — A dataset tailored for the task : the available datasets in the scientific community does not completely fulfill the requirements for SteelSeries products (description upon request). Conversely, our current datasets partly lacks speaker information. Hence, one would need to opt for the best solution or tradeoff combination. — A speaker embedding baseline with the assumption that the signal captured during the enrollment is “clean”, i.e. only contains the signal of interest (or at least with no second speaker). — A speech enhancement model with speaker embeddings. The intern could for instance re-use our implementation (currently without speaker embeddings) of E3net [14]. — Secondly, once a first baseline has been trained, the candidate could benchmark on different scenarii (signal-to-noise ratio during enrollment, signal ratio between speakers and effect of additional noise, various and mixed language). The Target Speaker Over-Suppression metric could potentially be used (description in [12]), as well as DNS standard metrics. This could lead the candidate to work on one of the following items to improve the baseline framework on its identified weaknesses : — Testing various speaker embeddings and ways or positions of integration into the networks — The separate training of the speaker-encoding network has been found to work better than joint training [16, 17, 15] (multi-task learning being often hard to tune). However, this would need to be reassessed with the final chosen architecture. — More effective enrollment strategies [18, 19] could be chosen and adapted to SteelSeries use cases. — Implementing loss functions suitable for the separation task (Step 2) could also be of interest, for instance following ideas in [20] or by adapting our internal loss. Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. 2 Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] DNS challenge on the paperswithcode website. https://paperswithcode.com/sota/speechenhancement-on-deep-noise-suppression. [9] Speech separation task referenced on the paperswithcode website. https://paperswithcode.com/ task/speech-separation. [10] H. Dubey et al. « Icassp 2022 deep noise suppression challenge ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 9271- 9275. [11] Q. Wang et al. « Voicefilter : Targeted voice separation by speaker-conditioned spectrogram masking ». In : arXiv preprint arXiv :1810.04826 (2018). [12] S. E. Eskimez et al. « Personalized speech enhancement : New models and comprehensive evaluation ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 356-360. [13] R. Giri et al. « Personalized percepnet : Real-time, low-complexity target voice separation and enhancement ». In : arXiv preprint arXiv :2106.04129 (2021). [14] M. Thakker et al. « Fast Real-time Personalized Speech Enhancement : End-to-End Enhancement Network (E3Net) and Knowledge Distillation ». In : arXiv preprint arXiv :2204.00771 (2022). [15] C. Xu et al. « Spex : Multi-scale time domain speaker extraction network ». In : IEEE/ACM transactions on audio, speech, and language processing 28 (2020), p. 1370-1384. [16] K. Žmolıková et al. « Learning speaker representation for neural network based multichannel speaker extraction ». In : 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2017, p. 8-15. [17] M. Delcroix et al. « Single channel target speaker extraction and recognition with speaker beam ». In : 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, p. 5554-5558. [18] H. Sato et al. « Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations ». In : arXiv preprint arXiv :2206.08174 (2022). [19] X. Liu, X. Li et J. Serrà. « Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation ». In : arXiv preprint arXiv :2210.12635 (2022). [20] H. Taherian, S. E. Eskimez et T. Yoshioka. « Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation ». In : arXiv preprint arXiv :2211.02944 (2022). |
Back | Top |