ISCApad #298 |
Friday, April 07, 2023 by Chris Wellekens |
6-1 | (2022-10-12) Internships at AVA France We have two 6 months internship proposals (for M2/Master 2 level) at Ava France ( https://www.ava.me/ ) in Paris (possible remote) on speech diarization. Feel free to apply:
Best regards,
Alexey Ozerov AI Research Lead at Ava
| ||||
6-2 | (2022-10-25) Postdoc@Telecom Paris (France) Post-Doctoral Position on Neural Models for Dialog Analysis Matthieu Labeau, Gaël Guibon, Chloé Clavel
| ||||
6-3 | (2022-10-14) Internships @ IRIT Toulouse, France L’équipe SAMoVA de l’IRIT à Toulouse propose plusieurs stages (M1, M2, PFE ingénieur) en 2023 autour des thématiques suivantes (liste non exhaustive) :
- traitement de la parole atypique
- modélisation de la déglutition
- transcription et compréhension de la parole (spoken language understanding)
- segmentation et regroupement en locuteurs (speaker diarization)
- description textuelle de l'audio (audio captioning)
Tous les détails (sujets, contacts) sont disponibles dans la section 'Jobs' de l’équipe :
https://www.irit.fr/SAMOVA/site/jobs/ Hervé Bredin
| ||||
6-4 | (2022-10-12) Ingenieur AI@INE France Nous recherchons activement un ingénieur en charge des opérations pour le département Évaluation de l’intelligence artificielle et de la cybersécurité du LNE :
Le candidat retenu intégrera une équipe en forte croissance spécialisée en évaluation des systèmes d’IA et intervenant dans de nombreux domaines (TAL, traitement d’images, dispositifs médicaux intelligents, systèmes de mobilité autonomes, robots agricoles, cobots, etc.).
Je me tiens à votre disposition pour tout échange sur cette offre.
Merci d’avance pour vos candidatures et vos partages, à très bientôt ! Guillaume AVRIN, PhD Direction des essais et de la certification
Laboratoire national de métrologie et d'essais
| ||||
6-5 | (2022-10-12) Positions @University of Texas El Paso, TX, USA Two 3yr postdoc positions testing gesture-speech synchrony
We're looking for two smart and motivated postdocs to join the Speech Perception in Audiovisual Communication lab (SPEAC; https://hrbosker.github.io) at the Donders Institute, Radboud University, Nijmegen, The Netherlands.
Keywords: multimodal prosody, audiovisual speech perception, gesture-speech synchrony, motion-tracking, MEG
>>> PD1
You will test both the production and perception of gesture-speech alignment in nine different languages, including free-stress, fixed-stress, and lexical tone languages. The production strand uses motion-tracking of 2D videos in Mediapipe and acoustic analyses in Praat to quantify gesture-speech alignment on a millisecond timescale. The perception strand involves running psychoacoustic tests with audiovisual stimuli manipulated to vary in the synchrony between hands and spoken prosody. Combining production and perception data will reveal how language-specific variability in gesture-speech alignment shapes the language-specific use of gestural timing in speech perception.
>>> PD2
You will use rapid invisible frequency tagging (RIFT) in MEG to pinpoint the neurobiological mechanisms underlying gesture-speech integration. Specifically, you will test how simple up-and-down beat gestures influence lexical stress perception in real time, using the 'manual McGurk effect' (Bosker & Peeters, 2021, Proc Roy Soc B). Furthermore, you will compare typical behavioural and neural signatures of gesture-speech integration to those in individuals with autism spectrum disorder (ASD) who are known to demonstrate impairments in prosody processing and audiovisual integration. Finally, you will run a large-scale correlational study testing whether the participants' own gestural timing behaviour is linked to their use of gestural timing in audiovisual speech perception.
3 year contract; employment for 0.8 FTE. Gross monthly salary: min €3,974 - max €5,439 (based on 38-hour working week; scale 11). Apply by December 1, 2022. Preferred starting date: March 1, 2023.
Contact: Hans Rutger Bosker, HansRutger.Bosker@donders.ru.nl
This mail was sent through the SProSIG mailing list, which is for announcements of interest to the speech prosody research community. To subscribe/unsubscribe, send mail to list@sprosig.org.
Nigel Ward, Professor of Computer Science, University of Texas at El Paso CCSB 3.0408, +1-915-747-6827 nigel@utep.edu https://www.cs.utep.edu/nigel/
| ||||
6-6 | (2022-10-17) Research internships @ LIUM, Le Mans France Nous proposons deux stages de recherche (pour le niveau M2/Master 2) au LIUM - Le Mans Université sur le traitement de la parole.
Tous les détails sont disponibles dans la section 'Recrutements' du site du laboratoire, onglet 'Stages' :
Merci de transférer si vous connaissez des étudiant.e.s à la quête de telle opportunité.
Meilleures salutations,
Meysam Shamsi
| ||||
6-7 | (2022-10-20) Postdoc in Educational Data Mining/Learning Analytics, University of Colorado Boulder, CO, USAPostdoc in Educational Data Mining/Learning AnalyticsLocation: University of Colorado Boulder, Boulder CO, USA
Work type: Full time
Employment type: Research faculty
Anticipated Start Date: Spring 2023 (desired), Summer 2023, or Fall 2023
Salary: $70k-$100k (depending on experience and qualifications)
Position Duration: 1-3 years, Initial contract is for one year. Second year contract is based on performance and extension to a third year and beyond is possible
Brief Job Summary: In this position, you will develop and apply computational techniques to analyze data from students’ log files in conjunction with other multimodal signals (e.g., speech, facial expressions, learning artifacts) during small group human tutoring, intelligent tutoring, and collaborative problem solving. You will also assist with integrating computational models into educational technologies where their performance and impact can be assessed in the real-world.
Please visit the job details page below for more information and to apply:
| ||||
6-8 | (2022-10-25) OPEN POSITIONS @ ELDA, Paris (France) OPEN POSITIONS in Paris (France)
| ||||
6-9 | (2022-11-08) Post doc and engineering positions @ LORIA-INRIA, Nancy, France
Automatic speech recognition for non-natives speakers in a noisy environment
Post-doctoral and engineer positions
Starting date: begin of 2023
Duration: 24 months for a post-doc position and 12 months for an engineer position
Supervisors: Irina Illina, Associate Professor, HDR Lorraine University LORIA-INRIA Multispeech Team, illina@loria.fr
Context
When a person has their hands busy performing a task like driving a car or piloting an airplane, voice is a fast and efficient way to achieve interaction. In aeronautical communications, the English language is most often compulsory. Unfortunately, a large part of the pilots are not native English and speak with an accent dependent on their native language and are therefore influenced by the pronunciation mechanisms of this language. Inside an aircraft cockpit, non-native voice of the pilots and the surrounding noises are the most difficult challenges to overcome in order to have efficient automatic speech recognition (ASR). The problems of non-native speech are numerous: incorrect or approximate pronunciations, errors of agreement in gender and number, use of non-existent words, missing articles, grammatically incorrect sentences, etc. The acoustic environment adds a disturbing component to the speech signal. Much of the success of speech recognition relies on the ability to take into account different accents and ambient noises into the models used by ARP.
Automatic speech recognition has made great progress thanks to the spectacular development of deep learning. In recent years, end-to-end automatic speech recognition, which directly optimizes the probability of the output character sequence based on the input acoustic characteristics, has made great progress [Chan et al., 2016; Baevski et al., 2020; Gulati, et al., 2020].
Objectives
The recruited person will have to develop methodologies and tools to obtain high-performance non-native automatic speech recognition in the aeronautical context and more specifically in a (noisy) aircraft cockpit.
This project will be based on an end-to-end automatic speech recognition system [Shi et al., 2021] using wav2vec 2.0 [Baevski et al., 2020]. This model is one of the most efficient of the current state of the art. This wav2vec 2.0 model enables self-supervised learning of representations from raw audio data (without transcription).
How to apply: Interested candidates are encouraged to contact Irina Illina (illina@loria.fr) with the required documents (CV, transcripts, motivation letter, and recommendation letters).
Requirements & skills:
- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,
- ability to work independently as well as in a team,
- solid programming skills (Python, PyTorch), and deep learning knowledge,
- good level of written and spoken English.
References
[Baevski et al., 2020] A. Baevski, H. Zhou, A. Mohamed, and M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.
[Chan et al., 2016] W. Chan, N. Jaitly, Q. Le and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960-4964, 2016.
[Chorowski et al., 2017] J. Chorowski, N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. Interspeech, 2017.
[Houlsby et al., 2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning, PMLR, pp. 2790–2799, 2019.
[Gulati et al., 2020] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition. Interspeech, 2020.
[Shi et al., 2021] X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie. The accented english speech recognition challenge 2020: open datasets, tracks, baselines, results and methods. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6918–6922, 2021.
| ||||
6-10 | (2022-11-12) One-year position @Institut Mines Telecom Atlantique, Nantes, France In the framework of the European/Japanese e-VITA project (https://www.e-vita.coach/),
| ||||
6-11 | (2022-11-15) Two internships @ Zaion, Paris, France Nous proposons deux offres de stage au sein de Zaion (niveau M2).
Merci de transférer si vous connaissez des étudiant.e.s qui cherchent de telles opportunités. - Apprentissage automatique pour router intelligemment les appels entrants dans les centres d’appel :
- Génération automatique de résumé de dialogue :
| ||||
6-12 | (2022-11-16) Post-doc @LaBRI, Bordeaux, France The Bordeaux Computer Science Laboratory (LaBRI) is currently looking to fill a 1 year post-doctoral position in the framework of the FVLLMONTI European project (http://www.fvllmonti.eu) Details on the position are given below. — The University of Bordeaux invites applications for a 1 year full-time postdoctoral researcher in Automatic Speech Recognition. The position is part of the FVLLMONTI project on efficient speech-to-text translation on embedded autonomous devices, funded by the European Community. To apply, please send by email a single PDF file containing a full CV (including publication list), cover letter (describing your personal qualifications, research interests and motivation for applying), evidence for software development experience (active Github/Gitlab profile or similar), two of your key publications, contact information of two referees and academic certificates (PhD, Diploma/Master, Bachelor certificates). Details on the position are given below: Job description: Post-doctoral position in Automatic Speech Recognition Duration: 12 months Starting date: tentatively 03/01/2023 Project: European FETPROACT project FVLLMONTI (started January 2021) Location: Bordeaux Computer Science Lab. (LaBRI CNRS UMR 5800), Bordeaux, France (Image and Sound team) Salary: from 2 086,45€ to 2 304,88€/month (estimated net salary after taxes, according to experience) Contact: jean-luc.rouas@labri.fr Job description: The applicant will be in charge of optimizing our state-of-the-art Automatic Speech Recognition and Machine Translation systems for English and French built using the ESPNET framework (https://espnet.github.io/espnet/) for end-to-end deep neural networks. The objective is to match the specifications and constraints of the designed systems to the requirements of other partners of the project specialized in hardware (close work with EPFL https://www.epfl.ch/labs/esl/). In particular, the applicant will carry on the work of previous post-doctorates on compression techniques (e.g. pruning, quantization, etc.) applied to Transformer and Conformer based networks to reduce the memory and energy consumption while keeping an eye on the performances (WER and BLEU scores). When a satisfactory trade-off is reached, more exploratory work is to be carried out on using emotion/attitude/affect recognition on the speech samples to supply additional information to the translation system. Context of the project: The aim of the FVLLMONTI project is to build a lightweight autonomous in-ear device allowing speech-to-speech translation. Today, pocket-talk devices integrate IoT products requiring internet connectivity which, in general, is proven to be energy inefficient. While machine translation (MT) and Natural Language Processing (NLP) performances have greatly improved, an embedded lightweight energy-efficient hardware remains elusive. Existing solutions based on artificial neural networks (NNs) are computation-intensive and energy-hungry requiring server-based implementations, which also raises data protection and privacy concerns. Today, 2D electronic architectures suffer from 'unscalable' interconnect and are thus still far from being able to compete with biological neural systems in terms of real-time information-processing capabilities with comparable energy consumption. Recent advances in materials science, device technology and synaptic architectures have the potential to fill this gap with novel disruptive technologies that go beyond conventional CMOS technology. A promising solution comes from vertical nanowire field-effect transistors (VNWFETs) to unlock the full potential of truly unconventional 3D circuit density and performance. Required skills: - PhD in Automatic Speech Recognition (preferred) or Machine Translation using deep neural networks - Knowledge of most widely used toolboxes/frameworks (pytorch, espnet) - Good programming skills in Python - Good communication skills (frequent interactions with hardware specialists) - Interest in hardware design will be a plus Selected references: S. Karita et al., 'A Comparative Study on Transformer vs RNN in Speech Applications,' 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, 2019, pp. 449-456, doi: 10.1109/ASRU46091.2019.9003750. Leila Ben Letaifa, Jean−Luc Rouas. Transformer Model Compression for End−to−End Speech Recognition on Mobile Devices. 2022 30th European Signal Processing Conference (EUSIPCO), Aug 2022, Belgrade, Serbia.
Leila Ben Letaifa, Jean−Luc Rouas. Fine-grained analysis of the transformer model for efficient pruning. 2022 International Conference on Machine Learning and Applications (ICMLA), Dec 2022, Nassau, Bahamas.
| ||||
6-13 | (2022-11-17) M2 internship offers LORIA - MULTISPEECH, Nancy France M2 internship offers LORIA - MULTISPEECH https ://team.inria.fr/multispeech/fr/ To apply, please send your CV and a short motivation letter directly to the supervisors of the corresponding offer. Offer 1 Contrastive Learning for Hate Speech Detection General information Supervisors Nicolas Zampieri, Irina Illina, Dominique Fohr Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Phone 03 54 95 84 06 Email fistname.lastname@loria.fr Office C 145 Motivation and context The United Nations defines hate speech as 'any type of communication through speech, writing or behavior, which denigrates a person or group based on who they are, i.e. their religion, ethnicity, nationality, or other identity factor.'. We are interested in hate speech posted on social networks. With the expansion of social networks (Twitter, Facebook, etc.), the number of messages posted every day has increased dramatically. It is very difficult and expensive to process the millions of content posted every day in order to remove hateful content. Thus, automatic methods are required to moderate the influx. Automatic hate speech detection is a difficult task in the field of natural language processing (NLP) [6]. With the appearance of transformer-based language models like BERT [3], new state-of-the-art models have emerged for hate speech detection like HateBERT [1]. Current NLP models rely strongly on efficient learning algorithms. We are particularly interested in one of them : contrastive learning. Contrastive learning is employed to learn an embedding space such that pairs of similar sentences have close representations. [5] provide a summary of different models based on contrastive learning in language processing. Goals and Objectives The goal of this internship is to study contrastive learning in the context of hate speech detection. We believe that using this methodology will make the models more effective. Our model learns to estimate whether two sentences have the same sentiment or not. Based on the first model, the intern will explore other approaches of contrastive learning, such as SimCSE [4] or Dual Contrastive Learning [2] models. The studied methods will be validated on several datasets to assess the robustness of the approach. In our team, we have several labeled corpora from social networks. The internship workplan is as follows : at the beginning the student will conduct a state-of-the-art study on recent developments in hate speech detection and contrastive learning in NLP. The student will implement the selected methods. Finally, the performance of the different implemented methods will be evaluated on several hate speech corpora and compared to the state-of-the-art. Required Skills The candidate should have an experience with Deep Learning, including a good practice in Python and an understanding of deep learning libraries like Keras, Pytorch or Tensorflow. Additional information The student intern will join the MULTISPEECH team. The team provides access to the computational resources (GPU, CPU and datasets) in order to carry out the research. References [1] Caselli, T., Basile, V., Mitrovic, J., & Granitzer, M. HateBERT : Retraining BERT for Abusive Language ´ Detection in English.. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) (pp. 17-25). ACL. doi :10.18653/v1/2021.woah-1.3. August 2021. [2] Chen, Q., Zhang, R., Zheng, Y., & Mao, Y. Dual Contrastive Learning : Text Classification via Label-Aware Data Augmentation.. https ://arxiv.org/abs/2201.08702. 2022. 2 [3] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding.. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies, Volume 1 (Long and Short Papers), (pp. 4171-4186). 2022. [4] Gao, T., Yao, X., & Chen, D. SimCSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. ACL. doi :10.18653/v1/2021.emnlp-main.552. 2021. [5] Rethmeier, N., & Augenstein, I. A Primer on Contrastive Pretraining in Language Processing : Methods, Lessons Learned and Perspectives.. https ://arxiv.org/abs/2102.12982. 2021. [6] Zampieri, N., Ramishc, C., Illina, I., & Fohr D. Identification of Multiword Expressions in Tweets for Hate Speech Detection.. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 202-210, 2022. European Language Resources Association. Offer 2 Diffusion-based Deep Generative Models for Audio-visual Speech Modeling General information Supervisors Mostafa SADEGHI, Romain SERIZEL Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email mostafa.sadeghi@inria.fr,romain.serizel@loria.fr Motivation Recently, diffusion models have gained much attention due to their powerful generative modeling performance, in terms of both the diversity and quality of the generated samples [1]. It consists of two phases, where during the so-called forward diffusion process, input data are mapped into Gaussian noise by gradually perturbing the data. Then, during a reverse process, a denoising neural network is learned that removes the added noise at each step, starting from pure Gaussian noise, to eventually recover the original clean data. Diffusion models have found numerous successful applications, particularly in computer vision, e.g., text-conditioned image synthesis, outperforming previous generative models, including variational autoencoders (VAEs), generative adversarial networks (GANs), and normalizing flows (NFs). Diffusion models have also been successfully applied to audio and speech signals, e.g., for audio synthesis [2] and speech enhancement [3]. Goal and objectives Despite their rapid progress and application extension, diffusion models have not yet been applied to audiovisual speech modeling. This task involves joint modeling of audio and visual modalities, where the latter concerns the lip movements of the speaker, as there is a correlation between what is being said and the lip movements. This joint modeling effectively incorporates the complementary information of visual modality for speech generation. Such a framework has already been established based on VAEs [4]. Given the great potential and advantages of diffusion models, in this project, we would like to develop a diffusion-based audio-visual generative modeling framework, where the generation of audio modality, i.e., speech, is conditioned on the visual modality, i.e., lip images, similarly to text-conditioned image synthesis. This might then serve as an efficient representation learning framework for downstream tasks, e.g., audio-visual speech enhancement (AVSE) [4]. Background in statistical signal processing, computer vision, machine learning, and deep learning frameworks (Python, PyTorch) are favored. Interested candidates should send an email to the supervisors with a detailed CV and transcripts. Work environment This master internship is part of the REAVISE project : 'Robust and Efficient Deep Learning based Audiovisual Speech Enhancement' (2023-2026) funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified AVSE framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework. The intern will be supervised by Mostafa Sadeghi (researcher, Inria) and Romain Serizel (associate professor, University of Lorraine), as members of the MULTISPEECH team, and will benefit from the research environment, expertise, and computational resources (GPU & CPU) of the team. References [1] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M. H. Yang, Diffusion models : A comprehensive survey of methods and applications arXiv preprint arXiv :2209.00796, 2022. [2] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, Diffwave : A versatile diffusion model for audio synthesis arXiv preprint arXiv :2009.09761, 2020. [3] Y. J. Lu, Z. Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, Conditional diffusion probabilistic model for speech enhancement IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. [4] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, Audio-visual speech enhancement using conditional variational auto-encoders IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788 ?1800, 2020. Offer 3 Efficient Attention-based Audio-visual Fusion Mechanisms for Speech Enhancement General information Supervisors Mostafa SADEGHI Romain SERIZEL Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email mostafa.sadeghi@inria.fr romain.serizel@loria.fr Motivation Audiovisual speech enhancement (AVSE) is defined as the task of improving the quality and intelligibility of a noisy speech signal by utilizing the complementary information provided by the visual modality, i.e., lip movements of the speaker [1]. Visual modality is especially important in high-noise situations, as it is less affected by acoustic noise. Because of that, AVSE could be exploited in several practical applications, including hearing assistive devices. Numerous works have already studied the integration of visual modality with audio modality to improve the performance of speech enhancement. While the majority of audiovisual speech enhancement algorithms rely on deep neural networks and supervised learning, they require very large audiovisual datasets with diverse noise instances to have good generalization performance. A recently introduced AVSE approach is based on unsupervised learning [2,3], where during a training phase, the statistical distribution of clean speech is learned from a clean audiovisual dataset. This is done using a deep generative model, e.g. variational autoencoder (VAE) [4]. Then, at test (inference) time, the learned distribution is combined with a noise model to estimate the clean speech signal from the available noisy speech observations. Goal and objectives An important element of AVSE is audio-visual feature fusion, which should robustly and efficiently combine the two modalities. Current fusion mechanisms used for unsupervised AVSE are based on simple feature concatenation, which is not effective, as it treats the two feature streams on an equal basis. In fact, the audio modality usually contributes more than the visual modality, but in general, their contributions should be robustly balanced and weighted. In this project, we are going to develop efficient feature fusion modules based on attention models [5], which have proven very successful in different applications. The designed fusion module is supposed to robustly and efficiently incorporate the potentially different uncertainty (reliability) levels of the two modalities. We will then evaluate its effectiveness for AVSE. Background in statistical signal processing, probabilistic machine learning, optimization, and programming languages & deep learning frameworks (Python, PyTorch) are favored. Interested candidates should send an email to the supervisors with a detailed CV and transcripts. Work environment This master internship is part of the REAVISE project : 'Robust and Efficient Deep Learning based Audiovisual Speech Enhancement' (2023-2026) funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified AVSE framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework. The intern will be supervised by Mostafa Sadeghi (researcher, Inria) and Romain Serizel (associate professor, University of Lorraine), as members of the MULTISPEECH team, and will benefit from the research environment, expertise, and computational resources (GPU & CPU) of the team. References [1] D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, An overview of deep-learningbased audio-visual speech enhancement and separation IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1368 ?1396, 2021. 5 [2] M. Sadeghi and X. Alameda-Pineda, Switching variational auto-encoders for noise-agnostic audio-visual speech enhancement IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. [3] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, Audio-visual speech enhancement using conditional variational auto-encoders IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788 ?1800, 2020. [4] D. P. Kingma and M. Welling, An introduction to variational autoencoders Foundations and Trends in Machine Learning, 2019. [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need Advances in neural information processing systems, 2017. Offer 4 Multi-modal Stuttering Detection Using Self-supervised Learning General information Supervisors Shakeel Ahmad Sheikh, Slim Ouni Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email fistname.lastname@loria.fr Office C 137 Motivation Stuttering is a neuro-developmental speech disorder that starts appearing when language, speech, and emotion supporting neural connections are changing quickly [2]. In standard stuttering therapy sessions, the speech pathologists or speech therapists either manually examine and analyze the person who stutter (PWS) speech or their recordings. In order to rectify the stuttering, the speech therapists carefully observe and monitor the patterns in speech utterances of PWS. However, this convention of stuttering detection is very time consuming and strenuous. It is also biased towards the subjective belief of speech language therapists. Thus, it is important to build stuttering detection interactive tools that provide impartial objective assessment, and can be utilized to tune and improve various ASR virtual assistants for stuttered speech. Deep learning has been used tremendously in domains like speech recognition [5], emotion detection [1], however, in stuttering domain, its application is limited. The acoustic cues embedded in the speech of PWS can be exploited by various deep learning methods in the detection of stuttering. Most of the existing stuttering detection techniques utilize spectral features such as spectrograms and MFCCs as an input representation of the stuttered speech [12, 11, 3]. The most common problem in the stuttering domain is the dataset issue. There are few stuttering datasets like UCLASS, FluencyBank, and SEP28K [3], which are small containing only a few dozens of speakers. While deep learning methods have shown substantial gains in domains like ASR, speaker verification, emotion detection, etc, however, the improvement in stuttering detection is very limited, most likely due to the miniature size of datasets. The common strategy in dealing with training on small datasets is to apply transfer learning, where the pre-trained model (trained first on some auxiliary task on a large dataset) is used to enhance the performance of the desired task, for which data is very scarce. The deep learning model trained on some auxiliary task can be fine-tuned by re-training, or replacing some of its last layers, or it can also be employed as a feature extractor for the desired task, that we are trying to address. Transfer learning methodology has been explored in various fields like ASR, emotion detection [8], etc. Recently, self-supervised learning has shown significant improvement in stuttering detection [11, 18, 17, 16]. Multimodal Stuttering Detection Stuttering can be characterized as an audio-visual problem. Cues are present both in the visual (e.g., head nodding, lip tremors, quick eye blinks, and unusual lip shapes) as well as in the audio modality [4]. This multimodal learning paradigm could be helpful in learning robust stutter-specific hidden representations across the cross-modality platform, and could also help in building robust automatic stuttering detection systems. Selfsupervised learning can also be exploited to capture acoustic stutter-specific representations based on guided video frames. As proposed by Shukla et al. [14], this framework could be helpful in learning stutter-specific features from audio signals guided by visual frames or vice versa. Altinkaya and Smeulders [15] recently presented the first audio-visual stuttered dataset which consists of 25 speakers (14 male, 11 female). They trained ResNet-based RNN (gated recurrent unit) on the audio-visual modality for the detection of block stuttering type. The main idea in this internship is to explore the impact of further self supervised learning in stuttering detection in combination with audio-visual setup. The goal of the proposed study is to develop and evaluate audio-visual based self supervised stuttering detection classifiers, that will be able to distinguish among several stutter classes. 1. Objective 1 : Lliterature survey by looking at the existing work in stuttering detection. 2. Objective 2 : Developing a pre-trained stuttering classifier based on self-supervised learning ; Some initial experiments would be carried out. We would explore the self supervised models such as wav2vec 2.0, a modified version of wav2vec [9], and their variants such as Unispeech, HuBERT, etc. We would use wav2vec 2.0 either as a feature extractor or just fine tune it by replacing the last few layers and adapt it for stuttering detection. 7 3. The experiments would be carried out on the newly developed French stuttering dataset. 4. Objective 3 : Carrying out the actual experiments and the impact of fine-tuning and pre-trained features would be analyzed on the raw stuttered embedded audio-visual stuttered samples. References [1] Mehmet Berkehan Ak Cay and Kaya Oguz L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M. H. Yang, Speech emotion recognition : Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers' Speech Communication, 116 (2020) pp.56- 76. [2] Smith, Anne and Weber, Christine How stuttering develops : The multifactorial dynamic pathways theory' Journal of Speech, Language, and Hearing Research, 60 (2017) pp.2483–2505. [3] Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni, Machine learning for stuttering identification : Review, challenges and future directions, Neurocomputing, 514 (2022), pp 385-402, [4] Guitar, Barry. Stuttering : An integrated approach to its nature and treatment. Lippincott Williams & Wilkins, 2013. [5] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, 'Speech Recognition Using Deep neural networks : A systematic review,' IEEE Access, vol. 7, pp. 19143-19165, 2019. [6] Latif, Siddique, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Björn W. Schuller. 'Deep representation learning in speech processing : Challenges, recent advances, and future trends.' arXiv preprint arXiv :2001.00378 (2020). [7] Ning, Y., He, S., Wu, Z., Xing, C. and Zhang, L.J., 2019. A review of deep learning based speech synthesis. Applied Sciences, 9(19), p.4050. [8] Wang, Yingzhi, Abdelmoumene Boumadane, and Abdelwahab Heba. 'A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.' arXiv preprint arXiv :2111.02735 (2021). [9] Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 'wav2vec 2.0 : A framework for self-supervised learning of speech representations.' Advances in Neural Information Processing Systems, 33 (2020) : 12449-12460. [10] Lea, Colin, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, and Jeffrey P. Bigham. 'Sep-28k : A dataset for stuttering event detection from podcasts with people who stutter.' In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6798-6802. IEEE, 2021. [11] Sheikh, Shakeel A., Md Sahidullah, Slim Ouni, and Fabrice Hirsch. 'End-to-End and Self-supervised learning for ComParE 2022 stuttering sub-challenge.' In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7104-7108. 2022. [12] Sheikh, Shakeel A., Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 'Robust stuttering detection via multi-task and adversarial learning.' In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 190-194. IEEE, 2022. [13] Ngiam, Jiquan, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 'Multimodal deep learning.' In ICML. 2011. [14] Shukla, Abhinav, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, and Maja Pantic. 'Visually guided self supervised learning of speech representations.' In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6299-6303. IEEE, 2020. [15] Altinkaya, Mehmet, and Arnold WM Smeulders. 'A dynamic, self supervised, large scale audiovisual dataset for stuttered speech.' In Proceedings of the 1st International Workshop on Multimodal Conversational AI, pp. 9-13. 2020. [16] Mohapatra, Payal, Akash Pandey, Bashima Islam, and Qi Zhu. 'Speech disfluency detection with contextual representation and data distillation.' In Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications, pp. 19-24. 2022. [17] Grósz, Tamás, Dejan Porjazovski, Yaroslav Getman, Sudarsana Kadiri, and Mikko Kurimo. 'Wav2vec2- based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering.' In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7026-7029. 2022. [18] Bayerl, Sebastian P., Dominik Wagner, Elmar Nöth, and Korbinian Riedhammer. 'Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0.' arXiv preprint arXiv :2204.03417 (2022). Offer 5 Dictionary learning for deep unsupervised speech separation General information Supervisors Paul Magron Mostafa Sadeghi Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email paul.magron@inria.fr mostafa.sadeghi@inria.fr Office C 141 C 136 Motivation and context Speech separation consists in isolating the signals that correspond to each speaker from an acoustic mixture where several persons might be speaking. This task is an important preprocessing step in many applications such as hearing aids or vocal assistants based on automatic speech recognition. State-of-the-art separation systems rely on supervised deep learning, where a network is trained to predict the isolated speakers’ signals from their mixture [1, 2]. However, these approaches are costly in terms of training data and have a limited capacity to generalize to unseen speakers. Goal and objectives The goal of this internship is to design a fully unsupervised system for speech separation, which is more data-efficient than supervised approaches, and applicable to any mixture of speakers. To that end, we propose to combine variational autoencoders (VAEs) with dictionary models (DMs). DM consist in decomposing a given input matrix (usually : an audio spectrogram) as the product of two interpretable factors : a dictionary of spectra and a temporal activation matrix). This family of methods has been extensively researched before the era of deep learning [3], but it is limited since real-world audio spectrograms cannot be decomposed using such simple models. Therefore, we propose to leverage VAEs as a tool to learn a latent representation of the data which is regularized using DMs. Such a system can be cast as an instance of transform learning [4] : the key idea is to apply a (learned) transform to the data so that it better complies with a desirable property - here, decomposition on a dictionary. A first attempt was recently proposed and has shown promising results in terms of speech modeling [5], although it was using a fixed dictionary. This internship aims at extending this work by considering a system where both the VAE and the dictionary are learned jointly, and applying it to the task of speech separation. Once trained, the resulting system operates in three stages : (i) the (mixture) audio spectrogram is projected through the encoder into some latent space ; (ii) this latent representation is decomposed efficiently using a DM learning algorithm, which provides a latent feature for each speaker ; (iii) these latent features are passed through the decoder to retrieve a spectrogram for each speaker. Such a system is promising since it is fully unsupervised (it can be trained without knowledge of specific mixtures), it yields an interpretable decomposition of the latent representation, and it can serve as a basis for other applications (including speaker diarization, speech enhancement or voice conversion). A good practice in Python and basic knowledge about deep learning, both theoretical and practical (e.g., using PyTorch) are required. Some notions of audio/speech signal processing and machine learning is a plus. Work environment The trainee will be supervised by Paul Magron (Chargé de Recherche Inria) and Mostafa Sadeghi (Researcher, Inria Starting Faculty Position), and will benefit from the research environment and the expertise in audio signal processing of the MULTISPEECH team. This team includes many PhD students, post-docs, trainees, and permanent staff working in this field, and offers all the necessary computational resources (GPU and CPU, speech datasets) to conduct the proposed research. References [1] D. Wang and J. Chen, Supervised Speech Separation Based on Deep Learning : An Overview IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702-1726, 2018. [2] Y. Luo and N. Mesgarani, Conv-TasNet : Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256- 1266, 2019. [3] T. Virtanen, Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066-1074, 2007. [4] D. Fagot, H. Wendt and C. Févotte, Nonnegative Matrix Factorization with Transform Learning IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. [5] M. Sadeghi, and P. Magron, A Sparsity-promoting Dictionary Model for Variational Autoencoders Interspeech, 2022. Offer 6 Semantic latent space for expressive text-to-speech General information Supervisors Vincent Colotte, Slim Ouni Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Phone 03 54 95 20 74 Email vincent.colotte@loria.fr, slim.ouni@inria.fr Office C141 Motivation Over the last decades, text-to-speech synthesis (TTS) has reached good quality and intelligibility, and is now commonly used in information delivery services, as for instance in call center automation, and in navigation systems. In the past, the main goal when developing TTS systems was to achieve high intelligibility. The speech style was then typically a “reading style,” which resulted from the style of the speech data used to develop TTS systems (reading of a large set of sentences). Recent research on speech synthesis focuses now on expressive speech to obtain generated speech more expressive or spontaneous. Almost all systems are now based on neural network methods. Therefore, to tackle expressiveness integration in a network, as numerous recent works in neural networks, training and testing step pass through several steps with specific latent spaces to condition the network or to propose a latent representation to control the expressiveness. In stochastic processes, the explanation of such a numeric representation is still difficult to extract [1]. Moreover, the use of new representations as Word2Vec for textual material or Wav2Vec for audio signal shows that we can find a representation with implicit linguistic and semantic information. The need of explanation still remains. The internship will take place in this framework. Objectives and expected outcomes The goal of the proposed study is to investigate the information contained in a latent representation dedicated to expressive speech. Previous work dealt with Variational Autoencoder (VAE) approach to explore this dimension in the audiovisual domain [2] without emotion tag the latent representation retrieved the emotional information. In addition, [3] used several representations of acoustic expressiveness to condition a network to transfer an emotion from a speaker to another sentence of another speaker. Moreover, [5] had jointly used acoustic and textual expressiveness representation. The textual representation was based on SBERT approach. The internship work will consist to analyze latent representations of a TTS system (for instance Glow approach for audio speech or VAE for audio-visual speech). The second step will introduce semantic information by textual latent representation from simple tag [4], a description or the text itself. The objective is to jointly learn representations and analyze them to extract understanding for controlling the system. Additional information and requirements The internship will be carried out within the framework of the European project Humane-AI-Net. A good knowledge of Python and basic knowledge of neural network learning is required. References [1] Tits, N., Wang, F., Haddad, K.E., Pagel, V., Dutoit, T. Visualization and interpretation of latent spaces for controlling expressive speech synthesis through audio analysis, in Proc. Interspeech, 2019 [2] S. Dahmani, V. Colotte, V. Girard and S. Ouni Learning emotions latent representation with CVAE for TextDriven Expressive AudioVisual Speech Synthesis, in Neural Networks, Elsevier, 2021 [3] A. Kulkarni, V. Colotte, D. Jouvet, Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems, in Proc. Interspeech, 2022 [4] Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S. Expressive Text-to-Speech Using Style Tag. Interspeech 2021 [5] Shin, Y., Lee, Y., Jo, S., Hwang, Y., Kim, T., Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS. Interspeech 2022.
Offer 7 Disentanglement in Speech Data for Privacy Needs General information Supervisors Emmanuel Vincent, Marc Tommasi Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email emmanuel.vincent@inria.fr, marc.tommasi@inria.fr Motivation and context Large-scale collection, storage, and processing of speech data poses severe privacy threats [1]. Indeed, speech encapsulates a wealth of personal data (e.g., age and gender, ethnic origin, personality traits, health and socioeconomic status, etc.) which can be linked to the speaker’s identity via metadata or via automatic speaker recognition. Speech data may also be used for voice spoofing using voice cloning software. With firm backing by privacy legislations such as the European general data protection regulation (GDPR), several initiatives are emerging to develop and evaluate privacy preservation solutions for speech technology. These include voice anonymization methods [2] which aim to conceal the speaker’s voice identity without degrading the utility for downstream tasks, and speaker re-identification attacks [3] which aim to assess the resulting privacy guarantees, e.g., in the scope of the VoicePrivacy challenge series [4]. Goals and objectives The internship will tackle the objective of speech anonymization. Previous works have shown that simple adversarial approaches that aim at removing speaker identity from speech signals do not provide sufficient privacy guaranties [5]. An interpretation of this failure can be that adversaries were not strong enough. Moreover, there is no clear evidence that a transformation that removes speaker identity is informative enough to allow the reconstruction of intelligible speech signals. These observations raise a classical trade-off between privacy and utility that is essential in many privacy preservation scenarios. Instead of trying to remove speaker information, another option is to replace it by another one. To do so, a sub-objective is to disentangle speech signals, that is to isolate speech features that contribute to the success of speaker identification. Disentanglement is understood in this project as the process of embedding voice data in a new representation where different types of information (speaker identity, linguistic content, or even traits like age, gender or ethnicity) are separated and associated with disjoint sets of features. Variational autoencoders are supposed to naturally support disentanglement [6]. Additionally, variational approaches can also be used to make attackers stronger by introducing more diversity. Those two ways of improving adversarial approaches for learning a private representation of speech will be investigated. References [1] A. Nautsch, A. Jimenez, A. Treiber, J. Kolberg, C. Jasserand, E. Kindt, H. Delgado, M. Todisco, M. A. Hmani, A. Mtibaa, M. A. Abdelraheem, A. Abad, F. Teixeira, M. Gomez-Barrero, D. Petrovska, G. Chollet, N. Evans, T. Schneider, J.-F. Bonastre, B. Raj, I. Trancoso, and C. Busch. Preserving privacy in speaker and speech characterisation, in Computer Speech and Language, 2019 [2] B. M. L. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yamagishi. Privacy and utility of x-vector based speaker anonymization, in IEEE/ACM Transactions on Audio, Speech and Language Processing, 30 :2383–2395, 2022. [3] B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent. Evaluating voice conversion-based privacy protection against informed attackers, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. [4] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien, A. Chanclu, J.-F. Bonastre, M. Todisco, and M. Maouche. The VoicePrivacy 2020 Challenge : Results and findings, Computer Speech and Language, 2022. [5] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent. Privacy-preserving adversarial representation learning in ASR : Reality or illusion ?, in Proc. Interspeech, 2019 [6] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda. Dynamical Variational Autoencoders : A Comprehensive Review, in Foundations and Trends in Machine Learning, vol. 15, 2021 12
| ||||
6-14 | (2022-11-18) PhD studentships @ University of Edinburgh, Scotland, UK PHD STUDENTSHIPS IN SPEECH TECHNOLOGY, COMPUTATIONAL LINGUISTICS, AND COGNITIVE SCIENCE Institute for Language, Cognition and Computation The Institute for Language, Cognition and Computation (ILCC) at the University of Edinburgh invites applications for three-year PhD studentships starting in October 2023. ILCC is dedicated to the pursuit of basic and applied research on computational approaches to language, communication and cognition. Primary research areas include:
Approximately 10 studentships from a variety of sources are available, covering both maintenance at the research council rate of GBP 17,668 (2022/23 rates) per year and tuition fees. Awards increase every year, typically with inflation. Studentships are available for UK, EU, and non-EU nationals. For a list of academic staff at ILCC with research areas, and for a list of indicative PhD topics, please consult: http://web.inf.ed.ac.uk/ilcc/people/academic-senior-research-staff Details regarding the PhD programme and the application procedure can be found at: http://www.ed.ac.uk/informatics/postgraduate/research-degrees/phd There are TWO DEADLINES for applications to receive full consideration: round 1: 25th November 2022 We strongly recommend that non-UK applicants submit their applications in round 1, to maximise their chances of funding. Please direct inquiries to the PhD admissions team at ilcc-admissions@inf.ed.ac.uk. Please note that the 3-year ILCC PhD program is distinct from the UKRI Centre for Doctoral Training in Natural Language Processing, which offers a 4-year PhD with integrated study:
--
| ||||
6-15 | (2022-11-18) Poste d enseignant, IUT et ENSSAT, Lannion, France L’IUT de Lannion et l’ENSSAT à Lannion (22) recherchent chacun une enseignante-chercheuse ou un enseignant-chercheur sur contrat LRU à temps plein en Informatique pour le reste de l’année (jusque fin août 2023). L'intégration recherche peut se faire au sein de l'équipe EXPRESSION de l'IRISA, entre autres.
La date limite de candidature est très proche : 30/11 pour l’IUT et 11/12 pour l’ENSSAT, pour une prise de fonction vraisemblablement en janvier 2023. Les fiches de poste et les modalités de candidatures sont accessibles sur le site de l’université de Rennes 1 https://www.univ-rennes1.fr/nos-offres-demploi#p-126
| ||||
6-16 | (2022-11-20) Research position on speaker and text anonymization for medical applications @DFKI, Berlin, GE Research position on speaker and text anonymization for medical applications @DFKI, Berlin
We’re happy to announce a new research position in the field of speech- and text anonymization at German Research Center for Artificial Intelligence, Berlin, Germany. We’re looking for a full time Researcher or Junior Researcher level, and offer a 2 years contract with optional prolongation and PhD perspective.
The Speech and Language Technology Lab at DFKI Berlin is involved in numerous national as well as international research programmes and networks. We offer an interesting and flexible work environment as part of an innovative, international and enthusiastic team which coordinates and participates in national as well as European projects in the wider area of Language Technology.
Your tasks
Your qualifications
Your benefits
The German Research Center for Artificial Intelligence (DFKI) is Germany's leading business-oriented research institution in the field of innovative software technologies based on artificial intelligence methods. In the international scientific community, DFKI ranks among the most recognized 'Centers of Excellence' and currently is the biggest research center worldwide in the area of Artificial Intelligence and its application in terms of number of employees and the volume of external funds. The DFKI cooperates closely with national and international companies.
DFKI encourages applications from people with disability; DFKI intends to increase the proportion of female employees in the field of science and encourages women to apply for this position.
Application deadline: Dec 23 More details and link: https://jobs.dfki.de/en/internal/vacancy/en-researcher-m-w-d-in-506968.html
| ||||
6-17 | (2022-11-21) Postdoc @GIPSALab, Grenoble, France Offre de post-doc de 6 mois au GIPSA-lab, Grenoble, sur le contrôle gestuel temps-réel de l'intonation pour la suppléance vocale, dans le cadre du projet ANR GEPETO (Gestures and Pedagogy of Intonation). Informations généralesRéférence : UMR5216-CHRROM-022 MissionsCe post-doctorat fait partie du projet ANR GEPETO* (GEstures and PEdagogy of InTOnation), dont le but est d'étudier l'utilisation de gestes manuels par le biais d'interfaces humain-machine, pour la conception d'outils et méthodes permettant l'apprentissage du contrôle de l'intonation (mélodie) dans la parole. Activités- Prendre en main le système de conversion chuchotement-parole dans l'environnement Max/MSP CompétencesLes personnes n'ayant pas de compétence dans certains des domaines listés sont néanmoins encouragées à déposer une candidature. Contexte de travailLe Gipsa-lab est une unité de recherche commune CNRS, Grenoble-INP (Institut de Technologie de
| ||||
6-18 | (2022-11-21) Speech researcher @Vivoka, Metz, France
SPEECH RESEARCHER (M/F)
AboutVivokaVivoka is a global leader in voice Al technologies founded in 2015. Thanks to its VDK (Voice Development Kit), Vivoka offers an all-in-one solution that enables any company to create its own high performance, secure embedded/offline voice assistant in record time. Vivoka has won several innovation awards and has established leading partnerships with major players in the voice market. Vivoka has a portfolio of about 100 customers from all major industries and is pursuing its goal of bringing people closer to technology through voice.
MainmissionAspartofitsconstant evolution,Vivoka isfurther developingthe VoiceDevelopment Kitanditsrelatedprojects.Your mainmissionwillrevolve around creatingsolutions forsignalprocessingandmorespecifically speechprocessing.
Roles& Responsibilities
Job's Requirements
The advantages of the job
If you are interested in the position, send your documents to
| ||||
6-19 | (2022-11-21) NLP Researcher@Vivoka, Metz, France NLPRESEARCHER(M/F)
AboutVivokaVivoka is a global leader in voice Al technologies founded in 2015. Thanks to its VDK (Voice Development Kit), Vivoka offers an all-in-one solution that enables any company to create its own high-performance, secure embedded/offline voice assistant in record time. Vivoka has won several innovation awards and has established leading partnerships with major players in the voice market. Vivoka has a portfolio of about 100 customers from all major industries and is pursuing its goal of bringing people closer to technology through voice. MainmissionAspartofitsconstant evolution,Vivoka isfurther developingthe VoiceDevelopment Kitandits relatedprojects. Your mainmission willrevolve around creatingsolutions for NLPandmore specifically NLU.
Roles & Responsibilities
Job’s Requirements
The advantages of the job
If you are interested in the position, send your documents to
| ||||
6-20 | (2022-11-21) Machine Learning engineer@Vivoka, Metz, HFrance MACHINELEARNING ENGINEER (M/F)
AboutVivokaVivoka is a global leader in voice Al technologies founded in 2015. Thanks to its VDK (Voice Development Kit), Vivoka offers an all-in-one solution that enables any company to create its own highperformance, secure embedded/offline voice assistant in record time. Vivoka has won several innovation awards and has established leading partnerships with major players in the voice market. Vivoka has a portfolio of about 100 customers from all major industries and is pursuing its goal of bringing people closer to technology through voice.
MainmissionAspartof theR&Dteam,youwillbenchmark,develop,integrate anddeployour future embeddedMLtechnologiesfor the speechandnaturallanguageprocessing applications.
Roles & Responsibilities
Job's Requirements
R&D.
The advantages of the job
If you are interested in the position, send your documents to
| ||||
6-21 | (2022-11-22) Faculty position (Associate professor, tenure position) at Telecom Paris, France Faculty position (Associate professor, tenure position) at Telecom Paris in Machine-Learning for Social Computing.
Telecom Paris has a new permanent (tenure) faculty position (Associate Professor/ “Maître de conférences”) in the area of **machine learning for social computing**. Applicants from the following sub-research areas are welcome:
Important Dates
Context Social Computing team [1] - S²A (machine learning, statistics and signal processing) group [2] - LTCI (laboratoire de traitement et communication de l’information) [3] - Telecom Paris [4] .
Ecosystem Telecom Paris [4] is a founding member of the Institut Polytechnique de Paris (IP Paris), a world-class scientific and technological institution. Located at the Plateau de Saclay close to Paris-Saclay University, this Institution is a partnership between Ecole Polytechnique, ENSTA Paris, ENSAE Paris, Télécom Paris, Télecom SudParis, with HEC as a key partner. Regularly ranked as one of the best engineering schools in France, Télécom Paris is recognized for its excellent training, its very good employability rate with high salaries, its high-level research, and its very close proximity to companies. The THE (Times Higher Education) ranks Télécom Paris 2nd best French engineering school, 5th better French university, and 6th « best small university ». The newly created institution IP Paris was ranked in the top 50 best universities in the QS world university ranking. In the context of the Institut Polytechnique de Paris, the activities in Data Science and AI of the team benefit from the center Hi!Paris (https://www.hi-paris.fr), offering seminars, workshops, and fundings through calls for project
Main missions/Research activities
Main missions/Teaching activities Participate in teaching activities at Telecom Paris and its partner academic institutions (as part of joint Master programs), especially in natural language processing, speech processing, machine learning, and Data Science, including life-long training programs (e.g. the local “Mastères Spécialisés”)
Candidate profile As a minimum requirement, the successful candidate will have:
NOTE: The candidate does *not* need to speak French to apply, just to be willing to learn the language (teaching will be mostly given in English)
Other skills expected include: • Capacity to work in a team and develop good relationships with colleagues and peers • Excellent writing and pedagogical skills
More about the position • Place of work: Saclay (Paris outskirts)
How to apply? Applications must be submitted via one of the following websites: French Version: English Version:
Applicants should submit a single PDF file that includes: - cover letter, - curriculum vitae, - statements of research and teaching interests (4 pages) - three publications - contact information for two references
Contacts: == please do not hesitate to directly contact us before applying == Chloé Clavel (Coordinator of the Social Computing team) Stéphan Clémençon (Head of the S²A group) Florence d’Alché-Buc (Head of the IDS department)
[3] https://www.telecom-paris.fr/fr/lecole/departements-enseignement-recherche/image-donnees-signal [5] https://www.telecom-paris.fr/en/home
| ||||
6-22 | (2022-11-23) Postdoc for speech-based affective computing , King's college, London I am looking for a post-doc for speech-based affective computing and multimodal mHealth applications. For full details, see: https://jobs.kcl.ac.uk/gb/en/job/058426/Research-Fellow-in-Data-Science-for-Mobile-Health-mHealth Dr. Nicholas Cummins Institute of Psychiatry, Psychology & Neuroscience
| ||||
6-23 | (2022-11-24) Internship @UMRAE, Strasbourg, France UMRAE-INRIA PROPOSITION DE STAGE 2022-2023
Sujet de stage Nouveaux algorithmes pour le diagnostic acoustique de salle automatisé Niveau recommandé ☒Master (M2) ☐Master (M1) ☐Ingénieur ☐Licence ☐Bac + 2 Compétences requises Acoustique des salles, Méthodes d’optimisation, Traitement du signal, Apprentissage automatique Des connaissances en Python, Matlab serait un plus. Master 2 (acoustique, informatique, traitement du signal…) Introduction générale
Les nuisances sonores sont citées comme première source de gêne par les populations et constituent un enjeu sanitaire et social important. Dans les bâtiments, la gêne est souvent liée à une mauvaise qualité acoustique des salles due à une réverbération trop importante (cantine, piscine, crèche…). Dans le cadre de la réhabilitation acoustique des salles, la proposition d’une solution nécessite une bonne connaissance des caractéristiques géométriques et acoustiques de l’existant. Pour estimer certains paramètres inconnus (ex : absorption d’un plafond inconnu), les acousticiens de terrain s’appuient sur des mesures in situ du champ sonore et sur des modèles acoustiques numériques (ou analytique) dont ils calent de façon itérative les paramètres de façon à retrouver la valeur du champ sonore mesurée. Le processus complet d’un diagnostic est donc long, coûteux et parfois imprécis selon les modèles utilisés. Face à ce constat, le développement de méthodes dites inversespermettant de remonter automatiquement aux paramètres acoustiques d’intérêt à partir de la mesure constituerait une percée majeure pour l’acoustique du bâtiment, ouvrant la voie au développement d’outils plus simples, plus rapides et plus fiables à destination des acousticiens.
Sujet
Notre sujet, portant sur le développement de méthodes inverses en acoustique du bâtiment via des méthodes d’optimisation, de traitement du signal audio et/ou d’apprentissage automatique vient compléter la palette d’outils de prévision de champ sonore déjà existant. Par ailleurs, il vient aussi rompre l’herméticité existante entre le monde de l’audio et celui de l’acoustique, se traduisant par des conférences et journaux distincts. Des travaux préliminaires conduits par l’UMRAE et l’Inria, basés sur la réponse impulsionnelle de la salle (RIR ou Room Impulse Response), ont clairement montré qu’une application directe des approches d’optimisation (ainsi que d’approches d’apprentissage automatique) existantes dans d’autres domaines ne pouvait suffire pour résoudre notre problème. Ces approches doivent être adaptées au cas spécifique de l’acoustique. A ce jour, pour des conditions idéalisées (microphones et sources omnidirectionnels, salle rectangulaire, absorption plutôt faible des parois…), nos travaux ont montré qu’il est possible de « retrouver » au sein de la RIR l’absorption des parois et de reconstruire la géométrie de la salle et ce, sans connaissance a priori sur la position de la source sonore, des microphones. Le/la candidat(e), entouré(e) de ses encadrants, viendra en renfort de deux doctorants travaillant sur ces méthodes inverses. Pour cela, il/elle devra prendre en main l’une des méthodes d’optimisation déjà mises en place (algorithmes itératifs Ransac, Sliding Franck Wolfe, Méthode des solutions fondamentales, réseaux de neurones…) ainsi que le modèle théorique spécifiquement retenu pour ces travaux d’optimisation pouvant être exprimé dans le domaine temporel, de Fourier ou sur une décomposition en harmoniques sphériques. Plusieurs pistes de travail sont ensuite possibles, suivant les affinités du candidat et après discussion avec ses encadrants. Il pourra s’intéresser au modèle théorique de référence, par exemple, en l’utilisant dans l’un des domaines (temporel, Fourier ou Harmoniques sphériques). Il pourra aussi chercher à l’améliorer en y intégrant par exemple la directivité des sources et microphones, ou la diffusion des parois. En parallèle, le candidat pourra aussi s’intéresser à affiner les méthodes d’optimisation retenues pour le cas de l’acoustique, voire de proposer un autre algorithme d’optimisation utilisé dans un autre domaine de la physique. Pour finir, afin de valider ses travaux, le candidat aura à sa disposition des RIRs simulées avec des outils numériques de référence, mais aussi des RIRs mesurées notamment dans le cadre d’un projet de recherche mené conjointement avec l’Institut de Recherche en Coordination Acoustique/Musique (IRCAM Paris).
Informations générales
Lieu et durée du stage Le stage aura lieu à Strasbourg au sein des locaux du Cerema (11 rue Jean Mentelin – 67200 Straabourg). Le stage est prévu pour une durée de 4-6 mois.
Encadrants
Antoine Deleforge, Chargé de Recherche Inria, Equipe Multispeech, 615 rue du jardin botanique, 54600 Villiers lès Nancy. Pour des raisons pratiques, Antoine Deleforge est physiquement au sein de l’agence de Strasbourg 2 jours par semaine. https://members.loria.fr/ADeleforge/ ; https://team.inria.fr/multispeech/ ; https://www.inria.fr/fr
Cédric Foy, Chargé de Recherche UMRAE, Cerema, Univ. Gustave Eiffel, 11 rue Jean Menteli, 67200 Strasbourg https://www.cerema.fr/fr ; https://www.umrae.fr/ ; https://twitter.com/umrae_lab
Bibliographie T. Sprunck, K. Chahdi, C. Foy, E. Franck, A. Deleforge, Reconstruction de la forme d’une pièce par super-résolution à l’aide de réponses impulsionnelles, 16ème Congrès Français d'Acoustique, Marseille, France, 11-15 avr. 2022.
S. Dilungana, A. Deleforge, C. Foy, S. Faisan, Estimation jointe des profils d’absorption des parois d’une salle à partir de réponses impulsionnelles, 16ème Congrès Français d'Acoustique, Marseille, France, 11-15 avr. 2022.
S. Dilungana, A. Deleforge, C. Foy, S. Faisan, Geometry-Informed estimation of surface absorption profiles from impulses responses, Eusipco, 30th European Signal Processing Conference, Belgrade, Serbia, 2022.
T. Sprunck, Y. Privat, C. Foy, A. Deleforge, Gridless 3D Recovery if Images Sources from Room Impulse Responses, preprint, 2022, https://arxiv.org/abs/2208.14017
S. Dilungana, A. Deleforge, C. Foy, and S. Faisan, Learning-based estimation of individual absorption profiles from a single room impulse response with known positions of source, sensor and surfaces. In INTER-NOISE and NOISE-CON Congress and Conference Proceedings, vol. 263, No. 1, pp. 5623-5630, 2021.
https://jtav.ifsttar.fr/fileadmin/contributeurs/JTAV/2022/JTAV2022_foyetcoll.pdf
| ||||
6-24 | (2022-12-05) Permanent academic post in Speech Technology@ University of Edinburgh, Scotland, UK The School of Informatics at the University of Edinburgh is recuiting for a permanent academic post in Speech Technology. The appointment will be at Lecturer or Reader grade (equivalent to US Assistant Professor/Associate Professor). You will contribute to research and teaching in the Centre for Speech Technology Research (CSTR) and the Institute for Language, Cognition, and Computation (ILCC). There is extensive scope to collaborate with other Institutes and Schools within the University.
The successful candidate will have (or be near to completing) a PhD, an established research agenda and the enthusiasm and ability to undertake original research, to lead a research group, and to engage with teaching and academic supervision. We are seeking current and future leaders in the field who are able to forge new collaborations both within the field and across disciplines. We are particularly looking for a candidate with potential to extend the breadth of our research beyond our traditional core strengths in speech recognition and synthesis towards emerging applications, for example in spoken dialogue systems; spoken language understanding; healthcare and assistive technology applications; explainable speech processing; human computer interaction; or autonomous systems. For more details, including how to apply, view the full advert at https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/job/5973 Applications close on 12 January. The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
| ||||
6-25 | (2022-12-08) Ph.D. Position in Cognitive Neuroscience@ GIPSA, Grenoble, France
| ||||
6-26 | (2023-12-05)Master internship- Advanced Selective Mutual Learning for audio source separation @SteelSeries France R&D team (former Nahimic R&D team), France Advanced Selective Mutal Learning for audio source separation Master internship, Lille (France), 2022 Advisors — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Pierre Biret, R&D Engineer, pierre.biret@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Selective Mutual Learning (SML) Mutual learning (ML) [8] is a general idea related to knowledge distillation (KD) [9] where a group of untrained lightweight networks simultaneously learn and share knowledge to perform tasks together during training. The specificity of Selective Mutual Learning [10] is that the high-confidence predictions are used to guide the remaining network while the low-confidence predictions are ignored. This helps removing poor predictions of the networks during sharing knowledge. It can be noticed that the knowledge sharing is operated via loss functions that take into account the prediction of the other networks. The approach is simple and already shows benefits compared to KD and ML for boosting the performance of the networks for speech separation. Research axes The intern will be able to use existing internal trainsets and already implemented network architectures, which will facilitate drawing unbiased comparisons to our baseline approach. After re-implementing the SML approach, here is a list of possible axes of improvement of the SML approach : — tune the confidence factor (hyper-parameter c in [10]) to fit our speech denoising baseline (DNN and trainset) — extend and test the SML approach to more than 2 networks — adapt the SML loss formula to incorporate our internal loss (description upon request) 1 — additional tests with — new or already implemented networks : TasNet [11] ,E3net [12], DPRNN [13], transformer [1]) — various trainset (music separation , speech separation, ... ) Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] Y. Zhang et al. « Deep mutual learning ». In : Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 4320-4328. [9] G. Hinton, O. Vinyals, J. Dean et al. « Distilling the knowledge in a neural network ». In : arXiv preprint arXiv :1503.02531 2.7 (2015). [10] H. M. Tan et al. « Selective Mutual Learning : An Efficient Approach for Single Channel Speech Separation ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 3678-3682. [11] Y. Luo et N. Mesgarani. « TasNet : time-domain audio separation network for real-time, singlechannel speech separation ». In : arXiv :1711.00541 [cs, eess] (1er nov. 2017). 4*. [12] M. Thakker et al. « Fast Real-time Personalized Speech Enhancement : End-to-End Enhancement Network (E3Net) and Knowledge Distillation ». In : arXiv preprint arXiv :2204.00771 (2022). [13] Y. Luo, Z. Chen et T. Yoshioka. « Dual-path rnn : efficient long sequence modeling for timedomain single-channel speech separation ». In : ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, p. 46-50.
| ||||
6-27 | (2023-12-05) Personalized speech enhancement Master internship, Lille (France) @SteelSeries France R&D team (former Nahimic R&D team), France Personalized speech enhancement Master internship, Lille (France), 2022
Advisors — Damien Granger, R&D Engineer, damien.granger@steelseries.com — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Lucas Dislaire, R&D Engineer, lucas.dislaire@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Speech related audio source separation task Speech enhancement[8] or speech denoising usually refers to the task where the signal of interest is one intelligible speaker drown in additive noise. Speech separation [9] (sometime speaker separation) refers to the task of separately retrieving multiple unknown speakers usually not drown in additive noise. In the case of Personalized Speech Enhancement (also called target voice separation or target speaker extraction), the signal of interest is a speaker but a known speaker. This opens to (1) potentially better speech enhancement performances and also (2) focus on a particular speaker where speech enhancement would have kept all intelligible speakers. Personalized speech enhancement The research area is really recent and has just been added last year to the DNS challenge 1 [10]. VoiceFilter [11] seems to be the first article to tackle the problem of Personalized speech enhancement. 1. It can be noticed that the DNS challenge tasks [10] are defined as being real-time with a 40ms constraint on the latency (mainly composed of the look-ahead). 1 It uses two separately-trained neural networks : one discrimination network that produces speaker-specific embeddings from reference utterances of the target speaker ; and one “main” network, that performs the actual speech enhancement by taking as input both the corrupted utterance and the target speaker embeddings. The approach has now been outperformed, e.g., [12], while the two-step approach tends to prevail in the literature [13, 14, 15] : firstly one needs to learn a target speaker representation during an enrollment phase for instance by means of speaker embeddings, such as x-vectors or d-vectors [12], secondly incorporate this results in the neural network that will learn to extract this target speaker’s speech. However, it can be noticed that two steps does not necessarily means 2 networks, for instance a jointly trained 4-stage network is proposed in [15]. Axis of research The objective of the internship is to address one or several of the following targets : — Firstly, a baseline framework needs to be set up. It will require : — A dataset tailored for the task : the available datasets in the scientific community does not completely fulfill the requirements for SteelSeries products (description upon request). Conversely, our current datasets partly lacks speaker information. Hence, one would need to opt for the best solution or tradeoff combination. — A speaker embedding baseline with the assumption that the signal captured during the enrollment is “clean”, i.e. only contains the signal of interest (or at least with no second speaker). — A speech enhancement model with speaker embeddings. The intern could for instance re-use our implementation (currently without speaker embeddings) of E3net [14]. — Secondly, once a first baseline has been trained, the candidate could benchmark on different scenarii (signal-to-noise ratio during enrollment, signal ratio between speakers and effect of additional noise, various and mixed language). The Target Speaker Over-Suppression metric could potentially be used (description in [12]), as well as DNS standard metrics. This could lead the candidate to work on one of the following items to improve the baseline framework on its identified weaknesses : — Testing various speaker embeddings and ways or positions of integration into the networks — The separate training of the speaker-encoding network has been found to work better than joint training [16, 17, 15] (multi-task learning being often hard to tune). However, this would need to be reassessed with the final chosen architecture. — More effective enrollment strategies [18, 19] could be chosen and adapted to SteelSeries use cases. — Implementing loss functions suitable for the separation task (Step 2) could also be of interest, for instance following ideas in [20] or by adapting our internal loss. Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. 2 Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] DNS challenge on the paperswithcode website. https://paperswithcode.com/sota/speechenhancement-on-deep-noise-suppression. [9] Speech separation task referenced on the paperswithcode website. https://paperswithcode.com/ task/speech-separation. [10] H. Dubey et al. « Icassp 2022 deep noise suppression challenge ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 9271- 9275. [11] Q. Wang et al. « Voicefilter : Targeted voice separation by speaker-conditioned spectrogram masking ». In : arXiv preprint arXiv :1810.04826 (2018). [12] S. E. Eskimez et al. « Personalized speech enhancement : New models and comprehensive evaluation ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 356-360. [13] R. Giri et al. « Personalized percepnet : Real-time, low-complexity target voice separation and enhancement ». In : arXiv preprint arXiv :2106.04129 (2021). [14] M. Thakker et al. « Fast Real-time Personalized Speech Enhancement : End-to-End Enhancement Network (E3Net) and Knowledge Distillation ». In : arXiv preprint arXiv :2204.00771 (2022). [15] C. Xu et al. « Spex : Multi-scale time domain speaker extraction network ». In : IEEE/ACM transactions on audio, speech, and language processing 28 (2020), p. 1370-1384. [16] K. Žmolıková et al. « Learning speaker representation for neural network based multichannel speaker extraction ». In : 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2017, p. 8-15. [17] M. Delcroix et al. « Single channel target speaker extraction and recognition with speaker beam ». In : 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, p. 5554-5558. [18] H. Sato et al. « Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations ». In : arXiv preprint arXiv :2206.08174 (2022). [19] X. Liu, X. Li et J. Serrà. « Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation ». In : arXiv preprint arXiv :2210.12635 (2022). [20] H. Taherian, S. E. Eskimez et T. Yoshioka. « Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation ». In : arXiv preprint arXiv :2211.02944 (2022).
| ||||
6-28 | (2022-12-05) Real time speaker separation Master internship, Lille (France), 2022@SteelSeries France R&D team (former Nahimic R&D team), France Real time speaker separation Master internship, Lille (France), 2022 Advisors — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Damien Granger, R&D Engineer, damien.granger@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Real time speaker separation This internship targets speaker separation which is formalized in the scientific community as the task of separately retrieving a given number of speech/speaker signals from a monaural mixture signal. Most of the scientific challenges [8] compare offline (not real-time) approaches. The objective of the internship is to address the following targets (more or less ordered) : — Based on our current speech denoising trainsets, the candidate will create a trainset for the speaker separation task that match the same in-house requirement. Indeed, most of the available datasets in the scientific community lack quantity, audio quality of the groundtruths, high sampling rate, diversity of speakers/noise type. In addition, for the SteelSeries use cases, the overlap in time of the different speech sources might be lower than in the scenarii used by the scientific community and it statistical distribution will need to be well identified/defined. — Once our offline and online baseline algorithm have been trained on such a trainset, the candidate could benchmark on different scenarii (number of speaker, signal ratio between speakers, effect of additional noise, various and mixed languages) to potentially fulfill the weakness of the trainset. — The first subjective listening could bring the candidate to design complementary metrics, for instance representing false positive in speaker attribution or representating the statistics about the time needed by real-time DNN to correctly attribute a signal to the correct speaker after some silence. 1 — While all the above could be done using state-of-the-art loss functions, the candidate could also adapt our internal loss to be permutation invariant [9]. — The scientific community is very active in proposing new DNN architectures (offline [10, 8] and online [11, 12]. The candidate could also re-implement or propose her/his own architecture. In particular, a multi-task approach where the DNN also outputs the number of active speakers would be of great interest. Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] Speech separation task referenced on the paperswithcode website. https://paperswithcode.com/ task/speech-separation. [9] X. Liu et J. Pons. « On permutation invariant training for speech source separation ». In : ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, p. 6-10. [10] Music separation task referenced on the paperswithcode website. https://paperswithcode.com/ sota/music-source-separation-on-musdb18. [11] DNS challenge on the paperswithcode website. https://paperswithcode.com/sota/speechenhancement-on-deep-noise-suppression. [12] H. Dubey et al. « Icassp 2022 deep noise suppression challenge ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 9271-9275
| ||||
6-29 | (2022-11-05)Audio detection for gaming Master internship, Lille (France), 2022@SteelSeries France R&D team (former Nahimic R&D team), France Audio detection for gaming Master internship, Lille (France), 2022 Advisors — Damien Granger, R&D Engineer, damien.granger@steelseries.com — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject This internship targets the detection of a known signal in an audio scene (containing multiple signals). For instance, some signs and feedbacks in video games are always the same audio signal while the rest of the audio scene is changing. The current internal implementation is based on a legacy state of the art music identification system [1, 2] The objective of the internship is to address one or multiple of the following targets : — Agnostic to the source type (speech, music, audio gaming event ...), indeed the current approach is designed for music — Enable the handling of shorter target signal — Robustness to various overlapping audio noise type from the audio scene — Robustness to level variation over time (in the case of moving audio sources) — Explore the effect of having multi-channel signals as input, summing the channels might help to identify moving sound but potentially with the drawback of lowering the signal-to-noise ratio — Improvement of the above-mentioned aspects by adapting the approach to the use of a smaller dictionary of target signals (not millions like in the case of musics) Machine learning is the expected track, and in particular, pre-trained and potentially overfitted audio representation (embeddings). Here is a short list of examples : — Attention-Based Audio Embeddings [3] — Autoencoder [4] — Contrastive learning [5, 6] 1 Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. Références [1] A. Wang. « The Shazam music recognition service ». In : Communications of the ACM 49.8 (2006), p. 44-48. [2] A. Wang et al. « An industrial strength audio search algorithm. » In : Ismir. T. 2003. Washington, DC. 2003, p. 7-13. [3] A. Singh, K. Demuynck et V. Arora. « Attention-Based Audio Embeddings for Query-by-Example ». In : arXiv preprint arXiv :2210.08624 (2022). [4] A. Báez-Suárez et al. « SAMAF : Sequence-to-sequence Autoencoder Model for Audio Fingerprinting ». In : ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16.2 (2020), p. 1-23. [5] X. Wu et H. Wang. « Asymmetric Contrastive Learning for Audio Fingerprinting ». In : IEEE Signal Processing Letters 29 (2022), p. 1873-1877. [6] Z. Yu et al. « Contrastive unsupervised learning for audio fingerprinting ». In : arXiv preprint arXiv :2010.13540 (2020).
| ||||
6-30 | (2022-12-12) Master internship @ LISN Lab, Orsay, Gif sur Yvette, France Creation of a speech synthesis model from spontaneous speech Keywords: Speech synthesis, spontaneous speech, low ressource languages, Nigerian Pidgin Context Nigerian Pidgin is a large but under-resourced language that increasingly serves as the primary vernacular language of Africa’s most populous country. Once stigmatized as a “broken” variety of English spoken only by the uneducated, Nigerian Pidgin is now a source of pride for many speakers who view it as a home-grown vehicle for communication. It transcends class and ethnicity, lacking the tribal associations of indigenous languages and the colonial baggage associated with English. The language can now be seen and heard in college campuses, houses of worship, advertisements, Nigerian expat communities, and even on a local branch of the British Broadcasting Channel. Objectives Despite Nigerian Pidgin’s growing prestige and a pool of speakers rivaling those of major languages like Turkish or Korean, the grammatical and intonational properties of the language are comparatively understudied. This internship is the extension of an ongoing research project aimed at better understanding its linguistic properties through the development and adaptation of NLP technologies. This research’s principal aim is to produce a natural-sounding text-tospeech (TTS) model that will allow researchers to conduct perception tests to determine how intonation influences the interpretation of meaning. Thanks partly to the recent explosion of neural network-based speech technologies, researchers can now produce high-quality synthesis from relatively simple datasets using models like TacoTron 2, complementing classical approaches such as those based on Hidden Markov Models. Specifically, the intern will assist in developing a text-to-speech platform trained on an existing database of Nigerian Pidgin recordings. In addition to producing natural-sounding speech, a central goal of this project will be to build a TTS model that will allow for the direct modification of intonational patterns via explicit parameters provided by researchers. The intern’s work will contribute to the exploration of the language’s melodic and tonal properties by allowing researchers to produce variations of novel utterances differing only by their intonational patterns. Primary tasks • Surveying existing TTS models and selecting the most suitable approach • Training a model on a corpus of Nigerian Pidgin • Optimizing and evaluating the model Profile A second-year master’s student with: • A solid background in machine learning (speech synthesis is a plus) • Good academic writing skills in English • An strong interest in language and linguistics Sous la tutelle de : www.lisn.upsaclay.fr | Twitter @LisnLab | LinkedIn LisnLab Site Belvédère : Campus Universitaire Bâtiment 507 Rue du Belvédère – 91405 Orsay Cedex Site Plaine : Campus Universitaire bâtiment 650 Rue Raimond Castaing – 91190 Gif-sur-Yvette M2-CS-Intenship 2022-2023 Modalities The internship will take place from March 2023 for 5 to 6 months at the LISN lab in the Sciences and Language Technologies department, as well as in the MoDyCo lab at Paris Nanterre University (primarily at the location of shortest commute). • The LISN’s Belvédère site is located in the plateau de Saclay: University campus building 507, rue du Belvédère, 91400 Orsay. • The MoDyCo lab is located at the Université Paris Ouest Nanterre La Défense: Bâtiment A, 200, avenue de la République – 92001 Nanterre. The candidate will be supervised by Emmett Strickland (MoDyCo) and Marc Evrard (LISN). Allowance under the official standards (service-public.fr). To apply Please send a CV and brief cover letter highlighting your interest in the project to the following: • Emmett Strickland (emmett.strickland@parisnanterre.fr) • Marc Evrard (marc.evrard@lisn.upsaclay.fr) Further reading 1. Tan, X., Qin, T., Soong, F., & Liu, T. Y. (2021). A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561. https://arxiv.org/abs/2106.15561 2. Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. J. (2019). A Review of Deep Learning Based Speech Synthesis. Applied Sciences (2076-3417), 9(19). https://www.mdpi.com/2076- 3417/9/19/4050 3.Bigi, B., Caron, B., & Abiola, O. S. (2017). Developing resources for automated speech processing of the african language naija (nigerian pidgin). In 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (pp. 441-445). https://hal.archives-ouvertes.fr/hal-01705707/document
| ||||
6-31 | (2022-12-18) Master internship@ LISNLab, Orsay, France Study and development of a vocal force model Keywords: Machine learning, voice strength, speech processing, expressive speech Context The project aims to model the vocal force (VF) estimation on speech recordings. VF is defined as the sound pressure level (SPL in C-weighted decibels) measured in free field, one meter away in front of the speaker’s mouth (Liénard, 2019). This SPL is unfortunately lost in the vast majority of recordings, though the human ear is able to estimate this information thanks to the spectral differences produced by the variations in vocal effort induced by these VF values. A corpus presenting a pair of calibrated/uncalibrated signals will be used to build a model capable of estimating the original value of VF (in dBC). Collaborations under development will benefit and extend this effort by expanding the collected corpus and applying the resulting model to other tasks (e.g., expressive synthesis, Evrard et al., 2015). Objectives The initial aim will be to increase the variational characteristics of the uncalibrated signal from the pair provided in this corpus. In practice, it will be necessary to apply a series of degradations corresponding to the variations in distance and positioning of the speaker with respect to the microphone. Moreover, other processing will be applied, such as those typically used in post-production (compression, gate, etc.). A model will then have to be trained from this calibrated/uncalibrated pair to reproduce a reliable estimate of the original VF from any recording. Different neural architectures will be evaluated, from simple feedforward neural networks to those based on complex representations (e.g., CNN, LSTM). Different feature extraction methods will also be considered: raw, perceptually filtered (e.g., Mel) spectrums, as well as self-supervised model-based (e.g., Baevski et al., 2020). Tasks • Reviewing speech corpus augmentation techniques • Surveying learning architectures: neural and self-supervised for processing audio pairs • Augmentation of the corpus through the application of acoustic degradations • Building a model of voice strength restoration from the signal pairs • Presenting an objective evaluation of the model’s performance, as well as a subjective evaluation via perceptual experiments Profile A second-year master’s student with: • A solid background in machine learning • Good academic writing skills in English • A strong interest in expressive speech Sous la tutelle de : www.lisn.upsaclay.fr | Twitter @LisnLab | LinkedIn LisnLab Site Belvédère : Campus Universitaire Bâtiment 507 Rue du Belvédère – 91405 Orsay Cedex Site Plaine : Campus Universitaire bâtiment 650 Rue Raimond Castaing – 91190 Gif-sur-Yvette M2-CS-Internship 2022-2023 Modalities The internship will take place from March 2023 for 5 to 6 months in the Department of Language Sciences and Technologies at the LISN laboratory. The LISN Belvedere’s site is located on the plateau de Saclay: University campus, building 507, rue du Belvédère, 91400 Orsay. The candidate will be supervised by Marc Evrard and Albert Rilliard. Allowance according to official standards (service-public.fr). How to apply Please send a CV and brief cover letter highlighting your interest in the project to Marc Evrard (marc.evrard@lisn.upsaclay.fr). References 1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460. 2. Evrard, M., Delalez, S., d’Alessandro, C., & Rilliard, A. (2015). Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis. In Sixteenth Annual Conference of the International Speech Communication Association. 3. Liénard, J. S. (2019). Quantifying vocal effort from the shape of the one-third octave long-term-average spectrum of speech. The Journal of the Acoustical Society of America, 146(4), EL369-EL375.
| ||||
6-32 | (2022-12-23) Postdoctoral Research Fellows, National University of Singapore Two full-time Postdoctoral Research Fellows in automatic lyrics generation and automatic singing voice/speech evaluation. You can find the detailed job descriptions here: https://smcnus.comp.nus.edu.sg/postdoct_job_description_2022
| ||||
6-33 | (2023-01-02) POSTDOC 21 MONTHS at GIPSA-Lab, Grenoble-France POSTDOC 21 MONTHS at GIPSA-Lab, Grenoble-France
on the automatic evaluation of computer-assisted reading fluency of young French readers' Contact: Gerard Bailly at gerard.bailly@gipsa-lab.fr
| ||||
6-34 | (2023-01-03) Stage de recherche projet CAIBots: Conversational AI with teams of robots, LIA, Avignon, France FORMULAIRE DE STAGE RECHERCHE Intitulé du projet CAIBots: Conversational AI with teams of robots Encadrants Prof. Fabrice Lefèvre Descriptif du stage : L'objectif du stage consiste à étudier la mise en place d’un dispositif robotique permettant la simulation en « conditions réelles » des IA conversationnelles (CAI) vocales. Entraîner puis tester de l’IA conversationnelle (chatbots, systèmes de dialogue) est couteux et complexe, nous souhaitons grandement réduire cette difficulté en fournissant une solution robotique physique autonome pour apprendre et évaluer de nouveaux modules pour la CAI avant de les utiliser avec de vrais utilisateurs humains. Dans un premier temps, il s’agira principalement de tester des solutions existantes et clefs en main pour les éléments de la chaîne de traitement du langage parlé et de vérifier leur niveau de performance en configuration robot-robot. Ensuite une recherche vers des solutions embarquées sera menée. Elle devra permettre d’améliorer la latence du dispositif mais aussi d’assurer une meilleure protection des données personnelles (en ôtant la nécessité du passage par des clouds propriétaires). Globalement le système d'interaction vocal mis en place devra permettant une discussion ouverte entre un humain et une machine sur des sujets généraux. Le cas d’usage envisagé se positionne donc dans la logique du challenge Amazon Alexa (https://developer.amazon.com/alexaprize) : développer un bot pouvant entretenir une conversation pendant quelques minutes. Il sera donc nécessaire de prévoir aussi un utilisateur simulé pour permettre une interaction robot-robot autonome (le cas de conversations multiparties humains-robots pourra aussi être testé, sans être un objectif prioritaire du stage). Il s'agira d’initier le dispositif, c'est à dire de mettre en place les composants en configuration de base, mais illustrant les capacités potentielles pouvant être atteintes avec un temps de développement plus conséquent. Les solutions robotiques et logicielles entrevues pour ce travail sont, par exemple : robot Pepper, Google Cloud ASR, SpeechBrain, RASA et/ou des modèles pré-entraînés (BERT, GPT, BlenderBot…) ... Il s’agit principalement de plateformes open-source, assez complètes. Le travail consistera à mettre en œuvre rapidement un système réel afin de pouvoir le faire progresser en configuration robot-robot puis le tester avec un panel représentatif d'utilisateurs potentiels. Si un intérêt pour l'apprentissage automatique et le traitement de la langue naturelle est essentiel, il est aussi attendu du stagiaire de bonnes capacités en développement logiciel. Le stage sera une occasion d'acquérir des compétences en traitement automatique de la langue dans un contexte d'expérimentation en robotique embarquée. Plusieurs pistes pour une prolongation en thèse sont ouvertes. Durée du stage 6 mois Rémunération Environ 540€ / mois Thématique associée au stage Systèmes de dialogue humain-machine, reconnaissance et compréhension de parole, interface cognitive, robotique
| ||||
6-35 | (2023-01-05) Post-doctoral and engineer positions@ LORIA-INRIA, Nancy, France Automatic speech recognition for non-natives speakers in a noisy environment
Post-doctoral and engineer positions
Starting date: begin of 2023
Duration: 24 months for a post-doc position and 12 months for an engineer position
Supervisors: Irina Illina, Associate Professor, HDR Lorraine University LORIA-INRIA Multispeech Team, illina@loria.fr
Context
When a person has their hands busy performing a task like driving a car or piloting an airplane, voice is a fast and efficient way to achieve interaction. In aeronautical communications, the English language is most often compulsory. Unfortunately, a large part of the pilots are not native English and speak with an accent dependent on their native language and are therefore influenced by the pronunciation mechanisms of this language. Inside an aircraft cockpit, non-native voice of the pilots and the surrounding noises are the most difficult challenges to overcome in order to have efficient automatic speech recognition (ASR). The problems of non-native speech are numerous: incorrect or approximate pronunciations, errors of agreement in gender and number, use of non-existent words, missing articles, grammatically incorrect sentences, etc. The acoustic environment adds a disturbing component to the speech signal. Much of the success of speech recognition relies on the ability to take into account different accents and ambient noises into the models used by ARP.
Automatic speech recognition has made great progress thanks to the spectacular development of deep learning. In recent years, end-to-end automatic speech recognition, which directly optimizes the probability of the output character sequence based on the input acoustic characteristics, has made great progress [Chan et al., 2016; Baevski et al., 2020; Gulati, et al., 2020].
Objectives
The recruited person will have to develop methodologies and tools to obtain high-performance non-native automatic speech recognition in the aeronautical context and more specifically in a (noisy) aircraft cockpit.
This project will be based on an end-to-end automatic speech recognition system [Shi et al., 2021] using wav2vec 2.0 [Baevski et al., 2020]. This model is one of the most efficient of the current state of the art. This wav2vec 2.0 model enables self-supervised learning of representations from raw audio data (without transcription).
How to apply: Interested candidates are encouraged to contact Irina Illina (illina@loria.fr) with the required documents (CV, transcripts, motivation letter, and recommendation letters).
Requirements & skills:
- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,
- ability to work independently as well as in a team,
- solid programming skills (Python, PyTorch), and deep learning knowledge,
- good level of written and spoken English.
References
[Baevski et al., 2020] A. Baevski, H. Zhou, A. Mohamed, and M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.
[Chan et al., 2016] W. Chan, N. Jaitly, Q. Le and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960-4964, 2016.
[Chorowski et al., 2017] J. Chorowski, N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. Interspeech, 2017.
[Houlsby et al., 2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning, PMLR, pp. 2790–2799, 2019.
[Gulati et al., 2020] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition. Interspeech, 2020.
[Shi et al., 2021] X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie. The accented english speech recognition challenge 2020: open datasets, tracks, baselines, results and methods. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6918–6922, 2021.
| ||||
6-36 | (2023-01-15) Stage M2, LIA, Avignon, France Sujet de stage de M2 : Décodage des signaux EEG à l’aide des méthodes d’apprentissage automatique Contexte L’EEG (électroencéphalographie) est une technique non invasive, qui permet de mesurer l’activité électrique du cerveau à l’aide d’électrodes placées sur la tête. Ces électrodes enregistrent l’activité électrique causées par les neurones. Les données recueillies sont enregistrées et peuvent être, à l’aide des méthodes d’apprentissage automatique, utilisées à diverses fins comme par exemple : analyse, classification ou interface neuronale directe [Cao20]. Dans le cadre du traitement automatique du langage et de la parole, des premiers travaux, avec des résultats préliminaires, sont apparus récemment (e.g. classification EEG avec une approche par Transformer [Sun21], des données EEG issues de la lecture de phrases [Hollenstein18] ou encore dans le cadre de méthodes combinant l’utilisation d’EEG avec des techniques de traitement du langage pour la détection de préférences utilisateur [Gauba17]). Objectif L’objectif de ce stage consiste à reconnaître automatiquement des caractères dans un premier temps, puis des mots isolés, énoncés oralement via les signaux EEG. Les différentes étapes du stage peuvent se résumer comme suit : 1. Prise en main du casque EEG Emotiv EPOC-X (https://www.emotiv.com/epoc-x/). 2. Mettre en place un protocole expérimental et collecter un corpus permettant la mise en place des expériences. 3. Evaluer et choisir des algorithmes d’apprentissage automatique pour reconnaitre les caractères et/ou mots isolés à partir des signaux EEG du corpus collecté dans le cadre du stage. Profil du candidat L’étudiant.e doit être en dernière année de diplôme d’ingénieur ou en Master 2 d’informatique. Il ou elle doit posséder des notions de programmation, maîtriser l’environnement Linux et les méthodes standard d’apprentissage automatique. Durée du stage 6 mois de février à mars 2023. Lieu du stage Le stage aura lieu au sein du Laboratoire Informatique d’Avignon (LIA) à Avignon Université ou au sein du Laboratoire des Sciences du Numérique de Nantes (LS2N) à l’Université de Nantes. Gratification Le stage sera gratifié selon le montant horaire en vigueur au 01/01/2023, considérant une convention de stage de 35 heures par semaine (4,05 euros soit ≃550 €/mois). Comment postuler ? Merci d’envoyer par email à Mickael Rouvier (mickael.rouvier@univavignon.fr) et Richard Dufour (richard.dufour@univ-nantes.fr) les documents suivants : 1) CV ; 2) relevé de notes (licence et master) et 3) lettre de motivation. Bibliographie [Cao20] Cao, Z. (2020). A review of artificial intelligence for EEG-based brain− computer interfaces and applications. Brain Science Advances, 6(3), 162-170. [Gauba17] Gauba, H., Kumar, P., Roy, P. P., Singh, P., Dogra, D. P., & Raman, B. (2017). Prediction of advertisement preference by fusing EEG response and sentiment analysis. Neural Networks, 92, 77-88. [Hollenstein18] Hollenstein, N., Rotsztejn, J., Troendle, M., Pedroni, A., Zhang, C., & Langer, N. (2018). ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading. Scientific data, 5(1), 1-13. [Sun21] Sun, J., Xie, J., & Zhou, H. (2021). EEG classification with transformer-based models. In 2021 IEEE 3rd Global Conference on Life Sciences and Technologies (LifeTech) (pp. 92-93).IEEE
| ||||
6-37 | (2023-01-15) Ref 6581/22, Postdoctoral Research Fellow, MARCS Institute for Brain, Behaviour & Development ,Canberra, Australia Ref 6581/22, Postdoctoral Research Fellow, MARCS Institute for Brain, Behaviour & Development About Western Western Sydney University is a modern, forward-thinking, research-led university, located at the heart of Australia’s fastest-growing and economically significant region, Western Sydney. Boasting 11 campuses – many in Western Sydney CBD locations – and more than 200,000 alumni, 49,500 students and 3,500 staff, the University has 14 Schools with an array of well-designed programs and degrees carefully structured to meet the demands of future industry. The University is ranked in the top two per cent of universities worldwide, and as a research leader, over 85 per cent of the University’s assessed research is rated at ‘World Standard’ or above. About the Role The MARCS Institute for Brain, Behaviour and Development is seeking to appoint a Postdoctoral Research Fellow to join the Brain Sciences research program at Western Sydney University. This two year Postdoctoral Research Fellow position is funded by an ARC Discovery grant, “Investigating the characteristics of older adults' conversation behaviour” awarded to Chief Investigators (CIs) Professor Chris Davis and Professor Jeesun Kim at the MARCS Institute for Brain, Behaviour and Development, Western Sydney University, in collaboration with Partner Investigators (PI) Emeritus Professor Valerie Hazan at University College London. This project will investigate the production and perception of naturalistic conversations by young and older adults. In particular, the interest is in probing individualised semantic processing. The aim of the project is to understand factors that affect the engagement of older adults in conversations. In this role you will work with the team to develop, pilot, and apply new methods for probing individual-specific semantic processing and on the collection of sensory, perceptual, cognitive and electrophysiological (EEG) data. You will also implement procedures for analysing the data. The successful applicant will work closely with the above investigators and the rest of the research team, conducting studies and participating in supervising and/or training PhD students, Research Assistants, Honours students, and interns enlisted into the project. For further information about the role please refer to the attached Position Description. This is a full time, 2 year fixed term contract. Located in the new Westmead Innovation Quarter (WIQ) – a visionary research, health, education and business hub located in the Westmead Health Precinct. About MARCS The MARCS Institute for Brain, Behaviour and Development is an interdisciplinary research institute of Western Sydney. The vision for the Institute is to optimise human interaction and wellbeing across the lifespan. To strive to solve the problems that matter most through the themes: sensing and perceiving, interacting with each other, and technologies for humans. Researchers in MARCS come from many disciplines including cognitive science, developmental psychology, language science, music science, cognitive neuroscience, and biomedical, electrical, electronic and software engineering. Further information is available from our website - http://www.westernsydney.edu.au/marcs About You You will hold a relevant doctoral qualification, or substantial progress toward a PhD in Psychology, Cognitive Neuroscience, Computational linguistics or a related discipline. You must have experience in conducting neuroimaging and behavioural experiments. Culture Western Sydney University highly values equity and inclusiveness. We have a proud history of doing so and consider this an important part of our social and civic responsibilities as a University. We strive to contribute to tackling inequalities and promoting wellbeing within our own institution, the Greater Western Sydney region, nationally and internationally. Remuneration Package: Academic Level A: $103,304 to $124,808 p.a. (comprising Salary of $87,293 to $105,464 p.a., plus Superannuation and Leave Loading) Academic Level B: $131,136 to $154,738 p.a. (comprising Salary of $110,811 to $130,846 p.a., plus Superannuation and Leave Loading) Position Enquiries: Please contact Professor Chris Davis via email at chris.davis@westernsydney.edu.au Closing Date: 8:30pm AEDT, Sunday, 12 February 2023 Immigration Sponsorship: Employer Visa sponsorship will be provided if required. Click here to view Position Description How to Apply: · Start your application by clicking the 'begin' button. · Login to an existing account or reset your password · Preview Application Form Western Sydney University is committed to diversity and social inclusion. Applications from people of culturally and linguistically diverse backgrounds; equity target groups including women, people with disabilities, people who identify as LGBTIQ, and people of Aboriginal and Torres Strait Islander descent are encouraged. Professor Chris Davis, PhD The MARCS Institute for Brain, Behaviour and Development Western Sydney University Westmead Innovation Quarter Building U, Level 4 160 Hawkesbury Road (Corner of Farmhouse Road) Westmead NSW 2145 <chris.davis@westernsydney.edu.au>
| ||||
6-38 | (2023-01-16) Research Fellow Chairs @MIAI,Grenoble Interdisciplinary Institute, France
| ||||
6-39 | (2023-01-17) Two postdoctoral positions @ University of Cambridge, UK Senior Postdoctoral Position in SLP University of Cambridge, Department of Engineering (UK) The ALTA Institute is looking for a senior postdoctoral researcher in spoken language processing to join our research team investigating L2 English speaking automated assessment and learning. Website: https://www.jobs.cam.ac.uk/job/39114/
Postdoctoral Position in SLP University of Cambridge, Department of Engineering (UK) The ALTA Institute is looking for a postdoctoral researcher in spoken language processing to join our research team investigating L2 English speaking automated assessment and learning. Website: https://www.jobs.cam.ac.uk/job/39023/
| ||||
6-40 | (2023-01-25) Master 2 internship @ LISN, Orsay, France Creation of a speech synthesis model from spontaneous speech Keywords: Machine learning, speech synthesis, low resource languages, Nigerian Pidgin Objectives The main aim is to produce a natural-sounding text-to-speech (TTS) model allowing to perform perceptual tests for experimental linguistics. Thanks partly to the recent evolution of neural network-based speech technologies, researchers can now produce high-quality synthesis from relatively simple datasets using models like TacoTron 2, complementing classical approaches such as those based on Hidden Markov Models. Specifically, the intern will assist in developing a text-to-speech platform trained on an existing database of Nigerian Pidgin recordings. In addition to producing natural-sounding speech, a central goal of this project will be to build a TTS model that will allow for the direct modification of intonational patterns via explicit parameters provided by researchers. The intern’s work will contribute to the exploration of the language’s melodic and tonal properties by allowing researchers to produce variations of novel utterances differing only by their intonational patterns. Context This work is part of a larger project to study Nigerian Pidgin. It is a large but under-resourced language that increasingly serves as the primary vernacular language of Africa’s most populous country. Once stigmatized as a “broken” variety of English spoken only by the uneducated, Nigerian Pidgin is now a source of pride for many speakers who view it as a home-grown vehicle for communication. It transcends class and ethnicity, lacking the tribal associations of indigenous languages and the colonial baggage associated with English. The language can now be seen and heard in college campuses, houses of worship, advertisements, Nigerian expat communities, and even on a local branch of the BBC. Primary tasks • Surveying existing TTS models and selecting the most suitable approach • Training a model on a corpus of Nigerian Pidgin • Optimizing and evaluating the model Profile
A second-year master’s student with: • A solid background in machine learning (speech synthesis is a plus) • Good academic writing skills in English • An strong interest in language and linguistics
| ||||
6-41 | (2023-01-26) Poste de maître de conférences en informatique, Nantes, France Nantes Université ouvre un poste de maître de conférences en informatique pour septembre 2023. L'enseignement sera effectué au sein de la Faculté des Langues et Cultures Etrangères (FLCE) et la recherche sera menée au sein du LS2N (Laboratoire des Sciences du numérique de Nantes). Concernant les attendus du poste, la fiche descriptive est consultable à l'adresse : https://uncloud.univ-nantes.fr/index.php/s/ERdm9t8WPNdCn8m
| ||||
6-42 | (2023-01-30) Vacancy for a university professor in computer science at Bordeaux INP, France Call for applications:
:
- Research in computer music in the image and sound department of the LaBRI (www.labri.fr) and at SCRIME (scrime.u-bordeaux.fr) - Teaching at ENSEIRB-MATMECA ( https://enseirb-matmeca.bordeaux-inp.fr/fr) in the computer science department - Schedule: applications between February 23 2023, and March 30 2023, start September 2023 - More information : https://enseirb-matmeca.bordeaux-inp.fr/fr/enseignants
- Contact : myriam.desainte-catherine@labri.fr
Applicants must propose a research project that fits within the image and sound department of the LaBRI to work in particular with the Sound and Music Modeling group, and create links with the Manao team and the Analysis and Indexing group. Candidates must also propose a project for the SCRIME research platform (Studio de Recherche et de Création en Informatique et Musiques Expérimentales) following the departure of the current director. The research area is computer music (sound and music computation, musical interaction). The candidates must be involved in at least one of the following themes:
- computer processing of music and sound: analysis, transformation and generation of music and sound, including environmental sounds and soundtracks of ecological videos, by computational approaches (algorithms, signal processing, learning) in all dimensions of music and sound (timbre, pitch, dynamics and spatialization). - Sound and music interaction: designing new Interfaces between users and computers to create new means of musical expression, through the design of virtual/mixed/augmented sound reality systems, and new models of musical scores and instruments, through interaction with images and other media, and through the use of sound as a means of information. - understanding and modeling of sound and music: music information retrieval, computational musicology, computational approaches to music cognition, formal models and languages for music (time and space of sound and music parameters) - design of new tools for sound and music creation, performance and pedagogy: development of tools to assist sound design and music composition, scenarization, sonification, spatialization (includes algorithmic composition, especially by learning techniques), includes research of software architectures and languages combining micro (sound) and macro (musical form) levels, frugality of computations and transfers of sound and music data, minimization of sound transmission delay, formal specifications for tool preservation.
| ||||
6-43 | (2023-02-01) Post-doctoral and engineer positions@ LORIA-INRIA, Nancy, France Automatic speech recognition for non-natives speakers in a noisy environment
Post-doctoral and engineer positions
Starting date: begin of 2023
Duration: 24 months for a post-doc position and 12 months for an engineer position
Supervisors: Irina Illina, Associate Professor, HDR Lorraine University LORIA-INRIA Multispeech Team, illina@loria.fr
Context
When a person has their hands busy performing a task like driving a car or piloting an airplane, voice is a fast and efficient way to achieve interaction. In aeronautical communications, the English language is most often compulsory. Unfortunately, a large part of the pilots are not native English and speak with an accent dependent on their native language and are therefore influenced by the pronunciation mechanisms of this language. Inside an aircraft cockpit, non-native voice of the pilots and the surrounding noises are the most difficult challenges to overcome in order to have efficient automatic speech recognition (ASR). The problems of non-native speech are numerous: incorrect or approximate pronunciations, errors of agreement in gender and number, use of non-existent words, missing articles, grammatically incorrect sentences, etc. The acoustic environment adds a disturbing component to the speech signal. Much of the success of speech recognition relies on the ability to take into account different accents and ambient noises into the models used by ARP.
Automatic speech recognition has made great progress thanks to the spectacular development of deep learning. In recent years, end-to-end automatic speech recognition, which directly optimizes the probability of the output character sequence based on the input acoustic characteristics, has made great progress [Chan et al., 2016; Baevski et al., 2020; Gulati, et al., 2020].
Objectives
The recruited person will have to develop methodologies and tools to obtain high-performance non-native automatic speech recognition in the aeronautical context and more specifically in a (noisy) aircraft cockpit.
This project will be based on an end-to-end automatic speech recognition system [Shi et al., 2021] using wav2vec 2.0 [Baevski et al., 2020]. This model is one of the most efficient of the current state of the art. This wav2vec 2.0 model enables self-supervised learning of representations from raw audio data (without transcription).
How to apply: Interested candidates are encouraged to contact Irina Illina (illina@loria.fr) with the required documents (CV, transcripts, motivation letter, and recommendation letters).
Requirements & skills:
- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,
- ability to work independently as well as in a team,
- solid programming skills (Python, PyTorch), and deep learning knowledge,
- good level of written and spoken English.
References
[Baevski et al., 2020] A. Baevski, H. Zhou, A. Mohamed, and M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.
[Chan et al., 2016] W. Chan, N. Jaitly, Q. Le and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960-4964, 2016.
[Chorowski et al., 2017] J. Chorowski, N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. Interspeech, 2017.
[Houlsby et al., 2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning, PMLR, pp. 2790–2799, 2019.
[Gulati et al., 2020] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition. Interspeech, 2020.
[Shi et al., 2021] X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie. The accented english speech recognition challenge 2020: open datasets, tracks, baselines, results and methods. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6918–6922, 2021.
| ||||
6-44 | (2023-02-01) Master or engineer internship at Loria, Nancy, France Master or engineer internship at Loria (France)
Development of language model for business agreement use cases
Duration: 6 months, starting February or March 2023
Location: Loria (Nancy) and Station F, 5 Parvis Alan Turing, 75013, Paris
Supervision: Tristan Thommen (tristan@koncile.ai), Irina Illina (illina@loria.fr) and Jean-Charles Lamirel (jean-charles.lamirel@loria.fr)
Please apply by sending your CV and a short motivation letter directly to Tristan Thommen and Irina Illina.
Motivations and context
The usage of pre-trained models like Embeddings from Language Models (ELMo) (Peters et al., 2018), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), Robustly optimized BERT approach (RoBERTa) (Liu et al., 2019c), Generative Pre-Trained Transformer (GPT) (Radford et al., 2018), etc. proved to be state-of-the-art for various Natural Language Model (NLP) tasks. These models are trained on a huge unlabeled corpus and can be easily fine-tuned to various downstream tasks using task-specific datasets. Fine-tuning involves adding new task-specific layers to the model and updating the pretrained model parameters along with learning new task-specific layers.
Objectives
The goal of the internship is to develop a language model specific to business agreement use cases. This model should be able to identify and extract non-trivial information from a large mass of procurement contracts, in English and in French. This information consists, on the one hand, of simple contract identification data such as signature date, name of the parties, contract title, signatories, and on the other hand, of more complex information to be deduced from clauses, in particular, price determination according to parameters such as date or volume, renewal or expiry, obligations for the parties as well as the conditions. The difficulty of this task is that all this information is not standardized and may be represented in different ways and in different places in an agreement. For instance, a price could be based on a formula defined in the articles of the agreement and an index defined in one of its appendices.
To develop this language model we propose to fine-tune a pre-trained language model using a business agreement use dataset. The intern will identify the relevant pre-trained language model, prepare the data for training and adjust the parameters of fine-tuning.
The particularity of the internship is to use case relevant information of management of business agreements. Datasets will be constituted by Koncile’s clients and partners and developed during this internship models will be directly put into practice and tested with end users.
Koncile (link) is a start-up based in Paris, founded in 2022, that tackles the issue of mismanagement of procurements agreements by companies. It intends to leverage natural
langage processing techniques to analyze supplier contracts and invoicing. Koncile is incubated by Entrepreneur First and hosted at Station F in Paris.
Additional information and requirements
A good practice in Python and basic knowledge about Natural Langage Processing techniques are required. Some notions of machine learning is a plus, both theoretical and practical (e.g., using PyTorch).
References
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019c). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by generative pre-training.
| ||||
6-45 | (2023-02-06) Associate professor (tenure position) @Telecom Paris in Machine Learning for Socia lComputing Faculty position (Associate professor, tenure position) at Telecom Paris in Machine-Learning for Social Computing. Telecom Paris has a new permanent (tenure) faculty position (Associate Professor/ “Maître de conférences”) in the area of **machine learning for social computing**. Applicants from the following sub-research areas are welcome: - Neural models for the recognition and generation of socio-emotional behaviors - Natural language and speech processing - Dialogue, conversational systems, and social robotics - Reinforcement learning for dialogue - Sentiment analysis in social interactions - Bias and explainability in AI - Model tractability, multi-task learning, meta-learning Salary: between 40,58 k€ and 58,67 k€ depending on profile and experience Important Dates - March 20th, 2023: closing date for applications - April 20th, 2023: hearings of the preselected candidates Context Social Computing team [1] - S²A (machine learning, statistics and signal processing) group [2] - LTCI (laboratoire de traitement et communication de l’information) [3] - Telecom Paris [4] . Ecosystem Telecom Paris [4] is a founding member of the Institut Polytechnique de Paris <https://www.ip-paris.fr/en/> (IP Paris), a world-class scientific and technological institution. Located at the Plateau de Saclay close to Paris-Saclay University, this Institution is a partnership between Ecole Polytechnique, ENSTA Paris, ENSAE Paris, Télécom Paris, Télecom SudParis, with HEC as a key partner. Regularly ranked as one of the best engineering schools in France, Télécom Paris is recognized for its excellent training, its very good employability rate with high salaries, its high-level research, and its very close proximity to companies. The THE (Times Higher Education) ranks Télécom Paris 2nd best French engineering school, 5th better French university, and 6th « best small university » <https://www.telecom-paris.fr/times-higher-education-telecom-paris-6th-best-small-university>. The newly created institution IP Paris was ranked in the top 50 best universities in the QS world university ranking. In the context of the Institut Polytechnique de Paris, the activities in Data Science and AI of the team benefit from the center Hi!Paris ( https://www.hi-paris.fr), offering seminars, workshops, and fundings through calls for project Main missions/Research activities - Develop groundbreaking research in the field of machine learning applied to Social Computing, which includes: natural language and speech processing, dialogue, conversational systems, and social robotics, reinforcement learning for dialogue, sentiment analysis in social interactions, bias and explainability in AI, model tractability, multi-task learning, meta-learning - Develop both academic and industrial collaborations on the same topic, including collaborative activities with other Telecom Paris research departments and teams (including social sciences researchers of economics and social sciences department [6]), and research contracts with industrial players - Set up research grants and take part in national and international collaborative research projects - Publish high-quality research work in leading journals and conferences - Be an active member of the research community (serving on scientific committees and boards, organizing seminars, workshops, and special sessions...) Main missions/Teaching activities Participate in teaching activities at Telecom Paris and its partner academic institutions (as part of joint Master programs), especially in natural language processing, speech processing, machine learning, and Data Science, including life-long training programs (e.g. the local “Mastères Spécialisés”) Candidate profile As a minimum requirement, the successful candidate will have: - A Ph.D. degree - A track record of research and publication in one or more of the following areas: conversational artificial intelligence, machine learning, natural language processing, speech and signal processing, human-agent interactions, social robotics - Experience in teaching - An international postdoctoral experience is welcome but not mandatory - Excellent command of English NOTE: The candidate does *not* need to speak French to apply, just to be willing to learn the language (teaching will be mostly given in English) Other skills expected include: • Capacity to work in a team and develop good relationships with colleagues and peers • Excellent writing and pedagogical skills More about the position • Place of work: Saclay (Paris outskirts) How to apply? Applications must be submitted via one of the following websites: French Version: https://institutminestelecom.recruitee.com/o/enseignantchercheur-en-machine-learning-pour-la-modelisation-des-comportements-socioemotionnels-a-telecom-paris-cdi English Version: https://institutminestelecom.recruitee.com/l/en/o/enseignantchercheur-en-machine-learning-pour-la-modelisation-des-comportements-socioemotionnels-a-telecom-paris-cdi Applicants should submit a single PDF file that includes: - cover letter, - curriculum vitae, - statements of research and teaching interests (4 pages) - three publications - contact information for two references Contacts: == please do not hesitate to directly contact us before applying == Chloé Clavel (Coordinator of the Social Computing team) Stéphan Clémençon (Head of the S²A group) Florence d’Alché-Buc (Head of the IDS department) [1] https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci/research-teams/signal-statistics-learning/social-computing [2] https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci/research-teams/signal-statistics-learning [3] https://www.telecom-paris.fr/fr/lecole/departements-enseignement-recherche/image-donnees-signal [4] https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci [5] https://www.telecom-paris.fr/en/home [6] https://www.telecom-paris.fr/en/the-school/teaching-research-departments/economics-and-social-sciences
| ||||
6-46 | (2023-02-08) Post-doctoral and engineer positions @ LORIA-INRIA Nancy, France Automatic speech recognition for non-natives speakers in a noisy environment
Post-doctoral and engineer positions
Starting date: beginning of 2023
Duration: 24 months for a post-doc position and 12 months for an engineer position
Supervisors: Irina Illina, Associate Professor, HDR Lorraine University LORIA-INRIA Multispeech Team, illina@loria.fr
Context
When a person has their hands busy performing a task like driving a car or piloting an airplane, voice is a fast and efficient way to achieve interaction. In aeronautical communications, the English language is most often compulsory. Unfortunately, a large part of the pilots are not native English and speak with an accent dependent on their native language and are therefore influenced by the pronunciation mechanisms of this language. Inside an aircraft cockpit, non-native voice of the pilots and the surrounding noises are the most difficult challenges to overcome in order to have efficient automatic speech recognition (ASR). The problems of non-native speech are numerous: incorrect or approximate pronunciations, errors of agreement in gender and number, use of non-existent words, missing articles, grammatically incorrect sentences, etc. The acoustic environment adds a disturbing component to the speech signal. Much of the success of speech recognition relies on the ability to take into account different accents and ambient noises in the models used by ASR.
Automatic speech recognition has made great progress thanks to the spectacular development of deep learning. In recent years, end-to-end automatic speech recognition, which directly optimizes the probability of the output character sequence based on the input acoustic characteristics, has made great progress [Chan et al., 2016; Baevski et al., 2020; Gulati, et al., 2020].
Objectives
The recruited person will have to develop methodologies and tools to obtain high-performance non-native automatic speech recognition in the aeronautical context and more specifically in a (noisy) aircraft cockpit.
This project will be based on an end-to-end automatic speech recognition system [Shi et al., 2021] using wav2vec 2.0 [Baevski et al., 2020]. This model is one of the most efficient of the current state of the art. This wav2vec 2.0 model enables self-supervised learning of representations from raw audio data (without transcription).
How to apply: Interested candidates are encouraged to contact Irina Illina (illina@loria.fr) with the required documents (CV, transcripts, motivation letter, and recommendation letters).
Requirements & skills:
- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,
- the ability to work independently as well as in a team,
- solid programming skills (Python, PyTorch), and deep learning knowledge,
- good level of written and spoken English.
References
[Baevski et al., 2020] A. Baevski, H. Zhou, A. Mohamed, and M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.
[Chan et al., 2016] W. Chan, N. Jaitly, Q. Le and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960-4964, 2016.
[Chorowski et al., 2017] J. Chorowski, N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. Interspeech, 2017.
[Houlsby et al., 2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning, PMLR, pp. 2790–2799, 2019.
[Gulati et al., 2020] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition. Interspeech, 2020.
[Shi et al., 2021] X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie. The accented English speech recognition challenge 2020: open datasets, tracks, baselines, results, and methods. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6918–6922, 2021.
| ||||
6-47 | (2023-02-15) Ingenieur chef de projet, CRI Nancy, France
| ||||
6-48 | (20323-02-17) Internship Ingénieur.e de recherche NLP, LUNII, Paris
TECH · LUNII PARIS · TÉLÉTRAVAIL HYBRIDE
Ingénieur.e de recherche NLP - [Stage - 6 mois]Lunii, c'est une aventure humaine et entrepreneuriale, lancée en août 2016 * Ma Fabrique à Histoires*, un objet littéraire, technologique et ludique pour les enfants de 3 à 8 ans.Vos missions Vous rejoindrez le Pôle Tech pour participer à un projet de recherche appliquée autour de la synthèse vocale narrative. La synthèse vocale a connu des avancées spectaculaires grâce à l’utilisation de réseaux de neurones profonds, mais les procédures de préparation et d’étiquetage des données d’apprentissage sont encore très chronophages. Pour répondre à cette problématique, vous contribuerez principalement à l’amélioration d’outils d’analyse et d’étiquetage automatique dans le cadre de la préparation d’un corpus de parole pour un système de synthèse vocale. Vous aurez pour missions de : 👩🎓 Étudier et améliorer les phonétiseurs - aligneurs existants
🙋 Évaluer et comparer les méthodes d’analyse structurelle d’une histoire
💃 Constituer un corpus de parole narrative
Liste non exhaustive. Lunii recrute et reconnaît tous les talents : nous sommes profondément attaché·e·s à la mixité et à la diversité, on vous attend ! Profil recherché
Process de recrutement Envoyez-nous votre CV, nous prenons ensuite le temps de bien étudier votre candidature et si elle correspond à l’offre, nos échanges (en visio ou présentiel) continuent :
Informations complémentaires
| ||||
6-49 | (2023-02-19) Two professorships @ Technische Universität Darmstadt, Germany Technische Universität Darmstadt is one of Germany’s leading technical universities with broad excellence in research, an interdisciplinary profile, and an explicit focus on engineering sciences.
The department of Electrical Engineering and Information Technology invites applications for a Full Professorship (W3) or Assistant Professorship (W2 with Tenure Track) for Signal Theory and Statistical Learning (Code No 6)
We are looking for an excellent researcher with credentials in at least one of the following areas: •Theory and methods of statistical inference •Robust statistical signal processing • Statistical learning theory •Theoretical performance analysis and interpretability of signal processing methods •Guarantees and interpretability of statistical learning methods •New trends in statistical signal and learning theory Application deadline: March 15, 2023 All further information about the position and application process can be found under: https://www.tu-darmstadt.de/universitaet/karriere_an_der_tu/stellenangebote/aktuelle_stellenangebote/stellenausschreibungen_detailansichten_1_502208.en.jsp
| ||||
6-50 | (2023-02-28) Maitre de conferences, Paris-Saclay/LISN, Orsay, France L'Université Paris-Saclay recrute un·e Maître·sse de Conférences en informatique (27ème section) pour la rentrée 2023. La personne recrutée travaillera au LISN, le Laboratoire Interdisciplinaire des Sciences du Numériques de l'Université Paris-Saclay, dans le département Science et Technologies des Langues (STL). Le profil recherche porte sur le traitement la langue multimodale. L'enseignement se fera à l'UFR de Sciences. N'hésitez pas à me contacter ou à contacter un membre du département Science et Technologies des Langues si vous avez des questions. Vous trouverez ci-dessous des détails sur les profils recherche et enseignement. Gilles Adda Enseignement
La personne recrutée pourra enseigner dans toutes les filières relevant du département informatique de la Faculté des Sciences d?Orsay, au niveau Licence et Master (classique et en apprentissage). Elle devra enseigner dans les domaines à renforcer en base de donnée et sciences des données. Elle pourra enseigner dans ses domaines d?intérêts bien entendu. L?enseignement constitue l?une des missions qui fonde l?université. Les questions de la qualité d?une formation dispensée et de la qualité des apprentissages des étudiants sont plus que jamais au c?ur des préoccupations de l?Université Paris Saclay. A ce titre, le profil enseignement de ce poste inclut une capacité à concevoir les séquences d?enseignement selon des objectifs d?apprentissage et des compétences explicites, et éventuellement à expérimenter des modalités pédagogiques innovantes. La personne recrutée sera également amenée à participer rapidement à la vie de l?établissement (gestion de filière, implication dans l?une des structures de l?université,?). Une expérience en termes de responsabilités collectives est vivement souhaitée. Le ou la candidate devra clairement indiquer son projet d'intégration en matière d'enseignement, dans le cadre de l'offre de formation de l'université et en accord avec le département Informatique de la Faculté des sciences. Recherche
La candidate ou le candidat développera ses activités de recherche au sein du Laboratoire Interdisciplinaire des sciences du numérique (LISN - UMR9015) implanté sur l'université Paris-Saclay. La candidate ou le candidat intégrera le département Sciences et Technologies des Langues (STL) et renforcera les activités orientées vers le traitement du langage multimodal. Le langage naturel inclut les modalités écrites, parlées et signées ; il peut également s'accompagner d?attitudes sociales et de dimensions non verbales. Le traitement du langage relève alors du traitement conjoint de multiples canaux d?informations. De plus, il est souvent utilisé pour décrire des concepts et désigner des entités qui sont essentiellement multimodales (description d'une image, d'un événement, etc.). Les sujets intéressant le laboratoire autour de cette problématique sont :
-- |Depuis le 1er janvier 2021, le LIMSI a fusionné avec le LRI et est devenu le LISN (Laboratoire Interdisciplinaire des Sciences du Numérique) |Since January 1st 2021, LIMSI merged with the LRI lab and became the LISN (Interdisciplinary Computer Science Laboratory) | - | |Gilles Adda | responsable du département Sciences et Technologies des Langues | head of Language Science and Technology Department | http://www.limsi.fr/Individu/gadda/ |
| ||||
6-51 | (2023-03-02) Full Professor in Computer Sciences @Grenoble-INP Phelma/Gipsa-Lab, Grenoble, France Grenoble-INP Phelma is recruiting in 2023 a Full Professor in Computer Sciences (Section CNU 27). The host research laboratory will be GIPSA-lab (UMR 5216). The research profile is entitled 'Computer Science and Learning for Image and Signal Processing' and covers all the scientific themes of GIPSA-lab related to information processing, including automatic speech and language processing, for an affiliation to the « Speech and Cognition » group and the CRISSP (Cognitive Robotics, Interactive Systems and Speech Processing) team. The job description is available at https://phelma.grenoble-inp.fr/fr/l-ecole/concours-enseignants-chercheurs-2023
Contacts : Nicolas Marchand, Laurent Girin, Thomas Hueber (firstname.lastname@gipsa-lab.grenoble-inp.fr)
| ||||
6-52 | (2023-03-15) PhD student in Phonetics, Stockholm, SwedenPhD student in PhoneticsStockholm Ref. No. SU FV-0793-23
at the Department of linguistics. Closing date: 15 april 2023. Project description Qualification requirements Selection
In selecting applicants for postgraduate education in linguistics, the department board must take into account rules and regulations of the Faculty of Humanities. In addition to the above selection criteria, the following will be of great importance in the assessment:
Admission Regulations for Doctoral Studies at Stockholm University are available at: www.su.se/rules and regulations. Terms of employment The term of the initial contract may not exceed one year. The employment may be extended for a maximum of two years at a time. However, the total period of employment may not exceed the equivalent of four years of full-time study. Doctoral students should primarily devote themselves to their own education, but may engage in teaching, research, and administration corresponding to a maximum of 20 % of a full-time position. For this particular position, the doctoral student is expected to perform departmental duties corresponding to 20 % of full time. Where applicable, the total time of the appointment is extended to correspond to a full-time doctoral programme for four years. Please note that admission decisions cannot be appealed. Stockholm University strives to be a workplace free from discrimination and with equal opportunities for all. Contact Union representatives Application Please include the following information with your application
and, in addition, please include the following documents
Note that the proposal must address the following questions: why your project is suitable to be carried out at the Department of Linguistics at Stockholm University, how you intend to contribute to the research environment at the department with your research project, what makes you particularly suitable (to carry out the proposed research project).
You are welcome to apply! Stockholm University contributes to the development of sustainable democratic society through knowledge, enlightenment and the pursuit of truth.
Closing date: 15/04/2023
URL to this page
| ||||
6-53 | (2023-03-06) 2 open postdoc position at the LISN (ex-LIMS), Paris, France We have currently 2 open postdoc position at the LISN (ex-LIMS). You can apply online
| ||||
6-54 | (2023-03-08) PhD in ML/Speech Processing @LIA, Avignon, France PhD in ML/Speech Processing ? Speaker recognition systems against voice attacks : voices synthesis and voice transformation
Starting date: September 1st, 2023 (flexible)
Application deadline: July 10th, 2023
Interviews (tentative): July 15th, 2023
Salary: ~2000? gross/month (social security included)
Mission: research oriented (teaching possible but not mandatory)
Keywords: speech processing, automatic speaker recognition, anti-spoofing, deep neural network
CONTEXT
It is now widely accepted that automatic speaker recognition (ASV) systems are vulnerable not only to speech produced artificially by text-to-speech (TTS) [1], but also to other forms of attacks such as voice conversion (VC) and replay [2]. Voice conversion can be used to manipulate the voice identity of a speech signal, has progressed extremely rapidly in recent years [3], and has indeed become a serious threat.
The progress made in recent years in deep neural networks training has enabled spectacular advances in the fields of text-to-speech (TTS) and voice conversion (VC): DeepVoice, Tacotron 1 and 2 [4], Auto-VC [5,6]. Existing architectures now make possible producing synthesized or manipulated artificial voices with a realism close to or equal to that of human voices [4]. At the same time, voice conversion algorithms (from one speaker to another) have also made spectacular advances. It now becomes possible to clone a voice identity using a small amount of data. In the space of two years, extremely significant advances have been made [5,6,7]. The ability of these algorithms to forge voice identities capable of deceiving speaker recognition and counter-measure systems is an urgent topic of research.
Progress in terms of the fight against identity theft has been led by the initiative of the ASVspoof community, formed in 2013 and recognized as competent at the international level [8]. The most significant efforts have been made at the level of the acoustic parametrization (front-end) making it possible to better differentiate authentic (human) utterances from fraudulent utterances. The best performing system [9], which combines acoustic parameters based on Cepstrum-Mel, Cepstrum based on cochlear filters and instantaneous frequencies using a classifier based on a Gaussian mixture model, obtained the best performance.
For the past years, research efforts have focused on the back-end. As in speaker recognition research, the anti-spoofing community has embraced the power of deep learning and, unsurprisingly, the neural architectures used are almost the same. Advances in anti-spoofing have followed the rapid advances in TTS and VC. The best anti-spoofing system again used traditional acoustic parameters, with a classifier based on ResNet-18 [10].
SCIENTIFIC OBJECTIVES
As part of this thesis, the robustness of existing countermeasures against new forms of adversarial attacks designed specifically to deceive them will be assessed. One of the advances expected in this thesis will focus on the design of new countermeasures to detect such emerging, increasinly adversarial attacks. To do this, two avenues will be explored. The first is to redesign front-end feature extraction to capture cues that characterize adversarial attacks, then use them with re-trained classifiers. As it is not always easy to identify reliable characteristics, the second direction will aim at the adoption of end-to-end architectures able to learn characteristics automatically. Although these advances improve robustness to adversarial attacks, it will be important to ensure that the resulting countermeasures remain robust to previous attacks. This is known as the problem of the generalization. An effective anti-spoofing countermeasure must reliably detect any form of attack it encounters, not just the specific attacks it is trained to detect. Finally, improving adversarial attack detection performance should not come at the cost of increased false positives (genuine speech labeled as spoofed speech), which can hurt usability and convenience. The progress and results targeted in this thesis will therefore be countermeasures capable of defending speaker recognition systems against adversarial and non-adversarial attacks.
In parallel to this competition between research teams specializing in attacks and research teams specializing in counter-attacks, the speaker recognition community is focused on the creation and design of high-performance systems that are robust to acoustic variability. Recognition systems are trained to recognize speakers in increasingly difficult conditions (presence of several types of noise: additive, reverb, etc.). This robustness against difficult acoustic conditions can lead to weakness against recordings of attacks that were not taken into account during training. Of course this vulnerability can be reduced by using countermeasures (CM) systems. This approach can impact the usability of ASV systems since the countermeasures can also reject genuine clients (authentic users). This thesis will therefore go beyond the state of the art by optimizing both the ASV and the CM system, so that they work together to achieve the best possible compromise between security and usability/convenience.
REQUIRED SKILLS
- Master 2 in speech processing, computer science or data science
- Good mastering of Python programming and deep learning framework
- Previous experience in bias in machine learning would be a plus
- Good communication skills in English
- Good command of French would be a plus but is not mandatory
LAB AND SUPERVISION
The PhD position will be co-supervised by Nicholas Evans from EURECOM and Driss Matrouf from LIA-Avignon. Joint meetings are planned on a regular basis and the student is expected to spend time in LIA-Avignon. The students, along with the partners (IRCAM specialized in attack generation and EURECOM specialized in countermeasures) will closely collaborate.
INSTRUCTIONS FOR APPLYING
Applications must contain: CV + letter/message of motivation + master notes + be ready to provide letter(s) of recommendation; and be addressed to Driss Matrouf (driss.matrouf@univ-avignon.fr), Mickael Rouvier (mickael.rouvier@univ-avignon.fr) and Nicholas Evans (evans@eurecom.fr).
REFERENCES
[1] https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-1567157402
[2] N. Evans, T. Kinnunen and J. Yamagishi, ?Spoofing and countermeasures for automatic speaker verification? in Proc. Interspeech 2013 Aug 25 (pp. 925-929).
[3] Z. Yhi et al. (2020) Voice Conversion Challenge 2020- Intra-lingual semi-parallel and cross-lingual voice conversion . SCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.
[4] J. Shen et al, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Prediction, ICASSP,2018.
[5] Qian, K., Zhang, Y., Chang, S., Yang, X., and Hasegawa- Johnson, M. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In International Conference on Machine Learning (ICML), pp. 5210?5219, 2019
[6] Zhang, J.-X., Ling, Z.-H., and Dai, L.-R. Non-Parallel Sequence-to-Sequence Voice Conversion With Disentan- gled Linguistic and Speaker Representations. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 28:540?552, 2020.
[7] Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems (pp. 4480-4490).
[8] N. Evans, T. Kinnunen and J. Yamagishi, ?Spoofing and countermeasures for automatic speaker verification? in Proc. Interspeech 2013,
pp. 925-929, 2013
[9] T. B. Patel, H. A. Patil, ?Combining Evidences from Mel Cepstral, Cochlear Filter Cepstral and Instantaneous Frequency Features for Detection of Natural vs. Spoofed Speech?, in Proc. INTERSPEECH 2015, pp. 2062-2066, 2015
[10] X. Cheng, M. Xu, and T. F. Zheng, ?Replay detection using CQTbased modified group delay feature and ResNeWt network in ASVs poof 2019?, in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 540?545, 2019
| ||||
6-55 | (2023-03-09) Doctoral position : Acoustic to Articulatory Inversion by using dynamic MRI images @LORIA, Nancy, France Doctoral position : Acoustic to Articulatory Inversion by using dynamic MRI images
Loria ?Lorraine Research Laboratory in Computer Science and its Applications? is a research unit common to CNRS, the Université de Lorraine and INRIA. Loria gathers 450 scientists and its missions mainly deal with fundamental and applied research in computer sciences, especially the MultiSpeech Team which focuses automatic speech processing, audiovisual speech and speech production. IADI is a research unit common to Inserm the Université de Lorraine whose specialty is developing various techniques and methods to improve imaging of moving organs via the acquisition of MR images.
This PhD project founded by LUE (Lorraine Université d?Excellence) associates the Multispeech team and the IADI laboratory.
Start date is (expected to be) 1st septembre 2023 or as soon as possible thereafter.
Supervisors Yves Laprie, email yves.laprie@loria.fr Pierre-André Vuissoz, email pa.vuissoz@chru-nancy.fr
The project
Articulatory synthesis mimics the speech production process by first generating the shape of the vocal tract from the sequence of phonemes to be pronounced, then the acoustic signal by solving the aeroacoustic equations. Compared to other approaches to speech synthesis which offer a very high level of quality, the main interest is to control the whole production process, beyond the acoustic signal alone. The objective of this PhD is to succeed in the inverse transformation, called acoustic to articulatory inversion, in order to recover the geometric shape of the vocal tract from the acoustic signal. A simple voice recording will allow the dynamics of the different articulators to be followed during the production of the sentence. Beyond its interest in terms of scientific challenge, articulatory acoustic inversion has many potential applications. Alone, it can be used as a diagnostic tool to evaluate articulatory gestures in an educational or medical context.
Description of work
The objective is the inversion of the acoustic signal to recover the temporal evolution of the medio-sagittal slice. Indeed, dynamic MRI provides two-dimensional images in the medio-sagittal plane at 50Hz of very good quality and the speech signal acquired with an optical microphone can be very efficiently deconstructed with the algorithms developed in the MultiSpeech team (examples available on https://artspeech.loria.fr/resources/). We plan to use corpora already acquired or in the process of being acquired. These corpora represent a very large volume of data (several hundreds of thousands of images) and an approach for tracking the contours of articulators in MRI images which gives very good results was developed to process corpora. The automatically tracked contours can therefore be used to train the inversion. The goal is to perform the inversion using the LSTM approach on data from a small number of speakers for which sufficient data exists. This approach will have to be adapted to the nature of the data and to be able to identify the contribution of each articulator. In itself, successful inversion to recover the shape of the vocal tract in the medio-sagittal plane will be a remarkable success since the current results only cover a very small part of the vocal tract (a few points on the front part of the vocal tract). However, it is important to be able to transpose this result to any subject, which raises the question of speaker adaptation, which is the second objective of the PhD.
What we offer
Supervisors Yves Laprie, email yves.laprie@loria.fr Pierre-André Vuissoz, email pa.vuissoz@chru-nancy.fr
Application Your application including all attachments must be in English and submitted electronically by clicking APPLY NOW below. Please include:
log into Inria?s recruitment system(https://jobs.inria.fr/public/classic/en/offres/2023-05790) in order to apply to this position.
| ||||
6-56 | (2023-03-10) Ingénieur d'étude, GIPSA LAB, Grenoble, France Le service Plateformes du laboratoire GIPSA-LAB (CNRS/G-INP/UGA, Grenoble) recrute un.e ingénieur.e d?étude en instrumentation. Pour plus d'information, la fiche de poste détaillée est disponible ici : https://filesender.renater.fr/?s=download&token=2ae39938-121d-44c4-b6ce-385029ecf331 N'hésitez pas à me contacter si besoin d'information complémentaire. Coriandre VILAIN -- Coriandre Vilain Ingénieur de Recherche UGA Equipe PCMD, Pôle Parole & Cognition, GIPSA-LAB -- Responsable Service Plateformes GIPSA -- 04 76 82 77 80 www.gipsa-lab.fr/~coriandre.vilain
| ||||
6-57 | (2023-03-12) Associate-Assistant Professor in artificial intelligence for semantic and multi-modal multimedia analysis. Telecom Sud Paris, France Dear All, Telecom SudParis welcomes applications for a permanent position of Associate-Assistant Professor in artificial intelligence for semantic and multi-modal multimedia analysis. Telecom SudParis is a public graduate school for engineering, which has been recognized on the highest level in the domain of digital technology. Telecom SudParis is co-founder member of the Institut Polytechnique de Paris and part of the Institut Mines-Telecom, the number one group of engineering schools in France. The recruited assistant/associate professor will join the ARTEMIS (Advanced Research and TEchniques for Multidimensional Imaging Systems) Department of Télécom SudParis and the SAMOVAR laboratory. The targeted research theme concerns the field of artificial intelligence, applied to the semantic analysis of massive, distributed and heterogeneous multimedia data. This concerns the automatic, multi-modal interpretation of complex audio-visual documents, computer vision, multimedia indexation, knowledge extraction and machine learning methodologies. The expected contributions will focus on deep neural network learning methods and target the entire multimedia content processing chain. Detailed information can be found at the following URL: https://institutminestelecom.recruitee.com/l/en/o/maitre-de-conferences-en-intelligence-artificielle-pour-lanalyse-semantique-de-donnees-multimedia-cdi The application deadline is March 31, 2023. Please do not hesitate to contact me for any further information. Best regards, Titus ZAHARIA, Professor Head of the ARTEMIS Department Télécom SudParis Institut Polytechnique de Paris titus.zaharia@telecom-sudparis.eu
| ||||
6-58 | (2023-03-16) PhD Position in Deep Cascaded Representation Learning for Speech Modelling, Univ.Sheffield, UK Title of Project: Deep Cascaded Representation Learning for Speech Modelling Supervisor:Professor Thomas Hain Deadline for Applications:13th April 2023
The successful applicant will work under the supervision of Prof. Hain who is the Director of the
Funding Details:
| ||||
6-59 | (2023-03-15) Research Associate in Integrated Multitask Neural Speech Labelling, Univ.Sheffield, UK Job Title:Research Associate in Integrated Multitask Neural Speech Labelling We’re one of the best not-for-profit organisations to work for in the UK. The University’s Total
| ||||
6-60 | (2023-03-15) PhD Position in Adaptive Deep Learning for Speech and Language, Univ.Sheffield, UK Title of Project: PhD Position in Adaptive Deep Learning for Speech and Language Supervisor:Professor Thomas Hain
The successful applicant will work under the supervision of Prof. Hain who is the Director of the
| ||||
6-61 | (2023-03-17) Deux postes de MCF en phonétique au concours à l'Université Paul-Valéry Montpellier 3 ,France Deux postes de MCF en phonétique sont ouverts au concours à l'Université Paul-Valéry Montpellier 3 cette année : Phonétique générale et traitement outillé de l'oral (UMR 5267 Praxiling) : https://www.galaxie.enseignementsup-recherche.gouv.fr/ensup/ListesPostesPublies/ANTEE/2023_1/0341089Z/FOPC_0341089Z_4352.pdf Phonétique et Didactique de l'oral en FLE (acquisition / appropriation des langues) : https://www.galaxie.enseignementsup-recherche.gouv.fr/ensup/ListesPostesPublies/ANTEE/2023_1/0341089Z/FOPC_0341089Z_4353.pdf Date limite : 30 mars 2023
| ||||
6-62 | (2023-03-20) 2 postdocs for project ASTRID DeTOX @IRCAM Paris and EURECOM Sophia Antipolis , France Dans le cadre du projet ASTRID DeTOX sur la lutte contre les vidéos hyper-truquées de personnalités françaises,
deux postes sont à pourvoir :
- Un post-doc de 15 mois à l?IRCAM sur la génération de deep fakes audio-visuels
- Un post-doc de 18 mois ou une thèse de 36 mois à EURECOM sur la détection de deep fakes audio-visuels
| ||||
6-63 | (2023-03-15) PhD student in Phonetics, Stockholm University, SwedenPhD student in PhoneticsStockholm Ref. No. SU FV-0793-23
at the Department of linguistics. Closing date: 15 april 2023. Project description Qualification requirements Selection
In selecting applicants for postgraduate education in linguistics, the department board must take into account rules and regulations of the Faculty of Humanities. In addition to the above selection criteria, the following will be of great importance in the assessment:
Admission Regulations for Doctoral Studies at Stockholm University are available at: www.su.se/rules and regulations. Terms of employment The term of the initial contract may not exceed one year. The employment may be extended for a maximum of two years at a time. However, the total period of employment may not exceed the equivalent of four years of full-time study. Doctoral students should primarily devote themselves to their own education, but may engage in teaching, research, and administration corresponding to a maximum of 20 % of a full-time position. For this particular position, the doctoral student is expected to perform departmental duties corresponding to 20 % of full time. Where applicable, the total time of the appointment is extended to correspond to a full-time doctoral programme for four years. Please note that admission decisions cannot be appealed. Stockholm University strives to be a workplace free from discrimination and with equal opportunities for all. Contact Union representatives Application Please include the following information with your application
and, in addition, please include the following documents
Note that the proposal must address the following questions: why your project is suitable to be carried out at the Department of Linguistics at Stockholm University, how you intend to contribute to the research environment at the department with your research project, what makes you particularly suitable (to carry out the proposed research project).
You are welcome to apply! Stockholm University contributes to the development of sustainable democratic society through knowledge, enlightenment and the pursuit of truth.
Closing date: 15/04/2023
URL to this page
| ||||
6-64 | (2023-03-20) PhD student position in experimental phonetics @ Stockholm University, Sweden The Department of Linguistics at Stockholm University invites applications for a PhD student position in experimental phonetics, including (but not limited to) topics in prosody. For details, see: https://www.su.se/english/about-the-university/work-at-su/available-jobs/phd-student-positions-1.507588?rmpage=job&rmjob=20262&rmlang=UK .
| ||||
6-65 | (2023-03-20) PhD student, Bielefeld University, Germany The Digital Linguistics Lab at Bielefeld University (head: JProf. Dr.-Ing. Hendrik Buschmeier) is seeking to fill a research position (PhD-student, E13 TV-L, 100%, fixed-term) in the area of multimodal human-robot interaction in the research project ?Hybrid Living?.
| ||||
6-66 | (2023-03-24) Research assistant, McGill University, Montreal, Canada We are seeking a multimodal designer to take on a central role in the Shared Reality Lab?s open source IMAGE project (image.a11y.mcgill.ca), focused on making photos, charts, and maps available to people who are blind or low vision. We are currently operating under two grants, focused on integrating haptic force feedback and pin array devices into IMAGE. You will work with a multidisciplinary team of user experience researchers, designers, and developers who will support you in designing and releasing multimodal audio and haptic experiences that will delight our end users. The primary requirement is a strong background and passion for owning both design and iterative testing of combined audio/haptic end-user experiences. Since the goal of IMAGE is to release a practical solution that can be used on a daily basis, the candidate will work directly with developers and the rest of the team to make sure that ideas and designs get translated into implementable requirements, then deployed into production.
Other useful skills (not required). If the candidate has the desire and capability, they are also welcome to participate in software architecture and implementation, for example:
Candidates applying as a research assistant must be eligible to work in Canada.
The position is available immediately, with an initial appointment of up to one year. Informal inquiries are welcome.
|