ISCA - International Speech
Communication Association

ISCApad Archive » 2022 » ISCApad #295 » Jobs

ISCApad #295

Monday, January 09, 2023 by Chris Wellekens

6 Jobs

6-1

(2022-07-07) Two internship positions @ Naver Labs Europe

Naver Labs Europe (https://europe.naverlabs.com/) is currently offering 2 internship positions related to Speech Processing.

More details on both job offers can be found here:

https://europe.naverlabs.com/job/multidomain-and-multitask-learning-for-asr-internship/

https://europe.naverlabs.com/job/fine-grained-multi-faceted-control-of-prosodic-features-for-tts-systems-internship/

Top

6-2

(2022-07-17) Two PhD positions at Quality and Usability Lab, Technical University of Berlin, Germany

Two PhD positions at Quality and Usability Lab, Technical University of Berlin, Germany

We are looking to recruit two Doctoral Researchers to join Quality and Usability Lab, at Technical University of Berlin, Germany. Both positions are research assistant positions (TVL-E13) and depending on follow up funding to be continued until doctorate thesis can be finished.

The Quality and Usability Lab is part of TU Berlin’s Faculty IV and deals with the design and evaluation of human-machine interaction, in which aspects of human perception, technical systems and the design of interaction are the subject of our research. We focus on self-determined work in an interdisciplinary and international team; for this we offer open and flexible working conditions that promote scientific and personal exchange and are a prerequisite for excellent results.

Research Assistant (full-time) Speech Quality

The research is in the area of the assessment of the quality of speech services using a crowdsourcing approach. The aim of the research is to analyze how crowdsourcing-based listening-only and conversational speech quality evaluation experiments can be set up in order to provide valid and reliable results, and how the characteristics of the test participants, the test environment and the playback system can be assessed in online tests. It will be assessed which differences are to be expected between crowdsourcing and laboratory-based speech quality evaluation, and how these differences influence the development of instrumental speech quality prediction models. The results are expected to influence methods for speech quality assessment in crowdsourcing, as they are summarized in ITU-T Recommendation P.808.

This project is funded by the Deutsche Forschungsgemeinschaft, DFG, and is limited to a duration until January 31, 2024 (compensation TVL E13). A subsequent ongoing employment is supported if the PhD cannot be finished in running time of the project

Tasks:

Interacting and extending web platforms that are created for conducting and managing listening-only or conversational experiments (Frontend: HTML/JS/CSS, backend for conversation testing: Node.js / express.js, WebRTC)
Conducting subjective listening-only and/or conversation tests in the laboratory and via crowdsourcing; Analyzing the results
Recording of source speech signals in both laboratory and large-scale crowdsourcing and preparing speech dataset.
Enhancing test methods (that we developed for ITU-T Rec. P.808) for screening the participants’ ability, environment and set-up suitability for the speech quality assessment tasks.
Processing speech signals collected in a crowdsourcing approach, and applying relevant artificial network degradation conditions (e.g. background noise, clipping, etc.).
Benchmarking state-of-the-art instrumental models for predicting speech quality based on their performance on the collected crowdsourcing dataset.
Project communication and reporting.
Publication and presentation of project and research results in scientific journals, at conferences, at workshops and ITU-T Study Group 12 expert’s meetings. Publication and presentation of project and research results in scientific journals, at conferences, and in workshops as well as standardization meetings of ITU-T

Requirements:

Successfully completed university degree (Master, Diplom or equivalent) in computer engineering/science, informatics, media informatics, digital media, or information systems (or an equivalent technical background)
Deep knowledge, and hands-on experience in one or more general purpose programming languages (recommended is Python)
Profound programming skills in front-end (HTML5/CSS3, JS, jQuery, JSON), AND one scripting language for data processing (either MATLAB, Python or R), and ideally backend development skills
Knowledge about digital signal processing, beneficial: speech signal processing respectively audio signal processing and acoustics
Knowledge about empirical subjective tests and statistical data analysis is appreciated
Language skills: English fluent in writing and speaking (B2 level); willingness to learn German is expected
Joy of working in an interdisciplinary and international environment

Research Assistant (full-time) Conversation Quality – Salary grade E 13 TV-L

The initial funding is available from September 1^st, 2022 and is limited until April 30^th, 2023; however, the outcomes of the research should be used to support the preparation of a new project application, and may also become a foundation for a later PhD thesis. A subsequent position as a research assistant from the project funds would be possible if the funds were approved.

2.1 Tasks

Maintaining and further developing a platform to conduct web-based voice calls
Conducting subjective conversation tests in the laboratory and via crowdsourcing

Analysis of speech signals
Creating and evaluating models for predicting quality aspects using different algorithms (including traditional signal processing methods and state-of-the-art DNNs)
Project communication and reporting
Publication and presentation of project and research results in scientific journals, at conferences, and in workshops

2.2. Requirements

Master or diploma in electrical engineering, computer engineering, computer science, media informatics, media technology, information systems management (or an equivalent technical background)
Profound knowledge in digital signal processing, beneficial: speech signal processing or audio signal processing, respectively
Good programming skills (e.g. MATLAB or Python) and safe handling of web development tools (e.g. HTML5/CSS3, JS, ideally also backend development skills)
Interest in running user studies with test participants to determinate speech quality
Language skills: English and German fluent in writing and speaking
Knowledge about empirical studies and statistical data analysis is appreciated
Joy of working in an interdisciplinary and international environment

Application

For both positions, please send the following documents, bundled in a single PDF file, to Prof. Dr.-Ing. Sebastian Möller bewerbung@qu.tu-berlin.de: Letter of application, curriculum vitae, copies of certificates, job references. Please also specify for which position you are applying.

To ensure equal opportunities between women and men, applications by women with the required qualifications are explicitly desired. Qualified individuals with disabilities will be favored.

Top

6-3

(2022-07-28) PhD Position : Naver Labs Europe (France) and FBK Trento (Italy)

PhD Position : Naver Labs Europe (France) and FBK Trento (Italy) start Nov 2022

Have you recently completed or expect very soon an MSc or equivalent degree in computer science, artificial intelligence, computational linguistics, engineering, or a related area? Are you interested in carrying out research on Speech-to-Speech Translation during the next few years? Are you excited to spend a part of your life in 2 pleasant alpine cities in France (Grenoble) and Italy (Trento) ?

WE ARE LOOKING FOR YOU!!!

The Machine Translation (MT) group at Fondazione Bruno Kessler (Trento, Italy) in conjunction with Naver Labs Europe (Grenoble, France) are pleased to announce the availability of the following fully-funded Ph.D. position at the Doctorate Program in Industrial Innovation of the University of Trento and Fondazione Bruno Kessler.

PhD topic: Unified Foundation models for Speech-to-Speech Translation

The deadline for application: August 23rd.

More details here: http://tinyurl.com/PhD-FBK-NLE

Top

6-4

(2022-07-21) Research Opportunity at INESC TEC / LIAAD, Porto, Portugal

Research Opportunity at INESC TEC / LIAAD, Porto, Portugal
Natural Language Processing / Machine Learning

Funded PhD position, fees covered during the period of the grant

Objectives:
Develop Machine Learning and NLP algorithms and tools to identify, formally represent and reuse narrative structures from textual sources, with a focus on journalistic texts and medical texts in Portuguese and other languages. The focus is on NLP algorithms and tools for extracting and understanding content.

Work description
We are looking for a highly motivated Master to join the team of researchers of the Text2Story project and to do a PhD that will extend beyond the end of the project. The topic is Extraction of Narratives from text. The selected candidate will work with INESC TEC's Machine Learning and NLP / NLP team and will have the opportunity to work in a dedicated and young environment, in close interaction with researchers, doctoral and post-doctoral students working on varied Machine Learning topics, information extraction and computer science. The candidate must be motivated to collaborate on other projects on time.

Academic Qualifications
MSc. in Computer Science / Data Science / Mathematics

Minimum profile required
Programming experience, Statistics, publications in NLP/Text Mining

Preference factors:
Good background in Mathematics. Involvement in previous research projects and publications

Minimum requirements:
Knowledge in Maths, Learning /Data Mining. Knowledge in programing languages mainly Python and C/C++. Strong will to pursue a PhD. Excellent academic background.

Application deadline

02-August-2022

Advisor
Alípio Jorge

Apply now (text in PT and ENG)
http://www.inesctec.pt/pt/oportunidade/AE2022-0227

Top

6-5

(2022-08-11) PhD position @Gipsa Lab , Grenoble, France

L'offre de thèse sur 'le Rôle de la conduction osseuse dans le contrôle de la voix (parole, chant) et de l'expression musicale' (Role of bone conducted feedback in the control of voice

(speech, singing) and musical expression) n'a toujours pas été pourvue :
http://www.gipsa-lab.grenoble-inp.fr/transfert/propositions/1_2022-07-05_offretheseINCEPTION.pdf

Si vous souhaitez candidater, aller sur le lien suivant :

emploi.cnrs.fr/Offres/Doctorant/UMR5216-CHRROM-021/Default.aspx #Emploi #OffreEmploi #Recrutement

Bien cordialement,

Pierre Baraduc, Coriandre Vilain et Nathalie Henrich Bernardoni

***********************************************************************************************************

Offre de thèse sur 'le Rôle de la conduction osseuse dans le contrôle de la voix (parole, chant) et de l'expression musicale' (Role of bone conducted feedback in the control of voice
(speech, singing) and musical expression):
http://www.gipsa-lab.grenoble-inp.fr/transfert/propositions/1_2022-07-05_offretheseINCEPTION.pdf

Pour candidater, aller sur le lien suivant :

https://emploi.cnrs.fr/Offres/Doctorant/UMR5216-CHRROM-018/Default.aspx

Top

6-6

(2022-08-22) PhD in ML/NLP – Fairness and self-supervised learning for speech processing, IMAG, Grenoble,France

PhD in ML/NLP – Fairness and self-supervised learning for speech processing

Starting date: November 1st, 2022 (flexible)

Application deadline: September 5th, 2022

Interviews (tentative): September 19th, 2022

Salary: ~2000€ gross/month (social security included)

Mission: research oriented (teaching possible but not mandatory)

Keywords: speech processing, fairness, bias, self-supervised learning, evaluation metrics

CONTEXT

The ANR project E-SSL (Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies) will start on November 1st 2022. Self-supervised learning (SSL) has recently emerged as one of the most promising artificial intelligence (AI) methods as it becomes now feasible to take advantage of the colossal amounts of existing unlabeled data to significantly improve the performances of various speech processing tasks.

PROJECT OBJECTIVES

Speech technologies are widely used in our daily life and are expanding the scope of our action, with decision-making systems, including in critical areas such as health or legal aspects. In these societal applications, the question of the use of these tools raises the issue of the possible discrimination of people according to criteria for which society requires equal treatment, such as gender, origin, religion or disability... Recently, the machine learning community has been confronted with the need to work on the possible biases of algorithms, and many works have shown that the search for the best performance is not the only goal to pursue [1]. For instance, recent evaluations of ASR systems have shown that performances can vary according to the gender but these variations depend both on data used for learning and on models [2]. Therefore such systems are increasingly scrutinized for being biased while trustworthy speech technologies definitely represents a crucial expectation.

Both the question of bias and the concept of fairness have now become important aspects of AI, and we now have to find the right threshold between accuracy and the measure of fairness. Unfortunately, these notions of fairness and bias are challenging to define and their
meanings can greatly differ [3].

The goals of this PhD position are threefold:

- First make a survey on the many definitions of robustness, fairness and bias with the aim of coming up with definitions and metrics fit for speech SSL models

- Then gather speech datasets with high amount of well-described metadata

- Setup an evaluation protocol for SSL models and analyzing the results.

SKILLS

Master 2 in Natural Language Processing, Speech Processing, computer science or data science.

Good mastering of Python programming and deep learning framework.

Previous experience in bias in machine learning would be a plus

Very good communication skills in English

Good command of French would be a plus but is not mandatory

SCIENTIFIC ENVIRONMENT

The PhD position will be co-supervised by Alexandre Allauzen (Dauphine Université PSL, Paris) and Solange Rossato and François Portet (Université Grenoble Alpes). Joint meetings are planned on a regular basis and the student is expected to spend time in both places. Moreover, two other PhD positions are open in this project. The students, along with the partners will closely collaborate. For instance, specific SSL models along with evaluation criteria will be developed by the other PhD students. Moreover, the PhD student will collaborate with several team members involved in the project in particular the two other PhD candidates who will be recruited and the partners from LIA, LIG and Dauphine Université PSL, Paris. The means to carry out the PhD will be provided both in terms of missions in France and abroad and in terms of equipment. The candidate will have access to the cluster of GPUs of both the LIG and Dauphine Université PSL. Furthermore, access to the National supercomputer Jean-Zay will enable to run large scale experiments.

INSTRUCTIONS FOR APPLYING

Applications must contain: CV + letter/message of motivation + master notes + be ready to provide letter(s) of recommendation; and be addressed to Alexandre Allauzen (alexandre.allauzen@espci.psl.eu), Solange Rossato (Solange.Rossato@imag.fr) and François Portet (francois.Portet@imag.fr). We celebrate diversity and are committed to creating an inclusive environment for all employees.

REFERENCES:

[1] Mengesha, Z., Heldreth, C., Lahav, M., Sublewski, J. & Tuennerman, E. “I don’t Think These Devices are Very Culturally Sensitive.”—Impact of Automated Speech Recognition Errors on African Americans. Frontiers in Artificial Intelligence 4. issn: 2624-8212. https://www.frontiersin.org/article/10.3389/frai.2021.725911 (2021).
[2] Garnerin, M., Rossato, S. & Besacier, L. Investigating the Impact of Gender Representation in ASR Training Data: a Case Study on Librispeech in Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing (2021), 86–92.
[3] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACMComput. Surv. 54. issn: 0360-0300. https://doi.org/10.1145/3457607 (July 2021).

Top

6-7

(2022-08-22) PhD in ML/NLP – Efficient, Fair, robust and knowledge informed self-supervised learning for speech processi

PhD in ML/NLP – Efficient, Fair, robust and knowledge informed self-supervised learning for speech processing

Starting date: November 1st, 2022 (flexible)

Application deadline: September 5th, 2022

Interviews (tentative): September 19th, 2022

Salary: ~2000€ gross/month (social security included)

Mission: research oriented (teaching possible but not mandatory)

Keywords: speech processing, natural language processing, self-supervised learning, knowledge informed learning, Robustness, fairness

CONTEXT

The ANR project E-SSL (Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies) will start on November 1st 2022. Self-supervised learning (SSL) has recently emerged as one of the most promising artificial intelligence (AI) methods as it becomes now feasible to take advantage of the colossal amounts of existing unlabeled data to significantly improve the performances of various speech processing tasks.

PROJECT OBJECTIVES

Recent SSL models for speech such as HuBERT or wav2vec 2.0 have shown an impressive impact on downstream tasks performance. This is mainly due to their ability to benefit from a large amount of data at the cost of a tremendous carbon footprint rather than improving the efficiency of the learning. Another question related to SSL models is their unpredictable results once applied to realistic scenarios which exhibit their lack of robustness. Furthermore, as for any pre-trained models applied in society, it is important to be able to measure the bias of such models since they can augment social unfairness.

The goals of this PhD position are threefold:

- to design new evaluation metrics for SSL of speech models ;

- to develop knowledge-driven SSL algorithms ;

- to propose methods for learning robust and unbiased representations.

SSL models are evaluated with downstream task-dependent metrics e.g., word error rate for speech recognition. This couple the evaluation of the universality of SSL representations to a potentially biased and costly fine-tuning that also hides the efficiency information related to the pre-training cost. In practice, we will seek to measure the training efficiency as the ratio between the amount of data, computation and memory needed to observe a certain gain in terms of performance on a metric of interest i.e., downstream dependent or not. The first step will be to document standard markers that can be used as robust measurements to assess these values robustly at training time. Potential candidates are, for instance, floating point operations for computational intensity, number of neural parameters coupled with precision for storage, online measurement of memory consumption for training and cumulative input sequence length for data.

Most state-of-the-art SSL models for speech rely on masked prediction e.g. HuBERT and WavLM, or contrastive losses e.g. wav2vec 2.0. Such prevalence in the literature is mostly linked to the size, amount of data and computational resources injected by the company producing these models. In fact, vanilla masking approaches and contrastive losses may be identified as uninformed solutions as they do not benefit from in-domain expertise. For instance, it has been demonstrated that blindly masking frames in the input signal i.e. HuBERT and WavLM results in much worse downstream performance than applying unsupervised phonetic boundaries [Yue2021] to generate informed masks. Recently some studies have demonstrated the superiority of an informed multitask learning strategy carefully selecting self-supervised pretext-tasks with respect to a set of downstream tasks, over the vanilla wav2vec 2.0 contrastive learning loss [Zaiem2022]. In this PhD project, our objective is: 1. continue to develop knowledge-driven SSL algorithms reaching higher efficiency ratios and results at the convergence, data consumption and downstream performance levels; and 2. scale these novel approaches to a point enabling the comparison with current state-of-the-art systems and therefore motivating a paradigm change in SSL for the wider speech community.

Despite remarkable performance on academic benchmarks, SSL powered technologies e.g. speech and speaker recognition, speech synthesis and many others may exhibit highly unpredictable results once applied to realistic scenarios. This can translate into a global accuracy drop due to a lack of robustness to adversarial acoustic conditions, or biased and discriminatory behaviors with respect to different pools of end users. Documenting and facilitating the control of such aspects prior to the deployment of SSL models into the real-life is necessary for the industrial market. To evaluate such aspects, within the project, we will create novel robustness regularization and debasing techniques along two axes: 1. debasing and regularizing speech representations at the SSL level; 2. debasing and regularizing downstream-adapted models (e.g. using a pre-trained model).

To ensure the creation of fair and robust SSL pre-trained models, we propose to act both at the optimization and data levels following some of our previous work on adversarial protected attribute disentanglement and the NLP literature on data sampling and augmentation [Noé2021]. Here, we wish to extend this technique to more complex SSL architectures and more realistic conditions by increasing the disentanglement complexity i.e. the sex attribute studied in [Noé2021] is particularly discriminatory. Then, and to benefit from the expert knowledge induced by the scope of the task of interest, we will build on a recent introduction of task-dependent counterfactual equal odds criteria [Sari2021] to minimize the downstream performance gap observed in between different individuals of certain protected attributes and to maximize the overall accuracy. Following this multi-objective optimization scheme, we will then inject further identified constraints as inspired by previous NLP work [Zhao2017]. Intuitively, constraints are injected so the predictions are calibrated towards a desired distribution i.e. unbiased.

SKILLS

Master 2 in Natural Language Processing, Speech Processing, computer science or data science.

Good mastering of Python programming and deep learning framework.

Previous in Self-Supervised Learning, acoustic modeling or ASR would be a plus

Very good communication skills in English

Good command of French would be a plus but is not mandatory

SCIENTIFIC ENVIRONMENT

The thesis will be conducted within the Getalp teams of the LIG laboratory (https://lig-getalp.imag.fr/) and the LIA laboratory (https://lia.univ-avignon.fr/). The GETALP team and the LIA have a strong expertise and track record in Natural Language Processing and speech processing. The recruited person will be welcomed within the teams which offer a stimulating, multinational and pleasant working environment.

The means to carry out the PhD will be provided both in terms of missions in France and abroad and in terms of equipment. The candidate will have access to the cluster of GPUs of both the LIG and LIA. Furthermore, access to the National supercomputer Jean-Zay will enable to run large scale experiments.

The PhD position will be co-supervised by Mickael Rouvier (LIA, Avignon) and Benjamin Lecouteux and François Portet (Université Grenoble Alpes). Joint meetings are planned on a regular basis and the student is expected to spend time in both places. Moreover, the PhD student will collaborate with several team members involved in the project in particular the two other PhD candidates who will be recruited and the partners from LIA, LIG and Dauphine Université PSL, Paris. Furthermore, the project will involve one of the founders of SpeechBrain, Titouan Parcollet with whom the candidate will interact closely.

INSTRUCTIONS FOR APPLYING

Applications must contain: CV + letter/message of motivation + master notes + be ready to provide letter(s) of recommendation; and be addressed to Mickael Rouvier (mickael.rouvier@univ-avignon.fr), Benjamin Lecouteux (benjamin.lecouteux@univ-grenoble-alpes.fr) and François Portet (francois.Portet@imag.fr). We celebrate diversity and are committed to creating an inclusive environment for all employees.

REFERENCES:

[Noé2021] Noé, P.- G., Mohammadamini, M., Matrouf, D., Parcollet, T., Nautsch, A. & Bonastre, J.- F. Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation in Proc. Interspeech 2021 (2021), 1902–1906.

[Sari2021] Sarı, L., Hasegawa-Johnson, M. & Yoo, C. D. Counterfactually Fair Automatic Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3515–3525 (2021)

[Yue2021] Yue, X. & Li, H. Phonetically Motivated Self-Supervised Speech Representation Learning in Proc. Interspeech 2021 (2021), 746–750.

[Zaiem2022] Zaiem, S., Parcollet, T. & Essid, S. Pretext Tasks Selection for Multitask Self-Supervised Speech Representation in AAAI, The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing, 2023 (2022).

[Zhao2017] Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K. - W. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017), 2979–2989.

Top

6-8

(2022-08-24) PostDoc position at Grenoble Alps University, Grenoble, France

PostDoc position at Grenoble Alps University, France

Summary

The Grenoble Alps University offers a PostDoc position for a highly motivated candidate to be working on the multi-disciplinary research project THERADIA, which aims to create an empathic virtual assistant that accompanies cognitively impaired patients during remediation exercises at home. The successful candidate will have the exciting opportunity to develop new machine learning techniques for the robust detection of affective and cognitive behaviours from newly collected audiovisual data. Models will be incorporated into the virtual agent to tailor the interaction with the patient, using specific interaction scenarios, and these models will be evaluated and fine-tuned in a clinical trial to demonstrate the effectiveness of the agent in supporting patients suffering from cognitive conditions during digital therapies. If successful, the system will be operated nationally and the cognitive remediation sessions will be covered by social security.

Duration: 2 years,

Salary: according to experience (up to 4142€ / month)

Envisaged starting date: November 2022

Scientific environment

The person recruited will be hosted within the GETALP team of the Laboratoire d’Informatique de Grenoble (LIG), which offers a dynamic, international, and stimulating framework for conducting high-level multi-disciplinary research. The GETALP team is housed in a modern building (IMAG) located on a 175-hectare landscaped campus that was ranked as the eighth most beautiful campus in Europe by Times Higher Education magazine in 2018.

Requirements

The ideal candidate must have a PhD degree and a strong background in machine learning, and affective computing or cognitive science/neuroscience.

The successful candidate should have:

· Excellent knowledge of machine learning techniques

· Good knowledge of speech and/or image processing

· Good knowledge of experimental design and statistics

· Strong programming skills in Python

· Excellent publication record

· Willing to work in multi-disciplinary and international teams

· Good communication skills

Application

Applications are expected to be received on an ongoing basis and the position will be open until filled. Applications should be sent to Fabien Ringeval (fabien.ringeval@imag.fr) and François Portet (francois.portet@imag.fr). The application file should contain:

· Curriculum vitae

· Recommendation letter

· One-page summary of research background and interests

· At least three publications demonstrating expertise in the aforementioned areas

· Pre-defence reports and defence minutes; or summary of the thesis with date of defence for those currently in doctoral studies

Top

6-9

(2022-08-25) Post-doc @IMT Atlantique, France

In the framework of the European/Japanese e-VITA project (https://www.e-vita.coach/), IMT Atlantique is

offering a 15-month post-doctoral position in the field of active living technologies (IoT, data fusion, AI,

cloud/edge architectures, user services, coaching, NLP, etc.).

Description and link to apply:
https://institutminestelecom.recruitee.com/l/en/o/postdoctorante-ou-postdoctorant-en-fusion-de-donnees-multimodales-cdd-15-mois

Top

6-10

(2022-08-26) Ingénieur.e de recherche, Laboratoire Parole et Langage, Aix-en-Provence, France

Le Laboratoire Parole et Langage (LPL UMR7309 CNRS AMU) recrute un.e Ingénieur.e de recherche (IR)

Plateforme EEG & Oculométrie qui aura pour missions :

Recueil, traitement et analyse des données acquises au moyen de l'électroencéphalographie et des instruments de suivi des mouvements oculaires,
Conseil et formation dans le domaine des analyses statistiques.

Il s'agit d'un CDD de 4 mois à partir de 10/2022 à Aix-en-Provence.

Info & candidature : https://emploi.cnrs.fr/Offres/CDD/UMR7309-STELHU-003/Default.aspx

Top

6-11

(2022-09-02) Research Assistant, TUBerlin, Germany

Research Assistant for Conversational Speech Quality Assessment and Prediction - salary grade E 13 TV-L Berliner Hochschulen

part-time employment may be possible

Hire Date

The start date for the position is planned for November 1st, 2022, qualification goal doctorate. It is limited to a duration until September 30th, 2025 and a subsequent ongoing employment is supported if sufficient third-party-funding is available.

About Us

The majority of systems and services that are provided by computer science, electrical engineering and information technology finally are directed to their human users. To successfully build such systems and services, it is essential to investigate and understand users and their behavior when interacting with technology.From this, design principles for human-machine interfaces can be derived and requirements for the underlying technologies can be defined.

The Quality and Usability Lab is part of TU Berlin’s Faculty IV and deals with the design and evaluation of human-machine interaction, in which aspects of human perception and communication, technical systems and the design of interaction are the subject of our research. We focus on self-determined work in an interdisciplinary and international team; for this we offer open and flexible working conditions that promote scientific and personal exchange and are a prerequisite for excellent results.

Tasks

The position is open to do research in the fieldof speech signals analysis, and the assessment of speech quality in different (mobile and fixed) networks. Therefore, speech signalsare to be analyzed in listening-only as well as conversational situations in order to get indications or the perceived quality. Based on these analysis, signal-based and parametric models for the estimation of speech quality can be extended and integrated. One focus of the present research may be the evaluation of new speech codecs in different network scenarios. The models are to be validated based on subjective listening and conversation tests. For this purpose, methods of crowdsourcing can be applied, i. e. real users should carry out the data collection and/or evaluation via an online platform. Scientifically interesting is the comparison of such crowdsourced data to those that can be obtained under laboratory conditions.

The outcomes of the research should be used to support the preparation of a new project application, and may also become a foundation for a later PhD thesis. A subsequent position as a research assistant from the project funds would be possible if the funds were approved.

The concrete tasks include, among other things:

Maintaining and further developing a platform to conduct web-based voice calls

Conducting subjective conversation tests in the laboratory and via crowdsourcing

Analysis of speech signals
Creating and evaluating models for predicting quality aspects using different algorithms (including traditional signal processing methods and state-of-the-art DNNs)
Project communication and reporting
Publication and presentation of project and research results in scientific journals, at conferences, and in workshops

Requirements

Master or diploma in electrical engineering, computer engineering, computer science, media informatics, media technology, information systems management (or an equivalent technical background)
Profound knowledge in digital signal processing, beneficial: speech signal processing or audio signal processing, respectively
Good programming skills (e.g. MATLAB or Python) and safe handling of web development tools (e.g. HTML5/CSS3, JS, ideally also backend development skills)
Interest in running user studies with test participants to determinate speech quality
Language skills: English and German fluent in writing and speaking
Knowledge about empirical studies and statistical data analysis is appreciated
Joy of working in an interdisciplinary and international environment

Compensation

“Tarifvertrag für den öffentlichen Dienst der Länder (TV-L)” (E13, 100%).

Application

Please send the following documents, bundled in a single PDF file, to
Prof. Dr.-Ing. Sebastian Möller bewerbung@qu.tu-berlin.de:

Letter of application, curriculum vitae, copies of certificates, job references.

To ensure equal opportunities between women and men, applications by women with the required qualifications are explicitly desired. Qualified individuals with disabilities will be favored. The TU Berlin values the diversity of its members and is committed to the goals of equal opportunities.

Please send electronic copies only. Original documents will not be returned.

Top

6-12

(2022-09-02) Research Assistant (2), TUBerlin, Germany

Research Assistant for Chatbot-based Support for Self-organization During Studies - salary grade E 13 TV-L Berliner Hochschulen

part-time employment may be possible

Hire Date

The start date for the position is planned for October 1st, 2022, with the qualification goal doctorate. It is limited to a duration of 26 months and a subsequent ongoing employment is supported if the PhD cannot be finished in the given time.

About Us

Most systems and services that are provided by computer science, electrical engineering and information technology finally are oriented on the needs of their human users. To build successfully build such systems and services it essential to investigate and understand users and their behavior when interacting with technology.From this, design principles for human-machine interfaces can be derived and requirements for the underlying technologies can be defined.

The Quality and Usability Lab is part of TU Berlin’s Faculty IV and deals with the design and evaluation of human-machine interaction, in which aspects of human perception, technical systems and the design of interaction are the subject of our research. We focus on self-determined work in an interdisciplinary and international team; for this we offer open and flexible working conditions that promote scientific and personal exchange and are a prerequisite for excellent results.

Tasks

Conception and development of text-based interactive dialogue systems, so-called chatbots, as part of the USOS project (chatbot-based support for self-organization during studies). Machine learning methods are used both to process text-based information and to control dialogs. The range of tasks also includes the design and implementation of graphic user interfaces, e.g., as a web app, Android app or iOS app. The quality and the user experience of the created interaction concepts are then evaluated in the context of user studies.

The specific tasks include:

Implementation of information extraction for the module transfer system and course catalog of the TU Berlin
Implementation of natural language understanding, dialog management and response generation for a chatbot
Communication with project participants on the technical requirements of the chatbot
Planning and conducting user studies
Active participation in the conception, construction, and evaluation of the overall system
Publication and presentation of project and research results in scientific journals, at conferences, and in workshops as well as standardization meetings of ITU-T

Professionally experienced employees from our team support you with self-motivated familiarization with the areas of responsibility.

Requirements

Master or diploma in electrical engineering, computer engineering/science, informatics, media informatics, media technology, information systems (or an equivalent technical background)
Ability to work independently in a team and good self-organization
Very good programming knowledge in Python and its routine use in development environments and experience with working under Linux and the command line
Experience in the use of machine learning frameworks such as Tensorflow, Keras, or PyTorch
In-depth knowledge of the principles of machine learning (supervised learning, unsupervised learning and reinforcement learning)
Previous experience in one of the following areas: Information Extraction, Natural Language Understanding, Natural Language Generation
Desired previous knowledge (not required)
- Experience in the preparation and efficient processing of training data for AI-based systems
- Experience in the development of chatbots or speech dialog systems
- Experience with transformer-based language models such as BERT or GPT
- Experience with empirical research and statistical data analysis
Interest in carrying out experiments with test subjects to determine quality and user experience
Language skills: German fluent in writing and speaking, English communication secure
Joy of working in an interdisciplinary and international environment

Compensation

“Tarifvertrag für den öffentlichen Dienst der Länder (TV-L)” (E13, 100%)

Application

Please send the following documents, bundled in a single PDF file, to
Prof. Dr.-Ing. Sebastian Möller bewerbung@qu.tu-berlin.de:

Letter of application, curriculum vitae, copies of certificates, job references.

To ensure equal opportunities between women and men, applications by women with the required qualifications are explicitly desired.

Qualified individuals with disabilities will be favored.

Please send copies only. Original documents will not be returned.

Top

6-13

(2022-09-02) Research assistant (3), TUBerlin, Germany

Research Assistant for Multimodal Interactive Assistance for the Digital Collection of Patient-Reported Outcome Measures - salary grade E 13 TV-L Berliner Hochschulen

part-time employment may be possible

Hire Date

The start date for the position is planned for December 1st, 2022, with the qualification goal doctorate. It is limited to a duration of 2.5 years and a subsequent ongoing employment is supported if the PhD cannot be finished in the given time.

About Us

Most systems and services that are provided by computer science, electrical engineering and information technology finally are oriented on the needs of their human users. To build successfully build such systems and services it essential to investigate and understand users and their behavior when interacting with technology.From this, design principles for human-machine interfaces can be derived and requirements for the underlying technologies can be defined.

The Quality and Usability Lab (https://tu.berlin/qu) is part of TU Berlin’s Faculty IV and deals with the design and evaluation of human-machine interaction, in which aspects of human perception, technical systems and the design of interaction are the subject of our research. We focus on self-determined work in an interdisciplinary and international team; for this we offer open and flexible working conditions that promote scientific and personal exchange and are a prerequisite for excellent results.

Tasks

Conception and development of an interactive natural language-based dialog system, as part of the project MIA-PROM (Multimodal interactive assistance for the digital collection of Patient-Reported Outcome Measures). The project is in the field of outpatient rehabilitation and requires cooperation with researchers (human-machine interaction, technology-sociology, and healthcare) and the intended users. In the subproject Adaptive Dialog, methods of machine learning are used both for the processing of natural language utterances and for the control and adaptation of dialogs. The range of tasks also includes research on the methods and interaction concepts used in the subproject. The quality and user experience of the created interaction concepts are then evaluated in empirical user studies.

The specific tasks include:

Implementation of components of a spoken dialog system, in particular, natural language understanding, dialog management and response generation
Communication with project partners on the technical and functional requirements of the dialog system
Planning and implementation of user studies
Active participation in the conception, implementation, and evaluation of the overall system
Publication and presentation of project and research results in scientific journals, at conferences and workshops, as well as in international standardization committees

Professionally experienced employees from our team support you with self-motivated familiarization with the areas of responsibility.

Requirements

Master or diploma in electrical engineering, computer engineering/science, informatics, media informatics, media technology, information systems (or an equivalent technical background)
Ability to work independently in a team and good self-organization
Good programming skills in Python and their routine use in development environments
Previous experience in the field of Natural Language Processing
Experience in the use of frameworks for Natural Language Processing (e.g., Rasa NLU, AllenNLP or SparkNLP)
Fundamental knowledge of machine learning principles
Optional previous knowledge (not required):

Experience in the preparation and efficient processing of training data for AI-based systems
Experience in the development of chatbots or voice dialogue systems
Experience with Transformer-based language models such as BERT or GPT

Interest in conducting empirical studies with human participants to determine quality and user experience in human-machine interaction
Language skills: German fluent in writing and speaking, English communication secure
Desire to work in an interdisciplinary and international environment

Compensation

“Tarifvertrag für den öffentlichen Dienst der Länder (TV-L)” (E13, 100%)

Application

Please send the following documents, bundled in a single PDF file, to
Prof. Dr.-Ing. Sebastian Möller bewerbung@qu.tu-berlin.de:

Letter of application, curriculum vitae, copies of certificates, job references.

To ensure equal opportunities between women and men, applications by women with the required qualifications are explicitly desired.

Qualified individuals with disabilities will be favored.

Please send copies only. Original documents will not be returned.

Top

6-14

(2022-09-09) Postdoctoral Research Fellow, Tampere University and Tampere University of Applied Sciences, Finland

Postdoctoral Research Fellow
(speech and language technology,cognitive science)

Tampere University and Tampere University of Applied Sciences create
a unique environment for multidisciplinary, inspirational and high-
impact research and education. Our universities community has its
competitive edges in technology, health and society.www.tuni.fi/en
Speech and Cognition research group (SPECOG )is part ofComputing Sciences atTampere University within
the Faculty of Information Technology and Communication Sciences.SPECOG focuses on interdisciplinary
research at the intersection of speechand languagetechnology and cognitive science.We combine advanced
signal processing and machine learning methods with empirical large-scale infant data to the study of child
language development. We also study how human-like perceptual learning can be applied in artificial
intelligence (AI) systems. SPECOG collaborates with several internationally leading research groups within
and across disciplinary boundaries, including joint research with psychologists, linguists, phoneticians, and
computer scientists.
More informationon SPECOG:https://webpages.tuni.fi/specog/index.html
Job description
We are inviting applications for the position o fa postdoctoral research fellow on the topic of computational
modelling of child language development. The position is associated with aproject titled “Modeling Child
Language Development using Naturalistic Data at a Scale (L-SCALE)”, where the aim is to develop new
practices to training and evaluation of computational models of infant language learning using realistic infant
data. These data may include long-form child-centred audio recordings,infant-care giver interaction
transcripts, and meta-analyses conducted across a range of behavioural experiments.While the job is located
in Finland, the project has a notable emphasis on international collaboration with key partners around the
world.
The work will be conducted as a member of the SPECOG research groupled by Dr.Okko Räsänen.We are
looking for candidates who are interested in human and/or machine language processing, and who are willing
to contribute to our cross-disciplinary research efforts in understanding language learning in humans through
computational means.
In this position, the candidate is expected to:
1)carry out high-quality postdoctora lresearch on computational modelling of early language development
and contribute to the development of ecologically plausible model training and evaluation practice.
2)work in close collaboration with other members of the research group, and
3)advise undergraduate/graduate projects on topics related to your own research (with flexibility
according to personal interests and career aspirations).
Requirements
The candidate should hold a doctoral degree (e.g.,PhD or D.Sc.) in language technology, psycholinguistics,
cognitive science,computer science,or other relevant area. Candidates who have already completed their
doctoral research work but have not yet received their doctoral certificate may also apply.

A successful candidate has strong expertise in

a)speech and/or language technology or in

b) childlanguage
development research with quantitative methods (e.g., developmental psychology,psycholinguistics,
cognitive science).

Fluent programming (at least Python,Matlab, or R) and oral and written English skills are
required. Strong motivation towards understanding the underpinnings of human language learning and
processing is a must. Experience from computational modelling or use of statistical models in empirical
research are considered as an advantage.
Potential candidates must be capable of carrying out independent academic research at the highest
international level.Competence must be demonstrated through several existing publications in
internationally recognised peer-reviewed journals and conferences.
We offer
The position will be filled for a fixed-term period of up to 3.5 years, but is negotiated according to
applicant’s career plans. Starting date is also negotiable, but should not be later than March2023. A trial
period of 6 months is applied to all new employees.
We offer competitive academic salary, typically between 3500–3600€ per month for astarting postdoc,
generally depending on experience and merits achieved (the position is placed on job demand levels 5–6 in
accordance withthe Finnish University Salary System).The position also includespossibilities for short-term
researcher mobility to other international research labs.Traveling costs to presenting peer-reviewed work
in major international conferences are covered by default. In addition, the position comes with extensive
benefits such as occupational healthcare, on-campus sports facilities, flexible working hours, and several
restaurants and cafés on the campus with staff discounts. The jobis associated with1612 hannual working
time, which translates to approx. 6 weeks of holiday per year.
How to apply
Send the application through the online portalat
https://tuni.rekrytointi.com/paikat/?o=A_RJ&jgid=1&jid=1572.
Deadline for applications is 9st of October2022 at 23.59(GMT+3). Note that we may start interviewing
applicants already before the deadline. We reserve the opportunity to decide not to fill the position in
case a suitable candidate is not found during the process.
The application should contain the following documents (all in .pdfformat):
-A free-form letter of motivation for the position in question(max.1page)
-Academic CV with contact information
-A complete list of publications
-A copy of doctoral degree certificate
-Potential letters of recommendation(max.3)
Please name all the documents as surname_CV.pdf,surname_list_of_publications.pdf... etc. Only the
applications sent through the university application portal and containing the requested attachments in the
instructed format will be considered in the recruitment process.
The most promising candidates will be interviewed in person before the final decision.
For more information about the position, please contact Associate Professor Okko Räsänen
(firstname.surname@tuni.fi; no umlauts) by email. For more information on our group activities and recent
publications, see https://webpages.tuni.fi/specog/index.html.

About the research environment
Finland is among the most stable, free and safe countries in the world, based on prominent ratings by
various agencies. It is also ranked as one of the top countries as far as social progress is concerned.
Tampere is the largest inland city of Finland, and the city is counted among the major academic hubs in the
Nordic countries, offering a dynamic living environment. Tampere region is one of the most rapidly growing
urban areas in Finland and home to a vibrant knowledge-intensive entrepreneurial community. The city is
an industrial powerhouse that enjoys a rich cultural scene and a reputation as a centre of Finland’s
information society. Despite its growth, living in Tampere is highly affordable for housing. In addition,the
excellent public transport network enables quick, easy,and cheap transportation around the city of
Tampere and university campuses.Tampere is also surrounded by vivid nature with forests and lakes,
providing countless opportunities for easy-to-access outdoor adventures and refreshment throughout the
year.

Read more about Finland and Tampere:
•https://www.visitfinland.com/about-finland/
•https://finland.fi/
•http://julkaisut.valtioneuvosto.fi/bitstream/handle/10024/161193/MEAEguide_18_2018_T
ervetuloaSuomeen_Eng_PDFUA.pdf
•https://visittampere.fi/en/

Top

6-15

(2022-09-12) Postdocs and Software engineering positions, Telecom Paris, France

Telecom Paris' ADASP research group (https://adasp.telecom-paris.fr) is welcoming
applications for
multiple postdoc and audio software engineering positions in speech processing, machine
listening,
MIR and audio/music DSP to start in September 2022 onwards.

The work will be performed as part of a collaborative project whose purpose is to
revolutionise
hearable technologies (especially TWS/earbuds) by offering extremely efficient real-time
machine
listening, speech processing and MIR solutions which can run on very low consumption
hardware.

Telecom Paris [2] is located on the Plateau de Saclay (Paris outskirts). Accepted
candidates will
join the ADASP group [4], a multidisciplinary team working at the intersection of machine
learning,
sound, music and signal processing.

-- Relevant topics
Should you resonate with any of the following keywords, do not hesitate to apply
(provided that you
comply with the requirements further below):

- Model-based, few-shot, frugal, self-supervised and semi-supervised learning
- Domain-shift adaptation, knowledge distillation and deep network quantization
- Online speaker identification, diarization and separation
- Speech enhancement/denoising, target speaker extraction
- DCASE topics: acoustic scene classification, sound event detection and localization
- MIR topics: music representation learning, real-time remastering, autotagging

-- Profile of the candidates

--- Postdoc candidates
- A PhD degree
- A track record of research and publications in one or more of the following areas:
speech
processing, machine listening, MIR, machine learning, signal processing
- Experience in deep learning (ideally) applied to audio, speech processing, MIR or
machine listening

--- Software engineering candidates
- MSc. in one of the following areas: computer science, electrical engineering or
electronics
- Experience with real-time audio DSP / edge computing (especially edge machine learning)
/ ML ops
will be a big plus
- Sufficient background in machine learning and signal processing, ideally audio/speech
signal
processing

--- All candidates
- Strong communication skills in English (French is not required)
- Team spirit
- Excellent coding skills

-- Important Dates
Review of applications will start as soon as possible, and continue until all posts are
filled.

-- How to apply
The application shall be submitted via email (slim.essid@telecom-paris.fr) as a *single
pdf file*,
including:
- a letter of motivation
- a complete and detailed curriculum vitae, including email contact details of two
references

-- Context
Telecom Paris [2] is a French leading engineering school and scientific research
institution,
founded in July 1878. It is a founding member of the Institut Polytechnique de Paris [1],
a world-class scientific and technological institution.

The Information Processing and Communication Laboratory (LTCI) [3] is Telecom Paris’
in-house
research laboratory. Since January 2017, it has continued the work previously carried out
by the
CNRS joint research unit of the same name. The LTCI was created in 1982 and is known for
its
extensive coverage of topics in the field of information and communication technologies.
The LTCI’s
core subject areas are computer science, networks, data science, signal and image
processing and
digital communications. The laboratory is also active in issues related to systems
engineering and
applied mathematics.

The open position will be hosted by Telecom Paris’ Audio Data Analysis and Signal
Processing (ADASP)
group [4], a subgroup of the statistics, signal processing and machine learning (S²A)
team, within
the Images, Data & Signals (IDS) department.

-- Contact
Slim Essid, Coordinator of the ADASP group

[1] https://www.ip-paris.fr/en
[2] https://www.telecom-paris.fr/en/home
[3]
https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci
[4] https://adasp.telecom-paris.fr

Top

6-16

(2022-09-27) Deep learning software expert, CNRS:LSCP, Paris, France

Short summary: We are looking for someone with experience with deep learning, ideally using scikit-learn & pytorch, to join our technical team. We specialize in long-form audio-recordings, and your job will be to design, fine-tune, and evaluate neural networks on such data. Conversational French is NOT needed - we work in English!

For more details see https://emploi.cnrs.fr/Offres/CDD/UMR8554-ALECRI1-001/Default.aspx?lang=EN

---------------------------------------------------------------
Alex (Alejandrina) Cristia
Researcher, CNRS
Laboratoire de Sciences Cognitives et Psycholinguistique
29, rue d'Ulm, 75005, Paris, FRANCE
My site: www.acristia.org

Top

6-17

(2022-09-25) Post-doctorat (H/F) Identification d'expressions genrées at LISN, St Aubin, France

Le LISN recrute un post-doc d’un an dans le cadre du projet ANR GEM (Gender Equality Monitor) sur l’identification d'expressions genrées par des représentations vectorielles sur un corpus de transcription de la parole dans les médias.

l’offre est détaillée ci-dessous :

Post-doctorat (H/F) Identification d'expressions genrées par des représentations vectorielles sur un corpus de transcription de la parole dans les médias

Informations générales

Référence : UMR9015-CYRGRO-002

Lieu de travail : ST AUBIN

Date de publication : samedi 10 septembre 2022

Type de contrat : CDD Scientifique

Durée du contrat : 12 mois

Date d'embauche prévue : 1 décembre 2022

Quotité de travail : Temps complet

Rémunération : Entre 2889,91€ et 4082,9€ bruts mensuels selon expérience

Niveau d'études souhaité : Doctorat

Expérience souhaitée : 1 à 4 années

Missions

Le projet GEM (Gender Equality Monitor) vise à analyser les interactions entre femmes et hommes dans les médias (radio et télévision), et plus particulièrement les différences de représentations selon que la personne qui s'exprime est une femme ou un homme, selon son rôle (anonyme, journaliste, politique, etc.), et selon les thèmes abordés. Dans ce projet inter-disciplinaire, les partenaires informatiques (dont le LISN) ont pour mission d'implémenter les descripteurs qui permettront aux partenaires en sciences humaines et sociales de quantifier et qualifier les différences de représentation.

https://anr.fr/Projet-ANR-19-CE38-0012

Activités

La personne recrutée (H/F) aura en charge de mettre au point des techniques de traitement automatique des langues (TAL) non supervisées ou semi-supervisées appliquées à des corpus de transcriptions automatiques de la parole, pour identifier les 'expressions genrées' telles que les références à des stéréotypes culturels en fonction du genre, les entités nommées traditionnelles ou toute référence à la vie privée, l'âge, le physique, la sexualité, les compétences, etc.

De manière secondaire, l'analyse des biais dans les modèles de langue pourra également être conduite.

Les corpus sont mis à disposition par le porteur du projet (Institut National de l'Audiovisuel) et se composent : de matinales radios et journaux de télévision du corpus GMMP (Global Monitoring Media Project), d'émissions de radio françaises (émissions culinaires, économiques, sportives, et libre-antennes) pour l'étude des incivilités (interruptions, injures, etc.), et d'émissions de télé-réalité (Loft Story 2001, Les Marseillais à Dubaï 2021). Aucune annotation n'est disponible autour des expressions genrées. La personne recrutée devra donc privilégier des méthodes non supervisées ou semi-supervisées.

Ce travail sera co-encadré par Mme Sahar Ghannay (MCF en informatique à l'Université Paris Saclay) et M. Cyril Grouin (IR en informatique au CNRS). Le contrat sera financé par l'Agence Nationale de la Recherche (ANR GEM 2019) porté par David Doukhan (Institut National de l'Audiovisuel).

Compétences

- très bonne maîtrise du français

- traitement automatique des langues et de la parole ; une formation spécifique dans cette discipline est un plus

- expérience des plongements lexicaux et réseaux de neurones

Contexte de travail

Le Laboratoire Interdisciplinaire des Sciences du Numériques (LISN) est une unité installée sur le plateau de Saclay et créée en 2021 de la fusion des laboratoires LIMSI et LRI. Les recherches effectuées au LISN couvrent un large spectre scientifique et sont reconnues à l'international.

Le laboratoire comprend plus de 380 membres répartis dans 16 équipes de recherche et 6 services de support et soutien. Les locaux sont intégralement en zone à régime restrictif (ZRR).

La personne recrutée travaillera au sein de l'équipe ILES, en lien étroit avec les chercheurs des équipes ILES et TLP impliqués sur le projet, au sein du département Sciences et Technologies des Langues (STL).

Contraintes et risques

Déplacement possible en Ile-de-France pour les réunions de travail ponctuelle

Déplacements nationaux et internationaux en conférence en cas d'article à présenter

Travail sur ordinateur

Candidature ici: https://emploi.cnrs.fr/Offres/CDD/UMR9015-CYRGRO-002/Default.aspx

Top

6-18

(2022-10-12) Internships at AVA France

We have two 6 months internship proposals (for M2/Master 2 level) at Ava France ( https://www.ava.me/ ) in Paris (possible remote) on speech diarization.

Feel free to apply:

'Active Learning for Diarization' https://www.notion.so/ava/M2-Internship-Active-Learning-for-Diarization-Ava-France-5f80cb76bd8b451c8088f9c27e01f387
'On device Speaker Diarization' https://www.notion.so/ava/M2-Internship-On-device-Speaker-Diarization-Ava-France-d584fb15d73f4d1297b23969039995fc

Best regards,

Alexey Ozerov

AI Research Lead at Ava

Top

6-19

(2022-10-25) Postdoc@Telecom Paris (France)

Post-Doctoral Position on Neural Models for Dialog Analysis

Matthieu Labeau, Gaël Guibon, Chloé Clavel

Place of work: Telecom Paris, Palaiseau (Paris outskirt), France

Starting date: from February 2023

Context: The post-doctoral fellow will be integrated in the social computing theme of the Signal, Statis-tics and Learning (S2A) team at Telecom Paris. Research activity will be supervised by Chloé Clavel, Matthieu Labeau, members of the team, and Gaël Guibon (University of Lorraine and LORIA laboratory).

Candidate profile: As a minimum requirement, the successful candidate should have:

• A PhD degree in one or more of the following areas: machine learning, natural language processing, computational linguistics, affective computing.

• Excellent programming skills (preferably in Python)

• Excellent command of English

How to apply: The application should be formatted as **a single PDF file** and should include:

• A complete and detailed curriculum vitae

• A cover letter

• The defense and PhD reports

• The contact of two referees

The PDF file should be sent to the three supervisors: Chloé Clavel, Gaël Guibon, Matthieu Labeau:

chloe.clavel@telecom-paris.fr, gael.guibon@univ-lorraine.fr, matthieu.labeau@telecom-paris.fr

Subject: Flexible and Adaptable Learning for Dialog Analysis.

Keywords: natural language processing, semi-supervised learning, few-shot learning, multi-task learning, robust learning, dialog analysis.

Description: Current research on dialog analysis encompasses a large array of (often related) classification tasks, where an output sequence of labels corresponds to an input sequence of utterances. However, the domain of the textual data, the nature of the labels, and their granularity may vary widely among tasks, while available data is often scarce. These issues are often addressed with methods coming from semi-supervised learning (Van Engelen and Hoos, 2020), few-shot learning (Guibon et al., 2021a), and more recently meta-learning, either separately (Guibon et al., 2021b) or together (Ma et al., 2022). However, existing solutions are often specific to a particular setting, dataset, and task, and have blind spots: for example, few-shot and meta-learning approaches are not designed to deal with label imbalance, while real-world data will rarely be balanced. We plan to work towards flexible approaches to dialog analysis by following one or several of these research directions – while keeping in mind that domain adaptation is often required in a few-shot setting:

• Few-shot learning for label imbalance and structured data: FSL is mostly used in cases where it is possible to enforce a balance in labels for the training samples. However, this is difficult to do with structured data representations such as dialogues (Guibon et al., 2021a).

• Few-shot learning with insufficient data: How to better exploit new labels at inference time in an FSL setting? How to best make use of available unlabeled data (semi-supervised learning) or supplementary resources (any kind of ontology, typology of emotions, etc.)? (Ren et al., 2018).

• Few-shot joint multi-task learning: how to better integrate joint learning of different tasks in a few-shot setting? A possible lead is to exploit the structure of the data to create substitute tasks, working towards a model easily adaptable to a new set of labels with very little supervision, through short, multi-task fine-tuning (Ye, Lin, and Ren, 2021).

• Calibrated few-shot learning: labels are often uncertain, and highly dependent on the context or the bias of the annotator; this should be reflected in models, whether through calibration (Guo et al., 2022) or soft-labeling.

References

Guibon, G.; Labeau, M.; Flamein, H.; Lefeuvre, L.; and Clavel, C. 2021a. Few-Shot Emotion Recognition in Conversation with Sequential Prototypical Networks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic.

Guibon, G.; Labeau, M.; Flamein, H.; Lefeuvre, L.; and Clavel, C. 2021b. Meta-learning for Classifying Previously Unseen Data Sources into Previously Unseen Emotional Categories. In Proceedings of the 1st Workshop on Meta-Learning and Its Applications to Natural Language Processing, 76–89. Online:
Association for Computational Linguistics.

Guo, Y.; Du, R.; Li, X.; Xie, J.; Ma, Z.; and Dong, Y. 2022. Learning Calibrated Class Centers for Few-Shot Classification by Pair-Wise Similarity. IEEE Transactions on Image Processing, 31: 4543–4555.

Ma, T.; Jiang, H.; Wu, Q.; Zhao, T.; and Lin, C.-Y. 2022. Decomposed Meta-Learning for Few-Shot Named Entity Recognition. In Findings of the Association for Computational Linguistics: ACL 2022, 1584–1596.

Ren, M.; Triantafillou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J. B.; Larochelle, H.; and Zemel, R. S. 2018. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.

Van Engelen, J. E.; and Hoos, H. H. 2020. A survey on semi-supervised learning. Machine Learning, 109(2): 373–440.

Ye, Q.; Lin, B. Y.; and Ren, X. 2021. CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7163–7189. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.

Top

6-20

(2022-10-14) Internships @ IRIT Toulouse, France

L’équipe SAMoVA de l’IRIT à Toulouse propose plusieurs stages (M1, M2, PFE ingénieur) en 2023 autour des thématiques suivantes (liste non exhaustive) :

- traitement de la parole atypique

- modélisation de la déglutition

- transcription et compréhension de la parole (spoken language understanding)

- segmentation et regroupement en locuteurs (speaker diarization)

- description textuelle de l'audio (audio captioning)

Tous les détails (sujets, contacts) sont disponibles dans la section 'Jobs' de l’équipe :
https://www.irit.fr/SAMOVA/site/jobs/

Hervé Bredin
CNRS / IRIT / SAMoVA
Chargé de recherche
herve.bredin@irit.fr

Top

6-21

(2022-10-12) Ingenieur AI@INE France

Nous recherchons activement un ingénieur en charge des opérations pour le département Évaluation de l’intelligence artificielle et de la cybersécurité du LNE :

https://www.lne.fr/fr/offre-emploi/ingenieur-en-charge-operations-departement-evaluation-intelligence-artificielle-0

Le candidat retenu intégrera une équipe en forte croissance spécialisée en évaluation des systèmes d’IA et intervenant dans de nombreux domaines (TAL, traitement d’images, dispositifs médicaux intelligents, systèmes de mobilité autonomes, robots agricoles, cobots, etc.).

Je me tiens à votre disposition pour tout échange sur cette offre.

Merci d’avance pour vos candidatures et vos partages, à très bientôt !

Guillaume AVRIN, PhD
Responsable du département Évaluation de l’intelligence artificielle
Responsable des activités d’essais en cybersécurité

Direction des essais et de la certification

Laboratoire national de métrologie et d'essais
29 avenue Roger Hennequin 78197 Trappes Cedex - lne.fr

Top

6-22

(2022-10-12) Positions @University of Texas El Paso, TX, USA

Two 3yr postdoc positions testing gesture-speech synchrony

We're looking for two smart and motivated postdocs to join the Speech Perception in Audiovisual Communication lab (SPEAC; https://hrbosker.github.io) at the Donders Institute, Radboud University, Nijmegen, The Netherlands.

Keywords: multimodal prosody, audiovisual speech perception, gesture-speech synchrony, motion-tracking, MEG

>>> PD1

https://www.ru.nl/en/working-at/job-opportunities/postdoctoral-researcher-testing-cross-linguistic-gesture-speech-alignment-with-motion-tracking-at-donders-centre-for-cognition

You will test both the production and perception of gesture-speech alignment in nine different languages, including free-stress, fixed-stress, and lexical tone languages. The production strand uses motion-tracking of 2D videos in Mediapipe and acoustic analyses in Praat to quantify gesture-speech alignment on a millisecond timescale. The perception strand involves running psychoacoustic tests with audiovisual stimuli manipulated to vary in the synchrony between hands and spoken prosody. Combining production and perception data will reveal how language-specific variability in gesture-speech alignment shapes the language-specific use of gestural timing in speech perception.

>>> PD2

https://www.ru.nl/en/working-at/job-opportunities/postdoctoral-researcher-testing-audiovisual-gesture-speech-integration-in-meg

You will use rapid invisible frequency tagging (RIFT) in MEG to pinpoint the neurobiological mechanisms underlying gesture-speech integration. Specifically, you will test how simple up-and-down beat gestures influence lexical stress perception in real time, using the 'manual McGurk effect' (Bosker & Peeters, 2021, Proc Roy Soc B). Furthermore, you will compare typical behavioural and neural signatures of gesture-speech integration to those in individuals with autism spectrum disorder (ASD) who are known to demonstrate impairments in prosody processing and audiovisual integration. Finally, you will run a large-scale correlational study testing whether the participants' own gestural timing behaviour is linked to their use of gestural timing in audiovisual speech perception.

3 year contract; employment for 0.8 FTE. Gross monthly salary: min €3,974 - max €5,439 (based on 38-hour working week; scale 11). Apply by December 1, 2022. Preferred starting date: March 1, 2023.

Contact: Hans Rutger Bosker, HansRutger.Bosker@donders.ru.nl

This mail was sent through the SProSIG mailing list, which is for announcements of interest to the speech prosody research community. To subscribe/unsubscribe, send mail to list@sprosig.org.

Nigel Ward, Professor of Computer Science, University of Texas at El Paso

CCSB 3.0408, +1-915-747-6827

nigel@utep.edu https://www.cs.utep.edu/nigel/

Top

6-23

(2022-10-17) Research internships @ LIUM, Le Mans France

Nous proposons deux stages de recherche (pour le niveau M2/Master 2) au LIUM - Le Mans Université sur le traitement de la parole.

Tous les détails sont disponibles dans la section 'Recrutements' du site du laboratoire, onglet 'Stages' :

https://lium.univ-lemans.fr/stages/

Merci de transférer si vous connaissez des étudiant.e.s à la quête de telle opportunité.

Meilleures salutations,

Meysam Shamsi

Top

6-24

(2022-10-20) Postdoc in Educational Data Mining/Learning Analytics, University of Colorado Boulder, CO, USA

Postdoc in Educational Data Mining/Learning Analytics

Location: University of Colorado Boulder, Boulder CO, USA

Work type: Full time

Employment type: Research faculty

Anticipated Start Date: Spring 2023 (desired), Summer 2023, or Fall 2023

Salary: $70k-$100k (depending on experience and qualifications)

Position Duration: 1-3 years, Initial contract is for one year. Second year contract is based on performance and extension to a third year and beyond is possible

Brief Job Summary: In this position, you will develop and apply computational techniques to analyze data from students’ log files in conjunction with other multimodal signals (e.g., speech, facial expressions, learning artifacts) during small group human tutoring, intelligent tutoring, and collaborative problem solving. You will also assist with integrating computational models into educational technologies where their performance and impact can be assessed in the real-world.

Please visit the job details page below for more information and to apply:

https://jobs.colorado.edu/jobs/JobDetail/?jobId=43747

Top

6-25

(2022-10-25) OPEN POSITIONS @ ELDA, Paris (France)

OPEN POSITIONS in Paris (France)

The European Language resources Distribution Agency (ELDA), a company specialized in
Human Language Technologies within an international context, acting as the distribution
agency of the European Language Resources Association (ELRA), is currently seeking to
fill both positions:

* Programme Manager (m/f)
* Programme Manager in Speech Technologies (m/f)

Both positions are permanent and for immediate vacancy.

All details are available @ https://bit.ly/3G3PG1Z <https://t.co/7bwSMWookR>

Top

6-26

(2022-11-08) Post doc and engineering positions @ LORIA-INRIA, Nancy, France

Automatic speech recognition for non-natives speakers in a noisy environment

Post-doctoral and engineer positions

Starting date: begin of 2023

Duration: 24 months for a post-doc position and 12 months for an engineer position

Supervisors: Irina Illina, Associate Professor, HDR Lorraine University LORIA-INRIA Multispeech Team, illina@loria.fr

Context

When a person has their hands busy performing a task like driving a car or piloting an airplane, voice is a fast and efficient way to achieve interaction. In aeronautical communications, the English language is most often compulsory. Unfortunately, a large part of the pilots are not native English and speak with an accent dependent on their native language and are therefore influenced by the pronunciation mechanisms of this language. Inside an aircraft cockpit, non-native voice of the pilots and the surrounding noises are the most difficult challenges to overcome in order to have efficient automatic speech recognition (ASR). The problems of non-native speech are numerous: incorrect or approximate pronunciations, errors of agreement in gender and number, use of non-existent words, missing articles, grammatically incorrect sentences, etc. The acoustic environment adds a disturbing component to the speech signal. Much of the success of speech recognition relies on the ability to take into account different accents and ambient noises into the models used by ARP.

Automatic speech recognition has made great progress thanks to the spectacular development of deep learning. In recent years, end-to-end automatic speech recognition, which directly optimizes the probability of the output character sequence based on the input acoustic characteristics, has made great progress [Chan et al., 2016; Baevski et al., 2020; Gulati, et al., 2020].

Objectives

The recruited person will have to develop methodologies and tools to obtain high-performance non-native automatic speech recognition in the aeronautical context and more specifically in a (noisy) aircraft cockpit.

This project will be based on an end-to-end automatic speech recognition system [Shi et al., 2021] using wav2vec 2.0 [Baevski et al., 2020]. This model is one of the most efficient of the current state of the art. This wav2vec 2.0 model enables self-supervised learning of representations from raw audio data (without transcription).

How to apply: Interested candidates are encouraged to contact Irina Illina (illina@loria.fr) with the required documents (CV, transcripts, motivation letter, and recommendation letters).

Requirements & skills:

- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,

- ability to work independently as well as in a team,

- solid programming skills (Python, PyTorch), and deep learning knowledge,

- good level of written and spoken English.

References

[Baevski et al., 2020] A. Baevski, H. Zhou, A. Mohamed, and M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 2020.

[Chan et al., 2016] W. Chan, N. Jaitly, Q. Le and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960-4964, 2016.

[Chorowski et al., 2017] J. Chorowski, N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. Interspeech, 2017.

[Houlsby et al., 2019] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning, PMLR, pp. 2790–2799, 2019.

[Gulati et al., 2020] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition. Interspeech, 2020.

[Shi et al., 2021] X. Shi, F. Yu, Y. Lu, Y. Liang, Q. Feng, D. Wang, Y. Qian, and L. Xie. The accented english speech recognition challenge 2020: open datasets, tracks, baselines, results and methods. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6918–6922, 2021.

Top

6-27

(2022-11-12) One-year position @Institut Mines Telecom Atlantique, Nantes, France

In the framework of the European/Japanese e-VITA project (https://www.e-vita.coach/),
IMT Atlantique is offering a one-year position in the field of active living technologies
(IoT, data fusion, AI, cloud/edge architectures, user services, coaching, NLP, etc.).

Description and link to apply:
https://institutminestelecom.recruitee.com/l/en/o/ingenieure-ou-ingenieur-en-fusion-multimodale-cdd-12-mois

Top

6-28

(2022-11-15) Two internships @ Zaion, Paris, France

Nous proposons deux offres de stage au sein de Zaion (niveau M2).

Merci de transférer si vous connaissez des étudiant.e.s qui cherchent de telles opportunités.

- Apprentissage automatique pour router intelligemment les appels entrants dans les centres d’appel :

https://www.welcometothejungle.com/fr/companies/zaion/jobs/stage-construction-d-un-framework-base-sur-l-apprentissage-automatique-pour-router-intelligemment-les-appels-entrants-dans-les-centres-d-appel_paris?q=24bcc6977132b41481c4bdc677196843&o=1427380

- Génération automatique de résumé de dialogue :

https://www.welcometothejungle.com/fr/companies/zaion/jobs/stage-developpement-d-un-modele-de-generation-automatique-de-resume-de-dialogue_paris?q=24bcc6977132b41481c4bdc677196843&o=1427375

Top

6-29

(2022-11-16) Post-doc @LaBRI, Bordeaux, France

The Bordeaux Computer Science Laboratory (LaBRI) is currently looking to fill a 1 year post-doctoral position in the framework of the FVLLMONTI European project (http://www.fvllmonti.eu)

Details on the position are given below.

—
The University of Bordeaux invites applications for a 1 year full-time postdoctoral researcher in Automatic Speech Recognition. The position is part of the FVLLMONTI project on efficient speech-to-text translation on embedded autonomous devices, funded by the European Community.

To apply, please send by email a single PDF file containing a full CV (including publication list), cover letter (describing your personal qualifications, research interests and motivation for applying), evidence for software development experience (active Github/Gitlab profile or similar), two of your key publications, contact information of two referees and academic certificates (PhD, Diploma/Master, Bachelor certificates).

Details on the position are given below:

Job description: Post-doctoral position in Automatic Speech Recognition
Duration: 12 months
Starting date: tentatively 03/01/2023
Project: European FETPROACT project FVLLMONTI (started January 2021)
Location: Bordeaux Computer Science Lab. (LaBRI CNRS UMR 5800), Bordeaux, France (Image and Sound team)
Salary: from 2 086,45€ to 2 304,88€/month (estimated net salary after taxes, according to experience)

Contact: jean-luc.rouas@labri.fr

Job description:
The applicant will be in charge of optimizing our state-of-the-art Automatic Speech Recognition and Machine Translation systems for English and French built using the ESPNET framework (https://espnet.github.io/espnet/) for end-to-end deep neural networks. The objective is to match the specifications and constraints of the designed systems to the requirements of other partners of the project specialized in hardware (close work with EPFL https://www.epfl.ch/labs/esl/). In particular, the applicant will carry on the work of previous post-doctorates on compression techniques (e.g. pruning, quantization, etc.) applied to Transformer and Conformer based networks to reduce the memory and energy consumption while keeping an eye on the performances (WER and BLEU scores). When a satisfactory trade-off is reached, more exploratory work is to be carried out on using emotion/attitude/affect recognition on the speech samples to supply additional information to the translation system.

Context of the project:
The aim of the FVLLMONTI project is to build a lightweight autonomous in-ear device allowing speech-to-speech translation. Today, pocket-talk devices integrate IoT products requiring internet connectivity which, in general, is proven to be energy inefficient. While machine translation (MT) and Natural Language Processing (NLP) performances have greatly improved, an embedded lightweight energy-efficient hardware remains elusive. Existing solutions based on artificial neural networks (NNs) are computation-intensive and energy-hungry requiring server-based implementations, which also raises data protection and privacy concerns. Today, 2D electronic architectures suffer from 'unscalable' interconnect and are thus still far from being able to compete with biological neural systems in terms of real-time information-processing capabilities with comparable energy consumption. Recent advances in materials science, device technology and synaptic architectures have the potential to fill this gap with novel disruptive technologies that go beyond conventional CMOS technology. A promising solution comes from vertical nanowire field-effect transistors (VNWFETs) to unlock the full potential of truly unconventional 3D circuit density and performance.

Required skills:
- PhD in Automatic Speech Recognition (preferred) or Machine Translation using deep neural networks
- Knowledge of most widely used toolboxes/frameworks (pytorch, espnet)
- Good programming skills in Python
- Good communication skills (frequent interactions with hardware specialists)
- Interest in hardware design will be a plus

Selected references:

S. Karita et al., 'A Comparative Study on Transformer vs RNN in Speech Applications,' 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, 2019, pp. 449-456, doi: 10.1109/ASRU46091.2019.9003750.

Gulati, Anmol, et al. 'Conformer: Convolution-augmented Transformer for Speech Recognition.' arXiv preprint arXiv:2005.08100 (2020).

Leila Ben Letaifa, Jean−Luc Rouas. Transformer Model Compression for End−to−End Speech Recognition on Mobile Devices. 2022 30th European Signal Processing Conference (EUSIPCO), Aug 2022, Belgrade, Serbia.

Leila Ben Letaifa, Jean−Luc Rouas. Fine-grained analysis of the transformer model for efficient pruning. 2022 International Conference on Machine Learning and Applications (ICMLA), Dec 2022, Nassau, Bahamas.

Top

6-30

(2022-11-17) M2 internship offers LORIA - MULTISPEECH, Nancy France

M2 internship offers LORIA - MULTISPEECH https ://team.inria.fr/multispeech/fr/

To apply, please send your CV and a short motivation letter directly to the supervisors of the

corresponding offer.

Offer 1 Contrastive Learning for Hate Speech Detection

General information Supervisors Nicolas Zampieri, Irina Illina, Dominique Fohr Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Phone 03 54 95 84 06 Email fistname.lastname@loria.fr Office C 145

Motivation and context The United Nations defines hate speech as 'any type of communication through speech, writing or behavior, which denigrates a person or group based on who they are, i.e. their religion, ethnicity, nationality, or other identity factor.'. We are interested in hate speech posted on social networks. With the expansion of social networks (Twitter, Facebook, etc.), the number of messages posted every day has increased dramatically. It is very difficult and expensive to process the millions of content posted every day in order to remove hateful content. Thus, automatic methods are required to moderate the influx. Automatic hate speech detection is a difficult task in the field of natural language processing (NLP) [6]. With the appearance of transformer-based language models like BERT [3], new state-of-the-art models have emerged for hate speech detection like HateBERT [1]. Current NLP models rely strongly on efficient learning algorithms. We are particularly interested in one of them : contrastive learning. Contrastive learning is employed to learn an embedding space such that pairs of similar sentences have close representations. [5] provide a summary of different models based on contrastive learning in language processing. Goals and Objectives The goal of this internship is to study contrastive learning in the context of hate speech detection. We believe that using this methodology will make the models more effective. Our model learns to estimate whether two sentences have the same sentiment or not. Based on the first model, the intern will explore other approaches of contrastive learning, such as SimCSE [4] or Dual Contrastive Learning [2] models. The studied methods will be validated on several datasets to assess the robustness of the approach. In our team, we have several labeled corpora from social networks. The internship workplan is as follows : at the beginning the student will conduct a state-of-the-art study on recent developments in hate speech detection and contrastive learning in NLP. The student will implement the selected methods. Finally, the performance of the different implemented methods will be evaluated on several hate speech corpora and compared to the state-of-the-art. Required Skills The candidate should have an experience with Deep Learning, including a good practice in Python and an understanding of deep learning libraries like Keras, Pytorch or Tensorflow. Additional information The student intern will join the MULTISPEECH team. The team provides access to the computational resources (GPU, CPU and datasets) in order to carry out the research.

References

[1] Caselli, T., Basile, V., Mitrovic, J., & Granitzer, M. HateBERT : Retraining BERT for Abusive Language ´ Detection in English.. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) (pp. 17-25). ACL. doi :10.18653/v1/2021.woah-1.3. August 2021. [2] Chen, Q., Zhang, R., Zheng, Y., & Mao, Y. Dual Contrastive Learning : Text Classification via Label-Aware Data Augmentation.. https ://arxiv.org/abs/2201.08702. 2022. 2 [3] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding.. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies, Volume 1 (Long and Short Papers), (pp. 4171-4186). 2022. [4] Gao, T., Yao, X., & Chen, D. SimCSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. ACL. doi :10.18653/v1/2021.emnlp-main.552. 2021. [5] Rethmeier, N., & Augenstein, I. A Primer on Contrastive Pretraining in Language Processing : Methods, Lessons Learned and Perspectives.. https ://arxiv.org/abs/2102.12982. 2021. [6] Zampieri, N., Ramishc, C., Illina, I., & Fohr D. Identification of Multiword Expressions in Tweets for Hate Speech Detection.. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 202-210, 2022. European Language Resources Association.

Offer 2 Diffusion-based Deep Generative Models for Audio-visual Speech Modeling

General information Supervisors Mostafa SADEGHI, Romain SERIZEL Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email mostafa.sadeghi@inria.fr,romain.serizel@loria.fr

Motivation Recently, diffusion models have gained much attention due to their powerful generative modeling performance, in terms of both the diversity and quality of the generated samples [1]. It consists of two phases, where during the so-called forward diffusion process, input data are mapped into Gaussian noise by gradually perturbing the data. Then, during a reverse process, a denoising neural network is learned that removes the added noise at each step, starting from pure Gaussian noise, to eventually recover the original clean data. Diffusion models have found numerous successful applications, particularly in computer vision, e.g., text-conditioned image synthesis, outperforming previous generative models, including variational autoencoders (VAEs), generative adversarial networks (GANs), and normalizing flows (NFs). Diffusion models have also been successfully applied to audio and speech signals, e.g., for audio synthesis [2] and speech enhancement [3]. Goal and objectives Despite their rapid progress and application extension, diffusion models have not yet been applied to audiovisual speech modeling. This task involves joint modeling of audio and visual modalities, where the latter concerns the lip movements of the speaker, as there is a correlation between what is being said and the lip movements. This joint modeling effectively incorporates the complementary information of visual modality for speech generation. Such a framework has already been established based on VAEs [4]. Given the great potential and advantages of diffusion models, in this project, we would like to develop a diffusion-based audio-visual generative modeling framework, where the generation of audio modality, i.e., speech, is conditioned on the visual modality, i.e., lip images, similarly to text-conditioned image synthesis. This might then serve as an efficient representation learning framework for downstream tasks, e.g., audio-visual speech enhancement (AVSE) [4]. Background in statistical signal processing, computer vision, machine learning, and deep learning frameworks (Python, PyTorch) are favored. Interested candidates should send an email to the supervisors with a detailed CV and transcripts. Work environment This master internship is part of the REAVISE project : 'Robust and Efficient Deep Learning based Audiovisual Speech Enhancement' (2023-2026) funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified AVSE framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework. The intern will be supervised by Mostafa Sadeghi (researcher, Inria) and Romain Serizel (associate professor, University of Lorraine), as members of the MULTISPEECH team, and will benefit from the research environment, expertise, and computational resources (GPU & CPU) of the team.

References

[1] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M. H. Yang, Diffusion models : A comprehensive survey of methods and applications arXiv preprint arXiv :2209.00796, 2022. [2] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, Diffwave : A versatile diffusion model for audio synthesis arXiv preprint arXiv :2009.09761, 2020. [3] Y. J. Lu, Z. Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, Conditional diffusion probabilistic model for speech enhancement IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. [4] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, Audio-visual speech enhancement using conditional variational auto-encoders IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788 ?1800, 2020.

Offer 3 Efficient Attention-based Audio-visual Fusion Mechanisms for Speech Enhancement

General information Supervisors Mostafa SADEGHI Romain SERIZEL Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email mostafa.sadeghi@inria.fr romain.serizel@loria.fr

Motivation Audiovisual speech enhancement (AVSE) is defined as the task of improving the quality and intelligibility of a noisy speech signal by utilizing the complementary information provided by the visual modality, i.e., lip movements of the speaker [1]. Visual modality is especially important in high-noise situations, as it is less affected by acoustic noise. Because of that, AVSE could be exploited in several practical applications, including hearing assistive devices. Numerous works have already studied the integration of visual modality with audio modality to improve the performance of speech enhancement. While the majority of audiovisual speech enhancement algorithms rely on deep neural networks and supervised learning, they require very large audiovisual datasets with diverse noise instances to have good generalization performance. A recently introduced AVSE approach is based on unsupervised learning [2,3], where during a training phase, the statistical distribution of clean speech is learned from a clean audiovisual dataset. This is done using a deep generative model, e.g. variational autoencoder (VAE) [4]. Then, at test (inference) time, the learned distribution is combined with a noise model to estimate the clean speech signal from the available noisy speech observations. Goal and objectives An important element of AVSE is audio-visual feature fusion, which should robustly and efficiently combine the two modalities. Current fusion mechanisms used for unsupervised AVSE are based on simple feature concatenation, which is not effective, as it treats the two feature streams on an equal basis. In fact, the audio modality usually contributes more than the visual modality, but in general, their contributions should be robustly balanced and weighted. In this project, we are going to develop efficient feature fusion modules based on attention models [5], which have proven very successful in different applications. The designed fusion module is supposed to robustly and efficiently incorporate the potentially different uncertainty (reliability) levels of the two modalities. We will then evaluate its effectiveness for AVSE. Background in statistical signal processing, probabilistic machine learning, optimization, and programming languages & deep learning frameworks (Python, PyTorch) are favored. Interested candidates should send an email to the supervisors with a detailed CV and transcripts. Work environment This master internship is part of the REAVISE project : 'Robust and Efficient Deep Learning based Audiovisual Speech Enhancement' (2023-2026) funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified AVSE framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework. The intern will be supervised by Mostafa Sadeghi (researcher, Inria) and Romain Serizel (associate professor, University of Lorraine), as members of the MULTISPEECH team, and will benefit from the research environment, expertise, and computational resources (GPU & CPU) of the team.

References

[1] D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, An overview of deep-learningbased audio-visual speech enhancement and separation IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1368 ?1396, 2021. 5 [2] M. Sadeghi and X. Alameda-Pineda, Switching variational auto-encoders for noise-agnostic audio-visual speech enhancement IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021. [3] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, Audio-visual speech enhancement using conditional variational auto-encoders IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788 ?1800, 2020. [4] D. P. Kingma and M. Welling, An introduction to variational autoencoders Foundations and Trends in Machine Learning, 2019. [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need Advances in neural information processing systems, 2017.

Offer 4 Multi-modal Stuttering Detection Using Self-supervised Learning

General information Supervisors Shakeel Ahmad Sheikh, Slim Ouni Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email fistname.lastname@loria.fr Office C 137

Motivation Stuttering is a neuro-developmental speech disorder that starts appearing when language, speech, and emotion supporting neural connections are changing quickly [2]. In standard stuttering therapy sessions, the speech pathologists or speech therapists either manually examine and analyze the person who stutter (PWS) speech or their recordings. In order to rectify the stuttering, the speech therapists carefully observe and monitor the patterns in speech utterances of PWS. However, this convention of stuttering detection is very time consuming and strenuous. It is also biased towards the subjective belief of speech language therapists. Thus, it is important to build stuttering detection interactive tools that provide impartial objective assessment, and can be utilized to tune and improve various ASR virtual assistants for stuttered speech. Deep learning has been used tremendously in domains like speech recognition [5], emotion detection [1], however, in stuttering domain, its application is limited. The acoustic cues embedded in the speech of PWS can be exploited by various deep learning methods in the detection of stuttering. Most of the existing stuttering detection techniques utilize spectral features such as spectrograms and MFCCs as an input representation of the stuttered speech [12, 11, 3]. The most common problem in the stuttering domain is the dataset issue. There are few stuttering datasets like UCLASS, FluencyBank, and SEP28K [3], which are small containing only a few dozens of speakers. While deep learning methods have shown substantial gains in domains like ASR, speaker verification, emotion detection, etc, however, the improvement in stuttering detection is very limited, most likely due to the miniature size of datasets. The common strategy in dealing with training on small datasets is to apply transfer learning, where the pre-trained model (trained first on some auxiliary task on a large dataset) is used to enhance the performance of the desired task, for which data is very scarce. The deep learning model trained on some auxiliary task can be fine-tuned by re-training, or replacing some of its last layers, or it can also be employed as a feature extractor for the desired task, that we are trying to address. Transfer learning methodology has been explored in various fields like ASR, emotion detection [8], etc. Recently, self-supervised learning has shown significant improvement in stuttering detection [11, 18, 17, 16]. Multimodal Stuttering Detection Stuttering can be characterized as an audio-visual problem. Cues are present both in the visual (e.g., head nodding, lip tremors, quick eye blinks, and unusual lip shapes) as well as in the audio modality [4]. This multimodal learning paradigm could be helpful in learning robust stutter-specific hidden representations across the cross-modality platform, and could also help in building robust automatic stuttering detection systems. Selfsupervised learning can also be exploited to capture acoustic stutter-specific representations based on guided video frames. As proposed by Shukla et al. [14], this framework could be helpful in learning stutter-specific features from audio signals guided by visual frames or vice versa. Altinkaya and Smeulders [15] recently presented the first audio-visual stuttered dataset which consists of 25 speakers (14 male, 11 female). They trained ResNet-based RNN (gated recurrent unit) on the audio-visual modality for the detection of block stuttering type. The main idea in this internship is to explore the impact of further self supervised learning in stuttering detection in combination with audio-visual setup. The goal of the proposed study is to develop and evaluate audio-visual based self supervised stuttering detection classifiers, that will be able to distinguish among several stutter classes. 1. Objective 1 : Lliterature survey by looking at the existing work in stuttering detection. 2. Objective 2 : Developing a pre-trained stuttering classifier based on self-supervised learning ; Some initial experiments would be carried out. We would explore the self supervised models such as wav2vec 2.0, a modified version of wav2vec [9], and their variants such as Unispeech, HuBERT, etc. We would use wav2vec 2.0 either as a feature extractor or just fine tune it by replacing the last few layers and adapt it for stuttering detection. 7 3. The experiments would be carried out on the newly developed French stuttering dataset. 4. Objective 3 : Carrying out the actual experiments and the impact of fine-tuning and pre-trained features would be analyzed on the raw stuttered embedded audio-visual stuttered samples.

References

[1] Mehmet Berkehan Ak Cay and Kaya Oguz L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M. H. Yang, Speech emotion recognition : Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers' Speech Communication, 116 (2020) pp.56- 76. [2] Smith, Anne and Weber, Christine How stuttering develops : The multifactorial dynamic pathways theory' Journal of Speech, Language, and Hearing Research, 60 (2017) pp.2483–2505. [3] Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni, Machine learning for stuttering identification : Review, challenges and future directions, Neurocomputing, 514 (2022), pp 385-402, [4] Guitar, Barry. Stuttering : An integrated approach to its nature and treatment. Lippincott Williams & Wilkins, 2013. [5] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, 'Speech Recognition Using Deep neural networks : A systematic review,' IEEE Access, vol. 7, pp. 19143-19165, 2019. [6] Latif, Siddique, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Björn W. Schuller. 'Deep representation learning in speech processing : Challenges, recent advances, and future trends.' arXiv preprint arXiv :2001.00378 (2020). [7] Ning, Y., He, S., Wu, Z., Xing, C. and Zhang, L.J., 2019. A review of deep learning based speech synthesis. Applied Sciences, 9(19), p.4050. [8] Wang, Yingzhi, Abdelmoumene Boumadane, and Abdelwahab Heba. 'A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding.' arXiv preprint arXiv :2111.02735 (2021). [9] Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 'wav2vec 2.0 : A framework for self-supervised learning of speech representations.' Advances in Neural Information Processing Systems, 33 (2020) : 12449-12460. [10] Lea, Colin, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, and Jeffrey P. Bigham. 'Sep-28k : A dataset for stuttering event detection from podcasts with people who stutter.' In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6798-6802. IEEE, 2021. [11] Sheikh, Shakeel A., Md Sahidullah, Slim Ouni, and Fabrice Hirsch. 'End-to-End and Self-supervised learning for ComParE 2022 stuttering sub-challenge.' In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7104-7108. 2022. [12] Sheikh, Shakeel A., Md Sahidullah, Fabrice Hirsch, and Slim Ouni. 'Robust stuttering detection via multi-task and adversarial learning.' In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 190-194. IEEE, 2022. [13] Ngiam, Jiquan, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 'Multimodal deep learning.' In ICML. 2011. [14] Shukla, Abhinav, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, and Maja Pantic. 'Visually guided self supervised learning of speech representations.' In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6299-6303. IEEE, 2020. [15] Altinkaya, Mehmet, and Arnold WM Smeulders. 'A dynamic, self supervised, large scale audiovisual dataset for stuttered speech.' In Proceedings of the 1st International Workshop on Multimodal Conversational AI, pp. 9-13. 2020. [16] Mohapatra, Payal, Akash Pandey, Bashima Islam, and Qi Zhu. 'Speech disfluency detection with contextual representation and data distillation.' In Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications, pp. 19-24. 2022. [17] Grósz, Tamás, Dejan Porjazovski, Yaroslav Getman, Sudarsana Kadiri, and Mikko Kurimo. 'Wav2vec2- based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering.' In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7026-7029. 2022. [18] Bayerl, Sebastian P., Dominik Wagner, Elmar Nöth, and Korbinian Riedhammer. 'Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0.' arXiv preprint arXiv :2204.03417 (2022).

Offer 5 Dictionary learning for deep unsupervised speech separation

General information Supervisors Paul Magron Mostafa Sadeghi Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email paul.magron@inria.fr mostafa.sadeghi@inria.fr Office C 141 C 136

Motivation and context Speech separation consists in isolating the signals that correspond to each speaker from an acoustic mixture where several persons might be speaking. This task is an important preprocessing step in many applications such as hearing aids or vocal assistants based on automatic speech recognition. State-of-the-art separation systems rely on supervised deep learning, where a network is trained to predict the isolated speakers’ signals from their mixture [1, 2]. However, these approaches are costly in terms of training data and have a limited capacity to generalize to unseen speakers. Goal and objectives The goal of this internship is to design a fully unsupervised system for speech separation, which is more data-efficient than supervised approaches, and applicable to any mixture of speakers. To that end, we propose to combine variational autoencoders (VAEs) with dictionary models (DMs). DM consist in decomposing a given input matrix (usually : an audio spectrogram) as the product of two interpretable factors : a dictionary of spectra and a temporal activation matrix). This family of methods has been extensively researched before the era of deep learning [3], but it is limited since real-world audio spectrograms cannot be decomposed using such simple models. Therefore, we propose to leverage VAEs as a tool to learn a latent representation of the data which is regularized using DMs. Such a system can be cast as an instance of transform learning [4] : the key idea is to apply a (learned) transform to the data so that it better complies with a desirable property - here, decomposition on a dictionary. A first attempt was recently proposed and has shown promising results in terms of speech modeling [5], although it was using a fixed dictionary. This internship aims at extending this work by considering a system where both the VAE and the dictionary are learned jointly, and applying it to the task of speech separation. Once trained, the resulting system operates in three stages : (i) the (mixture) audio spectrogram is projected through the encoder into some latent space ; (ii) this latent representation is decomposed efficiently using a DM learning algorithm, which provides a latent feature for each speaker ; (iii) these latent features are passed through the decoder to retrieve a spectrogram for each speaker. Such a system is promising since it is fully unsupervised (it can be trained without knowledge of specific mixtures), it yields an interpretable decomposition of the latent representation, and it can serve as a basis for other applications (including speaker diarization, speech enhancement or voice conversion). A good practice in Python and basic knowledge about deep learning, both theoretical and practical (e.g., using PyTorch) are required. Some notions of audio/speech signal processing and machine learning is a plus. Work environment The trainee will be supervised by Paul Magron (Chargé de Recherche Inria) and Mostafa Sadeghi (Researcher, Inria Starting Faculty Position), and will benefit from the research environment and the expertise in audio signal processing of the MULTISPEECH team. This team includes many PhD students, post-docs, trainees, and permanent staff working in this field, and offers all the necessary computational resources (GPU and CPU, speech datasets) to conduct the proposed research.

References

[1] D. Wang and J. Chen, Supervised Speech Separation Based on Deep Learning : An Overview IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702-1726, 2018. [2] Y. Luo and N. Mesgarani, Conv-TasNet : Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256- 1266, 2019. [3] T. Virtanen, Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066-1074, 2007. [4] D. Fagot, H. Wendt and C. Févotte, Nonnegative Matrix Factorization with Transform Learning IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. [5] M. Sadeghi, and P. Magron, A Sparsity-promoting Dictionary Model for Variational Autoencoders Interspeech, 2022.

Offer 6 Semantic latent space for expressive text-to-speech

General information Supervisors Vincent Colotte, Slim Ouni Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Phone 03 54 95 20 74 Email vincent.colotte@loria.fr, slim.ouni@inria.fr Office C141

Motivation Over the last decades, text-to-speech synthesis (TTS) has reached good quality and intelligibility, and is now commonly used in information delivery services, as for instance in call center automation, and in navigation systems. In the past, the main goal when developing TTS systems was to achieve high intelligibility. The speech style was then typically a “reading style,” which resulted from the style of the speech data used to develop TTS systems (reading of a large set of sentences). Recent research on speech synthesis focuses now on expressive speech to obtain generated speech more expressive or spontaneous. Almost all systems are now based on neural network methods. Therefore, to tackle expressiveness integration in a network, as numerous recent works in neural networks, training and testing step pass through several steps with specific latent spaces to condition the network or to propose a latent representation to control the expressiveness. In stochastic processes, the explanation of such a numeric representation is still difficult to extract [1]. Moreover, the use of new representations as Word2Vec for textual material or Wav2Vec for audio signal shows that we can find a representation with implicit linguistic and semantic information. The need of explanation still remains. The internship will take place in this framework. Objectives and expected outcomes The goal of the proposed study is to investigate the information contained in a latent representation dedicated to expressive speech. Previous work dealt with Variational Autoencoder (VAE) approach to explore this dimension in the audiovisual domain [2] without emotion tag the latent representation retrieved the emotional information. In addition, [3] used several representations of acoustic expressiveness to condition a network to transfer an emotion from a speaker to another sentence of another speaker. Moreover, [5] had jointly used acoustic and textual expressiveness representation. The textual representation was based on SBERT approach. The internship work will consist to analyze latent representations of a TTS system (for instance Glow approach for audio speech or VAE for audio-visual speech). The second step will introduce semantic information by textual latent representation from simple tag [4], a description or the text itself. The objective is to jointly learn representations and analyze them to extract understanding for controlling the system. Additional information and requirements The internship will be carried out within the framework of the European project Humane-AI-Net. A good knowledge of Python and basic knowledge of neural network learning is required.

References

[1] Tits, N., Wang, F., Haddad, K.E., Pagel, V., Dutoit, T. Visualization and interpretation of latent spaces for controlling expressive speech synthesis through audio analysis, in Proc. Interspeech, 2019 [2] S. Dahmani, V. Colotte, V. Girard and S. Ouni Learning emotions latent representation with CVAE for TextDriven Expressive AudioVisual Speech Synthesis, in Neural Networks, Elsevier, 2021 [3] A. Kulkarni, V. Colotte, D. Jouvet, Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems, in Proc. Interspeech, 2022 [4] Kim, M., Cheon, S.J., Choi, B.J., Kim, J.J., Kim, N.S. Expressive Text-to-Speech Using Style Tag. Interspeech 2021 [5] Shin, Y., Lee, Y., Jo, S., Hwang, Y., Kim, T., Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS. Interspeech 2022.

Offer 7 Disentanglement in Speech Data for Privacy Needs

General information Supervisors Emmanuel Vincent, Marc Tommasi Address LORIA, Campus Scientifique - BP 239, 54506 Vandœuvre-lès-Nancy Email emmanuel.vincent@inria.fr, marc.tommasi@inria.fr

Motivation and context Large-scale collection, storage, and processing of speech data poses severe privacy threats [1]. Indeed, speech encapsulates a wealth of personal data (e.g., age and gender, ethnic origin, personality traits, health and socioeconomic status, etc.) which can be linked to the speaker’s identity via metadata or via automatic speaker recognition. Speech data may also be used for voice spoofing using voice cloning software. With firm backing by privacy legislations such as the European general data protection regulation (GDPR), several initiatives are emerging to develop and evaluate privacy preservation solutions for speech technology. These include voice anonymization methods [2] which aim to conceal the speaker’s voice identity without degrading the utility for downstream tasks, and speaker re-identification attacks [3] which aim to assess the resulting privacy guarantees, e.g., in the scope of the VoicePrivacy challenge series [4]. Goals and objectives The internship will tackle the objective of speech anonymization. Previous works have shown that simple adversarial approaches that aim at removing speaker identity from speech signals do not provide sufficient privacy guaranties [5]. An interpretation of this failure can be that adversaries were not strong enough. Moreover, there is no clear evidence that a transformation that removes speaker identity is informative enough to allow the reconstruction of intelligible speech signals. These observations raise a classical trade-off between privacy and utility that is essential in many privacy preservation scenarios. Instead of trying to remove speaker information, another option is to replace it by another one. To do so, a sub-objective is to disentangle speech signals, that is to isolate speech features that contribute to the success of speaker identification. Disentanglement is understood in this project as the process of embedding voice data in a new representation where different types of information (speaker identity, linguistic content, or even traits like age, gender or ethnicity) are separated and associated with disjoint sets of features. Variational autoencoders are supposed to naturally support disentanglement [6]. Additionally, variational approaches can also be used to make attackers stronger by introducing more diversity. Those two ways of improving adversarial approaches for learning a private representation of speech will be investigated.

References

[1] A. Nautsch, A. Jimenez, A. Treiber, J. Kolberg, C. Jasserand, E. Kindt, H. Delgado, M. Todisco, M. A. Hmani, A. Mtibaa, M. A. Abdelraheem, A. Abad, F. Teixeira, M. Gomez-Barrero, D. Petrovska, G. Chollet, N. Evans, T. Schneider, J.-F. Bonastre, B. Raj, I. Trancoso, and C. Busch. Preserving privacy in speaker and speech characterisation, in Computer Speech and Language, 2019 [2] B. M. L. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yamagishi. Privacy and utility of x-vector based speaker anonymization, in IEEE/ACM Transactions on Audio, Speech and Language Processing, 30 :2383–2395, 2022. [3] B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent. Evaluating voice conversion-based privacy protection against informed attackers, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. [4] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien, A. Chanclu, J.-F. Bonastre, M. Todisco, and M. Maouche. The VoicePrivacy 2020 Challenge : Results and findings, Computer Speech and Language, 2022. [5] B. M. L. Srivastava, A. Bellet, M. Tommasi, and E. Vincent. Privacy-preserving adversarial representation learning in ASR : Reality or illusion ?, in Proc. Interspeech, 2019 [6] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda. Dynamical Variational Autoencoders : A Comprehensive Review, in Foundations and Trends in Machine Learning, vol. 15, 2021 12

Top

6-31

(2022-11-18) PhD studentships @ University of Edinburgh, Scotland, UK

PHD STUDENTSHIPS IN SPEECH TECHNOLOGY, COMPUTATIONAL LINGUISTICS, AND COGNITIVE SCIENCE

Institute for Language, Cognition and Computation
School of Informatics
University of Edinburgh

The Institute for Language, Cognition and Computation (ILCC) at the University of Edinburgh invites applications for three-year PhD studentships starting in October 2023. ILCC is dedicated to the pursuit of basic and applied research on computational approaches to language, communication and cognition.

Primary research areas include:

Natural language processing and computational linguistics
Machine Translation
Speech technology
Dialogue, multimodal interaction, language and vision
Computational Cognitive Science , including language and speech, decision-making, learning and generalization
Social Media and Computational Social Science
Human-Computer interaction, design informatics, assistive and educational technology
Information retrieval and visualization

Approximately 10 studentships from a variety of sources are available, covering both maintenance at the research council rate of GBP 17,668 (2022/23 rates) per year and tuition fees. Awards increase every year, typically with inflation. Studentships are available for UK, EU, and non-EU nationals.

For a list of academic staff at ILCC with research areas, and for a list of indicative PhD topics, please consult:

http://web.inf.ed.ac.uk/ilcc/people/academic-senior-research-staff
http://www.ilcc.inf.ed.ac.uk/study/possible-phd-topics-in-ilcc

Details regarding the PhD programme and the application procedure can be found at:

http://www.ed.ac.uk/informatics/postgraduate/research-degrees/phd

There are TWO DEADLINES for applications to receive full consideration:

round 1: 25th November 2022
round 2: 27th January 2023

We strongly recommend that non-UK applicants submit their applications in round 1, to maximise their chances of funding. Please direct inquiries to the PhD admissions team at ilcc-admissions@inf.ed.ac.uk.

Please note that the 3-year ILCC PhD program is distinct from the UKRI Centre for Doctoral Training in Natural Language Processing, which offers a 4-year PhD with integrated study:

http://nlp-cdt.ac.uk/

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

Top

6-32

(2022-11-18) Poste d enseignant, IUT et ENSSAT, Lannion, France

L’IUT de Lannion et l’ENSSAT à Lannion (22) recherchent chacun une enseignante-chercheuse ou un enseignant-chercheur sur contrat LRU à temps plein en Informatique pour le reste de l’année (jusque fin août 2023).
Les fonctions de ce poste sont similaires à celles d'un poste d'ATER et cela peut être une solution pour terminer une thèse ou pour préparer les concours de chercheurs et d'enseignants-chercheurs.

L'intégration recherche peut se faire au sein de l'équipe EXPRESSION de l'IRISA, entre autres.

La date limite de candidature est très proche : 30/11 pour l’IUT et 11/12 pour l’ENSSAT, pour une prise de fonction vraisemblablement en janvier 2023.

Les fiches de poste et les modalités de candidatures sont accessibles sur le site de l’université de Rennes 1
https://www.univ-rennes1.fr/nos-offres-demploi#p-126

Top

6-33

(2022-11-20) Research position on speaker and text anonymization for medical applications @DFKI, Berlin, GE

Research position on speaker and text anonymization for medical applications @DFKI, Berlin

We’re happy to announce a new research position in the field of speech- and text anonymization at German Research Center for Artificial Intelligence, Berlin, Germany. We’re looking for a full time Researcher or Junior Researcher level, and offer a 2 years contract with optional prolongation and PhD perspective.

The Speech and Language Technology Lab at DFKI Berlin is involved in numerous national as well as international research programmes and networks. We offer an interesting and flexible work environment as part of an innovative, international and enthusiastic team which coordinates and participates in national as well as European projects in the wider area of Language Technology.

Your tasks

Research and development of methods for anonymization and fake-detection using text, speech, and semi-structural data from medical and fake-news application domains
Research and development of methods for explainability, fairness, trustworthiness for above methods
Support of industrial and public scientific, research and development projects on national and international level ( BMBF, EU, etc.)
Exchange and network with relevant stakeholder from science, industry, politics and civil society.

Your qualifications

Masters degree (or equivalent qualification) in Computer Science, Computational Linguistics, Signal Processing or a related discipline
MA degree in the area of speech, NLP, AI, ML, xAI
Good English or German language skills
Confident appearance in an international working environment and high self-motivation
Ability to work in a team, problem-solving and results-oriented

Your benefits

A wide and interesting range of tasks and extensive development opportunities in a young and international team
Contribution to trend-setting, innovative projects in the field of Language Technology and AI
An innovative, agile and professional work environment
An attractive workplace in the heart of Berlin with PhD perspective

The German Research Center for Artificial Intelligence (DFKI) is Germany's leading business-oriented research institution in the field of innovative software technologies based on artificial intelligence methods. In the international scientific community, DFKI ranks among the most recognized 'Centers of Excellence' and currently is the biggest research center worldwide in the area of Artificial Intelligence and its application in terms of number of employees and the volume of external funds. The DFKI cooperates closely with national and international companies.

DFKI encourages applications from people with disability; DFKI intends to increase the proportion of female employees in the field of science and encourages women to apply for this position.

Application deadline: Dec 23

More details and link: https://jobs.dfki.de/en/internal/vacancy/en-researcher-m-w-d-in-506968.html

Top

6-34

(2022-11-21) Postdoc @GIPSALab, Grenoble, France

Offre de post-doc de 6 mois au GIPSA-lab, Grenoble, sur le contrôle gestuel temps-réel de l'intonation pour la suppléance vocale, dans le cadre du projet ANR GEPETO (Gestures and Pedagogy of Intonation).

Candidature: https://emploi.cnrs.fr/Offres/CDD/UMR5216-CHRROM-022/Default.aspx

Informations générales

Référence : UMR5216-CHRROM-022
Lieu de travail : ST MARTIN D HERES
Date de publication : mardi 22 novembre 2022
Type de contrat : CDD Scientifique
Durée du contrat : 6 mois
Date d'embauche prévue : 1 février 2023
Quotité de travail : Temps complet
Rémunération : 2 805,35 € à 3963.98 €
Niveau d'études souhaité : Doctorat
Expérience souhaitée : 1 à 4 années

Missions

Ce post-doctorat fait partie du projet ANR GEPETO* (GEstures and PEdagogy of InTOnation), dont le but est d'étudier l'utilisation de gestes manuels par le biais d'interfaces humain-machine, pour la conception d'outils et méthodes permettant l'apprentissage du contrôle de l'intonation (mélodie) dans la parole.
En particulier, ce poste se place dans le contexte de la rééducation vocale, dans le cas de dégradation ou d'absence de vibration des plis vocaux chez des patients atteints de troubles du larynx. Les solutions médicales actuelles pour remplacer cette vibration consistent à injecter une source sonore artificielle dans le conduit vocal, directement par la bouche ou en transmission par les tissus du cou, grâce à un électrolarynx. Ce vibreur génère une source vocale de substitution sur laquelle l'utilisateur peut articuler normalement de la parole. Une alternative est de capter à l'aide d'un microphone la parole non-voisée produite par une personne en absence de vibration des plis vocaux (par exemple un chuchotement), et d'y ré-introduire le voisement en temps-réel par synthèse vocale. La voix reconstruite est alors jouée en temps-réel sur un haut-parleur. Aujourd'hui, l'ensemble de ces systèmes génèrent des signaux d'intonation (mélodie) relativement constante, conduisant à des voix très robotiques.
Le but du projet GEPETO à GIPSA-lab est d'ajouter à ces deux solutions un contrôle de l'intonation en temps-réel par le geste de la main, et d'étudier l'usage de tels systèmes dans des situations d'interactions orales.

Objectifs :
Nous avons développé dans la première partie du projet une solution de conversion chuchotement-parole en temps-réel à laquelle peuvent se connecter diverses interfaces permettant de capter les gestes manuels dans différents espaces (trajectoire sur une surface, dans l'espace, pression, etc.) pour le contrôle de l'intonation. Le post-doctorat aura pour but d'évaluer un tel système dans des situations de production et interaction orales dans une application de suppléance vocale.
Nous aborderons dans un premier temps la question du contrôle du voisement (activation ou non de la source voisée). Nous avons implémenté une méthode semi-automatique basée sur le centroïde spectral du signal de chuchotement qui nécessite que l'utilisateur ajuste la production de son chuchotement pour qu'une décision correcte de voisement soit prise (Ardaillon et al., Interspeech 2022)**. Nous chercherons alors à évaluer dans quelle mesure cette adaptation au système est possible.
Nous étudierons dans un deuxième temps le contrôle prosodique par le geste manuel de motifs intonatifs typiques (par exemple : les modalités de phrase, l'accentuation), en fonction des degrés de liberté offerts par les interfaces disponibles (tablette tactile Sensel Morph, accéléromètre, etc.). Cette étape sera évaluée à la fois sur des tâches simples d'imitation de phrases, mais aussi dans des situations de communication où l'utilisateur doit produire des phrases intelligibles et expressives pour son interlocuteur, sans référence à imiter.
Ces deux questions de recherche portent donc sur l'adaptation d'un.e utilisateur.trice de sa production de parole et de ses gestes manuels pour le contrôle du voisement et de l'intonation, respectivement. Il s'agira pour chaque étude de proposer des protocoles expérimentaux permettant à la fois l'évaluation des sujets sur les tâches de contrôle, mais aussi leur capacité d'apprentissage à l'utilisation d'un tel système.

* http://gepeto.dalembert.upmc.fr/project_fr.html
** http://www.gipsa-lab.grenoble-inp.fr/~olivier.perrotin/media/papers/10675_file_Paper.pdf

Activités

- Prendre en main le système de conversion chuchotement-parole dans l'environnement Max/MSP
- Proposer un protocole d'apprentissage pour le contrôle du voisement
- Proposer un protocole d'évaluation du contrôle de l'intonation sur les tâches d'imitation et de production
- Proposer un protocole d'apprentissage pour le contrôle de l'intonation
- Évaluer l'ensemble sur des groupes d'utilisateurs

Compétences

Les personnes n'ayant pas de compétence dans certains des domaines listés sont néanmoins encouragées à déposer une candidature.
- Perception et production de la parole
- Méthodologie expérimentale en Sciences de la Parole / Phonétique
- Outils d'analyse de résultats (par ex: R/Python/Matlab)
- Interaction humain machine
- Programmation Max/MSP (prise en main du dispositif existant et améliorations)
- Compréhension du français (langue utilisée pour le développement et l'évaluation du système)

Contexte de travail

Le Gipsa-lab est une unité de recherche commune CNRS, Grenoble-INP (Institut de Technologie de
Grenoble), Université de Grenoble conventionnée par l'Inria et l'Observatoire des Sciences de l'Univers de
Grenoble.
Avec 350 personnes dont environ 150 doctorants, Gipsa-lab est une unité de recherche multidisciplinaire
développant à la fois des recherches fondamentales et appliquées sur des signaux et des systèmes complexes.
Gipsa-lab développe des projets dans les domaines stratégiques de l'énergie, de l'environnement, de la
communication, des systèmes intelligents, de la vie et de la santé et de l'ingénierie linguistique.
Par ses activités de recherche, Gipsa-lab maintient un lien constant avec l'environnement économique grâce à
un partenariat fort avec les entreprises.
Le personnel du Gipsa-lab est impliqué dans l'enseignement et la formation dans les différentes universités et
écoles d'ingénieurs de l'agglomération grenobloise (Université Grenoble Alpes).
Gipsa-lab est internationalement reconnu pour la recherche réalisée en Automatique et Diagnostic, Sciences
de l'Information et de l'Image du Signal, Parole et Cognition. L'unité développe sa recherche au travers de 16
équipes organisées en 4 pôles de recherche :
.Automatique et Diagnostic
.Sciences des Données
.Géométries, Apprentissage, Information et Algorithmes
.Paroles et Cognition
Le Gipsa-lab regroupe 150 permanents et environ 250 non permanents (doctorants, post-doctorants,
chercheurs invités, personnel administratifs et techniques, stagiaires en master…)
Le.a post-doctorant.e sera rattaché.e à l'équipe CRISSP (Cognitive Robotics, Interactive Systems, Speech Processing) du Pôle Parole et Cognition de GIPSA-lab. Il/elle travaillera avec Olivier Perrotin de l'équipe CRISSP, et Nathalie Henrich Bernardoni de l'équipe MOVE (Analyse et Modélisation de l'Homme en Mouvement : Biomécanique, Cognition, Vocologie) du Pôle Science des Données.

Top

6-35

(2022-11-21) Speech researcher @Vivoka, Metz, France

SPEECH

RESEARCHER (M/F)

AboutVivoka

Vivoka is a global leader in voice Al technologies founded in 2015. Thanks to its VDK (Voice Development Kit), Vivoka offers an all-in-one solution that enables any company to create its own high performance, secure embedded/offline voice assistant in record time. Vivoka has won several innovation awards and has established leading partnerships with major players in the voice market. Vivoka has a portfolio of about 100 customers from all major industries and is pursuing its goal of bringing people closer to technology through voice.

Mainmission

Aspartofitsconstant evolution,Vivoka isfurther developingthe VoiceDevelopment Kitanditsrelatedprojects.Your mainmissionwillrevolve around creatingsolutions forsignalprocessingandmorespecifically speechprocessing.

Roles& Responsibilities

Youwillcreatethenext generation of speech processing engines.
Youwillworkaspartofa passionate and collaborative teamspecializedinmachine learningandspeech processingtechnologies.
You willhavetheopportunity to publish research articles and contribute to international patents.
Youwillparticipateinallaspectsofprojectsincludingproposingideas,data collection/analysis, literature review, prototyping and development.

Job's Requirements

Youarepassionate,creative,andhaveacollaborativemindset.
YouhaveaPh.D.in speechprocessingor machinelearningorarelatedarea.Oryou areanexceptional candidate holdingaMaster's degree andhaveastrong experience.
YouarefamiliarwithrecentresearchinDeepLearningandhavetheknowledge and skills to implement algorithms fromresearch papers.
Youhavehands-onexperiencebuildingMLsystemsforsignalprocessingorrelated fields.
Youhaveastrongprogrammingbackgroundwith2+yearsofexperienceinanyof the following languages: Python or C++.
Youhavetheabilitytoquicklylearnnewtechnologiesandsuccessfully implement them.
Youhavegoodanalyticalandsyntheticalskills.
Youhavegoodwritingskills.
YouarefluentinEnglish.
Youhaveexperience withoneor moreof thefollowinglibraries:PyTorch, TensorFlow.
Youhavescientificpublications inleadingspeechorsignalprocessingconferences (likelnterspeech,ICASSP,ASRU,etc.)andjournals.
Youmayhaveexperiencewithresearchordevelopmentforembedded/edge devices

The advantages of the job

Youwillbepartofagrowingproject,andbeoneofitspillars.
AtVivoka,autonomyandtrustareimportantvalues.
Everydayisanewchallenge,youwillnotexperiencemonotony.
Eachprojectisaninnovation,youevolveinacrazyandstimulatingenvironment.
AtVivoka,wemakeeveryeffortforouremployeesandtheirwell-being.
Youwillbelocatedin Metzwhichisstrategically locatednear the bordersof France, Luxembourg andGermany.
Youwillhavecomplementaryhealthinsurancebenefits.
Youwillhavecomplementarymonthly mealvouchers.

If you are interested in the position,

send your documents to

Top

6-36

(2022-11-21) NLP Researcher@Vivoka, Metz, France

NLPRESEARCHER(M/F)

AboutVivoka

Vivoka is a global leader in voice Al technologies founded in 2015. Thanks to its VDK (Voice Development Kit), Vivoka offers an all-in-one solution that enables any company to create its own high-performance, secure embedded/offline voice assistant in record time. Vivoka has won several innovation awards and has established leading partnerships with major players in the voice market. Vivoka has a portfolio of about 100 customers from all major industries and is pursuing its goal of bringing people closer to technology through voice.

Mainmission

Aspartofitsconstant evolution,Vivoka isfurther developingthe VoiceDevelopment Kitandits relatedprojects. Your mainmission willrevolve around creatingsolutions for NLPandmore specifically NLU.

Roles & Responsibilities

YouwillcreatethenextgenerationofNLPengines.
YouwillworkaspartofapassionateandcollaborativeR&Dteamspecialized in NLP and ML technologies.
Youwillpublishscientificpapersandcontributetointernationalpatents.
Youwillparticipateinallaspectsofprojectsincludingproposingideas,data collection/analysis, literature review, prototyping and development.

Job’s Requirements

Youarepassionate,creative,andhaveacollaborativemindset.
Youhaveawillingnesstolearnnewtechnologiesandsuccessfullyimplementthem.
YouhaveaPhD(or M.Sc.) inNLP,MachineLearning,DataScience,Computer Science ora related area,and have a strong experience in NLP/NLU/NLG.
You have strongscientific publications in NLP conferences such as ACL,EMNLP, COLING,etc.
YouhaveanadvancedunderstandingofNLP/NLU/NLG techniquesandpractical experienceinbuildingMLmodelsforNLP/NLU/NLG and/orconversationalagents.
Youhaveexperienceinbothrapidprototypingandexperimentation.
Youhaveastrongprogrammingbackgroundwith2+yearsof experienceinanyof the followinglanguages: Python(PyTorchor TensorFlow) or C++.
Youhavegoodanalyticalandsyntheticalskills.
Youhavegoodwritingskills.
You havegoodcommunicationskillsinEnglish(Frenchisnotmandatory).
Youareableto clearlycommunicateyourtechnicalfindingsto anon-technical team.
Youmayhaveexperiencewithresearchordevelopmentforembedded/edge devices.

The advantages of the job

Youwillbepartofagrowingproject,andbeoneof its pillars.
AtVivoka,autonomyandtrustareimportantvalues.
Everydayisanewchallenge,youwillnotexperience monotony.
Eachprojectisaninnovation,youevolvein acrazyandstimulatingenvironment.
AtVivoka,wemakeeveryeffortforouremployeesandtheirwell-being.
Youwillbelocatedin Metzwhichisstrategically locatednear thebordersof France, Luxembourg andGermany.
Youwillhavecomplementaryhealthinsurancebenefits.
Youwillhavecomplementarymonthlymealvouchers.

If you are interested in the position,

send your documents to

Top

6-37

(2022-11-21) Machine Learning engineer@Vivoka, Metz, HFrance

MACHINELEARNING ENGINEER (M/F)

AboutVivoka

Vivoka is a global leader in voice Al technologies founded in 2015. Thanks to its VDK (Voice Development Kit), Vivoka offers an all-in-one solution that enables any company to create its own highperformance, secure embedded/offline voice assistant in record time. Vivoka has won several innovation awards and has established leading partnerships with major players in the voice market. Vivoka has a portfolio of about 100 customers from all major industries and is pursuing its goal of bringing people closer to technology through voice.

Mainmission

Aspartof theR&Dteam,youwillbenchmark,develop,integrate anddeployour future embeddedMLtechnologiesfor the speechandnaturallanguageprocessing applications.

Roles & Responsibilities

YouwillworkaspartofapassionateandcollaborativeR&Dteamspecializedin embedded ML technologies.
Youwillhavetheopportunity to publishscientific papersandcontribute to international patents.
Youwillattendworldrenownedconferences.
Youwillparticipateinallaspectsofprojectsincludingproposingideas,data collection/analysis,literature review, prototyping and development.
Youwillhavetopresentyourworktothecompanyinavulgarizedway.

Job's Requirements

YouholdaMaster/EngineerdegreeorPhDinMachineLearningwithinterestin

R&D.

Youkeepyourself up-to-date withtherecentmethods andtechniques in applied deep learning and machine learning.
Youarecomfortablewithreadingscientificresearchpapersaswellasdiscussing and implementing them.
YouhaveexpertisewithMachineLearningframeworkssuchasPyTorchand TensorFlow.
YouhaveexcellentskillsinatleastPythonorC++.
You'vealreadycreatedproductionreadycomplexmodels.
Youmayhaveexperiencewithresearchordevelopmentforembedded/edge devices.
Youhavegoodanalyticalandsyntheticalskills.
Youareindependentinyourwork.
Youarecapableofvulgarizingcomplextopicsandpresentingthemtoateam.
YouhavegoodcommunicationskillsinEnglish(Frenchisnotmandatory).

The advantages of the job

Youwillbepartofagrowingproject,andbeoneofitspillars.
AtVivoka,autonomyandtrustareimportantvalues.
Everydayisanewchallenge,youwillnot experiencemonotony.
Eachprojectisaninnovation,youevolveinacrazyandstimulatingenvironment.
AtVivoka,wemakeeveryeffortforouremployeesandtheirwell-being.
Youwillbelocatedin Metzwhichisstrategically locatednear the bordersof France, Luxembourg andGermany.
Youwillhavecomplementaryhealthinsurancebenefits.
Youwillhavecomplementarymonthlymealvouchers.

If you are interested in the position,

send your documents to

Top

6-38

(2022-11-22) Faculty position (Associate professor, tenure position) at Telecom Paris, France

Faculty position (Associate professor, tenure position) at Telecom Paris in

Machine-Learning for Social Computing.

Telecom Paris has a new permanent (tenure) faculty position (Associate Professor/ “Maître de conférences”) in the area of **machine learning for social computing**. Applicants from the following sub-research areas are welcome:

Neural models for the recognition and generation of socio-emotional behaviors
Natural language and speech processing
Dialogue, conversational systems, and social robotics
Reinforcement learning for dialogue
Sentiment analysis in social interactions
Bias and explainability in AI
Model tractability, multi-task learning, meta-learning

Important Dates

March 20th, 2023: closing date for applications
April 20th, 2023: hearings of the preselected candidates

Context

Social Computing team [1] - S²A (machine learning, statistics and signal processing) group [2] - LTCI (laboratoire de traitement et communication de l’information) [3] - Telecom Paris [4] .

Ecosystem

Telecom Paris [4] is a founding member of the Institut Polytechnique de Paris (IP Paris), a world-class scientific and technological institution. Located at the Plateau de Saclay close to Paris-Saclay University, this Institution is a partnership between Ecole Polytechnique, ENSTA Paris, ENSAE Paris, Télécom Paris, Télecom SudParis, with HEC as a key partner.

Regularly ranked as one of the best engineering schools in France, Télécom Paris is recognized for its excellent training, its very good employability rate with high salaries, its high-level research, and its very close proximity to companies. The THE (Times Higher Education) ranks Télécom Paris 2nd best French engineering school, 5th better French university, and 6th « best small university ». The newly created institution IP Paris was ranked in the top 50 best universities in the QS world university ranking.

In the context of the Institut Polytechnique de Paris, the activities in Data Science and AI of the team benefit from the center Hi!Paris (https://www.hi-paris.fr), offering seminars, workshops, and fundings through calls for project

Main missions/Research activities

Develop groundbreaking research in the field of machine learning applied to Social Computing, which includes: natural language and speech processing, dialogue, conversational systems, and social robotics, reinforcement learning for dialogue, sentiment analysis in social interactions, bias and explainability in AI, model tractability, multi-task learning, meta-learning
Develop both academic and industrial collaborations on the same topic, including collaborative activities with other Telecom Paris research departments and teams (including social sciences researchers of economics and social sciences department [6]), and research contracts with industrial players
Set up research grants and take part in national and international collaborative research projects
Publish high-quality research work in leading journals and conferences
Be an active member of the research community (serving on scientific committees and boards, organizing seminars, workshops, and special sessions...)

Main missions/Teaching activities

Participate in teaching activities at Telecom Paris and its partner academic institutions (as part of joint Master programs), especially in natural language processing, speech processing, machine learning, and Data Science, including life-long training programs (e.g. the local “Mastères Spécialisés”)

Candidate profile

As a minimum requirement, the successful candidate will have:

A Ph.D. degree
A track record of research and publication in one or more of the following areas: conversational artificial intelligence, machine learning, natural language processing, speech and signal processing, human-agent interactions, social robotics
Experience in teaching
An international postdoctoral experience is welcome but not mandatory
Excellent command of English

NOTE:

The candidate does *not* need to speak French to apply, just to be willing to learn the language (teaching will be mostly given in English)

Other skills expected include:

• Capacity to work in a team and develop good relationships with colleagues and peers

• Excellent writing and pedagogical skills

More about the position

• Place of work: Saclay (Paris outskirts)

How to apply?

Applications must be submitted via one of the following websites:

French Version:

https://institutminestelecom.recruitee.com/o/enseignantchercheur-en-machine-learning-pour-la-modelisation-des-comportements-socioemotionnels-a-telecom-paris-cdi

English Version:

https://institutminestelecom.recruitee.com/l/en/o/enseignantchercheur-en-machine-learning-pour-la-modelisation-des-comportements-socioemotionnels-a-telecom-paris-cdi

Applicants should submit a single PDF file that includes:

- cover letter,

- curriculum vitae,

- statements of research and teaching interests (4 pages)

- three publications

- contact information for two references

Contacts: == please do not hesitate to directly contact us before applying ==

Chloé Clavel (Coordinator of the Social Computing team)

Stéphan Clémençon (Head of the S²A group)

Florence d’Alché-Buc (Head of the IDS department)

[1] https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci/research-teams/signal-statistics-learning/social-computing

[2] https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci/research-teams/signal-statistics-learning

[3] https://www.telecom-paris.fr/fr/lecole/departements-enseignement-recherche/image-donnees-signal

[4] https://www.telecom-paris.fr/en/research/laboratories/information-processing-and-communication-laboratory-ltci

[5] https://www.telecom-paris.fr/en/home

[6] https://www.telecom-paris.fr/en/the-school/teaching-research-departments/economics-and-social-sciences

Top

6-39

(2022-11-23) Postdoc for speech-based affective computing , King's college, London

I am looking for a post-doc for speech-based affective computing and multimodal mHealth applications. For full details, see: https://jobs.kcl.ac.uk/gb/en/job/058426/Research-Fellow-in-Data-Science-for-Mobile-Health-mHealth

Dr. Nicholas Cummins
Lecturer in AI for Speech Analysis for Healthcare
Department of Biostatistics & Health Informatics

Institute of Psychiatry, Psychology & Neuroscience
King's College London

Top

6-40

(2022-11-24) Internship @UMRAE, Strasbourg, France

UMRAE-INRIA

PROPOSITION DE STAGE 2022-2023

Sujet de stage

Nouveaux algorithmes pour le diagnostic acoustique de salle automatisé

Niveau recommandé

☒Master (M2) ☐Master (M1) ☐Ingénieur ☐Licence ☐Bac + 2

Compétences requises

Acoustique des salles, Méthodes d’optimisation, Traitement du signal, Apprentissage automatique

Des connaissances en Python, Matlab serait un plus.

Master 2 (acoustique, informatique, traitement du signal…)

Introduction générale

Les nuisances sonores sont citées comme première source de gêne par les populations et constituent un enjeu sanitaire et social important. Dans les bâtiments, la gêne est souvent liée à une mauvaise qualité acoustique des salles due à une réverbération trop importante (cantine, piscine, crèche…). Dans le cadre de la réhabilitation acoustique des salles, la proposition d’une solution nécessite une bonne connaissance des caractéristiques géométriques et acoustiques de l’existant. Pour estimer certains paramètres inconnus (ex : absorption d’un plafond inconnu), les acousticiens de terrain s’appuient sur des mesures in situ du champ sonore et sur des modèles acoustiques numériques (ou analytique) dont ils calent de façon itérative les paramètres de façon à retrouver la valeur du champ sonore mesurée. Le processus complet d’un diagnostic est donc long, coûteux et parfois imprécis selon les modèles utilisés. Face à ce constat, le développement de méthodes dites inversespermettant de remonter automatiquement aux paramètres acoustiques d’intérêt à partir de la mesure constituerait une percée majeure pour l’acoustique du bâtiment, ouvrant la voie au développement d’outils plus simples, plus rapides et plus fiables à destination des acousticiens.

Sujet

Notre sujet, portant sur le développement de méthodes inverses en acoustique du bâtiment via des méthodes d’optimisation, de traitement du signal audio et/ou d’apprentissage automatique vient compléter la palette d’outils de prévision de champ sonore déjà existant. Par ailleurs, il vient aussi rompre l’herméticité existante entre le monde de l’audio et celui de l’acoustique, se traduisant par des conférences et journaux distincts. Des travaux préliminaires conduits par l’UMRAE et l’Inria, basés sur la réponse impulsionnelle de la salle (RIR ou Room Impulse Response), ont clairement montré qu’une application directe des approches d’optimisation (ainsi que d’approches d’apprentissage automatique) existantes dans d’autres domaines ne pouvait suffire pour résoudre notre problème. Ces approches doivent être adaptées au cas spécifique de l’acoustique. A ce jour, pour des conditions idéalisées (microphones et sources omnidirectionnels, salle rectangulaire, absorption plutôt faible des parois…), nos travaux ont montré qu’il est possible de « retrouver » au sein de la RIR l’absorption des parois et de reconstruire la géométrie de la salle et ce, sans connaissance a priori sur la position de la source sonore, des microphones.

Le/la candidat(e), entouré(e) de ses encadrants, viendra en renfort de deux doctorants travaillant sur ces méthodes inverses. Pour cela, il/elle devra prendre en main l’une des méthodes d’optimisation déjà mises en place (algorithmes itératifs Ransac, Sliding Franck Wolfe, Méthode des solutions fondamentales, réseaux de neurones…) ainsi que le modèle théorique spécifiquement retenu pour ces travaux d’optimisation pouvant être exprimé dans le domaine temporel, de Fourier ou sur une décomposition en harmoniques sphériques.

Plusieurs pistes de travail sont ensuite possibles, suivant les affinités du candidat et après discussion avec ses encadrants. Il pourra s’intéresser au modèle théorique de référence, par exemple, en l’utilisant dans l’un des domaines (temporel, Fourier ou Harmoniques sphériques). Il pourra aussi chercher à l’améliorer en y intégrant par exemple la directivité des sources et microphones, ou la diffusion des parois. En parallèle, le candidat pourra aussi s’intéresser à affiner les méthodes d’optimisation retenues pour le cas de l’acoustique, voire de proposer un autre algorithme d’optimisation utilisé dans un autre domaine de la physique. Pour finir, afin de valider ses travaux, le candidat aura à sa disposition des RIRs simulées avec des outils numériques de référence, mais aussi des RIRs mesurées notamment dans le cadre d’un projet de recherche mené conjointement avec l’Institut de Recherche en Coordination Acoustique/Musique (IRCAM Paris).

Informations générales

Lieu et durée du stage

Le stage aura lieu à Strasbourg au sein des locaux du Cerema (11 rue Jean Mentelin – 67200 Straabourg). Le stage est prévu pour une durée de 4-6 mois.

Encadrants

Antoine Deleforge, Chargé de Recherche Inria, Equipe Multispeech, 615 rue du jardin botanique, 54600 Villiers lès Nancy.

Pour des raisons pratiques, Antoine Deleforge est physiquement au sein de l’agence de Strasbourg 2 jours par semaine.

https://members.loria.fr/ADeleforge/ ; https://team.inria.fr/multispeech/ ; https://www.inria.fr/fr

Cédric Foy, Chargé de Recherche UMRAE, Cerema, Univ. Gustave Eiffel, 11 rue Jean Menteli, 67200 Strasbourg

https://www.cerema.fr/fr ; https://www.umrae.fr/ ; https://twitter.com/umrae_lab

Bibliographie

T. Sprunck, K. Chahdi, C. Foy, E. Franck, A. Deleforge, Reconstruction de la forme d’une pièce par super-résolution à l’aide de réponses impulsionnelles, 16ème Congrès Français d'Acoustique, Marseille, France, 11-15 avr. 2022.

S. Dilungana, A. Deleforge, C. Foy, S. Faisan, Estimation jointe des profils d’absorption des parois d’une salle à partir de réponses impulsionnelles, 16ème Congrès Français d'Acoustique, Marseille, France, 11-15 avr. 2022.

S. Dilungana, A. Deleforge, C. Foy, S. Faisan, Geometry-Informed estimation of surface absorption profiles from impulses responses, Eusipco, 30th European Signal Processing Conference, Belgrade, Serbia, 2022.

T. Sprunck, Y. Privat, C. Foy, A. Deleforge, Gridless 3D Recovery if Images Sources from Room Impulse Responses, preprint, 2022, https://arxiv.org/abs/2208.14017

S. Dilungana, A. Deleforge, C. Foy, and S. Faisan, Learning-based estimation of individual absorption profiles from a single room impulse response with known positions of source, sensor and surfaces. In INTER-NOISE and NOISE-CON Congress and Conference Proceedings, vol. 263, No. 1, pp. 5623-5630, 2021.

https://jtav.ifsttar.fr/fileadmin/contributeurs/JTAV/2022/JTAV2022_foyetcoll.pdf

Top

6-41

(2022-12-05) Permanent academic post in Speech Technology@ University of Edinburgh, Scotland, UK

The School of Informatics at the University of Edinburgh is recuiting for a permanent academic post in Speech Technology. The appointment will be at Lecturer or Reader grade (equivalent to US Assistant Professor/Associate Professor). You will contribute to research and teaching in the Centre for Speech Technology Research (CSTR) and the Institute for Language, Cognition, and Computation (ILCC). There is extensive scope to collaborate with other Institutes and Schools within the University.

The successful candidate will have (or be near to completing) a PhD, an established research agenda and the enthusiasm and ability to undertake original research, to lead a research group, and to engage with teaching and academic supervision. We are seeking current and future leaders in the field who are able to forge new collaborations both within the field and across disciplines. We are particularly looking for a candidate with potential to extend the breadth of our research beyond our traditional core strengths in speech recognition and synthesis towards emerging applications, for example in spoken dialogue systems; spoken language understanding; healthcare and assistive technology applications; explainable speech processing; human computer interaction; or autonomous systems.

For more details, including how to apply, view the full advert at https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/job/5973

Applications close on 12 January.

Top

6-42

(2022-12-08) Ph.D. Position in Cognitive Neuroscience@ GIPSA, Grenoble, France

The GIPSA-lab, Grenoble, is offering a

Ph.D. Position in Cognitive Neuroscience

Senses of confidence and effort in sensorimotor adaptation

(speech and reaching)

Application deadline: 31/12/22; Starting date: 1/04/23 at the latest

Context

Numerous studies have explored sensorimotor learning in hand movements and speech production. They showed how individuals adapt their gestures in a way that compensates partially for the perturbation induced on the visual, auditory or somatosensory feedback. Varying degrees of compensation were observed across individuals (1–3) – in particular for pathological populations (4,5), for different sensorimotor perturbations (e.g. pitch or formant shifted feedback (6,7)), different languages (8) or different tasks (e.g. including linguistic confusion or not) (9,10).

Scientific objectives

In complement to this existing literature (7,11), the current project aims at exploring in more detail the factors influencing this varying degree of compensation to a sensorimotor perturbation. We will explore, in particular, the hypothesis that it may be influenced by the relative attention and confidence given to our different sensory feedbacks (visual and proprioceptive feedbacks for hand movements, auditory and proprioceptive feedbacks for speech), and to the related sense of effort felt in the task.

To that goal, several experiments of both visuo-motor and audio-motor perturbation will be conducted (see Figure 1). In a first behavioral step, we will explore how the degree of arm or speech compensation varies with an increasing rotation of the visual feedback or an increasing pitch shift of the auditory feedback, how it is influenced by the location or the pitch level of the target, and how it may be affected by an increasing degree of visual blurring or auditory masking. We will pay attention, in particular, to possible reorganizations of the compensatory behavior, detected from discontinuities in the compensation/perturbation relationship.

Depending on the candidate’s interests and/or funding opportunities, a second step will explore further the neural correlates of these compensatory mechanisms (12–16) and of their possible re-organization with an increasing level of perturbation, using fMRI neuro-imaging; or the second step of the project will explore the possible impairment of these mechanisms in people who stutter, who demonstrate reduced degrees of compensation to an auditory perturbation (17–20), reduced tactile sensibility of the oral cavity (21–23), and increased sense of effort (24).

Required skills

We are searching for a highly motivated candidate with:

- a Master degree (M.Sc., M. Eng. or equivalent) in (neuro)cognitive sciences, computer science, or signal processing

- knowledge and interest in motor control, neurosciences and speech.
- good programming skills in Matlab, Python or R

- experimental skills and interests

Lab and supervision

The PhD candidate will be supervised by Maëva Garnier, Fabien Cignetti and Pascal Perrier, in collaboration between the GIPSA-lab and TIMC-IMAG in Grenoble. He/she will join the PCMD team of GIPSA-lab in Grenoble, composed of six PhD students and 12 researchers and engineers (http://www.gipsa-lab.grenoble-inp.fr/en/pcmd.php)

Application instructions

The application consists of a motivation letter, CV (with detailed list of courses related to computer science, signal processing, and neuro-cognitive science), names and contact details of two references, and transcripts of grades from under-graduate and graduate programs.

Contact

Maëva Garnier      Email:   maeva.garnier@gipsa-lab.fr                    Phone: (+33) 4 76 57 50 61

Fabien Cignetti     Email:   fabien.cignetti@univ-grenoble-alpes.fr   Phone: (+33) 4 76 63 71 10

Pascal Perrier       Email:   pascal.perrier@gipsa-lab.fr                    Phone: (+33) 4 76 57 48 25

References

1.    Ghosh SS, Matthies ML, Maas E, Hanson A, Tiede M, Ménard L, et al. An investigation of the relation between sibilant production and somatosensory and auditory acuity. J Acoust Soc Am. 2010;128(5):3079–87.

2.    Villacorta VM, Perkell JS, Guenther FH. Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. J Acoust Soc Am. 2007;122(4):2306–19.

3.    Savariaux C, Perrier P. Compensation strategies for the perturbation of the rounded vowel [u] using a lip tube: A study of the control space in speech production. J Acoust Soc Am. 1995;98(5):2428–42.

4.    Loucks T, Chon H, Han W. Audiovocal integration in adults who stutter: Audiovocal integration in stuttering. Int J Lang Commun Disord. 2012 Jul;47(4):451–6.

5.    Mollaei F, Shiller DM, Baum SR, Gracco VL. Sensorimotor control of vocal pitch and formant frequencies in Parkinson’s disease. Brain Res. 2016;1646:269–77.

6.    Jones JA, Munhall KG. Perceptual calibration of F0 production: evidence from feedback perturbation. J Acoust Soc Am. 2000 Sep;108(3 Pt 1):1246–51.

7.    MacDonald EN, Goldberg R, Munhall KG. Compensations in response to real-time formant perturbations of different magnitudes. J Acoust Soc Am. 2010 Feb 1;127(2):1059–68.

8.    Mitsuya T, MacDonald EN, Purcell DW, Munhall KG. A cross-language study of compensation in response to real-time formant perturbation. J Acoust Soc Am. 2011;130(5):2978–86.

9.    Bourguignon NJ, Baum SR, Shiller DM. Lexical-perceptual integration influences sensorimotor adaptation in speech. Front Hum Neurosci. 2014;8:208.

10. Frank AF. Integrating linguistic, motor, and perceptual information in language production. University of Rochester; 2011.

11. Liu H, Larson CR. Effects of perturbation magnitude and voice F0 level on the pitch-shift reﬂex. :8.

12. Behroozmand R, Korzyukov O, Sattler L, Larson CR. Opposing and following vocal responses to pitch-shifted auditory feedback: Evidence for different mechanisms of voice pitch control. J Acoust Soc Am. 2012 Oct;132(4):2468–77.

13. Parkinson AL, Flagmeier SG, Manes JL, Larson CR, Rogers B, Robin DA. Understanding the neural mechanisms involved in sensory control of voice production. Neuroimage. 2012;61(1):314–22.

14. Toyomura A, Fujii T, Kuriki S. Effect of external auditory pacing on the neural activity of stuttering speakers. NeuroImage. 2011 Aug 15;57(4):1507–16.

15. Zarate JM, Wood S, Zatorre RJ. Neural networks involved in voluntary and involuntary vocal pitch regulation in experienced singers. Neuropsychologia. 2010;48(2):607–18.

16. Zarate JM, Zatorre RJ. Experience-dependent neural substrates involved in vocal pitch regulation during singing. Neuroimage. 2008;40(4):1871–87.

17. Kim KS, Daliri A, Flanagan JR, Max L. Dissociated Development of Speech and Limb Sensorimotor Learning in Stuttering: Speech Auditory-motor Learning is Impaired in Both Children and Adults Who Stutter. Neuroscience. 2020 Dec 15;451:1–21.

18. Daliri A, Wieland EA, Cai S, Guenther FH, Chang SE. Auditory-motor adaptation is reduced in adults who stutter but not in children who stutter. Dev Sci. 2018;21(2):e12521.

19. Cai S, Beal DS, Ghosh SS, Tiede MK, Guenther FH, Perkell JS. Weak Responses to Auditory Feedback Perturbation during Articulation in Persons Who Stutter: Evidence for Abnormal Auditory-Motor Transformation. Larson CR, editor. PLoS ONE. 2012 Jul 23;7(7):e41830.

20. Sengupta R, Shah S, Gore K, Loucks T, Nasir SM. Anomaly in neural phase coherence accompanies reduced sensorimotor integration in adults who stutter. Neuropsychologia. 2016;93:242–50.

21. De Nil LF, Abbs JH. Kinaesthetic acuity of stutterers for oral and non-oral movements. Brain. 1991;114(5):2145–58.

22. Loucks TM, De Nil LF. Oral kinesthetic deficit in adults who stutter: a target-accuracy study. J Mot Behav. 2006;38(3):238–47.

23. Loucks TMJ, De Nil LF. Oral kinesthetic deficit in stuttering evaluated by movement accuracy and tendon vibration. Speech Mot Control Norm Disord Speech. 2001;307–10.

24. Ingham RJ, Warner A, Byrd A, Cotton J. Speech effort measurement and stuttering: Investigating the chorus reading effect. 2006;

25. Caudrelier T, Rochet-Capellan A. Changes in speech production in response to formant perturbations: An overview of two decades of research. 2019.

Top

6-43

(2023-12-05)Master internship- Advanced Selective Mutual Learning for audio source separation @SteelSeries France R&D team (former Nahimic R&D team), France

Advanced Selective Mutal Learning for audio source separation

Master internship, Lille (France),

2022 Advisors — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Pierre Biret, R&D Engineer, pierre.biret@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Selective Mutual Learning (SML) Mutual learning (ML) [8] is a general idea related to knowledge distillation (KD) [9] where a group of untrained lightweight networks simultaneously learn and share knowledge to perform tasks together during training. The specificity of Selective Mutual Learning [10] is that the high-confidence predictions are used to guide the remaining network while the low-confidence predictions are ignored. This helps removing poor predictions of the networks during sharing knowledge. It can be noticed that the knowledge sharing is operated via loss functions that take into account the prediction of the other networks. The approach is simple and already shows benefits compared to KD and ML for boosting the performance of the networks for speech separation. Research axes The intern will be able to use existing internal trainsets and already implemented network architectures, which will facilitate drawing unbiased comparisons to our baseline approach. After re-implementing the SML approach, here is a list of possible axes of improvement of the SML approach : — tune the confidence factor (hyper-parameter c in [10]) to fit our speech denoising baseline (DNN and trainset) — extend and test the SML approach to more than 2 networks — adapt the SML loss formula to incorporate our internal loss (description upon request) 1 — additional tests with — new or already implemented networks : TasNet [11] ,E3net [12], DPRNN [13], transformer [1]) — various trainset (music separation , speech separation, ... ) Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] Y. Zhang et al. « Deep mutual learning ». In : Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, p. 4320-4328. [9] G. Hinton, O. Vinyals, J. Dean et al. « Distilling the knowledge in a neural network ». In : arXiv preprint arXiv :1503.02531 2.7 (2015). [10] H. M. Tan et al. « Selective Mutual Learning : An Efficient Approach for Single Channel Speech Separation ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 3678-3682. [11] Y. Luo et N. Mesgarani. « TasNet : time-domain audio separation network for real-time, singlechannel speech separation ». In : arXiv :1711.00541 [cs, eess] (1er nov. 2017). 4*. [12] M. Thakker et al. « Fast Real-time Personalized Speech Enhancement : End-to-End Enhancement Network (E3Net) and Knowledge Distillation ». In : arXiv preprint arXiv :2204.00771 (2022). [13] Y. Luo, Z. Chen et T. Yoshioka. « Dual-path rnn : efficient long sequence modeling for timedomain single-channel speech separation ». In : ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, p. 46-50.

Top

6-44

(2023-12-05) Personalized speech enhancement Master internship, Lille (France) @SteelSeries France R&D team (former Nahimic R&D team), France

Personalized speech enhancement Master internship, Lille (France), 2022

Advisors — Damien Granger, R&D Engineer, damien.granger@steelseries.com — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Lucas Dislaire, R&D Engineer, lucas.dislaire@steelseries.com Company description

About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Speech related audio source separation task Speech enhancement[8] or speech denoising usually refers to the task where the signal of interest is one intelligible speaker drown in additive noise. Speech separation [9] (sometime speaker separation) refers to the task of separately retrieving multiple unknown speakers usually not drown in additive noise. In the case of Personalized Speech Enhancement (also called target voice separation or target speaker extraction), the signal of interest is a speaker but a known speaker. This opens to (1) potentially better speech enhancement performances and also (2) focus on a particular speaker where speech enhancement would have kept all intelligible speakers. Personalized speech enhancement The research area is really recent and has just been added last year to the DNS challenge 1 [10]. VoiceFilter [11] seems to be the first article to tackle the problem of Personalized speech enhancement. 1. It can be noticed that the DNS challenge tasks [10] are defined as being real-time with a 40ms constraint on the latency (mainly composed of the look-ahead). 1 It uses two separately-trained neural networks : one discrimination network that produces speaker-specific embeddings from reference utterances of the target speaker ; and one “main” network, that performs the actual speech enhancement by taking as input both the corrupted utterance and the target speaker embeddings. The approach has now been outperformed, e.g., [12], while the two-step approach tends to prevail in the literature [13, 14, 15] : firstly one needs to learn a target speaker representation during an enrollment phase for instance by means of speaker embeddings, such as x-vectors or d-vectors [12], secondly incorporate this results in the neural network that will learn to extract this target speaker’s speech. However, it can be noticed that two steps does not necessarily means 2 networks, for instance a jointly trained 4-stage network is proposed in [15]. Axis of research The objective of the internship is to address one or several of the following targets : — Firstly, a baseline framework needs to be set up. It will require : — A dataset tailored for the task : the available datasets in the scientific community does not completely fulfill the requirements for SteelSeries products (description upon request). Conversely, our current datasets partly lacks speaker information. Hence, one would need to opt for the best solution or tradeoff combination. — A speaker embedding baseline with the assumption that the signal captured during the enrollment is “clean”, i.e. only contains the signal of interest (or at least with no second speaker). — A speech enhancement model with speaker embeddings. The intern could for instance re-use our implementation (currently without speaker embeddings) of E3net [14]. — Secondly, once a first baseline has been trained, the candidate could benchmark on different scenarii (signal-to-noise ratio during enrollment, signal ratio between speakers and effect of additional noise, various and mixed language). The Target Speaker Over-Suppression metric could potentially be used (description in [12]), as well as DNS standard metrics. This could lead the candidate to work on one of the following items to improve the baseline framework on its identified weaknesses : — Testing various speaker embeddings and ways or positions of integration into the networks — The separate training of the speaker-encoding network has been found to work better than joint training [16, 17, 15] (multi-task learning being often hard to tune). However, this would need to be reassessed with the final chosen architecture. — More effective enrollment strategies [18, 19] could be chosen and adapted to SteelSeries use cases. — Implementing loss functions suitable for the separation task (Step 2) could also be of interest, for instance following ideas in [20] or by adapting our internal loss. Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. 2 Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] DNS challenge on the paperswithcode website. https://paperswithcode.com/sota/speechenhancement-on-deep-noise-suppression. [9] Speech separation task referenced on the paperswithcode website. https://paperswithcode.com/ task/speech-separation. [10] H. Dubey et al. « Icassp 2022 deep noise suppression challenge ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 9271- 9275. [11] Q. Wang et al. « Voicefilter : Targeted voice separation by speaker-conditioned spectrogram masking ». In : arXiv preprint arXiv :1810.04826 (2018). [12] S. E. Eskimez et al. « Personalized speech enhancement : New models and comprehensive evaluation ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 356-360. [13] R. Giri et al. « Personalized percepnet : Real-time, low-complexity target voice separation and enhancement ». In : arXiv preprint arXiv :2106.04129 (2021). [14] M. Thakker et al. « Fast Real-time Personalized Speech Enhancement : End-to-End Enhancement Network (E3Net) and Knowledge Distillation ». In : arXiv preprint arXiv :2204.00771 (2022). [15] C. Xu et al. « Spex : Multi-scale time domain speaker extraction network ». In : IEEE/ACM transactions on audio, speech, and language processing 28 (2020), p. 1370-1384. [16] K. Žmolıková et al. « Learning speaker representation for neural network based multichannel speaker extraction ». In : 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2017, p. 8-15. [17] M. Delcroix et al. « Single channel target speaker extraction and recognition with speaker beam ». In : 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, p. 5554-5558. [18] H. Sato et al. « Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations ». In : arXiv preprint arXiv :2206.08174 (2022). [19] X. Liu, X. Li et J. Serrà. « Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation ». In : arXiv preprint arXiv :2210.12635 (2022). [20] H. Taherian, S. E. Eskimez et T. Yoshioka. « Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation ». In : arXiv preprint arXiv :2211.02944 (2022).

Top

6-45

(2022-12-05) Real time speaker separation Master internship, Lille (France), 2022@SteelSeries France R&D team (former Nahimic R&D team), France

Real time speaker separation

Master internship, Lille (France), 2022

Advisors — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com — Damien Granger, R&D Engineer, damien.granger@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject Audio source separation consists in extracting the different sound sources present in an audio signal, in particular by estimating their frequency distributions and/or spatial positions. Many applications are possible from karaoke generation to speech denoising. In 2020, our separation approaches [1, 2] were equaling the state of the art [3, 4] on a music separation task. Since then our speech denoising product has hit the market [5] and the team continue to explore many tracks of improvements (see for instance the following project [6, 7]). Real time speaker separation This internship targets speaker separation which is formalized in the scientific community as the task of separately retrieving a given number of speech/speaker signals from a monaural mixture signal. Most of the scientific challenges [8] compare offline (not real-time) approaches. The objective of the internship is to address the following targets (more or less ordered) : — Based on our current speech denoising trainsets, the candidate will create a trainset for the speaker separation task that match the same in-house requirement. Indeed, most of the available datasets in the scientific community lack quantity, audio quality of the groundtruths, high sampling rate, diversity of speakers/noise type. In addition, for the SteelSeries use cases, the overlap in time of the different speech sources might be lower than in the scenarii used by the scientific community and it statistical distribution will need to be well identified/defined. — Once our offline and online baseline algorithm have been trained on such a trainset, the candidate could benchmark on different scenarii (number of speaker, signal ratio between speakers, effect of additional noise, various and mixed languages) to potentially fulfill the weakness of the trainset. — The first subjective listening could bring the candidate to design complementary metrics, for instance representing false positive in speaker attribution or representating the statistics about the time needed by real-time DNN to correctly attribute a signal to the correct speaker after some silence. 1 — While all the above could be done using state-of-the-art loss functions, the candidate could also adapt our internal loss to be permutation invariant [9]. — The scientific community is very active in proposing new DNN architectures (offline [10, 8] and online [11, 12]. The candidate could also re-implement or propose her/his own architecture. In particular, a multi-task approach where the DNN also outputs the number of active speakers would be of great interest. Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. Références [1] I. Alaoui Abdellaoui et N. Souviraà-Labastie. « Blending the attention mechanism in TasNet ». working paper or preprint. Nov. 2020. [2] E. Pierson Lancaster et N. Souviraà-Labastie. « A frugal approach to music source separation ». working paper or preprint. Nov. 2020. [3] F.-R. Stöter, A. Liutkus et N. Ito. « The 2018 signal separation evaluation campaign ». In : International Conference on Latent Variable Analysis and Signal Separation. Springer. 2018, p. 293-305. [4] N. Takahashi et Y. Mitsufuji. « Multi-scale Multi-band DenseNets for Audio Source Separation ». In : 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 29 juin 2017. arXiv : 1706.09588. [5] ClearCast AI Noise Canceling - Promotion video. https : / / www . youtube . com / watch ? v = RD4eXKEw4Lg. [6] M. Vial et N. Souviraà-Labastie. Learning rate scheduling and gradient clipping for audio source separation. Rapp. tech. SteelSeries France, déc. 2022. [7] The torchcustoml rschedulersGitHubrepository. https : / / github . com / SteelSeries / torch _ custom_lr_schedulers. [8] Speech separation task referenced on the paperswithcode website. https://paperswithcode.com/ task/speech-separation. [9] X. Liu et J. Pons. « On permutation invariant training for speech source separation ». In : ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, p. 6-10. [10] Music separation task referenced on the paperswithcode website. https://paperswithcode.com/ sota/music-source-separation-on-musdb18. [11] DNS challenge on the paperswithcode website. https://paperswithcode.com/sota/speechenhancement-on-deep-noise-suppression. [12] H. Dubey et al. « Icassp 2022 deep noise suppression challenge ». In : ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, p. 9271-9275

Top

6-46

(2022-11-05)Audio detection for gaming Master internship, Lille (France), 2022@SteelSeries France R&D team (former Nahimic R&D team), France

Audio detection for gaming

Master internship, Lille (France), 2022

Advisors — Damien Granger, R&D Engineer, damien.granger@steelseries.com — Nathan Souviraà-Labastie, R&D Engineer, PhD, nathan.souviraa-labastie@steelseries.com Company description About GN Group GN was founded 150 years ago with a truly innovative and global mindset. Today, we honour that legacy with world-leading expertise in the human ear, sound and video processing, wireless technology, miniaturization and collaborations with leading technology partners. GN’s solutions are marketed by the brands ReSound, Beltone, Interton, Jabra, BlueParrott, SteelSeries and FalCom in 100 countries. The GN Group employs 6,500 people and is listed on Nasdaq Copenhagen (GN.CO). About SteelSeries SteelSeries is the worldwide leader in gaming and esports peripherals focused on premium quality, innovation, and functionality. SteelSeries’ family of professional and gaming enthusiasts are the driving force behind the company and help influence, design, and craft every single accessory and the brand’s software ecosystem, SteelSeries GG. In 2020, SteelSeries acquired Nahimic, the leader in 3D sound solutions for gaming. We are currently looking for a machine learning / audio signal processing intern to join the R&D team of SteelSeries’ Software & Services Business Unit in our French office (former Nahimic R&D team). Internship subject This internship targets the detection of a known signal in an audio scene (containing multiple signals). For instance, some signs and feedbacks in video games are always the same audio signal while the rest of the audio scene is changing. The current internal implementation is based on a legacy state of the art music identification system [1, 2] The objective of the internship is to address one or multiple of the following targets : — Agnostic to the source type (speech, music, audio gaming event ...), indeed the current approach is designed for music — Enable the handling of shorter target signal — Robustness to various overlapping audio noise type from the audio scene — Robustness to level variation over time (in the case of moving audio sources) — Explore the effect of having multi-channel signals as input, summing the channels might help to identify moving sound but potentially with the drawback of lowering the signal-to-noise ratio — Improvement of the above-mentioned aspects by adapting the approach to the use of a smaller dictionary of target signals (not millions like in the case of musics) Machine learning is the expected track, and in particular, pre-trained and potentially overfitted audio representation (embeddings). Here is a short list of examples : — Attention-Based Audio Embeddings [3] — Autoencoder [4] — Contrastive learning [5, 6] 1 Skills Who are we looking for ? Preparing an engineering degree or a master’s degree, you preferably have knowledge in the development and implementation of advanced machine learning algorithms. Digital audio signal processing skills is a plus. Whereas not mandatory, notions in the following additional various fields would be appreciated : Audio effects in general : compression, equalization, etc. - Statistics, probabilist approaches, optimization. - Programming language : Python, Pytorch, Keras, Tensorflow, Matlab. - Voice recognition, voice command. - Computer programming and development : Max/MSP, C/C++/C#. - Audio editing software : Audacity, Adobe Audition, etc. - Scientific publications and patent applications. - Fluent in English and French. - Demonstrate intellectual curiosity. Références [1] A. Wang. « The Shazam music recognition service ». In : Communications of the ACM 49.8 (2006), p. 44-48. [2] A. Wang et al. « An industrial strength audio search algorithm. » In : Ismir. T. 2003. Washington, DC. 2003, p. 7-13. [3] A. Singh, K. Demuynck et V. Arora. « Attention-Based Audio Embeddings for Query-by-Example ». In : arXiv preprint arXiv :2210.08624 (2022). [4] A. Báez-Suárez et al. « SAMAF : Sequence-to-sequence Autoencoder Model for Audio Fingerprinting ». In : ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16.2 (2020), p. 1-23. [5] X. Wu et H. Wang. « Asymmetric Contrastive Learning for Audio Fingerprinting ». In : IEEE Signal Processing Letters 29 (2022), p. 1873-1877. [6] Z. Yu et al. « Contrastive unsupervised learning for audio fingerprinting ». In : arXiv preprint arXiv :2010.13540 (2020).

Top

6-47

(2022-12-12) Master internship @ LISN Lab, Orsay, Gif sur Yvette, France

Creation of a speech synthesis model from spontaneous speech

Keywords:

Speech synthesis, spontaneous speech, low ressource languages, Nigerian Pidgin Context Nigerian Pidgin is a large but under-resourced language that increasingly serves as the primary vernacular language of Africa’s most populous country. Once stigmatized as a “broken” variety of English spoken only by the uneducated, Nigerian Pidgin is now a source of pride for many speakers who view it as a home-grown vehicle for communication. It transcends class and ethnicity, lacking the tribal associations of indigenous languages and the colonial baggage associated with English. The language can now be seen and heard in college campuses, houses of worship, advertisements, Nigerian expat communities, and even on a local branch of the British Broadcasting Channel.

Objectives

Despite Nigerian Pidgin’s growing prestige and a pool of speakers rivaling those of major languages like Turkish or Korean, the grammatical and intonational properties of the language are comparatively understudied. This internship is the extension of an ongoing research project aimed at better understanding its linguistic properties through the development and adaptation of NLP technologies. This research’s principal aim is to produce a natural-sounding text-tospeech (TTS) model that will allow researchers to conduct perception tests to determine how intonation influences the interpretation of meaning. Thanks partly to the recent explosion of neural network-based speech technologies, researchers can now produce high-quality synthesis from relatively simple datasets using models like TacoTron 2, complementing classical approaches such as those based on Hidden Markov Models. Specifically, the intern will assist in developing a text-to-speech platform trained on an existing database of Nigerian Pidgin recordings. In addition to producing natural-sounding speech, a central goal of this project will be to build a TTS model that will allow for the direct modification of intonational patterns via explicit parameters provided by researchers. The intern’s work will contribute to the exploration of the language’s melodic and tonal properties by allowing researchers to produce variations of novel utterances differing only by their intonational patterns.

Primary tasks

• Surveying existing TTS models and selecting the most suitable approach

• Training a model on a corpus of Nigerian Pidgin • Optimizing and evaluating the model

Profile

A second-year master’s student with:

• A solid background in machine learning (speech synthesis is a plus)

• Good academic writing skills in English

• An strong interest in language and linguistics

Sous la tutelle de : www.lisn.upsaclay.fr | Twitter @LisnLab | LinkedIn LisnLab Site Belvédère : Campus Universitaire Bâtiment 507 Rue du Belvédère – 91405 Orsay Cedex Site Plaine : Campus Universitaire bâtiment 650 Rue Raimond Castaing – 91190 Gif-sur-Yvette M2-CS-Intenship 2022-2023

Modalities

The internship will take place from March 2023 for 5 to 6 months at the LISN lab in the Sciences and Language Technologies department, as well as in the MoDyCo lab at Paris Nanterre University (primarily at the location of shortest commute).

• The LISN’s Belvédère site is located in the plateau de Saclay: University campus building 507, rue du Belvédère, 91400 Orsay.

• The MoDyCo lab is located at the Université Paris Ouest Nanterre La Défense: Bâtiment A, 200, avenue de la République – 92001 Nanterre. The candidate will be supervised by Emmett Strickland (MoDyCo) and Marc Evrard (LISN). Allowance under the official standards (service-public.fr).

To apply

Please send a CV and brief cover letter highlighting your interest in the project to the following: • Emmett Strickland (emmett.strickland@parisnanterre.fr) • Marc Evrard (marc.evrard@lisn.upsaclay.fr)

Further reading

1. Tan, X., Qin, T., Soong, F., & Liu, T. Y. (2021). A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561. https://arxiv.org/abs/2106.15561

2. Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. J. (2019). A Review of Deep Learning Based Speech Synthesis. Applied Sciences (2076-3417), 9(19). https://www.mdpi.com/2076- 3417/9/19/4050

3.Bigi, B., Caron, B., & Abiola, O. S. (2017). Developing resources for automated speech processing of the african language naija (nigerian pidgin). In 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (pp. 441-445). https://hal.archives-ouvertes.fr/hal-01705707/document

Top

6-48

(2022-12-18) Master internship@ LISNLab, Orsay, France

Study and development of a vocal force model

Keywords:

Machine learning, voice strength, speech processing, expressive speech

Context

The project aims to model the vocal force (VF) estimation on speech recordings. VF is defined as the sound pressure level (SPL in C-weighted decibels) measured in free field, one meter away in front of the speaker’s mouth (Liénard, 2019). This SPL is unfortunately lost in the vast majority of recordings, though the human ear is able to estimate this information thanks to the spectral differences produced by the variations in vocal effort induced by these VF values. A corpus presenting a pair of calibrated/uncalibrated signals will be used to build a model capable of estimating the original value of VF (in dBC). Collaborations under development will benefit and extend this effort by expanding the collected corpus and applying the resulting model to other tasks (e.g., expressive synthesis, Evrard et al., 2015).

Objectives

The initial aim will be to increase the variational characteristics of the uncalibrated signal from the pair provided in this corpus. In practice, it will be necessary to apply a series of degradations corresponding to the variations in distance and positioning of the speaker with respect to the microphone. Moreover, other processing will be applied, such as those typically used in post-production (compression, gate, etc.). A model will then have to be trained from this calibrated/uncalibrated pair to reproduce a reliable estimate of the original VF from any recording. Different neural architectures will be evaluated, from simple feedforward neural networks to those based on complex representations (e.g., CNN, LSTM). Different feature extraction methods will also be considered: raw, perceptually filtered (e.g., Mel) spectrums, as well as self-supervised model-based (e.g., Baevski et al., 2020).

Tasks

• Reviewing speech corpus augmentation techniques

• Surveying learning architectures: neural and self-supervised for processing audio pairs

• Augmentation of the corpus through the application of acoustic degradations

• Building a model of voice strength restoration from the signal pairs

• Presenting an objective evaluation of the model’s performance, as well as a subjective evaluation via perceptual experiments

Profile

A second-year master’s student with:

• A solid background in machine learning

• Good academic writing skills in English

• A strong interest in expressive speech

Modalities The internship will take place from March 2023 for 5 to 6 months in the Department of Language Sciences and Technologies at the LISN laboratory. The LISN Belvedere’s site is located on the plateau de Saclay: University campus, building 507, rue du Belvédère, 91400 Orsay. The candidate will be supervised by Marc Evrard and Albert Rilliard. Allowance according to official standards (service-public.fr).

How to apply

Please send a CV and brief cover letter highlighting your interest in the project to Marc Evrard (marc.evrard@lisn.upsaclay.fr).

References

1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449-12460.

2. Evrard, M., Delalez, S., d’Alessandro, C., & Rilliard, A. (2015). Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis. In Sixteenth Annual Conference of the International Speech Communication Association.

3. Liénard, J. S. (2019). Quantifying vocal effort from the shape of the one-third octave long-term-average spectrum of speech. The Journal of the Acoustical Society of America, 146(4), EL369-EL375.

Top

6-49

(2022-12-23) Postdoctoral Research Fellows, National University of Singapore

Two full-time Postdoctoral Research Fellows in automatic lyrics generation and automatic singing voice/speech evaluation.

You can find the detailed job descriptions here:

https://smcnus.comp.nus.edu.sg/postdoct_job_description_2022

Top

6-50

(2023-01-02) POSTDOC 21 MONTHS at GIPSA-Lab, Grenoble-France

POSTDOC 21 MONTHS at GIPSA-Lab, Grenoble-France
on the automatic evaluation of computer-assisted reading fluency of young French readers'

To have more info and apply: https://emploi.univ-grenoble-alpes.fr/offres/chercheur-chercheuse-post-doctoral-evaluation-automatique-de-la-fluence-pour-l-apprentissage-de-la-lecture-1164624.kjsp?RH=1135797159702996

Contact: Gerard Bailly at gerard.bailly@gipsa-lab.fr

Top

6-51

(2023-01-03) Stage de recherche projet CAIBots: Conversational AI with teams of robots, LIA, Avignon, France

FORMULAIRE DE STAGE RECHERCHE Intitulé du projet CAIBots: Conversational AI with teams of robots

Encadrants Prof. Fabrice Lefèvre

Descriptif du stage : L'objectif du stage consiste à étudier la mise en place d’un dispositif robotique permettant la simulation en « conditions réelles » des IA conversationnelles (CAI) vocales. Entraîner puis tester de l’IA conversationnelle (chatbots, systèmes de dialogue) est couteux et complexe, nous souhaitons grandement réduire cette difficulté en fournissant une solution robotique physique autonome pour apprendre et évaluer de nouveaux modules pour la CAI avant de les utiliser avec de vrais utilisateurs humains. Dans un premier temps, il s’agira principalement de tester des solutions existantes et clefs en main pour les éléments de la chaîne de traitement du langage parlé et de vérifier leur niveau de performance en configuration robot-robot. Ensuite une recherche vers des solutions embarquées sera menée. Elle devra permettre d’améliorer la latence du dispositif mais aussi d’assurer une meilleure protection des données personnelles (en ôtant la nécessité du passage par des clouds propriétaires). Globalement le système d'interaction vocal mis en place devra permettant une discussion ouverte entre un humain et une machine sur des sujets généraux. Le cas d’usage envisagé se positionne donc dans la logique du challenge Amazon Alexa (https://developer.amazon.com/alexaprize) : développer un bot pouvant entretenir une conversation pendant quelques minutes. Il sera donc nécessaire de prévoir aussi un utilisateur simulé pour permettre une interaction robot-robot autonome (le cas de conversations multiparties humains-robots pourra aussi être testé, sans être un objectif prioritaire du stage). Il s'agira d’initier le dispositif, c'est à dire de mettre en place les composants en configuration de base, mais illustrant les capacités potentielles pouvant être atteintes avec un temps de développement plus conséquent. Les solutions robotiques et logicielles entrevues pour ce travail sont, par exemple : robot Pepper, Google Cloud ASR, SpeechBrain, RASA et/ou des modèles pré-entraînés (BERT, GPT, BlenderBot…) ... Il s’agit principalement de plateformes open-source, assez complètes. Le travail consistera à mettre en œuvre rapidement un système réel afin de pouvoir le faire progresser en configuration robot-robot puis le tester avec un panel représentatif d'utilisateurs potentiels. Si un intérêt pour l'apprentissage automatique et le traitement de la langue naturelle est essentiel, il est aussi attendu du stagiaire de bonnes capacités en développement logiciel. Le stage sera une occasion d'acquérir des compétences en traitement automatique de la langue dans un contexte d'expérimentation en robotique embarquée. Plusieurs pistes pour une prolongation en thèse sont ouvertes. Durée du stage 6 mois Rémunération Environ 540€ / mois Thématique associée au stage Systèmes de dialogue humain-machine, reconnaissance et compréhension de parole, interface cognitive, robotique

Top

6-52

(2023-01-05) Post-doctoral and engineer positions@ LORIA-INRIA, Nancy, France

Automatic speech recognition for non-natives speakers in a noisy environment

Post-doctoral and engineer positions

Starting date: begin of 2023

Duration: 24 months for a post-doc position and 12 months for an engineer position

Supervisors: Irina Illina, Associate Professor, HDR Lorraine University LORIA-INRIA Multispeech Team, illina@loria.fr

Context

Objectives

How to apply: Interested candidates are encouraged to contact Irina Illina (illina@loria.fr) with the required documents (CV, transcripts, motivation letter, and recommendation letters).

Requirements & skills:

- M.Sc. or Ph.D. degree in speech/audio processing, computer vision, machine learning, or in a related field,

- ability to work independently as well as in a team,

- solid programming skills (Python, PyTorch), and deep learning knowledge,

- good level of written and spoken English.

References

[Chorowski et al., 2017] J. Chorowski, N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. Interspeech, 2017.

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy