ISCA - International Speech
Communication Association

ISCApad Archive » 2019 » ISCApad #258 » Jobs

ISCApad #258

Tuesday, December 10, 2019 by Chris Wellekens

6 Jobs

6-1

(2019-06-01) PhD position in NLP at LORIA, Nancy, France

Automatic classification using deep learning of hate speech posted on the Internet

Supervisors: Irina Illina, MdC, HDR, Dominique Fohr, CR CNRS

Team: Multispeech, LORIA-INRIA, France

Contact: illina@loria.fr, dominique.fohr@loria.fr

Duration of PhD Thesis : 3 years

Deadline to apply : June 30th, 2019

Required skills: background in statistics, natural language processing and computer program skills (Perl, Python). Candidates should email a detailed CV with diploma

Keywords: hate speech, social media, natural language processing.

The rapid development of the Internet and social networks has brought great benefits to women and men in their daily lives. Unfortunately, the dark side of these benefits has led to an increase in hate speech and terrorism as the most common and powerful threats on a global scale. Hate speech is a type of offensive communication mechanism that expresses an ideology of hatred often using stereotypes. Hate speech can target different societal characteristics such as gender, religion, race, disability, etc. Hate speech is the subject of different national and international legal frameworks. Hate speech is a type of terrorism and often follows a terrorist incident or event.

Social networks are incredibly popular today. Nowadays, Twitter, LinkedIn, Facebook and YouTube are used as a standard tool for communicating ideas, beliefs and feelings. Only a small percentage of people use part of the network for unhealthy activities such as hate speech and terrorism. But the impact of this low percentage of users is extremely damaging. For years, social media companies such as Twitter, Facebook and YouTube have invested hundreds of millions of dollars each year in the task of detecting, classifying and moderating hate. But these efforts are mainly based on manually revising the content to identify and remove offensive content, which is extremely expensive.

This thesis aims at designing automatic and evolving methods for the classification of hate speech in the field of social media. Despite the studies already published on this subject, the results show that the task remains very difficult. We will use semantic content analysis methodologies from automatic language processing (NLP) and methodologies based on deep learning (DNN) which is the revolution in the field of artificial intelligence. During this thesis, we will develop a research protocol to classify hate speech in the text in terms of hateful, aggressive, insulting, ironic, neutral, etc. character. This type of problem is placed in the context of the multi-label classification.

In addition, the problem of obfuscation of words in hate messages will need to be addressed. People who want to write hate speech on the Internet know that they risk being censored by rudimentary automatic systems of moderation. So, users try to obscure their words by changing the spelling or the spelling of words.

Among the crucial points of this thesis are the choice of the DNN architecture and the relevant representation of the data, ie the text of the internet message. The system designed will be validated on real flows of social networks.

Skills

Strong background in mathematics, machine learning (DNN), statistics

Following profiles are welcome, either:

Strong experience with natural language processing

Excellent English writing and speaking skills are required in any case.

References :

T Gröndahl, L Pajola, M Juuti, M Conti, N Asokan (2018) ?All You Need is? Love?: Evading Hate-speech Detection, arXiv preprint arXiv:1808.09115

Wiegand, M., Klakow, D. (2008). Optimizing Language Models for Polarity Classification. In Proceedings of ECIR, pp. 612-616.

Wiegand, M., Ruppenhofer, J. (2015). Opinion Holder and Target Extraction based on the Induction of Verbal Categories. In Proceedings of CoNLL, pp. 215-225.

Wiegand, M., Ruppenhofer J., Schmidt A., C. Greenberg (2018) Inducing a Lexicon of Abusive Words ? A Feature-Based Approach. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Wiegand, M., Wolf, M., Ruppenhofer, J. (2017) Negation Modeling for German Polarity Classification. In Proceedings of GSCL.

Zhang Z., Luo L. (2018). Hate speech detection: a solved problem? The Challenging Case of Long Tail on Twitter. arxiv.org/pdf/1803.03662

Back

Top

6-2

(2019-06-07) PhD grant at ISIR and STMS, Paris, France

Modélisation multimodale de l’expressivité et de l’alignement pour l’interaction humain-machine

Directrice de thèse : Catherine Pelachaud (ISIR)

Co-encadrant : Nicolas Obin (STMS)

Contexte

Cette thèse s’inscrit dans un contexte particulièrement riche en développement d’interfaces de communication entre l’humain et la machine. Par exemple, l’émergence et la démocratisation des assistants personnels (smartphones, home assistants, chatbots) font de l’interaction avec la machine une réalité quotidienne pour de plus en plus d’individus. Cette pratique tend à s’amplifier et à se généraliser à un grand nombre d’usages et de pratique de l’être humain : depuis les agents d’accueil (aujourd’hui, quelques robots Pepper plus pour la démo que pour un usage réel), la consultation à distance, ou les agents embarqués dans les véhicules autonomes. L’expansion des usages appelle à une extension des modalités d’interaction et à l’amélioration de la qualité de l’interaction avec la machine : aujourd’hui, la voix constitue la modalité privilégiée de l’interaction, et les scénarios d’interaction demeurent très limités (demande d’information, question-réponse, pas de réelle interaction dans la durée). Les limitations principales sont d’une part une faible expressivité : le comportement de l’agent est encore souvent monomodal (voix, comme les assistants Alexa ou Google Home) et demeure très monotone, ce qui limite grandement l’acceptabilité, la durée et la qualité de l’interaction ; et d’autre part le comportement de l’agent est peu ou pas adapté à l’interlocuteur, ce qui diminue l’engagement de l’humain dans l’interaction. Lors d’une interaction humain-humain les phénomènes d’alignement (e.g., ton de voix, vitesse de mouvement corporel) sont des indices de compréhension commune et d’engagement dans l’interaction (Pickering et Garrod, 1999; Castellano et al, 2012). L’engagement est marqué par des comportements non-verbaux sociaux (nonverbal social behaviors) à des moments spécifiques de l’interaction : ce peut être des signaux de feedbacks (pour indiquer être en phase avec l’interactant), ou bien une forme d’imitation (par exemple : un sourire appelle un autre sourire, le ton de la voix reprend des éléments de celui de l’interactant), ou encore des signaux synchronisés avec ceux de l’interactant (la gestion des tours de parole). Cette thèse vise à modéliser le comportement de l’agent en fonction de celui de l’utilisateur pour qu’il puisse montrer son engagement attentionnel en vue de maintenir l’interaction et de rendre ses messages plus compréhensifs. L’adaptation du comportement de l’agent se produira à différents niveaux comportementaux (prosodique, lexicale, comportementale, imitation, tour de parole…). L’interaction humain-machine, avec un fort potentiel applicatif dans de nombreux domaines, est un exemple d'interdisciplinarité nécessaire entre humanités numériques, robotique, et intelligence artificielle.

Objectif

L’objectif de la thèse est de mieux comprendre et à modéliser les mécanismes qui régissent l’interaction multimodale (voix et geste) entre un humain et une machine, pour permettre de lever des verrous technologiques et permettre d’élaborer un agent conversationnel capable de s’adapter de manière naturelle et cohérente à un interactant humain.

1) Expressifs (Léon, 1993) : capable d’avoir une expression variée et cohérente pour maintenir l’attention de l’interlocuteur, souligner les points importants, améliorer la qualité de l’interaction et en allonger la durée (dépasser un ou deux tours de parole)

2) Alignés sur le comportement multimodal de l’interlocuteur (Pickering et Garrod, 1999; Castellano et al, 2012; Clavel et al, 2016) : c’est-à-dire capable d’adapter son comportement en fonction du comportement de l’interlocuteur, pour renforcer l’engagement de ce dernier dans l’interaction.

Dans un premier temps, la thèse proposera de réaliser une architecture neuronale unifiée pour la modélisation générative du comportement multimodale de l’agent. L’expressivité de l’agent, prosodique (Obin, 2011; Obin, 2015) et gestuelle (Pelachaud, 2009), sera modélisée par des architectures neuronales récurrentes aujourd'hui couramment utilisées pour la voix et le geste (Bahdanau et al, 2014, Wang, 2017, Robinson & Obin, 2019). La thèse se focalisera sur deux aspects essentiels de la modélisation du comportement de l’agent : le développement d’architectures structurées sur plusieurs échelles temporelles pour améliorer la modélisation de la variabilité prosodique et gestuelle à l’échelle de la phrase et à l’échelle du discours (Le Moine & Obin, 2019), et l’apprentissage d’un comportement multimodal cohérent par l’approfondissement de mécanismes d’attention multimodaux partagés appliqués à la synchronicité des profils prosodiques et gestuels générés (He, 2018).

Dans un deuxième temps, la thèse s’attaquera à l’alignement du comportement de l’agent avec celui de l’humain. La thèse approfondira particulièrement l’apprentissage interactif et par imitation pour adapter de manière cohérente le comportement multimodal de l’agent à l’humain (Weber, 2018; Mancini, 2019), à partir des bases de données de dialogues accessibles (telles que NoXi (récoltées à l’ISIR et annotées en terme d’engagement), IEMOCAP (USC, Carlos Busso), Gest-IS (Edinburgh University, Katya Saint-Amard)) pour apprendre la relation et aussi leur adaptation au cours de l'interaction entre les profils prosodiques et comportementaux des interlocuteurs.

La thèse sera co-encadrée par Catherine Pelachaud, de l’équipe PIRoS de l’ISIR, spécialisée en interaction humain-machine et agents conversationnels, et par Nicolas Obin, de l’équipe Analyse et Synthèse des Sons (AS) de STMS, spécialisée en modélisation générative de signaux de parole.. Le doctorant bénéficiera par ailleurs des connaissances, savoir-faire, et outils existants à STMS et à l’ISIR (par exemple : synthétiseur de parole ircamTTS développé à STMS, plateforme GRETA développée à l’ISIR) et de la logistique de calcul de STMS (serveur de calculs, GPU).

Bibliographie

(Bevacqua et al., 2012) Elisabetta Bevacqua, Etienne de Sevin, Sylwia Julia Hyniewska, Catherine Pelachaud, A listener model: Introducing personality traits, Journal on Multimodal User Interfaces, special issue Interacting ECAs, Elisabeth André, Marc Cavazza and Catherine Pelachaud (Guest Editors), July 2012, 6(1-2), pp 27-38.

(Castellano et al., 2012) G. Castellano, M. Mancini, C. Peters, P. W. McOwan. Expressive copying behavior for social agents: a perceptual analysis. IEEE Trans Syst, Man Cybern, Part A: Syst Hum 42(3), 2012.

(Clavel et al., 2016) Chloé Clavel, Angelo Cafaro, Sabrina Campano, and Catherine Pelachaud, Fostering user engagement in face-to-face human-agent interactions, in A. Esposito and L. Jain (Eds), Toward Robotic Socially Believable Behaving Systems - Volume I: Modeling Social Signals, Springer Series on Intelligent Systems Reference Library (ISRL), 2016

(Glas and Pelachaud, 2015) N. Glas, C. Pelachaud, Definitions of Engagement in Human-Agent Interaction, workshop ENHANCE, in International Conference on Affective Computing and Intelligent Interaction (ACII), 2015.

(Hall et al., 2005) L. Hall, S. Woods, R. Aylett, L. Newall, A. Paiva. Achieving empathic engagement through affective interaction with synthetic characters. Affective computing and intelligent interaction, 2005.

(He, 2018) Xiaodong He, Deep Attention Mechanism for Multimodal Intelligence: Perception, Reasoning, & Expression across Language & Vision, Microsoft Research, AI NEXTCon, 2018.

(Le Moine & Obin, 2019) Clément Lemoine, Modélisation neuronale de l’expressivité pour la transformation de la voix, stage de Master, 2019.

(Léon, 1993) P. Léon. Précis de phonostylistique : Parole et expressivité. Paris:Nathan, 1993.

(Obin, 2011) N. Obin. MeLos: Analysis and Modelling of Speech Prosody and Speaking Style, PhD. Thesis, Ircam-Upmc, 2011.

(Obin, 2015) N. Obin, C. Veaux, P. Lanchantin. Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human, in Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis. Chapter 3: Control of Prosody in Speech Synthesis, p.189-202, Springer Verlag, February, 2015.

(Ochs et al., 2008) M. Ochs, C. Pelachaud, D. Sadek, An Empathic Virtual Dialog Agent to Improve Human-Machine Interaction, Seventh International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Estoril Portugal, May 2008.

(Paiva et al., 2017) A. Paiva, I. Leite, H. Boukricha, Hana I. Wachsmuth 'Empathy in Virtual Agents and Robots: A Survey.', ACM Trans. Interact. Intell. Syst. (2017), 7 (3):11:1-11:40.

(Pelachaud, 2009) C. Pelachaud, Studies on Gesture Expressivity for a Virtual Agent, Speech Communication, special issue in honor of Björn Granstrom and Rolf Carlson, 51 (2009) 630-639.

(Poggi, 2007) I. Poggi. Mind, hands, face and body: a goal and belief view of multimodal communication. Weidler, Berlin, 2007.

(Robinson & Obin, 2019) C. Robinson, N. Obin, A. Roebel. Sequence-to-sequence modelling of F0 for speech emotion conversion, in IEEE International Conference on Audio, Signal, and Speech Processing (ICASSP), 2019.

(Sadoughi et al., 2017) Najmeh Sadoughi, Yang Liu, and Carlos Busso, 'Meaningful head movements driven by emotional synthetic speech,' Speech Communication, vol. 95, pp. 87-99, December 2017.

(Sidner and Dzikovska, 2002) C. L. Sidner, M. Dzikovska. Human-robot interaction: engagement between humans and robots for hosting activities. In: IEEE int conf on multimodal interfaces, 2002.

(Wang, 2017) Xin Wang, Shinji Takaki, Junichi Yamagishi. An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis, Interspeech, 2017

(Wang, 2018) Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Re, Ye Jia, Rif A. Saurous. « Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis », 2018.

(Weber, 2018) K. Weber, H. Ritschel, I. Aslan, F. Lingenfelser, E. André, How to Shape the Humor of a Robot - Social Behavior Adaptation Based on Reinforcement Learning, ACM International Conference on Multimodal Interaction, 2018.

(Mancini, 2019) M. Mancini, B. Biancardi, S. Dermouche, P. Lerner, C. Pelachaud, Managing Agent’s Impression Based on User’s Engagement Detection, Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, 2019.

Back

Top

6-3

(2019-06-14) PhD position: Privacy preserving and personalized transformations for speech recognition, INRIA Nancy and Univ.Le Mans, France

Thesis title

Privacy preserving and personalized transformations for speech recognition

This PhD thesis fits within the scope of a collaborative project (funded by the French National Research Agency) involving several French teams, among which, the MULTISPEECH team of Inria Nancy - Grand-Est and the LIUM (Laboratoire d'Informatique de l'Université du Mans).

This PhD position is in collaboration between the Multispeech team of the LORIA laboratory (Nancy) and Le Mans University. The thesis will be co-supervised by Denis Jouvet (https://members.loria.fr/DJouvet/) and Anthony Larcher (https://lium.univlemans.fr/team/anthony-larcher/). The selected candidate is expected to spend time in both teams over the course of the PhD.

Scientific Context

Over the last decade, great progress has been made in automatic speech recognition [Saon et al., 2017; Xiong et al., 2017]. This is due to the maturity of machine learning techniques (e.g., advanced forms of deep learning), to the availability of very large datasets, and to the increase in computational power. Consequently, the use of speech recognition is now spreading in many applications, such as virtual assistants (as for instance Apple’s Siri, Google Now, Microsoft’s Cortana, or Amazon’s Alexa) which collect, process and store personal speech data in centralized servers, raising serious concerns regarding the privacy of the data of their users. Embedded speech recognition frameworks have recently been introduced to address privacy issues during the recognition phase: in this case, a (pretrained) speech recognition model is shipped to the user's device so that the processing can be done locally without the user sharing its data. However, speech recognition technology still has limited performance in adverse conditions (e.g., noisy environments, reverberated speech, strong accents, etc.) and thus, there is a need for performance improvement. This can only be achieved by using large speech corpora that are representative of the actual users and of the various usage conditions. There is therefore a strong need to share speech data for improved training that is beneficial to all users, while preserving the privacy of the users, which means at least keeping the speaker identity and voice characteristics private1.

1 Note that when sharing data, users may want not to share data conveying private information at the linguistic level (e.g., phone number, person name, …). Such privacy aspects also need to be taken into account, but they are out-of-the scope of this thesis.

Missions: (objectives, approach, etc.)

Within this context, the objective of the proposed thesis is twofold. First, it aims at finding a privacy preserving transform of the speech data, and, second, it will investigate the use of additional personalized transforms, that can be applied on the user’s terminal, to increase speech recognition performance.

In the proposed approach, the device of each user will not share its raw speech data, but a privacy preserving transformation of the user speech data. In such approach, some private computations will be handled locally, while some cross-user computations may be carried out on a server using the transformed speech data, which protect the speaker identity and some of his/her features (gender, sentiment, emotions...). More specifically, this will rely on a representation learning to separate the features of the user data that can expose private information from generic ones useful for the task of interest, i.e., here, the recognition of the linguistic content. We will build upon ideas of Generative Adversarial Networks (GANs) for proposing such a privacy preserving transform. Since a few years, GANs are getting more and more used in deep learning. They

typically rely on both a generative network and a discriminative network, where the generator aims to output samples that the discriminator cannot distinguish from the true samples [Goodfellow et al., 2014; Creswell et al., 2018]. They have also been used as autoencoders [Makhzani et al., 2015] which are made of three mains blocks: encoder, generator and discriminator. In our case, the discriminators shall focus on discriminating between speakers and/or between voice-related classes (defined according to gender, emotions, etc). The training objective will be to maximize the speech recognition performance (using the privacy preserving transformed signal) while minimizing the available speaker or voice-related information measured by the discriminator.

As devices are getting more and more personal, it creates opportunities to make speech recognition more personalized. This includes two aspects: adapting the model parameters to the speaker (and to the device) and introducing personalized transforms to help hiding the speaker voice identity. Both aspects will be investigated. Voice conversion approaches provide example of transforms aiming at modifying the voice of a speaker so that it sounds like the voice of another target speaker [e.g., Chen et al., 2014; Mohammadi & Kain, 2014]. Similar approaches can thus be applied to map speaker specific features to those of a standard (or average) speaker, which thus would help concealing its identity. To take benefit of the increased personal usage of terminals, speaker and environment specific adaptation will be investigated to improve speech recognition performance. Collaborative learning mixing speech and speaker recognition has been shown to benefit both tasks [Liu et al. 2018; Garimella et al. 2015] and provide a way to combine both information in a single framework. This approach will be compared to adaptation of deep neural networks-based models [e.g., Abdel-Hamid & Jiang, 2013] to handle best different amounts of adaptation data.

Skills and profile:

Master in machine learning or in computer science

Background in statistics, and in deep learning

Experience with deep learning tools is a plus

Good computer skills (preferably in Python)

Experience in speech and/or speaker recognition is a plus

Bibliography:

[Abdel-Hamid & Jiang, 2013] Abdel-Hamid, O., & Jiang, H. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In ICASSP-2013, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7942-7946, 2013.

[Chen et al., 2014] Chen, L. H., Ling, Z. H., Liu, L. J., & Dai, L. R. Voice conversion using deep neural networks with layer-wise generative training. TASLP-2014, IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(12), pp. 1859-1872, 2014.

[Creswell et al., 2018] Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. Generative adversarial networks: An overview. IEEE Signal Processing Magazine 35, 1, 53-65, 2018.

[Garimella et al. 2015] Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S., & Parthasarathi, S. H. K., Robust i-vector based adaptation of DNN acoustic model for speech recognition. In INTERSPEECH, 2015.

[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672-2680, 2014.

[Liu et al. 2018] Y. Liu, L. He, J. Liu, and M. Johnson, Speaker Embedding Extraction with Phonetic Information,' in INTERSPEECH , pp. 2247-2251, 2018

[Makhzani, 2015] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.

[Mohammadi & Kain, 2014] Mohammadi, S. H., & Kain, A. Voice conversion using deep neural networks with speaker-independent pre-training. In SLT-2014, Spoken Language Technology Workshop , pp. 19-23, 2014.

[Saon et al., 2017] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall. English conversational telephone speech recognition by humans and machines. Technical report, arXiv:1703.02136, 2017.

[Xiong et al., 2017] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. Achieving human parity in conversational speech recognition. Technical report, arXiv:1610.05256, 2017.

Additional information:

Supervision and contact:

Denis Jouvet (denis.jouvet@loria.fr; https://members.loria.fr/DJouvet/)

Anthony Larcher (anthony.larcher@univ-lemans.fr; https://lium.univlemans.fr/team/anthony-larcher/)

Additional link:

Ecole Doctorale IAEM Lorraine (http://iaem.univ-lorraine.fr/)

Duration: 3 years

Starting date: autumn 2019

The candidates are required to provide the following documents in a single pdf or ZIP file:

 CV

 A cover/motivation letter describing their interest in the topic

 Degree certificates and transcripts for Bachelor and Master (or the last 5 years)

 Master thesis (or equivalent) if it is already completed, or a description of the work in progress, otherwise

 The publications (or web links) of the candidate, if any (it is not expected that they have any)

 In addition, one recommendation letter from the person who supervises(d) the Master thesis (or research project or internship) should be sent directly by his/her author to the prospective PhD advisor.

Back

Top

6-4

(2019-06-16) PhD position: Hybrid Bayesian and deep neural modeling for weakly supervised learning of sensory-motor speech representations, University of Grenoble-Alpes, France

Open fully-funded PhD position: “Hybrid Bayesian and deep neural modeling for weakly supervised

learning of sensory-motor speech representations”

The Deep-COSMO project, part of the new AI institute in Grenoble, is welcoming applications for a 3-year, fully funded

PhD scholarship starting October, 1st, 2019 at GIPSA-lab (Grenoble, France)

TOPIC: Representation learning, speech production and perception, Bayesian cognitive models, generative neural

networks

RESEARCH FIELD: Computer Science, Cognitive Science, Machine Learning, Artificial Intelligence, Speech Processing

SUPERVISION: J. Diard (LPNC); T. Hueber, L. Girin, J.-L. Schwartz (GIPSA-Lab)

IDEX PROJECT TITLE: Multidisciplinary Institute for Artificial Intelligence – Speech chair (P. Perrier)

SCIENTIFIC DEPARTMENT (LABORATORY’S NAME): GIPSA-lab

DOCTORAL SCHOOL: MSTII (maths and computer science) or EEATS (signal processing) or EDISCE (cognitive science),

depending on the candidate’s profile and career plan

TYPE of CONTRACT: 3-year doctoral contract

JOB STATUS: Full time

HOURS PER WEEK: 35

SALARY: between 1770 € and 2100 € gross per month (depending on complementary activity or not)

OFFER STARTING DATE: October, 1st, 2019

SUBJECT DESCRIPTION:

General objective

How can a child learn to speak from hearing sounds, without any motor instruction provided by his/her environment?

The general objective of this PhD project is to develop a computational agent, able to learn speech representations

from raw speech data in a weakly supervised configuration. This agent will involve an articulatory model of the human

vocal tract, an articulatory-to-acoustic synthesis system, and a learning architecture combining deep learning

algorithms and developmental principles inspired from cognitive sciences. This PhD will be part of the “Speech

communication” chair in the Multidisciplinary Institute for Artificial Intelligence in Grenoble (MIAI).

Method

This work will capitalize on two bricks of research recently developed in Grenoble. First, a Bayesian computational

model of speech communication called COSMO (Communicating about Objects using SensoriMotor Operations)

(Moulin-Frier et al., 2012, 2015; Laurent et al., 2017; Barnaud et al., 2019) was jointly developed by GIPSA and LPNC.

This model associates speech production and speech perception models in a single architecture. The random variables

in COSMO represent the signals and the sensori-motor processes involved in the speech production/perception loop.

COSMO learns their probability distributions from speech examples provided by the environment, and is then able to

perceive and produce speech sounds associated to speech categories. So far, COSMO was mostly tested on synthetic

data. One of the main challenges is now to confront COSMO to real-world data.

Second, we will also capitalize on a set of computational models for automatic processing and learning of sensorymotor

distributions in speech developed at GIPSA. This comprises a set of transfer-learning algorithms (Hueber et

al., 2015, Girin et al. 2017) aiming at adapting acoustic-articulatory knowledge on one speaker, towards another

speaker, using a limited amount of data, possibly incomplete and noisy; together with a set of deep neural networks

able to process raw articulatory data (Hueber et al., 2016; Tatulli & Hueber, 2017).

The first step will consist in designing, implementing and testing a “deep” version of COSMO, in which some of the

probability distributions are implemented by generative neural models (e.g. VAE, GAN). This choice is motivated by

the ability of such techniques to deal with raw, noisy and complex data, as well as their flexibility in terms of transfer

learning. The second stage will consist in reformulating entirely the speech communication agent in an end-to-end

neural architecture.

Outputs

The system will be tested in terms of both efficiency of the learning process – hence ability to generate realistic speech

sequences after convergence – and coherence of the motor strategies discovered by the computational agent, in spite

of the fact that no motor data will be provided for learning. The outputs are both (1) theoretical – for better

understanding the cognitive processes at hand in speech development and speech communication; (2) technical – for

integrating knowledge about speech production and cognitive processes in a machine learning architecture; and (3)

technological – for proposing a new generation of autonomous speech technologies.

ELIGIBILITY CRITERIA

Applicants must have:

- A Master's degree (or be about to earn one) or have a university degree equivalent to a European Master's (5-year

duration), in Computer Science, Cognitive Science, Signal Processing or Applied Mathematics.

- Solid skills in Machine Learning or probabilistic modeling + General knowledge in natural language processing

and/or speech processing (an affinity for cognitive sciences and speech sciences is welcome).

- Very good programming skills (mostly in Python).

- Good oral and written communication in English.

- Ability to work autonomously and in collaboration with supervisors and other team members.

SELECTION PROCESS

Applicants will have to send their CV + an application letter in English + copy of their last diploma to:

Jean-Luc.Schwartzr@gipsa-lab.fr, Thomas.Hueber@gipsa-lab.fr;

Letters of recommendation are welcome. Contact before preparing a complete application are welcome too.

Applications will be evaluated as they are received: the position is open until it is filled, with deadline on July 10th, 2019

Back

Top

6-5

(2019-06-16) PhD thesis proposal Incremental sequence-to-sequence mapping for speech generation using deep neural networks, GIPSALab, Grenoble, France

PhD thesis proposal

Incremental sequence-to-sequence mapping for

speech generation using deep neural networks

June 17, 2019

1 Context and objectives

In recent years, deep neural networks have been widely used to address sequence-

to-sequence (S2S) learning. S2S models can solve many tasks where source

and target sequences have di
erent lengths such as: automatic speech recog-

nition, machine translation, speech translation, text-to-speech synthesis, etc.

Recurrent, convolutional and transformer architectures, coupled with attention

models, have shown their ability to capture and model complex temporal de-

pendencies between a source and a target sequence of multidimensional discrete

and/or continuous data. Importantly, end-to-end training alleviates the need

to previously extract handcrafted features from the data by learning hierarchi-

cal representations directly from raw data (e.g. character string, video, speech

waveform, etc.).

The most common models are composed of an encoder that reads the full in-

put sequence (i.e. from its beginning to its end) before the decoder produces the

corresponding output sequence. This implies a latency equals to the length of

the input sequence. In particular, for a text-to-speech (TTS) system, the speech

waveform is usually synthesized from a complete text utterance (e.g. a sequence

of words with explicit begin/end-of-utterance markers). Such approach cannot

be used in a truly interactive scenario, in particular by a speech-handicapped

person to communicate orally'. Indeed, the interlocutor has to wait for the

complete utterance to be typed before being able to listen to the synthetic voice,

hence limiting the dynamics and naturalness of the interaction.

The goal of this project is to develop a general methodology for incremental

sequence-to-sequence mapping, with application to interactive speech technolo-

gies. It will require the development of end-to-end classication and regression

neural models able to deliver chunks of output data on-the-y, from only a par-

tial observation of input data. The goal is to learn an ecient policy that leads

to an optimal trade-o
between (variable) latency and accuracy of the decoding

process. Possible strategies to decode the output data as soon as possible in-

clude: (i) Predicting online he future' of the output sequence from he past

and present' of the input sequence, with an acceptable tolerance to possible er-

rors, or (2) learn automatically from the data an optimal waiting policy' that

prevents the model to output data when the uncertainty is too high. The devel-

oped methodology will be applied to address two speech processing problems:

(i) Incremental Text-to-Speech synthesis in which speech is synthesized while

the user is typing the text (possibly with a variable latency), and (ii) Incremen-

tal speech enhancement/inpainting in which portions of the speech signal are

unintelligible because of sudden noise or speech production disorders, and must

be replaced on-the-y with reconstructed portions.

2 Work plan

The proposed working plan is the following :

Bibliographic work on S2S neural models, in the context of speech recogni-

tion, speech synthesis, and machine translation as well as their incremental

(low-latency) variations

Investigating new architectures, losses, and training strategies toward in-

cremental S2S models.

Implementing and evaluating the proposed techniques in the context of

end-to-end neural TTS systems (the baseline system may be a neural

TTS trained with past information/left-context only).

Implementing and evaluating the proposed techniques in the context of

speech enhancement/inpainting, rst on simulated noisy speech and then

on pathological speech.

3 Requirements

We are looking for an outstanding and highly motivated PhD candidate to work

on this subject. Following requirements are mandatory:

Engineering degree and/or a Master's degree in Computer Science, Signal

Processing or Applied Mathematics.

Solid skills in Machine Learning. General knowledge in natural language

processing and/or speech processing.

Excellent programming skills (mostly in Python and deep learning frame-

works).

Good oral and written communication in English.

Ability to work autonomously and in collaboration with supervisors and

other team members.

4 Work context

Grenoble Alpes Univ. o
ers an excellent research environment with ample com-

puting facilities, as well as remarkable surroundings to explore over the week-

ends. The PhD project will be funded by the Grenoble Articial Intelligence

Institute (MIAI). The PhD candidate will work both at GIPSA-lab (CRISSP

team) and LIG-lab (GETALP team). The duration of the PhD is 3 years. The

salary is between 1770 and 2100 euros gross per month (depending on comple-

mentary activity or not).

5 How to apply?

Applications should include a detailed CV; a copy of their last diploma; at least

two references (people likely to be contacted); a cover letter of one page; a one-

page summary of the Master thesis; the two last transcripts of notes (Master or

engineering school). Applications should be sent to thomas.hueber@gipsa-lab.fr,

laurent.girin@gipsa-lab.fr and laurent.besacier@imag.fr. Applications will be

evaluated as they are received: the position is open until it is lled, with deadline

on July 10th, 2019.

Back

Top

6-6

(2019-06-20) Post-doc position, CNRS and Unv.Aix-Marseille, Aix-en-Provence, France

POST-DOC POSITION (18 months) - Forensic Voice Comparison (VoxCrim project): ability, limitations and specificities of listeners in speaker identification tasks
Laboratoire Parole et Langage (CNRS and Aix-Marseille Université) ? Aix-en-Provence, France

CONTEXT
The post-doc will be carried out within the framework of the VoxCrim project, funded by an ANR (Agence nationale de la recherche) grant (2017-2021, https://anr.fr/Project-ANR-17-CE39-0016). VoxCrim focuses on national security and legal/justice applications and aims to provide a validated scientific objective framework for all types of forensic voice comparison methods (automatic and phonetic). The goal is to develop certified standards to determine the specific areas in which voice comparison methods are applicable. The project includes two complementary subject areas: 1. the proposal of methodological standards to homogenize the expertise of voice comparison in a judicial environment, 2. the development of basic research in the fields of automatic speech processing and phonetics (speaker characteristics in the production and perception of speech).
The post-doc will participate in the second subject area (speaker characteristics in the production and perception of speech). Two questions need to be answered: What are the abilities and limits of listeners in speaker identification tasks? Which cues do listeners use to identify speakers?

TASKS
The main objective will be to conduct perception experiments aimed at assessing the ability of listeners in several speaker identification tasks.
The post-doctoral fellow will:
-    design experimental protocols
-    create and manipulate acoustic stimuli
-    run experiments and collect data
-    process data and perform statistical analysis
Finally, results will be presented at conferences and published in international journals.

WORK ENVIRONNEMENT
The postdoctoral fellow will work at the Laboratoire Parole et Langage (http://www.lpl-aix.fr/), a laboratory whose research interests are extremely varied (including linguistics, phonetics, neuroscience, psycholinguistics, sociolinguistics, and computer science). He or she will benefit from this stimulating environment and interact with all the members of the laboratory (faculty members, other post-docs, engineers, doctoral students, etc.). He or she will have the opportunity to discover all the projects of the laboratory.

EXPECTED PROFILE
The postdoctoral fellow will have a PhD in the speech sciences and/or in psychoacoustics (auditory measurements, audio signal processing). A strong background in data processing and statistics is also be required. A good command of French and English will also be appreciated.

18 months. Beginning in autumn 2019
Monthly salary: ~?1900 net (depending on experience)
Location : Laboratoire Parole et Langage (http://www.lpl-aix.fr/), Aix Marseille Université, CNRS UMR 7309, Aix-en-Provence, France
For additional information: christine.meunier@univ-amu.fr

-Application: CV and cover letter
-Send application as soon as possible to christine.meunier@univ-amu.fr

Supervisors: Dr. Christine Meunier, CNRS Researcher and Dr. Alain Ghio, Research Engineer - Laboratoire Parole et Langage, Aix-en-Provence, France.

Back

Top

6-7

(2019-06-21) Ingénieur d'études, LIG, Univ.de Grenoble-Alpes, France

RECRUTEMENT D?UN INGÉNIEUR D?ÉTUDES EN TRAITEMENT AUTOMATIQUE DES LANGUES NATURELLES ET EN DÉVELOPPEMENT D?UNE INTERFACE IHM-WEB

Début de contrat: Octobre 2019Durée: 7 mois

Salaire: 2000? brut/mois
Profil :- Titulaire d?un Master ou d?un doctorat en TAL
- Une formation en sciences du langage sera appréciée.- Compétences opérationnelles en génie logiciel (gestion de version, tests, qualité de code) et Python

- Des compétences en C/C++ seraient un plus

- Une expérience en traitement automatique de la parole est requise ainsi qu?un bon niveau de français.

- Une experience en METEOR, firepad, Node.JS, mongoDB, firebase serait un plus.

Ce poste nécessite des capacités de travail en équipe et en autonomie.

Une connaissance du contexte linguistique de la surdité serait un atout supplémentaire.

*Description du projet et des missions*
Dans le cadre du projet MANES (Médiation et Accessibilité Numérique pour les Étudiants Sourds) dirigé par François Portet (LIG), Isabelle Estève (LIDILEM) et Marion Fabre (ECP) dont une partie du financement est assurée par PULSALYS, IDEX Lyon-Saint Etienne, nous recrutons un ingénieur d?études pour un CDD de 7 mois.

L?objectif général du projet est de développer un dispositif de sous-titrage en temps-réel pour rendre accessible le discours oral de l?enseignant aux étudiants sourds, de façon à favoriser l?appropriation individuelle des savoirs, par le biais de la prise de notes. La réalisation technologique et les capacités de traitement de l?écrit par les publics sourds seront les deux axes de ce projet.La mission de l?ingénieur d?études consiste d?une part à développer, évaluer et améliorer des prototypes basés sur les dernières avancées scientifiques et à les fusionner pour réaliser un prototype de sous-titrage automatique en temps réel, à partir de la plateforme Kaldi.

D?autre part, à concevoir une interface IHM pour la retranscription automatique en temps réel du discours de l?enseignant et la projection du sous-titrage en cours, en y intégrant le prototype mentionné ci-dessus.

Le candidat aura en charge l?élaboration du prototype pour des expérimentations en salle de cours et les réajustements de l?interface liés à ces expérimentations.Missions en Traitement automatique des Langues :- Prise de connaissance de l?état de l'art des systèmes de sous-titrage automatique- Test de transcription semi-automatique et vérification des extraits oraux de cours magistraux- Adaptation du système temps-réel de transcription automatique à réseaux de neurones (Kaldi)- Traitement en temps-réel des transcriptions pour le sous-titrage adapté aux publics sourds

Les exigences fonctionnelles envisagées pour l?implémentation de la plateforme Kaldi sont : le repérage des mots-clés et des synonymes en temps réel (déjà existant dans KALDI) et le développement de nouvelles fonctionnalités : la segmentation et la simplification. Des perspectives d?adaptations pour le Off-Line seront aussi à envisager.

Missions en développement :- Réalisation d?une application web permettant la retranscription en temps réel du discours de l?enseignant et la projection de la retranscription obtenue (vidéo projecteur et possibilité d?extension pour une interface mobile)- Elaboration de l?interface étudiant : stockage, récupération de la trace écrite pour retravail et modification.
- Elaboration de l?interface enseignant : paramétrisation des éléments clés du cours
- Documentation : description et mode d?emploi de l?interface d?IHM.

*Environnement de travail*Le projet est porté par le laboratoire Education, Cultures et Politiques (ECP, EA 4571), Université Lumière Lyon 2 encollaboration avec le Laboratoire de Linguistique et Didactique des Langues Étrangères et Maternelles (LIDILEM), Université Grenoble-Alpes et le Laboratoire d'Informatique de Grenoble (LIG).

Le poste sera accueilli physiquement au Laboratoire d'Informatique de Grenoble, UMR CNRS, au sein de l'équipe GETALP. L'équipe GETALP (https://lig-getalp.imag.fr) regroupe plus de 40 chercheurs, ingénieurs et étudiants dans le domaine du traitement automatisé des langues et de la parole multilingue.*Candidature*Envoyer un CV, une lettre de motivation et 1 à 3 lettres de recommandation à Marion.Fabre@univ-lyon2.fr, François.portet@imag.fr, Benjamin.Lecouteux@imag.fr et isabelle.esteve@univ-grenoble-alpes.fr.Les candidatures seront examinées dès réception, la convocation éventuelle à un entretien d?embauche est prévue début juillet. Merci de candidater dès que possible et avant le 1^er Juillet 2019 minuit.

Back

Top

6-8

(2019-06-21) Post doc at LIUM, Univ. du Mans, Le Mans, France

Post-doc position open
------------------------------------
The Speech and Language Technology Group in Le Mans University is looking for
a post-doc scientist to develop autonomous systems

Keywords: Deep Learning, lifelong, autonomous systems, unsupervised learning,
active-learning, interactive-learning

Context
------------------------------------
The LST team from LIUM (Le Mans University) is focusing on autonomous system?s behavior
for the task of speaker diarization and machine translation. The ALLIES project
(European Chist-ERA collaborative project) aims at developing evaluation protocols,
metrics
and scenarios for lifelong learning autonomous systems. The goal is to enable
auto-adaptable
systems that can also auto-evaluate in order to sustain their performance across time.
Autonomous systems can rely on human domain experts via active and interactive learning
processes to be define within the ALLIES project.

Missions
------------------------------------
Develop an autonomous system for speaker diarization by integrating lifelong learning,
active and interactive learning components. The research work will be related to some of
the following topics:
    - unsupervised adaptation
    - unsupervised evaluation
    - active learning (based on the unsupervised evaluation process, the autonomous system
        is free to require additional knowledge from the human domain expert)
    - Interactive learning (a human domain expert provides specific knowledge to the
autonomous
        system. This information must be taken into account by the system)
    - Performance will be analyzed using protocols, metrics and scenarios developed for
the ALLIES project.

Participation to the ALLIES benchmarking evaluation for speaker diarization.
During the ALLIES project, LIUM is organizing two international evaluation
campaigns (one for Speaker Diarization jointly organized with Albayzin and the
second one for Machine Translation jointly with WMT)
The benchmarking evaluation will serve to validate approaches developed during the
post-doc

Dissemination
The research will be published in the major conferences and journals

------------------------------------
Duration: 12 months
Salary: 2 365,14? (after taxes)

Start: as soon as possible, latest January 2020

Location: LIUM, Le Mans University

Supervisers: Anthony Larcher (anthony.larcher@univ-lemans.fr) and Loïc Barrault
(loic.barrault@univ-lemans.fr)

Expected competences:
    - Phd in Machine Learning and Deep Learning
    - Experience in speech processing is positive
    - Python fluent
    - familiar with a deep learning toolkit (Pytorch, TensorFlow)

ALLIES website: https://projets-lium.univ-lemans.fr/allies/

Back

Top

6-9

(2019-06-22) Responsable de IA H/F. Manager de l’équipe R&D, Zaion, Paris, France

ZAION est une société innovante en pleine expansion spécialisée dans la technologie des robots conversationnels : callbot et chatbot intégrant de l’Intelligence Artificielle.

ZAION a développé une solution qui s’appuie sur une expérience de plus de 20 ans de la Relation Client. Cette solution en rupture technologique reçoit un accueil très favorable au niveau international et nous comptons déjà 18 clients actifs (GENERALI, MNH, APRIL, CROUS, EUROP ASSISTANCE, PRO BTP …).

Nous sommes actuellement parmi les seuls au monde à proposer une offre de ce type entièrement tournée vers la performance. Nous rejoindre, c’est prendre part à une aventure passionnante au sein d’une équipe ambitieuse afin de devenir la référence sur le marché des robots conversationnels.

Dans le cadre de son développement ZAION recherche son Responsable de IA H/F. Manager de l’équipe R&D, votre rôle est stratégique dans le développement et l’expansion de la société. Vous développerez, une solution qui permet de détecter les émotions dans les conversations. Nous souhaitons augmenter les fonctionnalités cognitives de nos callbots afin qu’ils puissent détecter les émotions de leurs interlocuteurs (joie, stress, colère, tristesse…) et donc adapter leurs réponses en conséquence.

Vos missions principales :

- Vous participez à la création du pôle R&D de ZAION et piloterez à votre arrivée votre premier projet de reconnaissance d’émotion dans la voix.

- Construisez, adaptez et faites évoluer nos services de détection d’émotion dans la voix

- Analysez de bases de données conséquentes de conversations pour en extraire les conversations émotionnellement pertinentes

- Construisez une base de données de conversations labelisées avec des étiquettes émotionnelles

- Formez et évaluez des modèles d'apprentissage automatique pour la classification d’émotion

- Déployez vos modèles en production

- Améliorez en continue le système de détection des émotions dans la voix

Qualifications requises et expérience antérieure :

-Vous avez une expérience de 5 ans minimum comme Data Scientist/Machine Learning appliqué à l’Audio et une appétence à l’encadrement

- Diplômé d’une école d’Ingénieur ou Master en informatique ou un doctorat en informatique mathématiques avec des compétences solides en traitements de signal (audio de préférence)

- Solide formation théorique en apprentissage machine et dans les domaines mathématiques pertinents (clustering, classification, factorisation matricielle, inférence bayésienne, deep learning...)

- La mise à disposition de modèles d'apprentissage machine dans un environnement de production serait un plus

- Vous maîtrisez un ou plusieurs des langages suivants : Python, Frameworks de machine Learning/Deep Learning (Pytorch, TensorFlow,Sci-kit learn, Keras) et Javascript

- Vous maîtrisez les techniques du traitement du signal audio

- Une expérience confirmée dans la labélisation de grande BDD (audio de préférence) est indispensable ;

- Votre personnalité : Leader, autonome, passionné par votre métier, vous savez animer une équipe en mode projet

- Vous parlez anglais couramment

Merci d’envoyer votre candidature à : alegentil@zaion.ai

Back

Top

6-10

(2019-06-23) 3 open roles at Speechmatics, Cambridge, UK

1.SPEECH RECOGNITION INTERN

Location: Cambridge, UK

Contact: careers@speechmatics.com

“As an intern at Speechmatics I have worked on projects that use real machine learning to deliver real value to people across the world. There are few places where the machine learning being used is at the bleeding edge of the field, but Speechmatics is one of them. The company has an amazing culture that allows you to grow as a programmer and as a person. If you want to be a part of a fast-growing machine learning company where you, personally, will make a difference then Speechmatics could well be the place for you!”

Sam Ringer, Machine Learning Engineer (previously R&D Intern), Speechmatics

Background

Speech technology is one of the most popular discussion items at the moment, yet speech interaction is limited to “Alexa, turn on the light”, or “Siri, where is the nearest coffee shop?” We are taking speech technology to the next level using our expertise in machine learning and speech-to-text technology to enable our customers to use conversational speech recognition. Our solutions power subtitling on TV, content discovery for videos, compliance solutions in banks, improve efficiency of meetings, and many other use-cases. Our mission is to improve human communication with a global speech engine, that works and put speech back at the heart of communication.

At Speechmatics you’ll be working with some of the smartest minds in the industry, working on cutting-edge projects and deploying the latest machine learning techniques to disrupt the market, providing customers with the best speech technology available, all whilst immersed in a progressive and great company culture. You can enjoy benefits including, share options, healthcare, life assurance, Bike Doctor, massages, regular BBQs, Brew Dogs in the fridge, no red tape, a top end laptop and much more. We’re building a company that truly strives to be world-leading and we’re looking for people who wholeheartedly believe they can be additive to our culture, bring new ideas to the table and get stuff done. If that’s you, carry on reading.

The Opportunity

The Speechmatics Engineering team develops and maintains speech-oriented products and services that will be used by businesses worldwide and is responsible for the complete product development cycle for these products. In this internship, you’ll help to support fundamental speech and language processing research to improve our performance and language coverage as well as helping to build products and features to delight our users.

Because you will be joining a rapidly expanding team, you will need to be a team player who thrives in a fast-paced environment, with a focus on investigating ideas and rapidly moving research developments into products. We strongly encourage versatility and knowledge transfer within and across teams. You will be expected to learn fast and feel emboldened to ask for support as you need it.

Prior experience of speech recognition is desirable, although Speechmatics has a team of speech recognition engineers who will collaborate and share any specialised knowledge required. If you are enthusiastic about speech recognition and machine learning in general, with the drive to deliver the best possible technology solutions, then we want to hear from you!

Our internships are not time constrained to specific dates – we can work out mutually agreeable start and end dates as part of the application process.

Key Responsibilities

Exploring and evaluating research ideas
Increasing and improving our language coverage
Prototyping new and improved features
Helping the company to take your R&D through to production
Communicating your work internally

Requirements

Essential

Team player
Enthusiasm for speech recognition and machine learning
Technical understanding of speech recognition or related discipline
Ability to rapidly deliver on ideas
Competent in Python and/or C/C++
Have or be studying towards a degree involving speech recognition, machine learning / computer science or related field

Desirable

Practical experience of ASR and ML packages such as Kaldi, HTK or TensorFlow
Commercial experience of speech recognition
Software development experience

Salary

Competitive salary (dependent on experience), flexible working and some awesome benefits & perks.

Interested?

Get in touch! Send your CV and covering letter to careers@speechmatics.com.

2.SPEECH RECOGNITION ENGINEER

Location: Cambridge, UK

Contact: careers@speechmatics.com

'As a Speech Recognition Engineer at Speechmatics, I work on solving a multitude of problems related to improving the accuracy and delivering new features for a global automatic speech recognition engine. As a member of the speech team, I work across every aspect of speech and implement the latest research in acoustic and language modelling. The team is supportive and also rich in terms of skills and backgrounds. Speechmatics offer progressive and rewarding opportunities in one of the best speech technology companies in the world.'

André Mansikkaniemi, Speech Recognition Engineer at Speechmatics

Background

The Opportunity

We are looking for a talented speech engineer to help us build the best speech technology for anybody, anywhere, in any language. You will be part of a team that is working on our core ASR capabilities to improve our speed and accuracy and develop novel features that we can support in all languages. Your work will feed into our ground-breaking framework to support the building of ASR models in every language pack published by the company. You will be responsible for keeping our system the most accurate and useful commercial speech recognition system available.

As you will be joining a small team, you will need to be a team player who thrives in a fast-paced environment, with a focus on rapidly moving research developments into products. Bringing skills into the team is as important as a can-do attitude. We strongly encourage versatility and knowledge transfer within the team, so that we can share efficiently what needs to be done to meet our commitments to the rest of the company.

Key Responsibilities

Research and development of improved speed and accuracy across our range of world leading ASR products and related features
Delivering the software that provides an easy-to-use feature rich ASR product for our customers
Enhancing our machine learning framework that robustly builds any language with the best possible performance
Taking data all the way from its raw form through to a finished model
Working within a team in an agile environment
Working closely with other technical teams and product team to deliver on the company’s technical vision

Requirements

Essential

Graduate degree in Statistics, Engineering, Mathematics, Computer Science
Knowledge of key natural language processing or related technologies, such as speech recognition, text-to-speech or natural language understanding
Experience working with standard speech and/or ML toolkits, e.g. Kaldi, KenLM, TensorFlow, etc.
Solid Python programming skills
Experience using Unix/Linux
Quick and enthusiastic learner
Excellent teamwork and communications skills
Analytical mind-set with a data-driven approach to making decisions and attention to detail

Desirable

Postgraduate degree in related discipline
Commercial work experience in ASR or a related field
Experience of working in an Agile framework
Expertise in modern speech recognition, including WFSTs, lattice processing, neural net (RNN / DNN / LSTM), acoustic and language models, decoding
Comprehensive knowledge of machine learning and statistical modelling
Experience in deep machine learning and related toolkits, e.g. Theano, Torch, etc.
Deep expertise in Python and/or C++ software development
Experience working effectively with software engineering teams or as a Software Engineer

Salary

Competitive salary (dependent on experience), flexible working and some awesome benefits & perks.

Interested?

Get in touch! Send your CV and covering letter to careers@speechmatics.com.

3.SENIOR SPEECH RECOGNITION ENGINEER

Location: Cambridge, UK

Contact: careers@speechmatics.com

'As a Speech Recognition Engineer at Speechmatics, I work on solving a multitude of problems related to improving the accuracy and delivering new features for a global Automatic Speech Recognition engine. As a member of the speech team, I work across every aspect of speech and implement the latest research in acoustic and language modelling. The team is supportive and also rich in terms of skills and backgrounds. Speechmatics offer progressive and rewarding opportunities in one of the best speech technology companies in the world.'

André Mansikkaniemi, Speech Recognition Engineer, Speechmatics

Background

The Opportunity

Key Responsibilities

Research and development of improved speed and accuracy across our range of world leading ASR products and related features
Delivering the software that provides an easy-to-use feature rich ASR product for our customers
Enhancing our machine learning framework that robustly builds any language with the best possible performance
Taking data all the way from its raw form through to a finished model
Working within a team in an agile environment
Working closely with other technical teams and product team to deliver on the company’s technical vision

Requirements

Essential

Commercial experience in ASR or a related field
Graduate degree in Statistics, Engineering, Mathematics, or Computer Science
Expertise in modern speech recognition, including WFSTs, lattice processing, neural net (RNN / DNN / LSTM), acoustic and language models, decoding
Experience working with standard speech and/or ML toolkits, e.g. Kaldi, KenLM, TensorFlow, etc.
Solid Python programming skills
Experience using Unix/Linux
Drive to help those around you learn and improve every day
Excellent teamwork and communications skills
Analytical mind-set with a data-driven approach to making decisions and attention to detail

Desirable

Postgraduate degree in related discipline
Experience of working in an Agile framework
Comprehensive knowledge of machine learning and statistical modelling
Experience in deep machine learning and related toolkits, e.g. Theano, Torch, etc.
Deep expertise in Python and/or C++ software development
Experience working effectively with software engineering teams or as a Software Engineer

Salary

Competitive salary (dependent on experience), flexible working and some awesome benefits & perks.

Interested?

Get in touch! Send your CV and covering letter to careers@speechmatics.com.

More about Speechmatics’ culture

Live for the wow | Build authentic relationships | Be the adventure

Innovation is what we do. We build, we iterate, we develop the next thing that delivers that wow moment. We see value in building long-term, authentic relationships that last and are based on trust and honesty. With our customers, our colleagues, our leaders, our suppliers or within our local community. Our journey should be fun and exciting. We will celebrate our successes and learn from our mistakes together along the way. We embrace learning and change to grow naturally and organically as a company and individuals. We trust, we’re honest, kind and respectful.

Back

Top

6-11

(2019-07-11) 3year Early Stage Researcher PhD positions

Applications are invited for a three-year Early Stage Researcher PhD positions in the

speech technology for pathological speech.

Description

The thesis focuses on studying the link between the internal representations of Deep Neural Networks (DNNs) and

the subjective representation of speech intelligibility. We propose to explore the saliency detection capabilities of

DNNs when used in a regression task for predicting speech intelligibility scores as given by human experts. By

saliency, we mean to retrieve which frequency bands are important and used by a DNN to make its predictions.

The final expectation is to identify regions of interest in the speech signal, both in time and frequency, that

characterise the level of speech impairment.

The experiments will be processed on various samples of speech performed by 150 people (100 patients and 50

healthy controls). This database was recorded within the INCA C2SI project, and contains speech from patients

treated for cancer of the oral cavity or pharynx. It contains also various metadata such as the location of the tumor,

the impairment in terms of severity and intelligibility that were appreciated by human experts, self evaluation

questionnaires on the patient’s quality of life… Various tasks were recorded such as a sustained vowel, read

speech, nonsense words, prosodic exercises, picture description, etc. There will be also the possibility to extend

the work to another corpus which is composed of voice of patients suffering from Parkinson disease.

At first, the PhD will have to take benefit from the various analysis and descriptions that were done during the C2SI

project trying to correlate the impact of the tumor and the communication ability. Those results will help attesting

the human representation of the impact of the disease. Then, a DNN representation will be modeled to fit the data,

taking care of the data sparsity. The last part of the work will be to explore the intern representation of the DNN,

trying to explore what part of the signal help to make a decision on the impact of the disease and that will be the

final goal of the thesis, studying the automatic representation that lies in the model the student will propose.

This work is funded by the TAPAS project (https://www.tapas-etn-eu.org) which is a Horizon 2020 Marie

Skłodowska-Curie Actions Initial Training Network European Training Network (MSCA-ITN-ETN) project that aims

to transform the well being of people across Europe with debilitating speech pathologies (e.g., due to stroke,

Parkinson's, etc.). These groups face communication problems that can lead to social exclusion. They are now

being further marginalised by a new wave of speech technology that is increasingly woven into everyday life but

which is not robust to atypical speech.

The supervision of the PhD will take place at IRIT laboratory by the SAMoVA team in Toulouse. SAMoVA does

research in the domain of “analysis, modeling and structuring of audiovisual content”. The application areas are

diverse: speech processing, identification of languages, speaker verification and speech and music indexing. The

researchers expertise covers novel machine learning and audio processing technologies and is now focused on

deep learning methods, leading to several publications in international conferences.

Eligibility Criteria

Early Stage Researchers (ESRs) shall, at the time of recruitment by the host organization, be in the first four

years (full-time equivalent research experience) of their research careers.

- The ESR may be a national of a Member State, of an Associated Country or of any Third Country.

- The ESR must not have resided or carried out her/his main activity (work, studies, etc.) in the country of her/his

host organization for more than 12 months in the 3 years immediately prior to her/his recruitment.

- Holds a Master’s degree or equivalent, which formally entitles to embark on a Doctorate.

- Does not hold a PhD degree.

Duration of recruitement: 36 months

Contact: Julie Mauclair (mauclair@irit.fr)

Back

Top

6-12

(2019-07-17) Chief Technical Officer (CTO) at ELDA

Chief Technical Officer (CTO)

Under the supervision of the CEO, the responsibilities of the Chief Technical Officer (CTO) include planning and supervising technical development of tools, software components or applications for language resource production and management.
He/she will be in charge of managing the current language resources production workflows and co-ordinating ELDA?s participation in R&D projects while being also hands-on whenever required by the language resource production and management team. He/she will liaise with external partners at all phases of the projects (submission to calls for proposals, building and management of project teams) within the framework of international, publicly- or privately-funded projects.

This yields excellent opportunities for creative and motivated candidates wishing to participate actively to the Language Engineering field.

Profile:
?    PhD in Computer Science, Natural Language Processing, or equivalent
?    Experience in Natural Language Processing (speech processing, data mining, machine translation, etc.)
?    Familiarity with open source and free software
?    Knowledge of a statically typed functional programming language (OCaml preferred) is a plus
?    Good level in English, with strong writing and documentation skills in English
?    Dynamic and communicative, flexible to work on different tasks in parallel
?    Ability to work independently and as part of a multidisciplinary team
?    Citizenship (or residency papers) of a European Union country
?    Good level in Python, knowledge of Django would be a plus
?    Proficiency in classic shell scripting in a Linux environment (POSIX tools, Bash, awk)

Salary: Commensurate with qualifications and experience (between 45-55K?).
Other benefits: complementary health insurance and meal vouchers

Applicants should email a cover letter addressing the points listed above together with a curriculum vitae to: job@elda.org

ELDA is acting as the distribution agency of the European Language Resources Association (ELRA). ELRA was established in February 1995, with the support of the European Commission, to promote the development and exploitation of Language Resources (LRs). Language Resources include all data necessary for language engineering, such as monolingual and multilingual lexica, text corpora, speech databases and terminology. The role of this non-profit membership Association is to promote the production of LRs, to collect and to validate them and, foremost, make them available to users. The association also gathers information on market needs and trends.

For further information about ELDA/ELRA, visit: ww.elra.info

Back

Top

6-13

(2019-07-19) Two Post-doctoral positions at Le Mans University , France

2 Post-doctoral positions at Le Mans University on Deep learning approaches speech processing

*Place of work* Le Mans University, Le Mans ? France

*Starting date* From now to June 2020

*Salary* between 2 300 and 2 600 ? /month

*Duration* 12 months and 24 months (can be combined in a 36 months position)

****************************************
1st position
****************************************

* Context *
The LST team from LIUM (Le Mans University) is focusing on autonomous system?s behavior
for the task of speaker diarization and machine translation.
The ALLIES project (European Chist-ERA collaborative project) aims at developing
evaluation protocols, metrics and scenarios for lifelong learning autonomous systems.
The goal is to enable auto-adaptable systems that can also auto-evaluate in order to
sustain their performance across time. Autonomous systems can rely on human domain
experts via active and interactive learning processes to be define within the ALLIES project.

* Missions *
Develop an autonomous system for speaker diarization by integrating lifelong learning,
active and interactive learning components. The research work will be related to some of the following topics:
- unsupervised adaptation
- unsupervised evaluation
- active learning (based on the unsupervised evaluation process, the autonomous
   system is free to require additional knowledge from the human domain expert)
- Interactive learning (a human domain expert provides specific knowledge to
   the autonomous system. This information must be taken into account by the system)
Performance will be analyzed using protocols, metrics and scenarios developed for the ALLIES project.

Participation to the ALLIES benchmarking evaluation for speaker diarization.
During the ALLIES project, LIUM is organizing two international evaluation
campaigns (one for Speaker Diarization jointly organized with Albayzin and the
second one for Machine Translation jointly with WMT)
The benchmarking evaluation will serve to validate approaches developed during the post-doc

* Dissemination*
The research will be published in the major conferences and journals

* Duration * 12 months
* Salary * 2 365,14? (after taxes)

* Start * as soon as possible, latest January 2020

* Supervisers * Anthony Larcher (anthony.larcher@univ-lemans.fr) and Loïc Barrault (loic.barrault@univ-lemans.fr)

Expected competences:
    - Phd in Machine Learning and Deep Learning
    - Experience in speech processing is positive
    - Python fluent
    - familiar with a deep learning toolkit (Pytorch, TensorFlow)

ALLIES website: https://projets-lium.univ-lemans.fr/allies/

****************************************
2nd position
****************************************

* Context *
The LST team from LIUM (Le Mans University) is focusing on evolutive end-to-end
neural networks for speaker recognition. The Extensor project (French ANR funded)
aims at developing novel architectures for end-to-end speaker recognition as well as
explaining the behavior of those networks. The focus of Extensor is threefold:
get rid of the legacy of bayesian system?s architecture and explore wider opportunities offered in deep learning;
explore real end-to-end architectures exploiting the tax signal instead of classical features (such as MFCC of filterbanks);
Develop tools for explainability in speaker recognition.

* Missions *
Develop end-to-end speaker recognition system based on state-of-the-art approaches (x-vectors, sincnet?)
Develop evolutive architectures making use of existing genetic algorithms and study their behavior.
Participate to the three hackathons organized by the Extensor project in order to develop
tools for evolutive neural network architecture and explainability for speaker recognition.
Dissemination: the research will be published in the major conferences and journals

* Duration * 24 months
* Salary * 2 600? (after taxes)

* Start * as soon as possible, latest June 2020

* Location * LIUM, Le Mans University

* Superviser * Anthony Larcher (anthony.larcher@univ-lemans.fr)

Expected competences:
    - Phd in Machine Learning and Deep Learning
    - Experience in speech processing is positive
    - Python fluent
    - familiar with a deep learning toolkit (Pytorch, TensorFlow)

Anthony Larcher Maître de Conférences, HDR / Associate Professor
Directeur de l'Institut Informatique Claude Chappe
co-responsable de la Spécialité Informatique
Responsable de l'option Interface Personnes Systèmes Tél. +33 (0)2 43 83 38 30
Avenue Olivier Messiaen, 72085 - LE MANS Cedex 09 univ-lemans.fr

Back

Top

6-14

(2019-07-20) Three-year Early Stage Researcher PhD positions, IRIT, Toulouse, France