ISCA - International Speech
Communication Association


ISCApad Archive  »  2020  »  ISCApad #270  »  Jobs

ISCApad #270

Friday, December 11, 2020 by Chris Wellekens

6 Jobs
6-1(2020-06-10) 2 post-docs positions at UTDallas, Texas, USA

POST-DOCTORAL POSITION #1

Center for Robust Speech Systems: Robust Speech Technologies Lab 

 

Developing robust speech and language technologies (SLT) for naturalistic audio is the most challenging topic in the broader class of machine learning problems. CRSS-RSTL stands at the forefront of this initiative by making available the largest (150,000 hours) publicly available naturalistic corpus in the world. The FEARLESS STEPS corpus is the collection of multi-speaker time synchronized multi-channel audio from all of NASA’s 12 Apollo Manned Missions. Deployment of such ambitious corpora requires development of state-of-the-art support infrastructure using multiple technologies working synchronously to provide meaningful information to researchers from the science, technology, historical archives, and educational communities. To this end, we are seeking a post-doctoral researcher in the area of speech and language processing and machine learning. The researcher will collaboratively aid in the development of speech, natural language, and spoken dialog systems for noisy multi-channel audio streams. Overseeing digitization of analog tapes, community outreach and engagement, and assisting in cutting edge SLT research are also important tasks for the project.

Those interested should send an email with their resume and areas of interest to John.Hansen@utdallas.edu. More information can be found on our website: CRSS–RSTLab (Robust Speech Technologies Lab) at https://crss.utdallas.edu/

 

 

POST-DOCTORAL POSITION #2

Center for Robust Speech Systems: Cochlear Implant Processing Lab 

 

Cochlear implants are one of the most successful solutions of replacing hearing sensation via an electronic device. However, the search for better sound coding and electrical stimulation strategies could be significantly accelerated by developing a flexible, powerful, portable speech processor for cochlear implants compatible with current smartphones/tablets. We are developing CCi-MOBILE, the next generation of such a research platform, one that will be more flexible and computationally powerful than clinical research devices that will enable implementation and long-term evaluation of advanced signal processing algorithms in naturalistic and diverse acoustic environments. To this end, we are seeking a post-doctoral researcher in the area of cochlear implant signal processing and embedded hardware/systems design. The researcher will collaboratively aid in the development of an embedded (FPGA-based) hardware (PCBs) for speech processing applications. Firmware development in Verilog and Java (Android) for DSP algorithms implementation is also an important task for the project.

Those interested should send an email with their resume and areas of interest to John.Hansen@utdallas.edu. More information can be found on our website: CRSS–CILab (Cochlear Implant Processing Lab) at https://crss.utdallas.edu/CILab/

 

Back  Top

6-2(2020-06-13) Tenure track researcher at CWI, Amsterdam, The Netherlands

Do you want to work with us at CWI in Amsterdam?

We have an open position for a tenure-track researcher at CWI (https://www.cwi.nl/) within our Distributed & Interactive Systems (DIS) group (https://www.dis.cwi.nl/). 
The focus is on Human-Centered Multimedia Systems (https://www.dis.cwi.nl/research-areas/human-centered-multimedia-systems/) and/or Quality of Experience (QoE) in immersive media (https://www.dis.cwi.nl/research-areas/qoe/).

You can find details about the position and application procedure here: 
https://www.cwi.nl/jobs/vacancies/tenure-track-position-in-multimedia-systems-and-human-computer-interaction (application deadline: July 15, 2020)

If you know of any interested candidates looking for such a position, please share with them. They are welcome to get in touch with me  <p.s.cesar@cwi.nl> concerning any questions prior to any formal application process.

Back  Top

6-3(2020-06-18) Assistant/Postdoc level a TUWien, Austria

Are you interested to join a vibrant research environment in the center of Europe, as a
Postdoc researcher or PhD student researcher, and work on exciting topics related to
machine learning, recommender systems, or (multimedia) information retrieval?

If so, please have a look at the following two announcements with upcoming deadlines:

* University Assistant/Postdoc level at JKU (40h/week, Aug 2020 - Jan 2023; application
deadline: June 24, 2020):
https://www.jku.at/fileadmin/gruppen/80/Stellenausschreibungen_E/4176_Homepage_E_20.05.2020.pdf

* PhD student researcher at JKU and TU Wien (40h/week, 3 years; application deadline:
July 9, 2020): https://tiss.tuwien.ac.at/mbl/blatt_struktur/anzeigen/10410#p242.3

For more details or informal inquiries, please contact me via markus.schedl@jku.at.

Back  Top

6-4(2020-06-20) Two fully-funded PhD studentships in automatic speech recognition - University of Sheffield

Subject: Two fully-funded PhD studentships in automatic speech recognition - University of Sheffield 

We are delighted to be able to offer two fully-funded PhD studentships in Automatic Speech Recognition at the Voicebase Centre of the University of Sheffield, to start in October 2020. The Voicebase studentship covers all fees and maintenance for 3 years at standard UK rates.

Topic 1: Semi-supervised Learning for Automatic Speech Recognition

To apply: https://www.jobs.ac.uk/job/CAJ398/phd-studentship-in-semi-supervised-learning-for-automatic-speech-recognition

Deadline: July 24, 2020

Topic 2: Multilingual Speech Recognition

To apply: https://www.jobs.ac.uk/job/CAJ394/phd-studentship-in-multilingual-speech-recognition

Deadline: July 24, 2020

For further information please contact Prof. Thomas Hain (t.hain@sheffield.ac.uk). 

Back  Top

6-5(2020-06-22) PhD grant, Université de Toulouse, France

Subject: “MOTRYLANG – The role of motor rhythm in language development and

language disorders”

Supervisors: Corine Astésano, Jessica Tallet

Host Laboratories:

U.R.I Octogone-Lordat (EA 4156), Université de Toulouse II

Laboratoire ToNIC (UMR 1214), Université Paul Sabatier - Toulouse III

Discipline: Linguistics

Doctoral School: Comportement, Langage, Education, Socialisation, Cognition (CLESCO)

Scientific description of the research project:

The project aims to address a series of scientific and clinical questions regarding the place of

motor activity in child language development and its rehabilitation.

Typical development comes with the implementation of rhythm in speech production

(prosodic accenting) and also in movement production (tapping, walking, sensorimotor

synchronisation…). Interestingly, the tempo of linguistic and motor rhythms is similar in

healthy adults (around 700 ms or 1,4 Hz).

The present project aims to (1) investigate the existence of a link between motor & linguistic

rhythms and associated neural correlates (ElectroEncephaloGraphy) in children with and

without linguistic disorders; (2) evaluate the impact of motor training on linguistic

performances, and (3) create, computerize and test a language rehabilitation program based on

the use of motor rhythm in children with language acquisition disorders.

This project will have scientific repercussions in linguistic and movement sciences as well as

in the field of rehabilitation.

The selected candidate will benefit from a stimulating scientific environment: (s)he will

integrate the Interdisciplinary Research Unit Octogone-Lordat (Toulouse II:

http://octogone.univ-tlse2.fr/) and will be co-supervised by Corine Astésano, linguistphonetician

specializing in prosody, and by Jessica Tallet, specialist in rhythmic motor skills

and learning at ToNIC laboratory, Toulouse NeuroImaging Center (Toulouse III:

https://tonic.inserm.fr/). The research will be integrated in a work group on Language,

Rhythm and Motor skills, which encompasses PhD students, professionals in rehabilitation

and collaborators from other universities.

Required skills:

- Master in linguistics, human movement sciences, cognitive sciences, health sciences

or equivalent

- A speech therapist’s profile would be a plus

- Experience in experimental phonetics and/or linguistics, neuro-psycho-linguistics

(speech disorders)

- Skills in linguistic data processing and analysis

- Skills in evaluating speech neurological disorders and in the running of linguistic

remediation programs

- Autonomy and motivation for learning new skills (for eg. EEG …)

- Good knowledge of the French language; good writing and oral skills in both French

and English

Salary:

- 1768.55 monthly gross, 3 year contract

Calendar:

- Sending of applications: 6th july 2020

- Audition of selected candidates: 15th july 2020

- Start of contract: 1rst october 2020

Applications must be sent to Corine Astésano (corine.astesano at univ-tlse2.fr) and will

include:

- A detailed CV, with list of publications if applicable

- A copy of the grades for the Master’s degree

- A summary of the Master’s dissertation and a pdf file of the Master’s dissertation

- A cover letter / letter of interest and/or scientific project (1 page max)

- A letter of recommendation from a referent scientific personality/supervisor

Back  Top

6-6(2020-06-30) PhD at Université Grenoble Alpes, France

L'Université Grenoble Alpes recrute une doctorante ou un doctorant avec
un contrat de 3 ans entièrement financé à partir d'octobre 2020 dans le
cadre du projet THERADIA.


*Le projet THERADIA*

Le projet THERADIA consiste à mettre au point un assistant thérapeutique
virtuel qui constitue le relais et l?interface entre le patient et le
thérapeute mais aussi les aidants qui gravitent autour du patient. Par
un usage massif de différentes technologies d?intelligence artificielle,
cet assistant empathique sera chargé d?adapter le traitement aux besoins
du patient sous le contrôle du thérapeute, et de s?assurer de son suivi
en interagissant avec les différents acteurs (thérapeute, patient,
aidants) selon un mode conversationnel. Par rapport à un bot classique,
l?objectif est de doter l?assistant d?une intelligence artificielle
affective basée sur l?analyse des échanges verbaux et non verbaux
(parole, prosodie et expressions du visage). L?assistant devra également
être capable de synthétiser différents styles comportementaux pour
interagir efficacement, et de résumer le fil d?interactions avec le
patient pour restituer au thérapeute ou aux aidants un résumé des
échanges et progrès réalisés. Validé d?un point de vue médico-économique
dans le contexte de THERADIA, ces technologies pourront trouver de
nombreux autres débouchés pour assister des humains amenés à
sous-traiter à une intelligence artificielle les dimensions affectives
de leurs interactions.


*Sujet de thèse* : Apprentissage profond et hybride pour la génération
automatique de résumés multimédia d?observance de remédiation cognitive
pour les aidants et cliniciens.

L?objectif de cette thèse de doctorat est de concevoir des méthodes
permettant de générer un compte rendu pertinent pour les aidants et les
orthophonistes synthétisant la prise en charge de patients souffrants de
troubles cognitifs lors de séances de remédiations cognitives réalisées
à domicile. La tâche de génération est donc de concevoir un système
capable :
- d'identifier, d'agréger, de sélectionner et de structurer les
informations pertinentes à communiquer au destinataire
- de transformer ces informations structurées en un document multimédia
cohérent
- d'adapter la réalisation vis-à-vis des critères de génération (type de
destinataire, période à résumer, taille du texte)

Afin de gérer la grande complexité de la tâche deux approches
complémentaires seront étudiées. Une approche experte (génération guidée
par les experts médicaux) [Portet et al. 2009] et une approche neuronale
de bout-en-bout apprise par renforcement (génération guidée par les
lecteurs) [Brenon et al. 2018]. Une approche hybride sera également
étudiée [Li et al. 2019].  Ces approches sont dépendantes d'un ensemble
d'exemples disponibles pour induire le fonctionnement de ces différents
choix. Malheureusement, ces données ne sont pas disponibles aujourd'hui.
Basé sur nos derniers travaux, nous aborderons ce problème avec un
modèle neuronal dans un cadre d'apprentissage faiblement supervisé qui
est capable de tirer parti d'un faible nombre de données supervisées et
d'un grand nombre de données non-supervisées. Dans notre cas les données
supervisées seront celles acquises dans le projet et les données non
supervisées seront obtenues auprès des partenaires (rapport d?entretien)
et sur le web (textes comportant des émotions). L'état de l'art montre
que des avancées importantes ont été faites en génération automatique de
textes avec des modèles neuronaux. Cependant, ces avancées se sont
appuyées sur des corpus de grande taille et sur des tâches isolées et
bien définies. Dans notre cas, la difficulté porte principalement sur la
sélection des informations en fonction du profil du destinataire, le
manque de corpus adéquat et de mesure de performances. Un des objectifs
de cette thèse sera de mettre en ?uvre des méthodes efficaces
d'apprentissage faiblement supervisé [Qader et al. 2019], de
renforcement et de collecte de données auprès des utilisateurs cibles du
projet.

Enfin, l'évaluation de la génération automatique de textes est une tâche
complexe. En effet, plusieurs dimensions linguistiques, sémantiques et
applicatives doivent être prises en compte (grammaire, cohérence,
émotion, pertinence, utilité). La qualité des sorties textuelles pourra
être évaluée, d'une part, par des mesures automatiques sur corpus lors
du développement de l'approche et, d'autre part, par des évaluations
subjectives humaines (patients, aidants, experts) sur chaque itération
majeure du système de génération.


*Environnement scientifique*
La thèse sera menée au sein de l'équipe Getalp du laboratoire LIG
(https://lig-getalp.imag.fr/). La personne recrutée sera accueillie au
sein de l?équipe qui offre un cadre de travail stimulant, multinational
et agréable. Les moyens pour mener à bien le doctorat seront assurés
tant en ce qui concerne les missions en France et à l?étranger qu?en ce
qui concerne le matériel (ordinateur personnel, accès aux serveurs GPU
du LIG).
La personne sera également amenée à collaborer avec plusieurs équipes
impliquées dans le projet THERADIA, en particulier avec des chercheurs
du GIPSA-lab situé également à Grenoble et les collaborateurs de
l?entreprise SBT HumanMatter(s) basée à Lyon.


*Comment postuler ?*
Les candidats doivent être titulaires d'un Master en informatique ou en
traitement automatique du langage naturel (ou être sur le point d'en
obtenir un). Ils doivent avoir une bonne connaissance des méthodes
d?apprentissage automatique et idéalement une expérience en collecte et
en évaluation impliquant l?humain. Ils doivent également avoir une très
bonne connaissance des langues française et anglaise pour pouvoir
traiter les données textuelles dans ces deux langues et mener des
entretiens principalement en français. Une expérience dans le domaine de
la génération automatique de textes serait un plus.

Les candidatures sont attendues au fil de l?eau et le poste sera ouvert
jusqu?à ce qu?il soit pourvu. Elles doivent contenir : CV + lettre/message
de motivation + notes master + lettre(s) de recommandations; et être
adressées à François Portet (Francois.Portet@imag.fr) et Fabien Ringeval
(Fabien.Ringeval@imag.fr).


*References*
Brenon A., Portet F., Vacher M. (2018) Arcades : A deep model for
adaptive decision making in voice controlled smart-homes. Pervasive and
Mobile Computing, 49, pp.92-110
Li Y., Liang X., Hu Z. et al. (2018) Hybrid retrieval-generation
reinforced agent for medical image report generation. Advances in Neural
Information Processing Systems. p. 1530-1540.
Portet F., Reiter E., Gatt A., Hunter J., Sripada S., Freer Y., Sykes C.
(2009) Automatic generation of textual summaries from neonatal intensive
care data. Artificial Intelligence, 173 (7-8), pp.789-816.
Qader R., Portet F., Labbé C. (2019) Neural Text Generation from
Unannotated Data by Joint Learning of Natural Language Generation and
Natural Language Understanding Models, INLG2019


--
François PORTET
Maître de conférences - Grenoble Institute of Technology
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 331
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE

Phone:  +33 (0)4 57 42 15 44
Email:  francois.portet@imag.fr
www:    http://membres-liglab.imag.fr/portet/

Back  Top

6-7(2020-07-09) Speech Research Scientist at ETS R&D

Speech Research Scientist at ETS R&D:

 

https://etscareers.pereless.com/index.cfm?fuseaction=83080.viewjobdetail&CID=83080&JID=302819

 

Back  Top

6-8(2020-07-16) Early Stage Researcher / PhD student in an EU Marie Sklodowska-Curie Action (H2020-MSCA-ITN), Romania

We are hiring an Early Stage Researcher / PhD student in an EU Marie Sklodowska-Curie Action (H2020-MSCA-ITN) on the topic 'Designing and Engineering Multimodal Feedback to Augment the User Experience of Touch Input' under the supervision of Prof. Radu-Daniel Vatavu (http://www.eed.usv.ro/~vatavu)

The job will take place in the Machine Intelligence and Information Visualization Laboratory at Stefan cel Mare University of Suceava, Romania. The position is full-time (40 hours/week) for a fixed term of 3 years starting September 28, 2020.

Salary starts at 2849.76? per month (gross amount) and the research will be conducted in a network of EU partner universities. Full details here: http://www.eed.usv.ro/mintviz/jobs/Job-Announcement-ESR860114.pdf

Other information:
https://cordis.europa.eu/project/rcn/224423/en
https://euraxess.ec.europa.eu/jobs/491894
http://multitouch.fgiraud.ovh/wpjb-jobs/phd-multimodal-feedback/  

**** Deadline: September 1, 2020 ****

Any inquiries about this position are welcome at any time at the email address radu.vatavu@usm.ro with the subject 'ESR MULTITOUCH-860114'

Back  Top

6-9(2020-07-20) Technical project manager, INRIA, Nancy, France

Inria is seeking a Technical Project Manager for a European (H2020 ICT) collaborative
project called COMPRISE (https://www.compriseh2020.eu/).

COMPRISE is a 3-year Research and Innovation Action (RIA) aiming at new cost-effective,
multilingual, privacy-driven voice interaction technology. This will be achieved through
research advances in privacy-driven deep learning, personalized training, automatic data
labeling, and tighter integration of speech and dialog processing with machine
translation. The consortium includes academic and industrial partners in France, Germany,
Latvia, and Spain. The project has been ongoing for 1.5 year, and it has received very
positive feedback at its first review.

The successful candidate will be part of the Multispeech team at Inria Nancy (France). As
the Technical Project Manager of H2020 COMPRISE, he/she will be responsible for animating
the consortium in daily collaboration with the project lead. This includes orchestrating
scientific and technical collaborations as well as reporting, disseminating, and
communicating the results. He/she will also lead Inria?s software development and
demonstration tasks.

Besides the management of COMPRISE, the successful candidate will devote half of his/her
time to other activities relevant to Inria. Depending on his/her expertise and wishes,
these may include: management of R&D projects in other fields of computer science,
involvement in software and technology development and demonstration tasks, building of
industry relationships, participation in the setup of academic-industry collaborations,
support with drafting and proofreading new project proposals, etc.

Ideal profile:
- MSc or PhD in speech and language processing, machine learning, or a related field
- at least 5 years' experience after MSc/PhD, ideally in the private sector
- excellent software engineering, project management, and communication skills

Application deadline: August 30, 2020

Starting date: October 1, 2020
Duration: 14 months
Location: Nancy, France
Salary: from 2,300 to 3,700 EUR net/month, according to experience

For more details and to apply:
https://jobs.inria.fr/public/classic/en/offres/2020-02908

Back  Top

6-10(2020-09-10) Experts in recognition and synthesis at Reykjavík University's Language and Voice Lab, Iceland
Reykjavík University's Language and Voice Lab (https://lvl.ru.is) is looking 
for experts in speech recognition and in speech synthesis. At the LVL you 
will be joining a research team working on exciting developments in language
 technology as a part of the Icelandic Language Technology Programme 
(https://arxiv.org/pdf/2003.09244.pdf). 
Job Duties:
 . Conduct independent research in the fields of speech processing, 
machine learning, speech recognition/synthesis and human-computer interaction.
 . Work with a team of other experts in carrying out the Speech 
Recognition/Synthesis part of the Icelandic Language Technology Programme. 
. Publish and disseminate research findings in journals and present at 
conferences. . Actively take part in scientific and industrial cooperation projects. 
. Assist in supervising Bachelor's/Master's students. Skills: 
. MSc/PhD degree in engineering, computer science, statistics, mathematics 
or similar
. Good programming skills (e.g. C++ and Python) and knowledge of Linux 
(necessary). 
. Good knowledge of a deep learning library such as PyTorch or TensorFlow
 (necessary). 
. Good knowledge of KALDI (preferable)
. Background in language technology (preferable).
 . Good skills in writing and understanding shell scripts (preferable).
 
 All applications must be accompanied by a good CV with information about 
previous jobs, education, references etc. It is also optional to attach a cover 
letter where the applicant can justify the reasons for being the right person
 for the job. Here is the link to apply: https://jobs.50skills.com/ru/is/5484
Applications deadline is October 4th 2020.
 Applications are only accepted through RU's recruitment system. 
All inquiries and applications will be treated as confidential. 
Further information about the job is provided by Jón Guðnason 
Associate Professor, jg@ru.is, and Ester Gústavsdóttir, Director of Human 
Resources, esterg@ru.is. 
 
The role of Reykjavik University is to create and disseminate knowledge to 
enhance the competitiveness and quality of life for individuals and society, 
guided by good ethics, sustainability and responsibility. 
Education and research at RU are based on strong ties with industry and society.
We emphasize interdisciplinary collaboration, international relations and 
entrepreneurship. 
 
Back  Top

6-11(2020-09-17) Proposition de contrat doctoral, Sorbonne University (Jussieu), Paris, France

Proposition de contrat doctoral

Titre :

Rythme de la parole et gestes manuels en synthèse performative

Résumé du sujet :

Le but de cette thèse est de développer un cadre théorique et des expérimentations quant à

l'utilisation du geste manuel pour le contrôle prosodique via des interfaces humain-machine, en

synthèse performative. La synthèse vocale performative est un nouveau paradigme de recherche en

interaction humain-machine, dans lequel une voix de synthèse est jouée comme un instrument en

temps-réel à l’aide des membres (mains, pieds). Le contrôle du rythme de parole par les mains est

un problème qui implique des unités rythmiques, des points de contrôle rythmique, les centres

perceptifs des syllabes et des gestes de tapotement (tapping), voire des partitions gestuelles

inspirées des phonologies autosegmentales ou articulatoires. Les unités rythmiques varient en

fonction de la phonologie de la langue étudiée, ici le français, l’anglais et le chinois mandarin. Les

enjeux de la thèse portent donc sur la modélisation des schémas de perception-action impliqués

dans le contrôle rythmique, la modélisation des unités temporelles, la réalisation et l’évaluation

d’un système de contrôle du rythme. Les applications visées sont :

1. l’apprentissage du contrôle naturel des contours intonatifs à l'aide de la chironomie pour

l'acquisition de langues étrangères (anglais, français, mandarin) ;

2. l’apprentissage du contrôle chironomique des contours d'intonation de la langue maternelle,

pour la suppléance vocale (larynx artificiel).

Contexte :

La voix n’est pas un 'instrument' de musique, au sens d’un artefact mis en vibration par les

membres ou par le souffle. Les organes vocaux sont internes, en grande partie invisibles, et

contrôlés de façon complexe par plusieurs ensembles musculaires (respiration, phonation,

articulation). Le contrôle vocal est donc, par nature intéroceptif, alors qu’il est davantage

kinesthésique et extéroceptif pour les instruments de musique.

L’avènement de la synthèse numérique permet pour la première le rendu d’un son indéniablement

vocal par un dispositif instrumental externe, mis à distance de l’appareil vocal. Les 'instruments

vocaux' sont 'manoeuvrés' par les mains, les pieds, à l’aide de capteurs ou d’interfaces humainmachine.

Cette mise à distance pose la question du contrôle vocal dans des termes tout à fait

différents de ceux du contrôle d’un instrument acoustique ou de la voix elle même. Les instruments

vocaux permettent actuellement un contrôle musical de la phonation : intonation, séquencement

rythmique, qualité de voix, pour la voix chantée. Le contrôle très précis de l’articulation et du

rythme en parole est encore problématique. Le propos de cette thèse est de traiter la question du

contrôle gestuel du rythme prosodique et du séquencement articulatoire.

Objectifs et résultats attendus :

Cette thèse s’inscrit dans la ligne de recherche sur les instruments vocaux. Un instrument

vocal est un synthétiseur en vocal temps réel à contrôle gestuel. La synthèse est réalisée par un

programme pour produire les échantillons. Le contrôle gestuel utilise des interfaces pour capter les

gestes. Les mouvements des articulateurs étant très rapides, il est difficile de les contrôler de façon

directe par les gestes manuels et des méthodologies basées sur la représentation phonologique du

rythme prosodique doivent être mise en place.

Le rythme est réalisé par des gestes des membres, mains ou pieds, en place des gestes articulatoire

qui correspondent aux syllabes. Les circuits de perception-action ne sont plus les mêmes, ni les

vélocités des organes mis en mouvement. Le contrôle du rythme prosodique en synthèse

performative est donc un problème qui implique la définition d’unités rythmiques, de points de

contrôle rythmique, de centres perceptifs des syllabes, de gestes de tapotement (tapping), voire de

partitions gestuelles inspirées des phonologies autosegmentale ou articulatoire.

Des points de contrôles rythmiques doivent enrichir le signal vocal pour permettre d’en manipuler

le déroulement temporel. Ces points doivent avoir du sens du point de vue de la phonologie de la

langue jouée, et de sa phonotactique. La perception du flux syllabique, avec ses centres perceptifs,

est donc impliquée. Les gestes de contrôle, par appuis ou tapotage, impliquent des processus

moteurs, à la fois analogues et différents de ceux des articulateurs. Les unités rythmiques varient en

fonction de la phonologie des langues étudiées, ici le français, l’anglais et le chinois mandarin. Les

enjeux de la thèse portent donc sur la modélisation des schémas de perception-action impliqués

dans le contrôle rythmique, la modélisation des unités temporelles, la réalisation et l’évaluation

d’un système de contrôle du rythme.

Les résultats attendus sont à la fois théoriques et pratiques :

L’expérimentation perceptive permettra de mettre en relation les différentes unités

temporelles;

les théories phonologiques sur l’organisation du geste phonatoire seront mises à l’épreuve

avec un nouveau paradigme expérimental;

un nouveau synthétiseur sera réalisé;

un ensemble de méthodes pour le contrôle gestuel de la synthèse, de nouveaux gestes et des

interfaces adaptées seront développés et testés dans les tâches applicative visées, soit

l’apprentissage du contrôle naturel des contours intonatifs à l'aide de la chironomie pour

l'acquisition de langues étrangères (anglais, français, mandarin) et l’apprentissage du

contrôle chironomique des contours d'intonation de la langue maternelle, pour la suppléance

vocale (larynx artificiel).

Méthodologie :

Les théories phonologiques et phonétiques de l’organisation temporelle des langues étudiées seront

considérées dans le contexte du paradigme de la synthèse performative. L’étude des relations entre

points de contrôle rythmique, centres perceptifs, gestes de tapotage et unités phonologiques

implique la modélisation et l’expérimentation, avec des sujets réalisant des tâches de perceptionaction.

La méthodologie relève ici de la psychologie et de la phonétique expérimentales : définition

de corpus, mise en oeuvre de protocoles de test, tests, analyses statistiques.

Un synthétiseur qui utilise les nouveaux paradigmes de contrôle rythmique sera développé. La

méthodologie relève ici du traitement du signal audio et de la parole ainsi que de l’informatique,

depuis la conception jusqu’à la programmation.

Ainsi un ensemble de méthodes pour le contrôle gestuel du rythme prosodique et du temps sera

développé et testé dans les tâches applicatives visées. Ces méthodes comprennent à la fois les gestes

et les interfaces de contrôle et relèvent de l’informatique dans le domaine des interfaces humainmachine.

Prérequis :

Ce sujet est à l’interface de la synthèse vocale et des interfaces humain-machine, de la prosodie, de

la perception et de la performance musicale. Cela demande des connaissances générales en

traitement du signal audionumérique et en informatique musicale ou en interface humain-machine.

Une partie du travail portera sur le développement logiciel. Des connaissances sur la voix et la

parole, en phonétique et phonologie, ainsi qu’en psychologie expérimentale ou sciences cognitives

seront nécessaires.

Les candidatures avec une formation initiale en informatique et traitement du signal aussi bien que

celles avec une formation initiale en linguistique, phonétique ou sciences cognitives seront

considérées. La formation initiale sera éventuellement complétée dans les domaines qui seraient

moins connus.

Encadrement :

Christophe d’Alessandro, DR CNRS, Responsable de l’équipe LAM

Institut Jean Le Rond d’Alembert, Sorbonne Université

christophe.dalessandro@sorbonne-universite.fr

Ce projet doctoral est dans le cadre du contrat ANR Gepeto, en collaboration avec le LPP

(Sorbonne nouvelle), et le GIPSA-Lab, Université de Grenoble.

Début du contrat dès que possible (à partir d’octobre 2020)

Références :

Delalez, S. et d’Alessandro, C. (2017). “Vokinesis: syllabic control points for performative

singing synthesis”, NIME’17 , pp. 198-203.

X. Xiao, G. Locqueville, C. d'Alessandro, B. Doval, « T-Voks: Controlling Singing and

Speaking Synthesis with the Theremin », Proceedings of the International Conference on

New Interfaces for Musical Expression, NIME’19, June 3-6, 2019, Porto Alegre, Brazil,

110-115.

Samuel Delalez, Christophe d’Alessandro « Adjusting the Frame: Biphasic Performative

Control of Speech Rhythm », Proc. INTERSPEECH 2017, 18th Annual Conference of the

International Speech Communication Association, Stockholm, Sweden, August 18-25,

2017, DOI: 10.21437/Interspeech.2017, 864-868.

Christophe d’Alessandro, Albert Rilliard, and Sylvain Le Beux « Chironomic stylization of

intonation » J. Acoust. Soc. Am., 129(3), march 2011, 1594-1604

Christophe d’Alessandro, Lionel Feugère, Sylvain Le Beux, Olivier Perrotin, and Albert

Rilliard (2014) , « Drawing melodies : evaluation of chironomic singing synthesis » , J.

Acoust. Soc. Am. 135 (6), 3601-3612.

I . Chow, M. Belyk, V. Tran, and S. Brown. Syllable synchronisation and the P-center in

cantonese. 49 :55–66, 2015.

C. d’Alessandro, L. Feugère, S. Le Beux, and O. Perrotin. Drawing melodies : Evaluation of

chironomic singing synthesis. J. Acoust. Soc. Am., 135(6) :3601–3612, March 2014.

C. d’Alessandro, A. Rilliard, and S. Le Beux. Chironomic stylisation of intonation. J.

Acoust. Soc. Am., 129(3) :1594–1604, March 2011.

C. Fowler. “Perceptual centers” in speech production and perception. Perception &

Psychophysics, 25 :375–388, 1979.

P. Howell. Predicton of p-center location from the distribution of energy in the ampitude

envelope. Perception and Psychophysics, 43 :90–93, 1988.

P. F. MacNeilage. The frame/content theory of evolution of speech production. Behavioral

and Brain Sciences, 21 :499–546, 1998.

S.M. Marcus. Acoustic determinants of perceptual center (P-center). Perception and

Psychophysics, 30 :247–256, 1981.

J. Morton, S. Marcus, and C. Frankish. Perceptual centers (P-Centers). Psychological

Review, 83(5) :405–408, 1976.

B. Pompino-Marshall. On the psycho-acoustic nature of the p-center phenomenon. Journal

of phonetics, 17 :175–192, 1989.

K. Rapp-Holmgren. A study of syllable timing. STL-QPSR, 12(1) :014–019, 1971.

B. H. Repp. Sensorimtor synchronization : A review of tapping litterature. Psychon. Bull.

Rev., 12(6) :969–992, 2005.

B. H. Repp and Y. H. Su. Sensorimotor synchronisation : A review of recent research.

Pyschon. Bull. Rev., 20 :403–452, 2013.

P. Wagner. The Rhythm of Language and Speech : Constraining Factors, Models, Metrics

and Applications. Habilitation à diriger des recherches, Rheinischen Friedrich-Wilhelms-

Universität Bonn, 2008.

Back  Top

6-12(2020-09-24) Offre de thèse: Modèles profonds pour la reconnaissance et l'analyse de la parole spontanée,Grenoble, France

*Sujet*
Modèles profonds pour la reconnaissance et l'analyse de la parole spontanée

Le traitement de la parole spontanée est l'un des défis majeurs que doit relever le
domaine de la reconnaissance automatique de la Parole (RAP). La parole spontanée est
significativement différente de la parole préparée (lecture, film, discours
radiophonique, commande vocale, etc) notamment à cause des disfluences (pause,
répétition, réparation, faux départ). Ces caractéristiques mettent en défaut les systèmes
de RAP traditionnels car la structure de la parole spontanée est beaucoup plus difficile
à modéliser que celle de la parole préparée.

Dans ce projet de doctorat nous nous intéresseront aux méthodes sans lexique basées sur
des architectures de séquence à séquence pour, dans un premier volet,  améliorer les
performances de la RAP sur la parole spontanée et, dans un deuxième volet, étudier les
structures internes des modèles neuronaux pour faire émerger de nouvelles hypothèses sur
la parole spontanée.
Dans le premier volet, les modèles séquence à séquence seront conçu et appris en
s?appuyant sur les corpus transcrits existants (plus de 300 heures) enregistrés dans la
communication quotidienne (enregistrements de discours au sein d'une famille, dans un
magasin, lors d'un entretien, etc.) pour apprendre des modèles profonds de la parole
spontanée.

Le deuxième volet de la thèse consistera à analyser les représentations apprises par les
modèles pour les confronter aux théories et modèles linguistiques sur la parole spontanée
au niveau prosodique, phonétique et grammatical. Les contributions de la thèse seront de
produire des systèmes de reconnaissance de parole adapté à la parole spontanée, de
permettre d'expliquer ces modèles par rapport à la connaissance actuelle sur la parole
spontanée, de faire ressortir des caractéristiques linguistiques intrinsèques à la parole
spontanée.

La thèse s'effectuera en collaboration avec des enseignants chercheurs du Lidilem pour
des applications en socio-linguistique et linguistique de terrain.

*Environnement scientifique*
La thèse sera menée au sein de l'équipe Getalp du laboratoire LIG (http: // www.
liglab.fr/en/research/research-areas-and-teams/getalp) en collaboration avec le
Laboratoire LIDILEM (https://lidilem.univ-grenoble-alpes.fr/). Le/la doctorant?e
recruté?e bénéficiera également du soutien de la Chaire Artificial Intelligence &
Language de l'intitut MIAI (https://miai.univ-grenoble-alpes.fr).

*Comment postuler ?*
Les candidats doivent être titulaires d'un Master en informatique ou en traitement du
langage naturel. Ils devraient idéalement avoir une
solide expertise en traitement automatique de la parole. Ils doivent également avoir une
très bonne connaissance des langues française et anglaise. Les candidatures sont
attendues avant le 9 octobre 2020 et doivent être adressées à François Portet à
Francois.Portet@imag.fr et Solange Rossato Solange.Rossato@imag.fr
Les candidat?es doivent joindre :
- une lettre de candidature expliquant pourquoi ils/elles se considèrent capables de
poursuivre ce projet de thèse,
- leur dernier diplôme,
- un CV.
Ils/elles peuvent également ajouter des lettres de recommandation.

Le comité de sélection informera les candidat?es de sa décision avant le 15 octobre 2020.
Si vous avez d'autres questions sur le poste et le projet, veuillez contacter François
Portet et/ou Solange Rossato

Back  Top

6-13(2020-10-05) Research Assistant/Associate in Spoken Language Processing, Cambridge University, UK
Research Assistant/Associate in Spoken Language Processing x 2 (Fixed Term)
Speech Research Group, Cambridge University Engineering Department, UK
Back  Top

6-14(2020-10-05) 2 POsitions at Radbout University, Nijmegen, The Netherlands

 

We have two vacancies for speech technology employees:

 

 

 

Back  Top

6-15(2020-10-08) Job offer: 1-year postdoc position at LSP (ENS Paris), France

Job offer: 1-year postdoc position at LSP (with a possibility of 1-year extension)

The Laboratoire des Systèmes Perceptifs (LSP, ENS Paris / CNRS, https://lsp.dec.ens.fr/en) is offering a postdoc position for the ANR project fastACI ('Exploring phoneme representations and their adaptability using fast Auditory Classification Images') supervised by Léo Varnet (leo.varnet@ens.psl.eu).

The fastACI project aims to develop a fast and robust experimental method to visualize and characterize the auditory mechanisms involved in phoneme recognition. The fast-ACI method relies on a stimulus-response model, combined with a reverse correlation (revcorr) experimental paradigm. This approach, producing an 'instant picture' of a participant’s listening strategy in a given context, has already yielded conclusive results, but remains very time-consuming. Therefore, the main objectives of this postdoc contract will be (1) to improve the efficiency of the process using advanced supervised learning techniques (e.g. sparse priors on a smooth basis) and an online adaptive protocol (e.g. Bayesian optimisation); then (2) to use this technique to map the phonemic representations used by normal-hearing listeners. For this purpose, a large number of experiments on phoneme-in-noise categorization tasks (e.g. /aba/ vs. /ada/) will be carried out, in order to insure the broadest possible mapping of the French phonological inventory. As a second step, the new tool will be used to explore the adaptability of speech comprehension in the case of sensorineural hearing loss and noisy backgrounds.

The post-doc will be involved in all activities in line with the project, including data collection, coding and statistical analyses. The post-doc will also coordinate trainees’ and students’ work involved in the project, and contribute significantly to publication of the findings.

Required profile:

Background in psychoacoustics and/or machine learning (the candidate should preferably hold a PhD in one of these fields)

 High skills in statistical data processing (in particular supervised learning algorithms) and practical knowledge in psychophysics

 Basic understanding of psychoacoustics and psycholinguistics

 

 Good knowledge of Matlab programming (other languages such as R can also be useful)

 Strong communication skills in English (working language in the lab) and French (interactions with participants, etc.).

Duration: 12 months or 24 months

Start: Early 2021

Net salary: ~ 2100€/ month

Application Procedure:

Applications must include a detailed CV and a motivation letter, a link to (or copy of) the PhD thesis, PhD Viva report, plus the email contact of 2 referees. Applications are to be sent to: Léo Varnet (leo.varnet@ens.psl.eu) before 08/11 (interviews should take place on the 18/11 by videoconference)

Back  Top

6-16(2020-10-20) PhD grant at INRIA Nancy, France

Privacy preserving and personalized transformations for speech recognition

 

This research position fits within the scope of a collaborative project (funded by the French National Research Agency) involving several French teams, among which, the MULTISPEECH team of Inria Nancy - Grand-Est.

One objective of the project is to transform speech data in order to hide some speaker characteristics (such as voice identity, gender information, emotion, ...) in order to safely share the transformed data while keeping speaker privacy. The shared data is to be used to train and optimize models for speech recognition.

The selected candidate will collaborate with other members of the project, and will participate to the project meetings.

 

Scientific Context

Over the last decade, great progress has been made in automatic speech recognition [Saon et al., 2017; Xiong et al., 2017]. This is due to the maturity of machine learning techniques (e.g., advanced forms of deep learning), to the availability of very large datasets, and to the increase in computational power. Consequently, the use of speech recognition is now spreading in many applications, such as virtual assistants (as for instance Apple’s Siri, Google Now, Microsoft’s Cortana, or Amazon’s Alexa) which collect, process and store personal speech data in centralized servers, raising serious concerns regarding the privacy of the data of their users. Embedded speech recognition frameworks have recently been introduced to address privacy issues during the recognition phase: in this case, a (pretrained) speech recognition model is shipped to the user's device so that the processing can be done locally without the user sharing its data. However, speech recognition technology still has limited performance in adverse conditions (e.g., noisy environments, reverberated speech, strong accents, etc.) and thus, there is a need for performance improvement. This can only be achieved by using large speech corpora that are representative of the actual users and of the various usage conditions. There is therefore a strong need to share speech data for improved training that is beneficial to all users, while preserving the privacy of the users, which means at least keeping the speaker identity and voice characteristics private [1].

Missions: (objectives, approach, etc.)

Within this context, the objective is twofold. First, it aims at improving privacy preserving transforms of the speech data, and, second, it will investigate the use of additional personalized transforms, that can be applied on the user’s terminal, to increase speech recognition performance.

In the proposed approach, the device of each user will not share its raw speech data, but a privacy preserving transformation of the user speech data. In such approach, some private computations will be handled locally, while some cross-user computations may be carried out on a server using the transformed speech data, which protect the speaker identity and some of his/her features (gender, sentiment, emotions...). More specifically, this rely on a representation learning to separate the features of the user data that can expose private information from generic ones useful for the task of interest, i.e., here, the recognition of the linguistic content. On this topic, recent experiments have relied on Generative Adversarial Networks (GANs) for proposing a privacy preserving transform [Srivastava et al., 2019], and on voice conversion approaches [Srivastava et al., 2020].

In addition, as devices are getting more and more personal, this creates opportunities to make speech recognition more personalized. Some recent studies have investigated approaches that takes benefit of speaker information [Turan et al., 2020].

The candidate will investigate further approaches along these lines. Other topics such as investigating the impact and benefit of adding some random noise in the transforms will be part of the studies, as well as dealing with (hiding) some paralinguistic characteristics. Research directions and priorities will take into account new state-of-the-art results and on-going activities in the project.

 

Skills and profile:

PhD or Master in machine learning or in computer science

Background in statistics, and in deep learning

Experience with deep learning tools is a plus

Good computer skills (preferably in Python)

Experience in speech and/or speaker recognition is a plus

 

Bibliography:

[Saon et al., 2017] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall: English conversational telephone speech recognition by humans and machines. Technical report, arXiv:1703.02136, 2017.

[Srivastava et al., 2019] B. Srivastava, A. Bellet, M. Tommasi, and E. Vincent: Privacy preserving adversarial representation learning in ASR: reality or illusion? INTERSPEECH 2019 - 20th Annual Conference of the International Speech Communication Association , Sep 2019, Graz, Austria.

[Srivastava et al., 2020] B. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamagishi, M. Maouche, A. Bellet, and M. Tommasi: Design choices for x-vector based speaker anonymization. INTERSPEECH 2020, 21th Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai, China.

[Turan et al., 2020] T. Turan, E. Vincent, and D. Jouvet: Achieving multi-accent ASR via unsupervised acoustic model adaptation. INTERSPEECH 2020, 21th Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai, China.

[Xiong et al., 2017] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. Achieving human parity in conversational speech recognition. Technical report, arXiv:1610.05256, 2017.

 

Additional information:

Supervision and contact:

Denis Jouvet (denis.jouvet@inria.fr; https://members.loria.fr/DJouvet/)

Duration: 2 years

Starting date: autumn 2020

Location: Inria Nancy – Grand Est, 54600 Villers-lès-Nancy

 

 

footnote [1] : Note that when sharing data, users may want not to share data conveying private information at the linguistic level (e.g., phone number, person name, …). Such privacy aspects also need to be taken into account, but they are out-of-the scope of this project.

 

Back  Top

6-17(2020-10-21) Fully funded PhD position at >IDIAP, Martigny, Switzerland

There is a fully funded PhD position open at Idiap Research Institute on 'Speech
recognition and natural language processing for digital interviews'.

At a high level, we are interested in how candidates for jobs respond in structured
selection interviews; in particular how they are able to tell stories about past work
situations. More concretely, we will investigate speech recognition architectures
suitable for such interviews, and natural language processing solutions that are able to
infer the higher level semantics required by our collaborators in the social sciences.

A-priori, given that it is the higher level semantics that are of interest, we expect to
make use of recent lexicon-free approaches to speech recognition. We also expect to draw
from the rapidly advancing language modelling field with tools such as BERT and its
successors. There will be ample opportunity for research in the component technologies,
which are all currently pertinent in the general machine learning landscape. In
particular, we expect to make advances in the technological interfaces between component
technologies, and in the humanist interfaces between the machine learning practitioners
and social scientists.

For more information, and to apply, please follow this link:
 http://www.idiap.ch/education-and-jobs/job-10312

Idiap is located in Martigny in French speaking Switzerland, but functions in English and
hosts many nationalities.  PhD students are registered at EPFL. All positions offer quite
generous salaries. Martigny has a distillery and a micro-brewery and is close to all
manner of skiing, hiking and mountain life.

There are other open positions on Idiap's main page
 https://www.idiap.ch/en/join-us/job-opportunities

Back  Top

6-18(2020-10-23) PhD grant at Université Grenoble-Alpes, France.

Dans le cadre de la Chaire ' Bayesian Cognition and Machine Learning for Speech
Communication' (http://www.gipsa-lab.fr/projet/MIAI-Speech/) financée par Le
Multidisciplinary Institute in Artificial Intelligence (MIAI) de l'Université Grenoble
Alpes (https://miai.univ-grenoble-alpes.fr/), nous proposons une thèse sur le contrôle de
la production de la parole. Cette thèse étudiera comment planification et exécution
motrices interagissent pour atteindre, dans des conditions très variées et parfois
variables au cours de l'élocution, les buts auditifs et somatosensoriels qui permettent
une communication parlée efficace. Pour cela le travail consistera à implémenter et à
approfondir l'hypothèse selon laquelle le cerveau développe et exploite des
représentations ou modèles internes de la dynamique des articulateurs et de leur
influence sur l'acoustique, afin de prédire à chaque instant l'état du système
articulatoire et les corrélats auditifs du signal acoustique. Une attention particulière
sera portée sur la modélisation de la façon dont le cerveau peut intégrer en temps réel
ces prédictions avec les retours somatosensoriels et auditifs effectifs, bruités et
retardés par les temps de transmission dans les systèmes physiques et physiologiques,
pour assurer une prononciation correcte des sons, en toute condition et, en visant une
minimisation de l'effort produit.

Mots Clés :Contrôle moteur de la parole ; Contrôle moteur ; Théorie du contrôle ;
Sciences Cognitives ; Machine Learning

Pour plus d'information voir:
http://www.gipsa-lab.fr/projet/MIAI-Speech/thesis/DEEPTONGUE_PhD_Proposal_Baraduc_Perrier.pdf
Bien cordialement

Pierre Baraduc et Pascal Perrier

Back  Top

6-19(2020-10-26) Post doctoral position at the Beckman Institute for Advance Science and Technology, University of Illinois, Urbana-Chanpaign, USA

Postdoctoralposition in Mobile Sensing and Child Mental Health

Beckman Institute for Advanced Science & Technology

University ofIllinois at Urbana-Champaign

 

Our interdisciplinary research team at the Beckman Institute for Advance Science and Technology is developing and applying innovative tools and methods from mobile sensing, signal processing and machine learning to gain insight into the dynamic processes underlying the emergence of disturbances in child mental health. We have engineered a wearable sensing platform that captures speech, motion, and physiological signals of infants and young children in their natural environments, and we are applying data-driven machine-learning approaches and dynamic statistical modeling techniques to large-scale, naturalistic, and longitudinal data sets to characterize dynamic child-parent transactions and children’s developing stress regulatory capacities and to ultimately capture reliable biomarkers of child mental health disturbance.

 

We seek outstanding candidates for a postdoctoral scholar position that combines multimodal sensing and signal processing, dynamic systems modeling, and child mental health. The ideal candidate would have expertise in one or more of the following domains related to wearable sensors:

 

  • signal processing of audio, motion or physiological data

  • statistical modeling of multivariate time series data

  • mobile health interventions including wearable sensors

  • activity recognition

  • machine learning

  • digital phenotyping

 

In addition to joining a highly interdisciplinary team and making contributions to high impact research on mobile sensing and child mental health, this position provides strong opportunities for professional development and mentorship by faculty team leaders, including Drs. Mark Hasegawa-Johnson, Romit Roy Choudhury, and Nancy McElwain. In collaboration with the larger team, the postdoctoral scholar will play a central role in preparing conference papers and manuscripts for publication, contributing to the preparation of future grant proposals, and assisting with further development of our mobile sensing platform for use with infants and young children.

 

Applicants should have a doctoral degree in computer engineering, computer science, or a field related to data analytics of wearable sensors, as well as excellent skills in programming, communication, and writing. Appointment is for at least two years, contingent on first-year performance. The position start date is negotiable.

 

Please send a coverletter and CV to Drs. Mark Hasegawa-Johnson (jhasegaw@illinois.edu) and Nancy McElwain (mcelwn@illinois.edu). Applications will be considered until the position is filled, with priority given to applications submitted by November 15th.

 

The University of Illinois is an Equal Opportunity, Affirmative Action employer. Minorities, women, veterans and individuals with disabilities are encouraged to apply. For more information, visit http://go.illinois.edu/EEO.

Back  Top

6-20(2020-10-28) TWO positions in Trinity College Dublin, Ireland

we have openings for TWO positions in Trinity College Dublin, Ireland, available from Dec 1st, 2020 for 14 months. We are seeking:

 

A Research Assistant (qualified to Masters level)

A Research Fellow (holds a PhD)

 

The Project:

RoomReader is a project led by Prof. Naomi Harte in TCD and Prof. Ben Cowan in UCD, Ireland. The research is exploring and modelling online interactions, and is funded by the Science Foundation Ireland Covid 19 Rapid Response Call. The candidate will be working with a team to drive research into multimodal cues of engagement in online teaching scenarios. The work involves a collaboration with Microsoft Research Cambridge, and Microsoft Ireland.

The Research Assistant will have a psychology/linguistics/engineering background (we are flexible) and will be tasked with researching and designing a new online task to elicit speech based interactions relevant to online teaching scenarios (think multi-party MapTask or Diapix, but different). They will also be responsible for the capture of that dataset and subsequent editing/labelling for deployment in the project and eventual sharing with the wider research community. Annual gross salary up to ?34,930 per annum depending on experience.

The Research Fellow needs a background, including a PhD, in deep learning and the modelling of multimodal cues in speech. Their previous experience might be in conversational analysis, multimodal speech recognition or other areas. They should have a proved track record with publications commensurate with career stage. Annual gross salary up  to ?50030 depending on experience.

 

The project starts on Dec 1st, and the positions can start from that date and continue for 14 months. Please email nharte@tcd.ie for a more detailed description of either role, or to discuss. I am open to a person remote-working for the remainder of 2020, but the ideal candidate will be in a position to move to Ireland for Jan 2021 and work with the team in TCD.

 

Sigmedia Research Group @ Trinity College Dublin, Ireland

The Signal Processing and Media Applications (aka Sigmedia) Group was founded in 1998 in Trinity College Dublin, Ireland. Originally with a focus on video and image processing, the group today spans research in areas across all aspects of media ? video, images, speech and audio. Prof. Naomi Harte leads the Sigmedia research endeavours in human speech communication. The group has active research in audio-visual speech recognition, evaluation of speech synthesis, multimodal cues in human conversation, and birdsong analysis. The group is interested in all aspect of human interaction, centred on speech. Much of our work is underpinned by signal processing and machine learning, but we also have researchers with a background in linguistic and psychology aspects of speech processing to keep us all grounded.

Back  Top

6-21(2020-10-30) 3 Tenure Track Professors (W2) at Saarland University, Germany

3 Tenure Track Professors (W2) at Saarland University

Saarland University is seeking to hire up to

3 Tenure Track Professors (W2)

in computer science and related areas with six-year tenure track to a permanent
professorship (W3). We are looking for highly motivated young researchers in any modern
area of Computer Science, especially in one or more of the following research areas:

. Artificial Intelligence, Machine Learning
. Natural Language Processing
. Data Science, Big Data
. Graphics, Visualization, Computer Vision
. Human-Computer Interaction
. Programming Languages and Software Engineering
. Computer Architecture and High-Performance Computing
. Networked, Distributed, Embedded, Real-Time Systems
. Bioinformatics
. Computational Logic and Verification
. Theory and Algorithms
. Societal Aspects of Computing
. Robotics
. Quantum Computing

The position will be established in the Department for Computer Science or in the
Department for Language Science and Technology of Saarland University, which is part of
the Saarland Informatics Campus (SIC), located in Saarbrücken. With its 800 researchers
and more than 2.000 students from 81 countries, the SIC belongs to one of the leading
locations for Computer Science in Germany and in Europe. All areas of computer science
are covered at five globally renowned research institutes and three collaborating
university departments, as well as in a total of 21 academic programs.

Our Offer: Tenure track professors (W2) have faculty status at Saarland University,
including the right to supervise bachelor?s, master?s and PhD students. They focus on
world-class research, will lead their own research group and will have a significantly
reduced teaching load. In case of outstanding performance, the position will be tenured
as full professor (W3). The tenure decision is made after at most 6 years.

The position provides excellent working conditions in a lively scientific community,
embedded in the vibrant research environment of the Saarland Informatics Campus. Saarland
University has leading departments in Computer Science and Computational Linguistics,
with more than 350 PhD students working on timely topics (see
https://saarland-informatics-campus.de/ for additional information).

Qualifications:

Applicants must hold an outstanding PhD degree, typically have completed a postdoc stay
and have teaching experience. They must have demonstrated outstanding research abilities
and the potential to successfully lead their own research group. Active contribution to
research and teaching, including basic lectures in computer science, is expected.
Teaching languages are English or German (basic courses). We expect sufficient German
language skills after an appropriate period.

Your Application: Candidates should submit their application online at:
https://applications.saarland-informatics-campus.de
No additional paper copy is required. The application must contain:

. a cover letter and curriculum vitae
. a full list of publications
. a short prospective research plan (2-5 pages)
. copies of degree certificates
. full text copies of the five most important publications
. a list of references: 3-5 (including email addresses), at least one of whom must be
  a person who is outside the group of your current or former supervisors or
  colleagues.

Applications will be accepted until December 11th, 2020. Application talks will take
place between Feb. 01 and Feb. 26, 2021. Please refer to reference W1786 in your
application. Please contact apply@saarland-informatics-campus.de if you have any
questions.

Saarland University is an equal opportunity employer. In accordance with its policy of
increasing the proportion of women in this type of employment, the University actively
encourages applications from women. For candidates with equal qualifications, preference
will be given to people with disabilities. In addition, Saarland University regards
internationalization as a university-wide cross-sectional task and therefore encourages
applications that align with activities to further internationalize the university. When
you submit a job application to Saarland University you will be transmitting personal
data. Please refer to our privacy notice (https://www.uni-
saarland.de/verwaltung/datenschutz/) for information on how we collect and process
personal data in accordance with Art. 13 of the General Data Protection Regulation
(GDPR). By submitting your application you confirm that you have taken note of the
information in the Saarland University privacy notice.

Links:

https://www.uni-saarland.de/fileadmin/user_upload/verwaltung/stellen/wissenschaftler/W1786_W2TT-Informatik_Ausschreibung_final.pdf

Saarland University, https://www.uni-saarland.de/en/home.html

Back  Top

6-22(2020-10-30) 2 Research Assistant/Associate Posts at Cambridge University, Great Britain
2 Research Assistant/Associate Posts in Spoken Language Processing at Cambridge University (Fixed Term)
 
Applications are invited for two Research Assistants/Research Associates in the Department of Engineering, Cambridge University, to work on an EPSRC funded project Multimodal Video Search by Examples (MVSE). The project is a collaboration between three Universities, Ulster, Surrey and Cambridge, and the BBC as an industrial partner. The overall aim of the project is to enable effective and efficient multimodal video search of large archives, such as BBC TV programmes.

The research associated with these positions will focus on deriving representation for all the information that is contained within the video speech signal, and integrating with other modalities. The forms of representation will include both: voice analytics e.g. speaker and emotion; and topic and audio content analytics e.g. word-sequence and topic classification and tracking. The position will involve close collaboration with Surrey and Ulster Universities to integrate with other information sources, video and the audio scene, to yield a flexible and efficient video search index.

Fixed-term: The funds for this post are available until 31 January 2024 in the first instance
Closing date: 1st December 2020

Full information can be found at: http://www.jobs.cam.ac.uk/job/27458/
Back  Top

6-23(2020-11-02) Fully-funded PhD studentships at the University of Sheffield, Great Britain

 Fully-funded PhD studentships in Speech and NLP at the University of Sheffield

*******************************************************************************************************

 

UKRI Centre for Doctoral Training (CDT) in Speech and Language Technologies (SLT) and their Applications 

 

Department of Computer Science

Faculty of Engineering 

University of Sheffield, UK

 

Fully-funded 4-year PhD studentships for research in speech technologies and NLP

 

** Applications now open for September 2021 intake **

 

Deadline for applications: 31 January 2021. 

 

Speech and Language Technologies (SLTs) are a range of Artificial Intelligence (AI) approaches which allow computer programs or electronic devices to analyse, produce, modify or respond to human texts and speech. SLTs are underpinned by a number of fundamental research fields including natural language processing (NLP / NLProc), speech processing, computational linguistics, mathematics, machine learning, physics, psychology, computer science, and acoustics. SLTs are now established as core scientific/engineering disciplines within AI and have grown into a world-wide multi-billion dollar industry.

 

Located in the Department of Computer Science at the University of Sheffield ? a world leading research institution in the SLT field ? the UKRI Centre for Doctoral Training (CDT) in Speech and Language Technologies and their Applications is a vibrant research centre that also provides training in engineering skills, leadership, ethics, innovation, entrepreneurship, and responsibility to society.

 

Apply now: https://slt-cdt.ac.uk/apply/ 

 

The benefits:

  • Four-year fully-funded studentship covering fees and an enhanced stipend (£17,000 pa)

  • Generous personal allowance for research-related travel, conference attendance, specialist equipment, etc.

  • A full-time PhD with Integrated Postgraduate Diploma (PGDip) incorporating 6 months of foundational speech and NLP training prior to starting your research project 

  • Bespoke cohort-based training programme running over the entire four years providing the necessary skills for academic and industrial leadership in the field.

  • Supervision from a team of over 20 internationally leading SLT researchers, covering all core areas of modern SLT research, and a broader pool of over 50 academics in cognate disciplines with interests in SLTs and their application

  • Every PhD project is underpinned by a real-world application, directly supported by one of our industry partners. 

  • A dedicated CDT workspace within a collaborative and inclusive research environment hosted by the Department of Computer Science

  • Work and live in Sheffield - a cultural centre on the edge of the Peak District National Park which is in the top 10 most affordable UK university cities (WhatUni? 2019).

 

About you:

We are looking for students from a wide range of backgrounds interested in speech and NLP. 

  • High-quality (ideally first class) undergraduate or masters (ideally distinction) degree in a relevant discipline. Suitable backgrounds include (but are not limited to) computer science/software engineering; informatics; AI; speech and language processing; mathematics; physics; linguistics; cognitive science; and engineering.

  • Regardless of background, you must be able to demonstrate strong mathematical aptitude (minimally to A-Level standard or equivalent) and experience of programming.

  • We particularly encourage applications from groups that are underrepresented in technology.

  • The majority of candidates must satisfy the UKRI funding eligibility criteria for ?home? students. Full details can be found on our website.

 

Applying:

Applications are now sought for the September 2021 intake. The deadline is 31 January 2021. 

 

Applications will be reviewed within 6 weeks of the deadline and short-listed applicants will be invited to interview. Interviews will be held in Sheffield or via videoconference. 

 

See our website for full details and guidance on how to apply: https://slt-cdt.ac.uk 

 

For an informal discussion about your application please contact us by email at: sltcdt-enquiries@sheffield.ac.uk


By replying to this email or contacting sltcdt-enquiries@sheffield.ac.uk you consent to being contacted by the University of Sheffield in relation to the CDT. You are free to withdraw your permission in writing at any time. 

Back  Top

6-24(2020-11-14) Master2 Internship, LORIA-INRIA, Nancy, France

Master2 Internship: Semantic information from the past in a speech recognition system: does the past help the present?

 

Supervisor:  Irina Illina, MdC, HDR

Team and Laboratory: Multispeech, LORIA-INRIA

Contact: illina@loria.fr

               

Co-Supervisor:  Dominique Fohr, CR CNRS

Team and Laboratory: Multispeech, LORIA-INRIA

Contact : dominique.fohr@loria.fr

 

Motivation and context

 

Semantic and thematic spaces are vector spaces used for the representation of words, sentences or textual documents. The corresponding models and methods have a long history in the field of computational linguistics and natural language processing. Almost all models rely on the hypothesis of statistical semantics that states that: statistical schemes of appearance of words (context of a word) can be used to describe the underlying semantics. The most used method to learn these representations is to predict a word using the context in which this word appears [Mikolov et al., 2013, Pennington et al., 2014], and this can be realized with neural networks. These representations have proved their effectiveness for a range of natural language processing tasks. In particular, Mikolov's Skip-gram and CBOW models [Mikolov et al., 2013] and BERT model [Devlin et al., 2019] have become very popular because of their ability to process large amounts of unstructured text data with reduced computing costs. The efficiency and the semantic properties of these representations motivate us to explore these semantic representations for our speech recognition system.

Robust automatic speech recognition (ASR) is always a very ambitious goal. Despite constant efforts and some dramatic advances, the ability of a machine to recognize the speech is still far from equaling that of the human being. Current ASR systems see their performance significantly decrease when the conditions under which they were trained and those in which they are used differ. The causes of variability may be related to the acoustic environment, sound capture equipment, microphone change, etc.

 

Objectives

Our speech recognition (ASR) system [Povey et al, 2011] is  supplemented by a semantic analysis for detecting the words of the processed sentence that could have been misrecognized and for finding words having a similar pronunciation and matching better the context [Level et al., 2020]. For example, the sentence « Silvio Berlusconi, prince de Milan » can be recognized by the speech recognition system as: « Silvio Berlusconi, prince de mille ans ». A good semantic context representation of the sentence could help to find and correct this error. This semantic analysis re-evaluates (rescores) the N-best transcription hypotheses and can be seen as a form of dynamic adaptation in the case of noisy speech data. A semantic analysis is performed in combining predictive representations using continuous vectors. All our models are based on the powerful technologies of DNN. We use for this the BERT model. The semantic analysis contains the following two modules: the semantic analysis module and the module for re-evaluating sentence hypotheses (rescoring) taking into account the semantic information.

The semantic module improves significantly the performance of speech recognition system. But we would like to go beyond the semantic information of the current sentence. Indeed, sometimes the previous sentences could help to understand and to recognize the current sentence.  The Master internship will be devoted to the innovative study of the taking into account the past recognized sentences to improve the recognition of the current sentence. Research will be conducted on the combination of semantic information from one or several past sentences with semantic information from current sentence to improve the speech recognition. As deep neural networks (DNNs) can model complex functions and get outstanding performance, they will be used in all our modeling. The performance of the different modules will be evaluated on artificially noisy speech data.

 

Required skills: background in statistics, natural language processing and computer program skills (Perl, Python).

Candidates should email a detailed CV with diploma

 

Bibliography

[Devlin  et al., 2019] Devlin, J., Chang, M.-W., Lee, K.  and Toutanova K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

[Level et al., 2020] Level S., Illina I., Fohr D. (2020).  Introduction of semantic model to help speech recognition. International Conference on TEXT, SPEECH and DIALOGUE.

 [Mikolov et al., 2013] Mikolov, T., Sutskever, I., Chen, T. Corrado, G.S.,and Dean, J. (2013). Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, pp. 3111?3119.

[Pennington et al., 2014] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543.

[Povey et al., 2011] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motl?cek, P., Qian, Y., Schwarz, Y., Silovsky, J., Stemmer, G., Vesely, K. (2011). The Kaldi Speech Recognition Toolkit, Proc. ASRU.

 

Back  Top

6-25(2020-11-15) PhD at LORIA-INRIA, Nancy, France
Title: Multi-task learning for hate speech classification
 
Research Lab: MULTISPEECH team, LORIA-INRIA, Nancy, France

Supervisors:

Irina Illina, Associate Professor, HDR (illina@loria.fr)
Ashwin Geet D?Sa, PhD Thesis student (ashwin-geet.dsa@loria.fr)
Dominique Fohr, Research scientist CNRS (dominique.fohr@loria.fr)

Motivation and context:

During the last years, online communication through social media has skyrocketed. Although most people use social media for constructive purposes, few misuse these platforms to spread hate speech. Hate speech is anti-social communicative behavior and targets a minor section of the society based on religion, gender, race, etc. (Delgado and Stefancic, 2014). It often leads to threats, fear, and violence to an individual or a group. Online hate speech is a punishable offense by the law, and the owners of the platform are held accountable for the hate speech posted by its users. Manual moderation of hate speech by humans is often expensive and time-consuming. Therefore, automatic classification techniques have been employed for the detection of hate speech.Recently, deep learning techniques have outperformed the classical machine learning techniques and have become state-of-the-art methodology for hate speech classification (Badjatiya et al., 2017). These methodologies need a large quantity of annotated data for training. The task of annotation is very expensive. To train a powerful deep neural network based classification system, several corpora can be used.Multi-task learning is a deep learning-based methodology (MT-DNN) which has proven to outperform the single-task based deep learning models, especially in the low-resource setting (Worsham and Jugal, 2020; Liu et al., 2019). This methodology jointly learns a model on multiple tasks such as classification, question-answering, etc. Thus, the information learned in one task can benefit other tasks and improves the performance of all the tasks.Existing hate speech corpora are collected from various sources such as Wikipedia, Twitter, etc. Labeling of these corpora can vary greatly. Some corpus creators combine various forms of hate, such as ?abuse?, ?threat?, etc. and collectively annotate the samples as ?hate speech? or ?toxic speech?. Whereas, other authors create more challenging corpora using fine-grained annotations such as ?hate speech?, ?abusive speech?, ?racism?, etc. (Davidson et al., 2017). Furthermore, the definition of hate speech remains unclear, and corpus creators can use different definitions. Thus, a model trained on one corpus cannot be easily used to classify the comments from another corpus. To take advantage of the different available hate speech corpora and to improve the performance of hate classification, we would like to explore the multi-corpus learning by using the methodology of multi-task learning.

Objectives:

The objective of this work is to improve the existing deep learning hate speech classifier by developing the multi-task learning system using several hate speech corpora during the training. In the MT-DNN model of (Liu et al., 2019), the multi-task learning model consists of a set of task-specific layers on top of shared layers. The shared layers are the bottom layers of the model and are jointly trained on several corpora. The task-specific layers are built on top of the shared layers and each of these layers is trained on a single task. We want to explore this setup for hate speech detection learning. In this case, shared layers will be jointly trained on several corpora of hate speech. Each task-specific layer will be used to learn a specific hate speech task from a specific corpus. For example, one task-specific layer can use very toxic, toxic, neither, healthy and very healthy classification task and the Wikipedia detox corpus (Wulczyn et al.,). Another task-specific layer can use hateful, abusive, normal classification task and the Founta corpus (Founta et al.). Since shared-layers are jointly trained using multiple corpora, it will improve the performance of the task-specific layers, especially for a task with small amount of data.The work plan for the internship is as follows: at the beginning, the intern will conduct a literature survey on the hate speech classification using deep neural networks. Using the hate speech classification baseline system (CNN-based or Bi-LSTM-based), existing in our team, the student will evaluate the performance of this system on several available hate speech corpora. After this, the student will develop a new methodology based on the MT-DNN model for efficient learning. We can use pre-trained BERT model (Devlin et al., 2019) to initialize the shared layers of MT-DNN. The performance of the proposed MT-DNN model will be evaluated and compared to single corpora learning and multi-corpora learning (grouping all corpora together).

Required Skills:

Background in statistics, natural language processing and computer program skills (Python).
Candidates should email a detailed CV with diploma.

Bibliography:

Badjatiya P., Gupta S., Gupta M., and Varma V. ?Deep learning for hate speech detection in tweets.? In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759-760, 2017.

Davidson T., Warmsley D., Macy M., and Weber I. ?Automated hate speech detection and the problem of offensive language.? arXiv preprint arXiv:1703.04009, 2017.

Delgado R., and Stefancic J. ?Hate speech in cyberspace.? Wake Forest L. Rev. 49: 319, 2014.

Devlin J., Chang M., Lee K., and Toutanova K. ?BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.? In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186. 2019.

Founta AM, Djouvas C, Chatzakou D, Leontiadis I, Blackburn J, Stringhini G, Vakali A, Sirivianos M, Kourtellis N. Large scale crowdsourcing and characterization of Twitter abusive behavior. ICWSM. 2018.

Liu X., He P., Chen W., and Gao J. ?Multi-Task Deep Neural Networks for Natural Language Understanding.? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487-4496. 2019.

Worsham J., and Jugal K. ?Multi-task learning for natural language processing in the 2020s: where are we going?.? Pattern Recognition Letters, 2020.

Wulczyn E, Thain N, Dixon L. Ex machina: Personal attacks seen at scale. InProceedings of the 26th International Conference on World Wide Web 2017.

Back  Top

6-26(2020-11-17) 2 year full-time postdoctoral researcher ar the University of Bordeaux, France
The University of Bordeaux invites applications for a 2 year full-time postdoctoral researcher in Automatic Speech Recognition. The position is part of the FVLLMONTI project on efficient speech-to-speech translation on embedded autonomous devices, funded by the European Community.
 
To apply, please send by email a single PDF file containing a full CV (including publication list), cover letter (describing your personal qualifications, research interests and motivation for applying), evidence for software development experience (active Github/Gitlab profile or similar), two of your key publications, contact information of two referees and academic certificates (PhD, Diploma/Master, Bachelor certificates).
 
Details on the position are given below:
 
Job description: Post-doctoral position in Automatic Speech Recognition 
Duration: 24 months
Starting date: as early as possible (from March 1st 2021)
Project: European FETPROACT project FVLLMONTI (starts January 2021)
Location: Bordeaux Computer Science Lab. (LaBRI CNRS UMR 5800), Bordeaux, France (Image and Sound team)
Salary: from 2 086,45? to 2 304,88?/month (estimated net salary after taxes, according to experience)
 
 
Short description:
The applicant will be in charge of developping state-of-the-art Automatic Speech Recognition systems for English and French as well as related Machine Translation systems using Deep Neural Networks. The objective is to provide the exact specifications of the designed systems to the other partners of the project specialized in hardware. Adjustements will have to be made to take into account the hardware constraints (i.e. memory and energy consumption impacting the number of parameters, computation time, ...) while keeping an eye on the performances (WER and BLEU scores). When a satisfactory trade-off is reached, more exploratory work is to be carried out on using emotion/attitude/affect recognition on the speech samples to supply additional information to the translation system. 
 
 
Context of the project:
The aim of the FVLLMONTI project is to build a lightweight autonomous in-ear device allowing speech-to-speech translation. Today, pocket-talk devices integrate IoT products requiring internet connectivity which, in general, is proven to be energy inefficient. While machine translation (MT) and Natural Language Processing (NLP) performances have greatly improved, an embedded lightweight energy-efficient hardware remains elusive. Existing solutions based on artificial neural networks (NNs) are computation-intensive and energy-hungry requiring server-based implementations, which also raises data protection and privacy concerns. Today, 2D electronic architectures suffer from 'unscalable' interconnect and are thus still far from being able to compete with biological neural systems in terms of real-time information-processing capabilities with comparable energy consumption. Recent advances in materials science, device technology and synaptic architectures have the potential to fill this gap with novel disruptive technologies that go beyond conventional CMOS technology. A promising solution comes from vertical nanowire field-effect transistors (VNWFETs) to unlock the full potential of truly unconventional 3D circuit density and performance.
 
Role:
The tasks assigned to the Computer Science lab are the design of the Automatic Speech Recognition systems (for French and English) and the Machine Translation (English to French and French to English). Speech synthesis will not be explored in the project but an open-source implementation will be used for demonstration purposes. Both ASR and MT tasks benefit from the use of Transformer architectures over Convolutional (CNNs) or Recurrent (RNNs) neural network architectures. Thus, the role of the applicant will be to design and implement state-of-the-art systems for ASR using Transformer nerworks (e.g. with the ESPNET toolkit) and to assist another post-doctorate for the MT systems. Once the performances reached by these baseline systems are satisfactory, details on the network will be given to our hardware designers partners (e.g. number of layers, value of the parameters, etc.). With the feedback of these partners, adjustements will be made to the network considering the hardware constraints while trying not to degrade the performances too much.
 
The second part of the project will focus on keeping up with the latest innovations and translating them into hardware specifications. For example, recent research suggest that adding convolutional layers to the transformer architecture (i.e. the 'conformer' network) can help reduce the number of parameters of the model which is critical regarding the memory usage of the hardware system.
 
Finally, more exploratory work on the detection of social affects (i.e. the vocal expression of the intent of the speaker: 'politeness', 'irony', etc) will be carried out. The additional information gathered using this detection will be added to the translation system for potential usage in the future speech synthesis system. 
 
 
Required skills:
- PhD in Automatic Speech Recognition (preferred) or Machine Translation using deep neural networks
- Knowledge of most widely used toolboxes/frameworks (tensorflow, pytorch, espnet for example)
- Good programming skills (python)
- Good communication skills (frequent interactions with hardware specialists)
- Interest in hardware design will be a plus
 
Selected references:
S. Karita et al., 'A Comparative Study on Transformer vs RNN in Speech Applications,' 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, 2019, pp. 449-456, doi: 10.1109/ASRU46091.2019.9003750.
Gulati, Anmol, et al. 'Conformer: Convolution-augmented Transformer for Speech Recognition.' arXiv preprint arXiv:2005.08100 (2020).
Rouas, Jean-Luc, et al. 'Categorisation of spoken social affects in Japanese: human vs. machine.' ICPhS. 2019.
 
 
---
 
Université de Bordeaux CNRS

Jean-Luc Rouas
CNRS Researcher
Bordeaux Computer Science Research Laboratory (LaBRI)
351 Cours de la libération - 33405 Talence Cedex - France
T. +33 (0) 5 40 00 35 28
www.labri.fr/~rouas

Back  Top

6-27(2020-11-19) Ingenieur de recherches (CDD) -Laboratoire ALAIA, Toulouse

Le laboratoire Commun ALAIA, destiné à l'Apprentissage des langues Assisté par Intelligence Artificielle, recrute un ingénieur de recherche en CDD (12 mois avec prolongation possible).

Le travail à réaliser se fera en coordination avec les deux partenaires impliqués dans le LabCom : l'IRIT (Institut de Recherche en Informatique de Toulouse) et la société Archean Technologies et plus particulièrement son pôle R&D Archean Labs (Montauban 82).

ALAIA est centré sur l'expression et la compréhension orale d'une langue étrangère (L2). Les missions consisteront à concevoir, développer et intégrer des services innovants basés sur l'analyse des productions des apprenants L2, la détection et la caractérisation d'erreurs allant du niveau phonétique au niveau linguistique.

Les compétences attendues portent sur le traitement automatique de la parole et du langage, le machine learning et le développement d'application web. 

Les candidatures sont à adresser à Isabelle Ferrané (isabelle.ferrane@irit.fr) et Lionel Fontan (lfontan@archean.tech). N'hésitez pas à nous contacter pour de plus amples informations.

Back  Top

6-28(2020-11-26) Full Professor at Dublin City University, Ireland

Dublin City University is seeking to make a strategic appointment to support existing academic and research leadership in the area of Multimodal Information Systems within the DCU School of Computing, home of the ADAPT Centre for Digital Content Technology.

Applications are invited from suitably qualified candidates for a new post of Full Professor of Computing (Multimodal Information Systems) at DCU under the Senior Academic Leadership Initiative (SALI) Call 2019, in line with the requirements set out in the Higher Education Authority (HEA) Call document:

https://hea.ie/assets/uploads/2019/06/FINAL-Call-document-2019-06-21.pdf

The full job job description is available at:

https://www.dcu.ie/sites/default/files/inline-files/nr029a-full-professor-of-computing-multimodal-information-processing-ad_final.pdf

The Appointee will have the benefit of interaction and collaboration with the wider ADAPT research teams -- a significant body of researchers, including PhD students, postdoctoral researchers, Research Fellows and ADAPT staff -- working on aspects of Language Technology, Machine Translation, Personalisation, Multimodal Processing, VR/AR and Privacy and Ethics across the eight ADAPT partner universities.  This is an exciting time for the ADAPT Centre as it transitions from the first phase of SFI funding to the phase II extension from 2021-26. Accordingly, the successful candidate has the perfect opportunity to shape their future research goals by contributing to the envisaged programme of work during the next cycle of ADAPT.

For further information, please contact Prof. Brian Corcoran, Acting Executive Dean, Faculty of Engineering and Computing (brian.corcoran@dcu.ie) at DCU, or Prof. Andy Way, Deputy Director of the ADAPT Centre at DCU (andy.way@adaptcentre.ie), who will be delighted to engage with you further about this role.

The application form is available at:

https://www.dcu.ie/sites/default/files/inline-files/dcu_application_sali-3.doc

The closing date for applications is Dec 2nd, 2020.

Back  Top

6-29(2020-11-28) Master2 Internship, INA, Bry sur Marne, France

Détection de locuteur·rice actif·ve dans les flux télévisuels

Sujet de Stage - M2 informatique ou école d?ingénieur

 

Mots clés (Fr): Détection de locuteur actif, analyse multimodale, traitement de la parole, détection du visage, apprentissage automatique, apprentissage profond, audiovisuel, humanités numériques, place des femmes dans les médias, indexation automatique, gender equality monitor

Keywords (En): Active speaker detection, multimodal analysis, speech processing, face detection, machine learning, deep learning, audiovisual, digital humanities, women in media, automatic indexing, gender equality monitor

Contexte

L?Institut national de l?audiovisuel (INA) est en charge du Dépôt Légal de la télévision, de la radio et du web médias. À ce titre, l?INA capte en continu 170 chaînes de télévision et stocke plus de 20 millions d?heures de contenu audiovisuel.

 

Un processus d?indexation, généralement réalisé par des documentalistes, est nécessaire pour décrire les contenus audiovisuels et retrouver des documents au sein de ces grandes collections. Ce travail consiste, entre autres, à référencer les personnes apparaissant dans les programmes, les sujets évoqués, ou encore produire des résumés des documents. Les activités du service de la recherche de l?INA visent à automatiser certains processus d?indexation: soit en automatisant certaines tâches sans valeur ajoutée humaine (segmentation, repérage de noms propres dans l?image, etc.), soit en réalisant des tâches qui ne sont pas faites par les documentalistes (décompte exhaustif du temps de parole). 

 

Le sujet proposé s?inscrit dans le cadre du projet Gender Equality Monitor (GEM), financé par l?Agence nationale de la recherche, qui vise à décrire les différences de représentation existant entre les femmes et les hommes dans les médias. Dans ce cadre, des campagnes d?indexation automatique massives des fonds INA ont permis de créer de nouvelles connaissances en science humaines en se fondant sur le temps de parole, le temps d?exposition visuelle, ou encore le contenu des incrustations texte [Dou19a, Dou19b, Dou20].

 

L?amélioration des systèmes d?indexation automatique nécessite de mettre au point des bases d?exemples représentatives de la diversité des matériaux traités, utilisées pour l?entraînement et l?évaluation. La constitution de bases d?exemples est un enjeu stratégique pour la conception de systèmes fondés sur des processus d?apprentissage automatique et des stratégies d?automatisation de constitution des bases peuvent être envisagées [Sal14].

Objectifs

La détection de locuteur·rice actif·ve (DLA) est une tâche d?analyse multimodale qui consiste à analyser une vidéo, déterminer si les mouvements d?un des visages apparaissant à l?écran correspondent au signal de parole contenu dans la piste audio. La conception de système DLA peut-être envisagée à l?aide d?approches non supervisées [Chun16] ou supervisées [Rot20]. La DLA répond à plusieurs problématiques métier rencontrées par l?INA.

 

  • Synchronisation audio/vidéo: il est fréquent que le flux vidéo et audio de documents numérisés soient désynchronisés. La DLA permet d?estimer la durée de décalage existant entre la piste audio et vidéo et synchroniser automatiquement ces deux flux.

  • Amélioration des systèmes de detection du sexe à partir de la voix ou des visages: les logiciels open-source inaSpeechSegmenter et inaFaceGender permettent respectivement de détecter le sexe d?une personne à partir de sa voix ou de son visage. Ces systèmes ont été conçus à partir d?un nombre d?exemples nécessairement fini, qui ne reflète pas parfaitement la diversité des contenus INA. L?utilisation de systèmes de détection de locuteur·rice actif·ve est envisagée pour constituer automatiquement des jeux de données pour lesquelles les prédictions du sexe de la personne sont différentes pour les modalités audio et vidéo, et ainsi accroître la robustessse des outils INA. Les cas limites obtenus peuvent également faire l?objet d?analyses plus poussées visant à décrire les limites des approches binaires pour la description des personnes.

  • Constitution de base multimodales: l?utilisation de système DLA, combinée aux notices documentaires INA et à un nombre d?exemples limité, peut être envisagée pour concevoir de nouvelles bases de visages et de parole [Nag20].

 

L?objectif général du stage consiste à mettre au point un système DLA, l?évaluer par rapport aux implémentations open-source existantes, le déployer sur les fonds INA pour constituer des bases d?exemples permettant d?améliorer les logiciels inaSpeechSegmenter et inaFaceGender. En fonction des résultats obtenus, le stage peut déboucher sur une diffusion open-source du système réalisé et/ou une publication scientifique.

Compétences requises

  • Apprentissage automatique

  • Vision artificielle et traitement d?images

  • Traitement du signal audio

  • Aisance en python

  • Ingénierie logicielle

  • Capacité à effectuer des recherches bibliographiques

  • Rigueur, Synthèse, Autonomie, Capacité à travailler en équipe

  • Intérêt pour la recherche académique et industrielle

Conditions du stage

Le stage se déroulera sur une période de 4 à 6 mois au sein du service de la Recherche de l?Ina et pourra débuter à partir de Janvier 2021. Il aura lieu sur le site Bry2, situé au 18 Avenue des frères Lumière, 94366 Bry-sur-Marne. Le stagiaire sera encadré par David Doukhan, Ingénieur R&D au service de la recherche et coordinateur du projet GEM.

Contact

Les candidat·e·s intéressé·e·s peuvent contacter David Doukhan (ddoukhan@ina.fr) pour plus d?informations, ou directement adresser par courriel une lettre de candidature incluant un Curriculum Vitae.

Bibliographie

 

[Chun16] Chung, J. S., & Zisserman, A. (2016). Out of time: automated lip sync in the wild. In Asian conference on computer vision (pp. 251-263). Springer, Cham.

 

[Dou18] Doukhan, D., Carrive, J., Vallet, F., Larcher, A., & Meignier, S. (2018). An open-source speaker gender detection framework for monitoring gender equality. In ICASSP (pp. 5214-5218). 

 

[Dou19a] Doukhan, D. (2019) À la radio et à la télé, les femmes parlent deux fois moins que les hommes. La revue des médias

 

[Dou19b] Doukhan, D., Rezgui, Z., Poels, G., & Carrive, J. (2019). Estimer automatiquement les différences de représentation existant entre les femmes et les hommes dans les médias.

 

[Dou20] Doukhan, D., Méadel, C., Coulomb-Gully, M. (2020) En période de coronavirus, la parole d?autorité dans l?info télé reste largement masculine. La revue des médias

 

[Nag20] Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60, 101027.

 

[Rot20] Roth, J., Chaudhuri, S., Klejch, O., Marvin, R., Gallagher, A., Kaver, L., & Pantofaru, C. (2020). Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection. In ICASSP IEEE.

 

[Sal14] Salmon, F., & Vallet, F. (2014). An Effortless Way To Create Large-Scale Datasets For Famous Speakers. In LREC (pp. 348-352).

Back  Top

6-30(2020-12-01) Funded PhD Position at University of Edinburgh, Scotland, UK

Funded PhD Position at University of Edinburgh

 

PhD Position: Automatic Affective Behaviour Monitoring through speech and/or multimodal means while preserving user’s privacy

 

For details please visit:

https://www.findaphd.com/phds/project/automatic-affective-behaviour-monitoring-through-speech-while-preserving-user-s-privacy/?p125956

……………………………………………………………………………………………………………………………………………………………………….

About the Project

The Advanced Care Research Centre at the University of Edinburgh is a new £20m interdisciplinary research collaboration aiming to transform later life with person centred integrated care

The vision of the ACRC is to play a vital role in addressing the Grand Challenge of ageing by transformational research that will support the functional ability of people in later life so they can contribute to their own welfare for longer. With fresh and diverse thinking across interdisciplinary perspectives our academy students will work to creatively embed deep understanding, data science, artificial intelligence, assistive technologies and robotics into systems of health and social care supporting the independence, dignity and quality-of-life of people living in their own homes and in supported care environments.

The ACRC Academy will equip future leaders to drive society’s response to the challenges of later life care provision; a problem which is growing in scale, complexity and urgency. Our alumni will become leaders in across a diverse range of pioneering and influential roles in the public, private and third sectors.

Automatic affect recognition technologies can monitor a person’s mood and mental health by processing verbal and non-verbal cues extracted from the person’s speech. However, the speech signal contains biometric and other personal information which can, if improperly handled, threaten the speaker’s privacy. Hence there is a need for automatic inference and monitoring methods that preserve privacy for speech data in terms of collection, training of machine learning models and use of such models in prediction. This project will focus on research, implementation and assessment of solutions for handling of speech data in the user’s own environment while protecting their privacy. We are currently studying the use of speech in healthy ageing and care in combination with IoT/Ambient Intelligence technologies in a large research project. This project will build on our research in this area.

 

The goals of this PhD project are:

  • to establish and assess user privacy requirements,
  • to devise privacy-preserving automatic affect recognition methods,
  • to develop speech data collection methods and tools for privacy-sensitive contexts, and
  • to evaluate these methods with respect to performance and privacy preservation requirements.

 

Training outcomes include machine learning methods for inference of mental health status, privacy-preserving machine learning and signal processing, and applications of such methods in elderly care.

Back  Top

6-31(2020-12-03) 6 months internship at GIPSA-Lab, Grenoble, France

Deep learning-based speech coding and synthesis in adverse conditions.

Projet : Vokkero 2023

Type : Internship, 6 months, start of 2021

Offre : vogo-bernin-pfe-2

Contact : r.vincent@vogo.fr

Keywords : Neural vocoding, deep-learning, speech synthesis, training dataset, normalisation.

Résumé : The project consists in evaluating the performances of the LPCNet neural vocoder for

speech coding and decoding under adverse conditions (noisy environment, varied speech style, etc.)

and in proposing learning techniques to improve the quality of synthesis.

1 L’entreprise VOGO, le Gipsa-lab

Vogo is an SME based in Montpellier, south of France : www.vogo-group.com. Vogo is the first Sportech

listed on Euronext Growth and develops solutions that enrich the experience of fans and professionals

during sporting events. Its brand Vokkero is specialized in the design and production of radio communication

systems : www.vokkero.com. It offers solutions for teams working in very noisy environments and

is notably a world reference in the professional sports refereeing market.

Gipsa-lab is a CNRS research unit joint with Grenoble-INP (Grenoble Institute of Technology), and

Université Grenoble Alpes. With 350 people, including about 150 doctoral students, Gipsa-lab is a multidisciplinary

research unit developing both basic and applied researches on complex signals and systems.

Gipsa-lab is internationally recognised for the research achieved in Automatic Control, Signal and Images

processing, Speech and Cognition, and develops projects in the strategic areas of energy, environment,

communication, intelligent systems, Life and Health and language engineering.

2 Le projet Vokkero 2023

Every 3 years, Vokkero renews its Hardware (radio, cpu) and Software (rte, audio processing) platforms,

in order to design new generations of products. The project extends over several years of study

and it is within this framework that the internship is proposed. In the form of a partnership with the Gipsalab,

the project consists in the study of speech coding using « neural networks » approaches, in order

to obtain performances not yet reached by classical approaches. The student will work at the GIPSA-lab

in the CRISSP team of the Speech and Cognition cluster under the supervision of Olivier PERROTIN, research

fellow at CNRS, and at the R&D of Vogo Bernin, with Rémy VINCENT, project leader on the Vogo

side.

3 Context & Objectives

The project consists in evaluating the performances of the LPCNet neural vocoder for speech coding

and decoding under adverse conditions (noisy environment, varied speech style, etc.) and in proposing

learning techniques to improve the quality of synthesis.

3.1 Context

Vocoders (voice coders) are models that allow a speech signal to be first reduced to a small set of

parameters (this is speech analysis or coding) and then reconstructed from these parameters (this is

speech synthesis or decoding). This coding/decoding process is essential in telecommunication applications,

where speech is coded, transmitted and then decoded at the receiver. The challenge is to minimise

the quantity of information transmitted, while keeping the quality of the reconstructed speech signal as

high as possible. Current techniques use high-quality speech signal models, with a constraint on algorithmic

complexity to ensure real-time processes in embedded systems. Examples of Codecs widelay

used are Speex (Skype) and its little brother, Opus (Zoom). A few orders of magnitude : OPUS converts a

sampled stream at 16kHz into a bitstream at 16kbits (i.e. a compression ratio of 1 :16), the reconstructed

signal is also at 16kHz and has 20ms of latency.

Since 2016 a new type of vocoder has emerged, called neural vocoder. Based on deep neural network

architectures, these are able to generate a speech signal from the classical input parameters of a

vocoder, without a priori knowledge of an explicit speech model, but using machine learning. The first

system, Google’s WaveNet [1], is capable of reconstructing a signal almost identical to natural speech,

but at a very high computation cost (20 seconds to generate a sample, 16,000 samples per second).

Since then, models have been simplified and are capable of generating speech in real time (WaveRNN

[2], WaveGlow [3]). In particular, the LPCNet neural vocoder [4, 5], also developed by Mozilla, is able to

convert a 16kHz sampled stream into a 4kbits bitstream, and reconstruct a 16kHz audio signal. This mix

of super-compression combined with bandwidth extension leads to much higher equivalent compression

ratios than 1 :16 !

However, the ability of these systems to generate high-quality speech has only been evaluated following

training on large and homogeneous databases, i.e. 24 hours of speech read by a single speaker

and recorded in a quiet environment [6]. On the other hand, in the application of Vokkero, speech is

recorded in adverse conditions (very noisy environment), and presents a significant variability (spoken

voice, shouted voice, multiplicity of referees, etc.). Is a neural vocoder trained on a read speech database

capable of decoding speech of this type? If not, is it possible to train the model on such data, while

they are only available in small quantities ?

The aim of this internship is to explore the limits of the LPCNet vocoder in application to the decoding

of referee speech. Various learning strategies (curriculum training, transfer learning, learning on

augmented data, etc.) will then be explored to try to adapt pre-trained models to our data.

3.2 Tasks

The student will evaluate the performance of a pre-trained LPCNet vocoder on referee speech data,

and will propose learning strategies to adapt the model to this new data, in a coding/re-synthesis scena

rio :

1. Get familiar with the system, performance evaluation on an audio-book database (baseline) ;

2. Evaluation of LPCNet on the Vokkero database and identification of the limits (ambient noise, pretreatments,

voice styles, etc.) ;

3. Study of strategies to improve system performance by data augmentation :

— Creation of synthetic and specific databases : noisy atmospheres, shouted voices ;

— Recording campaigns on Vokkero systems, in anechoic rooms and/or real conditions if the sanitary

situation allows it ;

— Comparison of the 2 approaches according to various learning strategies to learn a new model

from this data.

3.3 Required Skills

The student is expected to have a solid background in speech signal processing and an interest in

Python development. Experience in programming deep learning models in Python is a plus. The student

is expected to show curiosity for research, scientific rigour in methodology and experimentation, and

show autonomy for technical and organisational aspects. Depending on the candidate’s motivation, and

subject to obtaining funding, it is possible to pursue this topic as a PhD thesis.

The student will be able to subscribe to the company’s insurance system, will have luncheon vouchers

and will receive a monthly gratuity of 800€.

Références

[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.

Senior et K. Kavukcuoglu, “WaveNet : A Generative Model for Raw Audio”, CoRR, t. abs/1609.03499,

2016. arXiv : 1609.03499 (cf. p. 1).

[2] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van

den Oord, S. Dieleman et K. Kavukcuoglu, “Efficient Neural Audio Synthesis”, CoRR, t. abs/1802.08435,

2018. arXiv : 1802.08435 (cf. p. 1).

[3] R. Prenger, R. Valle et B. Catanzaro, “Waveglow : A Flow-based Generative Network for Speech Synthesis”,

in Proceedings of the International Conference on Acoustics, Speech and Signal

Processing (ICASSP), Brighton, UK : IEEE, mai 2019, p. 3617-3621 (cf. p. 1).

[4] J.-M. Valin et J. Skoglund, “LPCNET : Improving Neural Speech Synthesis through Linear Prediction”,

in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),

sér. ICASSP ’19, Brighton, UK : IEEE, mai 2019, p. 5891-5895 (cf. p. 1).

[5] ——, “A Real-Time Wideband Neural Vocoder at 1.6kb/s Using LPCNet”, in Proceedings of Interspeech,

Graz, Austria : ISCA, sept. 2019, p. 3406-3410 (cf. p. 1).

[6] P. Govalkar, J. Fischer, F. Zalkow et C. Dittmar, “A Comparison of Recent Neural Vocoders for Speech

Signal Reconstruction”,


 

Back  Top

6-32(2020-12-04) 6 mois de post-doctorat, IRIT, Toulouse, France

Dans le cadre du projet interdisciplinaire INGPRO, projet Région Occitanie sur l'étude de l'Incidence des Gestes sur la PROnonciation, l'IRIT (équipe SAMoVA https://www.irit.fr/SAMOVA/site/) propose 6 mois de post-doctorat pour travailler sur le traitement des donne?es de parole (évaluation manuelle et automatique) recueillies lors d'une expe?rimentation qui se de?roulera au printemps 2021. Cette expe?rimentation implique le recueil de donne?es orales dans différentes conditions expe?rimentales ainsi que l'analyse des données collectées. 

Ce travail se fera en collaboration  avec la société Archean Technologie (http://www.archean.tech/archean-labs-en.html) et le laboratoire Octogone de l'UT2J (https://octogone.univ-tlse2.fr/), partenaires du projet. Si vous êtes intéressé.e, vous trouverez ci-dessous le détail de l'offre proposée.

Offre : https://www.irit.fr/SAMOVA/site/wp-content/uploads/2020/12/Ficheposte_PostDoc_INGPRO_2021.pdf

Des compléments sur le projet INGPRO sont accessibles ici : https://www.irit.fr/SAMOVA/site/projects/current/ingpro/

 

Poste à pourvoir : CDD (post-doc catégorie A)
Durée : 6 mois
Lieux : IRIT, 118, route de Narbonne - 31062 TOULOUSE, déplacements ponctuels à prévoir à Mautauban ( ARCHEAN LABS, 20 place Prax-Paris - 82000 MONTAUBAN)
Contacts : Isabelle Ferrané (isabelle.ferrane@irit.fr), Charlotte Alazard (charlotte.alazard@univ-tlse2.fr) porteuse du projet
Salaire : selon expérience
Dossier de candidature : à envoyer avant le 15 janvier 2021 pour une prise de poste au plus tard au 1/04/2021
Diplôme : Doctorat en linguistique avec une spécialisation en phonétique (une dimension acquisitionnelle en L2 serait un plus)

Back  Top

6-33(2020-12-05) 5/6 months Internship, LIS-Lab, Université Aix-Marseille, France

Deep learning for speech perception
(Apprentissage profond pour la perception de la parole)

Length of internship: 5-6 months
Start date: between January and March
Contact: Ricard Marxer

Context
---
Recent deep learning (DL) developments have been key to breakthroughs in many artificial
intelligence (AI) tasks such as automatic speech recognition (ASR) [1] and speech
enhancement [2]. In the past decade the performance of such systems on reference corpora
has consistently increased driven by improvements in data-modeling and representation
learning techniques. However our understanding of human speech perception has not
benefited from such advancements. This internship sets the ground for a project that
proposes to gain knowledge about our perception of speech by means of large-scale
data-driven modeling and statistical methods. By leveraging modern deep learning
techniques and exploiting large corpora of data we aim to build models capable of
predicting human comprehension of speech at a higher level of detail than any other
existing approaches [3].

This internship is funded by the ANR JCJC project MIM (Microscopic Intelligibility
Modeling). It aims at predicting and describing speech perception at the stimuli,
listener and sub-word level. The project will also fund a PhD position, the call for
applications will be published in the coming months. A potential followup in PhD could be
foreseen for the successful candidate of this internship.

Subject
---
In an attempt to use DL methods for speech perception tasks, this internship aims at
participating in the first Clarity challenge. This challenge tackles the difficult task
of performing speech enhancement for optimising intelligibility of a speech signal in
noisy conditions. The challenge opens in January 2021, it is the first of its kind with
the objective of advancing hearing-aid signal processing and the modelling of
speech-in-noise perception.

Several research directions will be explored, including but not limited to:
- perceptual-based loss functions
- advanced speech representation learning pipelines
- DL-based multichannel processing techniques

Given that the baseline and data of the challenge are to be published in January 2021 and
the difficulty of the task remains uncertain, a backup plan is foreseen for this
internship that is more tightly related to the context of the ANR project.

In the MIM project, we focus on corpora of consistent confusions: speech-in-noise stimuli
that evoke the same misrecognition among multiple listeners.  In order to simplify this
first approach to microscopic  intelligibility  prediction,  we  will  restrict  to 
single-word  data.   This  should  reduce  the  lexical factors to aspects such as usage
frequency and neighborhood density, significantly limiting the complexity of the required
language model. Consistent confusions are valuable experimental data about the human
speech perception process. They provide targets for how intelligibility models should
dif-ferentiate from automatic speech recognition (ASR) systems.  While ASR models are
optimised to recognise what has been uttered, the proposed models should output what has
been perceived by a set of listeners. A sub-task encompasses creating baseline models
that predict listeners? responses to the noisy speech stimuli. We will target predictions
at different levels of granularity such as predicting the type of confusion, which phones
are misperceived or how a particular phone is confused.

Several models regularly used in speech recognition tasks will be trained and evaluated
in predicting the misperceptions of the consistent confusion corpora. We will first focus
on well established models such as GMM-HMM and/or simple deep learning architectures.
Advanced neural topologies such as TDNNs, CTC-based or attention-based models will also
be explored, even though the relatively small amount of training data in the corpora is
likely to be a limiting factor. As a starting point we envisage solving the 3 tasks
described in [3] consisting of 1) predicting the probability of occurrence of
misrecognitions at each position of the word, 2) given the position, predicting a
distribution of particular phone misperceptions, and 3) predicting the words and the
number of times they have been perceived among a set of listeners.  Predictions will be
evaluated using the metrics also defined in [3] and random and oracle predictions will be
used as references. These baseline models will be trained using only in-domain data and
optimized on word recognition tasks.


Profile
---
The candidate shall have the following profile:
- Master 2 level or equivalent in one of the following fields: machine learning, computer
science, applied mathematics, statistics, signal processing
- Good English written and spoken language skills
- Programming skills, preferably in Python

Furthermore the ideal candidate would have:
- Experience in one of the main DL frameworks (e.g. PyTorch, Tensorflow)
- Notions in speech or audio processing

Application procedure
---
In order to apply send the following to the contact address ricard.marxer@lis-lab.fr:
- CV
- Motivation letter
- Latest grades transcript (M1 and the 1st semester of M2 if available)

References
---
[1] Barker, J., Marxer, R., Vincent, E., & Watanabe, S. (2017). The third ?CHiME? speech
separation and recognition challenge: Analysis and outcomes. Computer Speech & Language,
46, 605?626.
[2] Marxer, R., & Barker, J. (2017). Binary Mask Estimation Strategies for Constrained
Imputation-Based Speech Enhancement. In Proc. Interspeech 2017 (pp. 1988?1992).
[3] Marxer, R., Cooke, M., & Barker, J. (2015). A framework for the evaluation of
microscopic intelligibility models. In Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp.
2558?2562).

Back  Top

6-34(2020-12-06) chercheur mi-temps, Université de Mons, Belgique

Le laboratoire de phonétique de l?UMONS (https://sharepoint1.umons.ac.be/FR/universite/facultes/fpse/serviceseetr/sc_langage/Pages/default.aspx), rattaché à l?Institut de Recherches en Sciences du Langage (https://langage.be/), est un labo SHS partenaire d?un projet R&D porté par la société https://roomfourzero.com/ et financé par la DG06 Wallonie (https://recherche-technologie.wallonie.be/): le projet « SALV », pour « sens à la voix ».

 

Pour mener à bien ce projet, nous recrutons un chercheur à mi-temps pour une première période de 6 mois ou à temps plein pour une première période de 3 mois à partir de janvier 2021 (avec possibilité d?extension à 1 ou 2 ans).

 

La mission est double :

  • Le/la candidat.e sera en charge des missions dévolues à l?UMONS dans le cadre du projet SALV (+ d?infos sur le projet ci-dessous) ; à ce titre il/elle sera la personne qui fera l?interface avec Room 40 et coordonnera les activités d?éventuels autres participants au projet dans le labo (p.ex. étudiants, stagiaires, chercheurs)
  • Le/la candidat.e recherché.e développera une stratégie générale (voire des outils spécifiques) permettant aux chercheurs du labo (originaires pour la plupart des SHS) d?optimiser leurs procédures de collecte, transcription et analyse de données parole en utilisant les techniques de traitement automatique de la parole.

 

Les modalités pratiques de l?engagement sont à ce stade très flexibles, à discuter avec les possibilités/souhaits du candidat retenu. Le salaire est fixé selon le diplôme (idéalement : doctorat en informatique ; autres formations : à discuter) et l?expérience.

 

Si vous êtes intéressé.e ou si vous voulez simplement en savoir plus : veronique.delvaux@umons.ac.be

 

 

****************

SALV

 

Room 40 (https://roomfourzero.com/) est une société jeune et dynamique qui fournit un ensemble de produits et services incluant notamment l?analyse en temps réel et la détection d?anomalie dans des flux audios et vidéos, ainsi que l?analyse contextuelle fine, basée sur le concept d?ontologie, des significations explicites et implicites de textes et bribes de textes tels que ceux échangés sur les réseaux sociaux ou via SMS.

 

Le projet SALV a pour objectif de développer une technologie de transcription et d?analyse de contenu vocal incluant un ensemble d?informations paralinguistiques contextuelles (émotions, stress, attitudes). Il se base en partie sur l?utilisation de technologies d?analyse de texte en temps réel et d?ontologies spécifiques que Room 40 a sous licence et commercialise déja?. Pour ce projet, de nouveaux outils de transcription de parole en texte devront e?tre développés afin d?intégrer ces informations paralinguistiques sous forme de métadonnées au texte de la transcription. La conjonction de ces deux types d?information devrait grandement améliorer la qualité du résultat de l?analyse.

 

Au terme du projet, on vise une solution d?analyse de parole comportant: (i) une approche pour retranscrire des paroles en texte; (ii) une méthode pour annoter le texte de métadonnées reprenant des éléments paralinguistiques; (iii) un syste?me pour l?analyse pertinente du contenu audio combinant la retranscription et des éléments paralinguistiques; enfin (iv) une couche applicative intégrant les éléments ci-dessus et comprenant des algorithmes d?analyse de contenu, des interfaces graphiques spécifiques a? certains segments de marché, ainsi qu?un nombre d?APIs garantissant l?interopérabilité du syste?me avec les infrastructures existantes des partenaires.

Back  Top

6-35(2020-12-07) Internships at IRIT, Toulouse, France

L?équipe SAMoVA de l?IRIT à Toulouse propose plusieurs stages de fin d?étude (M2, ingénieur) en 2021 :

- Apprentissage profond (deep learning) de représentation audio
- Apprentissage profond (deep learning) pour la segmentation en locuteurs en flux
- Caractérisation et modélisation de voix pathologiques
- Mesure d?intelligibilité pour la parole pathologique

Tous les détails (sujets, contacts) sont disponibles dans la section 'Jobs' de l?équipe :
https://www.irit.fr/SAMOVA/site/jobs/

Back  Top



 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA