ISCA - International Speech
Communication Association


ISCApad Archive  »  2016  »  ISCApad #221  »  Jobs  »  (2016-10-20) Two Master research internships at LIMSI - CNRS, Orsay, France

ISCApad #221

Friday, November 11, 2016 by Chris Wellekens

6-21 (2016-10-20) Two Master research internships at LIMSI - CNRS, Orsay, France
  

Two Master research internships (with follow-up PhD scholarship) at LIMSI - CNRS, Orsay, France
Unsupervised Multimodal Character Identification in TV Series and Movies

Keywords : deep learning, speech processing, natural language processing, computer vision

Automatic character identification in multimedia videos is an extensive and challenging problem. Person identities can serve as foundation and building block for many higher level video analysis tasks, for example semantic indexing, search and retrieval, interaction analysis and video summarization. The goal of this project is to exploit textual, audio and video information to automatically identify characters in TV series and movies without requiring any manual annotation for training character models. A fully automatic and unsupervised approach is especially appealing when considering the huge amount of available multimedia data (and its growth rate). Text, audio and video provide complementary cues to the identity of a person, and thus allow to better identify a person than from either modality alone.

In this context, LIMSI (www.limsi.fr) proposes two projects, focusing on two different aspects of this multimodal problem. Depending on the outcome of the internship, both projects may lead to a PhD scholarship (one funding is already secured).

Project 1 ? natural language processing + speech processing

speaker A  ? 'Nice to meet you, I am Leonard, and this is Sheldon. We live across the hall.'
speaker B ? 'Oh. Hi. I?m Penny.'

speaker A ? 'Sheldon, what the hell are you doing?'
speaker C ? I am not quite sure yet. I think I am on to something?

Just looking at these two short conversations, a human can easily infer that 'speaker A' is actually 'Leonard', 'speaker B' is Penny and 'speaker C' is Sheldon. The objective of this project is to combine natural language processing and speech processing to do the same automatically. Building blocks include automatic speech transcription, named entity detection, classification of names (first, second or third person) and speaker diarization. Preliminary works in this direction have already been published in [Bredin 2014] and [Haurilet 2016]

[Bredin 2014] Hervé Bredin, Antoine Laurent, Achintya Sarkar, Viet-Bac Le, Sophie Rosset, Claude Barras. Person Instance Graphs for Named Speaker Identification in TV Broadcast. Odyssey 2014, The Speaker and Language Recognition Workshop.
[Haurilet 2016] Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah, Rainer Stiefelhagen. Naming TV Characters by Watching and Analyzing Dialogs. WACV 2016. IEEE Winter Conference on Applications of Computer Vision.

Project 2 ? speech processing + computer vision

This project aims at improving (acoustic) speaker diarization using the visual modality. Indeed, it was shown in a recent paper [Bredin 2016] that recent advances in deep learning for computer vision led to very reliable face clustering performance ? whereas speaker diarization is very bad at processing TV series and movies (mostly because current state of the art has not been designed to process this kind of content).

The first task is to design deep learning approaches (based on recurrent neural networks) to address talking-face detection (e.g. deciding, among all visible people, which one is currently speaking) by combining the audio and visual (e.g. lip motion) streams. The second task is to combine talking-face detection and face clustering to guide and improve speaker diarization (i.e. who speaks when?). Read [Bredin 2016] for more information on this kind of approach.

[Bredin 2016] Hervé Bredin, Grégory Gelly. Improving speaker diarization of TV series using talking-face detection and clustering. ACM Multimedia 2016, 24th ACM International Conference on Multimedia.

Profile: Master student in machine learning (experience in natural language processing, computer vision and/or speech processing is appreciated)
Location: LIMSI - CNRS, Orsay, France
Duration: 5/6 months
Salary: according to current regulations
Contact: Hervé Bredin (bredin@limsi.fr) with CV + cover letter + reference letter(s)


Back  Top


 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by ISCA