ISCApad Archive » 2011 » ISCApad #156 » Events » Other Events » (2011-06-24) Workshop ADVANCES IN SPEECH TECHNOLOGIES at IRCAM Paris |
ISCApad #156 |
Tuesday, June 21, 2011 by Chris Wellekens |
Workshop
ADVANCES IN SPEECH TECHNOLOGIES
Friday, June 24, 2011
in Stravinsky conference room, IRCAM, Paris.
IRCAM, Music and ... Speech.
'Par son pouvoir expressif, par sa pérennité vis-à-vis de l’univers instrumental, par son pouvoir d’amalgame avec un texte, par la capacité qu’elle a de reproduire des sons inclassables par rapport aux grammaires - la grammaire du langage comme la grammaire musicale - , la voix peut se soumettre à la hiérarchie, s’y intégrer ou s’en dégager totalement. Moyen immédiat, qui n’est pas soumis inéluctablement à la contrainte culturelle pour communiquer, pour exprimer, la voix peut être, autant qu’un instrument cultivé, un outil ‘‘sauvage'', irréductible.'
Pierre Boulez, Automatisme et décision, Jalons (pour une décennie) : dix ans d'enseignement au Collège de France (1978- 1988), Paris, Christian Bourgois, 1989.
--------------------------------------------------------------------------------------------------------------
This workshop will feature top figures in speech processing who will present works-in-progress in speech technologies, from recognition and synthesis to interactions on Friday, June 24, 2011.
Free entrance
* - * - *
9:30am - 10:00am
Axel Roebel and Xavier Rodet, IRCAM - Analysis and Synthesis Team.
'Speech analysis, synthesis and transformation in the
Analysis/Synthesis team at IRCAM' Since about 7 years the interest of composers and musical assistants at IRCAM on speech synthesis and transformation techniques has constantly grown. As a result speech processing has become one of the central research objectives of the Analysis/Synthesis team at IRCAM. In the present introduction some of the key results of the research efforts will be presented, providing examples notably related to the estimation of the spectral envelope, the estimation of the LF glottal pulse model parameters, text to speech synthesis, shape invariant signal transformation in the phase vocoder, speaker transformation, voice conversion, transformation of emotional states.
* - * - *
10:00am - 11:00am
Jean-François Bonastre, Laboratoire d'Informatique d'Avignon - Université d'Avignon.
'Speaker Recognition: a New Binary Representation'
Speaker recognition main approaches are based on statistical modeling of the acoustic space. This modeling relies usually on a Gaussian Mixture Model (GMM) denoted Universal Background Model (UBM), with a large number of components and trained using a large set of speech data gathered from hundreds of speakers. Each target model is derived from the UBM thanks to a MAP adaptation of the gaussian mean parameters only. An important evolution of the UBM/GMM paradigm was to consider the UBM as a definition of a new data representation space defined by the concatenation of the Gaussian mean parameters. This space, denoted 'supervector' space, allowed to use Support Vector Machine (SVM) classifiers feed by the supervector. A second evolution step was crossed by the direct modelling of the session variability in the supervector space using the Joint Factor Analysis (JFA) approach. More recently, the Total Variability Space was introduced, as an evolution of JFA. It consists on a modelling of the total variability in the supervector space in order to build a smaller space which concentrates the information and where it is easier to model jointly session and speaker variability. Looking at this evolution, three remarks could be proposed. The evolution is always linked to large models with thousands of parameters. All the new approaches are quite unable to work at the frame per frame level and finally, these approaches rely on the general statistical paradigm where one information is evaluated as strong when it is present very often.
This speech proposes an analysis of the consequences of these remarks and presents a new paradigm for speaker recognition, based on a discrete binary representation, which is able to overpass the previous approaches limitations. * - * - *
11:00am - 12:00am
Nick Campbell, Centre for Language & Communications Studies - Trinity College, Dublin.
'Talking with Robots'
This talk describes a robot interface for gathering conversational data currently on exhibition in the Science Gallery of Trinity College Dublin.
We use a small LEGO NXT Mindstorms device as a platform for a high definition webcam and microphones, in conjunction with a finite-state dialogue machine and recordings of several human utterances that are played back through a sound-warping device to sound as if the robot is speaking them. Visual processing using OpenCV forms the core of the device, interacting with the discourse model to engage passers-by in a brief conversation so that we can record the exchange in order to learn more about such discourse strategies for advanced human-computer interaction. * - * - *
12:00am - 1:00pm
Simon King, Centre for Speech Technology Research - The University of Edinburgh.
'Synthetic Speech: Beyond Mere Intelligibility'
Some text-to-speech synthesisers are now as intelligible as human speech. This is a remarkable achievement, but the next big challenge is to approach human-like naturalness, which will be even harder. I will describe several lines of research which are attempting to imbue speech synthesisers with the properties they need to sound more 'natural' - whatever that means.
The starting point is personalised speech synthesis, which allows the synthesiser to sound like an individual person without requiring substantial amounts of their recorded speech. I will then describe how we can work from imperfect recordings or achieve personalised speech synthesis across languages, with a few diversions to consider what it means to sound like the same person in two different languages and how vocal attractiveness plays a role. Since the voice is not only our preferred means of communication but also a central part of our identity, losing it can be distressing. Current voice-output communication aids offer a very poor selection of voices, but recent research means that soon it will be possible to provide people who are losing the ability to speak, perhaps due to conditions such as Motor Neurone Disease, with personalised communication aids that sound just like they used to, even if we do not have a recording of their original voice. There will be plenty of examples, including synthetic child speech, personalised synthesis across the language barrier, and the reconstruction of voices from recordings of disordered speech. This work was done with Junichi Yamagishi, Sandra Andraszewicz, Oliver Watts, Mirjam Wester and many others. |
Back | Top |