ISCA Services

ISCA - International Speech
Communication Association

ISCApad Archive » 2013 » ISCApad #178 » Jobs » (2012-12-16) Master project IRISA Rennes France

ISCApad #178

Wednesday, April 10, 2013 by Chris Wellekens

6-17 (2012-12-16) Master project IRISA Rennes France

Computer Science Internship

CORDIAL group

Title :

Voice Conversion from non-parallel corpora

Description :

The main goal of a voice conversion system (VCS) is to transform the speech

signal uttered by speaker (the source speaker) so that it sounds like it was uttered by an other

person (the target speaker). The applications of such techniques are limitless. For example, a

VCS can be combined to a Text-To-Speech system in order to produce multiple high quality

synthetic voices. In the entertainment domain, a VCS can be used to dub an actor with its own

voice.

State of the art VCS use Gaussian Mixture Models (GMM) to capture the transformation

from the acoustic space of the source to the acoustic space of the target. Most of the models are

source-target joint models that are trained on paired source-target observations. Those paired

observations are often gathered from parallel corpora, that is speech signals resulting from the

two speakers uttering the same set of sentences. Parallel corpora are hard to come with. Moreo-

ver, they do not guaranty that the pairing of vectors is accurate. Indeed, the pairing process is

unsupervised and uses a Dynamic Time Warping under the strong (and unrealistic) hypothesis

that the two speakers truly uttered the same sentence, with the same speaking style. This asser-

tion is often wrong and results in non-discriminant models that tends to over-smooth speaker's

distinctive characteristics.

The goal of this Master subject is to suppress the use of parallel corpora in the process of

training joint GMM for voice conversion. We suggest to pair speech segments on high level speech

descriptors as those used in Unit Selection Text-To-Speech. Those descriptors not only contain

the segmental information (acoustic class for example) but also supra-segmental informations

such as phoneme context, speed, prosody, power, ... In a rst step, both source and target

corpora are segmented and tagged with descriptors. In a second step, each class from one corpus

is paired with the equivalent class from the other corpus. Finally, a classical DTW algorithm

can be applied on each paired class. The expected result is to derive transform models that both

could take into account speaker variability and be more robust to pairing errors.

Keywords :

Voice Conversion, Gaussian Mixture Models

Contacts :

Vincent Barreaud (vincent.barreaud@irisa.fr)

Bibliographie :

[1] H. Benisty and D. Malah. Voice conversion using gmm with enhanced global variance. In

Conference of the International Speech Communication Association (Interspeech)

, pages 669{

672, 2011.

[2] L. Mesbahi, V. Barreaud, and O. Boeard. Non-parallel hierarchical training for voice conver-

sion. In

Proceedings of the 16th European Signal Processing Conference, Lausanne, Switzerland, 2008.

[3] Y. Stylianou, O. Cappe, and E. Moulines. Continuous probabilistic transform for voice conver-

sion.

IEEE Transactions on Speech and Audio Processing, 6(2) :131-142, 1998.

Back

Top

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy