ISCApad #178 |
Wednesday, April 10, 2013 by Chris Wellekens |
Computer Science Internship CORDIAL group Title : Voice Conversion from non-parallel corpora Description : The main goal of a voice conversion system (VCS) is to transform the speech signal uttered by speaker (the source speaker) so that it sounds like it was uttered by an other person (the target speaker). The applications of such techniques are limitless. For example, a VCS can be combined to a Text-To-Speech system in order to produce multiple high quality synthetic voices. In the entertainment domain, a VCS can be used to dub an actor with its own voice. State of the art VCS use Gaussian Mixture Models (GMM) to capture the transformation from the acoustic space of the source to the acoustic space of the target. Most of the models are source-target joint models that are trained on paired source-target observations. Those paired observations are often gathered from parallel corpora, that is speech signals resulting from the two speakers uttering the same set of sentences. Parallel corpora are hard to come with. Moreo- ver, they do not guaranty that the pairing of vectors is accurate. Indeed, the pairing process is unsupervised and uses a Dynamic Time Warping under the strong (and unrealistic) hypothesis that the two speakers truly uttered the same sentence, with the same speaking style. This asser- tion is often wrong and results in non-discriminant models that tends to over-smooth speaker's distinctive characteristics. The goal of this Master subject is to suppress the use of parallel corpora in the process of training joint GMM for voice conversion. We suggest to pair speech segments on high level speech descriptors as those used in Unit Selection Text-To-Speech. Those descriptors not only contain the segmental information (acoustic class for example) but also supra-segmental informations such as phoneme context, speed, prosody, power, ... In a rst step, both source and target corpora are segmented and tagged with descriptors. In a second step, each class from one corpus is paired with the equivalent class from the other corpus. Finally, a classical DTW algorithm can be applied on each paired class. The expected result is to derive transform models that both could take into account speaker variability and be more robust to pairing errors. Keywords : Voice Conversion, Gaussian Mixture Models Contacts : Vincent Barreaud (vincent.barreaud@irisa.fr) Bibliographie : [1] H. Benisty and D. Malah. Voice conversion using gmm with enhanced global variance. In Conference of the International Speech Communication Association (Interspeech) , pages 669{ 672, 2011. [2] L. Mesbahi, V. Barreaud, and O. Boeard. Non-parallel hierarchical training for voice conver- sion. In Proceedings of the 16th European Signal Processing Conference, Lausanne, Switzerland, 2008. [3] Y. Stylianou, O. Cappe, and E. Moulines. Continuous probabilistic transform for voice conver- sion. IEEE Transactions on Speech and Audio Processing, 6(2) :131-142, 1998.
|
Back | Top |