ISCApad #176 |
Saturday, February 09, 2013 by Chris Wellekens |
Computer Science Internship CORDIAL group Title : Unit-selection speech synthesis guided by a stochastic model of spectral and prosodic parameters. A Text-To-Speech system (TTS) produces a speech signal corresponding to the vocalization of a given text. Such a system is composed of a linguistic processing stage followed by an acoustic one which complies as much as possible with the linguistic directives. Concerning the second step, the most used approaches are { the corpus based synthesis approach which lies on the selection and concatenation of unit sequences extracted from a large continuous speech corpus. It has been popular for 20 years, yielding an unmatched sound quality but still bearing some artefacts due to spectral discontinuities. { the statistical approach. The new generation of TTS systems has emerged in the last years, reintroducing the rule based systems. The rules are no longer deterministic like in the rst systems in the 1950's, but they are replaced by stochastic models. HTS, an HMMbased speech synthesis system, is currently the most used statistical system. The HTS type systems yield a good acoustic continuum but with a sound quality strongly depending on the underlying acoustic model. Recently, some hybrid synthesis systems have been proposed, combining the statistical approach with the method of unit selection. It consists in using the acoustic descriptions and the melodic contours generated by a statistical system in order to drive the cost function during the natural speech unit selection phase, or also, substituting the poor quality natural speech units by units derived from a statistical system. The framework of this subject is the corpus based TTS. Considering the combinatorial problem due to the search of an optimal unit sequence with a blind sequencing, the work consists in determining heuristics to reduce the search space and satisfy a real time objective. These assumptions, based on spectral and prosodic type parameters generated by HTS, will permit to implement pre-selection lters or to propose new cost functions within the corpus based system developped by the Cordial group. The production of the hybrid system will be evaluated and compared via listening tests with standard systems like HTS and a corpus based system. Keywords : TTS, Corpus based speech synthesis, Statistical Learning, Experiments. Contacts : Olivier Boe Bibliography : [1] A. W. Black and K. A. Lenzo, Optimal data selection for unit selection synthesis, 4th ISCA Tutorial and Research Workshop on Speech Synthesis, 2001. [2] H. Kawai, T. Toda, J. Ni, M. Tsuzaki and K. Tokuda, Ximera : a new tts from atr based on corpus-based technologies . ISCA Tutorial and Research Workshop on Speech Synthesis, 2004. [3] S. Rouibia and O. Rosec, Unit selection for speech synthesis based on a new acoustic target cost , Interspeech, 2005. [4] H. Zen, K. Tokuda and A. W. Black, Statistical parametric speech synthesis. Speech Communication, v.51, n.11, pages 1039{1064, 2009. [5] H. Silen, E. Helander, J. Nurminen, K. Koppinen and M. Gabbouj, Using Robust Viterbi Algorithm and HMM-Modeling in Unit Selection TTS to Replace Units of Poor Quality , Interspeech 2010.
|
Back | Top |