ISCApad #147 |
Sunday, September 12, 2010 by Chris Wellekens |
Title : Bayesian networks for modeling and handling variability sources in speech recognition - Location: INRIA Nancy Grand Est research center --- LORIA Laboratory, NANCY, France In state-of-art speech recognition systems, Hidden Markov Models (HMM) are used to model the acoustic realization of the sounds. The decoding process compares the unknown speech signal to sequences of these acoustic models to find the best matching sequence which determines the recognized words. Lexical and grammatical constraints are taken into account during the decoding process; they limit the amount of model sequences that are considered in the comparisons, which, nevertheless remains very large. Hence precise acoustic models are necessary for achieving good speech recognition performance. To obtain reliable parameters, the HMM-based acoustic models are trained on very large speech corpus. However, speech recognition performance is very dependent on the acoustic environment: good performance is achieved when the acoustic environment matches with that of the training data, and performance degrades when the acoustic environment gets different. The acoustic environment depends on many variability sources which impact on the acoustic signal. This includes the speaker gender (male / female), individual speaker characteristics, the speech loudness, the speaking rate, the microphone, the transmission channel, and of course the noise, to name only of few of them [Benzeghiba et al, 2007]. Using a training corpus which exhibits too many different variability sources (for example many different noise levels, too different channel speech coding schemes, ...) makes the acoustic models less discriminative, and thus lowers the speech recognition performance. On the opposite, having many sets of acoustic models, each one of them dedicated to a specific environment condition raises training problems. Indeed, because each training subset is restricted to a specific environment condition, its size gets much smaller, and consequently it might be impossible to train reliably some parameters of the acoustic models associated to this environment condition. In recent years, Dynamic Bayesian Networks (DBN) have been applied in speech recognition. In such an approach, certain model parameters are set dependent on some auxiliary features, such as articulatory information [Stephenson et al., 2000], pitch and energy [Stephenson et al. 2004], speaking rate [Shinozaki & Furui, 2003] or some hidden factor related to a clustering of the training speech data [Korkmazsky et al., 2004]. The approach has also been investigated for dealing with multiband speech recognition, non-native speech recognition, as well as for taking estimations of speaker classes into account in continuous speech recognition [Cloarec & Jouvet, 2008]. Although the above experiments were conducted with limited vocabulary tasks, they showed that Dynamics Bayesian Networks provide a way of handling some variability sources in the acoustic modeling. The objective of the work is to further investigate the application of Dynamic Bayesian Network (DBN) for continuous speech recognition application using large vocabularies. The aim is to estimate the current acoustic environment condition dynamically, and to constraint the current acoustic space used during decoding accordingly. The underlying idea is to be able to handle various range of acoustic space constraints during decoding. Hence, when the acoustic environment condition estimation is reliable, the corresponding specific condition constraints can be used (leading, for example, to model parameters associated to a class of very similar speakers in a given environment). On the opposite, when the acoustic environment condition estimation is less reliable, more tolerant constraints should be used (leading, for example, to model parameters associated to a broader class of speakers or to several environment conditions). Within the formalism of Dynamic Bayesian Networks, the work to be carried out is the following. The first aspect concerns the optimization of the classification of the training data, and associated methods for estimating the classes that best matches unknown test data automatically. The second aspect involves the development of confidence measures associated to the classification process of test sentences, and the integration of these confidence measures in the DBN modeling (in order to constraint more or less the acoustic space for decoding according to the reliability of the environment condition estimation). [Benzeghiba et al, 2007] M. Benzeghiba, R. de Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. mertins, C. Ris, R. Rose, V. Tyagi & C. Wellekens: 'Automatic speech recognition and speech variability: A review'; Speech Communication, Vol. 49, 2007, pp. 763-786. [Cloarec & Jouvet, 2008] G. Cloarec & D. Jouvet: 'Modeling inter-speaker variability in speech recognition' ; Proc. ICASSP'2008, IEEE International Conference on Acoustics, Speech, and Signal Processing, 30 March – 4 April 2008, Las Vegas, Nevada, USA, pp. 4529-4532 [Korkmazsky et al., 2004] F. Korkmazsky, M. Deviren, D. Fohr & I. Illina: 'Hidden factor dynamic Bayesian networks for speech recognition'; Proc. ICSLP'2004, International Conference on Spoken Language Processing, 4-8 October 2004, Jeju Island, Korea, pp. 1134-1137. [Shinozaki & Furui, 2003] T. Shinozaki & S. Furui: 'Hidden mode HMM using bayesian network for modeling speaking rate fluctuation'; Proc. ASRU'2003, IEEE Workshop on Automatic Speech Recognition and Understanding, 30 November - 4 December 2003, US Virgin Islands, pp.417-422. [Stephenson et al., 2000] T.A. Stephenson, H. Bourlard, S. Bengio & A.C. Morris: 'Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables'; Proc. ICSLP'2000, International Conference on Spoken Language Processing, 2000, Beijing, China, vol. 2, pp. 951–954. [Stephenson et al., 2004] T.A. Stephenson, M.M. Doss & H. Bourlard: 'Speech recognition with auxiliary information'; IEEE Transactions on Speech and Audio Processing, SAP-12 (3), 2004, pp. 189–203.
|
Back | Top |