ISCApad #270 |
Friday, December 11, 2020 by Chris Wellekens |
Privacy preserving and personalized transformations for speech recognition
This research position fits within the scope of a collaborative project (funded by the French National Research Agency) involving several French teams, among which, the MULTISPEECH team of Inria Nancy - Grand-Est. One objective of the project is to transform speech data in order to hide some speaker characteristics (such as voice identity, gender information, emotion, ...) in order to safely share the transformed data while keeping speaker privacy. The shared data is to be used to train and optimize models for speech recognition. The selected candidate will collaborate with other members of the project, and will participate to the project meetings.
Scientific Context Over the last decade, great progress has been made in automatic speech recognition [Saon et al., 2017; Xiong et al., 2017]. This is due to the maturity of machine learning techniques (e.g., advanced forms of deep learning), to the availability of very large datasets, and to the increase in computational power. Consequently, the use of speech recognition is now spreading in many applications, such as virtual assistants (as for instance Apple’s Siri, Google Now, Microsoft’s Cortana, or Amazon’s Alexa) which collect, process and store personal speech data in centralized servers, raising serious concerns regarding the privacy of the data of their users. Embedded speech recognition frameworks have recently been introduced to address privacy issues during the recognition phase: in this case, a (pretrained) speech recognition model is shipped to the user's device so that the processing can be done locally without the user sharing its data. However, speech recognition technology still has limited performance in adverse conditions (e.g., noisy environments, reverberated speech, strong accents, etc.) and thus, there is a need for performance improvement. This can only be achieved by using large speech corpora that are representative of the actual users and of the various usage conditions. There is therefore a strong need to share speech data for improved training that is beneficial to all users, while preserving the privacy of the users, which means at least keeping the speaker identity and voice characteristics private [1]. Missions: (objectives, approach, etc.) Within this context, the objective is twofold. First, it aims at improving privacy preserving transforms of the speech data, and, second, it will investigate the use of additional personalized transforms, that can be applied on the user’s terminal, to increase speech recognition performance. In the proposed approach, the device of each user will not share its raw speech data, but a privacy preserving transformation of the user speech data. In such approach, some private computations will be handled locally, while some cross-user computations may be carried out on a server using the transformed speech data, which protect the speaker identity and some of his/her features (gender, sentiment, emotions...). More specifically, this rely on a representation learning to separate the features of the user data that can expose private information from generic ones useful for the task of interest, i.e., here, the recognition of the linguistic content. On this topic, recent experiments have relied on Generative Adversarial Networks (GANs) for proposing a privacy preserving transform [Srivastava et al., 2019], and on voice conversion approaches [Srivastava et al., 2020]. In addition, as devices are getting more and more personal, this creates opportunities to make speech recognition more personalized. Some recent studies have investigated approaches that takes benefit of speaker information [Turan et al., 2020]. The candidate will investigate further approaches along these lines. Other topics such as investigating the impact and benefit of adding some random noise in the transforms will be part of the studies, as well as dealing with (hiding) some paralinguistic characteristics. Research directions and priorities will take into account new state-of-the-art results and on-going activities in the project.
Skills and profile: PhD or Master in machine learning or in computer science Background in statistics, and in deep learning Experience with deep learning tools is a plus Good computer skills (preferably in Python) Experience in speech and/or speaker recognition is a plus
Bibliography: [Saon et al., 2017] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall: English conversational telephone speech recognition by humans and machines. Technical report, arXiv:1703.02136, 2017. [Srivastava et al., 2019] B. Srivastava, A. Bellet, M. Tommasi, and E. Vincent: Privacy preserving adversarial representation learning in ASR: reality or illusion? INTERSPEECH 2019 - 20th Annual Conference of the International Speech Communication Association , Sep 2019, Graz, Austria. [Srivastava et al., 2020] B. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamagishi, M. Maouche, A. Bellet, and M. Tommasi: Design choices for x-vector based speaker anonymization. INTERSPEECH 2020, 21th Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai, China. [Turan et al., 2020] T. Turan, E. Vincent, and D. Jouvet: Achieving multi-accent ASR via unsupervised acoustic model adaptation. INTERSPEECH 2020, 21th Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai, China. [Xiong et al., 2017] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. Achieving human parity in conversational speech recognition. Technical report, arXiv:1610.05256, 2017.
Additional information: Supervision and contact: Denis Jouvet (denis.jouvet@inria.fr; https://members.loria.fr/DJouvet/) Duration: 2 years Starting date: autumn 2020 Location: Inria Nancy – Grand Est, 54600 Villers-lès-Nancy
footnote [1] : Note that when sharing data, users may want not to share data conveying private information at the linguistic level (e.g., phone number, person name, …). Such privacy aspects also need to be taken into account, but they are out-of-the scope of this project.
|
Back | Top |