Multimodal behavior generation and style transfer for virtual agent animation
Catherine Pelachaud, Nicolas Obin catherine.pelachaud@isir.upmc.fr, nicolas.obin@ircam.fr
Humans communicate through speech but also through their hand gestures, body posture, facial expression, gaze, touch, speech prosody, etc, a wide range of multimodal signals. Verbal and non-verbal behavior play a crucial role in sending and in perceiving new information in human-human interaction. Depending on the context of communication and the audience, a person is continuously adapting its style during interaction. This stylistic adaptation implies verbal and nonverbal modalities, such as language, speech prosody, facial expressions, hand gesture, and body posture. Virtual agents, also called Embodied Conversational Agents (ECAs- see [B] for an overview), are entities that can communicate verbally and nonverbally with human interlocutors. Their roles can vary depending on the applications. They can act as a tutor, an assistant or even a companion. Matching agent’s behavior style with their interaction context ensures better engagement and adherence of the human users. A large number of generative models were proposed in the past few years for synthesizing gestures of ECAs. Lately, style modeling and transfer have been receiving an increase of attention in order to adapt the behavior of the ECA to its context and audience. The latest research proposes neural architectures including a content and a style encoders and a decoder conditioned so as to generate the ECA gestural behavior corresponding to the content and with the desired style. While the first attempts focused on modeling the style of a single speaker [4, 5, 7], there is a rapidly increasing effort towards multi-speaker and multi-style modeling and transfer [1,2]. In particular, few-shots style transfer architectures attempt to generate a gestural behavior in a certain style with the minimum amount of data of the desired style and with the minimum requirement in terms of further training or fine-tuning.
Objectives and methodology:
The aim of this PhD is to generate human-like gestural behavior in order to empower virtual agents to communicate verbally and nonverbally with different styles - extending previous thesis accomplished by Mireille Fares [A]. We view behavioral style as being pervasive while speaking; it colors the communicative behaviors while content is carried by multimodal signals but mainly expressed through text semantics. The objective is to generate ultra-realistic verbal and nonverbal behaviors (text style, prosody, facial expression, body gestures and poses) corresponding to a given content (mostly driven by text and speech), and to adapt it to a specific style. This raises methodological and fundamental challenges in the fields of machine learning and human-computer interaction: 1) How to define content and style; which modalities are involved and with which proportion in the gestural expression of content and style? 2) How do we implement efficient neural architectures to disentangle content and style information from multimodal human behavior (text, speech, gestures)? The proposed directions will leverage on the cutting-edge research in neural networks such as multimodal modeling and generation [8], information disentanglement [6], and text prompt generation as popularized by DALL-E or Chat-GPT [9].
The research questions can be summarized as follows:
· What is a multimodal style?: What are the style cues in each modality (verbal, prosody, and nonverbal behavior)? How to fuse each modality cues to build a multimodal style?
· How to control the generation of verbal and nonverbal cues using a multimodal style? How to transfer a multimodal style into generative models? How to integrate style-oriented prompts/instructions into multimodal generative models by keeping the underlying intentions to be conveyed by the agent?
· How to evaluate the generation?: How to measure the content preservation and the style transfer? How to design evaluation protocols with real users?
The PhD candidate will elaborate contributions in the field of neural multimodal behavior generation of virtual agents with a particular focus on multimodal style generation and controls:
· Learning disentangled content and style encodings from multimodal human behavior using adversarial learning, bottleneck learning, and cross-entropy / mutual information formalisms.
· Generating expressive multimodal behavior using prompt-tuning, VAE-GAN, and stable diffusion algorithms. To accomplish those objectives, we propose the following steps:
· Analyzing corpus to identify style and content cues in different modalities.
· Proposing generative models for multimodal style transfer according to different control levels (human mimicking or prompts/instructions)
· Evaluating the proposed models with dedicated corpus (e.g. PATS) and with real users. Different criterias will be evaluated: content preservation, style transfer, coherence of the ECA overall modalities. When evaluated with a human user, we envision measuring the user’s engagement, their knowledge memorization and preferences.
Supervision team
Catherine Pelachaud is director of research CNRS at ISIR working on embodied conversational agent, affective computing and human-machine interaction.
[A] M. Fares, C. Pelachaud, N. Obin (2022) Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation, in EUSIPCO [B] C. Pelachaud, C. Busso, D. Heylen (2021), Multimodal behavior modeling for socially interactive agents. The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 1: Methods, Behavior, Cognition
Nicolas Obin is associate professor at Sorbonne UniversiteĢ and research scientist at Ircam in human speech generation, vocal deep fake, and multimodal generation.
[C] L. Benaroya, N. Obin, A. Roebel (2023). Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations. In Entropy 25 (2), 375
[D] F. Bous, L. Benaroya, N. Obin, A. Roebel (2022) Voice Reenactment with F0 and timing constraints and adversarial learning of conversions The supervision team is used to publish in high-venue conferences and journals in machine learning (e.g., AAAI, ICLR, DMKD), natural language processing & information access (e.g., EMNLP, SIGIR), agents (e.g., AAMAS), and speech (Interspeech).
Required Experiences and Skills
· Master or engineering degree in Computer Science or Applied Mathematics /knowledge in deep learning
· Very proficient in Python (NumPy, SciPy), TensorFlow/Pytorch environment, and distributed computation (GPU)
· High productivity, capacity for methodical and autonomous work, good communication skills.
Environment
The PhD will be hosted by two laboratories ISIR and IRCAM experts in the fields of machine learning, natural language / speech/ human behavior processing, and virtual agents with the support of the Sorbonne Center for Artificial Intelligence (SCAI). The PhD candidate is expected to publish in the most prominent conferences and journals in the domain (such as: ICML, EMNLP, AAAI, IEEE TAC, AAMAS, IVA, ICMI, etc...). SCAI is equipped with a cluster of 30 nodes: 100 GPU cards and a processor of 1800 TFLOPS / FP32. The candidate can also use the Jean Zay cluster hosted by the CNRS-IDRIS.
References
[1] C. Ahuja, D. Won Lee, and L.-P. Morency. 2022. Low-Resource Adaptation for Personalized Co-Speech Gesture Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] S. Alexanderson, G. Eje Henter, T. Kucherenko, and J. Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. 487–496.
[3] P. Bordes, E. Zablocki, L. Soulier, B. Piwowarski, P. Gallinari (2019). Incorporating Visual Semantics into Sentence Representations within a Grounded Space. In EMNLP/IJCNLP
[4] D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, A., and M.J. Black. (2019). Capture, learning, and synthesis of 3D speaking styles. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10101–10111
[5] S. Ginosar, A. Bar, G. Kohavi, C., Chan, A., Owens and J. Malik. 2019. Learning individual styles of conversational gesture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[6] S. Subramanian, G. Lample, E.M. Smith, L. Denoyer, M.'A. Ranzato, Y-L. Boureau (2018). Multiple-Attribute Text Style Transfer. CoRR abs/1811.00552
[7] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 1–12
[8] C. Rebuffel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari (2022). Controlling hallucinations at word level in data-to-text generation. In Data Min. Knowl. Discov. 36(1): 318-354
[9] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, M-H Yang, B Cui (2022). Diffusion Models: A Comprehensive Survey of Methods and Applications. CoRR abs/2209.00796