Stage 1 – M2 – Deep Tongue

Modélisation dynamique d’une langue 3D d’un avatar par des réseaux profonds

Laboratoire : Inria Nancy Grand Est – LORIA
Ville : Nancy, France.
Équipe : Multispeech
Thématique : Intelligence artificielle / Interaction / Traitement de la parole multimodale
Contact : Slim Ouni (Slim.Ouni@loria.fr)

(For English Version, see below)

Présentation

La langue joue un rôle important dans la production de la parole. Elle participe à l’articulation de plusieurs sons et sa position est critique pour certains phonèmes. L’étude des gestes articulatoires permet de mieux comprendre les mécanismes de production de la parole avec des implications directes sur l’apprentissage des langues et la rééducation orthophonique.

Dans le cadre de nos travaux sur la tête parlante 3D (un avatar parlant), nous avons développé un système d’animation d’un avatar à partir de la parole permettant d’animer finement la bouche. Nous souhaitons augmenter ce système par un modèle de langue 3D qui accroît considérablement l’intelligibilité globale de l’articulation visuelle.

La langue est un organe complexe, très flexible, extensible et compressible qui peut être courbé et qui permet de réaliser des degrés d’articulation très fine. Les aspects dynamiques de l’articulation de la langue (y compris la coarticulation, c’est-à-dire l’interaction entre les phonèmes leurs influences mutuelles) sont également importants. Plusieurs approches de modélisation de langue existent. Elles sont soit purement géométriques, soit basées sur des images IRM et des données électromagnétographiques (EMA). En effet, il est possible d’observer la déformation de la langue et de mesurer son évolution temporelle en utilisant ces techniques, qui sont d’ailleurs utilisées dans plusieurs études en production de la parole pour acquérir un corpus de données 3D de la langue.

Objectifs du stage

L’objectif de ce travail est de coordonner les mouvements de la langue avec le signal de parole. Il s’agit donc de contrôler le mouvement d’une langue 3D à partir de la parole. Le modèle 3D de la langue doit permettre d’avoir un compromis entre une structure très flexible qui permet de réaliser des gestes complexes et une représentation simple contrôlée par un petit nombre de paramètres. Il s’agit de partir d’un modèle 3D générique de langue qui sera contrôlé par les données 2D ou 3D acquises par un articulographe ou IRM. Un corpus de données articulatoires est disponible et sera utilisé dans cette étude pour entrainer un système de réseau de neurones utilisant des techniques d’apprentissage profond pour estimer les mouvements de la langue à partir de la parole (Biasutto–Lervat, et Ouni, 2018). Le système de contrôle de la langue sera évalué et intégré dans une tête parlante animée.

N’hésitez pas à contacter le responsable du stage pour tout complément d’information.

Compétences espérées

De bonnes connaissances informatiques et en machine learning sont nécessaires. Une première expérience avec l’utilisation d’une librairie de réseaux de neurones (comme PyTorch ou TensorFlow, …) est appréciée.

Bourse d’excellence

Le laboratoire propose un nombre limité de bourses d’excellence pour des candidats excellents (un bon parcours académique) qui sont des étudiants français (hors de la région Grand-Est) ou étrangers, qui souhaitent faire une thèse par la suite. Cette bourse couvre la mobilité à une hauteur de 1000€ et une indemnité de 1000€ par mois. Pour candidater à ce financement, il faut répondre à cette offre de stage en m’envoyant votre CV avant le 26/11/2020

Bibliographie

T. Biasutto–Lervat, and S. Ouni. « Phoneme-to-Articulatory mapping using bidirectional gated RNN. » Interspeech 2018.
Y. Jun, C. Jiang, R. Li, C.W. Luo, & Z.F. Wang (2016). Real-Time 3-D Facial Animation: From Appearance to Internal Articulators. IEEE Transactions on Circuits and Systems for Video Technology, 28(4), 920-932.
Li, R., & Yu, J. (2017, October). An audio-visual 3D virtual articulation system for visual speech synthesis. In 2017 IEEE International Symposium on Haptic, Audio and Visual Environments and Games (HAVE) (pp. 1-6). IEEE.
Bian, J., Li, S., Wang, Y., Chen, J., & Xiao, H. (2017, December). A survey of tongue modeling methods in speech visualization. In 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE) (pp. 431-435). IEEE.
O. Engwall, (2003). “Combining MRI, EMA & EPG measurements in a three-dimensional tongue model,” Speech Communication, vol. 41, no. 2-3, pp. 303–329.
W. Fernandez, P. Mithraratne, S. F. Thrupp, M. H. Tawhai, and P. J. Hunter (2004) « Anatomically based geometric modelling of the musculo-skeletal system and other organs, » Biomechanics and Modeling in Mechanobiology, vol. 2, no. 3, pp. 139-155, 2004.
S.A. King and R.E. Parent, « A 3D Parametric Tongue Model for Animated Speech, » JVCA, vol. 12, no. 3, pp. 107-115, 2001.
X.B. Lu, C.W. Thorpe, K. Foster and P. Hunter, (2009) « From experiments to articulatory motion: a three-dimensional talking head model » Interspeech 2009, Brighton.
Articulographe AG501, Carstens, http://www.articulograph.de

Dynamic modeling of an avatar’s 3D tongue by deep networks

Presentation

Tongue plays an important role in speech production. It participates in the articulation of several sounds and its position is critical for several phonemes. The study of articulatory gestures makes it possible to better understand the mechanisms of speech production with direct implications on language learning and speech therapy.

In our work on a 3D talking head (a talking avatar), the increase of the latter by a 3D tongue model considerably increases the overall intelligibility of the visual articulation.

As part of our work on 3D talking head (a speaking avatar), we have developed a system for animating an avatar from speech to finely animate the mouth. We are considering augmenting this system with a 3D language model that considerably increases the overall intelligibility of the visual articulation.

The tongue is a complex, highly flexible, extensible and compressible organ that can be curved and allows very fine degrees of articulation to be achieved. The dynamic aspects of language articulation (including coarticulation) are also important. Several approaches to language modeling exist. They are either purely geometric or based on MRI images and electromagnetic data (EMA). Indeed, it is possible to observe the deformation of the tongue and to measure its temporal evolution using these techniques, that are also used in several speech production studies to acquire a 3D data set of the language.

Objective of the Internship

The objective of this work is to coordinate the movements of the tongue with the speech signal. Therefore, we need to control the movement of a 3D tongue from speech. The 3D model of the tongue must allow a compromise between a very flexible structure that allows complex gestures to be made and a simple 3D representation controlled by a small number of parameters. A generic 3D tongue model that will be controlled by 3D data acquired by an articulography or MRI. A corpus of articulatory data is available and will be used in this study to train a neural network system using deep learning techniques to estimate tongue movements from speech. The tongue control system will be evaluated and integrated into an animated talking head.

Feel free to contact the internship supervisor for any further information.

Skills and profile

Appropriate candidates would have strong background in computer science and machine learning. A first experience with the use of a neural network library (such as PyTorch or TensorFlow,…) is appreciated.

Context

The work will be done within a dynamic research team Multispeech research team), at the research center Inria Nancy Grand Est (LORIA) and you will integrate a team composed of both experienced and young researchers (PhD, postdocs and engineers) and closely supervised by a senior researcher. We have a motion capture facilities and an articulograph in the laboratory that can be used to acquire data in this project. Several speech processing tools are available in the team.

This internship can be a great opportunity to discover research in the field of spoken communication and 3D avatar animation using automatic learning techniques.

Excellence Internship (Bourse d’excellence)

It is possible to apply for a highly competitive internship funding for outstanding candidates (with good academic background) who are French students (outside the Grand-Est region) or foreign students, who are interested to pursue doctoral thesis in the lab. This funding will cover mobility expenses (limited to 1000€ ) and 1000€ per month. To apply for this funding, you need contact me and send me your CV before 26/11/2020.

Bibliography

T. Biasutto–Lervat, and S. Ouni. « Phoneme-to-Articulatory mapping using bidirectional gated RNN. » Interspeech 2018.
Y. Jun, C. Jiang, R. Li, C.W. Luo, & Z.F. Wang (2016). Real-Time 3-D Facial Animation: From Appearance to Internal Articulators. IEEE Transactions on Circuits and Systems for Video Technology, 28(4), 920-932.
Li, R., & Yu, J. (2017, October). An audio-visual 3D virtual articulation system for visual speech synthesis. In 2017 IEEE International Symposium on Haptic, Audio and Visual Environments and Games (HAVE) (pp. 1-6). IEEE.
Bian, J., Li, S., Wang, Y., Chen, J., & Xiao, H. (2017, December). A survey of tongue modeling methods in speech visualization. In 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE) (pp. 431-435). IEEE.
O. Engwall, (2003). “Combining MRI, EMA & EPG measurements in a three-dimensional tongue model,” Speech Communication, vol. 41, no. 2-3, pp. 303–329.
W. Fernandez, P. Mithraratne, S. F. Thrupp, M. H. Tawhai, and P. J. Hunter (2004) « Anatomically based geometric modelling of the musculo-skeletal system and other organs, » Biomechanics and Modeling in Mechanobiology, vol. 2, no. 3, pp. 139-155, 2004.
S.A. King and R.E. Parent, « A 3D Parametric Tongue Model for Animated Speech, » JVCA, vol. 12, no. 3, pp. 107-115, 2001.
X.B. Lu, C.W. Thorpe, K. Foster and P. Hunter, (2009) « From experiments to articulatory motion: a three-dimensional talking head model » Interspeech 2009, Brighton.
Articulographe AG501, Carstens, http://www.articulograph.de