Audiovisual Speech – Slim OUNI

Audiovisual Speech Synthesis

Our focus in the field of audiovisual speech is audiovisual synthesis of highly intelligible speech. We are investigating methodologies of performing synthesis with its acoustic and visual components simultaneously. Therefore, we consider audiovisual speech as a bimodal signal with two channels: acoustic and visual. One of our purposes is to develop a highly realistic face animation, mainly the animation of the lips.
When dealing with audiovisual synthesis, we consider that it is important to study audiovisual intelligibility and the ability of the synthesis to send an intelligible message to the human receiver. In fact, the intelligibility of the audiovisual synthesis can be critical when considering applications addressed to hard-of-hearing humans or to learners of new languages. Our research in audiovisual speech intelligibility concerns the experimental evaluation, developing metrics to measure intelligibility.

Example of automatic lipsync using Dynalips technology

Multimodal data acquisition

Our work in audiovisual speech relies on acquiring data and processing it. This can be time-consuming and costly in terms of the effort required to carry out the acquisition and processing the data. This effort is unavoidable to make more progress in modeling the processes related to human communication.
We are continuously working on improving the acquisition techniques and investigating methods to make the process easier. In audiovisual speech, we used in the past sterovision technique to aquire 3D facial data. More recently, we are using more advanced motion capture techniques using VICON and also we are testing the use of cheaper hardware based on the kinect.