Articulatory copy synthesis


Videos can be read correctly on Mac or with Chromium.

The objective is to synthesize speech from films of the vocal tract (database of  Institut de Phonétique de Strasbourg). The articulators, tongue in particular, have been delineated by hand or semi-automatically by means of Xarticulator software. The following film shows the X-ray images together with the contours.


Then, the vocal tract of each image (one every 25 ms) is been transformed into an area function, which is fed into the acoustic simulation.

Center line of the vocal tract
elementary tubes


The acoustic signal is synthesized following the strategy proposed by Maeda (S. Maeda, "Phoneme as concatenable units: VCV synthesis using a vocal tract
    synthesizer",  in Sound Patterns of Connected Speech: Description, Models and Explanation, Proceedings of the symposium held at Kiel University, 1996). 

The key point is to design a correct synchronization between the excitation source and the images giving the geometry of the vocal tract. Indeed, there are few images (one every 25 ms). The following figure shows the schematic timing scenario to control the acoustic simulation.

timing

F0 and time points have been determined from the original speech signal. Since images of the vocal tract are not available at each of the time points (t0, t1... t6), it is necessary to duplicate some images.

Here is the synthetic speech signal for the video at top page:

Here is a second example with VCVs (video and synthesized speech signal).



video
synthetic speech signal

This project has been founded by LORIA 2012 and realized by Yves Laprie and Matthieu Loosvelt with the help of Shinji Maeda.