The act of speech production has both auditory and visual consequences. It is well known that humans can use both auditory and visual cues when perceiving speech. Current speech recognition systems use powerful statistical models of the audio components of spoken language but can fail unpredictably in non-ideal acoustic conditions. It is in such conditions that most humans will attempt to use visual cues as well to achieve understanding of what is being said. The major visual information used is the movements of the lips although the lips are not the only part of a face carrying speech information.
The close integration of audio and video evidence of speech articulation in automatic speech recognition systems has been shown to improve the recognition rate, particularly under non-ideal conditions. Various research groups around the world have come up with a number of ways of integrating the two modalities (see for example the proceedings of the AVSP conferences). However, it still remains uncertain, what the best way is.
This PhD projects aims at enhancing our knowledge of how to do the integration. In order to do so, we look at the statistical relationship of audio and video speech parameters for the sounds (phonemes) of Australien English.
The lip-tracking algorithm extracts the lip corners as well as the mid-points of upper and lower lip. Unlike many other groups, we extract these points on the inner lip contour, so that personal lip characteristics ("thick lips, thin lips, ...") don't interfere with the results. The algorithm uses a combination of colour information in the image data and knowledge about the structure of the mouth area. The algorithm builds on top of a stereo-vision face tracking system developed here at the Robotic Systems Laboratory. We exploit the available stereo data to extract 3D distances from the feature points we extract, i.e. our measurements are real measurements in millimeter rather than measurements in image pixels. Thus, different distances from the cameras to the speakers are not a factor in the analysis. According to our knowledge, we are the first group to apply stereo-vision to lip-tracking for AVSP.
From the 3D coordinates of the lip feature points we can calculate parameters such as mouth width, mouth height, and protrusion of upper and lower lip, because the cameras are calibrated. In addition, our lip-tracking algorithm checks the visibility of upper and lower teeth in the images. Measurements are done at the NTSC frame rate of 30Hz.
Download video 1
Download video 2
These video clips are examples of our stereo lip-tracking algorithm in
action (actually in slow motion, so you can follow it better).
(Windows Media Player or RealPlayer are required to play the file
which is in AVI format.)
We were inspired to measure these visual speech parameters from a lip model developed at the Institut de la Communication Parlee (ICP) in Grenoble (France), which we were able to use thanks to the contacts of Dr. Jordi Robert-Ribes. The 3D lip model is driven by only 5 physical parameters, namely the mouth width and height, and the horizontal distance of the contact point of upper and lower lip, the lip tip of the upper lip, and the lip tip of the lower lip, respectively, to an imaginery vertical plane "behind" the lips. See the reference paper 3D Models of the Lips for Realistic Speech Animation (Proceedings of Computer Graphic 96, Geneve, 1996). In other words, the values of the parameters we measure on the speaker's face could be used to drive an animation with this artificial lip model.
![]() |
![]() |
Shaded lip model | Wireframe lip model |
Recording setup
AVOZES contains a variety of utterances from each speaker with a total length of recordings of about 5 minutes per speaker. Core part of the corpus are 40 sequences per speaker containing (almost) all phonemes and visemes of Australian English in CVC- or VCV-contexts. All sequences consist of a carrier phrase and the nonsense word with the phoneme embedded like: You grab BAB beer. We used this carrier phrase because the bi-labial closures before and after the word of interest provide a simple means for extracting the part of the sequence that is of interest for our AV analysis. Obviously, coarticulation is an important aspect and this project explores just this form of it. Other sequences provide data to build a 3D face model of the speaker's face, as well as examples of continuous speech such as counting (You grab ONE beer. You grab TWO beer. ...) and carefully designed sentences containing all phonemes.
We are currently in the process of analysing these sequences.
Download video (5.5MB)
This is a demo clip of the short vowel sequence by one of the speakers.
Note that lossy compression was used to reduce the file size for this
webpage but otherwise, i.e. for analysis, we use lossless storage on
DV tape.
We intend to make the AVOZES data corpus available to the research community.
[Back to Homepage] [Back to Research]