Roland Goecke's PhD Project

"A Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English"

Supervisory Panel

Dr. Bruce Millar (supervisor), Computer Sciences Lab (RSISE, ANU)
Prof. Alex Zelinsky (co-supervisor), Robotics System Lab (RSISE, ANU)
Dr. Jordi Robert-Ribes (advisor), Optus, Sydney


In August 1998, I started my PhD on Audio-Visual Speech Processing. This involves automatic speech-reading, commonly known as lip-reading. Our basic interest in this project is to advance the knowledge about how to enhance the reliability of speech recognition systems, particularly in noise-degraded environments, by using the additional visual information of the speech articulation available in video data. There is a quite a number of application areas for such improved speech recognition technology. For example, one can think of voice-activated controls (radio, A/C, ...) in a car (lots of non-stationary noise) or automatic protocolling systems, e.g. during a video conference.

The act of speech production has both auditory and visual consequences. It is well known that humans can use both auditory and visual cues when perceiving speech. Current speech recognition systems use powerful statistical models of the audio components of spoken language but can fail unpredictably in non-ideal acoustic conditions. It is in such conditions that most humans will attempt to use visual cues as well to achieve understanding of what is being said. The major visual information used is the movements of the lips although the lips are not the only part of a face carrying speech information.

The close integration of audio and video evidence of speech articulation in automatic speech recognition systems has been shown to improve the recognition rate, particularly under non-ideal conditions. Various research groups around the world have come up with a number of ways of integrating the two modalities (see for example the proceedings of the AVSP conferences). However, it still remains uncertain, what the best way is.

This PhD projects aims at enhancing our knowledge of how to do the integration. In order to do so, we look at the statistical relationship of audio and video speech parameters for the sounds (phonemes) of Australien English.


In order to look at the integration of audio and video, one needs to have some parameters describing the visible speech effects. This can be done either implicitly by taking the lip/mouth region and applying a dimensionality reduction technique (PCA, LDA, ...), or explicitly by extracting geometric features such as with and height of the mouth opening, for example. Our lip-tracking algorithm follows the second approach.

The lip-tracking algorithm extracts the lip corners as well as the mid-points of upper and lower lip. Unlike many other groups, we extract these points on the inner lip contour, so that personal lip characteristics ("thick lips, thin lips, ...") don't interfere with the results. The algorithm uses a combination of colour information in the image data and knowledge about the structure of the mouth area. The algorithm builds on top of a stereo-vision face tracking system developed here at the Robotic Systems Laboratory. We exploit the available stereo data to extract 3D distances from the feature points we extract, i.e. our measurements are real measurements in millimeter rather than measurements in image pixels. Thus, different distances from the cameras to the speakers are not a factor in the analysis. According to our knowledge, we are the first group to apply stereo-vision to lip-tracking for AVSP.

From the 3D coordinates of the lip feature points we can calculate parameters such as mouth width, mouth height, and protrusion of upper and lower lip, because the cameras are calibrated. In addition, our lip-tracking algorithm checks the visibility of upper and lower teeth in the images. Measurements are done at the NTSC frame rate of 30Hz.

Download video 1
Download video 2
These video clips are examples of our stereo lip-tracking algorithm in action (actually in slow motion, so you can follow it better). (Windows Media Player or RealPlayer are required to play the file which is in AVI format.)

We were inspired to measure these visual speech parameters from a lip model developed at the Institut de la Communication Parlee (ICP) in Grenoble (France), which we were able to use thanks to the contacts of Dr. Jordi Robert-Ribes. The 3D lip model is driven by only 5 physical parameters, namely the mouth width and height, and the horizontal distance of the contact point of upper and lower lip, the lip tip of the upper lip, and the lip tip of the lower lip, respectively, to an imaginery vertical plane "behind" the lips. See the reference paper 3D Models of the Lips for Realistic Speech Animation (Proceedings of Computer Graphic 96, Geneve, 1996). In other words, the values of the parameters we measure on the speaker's face could be used to drive an animation with this artificial lip model.

Shaded lip model Wireframe lip model

AVOZES - Audio-Visual Australian English Speech Data Corpus

We designed and recorded an audio-visual data corpus for Australian English. This is the first such data corpus using stereo cameras, thus allowing actual 3D measurements of distances on the speaker's face (e.g. distance from one lip corner to the other = mouth width). The corpus comprises 20 native speakers of Australian English, 10 female and 10 male. Recordings were made in a controlled environment (almost sound-proof audio laboratory) with very little external noise but some internal noise due to recording equipment. Prompts were presented on a screen above the cameras. Speakers would sit in a frontal position to the cameras with their face being illuminated by a light source below the cameras.

Recording setup

AVOZES contains a variety of utterances from each speaker with a total length of recordings of about 5 minutes per speaker. Core part of the corpus are 40 sequences per speaker containing (almost) all phonemes and visemes of Australian English in CVC- or VCV-contexts. All sequences consist of a carrier phrase and the nonsense word with the phoneme embedded like: You grab BAB beer. We used this carrier phrase because the bi-labial closures before and after the word of interest provide a simple means for extracting the part of the sequence that is of interest for our AV analysis. Obviously, coarticulation is an important aspect and this project explores just this form of it. Other sequences provide data to build a 3D face model of the speaker's face, as well as examples of continuous speech such as counting (You grab ONE beer. You grab TWO beer. ...) and carefully designed sentences containing all phonemes.

We are currently in the process of analysing these sequences.

Download video (5.5MB)
This is a demo clip of the short vowel sequence by one of the speakers. Note that lossy compression was used to reduce the file size for this webpage but otherwise, i.e. for analysis, we use lossless storage on DV tape.

We intend to make the AVOZES data corpus available to the research community.

[Back to Homepage] [Back to Research]

(c) Roland Göcke
Last modified: Tue Jan 11 17:58:33 AUS Eastern Daylight Time 2005