A Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English

Author: Roland Göcke
PhD Thesis, Australian National University, Canberra Australia, published May 2004

Abstract

Human perception of the world is inherently multi-sensory because the information provided is multimodal. The perception of spoken language is no exception. Beside the auditory information, there is visual speech information as well, provided by the facial movements as a result of moving the articulators during speech production. Visual speech information contributes to speech perception in all kinds of audio conditions, but its effect is perhaps most readily noticed in noisy audio conditions. Various research groups around the world have studied the effects of incorporating visual speech information in automatic speech recognition (ASR) systems in recent years. They have found that audio-video (AV) ASR systems result in an improved recognition rate compared to audio-only systems, in particular in noisy audio conditions. Exactly how to incorporate the additional visual speech information best is still not known.

This study aims to extend our knowledge of relationships between audio and video speech parameters. It investigates ways of describing such relationships using statistical analyses and their application to the example of Australian English (AuE). The work described in this thesis is multi-disciplinary. Apart from the statistical analyses, it also required algorithms to extract speech parameters and a corpus of AV speech sequences, which were not readily available.

A novel non-intrusive automatic lip tracking algorithm is presented, which uses a stereo camera system to enable accurate 3D measurements of facial feature points without the need for artificial markers on the face. Due to the lack of an AV speech corpus for AuE, a new modular framework for AV speech corpora was developed and followed in a newly created corpus for AuE.

Equipped in such ways, it was possible to test the hypothesis that combinations of audio and video speech parameters are related, rather than single parameters, and that these combinations are phoneme-specific. Based on articulatory theory, it is clear that the audio and video domain are related in some way and to some extent because the visible speech articulators form a part of the whole set of articulators. However, it also means that not all of the information contained in the audio modality has equivalent information in the video modality. The set of audio speech parameters was formed by voice source excitation frequency F₀, formant frequencies F₁, F₂, F₃, and RMS energy. Mouth width, mouth height, protrusion of upper and lower lip, and the novel teeth visibility measure \emph{relative teeth count} formed the video speech parameter set.

Extensive univariate and multivariate statistical analyses, such as pairwise linear correlation analysis, principal component analysis, statistical shape analysis, canonical correlation analysis, and coinertia analysis, were performed to explore the AV relationships in AuE. The AV relationships found by this study support the hypothesis that linear combinations of parameters correlate well (r = 0.5-0.8) across the two modalities and that their composition is phoneme-specific. The results show that with the given parameter sets, between one fifth and one third of the variance in either modality can be recovered from the other modality. For visible speech information purely based on the lips, this agrees with studies on human speech perception found in current literature. Further investigations are required to test the stability of the found relationships and their suitability for a rule-based AV ASR system.

Download PhD Thesis (without data appendices on CD-ROM) (2.6MB, PDF)
Download Thesis Appendix C (on CD-ROM) (2.0MB, PDF)
Download Thesis Appendix D (on CD-ROM) (97kB, PDF)
Download Thesis Appendix E (on CD-ROM) (2.7MB, PDF)
Download Thesis Appendix F (on CD-ROM) (231kB, PDF)
Download Thesis Appendix G (on CD-ROM) (99kB, PDF)
Download Thesis Appendix H (on CD-ROM) (104kB, PDF)
Download Thesis Appendix I (on CD-ROM) (3.4MB, PDF)
Download Thesis Appendix J (on CD-ROM) (141kB, PDF)

[Homepage] [Research] [Publications]

A Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English

Author: Roland Göcke PhD Thesis, Australian National University, Canberra Australia, published May 2004

Abstract

Author: Roland Göcke
PhD Thesis, Australian National University, Canberra Australia, published May 2004