PhD Thesis, Australian National University, Canberra Australia,
published May 2004
This study aims to extend our knowledge of relationships between audio and video speech parameters. It investigates ways of describing such relationships using statistical analyses and their application to the example of Australian English (AuE). The work described in this thesis is multi-disciplinary. Apart from the statistical analyses, it also required algorithms to extract speech parameters and a corpus of AV speech sequences, which were not readily available.
A novel non-intrusive automatic lip tracking algorithm is presented, which uses a stereo camera system to enable accurate 3D measurements of facial feature points without the need for artificial markers on the face. Due to the lack of an AV speech corpus for AuE, a new modular framework for AV speech corpora was developed and followed in a newly created corpus for AuE.
Equipped in such ways, it was possible to test the hypothesis that combinations of audio and video speech parameters are related, rather than single parameters, and that these combinations are phoneme-specific. Based on articulatory theory, it is clear that the audio and video domain are related in some way and to some extent because the visible speech articulators form a part of the whole set of articulators. However, it also means that not all of the information contained in the audio modality has equivalent information in the video modality. The set of audio speech parameters was formed by voice source excitation frequency F0, formant frequencies F1, F2, F3, and RMS energy. Mouth width, mouth height, protrusion of upper and lower lip, and the novel teeth visibility measure \emph{relative teeth count} formed the video speech parameter set.
Extensive univariate and multivariate statistical analyses, such as pairwise linear correlation analysis, principal component analysis, statistical shape analysis, canonical correlation analysis, and coinertia analysis, were performed to explore the AV relationships in AuE. The AV relationships found by this study support the hypothesis that linear combinations of parameters correlate well (r = 0.5-0.8) across the two modalities and that their composition is phoneme-specific. The results show that with the given parameter sets, between one fifth and one third of the variance in either modality can be recovered from the other modality. For visible speech information purely based on the lips, this agrees with studies on human speech perception found in current literature. Further investigations are required to test the stability of the found relationships and their suitability for a rule-based AV ASR system.
Download PhD Thesis (without data appendices on CD-ROM) (2.6MB, PDF)
Download Thesis Appendix C (on CD-ROM) (2.0MB, PDF)
Download Thesis Appendix D (on CD-ROM) (97kB, PDF)
Download Thesis Appendix E (on CD-ROM) (2.7MB, PDF)
Download Thesis Appendix F (on CD-ROM) (231kB, PDF)
Download Thesis Appendix G (on CD-ROM) (99kB, PDF)
Download Thesis Appendix H (on CD-ROM) (104kB, PDF)
Download Thesis Appendix I (on CD-ROM) (3.4MB, PDF)
Download Thesis Appendix J (on CD-ROM) (141kB, PDF)
[Homepage] [Research] [Publications]