3D Lip Tracking and Co-Inertia Analysis for Improved Robustness of Audio-Video Automatic Speech Recognition
Authors: Roland Göcke
Presented by Roland Göcke at the International Conference on
Auditory-Visual Speech Processing AVSP2005,
Vancouver Island, Canada, 24-27 July 2005
Abstract
Multimodality is a key issue in robust human-computer interaction. The
joint use of audio and video speech variables has been shown to improve
the performance of automatic speech recognition (ASR) systems. However,
robust methods in particular for the real-time extraction of video
speech features are still an open research area. This paper addresses
the robustness issue of audio-video (AV) ASR systems by exploring a
real-time 3D lip tracking algorithm based on stereo vision and by
investigating how learned statistical relationships between the sets
of audio and video speech variables can be employed in AV ASR systems.
The 3D lip tracking algorithm combines colour information from each
cameras' images with knowledge about the structure of the mouth region
for different degrees of mouth openness. By using a calibrated stereo
camera system, 3D coordinates of facial features can be recovered, so
that the visual speech variable measurements become independent from
the head pose. Multivariate statistical analyses enable the analysis of
relationships between sets of variables. Co-inertia analysis is a
relatively new method and has not yet been widely used in AVSP research.
Its advantage is its superior numerical stability compared to other
multivariate methods in the case of small sample size. Initial results
are presented, which show how 3D video speech information and learned
statistical relationships between audio and video speech variables can
improve the performance of AV ASR systems.
Download (218kB, PDF)
[Homepage]
[Research]
[Publications]
(c) Roland Göcke
Last modified: Mon Sep 05 18:30:05 AUS Eastern Standard Time 2005