Noisy Audio Feature Enhancement Using Audio-Visual Speech Data
Authors: Roland Göcke, Gerasimos Potamianos, and Chalapathy
Neti
Presented by Gerasimos Potamianos at the 2002 IEEE International
Conference on Acoustics, Speech, and Signal Processing ICASSP 2002,
Orlando, USA, 12-17 May 2002
Abstract
We investigate improving automatic speech recognition (ASR) in noisy
conditions by enhancing noisy audio features using visual speech captured
from the speaker's face. The enhancement is achieved by applying a linear
filter to the concatenated vector of noisy audio and visual features,
obtained by mean square error estimation of the clean audio features in
a training stage. The performance of the enhanced audio features is
evaluated on two ASR tasks: A connected digits task and
speaker-independent, large-vocabulary, continuous speech recognition. In
both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR
trained on the enhanced audio features significantly outperforms ASR
trained on noisy audio, achieving for example a 46% relative reduciton
in word error rate on the digits task at -3.5dB SNR. However, the method
fails to capture the full visual modality benefit to ASR, as demonstrated
by its comparison to discriminant audio-visual feature fusion introduced
in previous work.
Download (166k, PDF)
[Back to Homepage]
[Research]
[Back to Publications]
(c) Roland Göcke
Last modified: 7/2/02