Audiovisual Speech Recognition: Data Collection and Feature Extraction in Automotive Environment

Faculty: Mark Hasegawa-Johnson, Camille Goudeseune, Thomas S. Huang, Stephen E. Levinson, and Michael McLaughlin

Students: Bowon Lee, Sarah Borys, Ming Liu, Suketu Kamdar

Sponsor: Motorola Center for Communication

Performance Dates: 2002-2006

Speech recognition in an automobile is typically performed using a single microphone, often mounted in the sun-visor in front of the driver. Typical acoustic background noise levels vary from approximately 15dB SNR to -5dB SNR. At these noise levels, even recognizers with a very small vocabulary may generate too many recognition errors for practical use.

This research project has developed audiovisual speech recognition using a multisensory visor-mounted array composed of eight microphones and four video cameras. We have acquired data from 86 talkers (the AVICAR corpus), in realistic environments, developed and applied robust audiovisual feature extraction algorithms, and tested the resulting features by training and testing small-vocabulary speech recognition models.

Audio-video recordings of speech were acquired in realistic noise conditions: engine idling, windows closed at 35mph, windows open at 35mph, windows closed at 65mph, windows open at 65mph. Acquired data was used to develop and apply algorithms for robust audiovisual feature extraction. In particular, graduate research assistants working on this research have focused on two problems: (1) Accurate visual tracking of the face and extraction of lip features; and, (2) Extraction of an accurate audio speech recognition feature stream from the multi-microphone array. Extracted audiovisual features have been used to train and test four small-vocabulary speech recognizers: two binaural (two-microphone) audiovisual speech recognizers (with different recognition architectures), one binaural audio-only recognizer, and one monaural audio-only recognizer.