CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

A Segment-Based Audio-Visual Speech Recognition System

Timothy J. Hazen, Kate Saenko & James R. Glass


Visual information has been shown to be useful for improving the accuracy of speech recognition in both humans and machines. These improvements are the result of the complementary nature of the audio and visual modalities. For example, many sounds that are confusable by ear are easily distinguishable by eye, such as n and m. Additionally, the improvements from adding the visual modality are often more pronounced in noisy conditions where the audio signal-to-noise ratio (SNR) is reduced.

In our work, we have recently developed our own audio-visual speech recognition (AVSR) system. It is our hope that this speech recognition technology can eventually be deployed in systems located in potentially noisy environments where visual monitoring of the user is possible. These locales include automobiles, public kiosks, and offices. This AVSR system incorporates information collected from observations of the speaker's lip movements and combines it with knowledge obtained from the acoustic signal from the person's speech in order to improvement speech recognition accuracy and robustness.

Our AVSR system is built upon our existing segment-based speech recognizer [1]. It incorporates information collected from visual measurements of the speaker's lip region using an audio-visual integration mechanism that we call a segment-constrained HMM [2]. In this approach, the audio and visual streams are independently processed and classified. Integration occurs at a higher level within the search mechanism of the recognizer. One advantage to using a late integration is that it enables a variety of desirable modeling techniques such as adaptive weighting of the audio and visual classification scores, or asynchronous processing of the audio and visual streams.

The AV-TIMIT Corpus

To evaluate the performance of our system we have collected a corpus of video recordings called the Audio-Visual TIMIT (AV-TIMIT) corpus. It contains read speech and was recorded in a relatively quiet office with controlled lighting, background and audio noise level. The main design goals for this corpus were:

  • continuous, phonetically balanced speech (based on 450 TIMIT-SX sentences)
  • multiple speakers (223 in total with 117 males and 106 females)
  • controlled office environment
  • high resolution digital video

The audio portion of the corpus was collected using a far-field array microphone. The resulting audio had an average signal-to-noise ratio (SNR) of 25 dB. To simulate noisier conditions we artificially added babble noise from the NOISEX database [3] at SNRs ranging from -10dB to 20dB.

The corpus was subdivided into three subsets:

  • a training set containing 3608 utterances from 185 speakers
  • a development test set containing 284 utterances from 19 speakers
  • a final test set containing 285 utterances from 19 speakers

The training set was used to train the audio and visual models. The development test set was used to tune system parameters for optimal performance. All of our final results were generated using the final test set.


For evaluation, our AVSR system was configured with a 1793 word vocabulary to match the vocabulary of the AV-TIMIT corpus. A simple word-pair grammar was established to provide language model constraint. The recognizer's acoustic models were trained for each SNR level (in 5dB increments) between -10db and 20dB. The AVSR system was then tested using the matched SNR acoustic models for each SNR condition as well as the "clean" condition where no additional noise was added. Table 1 shows the performance of the full AVSR system in comparison to a system using only the audio information. The results show that the reduction in recognition error rate is between 14% and 60%, depending on the noise level, when the visual information is incorporated into the recognition process. Full details of this experiment and its results will be published in [4].

Table 1: AVSR in babble noise with matched acoustic models
Word Error Rate / Error Rate Reduction
Audio-Only Audio-Visual
clean 2.27% / - 0.91% / 60%
20 1.90% / - 1.04% / 44%
15 1.99% / - 1.00% / 50%
10 2.49% / - 1.41% / 44%
5 5.26% / - 4.26% / 19%
0 16.5% / - 12.3% / 26%
-5 62.4% / - 46.4% / 26%
-10 90.9% / - 78.3% / 14%
Future Plans

Now that we have verified the effectiveness of our baseline system on studio quality video, we plan to investigate the application of this technology to the more realistic conditions likely to be encountered by deployed applications. Towards this end we have compiled a corpus of audio-visual data collected in a moving automobile environment. This corpus increases the complexity of the visual task by adding variable lighting conditions and head movements into the task of visual head/lip tracking and feature extraction.


This work has been supported in part by the MIT-Ford Alliance and in part by ITRI.


[1] James R. Glass. A probabilistic framework for segment-based speech recognition. Computer Speech and Language, vol. 17, no. 2-3, pp. 137-152, April/July 2003.

[2] Timothy J. Hazen, Kate Saenko, Chia-Hao La and James R. Glass. A segment-based audio-visual speech recognizer: Data collection, development and initial experiments. In Proceedings of the International Conference on Multimodal Interfaces, pp. 235-242, State College, Pennsylvania, October 2004.

[3] A. Varga and H. Steeneken. Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, vol. 12, no. 3, pp. 247-251, July 1993.

[4] Timothy J. Hazen. Visual model structures and synchrony constraints for audio-visual speech recognition. To appear in IEEE Transactions on Speech and Audio Processing, 2005.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)