CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Co-Adaptation of Audio-Visual Speech and Gesture Classifiers

C. Mario Christoudias, Kate Saenko, Louis-Philippe Morency & Trevor Darrell


Human interaction relies on multiple redundant modalities to robustly convey information. Similarly, many human-computer interface (HCI) systems use multiple modes of input and output to increase robustness in the presence of noise (e.g., by performing audio-visual speech recognition) and to improve the naturalness of the interaction (e.g., by allowing gesture input in addition to speech). Such systems often employ classifiers based on supervised learning methods which require manually labeled data. However, obtaining large amounts of labeled data is costly, especially for systems that must handle multiple users and realistic (noisy) environments. In this work, we address the issue of learning multi-modal classifiers in a semi-supervised manner. We present a method that improves the performance of existing classifiers on new users and noise conditions without requiring any additional labeled data.

There has been much interest recently in developing semi-supervised learning algorithms for problems with multiple views of the data. One such algorithm, co-training [1], improves weak classifiers learned on separate views of the labeled data by maximizing their agreement on the unlabeled data. Co-training was originally proposed for the scenario in which labeled data is scarce but unlabeled data is easy to collect. In multimodal HCI development, it may be feasible to collect enough labeled data from a set of users in a certain environment, but the resulting system may not generalize well to new users or environments. For example, a new user may gesture differently, or the room may become noisy when a fan is turned on. The semi-supervised learning problem then becomes one of adapting existing models to the particular condition. To solve this problem, we investigate a variant of co-training, which we call co-adaptation. Co-adaptation uses a generic supervised classifier to produce an initial labeled training set for the new condition, from which a data-specific classifier is built. The algorithm then improves the resulting data-specific classifier with co-training, using the remaining unlabeled samples.

We evaluate our co-adaptation algorithm on two audio-visual HCI tasks: speech unit classification and user agreement detection. The first task is to identify a sequence of acoustic and lip image features as a particular word or phoneme. The second task is to determine whether a user has expressed agreement or disagreement during a conversation, given a sequence of head gesture and acoustic features. In our experiments, we adapt hidden Markov models (HMMs), which are commonly used for speech and gesture sequence classification.

A more detailed treatment of this work and additional experiments is available in [2].

Co-Adaptation Algorithm

We propose an adaptive version of the co-training algorithm that bootstraps a data-dependent model from a data-independent model trained on a large labeled dataset. Suppose we obtain unlabeled data from a new condition, such as a new user. We first use the user-independent model to specify a small seed set of labeled examples using its most confident predictions. A user-dependent model is then trained on this initial seed set and improved with cross-modal learning on the rest of the unlabeled data. The resulting co-adaptation algorithm is summarized as Algorithm 1.

Algorithm 1 Co-Adaptation Algorithm
Given user-independent classifiers fiUI, i=1,...k, a user-dependent unlabeled set U and parameters N, M and T:
Set S=Æ
for i=1 to k do
     Use fiUI to label the M highest-confidence samples in U and move them to S
end for
Set t=1
     for i=1 to k
          Train user-dependent classifier fi on view i of S
          Use fi to label N highest-confidence samples in U and move them to S
     end for
     Set t=t+1
until t=T or |U|=0

The intuition behind the co-adaptation algorithm is that, while the overall performance of the generic model may be poor on new users or under new noise conditions, it can still be used to accurately label a small seed set of examples. The initial seed classifier can then be improved via co-training. Since the new classifier is trained using samples from the new working condition (i.e., new user and environment), it has the potential to out-perform the original generic classifier in the new setting, especially when user variation or difference in environment is large.


Co-adaptation was applied to perform unsupervised adaptation of audio and visual HMM classifiers. The HMMs used were left-to-right HMMs, and a confidence measure based on the HMM posterior probability was used to learn from un-labeled data [2]. Co-adaptation experiments were performed in two different human-computer interaction tasks: audio-visual agreement recognition and speech unit classification. In the former task agreement is recognized from either the person's head gesture (head nod and head shake) or speech ('yes' and 'no'). For this task a dataset was collected of 15 subjects interacting with an avatar that asked each subject a set of 103 yes/no questions. The subjects were prompted to reply using simultaneous head gesture and speech. In speech unit classification, visemes are recognized from either the person's sequence of mouth images or their speech waveform. For this task a subset of the AVTIMIT corpus was used [5]. To map recognized phonemes in the audio to viseme labels, phonemes were grouped into four visually distinct viseme classes [2].

Classifier User Independent Co-Adaptation Single-Modality Bootstrap
Audio 89.8 ± 8.8 94.2 ± 5.6 (p=0.023) 91.3 ± 8.7 (p=0.414)
Visual 99.0 ± 2.0 98.5 ± 1.8 (p=0.332) 98.5 ± 2.3 (p=0.411)

Figure 1: Co-adaptation of multimodal agreement classifiers. Each column shows the mean CCR over the 15 test subjects, ± the standard deviation. The p-value comparing the performance of each method to that of the user-independent model is also shown.

Figure 1 summarizes the results of the agreement recognition experiments. For these experiments, a user-independent classifier was trained on 14 of the 15 subjects and co-adaptation was evaluated on the left out subject. The user-independent visual classifiers are already performing at 99 percent for this dataset and it is therefore difficult to get an improvement from adaptation. Comparing the user-independent and -specific audio classifiers, however, there is a significant increase in CCR rate (p=0.023) by 4.4 percent achieved through co-adaptation. As a baseline comparison, Figure 1 also displays the results of performing single-modality bootstrapping [3], which does not use cross-modal learning, but rather learns a classifier separately in each modality. It is similar to co-adaptation (Algorithm 1), except that each classifier operates on its own copy of U and S, and classification labels are not shared across modalities. Unlike the co-adaptation approach the difference in performance between the user-independent and user-dependent audio HMM classifiers obtained with single-modality bootstrapping is not significant (p=0.414). This is because co-adaptation, unlike single modality bootstrapping, was able to leverage the good performance of the visual classifiers to significantly improve the performance of the audio agreement classifier.

Classifier User Independent Co-Adaptation Audio-Bootstrap Video-Bootstrap Single-modality Bootstrap
Audio 52.8 ± 4.8 69.9 ± 7.4 (p<<.01) 55.4 ± 4.5 63.3 ± 11.8 58.6 ± 4.4 (p<<.01)
Video 59.8 ± 11.3 69.0 ± 8.6 (p<<.01) 51.5 ± 7.9 62.4 ± 13.2 60.7 ± 12.1 (p=.03)

Figure 2: Co-adaptation results on the audio-visual speech data. Each column shows the mean CCR over 39 test speakers, ± the standard deviation. p-values are relative to the user-independent classifier.

The experiments on audio-visual speech unit classification were performed using 50 subjects to train the user-independent audio and visual classifiers and 39 test subjects. For this experiment, babble noise was added to the test subjects' audio to simulate a noisy environment [2]. Figure 2 summarizes the results of the speech unit classification experiments. The user-specific classifiers found with co-adaptation show a significant improvement in CCR rate over the user-independent classifiers and these classifiers outperform the user-specific classifiers found using the single-modality bootstrapping baseline. Figure 2 also displays the result of labeling the adaptation data using the user-independent classifier from one modality. This kind of approach is beneficial when the user-independent classifier in one modality is more reliable than the other, as is the case for the visual modality in this experiment. If using the user-independent classifier of the "weaker" modality, however, one can degrade performance, as is seen when bootstrapping from the audio classifier. Co-adaptation is able to improve performance without the a priori knowledge of which modality is more reliable.


We investigated the multi-view semi-supervised co-training algorithm as a means of utilizing unlabeled data in multimodal HCI learning problems. We proposed an adaptive co-training algorithm, co-adaptation, and showed that it can be used to improve upon existing models trained on a large amount of labeled data when a small amount of unlabeled data from new users or noise conditions becomes available. Interesting avenues of future work include the use of co-adaptation to perform high-level adaptation of audio-visual classifiers (e.g., adapting their language model), the use of user-dependent observations and the use of HMM adaptation techniques (MLLR [6], MAP [4]) in our algorithm.


[1] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, pp. 92-100, 1998.

[2] C. Mario Christoudias, Kate Saenko, Louis-Philippe Morency and Trevor Darrell. Co-Adaptation of Audio-Visual Speech and Gesture Classifiers. In The Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 84--91, Banff, Alberta, Canada, November 2006.

[3] B. Efron and R. Tibshirani. An Introduction to the Boot-strap. Chapman and Hall, 1993.

[4] J.-L. Gauvain and C.-H. Lee. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Audio Processing, pp. 291--298, April 1994.

[5] T. J. Hazen, K. Saenko, C. H. La, and J. Glass. A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In The Proceedings of the 7th International Conference on Multimodal Interfaces, Torento, Italy, October 2005.

[6] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech Lang., pp. 171--185, 1995.


vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu