Abstracts - 2007
Audiovisual Communication Error Detection
Sy Bor Wang, David Demirdjian, Hedvig Kjellstrom & Trevor Darrell
We propose a framework for communication error detection in conversational systems using audio-visual observations of users.
Many spoken dialogue systems have difficulty detecting the occurrence of communication errors (e.g. errors made by the speech recognizer). By our definition, a communication error is when the automatic speech recognition system misinterprets the user and makes an erroneous reply. Considerable research has been invested in monitoring audio cues to detect such communication errors. Various researchers have shown that human users change their speaking style when they encounter a problem with a conversational system. For example, users tend to speak slower or louder when speech recognition errors occur. These problems motivated the monitoring of prosodic aspects of a speaker's utterances and several studies have shown that using automatically extracted prosodic features can help in communication error detection [1,2,3]. However, the performance of using prosody features differs across studies, and since human to human communication is done audio-visually in a complementary manner, this hints at the use of visual cues to improve error detection performance.
Recent perceptual studies indicate that communication error detection can be improved significantly if both acoustic and visual modalities are taken into account. A study conducted in  explored the human perception of audio-visual cues in dialogue systems and showed that given the visual footage of the speaker, human observers performed better at recognizing communication errors than when only audio recordings were provided: speech and non-verbal/verbal facial expressions of the users were shown to be good indicators of the communication state. This insight motivates us to detect com- munication errors in two modes; using visual features only when the user is listening to the system response, and using audio-visual fea- tures when the user is speaking. Figure 1 illustrates the reaction of a user experiencing a speech recognition error from a conversational system. This additional visual modality has not been used in existing error detection literature.
Figure 1. Illustration of communication errors. In a., the subject is making a query of a restaurant the first time. In b., the subject is listening to the response of the system. In c., the subject repeats his query. The visual cues in b. and the audiovisual cues in c. are indicative of communication errors.
We varied the types of visual features and audio features extracted to compare and find discrimnative ones for training and testing. Our intial results have shown that for the visual features, facial motion estimates perform better than head pose estimates in detecting errors. For the audio features,lexical-based prosody features perform better than affect-based prosody features. A seperate experiment has also shown that computing the relative changes in prosody statistics from one utterance to the next are very useful for error detection as well, and that fusion by voting has achieved an 83.3% accuracy in error detection. We are in the process of collecting more data, and conducting new experiments.
 J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke. Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In Int'l Conf. on Speech and Language Processing(ICSLP), 2002.
 D. Litman, J. Hirschberg, and M. Swerts. Predicting user reactions to system error. In ACL, 2001.
 S. L. Oviatt and R. VanGent. Error resolution during multimodal human-computer interaction. In SpeechCommunication, 1998.
 P. Barkhuysen, E. Krahmer, and M. Swerts. Audiovisual perception of communi- cation problems. In Speech Prosody, 2004