CSAIL Research Abstract

Dynamically Incorporating Contextual Cues into Language Models for Spoken Dialogue Systems

Alexander Gruenstein, Chao Wang & Stephanie Seneff

Problem Statement

Humans effortlessly make use of conversational context to understand and anticipate what others are saying to them. For instance, after asking a question such as "What's your favorite city in Texas?" to a companion at a noisy cocktail party, if we hear a response that sounds like it could be either "Austin" or "Boston" we easily prefer the former. In spoken dialogue systems, language models are designed to prefer word sequences likely in the supported domain. In most systems, the approach to language modeling can be viewed as being close to one of two extremes: either there is a single large model used to recognize all user utterances, or a series of smaller -- often context free -- models for each dialogue state in the system. Typically, state-based systems demonstrate strong system initiative, in which the system guides an information-seeking dialogue in a step-by-step fashion, eliciting the user's response from a narrowly constrained set of alternatives. In contrast, single-language-model systems typically arise as part of an information-state update approach to dialogue, in which the aim is to allow the user to feel as though she could say anything that naturally comes to mind at a particular point in the dialogue. Because the system does not so strongly guide the conversation, it is typically difficult for a dialogue system designer to partition potential information states into all possible dialogue states and enumerate their transitions, making it less straightforward to craft state-specific language models in advance -- often resulting in the use of a single large model trained on all in-domain data.

We seek to develop language models which do not put constraints on what the user can say at any given point in the dialogue, yet leverage conversational context to more highly weight utterances we expect the user is likely to utter. In particular, we are pursuing an approach which biases a single language model on the fly, rather than pre-compiling state-specific models. We believe this approach will better allow dialogue systems to adapt on the fly to expectations inferred from the discourse.

Our Approach

The spoken language systems group has developed a speech recognizer (SUMMIT) capable of efficiently dynamically modifying the expansions of particular classes of an n-gram language model in real time [1]. We are currently developing techniques which make use of this capability to dynamically bias a dialogue system's language model based on context, as represented by the system's information state, by creating context-sensitive dynamic classes which are updated at every turn as the dialogue with the user unfolds. We have developed techniques for creating both strong context-sensitive dynamic classes -- in which highly specific expectations about phrases likely to be uttered by the user are captured -- and weak context-sensitive dynamic classes -- in which more general expectations about the types of phrases a user might utter are captured.

Implementation

We have implemented our approach within the MERCURY flight reservation dialogue system [2]. Here we describe how we train both strong and weak context-sensitive dynamic classes.

Strong context-sensitive classes expand to a small list of phrases which are salient in the current context. For instance, if the system has just offered the user a short list of flight times available, then phrases which refer to these times will become expansions to a particular strong context-sensitive dynamic class called $dyntime. We created the following strong dynamic classes in this domain:

Figure 1 shows a sample dialogue snippet after the user's utterances have been tagged with context-sensitive dynamic classes (as well as with normal classes). Figure 2 shows how the dialogue information state following utterance S3 gives rise to a particular set of expansions for each dynamic class, which in turn allows the word two in utterance U3 to be tagged with the context-dependent class $dyntime.

S1:	How may I help you?
U1 :	I'd like to fly from Austin/*$city* to Oakland/*$city* on the third/*$digit*.
S2 :	Okay, from Boston [misrecognized] to Oakland on March third. Can you provide an approximate departure time or airline?
U2 :	Not Boston/*$dynsource, Austin/$city*.
S3 :	Okay, from Austin to Oakland on March third. I've got a flight on American at two o'clock, would that work? Or I've got one on United at four thirty.
U3 :	How about the flight at two/*$dyntime*.

Figure 1: A dialogue snippet tagged with strong context sensitive dynamic classes and normal static classes

Figure 2: A dialogue system information-state snippet with the corresponding strong dynamic class expansions

We define weak context-sensitive classes as occurring in cases when the dialogue context provides knowledge of the type of information the user might soon supply, but with less specificity than in the strong case above. Specifically, we have experimented with expectations regarding airlines and locations. Using the same procedure as above, we have tagged occurrences of locations and airlines depending on if the utterance occurred in one of three conditions:

Unlike strong classes, the class expansions for the weak classes are not manipulated at run time. Instead, particular pre-trained weak classes are enabled or disabled at run time depending on which of C1, C2, or C3 applies. Thus, search paths through the language model are toggled on or off depending on the dialogue context.

Evaluation

We evaluated the strong and weak techniques independently using a corpus of 26,886 utterances collected over several years of real user interaction with various versions of the MERCURY flight reservation system [2]. We trained with 24,815 utterances and tested on 2,071, creating trigram language models with a base vocabulary size of 1,586 (excluding class expansions noted below). We tested under four different conditions produced by manipulating how the expansion weights for several of the classes were calculated and by controlling the class vocabulary sizes as follows:

Figure 3 shows the improvements in overall word error rates obtained in each condition by the context-sensitive language models.

	baseline	strong	weak
CM	17.8	17.7	17.6
UM	25.2	24.4	20.8
PL	27.1	26.7	26.0
UL	46.7	45.0	42.1

Figure 3: Comparison of word error rates of baseline, strong, and weak language models.

Acknowledgements

This research is funded in part by an industrial consortium supporting the MIT Oxygen Alliance.

References

[1] G. Chung, S. Seneff, C. Wang, and L. Hetherington. A dynamic vocabulary spoken dialogue interface. In Proceedings of Interspeech, 2004

[2] S. Seneff. Response planning and generation in the MERCURY flight reservation system. Computer Speech and Languague, vol 16, pp. 283-312, 2002