CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

Dynamically Incorporating Contextual Cues into Language Models for Spoken Dialogue Systems

Alexander Gruenstein, Chao Wang & Stephanie Seneff

Problem Statement

Humans effortlessly make use of conversational context to understand and anticipate what others are saying to them. For instance, after asking a question such as "What's your favorite city in Texas?" to a companion at a noisy cocktail party, if we hear a response that sounds like it could be either "Austin" or "Boston" we easily prefer the former. In spoken dialogue systems, language models are designed to prefer word sequences likely in the supported domain. In most systems, the approach to language modeling can be viewed as being close to one of two extremes: either there is a single large model used to recognize all user utterances, or a series of smaller -- often context free -- models for each dialogue state in the system. Typically, state-based systems demonstrate strong system initiative, in which the system guides an information-seeking dialogue in a step-by-step fashion, eliciting the user's response from a narrowly constrained set of alternatives. In contrast, single-language-model systems typically arise as part of an information-state update approach to dialogue, in which the aim is to allow the user to feel as though she could say anything that naturally comes to mind at a particular point in the dialogue. Because the system does not so strongly guide the conversation, it is typically difficult for a dialogue system designer to partition potential information states into all possible dialogue states and enumerate their transitions, making it less straightforward to craft state-specific language models in advance -- often resulting in the use of a single large model trained on all in-domain data.

We seek to develop language models which do not put constraints on what the user can say at any given point in the dialogue, yet leverage conversational context to more highly weight utterances we expect the user is likely to utter. In particular, we are pursuing an approach which biases a single language model on the fly, rather than pre-compiling state-specific models. We believe this approach will better allow dialogue systems to adapt on the fly to expectations inferred from the discourse.

Our Approach

The spoken language systems group has developed a speech recognizer (SUMMIT) capable of efficiently dynamically modifying the expansions of particular classes of an n-gram language model in real time [1]. We are currently developing techniques which make use of this capability to dynamically bias a dialogue system's language model based on context, as represented by the system's information state, by creating context-sensitive dynamic classes which are updated at every turn as the dialogue with the user unfolds. We have developed techniques for creating both strong context-sensitive dynamic classes -- in which highly specific expectations about phrases likely to be uttered by the user are captured -- and weak context-sensitive dynamic classes -- in which more general expectations about the types of phrases a user might utter are captured.


We have implemented our approach within the MERCURY flight reservation dialogue system [2]. Here we describe how we train both strong and weak context-sensitive dynamic classes.

Strong Context-Sensitive classes

Strong context-sensitive classes expand to a small list of phrases which are salient in the current context. For instance, if the system has just offered the user a short list of flight times available, then phrases which refer to these times will become expansions to a particular strong context-sensitive dynamic class called $dyntime. We created the following strong dynamic classes in this domain:

  • $dynsource: the city the user is departing from
  • $dyndestination: the city to which the user wants to go
  • $dynairline: the possible airlines the system has offered as available for that route
  • $dyntime: the times of the flights the system has offered
  • $dynnthflight: phrases such as the first one or the second flight dependent on the number of flights offered

Figure 1 shows a sample dialogue snippet after the user's utterances have been tagged with context-sensitive dynamic classes (as well as with normal classes). Figure 2 shows how the dialogue information state following utterance S3 gives rise to a particular set of expansions for each dynamic class, which in turn allows the word two in utterance U3 to be tagged with the context-dependent class $dyntime.

S1: How may I help you?
U1 : I'd like to fly from Austin/$city to Oakland/$city on the third/$digit.
S2 : Okay, from Boston [misrecognized] to Oakland on March third. Can you provide an approximate departure time or airline?
U2 : Not Boston/$dynsource, Austin/$city.
S3 : Okay, from Austin to Oakland on March third. I've got a flight on American at two o'clock, would that work? Or I've got one on United at four thirty.
U3 : How about the flight at two/$dyntime.

Figure 1: A dialogue snippet tagged with strong context sensitive dynamic classes and normal static classes


 :date "March3" :source "AUS" :date "March3" :source 
        "AUS" :destination "OAK" :reply_frame { :best_departure { :departure_time 
        "2:00pm" :airline "AA" } :second_departure { :departure_time "4:30pm" 
        :airline "UA" } } 
$dynsource: "austin"
$dyndestination: "oakland"
$dyntime: "two", "two o clock", "two p m", "two o clock p m", "four thirty", "four thirty p m"
$dynairline: "united", "united airlines" "american","american airlines"
$dynnthflight: "the first one", "the first flight", "the second one", "the second flight"

Figure 2: A dialogue system information-state snippet with the corresponding strong dynamic class expansions

Weak context-sensitive dynamic classes

We define weak context-sensitive classes as occurring in cases when the dialogue context provides knowledge of the type of information the user might soon supply, but with less specificity than in the strong case above. Specifically, we have experimented with expectations regarding airlines and locations. Using the same procedure as above, we have tagged occurrences of locations and airlines depending on if the utterance occurred in one of three conditions:

  • C1: The current dialogue context is such that the user has just been prompted for one of the class members: for example, the system has prompted the user for an airline.
  • C2: The system has just prompted with How may I help you?
  • C3: All other contexts.

Unlike strong classes, the class expansions for the weak classes are not manipulated at run time. Instead, particular pre-trained weak classes are enabled or disabled at run time depending on which of C1, C2, or C3 applies. Thus, search paths through the language model are toggled on or off depending on the dialogue context.


We evaluated the strong and weak techniques independently using a corpus of 26,886 utterances collected over several years of real user interaction with various versions of the MERCURY flight reservation system [2]. We trained with 24,815 utterances and tested on 2,071, creating trigram language models with a base vocabulary size of 1,586 (excluding class expansions noted below). We tested under four different conditions produced by manipulating how the expansion weights for several of the classes were calculated and by controlling the class vocabulary sizes as follows:

  • CM Expansion weights for $city, $city_state, and $airline based on corpus statistics; medium vocabulary size of: 516 city names, 329 city_states, 68 airlines
  • UM Expansion weights uniform; medium vocabulary.
  • PL Expansion weights for $city, $city_state based on population, $airline uniform; large vocabulary: 16,956 city names, 25,334 city_states, 68 airlines.
  • UL Expansion weights uniform; large vocabulary.

Figure 3 shows the improvements in overall word error rates obtained in each condition by the context-sensitive language models.

baseline strong weak

Figure 3: Comparison of word error rates of baseline, strong, and weak language models.


This research is funded in part by an industrial consortium supporting the MIT Oxygen Alliance.


[1] G. Chung, S. Seneff, C. Wang, and L. Hetherington. A dynamic vocabulary spoken dialogue interface. In Proceedings of Interspeech, 2004

[2] S. Seneff. Response planning and generation in the MERCURY flight reservation system. Computer Speech and Languague, vol 16, pp. 283-312, 2002


horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)