CSAIL Research Abstract

Developing Error Recovery Strategies Via Simulated Utterances from Existing Corpora

Ed Filisko & Stephanie Seneff

Introduction and Motivation

Spoken dialogue systems are emerging as an intuitive interface for providing conversational access to online information sources. While the effectiveness of such systems has improved significantly over the past several years, a critical barrier to widespread deployment remains in the form of communication breakdown at strategic points in the dialogue, often when the user is trying to convey a critical piece of information that the system repeatedly misunderstands.

This research addresses the problems of error detection and recovery, in a situation where the user is attempting to provide a named entity (e.g., city name), which the system fails to understand. Detecting errors via tedious confirmation sub-dialogues for every attribute provided would displease users, causing annoyance and a potential unwillingness to use the system at all. Hence, the system needs to be selective such that it only invokes confirmation sub-dialogues when it perceives there to be a communication breakdown. Error recovery is also problematic, in that the system may persist in misunderstanding repeated spoken renditions of the same named entity. In this case, a substantially different tactic must be adopted to assure higher communicative success.

Obtaining appropriate user data for developing and evaluating different error handling strategies is costly and difficult. Specifically, it is difficult to influence a user's behavior so as to enhance the occurrence of naturally errorful scenarios.

User simulation is employed here as an approach to this problem. This technique has increasingly been exploited by researchers in spoken dialogue systems to aid in system development, as well as to facilitate highly-controlled evaluation experiments [1,2,3]. These simulations, however, have either required a data collection effort with real subjects, or they produced realistic but artificial utterances.

In previous work, we were able to locate points of communication breakdown in existing user dialogues, and simulate a continuation of those conversations with a different error-recovery approach, using the DECtalk synthesizer [4] to generate waveforms for a solicited speak-and-spell sub-dialogue [5].

The solution presented here furthers this idea of using existing corpora for user simulation. In this novel technique, an existing corpus of in-domain utterances are recombined with synthetic waveforms to produce a wide array of new utterances based on real-user input. This method allows us to retain the richness of variety in the ways users can pose utterances, while introducing in a controlled manner rare cities that require an error-recovery sub-dialogue to be resolved. We can then assess our error-recovery strategies in the context of these simulated user dialogues.

Creative Utterance Generation

Figure 1 shows how the system creates an utterance. The simulated user passes the generator a list of the attribute-value pairs it wants to speak. The generator selects an utterance template based on the user's attributes, and plugs in the user's values. For example, suppose the simulated user wanted to go from Boston to Avon Lake. The generator would select a template such as:

The synthesizer [6] converts the string to a waveform, combining as few speech segments as possible from a corpus of real-user utterances and synthesized words. The synthesized words are created using DECtalk. They represent attribute values we wish to test, but they do not exist in the corpus as part of a real-user utterance. Examples of such values include foreign or multiple-word city names. The resulting waveform is then passed to the recognizer as a normal user utterance.

Figure 1: The simulation cycle showing how a simulated user utterance is generated.

Experiments

These experiments were conducted in a flight reservations domain. The system's goal for each dialogue was to confidently acquire the source, destination, and date intended by the simulated user. Once the system had either acquired (or abandoned the acquisition of) each attribute, it thanked the user and moved on to a new dialogue. Various configurations of the simulated user's behavior were explored. Table 1 describes the five different scenarios used in the experiments.

	A	B	C	D	E
Source	known	known	known	known	unknown
Destination	known	known	known	known	unknown
Initial Utterance Contents	city & state	city	city	city	city
Speak-and-Spell Requested	yes	yes	yes	no	yes
Compliant User	yes	yes	no	n/a	yes

To begin each dialogue, the simulated user randomly selects values for source, destination, and date, and maintains these for the remainder of the dialogue. The user then provides either a city name or both city and state name in its initial utterance to the system. The system recognizes and parses the utterance, and searches for any required attributes, in this case, source, destination, or date. The system bases its following response on the confidence scores of the city names, and whether or not they have been detected as out-of-vocabulary (OOV) words. Finding a word that is either OOV, or that has a low confidence score, the system will request the user to speak and spell the intended city name (e.g., "Boston. B O S T O N."). A moderate confidence score may trigger the system to simply request an explicit confirmation from the user, as in, "Did you say Boston?". If a city name is ambiguous, the system must request a state name, whereas if a city name is unique, its state name will be implicitly confirmed, without input from the user. If the system cannot identify a required attribute, the system will explicitly request it. The system knows how to behave by consulting an external table, specified by the system developer. The system continues this process until it has confidently obtained all required attributes.

Through experience, we have learned that user behavior affects the length of each dialogue, as well as the system's ability to acquire all of the user's intended attributes. The speak-and-spell mechanism in a live weather information dialogue system revealed that users commonly do not provide a speak-and-spell utterance when it is requested from them. Often, they just repeat their previous utterance. A major goal of this research is to simulate this type of behavior to observe whether a given dialogue system is robust enough to recover from such situations.

Results and Discussion

Five experiments were run according to the configurations in Table 1. Each experiment consisted of 50 dialogues. The best performance is observed in configuration A, in which the simulated user provided a known city and the associated state name. A single dialogue is considered successful if all acquired attributes (source city, source state, destination city, destination state, and date) were correct, that is, 100% accurate. While only 36 of 50 dialogues were successful in the best-performing configuration, more than 88% of the attributes were correctly obtained. As expected, configuration A also shows a relatively low average of 6.6 turns per dialogue. Configuration B only differs from A in that just the city was provided. The increase in average turns per dialogue most likely comes from the system's additional requests for the state name.

Configuration C models a non-compliant user, that is, it did not provide a speak-and-spell utterance when the system requested it. The simulated user was programmed to repeat its last utterance. In order to handle this specific situation, the system runs both a main recognizer and a special speak-and-spell recognizer while awaiting the user's response. If the user chooses not to comply, the main recognizer can catch what was spoken, and the hypothesis from the speak-and-spell recognizer can be ignored. Since nearly 89% of the acquired attributes were correct, this strategy would potentially be able to handle such non-compliant behavior displayed by real users. As real-user behavior can never be completely anticipated, however, this strategy must be tested on a live dialogue system in order to determine its true utility.

In configuration D, the system was prohibited from requesting a speak-and-spell utterance from the user. It could only request explicit confirmation of its own hypotheses. The results were better than expected, with 33 successful dialogues and an average attribute accuracy of more than 91%. This performance is comparable to the configuration utilizing the speak-and-spell mode of acquisition. However, it must be noted that these results reflect a simulated user that only provides city names known to the system.

In configuration E, the cities provided by the simulated user were unknown to the system. This means the system would have to rely entirely on the speak-and-spell mechanism in order to recover. A system set up in configuration D, for example, would fail completely since speak-and-spell mode would never be requested. Configuration E displays the worst performance; nevertheless, the fact that nearly 86% of the acquired attributes were correct is very encouraging. We will aim to improve this performance through further development of the system.

	A	B	C	D	E
Successful Dialogues (out of 50)	36	31	32	33	30
Average Turns Per Successful Dialogue	6.6	7.5	7.5	7.7	9.0
Average Attribute Accuracy Per Dialogue	88.4%	87.2%	88.8%	91.4%	85.9%

Table 2: Results of running 50 dialogues on each of the five simulation configurations shown in Table 1.

Future Work

In the future, we would like to continue such simulations to improve the speak-and-spell recovery mechanism, as well as other dialogue strategies. We intend to apply the simulation technique in other domains such as weather information and a personnel directory, where speak-and-spell mode should greatly assist in the acquisition of people's names. It would be beneficial to generalize this technique to the point where it can serve as a useful tool in the development and testing of dialogue strategies to uncover many problems and issues before deploying a system to be used by real users.

Research Support

This research was supported by an industrial consortium supporting the MIT Oxygen Alliance.

References

[1] Konrad Scheffler and Steve Young. Corpus-Based Dialogue Simulation for Automatic Strategy Learning and Evaluation. In Proc. NAACL Workshop on Adaptation in Dialogue Systems, Pittsburgh, Pennsylvania, June 2001.

[2] Grace Chung. Developing a Flexible Spoken Dialogue System Using Simulation. In Proc. ACL 2004, pp. 63--70, Barcelona, Spain, July 2004.

[3] Wieland Eckert, Esther Levin, and Roberto Pieraccini. User Modeling for Spoken Dialogue System Evaluation. In Proc. IEEE ASRU Workshop, Santa Barbara, California, December 1997.

[5] Ed Filisko and Stephanie Seneff. Error Detection and Recovery in Spoken Dialogue Systems. In Proc. HLT-NAACL Workshop on Spoken Language Understanding for Conversational Systems, pp. 31--38, Boston, Massachusetts, May 2004.

[6] Jon Yi and James Glass. Information-Theoretic Criteria for Unit Selection Synthesis. In Proc. ICSLP, pp. 2617--2620, Denver, Colorado, September 2002.