CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line

 

Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Learning Decision Models in Spoken Dialogue Systems Via User Simulation

Ed Filisko & Stephanie Seneff

Introduction and Motivation

Spoken dialogue systems are a natural means to interact with online information sources. Although the effectiveness of such systems has improved significantly over the past few years, a critical barrier to widespread deployment remains in the form of communication breakdown at strategic points in the dialogue. Such breakdown often occurs when the user is trying to convey a named entity from a large or open vocabulary set. For example, there is no easy way to inform a user about which cities are known in a weather or flight domain.

If a user speaks a word that is unknown to the recognizer, an endless cycle of confusion could ensue, unless the system can acquire the user's intended value. We have explored the utility of requesting a speak-and-spell utterance for city names (e.g., "Boston B O S T O N") from the user, and have developed this strategy within a simulation framework [1].

In developing such a recovery strategy, we discovered a problem, which was difficult to solve using our standard technique of manually specifying rules to control the system's side of the dialogue. The conflict problem occurs when a newly hypothesized value conflicts with the value currently believed by the dialogue system. This is a problem since a speech recognizer only provides a best hypothesis of what the user said, meaning that it could often spuriously hypothesize a city name or it could hypothesize a known city name when an unknown city name was actually spoken. What should the system do in such a case? It could completely ignore the new value, implicitly confirm the new value (e.g., "Okay, to Reno"), explicitly confirm the new value (e.g., "Did you say to Reno?"), or request a spelling from the user (e.g., "Please spell your destination city").

We decided to try a learning approach, incorporating user simulation, to determine how the system should respond given the conflict problem.

Related Work

Many researchers have explored the utility of learning approaches to dialogue management. Learning rules is much more efficient than manual specification, and it often results in some rules that are unlikely to have been devised by a human. Reinforcement learning has typically been applied to determine the optimal dialogue strategies for a given task [2,3], while supervised learning has often been used to predict problems in a dialogue [4,5,6]. Such learning, however, is typically based on static corpora of real users, as opposed to simulated dialogues. We intended to learn from a corpus of simulated dialogues, created using a novel method of utterance generation [1] based on an existing corpus as well as speech from the DECtalk synthesizer [7]. Simulation would enable us to generate a large number of dialogues, as well as to have much greater control over the user's behavior such that strategies for handling very specific problematic situations could be evaluated.

Learning the Ideal Response

Experiments were performed in a flight reservation domain. The simulated user's goal was to get the system to correctly understand five attributes: source city and state, destination city and state, and a date. A large set of simulated dialogues was generated and a set of features from the speech recognizer and the dialogue history were recorded each time the conflict problem was encountered. The "ideal" system response was also recorded at that point. For example, if the system already believed the user's intended city name, the ideal action would be to ignore the newly hypothesized conflicting value.

The recorded data was then fed into JRip, an implementation of the RIPPER rule-induction algorithm [8], provided through WEKA, a publicly available, Java-based machine learning toolkit [9]. The result was a set of learned rules, predicting the ideal response to be taken in the case of the conflict problem.

Testing with a Simulated User

The learned model was incorporated into a system and a set of simulations was performed for testing. A set of 50 dialogues was run ten times and the results were averaged. A baseline system was also tested for comparison. Instead of using a learned model in response to the conflict problem, the baseline system used a set of simple, manually specified rules.

The results in Table 1 show that the learned model resulted in a greater number of successful dialogues (i.e., all five attributes were correctly obtained), a slightly lower number of average turns per successful dialogue, and a greater average attribute accuracy than the manual model.

No. Successful
Dialogues
Average No. Turns/
Successful Dialogue
Average Attribute
Accuracy
Manual 33.2 7.8 86.4%
Learned 35.9 7.7 89.0%
Table 1: Overall results of manual vs. learned model in response to the conflict problem with a simulated user.

A more detailed analysis showed how well each model performed in predicting the ideal action. Tables 2 and 3 represent confusion matrices for the decision made by the manual and learned model, respectively.

Reference -->
Ignore
(64.0%)
Implicit
(10.7%)
Explicit
(3.4%)
Spelling
(21.9%)
Ignore (77.4%) 38.6 3.9 1.8 10.0
Implicit (14.1%) 6.2 1.8 0.2 1.7
Explicit (2.1%) 0.0 0.7 0.0 0.8
Spelling (6.4%) 0.1 1.1 0.4 2.9

Table 2: Confusion matrix for decision made using the manual decision model. An overall accuracy of 61.7% was obtained. The values represent average number of decisions over ten runs.

Reference -->
Ignore
(71.7%)
Implicit
(9.5%)
Explicit
(2.7%)
Spelling
(16.1%)
Ignore (77.3%) 43.6 2.3 0.8 4.6
Implicit (8.6%) 2.7 1.9 0.0 1.1
Spelling (14.1%) 1.3 2.1 1.0 5.0

Table 3: Confusion matrix for decision made using the learned decision model. An overall accuracy of 76.1% was obtained. The values represent the average number of decisions over ten runs.

The learned model performed with 76.1% overall accuracy, compared to the manual model's 61.7%, indicating the learned model is better able to assist the system in acquiring a user's intentions. The ideal action was correctly predicted on the diagonal from the upper left corner to the lower right corner. The Ignore-Ignore cells show that both models were most accurate when predicting the ignore action--the manual model was correct just over 71% of the time, while the learned model was nearly 85% accurate.

It can also be seen that the learned model was more aggressive in predicting a spelling request with 14.1% of its total predictions versus 6.4% for the manual model. This may be the reason why the manual model is more accurate at predicting a truly ideal spelling request with 64% (2.9/4.5), versus the learned model's 43% (4.6/10.7). The overall results, however, show that this confusion is not necessarily detrimental. In general, this confusion would tend to increase the number of turns with excessive spelling requests, but the final attribute accuracy should not be affected.

One detrimental confusion is predicting Ignore when a spelling request is the ideal action. This is damaging since the user's intended city is unknown to the system if the ideal action is a spelling request, meaning that this action must be initiated to acquire the correct value. Constantly ignoring the conflicting value will keep the system from requesting a spelling, and this could prevent the acquisition of the user's intended city.

Testing with Real Users

Simulated user utterances are, admittedly, not as various and unpredictable as real user utterances, so we deployed a telephone-based spoken dialogue system to test the learned model with real users. The user's goal was the same as the simulated user's: get the system to understand a source city and state, destination city and state, and date. Each user visited a specified webpage and was presented with a unique scenario to be used in each of two dialogues. In one dialogue, the system utilized the manual model in response to the conflict problem, while in the other dialogue, the learned model was used. Data was collected for 40 user sessions by 30 unique users.

The results indicate that the learned model outperformed the manual model overall, and significantly outperformed the manual model at the level of predicting the ideal response to the conflict problem. Table 4 shows the overall results and Tables 5 and 6 show the confusion matrices for the decision made by the manual and learned model, respectively.

No. Successful
Dialogues
Average No. Turns/
Successful Dialogue
Average Attribute
Accuracy
Manual 31/40 8.97 93.0%
Learned 34/40 8.50 96.0%
Table 4: Overall results of manual vs. learned model in response to the conflict problem with real users.
Reference -->
Ignore
(52.1%)
Implicit
(12.7%)
Explicit
(5.6%)
Spelling
(29.6%)
Ignore (53.5%) 25 4 2 7
Implicit (19.7%) 7 5 0 2
Explicit (7.1%) 0 0 0 5
Spelling (19.7%) 5 0 2 7

Table 5: Confusion matrix for decision made using the manual decision model. An overall accuracy of 52.1% was obtained.

Reference -->
Ignore
(52.3%)
Implicit
(0.0%)
Explicit
(0.0%)
Spelling
(47.7%)
Ignore (50.0%) 19 0 0 3
Spelling (50.0%) 4 0 0 18

Table 6: Confusion matrix for decision made using the learned decision model. An overall accuracy of 84.1% was obtained.

The learned model performed signficantly better than the manual model with an accuracy of 84.1% versus 52.1%. As with the simulated user, performance is best when predicting the Ignore action. The learned model never predicted Implicit or Explicit so no conclusions can be made about those actions, except that the model never found them to be appropriate. As for predicting a spelling request, the learned model is much more accurate (18/22=82%) than the manual model (7/14=50%).

The results also show that the manual model had a tendency to predict implicit confirm when ignore was the ideal action. This confusion would certainly be punishing since the system consequently believes a spuriously hypothesized value, placing a burden on the user to initiate correction of the error. Ameliorating such an error can be complicated by the fact that the system believes the incorrect value with relatively high confidence, which could ultimately prove much too frustrating for the user.

The Compliance Problem

This simulation/learning framework was also applied to another problem, which we call the compliance problem. In this case, the system requests a spelling from the user, who may not provide a spelling utterance. This can be problematic, since both the main recognizer and a special spelling recognizer are active when the system initiates a spelling request; a non-spelling utterance will result in strange hypotheses from the spelling recognizer. Therefore, the system must determine which recognizer to trust; in other words, it must predict whether the user was compliant or not. We used JRip to learn a model from data generated by a simulated user. When tested on a simulated user, the model predicted both compliant and noncompliant user responses with 100% accuracy. In the test with real users, the model obtained an accuracy of 96.3%.

Most real users were compliant with a spelling request. However, about 17% of responses to the spelling requests did not contain a spelling, but often contained a repetition of the city name. Some responses contained a spelling embedded among other words, while others provided spellings when they were not requested. We learned from this latter behavior that once the system reveals to a user that it can accept a spelling, it should be prepared for a spelling at any point in the dialogue.

In another case, a user would spell one word of a multi-word city (e.g., Myrtle Point) and then be cut off by the recognizer. The resulting partial spelling happened to be a legitimate city name and when asked, for example, "Did you say to Myrtle?", most users replied, "Yes." A more creative user instead continued to spell the second word of the city name, producing an inter-utterance spelling. Such behavior was unanticipated, but the simulation/learning framework developed in this work could be used to develop a strategy to handle this behavior.

Summary and Future Work

We presented here a framework that can be used to learn decision models from simulated data. We showed that a learned model outperformed the manual model in responding to the conflict problem, and we learned a model to predict when a user was being noncompliant, which obtained very good accuracy. Possible future work includes applying these learned models to the same problem in other domains. The majority of features used to build the models were domain-independent, suggesting that they could be applied in a generic manner. Additional work may include applying this framework in order to learn decision models for other problems, such as predicting whether a user has spoken Mandarin or English in a multilingual weather domain.

Research Support

This research is sponsored by the T-Party Project, a joint research program between MIT and Quanta Computer Inc., Taiwan, and by Nokia.

References:

[1] E. Filisko and S. Seneff. Developing City Name Acquisition Strategies in Spoken Dialogue Systems Via User Simulation. In Proceedings 6th SIGdial Workshop, pp. 144--155, Lisbon, Portugal, September 2005.

[2] S. Singh, D. Litman, M. Kearns, and M. Walker. Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. Journal of Artificial Intelligence Research, 16:105--133, 2002.

[3] J. Henderson, O. Lemon, and K. Georgila. Hybrid Reinforcement/Supervised Learning for Dialogue Policies from COMMUNICATOR Data. In Proceedings 4th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pp. 68--75, Edinburgh, Scotland, 2005.

[4] D. Litman and S. Pan. Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System. In Proceedings National Conference on Artificial Intelligence, pp. 722--728, Austin, Texas, 2000.

[5] A. van den Bosch, E. Krahmer, and M. Swerts. Detecting Problematic Turns in Human-Machine Interactions: Rule-Induction Versus Memory-Based Learning Approaches. In Proceedings Association for Computational Linguistics, Toulouse, France, 2001.

[6] M. Walker, I. Langkilde-Geary, H. Hastie, J. Wright, and A. Gorin. Automatically Training a Problematic Dialogue Predictor for a Spoken Dialogue System. Journal of Artificial Intelligence Research, 16:293--319, 2002.

[7] W. Hallahan. DECtalk Software: Text-to-speech Technology and Implementation. Digital Technical Journal, 7(4):5--19, 1995. Available at http://www.hpl.hp.com/hpjournal/dtj/vol7num4/vol7num4art1.pdf. Accessed March 16, 2006.

[8] W. Cohen. Fast Effective Rule Induction. Proceedings 12th International Conference on Machine Learning, pp.115--123, Tahoe City, California, 1995.

vertical line
vertical line
 
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu