CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Efficient Model Learning for Dialog Management

Finale Doshi & Nicholas Roy

robotic wheelchair

Figure 1 - We equipped a standard power wheelchair with a computer and additional sensors to enable it to navigate autonomously as well as process user input. See robotic wheelchair for hardware details.


Spoken language interfaces provide a natural way for humans to interact with robots, but noisy speech recognition and linguistic ambiguities often make it difficult for the robot to decipher the user's intent. Intelligent planning algorithms, such as the Partially Observable Markov Decision Process (POMDP), have succeeded in dialog management applications ([1],[2],[3]) because of their robustness to the inherent uncertainty of human interaction. Like all dialog planning systems, however, POMDPs require an accurate user model.

POMDPs consist of large probabilistic models with many parameters that govern what the user may want, how the user may express themselves, and how the user will react to the robot's actions. These parameters are difficult to specify from domain knowledge--how can we quantify the likelihood that the robot will hear "coffee machine" when the user asks about the copy machine?--moreover, gathering enough data to estimate the parameters accurately a priori is expensive in the training time required from human operators.

In our work, a dialog manager for a robotic wheelchair (Figure 1), we take a Bayesian approach to learning the POMDP parameters through user interactions. We capitalize on the fact that while we may not know the true parameters, we can often guess a reasonable starting point. For example, we know that a user will be quite frustrated if the wheelchair drives to the wrong location. By incorporating this kind of basic domain knowledge into our prior over parameters, our system can act robustly even as it adapts itself to a specific user and voice recognition system.

Our Approach

Solving even small POMDPs is computationally hard, and in our work we explore several approaches for learning the dialog manager's parameters in robust and tractable way:

  • Using Expected values of Uncertain Parameters. Updating our priors over the uncertain parameters is not difficult, but what is the correct way to behave when model is uncertain? We show that in the absence of learning, simply using the expected values of the parameters will maximize the expected performance.
  • Model as Hidden State. Planning with only the expected values of the parameters is fast, but to be truly robust, we must consider the uncertainty of the parameters when interacting with the user. Thus we also consider thinking of the model parameters--such as those that govern the user's preferences--as hidden states that we must discover (in addition to discovering the user's intent). By making the system explicitly aware of its uncertainty, we show that it makes smarter decisions. We also develop methods for efficiently solving these large but structured POMDPs.
  • Policy Queries. In most Bayesian reinforcement learning contexts ([4], [5]), the only way for an agent to discover the consequences of a poor decision is to try it. Thus, even if the agent is aware of the uncertainty in its model, it may still try a rash action to "check" what the consequences really are. An important innovation in our work is to let the wheelchair to ask the user about actions that it should have done or do, thereby allowing the wheelchair to learn about the user's preferences in "safe" way.
Current Results

Table 1 shows an example dialog of how our learning dialog manager adapts to its user (using the expected value approach described above). In the first dialog, the system does not know that the word "elevator" is a location. In the later dialog, the dialog manager has discovered (1) the word "elevator" refers to two locations in the building, and (2) usually the user means the Gates elevator when she says "elevator." Thus, the dialog manager attempts to confirm the Gates location first.

Table 1 - The pair of dialogs show how the dialog manager has adapted to the user. In the first scenario, the dialog manager does not know that the word "elevator" refers to a location and is forced to wait for keyword that it recognizes. In later scenarios, it learns a mapping for the new word.

An example dialog from a user test before learning
User Take me to the elevator.
Robot Where do you want to go?
The system does not know that "elevator" is a location.
User Take me to the Gates elevator.
Gates is a keyword.
Robot Do you want to go to the Gates Tower?
User Yes.
Robot ... going to Gates.
An example dialog from a user test after learning
User Take me to the elevator.
Robot Do you want to go to the Gates Tower?
The robot knows there are two elevators, and the user usually wants to go to Gates
User No.
This time the user does not want to go to the preferred location.
Robot Do you want to go to the Dreyfoos Tower?
The robot suggests the second elevator location.
User Yes.
Robot ... going to Dreyfoos.

Figure 2 shows the total reward achieved over 100 simulated user trials of 30 dialogs each with and without policy queries (both systems had the ability to adapt to the user preferences). The system without policy queries (left) had no choice but to either make mistakes or act conservatively to avoid making mistakes; policy queries (right) allowed the system to discover the user's true preferences and thus act with a more optimal policy.

policy query results

Figure 2 - Without meta-actions, the dialog manager has no way to learn about the user's preferences. Thus, it is forced to act more conservatively than it normally would had it known the true user model. For over half of the tests in this simulated example, the dialog manager with meta-actions scored higher than the maximum reward achieved by the non-learner.


[1] N. Roy, J. Pineau, and S. Thrun. Spoken dialog management using probabalistic reasoning. In Proceedings of the 38th annual meeting ofthe ACL, Hong Kong, China, 2000.

[2] J. Williams and S. Young. Scaling up POMDPs for dialog management: the "summary POMDP method. In Proceedings of SIGdial Workshop on Discourse and Dialog ,2005.

[3] J. Hoey, P. Poupart, C. Boutilier, and A. Mihailidis. POMDP models for assistive technology, 2005.

[4] R. Jaulmes, J. Pineau, and D. Precup. Learning in non-stationary partially observerable markov decision processes. In Workshop on Non-Stationary Reinforcement Learning at the ECML, 2005.

[5] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd international conference on machine learning, pp. 697--704, New York, NY, USA, 2006.


vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu