MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Web-Based Multimodal Dialogue Systems

Alexander Gruenstein & Stephanie Seneff

Introduction

We are developing a generic infrastructure for rapidly building web-based multimodal dialogue systems for database access. We focus on the portability and scalability of such systems in the context of databases containing geographical content; as our example application in this domain, we build a map-based spoken dialogue system for accessing restaurant information. In previous research, we have shown progress with regard to the portability of dialogue management [1], and we have shown how cross-domain fertilization and simulation can create portable language models [2,3]. In this abstract, we focus on techniques for building a generic web-interface, and for dealing with "messy" databases which must be adapted for use with speech.

Figure 1 shows a screenshot of the multimodal interface, while figure 2 shows an example dialogue with the system. Videos highlighting the capabilities of the system may be viewed here.

**Figure 1:** *Screenshot of the restaurant guide user interface.*

Preparing the Database

We harvest the data for the system by crawling web-based restaurant databases. Currently, we have data for seven major metropolitan areas in the United States: Austin, Boston, Chicago, Los Angeles, San Francisco, Seattle, and Washington D.C. We are in the process of acquiring more data using a custom-built web spider. After the data are downloaded, processing leads to a vocabulary of proper nouns for the speech recognizer, and a structured database which can be used to filter user queries. This process consists of using contextualized rules from a configuration file to convert abbreviations such as "pzzr" into "pizzeria" and create aliases such as "steves" for "steves restaurant." Most of the clean-up is generic and can be applied to all cities; for any exceptions that might arise the configuration file may be tweaked. Both actions are necessary in order for the system to be naturally accessible via speech and for the speech synthesizer to function properly.

Finally, a rather sophisticated processing step takes place in which any natural language which must be parsed is turned into a structured representation. In this application, this involves processing any information about the opening hours of each restaurants gleaned from the web. A context free grammar is used to turn this free-form text into a structured meaning representation, allowing the system to answer questions such as "Are they open for dinner on Monday?". Using natural language generation rules, answers to these questions may be fluently formulated.

System Architecture

The overall system architecture is depicted in figure 3. The user interface is presented on a dynamic web page. The client sends speech, typed, and gesture inputs to the server, where dialogue processing occurs. The client plays synthesized speech and displays database query results sent to it from the backend. The dialogue processing architecture on the server side is configured via the Galaxy architecture [7].

GUI and Audio Front End

A screenshot of the GUI appears in figure 1. It is implemented as a dynamic web page, whose interaction with the web server is moderated by the GUI Controller Servlet. A Java applet embedded in the page provides a push-to-talk button and endpointing, and streams recorded speech and synthesized audio across the network. The GUI can display any type of geographically situated entity, and as such serves as a generic map-based front-end. The user can draw on the map while speaking, which makes multimodal commands such as those shown in figure 2 possible. Visible entities may be circled, as in . Strokes that form points or lines are recognized as illustrated in and . Arbitrary regions may also be circled, allowing commands such as Show cheap restaurants here. Gesture recognition is performed using the algorithms described in [8].

Speech Recognition:

The SUMMIT system, specifically the version in [4] which allows for the dynamic manipulation of language models, is used. The -gram language model contains several dynamic classes, whose contents may change over the course of dialogue interaction, reflecting dialogue context: $restaurant, $street, $city, $neighborhood, and $landmark. Each class is a finite state transducer (FST) which can be sewn into the parent recognizer's main FST at any time, expanding into the appropriate vocabulary whenever that class appears in the upper layer of the FST (see [9]). Each class is automatically tagged in the output string to support linguistic analysis.

Language Understanding and Language Modeling:

For language understanding we utilize a lexicalized stochastic syntax-based context free grammar, specialized for database query applications. The recognizer's class -gram language model is automatically derived from the grammar, using techniques described in [10]. Both static and dynamic classes are specified. While the vocabulary populating dynamic classes is constantly changing during a dialogue, the NL grammar remains static by utilizing the tags provided by the recognizer to aid in parsing and assigning proper roles to the (unknown) proper nouns. The -gram weights have been trained using a small corpus of transcribed developer interactions with the system.

Multimodal Context Resolution:

All entities introduced into the discourse either verbally (by the user or by the system) or through gesture are added to a common discourse entity list. A heuristic algorithm described in [11] resolves plausible anaphors for deictic, pronominal, and definite NP's.

Response Planning and Generation:

A generic dialogue manager [1] filters the database based on user-supplied constraints and prepares a reply frame encoding the system's intended response, which is then converted to a surface-form string using the Genesis generation system [12] and finally to a synthesized waveform using a commercial speech synthesizer. A separate multimodal reply frame is converted to XML by Genesis to update the set of entities shown on the GUI.

**Figure 2:** *An example interaction, labeled with for user, for system. Gestures and system actions are bracketed. Some system remarks were shortened for brevity.*
$\begin{figure} \begin{footnotesize} \begin{tabular}{\vert p{.01\textwidth}p{.4... ... \\ \hline \end{tabular} \end{footnotesize} \vspace{-.25in} \end{figure}$

**Figure 3:** *Architecture Diagram*
$\includegraphics[width=.4\textwidth]{diagram}$

References:

[1] J. Polifroni, G. Chung, and S. Seneff, ``Towards the automatic generation of mixed-initiative dialogue systems from web content,'' in Proc. of EUROSPEECH, 2003.
[2] G. Chung, S. Seneff, and C. Wang, ``Automatic induction of language model data for a spoken dialogue system,'' in Proc. of SIGdial, 2005.
[3] C. Wang, S. Seneff, and G. Chung, ``Language model data filtering via user simulation and dialogue resynthesis,'' in Proc. of INTERSPEECH, 2005.
[4] G. Chung, S. Seneff, C. Wang, and L. Hetherington, ``A dynamic vocabulary spoken dialogue interface,'' in Proceedings of INTERSPEECH, 2004, pp. 327-330.
[5] M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor, ``MATCH: An architecture for multimodal dialogue systems,'' in Proc. of ACL, 2002.
[6] J. Gustafson, L. Bell, J. Beskow, J. Boye, J. E. Rolf Carlson, B. Granström, D. House, and M. Wirén, ``AdApt a multimodal conversational dialogue system in an apartment domain,'' in Proc. of ICSLP, 2000.
[7] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue, ``Galaxy-II: A reference architecture for conversational system development,'' in Proc. ICSLP, 1998.
[8] S. B. Wang, ``A multimodal Galaxy-based geographic system,'' M.S. thesis, MIT Department of Electrical Engineering and Computer Science.
[9] J. Schalkwyk, I. L. Hetherington, and E. Story, ``Speech recognition with dynamic grammars using finite-state transducers,'' in Proc. of EUROSPEECH, 2003.
[10] S. Seneff, C. Wang, and T. J. Hazen, ``Automatic induction of -gram language models from a natural language grammar,'' in Proc. of EUROSPEECH, 2003.
[11] E. Filisko and S. Seneff, ``A context resolution server for the Galaxy conversational systems,'' in Proc. of EUROSPEECH, 2003.
[12] L. Baptist and S. Seneff, ``Genesis-II: A versatile system for language generation in conversational system applications,'' in Proc. of ICSLP, 2000.
[13] I. L. Hetherington, ``A multi-pass, dynamic-vocabulary approach to real-time, large-vocabulary speech recognition,'' in Proc. of INTERSPEECH, 2005.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu