Abstracts - 2006
Web-Based Multimodal Dialogue Systems
Alexander Gruenstein & Stephanie Seneff
We are developing a generic infrastructure for rapidly building web-based multimodal dialogue systems for database access. We focus on the portability and scalability of such systems in the context of databases containing geographical content; as our example application in this domain, we build a map-based spoken dialogue system for accessing restaurant information. In previous research, we have shown progress with regard to the portability of dialogue management , and we have shown how cross-domain fertilization and simulation can create portable language models [2,3]. In this abstract, we focus on techniques for building a generic web-interface, and for dealing with "messy" databases which must be adapted for use with speech.
Preparing the Database
We harvest the data for the system by crawling web-based restaurant databases. Currently, we have data for seven major metropolitan areas in the United States: Austin, Boston, Chicago, Los Angeles, San Francisco, Seattle, and Washington D.C. We are in the process of acquiring more data using a custom-built web spider. After the data are downloaded, processing leads to a vocabulary of proper nouns for the speech recognizer, and a structured database which can be used to filter user queries. This process consists of using contextualized rules from a configuration file to convert abbreviations such as "pzzr" into "pizzeria" and create aliases such as "steves" for "steves restaurant." Most of the clean-up is generic and can be applied to all cities; for any exceptions that might arise the configuration file may be tweaked. Both actions are necessary in order for the system to be naturally accessible via speech and for the speech synthesizer to function properly.
Finally, a rather sophisticated processing step takes place in which any natural language which must be parsed is turned into a structured representation. In this application, this involves processing any information about the opening hours of each restaurants gleaned from the web. A context free grammar is used to turn this free-form text into a structured meaning representation, allowing the system to answer questions such as "Are they open for dinner on Monday?". Using natural language generation rules, answers to these questions may be fluently formulated.
The overall system architecture is depicted in figure 3. The user interface is presented on a dynamic web page. The client sends speech, typed, and gesture inputs to the server, where dialogue processing occurs. The client plays synthesized speech and displays database query results sent to it from the backend. The dialogue processing architecture on the server side is configured via the Galaxy architecture .
GUI and Audio Front End
A screenshot of the GUI appears in figure 1. It is implemented as a dynamic web page, whose interaction with the web server is moderated by the GUI Controller Servlet. A Java applet embedded in the page provides a push-to-talk button and endpointing, and streams recorded speech and synthesized audio across the network. The GUI can display any type of geographically situated entity, and as such serves as a generic map-based front-end. The user can draw on the map while speaking, which makes multimodal commands such as those shown in figure 2 possible. Visible entities may be circled, as in . Strokes that form points or lines are recognized as illustrated in and . Arbitrary regions may also be circled, allowing commands such as Show cheap restaurants here. Gesture recognition is performed using the algorithms described in .
The SUMMIT system, specifically the version in  which allows for the dynamic manipulation of language models, is used. The -gram language model contains several dynamic classes, whose contents may change over the course of dialogue interaction, reflecting dialogue context: $restaurant, $street, $city, $neighborhood, and $landmark. Each class is a finite state transducer (FST) which can be sewn into the parent recognizer's main FST at any time, expanding into the appropriate vocabulary whenever that class appears in the upper layer of the FST (see ). Each class is automatically tagged in the output string to support linguistic analysis.
Language Understanding and Language Modeling:
For language understanding we utilize a lexicalized stochastic syntax-based context free grammar, specialized for database query applications. The recognizer's class -gram language model is automatically derived from the grammar, using techniques described in . Both static and dynamic classes are specified. While the vocabulary populating dynamic classes is constantly changing during a dialogue, the NL grammar remains static by utilizing the tags provided by the recognizer to aid in parsing and assigning proper roles to the (unknown) proper nouns. The -gram weights have been trained using a small corpus of transcribed developer interactions with the system.
Multimodal Context Resolution:
All entities introduced into the discourse either verbally (by the user or by the system) or through gesture are added to a common discourse entity list. A heuristic algorithm described in  resolves plausible anaphors for deictic, pronominal, and definite NP's.
Response Planning and Generation:
A generic dialogue manager  filters the database based on user-supplied constraints and prepares a reply frame encoding the system's intended response, which is then converted to a surface-form string using the Genesis generation system  and finally to a synthesized waveform using a commercial speech synthesizer. A separate multimodal reply frame is converted to XML by Genesis to update the set of entities shown on the GUI.