CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Multimodal Interactive Dialogue

Aaron D. Adler & Randall Davis


Sketches are used in the early stages of design in many different domains, including electrical circuits, mechanical engineering, and chemistry. The sketches of such devices often have numerous ambiguities and under-constrained parameters. We envision a multimodal interactive dialogue system in which speech and sketching are used by both the user and the system to discuss and refine the design description. In the mechanical engineering domain, this multimodal conversation with the user will provide the system with a way to resolve the ambiguities by asking questions of the user. The system can then create an accurate causal model of the device that can in turn be used to generate a simulation of the functioning device.


We conducted an empirical study of multimodal interaction in which users were asked to sketch and verbally describe six mechanical devices at a whiteboard. By transcribing and analyzing videotapes of these sessions, we developed a set of approximately 50 rules based on clues in the sketching and the highly informal and unstructured speech. The rules segment the speech and sketching events, then align the corresponding events. Some rules group sketched objects based on timing data, while others look for key words in the speech input such as "and" and "there are," or disfluencies such as "ahh" and "umm." The system has a grammar framework that recognizes certain nouns like "pendulum," and adjectives like "identical." We combined ASSIST[1], our previous system that recognizes mechanical components (e.g., springs, pulleys, pin joints, etc.), a speech recognition engine, and the rules, creating a system that combines sketching and speech.

This initial system allows users to more easily create sketches of some mechanical systems. For example, identical parts are critical to some mechanical systems and sketching identical components is difficult. The speech input allows this information to be specified easily. This system allows the users to interact multimodally with the system, but the system does not have the capability to ask the user a question about an unclear or ambiguous part of the sketch. For example, if the system notices a discrepancy between the sketch and the user's spoken description, it has no way to ask the user for clarification. Instead of having the system guess, we suggest instead that, when confused or uncertain, the system ought to be able to query the user to resolve the discrepancies or ambiguities.

The function of some devices is not apparent from the sketch. Figure 1 shows a mechanical device in which the initial state of several components is undetermined. For example, the platforms in the sketch could be initially balanced or the forces on one side could be greater than the other side. Additionally, movements and motion paths of objects are not specified. Without this information the system will not behave as intended. These unknown variables are required for the system to build a causal model of the system and use that model to build an accurate simulation of the system.

Figure 1: A mechanical device.
A mechanical device.

Engaging the user in a multimodal conversation about the sketch, like a conversation the user might have with another person, will allow the system to determine the values for these unknown variables. The multimodal dialogue will enable the system to ask questions by highlighting or circling various parts of the sketch and verbally asking questions about them. For example, the system might inquire about a platform that has forces pulling it in opposite directions: "The platform has a pivot and torques in both directions. Is it balanced to start?" The user's response to the question will enable the system to determine how the platform moves.

Previous Work

In order to examine the nature of a multimodal dialogue, we conducted a user study to look at natural bidirectional communication between two people. We set up an extensive software architecture to allow two users with Tablet PCs to sketch simultaneously while sharing the same drawing surface. The intent was to determine things like: what are the characteristics of bidirectional interaction; what questions are asked; how is the sketching surface used to ask questions; how to handle disfluency; how conversations are structured; and how often and when it is okay to interrupt the user.

The software allowed the users to sketch and annotate the sketch using multi-colored pens and highlighters, and allowed the users to use a pixel-based eraser. The software ensured that the timestamps for the audio, video, and sketching data were all synchronized. Participants created several sketches, focusing on electrical circuit diagrams and a circuit project they had designed for a class. A sketch from the study is shown in Figure 2.

Figure 2: A sketch of a full adder from the user study. Notice the highlighter annotations which were used in the multimodal dialogue.
A sketch of a full adder from the user study.

The study provided many insights into multimodal interaction. For example, different colors of ink were used to differentiate objects and identify components of the sketch. The users' speech was disfluent and repetitious. The users coordinated the different modalities -- they referenced lists of items in the same order in their speech and sketch and they slowed down or sped up to keep the speech and sketching synchronized. Asking the users questions about the sketch prompted users to make revisions to the sketch or explain the design in more depth. Our quantitative analysis of the data produced an interesting observation about the timing of the speech and sketching. Previous work [2] found that the onset of sketch input often preceded the onset of speech input. We found that this is true for our data when you look at words and the sketched object to which they refer. However, we found that when you look at the data at a phrase level, more frequently the speech precedes the sketching.

Current Work

Our current research focuses on creating a multimodal interactive dialogue system. The system will have several components. The speech system will have a language model and grammar so that it can recognize common phrases and understand the responses to questions it asks. The physics component will perform a degree of freedom analysis and find potential collisions between objects. The dialogue part of the system will form intelligent questions based on information from the physics component and the responses from the user. The system will also interact with the sketch by using various techniques to identify different objects in the sketch. It will need to recognize the annotations the user makes to refer to objects. The system will combine the information from these sources and build a causal model of the system which will be used to generate a simulation of the device.

Funding Sources

This work is supported by MIT's Project Oxygen and Pfizer.


[1] Christine Alvarado and Randall Davis. Resolving ambiguities to create a natural computer-based sketching environment. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 1365--1374, August 2001.

[2] Sharon Oviatt, Antonella DeAngeli, and Karen Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In Conference Proceedings on Human Factors in Computing Systems, pp. 415-422. ACM Press, 1997.


vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu