Interacting with Speech and Sketching in a Design Environment
Aaron D. Adler & Randall Davis
Sketches are used in the early stages of design work in many different domains including electrical circuit diagrams, mechanical engineering diagrams, and military course of action diagrams. Some aspects of a design are more easily expressed using a sketch while other aspects are more easily communicated verbally. We will create a more natural interaction for the user by combining speech and sketching and allowing the user to use both modalities to interact with the system. The system will also be able to engage the user in a conversation about the design by utilizing the speech input. For example, if the user sketches an object that the system doesn't recognize, the system could start a conversation to clarify the user's intent.
Newton's Cradle (see Figure 1) is a good example of a mechanical device that has some components that are more easily drawn and other components that are more easily described verbally. Newton's Cradle is a set of pendulums consisting of a row of metal balls on strings. When you pull back and release a number of balls on one end, after a nearly elastic collision, the same number of balls move outward on the opposite end. Although this appears to be easily sketched, the precision required for correct operation makes it nearly impossible (see Figure 2). If the user could supply information verbally by saying that "there are five identical, evenly spaced and touching pendulums," it would be easy to create the device.
Figure 1: A sequence of images illustrating Newton's Cradle
Figure 2: Screenshot of the user study drawing surface
We conducted an empirical study of multimodal interaction in which users were asked to sketch and verbally describe six mechanical devices at a whiteboard. By transcribing and analyzing videotapes of these sessions, we developed a set of approximately 50 rules based on clues in the sketching and the highly informal and unstructured speech. The rules segment the speech and sketching events and then align the corresponding events. Some rules group sketched objects based on timing data, while other rules look for key words in the speech input such as "and" and "there are" or disfluencies such as "ahh" and "umm." The system has a grammar framework that recognizes certain nouns like "pendulum," and adjectives like "identical." We combined ASSIST, our previous system that recognizes mechanical components (e.g., springs, pulleys, pin joints, etc.), a speech recognition engine, and the rules to create a system that combines sketching and speech. This system allows the user to draw Newton's Cradle and describe it verbally so that it behaves properly.
The system could be improved by allowing the user to add new objects to the vocabulary and by adding a conversational interface that would allow the system to talk to the user about the design. Adding new vocabulary to the system would allow the system to adapt to new shapes or new words in a particular domain. For example, this would be useful in the military course of action diagram domain where new symbols and acronyms are added frequently. Adding a conversational interface to the system would allow the computer to interact with the user and ask appropriate questions to clarify parts of the drawing. This would create a more natural interaction and enable the computer to be more of a partner in the design process.
In order to better understand how users converse about a design, we are in the process of conducting a user study to further examine how two people use speech and sketching to talk about several circuit diagrams. The study will also give us a better idea of how new vocabulary could be added naturally to the system. To conduct the study, we have created a real-time shared sketching surface so that the two participants can see a shared drawing surface as they sketch (one sketching surface is shown in Figure 2). Users can draw in different colors, with a pen or a highlighter, and erase strokes or parts of strokes that have already been drawn.
As we work toward building the design studio of the future, we will continue to work on integrating the information that is conveyed through speech with the information that is conveyed through sketching. The speech channel should allow the system to learn new objects and vocabulary, as well as provide a more natural interface for the users. Speech will also provide a more conversational interaction that will allow the system to interact with the user when part of the sketch is unclear. Another direction that we will investigate is integrating speech with the language description files that are used by the sketch recognition system (link).
There are several similar multimodal systems that incorporate speech and sketching. QuickSet is a collaborative, multimodal, command-based system targeted toward improving efficiency in a military environment. The user can create and position items on an existing map using voice and pen-based gestures. QuickSet differs from our system because it is a command based system and users start with a map, which provides context, whereas our system uses natural speech and users start with an empty screen. AT&T Labs has developed MATCH, which provides a speech and pen interface to restaurant and subway information. Users can make simple queries using some multimodal dialogue capabilities. However, it uses command-based speech rather than natural speech, and it only has basic circling and pointing gestures for the graphical input modality, not full sketching capabilities.
This work is supported in part by the MIT/Microsoft iCampus initiative and in part by MIT's Project Oxygen.
 Aaron Adler, Jacob Eisenstein, Michael Oltmans, Lisa Guttentag, and Randall Davis. Building the design studio of the future. In Proceedings of Making Pen-Based Interaction Intelligent and Natural, pp 1--7, Menlo Park, California, October 2004.
 Christine Alvarado and Randall Davis. Resolving ambiguities to create a natural computer-based sketching environment. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 1365–-1374, August 2001.
 Michael Johnston, Srinivas Bangalore, Gunaranjan Vasireddy, Amanda Stent, Patrick Ehlen, Marilyn Walker, Steve Whittaker, and Preetam Maloor. MATCH: An architecture for multimodal dialogue systems. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 376–-383, 2002.
 Sharon Oviatt, Phil Cohen, Lizhong Wu, John Vergo, Lisbeth Duncan, Berhard Suhm, Josh Bers, Thomas Holzman, Terry Winograd, James Landay, Jim Larson, and David Ferro. Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions. In Human Computer Interaction, pp. 263–-322, August 2000.