CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Multimodal Natural Language Processing

Jacob Eisenstein & Randall Davis

Motivation

Spoken natural language occurs in parallel across a number of different modalities: speech, gesture, prosody, sketch, facial expressions, etc. However, the field of natural language processing has focused nearly exclusively on the spoken word. We believe that computers can benefit from attending to other modalities of natural language, just as people do. Moreover, we hope that this research will allow us to extend some of the applications of unimodal natural language processing -- for example, information retrieval and summarization -- to multimodal documents such as videos.

Corpus

We are developing a corpus of dialogues in which people explain the function of simple mechanical devices to each other [1]. This corpus allows us to study a variety of non-verbal modalities: free-hand gestures, gestures at a printed diagram, and sketching with a tracked marker on a whiteboard. Since we have multiple speakers for each device, we can also examine which features of gestures are consistent across speakers and which are idiosyncratic.

Low-level Modality Processing

Participants in this corpus wore colored gloves to facilitate hand tracking. This allows us to study the relationships between gesture and speech without dwelling on a computer vision problem that is being studied more carefully by others. In addition, we have performed a forced alignment to a hand transcription of the speech, yielding precise timestamps.

Current Status

Using this corpus, we are currently investigating a range of problems within the domain of multimodal natural language processing: sentence segmentation, disfluency detection and repair, and coreference resolution. A number of interesting questions must be answered to address these problems:

What are the features of gestures that are relevant to communication?
How can we represent non-static features, such as repetitions or specific trajectories?
What is the best way to combine models of gesture phenomena with existing techniques that handle the verbal modality?

References

[1] Aaron Adler and Jacob Eisenstein and Michael Oltmans and Lisa Guttentag and Randall Davis. Building the Design Studio of the Future. In Proceedings of the 2004 AAAI Fall Symposium on Making Pen-Based Interaction Intelligent and Natural. 2004.

[2] Jacob Eisenstein and Randall Davis. Visual and Linguistic Information in Gesture Classification. In International Conference on Multimodal Interfaces (ICMI'04). 2004.

[3] Jacob Eisenstein and Randall Davis. Gestural Cues for Sentence Segmentation. MIT AI Memo AIM-2005-014. 2005.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)