Multimodal Natural Language ProcessingJacob Eisenstein & Randall DavisMotivationSpoken natural language occurs in parallel across a number of different modalities: speech, gesture, prosody, sketch, facial expressions, etc. However, the field of natural language processing has focused nearly exclusively on the spoken word. We believe that computers can benefit from attending to other modalities of natural language, just as people do. Moreover, we hope that this research will allow us to extend some of the applications of unimodal natural language processing -- for example, information retrieval and summarization -- to multimodal documents such as videos. CorpusWe are developing a corpus of dialogues in which people explain the function of simple mechanical devices to each other [1]. This corpus allows us to study a variety of non-verbal modalities: free-hand gestures, gestures at a printed diagram, and sketching with a tracked marker on a whiteboard. Since we have multiple speakers for each device, we can also examine which features of gestures are consistent across speakers and which are idiosyncratic. Low-level Modality ProcessingParticipants in this corpus wore colored gloves to facilitate hand tracking. This allows us to study the relationships between gesture and speech without dwelling on a computer vision problem that is being studied more carefully by others. In addition, we have performed a forced alignment to a hand transcription of the speech, yielding precise timestamps. Current StatusUsing this corpus, we are currently investigating a range of problems within the domain of multimodal natural language processing: sentence segmentation, disfluency detection and repair, and coreference resolution. A number of interesting questions must be answered to address these problems:
References[1] Aaron Adler and Jacob Eisenstein and Michael Oltmans and Lisa Guttentag and Randall Davis. Building the Design Studio of the Future. In Proceedings of the 2004 AAAI Fall Symposium on Making Pen-Based Interaction Intelligent and Natural. 2004. [2] Jacob Eisenstein and Randall Davis. Visual and Linguistic Information in Gesture Classification. In International Conference on Multimodal Interfaces (ICMI'04). 2004. [3] Jacob Eisenstein and Randall Davis. Gestural Cues for Sentence Segmentation. MIT AI Memo AIM-2005-014. 2005. |
||
|