CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Gesture and Natural Language Semantics

Jacob Eisenstein & Randall Davis


Although the natural-language processing community has dedicated much of its focus to text, face-to-face spoken language is ubiquitous, and offers the potential for breakthrough applications in domains such as meetings, lectures, and presentations. Because spontaneous spoken language is typically more disfluent and less structured than written text, it may be critical to identify features from additional modalities that can aid in language understanding. However, due to the long-standing emphasis on text datasets, there has been relatively little work on non-textual features in unconstrained natural language.

In this project we explore the possibility of applying hand gesture features to the problem of coreference resolution: identifying which noun phrases refer to the same entity. Coreference resolution is thought to be fundamental to more ambitious applications such as automatic summarization, segmentation, and question answering. To motivate the need for multimodal features in coreference resolution, consider the following transcript, in which noun phrases are set off by brackets, and entites are indexed in parentheses.

``[This circle (1)] is rotating clockwise and [this piece of wood (2)] is attached at [this point (3)] and [this point (4)] but [it (5)] can rotate. So as [the circle (6)] rotates, [this (7)] moves in and out. So [this whole thing (8)] is just going back and forth.

Even given a high degree of domain knowledge (e.g., that ``circles'' often ``rotate'' but ``points'' rarely do), determining the coreference in this excerpt seems difficult. The word ``this'' accompanied by a gesture is frequently used to introduce a new entity, so it is difficult to determine from the text alone whether ``[this (7)]'' refers to ``[this piece of wood (2)],'' or to an entirely different part of the diagram. In addition, ``[this whole thing (8)]'' could be anaphoric, or it might refer to a new entity, perhaps some superset of predefined parts.

The example text was drawn from a small corpus of dialogues, which has been annotated for coreference (for more details on the corpus, and on this project in general, please see this paper). Participants in the study had little difficulty understanding what was communicated. While this does not prove that human listeners are using gesture or other multimodal features, it suggests that these features merit further investigation. We extracted hand positions from the videos in the corpus, using computer vision. From the raw hand positions, we derived gesture features that were used to supplement traditional textual features for coreference resolution. We present results showing that these features yield a significant improvement in performance.

Coreference Resolution

A set of twenty commonly-used linguistic features were selected, describing various syntactic and lexical properties of the noun phrases that were candidates for coreference. In addition, gesture features dervied from raw hand positions were used. First, at most one hand is determined to be the ``focus hand,'' as determined by the following heuristic: select the hand farthest from the body in the x-dimension, as long as the hand is not occluded and its y-position is not below the speaker's waist. If neither hand meets these criteria, than no hand is said to be in focus. Occluded hands are also not permitted to be in focus; since the listener's perspective was very similar to that of camera, it seemed unlikely that the speaker would occlude a meaningful gesture. In addition, estimates of the position of an occluded hand are unlikely to be accurate. The values for these features are computed at the temporal midpoint of each candidate NP. Two gesture features were used: the euclidean distance between the focus hand during each noun phrase, as measured in pixels; and, whether the same hand was in focus during each noun phrase.

Preliminary Results

Using a boosted decision tree classifier, we found that gesture features improved performance by a small but significant margin. The f-measure -- a combination of recall and precision -- was 54.9% with gesture features, and 52.8% without. This is in contrast to a most-common-class baseline of 41.5%. In addition, we observe that gesture features correlate well with coreference phenomena. A chi-squared analysis shows that the relationship between both gesture features and coreference was significant (X^2 = 727.8, dof = 4, p < .01 for euclidean distance between gestures; X^2 = 57.2, dof = 2, p < .01 for which hand was gesturing). The feature measuring distance between gestures ranked fifth overall when compared to the twenty linguistic features.

Future Work

These preliminary results suggest several avenues of future research. The gesture features currently used are quite simple; in fact, they relate to only one type of gesture, deixis, where space is used to convey meaning. Additional features that describe motion trajectory or hand-shape might improve performance by capturing the semantics of a wider variety of gestures. We are also interested in the possibility that "meta-features" of hand motion might tell us when gesture is likely to be relevant.

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu