We select keyframes that are judged by our model to be highly likely to contain an informative gesture. Again, this judgment is based on both linguistic and visual features. To evaluate this method, we compare the selected keyframes with ground truth annotations from a human rater. Keyframes selected from our conditional modality fusion method correspond better with ground truth than keyframes selected unimodally, according to commonly-used linguistic or visual features. More detail on this application will be available in a forthcoming publication [1].
The linguistics literature is rich with syntactic models of verbal language, but little is known about how gesture is organized. Thus, the supervised approaches that have been so successful with verbal language are inapplicable to gesture. We consider conditional modality fusion to be a first step towards characterizing the structure of gestural communication using hidden variable models.
[1] Jacob Eisenstein, Regina Barzilay, and Randall Davis. Extracting Keyframes Using Linguistically Salient Gestures. Forthcoming.
[2] Jacob Eisenstein and Randall Davis. Conditional Modality Fusion for Coreference Resolution. Submitted to Association for Computational Linguistics 2007.
[3] Alissa Melinger. Gesture and communicative intention of the speaker. Gesture 4(2), pp. 119--141. 2002.
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |