CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Information Extraction From Images and Captions

A. Quattoni, M. Collins & T. Darrell


Our goal is to develop a system that can auto-annotate images with information that can be used for semantic retrieval and browsing. For example, we would like to be able to answer queries such us "get me all images that contain people doing sports and are indoors." In other words, we wish to build a model that maps images to information templates. For the previous example the template would have the form : t =[ contains people: true, activity: sports, location : indoor].

To build such a system we propose to make use of the thousands of images paired with descriptive captions currently available online. In particular we are interested in the domain of news images.

Previous Work

Previous work in the area of automatic image annotation (Barnad) has considered images paired with keywords and developed probabilistic translation models for translating from images represented as a set of regions to keywords. However in the news datasets that we consider not all words in the captions can be directly aligned with image regions and therefore the translation models perform poorly. Furthermore, the words in the captions offer a significant amount of structure that we would like to leverage in our model.

First Approach

One of the main challenges in building a model that extracts semantic information from an image is that it might require a significant amount of labeled data. On the other hand extracting templates from captions might be an easier task that requires less training data. As a first step we propose to use a small training set of labeled captions to train a model that would map captions to templates. We will then use that model to label a larger set of images using their paired captions and use this larger dataset to train a visual classifier that maps images to templates.


[1] Barnad, K , P. Duygulu, D. Foryth, N. de Freitas, D. Blei and M. Jordan. " Matching Words and Pictures." Special Issue on Text and Images, JMLR, 2002.

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu