CSAIL Publications and Digital Archive header
bullet Research Abstracts Home bullet CSAIL Digital Archive bullet Research Activities bullet CSAIL Home bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2007
horizontal line

horizontal line

vertical line
vertical line

Photo-Oriented Question— A Multi-Modal Approach to Information Retrieval

Tom Yeh & Trevor Darrell


A photo-oriented question is a question asked by someone regarding the content of a photo. Its use is motivated by the observation that asking a question verbally about something difficult to name can be very challenging. For instance, if a tourist has visited the Stata Center before and now he wants to ask an MIT student who designed it, not knowing what the building is called, he can not ask the obvious question “who designed the Stata Center?”; rather, he most likely needs to rely on verbal description of the strange but unique visual features of the Stata Center. One can only hope the MIT student would not mistake it as the Simmons Hall.

An easy solution to this problem is to simply show the photo of the Stata Center. Any MIT student can easily identify the building in the photo and provide the answer correctly. The question can be adequately expressed by “who designed this?” and a photo showing the Stata Center. Such multi-modal question combing text and image is what we will refer to as the photo-oriented question (POQ).

The advent of several enabling technologies contributes to the creation of an automatic photo-oriented question answering system. Consumer digital cameras make it virtually free to capture images anywhere in any quantity. Wireless technology allows us to send photographs to anywhere we want. The World-Wide Web contains an unimaginable amount of knowledge contributed by millions of people all over the world. With these technologies, it becomes conceivable for someone to ask questions immediately upon seeing something interesting, simply by taking a photo of it using a digital camera, sending the photo via a wireless connection to a server, which can search the World Wide Web for clues to compose a reasonable answer.

System Overview

The POQ prototype system can be divided into two components: the client-side interface and the server.

The client-side interface provides the mechanism for users to specify both photo and question to form a photo-oriented question. In order to render the system usable by most people, we purposely avoid high-end or futuristic devices and focus on technological options widely available to the general public today. We will describe three input methods in increasing degree of hardware requirements.

The first method needs only a web browser with which one can browse through a list of existing photos and pick one if he or she wants to ask a question about it. The second method requires the use of a digital camera. Upon seeing something interesting, one can simply take a picture of it and submit the photo and text question at the nearest web terminal. The third method requires a mobile phone with a built-in digital camera. The wireless connectivity makes it possible to submit the photo-oriented question on the spot via the Multimedia Messaging Service (MMS) and receive the answer back immediately.

The role of the server is to process incoming photo-oriented questions and produce an answer that can best address the respective question. The server has three major components: the photo-matching engine, the question-matching engine, and the expert-knowledge aggregator.

Each photo-oriented question contains a query photo and a query text. The photo-matching engine analyzes the query photo and finds the top K similar photos in the photo database. It collects the previously asked questions from the set of similar photos to obtain a set of candidate questions. The question-matching engine will examine the set of candidate questions and identify the question that is most relevant to the current query text. We employ a word frequency-based approach to computing the question similarity. The question with the highest score is considered to be the question most relevant, both visually and verbally, to the photo-oriented question submitted by the user.

The system admits its inability to find a good answer if neither the photo nor question similarity score is above a certain threshold. We augment the “I don’t know” reply with a random fact about MIT so that the user can learn at least something.

Those photo-oriented questions the system is unable to confidently answer will be reviewed by MIT students, who enter their answers through the expert knowledge aggregator. As more questions are asked and answers are provided by the experts, the overall system effectiveness will increase.

Prototype Study and Preliminary Results

In order to shed lights on the types questions people would ask about the photographs they take, we designed and implemented a semi-functional prototype system. We targeted specifically the people visiting the MIT campus because they have real motivations to ask questions about various things around the campus.

The implementation is done in Ruby using MySQL for data storage running on a single Linux box. The image-matching system currently clusters similar photos in the database offline. However, to simulate interactivity in our study, we employ a wizard-of-oz approach where an MIT student manually assigns the photo to a known cluster during an online session. The question-matching engine will take the cluster of photos as input and find the most relevant question.

In the three-week period starting from late July to early August, we carried out field study at two locations ( Stata Center and Lobby 10). We set up two computer kiosks to where visitors could interact with the prototype system using the web interface. Also, we prepared five camera phones the visitors could borrow and use them to ask photo-oriented questions as they toured around the campus.

Our first analysis focused on only the question text. We employed the question taxonomy introduced by [Li, et al. 2002] which has been adopted by several recent works on automatic question classification. This taxonomy comprised of six categories. Following are examples of questions in each category:

  • Entity What is that building? What kind of animal is the sculpture? What is this?
  • Human Who is William Barton Rogers? Who is the person? Who is the architect? Who lives here?
  • Description What is this for? Why is it snowing? How to open the window? Do I have to pay?
  • Location Where am I? Where is the building? Is it in the east or west part of the campus?
  • Number How much does it cost? When was it built? How tall is that building? How many people live here?
  • Abbreviation What does MIT mean? What does RLE stand for?

[1] KATZ, B. (1997). Annotating the World Wide Web using natural language. In Proceedings of the 5th RIAO Conference on Computer Assisted Information Searching on the Internet.

[2] YEH, T., GRAUMAN, K., TOLLMAR, K. & DARRELL, T. (2005). A picture is worth a thousand keywords: image-based object search on a mobile platform. In CHI Extended Abstracts, 2025-2028.

[3] LI, X. AND ROTH, D. (2002). Learning Question Classifiers. In Proceedings of the 19th International Conference on Computational Linguistics, 556-562.


vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu