MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Information Access Using Natural Language

Boris Katz, Gary Borchardt, Sue Felshin & Gregory Marton

The Problem

With recent advances in computer and Internet technology, people have access to more information than ever before. As the amount of information grows, so does the problem of finding what one is looking for.

Motivation

A natural language system is the most intuitive interface for humans seeking information. It can produce high-precision responses which decrease search time and increase productivity. Such a system requires less training, is accessible to a wider audience, and can be deployed in a shorter period of time; it can serve as a rigorous testbed for research in language understanding. Regrettably, current natural language processing techniques are not yet capable of performing unrestricted full-text understanding. Furthermore, not all information is text; sounds, images, video, and other multimedia can all be valuable sources of knowledge. In response to this situation, we believe the best approach is to focus on teaching computers where and how to find the right pieces of knowledge: that is, we need to give our systems knowledge about knowledge.

Approach

To address the problem of information overload in today's world, we have developed START, a natural language question answering system that provides users with high-precision multimedia information access through the use of natural language annotations. To address the difficulty of accessing large amounts of heterogeneous data, we have developed Omnibase, which assists START by providing uniform access to structured and semistructured Web resources. Our ultimate goal is to develop a computer system that acts like a “smart reference librarian.” START has been used by researchers at MIT and other universities and research laboratories to construct and query knowledge bases using English. To date, START can access a broad range of information in a number of topic areas, including cities, countries, weather reports, U.S. colleges and universities, U.S. presidents, movies, and more. Figure 1 shows a screenshot of START responding to a user question.

To give our systems knowledge about knowledge, we use natural language annotations [1], which are machine-parseable sentences and phrases that describe the content of various information segments. They serve as metadata describing the types of questions that a particular piece of knowledge is capable of answering. An important feature of the annotation concept is that any information segment can be annotated: not only text, but also images, multimedia, and even procedures.

The ability to respond to natural language questions with textual and multimedia content crucially depends on natural language annotations. The knowledge coverage of the START system is thus dependent on the amount of annotated material. To increase the effectiveness of our technology, we have adapted natural language annotations to work with structured and semistructured data through Omnibase [2], which allows heterogeneous resources on the World Wide Web to be treated in a consistent manner for purposes of retrieving properties of objects.

Figure 1: START's answer to the question “Who was president of the United States in 1881?” with knowledge from the World Wide Web.

Despite the effectiveness of START and Omnibase in solving user information needs, there are still several major unsolved challenges.

The Scaling Problem: The sheer amount of information available in the world today places a practical limit on the amount of knowledge that can be incorporated into a system by a single research group. Although natural language annotations are easy and intuitive, there is simply too much content.
The Knowledge Engineering Bottleneck: Manual knowledge engineering is required to expand our system's knowledge coverage; integrating Web sources under Omnibase requires site-specific wrapper scripts. Consequently, only trained individuals can add knowledge to START and Omnibase.
The Implicit Knowledge Problem: Many components of knowledge do not appear explicitly within resources, but require the application of domain knowledge or rules of inference to extract them. In other cases, it is necessary to combine fragments of knowledge from multiple resources in order to derive sought-after components of knowledge.
The Fickle Web Problem: An undesirable side-effect of the Web's dynamic nature is instability of site layout and page content. This poses a serious problem to wrapper scripts custom-tailored to specific formats. Often, major changes to page content or layout structure will require significant modification of associated scripts.

To address these challenges, we have pursued many different solutions. The following abstracts in this volume describe each of these technologies in greater detail:

Research Support

This work is supported in part by the Advanced Research and Development Activity as part of the AQUAINT Phase II research program.

References:

[1] Boris Katz. Annotating the World Wide Web Using Natural Language. In Proceedings of the 5th RIAO Conference on Computer Assisted Information Searching on the Internet (RIAO '97), Montreal, Canada, 1997.

[2] Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, and Baris Temelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proc. of the 7th Int. Workshop on Applications of Natural Language to Information Systems (NLDB '02), Stockholm, Sweden, June 2002.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu