MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Omnibase: A Uniform Interface to Heterogeneous Data

Boris Katz & Sue Felshin

The Problem

Although the World Wide Web contains a tremendous amount of information, the lack of uniform structure makes finding the right knowledge difficult. Traditional keyword-based Web search engines have "solved" this problem by treating all data as raw text and performing literal character matching only; these searches often miss much data they should find and find much data they should miss. Other information retrieval systems have addressed the problem by accessing only a single resource or Web site. To utilize all the available information, a system must be able to access heterogeneous data on the Web and in other resources. Our solution is to apply a uniform model of data to resources on the Web and access the information in these resources through natural language.

Motivation

While it is possible to handle the problem of disparate resource representation formats by integrating each format separately into a document retriever, it is far more efficient to design a modular, uniform interface which serves as a front end to those resources. We have implemented Omnibase, a program that provides uniform access to resources that store information about objects. Because resources can vary so widely in their capabilities and functions, it is impossible for Omnibase to support all capabilities for all resources. However, essentially every resource supports the simple abilities to store and retrieve entities with properties, so Omnibase provides a uniform interface for this retrieval. This restricted type of query is more versatile than it appears on the surface, as many natural language questions can be mapped into object–property pairs; not only the obvious 'x of y' → {y, x}, but also many verbal forms, for example, 'who wrote the music for Star Wars' → {Star Wars, composer}, 'how big is Costa Rica' → {Costa Rica, area}, 'show me paintings by Monet' → {Monet, works} . Our experiments reveal that in practice object–property questions occur quite frequently. For example, just ten Web sources accessed in this way turned out to be sufficient for handling 47% of TREC-2001 questions from the QA track [5]. Therefore even a system which can retrieve values only for object–property queries will provide a very significant benefit for question answering.

Related Work

Several systems such as Ariadne [4] and Tsimmis [1] have attempted to integrate heterogeneous Web sources under a common interface but queries to these systems must be formulated in a formal language, which makes them inaccessible to a regular user. Omnibase stands apart from these systems in its use of the object–property–value data model which naturally corresponds to both users' English questions and on-line content.

Approach

Omnibase uses two main procedures for storing data: 1) It can add objects to its internal database. Objects are identified by their class and name. 2) It can add scripts to its internal database for retrieving properties of objects. Scripts are associated with classes.

Omnibase has two main procedures for retrieving data: 1) Given a query string, it can detect objects within the string, using "fuzzy matching" (that is, allowing for mismatches in case, spacing, accent marks, and punctuation), and recognizing synonyms; this allows the caller (the START natural language system) to recognize queryable objects, and complementarily aids parsing by allowing multi-word tokens to be chunked ("who hopped flown over the fence" is unparseable, but the parallel "who directed gone with the wind" can be easily analyzed as "who directed x" once Omnibase has recognized "gone with the wind" as the name of an object.). 2) Given an object designator and a property name, Omnibase can retrieve the object's value for that property by running the class script associated with the property. Classes are specific to data sources, and therefore scripts are customized to particular data sources. If one data source is a search URL, scripts for classes in that data source will construct appropriate URLs, post them to the Web, and return the result, possibly parsing the result to return only part of it. If another data source is one or more Web pages with data for all entries expressed in similarly formatted HTML, scripts for classes in that data source will retrieve a Web page and parse the HTML to find the correct segment of the Web page. For a more detailed description of Omnibase, see [3].

Figure 1: START responding to the question "How much does it cost to attend MIT?" with information provided by Omnibase.

Progress

We have integrated Omnibase into the START natural language system (see [2] and related abstracts in this collection). Omnibase allows us to quickly and conveniently augment START's knowledge base with Web and other data sources. It is no longer necessary to compromise START's modularity with large amounts of resource-specific code. Students can learn in hours how to write Omnibase scripts, substantially reducing the time it takes to integrate a new data source. Omnibase has significantly increased the quantity and diversity of data which START can access and queries which it can answer.

Future

Property scripts are very easy to write, such that novice programmers can learn to write simple scripts very quickly. Unfortunately, Web sites often change their formatting over time, and it is timeconsuming to rewrite scripts as Web sites change. Our Hap-Shu system (see related abstract) makes it possible to write generalized scripts which work at a conceptual level rather than directly at the HTML level.

Research Support

This work is supported in part by the Disruptive Technology Office as part of the AQUAINT Phase 3 research program.

References:

[1] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting Semistructured Information from the Web. In Workshop on Management of Semistructured Data at PODS/SIGMOD'97, Tsimmis, 1997.

[2] Boris Katz. Annotating the World Wide Web Using Natural Language. In Proceedings of the 5th RIAO Conference on Computer Assisted Information Searching on the Internet (RIAO '97), Montreal, Canada, 1997.

[3] Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, and Baris Temelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proc. of the 7th Int. Workshop on Applications of Natural Language to Information Systems (NLDB '02), Stockholm, Sweden, June 2002.

[4] C. Knoblock, S. Minton, J. Ambite, N. Ashish, I. Muslea, A. Philpot, and S. Tejada. The Ariadne Approach to Web-based Information Integration. In International Journal on Cooperative Information Systems v. 10; 1/2, pp. 145–169, 1999.

[5] Jimmy Lin. The Web as a Resource for Question Answering: Perspectives and Challenges. In Proceedings of LREC-2002, 2002.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu