MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Question Answering from Foreign-Language, Semi-Structured Sources

Boris Katz, Gary Borchardt, Sue Felshin, Gregory Marton, Yuan Shen & Gabriel Zaccak

Problem

The Web contains valuable information in many languages, and even though there are software tools that can form approximate translations of foreign-language text into English, users who cannot read particular languages and who are unfamiliar with particular foreign-language sites have no way to find the specific elements of information that are relevant to their needs.

Approach

We have recently implemented a capability to answer English questions on the basis of foreign-language material in semi-structured websites. This work extends the approach demonstrated by our START [1] and Omnibase [2] systems as used to answer English questions on the basis of English-language material in semi-structured websites.

At present, our foreign-language capability operates in connection with several Chinese and Arabic websites. Figure 1 illustrates the capability used to answer a question about the education of people in Shanghai, based on material in a Chinese-language site.

Figure 1: START and Omnibase answering a question by retrieving material from a Chinese-language website.

Several steps were required to create this new functionality. Once we had selected a number of Chinese and Arabic sites to be supported, we focused our attention first on question analysis. Foreign-language words can potentially be included within English questions in any of three ways. For example, the 1985 Japanese movie "Tampopo" can be represented

in the original foreign language as タンポポ or 蒲公英,
in transcription as "Tampopo", or
in translation as "Dandelion".

For our initial implementation of question answering from foreign-language sources, we chose to support foreign-language terms in transcription. This required us to extend the lexicon available to START and Omnibase so that it contains a body of pertinent, transcribed foreign-language terms. To accomplish this, we developed an automatic transcription tool that generates English transcriptions of Chinese and Arabic words, and we used this tool in combination with manual entry of alternate transcriptions where appropriate.

Next, we applied Omnibase's object–property–value model to the chosen Chinese and Arabic sites. This model depicts suitable data sources as collections of objects, with each object having one or more properties that have particular values. Previously, we had applied this model only to English semi-structured sources; however, the model itself is not language dependent, and thus it provides an appropriate characterization of many foreign-language sources as well.

Application of the object–property–value model requires the creation of a set of associated natural language annotations in START, plus the composition of a set of information access scripts for the properties covered by the model. With these elements in place, when the user enters a relevant request, START and Omnibase will execute the appropriate access script in order to retrieve a suitable fragment of foreign-language text. We present this text to the user in both its original form and in English translation produced by translation engines offered by BBN [3] and by Google.

Syntactic decomposition (for answering complex questions) works with foreign-language sources equally well. For example, if the answer to "the capital of Henan" has a synonym Zhengzhou (the transcription of 郑州) and the question answering support for another source accommodates the city by that same name, then we can answer:

⇒ How far is the capital of Henan from Seoul?
I know that Henan's capital is Zhengzhou (source: The government of The People's Republic of China).
Using this information, I determined how far Zhengzhou is from Seoul:
The distance between Seoul, South Korea and Zhengzhou, China is 769 miles (1,240 kilometers).
Source: START KB

Progress

We are in the process of integrating additional Chinese and Arabic sources into our implementation, and we are looking at the possibility of including sources in other languages as well. In addition, we plan to extend our support for the specification of foreign terms by allowing users to enter English translations of those terms as well as entering the foreign-language terms in their native character sets.

Support

This work is supported in part by the Disruptive Technology Office as part of the AQUAINT Phase 3 research program.

References

[1] Boris Katz. Annotating the World Wide Web Using Natural Language. In Proceedings of the 5th RIAO Conference on Computer Assisted Information Searching on the Internet (RIAO '97), Montreal, Canada, 1997.

[2] Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, and Baris Temelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proc. of the 7th Int. Workshop on Applications of Natural Language to Information Systems (NLDB '02), Stockholm, Sweden, June 2002.

[3] Bing Xiang, Jinxi Xu, Roger Bock, Ivan Bulyko, Jared Maguire, Spyros Matsoukas, Antti-Veikko Rosti, Richard Schwartz, Ralph Weischedel, and John Makhoul. The BBN Machine Translation System for the NIST 2006 MT Evaluation. In Proceedings of the NIST 2006 MT Workshop, September 2006.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu