CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line

 

Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Nuggeteer: Complex Information Retrieval Evaluation

Gregory Marton & Alexey Radul

The Problem

Evaluation is a critical component of any development effort, and automatic evaluation lets developers experiment with their systems in many ways, to find out which components provide the greatest contribution. Nuggeteer is an open-source automatic evaluation tool to evaluate systems addressing a new type of complex information retrieval need. Nuggeteer uses existing human judgements to approximate official evaluations, and solicits additional judgements from developers for increased accuracy.

Motivation

While many information retrieval systems focus on returning relevant documents, and question answering has focused on finding exact named-entity answers to simple factoid questions, many interesting questions are too complex to be evaluated either with document lists or answer patterns. A complex question will have multi-part answers, and each part may describe a fact about the world that can be stated in a wide variety of ways.

For example, for a TREC definitional question "Who is Jar Jar Binks?", one might want the answer to contain four critical facts

  • fictional character
  • in the Phantom Menace
  • computer-generated
  • annoying

However, these may be stated or implied in any number of ways, for example: "One grating hero in this first Star Wars prequel is the CG creation Jar Jar Binks."

The TREC definition, "other", and relationship questions [4] have used expensive human assessors to judge whether each system response (like the sentence above) contains the facts that comprise a complex answer. Nuggeteer uses these judgements to approximate human evaluations automatically, allowing developers to track their systems' successes and failures, and compare different approaches or parameter settings.

Previous Work

The Qaviar system [1] is similar in spirit and implementation to Nuggeteer, with some limitations due to the different task it was created to handle, and it is not generally available. Pourpre [2] was the first freely available automatic evaluation system for the same set of target tasks, and it uses the same idea of weighted keyword overlap matching as Qaviar and Nuggeteer do, but Pourpre's scores do not facilitate comparisons between evaluated systems. Nuggeteer offers three important improvements:

  • interpretability of the scores, as compared to official scores,
  • expandability of the answer key using human judgements of system responses, and
  • information about individual facts and responses, for detailed error analysis.
Approach

Like previous evaluation systems, Nuggeteer [3] relies on keyword overlap with known-correct answers to identify probably-correct responses. We optimize our matching over all combinations of various parameter settings that influence the matching process, including stemming, n-gram length, term weighting, stopword removal, and acceptance thresholds. We measure the ranking agreement between Nuggeteer and official scores as our predecessors did, as well as absolute agreement, and confidence intervals.

Research Support

This work is supported in part by the Advanced Research and Development Activity as part of the AQUAINT Phase II research program.

References:

[1] Eric J. Breck, John D. Burger, Lisa Ferro, Lynette Hirschman, David House, Marc Light, and Inderjeet Mani. How to Evaluate Your Question Answering System Every Day ... and Still Get Real Work Done. In Proceedings of the second international conference on Language Res ources and Evaluation (LREC2000), June 2000.

[2] Jimmy Lin and Dina Demner-Fushman. Automatically Evaluating Answers to Definition Questions. Technical Report 119, February 2005.

[3] Gregory A. Marton. Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements. In Proceedings of NAACL/HLT, 2006.

[4] Ellen Voorhees. Overview of the TREC 2005 question answering track. NIST publication, 2005.

 

vertical line
vertical line
 
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu