CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Learning Wrappers Efficiently Using Unlabeled Examples

ChongMeng Chow & Robert Miller

Motivation

A wrapper is a set of procedures for automatically performing information extraction on a particular resource. Current techniques for generating wrappers automatically require multiple labeled examples on multiple pages as training data.

Our goal is to bring the power of the wrapper down to the level of the end-user. In particular, wrappers are a necessary part of a web automation system, which allows a user to develop scripts that interact automatically with the web.

Consider the scenario of a user shopping for books on an online bookstore. Before buying, the user wants to check if the books are available from the local library. Without any automation, the user will have to manually extract the title of a book from the bookstore's web page, so as to enter it into the search page on the library's web site. This extraction has to be repeated for every book that the user is interested in.

We want to enable users to easily automate this kind of interaction, in particular, to do this by demonstration: the user will give one example of what to do, and a script will do the rest automatically. In order to make such a system as easy to use as possible, we want to drive the number of user-supplied examples as low as possible, ideally to one.

Approach

As a first step, we have implemented wrapper learning techniques which require few user-supplied labels and which automatically acquire unlabeled pages to guide the selection of appropriate features for the wrapper.

References

[1] ChongMeng Chow and Robert Miller. "Learning Wrappers Efficiently for Web Information Extraction Using Unlabeled Examples.." Submitted to AAAI 2005.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)