Learning Wrappers Efficiently Using Unlabeled ExamplesChongMeng Chow & Robert MillerMotivationA wrapper is a set of procedures for automatically performing information extraction on a particular resource. Current techniques for generating wrappers automatically require multiple labeled examples on multiple pages as training data. Our goal is to bring the power of the wrapper down to the level of the end-user. In particular, wrappers are a necessary part of a web automation system, which allows a user to develop scripts that interact automatically with the web. Consider the scenario of a user shopping for books on an online bookstore. Before buying, the user wants to check if the books are available from the local library. Without any automation, the user will have to manually extract the title of a book from the bookstore's web page, so as to enter it into the search page on the library's web site. This extraction has to be repeated for every book that the user is interested in. We want to enable users to easily automate this kind of interaction, in particular, to do this by demonstration: the user will give one example of what to do, and a script will do the rest automatically. In order to make such a system as easy to use as possible, we want to drive the number of user-supplied examples as low as possible, ideally to one. ApproachAs a first step, we have implemented wrapper learning techniques which require few user-supplied labels and which automatically acquire unlabeled pages to guide the selection of appropriate features for the wrapper. References[1] ChongMeng Chow and Robert Miller. "Learning Wrappers Efficiently for Web Information Extraction Using Unlabeled Examples.." Submitted to AAAI 2005. |
||
|