MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Unsupervised Record Extraction for the World Wide Web

Yuan Kui Shen & David Karger

Introduction

A large portion of pages on the web today are dynamically generated. Many of these dynamic pages are database-backed, whereby the results of database queries are combined with HTML templating to generate the pages returned to the browser. Web pages present data in a human understandable form but in a form difficult for algorithmic processing. As a result, specialized data extraction programs or wrappers need to be written to facilitate programmatic access to HTML embedded data. Much of the previous research in web data extraction has been focused on bettering tools and algorithms for automatically generating wrappers. Work in wrapper induction [3][4][5] aim to automatically learn or generalize an extraction program from human labelled examples. However, standard wrapper induction techniques produce wrappers specific to the trained web page type. What is learned is not generalizable or transferable to new or unseen websites. We are developing a general technique for inducing extraction wrappers that does not require per page-type training. Our system when complete can extract data from previously unseen web page-types and can learn wrappers without supervision.

Data structures can come in many forms, therefore we constrain our focus and assumptions to only flat records. Flat records correspond to tuples or rows in a relational database. An example of a flat record instance can be an address, a search result, or an auction item. Record instances of the same record class (or schema) contain the same number of fields; fields in the same record class are presented in a consistent order. We assume that record-pages contain more than one record instance from a given record class. Under these assumptions, our system has to determine the exact locations of all salient record classes and their instances as accurately as possible.

Motivation and applications

Our system is suited for applications where there is need for a large quantity of data extraction. An ideal application would one where scalability is constrained by the number of wrappers required; in such cases, manual labelling per web page type under standard wrapper induction would be too work-intensive.

One such application would be web search. Modern web search engines index web pages with the implicit assumption that each page represents a single concept; results returned are at the granularity of pages. However, in a record-page each record instance represents a single concept and hence should be index separately. A user interested in finding a particular address, person or item should naturally be presented with results at the desired granularity. With automatic record extraction we can extract record instances, and index information at record level thereby providing a more precise result than the state-of-the-art.

A more ambitious application would be to use automatic record extraction as a shortcut to semanticize the web. Current solutions involving standardization (i.e. conversion of sites to XML, RSS or other standards) would take many years before full implementation and compliance. With an accurate automated record extraction system, we can build the envisioned network of objects or concepts (the semantic web[1]) in a much shorter time span.

Approach

In our system, web pages are represented as tag trees. The input to the system is a full tag tree and the output is a set of subtrees (of the input) grouped into record classes with each class's members being individual record instances. The difficulty of our task lies in determining which subtrees to group together (class discovery), and identifying which classes are true record classes. We model record instances as sub-tag-trees; we assume that if a record class exists in a page then its member instances will repeat. Because we assume repetition, subtrees from the same record class should be similar to one another under some metric. This similarity metric we determine by training a binary decision classifier that learns a function over pairs of subtrees. Our classifier combines multiple pairwise features ranging from tree-to-tree edit distance and vector space distances. We also experimented with numerous optimization methods (SVM, Logistic regression, Adaboost, and decision trees) and feature selection methods to come up with an optimal clustering metric. We used hierarchical clustering with complete-link to determine the optimal clusters.

Figure 1: The CSAIL PI page contain examples of three types of clusterable instances, record-collections (or rows), records, and fields. Successful record-cluster detection should be able to discriminate the record collection and field clusters as non-record clusters.

Even with optimized clustering, not all clusters are record clusters. For instance, trees that represent a row of record instances (record collection) could form a cluster under our clustering metric (figure 1). Also, subtrees of record instances (i.e. the fields inside records) also form non-record clusters. In short, additional record cluster detection processing is needed to differentiate non-record clusters from record ones. To solve this problem, we trained a classifier that used a combination of several cluster-level features such as the contiguity of cluster members, and the periodicity of internal features. Using this record-cluster detection classifier we can identify the clusters that are true record clusters. Using the combination of clustering followed by record cluster detection, we are able to build a simple system that with acceptable accuracy can extract records from web pages.

Results

The success of our system is evaluated by how well the extracted record instances matched human labelled instances. This evaluation is determined by the quality of the clustering (class discovery) and the quality of the record detection subsystems. The quality of clustering is measured by both the cluster purity (whether clusters only containing the record instances of the same type), and cluster recall (whether record instances are spread over many clusters or inside one cluster). For our clustering task, we were able to achieve an average cluster-purity of 96.3% and cluster-recall of 89% (f-score of 91.4%). For our record-detection task, we were able to achieve an optimal classification recall of 94% but a precision of only 40%. Our system can cluster fairly accurately but more work is still required in record-cluster detection. We also compared our work with a previous automatic record extraction system [2] and found an accuracy improvement of about 15%.

Future work

We plan to incorporated location and layout-specific cues into our system. For instance, navigational menu items are often misidentified as record clusters; but if we incorporated layout information then they can easily identified by their location on the page (often near the top and margins of the page). We are also looking at leveraging information across web pages and integrating known data sources as priors to improve the overall record extraction process.

Research Funding

This work is supported by the HP-MIT alliance and MIT's Project Oxygen.

References:

[1] Tim Berners-Lee, J. Hendler, and O. Lassila. The semantic web, May 2001.

[2] David Buttler, Ling Liu, and Calton Pu. A fully automated extraction system for the world wide web. In IEEE ICDCS-21, April 2001.

[3] Andrew Hogue and David Karger. Tree pattern inference and matching for wrapper induction on the world wide web. Master?s thesis, Massachusetts Institute of Technology, May 2004.

[4] Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729?737, 1997.

[5] Ion Muslea, Steve Minton, and Craig Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents?99), pages 190?197, Seattle, WA, USA, 1999. ACM Press. 3

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu