Even with optimized clustering, not all clusters are record clusters. For instance, trees that represent a row of record instances (record collection) could form a cluster under our clustering metric (figure 1). Also, subtrees of record instances (i.e. the fields inside records) also form non-record clusters. In short, additional record cluster detection processing is needed to differentiate non-record clusters from record ones. To solve this problem, we trained a classifier that used a combination of several cluster-level features such as the contiguity of cluster members, and the periodicity of internal features. Using this record-cluster detection classifier we can identify the clusters that are true record clusters. Using the combination of clustering followed by record cluster detection, we are able to build a simple system that with acceptable accuracy can extract records from web pages.
The success of our system is evaluated by how well the extracted record instances matched human labelled instances. This evaluation is determined by the quality of the clustering (class discovery) and the quality of the record detection subsystems. The quality of clustering is measured by both the cluster purity (whether clusters only containing the record instances of the same type), and cluster recall (whether record instances are spread over many clusters or inside one cluster). For our clustering task, we were able to achieve an average cluster-purity of 96.3% and cluster-recall of 89% (f-score of 91.4%). For our record-detection task, we were able to achieve an optimal classification recall of 94% but a precision of only 40%. Our system can cluster fairly accurately but more work is still required in record-cluster detection. We also compared our work with a previous automatic record extraction system [2] and found an accuracy improvement of about 15%.
We plan to incorporated location and layout-specific cues into our system. For instance, navigational menu items are often misidentified as record clusters; but if we incorporated layout information then they can easily identified by their location on the page (often near the top and margins of the page). We are also looking at leveraging information across web pages and integrating known data sources as priors to improve the overall record extraction process.
This work is supported by the HP-MIT alliance and MIT's Project Oxygen.
[1] Tim Berners-Lee, J. Hendler, and O. Lassila. The semantic web, May 2001.
[2] David Buttler, Ling Liu, and Calton Pu. A fully automated extraction system for the world wide web. In IEEE ICDCS-21, April 2001.
[3] Andrew Hogue and David Karger. Tree pattern inference and matching for wrapper induction on the world wide web. Master?s thesis, Massachusetts Institute of Technology, May 2004.
[4] Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729?737, 1997.
[5] Ion Muslea, Steve Minton, and Craig Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents?99), pages 190?197, Seattle, WA, USA, 1999. ACM Press. 3
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |