|
Research
Abstracts - 2007 |
U-REST: An Unsupervised Record Extraction SysTemYuan Kui Shen & David R. KargerWe built a system that extracts record structures from web pages with no direct human supervision. Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products, company profiles, etc. We present a simplified framework for studying the problem of unsupervised record extraction -- one which separates the algorithms from the feature engineering. Our system, U-REST formalizes an approach to the problem of unsupervised record extraction using a simple two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging improvements in extraction accuracy. 1. INTRODUCTIONRecords are data tuples wrapped in HTML. On the web, records are presented as collections of consistently formatted HTML snippets. They manifest in list pages such as search results, course catalogues, or directory listings. While there are many types of records (nested records, tree records, etc.), our focus is on flat records: records characterized as having an ordered set of fields corresponding to the columns of an underlying data source (such as a database table). Given a list page P, the task of record extraction is to return the set of regions C = r1 . . . rn that best matches a human labelled reference set, L. When this task is accomplished without any human supervision, we call it unsupervised record extraction (URE). The supervised variant of this problem has been studied extensively. For instance, most recently, Hogue et al [2] demonstrated the use of a tree model to represent the extraction pattern. In general, supervised methods use learning techniques to induce an extraction pattern from a set of labelled examples. Because records repeat it is intuitive to think that simply searching for repetitions should aid in finding record instances. However not all repetitions are records. Some repetitions are composed of parts of records (fields), other constitute collection-of-records, and yet others come from formatting regularities that serve functions such as navigation or advertising. Hence, the URE task is difficult because there are different types of repetitions, and repetitions are often noisy. Several recent works have tried to tackle the URE problem. Omini [1] searched for record separator tags between contiguous record instances, posing the problem as one of segmentation. MDR/DEPTA [5] compared successive groups of subtrees and returned similar groupings as record regions. ViNT [6] utilized visual hints, content lines, as record identification features. However, little attention has been devoted to how features independently affect the system accuracy. Our work uses known machine learning techniques and survey some simple features to gain a clearer insight into the importance of features in the URE task. 2. SYSTEM OVERVIEWThe input to U-REST is a record-containing list page, the output is a set of potential record instances (record sets). List pages are first converted into a tag tree. A tag tree (or DOM) has a node for each open and close HTML tag pair. Each subtree of the page tag tree represents a distinct continuous (visible) region (or a block) on the web page. Any single subtree or set of adjacent subtrees (sibling subtrees) can represent a potential record instance. Our task is to return the sets of subtrees (or sibling subtrees) that best correspond to records. During U-REST's pattern discovery phase, the page DOM is decomposed into constituent subtrees. Those subtrees that are clearly non-records: root of the page or non-contentbearing leaf nodes (e.g. tags) are removed. The resulting subtrees undergo clustering, using HAC (hierarchical agglomerative clustering). The goal of clustering is to find salient structural repetitions. HAC iteratively merges the closest two points (trees or clusters of trees). The HAC clustering metric determines the threshold at which the merging process terminates. We designed this metric as a tree-pairwise similarity function: Φ is 1 if the pair is similar enough to be in the same cluster (0 otherwise). Φ is a trained classifier defined by its vector of features over tree pairs. In our experiments, a linear kernel SVM was found to be the most effective. We also surveyed an extensive set of features [4] (some top performing ones are noted in table 1). Using feature selection, we found a three-optimal combination:
Our results show that preserving the structural order of trees is critical when comparing trees: tree-edit distance predominates all n-gram based metrics. After clustering,
the discovered clusters contain non-record as well as record repetitions; non-record types include: fields, collections of records, decorative blocks, or navigational content. To differentiate record from these non-record clusters, we designed a record cluster classifier, modeled as a function over clusters: Ψ assigns the most record-like cluster the highest score. Our feature survey and feature selection work [4] yielded three classes of features that performed well in this subtask.
An SVM trained on features from these three classes produces an optimal Ψ the system returns the highest scoring cluster as the record cluster. 3. EVALUATION AND RESULTSWe compared the record sets extracted by U-REST, Omini and MDR [3] with a reference human-labelled record set. The evaluation metric is the number of pages correctly labelled at or above a fixed f-score. [F-score = 2pr/(p+r) , where p = |C ∩ L|/|C| and r = |C ∩ L|/|L|.]
Our results (Table 2) show that U-REST performs better than MDR or OMINI. Manual analysis of Omini/MDR results shows that the mis-identification of record collections as records is one of key causes of system errors. The variation feature introduced in our work helped differentiate much of the record-collection cases. 5. REFERENCES[1] D. Buttler, L. Liu, and C. Pu. A fully automated extraction system for the world wide web. In IEEE ICDCS-21, April 2001. [2] A. Hogue and D. Karger. Thresher: Automating the unwrapping of semantic content from the world wide web. In WWW 2005 Conference, 2005. [3] B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. UIC Technical Report, 2003. [4] Y. K. Shen. Automatic record extraction from the world wide web. Master's thesis, MIT, 2005. [5] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW 05: Proceedings of the 14th international conference on World Wide Web, pages 7685, New York, NY, USA, 2005. ACM Press. [6] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines, 2005. |
||||||||||||||||||||||||||||||||||||||||||||||||||
|