MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Hap-Shu: Conceptual Segmentation of Web Pages

Boris Katz, Baris Temulkuran, Sue Felshin, & Gabriel Zaccak

The Problem

An undesirable side-effect of the Web's dynamic nature is instability of site layout and page content. This poses a serious problem to wrapper scripts tailored to specific formats. Often, changes to a page's content or layout structure will require significant modification of associated scripts.

Hap-Shu [5] is a tool which conceptually segments Web pages—it recognizes the layout of a Web page, much in the same manner as a human—and automatically creates wrapper scripts for extracting relevant information segments. We have integrated Hap-Shu into our Omnibase system (see related abstract and [3]) to increase the speed at which we can integrate data, and to increase reliability of the system.

Motivation

The unstructured representation and dynamic nature of the knowledge on the Web renders the task of retrieving knowledge from a Web page difficult. The traditional ways to extract information from HTML documents, which form the biggest portion of the knowledge on the Web, involve manually written wrappers containing low-level regular expressions [2] or high-level tag paths [4] and automatically or semi-automatically generated wrappers [1]. These wrappers are difficult to build and costly to maintain. They are often vulnerable to minor format changes such as capitalization of tags or addition of spaces in the HTML document and to major changes such as introduction of tables to the document.

Mirroring remote data locally, either in its original form or as parsed data, would solve the script maintenance problem, but is not practical because there is too much data to mirror (the Web is vast) and because Web pages are often dynamic (cached weather reports, for example, would be out of date). Furthermore, while locally mirrored data has no maintenance problems, scripts to access local data are no easier to write than those that access live, remote data.

A human being examining a Web page can often easily "extract" a desired segment of information from the page, identifying it by visual characteristics (boldface, font size, layout on the page) and/or textual clues (target appears next to or below heading, or contains text that matches a certain pattern). We have developed a system to identify coherent data segments in Web pages in the same manner, so that a human "script writer" can write scripts to extract information by selecting relevant segments, which is far easier and significantly more robust than writing regular-expression-matching scripts.

Approach

Figure 1: Fragment of a page from the World Factbook.

If a human user is looking for, say, the population of the country of Turkey, he can go to the World Factbook, find the page on Turkey and scroll down the page till he sees the word 'Population'. He will reach the piece shown in Figure 1. Here he has to choose between two different rows, both of which seem to contain population information. These rows are labeled with 'Population' and 'Population growth rate'. The user will select the first row since he is looking for 'Population' and not 'Population growth rate'. One heuristic approach is to pick the first row where the word 'Population' occupies more of the title. This process can be briefly described as finding an object with a title that contains 'Population' and as few other characters as possible.

We must then decide where the information about population ends. Again, a human user will notice that 'Age Structure' is formatted the same way as 'Population'. So any information up to that point should be of interest. Briefly, one can explain this process as: find a title with the word 'Population' and get all the information till the next title of the same format. Of course, if the page ends or a table where the information resides ends, the user will cease to look for the next title.

The example above illustrates the two main ideas embodied in Hap-Shu: 1) looking for some segment of document structure that contains the relevant keywords, and 2) looking for a marker in the underlying HTML code that signals the end of the current information segment and the beginning of the next.

Hap-Shu includes a Web-based graphical editor which simplifies creating scripts. The script writer loads a representative page from a source into the editor and provides a keyword or keywords with which to identify the relevant content segment. Hap-Shu finds the most natural element where the keywords serve as a title (or in a title) and returns this to the user as the most likely result.

Progress

Hap-Shu enormously speeds script-writing and results in more robust and easily maintainable scripts for cases where a content element can be identified via a heading, which is usually the case.

Future

Hap-Shu will be extended to include a repair system which will periodically test scripts, detect failure, and, where possible, repair scripts on the fly.

Research Support

This work is supported in part by the Advanced Research and Development Activity as part of the AQUAINT Phase II research program.

References:

[1] N. Ashish and C. Knoblock. Wrapper Generation for Semi-structured Internet Sources. In Workshop on Management of Semistructured Data, 1997.

[2] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting Semistructured Information from the Web. In Workshop on Management of Semistructured Data at PODS/SIGMOD'97, Tsimmis, 1997.

[3] Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, and Baris Temelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proc. of the 7th Int. Workshop on Applications of Natural Language to Information Systems (NLDB '02), Stockholm, Sweden, June 2002.

[4] K. Taniguchi, H. Sakamoto, H. Arimura, S. Shimozono, and S. Arikawa. Mining SemiStructured Data by Path Expressions. In Proceedings of the 4th International Conference on Discovery Science, LNAI, 2001.

[5] Baris Temulkuran. Hap-Shu: A Language for Locating Information in HTML Documents. Master's Thesis, Massachussetts Institute of Technology. 2003.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu