If a human user is looking for, say, the population of the country of Turkey, he can go to the World Factbook, find the page on Turkey and scroll down the page till he sees the word 'Population'. He will reach the piece shown in Figure 1. Here he has to choose between two different rows, both of which seem to contain population information. These rows are labeled with 'Population' and 'Population growth rate'. The user will select the first row since he is looking for 'Population' and not 'Population growth rate'. One heuristic approach is to pick the first row where the word 'Population' occupies more of the title. This process can be briefly described as finding an object with a title that contains 'Population' and as few other characters as possible.
We must then decide where the information about population ends. Again, a human user will notice that 'Age Structure' is formatted the same way as 'Population'. So any information up to that point should be of interest. Briefly, one can explain this process as: find a title with the word 'Population' and get all the information till the next title of the same format. Of course, if the page ends or a table where the information resides ends, the user will cease to look for the next title.
The example above illustrates the two main ideas embodied in Hap-Shu: 1) looking for some segment of document structure that contains the relevant keywords, and 2) looking for a marker in the underlying HTML code that signals the end of the current information segment and the beginning of the next.
Hap-Shu includes a Web-based graphical editor which simplifies creating scripts. The script writer loads a representative page from a source into the editor and provides a keyword or keywords with which to identify the relevant content segment. Hap-Shu finds the most natural element where the keywords serve as a title (or in a title) and returns this to the user as the most likely result.
Hap-Shu enormously speeds script-writing and results in more robust and easily maintainable scripts for cases where a content element can be identified via a heading, which is usually the case.
Hap-Shu will be extended to include a repair system which will periodically test scripts, detect failure, and, where possible, repair scripts on the fly.
This work is supported in part by the Advanced Research and Development Activity as part of the AQUAINT Phase II research program.
[1] N. Ashish and C. Knoblock. Wrapper Generation for Semi-structured Internet Sources. In Workshop on Management of Semistructured Data, 1997.
[2] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting Semistructured Information from the Web. In Workshop on Management of Semistructured Data at PODS/SIGMOD'97, Tsimmis, 1997.
[3] Boris Katz, Sue Felshin, Deniz Yuret, Ali Ibrahim, Jimmy Lin, Gregory Marton, Alton Jerome McFarland, and Baris Temelkuran. Omnibase: Uniform Access to Heterogeneous Data for Question Answering. In Proc. of the 7th Int. Workshop on Applications of Natural Language to Information Systems (NLDB '02), Stockholm, Sweden, June 2002.
[4] K. Taniguchi, H. Sakamoto, H. Arimura, S. Shimozono, and S. Arikawa. Mining SemiStructured Data by Path Expressions. In Proceedings of the 4th International Conference on Discovery Science, LNAI, 2001.
[5] Baris Temulkuran. Hap-Shu: A Language for Locating Information in HTML Documents. Master's Thesis, Massachussetts Institute of Technology. 2003.
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |