CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

Automatic Web Page Concatenation

Matthew Webber & Robert Miller

Motivation

Many web sites divide articles and search result listing into multiple small pages, conventionally using numbered hyperlinks and Next and Previous links for navigation through the pages. Although this practice makes each page smaller and faster to retrieve over a slow connection, multipage articles can also become frustrating for users, since they interfere with skimming and printing, prevent using the web browser's Find command to search the entire content, and force the user to wait for pages to load. Some sites provide a "Show All" option that displays the complete article or listing, but this practice is not widespread, and usually offers only a simplified version of the page designed for printing, without navigation bars or other useful features.

Approach

We are developing a web browser extension that that detects, concatenates, and displays web content spanning multiple pages. We are using LAPIS [1] to find patterns in web pages characteristic of multipage content, such as a sequence of consecutive numbered links and links labeled "Next" and "Previous."

Example of multi-page content with Show All link inserted

Once a page has been classified as having multi-page content, a "Show All" link is inserted after its list of page links. When the user clicks this link, subsequent pages in the series are fetched and analyzed. Repeated non-content features (headers, sidebars, advertising) are distinguished from material unique to each page, and a new page is created that includes content from all pages, but only one copy of their non-content features. Because the created page has only one header section, one sidebar, and so forth, it has a more natural look and blends in better with the original website.

This project is implemented using Chickenfoot [2], a browser extension for Mozilla Firefox that allows the user to write Javascript scripts that automate and customize web pages.

Applications for this work include end-user customization of multi-page search results, articles, and product listings.

References:

[1] Robert C. Miller and Brad A. Myers. "Interactive Simultaneous Editing of Multiple Text Regions." Proceedings of USENIX 2001 Annual Technical Conference, Boston, MA, June 2001, pp 161-174.

[2] Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller. "Automation and Customization of Rendered Web Pages." Submitted to UIST 2005.

[3] Matthew Webber. "Automatic Web Page Concatenation." AUP project proposal, December 2004.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)