CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Detecting and Parsing Embedded Lightweight Structures

Philip Rha & Robert Miller

Motivation

Text documents, web pages, and source code are all documents that contain language structures that can be parsed with corresponding parsers. Some documents, like JSP pages, Java tutorial pages, and Java source code, often have language structures that are nested within another language structure. Such embedded language documents present a difficult parsing challenge for text processing systems. Some existing methods for parsing these documents involve using a custom parser on documents with explicitly marked boundaries of where embedded structures are located. An example of this is the use of the tag <script language=Javascript> to mark regions of embedded Javascript in HTML web pages. However, such methods fail for documents where the boundaries of embedded structures are not explicitly marked, like a Java tutorial page. In this page, there are unmarked Java language structures embedded in HTML. Although parsers exist exclusively for the outer and inner language structure, neither is suited for parsing the embedded structures in the context of the document. The standard HTML parser is unable to make sense of the Java structure and the standard Java parser is unable to locate the Java code embedded within the HTML. Our goal is to handle such embedded language documents containing unmarked embedded structures using the standard existing parsers.

Approach

We are developing a new technique for selectively applying existing parsers on intelligently transformed document content.

The task of parsing these embedded structures can be broken up into two phases: detection of embedded structures and parsing of those embedded structures. In order to detect embedded structures, we take advantage of the fact that there are natural boundaries in any given language in which these embedded structures can appear. We use these natural boundaries to narrow our search space for embedded structures. We further reduce the search space by using statistical analysis of token frequency for different language types. By combining the use of natural boundaries and the use of token frequency analysis, we can, for any given document, generate a set of regions that have a high probability of being an embedded structure.

To parse the embedded structures, the text of the region must often be transformed into a form that is readable by the intended parser. In our example Java tutorial page, the quotation marks in the Java structure are represented by the HTML entity ". In order for the standard Java parser to process that text, the HTML entity " must first be transformed to the character ". Our approach provides a systematic way to transform the document content into a form that is appropriate for the embedded structure parser using simple replacement rules.

Using our knowledge of natural boundaries and statistical analysis of token frequency, we are able to locate regions of embedded structures. Combined with replacement rules which transform document content into a parsable form, we are successfully able to parse a wide range of documents with embedded structures using existing parsers.

Applications for this work include syntax coloring, indexing and information retrieval, and error checking. This work is being implemented in the LAPIS text processing system.

LAPIS browser
highlighting a Java class automatically discovered in a web page

References:

[1] Philip Rha. "Detecting and Parsing Embedded Lightweight Structures." MEng thesis proposal, May 2004.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)