CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Duplicate Detection in Semi-Structured Data

Nicholas E. Matsakis & David Karger

Motivation

Managers of large data repositories have long been familiar with the problem of duplicate detection orrecord linkage. Database software is typically designed with the assumption that there is a single record for each entity of interest. The presence of duplicate records violates this assumption and can result in problems when the data is used, including stale values and incomplete results when searching or browsing.

With the advent of powerful personal computing, these problems are being faced anew by computer users without training in database management. A typical personal computer contains databases of addresses, email messages, website bookmarks, photographs, and music, in addition to all the files on the filesystem. These database are typically managed locally by a number of software applications without consulting an authority service. The result is that different personal databases use different identifiers for the same object, forming an impediment to sharing data across applications and systems.

Many prior advances in duplicate detection have been developed and evaluated using large, structured databases such as those maintained by libraries, census bureaus, and corporations. The standard model in these domains [1] considers a database as a collection of records describing objects of some type, with each record consisting of a prespecified set of named fields. For example, a library database may contain a set of records describing publications, with fields for the author, publisher, page count, and so forth. Duplicate detection in this model is a batch process performed on collection of records, resulting in a set of automatically matched records and a set of records held out for clerical review.

We are considering instead a semi-structured data model that relaxes some of these assumptions by not restricting records to conform to a fixed schema: records may have any number of fields and may have multiple fields with the same name. Additionally, this model allows for relational fields which refer to other records rather than holding a primitive value. Finally, we are interested in replacing the batch model of duplicate detection with an online model, where the computer is allowed to ask the user questions and then use the results in further processing. We believe that this model is more appropriate for personal databases and could significantly improve the user experience of creating, managing, and sharing personal information.

Approach

Formally we treat this problem as that of finding a partition of a set of records into a set of disjoint blocks, where the records in each block represent the same real-world entity. Following the work of [2], we are using a Conditional Random Field model to estimate the maximum-likelihood partition of the records, conditioned on the values of their associated fields. We are considering extensions which model the ideosyncracies of semi-structured data, including missing and multiple fields and relational structures. Additionally we are considering the problem of finding questions to ask the user whose answers are expected to decrease the resulting error the most.

To validate this approach we believe it is necessary to test in multiple data domains, since each may use fields that may make this problem easier or more challenging. We have identified the test domains of personal bibliographic records, personal music libraries, and email archives. We feel these form a good testbed because the identity relationships between the entities are well-defined but the data itself is often entered by hand with little to no effort devoted to maintaining global consistency. As such, it is subject to the variations, omissions, and errors that make this problem challenging.

One other important aspect these data domains share is that each describes only a single type of object: either publications, songs, or email messages. Only these primary record types are allowed to have fields while all other types of objects exist only as string literals. For example, in most music databases only tracks are allowed to have properties, while albums, artists, and genres exist only as string properties of tracks. Lifting these string properties into full records allows for increased flexibility, but requires discovering which strings represent the same object.

Matching these lifted records can be difficult; traditional duplicate detection algorithms are designed to work with flat records with a fixed number of properties. In contrast, lifted records have only the single field of the object name, but perhaps many relations to other objects. In a sense, much of what is known about a lifted object is contained in the fields in related objects. It is reasonable to expect that techniques that take into account these relations, perhaps performing simultaneous matching over a set of records of different types, will perform better than those that only do string matching on object names.

Progress

We have collected and labeled a dataset of personal bibliographic records from members of an MIT research group. We are using this dataset for an initial round of experiments to test our models for online duplicate detection and relational duplicate detection by simultaneously matching records of publications, authors, and venues.

Support

This research was supported by NTT, the Packard Foundation, and funding from the Oxygen project.

References:

[1] I. P. Fellegi and A. B. Sunter. A Theory for Record Linkage. In Journal of the American Statistical Association, vol. 64 pp. 1183--1210, 1969.

[2] Andrew McCallum and Ben Wellner. Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)