CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Duplicate Detection in Semi-Structured Data

Nicholas E. Matsakis & David Karger


Managers of large data repositories have long been familiar with the problem of duplicate detection or record linkage. Database software is typically designed with the assumption that there is a single record for each entity of interest. The presence of duplicate records violates this assumption and can result in problems when the data is used, including stale values and incomplete results when searching or browsing.

With the advent of powerful personal computing, these problems are being faced anew by computer users without training in database management. A typical personal computer contains databases of addresses, email messages, website bookmarks, photographs, and music, as well as all the files on the filesystem. These database are typically managed locally by a number of software applications without consulting an authority service. The result is that different personal databases use different identifiers for the same object, forming an impediment to sharing data across applications and systems.

Many prior advances in duplicate detection have been developed and evaluated using large, structured databases such as those maintained by libraries, census bureaus, and corporations. The standard model in these domains [1] considers a database as a collection of records describing objects of some type, with each record consisting of a prespecified set of named fields. For example, a library database may contain a set of records describing publications, with fields for the author, publisher, page count, and so forth. Duplicate detection in this model is a batch process performed on collection of records, resulting in a set of automatically matched records and a set of records held out for clerical review.

We are considering instead a semi-structured data model that does not restrict records to conform to a fixed schema: records may have any number of fields and may have multiple fields with the same name. Additionally, this model allows for relational fields which refer to other records. Finally, we are interested in replacing the batch model of duplicate detection with an online model, where the computer is allowed to ask the user questions and then use the results in further processing. We believe that this model is more appropriate for personal databases and could significantly improve the user experience of creating, managing, and sharing personal information.


Formally, we treat this problem as that of finding a partition of a set of records into a set of blocks, where the records in each block represent the same real-world entity. Following the work of [2], we are using a Conditional Random Field model to estimate the maximum-likelihood partition of the records, conditioned on the values of their associated fields. We are considering extensions which model the ideosyncracies of semi-structured data, including missing and multiple fields and relational structures. Additionally we are considering the problem of finding questions to ask the user whose answers are expected to decrease the resulting error the most.

To validate this approach we believe it necessary to test in multiple data domains, since each may use fields that may make this problem easier or more challenging. We have identified the test domains of personal bibliographic records, personal music libraries, and email archives. We feel these form a good test bed because the identity relationships between the entities are well-defined but the data itself is often entered by hand with little to no effort devoted to maintaining global consistency. As such, it is subject to the variations, omissions, and errors that make this problem challenging.

Another important aspect of these data domains is that the tend to be expressed in flat schemas describing only a single type of object: either publications, songs, or email messages. For example, in most music databases only songs are stored in records, while albums, artists, and genres exist only as properties of songs. We believe allowing artists to be records allows for increased flexibility, but to lift the artists from a database of songs requires discovering which artist string properties represent the same artist.

Matching these lifted records can be difficult; traditional duplicate detection algorithms are designed to work with flat records with a fixed number of properties. In contrast, lifted records have only the single field of the object name, but perhaps many relations to other objects. In a sense, much of what is known about a lifted object is contained in the fields in related objects. It is reasonable to expect that techniques that take into account these relations, perhaps performing simultaneous matching over a set of records of different types, will perform better than those that only do string matching on object names.


We have collected and labeled a dataset of personal bibliographic records from members of an MIT research group and have developed an initial algorithm for finding a partition of publications and improving it through question answering. We are currently working on extending this algorithm to different data sets (music and email) and simultaneously matching records of different types (e.g. publications, authors, and venues).


This research was supported by NTT, the Packard Foundation, and funding from the Oxygen project.


[1] I. P. Fellegi and A. B. Sunter. A Theory for Record Linkage. In Journal of the American Statistical Association, vol. 64 pp. 1183~~1210, 1969.

[2] Andrew McCallum and Ben Wellner. Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003.

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu