MIT CSAIL Research Abstracts

Duplicate Detection in Semi-Structured Data

Nicholas E. Matsakis & David Karger

Motivation

An essential step in data cleaning is identifying sets of records in a database that represent the same entity, often called duplicate detection. Many prior advances on this problem have focused on detecting duplicate records of a single type in a structured domain: identifying duplicates in a list of publications, for example. These techniques often embed a fair amount of domain knowledge, either in preprocessing or in the handling of record properties, and are designed to operate as batch processes, labelling duplicates with no human interaction.

We approach this problem from a different angle, working with semi-structured data sets containing multiple types of entities expressed relationally. For example, identifying duplicates not only in a list of publications but also in the authors and publication venues associated with them. We are also interested in tuning the duplicate-detection model to a particular data set by interactively querying the user about particular records and using the answers to adjusting the model.

We believe this is a sweet spot in the division of labor between human and machine: the human is neither required to invest the effort required to develop reliable general heuristics nor hand-label trivial examples of duplicates. This technique should be more suitable for situations where little is known about the data set schema a priori or the cost of encoding the domain knowledge as rules or heuristics cannot be amortized across large collections of data.

Data Model

Typical database records have a fixed number of strongly-typed properties. We use instead a semi-structured data model that does not restrict records to conform to a fixed schema: records are not required to have particular properties and properties may have multiple values. Property values are either string literals or references to other other records; in this way it is possible express relations between different records such as that between a publication record and its author records.

To validate this approach, we believe it is necessary to test in multiple data domains, since each may use properties that may make identifying duplicates easier or more difficult. Consequently we have identified the test domains of personal bibliographic records, personal music libraries, and email archives. These are a good test bed because the identity relationships between the entities are well-defined but the data itself is often entered by hand with little to no effort devoted to maintaining global consistency. As such, it is subject to the variations, omissions, and errors that make this problem challenging.

Approach

In the absence of prior information about the schema of a data set, one reasonable assumption is to identify duplicate records based on some measure of their similarity: across a wide range of schemas, records with similar properties are more likely to be duplicates that records with dissimilar properties.

It is clear, however, that not all properties are equally distinguishing: similarity in the title or author property of a publication is a greater indication of a possible identity match than similarity in the year published or venue. Thus it is necessary to actively learn the relative importance of different properties in indicating identity (or non-identity).

We propose to do this by using a parameterized similarity model and interactively querying about marginally similar groups of records. The results of this query serve two purposes. First, they are used to update the parameters of the model in order to better distinguish duplicates and second, they allow the human to hand label the most difficult cases.

Labeling a data set is an iterative process similar in spirit to the commonly used Expectation-Maximization (EM) framework. After each query, the model is updated based on the results of that query; the best partition of the data set into blocks of duplicates is found; and if desired a new query can be generated to refine the resulting labeling and model further.

Progress

We are currently developing the theoretical framework for this approach, including the overall structure of the algorithm and defining the properties of the similarity metric. Our current focus is on the joint problems of determining which queries are the most informative how to best incorporate the query responses into the similarity model.

Support

This research has been supported by NTT, the Packard Foundation, and funding from the Oxygen project.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu