MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Spectral Anonymization of Data

Thomas A. Lasko, Staal A. Vinterbo & Peter Szolovits

The Problem

Progress in medical artificial intelligence, computational diagnosis, and clinical research depends on the availability of high-quality datasets. But although we have the ability to distribute large electronic datasets to any interested researcher, we also have ethical and legal responsibilities to protect the privacy of the patients who generated that data. Moreover, patients are beginning to refuse to participate in studies and disease registries for fear of their information being released to a third party and used against them, and this is in turn causing bias in the datasets [1, 2, 3]. We can use contractual agreements to reduce unwanted dissemination, but these agreements are apparently insufficiently persuasive for some patients, and they don't provide much protection in cases of theft, accidental disclosure or malicious release.

Anonymization is the process of modifying the data in such a way that the identity of the source patients cannot be inferred, even when an attacker has access to parts of the original database. The first step is to remove all identifying information such as name, social security number, etc. But even after these direct identifiers are removed, patient identity can still be inferred using pattern or linkage attacks on the remaining data. The challenge in anonymization is to perturb the data in a way that prevents these attacks but also maintains the data's research value. High-dimensional datasets are particularly challenging because the amount of perturbation needed to prevent these attacks seems to be exponential in the number of dimensions.

We can perturb a dataset in many ways to achieve anonymity [4]. We can add univariate or multivariate noise. We can suppress individual cells, generalize coding schemes, or cluster the records into rough equivalence classes and represent each class with its centroid. We can infer a distribution on either the dataset as a whole or on individual clusters and draw synthetic records from these distributions. The aim of all these methods is to cause each record of the anonymized dataset to be mappable with roughly equal probability to at least k records of the original dataset, so an attacker is never able to infer a patient's identity with probability greater than 1/k.

Our Approach

Our approach to anonymization in high dimensions comes from the observation that we are not required to make the perturbations in the original basis of the data. Most real data contains some kind of internal structure, and we can transform our representation of the data to a basis that is better aligned with that structure. This reduces the number of dimensions we must handle in our anonymization, and therefore reduces the amount of perturbation necessary. We use spectral decomposition methods such as Singular Value Decomposition (SVD) to find a better basis for anonymization. Standard SVD produces a new basis that is a linear transformation of the original, but it is also possible to use nonlinear kernels [5] or a spectrum of principal curves [6] to more closely match nonlinear structure in the data.

Research Support

This research is supported in part by the National Library of Medicine (NIH National Library of Medicine Training Grant 2T15LM07092- 1).

References:

[1] Jelke G. Bethlehem, Wouter J. Keller, and Jeroen Pannekoek. Disclosure control of microdata. Journal of the American Statistical Association, 85(409):38--45, Mar 1990.

[2] Julie R Ingelfinger and Jeffrey M Drazen. Registry research and medical privacy. New England Journal of Mededicine, 350(14):1452--1453, Apr 2004.

[3] C. A. Welch. Sacred secrets-the privacy of medical records. New England Journal of Mededicine, 345(5):371--372, Aug 2001.

[4] William E. Winkler. Masking and re-identification methods for public use microdata: Overview and research problems. Research Report Series, Statistics 2004-06,, Statistical Research Division, U.S. Bureau of the Census, Washington DC, October 2004.

[5] Christopher J. C. Burges. Geometric methods for feature extraction and dimensional reduction - a guided tour. In Oded Maimon and Lior Rokach, editors, The Data Mining and Knowledge Discovery Handbook, 59--92. Springer, 2005.

[6] Hastie, T. & Stuetzle, W. Principal Curves Journal of the American Statistical Association, 1989, 84, 502 -- 516.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu