Abstracts - 2007
Spectral Anonymization of Data
Thomas A. Lasko, Staal A. Vinterbo & Peter Szolovits
Progress in medical artificial intelligence, computational diagnosis, and clinical research depends on the availability of high-quality datasets. But although we have the ability to distribute large electronic datasets to any interested researcher, we also have ethical and legal responsibilities to protect the privacy of the patients who generated that data. Moreover, patients are beginning to refuse to participate in studies and disease registries for fear of their information being released to a third party and used against them, and this is in turn causing bias in the datasets [1, 2, 3]. We can use contractual agreements to reduce unwanted dissemination, but these agreements are apparently insufficiently persuasive for some patients, and they don't provide much protection in cases of theft, accidental disclosure or malicious release.
Anonymization is the process of perturbing the data in such a way that the identity of the source individuals cannot be inferred, even when an attacker has access to parts of the original database. The first step is to remove all identifying information such as name, social security number, etc. But even after these direct identifiers are removed, patient identity can still be inferred using pattern or linkage attacks on the remaining data. The great challenge is to perturb the data in a way that prevents these attacks but nevertheless preserves the data's research value. High-dimensional datasets are particularly challenging because the amount of perturbation needed to prevent these attacks can be exponential in the number of dimensions.
Dozens of anonymization methods exist today, and none provide both perfect privacy protection and perfect analytic utility . These methods include adding univariate or multivariate noise, suppressing individual cells, generalizing coding schemes, swapping cells within columns, replacing groups of similar records with their centroid, and releasing only synthetic records drawn from an inferred distribution.
We make the observation that the anonymizer is not required to operate in the original basis of the data. Most real data contains some kind of internal structure, and we can transform our representation of the data to a basis that is better aligned with that structure. Transforming to such a basis can make the anonymizer's job easier, allow for better anonymizations, or skirt the curse of dimensionality. The spectral basis provided by Singular Value Decomposition rotates the axes of anonymization to better align with the structure of the data, and this can be optimal for data with linear structure. For data with nonlinear structure, this can improve the anonymization, but it is not optimal. We are currently investigating the properties of linear spectral anonymization and developing better methods for anonymizing data with nonlinear structure.
This research is supported in part by the National Library of Medicine (Grant 2R01LM007273-04A1).
 Jelke G. Bethlehem, Wouter J. Keller, and Jeroen Pannekoek. Disclosure control of microdata. Journal of the American Statistical Association, 85(409):38--45, Mar 1990.
 Julie R Ingelfinger and Jeffrey M Drazen. Registry research and medical privacy. New England Journal of Medicine, 350(14):1452--1453, Apr 2004.
 C. A. Welch. Sacred secrets-the privacy of medical records. New England Journal of Medicine, 345(5):371--372, Aug 2001.
 William E. Winkler. Masking and re-identification methods for public use microdata: Overview and research problems. Research Report Series, Statistics 2004-06,, Statistical Research Division, U.S. Bureau of the Census, Washington DC, October 2004.