CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Approximate Algorithms for High-Dimensional Geometric Problems

Alex Andoni, Nicole Immorlica, Piotr Indyk, Vahab Mirrokni & Nitin Thaper

What

The goal of this project is to design efficient algorithms for high-dimensional geometric problems. A prototypical problem in this area is the nearest neighbor problem: given a database of points in a d-dimensional space, devise a data structure which, given any query point, quickly finds its nearest neighbor in the database. Other problem include: finding the closest pair of points, computing a tree spanning all the points with minimum cost (MST), various clustering problems, etc.

Why

The nearest neighbor problem is of key interest in many areas. For example, in order to find a pair of similar images in a large data set, the researchers typically represent each image by a feature vector, and then find the closest pair of vectors. Since the number of features is typically very large, this approach requires solving high-dimensional geometric problems. In a similar fashion, high-dimensional geometric problems naturally occur in text and visual databases, machine learning, computational statistics, vector quantization, etc.

Unfortunately, the known algorithms for this problem suffer from "the curse of dimensionality": they either require storage that is exponential in the dimension, or have query time that quickly becomes linear as the dimension grows.It is widely conjectured that this phenomenon is inevitable, as long as one insists on exact answers.

How

Our approach to this problem is to design algorithms that are allowed to report approximate answers. The approximation is defined with respect to the distance: a c-nearest neighbor is a point whose distance to the query point is at most c times the distance from the query point to its nearest neighbor. In our earlier work (cf. [5]) we show that it is possible to design a c-approximate nearest neighbor data structure (based on Locality-Sensitive Hashing) with query time (roughly) O(dn^(1/c) and space O(dn^(1+1/c)). The algorithm can be modified to report all approximate nearest neighbors. In this way, the algorithm can be used to find the exact nearest neighbor as well, with the query time depending on the number of approximate nearest neighbors (which is often small). The algorithm, and its variants, has found a substantial number of applications, e.g., to efficient comparison of genomic sequences and motif finding, example-based pose estimation, image and shape matching with distributions of local features, etc.

The Locality-Sensitive Hashing algorithm works only for data sets in which the distance is measured using L1 (Manhattan) or L2 (Euclidean) norm. In many applications the distance functions are more complex, and include Hausdorff distance, Earth-Mover Distance and Edit Distance. Extending the nearest neighbor algorithm to such metrics can be, in principle, achieved by embedding (cf. [6]) those metrics into L1 or L2 norms.

This leads to the following two questions:

Is it possible to design more efficient algorithms for the approximate near neighbor ?
How to to extend those algorithms to more complex metrics ?

Progress

In [3], we proposed an improved version of the Locality-Sensitive Hashing algorithm, together with an efficient implementation. Preliminary experiments [7] show that on some benchmark data sets, such as MNIST handwritten digit database, the new algorithm offers an order of magnitude speed-up of the running time over known algorithms, even for the exact nearest neighbor problem. From the theoretical perspective, we were able to show that the exponent in the query time of the new algorithm is (slightly) smaller than 1/c (for the L2 norm). The new algorithm works directly in the L2 norm, and thus is easy to implement for this case.

In [1], we presented a method for embedding Earth-Mover Distance into the L1 norm. The theoretical bound on the distance distortion caused by the embedding is logarithmic. Nevertheless, we show that for natural data sets, the actual distortion in practice is much lower, and so the embedding could be still quite useful. The method has been used since then for fast countour matching and image and shape matching with distributions of local features.

For the approximate nearest neighbor under the edit distance (for strings of length up to d), we designed [2] a c-approximation algorithm, with constant c, O(d) query time and "sub-exponential" space, i.e., exponential only in d^a for arbitrarily small a>0. Although not (yet) practical, the algorithm provides hope that a better algorithm (with polynomial space) can be found in the future. Interestingly, the algorithm does not use embeddings and instead relies on a product-metric approach. A recent paper [Khot-Naor'05] shows that the edit distance cannot be embedded into L1 with constant distortion. This indicates that the product metric method provably outperforms the embedding method for this problem.

Finally, in [4], we investigated the complexity of the exact closest pair problem in very high dimensions, that is when the dimension is as large as the number of points. We show that, for a few Lp norms, the problems is reducible to matrix multiplication.

Research support

This research is supported by NSF Career Award.

References

[1] Indyk, P. , and N. Thaper, "Fast Image Retrieval via Embeddings", Proceedings of the 3rd International Workshop on Statistical and Computational Theories of Vision (at ICCV), Nice, France, October 2003.

[2] Indyk, P., "Approximate Nearest Neighbor under Edit Distance via Product Metrics", Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LO, January 2004, pp. 646-650.

[3] Datar, M., and N. Immorlica and P. Indyk and V. Mirrokni, "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions", Proceedings of the 20th Symposium on Computational Geometry , New York, NY, June 2004, pp. 253-262.

[4] Indyk, P., and M. Lewenstein, O. Lipsky and E. Porat, "Closest Pair Problems in Very High Dimensions", Proceedings of the 31st International Colloquium on Automata, Languages and Programming , Turku, Finland, July 2004, pp. 782-792.

[5] Indyk, P., "Nearest Neighbor in High-Dimensional Spaces", in CRC Handbook of Discrete and Computational Geometry, 2nd edition, Goodman and O'Rourke, eds., CRC Press, 2004, pp. 877-892.

[6] Indyk, P., and J. Matousek, "Low-Distortion Embeddings of Finite Metric Spaces", in CRC Handbook of Discrete and Computational Geometry, 2nd edition, Goodman and O’Rourke, eds., CRC Press, 2004., pp. 177-196.

[7] Andoni, A. , and N. Immorlica and P. Indyk and V. Mirrokni, "Locality-sensitive hashing using stable distributions", in "Nearest Neighbor Methods in Learning and Vision: Theory and Practice", Darrell, T. and P. Indyk and G. Shakhnarovich, eds., MIT Press, Cambridge, MA, in preparation.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)