CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

Algorithms for Data Streams / Sublinear-space Computation

Piotr Indyk & David Woodruff

What

The goal of this project is to design efficient algorithms for data streams. Such algorithms have very limited storage, and are required to process the information using only one pass over the data.

Why

Massive data sets are everywhere, e.g., the web, data feeds from the satellites or the network traffic data. The data size often exceeds the available memory, and thus the information has to be processed "on the fly". What can be computed in such a restricted model?

Progress

Distinct elements and frequency moments. In [1], we investigated lower bounds for the space needed to estimate the number of distinct elements in the input stream. This is a problem that frequently occurs in the applications. For example, one might want to compute the total number of distinct destinations of packets flowing through a network router (note that this is very different from simply counting the number of packets, which is easy to keep track of). The problem has been studied since the eighties, and several low-space algorithms for this problem are known. However, to estimate the number of distinct elements up to a factor of (1+ε), all known algorithm require space which is quadratic in 1/ε. Since in practice ε could often be as small as 0.01, improving the quadratic bound would have been of considerable importance. However, in our paper we show that the quadratic dependence of the required space on 1/ε is tight. Thus, there is no hope for obtaining better algorithm for this problem. Interestingly, the lower bound uses a combinatorial gadget which is constructed using a dimensionality reduction theorem.

Our lower bound made certain assumptions about the relation between 1/ε and the number of distinct elements in the stream. This assumption was removed in [2]. The latter paper also extended the lower bound to a more general problem of estimating the k-th frequency moments (Fk) of the stream. This quantity is defined as a sum, over of all distinct elements in the stream, of the number of times the element occurs raised to the kth power. E.g., F0 is equal to the number of distinct elements in the stream.

The problem of computing Fk in small space has been investigated in at least half a dozen papers, which steadily improved the upper and lower bounds for the amount of space needed to compute Fk . Finally, in [4], we gave a tight (modulo polylogarithmic factors) upper bound for the space complexity of this problem. That is, we designed an algorithm which estimates Fk up to (an arbitrarily small) constant factor, using space (roughly) O(m^(1-2/k)), for any k>2. Even though the space bound is polynomial in m, the size of the stream's domain, the algorithm needs only polylogarithmic time to process each new item that it sees.

Geometric data streams. In another line of research, we investigated dynamic algorithms for geometric data streams. In this setting, the input data consists of a sequence operations; each operation either adds or deletes a point to/from the current data set. The goal is to compute certain statistics of the data set, such as: the cost of the minimum tree spanning the points (MST), the minimum cost of a one-to-one matching among the points (assuming the number of points is even), the cost of clustering of the data with respect to various measures, etc. See this talk for an overview of geometric data streams.

In [3], we proposed a method for approximating the point-set statistics in the aforementioned setting. Our algorithms use only polylogarithmic space, which is exponentially smaller than than the space needed to store the whole input. The algorithms are obtained by representing the set of input points as a high-dimensional vector (using embeddings), and applying streaming algorithms designed for such vectors. For MST and matching, the algorithms achieve logarithmic approximation factors; for clustering problems, the approximation factor is arbitrarily close to 1. In another paper [5], we improve the approximation factor for the MST problem, by showing that it can be made arbitrarily close to 1 as well.

Research Support

This research is supported by NSF ITR grant CCR-0220280, NDSEG Fellowship, Alfred P. Sloan Foundation and Packard Foundation.

References

[1] Indyk, P. , and D. Woodruff, “Tight Lower Bounds for the Distinct Elements Problem”, Proceedings of the 44th IEEE Symposium on Foundations of Computer Science, Boston, MA, October 2003, pp. 283-290.

[2] Woodruff, D., "Optimal Space Lower Bounds for all Frequency Moments", Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LO, January 2004.

[3] Indyk, P., "Algorithms for Dynamic Geometric Problems over Data Streams", Proceedings of the 36th Symposium on Theory of Computing, Chicago, IL, May 2004, pp. 373-380.

[4] Indyk, P., and D. Woodruff, "Optimal Approximations of the Frequency Moments", Proceedings of the 37th ACM Symposium on Theory of Computing , Baltimore, MD, May 2005, to appear.

[5] Frahling, G., and P. Indyk and C. Sohler, "Sampling in Dynamic Data Streams and Applications", Proceedings of the 21st Annual ACM Symposium on Computational Geometry, Pisa, Italy, June 2005, to appear.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)