MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

Streaming Cache-Oblivious B-Trees

Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan Fogel, Bradley C. Kuszmaul, Jelani Nelson & Chris Wright

Introduction

The B-tree [3,12] is the classic external-memory dictionary data structure, and it has held its preeminent position for over three decades. Most B-tree implementations are, in fact, B $^{+}$ -trees [12,14], in which the full keys are all stored in the leaves, but for convenience we refer to all variations as ``B-trees.'' The B-tree is typically analyzed in a two-level memory model, called the Disk Access Machine (DAM) model [1]. The DAM model assumes an internal memory of size $M$ organized into blocks of size $B$ and an arbitrarily large external memory; the cost in the model is the number of transfers of blocks between the internal and external memory.

An $N$ -node B-tree supports searches, insertions, and deletions in $O(\log_{B+1}N)$ transfers and supports scans of $L$ contiguous elements in $O(1+L/B)$ transfers. An important characteristic of the B-tree is that it is provably optimal for searching. A simple information-theoretic argument shows that a search requires at least $\Omega(\log_{B+1}N)$ transfers.

In fact, there is a tradeoff between the cost of searching and inserting in external-memory dictionaries [8], and B-trees represent only one point on this trade-off space. Another point in the trade-off space is represented by the buffered-repository tree (BRT) [10]. The BRT supports the same operations as a B-tree, but searches use $O(\log N)$ transfers and insertions use amortized $O((\log N)/B)$ transfers. Thus, searches are slower in the BRT than in the B-tree, whereas insertions are significantly faster.

More generally, Brodal and Fagerberg's data structure [8], which we call the B $^\epsilon$ -tree, spans a large range of this tradeoff: The B $^\epsilon$ -tree is parameterized by $\epsilon$ , ( $0\leq\epsilon\leq 1$ ), and supports insertions in $O((\log_{B^\epsilon+1} N)/ B^{1-\epsilon})$ transfers and searches in $O(\log_{B^\epsilon+1} N)$ transfers. Thus, when $\epsilon=1$ it matches the performance of a B-tree, and when $\epsilon=0$ , it matches the performance of a buffered repository tree. An interesting intermediate point is when $\epsilon=1/2$ . When $\epsilon=1/2$ , searches are slower by a factor of roughly 2, but insertions are faster by a factor of roughly $\sqrt{B}/2$ .

This work explores this insert/search tradeoff in the cache-oblivious (CO) model [13]. The cache-oblivious model is similar to the DAM model, except that unlike the DAM model, the block size $B$ is unknown to the coder or to the algorithm and therefore cannot be used as a tuning parameter. The B-tree, buffered-repository tree, and B $^\epsilon$ -tree are not cache oblivious; they are parameterized by $B$ .

There already exist several cache-oblivious dictionaries. The most well-studied is the cache-oblivious B-tree [4,9,5]. The cache-oblivious B-tree supports searches in $O(\log_{B+1}N)$ transfers, insertions in $O(\log_{B+1}N+(\log^2N)/B)$ transfers, and range queries returning $L$ elements in $O(\log_{B+1}N+L/B)$ transfersAnother cache-oblivious dictionary is a cache-oblivious alternative to the buffered-repository tree, which we call here the lazy-search BRT [2]. Although it is useful in some contexts (such as cache-oblivious graph traversal) the lazy-search BRT is unsatisfactory in one crucial way: searches are heavily amortized, and the whole cost of searching is charged to the cost of previous insertions. Indeed, any given search might involve scanning the entire data structure.

Results

The work introduces several cache-oblivious dictionaries that illustrate different points in the insertion/search tradeoff. Specifically, we present the following cache-oblivious data structures and results:

Shuttle Tree

The shuttle tree, our main result, retains the same asymptotic search cost of the cache-oblivious B-tree while improving the insert cost. Specifically, searches still take $O(\log_{B+1}N)$ transfers, whereas insertions are reduced to ${\cal O}\!\left({(\log_{B+1}N)/B^{\Theta(1/(\log\log B)^2)}+(\log^2N)/B}\right)$ amortized transfers. This bound represents a speedup as long as $B\geq (\log N)^{1+c/\log\log\log^2N}$ , for any constant $c>1$ ; this inequality typically holds for external-memory applications. Range queries returning $L$ elements take $O(\log_{B+1}N+L/B)$ transfers, which is asymptotically optimal.

This relatively complex expression for the cost of inserts can be understood as follows: When the dominant term in the insertion cost is ${\cal O}\!\left({(\log_{B+1}N)/B^{\Theta(1/(\log\log B)^2)}}\right)$ , insertions run a factor of ${\cal O}\!\left({B^{1/(\log\log B)^2}}\right)$ faster in the shuttle tree than in a B-tree or cache-oblivious B-tree. Observe that this speedup of $B^{\Theta(1/(\log\log B)^2)}$ is superpolylogarithmic and subpolynomial in $B$ . This speedup, while nontrivial, is not as large as the speedup in the B $^\epsilon$ -tree.

Lookahead Array

We give another data structure that we call a lookahead array. The lookahead array is reminiscent of static-to-dynamic transformations [7] and fractional cascading [11]. This data structure is parameterized by a growth factor. If the growth factor is chosen to be $B$ , then the lookahead array is cache aware, and achieves the same amortized bounds as the B $^\epsilon$ -tree. If the growth factor is chosen to be a constant such as 2, then the lookahead array is cache-oblivious and matches the performance of the BRT. We call this version the cache-oblivious lookahead array (COLA). Unlike the BRT, the COLA is amortized, and any given insertion may trigger a rearrangement of the entire data structure.

For disk-based storage systems, range queries are likely to be faster for a lookahead array than for a BRT because the data is kept contiguously in arrays, taking advantage of inter-block locality, rather than scattered on blocks across disk. This is the same reason why the cache-oblivious B-tree can support range queries nearly an order of magnitude faster than a traditional B-tree; see e.g.,[6].

Deamortized Lookahead Array

We show how to deamortize the lookahead array and the cache-oblivious lookahead array. Thus, we obtain the first cache-oblivious alternative to the BRT . There is no amortization on searches and the worst-case cost for an insert is no more than the cost of a search.

Experiments

We next show how efficiently the COLA performs. We implemented a COLA and compared it with a B-tree. For databases in external memory, the COLA was 90 times faster than the B-tree for random inserts, 2.5 times slower for sorted inserts, and 1.7 times slower for searches.

We also found that design decisions as simple as merging elements largest first or smallest first had a greater impact on performance than, for example, the difference in cost of inserting in random or sorted order. We believe factors such as disk scheduling and prefetching are responsible. We examined several other interesting factors that had an impact on performance.

Research Support

This research was supported in part by NSF Grants CCF-0621511 and CCF-0541209, and a Google research award.

References:

[1] Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity of sorting and related problems. In Communications of the ACM, 31(9):1116--1127, September 1988.

[2] Lars Arge, Michael A. Bender, Erik D. Demaine, Bryan Holland-Minkley, and J. Ian Munro. Cache-oblivious priority queue and graph algorithm applications. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), pages 268-276, 2002.

[3] Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1(3):173-189, February 1972.

[4] M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Annual Symp. on Foundations of Computer Science (FOCS), pages 399-409, Redondo Beach, California, 2000.

[5] M. A. Bender, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. J. Algorithms, 3(2):115-136, 2004.

[6] Michael A. Bender, Martin Farach-Colton, and Bradley Kuszmaul. Cache-oblivious string B-trees. In Proc. 25th Symposium on Principles of Database Systems (PODS), 2006.

[7] Jon Louis Bentley and James B. Saxe. Decomposable searching problems i: Static-to-dynamic transformation. J. Algorithms, 1(4):301-358, 1980.

[8] Gerth Stølting Brodal and Rolf Fagerberg. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 546-554, 2003.

[9] Gerth Stølting Brodal, Rolf Fagerberg, and Riko Jacob. Cache oblivious search trees via binary trees of small height. In Proc. 13th Annual Symposium on Discrete Algorithms (SODA), pages 39-48, San Francisco, California, January 2002.

[10] Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. On external memory graph traversal. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 859-860, 2000.

[11] Bernard Chazelle and Leonidas J. Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(2):133-162, 1986.

[12] Douglas Comer. The ubiquitous B-tree. ACM Computing Surveys, 11(2):121-137, June 1979.

[13] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In Proc. 40th Annual Symp. on Foundations of Computer Science (FOCS), pages 285-297, New York, New York, October 1999.

[14] Donald E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, 1973.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu