MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Concurrent Cache-Oblivious B-Trees

Michael A. Bender, Jeremy T. Fineman, Seth Gilbert & Bradley C. Kuszmaul

Introduction

For over three decades, the B-tree [2] has been the data structure of choice for maintaining searchable, ordered data on disk. Traditional B-trees are effective in large part because they minimize the number of disk blocks accessed during a search. Specifically, for block size $B$ , a B-tree containing $N$ elements performs $O(\log_B{N}+1)$ block transfers per operation. B-trees are designed to achieve good data locality at only one level of the memory hierarchy and for one fixed block size.

In contrast, cache-oblivious (CO) B-trees attain near-optimal memory performance at all levels of the memory hierarchy and for all block sizes (e.g., [3,4,5,10]). A CO B-tree performs a search operation with $O(\log_B{N}+1)$ cache misses for all possible block sizes simultaneously, and even when the block size is unknown. In a complex memory hierarchy consisting of many levels of cache, the tree minimizes the number of memory transfers between each adjacent cache level. Thus CO B-treesperform near optimally in theory, and in recent experiments have shown promise of outperforming traditional B-trees [13,8].

One shortcoming of previously described CO B-trees is that they do not support concurrent access by different processes, whereas typical applications, such as databases and file systems, need to access and modify their data structures concurrently. This work describes three concurrent CO B-tree data structures, proves them correct, and presents performance analysis for a few interesting special cases. The rest of this abstract reviews CO B-trees and explains the concurrency problem.

Cache-oblivious model

External-memory data structures, such as B-trees, are traditionally analyzed in the disk-access model (DAM) [1], in which internal memory has size $M$ and is divided into blocks of size $B$ , and external memory (disk) is arbitrarily large. Performance in the DAM model is measured in terms of the number of block transfers. Thus B-trees implement searches asymptotically optimally in the DAM model.

The cache-oblivious model [11,16] is like the DAM model in that the objective is to minimize the number of data transfers between two levels. Unlike the DAM model, however, the parameters $B$ , the block size, and $M$ , the main-memory size, are unknown to the coder or to the algorithm. If an algorithm performs a nearly optimal number of memory transfers in a two-level model with unknown parameters, then the algorithm also performs a nearly optimal number of memory transfers on any unknown, multilevel memory hierarchy.

Cache-oblivious B-trees

A CO B-tree [4] achieves nearly optimal locality of reference at every level of the memory hierarchy. Although the first CO B-tree is complicated [4], subsequent designs [6,10,13,17] rival B-trees both in simplicity and performance [6,9,10,13,14,17]. Indeed, preliminary experiments have shown, surprisingly, that CO B-trees can outperform traditional B-trees, sometimes by factors of more than 2 [8,13].

There are two main approaches to implementing serial CO B-trees. One approach is based on packed-memory arrays and the other on exponential search trees. Both approaches employ a static CO search tree [16] as a building block.

A static CO search tree contains a set of $N$ ordered elements in a complete binary tree. It executes searches using $O(\log_B{N}+1)$ memory transfers. The elements are laid out in an array using the van Emde Boas layout. Each node in the binary tree is assigned to a position in a length- $N$ array. To perform the layout, split the tree at roughly half its height to obtain $\Theta(\sqrt{N})$ subtrees each with $\Theta(\sqrt{N})$ nodes. Each subtree is assigned to a contiguous portion of the array, within which the subtree is recursively laid out. The array is stored in memory or disk contiguously.

A static CO search tree executes searches in $O(\log_B{N}+1)$ memory transfers [16]. To understand this bound, consider the decomposition of the tree into subtrees of size between $\sqrt{B}$ and $B$ . The depth of each subtree is at least $\lg\sqrt{B}$ , so any root-to-leaf path encounters at most $\lg{N}/\lg\sqrt{B} = 2\log_B N$ subtrees. Each of these subtrees can cross at most one block boundary, leading to $4\log_B{N}$ memory transfers in the worst case. This analysis is not tight.

We leave descriptions of the dynamic CO B-treesto the full paper.

The concurrency problem

Concurrency introduces a number of challenges. A naïve approach would lock segments of the data structure during updates. For example, in a traditional B-tree, each block is locked before being updated. Unfortunately, in the CO model it is difficult to determine the correct granularity at which to acquire locks because the block size is unknown. Locking too small a region may result in deadlock; locking too large a region may result in poor concurrency.

Second, in most database and file-system applications, searches are more common than inserts or deletes, so it is desirable that searches be non-blocking, meaning that each search can continue to make progress, even if other operations are stalled. Our solutions in this paper do not require search operations to acquire locks.

Another problem arises with maintaining the ``balance'' of the data structure. Both main approaches to designing serial CO B-treesrequire careful weight balance of the trees to ensure efficient operations. With concurrent updates, the balance can be difficult to maintain. For example, the amortized analysis of a packed-memory array allows a process to rewrite the data structure completely once in a while. But if the rewrite acquires locks which prevent other processors from accessing the data structure, then the average concurrency drops: on average only $O(1)$ processes can access the data structure concurrently.

We respond to these challenges by developing new serial data structures that are more amenable to parallel accesses, and then we apply concurrency techniques to develop our concurrent CO B-trees.

Our results

This work includes three concurrent CO B-treedata structures: (1) an exponential CO B-tree(lock-based), (2) a packed-memory CO B-tree(lock-based), and (3) a packed-memory CO B-tree (non-blocking). We show that each data structure is linearizable, meaning that each completed operation (and a subset of the incomplete operations) can be assigned a serialization point; each operation appears, from an external viewer's perspective, as if it occurs exactly at the serialization point (see, e.g., [12,15]). We also show that the lock-based data structures are deadlock free, i.e., that some operation always eventually completes. We show that the non-blocking B-tree is lock-free; that is, even if some processes fail, some operation always completes.

When only one process is executing, the trees gain all the performance advantages of optimal cache-obliviousness; for example, they execute operations with the same asymptotic performance as the non-concurrent versions and behave well on a multi-level cache hierarchy without any explicit coding for the cache parameters. Also, these data structures perform nearly as well as the traditional cache-aware B-tree with additive insert overheads of $O(\lg\lg{N})$ for the exponential structure and $O(\log^2{N}/B)$ for the packed-memory structure. When more than one process is executing, the CO B-treesstill operate correctly and with little interference between disparate operations.

References:

[1] Alok Aggarwal and Jeffrey S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1--3, September 1988.

[2] Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1(3):173--189, February 1972.

[3] Michael A. Bender, Richard Cole, and Rajeev Raman. Exponential structures for efficient cache-oblivious algorithms. In ICALP 2002, pp. 195--207, 2002.

[4] Michael A. Bender, Erik Demaine, and Martin Farach-Colton. Cache-oblivious B-trees. In FOCS 2000, pp. 399--409, 2000.

[5] Michael A. Bender, Erik Demaine, and Martin Farach-Colton. Cache-oblivious B-trees. SIAM Journal on Computing, 2005.

[6] M. A. Bender, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In Soda 2002, pp. 29--38, 2002.

[8] Michael A. Bender, Martin Farach-Colton, Bradley C. Kuszmaul, and Jim Sukha. Cache-oblivious B-trees for optimizing disk performance. Manuscript, 2005.

[9] Michael A. Bender, Gerth Stølting Brodal, Rolf Fagerberg, Dongdong Ge, Simai He, Haodong Hu, John Iacono, and Alejandro López-Ortiz. The cost of cache-oblivious sorting. In FOCS 2003, pp. 271--282, Cambridge, Massachusetts, October 2004.

[10] Gerth Stølting Brodal, Rolf Fagerberg, and Riko Jacob. Cache oblivious search trees via binary trees of small height. In Soda 2002, pp. 39--48, San Francisco, California, January 2002.

[11] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS 1999, pp. 285--297, New York, New York, October 1999.

[12] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3): 468--492, 1990.

[13] Zardosht Kasheff. Cache-oblivious dynamic search trees. Masters thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, June 2004.

[14] Richard E. Ladner, Ray Fortna, and Bao-Hoang Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithms: From Algorithm Design to Robust and Efficient Software, volume 2547 of LNCS, pp. 78--92, 2002.

[15] Nancy Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.

[16] Harald Prokop. Cache-oblivious algorithms. Masters thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, June 1999.

[17] Naila Rahman, Richard Cole, and Rajeev Raman. Optimised predecessor data structures for internal memory. In Proceedings of the 5th International Workshop on Algorithm Engineering, volume 2141 of LNCS, pp. 67--78, Aarhus, Denmark, August 2001.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu