MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

END: The Expandable Network Disk

Athicha Muthitacharoen, Robert T. Morris & M. Frans Kaashoek

Introduction

Distributed Hash Tables (DHTs) are often viewed as addressing the unreliability of unmanaged hosts and the high latency and limited capacity of wide-area networks, at the expense of loose consistency guarantees. In consequence, DHTs have not been used for cluster applications, since clusters are usually built from dedicated high-reliability servers and fast low-latency LANs, and have strict consistency requirements.

We argue that DHT techniques are relevant to cluster applications. One, clusters often store a large quantity of data relative to the LAN speed, so DHT techniques that minimize replica creation traffic during failures are important for a cluster application. Two, clusters are increasingly built out of commodity PCs, which fail more frequently than purpose-built servers---a large cluster of PCs is likely to experience continuous churn[1], which DHTs handle well. Three, in an environment with low-latency links it is possible for DHT-like systems to provide strong consistency.

To explore the applicability of DHT techniques to cluster applications we present a case study: the Expandable Network Disk (END), which is a distributed virtual disk in the style of Petal[2], FAB[3], etc. END consists of an expandable number of storage ``bricks'', each of which can either be a server or a workstation. END appears to a file system as a giant physical disk, accessible through a standard block-level I/O interface; as a result, END supports existing disk file systems without modification. A system administrator would create an enormous file system on an END system. END only commits physical storage as the file system writes, so if the file system needs more storage, a system administrator can just plug in new bricks---END makes the storage available to the file system without requiring that the file system be re-formatted for the larger size.

Internally, END uses a one-hop DHT to spread data across the bricks, to find data despite changes in the set of bricks, and to replicate data for availability and durability. Because the DHT stores immutable ``chunks'' of data under content-hash keys, END maintains a separate mapping from each block address to the DHT key of its current content. END uses traditional primary-backup replication to keep this address mapping consistent in the face of failures, so that a read of a block reflects the last write to that block. END stores the address mappings separately from the data chunks; the mapping for a given block address is not in general stored on the same brick as the corresponding data chunk.

There are two main benefits from END's separation of mutable address mappings from DHT-like storage of immutable data. First, the separation allows fast recovery from failures. Only the address mappings need to be moved among bricks during reconfiguration, not the data; since the mappings are small compared to the data, reconfiguration is quick. The second benefit is that END can handle temporary brick failures efficiently. If a brick fails but soon re-joins the system with disk contents intact, END can immediately take advantage of the data chunks stored on the brick's disk. Since the chunks are immutable, they cannot get out of date even if the corresponding blocks were written while the brick was off-line.

END's two-layer design poses some challenges. First, reads and writes require two RPCs, one for the address mapping and one to fetch the data chunk. We expect the address mapping RPC to have little effect on overall performance because the address map fits in main memory and LAN latencies are low compared to disk accesses. Second, a simple implementation of content-hash storage is likely to turn sequential workloads into seek-intensive random-access workloads; we expect a log-structured chunk database will fix this problem. Finally, modifying a block will usually result in the old chunk becoming unreferenced; periodic garbage collection will be required to reclaim these chunks.

Progress

We have implemented the design described above except that there is no chunk garbage collector and no background process that copies chunks to their proper successors. We have implemented the log-structured chunk database and batching of writes to reduce the overhead of disk seeks and increase performance. We are conducting experiments to verify that the system recovers correctly from power failures and network partition.

Further development of END will include background garbage collection and movement of chunks to their proper successor, the ability to recover from cluster-wide power failures, and investigation of the creation of snapshots of consistent address mappings for archival.

References:

[1] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In The Proceedings of the Symposium on Operating Systems Principles, Bolton Landing, NY, USA, October 2003.

[2] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In The Proceedings of the Architectural Support for Programming Languages and Operating Systems, pp. 84--92, Cambridge, MA, USA, 1996.

[3] Y. Saito, S. Frolund, A. Veitch, A. Merchant and S. Spence. FAB: building distributed enterprise disk arrays from commodity components. In The Proceedings of the Architectural Support for Programming Languages and Operating Systems, pp. 48--58, Boston, MA, USA, 2004.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu