MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

END: The Expandable Network Disk

Athicha Muthitacharoen, Robert T. Morris & M. Frans Kaashoek

Introduction

Many organizations posses large data sets that need to be on-line but do not require the highest possible performance. It often makes sense to store this data on low-end commodity servers or on idle disk space of servers primarily used for something else (for example, compute servers). Once the amount of storage required grows beyond a few disks' worth, a cluster storage system starts to be attractive. An ideal cluster storage system would present a disk-like interface compatible with existing file systems (perhaps via iSCSI), keep running despite the failure of a few bricks, recover quickly after correction of a temporary failure that does not disturb disk contents, and expand transparently to file systems.

Efficient handling of temporary failures is particularly important when the cluster is composed of non-dedicated bricks that may be rebooted due to the needs of their primary function, or when large collections of inexpensive bricks have frequent failures of parts such as power supplies and network connectivity. A temporary brick failure poses two questions. First, how will redundancy be maintained for data written during the failure that would ordinarily be replicated on the failed brick? Second, when the failed brick recovers, how will the system know which of the blocks it stores are out of date, and which can safely be used? Petal[2] and FAB[3], two existing cluster storage systems, do not maintain full redundancy during temporary failures, and restore redundancy by copying newly written blocks to a recovering brick. A primary contribution of this paper is to maintain full redundancy during temporary failures and to avoid the expense of having to bring a recovered brick up to date.

This abstract presents END (the Expandable Network Disk). END appears to a file system as a giant physical disk, accessible through a standard block-level I/O interface; as a result, END supports existing disk file systems without modification. END only commits physical storage as blocks are written, so that its apparent address space can be much larger than the amount of physical storage. Each file system stored in END would also be created with a size large enough to accommodate future expansion. Thus, as storage needs increase over time, an administrator can add new bricks and existing file systems will automatically have access to more storage.

END uses a two-layer design, in which storage ``bricks'' hold two kinds of information. The lower layer stores replicated immutable ``chunks'' of data, each indexed by a key equal to a hash of its content. The upper layer maps each block address to the key of its current content; each mapping is held on two bricks using primary-backup replication. A brick that stores the address mapping of a given block typically does not also store the chunk holding the block's current content. This separation gives END the flexibility to store chunks on whatever brick is convenient; the immutability of chunks means that END can retrieve the chunk for a key from any brick without worrying about whether it has been kept up to date.

Several benefits spring from END's separation of the mutable address mappings from the content-addressed storage of immutable data. First, the separation allows fast recovery from failures. END needs to move only the address mappings among bricks during reconfiguration, not the data; and since the mappings are small compared to the data, reconfiguration is quick. Second, END can handle temporary brick failures efficiently. If a brick fails but soon recovers with disk contents intact, END can immediately take advantage of the data chunks stored on the brick's disk. Since the chunks are immutable, they cannot get out of date even if the corresponding blocks were written while the brick was off-line. Third, END can store content-addressed data chunks on any brick, so it can always fully replicate written data, even while bricks are off-line. Fourth, this flexibility allows END to spread the load away from bricks with full disks. Finally, a newly added brick can gradually pick up its share of data chunks. It does not need to copy all the chunks in an address range to start serving requests.

Experiments with our END prototype show that bulk write operations on a file system stored on an END system of 4 bricks has slightly better performance than on a single-disk file system. As expected, END's performance is worse for small file system operations. END's bulk write throughput scales linearly with the number of bricks, and reaches 31MB/s with 10 bricks. END recovers from a single brick failure in less than two minutes; writes completed afterwards are fully replicated. END reincorporates a rebooting brick in approximately one minute and is able to use the chunks already stored on the brick prior to its reboot. During a bulk write workload, adding a new brick to END completes in about one minute.

Progress

We have implemented the design described above except that there is no chunk garbage collector and no background process that copies chunks to their proper successors. We have implemented the log-structured chunk database and batching of writes to reduce the overhead of disk seeks and increase performance. We have been working to improve END's buld write performance so that it is competitive with systems that do not store a separate address to block hash mapping. We are also working to improve END's small writes performance by reducing to zero the disk seeks required to write to the log file. We are conducting experiments to verify that the system recovers correctly from power failures and network partition.

Further development of END will include background garbage collection and movement of chunks to their proper successor, the ability to recover from cluster-wide power failures, and investigation of the creation of snapshots of consistent address mappings for archival.

References:

[1] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In The Proceedings of the Symposium on Operating Systems Principles, Bolton Landing, NY, USA, October 2003.

[2] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In The Proceedings of the Architectural Support for Programming Languages and Operating Systems, pp. 84--92, Cambridge, MA, USA, 1996.

[3] Y. Saito, S. Frolund, A. Veitch, A. Merchant and S. Spence. FAB: building distributed enterprise disk arrays from commodity components. In The Proceedings of the Architectural Support for Programming Languages and Operating Systems, pp. 48--58, Boston, MA, USA, 2004.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu