CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

High Availability in DHTs: Erasure Coding vs. Replication

Rodrigo Rodrigues & Barbara Liskov

Introduction

Peer-to-peer distributed hash tables (DHTs) propose a logically centralized, physically distributed, hash table abstraction that can be shared simultaneously by many applications [1, 2, 3, 4]. Ensuring that data objects in the DHT have high availability levels when the nodes that are storing them are not themselves 100% available requires some form of data redundancy. Peer-to-peer DHTs have proposed two different redundancy schemes: replication [2, 3] and erasure coding [1, 4], but the relative benefits of the two schemes are not well understood. While comparisons exist [4, 5, 6] they mostly argue that erasure coding is the clear victor, due to huge storage savings for the same availability levels (or conversely, huge availability gains for the same storage levels).

Our research aims to provide a comprehensive discussion of the advantag des of each scheme. Our work shows that while gains from coding exist, they are highly dependent on the characteristics of the nodes that comprise the overlay. In fact, the benefits of coding are so limited in some cases that they can easily be outweighed by some disadvantages and the extra complexity of erasure codes. We summarize our work below; a full discussion can be found in [7].

Approach

We performed an analytic comparison of replication and coding that delineates the relative gains ov using coding vs. replication as a function of the server availability and the desired DHT object availability. The analysis highlights the main reason for using coding: it allows the same level of availability to be achieved with less storage. Storage size matters because this translates into bandwidth when nodes fail and their contents need to be moved to other nodes to maintain required replication levels. Thus erasure coding is going to matter more when data is stored at unreliable servers (lower server availability levels).

We then developed a model that allows us to understand server availability. The model distinguishes between nodes leaving the system forever, and nodes becoming temporarily unavailable. This distinction matters because when a node leaves forever, the state it stores must be moved to other nodes to retain required replication levels. However, this state transfer can be avoided during a temporary disconnection.

Progress

We used the model to perform measurements of how membership dynamics and node availability affect the bandwidth needed to maintain replication levels in the two schemes. Our measurements were done using results of three different traces that correspond to distinct likely deployments of a peer-to-peer storage system:

Overnet. This trace [8] represents a volunteer-based system in which nodes leave the system relatively quickly.
Farsite. This trace [9] represents a system in which the nodes were corporate desktop PCs.
Planet Lab. This trace Stribling [10] represents a server infrastructure.

Node availability is, as one would expect, extremely high for PlanetLab, slightly lower for Farsite, and low for Overnet trace.

We measured the bandwidth gains of using erasure coding vs. replication in the three deployments. Several conclusions can be drawn from our studies.

For the Overnet trace, coding is a win since server availability is low. However, the maintenance bandwidth for a scalable and highly available storage system with Overnet-like membership dynamics can be unsustainable for home users [11] (around 100 kbps on average for a modest per-node contribution of a few gigabytes). Therefore, cooperative storage systems should target more stable environments like Farsite or PlanetLab.

For the PlanetLab trace, coding is not a win, since server availability is extremely high. So the most interesting deployment for using erasure codes is Farsite, where intermediate server availability of 80--90% presents visible redundancy savings.

However, the redundancy savings from using coding instead of full replication come at a price. First, the download latency can be higher for coding than for replication in a heterogeneous environment like the Internet. When using replication, the data object can be downloaded from the replica closest to the client, whereas with coding the download latency is bounded by the distance to the $m$th closest replica. Second, the task of downloading only a particular subset of the object (a sub-block) is also complicated by coding, where the entire object must be reconstructed. With full replicas sub-blocks can be downloaded trivially. Third, coding is not adequate for a system design where operations are done at the server side, like keyword searching.

A final point is that in our analysis we considered only immutable data. The impact of mutability on the redundancy choice is unclear, since we have to consider how a node determines whether its state is accurate, and what it does if it isn't. A study of redundancy techniques in the presence of mutability is an area for future work.

Research Support

This research is supported by the NSF under grants ANI-0082503 (http://project-iris.net) and 6896772.

References

[1] John Kubiatowicz, et. al. OceanStore: An Architecture for Global-Scale Persistent Storage. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, Cambridge, Massachusetts, 2000.

[2] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp and Scott Shenker. A Scalable Content-Addressable Network. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, San Diego, California, August 2001.

[3] Antony Rowstron and Peter Druschel. Storage Management and Caching in PAST, A Large-scale, Persistent Peer-to-peer Storage Utility. In Proceedings of the 18th ACM Symposium on Operating System Principles, Banff, Canada, October 2001.

[4] F. Dabek et. al. Designing a DHT for low latency and high throughput. In Proceedings of the First ACM/Usenix Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, California, March 2004.

[5] Ranjita Bhagwan et. al. Total Recall: System Support for Automated Availability Management. In Proceedings of the First ACM/Usenix Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, California, March 2004.

[6] Hakim Weatherspoon and John D. Kubiatowicz. Erasure Coding vs. Replication: A Quantitative Comparison. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS'02), Cambridge, Massachusetts, March 2002.

[7] Rodrigo Rodrigues and Barbara Liskov. High Availability in DHTs: Erasure Coding vs. Replication. In Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS'05), Ithica, New York, February 2005.

[8] Ranjita Bhagwan, Stefan Savage and Geoffrey Voelker. Understanding Availability. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS'03), Berkeley, CA, February 2003.

[9] William J. Bolosky, John R. Douceur, David Ely and Marvin Theimer. Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs. In Proceedings of the international conference on measurement and modeling of computer systems (SIGMETRICS), 2000.

[10] Jeremy Stribling. PlanetLab - All Pairs Pings. http://pdos.lcs.mit.edu/~strib/pl_app. 2005.

[11] C. Blake and R. Rodrigues. High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two. In Proceedings of the Ninth Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue, Hawaii, May 2003.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)