MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

The IRIS Project: Scalable, Robust, Peer-to-Peer Systems

M. Frans Kaashoek, David Karger, Robert Morris, Emil Sit & Jeremy Stribling

Introduction

The Internet has demonstrated the power of a global naming and communications infrastructure: each new application need not implement its own network, but can simply assume a shared global communication infrastructure. The Infrastructure for Resilient Internet Systems (IRIS) project is developing a decentralized infrastructure, based on distributed hash tables (DHTs), that will enable a new generation of large-scale distributed applications. DHTs are scalable, achieving large system sizes without incurring undue overhead. They are self-configuring, automatically incorporating new nodes without manual intervention or oversight. They provide a simple and flexible interface and are simultaneously usable by many applications.

Distributed hash tables provide two key functions. First, they allow nodes to agree upon and efficiently locate a rendezvous point for a given object without central coordination — this is done using protocols such as Chord [1] and Accordion [5]. Our recent research has sought to build real implementations of these protocols and to ensure that they efficiently use available (but constrained) resources. Second, DHTs provide a simple storage interface, allowing applications to put a data object and then get the object at a later time using a key provided by the DHT. Our implementation of this is called DHash.

We are building applications that take advantage of this infrastructure to provide benefits and features that would otherwise be difficult to achieve. Among these are a low-overhead Usenet server and a distributed and cooperative research library.

Maintaining data under churn in DHash

DHash is a system for providing block storage, built using the Chord lookup protocol to organize nodes and data. DHash aims to provide efficient reads and writes, data integrity and data availability. That is, it seeks to be usable on a day-to-day basis. Early work [1] defined the basic interface and addressed the problem of data integrity by relating the lookup key to the stored data using a cryptographic hash function or public key signature. Subsequent work [3] focused on reducing lookup latency and achieving high throughput.

Our current work tackles the problem of maintaining the durability and availability of data [6, 8]. There is great potential for this, due to the large number of distributed participants and their contributed storage and bandwidth resources. If these resources could be efficiently harnessed, DHash could provide robustness and a high storage capacity. The challenge is to deal with failures: nodes may go offline, leave the system permanently, and occasionally lose the contents of their disks due to disk failure.

Replication is traditionally used to ensure that data remains both available (reachable on the network) and durable (stored intact, but perhaps not reachable). To manage failures, DHash must have an intelligent strategy for managing replicas. Availability is a harder goal than durability: availability may be threatened by any network, processor, software, or disk failure, while durability is usually only threatened by disk failures. Local fault-tolerant storage systems such as RAID provide availability and durability relatively inexpensively since disk failures are rare and plenty of local network or bus bandwidth is available to create new replicas. In a large distributed system, failures become more common and available bandwidth becomes more expensive. Further, it is difficult to distinguish between temporary failures (that do not affect durability) and permanent failures (that do).

To handle failures efficiently, we have modeled the failure and repair processes to better understand the parameters that affect data durability. The model captures the concept that repairs and failures occur at a given rate and shows the relationship between the number of existing replicas and these rates. We have shown that threshold based maintenance schemes can be tuned to essentially respond only to permanent failures without explicitly distinguishing permanent and transient failures: by creating a new replica on any observed failure but always tracking the location of all replicas, a system will create a set of extra replicas in response to transient failures that can mask the impact of these transient failures. That it is, sufficiently many copies are made so that the number of available copies is usually greater than the repair threshold [8].

We are also investigating the utility of performing maintenance continuously instead of simply in response to observed failures. A continuous maintenance scheme might operate at a fixed but low rate over long periods --- the ideal would be to produce at a rate that results in at least the same degree of durability as a reactive system but spread out the usage over time. This will reduce burstiness in bandwidth usage which can translate into reduced costs [7].

We are using DHash as the back-end for two applications with high data storage requirements. As we scale up these applications, the amount of data stored in DHash and the number of nodes participating in DHash increases as well. The result of this is that our implementation is being constantly adjusted, improved and evaluated in the context of our applications.

Application: UsenetDHT

UsenetDHT [2] is a DHT-based replacement for the Usenet messaging system. UsenetDHT presents the existing Usenet interface to clients but reduces storage and bandwidth costs by storing articles in a DHT rather than replicating them at each server. In UsenetDHT, the bandwidth required to support read and writes for the full Usenet feed is around 200KB/s; additionally, simulations suggest that to maintain the availability of this data in a relatively stable environment is on the order of 1-2Mbps. By contrast, a traditional Usenet server must dedicate around 20MB/s to carry a full feed.

Application: OverCite

OverCite [4, 7] is a new architecture for a distributed and cooperative research library based on a DHT. OverCite provides the same services as CiteSeer on a distributed set of nodes, eliminating the need for a single institution to contribute all of the resources necessary to run CiteSeer; the resource and bandwidth requirement of each participating node is significantly reduced. DHash serves as a distributed storage layer which, because of its robust and scalable models for data management and peer communication, allows the decentralization of the CiteSeer infrastructure and the inclusion of additional CPU and storage resources. Besides serving as a distributed, robust archive of data, DHash simplifies the coordination of distributed activities, such as crawling. Finally, DHash acts as a rendezvous point for producers and consumers of meta-data and documents.

References:

[1] Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Wide-area cooperative storage with CFS. In Proc. of the 18th ACM Symposium on Operating System Principles (SOSP '01), October 2001.

[2] Emil Sit, Frank Dabek, and James Robertson. UsenetDHT: A Low Overhead Usenet Server. In Proc. of the 3rd International Workshop on Peer-to-Peer Systems (IPTPS '04), February 2004.

[3] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, Frans Kaashoek, and Robert Morris. Designing a DHT for low latency and high throughput. In Proc. of the 1st Symposium on Networked System Design and Implementation (NSDI '04), March 2004.

[4] Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David R. Karger, Robert Morris, and Scott Shenker. OverCite: A Cooperative Digital Research Library. In Proc. of the 4th International Workshop on Peer-to-Peer Systems (IPTPS '05), February 2005.

[5] Jinyang Li, Jeremy Stribling, Robert Morris and M. Frans Kaashoek. Bandwidth-efficient management of DHT routing tables. In Proc. of the 2nd Symposium on Networked System Design and Implementation (NSDI'05), May 2005.

[6] Emil Sit, Andreas Haeberlen, Frank Dabek, Byung-Gon Chun, Hakim Weatherspoon, Robert Morris, M. Frans Kaashoek and John Kubiatowicz. Proactive replication for data durability. In Proc. of the 5th International Workshop on Peer-to-Peer Systems (IPTPS '06), February 2006.

[7] Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, and Robert Morris. Exploring the design of multi-site Web services using the OverCite digital libary. In Proc. of the 3rd Symposium on Networked System Design and Implementation (NSDI'06), May 2006.

[8] Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz and Robert Morris. Efficient Replica Maintenance for Distributed Storage Systems. In Proc. of the 3rd Symposium on Networked System Design and Implementation (NSDI'06), May 2006.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu