MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Research Abstracts Home

CSAIL Digital Archive

Research Activities

CSAIL Home

horizontal line

Research Abstracts - 2007
horizontal line

horizontal line

The IRIS Project: Scalable, Robust, Peer-to-Peer Systems

M. Frans Kaashoek, Robert Morris, Emil Sit & Jeremy Stribling

Introduction

The Internet has demonstrated the power of a global naming and communications infrastructure: each new application need not implement its own network, but can simply assume a shared global communication infrastructure. The Infrastructure for Resilient Internet Systems (IRIS) project is developing a decentralized storage infrastructure, based on distributed hash tables (DHTs), that will enable a new generation of large-scale distributed applications. The goal of DHTs is to provide a simple and flexible interface for reliable data location and storage. DHTs are designed to decentralized, able to tolerate failures of individual components as well as self-configuring, automatically incorporating new nodes without manual intervention or oversight.

Distributed hash tables provide two key functions. First, they allow nodes to agree upon and efficiently locate a rendezvous point for a given object without central coordination — this is done using protocols such as Chord [1] and Accordion [5]. Second, DHTs provide a simple storage interface, allowing applications to put a data object and then get the object at a later time using a key provided by the DHT. Our implementation of this is called DHash. Our recent research has focused on enhancing this implementation to the point where it can support multiple terabytes of data stored and accessed over the wide-area.

To demonstrate the applicability of this infrastructure, we are building applications that take advantage of this infrastructure to provide benefits and features that would otherwise be difficult to achieve. Among these are a low-overhead Usenet server and a distributed and cooperative research library.

Maintaining data under churn in DHash

DHash is a system for providing block storage, built using the Chord lookup protocol to organize nodes and data. DHash aims to provide efficient reads and writes, data integrity and data availability. That is, it seeks to be usable on a day-to-day basis. Early work [1] defined the basic interface and addressed the problem of data integrity by relating the lookup key to the stored data using a cryptographic hash function or public key signature. Subsequent work [3] focused on reducing lookup latency and achieving high throughput.

Our current work tackles the problem of maintaining the durability and availability of data [6, 8]. There is great potential for this, due to the large number of distributed participants and their contributed storage and bandwidth resources. If these resources could be efficiently harnessed, DHash could provide robustness and a high storage capacity. The challenge is to deal with failures: nodes may go offline, leave the system permanently, and occasionally lose the contents of their disks due to disk failure.

Replication is traditionally used to ensure that data remains both available (reachable on the network) and durable (stored intact, but perhaps not reachable). To manage failures, DHash must have an intelligent strategy for managing replicas. Availability is a harder goal than durability: availability may be threatened by any network, processor, software, or disk failure, while durability is usually only threatened by disk failures. Local fault-tolerant storage systems such as RAID provide availability and durability relatively inexpensively since disk failures are rare and plenty of local network or bus bandwidth is available to create new replicas. In a large distributed system, failures become more common and available bandwidth becomes more expensive. Further, it is difficult to distinguish between temporary failures (that do not affect durability) and permanent failures (that do).

Responding to failures can be a significant drain on resources if nodes are required to move or replicate objects after any transient failure; on the other hand, waiting until after an object has been definitively lost runs the risk of losing the last copy of an object. Using a simple model based on the concept that repairs and failures occur at a given rate, we have shown that maintenance schemes that trigger maintenance when too few replicas exist will tend to ignore transient failures if they are engineered to track the location of all replicas, whenever created. In this case, extra replicas created in response to transient failures will keep the available number of copies about the repair threshold, making repairs due to subsequent transient failures unlikely [8].

One difficult area of implementation is scaling object monitoring. Threshold based schemes must track the number of replicas currently available: this information must be kept up-to-date as nodes join and leave the system but remain available for re-use to avoid constant re-querying. We are investigating techniques for implementing synchronization as well as the impact of different object placement policies on the amount of synchronization and repair traffic. By using DHash as the back-end for two applications with high data storage requirements, we can directly observe the impact of any changes as we scale up these applications. The result of this is that our implementation is being constantly adjusted, improved and evaluated in the context of our applications.

Application: UsenetDHT

UsenetDHT [2] is a DHT-based replacement for the Usenet messaging system. UsenetDHT presents the existing Usenet interface to clients but reduces storage and bandwidth costs by storing articles in a DHT rather than replicating them at each server. In UsenetDHT, the bandwidth required to support read and writes for the full Usenet feed is around 200KB/s; additionally, simulations suggest that to maintain the availability of this data in a relatively stable environment is on the order of 1-2Mbps. By contrast, a traditional Usenet server must dedicate around 20MB/s to carry a full feed.

Application: OverCite

OverCite [4, 7] is a new architecture for a distributed and cooperative research library based on a DHT. OverCite provides the same services as CiteSeer on a distributed set of nodes, eliminating the need for a single institution to contribute all of the resources necessary to run CiteSeer; the resource and bandwidth requirement of each participating node is significantly reduced. DHash serves as a distributed storage layer which, because of its robust and scalable models for data management and peer communication, allows the decentralization of the CiteSeer infrastructure and the inclusion of additional CPU and storage resources. Besides serving as a distributed, robust archive of data, DHash simplifies the coordination of distributed activities, such as crawling. Finally, DHash acts as a rendezvous point for producers and consumers of meta-data and documents.

References:

[1] Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Wide-area cooperative storage with CFS. In Proc. of the 18th ACM Symposium on Operating System Principles (SOSP '01), October 2001.

[2] Emil Sit, Frank Dabek, and James Robertson. UsenetDHT: A Low Overhead Usenet Server. In Proc. of the 3rd International Workshop on Peer-to-Peer Systems (IPTPS '04), February 2004.

[3] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, Frans Kaashoek, and Robert Morris. Designing a DHT for low latency and high throughput. In Proc. of the 1st Symposium on Networked System Design and Implementation (NSDI '04), March 2004.

[4] Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David R. Karger, Robert Morris, and Scott Shenker. OverCite: A Cooperative Digital Research Library. In Proc. of the 4th International Workshop on Peer-to-Peer Systems (IPTPS '05), February 2005.

[5] Jinyang Li, Jeremy Stribling, Robert Morris and M. Frans Kaashoek. Bandwidth-efficient management of DHT routing tables. In Proc. of the 2nd Symposium on Networked System Design and Implementation (NSDI'05), May 2005.

[6] Emil Sit, Andreas Haeberlen, Frank Dabek, Byung-Gon Chun, Hakim Weatherspoon, Robert Morris, M. Frans Kaashoek and John Kubiatowicz. Proactive replication for data durability. In Proc. of the 5th International Workshop on Peer-to-Peer Systems (IPTPS '06), February 2006.

[7] Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, and Robert Morris. Exploring the design of multi-site Web services using the OverCite digital libary. In Proc. of the 3rd Symposium on Networked System Design and Implementation (NSDI'06), May 2006.

[8] Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz and Robert Morris. Efficient Replica Maintenance for Distributed Storage Systems. In Proc. of the 3rd Symposium on Networked System Design and Implementation (NSDI'06), May 2006.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu