CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Reconfigurable Byzantine-Fault-Tolerant Atomic Memory

Rodrigo Rodrigues & Barbara Liskov

Introduction

Quorum systems [2,8] are valuable tools for building highly available replicated data services. A quorum system can be defined as a set of sets (called quorums) with certain intersection properties. These properties allow read and write operations to be performed only at a quorum of the servers, since they ensure that any read operation will have access to the most recent value that was written.

Traditionally, quorum systems assumed that servers fail benignly, i.e., by crashing or omitting some steps. Reiter and Malkhi extended quorum systems to provide data availability even in the presence of arbitrary (Byzantine) faults [5]. but this work assumed a fixed set of servers that provide the data service throughout the entire system lifetime, and in fact almost all work on Byzantine fault tolerance has made this assumption.

Assuming a fixed set of servers in unrealistic for two reasons. First future storage systems will need to provide much more storage than can be accomodated on just one server group; instead we need to pool the storage of many servers and partition the load among them. Second, we must expect system membership to change over time. Machines will break, become compromised, or be decommissioned; such machines must be removed from the system and their responsibilities assigned to non-failed servers. Also, machines may need to be added to the system to replace failed servers or for performance reasons, if the need for increased storage or throughput dictates it.

Approach

This research extends the work on static Byzantine quorum systems to work in a large scale, dynamic environment. Our approach is designed to work in a very general setting. We allow large numbers of servers, e.g., hundreds of thousands, that together are able to store vast quantities of data on behalf of a large number of clients. Responsibility for handling different data items is partitioned among the servers: Each item is stored at a group of 3f+1 servers (a subset of the servers we call the item's replica group), and different items can be stored in different groups.

Our algorithm ensures strong semantics (atomicity) despite Byzantine failures of servers, crash failures of clients, and reconfigurations.

Our system makes use of a logically-centralized reconfigurer (implemented as a Byzantine-fault-tolerant replica group[1] that runs on the server nodes). The reconfigurer generates new configurations periodically, each with an associated epoch number. We tag all messages with epoch numbers, and use these numbers to ensure that nodes agree on which epoch they are in and that servers carry out reads and writes correctly. As in recent work on reconfigurable quorums for crash failures[3,4], we allow new replica groups to be completely disjoint from old ones, and thus quorums from two different epochs may not intersect. Epoch numbers allow us to ensure correct behavior in spite of this problem.

When the system reconfigures, responsibility for storage of data items can shift: servers that used to be in the replica group for an item are no longer responsible for it, but rather a new group of servers has taken over. Old replicas need to store state for old objects until they have been transferred to the new group but they must not retain this state forever; we have developed a way to garbage collect this state.

Our system requires that replica groups contain no more than f faulty nodes while they are "needed." Intuitively a node is needed while it is a member of the current configuration, for state transfer, and until client leases expire. We have specified a correctness (safety) condition that captures this notion of being needed precisely and we have proved the safety of our system. We have expressed the safety condition in terms of a window of vulnerability, which is a time interval that depends on the occurrence of certain events (e.g., state transfer ending). We have also developed a proof of its safety.

To our knowledge, ours is the first implementation of a reconfigurable Byzantine-fault-tolerant quorum system designed to work at very large scale, with responsibility for objects partitioned among many different replica groups. We have published a sketch of our approach [7]. Martin et al.[6] present an algorithm for a reconfigurable Byzantine quorum system, but that approach assumes a single replica group, does not allow implementing the reconfigurer as a Byzantine-fault-tolerant replica group, and appears to not be implemented.

Progress

Our system has been implemented. To evaluate its performance, we developed an analytic performance model for a heterogeneous deployment (variable inter-node latency) and we validated this model using results from experiments with our implementation. Our performance results show that our algorithms are efficient: the fact that our system reconfigures does not affect its normal case performance (when there are no reconfigurations), and the cost of reconfiguring is modest.

Research Support

This research is supported by the NSF under grants ANI-0082503 (http://project-iris.net) and 6896772.

References:

[1] Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance. In The Proceedings of the 3rd Annual Conference on Operating Systems Design and Implementation (OSDI), pp. 173--186, New Orleans, Louisiana, February 1999.

[2] D. K. Gifford. Weighted voting for replicated data. In The Proceedings of 17th ACM Symposium on Operating Systems Principles (SOSP '99). 1999.

[3] S. Gilbert, N. Lynch, and A. Shvartsman. RAMBO II: Rapidly reconfigurable atomic memory for dynamic networks. In The Proceedings of International Conference on Dependable Systems and Networks (DSN 2003), June 2003.

[4] N. Lynch and A. Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In The Proceedings of the 16th International Symposium on DIStributed Computing (DISC 2002). October 2002.

[5] D. Malkhi and M. Reiter. Byzantine Quorum Systems. Journal of Distributed Computing, 11(4):203--213, 1998.

[6] J.P. Martin and L. Alvisi. A framework for dynamic byzantine storage. In The Proceedings of International Conference on Dependable Systems and Networks (DSN 2004), June 2004.

[7] R. Rodrigues, B. Liskov, and L. Shrira. The design of a robust peertopeer system. In The Proceedings of the Tenth ACM SIGOPS European Workshop, Saint Emilion, France, Sept. 2002.

[8] R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Transactions on Database Systems, 4(2):180--209, June 1979.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)