Abstracts - 2006
Hybrid Byzantine Fault Tolerance
James Cowling, Daniel Myers & Barbara Liskov
The increasing importance of highly available online services mandates protocols to ensure correctness and permanence of service data. Availability and correctness in the face of node failures and malicious attacks, so called Byzantine faults, introduces additional challenges to reliability. These challenges are addressed by Byzantine fault-tolerant replicated state machines, notably the Castro and Liskov BFT system .
Recent criticism has highlighted scalability limitations in the base BFT protocol, where a network may be saturated by quadratic-order communication between server replicas. Additional delay is also introduced by the three-phase read/write protocol required by all operations under BFT. The Query/Update protocol  introduces an optimistic quorum approach to scalable Byzantine fault-tolerance, requiring only single-phase reads and writes, without requiring all-to-all server communication. While achieving high levels of performance on read loads and isolated writes, throughput and response time degrade drastically in the Q/U scheme in the presence of contending writes. An additional artifact of the quorum approach is the requirement for 5f+1 servers, for a given failure tolerance f, in contrast with the theoretical minimum of 3f+1 servers in BFT.
We propose a new Hybrid BFT system, operating under optimistic concurrency control with low levels of write contention, while using BFT to resolve ordering of contending writes. Hybrid BFT provides two-phase write operations, and one- or two-phase reads, while requiring the same 3f+1 number of servers as BFT. Our expectation is that Hybrid BFT will achieve levels of performance close to or better than a purely optimistic scheme such as Q/U, while providing the resilience of BFT under highly contending write loads.
Requirements for a scalable approach for Byzantine fault-tolerance are as follows:
We devise an optimistic Byzantine fault tolerant replicated state machine protocol that satisfies the first four requirements above, leveraging the client to ensure consistency without server-to-server communication in the normal case. Concurrent writes may result in conflicting tentative states however, requiring the use of a contention resolution protocol in satisfying the latter requirement listed above. Here we use BFT as a submodule to resolve these conflicts, contributing the Hybrid aspect of the protocol. BFT contention resolution allows for continued high availability under high levels of write contention, without the excessive timeouts of barrier-based resolution .
The basic Hybrid protocol has been formulated and is undergoing extensions to efficiently bring slow or temporarily failed nodes to the current global system state.
Initial theoretical evaluation of Hybrid BFT in the absence of contention and failures has been performed, as is exhibited in Figure 1 and Figure 2, with results that look very promising in comparison to existing protocols. It is important to note here that while the Q/U protocol of  indeed has very similar network overhead without contention, we believe that the Hybrid protocol will greatly eclipse Q/U in performance in the more realistic case of concurrent writes. We expect that under typical read/write workloads Hybrid BFT will offer far greater throughput than traditional BFT, degrading to performance equal to BFT (and better than Q/U) under heavy write load.
The theoretical results above do not take into consideration the use of batching in BFT, a performance-enhancing technique not directly transferable to either the Hybrid or Query/Update schemes. Moreover, evaluation has not yet been conducted of the relative merits of Hybrid and pure-BFT under various levels of write contention.
A comprehensive implementation of Hybrid BFT is underway in a test framework and we hope to have experimental performance figures available in the near future.
This research is supported by Quanta Computer, Inc and the National Science Foundation ITR.
 M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. In The Proceedings of the 3rd Conference on Operating Systems Design and Implementation (OSDI '99), pp. 173--186, New Orleans, Louisiana, February 1999.
 M. Abd-El-Malek, G. R. Ganger, G. R. Goodson, M. K. Reiter and J. J. Wylie. Fault-Scalable Byzantine Fault-Tolerant Services. In The Proceedings of the 20th ACM Symposium on Operating System Principles (SOSP '05), pp. 59--74, Brighton, United Kingdom, October 2005.