Efficient Crash Recovery of Non-Identical Replicas in a Column Store Database Management SystemEdmond Lau, Samuel Madden & Michael StonebrakerProblem StatementThe standard solution to crash recovery in database management systems (DBMSs) requires the write-ahead logging of UNDO and REDO records in conjunction with an algorithm such as the ARIES recovery protocol [1] to redo committed transactions and to undo uncommitted ones. In C-Store [2], a distributed and column-oriented DBMS geared toward providing high ad-hoc query performance on terabytes of data, we take an innovative approach to recovery; we log only UNDO records to support transactional aborts and rely on redundantly stored data at other non-identical sites for recovery purposes. The motivation for the different approach stems partially from the fact that standard log-based recovery of large datasets on crashed machines can take days or weeks to complete and prove prohibitive. Under a typical C-Store installation consisting of 1000 machines, each with a 100-GB disk with a 200,000 hour disk life, a disk failure will occur approximately every 200 hours; moreover, software problems will further decrease the mean time to failure at a given site. Crash recovery is therefore a frequent occurrence, and efficient recovery procedures that complete in significantly less time than the mean time to failure become crucial for correct functionality. The concept of recovering crashed sites from replicas is not new; Walmart's data warehouse [3], for instance, chooses to mirror its data on identical replicas to reduce log-based recovery costs and to maintain availability in the face of system crashes. The challenge in C-Store's recovery problem arises because C-Store does not employ identical replicas for recovery; instead, C-Store stores relational tables as sets of columns, called projections, that are each stored in different sort orders, horizontally partitioned among multiple sites, and linked together by one or more join indices. The differently sorted projections enable the query optimizer to select the best set of projections to efficiently answer a given query, but causes crash recovery to be more difficult because we must now rebuild projections of crashed sites from redundant data distributed among non-identical replicas in potentially different sort orders and different compression schemes. C-Store ArchitectureIn data warehouse applications, transaction systems periodically load new data into historical storage, and analysts execute ad-hoc queries on possibly tens of terabytes of accumulated data in order to extract patterns and to gain insight into the data. In these environments, DBMSs that employ row-store architectures and store fields of a given record, or row, contiguously on disk have been shown to perform over an order of magnitude more slowly than DBMSs with column-store architectures that store related fields, or columns, of different records together. Current column-store architectural designs, however, suffer from poor transactional update performance because updates and deletes require multiple disk operations to access the non-contiguously stored fields of specific records. C-Store is a distributed column store DBMS designed to execute ad-hoc queries over an order of magnitude faster than major commercial DBMSs while simultaneously supporting reasonable performance on update queries. To do so, it employs a hybrid column-oriented architecture consisting of a write-optimized store (WS), in which projections are sorted in entry sequence for fast inserts, a read-optimized store (RS), in which projections are stored in commonly used sort orders for fast query performance, and a tuple mover to periodically load batches of tuples from the WS to the RS. Recovery Design GoalsA spectrum of different failure scenarios may result after a site fails under the C-Store architecture: a site may fail but not suffer any data loss; individual tuples in the WS or the RS may become damaged; the in-memory WS may be destroyed but the on-disk RS remain intact; or both the WS and RS may undergo catastrophic failure and lose all data. Our research involves the design of a toolkit of recovery protocols that can handle these scenarios that are simple enough to guarantee correctness and fast enough to recover the crashed site before another site fails. A crashed site recovers by copying state from other replicas. Although recovery may seem conceptually simple, we face two major challenges to efficient recovery. First, we must figure out how to synchronize a column on a recovering site r with a column on a live site s without copying the entirety of the column from s to r. Second, we must figure out how to guarantee that r has the most recent copy of the column at s if we do not quiesce the system while r is recovering. Main Recovery StrategiesAt this point, we have a basic high-level understanding of some possible recovery techniques that can be used in C-Store. For the common case in which a site r fails but suffers no data loss, the only recovery necessary is the application of updates that r may have missed; these updates can be queued for it at other sites and subsequently rolled forward at r to bring it up-to-date. If both the WS and the RS suffered from catastrophic failure, r can first execute a snapshot-isolated query as of some time in the past and then employ some type of handshaking algorithm to synchronize the remaining differences. A challenging case arises when either individual tuples in the WS or RS have been damaged or if the entire WS has been destroyed. We have been exploring the use of Merkle trees [4] to minimize the number and size of messages communicated in order to synchronize the state of two columns. The core idea behind a Merkle tree data structure is to hash each record of a column into a set of buckets, to hash the combined contents of each of those buckets into a smaller set of buckets, and to repeat until only one bucket remains. The data structure then consists of a tree of hashes in which the leaves contain the individual records. Merkle trees can help identify which records need to be copied from the live site to the recovering site using a top-down comparison of hashes. A fundamental deficiency of a Merkle tree-based approach, however, is that Merkle trees only synchronize values, whereas recovery in C-Store requires the values in the recovering column to be correct and also stored in the same sort order as the column's associated projection. We are working on augmenting the Merkle tree approach to synchronize both values and positions in two columns that may be in different sort orders. In the future, we plan to flesh out the details of these high-level recovery procedures, to prove that they are correct algorithmicly, to implement them on our developing C-Store prototype, and to evaluate the performance of different recovery procedures. FundingThis research is partially funded by a grant from the Deshpande Center. References[1] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. "ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging." In ACM Transactions on Database Systems, pp. 94--162, March 1992. [2] Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. "C-Store: A Column-oriented DBMS." [3] Paul Westerman. Data Warehousing: Using the Wal-Mart Model, Morgan-Kaufmann Publishers, 2000. [4] Ralph C. Merkle. "A Digital Signature Based on a Conventional Encryption Function." In A Conference on the Theory and Applications of Cryptographic Techniques on Advances in Cryptology, pp. 369--378, 1987. |
||
|