CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

An Examination of Encoding Schemes in Column Stores

Daniel Abadi, Miguel Ferreira, Samuel Madden & Michael Stonebraker

Project Overview

Compression in traditional database systems is known to improve performance significantly: it reduces the size of the data and improves I/O performance by reducing seek times (the data is stored nearer each other), transfer times (less data to transfer) and buffer hit rate (larger fraction of DBMS fits in buffer pool). For queries that are I/O limited, the CPU overhead of decompression is often compensated for by the I/O improvements.

Our research revisits this literature on compression in the context of column-oriented database systems (e.g., [1][2]). A column-oriented database system (or "column-store'') is one in which each attribute is stored in a separate column, such that successive values of that attribute are stored consecutively on disk. This is in contrast to most common database systems (e.g. Oracle, IBM DB2, Microsoft SQL Server) that store relations in rows ("row stores'') where values of different attributes from the same tuple are stored consecutively. Since there has been significant recent interest in column-oriented databases in both the research community [3][4][5][6] and in the commercial arena [2][7][8], we believe the time is right to systematically revisit the topic of compression in the context of these systems, particularly given that one of the oft-cited advantages of column-stores is their compressibility.

Storing data in columns presents a number of opportunities for improved performance from compression algorithms when compared to row-oriented architectures. In a column-oriented database, compression schemes that encode the differences between the successive entries in the column and multiple values at once are natural. In a row-oriented database, such schemes do not typically work well because adjacent rows of multi-attribute tables are likely to vary significantly in at least some attributes. Still, despite the unnatural nature of attribute-level compression, there have been some successful proposed schemes (e.g. [9][10][11]); however, none of these schemes attempt to compress attributes from more than one tuple at once.

Instead, compression techniques for row-stores often employ dictionary schemes where a dictionary is used to code wide values in the attribute domain into smaller codes. For example, a simple dictionary for a string-typed column of colors might map "blue'' to 0, "yellow'' to 1, "green'' to 2, and so on [9][12][13]. Sometimes these schemes employ prefix-coding based on symbol frequencies (e.g., Huffman encoding [14]) or express values as small differences from some frame of reference and remove leading nulls from them (e.g., [15][11][12]). In addition to these traditional techniques, column-stores are also well-suited to compression schemes that compress values from more than one row at a time. This allows for a larger variety of viable compression algorithms. For example, run-length encoding, where repeats of the same element are expressed as (run-length, value) pairs, is an attractive approach for compressing sorted data in a column-store. Similarly, improvements to traditional compression algorithms that allow basic "symbols'' to span more than one column entry are also possible in a column-store. Finally, we expect compression ratios to be generally much higher in column-stores because consecutive entries in a column are often quite similar to each other, whereas adjacent attributes in a tuple are not. Column-stores can also store different columns in different sort-orders, further increasing the potential for compression, since sorted data is usually quite compressible.

In addition to improving the compression ratio, column-oriented compression schemes also improve CPU performance by allowing database operators to to operate directly on compressed data. This is particularly true with compression schemes like run length encoding that refer to multiple entries with the same value in a single record. For example, if a run-length encoded column says the value "42'' appears 1000 times consecutively in a particular column for which we are computing a SUM aggregate, the operator can simply take the product of the value and run-length as the SUM, without having to decompress.

We are currently implementing a number of alternative compression schemes that are particularly well-suited to column stores, and building a framework for how these schemes can be seamlessly integrated into the architecture of C-Store, a column oriented database we are building.

The purpose of our work is not to propose fundamental new compression schemes. Many of the approaches we employ have been investigated in isolation in the context of row-oriented databases, and all are known in the data compression literature. Our purpose is to explore the performance and architectural implications of integrating a wide range of compression schemes into a column-oriented database.

References

[1] M. Stonebraker and D. J. Abadi and A. Batkin and X. Chen and M. Cherniack and M. Ferreira and E. Lau and A. Lin and S. Madden and E. O'Neil and P. O'Neil and A. Rasin and N. Tran and S. Zdonik. C-Store: A Column-oriented DBMS.

[2] URL. http://www.sybase.com/products/informationmanagement/sybaseiq.

[3] Ravishankar Ramamurthy and David Dewitt and Qi Su. A Case for Fractured Mirrors. In VLDB, 2002.

[4] Peter Boncz, Marcin Zukowski, Niels Nes . MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR, 2005.

[5] Mike Stonebraker, et al.. C-Store: A Column-oriented DBMS.

[6] P. Boncz and A.N. Wilschut and M.L. Kerston. Flattening an Object Algebra to Provide Performance. In ICDE, 1998.

[7] URL. http://www.addamark.com/products/sls.htm.

[8] Kx Sytems, Inc.. Faster Database Platforms for the Real-time Enterprise: How to Get the Speed You Need to Break through Business Intelligence Bottlenecks in Financial Institutions. 2003.

[9] G.Graefe and L.Shapiro. Data compression and database performance. April 1991.

[10] Gautam Ray and Jayant R. Haritsa and S. Seshadri. Database Compression: A Performance Enhancement Tool.. In COMAD, pp. 0-, 1995.

[11] Jonathan Goldstein and Raghu Ramakrishnan and Uri Shaft. Compressing Relations and Indexes. In ICDE '98: Proceedings of the Fourteenth International Conference on Data Engineering, pp. 370--379, 1998.

[12] Mark A. Roth and Scott J. Van Horn. Database compression. pp. 31--39, 1993.

[13] Zhiyuan Chen and Johannes Gehrke and Flip Korn. Query optimization in compressed database systems. In SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 271--282, 2001.

[14] D. Huffman. A method for the construction of minimum-redundancy codes. September 1952.

[15] Till Westmann and Donald Kossmann and Sven Helmer and Guido Moerkotte. The implementation and performance of compressed databases. pp. 55--67, 2000.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)