# Entropy Compression of Relations and Compressed Relations

We present a method to compress relations close to their entropy while still allowing efficient queries. Column values are encoded into variable length codes to exploit skew in their frequencies. The codes in each tuple are concatenated and the resulting tuplecodes are sorted and delta-coded to exploit the lack of ordering in a relation. Correlation is exploited either by co-coding correlated columns, or by using a sort order that leverages the correlation. We prove that this method leads to near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on vertical partitions tuned for TPC-H queries.

2.either value concatenation and co-coding, or careful column 1.1 Background and Related Work ordering within the tuplecode. We also allow domain-specific transformations to be applied to the column before the Huffman 1.1.1 Information Theory coding, e.g. text compression for strings. The theoretical foundation for much of data compression is We define a notion of entropy for relations, and prove that information theory . In the simplest model, it studies the our method is near-optimal under this notion, in that it compression of sequences emitted by 0th-order information asymptotically compresses a relation to within 4.3 bits/tuple of sources – ones that generate values i.i.d (independent and its entropy. identically distributed) from a probability distribution D. We have prototyped this method in a system called csvzip Shannon’s celebrated source coding theorem  says that one for compression and querying of relational data. We report on cannot code a sequence of values in less than H(D) bits per experiments with csvzip over data sets from TPC-H, TPC-E value on average, where H(D) = ΣicD pi lg (1/pi) is the entropy and SAP/R3. We obtain compression factors from 7 to 40, of the distribution D with probabilities pi. substantially better than what is obtained with gzip or with Several well studied codes like the Huffman and Shannon- domain coding, and also much better than what others have Fano codes achieve 1 + H(D) bits/tuple asymptotically, using a reported earlier. When we don’t use co-coding, the level of dictionary that maps values in D to codewords. A value with compression obtained is sensitive to the position of correlated occurrence probability pi is coded in roughly lg (1/pi) bits, so columns within the tuplecode. We discuss some heuristics for that more frequent values are coded in fewer bits. choosing this order, though this needs further study. Most coding schemes are prefix codes – codes where no codeword is a prefix of another codeword. A prefix code Efficient Operations over Compressed Data dictionary is often implemented as a prefix tree where each In addition to compression, we also investigate scans and joins edge is labelled 0 or 1 and each leaf maps to a codeword (the over compressed data. We currently do not support incremental string of labels on its path from the root). By walking the tree updates to compressed tables. one can tokenize a string of codewords without using delimiters Operating on the compressed data reduces our – every time we hit a leaf we output a codeword and jump back decompression cost; we only need to decode fields that need to to the root. be returned to the user or are used in aggregations. It also The primary distinction between relations and the sources reduces our memory throughput and capacity requirements. considered in information theory is the lack of order: relations The latter allows for larger effective buffer sizes, greatly are multi-sets, and not sequences. Secondarily, we want the increasing the speed of external data operations like sort. compressed relation to be directly queryable, whereas it is more Querying compressed data involves parsing the encoded bit common in the information theory literature for the sequence to stream into records and fields, evaluating predicates on the be decompressed and then pipelined to the application. encoded fields, and computing joins and aggregations. Prior researchers have suggested order-preserving codes [2,15,18] 1.1.2 Related work on database compression that allow range queries on encoded fields. However this DBMSs have long used compression to help alleviate their data method does not extend efficiently to variable length codes. movement problems. The literature has proceeded along two Just tokenizing a record into fields, if done naively, involves strands: field wise compression, and row wise compression. navigating through several Huffman dictionaries. Huffman Field Wise Compression: Graefe and Shapiro  were dictionaries can be large and may not fit in the L2 cache, among the first to propose field-wise compression, because making the basic operation of scanning a tuple and applying only fields in the projection list need to be decoded. They also selection or projection operations expensive. We introduce two observed that operations like join that involve only equality optimizations to speed up querying: comparisons can be done without decoding. Goldstein and Segregated Coding: Our first optimization is a novel scheme Ramakrishnan  propose to store a separate reference for the for assigning codes to prefix trees. Unlike a true order values in each page, resulting in smaller codes. Many column- preserving code, we preserve order only within codes of the wise storage schemes (e.g., [5,6,12]) also compress values same length. This allows us to evaluate many common within a column. Some researchers have investigated order- predicates directly on the coded data, and also to find the preserving codes in order to allow predicate evaluation on length of each codeword without accessing the Huffman compressed data [2,15,18].  study order preserving dictionary, thereby reducing the memory working set of compression of multi-column keys. An important practical tokenization and predicate evaluation. issue is that field compression can make fixed-length fields into Short Circuited Evaluation: Our delta coding scheme sorts variable-length ones. Parsing and tokenizing variable length the tuplecodes, and represents each tuplecode by its delta fields increases the CPU cost of query processing, and field from the previous tuplecode. A side-effect of this process is delimiters, if used, undo some of the compression. So almost to cluster tuples with identical prefixes together, which all of these systems use fixed length codes, mostly byte- means identical values for columns that are early in the aligned. This approximation can lead to substantial loss of concatenation order. This allows us to avoid decoding, compression as we argue in Section 2.1.1. Moreover, it is not selecting and projecting columns that are unchanged from clear how to exploit correlation with a column store. the previous tuple. Row Wise Compression: Commercial DBMS Before diving into details of our method, we recap some implementations have mostly followed the row or page level information theory basics and discuss some related work. compression approach where data is read from disk, 859

4. m est. H(delta(R)) in bits This bound is far from tight. Table 2 shows results from a 10000 1.897577 m Monte-Carlo simulation where we pick m numbers i.i.d from 100,000 1.897808 m [1,m], calculate the distribution of deltas, and estimate their 1,000,000 1.897952 m entropy. Notice that the entropy is always less than 2 bits. 10,000,000 1.89801 m Thus, Delta coding compresses R from m lg m bits to lg m + 40,000,000 1.898038 m 2(m–1) bits, saving (m–1)(lg m – 2) bits. For large databases, lg m can be about 30 (e.g., 100GB at 100B/tuple). As experiments Table 2: Entropy of delta(R) for a multi-set R of m in Section 4.1 show, when a relation has only a few columns, values picked uniformly, i.i.d. from [1,m] (100 trials) such delta coding alone can give up to a 10 fold compression. each Di from the actual value distribution in COLi, optionally This analysis applies to a relation with one column, chosen extended with some domain knowledge. For example, if COLi uniformly from [1, m]. We generalize this to a method that has {Apple, Apple, Banana, Mango, Mango, Mango}, then Di works on arbitrary relations in Section 2.1.4. is the distribution {pApple = 0.333, pBanana = 0.167, pMango =0.5}. Schemes like Huffman and Shannon-Fano code such a Optimality of Delta Coding sequence of i.i.d values by assigning shorter codes to frequent Such delta coding is also very close to optimal – the following values . On average, they can code each value in COLi with lemma shows we cannot reduce the size of a sequence by more at most 1 + H(Di) bits, where H(X) is the entropy of distribution than lg m! just by viewing it as a multi-set. X – hence these codes are also called “entropy codes.” Using an Lemma 2: Given a vector R of m tuples chosen i.i.d. from a entropy coding scheme, we can code the relation R distribution D and the multi-set R of values in R, (R and R are with ∑1≤i ≤ k R (1 + H( Di ) + DictSize( COLi ) bits, where both random variables), H(R) >= m H(D) – lg m! DictSize(COLi) is the size of the dictionary mapping code Proof Sketch: Since the elements R are chosen i.i.d., H(R) = m words to values of COLi. H(D). Now, augment the tuples t1, t2, …, tm of R with a “serial- If a relation were a sequence of tuples, assuming that the number” column SNO, where ti.SNO = i. Ignoring the ordering domains Di are independent (we relax this in 2.1.3), this coding of tuples in this augmented vector, we get a set, call it R1. is optimal, by Shannon’s source coding theorem . But Clearly there is a bijection from R1 to R, so H(R1) = m H(D). relations are not sequences, they are multi-sets of tuples, and But R is just a projection of R1, on all columns except SNO. For permit further compression. each relation R, there are at most m! relations R1 whose projection is R. So H(R1) <= H(R) + lg m! ± 2.1.2 Order: Delta Coding Multi-Sets So, with delta coding we are off by at most lg m! – m(lg m Consider a relation R with just one column, COL1, containing – 2) j m (lg m – lg e ) – m (lg m – 2) = m (2 – lg e) j 0.6 m numbers chosen uniformly and i.i.d from the integers in bits/tuple from the best possible compression. This loss occurs [1,m]. Traditional databases would store R in a way that because the deltas are in fact mildly correlated (e.g., sum of encodes both the content of R and some incidental ordering of deltas = m), but we do not exploit this correlation – we code its tuples. Denote this order-significant view of the relation as each delta separately to allow pipelined decoding. R (we use bold font to indicate a sequence). The number of possible instances of R is mm, each of which 2.1.3 Correlation has equal likelihood, so H(R) is m lg m. But R needs much less Consider a pair of columns (partKey, price), where each space because we don’t care about the ordering. Each distinct partKey largely has a unique price. Storing both partKey and instance of R corresponds to a distinct outcome of throwing of price separately is wasteful; once the partKey is known, the m balls into m equal probability bins. So, by standard range of possible values for price is limited. Such inter-field correlation is quite common and is a valuable opportunity for combinatorial arguments (see , ), there are  2 m −1   ∫ relation compression.  m  4m 4 π m different choices for R, which is much less than In Section 2.1.1, we coded each tuple in Σj H(Dj) bits. This m is optimal only if the column domains are independent, that is, m . A simple way to encode R is as a coded delta sequence: if the tuples are generated with an independent joint 1) Sort the entries of R, forming sequence v = v1, …, vm distribution (D1, D2, … Dk). For any joint distribution, H(D1, 2) Form a sequence delta(R) = v1, v2–v1, v3–v2, …, vm–vm-1 …, Dk) ≤ Σj H(Dj), with equality if and only if the Dj’s are 3) Entropy code the differences in delta to form a new independent . Thus any correlation strictly reduces the sequence code(R) = v1, code(v2–v1), ... , code(vm – vm–1) entropy of relations over this set of domains. Space savings by delta coding We have three methods to exploit correlation: co-coding, Intuitively, sorting and delta coding compresses R because the dependent coding and column ordering. distribution of deltas is tighter than that of the original integers Co-coding concatenates correlated columns, and encodes – small deltas are much more likely than large ones. Formally: them using a single dictionary. If there is correlation, this Lemma 1: If R is a multi-set of m values picked uniformly with combined code is more compact than the sum of the individual repetition from [1,m], and m > 100, then each delta in delta(R) field codes. has entropy < 2.67 bits. A variant approach we call dependent coding builds a Proof Sketch: See Appendix 7. ± Markov model of the column probability distributions, and uses Corollary 1.1: code(R) occupies < 2.67 m bits on average. it to assign Huffman codes. E.g, consider columns partKey, 861

8.tuplecodes; we are investigating an alternative XOR-based as they are entered into the hash buckets (a sort is not needed delta coding that doesn’t generate any carries. here because the input stream is sorted). The advantage is that hash buckets are now compressed more tightly so even larger 3.2 Layering other Query Plan operators relations can be joined using in-memory hash tables (the effect The previous section described how we can scan compressed of delta coding will be reduced because of the smaller number tuples from a compressed table, while pushing down selections of rows in each bucket). and projections. To integrate this scan into a query plan, we Grouping tuples by a column value can be done directly expose it using the typical iterator API, with one difference: using the code words, because checking whether a tuple falls getNext() returns not a tuple of values but a tuplecode – i.e., a into a group is simply an equality comparison. However tuple of coded column values. Most other operators, except aggregations are harder. aggregations, can be changed to operate directly on these COUNT, COUNT DISTINCT, can be computed directly on tuplecodes. We discuss four such operators next: index-scan, code words: to check for distinctness of values we check hash join, sort-merge join, and group-by with aggregation. distinctness of the corresponding codewords. MAX and MIN computation involves comparison between code words. Since 3.2.1 Index Scan: Access Row by RID our coding scheme preserves order only within code words of Index scans take a bounding predicate, search through an index the same length, we need to maintain the current maximum or structure for matching row ids (RIDs), load the corresponding minimum separately on code words of each length. After data pages, and extract matching records. The process of scanning through the entire input, we evaluate the overall max mapping predicates to RIDs occurs as usual. But extracting the or min by decoding the current code words of each codeword matching records is trickier because it involves random access length and computing the maximum of those values. within the table. Since we delta-code tuples, the natural way to SUM, AVG, STDEV, cannot be computed on the code tokenize a table into tuples is to scan them sequentially, as in words directly; we need to decode first. Decoding domain- Section 3.1. coded integer columns is just a bit-shift. Decoding Huffman Our solution is to punctuate a compressed table with codes from small domains or ones with large skew is also periodic non-delta-coded tuples (the fields in these tuples are cheap, because the Huffman tree is shallow. We also place still Huffman coded). This divides the table into separately columns that need to be decoded early in the column ordering, decodable pieces, called compression blocks (cblocks). We to improve the chance that the scanner will see runs of identical make each rid be a pair of cblock-id and index within cblock, codes, and benefit from short circuited evaluation. so that index-based access involves sequential scan within the cblock only. Thus, short cblocks mean fast index access. 3.2.3 Sort Merge Join However, since the first tuple is not delta-coded, short cblocks The principal comparisons operations that a sort merge join also mean less compression. In practice this is not a problem: A performs on its inputs are < and =. Superficially, it would Huffman-coded tuple takes only 10-20 bytes for typical appear that we cannot do sort merge join without decoding the schemas (Section 4.1), so even with a cblock size of 1KB, the join column, because we do not preserve order across code loss in compression is only about 1%. 1KB fits in L1-cache on words of different lengths. But in fact, sort merge join does not many processors, so sequential scan in a cblock is fine. need to compare tuples on the traditional ‘<’ operator – any total ordering will do. In particular, the ordering we have 3.2.2 Hash Join & Group By with Aggregation chosen for codewords – ordered by codeword length first and Huffman coding assigns a distinct field code to each value. So then within each length by the natural ordering of the values is we can compute hash values on the field codes themselves a total order. So we can do sort merge join directly on the without decoding. If two tuples have matching join column coded join columns, without decoding them first. values, they must hash to the same bucket. Within the hash bucket, the equi-join predicate can also be evaluated directly on 4. EXPERIMENTAL STUDY the field codes. We have implemented our compression method in a prototype One important optimization is to delta-code the input tuples system called csvzip. csvzip currently supports the compression 40 35 30 Domain Coding csvzip gzip csvzip+cocode 25 20 y 15 10 5 0 P1 P2 P3 P4 P5 P6 Figure 7: Comparing the compression ratios of four compression methods on 6 datasets 865

9. DATASET SCHEMA Original DC-1 DC-8 Huffman csvzip Delta code Huffman + Correlation csvzip+ Loss by not Gzip size (1) (2) savings Cocode saving cocode Cocoding (1)-(2) (3) (1)-(3) (5) (2)-(5) P1. LPK LPR LSK LQTY 192 76 88 76 7.17 68.83 36 40 4.74 2.43 73.56 P2. LOK LQTY 96 37 40 37 5.64 31.36 37 0 5.64 0 33.92 P3. LOK LQTY LODATE 160 62 80 48.97 17.60 31.37 48.65 0.32 17.60 0 58.24 P4. LPK SNAT LODATE CNAT 160 65 80 49.54 17.77 31.77 49.15 0.39 17.77 0 65.53 P5. LODATE LSDATE LRDATE LQTY LOK 288 86 112 72.97 24.67 48.3 54.65 18.32 23.60 1.07 70.50 P6. OCK CNAT LODATE 128 59 72 44.69 8.13 36.56 39.65 5.04 7.76 0.37 49.66 P7. SAP SEOCOMPODF 548 165 392 79 47 32 58 21 33 14 52 P8. TPC-E CUSTOMER 198 54 96 47 30 17 44 3 23 7 69 Table 6: Overall compression results on various datasets (all sizes are in # bits/tuple). DC-1 and DC-8 are bit and byte aligned domain coding. Csvzip is Huffman followed by delta coding; csvzip+cocode is cocode followed by Huffman and then delta-coding. Underscore indicates skew and italics denotes correlation. Column abbreviations are: LPK: partkey, LPR: extendedprice, LSK suppkey, LQTY: quantity, LOK: orderkey, LODATE: orderdate, SNAT: suppNation, CNAT: custNation, LSDATE: shipDate, LRDATE: receiptDate. TPC-E schema is tier, country_1, country_2, country_3, area_1, first name, gender, middle initial, last name. of relations loaded from comma separated value (csv) files, and Arithmetic Correlation: We made l_receiptdate and table scan with selection, projection, and aggregation. Updates l_shipdate be uniformly distributed in the 7 days after the and joins are not implemented at this time. We do not currently corresponding o_orderdate. have a SQL parser – we execute queries by writing C programs There are two other correlations inherent in the schema: that compose select, project, and aggregate primitives. For a given l_partkey, l_suppkey is restricted to be one of 4 We have used this prototype to perform an experimental possible values. study of both compression efficiency and scan efficiency. Our De-Normalization: Dataset P6 in Table 6 is on lineitem x goal is to quantify the following: order x customer x nation, and contains a non-key 1) How much can we compress, as compared to row coding or dependency o_custkey c_nationkey domain coding? What is the relative payoff of skew, order- We use a 1 TB scale TPC-H dataset (j6B rows in each of our freeness, and correlation (Section 4.1)? datasets). To make our experiments manageable, we did not 2) How efficiently can we run scan queries directly on these actually generate, sort, and delta-code this full dataset – rather compressed tables (Section 4.2)? we tuned the data generator to only generate 1M row slices of it, and compressed these. For example, P2 of Table 6 is delta- We use three datasets in our experiments: coded by sorting on <l_orderkey,l_qty>. We generate 1M row TPC-H views: We choose a variety of projections of Lineitem slices of P2 by modifying the generator to filter rows where x Orders x Part x Customer, that are appropriate for answering l_orderkey is not in the desired 1M row range. TPC-H queries (Table 6). Our physical design philosophy, like C-Store, is to have a number of highly compressed materialized 4.1 Extent of Compression views appropriate for the query workload. Table 6 lists the detailed compression results on each dataset. TPC-E Customer: We tested using 648,721 records of Figure 7 plots the overall compression ratio obtained by csvzip randomly generated data produced per the TPC-E specification. (with and without co-coding) vs that obtained by: This file contains many skewed data columns but little a plain gzip (representing the ideal performance of row and correlation other than gender being predicted by first name. page level coders), SAP/R3 SEOCOMPODF: We tested using projections of a a fixed length domain coder aligned at bit boundaries table from SAP having 50 columns and 236,213 rows. There is (representing the performance of column coders). Table 6 a lot of correlation between the columns, causing the delta code also lists the numbers for domain coding at byte boundaries; savings to be much larger than usual. it is significantly worse. One drawback with TPC-H is that uses uniform, Note that all ratios are with respect to the size of the vertical independent value distributions, which is utterly unrealistic partition, not size of the original tables – any savings obtained (and prevents us from showing off our segregated Huffman by not storing unused columns is orthogonal to compression. Coding ☺). So we altered the data generator to generate two Even without co-coding, csvzip consistently gets 10x or skewed columns and 2 kinds of correlation: more compression, in contrast to 2-3x obtained by gzip and Dates: We chose 99% of dates to be in 1995-2005, with 99% domain coding. On some datasets such as LPK LPR LSK of that on weekdays, 40% of that on two weeks each before LQTY, our compression ratio without co-coding is as high as New Year & Mothers’ Day (20 days / yr). 192/7.17j27; with co-coding it is 192/4.74j41. From Table 6, c_nationkey, s_nationkey: We chose nation distributions notice that the absolute per tuple sizes after compression are from WTO statistics on international trade typically 10-20bits. (www.wto.org/english/res_e/statis_e/wt_overview_e.htm). We next analyze from where our substantial gains arise. Soft Functional Dependency: We made l_extendedprice be functionally dependent on l_partkey Exploitation of Skew: The first source of compression is skew. Observe the bars labelled Huffman and Domain Coding 866

10. columns. Notice that the savings is now higher – the delta 12 coding is able to exploit correlations also, as we discussed in 10 DELTA Section 2.2.2. Delta w cocode 8 The plot on the opposite column illustrates the compression ratios obtained with the two forms of delta coding. The ratio is 6 as high as 10 times for small schemas like P1. The highest 4 overall compression ratios result when the length of a tuplecode and bits per tuple saved by delta coding are similar. 2 0 Exploiting Correlation via co-coding vs. delta-coding: P1 P2 P3 P4 P5 P6 The extent to which it can exploit correlations is indicated by the column labelled (2)-(5) in Table 6 – notice that we are in the chart below. They show the compression ratio achieved often close to the co-coding performance. just by Huffman coding column values, in contrast to the savings obtained by domain coding. All columns except This strategy is powerful, because by not co-coding range nationkeys and dates are uniform, so Huffman and domain queries on the dependent become much easier as we discussed coding are identical for P1 and P2. But for the skewed domains in Section 2.2.2. But a drawback is that the correlated columns the savings is significant (e.g., 44.7 bits vs. 59 bits for P6). have to be placed early in the sort order – for example, Table 6 lists the full results, including compression results on LODATE, LRDATE, LSDATE in dataset P5. We have SAP and TPC-E datasets. These have significant skew (several experimented with a pathological sort order – where the char columns with few distinct values), so Huffman coding correlated columns are placed at the end. When we sort P5 by does very well. On SAP, Huffman coding compresses to 79 (LOK, LQTY, LODATE, …), the average compressed tuple bits/tuple vs. 165 bits/tuple for domain coding. size increases by 16.9 bits. The total savings from correlation is only 18.32 bits, so we lose most of it. This suggests that we 6 need to do further investigation on efficient range queries over Huffman co-coded or dependent coded columns. 5 Domain Coding Huffman+CoCode 4 4.2 Querying Compressed Relations 3 We now investigate efficient querying of compressed relations. 2 We focus on scans, with selection, projections, and 1 aggregations. Our queries test three aspects of operating over 0 compressed data: 1. How efficiently can we undo the delta-coding to retrieve P1 P2 P3 P4 P5 P6 tuplecodes? 2. How efficiently can we tokenize tuples that contain Correlation: The best way to exploit correlation is by co- Huffman-coded columns? coding. Table 6 lists the extra benefit obtained by correlation, 3. How well can we apply equality and range predicates on over that obtained by just Huffman coding individual columns. Huffman coded and domain coded columns? This is also illustrated in the bar labelled “Huffman+CoCode” The first is the basic penalty for dealing with compressed data, in the above plot. For example, we go down from 72.97 that every query has to pay. The second measures the bits/tuple to 54.65 bits per tuple for P5. In terms of effectiveness of our segregated coding. The third measures the compression ratios, as a function of the original table size, we ability to apply range predicates using literal frontiers. compress 2.6x to 5.3x by exploiting correlation and skew. We run scans against 3 TPC-H schemas: (S1: LPR LPK Delta-Coding: The last source of redundancy is from lack-of- LSK LQTY) has only domain coded columns, (S2: LPR LPK ordering. In Table 6, column(3) – column (5) gives the savings LSK LQTY OSTATUS OCLK) has 1 Huffman coded column from delta coding when we co-code. Observe that it is almost (OSTATUS), and (S3: LPR LPK LSK LQTY OSTATUS always about 30 bits/tuple for all the TPC-H datasets. This OPRIO OCLK) has 2 Huffman coded columns observation is consistent with Theorem 3 and Lemma 2; lg m = (OSTATUS,OPRIO). OSTATUS has a Huffman dictionary lg (6.5B) j 32.5, and we save a little fewer bits per tuple by with 2 distinct codeword lengths, and OPRIO has a dictionary delta-coding. with 3 distinct codeword lengths. The table below plots the S1 S2 S3 scan bandwidth (in nanoseconds/tuple) for various kinds of scans against these schemas. All experiments were run on a Q1: select sum(lpr) from S1/2/3 8.4 10.1 15.4 1.2GHz Power 4, on data that fit entirely in memory. Q2: Q1 where lsk>? 8.1-10.2 8.7-11.5 17.7-19.6 Q1 is a straightforward scan plus aggregation on a domain Q3: Q1 where oprio>? 10.2-18.3 17.8-20.2 coded column. On S1, which has no Huffman columns, Q1 just tests the speed of undoing the delta code – we get 8.4ns/tuple. Q4: Q1 where oprio=? 11.7-15.6 20.6-22.7 On S2 and S3, it tests the ability to tokenize Huffman Again in Table 6, column (1) – column (2) gives the coded columns using the micro-dictionary (in order to skip over savings from delta coding when we don’t co-code correlated them). Tokenizing the first Huffman coded column is relatively 867

11.cheap because its offset is fixed; subsequent ones incur an  G. Copeland and S. Khoshafian. A decomposition storage overhead of about 5ns/tuple to navigate the micro-dictionary. If model. In SIGMOD 1985. a tuple has several Huffman coded columns, this suggests that  Sybase IQ. www.sybase.com/products/informationmanagement it will pay to encode the tuple length in front.  D. Knuth. The Art of Computer Programming. Addison Q2, Q3, and Q4 are queries with predicates plus Wesley, 1998. aggregation. A range of numbers is reported for each schema  S. Barnard and J. M. Child. Higher Algebra, Macmillan India because the run time depends on the predicate selectivity, due Ltd., 1994. to short-circuiting. For small selectivities, the predicate adds at  T. C. Hu, A. C. Tucker, Optimal computer search trees and variable-length alphabetic cods, SIAM J. Appl. Math, 1971. most a couple of ns/tuple beyond the time to tokenize. This is  Huffman, D. A method for the construction of minimum in line with our expectation, because once tokenized, the range redundancy codes. Proc. I.R.E. 40. 9. 1952 predicate is evaluated directly on the codeword.  Jim Gray. Commentary on 2005 Datamation benchmark. These numbers indicate we can process 50-100M tuples/s http://research.microsoft.com/barc/SortBenchmark using a single processor (equivalent to 1-2GB/s), and are quite  A. Zandi et al. Sort Order Preserving Data Compresion for promising. Obviously, a full query processor needs much more Extended Alphabets. Data Compression Conference 1993. than scan. Nevertheless, as we discussed in Section 3.2, hash  K. Bharat et al. The Connectivity server: Fast access to joins and merge joins can be done on the compressed tuples Linkage information on the web. In WWW 1998. using these basic equality and range predicate operations.  G. Graefe and L. Shapiro. Data Compression and Database Performance. In Symp on Applied Computing, 1991. 5. CONCLUSIONS AND FUTURE WORK  M. Poess et al. Data compression in Oracle. In VLDB 2003. Compression is a promising approach to deal with the data movement bottleneck. We have developed a notion of entropy 7. ENTROPY OF MULTI-SET DELTAS for relations and described a method to compress relations to Lemma 1: For a multi-set R of m values picked uniformly and i.i.d within 4.3 bits/tuple of the entropy. This results in up to a from [1..m], for m >100, the value distribution of code(R) has compression factor of up to 40 on TPC-H data, much better entropy ≤ 2.67 m bits. than reported in prior literature. We have also developed Proof: Recall the pseudo-code for computing the coded delta: techniques for efficient scans with selection and projection over 1) Sort the entries of R, forming sequence v = v1, …, vm relations compressed with this method. 2) Form a delta sequence v2 − v1, v3 − v2, … , vm − vm-1. Much work remains to exploit the promise of this idea. 3) Entropy code the values in delta to form a new sequence Although we have shown the basic feasibility, we need to code(R) = code(v2 − v1), ... , code(vm − vm-1) further investigate the engineering issues involved in doing Now consider this situation: we have thrown m balls numbered 1..m uniformly and i.i.d onto m numbered bins arranged in a circle. Define query operators both as a part of a commercial DBMS and as a a random variable Dj for ball j that landed in bin i utility, and figuring out how to utilize the 128 bit registers and Dj h 0 if j is not the highest numbered ball in bin i hardware threading available on modern processors. A second h k if j is the highest numbered ball bin i and direction is lossy compression, which we believe is vital for and k is the clockwise distance from bin i to the next occupied bin. In efficient aggregates over compressed data. Finally, we need to effect, Dj is the random variable for the bin gap between adjacent support incremental updates. We believe that many of the balls. The distribution of each code in code(R) is that of Dj. By standard warehousing ideas like keeping change logs and symmetry, this distribution is independent of j, so we shall call it periodic merging will work here as well. D. Our goal is to compute H(D). Now, let pd h Pr (D = d), 0 ≤ d < m and define 6. REFERENCES λd h Pr (D = d | D g 0), 1 ≤ d < m  Bala Iyer,, David Wilhite. Data Compression Support in Data So, pd = (1 − p0) λd, and Bases. In VLDB 1994.  G. Antoshenkov, D. Lomet, J. Murray. Order Preserving H(D)= Σ0≤d<m pd lg(1/pd) = −p0 lg p0 −Σ1≤d≤m–1 (1−p0)λdlg(1−p0)λd. String Compression. In ICDE 1996. Since Σ0≤d<m pd = 1 = p0 + Σ1≤d<m (1−p0)λd, we have,  T. Cover, J. Thomas. Elements of Information Theory. John H(D) = −p0 lg p0 − (1−p0)lg(1−p0) + (1−p0)Σ1≤d<m λd lg(1/λd) (1) Wiley. 1991 Consider αd h Pr(D > d | Dg0), 1 ≤ d < m. This is the probability  Eric W. Weisstein et al. "Catalan Number." In MathWorld. that, given that a ball is the highest numbered ball in a bin (D g 0), http://mathworld.wolfram.com/CatalanNumber.html. that there is a clockwise gap greater than d to the next occupied  A. Ailamaki, D.J. DeWitt, M. D. Hill, M. Skounakis Weaving bin, i.e., that all m – 1 other balls fall into the other m – d bins. So, relations for cache performance. In VLDB 2001. αd = (1 – d / m)m−1, for 1 ≤ d < m,  M. Stonebraker et al. C-Store: A Column Oriented DBMS. In λd = Pr (D = d | Dg0) = αd−1 − αd, VLDB 2005. αd = (1 – d/m)m−1 = [m/(m − d)][(1 − d / m)m] < [m/(m − d)] e−d,  S. Babu et al. SPARTAN: A model-based semantic by standard exponential inequalities , and compression system for massive tables. SIGMOD 2001. λd = αd−1 – αd < αd−1 < [m/(m – d + 1)] e1− d (2)  J. Goldstein, R. Ramakrishnan, U. Shaft. Compressing But, λd = αd−1 – αd = (1 − (d − 1)/m)m−1 − (1−d/m)m−1 Relations and Indexes. In ICDE 1998. λd (m/(m−d))m−1 = [(m − d + 1)/(m − d)]m−1 − 1 > 1 for d ≥ 1  G.V. Cormack and R.N. Horspool. Data Compression using So 1/λd < [m/(m − d)]m−1 (3) Dynamic Markov Modelling, Computer Journal 1987. Now, each occurrence of a zero delta corresponds to an empty bin,  G. V. Cormack, Data Compression In a Database System, so expected number of empty bins is mp0. But empty bins form a Comm. of the ACM 28(12), 1985 binomial distribution with probability (1−1/m)m. So p0 = (1−1/m)m 868

12.< 1/e. Now, by simple calculus, −x lg x has its maximum value Theorem 3: Algorithm 3 compresses a relation R of tuples when x=1/e, and substituting into equation (1) we have chosen i.i.d from a finite distribution D to at most H(R) + 4.3|R| −p0lg p0 − (1−p0)lg(1−p0) < (lg e)/e + (1−1/e) lg(1/(1−1/e)) (4) bits, assuming |R| > 100. Combining (1), (2), (3), (4) we get, Proof: Consider m items chosen i.i.d. from D. Let R be a random H(D) < (lg e)/e + (1−1/e) lg(e/(e−1)) + (1−p0) λ1 lg 1/λ1 + variable for these m items viewed as a sequence and R be a random (1−p0)Σ2≤d≤m–1[m/(m−d+1)] e1−d (m−1)lg(1+d/(m−d)) variable for these m items viewed as a multi-set. Without loss of As ln(1+x) < x, lg(1+x) = ln(1+x) / ln 2 < x / ln 2 we have generality let the distribution D be given by a set of n values v1, … H(D) < (lg e)/e + (1−1/e) lg(e/(e−1)) + (1−p0)λ1 lg 1/λ1 + ,vn and associated probabilities p1, …, pn, so that a sample drawn ((1−p0)/ln 2) Σ2≤d≤m–1 [dm(m−1)/((m−d)(m−d +1))] e1 − d (5) from D has value vi with probability pi. Let Ds be the subset of Let Ym ≡ ∑ md=−12 [dm( m − 1) (m − d )(m − d + 1)]e1−d Ym − Ym+1 = ∑ md=−21[2dm(d − 1) (m − d )(m − d +1)(m − d +2)]e1−d those values that code to less than ≈lg m∆ bits after step 1d. Let − m 2 ( m + 1)e − m+1 / 2 MUL be a random variable for the vector of multiplicities of = ∑ md=−22 [2dm(d − 1) (m − d )(m − d + 1)(m − d + 2)]e1−d elements in Ds in the random multi-set R, Dom(MUL) = N |Ds|). + me1−m [m 2 (e / 3 − 0.5) + m(e + 0.5) + 2e / 3] Use µ to represent a particular multiplicity in Dom(MUL), µi to > 0 for m>7 (the last quadratic is >0 for m>7) denote the i’th element of µ, the multiplicity of the associated vi, Hence Ym >Ym+1 for m > 7. and pµ to denote the probability that MUL=µ µ. But Y100 < 1.57, by calculation. So, for m>100, Ym < 1.57. (6) H(R) = H(MUL) + H(R | MUL) Also, λ1 = 1 − (1 − 1/ m)m−1 is increasing in m. Since x lg(1/ x) = H(MUL) + ΣµcDom(MUL) pµ H(R | MUL=µ µ) is maximum at 1/e, λ1 lg(1/ λ )1 < 0.63lg 0.63 < 0.42 for m>100. (7) We now employ an argument similar to Lemma 3. For any value Plugging (6), (7) into (5) and doing some algebra, of the multi-set R having multiplicity µ = (µ1, µ2, ... µ|Ds|), there are H ( D ) < lg e − (1 − 1/ e)lg(e − 1) + 0.42(1 − p0 ) + 1.57(1 − p0 ) / ln 2. at most m! / (µ1!µ2! ... µ|Ds|!) instances of R. So, By calculus, p0 = (1-1/ m)m > 0.99100 >0.36 for m>100. H(R | MUL = µ) ≥ H(R | MUL = µ) – lg (m!/(µ1!... µ|Ds|!)) Substituting, we get H(D) < 2.67 bits. H(R) ≥ H(MUL) + H(R | MUL = µ) – Σµ pµ lg (m!/(µ1!... µ|Ds|!)) = H(R) – ΣµcDom(MUL) pµ lg (m!/(µ1!... µ|Ds|!)). 8. OPTIMALITY OF ALGORITHM 3 Lemma 3: Let X be a random variable over a distribution D, and Let k = µ1 +...+ µ|Ds|. Then, k! µ1µ1... µ|Ds|µ_|Ds|/(µ1!... µ|Ds|!) is just let code(X) be a random variable representing the codeword for X, one term in the multinomial expansion of kk. So k!/(µ1!... µ|Ds|!) according to some compression scheme. Let codeα(X) be the < kk / µ1µ_1... µ|Ds|µ_|Ds|. Clearly k < m, so random variable for the α-bit prefix of code(X). Then, if the lg (m!/(µ1!... µ|Ds|!) < m lg m −Σ1≤i≤Ds µi lg µi. compression scheme is optimal and given that code(X) has ≥ α So, H(R) ≥ H(R) – ΣµcDom(MUL) pµ (m lg m −Σi µilg µi) bits, codeα(X) is uniformly distributed over all bit-vectors of length But, the last term can be simplified and replaced with an α. expectation giving us Proof: If the distribution of bits were not uniform, we could further H(R) ≥ H(R) – m lg m + E(Σi µilg µi) compress each α-bit prefix, contradicting the assumption that the = H(R) – m lg m + Σi E(µilg µi). compression scheme is optimal. ± Since x lg x is convex, we can apply Jensen’s inequality: E(f(X) We now show the algorithm of Section 2.1.4 is near optimal. ≥ f(E(X)) for convex functions f, yielding H(R) ≥ H(R) – m lg m + Σi E(µi) lg Ε(µi) (1) Lemma 4: If tuples of R are generated i.i.d. over a distribution D, By Lemma 4, Algorithm A encodes R in space the algorithm of Figure 3 encodes R in at most m (1 + H ( D )) − ( m − 1)(lg m − 2.67) + ∑ i∈Ds mpi (  lg m  − bi ) (2) m(1 + H ( D)) − (m − 1)(lg m − 2.67) + ∑ i∈D mpi ( lg m  − bi ) bits on Since tuples are chosen i.i.d, H ( R ) = mH ( D ). Recall that s average, where m = |R| > 100, Ds is the subset of tuples that get pi is the probability of the i'th element of Ds and bi is the coded in less than ≈lg m∆ bits at the end of step 1d, pi is the length of the code for the i'th element. So E ( µi ) = mpi . Thus, occurrence probability of tuple i c Ds, and bi is the length to which the space overhead of Algorithm A over entropy = (2) - (1) tuple i is coded at end of step 1d. ≤ 3.67 m + lg m + m ∑ i∈Ds mpi ( lg m  − bi − lg mpi ) Proof: By standard results on Huffman coding , the ≤ 3.67 m + lg m + m ∑ i∈Ds mpi (lg(1/ pi ) − bi ). expected size of the tuples at the end of Step 1d is no more than Let p = ∑ i∈Ds pi and qi = pi / p. Then the overhead ≤ m(1+H(D)) bits. The padding of Step 1e adds (≈lg m∆ – bi) random 3.67 m + lg m + mp ∑ i∈Ds qi (lg(2 − bi / qi ) − lg p ). Since bits for each tuple that codes to bi < ≈lg m∆ bits. So the overall padding has expected length of ∑ i∈D mpi ( lg m  − bi ) bits. By s ∑ i∈Ds qi = 1, and lg( x ) is concave, by Jensen's inequality, Lemma 3, after Step 1e, the ≈lg m∆-bit prefix of the coded tuple is ∑ i∈Ds qi lg(2 −bi / qi ) ≥ lg ∑ i∈Ds qi (2 −bi / qi ). So overhead ≤ uniformly distributed over all bit vectors of length ≈lg m∆. So, by 3.67 m + lg m + m lg ∑ i∈Ds 2− bi + mp lg (1 p ). Lemma 1, average length of the coded delta is at most 2.67 bits. So Now, for any unambiguous code, Kraft's inequality  says the delta coding of Steps 2 and 3 saves on average ≈lg m∆ − 2.67 that ∑ i∈D 2− bi ≤ 1. So overhead ≤ 3.67 m + lg m + mp lg (1 p ). bits for each pair of adjacent tuples: (m−1)(≈lg m∆ − 2.67) bits By simple calculus, x lg(1/ x ) ≤ (1/ e) lg e 0.53. So, overall. And by summing these quantities we get our result. ± space taken by Algorithm A - H ( R ) is at most 4.2 m + lg m bits < 4.3m bits (lg x/x < 0.07 for m>100) 869