## 展开查看详情

1. How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations Vijayshankar Raman Garret Swart IBM Almaden Research Center, San Jose, CA 95120 ABSTRACT increases CPU cost, especially for large compression dictionaries that don’t fit in cache1. Worse, since querying is We present a method to compress relations close to their entropy done on uncompressed data, the in-memory query execution is while still allowing efficient queries. Column values are not sped up at all. Furthermore, as we see later, a vanilla gzip encoded into variable length codes to exploit skew in their or dictionary coder give suboptimal compression – we can do frequencies. The codes in each tuple are concatenated and the much better by exploiting semantics of relations. resulting tuplecodes are sorted and delta-coded to exploit the Another popular way to compress data is a method that can lack of ordering in a relation. Correlation is exploited either by be termed domain coding [6, 8]. In this approach values from a co-coding correlated columns, or by using a sort order that domain are coded into a tighter representation, and queries run leverages the correlation. We prove that this method leads to directly against the coded representation. For example, values near-optimal compression (within 4.3 bits/tuple of entropy), and in practice, we obtain up to a 40 fold compression ratio on in a CHAR(20) column that takes on only 5 distinct values can vertical partitions tuned for TPC-H queries. be coded with 3 bits. Often such coding is combined with a We also describe initial investigations into efficient querying layout where all values from a column are stored together [5, 6, over compressed data. We present a novel Huffman coding 11, 12]. Operations like scan, select, project, etc. then become scheme, called segregated coding, that allows range and equality array operations that can be done with bit vectors. predicates on compressed data, without accessing the full Although it is very useful, domain coding alone is dictionary. We also exploit the delta coding to speed up scans, insufficient, because it poorly exploits three sources of by reusing computations performed on nearly identical records. redundancy in a relation: Initial results from a prototype suggest that with these Skew: Real-world data sets tend to have highly skewed value optimizations, we can efficiently scan, tokenize and apply distributions. Domain coding assigns fixed length (often predicates on compressed relations. byte aligned) codes to allow fast array access. But it is inefficient in space utilization because it codes infrequent 1. INTRODUCTION values in the same number of bits as frequent values. Data movement is a major bottleneck in data processing. In a Correlation: Correlation between columns within the same database management system (DBMS), data is generally row is common. Consider an order ship date and an order moved from a disk, though an I/O network, and into a main receipt date; taken separately, both dates may have the same memory buffer pool. After that it must be transferred up value distribution and may code to the same number of bits. through two or three levels of processor caches until finally it is However, the receipt date is most likely to be within a one loaded into processor registers. Even taking advantage of multi- week period of ship date. So, for a given ship date, the task parallelism, hardware threading, and fast memory probability distribution of receipt date is highly skewed, and protocols, processors are often stalled waiting for data: the the receipt date can be coded in fewer bits. price of a computer system is often determined by the quality Lack of Tuple Order: Relations are multi-sets of tuples, not of its I/O and memory system, not the speed of its processors. sequences of tuples. A physical representation of a relation Parallel and distributed DBMSs are even more likely to have is free to choose its own order – or no order at all. We shall processors that stall waiting for data from another node. Many see that this representation flexibility can be used for DBMS “utility” operations such as replication/backup, ETL additional, often substantial compression. (extract-transform and load), and internal and external sorting are also limited by the cost of data movement1. Column and Row Coding DBMSs have traditionally used compression to alleviate This paper presents a new compression method based on a mix this data movement bottleneck. For example, in IBM’s DB2 of column and tuple coding. DBMS, an administrator can mark a table as compressed, in Our method has three components: We encode column which case individual records are compressed using a values with Huffman codes [16] in order to exploit the skew in dictionary scheme [1]. While this approach reduces I/Os, the the value frequencies – this results in variable length field data still needs to be decompressed, typically a page or record codes. We then concatenate the field codes within each tuple to at a time, before it can be queried. This decompression form tuplecodes and sort and then delta code these tuplecodes, Permission to copy without fee all or part of this material is granted taking advantage of the lack of order within a relation. We provided that the copies are not made or distributed for direct commercial exploit correlation between columns within a tuple by using advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the 1 Very Large Data Base Endowment. To copy otherwise, or to republish, to For typical distributions, radix sort runs in linear time and thus post on servers or to redistribute to lists, requires a fee and/or special in-memory sort is dominated by the time to move data between permission from the publisher, ACM. memory and cache (e.g., see Jim Gray’s commentary on the VLDB ‘06, September 12-15, 2006, Seoul, Korea. 2005 Datamation benchmark [17]). External sorting is of course Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09 almost entirely movement-bound. 858

2.either value concatenation and co-coding, or careful column 1.1 Background and Related Work ordering within the tuplecode. We also allow domain-specific transformations to be applied to the column before the Huffman 1.1.1 Information Theory coding, e.g. text compression for strings. The theoretical foundation for much of data compression is We define a notion of entropy for relations, and prove that information theory [3]. In the simplest model, it studies the our method is near-optimal under this notion, in that it compression of sequences emitted by 0th-order information asymptotically compresses a relation to within 4.3 bits/tuple of sources – ones that generate values i.i.d (independent and its entropy. identically distributed) from a probability distribution D. We have prototyped this method in a system called csvzip Shannon’s celebrated source coding theorem [3] says that one for compression and querying of relational data. We report on cannot code a sequence of values in less than H(D) bits per experiments with csvzip over data sets from TPC-H, TPC-E value on average, where H(D) = ΣicD pi lg (1/pi) is the entropy and SAP/R3. We obtain compression factors from 7 to 40, of the distribution D with probabilities pi. substantially better than what is obtained with gzip or with Several well studied codes like the Huffman and Shannon- domain coding, and also much better than what others have Fano codes achieve 1 + H(D) bits/tuple asymptotically, using a reported earlier. When we don’t use co-coding, the level of dictionary that maps values in D to codewords. A value with compression obtained is sensitive to the position of correlated occurrence probability pi is coded in roughly lg (1/pi) bits, so columns within the tuplecode. We discuss some heuristics for that more frequent values are coded in fewer bits. choosing this order, though this needs further study. Most coding schemes are prefix codes – codes where no codeword is a prefix of another codeword. A prefix code Efficient Operations over Compressed Data dictionary is often implemented as a prefix tree where each In addition to compression, we also investigate scans and joins edge is labelled 0 or 1 and each leaf maps to a codeword (the over compressed data. We currently do not support incremental string of labels on its path from the root). By walking the tree updates to compressed tables. one can tokenize a string of codewords without using delimiters Operating on the compressed data reduces our – every time we hit a leaf we output a codeword and jump back decompression cost; we only need to decode fields that need to to the root. be returned to the user or are used in aggregations. It also The primary distinction between relations and the sources reduces our memory throughput and capacity requirements. considered in information theory is the lack of order: relations The latter allows for larger effective buffer sizes, greatly are multi-sets, and not sequences. Secondarily, we want the increasing the speed of external data operations like sort. compressed relation to be directly queryable, whereas it is more Querying compressed data involves parsing the encoded bit common in the information theory literature for the sequence to stream into records and fields, evaluating predicates on the be decompressed and then pipelined to the application. encoded fields, and computing joins and aggregations. Prior researchers have suggested order-preserving codes [2,15,18] 1.1.2 Related work on database compression that allow range queries on encoded fields. However this DBMSs have long used compression to help alleviate their data method does not extend efficiently to variable length codes. movement problems. The literature has proceeded along two Just tokenizing a record into fields, if done naively, involves strands: field wise compression, and row wise compression. navigating through several Huffman dictionaries. Huffman Field Wise Compression: Graefe and Shapiro [20] were dictionaries can be large and may not fit in the L2 cache, among the first to propose field-wise compression, because making the basic operation of scanning a tuple and applying only fields in the projection list need to be decoded. They also selection or projection operations expensive. We introduce two observed that operations like join that involve only equality optimizations to speed up querying: comparisons can be done without decoding. Goldstein and Segregated Coding: Our first optimization is a novel scheme Ramakrishnan [8] propose to store a separate reference for the for assigning codes to prefix trees. Unlike a true order values in each page, resulting in smaller codes. Many column- preserving code, we preserve order only within codes of the wise storage schemes (e.g., [5,6,12]) also compress values same length. This allows us to evaluate many common within a column. Some researchers have investigated order- predicates directly on the coded data, and also to find the preserving codes in order to allow predicate evaluation on length of each codeword without accessing the Huffman compressed data [2,15,18]. [2] study order preserving dictionary, thereby reducing the memory working set of compression of multi-column keys. An important practical tokenization and predicate evaluation. issue is that field compression can make fixed-length fields into Short Circuited Evaluation: Our delta coding scheme sorts variable-length ones. Parsing and tokenizing variable length the tuplecodes, and represents each tuplecode by its delta fields increases the CPU cost of query processing, and field from the previous tuplecode. A side-effect of this process is delimiters, if used, undo some of the compression. So almost to cluster tuples with identical prefixes together, which all of these systems use fixed length codes, mostly byte- means identical values for columns that are early in the aligned. This approximation can lead to substantial loss of concatenation order. This allows us to avoid decoding, compression as we argue in Section 2.1.1. Moreover, it is not selecting and projecting columns that are unchanged from clear how to exploit correlation with a column store. the previous tuple. Row Wise Compression: Commercial DBMS Before diving into details of our method, we recap some implementations have mostly followed the row or page level information theory basics and discuss some related work. compression approach where data is read from disk, 859

3. Domain Num. possible Num. Likely vals Entropy Comments values (in top 90 percentile) (bits/value) We assume that the db must support all dates till 10000 A.D., but 99% Ship Date 3650000 1547.5 9.92 of dates will be in 1995-2005, 99% of those are weekdays, 40% of those are in the 10 days each before New Year and Mother’s day. Last We use exact frequencies for all U.S. names ranking in the top 90 2160 (char (20)) j80000 26.81 Names percentile (from census.gov), and extrapolate, assuming that all ∫2160 Male first 160 names below 10th percentile are equally likely. This over-estimates 2 (char (20)) 1219 22.98 entropy. names We use country distribution from import statistics for Canada (from Customer 215 ∫ 27.75 2 1.82 www.wto.org) – the entropy is lesser if we factor in poor countries, Nation which trade much less and mainly with their immediate neighbours. Table 1: Skew and Entropy in some common domains decompressed, and then queried. IBM DB2 [1] and IMS[10] This then leads into a discussion of methods to query use a non-adaptive dictionary scheme, with a dictionary compressed relations, in Section 3. mapping frequent symbols and phrases to short code words. Some experiments on DB2 [1] indicate a factor of 2 2.1 Squeezing redundancy out of a relation compression. Oracle uses a dictionary of frequently used 2.1.1 Exploiting Skew by Entropy Coding symbols to do page-level compression and report a factor of 2 Many domains have highly skewed data distributions. One to 4 compression [21]. The main advantage of row or page form of skew is not inherent in the data itself but is part of the compression is that it is simpler to implement in an existing representation – a schema may model values from a domain DBMS, because the code changes are contained within the with a data type that is much larger. E.g., in TPC-H, the page access layer. But it has a huge disadvantage in that, while C_MKTSEGMENT column has only 5 distinct values but is it reduces I/O, the memory and cache behaviour is worsened modelled as CHAR(10) – out of 25610 distinct 10-byte strings, due to decompression costs. Some studies suggest that the CPU only 5 have non-zero probability of occurring! Likewise, post cost of decompression is also quite high [8]. Y2K, a date is often stored as eight 4-bit digits (mmddyyyy), Delta Coding: C-Store [6] is a recent system that does column- but the vast majority of the 168 possible values map to illegal wise storage and compression. One of its techniques is to delta dates. code the sort column of each table. This allows some Prior work (e.g, [6, 12]) exploits this using a technique exploitation of the relation’s lack-of-order. [6] does not state we’ll call domain coding: legal values from a large domain are how the deltas are encoded, so it is hard to gauge the extent to mapped to values from a more densely packed domain – e.g., which this is exploited. In a different context, inverted lists in C_MKTSEGMENT can be coded as a 3 bit number. To permit search engines are often compressed by computing deltas array based access to columns, each column is coded to a fixed among the URLs, and using heuristics to assign short codes to number of bits. common deltas (e.g, [19]). We are not aware of any rigorous While useful, this method does not address skew within the work showing that delta coding can compress relations close to value distribution. Many domains have long-tailed frequency their entropy. distributions where the number of possible values is much more Lossy Compression: There is a vast literature on lossy than the number of likely values. Table 1 lists a few such compression for images, audio, etc, and some methods for domains. E.g., 90% of male first names fall within 1219 values, relational data, e.g., see Spartan [7]. These methods are but to catch all corner cases we would need to code it as a complementary to our work – any domain-specific compression CHAR(20), using 160 bits, when the entropy is only 22.98 scheme can be plugged in as we show in Section 2.1.4. We bits. We can exploit such skew fully through entropy coding. believe lossy compression can be useful for measure attributes Probabilistic Model of a Relation: Consider a relation R with that are used only for aggregation. column domains COL1, … COLk. For purposes of compression, 2. COMPRESSION METHOD we view the values in COLi as being generated by an i.i.d. Three factors lead to redundancy in relation storage formats: (independent and identically distributed) information source skew, tuple ordering, and correlation. In Section 2.1 we discuss over a probability distribution Di. Tuples of R are viewed as each of these in turn, before presenting a composite being generated by an i.i.d information source with joint compression algorithm that exploits all three factors to achieve probability distribution: D = (D1, D2, … Dk).2 We can estimate near-optimal compression. Such extreme compression is ideal for pure data movement 2 By modeling the tuple sources as i.i.d., we lose the ability to tasks like backup or replication. But it is at odds with efficient exploit inter-tuple correlations. To our knowledge, no one has querying. Section 2.2 describes some relaxations that sacrifice studied such correlations in databases – all the work on some compression efficiency in return for simplified querying. correlations has been among fields within a tuple. If inter-tuple correlations are significant, the information theory literature on compression of non zero-order sources might be applicable. 860

4. m est. H(delta(R)) in bits This bound is far from tight. Table 2 shows results from a 10000 1.897577 m Monte-Carlo simulation where we pick m numbers i.i.d from 100,000 1.897808 m [1,m], calculate the distribution of deltas, and estimate their 1,000,000 1.897952 m entropy. Notice that the entropy is always less than 2 bits. 10,000,000 1.89801 m Thus, Delta coding compresses R from m lg m bits to lg m + 40,000,000 1.898038 m 2(m–1) bits, saving (m–1)(lg m – 2) bits. For large databases, lg m can be about 30 (e.g., 100GB at 100B/tuple). As experiments Table 2: Entropy of delta(R) for a multi-set R of m in Section 4.1 show, when a relation has only a few columns, values picked uniformly, i.i.d. from [1,m] (100 trials) such delta coding alone can give up to a 10 fold compression. each Di from the actual value distribution in COLi, optionally This analysis applies to a relation with one column, chosen extended with some domain knowledge. For example, if COLi uniformly from [1, m]. We generalize this to a method that has {Apple, Apple, Banana, Mango, Mango, Mango}, then Di works on arbitrary relations in Section 2.1.4. is the distribution {pApple = 0.333, pBanana = 0.167, pMango =0.5}. Schemes like Huffman and Shannon-Fano code such a Optimality of Delta Coding sequence of i.i.d values by assigning shorter codes to frequent Such delta coding is also very close to optimal – the following values [3]. On average, they can code each value in COLi with lemma shows we cannot reduce the size of a sequence by more at most 1 + H(Di) bits, where H(X) is the entropy of distribution than lg m! just by viewing it as a multi-set. X – hence these codes are also called “entropy codes.” Using an Lemma 2: Given a vector R of m tuples chosen i.i.d. from a entropy coding scheme, we can code the relation R distribution D and the multi-set R of values in R, (R and R are with ∑1≤i ≤ k R (1 + H( Di ) + DictSize( COLi ) bits, where both random variables), H(R) >= m H(D) – lg m! DictSize(COLi) is the size of the dictionary mapping code Proof Sketch: Since the elements R are chosen i.i.d., H(R) = m words to values of COLi. H(D). Now, augment the tuples t1, t2, …, tm of R with a “serial- If a relation were a sequence of tuples, assuming that the number” column SNO, where ti.SNO = i. Ignoring the ordering domains Di are independent (we relax this in 2.1.3), this coding of tuples in this augmented vector, we get a set, call it R1. is optimal, by Shannon’s source coding theorem [3]. But Clearly there is a bijection from R1 to R, so H(R1) = m H(D). relations are not sequences, they are multi-sets of tuples, and But R is just a projection of R1, on all columns except SNO. For permit further compression. each relation R, there are at most m! relations R1 whose projection is R. So H(R1) <= H(R) + lg m! ± 2.1.2 Order: Delta Coding Multi-Sets So, with delta coding we are off by at most lg m! – m(lg m Consider a relation R with just one column, COL1, containing – 2) j m (lg m – lg e ) – m (lg m – 2) = m (2 – lg e) j 0.6 m numbers chosen uniformly and i.i.d from the integers in bits/tuple from the best possible compression. This loss occurs [1,m]. Traditional databases would store R in a way that because the deltas are in fact mildly correlated (e.g., sum of encodes both the content of R and some incidental ordering of deltas = m), but we do not exploit this correlation – we code its tuples. Denote this order-significant view of the relation as each delta separately to allow pipelined decoding. R (we use bold font to indicate a sequence). The number of possible instances of R is mm, each of which 2.1.3 Correlation has equal likelihood, so H(R) is m lg m. But R needs much less Consider a pair of columns (partKey, price), where each space because we don’t care about the ordering. Each distinct partKey largely has a unique price. Storing both partKey and instance of R corresponds to a distinct outcome of throwing of price separately is wasteful; once the partKey is known, the m balls into m equal probability bins. So, by standard range of possible values for price is limited. Such inter-field correlation is quite common and is a valuable opportunity for combinatorial arguments (see [14], [4]), there are 2 m −1 ∫ relation compression. m 4m 4 π m different choices for R, which is much less than In Section 2.1.1, we coded each tuple in Σj H(Dj) bits. This m is optimal only if the column domains are independent, that is, m . A simple way to encode R is as a coded delta sequence: if the tuples are generated with an independent joint 1) Sort the entries of R, forming sequence v = v1, …, vm distribution (D1, D2, … Dk). For any joint distribution, H(D1, 2) Form a sequence delta(R) = v1, v2–v1, v3–v2, …, vm–vm-1 …, Dk) ≤ Σj H(Dj), with equality if and only if the Dj’s are 3) Entropy code the differences in delta to form a new independent [3]. Thus any correlation strictly reduces the sequence code(R) = v1, code(v2–v1), ... , code(vm – vm–1) entropy of relations over this set of domains. Space savings by delta coding We have three methods to exploit correlation: co-coding, Intuitively, sorting and delta coding compresses R because the dependent coding and column ordering. distribution of deltas is tighter than that of the original integers Co-coding concatenates correlated columns, and encodes – small deltas are much more likely than large ones. Formally: them using a single dictionary. If there is correlation, this Lemma 1: If R is a multi-set of m values picked uniformly with combined code is more compact than the sum of the individual repetition from [1,m], and m > 100, then each delta in delta(R) field codes. has entropy < 2.67 bits. A variant approach we call dependent coding builds a Proof Sketch: See Appendix 7. ± Markov model of the column probability distributions, and uses Corollary 1.1: code(R) occupies < 2.67 m bits on average. it to assign Huffman codes. E.g, consider columns partKey, 861

5. gives an example of how the data is transformed. Figure 4 shows a process flow chart. The algorithm has two main pieces: Column Coding: For each tuple, we first perform any type specific transforms (supplied by the user) on columns that need special handling (1a). For example, we can apply a text compressor on a long VARCHAR column, or split a date into week of year and day of week (to more easily capture skew towards weekdays). Next we co-code correlated columns (1b), and then replace each column value with a Huffman code (1c). We use Huffman codes as a default because they are asymptotically optimal, and we have developed a method to efficiently run selections and projections on concatenated Huffman codes (Section 3). We currently compute the codes using a statically built dictionary rather than a Ziv-Lempel style adaptive dictionary because the data is typically compressed 1. for each tuple (t.c1, t.c2, … t.ck) of |R| do once and queried many times, so the work done to develop a 1a. for each col ci that needs type specific transform, do: better dictionary pays off. t.ci type_specific_transform (t.ci) Tuple Coding: We then concatenate all the field codes to form 1b. concatenate correlated columns together a bit-vector for each tuple, pad them on the right to a given 1c. for each col i = 1 to k do: length and sort the bit-vectors lexicographically. We call these t.ci Huffman_Code(t.ci) bit vectors tuplecodes because each represents a tuple. After 1d. tupleCode concat(t.c1, t.c2, … t.ck) sorting, adjacent tuplecodes are subtracted to obtain a vector of 1e. If size(tupleCode) < ≈lg(|R|)∆ bits deltas and each delta is further Huffman coded. By Lemma 2, pad it with random bits to make it ≈lg(|R|)∆ bits long. we cannot save more than lg|R| bits/tuple by delta-coding, so 2. sort the coded tuples lexicographically. our algorithm needs to pad tuples only to lg |R| bits (in Section 3. for each pair of adjacent tuples p, q in sorted order do: 2.2.2 we describe a variation that pads tuples to more than lg 3a. let p = p1p2, q = q1q2, |R| bits; this is needed when we don’t co-code correlated where p1,q1 are ≈lg(|R|)∆-bit prefixes of p,q columns). 3b. deltaCode Huffman_Code(q1-p1). The expensive step in this compression process is the sort. 3c replace p with deltaCode.p2 But it need not be perfect, as any imperfections only reduce the quality of compression. E.g., if the data is too large for an in- Algorithm 3: Pseudo-Code for compressing a relation memory sort, we can create memory-sized sorted runs and not price, and brand, where (partKey,price) and (partKey,brand) do a final merge; by an analysis similar to Theorem 3, we lose are pair wise correlated, but price and brand are independent about lg x bits/tuple, if we have x similar sized runs. given the partKey. Instead of co-coding all three columns, we Analysis of compression efficiency can assign a Huffman code to partKey and then choose the Huffman dictionary for coding price and brand based on the Lemma 2 gives a lower bound on the compressibility of a code for partKey. Both co-coding and dependent coding will general relation: H(R) ≥ m H(D) – lg m!, where m = |R|, and code this relation to the same number of bits but when the tuples of R are chosen i.i.d from a joint distribution D. The correlation is only pair wise, dependent coding results in Huffman coding of the column values reduces the relation size smaller Huffman dictionaries, which can mean faster decoding. to m H(D) asymptotically. Lemma 1 shows that, for a multi-set Both co-coding and dependent coding exploit correlation of m numbers chosen uniformly from [1, m], delta coding saves maximally, but cause problems when we want to run range almost lg m! bits. But the analysis of Algorithm 3 is queries on the dependent column. In Section 2.2.2, we present a complicated because (a) our relation R is not such a multi-set, different technique that keeps correlated columns separate (to and (b) because of the padding we have to do in Step (1e). Still, allow fast queries), and instead exploits correlation by tuning we can show that we are within 4.3 bits/tuple of optimal the column ordering within a tuple. compression: Currently, csvzip implements co-coding and column Theorem 3: Our algorithm compresses a relation R of tuples ordering. The column pairs to be co-coded and the column chosen i.i.d from a distribution D to an expected bit size of no order are specified manually as arguments to csvzip. An more than H(R) + 4.3|R| bits, if |R |> 100. important future challenge is to automate this process. Proof: See Appendix 8. 2.1.4 Composite Compression Algorithm 2.2 Relaxing Compression for Query Speed Having seen the basic kinds of compression possible, we now csvzip implements the composite compression algorithm of proceed to design a composite compression algorithm that Section 2.1.4 in order to maximally compress its input exploits all three forms of redundancy and allows users to plug relations. But it also performs two relaxations that sacrifice in custom compressors for idiosyncratic data types (images, some compression in return for faster querying. text, dates, etc). Algorithm 3 describes this in pseudo-code and 862

6. Input tuple Consider the example (partKey, price) again. We can evaluate a predicate partKey=? AND price=? on the co-coded values if the co-coding scheme preserves the ordering on (partKey, price). We can also evaluate standalone predicates on Column 1 Column 2 Column 3 … Column k partKey. But we cannot evaluate a predicate on price without decoding. Co-coding also increases the dictionary sizes which Co-code Type specific Type specific can slow down decompression if the dictionaries no longer fit transform transform transform in cache. Transform Transform Transform Transform We avoid co-coding such column pairs by tuning the order Column 1 Column 2 Column 3 Column j of the columns in the tuplecode. Notice that in Step 1d of Dict Dict Dict Dict Algorithm 3, there is no particular order in which the fields of Huffman Huffman Huffman Huffman Encode Encode Encode Encode the tuple t should be concatenated – we can choose any concatenation order, as long as we follow the same for all the Field Field Field Filed Code Code Code … Code tuples. Say we code partKey and price separately, but place partKey followed by price early in the concatenation order in Step 1d. After sorting, identical values of partKey will mostly TupleCode appear together. Since partKey largely determines price, identical values of price will also appear close together. So the contribution of price to the delta (Step 3b) is a string of 0s most First b bits Any additional bits of the time. This 0-string compresses very well during the 1 Sorted Huffman coding of the tuplecode deltas. We present Tuplecodes experiments in Section 4 that quantify this trade-off. Previous — Key Tuplecode 3. QUERYING COMPRESSED DATA Delta Data Item We now turn our focus from compressing relations to running queries on the compressed relation. Our goals are to: Huffman Dict Encode Process Design query operators that work on compressed data, Delay Determine as soon as possible that the selection criteria is Delta Code not met, avoiding additional work for a tuple, Evaluate the operators using small working memory, by Append minimizing access to the full Huffman dictionaries. Compression 3.1 Scan with Selection and Projection Block Scans are the most basic operation over compressed relations, Figure 4: The Compression Process and the hardest to implement efficiently. In a regular DBMS, scan is a simple operator: it reads data pages, parses them into 2.2.1 Huffman coding vs. Domain coding tuples and fields, and sends parsed tuples to other operators in As we have discussed, Huffman coding can be substantially the plan. Projections are usually done implicitly as part of more space efficient than domain coding for skewed domains. parsing. Selections are applied just after parsing, to filter tuples But it does create variable length codes, which are harder to early. But parsing a compressed table is more compute tokenize (Section 3). For some numerical domains, domain intensive because all tuples are concatenated together into a coding also allows for efficient decoding (see below). So single bit stream. Tokenizing this stream involves: (a) undoing domain coding is useful in cases where it does not lose much in the delta coding to extract tuplecodes, and (b) identifying field space efficiency. boundaries within each tuplecode. We also want to apply Consider a domain like “supplierKey integer” in a table predicates during the parsing itself. with a few hundred suppliers. Using a 4 byte integer is obviously over-kill. But if the distribution of suppliers is Undoing the delta coding. The first tuplecode in the stream is roughly uniform, a 10-bit domain code may compress the always stored as-is. So we extract it directly, determining its column close to its entropy. Another example is “salary end by knowing its schema and navigating the Huffman tree for integer”. If salary ranges from 1000 to 500000, storing it as a each of its columns in order, as we read bits off the input 22 bit integer may be fine. The fixed length makes tokenization stream. Subsequent tuples are delta-coded on their prefix bits easy. Moreover, decoding is just a bit-shift (to go from 20 bits (Algorithm 3). For each of these tuples, we first extract its to a uint32). Decoding speed is crucial for aggregation delta-code by navigating the Huffman tree for the delta-codes. columns. We use domain coding as default for key columns We then add the decoded delta to the running tuple prefix of (like supplierKey) as well as for numerical columns on which the previous tuplecode to obtain the prefix of the current the workload performs aggregations (salary, price, …). tuplecode. We then push this prefix back into the input stream, so that the head of the input bit stream contains the full 2.2.2 Tuning the sort order to obviate co-coding tuplecode for the current tuple. We repeat this process till the In Section 2.1.3, we co-coded correlated columns to achieve stream is exhausted. greater compression. But co-coding can make querying harder. 863

7. We make one optimization to speed decompression. Rather as follows. We first rearrange the tree so that leaves at smaller than coding each delta by a Huffman code based on its depth are to the left of leaves at greater depth. We then permute frequency, we Huffman code only the number of leading 0s in the values at each depth so that leaves at each depth are in the delta, followed by the rest of the delta in plain-text. This increasing order of value, when viewed from left to right. “number-of-leading-0s” dictionary is often much smaller (and Finally, we label each node’s left-edge as 0 and right-edge as 1. hence faster to lookup) than the full delta dictionary, while Figure 5 shows an example. It is easy to see that a segregated enabling almost the same compression, as we see coding has two properties: experimentally (Section 4.1). Moreover, the addition of the • within values of a given depth, greater values have greater decoded delta is faster when we code the number of leading 0s, codewords (e.g., encode(‘tue’) < encode (‘thu’), in Figure 5) because it can be done with a bit-shift and a 64-bit addition • Longer codewords are numerically greater than shorter most of the time avoiding the use of multi-precision arithmetic. codewords (e.g., encode(‘sat’) < encode (‘mon’), in Figure 5) A second optimization we are investigating is to compute deltas on the full tuplecode itself. Experimentally we have A Micro-Dictionary to tokenize Codewords observed that this avoids the expensive push back of the Using property 2), we can find the length of a codeword in time decoded prefix, but it does increase the entropy (and thus proportional to the log of the code length. We don’t need the Huffman code length) of the deltas, by about 1bit/tuple. full dictionary; we just search the value ranges used for each code length. We can represent this efficiently by storing the Identifying field boundaries. Once delta coding has been smallest codeword at each length in an array we’ll call undone and we have reconstructed the tuplecode, we need to mincode. Given a bit-vector b, the length of the only codeword parse the tuplecode into field codes. This is challenging that is contained in a prefix of b is given by max{len : because there are no explicit delimiters between the field codes. mincode[len] ≤ b}, which can be evaluated efficiently using a The standard approach mentioned in Section 1.1 (walking the binary or linear search, depending on the length of the array. Huffman tree and exploiting the prefix code property) is too This array mincode is very small. Even if there are 15 expensive because the Huffman trees are typically too large to distinct code lengths and a code can be up to 32 bits long, the fit in cache (number of leaf entries = number of distinct values mincode array consumes only 60 bytes, and easily fits in the L1 in the column). Instead we use a new segregrated coding cache. We call it the micro-dictionary. We can tokenize and scheme (Section 3.1.1). extract the field code using mincode alone. Selecting without decompressing. We next want to evaluate Evaluating Range Queries using Literal Frontiers selection predicates on the field codes without decoding. Property 1) is weaker than full order-preservation, e.g., Equality predicates are easily applied, because the coding encode(wed) < encode(mon) in Figure 5. So, to evaluate < function is 1-to-1. But range predicates need order-preserving col on a literal , we cannot simply compare encode( ) with codes: e.g., to apply a predicate c1 ≤ c2, we want: code(c1) ≤ the field code. Instead we pre-compute for each literal a list of code(c2) iff c1 ≤ c2. However, it is well known [13] that prefix codewords, one at each length: codes cannot be order-preserving without sacrificing φ(λ) [d] = max {c a code word of length d| decode(c) ≤ λ} compression efficiency. The Hu-Tucker scheme [15] is known to be the optimal order-preserving code, but even it loses about To evaluate a predicate λ < col, we first find the length l of 1 bit (vs optimal) for each compressed value. Segregated the current field code using mincode. Then we check if φ(λ)[l] coding solves this problem as well. < field code. We call φ(λ) the frontier of λ. φ(λ) is calculated by binary search for encode(λ) within the leaves at each depth 3.1.1 Segregated coding of the Huffman tree. Although this is expensive, it is done only For fast tokenization with order-preservation, we propose a once per query. Notice that this only works for range predicates new scheme for assigning code words in a Huffman tree (our involving literals. Other predicates, such as col1 < col2 can scheme applies more generally, to any prefix code). only be evaluated on decoded values, but are less common. The standard method for constructing Huffman codes takes a list of values and their frequencies, and produces a binary tree 3.1.2 Short circuited evaluation [16]. Each value corresponds to a leaf, and codewords are Adjacent tuples processed in sorted order are very likely to assigned by labelling edges 0 or 1. have the same values for many of the initial columns. We take The compression efficiency is determined by the depth of each advantage of this during the scan by keeping track of the value – any tree that places values at the same depths has the current value of sub-expressions used in the computations to same compression efficiency. Segregated coding exploits this evaluate selections, projections, and aggregations. When processing a new tuple, we first analyze its delta code to determine the largest prefix of columns that is identical to the previous tuple. This is easy to do with our optimization of Huffman coding not the actual delta but the number of leading 0s in the delta. We do need to verify if carry-bits from the rest of the delta will propagate into the leading 0s, during the addition of the delta with the previous tuplecode. This verification is just a right-shift followed by a compare with the Figure 5: Segregated Huffman Coding Example previous tuplecode. The shift does become expensive for large 864

8.tuplecodes; we are investigating an alternative XOR-based as they are entered into the hash buckets (a sort is not needed delta coding that doesn’t generate any carries. here because the input stream is sorted). The advantage is that hash buckets are now compressed more tightly so even larger 3.2 Layering other Query Plan operators relations can be joined using in-memory hash tables (the effect The previous section described how we can scan compressed of delta coding will be reduced because of the smaller number tuples from a compressed table, while pushing down selections of rows in each bucket). and projections. To integrate this scan into a query plan, we Grouping tuples by a column value can be done directly expose it using the typical iterator API, with one difference: using the code words, because checking whether a tuple falls getNext() returns not a tuple of values but a tuplecode – i.e., a into a group is simply an equality comparison. However tuple of coded column values. Most other operators, except aggregations are harder. aggregations, can be changed to operate directly on these COUNT, COUNT DISTINCT, can be computed directly on tuplecodes. We discuss four such operators next: index-scan, code words: to check for distinctness of values we check hash join, sort-merge join, and group-by with aggregation. distinctness of the corresponding codewords. MAX and MIN computation involves comparison between code words. Since 3.2.1 Index Scan: Access Row by RID our coding scheme preserves order only within code words of Index scans take a bounding predicate, search through an index the same length, we need to maintain the current maximum or structure for matching row ids (RIDs), load the corresponding minimum separately on code words of each length. After data pages, and extract matching records. The process of scanning through the entire input, we evaluate the overall max mapping predicates to RIDs occurs as usual. But extracting the or min by decoding the current code words of each codeword matching records is trickier because it involves random access length and computing the maximum of those values. within the table. Since we delta-code tuples, the natural way to SUM, AVG, STDEV, cannot be computed on the code tokenize a table into tuples is to scan them sequentially, as in words directly; we need to decode first. Decoding domain- Section 3.1. coded integer columns is just a bit-shift. Decoding Huffman Our solution is to punctuate a compressed table with codes from small domains or ones with large skew is also periodic non-delta-coded tuples (the fields in these tuples are cheap, because the Huffman tree is shallow. We also place still Huffman coded). This divides the table into separately columns that need to be decoded early in the column ordering, decodable pieces, called compression blocks (cblocks). We to improve the chance that the scanner will see runs of identical make each rid be a pair of cblock-id and index within cblock, codes, and benefit from short circuited evaluation. so that index-based access involves sequential scan within the cblock only. Thus, short cblocks mean fast index access. 3.2.3 Sort Merge Join However, since the first tuple is not delta-coded, short cblocks The principal comparisons operations that a sort merge join also mean less compression. In practice this is not a problem: A performs on its inputs are < and =. Superficially, it would Huffman-coded tuple takes only 10-20 bytes for typical appear that we cannot do sort merge join without decoding the schemas (Section 4.1), so even with a cblock size of 1KB, the join column, because we do not preserve order across code loss in compression is only about 1%. 1KB fits in L1-cache on words of different lengths. But in fact, sort merge join does not many processors, so sequential scan in a cblock is fine. need to compare tuples on the traditional ‘<’ operator – any total ordering will do. In particular, the ordering we have 3.2.2 Hash Join & Group By with Aggregation chosen for codewords – ordered by codeword length first and Huffman coding assigns a distinct field code to each value. So then within each length by the natural ordering of the values is we can compute hash values on the field codes themselves a total order. So we can do sort merge join directly on the without decoding. If two tuples have matching join column coded join columns, without decoding them first. values, they must hash to the same bucket. Within the hash bucket, the equi-join predicate can also be evaluated directly on 4. EXPERIMENTAL STUDY the field codes. We have implemented our compression method in a prototype One important optimization is to delta-code the input tuples system called csvzip. csvzip currently supports the compression 40 35 30 Domain Coding csvzip gzip csvzip+cocode 25 20 y 15 10 5 0 P1 P2 P3 P4 P5 P6 Figure 7: Comparing the compression ratios of four compression methods on 6 datasets 865

9. DATASET SCHEMA Original DC-1 DC-8 Huffman csvzip Delta code Huffman + Correlation csvzip+ Loss by not Gzip size (1) (2) savings Cocode saving cocode Cocoding (1)-(2) (3) (1)-(3) (5) (2)-(5) P1. LPK LPR LSK LQTY 192 76 88 76 7.17 68.83 36 40 4.74 2.43 73.56 P2. LOK LQTY 96 37 40 37 5.64 31.36 37 0 5.64 0 33.92 P3. LOK LQTY LODATE 160 62 80 48.97 17.60 31.37 48.65 0.32 17.60 0 58.24 P4. LPK SNAT LODATE CNAT 160 65 80 49.54 17.77 31.77 49.15 0.39 17.77 0 65.53 P5. LODATE LSDATE LRDATE LQTY LOK 288 86 112 72.97 24.67 48.3 54.65 18.32 23.60 1.07 70.50 P6. OCK CNAT LODATE 128 59 72 44.69 8.13 36.56 39.65 5.04 7.76 0.37 49.66 P7. SAP SEOCOMPODF 548 165 392 79 47 32 58 21 33 14 52 P8. TPC-E CUSTOMER 198 54 96 47 30 17 44 3 23 7 69 Table 6: Overall compression results on various datasets (all sizes are in # bits/tuple). DC-1 and DC-8 are bit and byte aligned domain coding. Csvzip is Huffman followed by delta coding; csvzip+cocode is cocode followed by Huffman and then delta-coding. Underscore indicates skew and italics denotes correlation. Column abbreviations are: LPK: partkey, LPR: extendedprice, LSK suppkey, LQTY: quantity, LOK: orderkey, LODATE: orderdate, SNAT: suppNation, CNAT: custNation, LSDATE: shipDate, LRDATE: receiptDate. TPC-E schema is tier, country_1, country_2, country_3, area_1, first name, gender, middle initial, last name. of relations loaded from comma separated value (csv) files, and Arithmetic Correlation: We made l_receiptdate and table scan with selection, projection, and aggregation. Updates l_shipdate be uniformly distributed in the 7 days after the and joins are not implemented at this time. We do not currently corresponding o_orderdate. have a SQL parser – we execute queries by writing C programs There are two other correlations inherent in the schema: that compose select, project, and aggregate primitives. For a given l_partkey, l_suppkey is restricted to be one of 4 We have used this prototype to perform an experimental possible values. study of both compression efficiency and scan efficiency. Our De-Normalization: Dataset P6 in Table 6 is on lineitem x goal is to quantify the following: order x customer x nation, and contains a non-key 1) How much can we compress, as compared to row coding or dependency o_custkey c_nationkey domain coding? What is the relative payoff of skew, order- We use a 1 TB scale TPC-H dataset (j6B rows in each of our freeness, and correlation (Section 4.1)? datasets). To make our experiments manageable, we did not 2) How efficiently can we run scan queries directly on these actually generate, sort, and delta-code this full dataset – rather compressed tables (Section 4.2)? we tuned the data generator to only generate 1M row slices of it, and compressed these. For example, P2 of Table 6 is delta- We use three datasets in our experiments: coded by sorting on <l_orderkey,l_qty>. We generate 1M row TPC-H views: We choose a variety of projections of Lineitem slices of P2 by modifying the generator to filter rows where x Orders x Part x Customer, that are appropriate for answering l_orderkey is not in the desired 1M row range. TPC-H queries (Table 6). Our physical design philosophy, like C-Store, is to have a number of highly compressed materialized 4.1 Extent of Compression views appropriate for the query workload. Table 6 lists the detailed compression results on each dataset. TPC-E Customer: We tested using 648,721 records of Figure 7 plots the overall compression ratio obtained by csvzip randomly generated data produced per the TPC-E specification. (with and without co-coding) vs that obtained by: This file contains many skewed data columns but little a plain gzip (representing the ideal performance of row and correlation other than gender being predicted by first name. page level coders), SAP/R3 SEOCOMPODF: We tested using projections of a a fixed length domain coder aligned at bit boundaries table from SAP having 50 columns and 236,213 rows. There is (representing the performance of column coders). Table 6 a lot of correlation between the columns, causing the delta code also lists the numbers for domain coding at byte boundaries; savings to be much larger than usual. it is significantly worse. One drawback with TPC-H is that uses uniform, Note that all ratios are with respect to the size of the vertical independent value distributions, which is utterly unrealistic partition, not size of the original tables – any savings obtained (and prevents us from showing off our segregated Huffman by not storing unused columns is orthogonal to compression. Coding ☺). So we altered the data generator to generate two Even without co-coding, csvzip consistently gets 10x or skewed columns and 2 kinds of correlation: more compression, in contrast to 2-3x obtained by gzip and Dates: We chose 99% of dates to be in 1995-2005, with 99% domain coding. On some datasets such as LPK LPR LSK of that on weekdays, 40% of that on two weeks each before LQTY, our compression ratio without co-coding is as high as New Year & Mothers’ Day (20 days / yr). 192/7.17j27; with co-coding it is 192/4.74j41. From Table 6, c_nationkey, s_nationkey: We chose nation distributions notice that the absolute per tuple sizes after compression are from WTO statistics on international trade typically 10-20bits. (www.wto.org/english/res_e/statis_e/wt_overview_e.htm). We next analyze from where our substantial gains arise. Soft Functional Dependency: We made l_extendedprice be functionally dependent on l_partkey Exploitation of Skew: The first source of compression is skew. Observe the bars labelled Huffman and Domain Coding 866

10. columns. Notice that the savings is now higher – the delta 12 coding is able to exploit correlations also, as we discussed in 10 DELTA Section 2.2.2. Delta w cocode 8 The plot on the opposite column illustrates the compression ratios obtained with the two forms of delta coding. The ratio is 6 as high as 10 times for small schemas like P1. The highest 4 overall compression ratios result when the length of a tuplecode and bits per tuple saved by delta coding are similar. 2 0 Exploiting Correlation via co-coding vs. delta-coding: P1 P2 P3 P4 P5 P6 The extent to which it can exploit correlations is indicated by the column labelled (2)-(5) in Table 6 – notice that we are in the chart below. They show the compression ratio achieved often close to the co-coding performance. just by Huffman coding column values, in contrast to the savings obtained by domain coding. All columns except This strategy is powerful, because by not co-coding range nationkeys and dates are uniform, so Huffman and domain queries on the dependent become much easier as we discussed coding are identical for P1 and P2. But for the skewed domains in Section 2.2.2. But a drawback is that the correlated columns the savings is significant (e.g., 44.7 bits vs. 59 bits for P6). have to be placed early in the sort order – for example, Table 6 lists the full results, including compression results on LODATE, LRDATE, LSDATE in dataset P5. We have SAP and TPC-E datasets. These have significant skew (several experimented with a pathological sort order – where the char columns with few distinct values), so Huffman coding correlated columns are placed at the end. When we sort P5 by does very well. On SAP, Huffman coding compresses to 79 (LOK, LQTY, LODATE, …), the average compressed tuple bits/tuple vs. 165 bits/tuple for domain coding. size increases by 16.9 bits. The total savings from correlation is only 18.32 bits, so we lose most of it. This suggests that we 6 need to do further investigation on efficient range queries over Huffman co-coded or dependent coded columns. 5 Domain Coding Huffman+CoCode 4 4.2 Querying Compressed Relations 3 We now investigate efficient querying of compressed relations. 2 We focus on scans, with selection, projections, and 1 aggregations. Our queries test three aspects of operating over 0 compressed data: 1. How efficiently can we undo the delta-coding to retrieve P1 P2 P3 P4 P5 P6 tuplecodes? 2. How efficiently can we tokenize tuples that contain Correlation: The best way to exploit correlation is by co- Huffman-coded columns? coding. Table 6 lists the extra benefit obtained by correlation, 3. How well can we apply equality and range predicates on over that obtained by just Huffman coding individual columns. Huffman coded and domain coded columns? This is also illustrated in the bar labelled “Huffman+CoCode” The first is the basic penalty for dealing with compressed data, in the above plot. For example, we go down from 72.97 that every query has to pay. The second measures the bits/tuple to 54.65 bits per tuple for P5. In terms of effectiveness of our segregated coding. The third measures the compression ratios, as a function of the original table size, we ability to apply range predicates using literal frontiers. compress 2.6x to 5.3x by exploiting correlation and skew. We run scans against 3 TPC-H schemas: (S1: LPR LPK Delta-Coding: The last source of redundancy is from lack-of- LSK LQTY) has only domain coded columns, (S2: LPR LPK ordering. In Table 6, column(3) – column (5) gives the savings LSK LQTY OSTATUS OCLK) has 1 Huffman coded column from delta coding when we co-code. Observe that it is almost (OSTATUS), and (S3: LPR LPK LSK LQTY OSTATUS always about 30 bits/tuple for all the TPC-H datasets. This OPRIO OCLK) has 2 Huffman coded columns observation is consistent with Theorem 3 and Lemma 2; lg m = (OSTATUS,OPRIO). OSTATUS has a Huffman dictionary lg (6.5B) j 32.5, and we save a little fewer bits per tuple by with 2 distinct codeword lengths, and OPRIO has a dictionary delta-coding. with 3 distinct codeword lengths. The table below plots the S1 S2 S3 scan bandwidth (in nanoseconds/tuple) for various kinds of scans against these schemas. All experiments were run on a Q1: select sum(lpr) from S1/2/3 8.4 10.1 15.4 1.2GHz Power 4, on data that fit entirely in memory. Q2: Q1 where lsk>? 8.1-10.2 8.7-11.5 17.7-19.6 Q1 is a straightforward scan plus aggregation on a domain Q3: Q1 where oprio>? 10.2-18.3 17.8-20.2 coded column. On S1, which has no Huffman columns, Q1 just tests the speed of undoing the delta code – we get 8.4ns/tuple. Q4: Q1 where oprio=? 11.7-15.6 20.6-22.7 On S2 and S3, it tests the ability to tokenize Huffman Again in Table 6, column (1) – column (2) gives the coded columns using the micro-dictionary (in order to skip over savings from delta coding when we don’t co-code correlated them). Tokenizing the first Huffman coded column is relatively 867

11.cheap because its offset is fixed; subsequent ones incur an [11] G. Copeland and S. Khoshafian. A decomposition storage overhead of about 5ns/tuple to navigate the micro-dictionary. If model. In SIGMOD 1985. a tuple has several Huffman coded columns, this suggests that [12] Sybase IQ. www.sybase.com/products/informationmanagement it will pay to encode the tuple length in front. [13] D. Knuth. The Art of Computer Programming. Addison Q2, Q3, and Q4 are queries with predicates plus Wesley, 1998. aggregation. A range of numbers is reported for each schema [14] S. Barnard and J. M. Child. Higher Algebra, Macmillan India because the run time depends on the predicate selectivity, due Ltd., 1994. to short-circuiting. For small selectivities, the predicate adds at [15] T. C. Hu, A. C. Tucker, Optimal computer search trees and variable-length alphabetic cods, SIAM J. Appl. Math, 1971. most a couple of ns/tuple beyond the time to tokenize. This is [16] Huffman, D. A method for the construction of minimum in line with our expectation, because once tokenized, the range redundancy codes. Proc. I.R.E. 40. 9. 1952 predicate is evaluated directly on the codeword. [17] Jim Gray. Commentary on 2005 Datamation benchmark. These numbers indicate we can process 50-100M tuples/s http://research.microsoft.com/barc/SortBenchmark using a single processor (equivalent to 1-2GB/s), and are quite [18] A. Zandi et al. Sort Order Preserving Data Compresion for promising. Obviously, a full query processor needs much more Extended Alphabets. Data Compression Conference 1993. than scan. Nevertheless, as we discussed in Section 3.2, hash [19] K. Bharat et al. The Connectivity server: Fast access to joins and merge joins can be done on the compressed tuples Linkage information on the web. In WWW 1998. using these basic equality and range predicate operations. [20] G. Graefe and L. Shapiro. Data Compression and Database Performance. In Symp on Applied Computing, 1991. 5. CONCLUSIONS AND FUTURE WORK [21] M. Poess et al. Data compression in Oracle. In VLDB 2003. Compression is a promising approach to deal with the data movement bottleneck. We have developed a notion of entropy 7. ENTROPY OF MULTI-SET DELTAS for relations and described a method to compress relations to Lemma 1: For a multi-set R of m values picked uniformly and i.i.d within 4.3 bits/tuple of the entropy. This results in up to a from [1..m], for m >100, the value distribution of code(R) has compression factor of up to 40 on TPC-H data, much better entropy ≤ 2.67 m bits. than reported in prior literature. We have also developed Proof: Recall the pseudo-code for computing the coded delta: techniques for efficient scans with selection and projection over 1) Sort the entries of R, forming sequence v = v1, …, vm relations compressed with this method. 2) Form a delta sequence v2 − v1, v3 − v2, … , vm − vm-1. Much work remains to exploit the promise of this idea. 3) Entropy code the values in delta to form a new sequence Although we have shown the basic feasibility, we need to code(R) = code(v2 − v1), ... , code(vm − vm-1) further investigate the engineering issues involved in doing Now consider this situation: we have thrown m balls numbered 1..m uniformly and i.i.d onto m numbered bins arranged in a circle. Define query operators both as a part of a commercial DBMS and as a a random variable Dj for ball j that landed in bin i utility, and figuring out how to utilize the 128 bit registers and Dj h 0 if j is not the highest numbered ball in bin i hardware threading available on modern processors. A second h k if j is the highest numbered ball bin i and direction is lossy compression, which we believe is vital for and k is the clockwise distance from bin i to the next occupied bin. In efficient aggregates over compressed data. Finally, we need to effect, Dj is the random variable for the bin gap between adjacent support incremental updates. We believe that many of the balls. The distribution of each code in code(R) is that of Dj. By standard warehousing ideas like keeping change logs and symmetry, this distribution is independent of j, so we shall call it periodic merging will work here as well. D. Our goal is to compute H(D). Now, let pd h Pr (D = d), 0 ≤ d < m and define 6. REFERENCES λd h Pr (D = d | D g 0), 1 ≤ d < m [1] Bala Iyer,, David Wilhite. Data Compression Support in Data So, pd = (1 − p0) λd, and Bases. In VLDB 1994. [2] G. Antoshenkov, D. Lomet, J. Murray. Order Preserving H(D)= Σ0≤d<m pd lg(1/pd) = −p0 lg p0 −Σ1≤d≤m–1 (1−p0)λdlg(1−p0)λd. String Compression. In ICDE 1996. Since Σ0≤d<m pd = 1 = p0 + Σ1≤d<m (1−p0)λd, we have, [3] T. Cover, J. Thomas. Elements of Information Theory. John H(D) = −p0 lg p0 − (1−p0)lg(1−p0) + (1−p0)Σ1≤d<m λd lg(1/λd) (1) Wiley. 1991 Consider αd h Pr(D > d | Dg0), 1 ≤ d < m. This is the probability [4] Eric W. Weisstein et al. "Catalan Number." In MathWorld. that, given that a ball is the highest numbered ball in a bin (D g 0), http://mathworld.wolfram.com/CatalanNumber.html. that there is a clockwise gap greater than d to the next occupied [5] A. Ailamaki, D.J. DeWitt, M. D. Hill, M. Skounakis Weaving bin, i.e., that all m – 1 other balls fall into the other m – d bins. So, relations for cache performance. In VLDB 2001. αd = (1 – d / m)m−1, for 1 ≤ d < m, [6] M. Stonebraker et al. C-Store: A Column Oriented DBMS. In λd = Pr (D = d | Dg0) = αd−1 − αd, VLDB 2005. αd = (1 – d/m)m−1 = [m/(m − d)][(1 − d / m)m] < [m/(m − d)] e−d, [7] S. Babu et al. SPARTAN: A model-based semantic by standard exponential inequalities [14], and compression system for massive tables. SIGMOD 2001. λd = αd−1 – αd < αd−1 < [m/(m – d + 1)] e1− d (2) [8] J. Goldstein, R. Ramakrishnan, U. Shaft. Compressing But, λd = αd−1 – αd = (1 − (d − 1)/m)m−1 − (1−d/m)m−1 Relations and Indexes. In ICDE 1998. λd (m/(m−d))m−1 = [(m − d + 1)/(m − d)]m−1 − 1 > 1 for d ≥ 1 [9] G.V. Cormack and R.N. Horspool. Data Compression using So 1/λd < [m/(m − d)]m−1 (3) Dynamic Markov Modelling, Computer Journal 1987. Now, each occurrence of a zero delta corresponds to an empty bin, [10] G. V. Cormack, Data Compression In a Database System, so expected number of empty bins is mp0. But empty bins form a Comm. of the ACM 28(12), 1985 binomial distribution with probability (1−1/m)m. So p0 = (1−1/m)m 868

12.< 1/e. Now, by simple calculus, −x lg x has its maximum value Theorem 3: Algorithm 3 compresses a relation R of tuples when x=1/e, and substituting into equation (1) we have chosen i.i.d from a finite distribution D to at most H(R) + 4.3|R| −p0lg p0 − (1−p0)lg(1−p0) < (lg e)/e + (1−1/e) lg(1/(1−1/e)) (4) bits, assuming |R| > 100. Combining (1), (2), (3), (4) we get, Proof: Consider m items chosen i.i.d. from D. Let R be a random H(D) < (lg e)/e + (1−1/e) lg(e/(e−1)) + (1−p0) λ1 lg 1/λ1 + variable for these m items viewed as a sequence and R be a random (1−p0)Σ2≤d≤m–1[m/(m−d+1)] e1−d (m−1)lg(1+d/(m−d)) variable for these m items viewed as a multi-set. Without loss of As ln(1+x) < x, lg(1+x) = ln(1+x) / ln 2 < x / ln 2 we have generality let the distribution D be given by a set of n values v1, … H(D) < (lg e)/e + (1−1/e) lg(e/(e−1)) + (1−p0)λ1 lg 1/λ1 + ,vn and associated probabilities p1, …, pn, so that a sample drawn ((1−p0)/ln 2) Σ2≤d≤m–1 [dm(m−1)/((m−d)(m−d +1))] e1 − d (5) from D has value vi with probability pi. Let Ds be the subset of Let Ym ≡ ∑ md=−12 [dm( m − 1) (m − d )(m − d + 1)]e1−d Ym − Ym+1 = ∑ md=−21[2dm(d − 1) (m − d )(m − d +1)(m − d +2)]e1−d those values that code to less than ≈lg m∆ bits after step 1d. Let − m 2 ( m + 1)e − m+1 / 2 MUL be a random variable for the vector of multiplicities of = ∑ md=−22 [2dm(d − 1) (m − d )(m − d + 1)(m − d + 2)]e1−d elements in Ds in the random multi-set R, Dom(MUL) = N |Ds|). + me1−m [m 2 (e / 3 − 0.5) + m(e + 0.5) + 2e / 3] Use µ to represent a particular multiplicity in Dom(MUL), µi to > 0 for m>7 (the last quadratic is >0 for m>7) denote the i’th element of µ, the multiplicity of the associated vi, Hence Ym >Ym+1 for m > 7. and pµ to denote the probability that MUL=µ µ. But Y100 < 1.57, by calculation. So, for m>100, Ym < 1.57. (6) H(R) = H(MUL) + H(R | MUL) Also, λ1 = 1 − (1 − 1/ m)m−1 is increasing in m. Since x lg(1/ x) = H(MUL) + ΣµcDom(MUL) pµ H(R | MUL=µ µ) is maximum at 1/e, λ1 lg(1/ λ )1 < 0.63lg 0.63 < 0.42 for m>100. (7) We now employ an argument similar to Lemma 3. For any value Plugging (6), (7) into (5) and doing some algebra, of the multi-set R having multiplicity µ = (µ1, µ2, ... µ|Ds|), there are H ( D ) < lg e − (1 − 1/ e)lg(e − 1) + 0.42(1 − p0 ) + 1.57(1 − p0 ) / ln 2. at most m! / (µ1!µ2! ... µ|Ds|!) instances of R. So, By calculus, p0 = (1-1/ m)m > 0.99100 >0.36 for m>100. H(R | MUL = µ) ≥ H(R | MUL = µ) – lg (m!/(µ1!... µ|Ds|!)) Substituting, we get H(D) < 2.67 bits. H(R) ≥ H(MUL) + H(R | MUL = µ) – Σµ pµ lg (m!/(µ1!... µ|Ds|!)) = H(R) – ΣµcDom(MUL) pµ lg (m!/(µ1!... µ|Ds|!)). 8. OPTIMALITY OF ALGORITHM 3 Lemma 3: Let X be a random variable over a distribution D, and Let k = µ1 +...+ µ|Ds|. Then, k! µ1µ1... µ|Ds|µ_|Ds|/(µ1!... µ|Ds|!) is just let code(X) be a random variable representing the codeword for X, one term in the multinomial expansion of kk. So k!/(µ1!... µ|Ds|!) according to some compression scheme. Let codeα(X) be the < kk / µ1µ_1... µ|Ds|µ_|Ds|. Clearly k < m, so random variable for the α-bit prefix of code(X). Then, if the lg (m!/(µ1!... µ|Ds|!) < m lg m −Σ1≤i≤Ds µi lg µi. compression scheme is optimal and given that code(X) has ≥ α So, H(R) ≥ H(R) – ΣµcDom(MUL) pµ (m lg m −Σi µilg µi) bits, codeα(X) is uniformly distributed over all bit-vectors of length But, the last term can be simplified and replaced with an α. expectation giving us Proof: If the distribution of bits were not uniform, we could further H(R) ≥ H(R) – m lg m + E(Σi µilg µi) compress each α-bit prefix, contradicting the assumption that the = H(R) – m lg m + Σi E(µilg µi). compression scheme is optimal. ± Since x lg x is convex, we can apply Jensen’s inequality: E(f(X) We now show the algorithm of Section 2.1.4 is near optimal. ≥ f(E(X)) for convex functions f, yielding H(R) ≥ H(R) – m lg m + Σi E(µi) lg Ε(µi) (1) Lemma 4: If tuples of R are generated i.i.d. over a distribution D, By Lemma 4, Algorithm A encodes R in space the algorithm of Figure 3 encodes R in at most m (1 + H ( D )) − ( m − 1)(lg m − 2.67) + ∑ i∈Ds mpi ( lg m − bi ) (2) m(1 + H ( D)) − (m − 1)(lg m − 2.67) + ∑ i∈D mpi ( lg m − bi ) bits on Since tuples are chosen i.i.d, H ( R ) = mH ( D ). Recall that s average, where m = |R| > 100, Ds is the subset of tuples that get pi is the probability of the i'th element of Ds and bi is the coded in less than ≈lg m∆ bits at the end of step 1d, pi is the length of the code for the i'th element. So E ( µi ) = mpi . Thus, occurrence probability of tuple i c Ds, and bi is the length to which the space overhead of Algorithm A over entropy = (2) - (1) tuple i is coded at end of step 1d. ≤ 3.67 m + lg m + m ∑ i∈Ds mpi ( lg m − bi − lg mpi ) Proof: By standard results on Huffman coding [3], the ≤ 3.67 m + lg m + m ∑ i∈Ds mpi (lg(1/ pi ) − bi ). expected size of the tuples at the end of Step 1d is no more than Let p = ∑ i∈Ds pi and qi = pi / p. Then the overhead ≤ m(1+H(D)) bits. The padding of Step 1e adds (≈lg m∆ – bi) random 3.67 m + lg m + mp ∑ i∈Ds qi (lg(2 − bi / qi ) − lg p ). Since bits for each tuple that codes to bi < ≈lg m∆ bits. So the overall padding has expected length of ∑ i∈D mpi ( lg m − bi ) bits. By s ∑ i∈Ds qi = 1, and lg( x ) is concave, by Jensen's inequality, Lemma 3, after Step 1e, the ≈lg m∆-bit prefix of the coded tuple is ∑ i∈Ds qi lg(2 −bi / qi ) ≥ lg ∑ i∈Ds qi (2 −bi / qi ). So overhead ≤ uniformly distributed over all bit vectors of length ≈lg m∆. So, by 3.67 m + lg m + m lg ∑ i∈Ds 2− bi + mp lg (1 p ). Lemma 1, average length of the coded delta is at most 2.67 bits. So Now, for any unambiguous code, Kraft's inequality [4] says the delta coding of Steps 2 and 3 saves on average ≈lg m∆ − 2.67 that ∑ i∈D 2− bi ≤ 1. So overhead ≤ 3.67 m + lg m + mp lg (1 p ). bits for each pair of adjacent tuples: (m−1)(≈lg m∆ − 2.67) bits By simple calculus, x lg(1/ x ) ≤ (1/ e) lg e 0.53. So, overall. And by summing these quantities we get our result. ± space taken by Algorithm A - H ( R ) is at most 4.2 m + lg m bits < 4.3m bits (lg x/x < 0.07 for m>100) 869