# Bitmap Index Design and Evaluation

We study indexing techniques for main memory, including hash indexes, binary search trees, T-trees, B+-trees, interpolation search, and binary search on arrays. In a decision-support context, our primary concerns are the lookup time, and the space occupied by the index structure. Our goal is to provide faster lookup times than binary search by paying attention to reference locality and cache behavior, without using substantial extra space. We propose a new indexing technique called “Cache-Sensitive Search Trees” (CSS-trees).

3.arithmetic), this digit being encoded in bits by turning exactly one (2) Bitmap Encoding Scheme Consider the ith component of out of C bits on. The arithmetic we choose for the value repre- an index with attribute cardinality bi. There are essentially two sentation, i.e., the decomposition of the value in digits according to major schemes to directly encode the corresponding values v; (0 5 some base, and the encoding scheme of each digit in bits are the 5 bi - 1) in bits’: two dimensions of the space and are analyzed below. (1) Attribute Value Decomposition Consider an attribute value l Equality Encoding: There are bi bits, one for each possible vandasequenceof(n-l)numbers< b,-l,b,-2, . . ..bl >. Fur- value. The representation of value ‘ui has all bits set to 0, ex- cept for the bit corresponding to v;, which is set to 1. Clearly, thermore, define b, = an equality-encoded component consists of bi bitmaps. l Range Encoding: There are b; bits again, one for each possi- =Vl ?J = fib1 fvl ble value. The representation of value U; has the vi rightmost bits set to 0 and the remaining bits (starting from the one cor- responding to o; and to the left) set to 1 Intuitively, each bitmap ByI has 1 in all the records whose ith component = &(bzbl) + vzbl + v~ value is less than or equal to ‘ui. Since the bitmap Bf’-’ has = K(bxbzbl) + vs(bzbl) + vzbl + vl all bits set to 1, it does not need to be stored, so a range- = encoded component consists of (bi - 1) bitmaps. /n-l \ /z-l \ Figures 4(b) and (c) show the range-encoded indexes correspond- = w, (Ebj) +...+v, (gb3) +...+vzbl+vl, ing to the equality-encoded indexes in Figure 1 and Figure 3, re- spectively. where v; = V; mod bi, V; = 2 , 1 < i 5 n, and each digit L 1 ~a(@ 8’ B6 B5 B4 B3 8’ B’ B” vz is in the range 0 5 wi < b,. I I 1 I I 1 0 0 0 Based on the above, each choice of n and sequence < b,, b,-1 , 2 I I I I I I 0 0 3 1 I I I I I 1 0 bl > gives a different representation of attribute values, and 4 I I I I I I 0 0 the;efore a different index. An index is well-dejned if bi 2 2, 5 0 0 0 0 0 0 0 0 1 <i <TX. Thesequence< b,,b,-l,...,bl >isthebaseof 6 I I 1 I 1 I 0 0 the-index, which is in turn called a base-< b,, b,,-1, . . . , bl > 7 I 1 I I 1 I 0 0 index, If b, = b,-1 = = bl z b, then the base is called 8 1 I I I I I 1 I 9 I 0 0 0 0 0 0 0 uniform and the index is called base-b for short. The index consists IO I I I 0 0 0 0 0 of n components, i.e., one component per digit. Each component II I 1 0 0 0 0 0 0 individually is now a collection of bitmaps, constituting essentially 12 I I I I 0 0 0 0 a base-bj index. (b) Let ni denote the number of bitmaps in the ith component of an index and {B” - ‘, BrZ -‘, . , BP} denote the collection of ni bitmaps that form the ith component. Figure 3 shows a base- < 3,3 > Value-List index (based on the same 12-record relation in Figure 1). By decomposing a single-component index into a 2- component index. the number of bitmaps has been reduced from 9 to 6. na(R) B; B; B!j Bf B; By 1x3+0 I 3 - 0 I 0 0 0 I 0X3+2 2 2 - 0 0 I I 0 0 0x3+1 3 I - 0 0 I 0 I 0 Figure 4: Examples of Range-Encoded Bitmap Indexes. (a) Projec- 0x3+2 tion of indexed attribute values with duplicates preserved. (b) Sin- 4 2 - 0 0 I I 0 0 2x3+2 gle Component, Base-9 Range-Encoded Bitmap Index. (c) Base- 5 8 I 0 0 I 0 0 < 3,3 > Range-Encoded Bitmap Index. 0x3+2 6 2 - 0 0 I I 0 0 0x3+2 7 2 - 0 0 I I 0 0 Figure 5 shows the two-dimensional design space of indexes 0x3+0 for selection queries, with the two existing alternatives, namely, 8 0 - 0 0 I 0 0 I 2x3+1 the Value-List index and the Bit-Sliced index [lo], classified as 9 I I 0 0 0 I 0 shown, The Value-List index, which is the simplest index and is 1.X3+2 IO 5 - 0 I 0 I 0 0 the most commonly implemented, has a single component and is 2x3+0 equality-encoded. The Bit-Sliced index [lo] has a uniform base II 6 I 0 0 0 0 1 1x3+1 and is range-encoded. A base-10 Bit-Sliced index has been used 12 0 I 0 0 I 0 in MODEL 204 for evaluating range predicates . The Bit-Sliced index is also used in Sybase IQ for evaluating range predicates and 1 We have excluded the third “basic” encoding scheme (the binary encoding Figure 3: Example of a 2-Component Value-List Index (a) Pro- scheme) discussed in [13, 141, as it can be characterized in our framework asa base-2 jection of indexed attribute values with duplicates preserved. (b) decomposition with one of the two encodings we do present. ‘Clearly, a symmetric scheme exists as well, where the roles of right and left are Base-< 3,3 > Value-List Index. exchanged. 357

4. BITMAP ENCODING SCHEME 4 bitmap operations and 5 bitmap scans, compared to Algorithm RangeEval, which incurs 10 bitmap operations and 6 bitmap scans. Moreover, while efficient processing of Algorithm BSRange- Opt requires buffering for only one bitmap (the intermediate/final result), efficient processing of Algorithm RangeEval requires buffer- ing for two bitmaps (for BEQ and BGT/BLT). VALUE DECOMPOSITION c + ’ h, ’ Figure 5: Design Space of Bitmap Indexes for Selection Queries. performing aggregation [lo]. In this paper, we consider all well- defined decompositions, and just the two encoding schemes pre- sented above, which are used in practice. 3 An Improved Evaluation Algorithm for Range-Encoded Bitmap Indexes An evaluation algorithm for selection queries based on range-encoded indexes has been proposed by O’Neil and Quass (Algorithm 4.3 in [IO]), which we referred to as Algorithm RangeEval. In this section, we present an improved evaluation algorithm (Algorithm RangeEval-Opt) that reduces the number of bitmap operations by about 50% and requires one less bitmap scan for a range pred- Figure 7: Comparison of Evaluation Strategies for Selection Predi- icate evaluation, Given this, in the rest of the paper, the improved cate “A 5 864” using a 3-Component, Base-10 Range-Encoded evaluation algorithm is used as the basis for the time analysis of Bitmap Index. (a) Algorithm RangeEval. (b) Algorithm range-encoded indexes. Both evaluation algorithms are shown in RangeEval-Opt. Figure 6. Let Bc and Di denote bitmaps with all bits set to 0 and 1, re- spectively. Also let A, V, and @ denote the logical AND, OR, and 3.1 Analytical Comparison of Evaluation Algorithms for XOR operations, respectively, and B denote the complement of a Range-Encoded Bitmap Index bitmap B. Depending on the input predicate operator (op), Algo- rithm RangeEval evaluates each range predicate by computing two bitmaps: the BEQ bitmap and either the &T or BLT bitmap. Number of Bitmap Operations For example, if the predicate is “<‘I, then the result bitmap BLE is obtained by computing the bitmaps BEQ and BLT; steps that involved BGT, BGE, or BNE are not required. The final result bitmap that is returned is either BLT, BLE. BGT, BGE, BEQ, or BNE, corresponding to the predicate operator “<‘I, “I”, ‘I>“, “2”. “=“, or “#“, respectively. As we show below, such an evaluation strategy not only costs more bitmap operations to incrementally build the two intermediate bitmaps, but also incurs more bitmap RanaeEval-Oot scans since the evaluation of the bitmap BEQ, which corresponds to an equality predicate evaluation, is the most expensive predicate in terms of bitmap scans. Algorithm RangeEval-Opt avoids the intermediate equal- ity predicate evaluation by evaluating each range query in terms Aft 11 n 1 0 ( 1 1 n I2n+l 11 2; of only the “5” predicate based on the following three identities: (1) A < v z A 2 v - 1, (2) A > v = A 5 v, (3) A 2 v E A 5 v - 1. Furthermore, the evaluation of the “I” predi- Table 1: Worst Case Analysis of Evaluation Algorithms; n is the cate itself does not require an equality predicate evaluation. Con- number of index components. sequently, Algorithm RangeEval-Opt requires the computation of only one bitmap B for any predicate evaluation, as opposed to Table 1 shows the worst case analysis of the performance of the two bitmaps (BEQ and either BLT or BGT) required in Algorithm evaluation algorithms in terms of the number of bitmap operations RangeEval. and scans. For Algorithm RangeEval, since each range pred- Figure 7 compares the two algorithms for evaluating the pred- icate evaluation entails an equality predicate evaluation (for BEQ icate “A 2 864” using a 3-component base-10 index. Clearly, bitmap), the number of bitmap operations of a range predicate eval- Algorithm RangeEval-Opt is more efficient as it requires only uation is at least twice that of the “=” predicate evaluation. By us- ing a more direct evaluation strategy, Algorithm RangeEval-Opt 358

5. Evaluation Algorithms for Selection Queries Using Range-Encoded Bitmap Indexes. Input: n is the number of components in the range-encoded index. <bn,bn-I,..., bl > is the base of the index. op is the predicate operator, op E {<, >, <,>, =, #}. II is the predicate value. B,, is a bitmap representing the set of records with non-null values for the indexed attribute. Output: A bitmap representation of the set of records that satisfies the predicate “A op v”. Algorithm RangeEval Algorithm RangeEval -Opt I) BGT = BLT = 80; 1) B=B,, 2) BEQ = B,,; 2) if(opE{<,>})thenv=v-1; 3) letv=v,v,-l...vt; 3) let v = 1)~2)~-1 .211; 4) for i = R.downto 1 do 4) if(opE {<,>,<,>})then 5) if (v, > 0) then 5) if (VI < bl - 1) then B = By’ ; BL~ = BLT v (BEG A B;‘-‘); 6) for i = 2 to 71do I;; if (w, < b, - 1) then 7) if (v; # bi - 1) then B = B A By’; 8) BGT = BGT V (BEQ A B1y’); 8) if(ui#O)thenB=BVBY*-‘; BEQ = BEQ A (By’ \$ BF’-l); 9) else 9) else 10) for i = 1 to n do 10) 11) if(u;=O)thenB=BABF; 11) BEQ = BEQ A Bf’-‘; 12) else 12) elseif(v;=b;-1)thenB=BAB~‘c2; BGT = BGT V (BEQ A @); 13) elseB=BA(BY’@By’-I); 13) BEQ = BEQ A BP; 14)if(opE {>,>,#})then 14) 15) return % A B,,; 15) BNE = BEQ A B,,; 16) BLE = BLT V BEQ; BGE = BGT V BEQ; 16) else 17) return B,,; 17) return B A B,,,; Figure 6: Algorithms RangeEval and RangeEval-Opt. reduces the number of bitmap operations for range predicate eval- 4 Cost Model for Space-Time Tradeoff Analysis uations by about half; the number of bitmap scans for range pred- icates is reduced by one. Both algorithms have the same cost for Our cost model for the space-time tradeoff analysis uses the fol- an equality predicate evaluation. For both evaluation algorithms, lowing two metrics: the worst case occurs when 0 < w2 < b; - 1, 1 < i 5 n, which corresponds also to the most probable case. l The space metric is in terms of the number of bitmaps stored. l The time metric is in terms of the expected number of bitmap 3.2 Experimental Comparison of Evaluation Algorithms scans for a selection query evaluation, and is based on the as- for Range-Encoded Bitmap Index sumption that the queries in the query space & are uniformly distributed, where Q = {A op IJ : op E (5, 2, <, >, =, # In this section, we compare the performance of the evaluation al- },O 5 w < C}. gorithms in terms of the average number of bitmap scans and the average number of bitmap operations. Indexes with uniform base Note that our time metric incorporates only the I/O cost (i.e., are generated by varying the two main parameters: the attribute number of bitmap scans) and does not include the CPU cost (i.e., cardinality C and the base number b. We experimented with C = number of bitmap operations) as the number of bitmap scans and 50, 100, and 1000; for each value of C, the entire space of base- the number of bitmap operations for a query evaluadon are cor- b range-encoded indexes are generated by varying b from 2 to C. related. We denote the space and time metric of an index I by For each index, we generated a total of 6C selection queries of the Space(l) and Time(l), respectively. form “A op c” by varying the predicate op E { <, 5, >, 2, =, #} and the predicate constant 0 <_ c < C. Each query is evaluated 5 Comparison of Bitmap Encoding Schemes using Algorithm RangeEval and Algorithm RangeEval-Opt. For a given range-encoded index and an evaluation algorithm, the This section compares the performance of the two basic encoding average number of bitmap scans (operations) is computed by tak- schemes, namely, equality encoding and range encoding, for selec- ing the average of the number of bitmap scans (operations) over all tion queries. Our result shows that range encoding provides better 6C queries. space-time tradeoff than equality encoding in most cases. Let I Figure 8 compares the performance of the two evaluation algo- denote an n-component index with base < b,, b,- 1, . . . , bl >. rithms as a function of the base number with C = 100. The graphs Based on the cost model, we have the following result. clearly demonstrate that Algorithm RangeEval-Opt is more ef- ficient than Algorithm RangeEval. Similar trends are obtained for other values of C. Theorem 5.1 If I is equality-encoded, then bi if b; > 2, Space(l) = 25-i, where s; = (1) 1 otherwise. i=l 359

6. Given the overall superiority of range-encoding, in the remain- der of this paper, we focus on the space-time tradeoff of range- encoded indexes. Henceforth, we use the term index as an abbrevi- ation for a range-encoded index. 6 Space-Optimal and Time-Optimal Bitmap Indexes In this section, we identify both the space-optimal and time-optimal indexes (i.e., points (A) and (D) in Figure 2). We refer to the most space-efficient (respectively, time-efficient) index among all indexes with n components as the n-component space-optimal in- dex (respectively, n-component time-optimal index). Note that the n-component space-optimal index might not be unique; for exam- (a) Average Number of Bitmap Scans as a Function of ple, if C = 1000, then the base-< 32,32 > and base-< 31,33 > Base Number indexes are both 2-component space-optimal indexes. Based on Equations (3) and (4). we have the following result. Theorem 6.1 (1) The number of bitmaps in an n-component space-optimal index is given by n(b - 2) + T, where b = 1q and T is the smallest positive integer such that V(b- l)n-’ 2 C. The index with base < ,b - 1,. ,. , b - l,, lx--T b, . . , b > is an n-component space-optimal index. T (2) The space-efficiency of space-optimal indexes is a non-decreasing 4 Oo I IO I1I.I. 20 30 49 Bale E”,b.. 60 70 I1 80 w I ICC function of the number of components. (3) The base of the n-component time-optimal index is < w (b) Average Number of Bitmap Operations as a Function of Base Number n-1 Figure 8: RangeEval vs RangeEval-Opt for Uniform pi-1 >. Base Range-Encoded Bitmap Indexes, C = 100. (4) The time-efficiency of time-optimal indexes is a non-increasing function of the number of components. Time(l) = 2t, + l), where (2) By Theorem 6.1, it follows that the space-optimal index corre- sponds to the index with the maximum number of components (i.e, n = [logz(C)l with each base number equal to 2), and the time- optimal index corresponds to the index with the minimal number ([+j’+(b,-l)([+l-4)) ifbi>2, of components (i.e. n = 1). These are probably immediate corol- otherwise. laries of information theory and/or number theory results as well, but they are easy enough to prove that doing so seemed easier than If 1 is range-encoded, then tracking down the existing literature. n Space(l) = C(bi - 1) 7 Bitmap Index with Optimal Space-Time Tradeoff (Knee) z=l In this section, we characterize the knee of the space-time tradeoff Time(l) = 2(n - 2 \$ + Jj(k - 1)) (4) graph, referred to as the knee index (point (C) in Figure 2), which i=l ’ corresponds to the index with the best space-time tradeoff. Note that our characterization is an approximate one as finding a precise analytical characterization is generally a difficult problem. How- The time analysis for range-encoded indexes is based on Algo- ever, it turns out that our approximate characterization has very rithm RangeEval-Opt. The evaluation algorithm for equality- good accuracy. encoded indexes is not shown here due to space constraint; details We first give a definition of the knee index that will serve as can be found elsewhere . a basis for comparing with our approximate characterization. Let A range-encoded index requires one or two bitmap scans per S = {II, 12, , I,,} denote the set of optimal indexes3 corre- component for a selection predicate evaluation. On the other hand, sponding to the points in the space-time tradeoff graph such that an equality-encoded index requires only one bitmap scan per com- Space(1,) < Space(l,+l), 1 5 j < p. For 1 < j < p, let LGj ponent for an equality predicate evaluation, but for a range predi- and RGj denote the gradients of the line segments formed by I3 cate evaluation, the number of bitmap scans required per compo- 3The set of optimal indexes S is the maximal subset of all possible indexes S’ such nent is between two and half the number of bitmaps in that com- that for each I E S, there does not exist another index in S’ that is more efficient ponent. Figure 9 compares the space-time tradeoff of range- and than I in both time and space. equality-encoded indexes for C = 20, 100, and 1000. The re- sults show that range-encoding in general provides better space- time tradeoff than equality-encoding in most cases. 360

7. (c) c (c) c = = 1000 1000 Figure 9: Comparison of Space-Time Tradeoff of Range- and Equality-Encoded Bitmap Indexes. with Ij-r and I,+,, respectively. LG, and RG, are defined as time tradeoff graphs for three classes of indexes: the class of space- follows: optimal indexes, the class of time-optimal indexes, and the en- TiVLe(lj) - ?‘i?TK(lj+l) x F and tire class of indexes, for C = 1000; similar results are obtained RG, = for other values of C. The graph for space-optimal (respectively, SpUCe(lj+l) - Space(l,) time-optimal) indexes consist of at most [logz(C)l points, where Ti77E(lj-1) - Time(lj) x F each point corresponds to an n-component space-optimal (respec- LGj = SpUCe(lj) - Space(l,-1) tively, time-optimal) index, 1 5 n 5 [Zogz(C)l. Note that since the space-optimal index is generally not unique, each point in the where F = Space(l,,)/Time(lr) is a normalizing factor. The space-optimal graph shown corresponds to the most time-efficient index Ik E {I3 E S : LG, > 1, RGj < 1) with the maximum index among all equally space-efficient indexes with the same num- ratio LGk/RGk is the knee index. ber of components, Figure 10 shows that the tradeoff graph for space-optimal indexes provides a good approximation to that for all indexes. In particular, the set of points on the graph for space- optimal indexes is a subset of the set of points on the graph for all indexes. Figure 11 shows the same space-optimal tradeoff graph as in Figure 10 but with each point labelled with the number of compo- nents of the corresponding space-optimal index. We observe that the knee of the space-time tradeoff graph for the space-optimal in- dex corresponds to a 2-component index, something that was con- sistent throughout our experimentation. Hence, we characterize the knee index as the most time-efficient 2-component space-optimal index, which is obtained from the following result. Theorem 7.1 The base of the most time-efficient 2-component space- Figure 10: Space-Time Tradeoff of Bitmap Indexes, C = optimal index is given by < bz - 6, bl + 6 >, where bl = [q, 1000. bz = I\$], and 6 = max{O, b2mb1+J(~+b1)2-4C 1). L We have compared the knee index based on our approximate char- acterization with that based on the definition for various values of attribute cardinality; the results show that both knee indexes match exactly for all the cases that we compared. 8 Time-Optimal Bitmap Index Under Space Constraint \2 In this section, we consider the following practical optimization =~--~~---~~-~-~------~-..--...---- .............._.._.__..............~~ ~~ ....-_____-___----_- problem (point (B) in Figure 2): Given a constraint on the available * disk space to store an index, say at most M bitmaps (or equiva- / lently, at most MN bits, where N is the number of records), deter- mine the most time-efficient index. We first present an algorithm that finds the optimal solution, and then present a more efficient heuristic approach, which is near-optimal. Both algorithms are Figure II: Space-Time Tradeoff of Space-Optimal Bitmap shown in Figure 12. In the,following, let I, denote an n-component Indexes, C = 1000. index; and IApace and I:‘,’ denote the n-component space- and time-optimal indexes, respectively. We now motivate our approximate characterization, which is based on the results of Theorem 6.1, Figure 10 compares the space- 361

8. Algorithms to Find Time-Optimal Bitmap Index Under Space Constraint. Input: M is the space constraint in terms of the maximum number of bitmaps. C is the attribute cardinality. Output: The time-optimal bitmap index with no more than M bitmaps. Algorithm TimeOptAlg Algorithm TimeOptHeur 1) let n be the smallest number such that Sp~ce(l~~“~“) 5 M; 1) (r&,1)= FindSmallestN(M,C); 2) if (Space(l:lim”) 5 M) then return 12”“; 2) if (Space(lkvL”) 5 ill) then return 1:‘,‘; 3) let n’ be the smallest number such that n’ > n and Spuce(l\$“‘) 5 M; 3) return Ref ineIndex(1); 4) let Z = {II, : n < k < n’, Space(lh) 5 M} u {I,!J”“}; 5) return I’ E Z such that Time(1’) 5 Ti,me(l), V I E 1; Figure 12: Algorithms TimeOptAlgandTimeOptHeur 8.1 Optimal Approach Algorithm TimeOptAlg finds the actual optimal solution. By re- sult (2) of Theorem 6.1, the solution must have at least n compo- nents, where n is the smallest number of components such that Spuce(l~Pace ) 5 M. If the n-component time-optimal index satisfies the space constraint (step 2), then by result (4) of Theo- rem 6.1, it is the solution; otherwise, by result (4) of Theorem 6.1, the solution index must have no more than n’ components where n’ 2 n is the smallest number of components such that the n’- component time-optimal index satisfies the space constraint. Fig- ure 13 shows two cases to illustrate the bounds on the number of components of the solution index; each point on the space-time tradeoff graphs shown is labelled with the number of components of the corresponding index. The time complexity of the algorithm is determined by the size of the set of candidate indexes defined by Z (step 4). This search space is large as we need to enumerate for each value of k, n 5 k < n’, all possible /c-component indexes such that n,“=, bi 2 C and Cfzl(bi - 1) < M. F’lgure 14 shows the size of Z as a function of M for C = 1000. 8.2 Heuristic Approach To avoid the costly exhaustive search in the optimal approach (steps 4 and 5), we present in this section a heuristic approach (Algorithm TimeOptHeur in Figure 12) to find an approximate optimal solu- tion. Our experimental results show that the heuristic approach is near-optimal with the optimal index being selected at least 97% of the time. The main idea is to first select a “seed” index that satis- fies the space constraint, and then to improve the time-efficiency of the seed index by adjusting its base numbers. For the seed index, Figure 13: Bounds on Number of Components of Solution Index in our approach uses an n-component index I with Space(l) = M, Algorithm TimeOptAlg. where n is the least number of components such that Space(l~Pace) 5 M. Both n and I are determined by Algorithm FindSmallestN satisfies the space cqnstraint (step Let I:, be another n-component index with base < b’,, b/,-l, .., (Figure 15). Note that if I:,’ b’l > as defined above. Then Time(lL) < Time(1,). 2),then,as in step 2 of Algorithm TimeOptAlg,I\$me is the op- timal solution index. Algorithm Ref ineIndex (Figure 15) im- proves the time-efficiency of an index; its correctness is based on Based on Theorem 8.1, Algorithm Ref ineIndex essentially the following result. tries to maximize the number of components with small base num- bers; in particular, step 8 determines the largest possible 6 value for Theorem 8.1 Let I,, be an n-component index with base < 0,) b,- 1, adjusting a pair of base numbers. To illustrate Algorithm Refine- Index,consider a3-component index I withbase < 21,21,22 > . , bl >. Suppose there exists two base numbers b, and b,, and C = 1000. In the first iteration, with b, = b, = 21, S = 19 2 < b, 5 b,, and some integer constant 0 < 6 5 b, - 2, such that n b,--6 ifi=p, and the base is refined to < 2,22,40 >. In the second iteration, b’i 2 C, where b’; = b,+b ifi=q, with b, = 22 and b, = 40, 6 = 12 and the base is refined to II < 2, lo,52 >. After the final refinement step 3, the base of the re- i=l b, otherwise. fined index I’ is < 2,10,50 >. By Equation (3), Space(l) = 61 362

9. Algorithm FindSmal les tN: Algorithm to Find the Bitmap Index with Least Number of Components Under Space Constraint. 1 Input: M is the space constraint in terms of the maximum number of bitmaps. C is the attribute cardinality. Output: n, the smallest number of components such that Space(l~Pace) < M, and An n-component index I, with Space(I,,) = M. 1) n=o; // n is the number of components 2) repeat 3) n=n+l; 4) b= +j; 5) r= i M+n)modr~; //r is the number of components with base (b + 1) 6) until ((b + l)Tbn-’ > C); 7) let I, be the n-component bitmap index with base < b, . , b, b + 1,. . . , b + 1 >; -- n--r 7 8) return (n, I,); Algorithm Ref ineIndex: Algorithm to Improve the Time-Efficiency of a Bitmap Index. Input: I, is a ,n-component bitmap index. C is the attribute cardinality. Output: A n-component bitmap index IL such that Time 5 Time(I,) and Space(I:,) I: Space(I,). I) letseqOfBase=<b,,b,-I,...,bl >bethebaseofI,; 2) prodOfBase = fi bi; i=l 3) for i = n downto 2 do 4) let b, be the smallest base number from seqOf Base; 5) delete b, from seqOf Base; 6) if (bp > 2) then 7) let b, be the smallest base number from seq0 f Buse; bb b-b,+&,+bd2-p?,#& 8) let 6 = 2 L J1 9) if (0 < 6 5 b, - 2) then 10) prodOfBase = prodOjBasbep~~-6)(b1+6); 11) b, = b, - 6; 12) b, = b, + 6; 13) b’, = b,; 14) b; = 15) return the n-component bitmap index with base < b’,, b’,- 1, . . , b; >; Figure 15: Algorithms FindSmallestN and Ref ineIndex. and Space(I’) = 59. By Equation (4), Time(I) = 5.08 and Table 2 shows the effectiveness of the heuristic approach for Time(l’) = 4.10. various values of attribute cardinality. The second column shows The time complexities of Algorithms FindSmal les tN and the percentage of solutions selected by the heuristic approach that Ref ineIndex are O(log:,(C)) and O(nlogz(n)), respectively, are optimal for the attribute cardinality value indicated in the first where n is at most [logz(C)l. Therefore, the time complexity of column. For those cases where the optimal and heuristic approaches Algorithm TimeOptHeur is o(logz(C)logz(log:!(C))). differ, the third column shows the maximum difference in the ex- pected number of bitmap scans of their selected indexes. The result indicates that the proposed heuristic approach is near-optimal. Max. Difference in Expected Number 9 Effect of Bitmap Compression on Space-Time Tradeoff 50 500 This section examines the effect of bitmap compression on the 1000 space-time tradeoff of bitmap indexes. [_-I 2000 9.1 Bitmap Index Storage Schemes Table 2: Effectiveness of Heuristic Approach to Select Time- We first investigate different physical organizations for bitmap in- Optimal Bitmap Index under Space Constraint. dexes to identify storage schemes that have good space-time trade- 363

10. DataSet 1 1 DataSet 2 ‘~~ Table 3: Characteristics of TPC-D Benchmark Data used in Exper- iments to 6; this is motivated by our results in Section 7 which show that the space-time tradeoff of space-optimal indexes provides a good Figure 14: Size of Set of Candidate Bitmap Indexes, 2, as a approximation to the space-time tradeoff of all indexes. The data Function of the Space Constraint A4 for C = 1000. compression code used is from the zlib library5. Table 4 compares the compressibility of the three storage schemes for both data sets. For each n-component index I and each com- off. Consider an k-component bitmap index for a N-record re- pressed storage scheme S, where 1 5 n 5 6 and S E {CBS, cCS, lation, where the z.” index component comprises of 72; bitmaps, clS}, we compute the percentage of the size of I under S to the size of I under BS. For both data sets, the results show that CS- 1 < i 5 lc. Let n = 2 nl. The entire bitmap index is essen- E=l Base Size of I Compressibility of tially a (N x n) bit-matrix, where the ith component is a (N x n;) bit-matrix, and each bitmap is a (N x 1) bit-vector. Based on the above view of a bitmap index, we compare the following three stor- age schemes: Bitmap-Level Storage (BS) Each bitmap is stored individually in a single bitmap file of N bits. Thus, the bitmap index is stored in n N-bit bitmap files. Component-Level Storage (CS) Each index component is stored Base Size of I Compressibility of individually in a row-major order in a single bitmap file of of under BS Storage Scheme (%) (N x ni) bits, 1 < i < Ic. Thus, the bitmap index is stored Index I (in bytes) CBS cCS cIS in k bitmap files. < 2406 > 450,937,500 76.2 2.2 2.2 < 43,56 > 18,187,500 77.6 26.3 28.8 Index-Level Storage (IS) The entire index is stored in a row-major < 11,13,17 > 7,125,OOO 80.7 40.9 48.8 order in a single bitmap file of Nn bits. < 5,7,7,10 > 4,687,500 84.2 60.5 76.8 < 4,5,5,5,5 > 3,562,500 87.7 67.2 87.6 We refer to a bitmap index that is organized using storage scheme < 3,3,3,3,5,6 > 3,187,500 89.7 75.1 93.0 X (without any compression) as a X-index, and refer to a X-index (b) Data Set 2 with all its bitmaps compressed as a cX-index (i.e., c for com- pressed). For a l-component bitmap index, both a CS-index and an IS-index are equivalent. For a bitmap index with the maximum Table 4: Comparison of Compressibility of Different Storage number of components; i.e., each component is of base 2, (1) both Schemes. a BS-index and a CS-index are equivalent, and (2) an IS-index is a projection index4. indexes give the best compression. In a CS-index, each row of a component bit-matrix has either In terms of query evaluation cost, a BS-index requires at most a consecutive stream of 1 bits followed by a consecutive stream of 2 bitmap file scans per index component. On the other hand, both a 0 bits if the index is range-encoded, or exactly one bit set to 1 if CS-index and an IS-index require all the bitmap files to be scanned; the index is equality-encoded. In contrast, in a BS-index, the dis- moreover, they also incur additional CPU overhead to extract the tribution of the bits in each bitmap is dependent on the distribution bits of each relevant bitmap from the component bitmap files (which of the attribute values. Thus, we expect a CS-index to be more are stored in row-major order). Thus, we expect CBS-indexes to compressible than a BS-index. be more time-efficient than cCS- and cIS-indexes; this is indeed supported by our experimental results. For example, consider the 9.2 Experimental Setup base-50 CBS-index; the average size of each compressed bitmap is 36Y7571448xo.776 = 582,118 bytes. Even when two bitmaps (an Our experimental study uses two data sets extracted from the TPC- averag:if 1,164,236 bytes) are scanned to evaluate an equality D Benchmark : data set 1 is for small attribute cardinality, and query, using the CBS-index is still significantly cheaper than us- data set 2 is for large attribute cardinality. Table 3 shows the key ing the cCS-index (or cIS-index), which requires (36,757,448 x characteristics of our experimental data. To limit the number of ex- 0.272) = 9,998,026 bytes to be scanned. periments for each data set, our comparison of the space-time trade- Since CBS-indexes are the most time-efficient and cCS-indexes off is restricted to the class of space-optimal indexes (as a function are the most space-efficient, we shall compare the space-time trade- of the number of index components, n) with n being varied from 1 ‘The zlib library is written by Jean-loup Gdly 4A projection index [IO] on an attribute A in B relation R is simply the projection and Mark Adler (http //www.cdrom,com/pub/infozip/zlib); the compression method of A on R with duplicates preserved and stored in RlD-order. is based on a LZ77 variantcalleddeflation. 364