Fast Scans for Main Memory Data Processing

This paper focuses on running scans in a main memory data processing system at “bare metal” speed.Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations).Scans are common in main memory data processing environments,and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table.

1.BitWeaving: Fast Scans for Main Memory Data Processing Yinan Li Jignesh M. Patel University of Wisconsin–Madison University of Wisconsin–Madison ABSTRACT Naive BitWeaving/V SIMD-scan BitWeaving/H This paper focuses on running scans in a main memory data pro- Cycles / value, log scale cessing system at “bare metal” speed. Essentially, this means that 10 the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, 1 and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that 0.1 exploits the parallelism available at the bit level in modern pro- 4 8 12 16 20 24 28 32 cessors. BitWeaving operates on multiple bits of data in a single Number of bits used to represent each column value cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing Figure 1: Performance Comparison BitWeaving to drop the cycles per column to below one in some A key operation in a main memory DBMS is the full table scan case. BitWeaving comes in two flavors: BitWeaving/V which looks primitive, since ad hoc business intelligence queries frequently use like a columnar organization but at the bit level, and BitWeaving/H scans over tabular data as base operations. An important goal for a which packs bits horizontally. In this paper we also develop the main memory data processing system is to run scans at the speed of arithmetic framework that is needed to evaluate predicates using the processing units, and exploit all the functionality that is avail- these BitWeaving organizations. Our experimental results show able inside modern processors. For example, a recent proposal that both these methods produce significant performance benefits for a fast scan [18] packs (dictionary) compressed column values over the existing state-of-the-art methods, and in some cases pro- into four 32-bit slots in a 128-bit SIMD word. Unfortunately, this duce over an order of magnitude in performance improvement. method has two main limitations. First, it does not fully utilize the width of a word. For example, if the compressed value of a particu- Categories and Subject Descriptors lar attribute is encoded by 9 bits, then we must pad each 9-bit value H.2.4 [Information Systems]: Database Management—systems to a 32-bit boundary (or what ever is the boundary for the SIMD instruction), wasting 32 − 9 = 23 bits every 32 bits. The second Keywords limitation is that it imposes extra processing to align tightly packed values to the four 32-bit slots in a 128-bit SIMD word. Bit-parallel, intra-cycle parallelism, storage organization, indexing, In this paper, we propose a set of techniques, which are collec- analytics tively called BitWeaving, to aggressively exploit “intra-cycle” par- allelism. The insight behind our intra-cycle parallelism paradigm is 1. INTRODUCTION recognizing that in a single processor clock cycle there is “abundant There is a resurgence of interest in main memory database man- parallelism” as the circuits in the processor core are simultaneously agement systems (DBMSs), due to the increasing demand for real- computing on multiple bits of information, even when working on time analytics platforms. Continual drop in DRAM prices and in- simple ALU operations. We believe that thinking of how to fully creasing memory densities have made it economical to build and exploit such intra-cycle parallelism is critical in making data pro- deploy “in memory” database solutions. Many systems have been cessing software run at the speed of the “bare metal”, which in this developed to meet this growing requirement [2, 5–7, 9, 14, 21]. study means the speed of the processor core. The BitWeaving methods that are proposed in this paper target intra-cycle parallelism for higher performance. BitWeaving does not rely on the hardware-implemented SIMD capability, and can Permission to make digital or hard copies of all or part of this work for be implemented with full-word instructions. (Though, it can also personal or classroom use is granted without fee provided that copies are leverage SIMD capabilities if that is available.) BitWeaving comes not made or distributed for profit or commercial advantage and that copies in two flavors: BitWeaving/V and BitWeaving/H, corresponding to bear this notice and the full citation on the first page. To copy otherwise, to two underlying storage formats. Both methods produce as output republish, to post on servers or to redistribute to lists, requires prior specific a result bit vector, with one bit per input tuple that indicates if the permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. input tuple matches the predicate on the column. Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. The first method, BitWeaving/V, uses a bit-level columnar data

2.organization, packed into processor words. It then organizes the sented using these codes, and these codes only use as many bits as words into a layout that results in largely sequential memory ad- are needed for the fixed-length encoding. dress lookups when performing a scan. Predicate evaluation in the In these compression methods, all value types, including nu- scan operation is converted to logical computation on these “words meric and string types, are encoded as an unsigned integer code. of bits” using the arithmetic framework proposed in this paper. In For example, an order-preserving dictionary can map strings to un- this organization, storage space is not wasted padding bits to fit signed integers [3, 10]. A scale scheme can convert floating point boundaries that are set by the hardware. More importantly, in many numbers to unsigned integers by multiplying by a certain factor [4]. cases, an early pruning technique allows the scan computation to be These compression methods maintain an order-preserving one-to- safely terminated, even before all the bits in the underlying data are one mapping between the column values to the codes. As a result, examined. Thus, predicates can often be computed by only looking column scans can usually be directly evaluated on the codes. at some of most significant bits in each column. This scheme also For predicates involving arithmetic or similarity predicates (e.g. naturally produces compact result bit vectors that can be used to the LIKE predicates on strings), scans cannot be performed directly evaluate the next stage of a complex predicate efficiently. on the encoded codes. These codes have to be decoded, and then The second method, BitWeaving/H, uses a bit organization that are evaluated in a conventional way. is a dual of BitWeaving/V. Unlike the BitWeaving/V format, all the bits of a column value are stored together in BitWeaving/H, provid- 2.1 Problem Statement ing high performance when fetching the entire column value. Un- A column-scalar scan takes as input a list of n k-bit codes and a like previous horizontal bit packing methods, BitWeaving/H stag- predicate with a basic comparison, e.g. =, =, <, >, ≤, ≥, BETWEEN, gers the codes across processor words in a way that produces com- on a single column. Constants in the predicate are also in the do- pact result bit vectors that are easily reusable when evaluating the main of the compressed codes. The column-scalar scan finds all next stage of a complex predicate. matching codes that satisfy the predicate, and outputs an n-bit vec- Both BitWeaving methods can be used as a native storage orga- tor, called the result bit vector, to indicate the matching codes. nization technique in a column store database, or as an indexing A processor word is a data block that can be processed as a unit method to index specific column(s) in row stores or column stores. by the processor. For ease of explanation, we initially assume that a Figure 1 illustrates the performance of a scan operation on a sin- processor word is an Arithmetic Logic Unit (ALU) word, i.e. a 64- gle column, when varying the width of the column from 1 bit to 32 bit word for modern CPUs, and in Appendix C of the extended ver- bits (Section 6 has more details about this experiment). This fig- sion of this paper [12] we generalize our method for wider words ure shows the SIMD-scan method proposed in [18], and a simple (e.g. SIMD). The instructions that process the processor word as method (labeled Naive) that scans each column in a traditional scan a unit of computation are called full-word instructions. Next, we loop and interprets each column value one by one. As can be seen define when a scan is a bit-parallel method. in the figure, both BitWeaving/V and BitWeaving/H outperform D EFINITION 1. If w is the width of a processor word, and k is the other methods across all the column widths. Both BitWeaving the number of bits that are needed to encode a code in the column methods achieve higher speedups over other methods when the col- C, then a column-scalar scan on column C is a bit-parallel method umn representation has fewer number of bits, because this allows if it runs in O( nk w ) full-word instructions to scan over n codes. more column predicates to be computed in parallel (i.e. the intra- A bit-parallel method needs to run in O( nk ) instructions to make w cycle parallelism per input column value is higher). For example, full use of the “parallelism” that is offered by the bits in the entire when each column is coded using 4 bits, the BitWeaving methods width of a processor word. Since processing nk bits with w-bit are 20X faster than the SIMD-scan method. Even for columns that processor words requires at least O( nk ) instructions, intuitively a w are wider than 12 bits, both BitWeaving methods are often more method that matches the O( nk ) bound has the potential to run at than 4X faster than the SIMD-scan method. Note that as described w the speed of the underlying processor hardware. in [18], real world data tends to use 8 to 16 bits to encode a column; BitWeaving is one order of magnitude faster than the SIMD-scan 2.2 Framework method within this range of code widths. The focus of this paper is on speeding up scan queries on colum- The contribution of this paper is the presentation of the BitWeav- nar data in main memory data processing engines. Our frame- ing methods that push our intra-cycle parallelism paradigm to its work targets the single-table predicates in the WHERE clause of natural limit – i.e. to the bit level for each column. We also develop SQL. More specifically, the framework allows conjunctions, dis- an arithmetic framework for predicate evaluation on BitWeaved junctions, or arbitrary boolean combinations of the following basic data, and present results from an actual implementation. comparison operators: =, =, <, >, ≤, ≥, BETWEEN. The remainder of this paper is organized as follows: Section 2 For the methods proposed in this paper, we evaluate the complex contains background information. The BitWeaving methods and predicate by first evaluating basic comparisons on each column, the related arithmetic framework is described in Sections 3 through 5. using a column-scalar scan. Each column-scalar scan produces a Section 6 contains our experimental results. Related work is cov- result bit vector, with one bit for each input column value that in- ered in Section 7, and Section 8 contains our concluding remarks. dicates if the corresponding column value was selected to be in the result. Conjunctions and disjunctions are implemented as log- 2. OVERVIEW ical AND and OR operations on these result bit vectors. Once the Main memory analytic DBMSs often store data in a compressed column-scalar scans are complete, the result bit vector is converted form [2, 4, 5, 10]. The techniques presented in this paper apply to to a list of record numbers, which is then used to retrieve other commonly used column compression methods, including null sup- columns of interest for this query. (See Appendix A in [12] for pression, prefix suppression, frame of reference, and order-preserving more details.) NULL values and three-valued boolean logic can dictionary encoding [2,4,5,10]. Such a scheme compresses columns be implemented in our framework using the techniques proposed using a fixed-length order-preserving scheme, and converts the na- in [13], and, in the interest of space, this discussion is omitted here. tive column value to a code. In this paper, we use the term “code” We represent the predicates in the SQL WHERE clause as a bi- to mean an encoded column value. The data for a column is repre- nary predicate tree. A leaf node encapsulates a basic comparison

3.operation on a single column. The internal nodes represent logical (111)2 , 4 = (100)2 , 3 = (011)2 }, denoted as c1 – c10 respec- operation, e.g. AND, OR, NOT, on one or two nodes. To evalu- tively. Each value can be encoded by 3 bits (k = 3). For ease of ate a predicate consisting of arbitrary boolean combinations of ba- illustration, we assume 8-bit processor words (i.e. w = 8). sic comparisons, we traverse the predicate tree in depth-first order, performing the column-scalar comparison on each leaf node, and 3.2 The Horizontal bit-parallel (HBP) method merging result bit vectors at each internal node based on the logical The HBP method compactly packs codes into processor words, operator that is represented by the internal node. Figure 8 illustrates and implements the functionality of hardware-implemented SIMD an example predicate tree. In Section 3, we focus on single-column instructions based on ordinary full-word instructions. The HBP scans, and we discuss complex predicates in Section 4.3. method solves a general problem for hardware-implemented SIMD that the natural bit width of a column often does not match any of the bank widths of the SIMD processor, which leads to an under- 3. BIT-PARALLEL METHODS utilization of the available bit-level parallelism. In this section, we propose two bit-parallel methods that are de- We first present the storage layout of HBP in Section 3.2.1, and signed to fully utilize the entire width of the processor words to then describe the algorithm to perform a basic scan on a single reduce the number of instructions that are needed to process data. column over the proposed storage layout in Section 3.2.2. These two bit-parallel methods are called Horizontal Bit-Parallel (HBP) and Vertical Bit-Parallel (VBP) methods. Each method has 3.2.1 Storage layout a storage format and an associated method to perform a column- In the HBP method, each code is stored in a (k + 1)-bit section scalar scan on that storage method. In Section 4, we describe an whose leftmost bit is used as a delimiter between adjacent codes (k early pruning technique to improve on the column-scalar scan for denote the number of bits needed to encode a code). A method that both HBP and VBP. Then, in Section 5 we describe the BitWeaving does not require the extra delimiter bit is feasible, but is much more method, which combines the bit-parallel methods that are described complicated than the method with delimiter bits, and also requires below with the early pruning technique. BitWeaving comes in two executing more instructions per code [11]. Thus, our HBP method flavors: BitWeaving/H and BitWeaving/V corresponding to the un- uses this extra bit for storage. derlying bit-parallel method (i.e. HBP or VBP) that it builds on. HBP tightly packs and pads a group of (k + 1)-bit sections into a processor word. Let w denote the width of a processor word. 3.1 Overview of the two bit-parallel methods Then, inside the processor word, k+1 w sections are concatenated As their names indicate, the two bit-parallel methods, HBP and together and padded to the right with 0s up to the word boundary. VBP, organize the column codes horizontally and vertically, respec- tively. If we thought of a code as a tuple consisting of multiple Column Codes HBP Storage Layout fields (bits), HBP and VBP can be viewed as row-oriented storage Segment 1 c1 c5 and column-oriented storage at the bit level, respectively. Figure 2 c1 0 0 1 v1 0 0 0 1 0 1 1 0 demonstrates the basic idea behind HBP and VBP storage layouts. c2 1 0 1 c2 c6 Both HBP and VBP only require the following full-word opera- c3 1 1 0 v2 0 1 0 1 0 1 0 0 tions, which are common in all modern CPU architectures (includ- c4 0 0 1 c3 c7 ing at the SIMD register level in most architectures): logical and c5 1 1 0 v3 0 1 1 0 0 0 0 0 (∧), logical or (∨), exclusive or (⊕), binary addition (+), negation c6 1 0 0 c4 c8 (¬), and k-bit left or right shift (←k or →k , respectively). c7 0 0 0 v4 0 0 0 1 0 1 1 1 c8 1 1 1 Segment 2 c9 c10 c9 1 0 0 v5 0 1 0 0 0 0 1 1 c10 0 1 1 code code Processor Word Figure 3: Example of the HBP storage layout (k = 3, w = 8). Delimiter bits are marked in gray. Processor Word Processor Word (a) Horizontal Bit-Parallel (b) Vertical Bit-Parallel In the HBP method, the codes are organized in a storage lay- out that simplifies the process of producing a result bit vector with Figure 2: HBP and VBP layouts for a column with 3-bit codes. one bit per input code (described below in Section 3.2.2). The col- The shaded boxes represent the bits for the first column value. umn is divided into fixed-length segments, each of which contains w (k + 1) · k+1 codes. Each code represents k + 1 bits values, Since the primary access pattern for scan operations is the se- with k bits for the actual code and the leading bit set to the delim- w quential access pattern, both the CPU cost and the memory access iter value of 0. Since a processor word fits k+1 codes, a segment cost are significant components that contribute to the overall exe- occupies k + 1 contiguous processor words in memory space. In- w cution time for that operation. Consequently, our methods are op- side a segment, the layout of the (k + 1) · k+1 codes, denoted timized for both the number of CPU instructions that are needed to as c1 ∼ c(k+1)· k+1w , is shown below. We use vi to denote the ith process the data, as well as the number of CPU cache lines that are processor word in the segment. occupied by the underlying (HBP or VBP) data representations. v1 : c1 ck+2 c2k+3 ··· ck· w k+1 +1 3.1.1 Running Example v2 : c2 ck+3 c2k+4 ··· ck· w k+1 +2 To illustrate the techniques, we use the following example through- .. .. .. .. .. out this section. The data set has 10 tuples, and the column of inter- . . . . . est contains the following codes: {1 = (001)2 , 5 = (101)2 , 6 = vk : ck c2k+1 c3k+2 ··· c(k+1)· k+1 w −1 (110)2 , 1 = (001)2 , 6 = (110)2 , 4 = (100)2 , 0 = (000)2 , 7 = vk+1 : ck+1 c2k+2 c3k+3 ··· c(k+1)· k+1 w

4. v1 (c1 , c5 ) v2 (c2 , c6 ) v3 (c3 , c7 ) v4 (c4 , c8 ) v5 (c9 , c10 ) X= (0001 0110)2 (0101 0100)2 (0110 0000)2 (0001 0111)2 (0100 0011)2 Y = (0101 0101)2 (0101 0101)2 (0101 0101)2 (0101 0101)2 (0101 0101)2 mask = (0111 0111)2 (0111 0111)2 (0111 0111)2 (0111 0111)2 (0111 0111)2 X ⊕ mask = (0110 0001)2 (0010 0011)2 (0001 0111)2 (0110 0000)2 (0011 0100)2 Y + (X ⊕ mask) = (1011 0110)2 (0111 1000)2 (0110 1100)2 (1011 0101)2 (1001 1001)2 Z = (Y + (X ⊕ mask)) ∧ ¬mask = (1000 0000)2 (0000 1000)2 (0000 1000)2 (1000 0000)2 (1000 1000)2 Figure 4: Evaluating a predicate c < 5 on the example column c Figure 3 demonstrates the storage layout for the example col- (yi + (xi ⊕ 01k )) ∧ 10k = 10k iff xi < yi . We also know that umn. Since each code in the example column is encoded by 3 bits yi + (xi ⊕ 01k ) is always less than 2k+1 , so overflow is impossible (k = 3), we use 4 = 3 + 1 bits to store each code and fit two codes for each (k + 1)-bit section. Thus, we have Z = (Y + (X ⊕ into a 8-bit word (w = 8). As shown in the figure, the 10 values 01k 01k · · · 01k )) ∧ 10k 10k · · · 10k for the comparison operator <. are divided into 2 segments. In segment 1, eight codes are packed LESS THAN OR EQUAL TO (≤). Since xi ≤ yi iff xi < yi + 1, into four 8-bit words. More specifically, word 1 contains code 1 we have Z = (Y + (X ⊕ 01k ) + 0k 1) ∧ 10k 10k · · · 10k for the and 5. Word 2 contains code 2 and 6. Word 3 contains code 3 and comparison operator ≤. 7. Word 4 contains code 4 and 8. Segment 2 is only partially filled, and contains code 9 and code 10 that are packed into word 5. Then, GREATER THAN (>) and GREATER THAN OR EQUAL TO (≥) can be implemented by swapping X and Y for LESS THAN (<) 3.2.2 Column-scalar scans and LESS THAN OR EQUAL TO (≤) operators, respectively. The HBP column-scalar scan compares each code with a con- Thus, the function f◦ (X, C) computes the predicates listed above stant C, and outputs a bit vector to indicate whether or not the cor- on k+1w codes using 3–4 instructions. responding code satisfies the comparison condition. Figure 4 illustrates an example when applying f< (vi , 5) on the w In HBP, k+1 codes are packed into a processor word. Thus, words v1 ∼ v5 shown in Figure 3. The ith column in the fig- we first introduce a function f◦ (X, C) that performs simultaneous ure demonstrates the steps when calculating Z = f< (vi , 5) on the w comparisons on k+1 packed codes in a processor word. The word vi . The last row represents the results of the function. Each w outcome of the function is a vector of k+1 results, each of which result word contains two comparison results. The value (1000)2 occupies a (k + 1)-bit section. The delimiter (leftmost) bit of each indicates that the corresponding code is less than the constant 5, section indicates the comparison results. whereas the value (0000)2 indicates that the corresponding code Formally, a function f◦ (X, C) takes as input a comparison oper- does not satisfy the comparison condition. ator ◦, a comparison constant C, and a processor word X that con- w tains a vector of k+1 codes in the form X = (x1 , x2 , · · · , x k+1 w ), Algorithm 1 HBP column-scalar scan and outputs a vector Z = (z1 , z2 , · · · , z w k+1 ), where zi = 10k Input: a comparison operator ◦ if xi ◦ C = true, or zi = 0k+1 if xi ◦ C = false. Note that a comparison constant C in the notation above for zi , we use exponentiation to denote bit Output: BVout : result bit vector repetition, e.g. 14 02 = 111100, 10k = 1 00 · · · 00. 1: for each segment s in column c do k 2: ms := 0 Since the codes are packed into processor words, the ALU in- 3: for i := 1 . . . k + 1 do struction set can not be directly used to process these packed codes. 4: mw := f◦ ( , C) In HBP, the functionality of vector processing is implemented using 5: ms := ms ∨ →i−1 (mw ) w 6: append ms to BVout full-word instructions. Let Y denote a vector of k+1 instances of constant C, i.e. Y = (y1 , y2 , · · · , y k+1 ), where yi = C. Then, w 7: return BVout ; the task is to calculate the vector Z in parallel, where each (k+1)- bit section in this vector, zi = xi ◦ yi ; here, ◦ is one of comparison Next, we present the HBP column-scalar scan algorithm based operators described as follows. Note that most of these functions on the function f◦ (X, C). Algorithm 1 shows the pseudocode for are adapted from [11]. the scan method. The basic idea behind this algorithm is to re- INEQUALITY (=). For the INEQUALITY , observe that xi = yi organize the comparison results in an appropriate order, matching iff xi ⊕ yi = 0k+1 . Thus, we know that xi = yi iff (xi ⊕ yi ) + the order of the original codes. As shown in the algorithm, for 01k = 1∗k (we use ∗ to represent an arbitrary bit), which is true iff each segment in the column, we iterate over the k + 1 words. In ((xi ⊕ yi ) + 01k ) ∧ 10k = 10k . We know that (xi ⊕ yi ) + 01k is the inner loop over the k + 1 words, we combine the results of always less than 2k+1 , so overflow is impossible for each (k+1)-bit f◦ (v1 , C) ∼ f◦ (vk+1 , C) together to obtain the result bit vector section. As a result, these computation can be done simultaneously on segment s. This procedure is illustrated below: on all xi and yi within a processor word. It is straightforward to f◦ (v1 , C) : R(c1 ) 0 ··· 0 0 R(ck+2 ) ··· see that Z = ((X ⊕ Y ) + 01k 01k · · · 01k ) ∧ 10k 10k · · · 10k . →1 (f◦ (v2 , C)) : 0 R(c2 ) ··· 0 0 0 ··· .. .. .. .. EQUALITY (=). EQUALITY operator is implemented by the . . . . complement of the INEQUALITY operator, i.e. Z = ¬((X ⊕ Y ) + →k−1 (f◦ (vk , C)) : 0 0 ··· R(ck ) 0 0 ··· 01k 01k · · · 01k ) ∧10k 10k · · · 10k . →k (f◦ (vk+1 , C)) : 0 0 ··· 0 R(ck+1 ) 0 ··· ∨ : R(c1 ) R(c2 ) ··· R(ck ) R(ck+1 ) R(ck+2 ) ··· LESS THAN (<). Since both xi and yi are integers, we know that xi < yi iff xi ≤ yi − 1, which is true iff 2k ≤ yi + 2k − xi − 1. In the tabular representation above, each column represents one Observe that 2k − xi − 1 is just the k-bit logical complement of xi , bit in the outcome of f◦ (vi , C). Let R(ci ) denote the binary re- which can be calculated as xi ⊕ 01k . It is then easy to show that sult of the comparison on ci ◦ C. Since R(ci ) is always placed in

5. Column Codes VBP Storage Layout Algorithm 2 VBP column-scalar comparison Segment 1 Segment 1 c1 c2 c3 c4 c5 c6 c7 c8 Input: a predicate C1 < c < C2 on column c c1 0 0 1 Output: BVout : result bit vector Code Width v1 0 1 1 0 1 1 0 1 c2 1 0 1 Transpose 1: for i := 1 . . . k do v2 0 0 1 0 1 0 0 1 2: if i-th bit in C1 is on then c3 1 1 0 v3 1 1 0 1 0 0 0 1 3: C1i := 1w c4 0 0 1 4: else c5 1 1 0 Processor Word 5: C1i := 0w c6 1 0 0 Segment 2 6: for i := 1 . . . k do c7 0 0 0 c9 c10 7: if i-th bit in C2 is on then 8: C2i := 1w Code Width c8 1 1 1 v4 1 0 0 0 0 0 0 0 9: else Segment 2 Transpose v5 0 1 0 0 0 0 0 0 10: C2i := 0w c9 1 0 0 v6 0 1 0 0 0 0 0 0 11: for each segment s in column c do 12: mlt , mgt := 0 c10 0 1 1 Processor Word 13: meq1 , meq2 := 1w 14: for i := 1 . . . k do Figure 5: Example of the VBP storage layout. The middle bits 15: mgt := mgt ∨ (meq1 ∧ ¬C1i ∧ ) of codes are marked in light gray, whereas the least significant 16: mlt := mlt ∨ (meq2 ∧ C2i ∧ ¬ ) 17: meq1 := meq1 ∧ ¬( ⊕ C1i ) bits are marked in dark gray. 18: meq2 := meq2 ∧ ¬( ⊕ C2i ) 19: append mgt ∧ mlt to BVout 20: return BVout ; the delimiter (leftmost) bit in a (k + 1)-bit section, the output of f◦ (vi , C) is in the form: R(ci )0k R(ck+1+i )0k · · · . By right shift- ing the output of f◦ (vi , C), we move the result bits R(ci ) to the k w-bit words, denoted as v1 , v2 , · · · , vk , such that the j-th bit in appropriate bit positions. The OR (∨) summation over the k + 1 vi equals to the i-th bit in the original code cj . result words is then in the form of R(c1 )R(c2 )R(c3 ) · · · , repre- w Inside a segment, the k words, i.e. v1 , v2 , · · · , vk , are physically senting the comparison results on the k+1 codes of segment s, in stored in a continuous memory space. The layout of the k words ex- the desired result bit vector format. actly matches the access pattern of column-scalar scans (presented For instance, to compute the result bit vector on segment 1 below in Section 3.3.2), which leads to a sequential access pattern (v1 , v2 , v3 , and v4 ) shown in Figure 3 and Figure 4, we perform on these words, making it amenable for hardware prefetching. (1000 0000)2 ∨ →1 (0000 1000)2 ∨ →2 (0000 1000)2 ∨ →3 Figure 5 illustrates the VBP storage layout for the running ex- (1000 0000)2 = (1001 0110)2 . The result bit vector (1001 0110)2 ample shown in Section 3.1.1. The ten codes are broken into two means that c1 , c4 , c6 , and c7 satisfy the comparison condition. segments with eight and two codes, respectively. The two segments Note that the steps above, which are carried out to produce a are separately transposed into three 8-bit words. The word v1 in result bit vector with one bit per input code, are essential when segment 1 holds the most significant (leftmost) bits of the codes using the result bit vector in a subsequent operation (e.g. the next c1 ∼ c8, the word v2 holds the middle bits of the codes c1 ∼ c8, step of a complex predicate evaluation in which the other attributes and the word v3 holds the least significant (rightmost) bits of the in the predicate have different code widths). codes c1 ∼ c8. In segment 2, only the leftmost two bits of the The HBP storage layout is designed to make it easy to assemble three words are used, and the remaining bits are filled with zeros. the result bit vector with one bit per input code. Taking Figure 3 as an example again, imagine that we lay out all the codes in sequence, 3.3.2 Column-scalar scans i.e. put c1 and c2 in v1, put c3 and c4 in v2, and so forth. Now, the result words from the predicate evaluation function f◦ (vi , C) The VBP column-scalar scan evaluates a comparison condition on v1, v2, · · · are f◦ (v1 , C) = R(c1 )000R(c2 )000, f◦ (v2 , C) = over all the codes in a single column and outputs a bit vector, where R(c3 )000R(c4 )000, · · · . Then, these result words must be con- each bit indicates whether or not the corresponding code satisfies verted to a bit vector of the form R(c1 )R(c2 )R(c3 )R(c4 ) · · · , by the comparison condition. extracting all the delimiter bits R(ci ) and omitting all other bits. The VBP column-scalar scan follows the natural way to compare Unfortunately, this conversion is relatively expensive compared to two integers in the form of bit strings: we compare each pair of the computation of the function f◦ (vi , C) (See Appendix B in [12] bits at the same position of the two bit strings, starting from the for more details). In contrast, the storage layout used by the HBP most significant bits to the least significant bits. The VBP method method does not need to execute this conversion to produce the re- essentially performs this process on a vector of w codes in parallel, sult bit vector. In Section 6.1.1, we empirically compare the HBP inside each segment. method with a method that needs this conversion. Algorithm 2 shows the pseudocode to evaluate the comparison predicate BETWEEN C1 AND C2. 3.3 The Vertical bit-parallel (VBP) method At the beginning of the algorithm (Lines 1–5), we create a list of words C11 ∼ C1k to represent w instances of C1 in the VBP The Vertical Bit-Parallel (VBP) method is like a bit-level column storage format. If the i-th bit of C1 is 1, C1i is set to be all 1s, as store, with data being packed at word boundaries. VBP is inspired all the i-th bits of the w instances of C1 are all 1s. Otherwise, C1i by the bit-sliced method [13], but as described below, is different is set to be all 0s. Similarly, we create C21 ∼ C2k to represent C2 in the way it organizes data around word boundaries. in the VBP storage format (in Line 6–10). In the next step, we iterate over all segments of the column, and 3.3.1 Storage layout simultaneously evaluate the range on all the w codes in each seg- In VBP, the column of codes is broken down to fixed-length seg- ment. The bit vector mgt is used to indicate the codes such that ments, each of which contains w codes (w is the width of a proces- they are greater than the constant C1, i.e. if the i-th bit of mgt is sor word). The w k-bit codes in a segment are then transposed into on, then the i-th code in the segment is greater than the constant

6.C1. Likewise, mlt is used to indicate the codes that are less than 1 the constant C2. meq1 and meq2 are used to indicate the codes that Early Pruning Probability are equivalent to the constant C1 and C2, respectively. 0.8 In the inner loop (Line 14–18), we compare the codes with the constants C1 and C2 from the most significant bits to the least sig- 0.6 nificant bits, and update the bit vector mgt , mlt , meq1 , and meq2 , correspondingly. The k words in the segment s are denoted as 0.4 s.v1 ∼ s.vk . At the i-th bit position, for a code with the i-th bit on, the code must be greater than the constant C1 iff the i-th bit of 0.2 Fill Factor:100% Fill Factor: 10% C1 is off and all bits to the left of this position between the code Fill Factor: 1% and C1 are all equal (meq1 ∧ ¬C1i ∧ ). The corresponding 0 0 4 8 12 16 20 24 28 32 bits in mgt are then updated to be 1s (Line 15). Similarly, mlt is updated if the i-th bit of a code is 0, the i-th bit of C2 is 1, and Bit Position (b) all the bits to the left of this position are all equal (Line 16). We Figure 7: Early Pruning Probability P (b) also update meq1 (meq2 ) for the codes that are different from the constant C1(C2) at the i-th bit position (Line 17 & 18). 0s to reflect this situation. Next, we expand the comparison to the After the inner loop, we perform a logical AND between the bit second bit between the constant and the codes. Now, we know that vector mgt and mlt to obtain the result bit vector on the segment the 1st, 4th, and 7th codes are smaller than the constant because (Line 19). This bit vector is then appended to the result bit vector. their first two bits are less than the first two bits of the constant Algorithm 2 can be easily extended for other comparison condi- (01). We also know that the 2nd, 3rd, 5th, 6th, and 8th codes are tions. For example, we can modify Line 19 to “append mgt ∧mlt ∨ greater than the constant, as their first two bits are greater than the meq1 ∨ meq2 to BVout ” to evaluate the condition C1 ≤ c ≤ C2. first two bits of the constant (01). At this point, all the codes have a For certain comparison conditions, some steps can be eliminated. definite answer w.r.t. this predicate, and we can terminate the VBP For instance, Line 15 and 17 can be skipped for a LESS THAN (<) column-scalar scan on this segment. The bit vector mlt is updated comparison, as we do not need to evaluate mgt and meq1 . to be 10010010, and it is also the final result bit vector. 4. EARLY PRUNING 4.2 Estimating the early pruning probability We first introduce the fill factor f of a segment, defined as the The early pruning technique aims to avoid accesses on unneces- number of codes that are present over the maximum number of sary data at the bit level. This technique is orthogonal to the two codes in the segment, i.e. the width of processor word w. For bit-parallel methods described in Section 3, and hence can be ap- instance, the fill factor of the segment 1 in Figure 5 is 8/8 = 100%, plied to both the HBP and the VBP methods. However, as the early whereas the fill factor of the segment 2 is 2/8 = 25%. According pruning technique is more naturally described within the context of to this definition, a segment contains wf codes. VBP, we first describe this technique as applied to VBP. Then, in The early pruning probability P (b) is defined as the probability Section 5.2 we discuss how to apply this technique to HBP. that the wf codes in a segment are all different from the constant 4.1 Basic idea behind early pruning in the most significant b bits, i.e. it is the probability that we can terminate the computation at the bit position b. It is often not necessary to access all the bits of a code to compute We analyze the early pruning probability P (b) on a segment con- the final result. For instance, to compare the code (11010101)2 to taining wf k-bit codes. We assume that a code and the constant a constant (11001010)2 , we compare the pair of bits at the same have the same value at a certain bit position with a probability of position, starting from the most significant bit to the least signifi- 1/2. Thus, the probability that all of the leading b bits between a cant bit, until we find two bits that are different. At the 4th position code and the constant are identical is given by ( 21 )b . Since a seg- (underlined above), the two bits are different, and thus we know ment contains wf codes, the probability that these codes are all that the code is greater than the constant. We can now ignore the different from the constant in the leading b bits, i.e. the early prun- remaining bits. ing probability P (b), is: Constant VBP words mlt 1 P (b) = (1 − ( )b )w·f 1st bit 0 01101101 00000000 2 2nd bit 1 00101001 10010010 Figure 7 plots the early pruning probability P (b) with a 64-bit 3rd bit 1 11010001 - processor word (w = 64) by varying the bit position b. We first Figure 6: Evaluating c < 3 with the early pruning technique look at the curve with a 100% fill factor. The early pruning prob- ability increases as the bit position number increases. At the bit It is easy to apply the early pruning technique on VBP, which position 12, the early pruning probability is already very close to performs comparisons on a vector of w codes in parallel. Figure 6 100%, which indicates that in many cases we can terminate the illustrates the process of evaluating the eight codes in segment 1 of scan after looking at the first 12 bits. If a code is a 32-bit integer, the example column c with a comparison condition c < 3. The VBP with early pruning potentially only uses 12/32 of the memory constant 3 is represented in the binary form (011)2 as shown in the bandwidth and the processing time that is needed by the base VBP second column in the figure. The first eight codes (1 = (001)2 , method (without early pruning). 5 = (101)2 , 6 = (110)2 , 1 = (001)2 , 6 = (110)2 , 4 = (100)2 , In Figure 7, at the lower fill factors, segments are often “cut- 0 = (000)2 , 7 = (111)2 ) of column c are stored in three 8-bit VBP off” early. For example, for segments with fill factor 10%, we can words, as shown in the third column in the figure. prune the computation at bit position 8 in most (i.e. 97.5%) cases. By comparing the first bit of the constant (0) with the first bits of This cut-off mechanism allows for efficient evaluation of conjunc- the eight codes (01101101), we notice that no code is guaranteed to tion/disjunction predicates in BitWeaving, as we will see next in be less than the constant at this point. Thus, the bit vector mlt is all Section 4.3.

7. OR the total number of bits that are accessed in a column-scalar scan operation; 2) The storage format is not a pure VBP format, but a AND weaving of the VBP format with horizontal packing into bit groups AND to further exploit the benefits of early pruning, by making access to the underlying bits more sequential (and hence more amenable R.a < 10 R.b > 5 R.c < 20 R.d = 3 for hardware prefetching); 3) It can be implemented with SIMD instructions allowing it to make full use of the entire width of the Figure 8: An example predicate tree for the expression (wider) SIMD words in modern processors. R.a < 10 AND R.b > 5 AND R.c < 20 OR R.d = 3 5.1.1 Storage layout 4.3 Filter bit vectors on complex predicates In this section, we describe how the VBP storage layout is adapted The early pruning technique can also be used when evaluating in BitWeaving/V to further exploit the benefits of the early pruning predicate clauses on multiple columns. Predicate evaluation on a technique. In addition to the core VBP technique of vertical par- single column can be pruned as outlined above in Section 4.1. But, titioning the codes at the bit level, in BitWeaving/V the codes are early pruning can also be used when evaluating a series of predicate also partitioned in a horizontal fashion to provide better CPU cache clauses with the result vector from the first clause being used to performance when using the early pruning technique. This combi- “initialize” the pruning bit vector for the second clause. nation of vertical and horizontal partitioning is the reason why the The result bit vector that is produced from a previous step is proposed solution is called BitWeaving. called the filter bit vector of the current column-scalar scan. This filter bit vector is used to filter out the tuples that do not match the Processor Cut-off Cut-off predicate clauses that were examined in the previous steps, lead- Word ing to a lower fill factor on the current column. Thus, the filter bit vector further reduces the computation on the current column- scalar scan (note that at the lower fill factors, predicate evaluation Segment 1 Segment 2 are often “cut-off” early, as shown in Figure 7). (a) VBP As an example, consider the complex predicate: R.a < 10 AND R.b > 5 AND R.c < 20 OR R.d = 3. Figure 8 il- Processor Cut-off Cut-off lustrates the predicate tree for this expression. First, we evaluate the Word predicate clause on column R.a, using early pruning. This evalu- Bit Group 1 ation produces a result bit vector. Next, we start evaluating the predicate clause on the column R.b, using early pruning. How- Bit Group 2 ever, in this step we use the result bit vector produced from the Bit Group 3 previous step to seed the early pruning. Thus, tuples that did not Segment 1 Segment 2 match the predicate clause R.a < 10 become candidates for early (b) BitWeaving/V pruning when evaluating the predicate clause on R.b, regardless of the value of their b column. As a result, the predicate evaluation Figure 9: Early pruning on VBP and BitWeaving/V on column b is often “cut-off” even earlier. Similarly, the result bit vector produced at the end of evaluating the AND node (the white The VBP storage format potentially wastes memory bandwidth AND node in the figure) is fed into the scan on column R.c. Fi- if we apply the early pruning technique on the base VBP storage nally, since the root node is an OR node, the complement of the format. (See Section 3.3 for details.) Figure 9(a) illustrates a scan result bit vector on the AND node (the gray one) is fed into the on a column stored in the VBP format. Suppose that with the early final scan on column R.d. pruning technique (described in Section 4.1), the outcome of com- parisons on all the codes in a segment is determined after accessing 5. BIT WEAVING the first three words. Thus, the 4th to the 9th words in segment 1 In this section, we combine the techniques proposed above, and can be skipped, and the processing can move to the next segment extend them, into the overall method called BitWeaving. BitWeav- (as shown by the dashed arrow). Suppose that a CPU cache line ing comes in two flavors: BitWeaving/H and BitWeaving/V corre- contains 8 words. Thus, the six words that are skipped occupy the sponding to the underlying bit-parallel storage format (i.e. HBP or same CPU cache line as the first three words. Skipping over the VBP described in Section 3) that it builds on. As described below, content that has already been loaded into the CPU cache results BitWeaving/V also employs an adapted form of the early pruning in wasted memory bandwidth, which is often a critical resource in technique described in Section 4. main memory data processing environments. We note that the BitWeaving methods can be used as a base stor- We solve this problem by dividing the k words in a segment into age organization format in column-oriented data stores, and/or as fixed sized bit groups. Let B denote the size of each bit group. indices to speedup the scans over some attributes. For ease of pre- The words in the same bit group are physically stored in continu- sentation, below we assume that BitWeaving is used as a storage ous space. Figure 9(b) illustrates how the storage layout with the format. It is straightforward to employ the BitWeaving method as bit grouping reduces the amount of data that is loaded into the CPU indices, and in Section 6.2 we empirically evaluate the performance cache. As shown in this figure, the nine words in each segment are of the BitWeaving methods when used in both these ways. divided into three bit groups, each containing three words per seg- ment. Suppose that with the early pruning technique, the outcome 5.1 BitWeaving/V of the comparisons on all the codes in a segment is determined after BitWeaving/V is a method that applies the early pruning tech- accessing the first three words. In this case, we only need to access nique on VBP. BitWeaving/V has three key features: 1) The early the first three words of each segment, which are all laid out contin- pruning technique skips over pruned column data, thereby reducing uously and compactly in the first bit group. Consequently, the bit

8.Algorithm 3 BitWeaving/V column-scalar scan in one of every B iterations reduces the number of checks at the Input: a predicate C1 < c < C2 on column c positions where the cut-off probability is in the middle range. We BVin : filter bit vector have observed that without this attention to branch prediction in the Output: BVout : result bit vector algorithm, the scans generally run slower by up to 40%. 1: initialize C1 and C2 (same as Lines 1-10 in Algorithm 2) The second modification is to feed a filter bit vector into the 2: for each segment s in column c do 3: mlt , mgt := 0 column-scalar comparisons. In a filter bit vector, the bits associated 4: meq1 , meq2 := BVin .s ✁ with the filtered codes are turned off. Filter bit vectors are typically 5: for g := 1 . . . B k do ✷ the result bit vectors on other predicates in a complex WHERE 6: if meq1 == 0 and meq2 == 0 then ✷ clause (see Section 4.3 for more details). 7: break ✷ To implement this feature, the bit masks meq1 and meq2 are ini- 8: for i := gB + 1 . . . min(gB + B, k) do ✷ tialized to the corresponding segment in the filter bit vector (marked 9: mgt := mgt ∨ (meq1 ∧ ¬C1i ∧ s.wi ) with ✁ at the end of the line). During the evaluation on a segment, 10: mlt := mlt ∨ (meq2 ∧ C2i ∧ ¬s.wi ) the bit masks meq1 and meq2 are updated by meq1 := meq1 ∧ 11: meq1 := meq1 ∧ ¬(s.wi ⊕ C1i ) 12: meq2 := meq2 ∧ ¬(s.wi ⊕ C2i ) ¬(s.wi ⊕ C1i ) and meq2 := meq2 ∧ ¬(s.wi ⊕ C2i ), respectively. 13: append mgt ∧ mlt to BVout Thus, the filtered codes remain 0s in meq1 and meq2 during the 14: return BVout ; evaluation. Once the bits associated with the unfiltered codes are all updated to 0s, we terminate the comparisons on this segment following the early pruning technique. The filter bit vector poten- grouping technique uses memory bandwidth more judiciously, and tially speedups the cut-off process. results in a more sequential access pattern. 5.2 BitWeaving/H In the example above, if the early pruning triggers at two bits (instead of three bits), then we still save on memory accesses over It is also feasible to apply early pruning technique on data stored a method that does not use bit groups. However, we will likely in the HBP format. The key difference is that we store each bit waste memory bandwidth bringing in data for the third bit. Picking group in the HBP storage layout (described in Section 3.2). For an optimal bit group size is an interesting direction for future work. a column-scalar scan, we evaluate the comparison condition on bit In Appendix D.2 of the full-length version of this paper [12], we groups starting from the one containing the most significant bits. In empirically demonstrate the impact of the bit group size. addition to the result bit vector on the input comparison condition, we also need to compute a bit vector for the inequality condition in 5.1.2 Column-scalar scans order to detect if the outcome of the scan is fully determined. Once the outcome is fully determined, we skip the remaining bit groups We apply the early pruning technique on the VBP column-scalar (using early pruning). scan, in two ways. First, when evaluating a comparison condition However, the effect of early pruning technique on HBP is offset on a vector of codes, we skip over the least significant bits as soon by the high overhead of computing the additional bit vector, and as the outcome of the scan is fully determined. Second, a filter bit has an overall negative impact on performance (see Section 6.1.2). vector is fed into the scan to further speedup comparisons. This bit Therefore, the BitWeaving/H method is simply the HBP method. vector is used to filter out unmatched tuples even before the scan starts. This technique reduces the number of available codes in 5.3 BitWeaving/H and BitWeaving/V each segment, and thus speedups the scan (recall that early pruning In this section, we compare the two BitWeaving methods, in technique often runs faster on segments with a lower fill factor as terms of performance, applicability, as well as ease of implementa- shown in Section 4.2). tion. The summary of this comparison is shown in Table 1. The pseudocode for a VBP column-scalar scan with early prun- ing technique is shown in Algorithm 3. The algorithm is based on BitWeaving/H BitWeaving/V the VBP column-scalar scan shown in Algorithm 2. The modified n(k+1) lines are marked with ✁ and ✷ at the end of lines. Scan Complexity O( w ) O( nk w ) The first modification over the VBP scan method is to skip over SIMD Implementation Limited Good the least significant bits once the outcome of the scan is fully de- Early Pruning No Yes termined (marked with ✷ at the end of lines). In the BitWeaving/V Lookup Performance Good Poor storage layout, k words representing a segment are divided into fixed-size bit groups. Each bit group contains B words in the seg- Table 1: Comparing BitWeaving/H and BitWeaving/V ment. Predicate evaluation is also broken into a group of small loops to adapt to the design of bit groups. Before working on each Scan Complexity. BitWeaving/H uses k + 1 bits of processor bit group, we check the values of the bit masks meq1 and meq2 . If word to store a k-bit code, while BitWeaving/V requires only k both bit masks are all 0s, then the leading bits between the codes bits. As both methods simultaneously process multiple codes, the and the constant are all different. Thus, the outcome of the scan CPU cost of BitWeaving/H and BitWeaving/V are O( n(k+1) w ) on the segment is fully determined. As a result, we terminate the and O( nkw ), respectively; i.e., both are bit-parallel methods as per evaluation on this segment, and move to the next one. Definition 1. Both BitWeaving methods are generally competitive We check the cut-off condition (in Line 6) in one of every B to other methods. However, in the extreme cases, BitWeaving/V iterations of processing the k words of a segment. The purpose of could be close to 2X faster than BitWeaving/H due to the overhead this design is to reduce the cost of checking the condition as well of the delimiter bits (in BitWeaving/H). For instance, BitWeav- as the cost of CPU branch mispredictions that this step triggers. If ing/H fits only one 32-bit code (with an addition delimiter bit) in a the cut-off probability at a bit position is neither close to 0% nor 64-bit process word, whereas BitWeaving/V fits two codes. 100%, it is difficult for the CPU to predict the branch. Such branch SIMD Implementation. The implementation of BitWeaving/H misprediction can significantly slows down the overall execution. method relies on arithmetic and shift operations, which is generally With the early pruning technique, checking the cut-off condition not supported on an entire SIMD word today. Thus, BitWeaving/H

9. Naive SIMD Bit-sliced BL BitWeaving/V BitWeaving/H 10 0.06 30 10 9 L3 cache misses / code 9 0.05 25 8 Instructions / code 8 Cycles / code Cycles / code 7 7 0.04 20 6 6 5 5 0.03 15 4 4 3 3 0.02 10 2 2 1 0.01 5 1 0 0 0 0 4 48 12 16 20 8 24 28 32 12 0 0 4 8 16 12 16 20 24 20 28 32 0 0 244 8 12 16 2820 24 28 3232 Size of code (number of bits) Size ofSize codes (number of bits) of code (number of bits) Size of code (number of bits) (a) Query Execution Time (b) Memory Performance (c) CPU Performance Figure 10: Performance on query Q1. has to pad codes to the width of banks in the SIMD registers, rather proposed in [13]. This method was originally proposed to index ta- than the SIMD word width. This leads to underutilization of the full bles with low number of distinct values; it shares similarities to width of the SIMD registers. In contrast, BitWeaving/V method the VBP method, but does not explore the storage layout and early achieves the full parallelism that is offered by SIMD instructions. pruning technique. Surprisingly, previous recent work on main Appendix C in [12] describes how our methods can be extended to memory scans have largely ignored the bit-sliced method. work with larger SIMD words. In the graphs below, we use the tag BL (Blink-Like) to repre- Early Pruning. Applying early pruning technique on HBP re- sent the method that adapts the Blink method [8] for column stores quires extra processing that hurts the performance of HBP. As a (since we focus on column stores for this evaluation). Thus, the result, BitWeaving/H does not employ the early pruning technique. tag BL refers to tightly (horizontally) packed columns with a sim- In contrast, in BitWeaving/V, the early pruning technique works ple linear layout, and without the extra bit that is used by HBP naturally with the underlying VBP-like format with no extra cost, (see Section 3.2). The BL method differs from the BitWeaving/H and usually improves the scan performance. method as it does not have the extra bit, and it lays out the codes in Lookup Performance. With the BitWeaving/H layout, it is easy order (w.r.t. the discussion in the last paragraph in Section 3.2, the to fetch a code as all the bits of the code are stored contiguously. In layout of the codes in BL is c1 and c2 in v1, c3 and c4 in v2, and contrast, for BitWeaving/V, all the bits of a code are spread across so on in Figure 3). various bit groups, distributed over different words. Consequently, Below, the tags BitWeaving/H (or BW/H) and BitWeaving/V (or looking up a code potentially incurs many CPU cache misses, and BW/V) refer to the methods proposed in this paper. The size of the can thus hurts performance. bit group is 4 for all experiments. The effect of the other bit group To summarize, in general, both BitWeaving/H and BitWeaving/V sizes on scan performance is shown in Appendix D.2 in [12]. are competitive methods. BitWeaving/V outperforms BitWeaving/H We implemented each method in C++, and compiled the code for scan performance whereas BitWeaving/H achieves better lookup using g++ 3.4.6 with optimization flags (O3). performance. Empirical evaluation comparing these two methods In all the results below, we ran experiments using a single pro- is presented in the next section. cess with a single thread. We have also experimented using mul- tiple threads working on independent data partitions. Since the re- sults are similar to that of a single thread (all the methods parallelize 6. EVALUATION well assuming that each thread works on a separate partition), in the We ran our experiments on a machine with dual 2.67GHz In- interest of space, we omit these results. tel Xeon 6-core CPUs, and 24GB of DDR3 main memory. Each processor has 12MB of L3 cache shared by all the cores on that 6.1 Micro-Benchmark Evaluation processor. The processors support a 64-bit ALU instruction set as For this experiment, we created a table R with a single column well as a 128-bit Intel SIMD instruction set. The operating system and one billion uniformly distributed integer values in this column. is Linux 2.6.9. The domain of the values are [0, 2d ), where d is the width of the In the evaluation below, we compare BitWeaving to the SIMD- column that is varied in the experiments. The query (Q1), shown scan method proposed in [18], the Bit-sliced method [13], and a below, is used to evaluate a column-scalar scan with a simple LESS method based on Blink [8]. Collectively, these three methods rep- THAN predicate. The performance on other predicates is similar resent the current state-of-the-art main memory scan methods. to that on the LESS THAN predicate. (See Appendix D.1 in [12] To serve as a yardstick, we also include comparison with a naive for more details.) The constants in the WHERE clause are used method that simply extracts, loads, and then evaluates each code to control the selectivity. By default, the selectivity on each predi- with the comparison condition in series, without exploiting any cate is set to 10%, i.e. 10% of the input tuples match the predicate. word-level parallelism. In the graphs below the tag Naive refers Note, we also evaluate the impact of different selectivity (see Ap- to this simple scan method. pendix D.3 in [12]), but by default use a value of 10%. Below, the tag SIMD-scan refers to the technique in [18] that uses SIMD instructions to align multiple tightly packed codes to Q1: SELECT COUNT(*) FROM R WHERE R.a < C1 SIMD banks in parallel, and then simultaneously processes multi- ple codes using SIMD instructions. 6.1.1 BitWeaving v/s the Other Methods Below, the tag Bit-sliced refers to the traditional bit-sliced method In the evaluation below, we first compare BitWeaving to the

10. 3.5 3.5 1400 VBP HBP HBP 3 VBP+Pruning 3 HBP+Pruning 1200 HBP+Pruning VBP+Pruning+SIMD HBP+Pruning+SIMD VBP 2.5 2.5 1000 Cycles / code Cycles / code Cycles / code VBP+Pruning 2 2 800 1.5 1.5 600 1 1 400 0.5 0.5 200 0 0 0 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 Size of code (number of bits) Size of code (number of bits) Size of code (number of bits) (a) VBP-related methods: Execution Time (b) HBP-related methods: Execution Time (c) Lookup Performance Figure 11: Performance comparison between the HBP and the VBP related methods on Query Q1. Naive, the SIMD-scan [18], the Bit-sliced [13], and the BL meth- 12 bits, both BitWeaving methods are often more than 3X faster ods. Figure 10(a), Figure 10(b), and Figure 10(c) illustrate the than the SIMD-scan, the Bit-sliced and the BL methods. number of cycles, cache misses, and CPU instructions for the six methods for Q1 respectively, when varying the width of the column 6.1.2 Individual BitWeaving Components code from 1 bit to 32 bits. The total number of cycles for the query In this experiment, we compare the effect of the various tech- is measured by using the RDTSC instruction. We divide this total niques (VBP v/s HBP, early pruning, and SIMD optimizations) that number of cycles by the number of codes to compute the cycles per have been proposed in this paper. Figure 11(a) and 11(b) plot the code, which is shown in Figure 10(a). performance of these techniques for VBP and HBP for query Q1, As can be observed in Figure 10(a), not surprisingly, the Naive respectively. method is the slowest. The Naive method shifts and applies a mask First, we compare the scan performance of the HBP and the VBP to extract and align each packed code to the processor word. Since methods for query Q1. From the results shown in Figure 11(a) and each code is much smaller than a processor word (64-bit), it burns 11(b), we observe that at certain points, VBP is up to 2X faster than many more instructions than the other methods (see Figure 10(c)) HBP. For example, VBP is 2X faster than HBP for 32-bit codes, be- on every word of data that is fetched from the underlying memory cause HBP has to pad 32-bit code to a entire 64-bit word to fit both subsystem (with L1/L2/L3 caches buffering data fetched from main the code and the delimiter bit. In spite of this, HBP and VBP gener- memory). Even when most of the data is served from the L1 cache, ally show a similar performance trend as the code width increases. its CPU cost dominates the overall query execution time. This empirical results matches our analysis that both methods sat- The SIMD-scan achieves 50%–75% performance improvement isfy the cost-bound for bit-parallel methods. over the Naive method (see Figure 10(a)), but it is still worse com- Next, we examine the effects of the early pruning technique on pared to the other methods. Even though a SIMD instruction can both the VBP and the HBP methods. As can be seen in Figure 11(a), process four 32-bit banks in parallel, the number of instructions for wider codes, the early pruning technique quickly reduces the drops by only 2.2-2.6X (over the Naive method), because it im- query execution time for VBP, and beyond 12 bits, the query ex- poses extra instructions to align packed codes into the four banks ecution time with early pruning is nearly constant. Essentially, as before any computation can be run on that data. Furthermore, we described in Section 4.2, for wider codes early pruning has a high observe that with SIMD instructions, the CPI (Cycles Per Instruc- chance of terminating after examining the first few bits. tions) increases from 0.37 to 0.56 (see Figure 10(c)), which means In contrast, as can be seen in Figure 11(b), the effect of the early that a single SIMD instructions takes more cycles to executed than pruning technique on the HBP method is offset by the high over- a ordinary ALU instruction. This effect further dampens the benefit head of computing the additional masks (see Section 5.2). Conse- of this SIMD implementation. quently, the HBP method (which is the same as BitWeaving/H, as As can be seen in Figure 10(a), the Bit-sliced and the BL meth- discussed in Section 5.2) is uniformly faster than “HBP + Pruning”. ods shows a near linear increase in run time as the code width Applying SIMD parallelism achieves marginal speedups for both increases. Surprisingly, both these methods are almost uniformly the HBP and the VBP methods (see Figures 11(a) and 11(b)). Ide- faster than the SIMD-scan method. However, the storage layout ally, the implementation with a 128-bit SIMD word should be 2X of the Bit-sliced method occupies many CPU cache lines for wider faster than that with 64-bit ALU word. However, by measuring the codes. As a result, as can be seen in Figure 10(b), the number of L3 number of instructions, we observed that the SIMD implementa- cache misses quickly increases and hinders overall performance. tion reduces the number of instructions by 40%, but also increase In this experiment, the BitWeaving methods outperform all the the CPI by 1.5X. Consequently, the net effect is that the SIMD other methods across all the code widths (see Figure 10(a)). Un- implementation is only 20% and 10% faster than the ALU imple- like the Naive and the SIMD-scan methods, they do not need to mentation, for VBP and HBP respectively. move data to appropriate positions before the predicate evaluation Next, we evaluate the performance of a lookup operation. A computation. In addition, as shown in Figure 10(b) and 10(c), the lookup operation is important to produce the attributes in the pro- BitWeaving methods are optimized for both cache misses and in- jection list after the predicates in the WHERE clause have been structions due to their storage layouts and scan algorithms. Finally, applied. In this experiment, we randomly pick 10 million positions with the early pruning technique, the execution time of BitWeav- in the column, and measure the average number of cycles that are ing/V (see Figure 10(a)) does not increase for codes that are wider needed to fetch (and assemble) a code at each position. The results than 12 bits. As can be seen in Figure 10(a), for codes wider than for this experiment are shown in Figure 11(c). As can be seen in Figure 11(c), amongst the four methods, the

11. Speedup over the Naive method 35 30 Scan on l_shipdate Naive BL BW/H BW/H-idx 30 Scan on l_quantity SIMD Bit-sliced BW/V BW/V-idx 25 25 Scan on l_discount Cycles/row Aggregation 20 20 15 15 10 10 5 0 5 BW/H-idx BW/V-idx Bit-sliced 0 BW/H Naive BW/V SIMD Q4 Q5 Q12 Q14 Q17 Q19 BL TPC-H Queries (a) Q6: Execution Time Breakdown (b) Performance on TPC-H Queries with Materialized Join Inputs Figure 12: Performance comparison with the TPC-H Queries (BW=BitWeaving). lookup performance of the HBP method is the best, and its per- is fairly wide (24 bits), BitWeaving/V spends more cycles extract- formance is stable across the code widths. The reason for this ing the matching values from the aggregation columns. As a result, behavior is because all the bits of a code in the HBP method are for this particular query, BitWeaving/H is faster than BitWeaving/V. located together. For the VBP method, all the bits of a code are We also evaluated the effects of using the BitWeaving methods stored in continuous space, and thus it is relatively fast to access as indices. In this method, the entire WHERE clause is evaluated all the bits and assemble the code. For the methods with the early using the corresponding BitWeaving methods on the columns of pruning technique, the bits of a code are distributed into various bit interest for this WHERE clause. Then, using the method described groups. Assembling a code requires access to data across multi- in Appendix A in [12] , the columns involved in the aggregation ple bit groups at different locations, which incurs many CPU cache (in the SELECT clause of the query) are fetched from the associ- misses, and thus significantly hurts the lookup performance. ated column store(s) for these attributes. These column stores use a Naive storage organization. 6.2 TPC-H Evaluation In Figure 12, these BitWeaving index-based methods are de- In this experiment, we use seven queries from the TPC-H bench- noted as BW/H-idx and BW/V-idx. As can be seen in Figure 12(a), mark [17]. These experiments were run against a TPC-H dataset BW/H-idx and BW/H have similar performance. The key differ- at scale factor 10. The total size of the database is approximately ence between these methods is whether the aggregation columns 10GB. First, we compare the performance of the various methods are accessed from either the BW/H format or from the Naive col- on the TPC-H scan query (Q6). This query is shown below: umn store. However, since using BW/H always results in accessing SELECT sum(l_extendedprice * l_discount) one cache line per lookup, its performance is similar to the lookup FROM lineitem with the Naive column store organization (i.e the BW/H-idx case). WHERE l_shipdate BETWEEN Date and Date + 1 year On the other hand, the BW/V-idx method is about 30% faster than and l_discount BETWEEN Discount - 0.01 the BW/V method. The reason for this behavior is that the verti- and Discount + 0.01 and l_quantity < Quantity cal bit layout in BW/V results in looking up data across multiple As per the TPC-H specifications for the domain size for each of cache lines for each aggregate column value, whereas the BW/V- the columns/attributes in this query, the column l_shipdate, idx method fetches these attribute values from the Naive column l_discount, l_quantity, l_extendedprice are en- store, which requires accessing only one cache line for each aggre- coded with 12 bits, 4 bits, 6 bits, and 24 bits, respectively. The gate column value. selectivity of this query is approximately 2%. Next, we selected six TPC-H join queries (Q4, Q5, Q12, Q14, Figure 12(a) shows the time breakdown for the scan and the ag- Q17, Q19), and materialized the join component in these queries. gregation operations for the BitWeaving and the other methods. Then, we ran scan operations on the pre-joined materialized tables. Not surprisingly, the Naive method is the slowest. The SIMD- Here, we report the results of these scan operations on these ma- scan method only achieves about 20% performance improvement terialized tables. The widths of the columns involved in the se- over the Naive method, mainly because the SIMD-scan method per- lection operations (i.e. the WHERE clause in the SQL query on forms relatively poorly when evaluating the BETWEEN predicates the pre-joined materialized tables) ranges from 2 bits to 12 bits. (see Appendix D.1 in [12]). Evaluating a BETWEEN predicate is All these queries, except for query Q19, contain a predicate clause complicated/expensive with the SIMD-scan method since the re- that is a conjunction of one to four predicates. Query Q19 has sults of the SIMD computations are always stored in the original a more complex predicate clause, which includes a disjunction of input registers. Consequently, we have to make two copies for each three predicate clauses, each of which is a conjunction of six pred- attribute value, and compare each copy with the lower and upper icates. These queries contain a variety of predicates, including bound constants in the BETWEEN predicate, respectively. <, >, =, <>, BETWEEN, and IN. Some queries also involve The BL method runs at a much higher speed compared to the predicates that perform comparisons between two columns. The Naive and the SIMD methods. However, compared to BitWeav- projection clauses of these six queries contain one to three columns ing/H, the BL method uses more instructions to implement its func- with widths that vary from 3 to 24 bits. tionality of parallel processing on packed data and the conversion Figure 12(b) plots the speedup of all the methods over the Naive process to produce the result bit vector, which hinders its scan per- method for the six TPC-H queries. For most queries, the BitWeav- formance. ing methods are over one order of magnitude faster than the Naive Note that both the BitWeaving methods (BW/H and BW/V) out- method. perform all existing methods. As the column l_extendedprice By comparing the performance of the BW/H and the BW/V meth-

12.ods, we observe that the answer to the question of which BitWeav- common access patterns, and both methods match the complex- ing method has higher performance depends on many query char- ity bound for bit-parallel scans. Our experimental studies show acteristics. For Q4 and Q14, the BW/H method is slightly faster that the BitWeaving techniques are faster than the state-of-the-art than the BW/V method, because the BW/H method performs bet- scan methods, and in some cases by over an order of magnitude. ter on the BETWEEN predicate on relative narrow columns (see Ap- For future work, we plan to explore methods to use BitWeaving pendix D.1 in [12]). Query Q4 contains two predicate clauses, one effectively in other database operations, such as joins and aggre- of which is a BETWEEN predicate. In contrast, Query Q14 contains gates. We also plan to study how to apply the BitWeaving technique only one predicate clause, which is a BETWEEN predicate. For within the context of broader automatic physical database design queries Q5, Q12, Q17, and Q19, the BW/V method outperforms issues (e.g. replication, and forming column groups [8] automat- the BW/H method as these four queries contain more than three ically), multi-threaded scans, concurrent scans, other compression predicate clauses. Although some of these queries also contain schemes, and considering the impact of BitWeaving on query opti- the BETWEEN predicate(s), the early pruning technique (of BW/V) mization. speedups the scans with on BETWEEN predicates when performed at the end of a series of column-scalar scans. In general, we ob- Acknowledgments serve that the BW/V method has higher performance for queries with predicates that involve many columns, involve wider columns, We would like to thank Jeffrey Naughton, Karu Sankaralingam, and have highly selective predicates. Craig Chasseur, Qiang Zeng, and the reviewers of this paper for Using the BW/V method as an index improves the performance their insightful feedback on an earlier draft of this paper. This by about 15% for Q5 and Q17 (both queries contain wider columns work was supported in part by the National Science Foundation un- in their projection lists), and has no significant gain for the other der grants IIS-1110948 and IIS-1250886, and a gift donation from queries. In general, we observe that it is not very productive to use Google. the BW/H method as an index, since it already has a low lookup cost as a base storage format. For the BW/V method, using it as an 9. REFERENCES index improves the performance for some queries by avoiding the [1] D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671–682, 2006. slow lookups that are associated with using the BW/V method as [2] R. Barber, P. Bendel, M. Czech, O. Draese, F. Ho, N. Hrle, S. Idreos, M.-S. the base storage format. Kim, O. Koeth, J.-G. Lee, T. T. Li, G. M. Lohman, K. Morfonios, R. Müller, We note that there are interesting issues here in terms of how to K. Murthy, I. Pandis, L. Qiao, V. Raman, R. Sidle, K. Stolze, and S. Szabo. Business analytics in (a) Blink. IEEE Data Eng. Bull., 35(1):9–14, 2012. pick between BitWeaving/H vs. BitWeaving/V, and whether to use [3] C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving the BitWeaving methods as an index of for base storage. Build- string compression for main memory column stores. In SIGMOD Conference, ing an accurate cost model that can guide these choices based on pages 283–296, 2009. workload characteristics is an interesting direction for future work. [4] W. Fang, B. He, and Q. Luo. Database compression on graphics processors. PVLDB, 3(1):670–680, 2010. [5] F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The 7. RELATED WORK SAP HANA database – an architecture overview. IEEE Data Eng. Bull., 35(1):28–33, 2012. The techniques present in this paper are applicable to main-memory [6] M. Grund, J. Krüger, H. Plattner, A. Zeier, P. Cudré-Mauroux, and S. Madden. analytics DBMSs. In such DBMSs, data is often stored in com- Hyrise - a main memory hybrid storage engine. PVLDB, 4(2):105–116, 2010. pressed form. SAP HANA [5], IBM Blink [2,15], and HYRISE [10] [7] S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. MonetDB: Two decades of research in column-oriented database architectures. use sorted dictionaries to encode values. Dynamic order-preserving IEEE Data Eng. Bull., 35(1):40–45, 2012. dictionary was proposed to encode strings [3]. Other light-weight [8] R. Johnson, V. Raman, R. Sidle, and G. Swart. Row-wise parallel predicate compression schemes can also be used for main-memory column- evaluation. PVLDB, 1(1):622–634, 2008. wise databases, such as [1, 4, 20]. [9] A. Kemper, T. Neumann, F. Funke, V. Leis, and H. Mühe. Hyper: Adapting columnar main-memory data management for transactional and query SIMD instructions can be used to speed up database operations [19], processing. IEEE Data Eng. Bull., 35(1):46–51, 2012. and the SIMD-scan [18] method is the state-of-the-art scan method [10] J. Krüger, C. Kim, M. Grund, N. Satish, D. Schwalb, J. Chhugani, H. Plattner, that uses SIMD. In this paper we compare BitWeaving with this P. Dubey, and A. Zeier. Fast updates on read-optimized databases using multi-core CPUs. PVLDB, 5(1):61–72, 2011. scan method. We note that BitWeaving can also be used in pro- [11] L. Lamport. Multiple byte processing with full-word instructions. Commun. cessing environments that don’t support SIMD instructions. ACM, 18(8):471–475, 1975. The BitWeaving/V methods shares similarity to the bit-sliced in- [12] Y. Li and J. M. Patel. BitWeaving: Fast scans for main memory data processing. dex [13], but the storage layout of a bit-sliced index is not opti- mized for memory access, as well as our proposed early pruning [13] P. E. O’Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD, pages 38–49, 1997. technique. A follow-up work presented the algorithms that perform [14] Oracle Exalytics In-Memory Machine: A Brief Introduction, October 2011. arithmetic on bit-sliced indices [16]. Some techniques described in Oracle White Paper. that paper are also applicable to our BitWeaving/V method. [15] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang, and The BitWeaving/H method relies on the capability to process R. Sidle. Constant-time query processing. In ICDE, pages 60–69, 2008. [16] D. Rinfret, P. E. O’Neil, and E. J. O’Neil. Bit-sliced index arithmetic. In packed data in parallel. This technique was first proposed by Lam- SIGMOD, pages 47–57, 2001. port [11]. Recently, a similar technique [8] was used to evaluate [17] Transaction Processing Performance Council. TPC Benchmark H. Revision complex predicates in IBM’s Blink System [2]. 2.14.3. November 2011. [18] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing 8. CONCLUSIONS AND FUTURE WORK units. PVLDB, 2(1):385–394, 2009. [19] J. Zhou and K. A. Ross. Implementing database operations using SIMD With the increasing demand for main memory analytics data pro- instructions. In SIGMOD, pages 145–156, 2002. cessing, there is an critical need for fast scan primitives. This pa- [20] M. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-scalar RAM-CPU per proposes a method called BitWeaving that addresses this need cache compression. In ICDE, page 59, 2006. by exploiting the parallelism available at the bit level in modern [21] M. Zukowski, M. van de Wiel, and P. A. Boncz. Vectorwise: A vectorized analytical DBMS. In ICDE, pages 1349–1350, 2012. processors. The two flavors of BitWeaving are optimized for two