Vectorization vs. Compilation in Query Execution

One of the findings is that compilation should always be combined with block-wise query execution. Another contribution is identifying three cases where “loop-compilation” strategies are inferior to vectorized execution. As such, a careful merging of these two strategies is proposed for optimal performance: either by incorporating vectorized execution principles into compiled query plans or using query compilation to create building blocks for vectorized processing.

1. Vectorization vs. Compilation in Query Execution Juliusz Sompolski1 Marcin Zukowski Peter Boncz2 VectorWise B.V. VectorWise B.V. Vrije Universiteit Amsterdam ABSTRACT conditions used in Joins and Select, calculations used to in- Compiling database queries into executable (sub-) programs troduce new columns in Project, and functions like MIN, provides substantial benefits comparing to traditional inter- MAX and SUM used in Aggregation. Most query inter- preted execution. Many of these benefits, such as reduced preters follow the so-called iterator-model (as described in interpretation overhead, better instruction code locality, and Volcano [5]), in which each operator implements an API that providing opportunities to use SIMD instructions, have pre- consists of open(), next() and close() methods. Each next() viously been provided by redesigning query processors to call produces one new tuple, and query evaluation follows a use a vectorized execution model. In this paper, we try to “pull” model in which next() is called recursively to traverse shed light on the question of how state-of-the-art compila- the operator tree from the root downwards, with the result tion strategies relate to vectorized execution for analytical tuples being pulled upwards. database workloads on modern CPUs. For this purpose, we It has been observed that the tuple-at-a-time model leads carefully investigate the behavior of vectorized and compiled to interpretation overhead: the situation that much more strategies inside the Ingres VectorWise database system in time is spent in evaluating the query plan than in actually three use cases: Project, Select and Hash Join. One of the calculating the query result. Additionally, this tuple-at-a- findings is that compilation should always be combined with time interpretation model particularly affects high perfor- block-wise query execution. Another contribution is iden- mance features introduced in modern CPUs [13]. For in- tifying three cases where “loop-compilation” strategies are stance, the fact that units of actual work are hidden in the inferior to vectorized execution. As such, a careful merging stream of interpreting code and function calls, prevents com- of these two strategies is proposed for optimal performance: pilers and modern CPUs from getting the benefits of deep either by incorporating vectorized execution principles into CPU pipelining and SIMD instructions, because for these compiled query plans or using query compilation to create the work instructions should be adjacent in the instruction building blocks for vectorized processing. stream and independent of each other. Related Work: Vectorized execution. MonetDB [2] reduced interpretation overhead by using bulk processing, 1. INTRODUCTION where each operator would fully process its input, and only Database systems provide many useful abstractions such then invoking the next execution stage. This idea has been as data independence, ACID properties, and the possibil- further improved in the X100 project [1], later evolving into ity to pose declarative complex ad-hoc queries over large VectorWise, with vectorized execution. It is a form of block- amounts of data. This flexibility implies that a database oriented query processing [8], where the next() method rather server has no advance knowledge of the queries until run- than a single tuple produces a block (typically 100-10000) time, which has traditionally led most systems to implement of tuples. In the vectorized model, data is represented as their query evaluators using an interpretation engine. Such small single-dimensional arrays (vectors), easily accessible an engine evaluates plans consisting of algebraic operators, for CPUs. The effect is (i) that the percentage of instruc- such as Scan, Join, Project, Aggregation and Select. The op- tions spent in interpretation logic is reduced by a factor erators internally include expressions, which can be boolean equal to the vector-size, and (ii) that the functions that per- 1 form work now typically process an array of values in a tight This work is part of a MSc thesis being written at Vrije loop. Such tight loops can be optimized well by compilers, Universiteit Amsterdam. e.g. unrolled when beneficial, and enable compilers to gener- 2 The author also remains affiliated with CWI Amsterdam. ate SIMD instructions automatically. Modern CPUs also do well on such loops, as function calls are eliminated, branches get more predictable, and out-of-order execution in CPUs Permission to make digital or hard copies of all or part of this work for often takes multiple loop iterations into execution concur- personal or classroom use is granted without fee provided that copies are rently, exploiting the deeply pipelined resources of modern not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to CPUs. It was shown that vectorized execution can improve republish, to post on servers or to redistribute to lists, requires prior specific data-intensive (OLAP) queries by a factor 50. permission and/or a fee. Related Work: Loop-compilation. An alternative strat- Proceedings of the Seventh International Workshop on Data Management on New Hardware (DaMoN 2011), June 13, 2011, Athens, Greece. egy for eliminating the ill effects of interpretation is using Copyright 2011 ACM 978-1-4503-0658-4 ...$10.00. Just-In-Time (JIT) query compilation. On receiving a query

2.for the first time, the query processor compiles (part of) the Algorithm 1 Implementation of an example query using query into a routine that gets subsequently executed. In vectorized and compiled modes. Map-primitives are stat- Java engines, this can be done through the generation of ically compiled functions for combinations of operations new Java classes that are loaded using reflection (and JIT (OP), types (T) and input formats (col/val). Dynamically compiled by the virtual machine) [10]. In C or C++, source compiled primitives, such as c000(), follow the same pat- code text is generated, compiled, dynamically loaded, and tern as pre-generated vectorized primitives, but may take executed. System R originally skipped compilation by gen- arbitrarily complex expressions as OP. erating assembly directly, but the non-portability of that // General v e c t o r i z e d p r i m i t i v e p a t t e r n approach led to its abandonment [4]. Depending on the m a p O P T c o l T c o l ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ c o l 2 ) { compilation strategy, the generated code may either solve f o r ( i n t i =0; i <n ; i ++) the whole query (“holistic” compilation [7]) or only certain r e s [ i ]=OP( c o l 1 [ i ] , c o l 2 [ i ] ) ; } performance-critical pieces. Other systems that are known to use compilation are ParAccel [9] and the recently an- // The micro−benchmark u s e s d a t a s t o r e d i n : nounced Hyper system [6]. We will generalise the current const i d x LEN=1024; state-of-the-art using the term “loop-compilation” strategies, c h r tmp1 [ LEN ] , tmp2 [ LEN ] , one = 1 0 0 ; s h t tmp3 [ LEN ] ; as these typically try to compile the core of the query into a i n t tmp4 [ LEN ] ; // f i n a l r e s u l t single loop that iterates over tuples. This can be contrasted with vectorized execution, which decomposes operators in // V e c t o r i z e d code : map add chr val chr c o l (LEN, tmp1 ,& one , l d i s c o u n t ) ; multiple basic steps, and executes a separate loop for each map sub chr val chr c o l (LEN, tmp2 ,& one , l t a x ) ; basic step (“multi-loop”). map mul chr col chr c o l (LEN, tmp3 , tmp1 , tmp2 ) ; Compilation removes interpretation overhead and can lead map mul int col sht c o l (LEN, tmp4 , l e x t p r i c e , tmp3 ) ; to very concise and CPU-friendly code. In this paper, we // Compiled e q u i v a l e n t o f t h i s e x p r e s s i o n : put compilation in its most favourable light by assuming c000 ( idx n , int ∗ res , int ∗ col1 , chr ∗ col2 , chr ∗ c o l 3 ){ that compilation-time is negligible. This is often true in f o r ( i d x i =0; i <n ; i ++) r e s [ i ]= c o l 1 [ i ] ∗ ( ( 1 0 0 − c o l 2 [ i ] ) ∗ ( 1 0 0 + c o l 3 [ i ] ) ) ; OLAP queries which tend do be rather long-running, and } technologies such as JIT in Java and the LLVM framework for C/C++ [12] nowadays provide low (milliseconds) laten- cies for compiling and linking. 2. CASE STUDY: PROJECT Roadmap: vectorization vs. compilation. Vectorized Inspired by the expressions in Q1 of TPC-H we focus on expressions process one or more input arrays and store the the following simple Scan-Project query as micro-benchmark: result in an output array. Even though systems like Vec- torWise go through lengths to ensure that these arrays are SELECT l_extprice*(1-l_discount)*(1+l_tax) FROM lineitem CPU cache-resident, this materialization constitutes extra The scanned columns are all decimals with precision two. load/store work. Compilation can avoid this work by keep- VectorWise represents these internally as integers, using the ing results in CPU registers as they flow from one expression value multiplied by 100 in this case. After scanning and de- to the other. Also, compilation as a general technique is or- compression it chooses the smallest integer type that, given thogonal to any execution strategy, and can only improve the actual value domain, can represent the numbers. The performance. We used the VectorWise DBMS3 to inves- same happens for calculation expressions, where the desti- tigate three interesting use cases that highlight the issues nation type is chosen to be the minimal-width integer type, around the relationship between compilation and vectoriza- such that overflow is prevented. In the TPC-H case, the tion. price column is a 4-byte integer and the other two are single- As our first case, Section 2 shows how in Project expres- byte columns. The addition and subtraction produce again sion calculations loop-compilation tends to provide the best single bytes, their multiplication a 2-byte integer. The fi- results, but that this hinges on using block-oriented pro- nal multiplication multiplies a 4-byte with a 2-byte integer, cessing. Thus, compiling expressions in a tuple-at-a-time creating a 4-byte result. engine may improve some performance, but falls short of the gains that are possible. In Section 3, our second case Vectorized Execution. VectorWise executes functions is Select, where we show that branch mispredictions hurt inside a Project as so-called map-primitives. Algorithm 1 loop-compilation when evaluating conjunctive predicates. In shows the example code for a binary primitive. In this, chr, contrast, the vectorized approach avoids this problem as it sht, int and lng represent internal types for 1-, 2-, 4- and can transform control-dependencies into data-dependencies 8-byte integers and idx represents an internal type for rep- for evaluating booleans (along [11]). The third case in Sec- resenting sizes, indexes or offsets within columns of data tion 4 concerns probing large hash-tables, using a HashJoin (implemented as integer of required width). A val suffix in as an example. Here, loop-compilation gets CPU cache miss the primitive name indicates a constant (non-columnar) pa- stalled while processing linked lists (i.e., hash bucket chains). rameter. VectorWise pre-generates primitives for all needed We show that a mixed approach that uses vectorization for combinations of operations, types and parameter modes. All chain-following is fastest, and robust to the various parame- functions to support SQL fit in ca. 9000 lines of macro- ters of the key lookup problem. These findings lead to a set code, which expands into roughly 3000 different primitive of conclusions which we present in Section 5. functions producing 200K LOC and a 5MB binary. It is reasonable to expect that a compiler that claims sup- port for SIMD should be able to vectorize the trivial loop in 3 See Data storage and query the map_ functions. On x86 systems, gcc (we used 4.5.1) usu- evaluation in VectorWise is based on the X100 project [1]. ally does so and the Intel compiler icc never fails to. With

3. Vectorized, SIMD Wise can now generate on-the-fly: it combines vectorization Compiled, SIMD with compilation. Such a combination in itself is not new Vectorized, no−SIMD 50 (“compound primitives” [1]), and the end result is similar Compiled, no−SIMD Vectorized, SIMD−sht to what a holistic query compiler like HIQUE [7] generates Compiled, SIMD−sht for this Scan-Project plan, though it would also add Scan code. However, if we assume a HIQUE with a simple main- Interpreted memory storage system and take l_tax, etc. to be pointers 20 Cycles per tuple (log scale) into a column-wise stored table, then c000() would be the exact product of a “loop-compilation” strategy. The main benefit of the compiled approach is the absence 10 of load/stores. The vectorized approach needs 22 load/s- Comp. per tuple tores, but only the bottom three loads and top-level store are needed by the compiled strategy. Comparing vectorized 5 with compiled, we are surprised to see that the vectorized version is significantly faster (4 vs. 2 cycles/tuple). Close investigation of the generated code revealed that icc chooses in its SIMD generation to align all calculations on the widest 2 unit (here: 4-byte integers). Hence, the opportunities for 1- byte and 2-byte SIMD operations are lost. Arguably, this is a compiler bug or sub-optimality. In order to show what compilation could achieve, we re- 1 tried the same, now assuming that l_extprice would fit into 1 2 4 8 32 128 512 2K 8K 32K 128K 1M 4M a 2-byte integer; which are the “SIMD-sht” lines in in Fig- Vector size ure 1. Here, we see compilation beating vectorized execu- tion, as one would normally expect in Project tasks. A final Figure 1: Project micro-benchmark: with and observation is that compiled map-primitives are less sensi- without {compilation, vectorization, SIMD}. The tive to cache size (only to L2, not L1), such that a hybrid “SIMD-sht” lines work around the alignment sub- vectorized/compiled engine can use large vector-sizes. optimality in icc SIMD code generation. Tuple-at-a-time compilation. The black star and dia- mond in Figure 1, correspond to situations where primitive a single SSE instruction, modern x86 systems can add and functions work tuple-at-a-time. The non-compiled strategy subtract 16 single-byte values, multiply 8 single-byte inte- is called “interpreted”, here. An engine like MySQL, whose gers into a 2-byte result, or multiply four 4-byte integers. whole iterator interface is tuple-at-a-time, can only use such Thus, 16 tuples in this expression could be processed with 8 functions as it has just one tuple to operate on at any mo- SIMD instructions: one 8-bit addition, one 8-bit subtraction, ment in time. Tuple-at-a-time primitives are conceptually two 8-bit multiplications with 16-bit results, and four 32-bit very close to the functions in Algorithm 1 with vector-size n=1, but lack the for-loop. We implemented them separately multiplications. All of these instructions store one result and the first two operations load one input (with the other for fairness, because these for-loops introduce loop-overhead. parameter being a constant) while the other two load two This experiment shows that if one would contemplate intro- inputs. With these 22 (2*2+6*3) load/stores, we roughly ducing compilation in an engine like MySQL without break- need 30 instructions – in reality some additional instruc- ing its tuple-at-a-time operator API, the gain in expression tions for casts, padding and looping are required, such that calculation performance could be a factor 3 (23 vs 7 cycle/tu- the total for processing 16 tuples is around 60. In compari- ple). The absolute performance is clearly below what block- son, without SIMD we would need 4 instructions (2 loads, 1 wise query processing offers (7 vs 1.2cycle/tuple), mainly calculation, 1 store) per calculation such that a single tuple due to missed SIMD opportunities, but also because the requires 16 instructions, > 4 times more than with SIMD. virtual method call for every tuple inhibits speculative exe- The vectorized “SIMD” and “no-SIMD” lines in Figure 1, cution across tuples in the CPU. Worse, in tuple-at-a-time show an experiment in which expression results are calcu- query evaluation function primitives in OLAP queries only lated, using different vector-sizes. We used a 2.67GHz Ne- make up a small percentage (<5%) of overall time [1], be- halem core, on a 64-bits Linux machine with 12GB of RAM. cause most effort goes into the tuple-at-a-time operator APIs. The no-SIMD vectorized code, produced by explicitly dis- The overall gain of using compilation without changing the abling SIMD generation in the compiler (icc 11.04 , here), is tuple-at-a-time nature of a query engine can therefore at indeed 4 times slower than SIMD. The general trend of de- most be a few percent, making such an endeavour question- creasing interpretation overhead with increasing vector-size able. until around one thousand, and performance deteriorating due to cache misses if vectors start exceeding the L1 and L2 3. CASE STUDY: SELECT caches, has been described already in detail in [13, 1]. We now turn our attention to a micro-benchmark that tests conjunctive selections: Compiled Execution. The lower part of Algorithm 1 shows the compiled code that a modified version of Vector- WHERE col1 < v1 AND col2 < v2 AND col3 < v3 4 Compiler options are -O3 for gcc, supplemented for icc with Selection primitives shown in Algorithm 2 create vectors -xSSE4.2 -mp1 -unroll of indexes for which the condition evaluates to true, called

4.Algorithm 2 Implementations of < selection primitives. Algorithm 3 Four compiled implementations of a con- All algorithms return the number of selected items (re- junctive selection. Branching cannot be avoided in loop- turn j). For mid selectivities, branching instructions lead compilation, which combines selection with other opera- to branch mispredictions. In a vectorized implementation tions, without executing these operations eagerly. The four such branching can be avoided. VectorWise dynamically se- implementations balance between branching and eager com- lects the best method depending on the observed selectivity, putation. but in the micro-benchmark we show the results for both methods. // ( 1 . ) a l l p r e d i c a t e s b r a n c h i n g ( ” l a z y ”) i d x c 0 0 0 1 ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ c o l 2 , T∗ c o l 3 , // Two v e c t o r i z e d i m p l e m e n t a t i o n s T∗ v1 , T∗ v2 , T∗ v3 ) { // ( 1 . ) medium s e l e c t i v i t y : non−b r a n c h i n g code i d x i , j =0; i d x s e l l t T c o l T v a l ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ v a l 2 , f o r ( i =0; i <n ; i ++) idx ∗ s e l ){ i f ( c o l 1 [ i ] <∗ v1 && c o l 2 [ i ]<∗ v2 && c o l 3 [ i ]<∗ v3 ) i f ( s e l== NULL) { r e s [ j ++] = i ; f o r ( i d x i =0 , i d x j =0; i <n ; i ++) { return j ; // r e t u r n number o f s e l e c t e d i t e m s . r e s [ j ] = i ; j += ( c o l 1 [ i ] < v a l 2 [ 0 ] ) ; } } } else { // ( 2 . ) b r a n c h i n g 1 , 2 , non−b r . 3 f o r ( i d x i =0 , i d x j =0; i <n ; i ++) { i d x c 0 0 0 2 ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ c o l 2 , T∗ c o l 3 , r e s [ j ] = s e l [ i ] ; j += ( c o l 1 [ s e l [ i ] ] < ∗ v a l 2 ) ; T∗ v1 , T∗ v2 , T∗ v3 ) { } i d x i , j =0; } f o r ( j =0; i <n ; i ++) return j ; i f ( c o l 1 [ i ] <∗ v1 && c o l 2 [ i ] < ∗ v2 ) { } r e s [ j ] = i ; j += c o l 3 [ i ] < ∗ v3 ; // ( 2 . ) e l s e : b r a n c h i n g s e l e c t i o n } i d x s e l l t T c o l T v a l ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ v a l 2 , return j ; idx ∗ s e l ){ } i f ( s e l==NULL) { f o r ( i d x i =0 , i d x j =0; i <n ; i ++) // ( 3 . ) b r a n c h i n g 1 , non−b r . 2 ,3 i f ( c o l 1 [ i ] < ∗ v a l 2 ) r e s [ j ++] = i ; i d x c 0 0 0 3 ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ c o l 2 , T∗ c o l 3 , } else { T∗ v1 , T∗ v2 , T∗ v3 ) { f o r ( i d x i =0 , i d x j =0; i <n ; i ++) i d x i , j =0; i f ( c o l 1 [ s e l [ i ] ] < ∗ v a l 2 ) r e s [ j ++] = s e l [ i ] ; f o r ( i =0; i <n ; i ++) } i f ( c o l 1 [ i ]< v1 ) { return j ; r e s [ j ] = i ; j += c o l 2 [ i ]<∗ v2 & c o l 3 [ i ]<∗ v3 } } return j ; // V e c t o r i z e d c o n j u n c t i o n i m p l e m e n t a t i o n : } const i d x LEN=1024; i d x s e l 1 [ LEN ] , s e l 2 [ LEN ] , r e s [ LEN ] , r e t 1 , r e t 2 , r e t 3 ; // ( 4 . ) non−b r a n c h i n g 1 , 2 , 3 , ( ”compute− a l l ”) r e t 1 = s e l l t T c o l t v a l (LEN, s e l 1 , c o l 1 ,& v1 , NULL ) ; i d x c 0 0 0 4 ( i d x n , T∗ r e s , T∗ c o l 1 , T∗ c o l 2 , T∗ c o l 3 , r e t 2 = s e l l t T c o l t v a l ( r e t 1 , s e l 2 , c o l 1 ,& v1 , s e l 1 ) ; T∗ v1 , T∗ v2 , T∗ v3 ) { r e t 3 = s e l l t T c o l t v a l ( r e t 2 , r e s , c o l 1 ,& v1 , s e l 2 ) ; i d x i , j =0; f o r ( i =0; i <n ; i ++) { res [ j ] = i ; j += ( c o l 1 [ i ]<∗ v1 & c o l 2 [ i ]<∗ v2 & c o l 3 [ i ]<∗ v3 ) } return j ; selection vectors. Selection primitives can also take a selec- } tion vector as parameter, to evaluate the condition only on elements of the vectors from the positions pointed to by the selection vector 5 . A vectorized conjunction is implemented by chaining selection primitives with the output selection strategy depending on the observed selectivity 6 . As such, its vector of the previous one being the input selection vector performance achieves the minimum of the vectorized branch- of the next one, working on a tightening subset of the origi- ing and non-branching lines in Figure 2. nal vectors, evaluating this conjunction lazily only on those In this experiment, each of the columns col1, col2, col3 elements for which the previous conditions passed. is an integer column, and the values v1, v2 and v3 are con- Each condition may be evaluated with one of two imple- stants, adjusted to control the selectivity of each condition. mentations of selection primitive. The naive “branching” Here, we keep the selectivity of each branch equal, hence to implementation of selection evaluates conditions lazily and the cube root of the overall selectivity, which we vary from branches out if any of the predicates fails. If the selectivity 0 to 1. We performed the experiment on 1K input tuples. of conditions is neither very low or high, CPU branch predic- Figure 2 shows that compilation of conjunctive Select is tors are unable to correctly guess the branch outcome. This inferior to the pure vectorized approach. The lazy compiled prevents the CPU from filling its pipelines with useful future program does slightly outperform vectorized branching, but code and hinders performance. In [11] it was shown that a for the medium selectivities branching is by far not the best branch (control-dependency) in the selection code can be option. The gist of the problem is that the trick of (i) con- transformed into a data dependency for better performance. verting all control dependencies in data dependencies while The sel_lt functions in Algorithm 2 contain both ap- still (ii) avoiding unnecessary evaluation, cannot be achieved proaches. The VectorWise implementation of selections uses in a single loop. If one avoids all branches (the “compute-all” a mechanism that chooses either the branch or non-branch approach in Algorithm 3), all conditions always get evalu- ated, wasting resources if a prior condition already failed. 5 In fact, other primitives are also able to work with selection 6 vectors, but it was removed from code snippets where not It even re-orders dynamically the conjunctive predicates necessary for the discussed experiments. such that the most selective is evaluated first.

5. 25 Algorithm 4 Vectorized implementation of hash probing. m a p h a s h T c o l ( i d x n , i d x ∗ r e s , T∗ c o l 1 ) { 20 f o r ( i d x i =0; i <n ; i ++) r e s [ i ] = HASH( c o l 1 [ i ] ) ; } Cycles per tuple 15 map rehash idx col T col ( idx n , idx ∗ res , i d x ∗ c o l 1 , T∗ c o l 2 ) { f o r ( i d x i =0; i <n ; i ++) r e s [ i ] = c o l 1 [ i ] ˆ HASH( c o l 2 [ i ] ) ; 10 } m a p f e t c h i d x c o l T c o l ( i d x n , T∗ r e s , Vectorized, branching i d x ∗ c o l 1 , T∗ base , i d x ∗ s e l ) { Vectorized, non−branching if ( sel ) { 5 Comp., if(1 && 2 && 3){} (lazy) f o r ( i d x i =0; i <n ; i ++) Comp., if(1 && 2) { 3 non−br } r e s [ s e l [ i ] ] = base [ c o l 1 [ s e l [ i ] ] ] ; Comp., if(1) { 2&3 non−br } } e l s e { /∗ s e l == NULL, o m i t t e d ∗/ } } Comp., 1&2&3 non−br. (compute all) 0 m a p c h e c k T c o l i d x c o l c o l T c o l ( idx n , chr ∗ res , T∗ keys , T∗ base , i d x ∗ pos , i d x ∗ s e l ) { if ( sel ) { 800 f o r ( i d x i =0; i <n ; i ++) res [ sel [ i ] ] = ( k e y s [ s e l [ i ] ] ! = b a s e [ pos [ s e l [ i ] ] ] ) ; 600 } e l s e { /∗ s e l == NULL, o m i t t e d ∗/ } Total br. misp. } m a p r e c h e c k c h r c o l T c o l i d x c o l T c o l ( idx n , chr ∗ res , chr ∗ col1 , 400 T∗ keys , T∗ base , i d x ∗ pos , i d x ∗ s e l ) { if ( sel ) { f o r ( i d x i =0; i <n ; i ++) 200 res [ sel [ i ] ] = col1 [ sel [ i ] ] | | ( k e y s [ s e l [ i ] ] ! = b a s e [ pos [ s e l [ i ] ] ] ) ; } e l s e { /∗ s e l == NULL, o m i t t e d ∗/ } } 0 h t l o o k u p i n i t i a l ( i d x n , i d x ∗ pos , i d x ∗ match , 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 i d x ∗ H, i d x ∗ B) { f o r ( i d x i =0 ,k =0; i <n ; i ++) { Selectivity of each condition // s a v i n g found c h a i n head p o s i t i o n i n HT pos [ i ] = B [H [ i ] ] ; // s a v i n g t o a s e l . v e c t o r i f non−z e r o Figure 2: Conjunction of selection conditions: total i f ( pos [ i ] ) { match [ k++] = i ; } } cycles and branch mispredictions vs. selectivity } h t l o o k u p n e x t ( i d x n , i d x ∗ pos , i d x ∗ match , idx ∗ next ) { f o r ( i d x i =0 ,k =0; i <n ; i ++) { One can try mixed approaches, branching on the first pred- // a d v a nc i n g t o n e x t i n c h a i n icates and using data dependency on the remaining ones. pos [ match [ i ] ] = n e x t [ pos [ match [ i ] ] ] ; They perform better in some selectivity ranges, but main- // s a v i n g t o a s e l . v e c . i f non−empty i f ( pos [ match [ i ] ] ) { match [ k++] = match [ i ] ; } tain the basic problems – their worst behavior is when the } selectivity after branching predicates is around 50%. } procedure HTprobe(V, B[0..N − 1], K1..k (in), R1..v (out)) 4. CASE STUDY: HASH JOIN // Iterative hash-number computation Our last micro-benchmark concerns Hash Joins: H ← map hash(K1 ) for each remaining key vectors Ki do SELECT build.col1, build.col2, build.col3 H ← map rehash(H, Ki ) WHERE probe.key1 = build.key1 AND probe.key2 = build.key2 H ← map bitwiseand(H, N − 1) // Initial lookup of candidate matches FROM probe, build P os, M atch ← ht lookup initial(H, B) We focus on an equi-join condition involving keys consist- while M atch not empty do // Candidate value verification ing of two (integer) columns, because such composite keys Check ← map check(K1 , Vkey1 , P os, M atch) are more challenging for vectorized executors. This discus- for each remaining key vector Ki do sion assumes simple bucket-chaining, such as used in Vec- Check ← map recheck(Check, Ki , Vkeyi , P os, M atch) torWise, presented in Figure 3. This means that keys are M atch ← sel nonzero(Check, M atch) hashed on buckets in an array B with size N which is a power // Chain following of two. Each bucket contains the offset of a tuple in a value P os, M atch ← ht lookup next(P os, M atch, Vnext ) space V . This space can either be organized using DSM or Hits ← sel nonzero(P os) // Fetching the non-key attributes NSM layout; VectorWise supports both [14]. It contains the for each result vector Ri do values of the build relation, as well as a next-offset, which Ri ← map f etch(P os, Vvaluei , Hits) implements the bucket chain. A bucket may have a chain of length > 1 either due to hash collisions, or because there are multiple tuples in the build relation with the same key.

6. hash bucket Algorithm 5 Fully loop-compiled hash probing: for each values heads V (DSM) NSM tuple, read hash bucket from B, loop through a chain H B next key1 key2 val1 val2 val3 in V, fetching results when the check produces a match 0 2 0 x x x x x x 3 1 0 1 0 103 1003 46 a May f o r ( i d x i =0 , j =0; i <n ; i ++) { i d x pos , hash = HASH( key1 [ i ] ) ˆHASH( key2 [ i ] ) ; 0 2 3 2 0 100 1000 124 x Jan i f ( pos = B [ hash &(N− 1 ) ] ) do { 2 3 4 3 0 102 1002 73 v Apr i f ( key1 [ i ]==V [ pos ] . key1 && 3 1 203 2003 736 w Oct key2 [ i ]==V [ pos ] . key2 ) { 4 r e s 1 [ i ] = V[ pos ] . c o l 1 ; hash value computation r e s 2 [ i ] = V[ pos ] . c o l 2 ; r e s 3 [ i ] = V[ pos ] . c o l 3 ; break ; // found match Figure 3: Bucket-chain hash table as used in Vec- } torWise. The value space V presented in the figure } while ( pos = V. n e x t [ pos ] ) ; // n e x t is in DSM format, with separate array for each at- } tribute. It can also be implemented in NSM, with data stored tuple-wise. 700 Vectorized DSM 600 Vectorized NSM Vectorized Hash Probing. For space reasons we only Compiled NSM discuss the probe phase in Algorithm 4, we show code for Compiled DSM 500 the DSM data representation and we focus on the scenario Cycles per tuple 400 when there is at most one hit for each probe tuple (as is common with relations joined with a foreign-key referen- 300 tial constraint). Probing starts by vectorized computation of a hash number from a key in a column-by-column fash- 200 ion using map-primitives. A map_hash_T_col primitive first hashes each key of type T onto a lng long integer. If the key is composite, we iteratively refine the hash value using 100 a map_rehash_lng_col_T_col primitive, passing in the previ- 0 ously computed hash values and the next key column. A 10 TLB misses per tuple bitwise-and map-primitive is used to compute a bucket num- 8 ber from the hash values: H&(N -1). 6 To read the positions of heads of chains for the calculated buckets we use a special primitive ht_lookup_initial. It be- 4 haves like a selection primitive, creating a selection vector 2 M atch of positions in the bucket number vector H for which 0 a match was found. Also, it fills the P os vector with posi- 1 2 3 4 5 6 7 8 9 10 tions of the candidate matching tuples in the hash table. If the value (offset) in the bucket is 0, there is no key in the Number of fetched columns hash table – these tuples store 0 in P os and are not part of M atch. Figure 4: Fetching columns of data from a hash ta- Having identified the indices of possible matching tuples, ble: cycles per tuple and total TLB misses the next task is to “check” if the key values actually match. This is implemented using a specialized map primitive that combines fetching a value by offset with testing for non- P os with selection vector Hits becomes a pivot vector for equality: map_check. Similar to hashing, composite keys fetching. This pivot is subsequently used to fetch (non-key) are supported using a map_recheck primitive which gets the result values from the build value area into the hash join boolean output of the previous check as an extra first pa- result vectors; using one fetch primitive for each result col- rameter. The resulting booleans mark positions for which umn. the check failed. The positions can then be selected using a select sel_nonzero primitive, overwriting the selection vector Partial Compilation. There are three opportunities to M atch with positions for which probing should advance to apply compilation to vectorized hashing. The first is to com- the “next” position in the chain. Advancing is done by a spe- pile the full sequence of hash/rehash/bitwise-and and bucket cial primitive ht_lookup_next, which for each probe tuple in fetch into a single primitive. The second combines the check M atch fills P os with the next position in the bucket-chain of and iterative re-check (in case of composite keys) and the select > 0 into a single select-primitive. Since the test for a V . It also guards for ends of chain by reducing M atch to its key in a well-tuned hash table has a selectivity around 75%, subset for which the resulting position in P os was non-zero. we can restrict ourselves to a non-branching implementation. The loop finishes when the M atch selection vector be- These two opportunities re-iterate the compilation benefits comes empty, either because of reaching end of chain (ele- of Project and Select primitives, as discussed in the previous ment in P os equals 0, a miss) or because checking succeeded sections, so we omit the code. (element in P os pointing to a position in V , a hit). The third opportunity is in the fetch code. Here, we can Hits can be found by selecting the elements of P os which generate a composite fetch primitive that, given a vector of ultimately produced a match with a sel_nonzero primitive. positions, fetches multiple columns. The main benefit of this

7. Varying hash table size Varying selectivity on HT with 16M tuples Varying number of buckets on HT with 16M tuples 600 600 Vectorized, DSM 890 2500 Vectorized, NSM 828 5293 Compiled, DSM 4419 Compiled, NSM Partial Compilation 500 500 2000 400 400 1500 Cycles per tuple 300 300 1000 200 200 500 100 100 0 0 0 4K 16K 64K 256K 1M 4M 16M 64M 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2x 1x 1/2x 1/4x 1/8x 1/16x 1/32x Hash table size (tuples) Selectivity of matching (fraction of matched tuples) Number of buckets / hash table size Figure 5: Hash table probing. Left: different sizes of the hash table value space V . Middle: different match rates of the probe. Right: different sizes of the bucket array B w.r.t. value space V (different chain lengths). is obtained in case of NSM organisation of the value space For random access, this is hard to achieve, but the deeply V . Vectorization fetches values column-at-a-time, hence pipelined out-of-order nature of modern CPUs does help. passes over the NSM value space as many times as there That is, if a load stalls, the CPU might be able to speculate are result columns (here: 3), accessing random positions ahead into upstream instructions and reach more loads. The from the value space on each pass. On efficient vector-sizes, Intel Nehalem architecture can have four outstanding loads, the amount of random accesses is surely larger than TLB improving bandwidth by a factor four7 . Success is not guar- cache size and may even exceed the amount of cache lines, anteed, since the instruction speculation window of a CPU such that TLB and cache trashing occurs, with pages and is limited, depends on branch prediction, and only indepen- cache lines from the previous pass being already evicted from dent upstream instructions can be taken into execution. the caches before the next. The compiled fetch fetches all columns from the same position at once, achieving more data The Hard-To-Understand Part. The crux here is that locality. Figure 4 shows that in normal vectorized hashing the vectorized fetch primitive trivially achieves whatever performance of NSM and DSM is similar, but compilation maximum outstanding loads the CPU supports, as it is a makes NSM clearly superior. tight loop of independent loads. The same holds for the partially compiled variants. The fully compiled hash probe, Full Compilation. It is possible to create a loop-compiled however, can run aground while following the bucket chain. version of the full join, like e.g. proposed in HIQUE [7]. Al- Its performance is only good if the CPU can speculate ahead gorithm 5 shows the hash probe section of such an algorithm, across multiple probe tuples (execute concurrently instruc- which we also tested. It loops over probe keys, and for each tions from multiple for-loop iterations on i ). That depends probe key fetches the corresponding bucket, then iterates on the branch predictor predicting the while(pos..) to be over the bucket-chain, checking the key for equality, and if false, which will happen in join key distributions where there equal, fetches the needed result columns. We try to explain are no collisions. If, however, there are collisions or if the in the following why this fully compiled algorithm is inferior build relation has multiple identical keys, the CPU will stall to the vectorized alternative with partial compilation. with a single outstanding memory request (V[pos]), because the branch predictor will make it stay in the while-loop, and Parallel Memory Access. Because memory latency is it will be unable to proceed as the value of pos =[pos] not improving much (∼100ns), and cache line granularities is unknown because it depends on the current cache/TLB must remain limited (64bytes), memory bandwidth on mod- miss. A similar effect has been described in the context of ern hardware is no longer simply the division between these using explicit prefetching instructions in hash-joins [3]. This two. A single Nehalem core can achieve a factor 10 more than this 0.64GB/s, thanks to automatic CPU prefetching 7 Speculation-friendly code is thus more effective than man- on sequential access. Performance thus crucially depends on ual prefetching, which tends to give only minor improve- having multiple outstanding memory requests at all times. ment, and is hard to tune/maintain for multiple platforms.

8.effect causes fully compiled hashing to be four times slower while maintaining clear modularization, simplified testing than vectorized hashing in the worst case. and easy performance- and quality-tracking, which are key properties of a software product. Experiments. Figure 5 shows experiments for hash prob- ing using the vectorized, fully and partially compiled ap- proaches, using both DSM and NSM as the hash table (V ) 6. REFERENCES representation. We vary hash table size, selectivity (frac- [1] P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: tion of probe keys that match something), and bucket chain Hyper-Pipelining Query Execution. In Proc. CIDR, length; which have default values resp. 16M, 1.0 and 1. The Asilomar, CA, USA, 2005. left part shows that when the hash table size grows, per- [2] P. A. Boncz. Monet: A Next-Generation DBMS formance deteriorates; it is well understood that cache and Kernel For Query-Intensive Applications. Ph.d. thesis, TLB misses are to blame, and DSM achieves less locality Universiteit van Amsterdam, Amsterdam, The than NSM. The middle graph shows that with increasing hit Netherlands, May 2002. rate, the cost goes up, which mainly depends on increasing [3] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. fetch work for tuple generation. The compiled NSM fetch Mowry. Improving hash join performance through alternatives perform best, as explained. The right graph prefetching. In Proc. ICDE, Boston, MA, USA, 2004. shows what happens with increasing chain length. As dis- [4] D. Chamberlin et al. A history and evaluation of cussed above, the fully compiled (NSM) variant suffers most, System R. Commun. ACM, 24(10):632–646, 1981. as it gets no parallel memory access. The overall best solu- [5] G. Graefe. Volcano - an extensible and parallel query tion is partially compiled NSM, thanks to its efficient com- evaluation system. IEEE TKDE, 6(1):120–135, 1994. piled multi-column fetching (and to a lesser extent efficient [6] A. Kemper and T. Neumann. HyPer: Hybrid OLTP checking/hashing, in case of composite keys) and its parallel and OLAP High Performance Database System. memory access, during lookup, fetching and chain-following. Technical report, Technical Univ. Munich, TUM-I1010, May 2010. 5. CONCLUSIONS [7] K. Krikellas, S. Viglas, and M. Cintra. Generating For database architects seeking a way to increase the com- code for holistic query evaluation. In ICDE, pages putational performance of a database engine, there might 613–624, 2010. seem to be a choice between vectorizing the expression en- [8] S. Padmanabhan, T. Malkemus, R. Agarwal, and gine versus introducing expression compilation. Vectoriza- A. Jhingran. Block Oriented Processing of Relational tion is a form of block-oriented processing, and if a system Database Operations in Modern Computer already has an operator API that is tuple-at-a-time, there Architectures. In Proc. ICDE, Heidelberg, Germany, will be many changes needed beyond expression calculation, 2001. notably in all query operators as well as in the storage layer. [9] ParAccel Inc. Whitepaper. The ParAcel Analytical If high computational performance is the goal, such deep Database: A Technical Overview, Feb 2010. changes cannot be avoided, as we have shown that if one would keep adhering to a tuple-a-time operator API, expres- [10] J. Rao, H. Pirahesh, C. Mohan, and G. M. Lohman. sion compilation alone only provides marginal improvement. Compiled Query Execution Engine using JVM. In Our main message is that one does not need to choose be- Proc. ICDE, Atlanta, GA, USA, 2006. tween compilation and vectorization, as we show that best [11] K. A. Ross. Conjunctive selection conditions in main results are obtained if the two are combined. As to what this memory. In Proc. PODS, Washington, DC, USA, 2002. combining entails, we have shown that ”loop-compilation” [12] The LLVM Compiler Infrastructure . techniques as have been proposed recently can be inferior to plain vectorization, due to better (i) SIMD alignment, [13] M. Zukowski. Balancing Vectorized Query Execution (ii) ability to avoid branch mispredictions and (iii) parallel with Bandwidth-Optimized Storage. Ph.D. Thesis, memory accesses. Thus, in such cases, compilation should Universiteit van Amsterdam, Sep 2009. better be split in multiple loops, materializing intermediate [14] M. Zukowski, N. Nes, and P. Boncz. DSM vs. NSM: vectorized results. Also, we have signaled cases where an in- CPU Performance Tradeoffs in Block-Oriented Query terpreted (but vectorized) evaluation strategy provides op- Processing. 2008. timization opportunities which are very hard with compila- tion, like dynamic selection of a predicate evaluation method or predicate evaluation order. Thus, a simple compilation strategy is not enough; state- of-the art algorithmic methods may use certain complex transformations of the problem at hand, sometimes require run-time adaptivity, and always benefit from careful tun- ing. To reach the same level of sophistication, compilation- based query engines would require significant added com- plexity, possibly even higher than that of interpreted en- gines. Also, it shows that vectorized execution, which is an evolution of the iterator model, thanks to enhancing it with compilation further evolves into an even more efficient and more flexible solution without making dramatic changes to the DBMS architecture. It obtains very good performance