Generating Code for Holistic Query Evaluation

We present the application of customized code generation to database query evaluation. The idea is to use a collection of highly efficient code templates and dynamically instantiate them to create query- and hardware-specific source code. The source code is compiled and dynamically linked to the database server for processing. Code generation diminishes the bloat of higher-level programming abstractions necessary for implementing generic, interpreted, SQL query engines.

1. Generating code for holistic query evaluation Konstantinos Krikellas 1, Stratis D. Viglas 2, Marcelo Cintra 3 School of Informatics, University of Edinburgh, UK 1 2 3 Abstract— We present the application of customized code during query execution. generation to database query evaluation. The idea is to use Existing work has identified the data layout as the main a collection of highly efficient code templates and dynamically bottleneck that prevents contemporary processor designs with instantiate them to create query- and hardware-specific source code. The source code is compiled and dynamically linked to multiple levels of cache memories from exploiting their full the database server for processing. Code generation diminishes potential in database workloads. We argue that changing the the bloat of higher-level programming abstractions necessary for storage layer is a radical departure from existing designs. implementing generic, interpreted, SQL query engines. At the We identify the biggest problem with the design of a query same time, the generated code is customized for the hardware it engine to be the compilation of SQL queries in operator plans will run on. We term this approach holistic query evaluation. We present the design and development of a prototype system called and the generality of the common operator interface, namely HIQUE, the Holistic Integrated Query Engine, which incorporates the iterator model. The latter results in a poor utilization of our proposals. We undertake a detailed experimental study of the CPU resources. Its abstract implementation and the frequent system’s performance. The results show that HIQUE satisfies its use of function calls inflate the number of instructions and design objectives, while its efficiency surpasses that of both well- memory accesses required for query evaluation. The use of established and currently-emerging query processing techniques. generic code does not permit its customization according to the characteristics of both the executed queries and the I. I NTRODUCTION hardware platform. SQL and query processing in main memory, This paper presents the application of customized code gen- however, exhibit a strong potential for exploiting just-in-time eration for the purpose of efficient database query processing. compilation. We take this idea to the extreme. Our approach stems from template-based programming. The Code generation for query evaluation. Ideally, query pro- idea is to use code templates for the various query processing cessing code should optimally use the cache hierarchy and algorithms and then dynamically instantiate them and compose reduce the number of instructions needed for query evaluation. them in a single piece of source code that evaluates the query. At the same time, one would want to keep the compositional Dynamic template instantiation removes the deficiencies of all aspects of the iterator model and not affect separate system high-level abstractions that are necessary for implementing modules. To that end, we introduce a novel query evaluation generic query evaluators in current query engine designs. technique that we term holistic query evaluation. The idea Moreover, since the code is dynamically generated, it can be is to inject a source code generation step in the traditional customized to exploit the architectural characteristics of the query evaluation process. The system should look at the entire hardware it will execute on. We term this approach holistic query and optimize it holistically, by generating query- and query evaluation, as the key premise is that one should take hardware-specific source code, compiling it, and executing it. a holistic view of both the query being evaluated and the Our approach has multiple benefits: (a) the number of host hardware. The resulting performance advantage in main- function calls during query evaluation is minimized; (b) the memory execution is substantial; for instance, it reaches a generated code exhibits increased data locality, therefore mak- factor of 167 over established database technology in TPC-H ing optimal use of cache-resident data; (c) code generation and Query 1. The novelty we claim is that template-based code compilation allow the use of compiler optimization techniques generation can be generalized to efficiently process any type targeting each individual query, an extra optimization level on of query without affecting orthogonal aspects of the database top of conventional query optimization; and (d) the generated system. code approaches the performance of hard-coded evaluation Motivation. Traditionally, query processing algorithms have plans. The model is flexible and does not affect other or- focused on minimizing disk I/O while their in-memory effi- thogonal system aspects, such as storage management and ciency has been considered to be a secondary priority. For concurrency control. contemporary servers with large amounts of memory, it is Using this framework, we have developed a prototype holis- conceivable for a large portion of the on-disk data –or even tic query engine and compared its performance to both iterator- the entire database– to fit in main memory. In such cases, the based solutions and existing database systems. The results difference in access latency between the processor’s registers (a) quantify the advantage of per-query code generation over and main memory becomes the performance bottleneck [1]. generic query operator implementations, and (b) demonstrate To optimize such workloads, one needs to carefully “craft” a superiority of the holistic approach over both iterator-based the executed code so as to minimize the processor stall time and hardware-conscious systems in a subset of the TPC-H 978-1-4244-5446-4/10/$26.00 © 2010 IEEE 613 ICDE Conference 2010

2. serve most of the processor’s instruction and data requests, the remaining ones (cache misses) are expensive and can become a performance bottleneck. To aid the caches, latest processors incorporate hardware prefetchers that identify the instructions and data likely to be accessed shortly and prefetch them into the appropriate cache level [10]. Current CPUs, as shown in Figure 1, employ multiple prefetching units tightly coupled with the cache hierarchy. The simplest ones are capable of detecting sequen- tial patterns. The more advanced ones closely monitor the addresses touched by the processor to identify more complex access patterns by (a) keeping the history of accesses for a small number of the most frequently accessed addresses, and Fig. 1. The architecture of modern CPUs (b) tracking the distance (stride) between successive fetches. To quantify the impact of hardware prefetching, we mea- benchmark, therefore proving its viability as an alternative sured the data access latency for sequential and random access query engine design. inside the memory hierarchy of a system using an Intel Core 2 The rest of this paper is organized as follows: in Section II Duo 6300 processor, clocked at 1.86GHz.1 The results showed we describe current processor and compiler technologies and that, while all accesses to the D1-cache have a uniform cost of conventional query engine design. Related work is presented 3 CPU cycles, there is a significant difference when switching in Section III. In Section IV we present the design of a system from sequential to random access in the L2-cache: the former based on holistic evaluation. Our approach to code generation takes 9 cycles and the latter 14 cycles. The gap grows when a and our query evaluation algorithms are given in Section V, data access cannot be served from caches, as sequential access while in Section VI we experimentally evaluate the perfor- in main memory costs 28 cycles, while random access costs mance of the proposed model. Finally, we draw conclusions 77 cycles or more. and identify future research directions in Section VII. B. Drawbacks of iterators II. BACKGROUND Most query engines are based on the iterator model [11]. A. Hardware primer This model provides an abstract interface used for streaming tuples across query operators, in the form of three functions: Modern CPUs process multiple instructions simultaneously (a) open(), designating the start of information exchange through pipelining and superscalar execution [14]. If the CPU and initialization of internal operator state, (b) get next(), supports out-of-order execution, instructions waiting for data for propagating tuples between operators, and (c) close(), transfer or other instructions to execute first yield for ready- denoting the end of processing and allowing the operators to-execute instructions following in the pipeline. Out-of-order to free up their resources. Query plans are then organized execution hides stalls, but offers limited advantages when the as pipelined binary trees of operators communicating through executed code includes consecutive memory requests and long iterator calls. data and control dependency chains, which are common in Though generic, the iterator model exhibits a large number database workloads. of function calls. For each in-flight tuple the system makes at The substantial latency for transferring data from main least two calls: one for the caller to request it and one for the memory to the processor’s registers is countered with multiple callee to propagate it. The number of function calls further levels of cache memory. The closer a level is to the processor, grows as iterators need to be generic. Their functions may be the smaller and faster it is to access it. Cached data is organized virtual to be dynamically bound to the data types they process, in fixed-length chunks, each termed a cache line: the data which implies that all field accesses and comparisons may exchange unit with the main memory. Modern processors require a function call. Each function call updates the stack incorporate a very fast to access level-1 (L1) cache, divided register and saves/restores the contents of the CPU registers in the Instruction (I1) cache and the Data (D1) cache; a larger to/from the stack. With tens of registers in current CPUs, but moderately fast to access level-2 (L2) cache; and, in some frequent function calls may lead to significant overhead, as a models, an even larger but much slower to access level-3 (L3) substantial percentage of CPU time is spent without any actual cache. Caches exploit both temporal locality (data tend to be contribution to result computation. Moreover, since a function repeatedly accessed within a short period) and spatial locality call is a jump in the executed code, it forces a new instruction (contiguously allocated data tend to be accessed in unison). stream to be loaded in the pipeline, thus limiting superscalar Non-blocking operation and superscalar execution allow for execution. multiple pending memory operations, thus overlapping fetch In addition to stack interaction, there is also overhead at latencies. Data-intensive workloads restrict this operation: the the data level. Each iterator maintains internal state; iterator cache controller can only serve a limited number of concurrent requests and therefore becomes saturated. While caches can 1 The results were extracted using the RightMark Memory Analyzer [22]. 614

3.calls require several memory operations for accessing and however, implied revisiting all query evaluation algorithms. updating the iterator state, each call potentially triggering It also affected not only the design of the query engine, but cache misses. Moreover, iterator state manipulation interferes also other orthogonal aspects of a DBMS, e.g., concurrency with data stream accesses. Even if the data access pattern control. In [2] the Partition Attributes Across, or PAX, storage is sequential it will be frequently interrupted, thus reducing model was introduced. The idea is that although pages still the efficiency of hardware prefetching. Note that the iterator provide a tuple-level interface, the tuples within a page are interface does not control the data flow of pipelined operators, vertically partitioned, thus greatly enhancing cache locality. as each operator implementation is independent. Consequently, This hybrid approach combines the benefits of NSM and DSM pipelined iterator calls may introduce cache contention and while requiring only moderate changes to the database system. evict cache lines from each other’s data-set, leading to cache In the context of the iterator model, a buffering operator thrashing. was proposed in [25] to increase the tuple granularity in inter- operator communication. This resulted in a measurable reduc- C. Compiler optimization tion in the number of iterator calls across the operator pipeline, Developers rely on the compiler to transform the code in but had no effect on the number of evaluation function calls in ways that reduce processor stalls during execution. Since the the body of the operator. In [20] it was proposed that multiple compiler generates the executable code, it can optimize it aggregation operations can be combined in a single blocking for the target architecture and hardware platform. The code operator that executes these operations over a sequence of is transformed in ways that (a) keep the execution pipeline tuples through array computations. Common computations full of independent instructions, (b) distribute variables to across the aggregation functions are performed only once and registers in ways that encourage their reuse, and (c) group stored as intermediate results; array computations are used together accesses to the same data [16]. These optimizations to evaluate aggregates, a technique more in line with the result in increased parallelism, reduced memory accesses and superscalar design of modern processors. maximized cache locality, thus limiting processor stalls. The state-of-the-art in main-memory query execution is The iterator model, however, prevents the compiler from MonetDB [3], [18] where, in addition to vertical decompo- applying such optimizations. Each iterator call triggers a chain sition, the entire query engine is built on the notion of array of function calls that will eventually produce a single tuple. manipulation, with sophisticated query processing techniques The compiler cannot factor this out and identify the (possibly (e.g., radix-cluster hash join) having been developed in that iterative) access pattern over the input, as interprocedural context. Though MonetDB’s engine employs a different data analysis and optimizations are much more limited than in- flow than that of traditional DBMSs, it still is an operator- traprocedural ones. Moreover, conditions and jumps in the based approach, tightly connected to the DSM. It also re- code due to function calls disrupt the instruction sequence quires materializing all intermediate results, thus reducing and reduce the range of code the compiler can examine in the opportunities for exploiting cache locality across separate unison for optimization opportunities. This is aggravated by query operators. These restrictions led to the introduction of certain parameters necessary for iterator instantiation (e.g., MonetDB/X100 [4], [26], where the idea of the blocking predicate value types and offsets inside tuples) being only operator [20] was coupled with a column-wise storage layout. specified at run-time. These ambiguities refrain the compiler The use of compound vectorized primitives for performing from applying a substantial part of its code optimization all computations achieved performance comparable to that of techniques on iterator implementations. hard-coded implementations. Prefetching is another area that has received attention, III. R ELATED W ORK with [7], [8] presenting ways of employing software prefetch- It has long been known that processors are designed for ing in hash join evaluation. Though this approach may improve complex numerical workloads over primitive data types and response times, it introduces the need for dynamically calculat- are not well-tailored for database workloads. The authors ing the prefetching distance according to the CPU’s frequency, of [23] measured performance not only in terms of response cache latencies, and the run-time load. Inaccuracies result in time, but also in terms of hardware performance metrics. They failing to prefetch the required data on time, or polluting the also proposed various modifications to improve the behavior cache with not immediately needed data. Besides, the cache of join algorithms on contemporary processors. Regarding the controller treats software prefetching instructions as hints and storage layer, it was soon realized that the established N-ary may ignore them if there are pending fetch requests. We have Storage Model (NSM) penalized execution for the common therefore chosen not to employ software prefetching in our case of only a small number of fields in each tuple being implementation. necessary during query evaluation. This lead to the intro- Although a primitive form of code generation was used duction of vertical partitioning and the Decomposed Storage even in System-R [5], the adoption of the iterator model [11] Model (DSM) [9], where each tuple field is separately stored. has dominated query engine design. Code generation was This layout reduces the amount of data touched during query revisited in the Daytona data management system [13], which evaluation and allows for the use of array computations when was capable of on-the-fly generation of query-specific code. implementing the operators. This change of storage layout, It relied, however, on the operating system to perform most 615

4. Fig. 2. Holistic query engine overview of the functionality a DBMS would traditionally provide (e.g., the size of intermediate results. It also chooses the optimal buffering, concurrency control). Similarly, the authors of [21] evaluation algorithm for each operator and sets the parameters presented a Java prototype system employing dynamic query used for the instantiation of the code generator’s templates (see compilation. Still, this system employed iterators for operator Section V for more details). communication and execution, using the code generator only The output of the optimizer is a topologically sorted list to remove virtual functions from the body of iterators. More- O of operator descriptors oi . Each oi has as input either over, it did not present any novel query processing options, primary table(s), or the output of oj , j < i. The descriptor e.g., joins were exclusively evaluated through preconstructed contains the algorithm to be used in the implementation of join indexes. each operator and additional information for initializing the code template of this algorithm. Effectively, this list describes IV. S YSTEM OVERVIEW a scheduled tree of physical operators since there is only In this section we present the design of a query engine one root operator. It is organised so that the first elements employing code generation as the underlying principle for ef- describe the joins of the query, followed by any aggregation ficient query evaluation. Our system is named HIQUE, standing and sorting operations (unary operators, at most one descriptor for Holistic Integrated Query Engine. It has been implemented for each). The optimizer keeps track of interesting orders [5] in C/C++ and compiled using the GNU gcc compiler, over the and join teams [12], grouping together join operations and GNU Linux operating system. It adopts the traditional client- avoiding re-sorting where possible. The code generator will server model, i.e., multiple clients communicate with the query then traverse the topologically sorted list and emit a set of engine. functions containing the source code for each operator. This Storage layer. We have adopted the N-ary Storage Model is done in two steps per operator: (NSM) as a storage layout, with tuples consecutively stored 1) Data staging: all input tables are scanned, all selec- in pages of 4096 bytes. The system, however, is not tied to tion predicates are applied, and any unnecessary fields the NSM in any way; any other storage model, such as the DSM are dropped from the input to reduce tuple size and or the PAX models, can be used and our proposals will still increase cache locality on subsequent processing. Any be applicable. Each table resides in its own file on disk, and pre-processing needed by the following operator, e.g., the system’s storage manager is responsible for maintaining sorting or partitioning, is performed by interleaving the information on table/file associations and schemata. A buffer pre-processing code with the scanning code. The output manager is responsible for buffering disk pages and providing is then materialized (though not on disk, if the staged concurrency control; it uses the LRU replacement policy. In input is small enough to fit in main memory). addition to standard files, the system uses memory-efficient 2) Holistic algorithm instantiation: the source code that indexes, in the form of fractal B+ -trees [6], with each physical implements each operator is generated. This code is page divided in four tree nodes of 1024 bytes each. an instantiation of the appropriate holistic algorithm Query processing. The route of a query through the system template, as described in Section V-B. is shown in Figure 2. The first module is the SQL parser. By looking at the inputs of each operator the code generator Our SQL grammar supports conjunctive queries with equi-joins composes the operator implementations to generate a final and arbitrary groupings and sort orders. It does not support function. This function evaluates the entire query and is to (a) statistical functions in aggregate values, and (b) nested be called by the query engine. The final step of the code queries. We believe, however, these to be straightforward generation process is to insert all generated functions into a extensions that do not restrict the generality of the holistic new C source file. evaluation model. Once the query-specific source code has been generated, a The SQL parser checks the query for validity against the system call invokes the compiler to compile the source file system catalogue and outputs an internal query representation into a shared library file. This step allows the application for the optimizer. The latter chooses the optimal evaluation of aggressive compiler optimizations that target the code of plan using a greedy approach, with the objective of minimizing the specific query. The shared library file is then dynamically 616

5. Input: 1. Topologically sorted list of operators O, in the correct order, ensures the correct (de)allocation of 2. Code templates for data staging (T S), resources and sends the output to the client (Line 19). Finally, join evaluation (T J) and aggregation (T A) Output: Query-specific C source file all generated functions are put into a new C source file, in the same order as they have been generated. 1. for each join operator jm ∈ O 2. retrieve code template tsm ∈ T S to stage jm ’s inputs B. Algorithms and code templates 3. for each input in of jm 4. instantiate tsm for in The goal of holistic algorithms is to use code generation to 5. generate C function csmn for staging in customize well-known data processing algorithms into more 6. retrieve code template tjm ∈ T J for jm ’s algorithm hardware-efficient implementations. Per-query code generation 7. instantiate tjm for jm 8. generate C function cjm to evaluate join allows the following query-level optimizations: (a) attribute 9. if ∃ aggregate operator a ∈ O types are known a priori, which means we can revert separate 10. retrieve code template tsa ∈ T S to stage a’s input function calls for data accessing and predicate evaluation to 11. instantiate tsa for a pointer casts and primitive data comparisons, respectively; and 12. generate C function csa for staging a (b) fixed-length tuples inside each page can be accessed as an 13. retrieve code template ta ∈ T A for a’s algorithm 14. instantiate ta for a array through pointer arithmetic and direct referencing. The 15. generate C function ca to compute aggregate values system is aware of the differences in latency for accessing 16. if ∃ ordering operator s ∈ O each level of the memory hierarchy. Recall from Section II 17. retrieve code template ts ∈ T S for sorting that switching from sequential to random access may even 18. instantiate ts and generate sorting C function cs double the latency on accesses outside the D1-cache; moving 19. traverse O to compose the function cm calling all functions 20. write all generated functions to a new source file F one layer down the memory hierarchy increases latency by one 21. return F order of magnitude. The generated code therefore (a) examines the input in blocks that fit inside the D1- or the L2-cache, Fig. 3. The code generation algorithm (b) maximizes reuse by performing multiple operations over cache-resident data, and (c) strives for random access patterns linked and loaded by the query executor. The latter calls appearing only inside the D1-cache, as this is the only level the dynamically loaded function to evaluate the query and where the fetch cost is the same for both sequential and redirects the output to the client. random access. V. C ODE GENERATION As an example of generated code, Listing 1 shows the C In this section we present the implementation of the code code for filtering the tuples of a table. By employing type generator. The code generator uses a template-based approach. information (int in this case) and using array accesses, we Each algorithm is represented as an abstract template, which can eliminate all function calls (but the unavoidable for loading is instantiated according to the execution plan. pages and generating the output), saving a large number of CPU cycles. We also reduce the number of instructions A. Implementation executed, as we evaluate predicates over primitive data types. The code generator accepts as input the output of the Moreover, the use of array computations allows the code to optimizer (i.e., a topologically sorted list O of operator de- exploit the processor’s superscalar design. The lack of function scriptors) and produces a C source file of query-specific code. calls in the inner loop, in combination with directly accessing The generation algorithm is shown in Figure 3. As mentioned, tuples and their fields by reference, further aids the compiler each descriptor contains the algorithm to be implemented, in optimizing the generated code in ways that efficiently along with the necessary parameters for instantiating code distribute data to registers and favor cache reuse. templates. These parameters include the predicate data type(s), Input staging. The staging algorithms include sorting, parti- information about the operator’s inputs, be they primary tables tioning, and a hybrid approach. Sorting is performed by using or intermediate results, and the output schema. an optimized version of quicksort over L2-cache-fitting input Code generation progresses as follows: the generator tra- partitions and then merging them. Partitioning is either fine, verses the operator descriptor list processing the join operators through mapping attribute values to partitions, or coarse, by first (Lines 1 to 8 in Figure 3) and moving on to any aggrega- using hash and modulo calculations to direct tuples to parti- tion (Lines 9 to 15) and ordering operators (Lines 16 to 18). tions. Fine-grained partitioning is used when the partitioning For each operator the generator emits functions that (a) stage attribute has a small enough number of distinct values, so the input (one function per input table), and (b) execute the that a value-partition map comfortably fits inside the cache operator’s algorithm. These functions are built by retrieving hierarchy. For each input tuple, the map is looked up for the the appropriate code template (e.g., Lines 2 and 6 for joins) and corresponding partition. We maintain a sorted array of attribute instantiating it according to the parameters of the operator’s values and use binary search for lookups. In case the directory descriptor (e.g., Lines 4 and 7). Given that operator descriptors spans the lower cache level, searching it may trigger expensive in O contain information about how operators are connected, cache misses, so coarse-grained partitioning proves more the last bit of code generation is to traverse O and generate a efficient. However, since the generated partitions in the latter main (composing) function that calls all evaluation functions case contain multiple attribute values, increasing reuse through 617

6. 1 // l o o p o v e r p a g e s tuples. If each tuple of the outer loop matches at most once 2 f o r ( i n t p = start_page ; p <= end_page ; p++) { with tuples from the inner loop, the access pattern is linear 3 page_str ∗ page = read_page ( p , table ) ; 4 // l o o p o v e r t u p l e s for both inputs; backtracking to the beginning of a group of 5 f o r ( i n t t = 0 ; t < page−>num_tuples ; t++) { matching inner tuples is quite likely to result in cache hits, 6 v o i d ∗ tuple = page−>data + t ∗ tuple_size ; 7 i n t ∗ value = tuple + predicate_offset ; since small groups will tend to be resident in the L2-, or even 8 i f (∗ value != p redicate_value ) c o n t i n u e ; the D1-cache. 9 add_to_result ( tuple ) ; // i n l i n e d 10 }} Partition join builds upon Grace hash join [17]. The input tables are first finely or coarsely partitioned in M partitions Listing 1. Optimized table scan-select each (Line 1); the corresponding partitions (Lines 3 to 5) are 1 /∗ I n l i n e d c o d e t o hash−p a r t i t i o n o r s o r t i n p u t s ∗/ then joined using nested-loops join. For fine-grained parti- 2 tioning, all tuples inside corresponding partitions will match. 3 hash join : // e x a m i n e c o r r e s p o n d i n g p a r t i t i o n s t o g e t h e r 4 f o r ( k = 0 ; k < M ; k++) { For coarse-grained partitioning, each tuple will match with 5 // u p d a t e page bounds f o r b o t h k−t h p a r t i t i o n s none, some, or all tuples of the corresponding partition. To 6 hybrid hash−sort−merge join : // s o r t p a r t i t i o n s 7 that end, we do not use any hash table as it would lead 8 f o r ( p_1 = start_page_1 ; p_1 <= end_page_1 ; p_1++) { to uncontrollable random access patterns. Instead, we prefer 9 page_str ∗ page_1 = read_page ( p_1 , partition_1 [ k ] ) ; 10 f o r ( p_2 = start_page_2 ; p_2 <= end_page_2 ; p_2++) { to first sort the partitions (Line 6) and then use merge join 11 page_str ∗ page_2 = read_page ( p_2 , partition_2 [ k ] ) ; for each pair of corresponding partitions, an algorithm we 12 13 f o r ( t_1 = 0 ; t_1 < page_1−>num_tuples ; t_1++) { term hybrid hash-sort-merge join. Note that if the size of 14 v o i d ∗ tuple_1 = page_1−>data + t_1 ∗ tuple_size_1 ; the partitions is smaller than half that of the L2-cache, by 15 f o r ( t_2 = 0 ; t_2 < page_2−>num_tuples ; t_2++) { 16 v o i d ∗ tuple_2 = page_2−>data + t_2 ∗ tuple_size_2 ; sorting pairs of corresponding partitions just before joining 17 them (instead of during data staging), we ensure that they are 18 i n t ∗ value_1 = tuple_1 + offset_1 ; 19 i n t ∗ value_2 = tuple_2 + offset_2 ; L2-cache-resident during join evaluation. 20 i f (∗ value_1 != ∗ value_2 ) { Furthermore, the nested-loops template enables pipelined 21 merge join : // u p d a t e bounds f o r a l l l o o p s 22 continue ; evaluation of multiple joins without materializing intermediate 23 } results, thus radically reducing memory operations and pro- 24 add_to_result ( tuple_1 , tuple_2 ) ; // i n l i n e d 25 }}}}} cessor stalls. This is applicable in multi-way join queries with join teams, i.e., sets of tables joined on a common key. Such Listing 2. Nested-loops template for join evaluation queries are quite common, e.g., in star-schemata or key-foreign key joins. Notions like hash teams and interesting orders, sorting the partitions can help subsequent processing. This are handled by our model by increasing loop nesting. The leads to a class of what we term hybrid hash-sort algorithms, template of Listing 2 needs to be slightly modified to support which are applicable in certain join evaluation and aggregation join teams. First, all input tables are properly staged (sorted scenarios. These algorithms prove efficient if the number of or partitioned). Then, for each input table the code generator partitions is big enough so that the largest partition fits in the emits one loop over its pages and one over its tuples, with the L2-cache. page loops preceding the tuple loops and following the same Join evaluation. All join evaluation algorithms use the com- table order. The code layout resembles the layout suggested mon nested-loops template shown in Listing 2; the difference by the loop-blocking code optimization technique [16], which across algorithms lies in how their inputs have been staged. enhances cache locality. The template uses an array-like sequential access pattern, Aggregation. The aggregation algorithms depend on input which favors the utilization of the hardware prefetcher on staging. Sort aggregation implies that the input has already the first iteration over each page’s tuples. Subsequent tuple been sorted on the grouping attributes. The input is then iterations will be performed over cache-resident pages without linearly scanned to identify the different groups and evaluate any cache misses. the aggregate results for each group on-the-fly. For hybrid Merge join assumes the input has been previously sorted hash-sort aggregation, the input is first hash-partitioned on (Line 1). Join evaluation then progresses by linearly examining the first grouping attribute and then each partition is sorted the input tables (M is set to 1 in Line 4). The bounds of the on all grouping attributes. Aggregation then progresses in a loops (both in terms of starting and ending pages per table, and single scan of each sorted partition. in terms of starting and ending tuples per page) are constantly Another option is to use value directories for each grouping updated as the merging process progresses (Line 21). This is attribute. This is applicable if the total size of the directories controlled by a condition variable that can take one of three for all grouping attributes is small enough to fit inside the values: the first means that there is no match between the cache hierarchy, so as to avoid directory lookups triggering current tuples; the second means that at least one match has cache misses. In this case, map aggregation is evaluated been found and we should continue scanning inner tuples for in a single linear scan of the input without requiring any matches; the last means that the group of inner matching tuples prior staging. To do so, we maintain one value directory has been exhausted, so we need to advance the outer tuple and per grouping attribute, as shown in Figure 4(a) for three backtrack to the beginning of the group of matching inner attributes, and one array per aggregate function holding ag- 618

7. R.c R.a R.b TABLE I value id value id value id Europe 0 I NTEL C ORE 2 D UO 6300 S PECIFICATIONS 100 0 A 0 Asia 1 200 1 B 1 Number of cores 2 Africa 2 300 2 C 2 Frequency 1.86GHz America 3 (a) Multiple mapping directories Cache line size 64B I1-cache 32KB (per core) Offset(R.a = 200, R.b = C, R.c = Asia) D1-cache 32KB (per core) = R.a[200] · |R.b| · |R.c| + R.b[C] · |R.c| + R.c[Asia] L2-cache 2MB (shared) = 1 · 3 · 4 + 2 · 4 + 1 = 21 L1-cache miss latency (sequential) 9 cycles (b) Offset of aggregate value L1-cache miss latency (random) 14 cycles Fig. 4. Group directories for aggregation L2-cache miss latency (sequential) 28 cycles L2-cache miss latency (random) 77 cycles RAM type 2x1GB DDR2 667MHz gregate values. Assuming that Mi is the map for attribute i and Mi [v] gives the identifier for value v of attribute i, one can then reduce the multi-dimensional mapping of tuple The experience of developing HIQUE has verified these (v1 , v2 , . . . , vn ), for grouping across n attributes, to the scalar claims. The introduction of new algorithms or even new n n operators required more effort to extend the parser and the i=1 (Mi [vi ] j=i+1 |Mj |), where |Mi | is the size of the mapping table for attribute i. The previous formula maps each optimizer than to extend the generator. As a general methodol- combination of values to a unique offset in the aggregate ogy of introducing algorithms, we would first create a model n arrays, the latter holding ( i=1 |Mi |) values each. An example implementation of the new algorithm and compare it to the of applying the formula is shown in Figure 4(b). Aggregate existing templates. In most cases, the new algorithm resulted evaluation then progresses as follows: for each input tuple, the in a few different lines of code when compared to the existing grouping attribute maps are used to identify the offset of the evaluation algorithms. We would then extend the templates and group the tuple belongs to. Then, the corresponding variables the code generator to support the new algorithm. This process for this group in each aggregate array are updated with the was further aided by the output of the code generator being a current values of the aggregate functions. C source code file: the compiler helped the developer to easily In all cases, the code generator inlines the code that identi- identify errors in the generated code and reduce the number fies the groups and applies the aggregate functions. The lack of iterations needed to fully support the new algorithm. of function calls is particularly important in aggregation: it VI. E XPERIMENTAL S TUDY allows the compiler to generate executable code that widely reuses registers in a computationally-intensive operation. The To test the viability of code generation as a general solution optimized code minimizes the frequency of interruptions to the to query evaluation we experimented with different aspects of execution sequence due to stack interaction, avoids multiple the system. Our aim was to measure (a) the superiority of evaluation of common arithmetic expressions, and reduces the the holistic model over the traditional iterator-based approach, number of data accesses per tuple. (b) the effect of compiler optimizations on the code generated by HIQUE, (c) the competitiveness of the system in comparison C. Development to other approaches, both research and commercial ones, on The main challenges in engineering a code generator for established benchmark queries, and (d) the penalty for gener- query evaluation were (a) the identification of common code ating, compiling, and linking query-specific code at runtime. templates across different algorithms, (b) the interconnection We report results on the currently dominant x86-64 proces- of different operators, since no common interface is present sor architecture. Our system had an Intel Core 2 Duo 6300 dual any more, and (c) the verification of correctness of the core processor, clocked at 1.86GHz. The system had a physical generated code for all supported operations. memory of 2GB and was running Ubuntu 8.10 (64 bit version, The holistic evaluation model eases those problems. The kernel 2.6.27); HIQUE’s generated code was compiled using the main advantage is that its algorithms exploit generic code GNU compiler (version 4.3.2) and with the -O2 compilation templates for all operations. Data staging employs the template flag. More detailed information about the testing platform can of Listing 1; sorting and partitioning operations can be inter- be found in Table I. The cache latencies were measured using leaved inside the code. For join evaluation, the nested-loops the RightMark Memory Analyser [22]. template of Listing 2 is used in each case, with differences Metrics and methodology. All queries were run in isolation between algorithms either being taken care of through staging, and were repeated ten times each. Each query ran in its own or through extra steps inside the loops. For instance, for hash thread, using a single processor core. We did not materialize join, the segments corresponding to Lines 3 to 5 are included the output in any case, as the penalty of materialization is and the ones for Lines 6 and 21 are excluded; including the last similar for all systems and configurations. We report average two code segments will turn the algorithm into hybrid hash- response times for each system, with the deviation being less sort-merge join. Aggregation extends the template of Listing 1 than 3% in all cases. We also used hardware performance by injecting code for tracking different groups and computing events as metrics. We obtained the latter with the OPro- the aggregate functions. Furthermore, operators are connected file [19] tool, which collects sampling data from the CPU’s by materializing intermediate results as temporary tables inside performance event counters. We broke down the execution the buffer pool and streaming them to subsequent operators. time into instruction execution, D1-cache miss stalls, L2-cache 619

8. 0.25 1.20 L1‐cache misses L1‐cache misses L2‐cache misses L2‐cache misses 1.00 0.20 Resource stalls Resource stalls Instruction execution Instruction execution 0.80 0.15 Time (s) Time (s) 0.60 0.10 0.40 0.05 0.20 0.00 0.00 Generic  Optimized  Generic  Optimized  HIQUE Generic  Optimized  Generic  Optimized  HIQUE Iterators Iterators hard‐coded hard‐coded Iterators Iterators hard‐coded hard‐coded (a) Execution time breakdown for Join Query #1 (b) Execution time breakdown for Join Query #2 Retired Function D1-cache D1-cache prefetch L2-cache prefetch CPI instructions (%) calls (%) accesses (%) efficiency (%) efficiency (%) Generic iterators 0.613 100.00 100.00 100.00 8.33 43.28 Optimized iterators 0.628 91.81 66.99 94.20 10.64 68.35 Generic hard-coded 0.569 53.47 33.87 51.85 27.78 86.84 Optimized hard-coded 0.498 27.63 1.29 39.31 25.00 89.47 HIQUE 0.475 26.22 1.08 36.67 25.00 92.11 (c) Hardware performance metrics for Join Query #1 Retired Function D1-cache D1-cache prefetch L2-cache prefetch CPI instructions (%) calls (%) accesses (%) efficiency (%) efficiency (%) Generic iterators 0.697 100.00 100.00 100.00 30.67 87.27 Optimized iterators 0.729 95.65 86.86 97.49 30.31 92.20 Generic hard-coded 0.720 67.32 49.56 61.95 60.55 86.38 Optimized hard-coded 0.750 56.80 32.75 56.13 60.95 89.93 HIQUE 0.769 56.62 32.37 54.03 61.07 89.97 (d) Hardware performance metrics for Join Query #2 Fig. 5. Join profiling miss stalls and other pipeline resource stalls.2 To account for by separately compiling the code for each query (including all hardware prefetching, we assumed sequential access latencies parameters for instantiating the statically pipelined iterators), for prefetched cache lines and random access latencies for all thus allowing their extended optimization by the compiler. For other cache misses. This allows for approximate calculation of join evaluation, we experimented with (a) two tables of 10,000 the cost of cache misses, as the non-blocking design of cache tuples of 72 bytes each using merge-join, with each outer memory allows the CPU to continue executing instructions tuple matching with 1,000 inner tuples, and (b) two tables of while fetching data. Still, this methodology provides a good 1,000,000 tuples of 72 bytes each using hybrid-join, with each approximation of actual cache miss stall times. In addition to outer tuple matching with 10 inner tuples. For aggregation, the execution time breakdown, we also calculate the Cycles we used a table of 1,000,000 tuples of 72 bytes each, two Per Instruction (CPI) ratio, the minimum value being 0.25 for sum functions, and we selected as the grouping attribute one Intel Core 2 Duo processors (i.e., four instructions executed in field with either (a) 100,000 distinct values, or (b) 10 distinct parallel per CPU cycle). We also measured samples for retired values. We employed hybrid aggregation in the first case and instructions, function calls and D1-cache accesses, normalized map aggregation in the second. All join and grouping attributes to the highest value among the compared configurations for were integers. We used both response times and hardware each query. Finally, we report the prefetching efficiency ratio, performance events as metrics. We present the results for join defined as the number of prefetched cache lines over the total evaluation in Figure 5 and for aggregation in Figure 6. number of missed cache lines. The first join query is inflationary, as it produces 10,000,000 A. Iterators versus holistic code tuples when joining two tables of 10,000 tuples each. In this case, the nested-loops-based template for join evaluation To quantify the iterator model’s deficiency compared to proves very efficient, as HIQUE is almost five times faster the proposed holistic model, we compared the following than the iterator implementations. The time breakdown in implementations: (a) an iterator-based one using generic func- Figure 5(a) shows that all versions exhibit minimal memory tions for predicate evaluation, (b) a type-specific version stalls, so the difference in execution time is exclusively due to of iterators with inlined predicate evaluation, (c) a hard- the lack of function calls, the reduction in retired instructions, coded implementation using generic functions for predicate and the elimination of resource stalls. Note that the generated evaluation and tuple accesses, (d) an improved hard-coded code requires 26.22% of the instructions, 36.67% of the data version with direct tuple accesses using pointer arithmetic, accesses and 1.08% of the function calls when compared to and (e) the code generated by HIQUE, that further inlines the generic iterator version, as shown in Figure 5(c). Besides, predicate evaluation. We favored the generic implementations the CPI ratio improves by 22.5% and closes in to the ideal 2 Other pipeline resource stalls are defined as resource stalls that are not value of 0.25. One can also observe that the efficiency of due to D1- or L2-cache misses – see also [15]. hardware prefetching more than doubles as the code becomes 620

9. 0.60 0.07 L1‐cache misses L1‐cache misses L2‐cache misses 0.06 L2‐cache misses 0.50 Resource stalls Resource stalls Instruction execution 0.05 Instruction execution 0.40 Time (s) Time (s) 0.04 0.30 0.03 0.20 0.02 0.10 0.01 0.00 0.00 Generic  Optimized  Generic  Optimized  HIQUE Generic  Optimized  Generic  Optimized  HIQUE Iterators Iterators hard‐coded hard‐coded Iterators Iterators hard‐coded hard‐coded (a) Execution time breakdown for Aggregation Query #1 (b) Execution time breakdown for Aggregation Query #2 Retired Function D1-cache D1-cache prefetch L2-cache prefetch CPI instructions (%) calls (%) accesses (%) efficiency (%) efficiency (%) Generic iterators 0.796 100.00 100.00 100.00 19.16 94.76 Optimized iterators 0.798 95.35 92.48 99.88 21.73 91.95 Generic hard-coded 0.872 59.85 86.83 91.19 56.79 85.59 Optimized hard-coded 0.875 54.99 77.74 89.32 56.82 86.12 HIQUE 0.919 53.86 68.65 81.63 56.90 88.95 (c) Hardware performance metrics for Aggregation Query #1 Retired Function D1-cache D1-cache prefetch L2-cache prefetch CPI instructions (%) calls (%) accesses (%) efficiency (%) efficiency (%) Generic iterators 0.791 100.00 100.00 100.00 75.71 95.05 Optimized iterators 0.881 81.85 94.06 74.79 93.18 93.17 Generic hard-coded 0.936 67.62 65.35 60.21 78.93 93.44 Optimized hard-coded 0.904 53.13 32.67 52.72 78.37 95.57 HIQUE 0.899 41.89 4.95 46.13 70.39 95.86 (d) Hardware performance metrics for Aggregation Query #2 Fig. 6. Aggregation profiling more query-specific, both for the D1- and the L2-cache. gation is evaluated in a single pass of the input without any The second join query uses two larger tables as inputs and need for intermediate staging. This allows the code generator has much lower join selectivity. In this case, the majority of to inline all group tracking and aggregate calculations in a the execution time is spent on staging the input, i.e., hash- single code segment. As shown in Figure 6(b), the code partitioning it and sorting the partitions. Since all versions generated by HIQUE outperforms generic iterators by almost implement the same algorithm, use the same type-specific im- a factor of two. Memory stalls dominate execution time for plementation of quicksort, and display similar access patterns, the HIQUE version (though their effect might be alleviated the differences in execution times are narrowed. As shown from the operation of non-blocking caches), as the aggregate in Figure 5(b) HIQUE is almost twice faster than the iterator- calculations require only a few instructions per tuple. Also based versions. The penalty for memory stalls is similar in all shown in Figure 6(b), the reduction in function calls is gradual cases, as expected. The reduction in retired instructions, data as the code becomes more query-specific and reaches 4.95% accesses and function calls is still significant, according to for the most optimized hard-coded version. Furthermore, the Figure 5(d), but does not reach the levels of the previous query. linear scan of the input helps the hardware prefetchers achieve Note that the CPI ratio increases for hard-coded versions. This high levels of efficiency, over 70% for the D1-cache and near is due to the retirement of fewer instructions in total, so the 95% for the L2-cache in all cases. contribution of costly memory operations to the CPI is more We next examined the efficiency of compiler optimizations substantial. Prefetching efficiency doubles for the D1-cache on the iterator-based and the hard-coded implementations. We and is approximately 90% for the L2-cache in all cases. compiled the various implementations with compiler optimiza- In terms of aggregation, the first benchmark query was tions disabled (by setting the optimization flag to -O0 for the evaluated using the hybrid hash-sort algorithm. In this case GNU compiler) and ran the same join and aggregation queries. staging dominates execution time, as aggregation is evaluated The results are presented in Table II. Naturally, the differences in a single scan of the sorted partitions. Still, as shown in between the various code versions are more tangible when Figure 6(a), HIQUE maintains an advantage of a factor of 1.61 there are no compiler optimizations, since the compiler can over iterators. The use of the same partitioning and sorting apply some of the optimizations that are included in the code implementations leads to similar memory stall costs for all generation process. For example, the compiler may inline the code versions. The difference in execution times mainly stems functions for predicate evaluation, so the differences between from the reduction in instructions, data accesses and function the last two implementations are narrowed in all queries, but calls, according to Figure 6(c). Observe that the efficiency of become apparent when the -O0 optimization flag is used. the D1-cache prefetcher increases three times, while that of The results show that compiler optimizations are most the L2-cache reaches almost 90% for all implementations. efficient in the first join query, resulting in speedups between In the case of the proposed map-based algorithm, aggre- 2.67 and 4.85, as the loop-oriented code transformations can 621

10. TABLE II E FFECT OF COMPILER OPTIMIZATION ( RESPONSE TIMES IN SECONDS ) Join Query #1 Join Query #2 Aggregation Query #1 Aggregation Query #2 -O0 -O2 -O0 -O2 -O0 -O2 -O0 -O2 Generic iterators 0.802 0.235 1.953 0.995 1.225 0.527 0.136 0.060 Optimized iterators 0.618 0.231 1.850 0.990 1.199 0.509 0.113 0.055 Generic hard-coded 0.430 0.118 1.421 0.688 0.586 0.344 0.095 0.051 Optimized hard-coded 0.267 0.055 1.225 0.622 0.554 0.333 0.080 0.038 HIQUE 0.178 0.054 1.138 0.613 0.543 0.326 0.070 0.033 improve performance on iterative tuple processing. For the rest staging; the latter cost is similar for all implementations. This of the queries the speedup is almost a factor of two. Since we is shown in Figure 7(c) for joining two tables of 1,000,000 compile the code for each query and for all implementations, tuples each. Each input tuple was 72 bytes wide, while the speedup is significant even for the iterator-based ones. the number of inner matching tuples per outer tuple varied Moreover, the compiler is less efficient on the hard-coded between 1 and 1,000. The results show that the gap between implementations: the source code is already minimalistic and the iterator-based and the holistic implementations widens contains various optimizations (e.g., loop blocking, function quickly as join selectivity increases and reaches a factor of inlining). Still, the simplicity of the code and the lack of five for 1,000 matches per outer tuple. function calls allows the compiler to further improve the hard- The salient factor in aggregation performance is the domain coded versions resulting in significant speedups. of the grouping attribute(s). If this domain allows the value directories and the aggregate arrays (see also Section V-B) to B. Performance of holistic algorithms simultaneously fit in the lowest cache level, map aggregation We now move on to examine the performance of the is expected to outperform the algorithms that require input proposed algorithms while varying the characteristics of the staging. We show the effect of the grouping attribute’s range in input and the predicates to be applied. We compared the Figure 7(d). The input table had 1,000,000 tuples of 72 bytes optimized iterator-based versions of the proposed algorithms each. We used two sum functions and one grouping attribute with the code HIQUE generates for each query. In Figure 7(a) as we varied the number of distinct values between 10 and we examine scalability in join evaluation. We used two tables 100,000. The results verify that, for small numbers of groups, with a tuple size of 72 bytes. Each outer tuple matched with map aggregation is highly efficient, both in its iterator-based ten inner tuples on integer join attributes. The cardinality of and its holistic form. However, sort and hybrid aggregation the outer table was set to 1,000,000, while the inner one’s are only moderately affected by the number of groups. They varied between 1,000,000 and 10,000,000. The results show perform better than map aggregation when the auxiliary data that all algorithms scale linearly, with iterator-based hash-sort- structures of the latter (i.e., the value directory for the grouping merge-join having similar performance to HIQUE’s merge-join. attribute and the aggregate arrays) span the L2-cache, the As expected, the generated version of the hash-sort-merge join difference approaching a factor of two for 100,000 groups. outperforms all other versions by a substantial margin, proving its efficiency in a wide range of input cardinalities. C. TPC-H benchmark In multi-way queries, the evaluation of multiple joins using The last set of experiments is over the standard and more re- a single segment of deeply-nested loops improves performance alistic TPC-H benchmark [24]. We benchmarked HIQUE against as the generated code does not require materialization of three database systems: (a) PostgreSQL (version 8.2.7), a intermediate results. To verify this, we joined one table of widely-used and high-performance open-source DBMS over 1,000,000 tuples with a varying number of tables of 100,000 NSM that employs the iterator model, (b) a commercial system, tuples each, on a single join attribute. All tables had 72-bytes- which we refer to as System X for anonymity, also using NSM sized tuples, while the output cardinality was 1,000,000 in all and iterators but employing software prefetching instructions cases. We compared the binary iterator-based merge-join, its to reduce cache miss stalls, and (c) MonetDB (version 5.8.2), equivalent when generated by HIQUE, and the code versions an architecture-conscious DBMS that uses a DSM-based storage when join teams where enabled in HIQUE, using either merge- layer and column-wise evaluation algorithms. This choice or hybrid-join. The results of Figure 7(b) show that although allowed the comparison of different storage systems and query iterator-based merge-join takes advantage of sorted orders, it engine designs, with PostgreSQL representing the traditional is widely outperformed by its holistic equivalent. Furthermore, I/O-optimized design, System X bridging the gap between I/O- the adoption of join teams radically reduces execution time, and CPU-bound execution with software prefetching, and Mon- with the difference between HIQUE and iterators reaching a etDB being a design optimized for main-memory execution. factor of 3.32 when joining eight tables. The extension of the We used the benchmark’s generator to generate a data-set nested-loops join template to support join teams therefore pays with a scaling factor of one. The tables were not altered in off in the case of multi-way join queries. any way (e.g., sorted) before being imported to the systems. Highly-selective join predicates are expected to increase the The “raw” data-set size was approximately 1.3GB, without difference in performance between the iterator and the holistic indexes, thus fitting in our system’s main memory. We built model. This is due to the number of iterator calls growing indexes in all systems, gathered statistics at the highest level larger and the join evaluation cost surpassing that of input of detail, and set the memory parameters to allow in-memory 622

11. 7 1.8 Merge - Iterators Merge - Iterators Hybrid - Iterators Merge - HIQUE (binary) 6 Merge - HIQUE 1.6 Merge - HIQUE (team) Hybrid - HIQUE Hybrid - HIQUE (team) 1.4 5 1.2 Time (s) Time (s) 4 1 3 0.8 2 0.6 1 0.4 0 0.2 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 Input cardinality (Millions) Number of joined tables (a) Join scalability (b) Multi-way joins 30 0.8 Merge - Iterators Sort - Iterators Hybrid - Iterators 0.7 Hybrid - Iterators 25 Merge - HIQUE Map - Iterators Hybrid - HIQUE Sort - HIQUE 0.6 Hybrid - HIQUE 20 Map - HIQUE 0.5 Time (s) Time (s) 15 0.4 0.3 10 0.2 5 0.1 0 0 0 0.5 1 1.5 2 2.5 3 1 1.5 2 2.5 3 3.5 4 4.5 5 log10(matching tuples) log10(group cardinality) (c) Join predicate selectivity (d) Grouping attribute cardinality Fig. 7. Join and aggregation performance 70 5 6 PostgreSQL 4.549 PostgreSQL PostgreSQL 59.353 System X System X 5.091 System X 60 5 MonetDB 4 MonetDB MonetDB 50 HIQUE HIQUE HIQUE 4 Time (s) Time (s) 3 Time (s) 40 37.185 2.477 3 30 2 2.091 2 20 1 0.795 0.971 10 0.628 1 0.411 1.376 0.356 0 0 0 (a) Query #1 (b) Query #3 (c) Query #10 Fig. 8. TPC-H Queries execution. We experimented with TPC-H Queries 1, 3 and 10. 662.16 millions of CPU cycles, which is comparable to that These include highly selective join predicates that cannot be of MonetDB/X100’s DSM-based approach and 30% faster evaluated as join teams, as well as aggregation operations than MonetDB/X100’s NSM-based approach [26]. Hence, we of a varying number of grouping attributes and aggregate posit that HIQUE generates code that is identical to a hard- functions. TPC-H tables have wide tuples spanning multiple coded implementation, thus achieving maximum efficiency in cache lines, with only a few fields actually needed by any aggregation (at least for NSM-based systems). query. Thus, the expectation is for MonetDB to benefit from The remaining queries test join evaluation, aggregation, and vertical partitioning and outperform all NSM-based systems. sorting. The holistic optimizer stages all inputs before further TPC-H Query 1 is an aggregation over almost the entire operations. This is expensive over the benchmark tables, as lineitem table (about 5,900,000 tuples) and produces four only a small portion of each tuple is used by the query output groups. As the two aggregation attributes have a product operators. The queries are a perfect match for DSM systems, of distinct value cardinalities equal to six, the most appropriate like MonetDB: through vertical partitioning only the required holistic aggregation algorithm is map aggregation. The results fields for each operator are fetched. As a result HIQUE is 34.5% in Figure 8(a) show HIQUE outperforming MonetDB by a factor faster and 18.1% slower in Queries 3 and 10 respectively, when of four and the other NSM-based systems by two orders of mag- compared to MonetDB. Compared to the NSM systems, HIQUE nitude, reaching a 167-fold advantage over PostgreSQL. This outperforms PostgreSQL and System X by a substantial factor, is due to the holistically generated code: it inlines all selection, ranging between 2.2 and 11.1. grouping, and aggregation operations in a single succinct code The TPC-H results prove the viability of holistic evaluation in block that lacks function calls and is tailored towards efficient a realistic query workload. The holistic model provides code register utilization. The measured performance translates to simplicity and enhances cache locality during execution, there- 623

12. TABLE III Q UERY PREPARATION COST SQL processing (ms) Compilation (ms) File sizes (bytes) TPC-H Query Parse Optimize Generate with -O0 with -O2 Source Shared library #1 21 1 1 121 274 17,733 16,858 #3 11 1 2 160 403 33,795 24,941 #10 15 1 4 213 619 50,718 33,510 fore reducing the number of instructions and data accesses The next step is to extend our approach for multithreaded required to evaluate queries. That way, both the processor and processing. Current processor designs integrate multiple cores the memory subsystem are stressed to a lower extent, leading sharing the lowest on-chip cache level. Though this design to a significant speedup of query evaluation. This allowed widens opportunities for parallelism, it introduces resource our NSM-based system to achieve performance that was so far contention. We believe that code generation is advantageous conceivable only for systems employing vertical partitioning. for such designs: one can accurately specify the code segments that can be executed in parallel, thus reducing synchronization D. Code generation cost overhead and memory bandwidth requirements. The drawback of per-query code generation is the overhead R EFERENCES for emitting and compiling query-specific source code. To [1] Anastassia Ailamaki et al. DBMSs on a Modern Processor: Where Does quantify this cost we report in Table III the preparation times Time Go? In The VLDB Journal, 1999. for the TPC-H queries. We separately show the query parsing, [2] Anastassia Ailamaki et al. Weaving Relations for Cache Performance. In The VLDB Journal, 2001. optimization, code generation and compilation times, as well [3] P. A. Boncz. Monet: A Next-Generation DBMS Kernel For Query- as the sizes of the generated source and shared-library files. Intensive Applications. PhD thesis, Universiteit van Amsterdam, 2002. The time for parsing, optimizing, and generating code is [4] P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper- Pipelining Query Execution. In CIDR, 2005. trivial (less than 25ms). Compiling the generated code takes [5] Donald D. Chamberlin et al. A history and evaluation of System R. longer and compilation time depends on the optimization level. Commun. ACM, 24(10), 1981. Compilation takes 121–213ms with no optimizations (-O0 [6] Shimin Chen et al. Fractal prefetching B+-Trees: optimizing both cache and disk performance. In SIGMOD, 2002. compiler flag), but needs 274–619ms when the optimization [7] Shimin Chen et al. Improving hash join performance through prefetch- level is increased (-O2 compiler flag). The generated source ing. In ICDE, 2004. and shared library file sizes are less than 50 kilobytes. [8] Shimin Chen et al. Inspector Joins. In VLDB, 2005. [9] George P. Copeland and Setrag Khoshafian. A Decomposition Storage Preparation time is not negligible and can be a significant Model. In SIGMOD, 1985. portion of execution time for short-running queries. In such [10] Jack Doweck. Inside Intel Core Microarchitecture and Smart Memory cases it is preferable to avoid applying compiler optimizations Access, 2005. White paper. [11] Goetz Graefe. Query Evaluation Techniques for Large Databases. ACM that increase compilation time; the gain in execution time Comput. Surv., 25(2), 1993. will be intangible. Additionally, it is common for systems to [12] Goetz Graefe et al. Hash Joins and Hash Teams in Microsoft SQL store pre-compiled and pre-optimized versions of frequently Server. In VLDB, 1998. [13] Rick Greer. Daytona And The Fourth-Generation Language Cymbal. In or recently issued queries. This is certainly applicable in SIGMOD, 1999. HIQUE, especially if we take into account the small size of the [14] John Hennessy and David Patterson. Computer architecture: a quanti- generated binary files. Besides, in most cases the performance tative approach. Morgan Kaufmann, 2006. [15] Intel Corporation. Intel 64 and IA-32 Architectures Software Devel- benefits outweigh the generation cost. oper’s Manual, 2008. [16] Ken Kennedy and John R. Allen. Optimizing compilers for modern VII. C ONCLUSIONS AND FUTURE WORK architectures: a dependence-based approach. Morgan Kaufmann Pub- lishers Inc., 2002. We have presented the case for holistic query evaluation. [17] Masaru Kitsuregawa et al. Application of Hash to Data Base Machine and Its Architecture. New Generation Comput., 1(1), 1983. The proposed technique is based on generating query-specific [18] S. Manegold et al. What happens during a Join? - Dissecting CPU and code that integrates multiple query operations in succinct code Memory Optimization Effects. In VLDB, 2000. constructs. The generation process uses code templates for [19] OProfile. A System Profiler for Linux, 2008. http://oprofile. each query operator and builds query-specific code with the [20] Sriram Padmanabhan et al. Block Oriented Processing of Relational following objectives: (a) minimum function calls, (b) reduced Database Operations in Modern Computer Architectures. In ICDE, 2001. instructions and memory accesses, and (c) enhanced cache lo- [21] Jun Rao et al. Compiled Query Execution Engine using JVM. In ICDE, 2006. cality. The proposed model exhibits a substantial performance [22] RightMark. RightMark Memory Analyser, 2008. http://cpu. advantage when implemented over the NSM-based storage layer. It also does not affect any orthogonal aspects of a [23] Ambuj Shatdal et al. Cache Conscious Algorithms for Relational Query Processing. In VLDB, 1994. DBMS like concurrency control and recovery. To verify the [24] Transaction Processing Performance Council. The TPC-H benchmark, advantages of the proposed holistic model, we implemented 2009. HIQUE — the Holistic Integrated Query Engine. Extensive [25] Jingren Zhou and Kenneth A. Ross. Buffering database operations for enhanced instruction cache performance. In SIGMOD, 2004. experiments with a variety of data-sets and query workloads [26] Marcin Zukowski et al. DSM vs. NSM: CPU performance tradeoffs in proved HIQUE’s potential for per-query code generation, verify- block-oriented query processing. In DaMoN, 2008. ing the efficiency of our approach in main-memory execution. 624