Fast Join Implementation on Modern Multi-Core CPUs

Join is an important database operation. As computer architectures evolve, the best join algorithm may change hand. This paper reexamines two popular join algorithms – hash join and sort-merge join – to determine if the latest computer architecture trends shift the tide that has favored hash join for many years. For a fair comparison, we implemented the most optimized parallel version of both algorithms on the latest Intel Core i7 platform. Both implementations scale well with the number of cores in the system and take advantages of latest processor features for performance.

1. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs Changkyu Kim† Eric Sedlar⋆ Jatin Chhugani† Tim Kaldewey⋆ Anthony D. Nguyen† Andrea Di Blas⋆ Victor W. Lee† Nadathur Satish† Pradeep Dubey† Contact: † Throughput Computing Lab, Intel Corporation ⋆ Special Projects Group, Oracle Corporation ABSTRACT which is the best join algorithm has been going on for decades. Join is an important database operation. As computer architectures Currently, for in-memory database operations, the hash-join algo- evolve, the best join algorithm may change hand. This paper re- rithm has been shown to outperform sort-merge join in many cases. examines two popular join algorithms – hash join and sort-merge Today’s commodity hardware already provides large degrees of join – to determine if the latest computer architecture trends shift parallelism, with multiple cores on a single chip, multiple hardware the tide that has favored hash join for many years. For a fair com- threads on each core (SMT) and vector instructions (SIMD) oper- parison, we implemented the most optimized parallel version of ating on 128-bit vectors whose capability will increase in the near both algorithms on the latest Intel Core i7 platform. Both imple- future. Coupled with growing compute density of chip multipro- mentations scale well with the number of cores in the system and cessors (CMP) and memory bandwidth challenges, it is not clear if take advantages of latest processor features for performance. Our hash join will continue to ourperform sort-merge join. In this pa- hash-based implementation achieves more than 100M tuples per per, we re-examine both algorithms under the context of CMP that second which is 17X faster than the best reported performance on offers thread-level parallelism (TLP), data-level parallelism (DLP), CPUs and 8X faster than that reported for GPUs. Moreover, the large on-die caches, and high memory bandwidth. performance of our hash join implementation is consistent over For a fair comparison, we optimize the implementations of both a wide range of input data sizes from 64K to 128M tuples and algorithms for the latest multi-core platform. Our hash-based im- is not affected by data skew. We compare this implementation plemenation can join 100 million tuples per second on a 3.2GHz to our highly optimized sort-based implementation that achieves quad-core Intel Core i7 965 platform which is faster than any re- 47M to 80M tuples per second. We developed analytical models to ported CPU implemenation thus far. We implemented sort-merge study how both algorithms would scale with upcoming processor join algorithm by exploiting all salient features of CMP such as architecture trends. Our analysis projects that current architectural TLP, DLP, blocking for on-die caches, and utilizing high memory trends of wider SIMD, more cores, and smaller memory bandwidth bandwidth judiciously. Moreover, our implementations are tolerant per core imply better scalability potential for sort-merge join. Con- of data skew without sacrificing performance. sequently, sort-merge join is likely to outperform hash join on up- In our study, we observe a number of interesting features of join coming chip multiprocessors. In summary, we offer multicore im- implementations. First, both join algorithms benefit greatly from plementations of hash join and sort-merge join which consistently multi-threading. Second, sort-merge join benefits greatly by ex- outperform all previously reported results. We further conclude that ploiting SIMD architecture offered by today’s processor. Its perfor- the tide that favors the hash join algorithm has not changed yet, but mance will continue to improve with the trend of wider SIMD [21, the change is just around the corner. 34]. Based on our analytical model, we project that sort-merge join will surpass hash join with 512-bit SIMD. For hash join to make use of SIMD execution, we believe that hardware support for efficient 1. INTRODUCTION scatter and atomic vector operations are necessary (Section 7). Fi- Join is a key operation in relational databases that facilitates the nally, by efficiently managing memory bandwidth usage, we show combination of two relations based on a common key. Join is an that both hash join and sort-merge join are compute bounded on expensive operation and an efficient implementation will improve today’s CMP system. However, our analytical model shows that the performance of many database queries. There are two common hash join demands at a minimum 1.5X more memory bandwidth join algorithms: sort-merge join and hash join. The debate over than sort merge join. If the gap between compute and bandwidth continues to grow for the future computer systems, the sort-merge join will be more efficient than hash join. Permission to copy without fee all or part of this material is granted provided Our contributions include: first, we implement the most efficient that the copies are not made or distributed for direct commercial advantage, hash join and sort-merge join on the latest computer platform. The the VLDB copyright notice and the title of the publication and its date appear, performance of hash join is 17X faster than the best published CPU and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers numbers and 8X faster than that reported for GPUs [18]. Second, or to redistribute to lists, requires a fee and/or special permission from the our join performance is constant for a wide range of input sizes publisher, ACM. and data skews. Third, we compare the performance of sort-merge VLDB ‘09, August 24-28, 2009, Lyon, France and hash join for current architectures, and conclude that hash-join Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. 1 superior in performance. Fourth, by constructing an analytical the parallel nature of these devices with associated high compute model for both hash join and sort-merge join, we conclude that fu- density and bandwidth and show significant performance benefits ture architectural trends towards increasing SIMD width and lim- (from 2-7X on the GPU and 8X on Cell) over optimized CPU-based ited per-core memory bandwidth will favor sort-merge join versus counterparts. hash join. Current trends in general purpose CPUs have also been in the The rest of the paper is organized as follows: Section 2 discusses direction of increasing parallelism, both in terms of the number of related work. Section 3 examines modern computer architectures cores on a chip and with respect to the SIMD width on each core. and their implications on the join performance. Section 4 presents Chip multiprocessors are different from conventional multiproces- our hash join algorithm and the considerations for parallelization. sor systems in that inter-thread communication is much faster with Section 5 describes our sort-merge join implementation and the shared on-chip caches [17] and the cost for thread synchronization considerations for parallelization. Section 6 presents the results on and atomic operations is much lower. Cieslewicz et al. [8] examine the two join implementations. Section 7 discusses architecture im- aggregation operations on 8-core chip multiprocessors and exploit provements that are beneficial to both join algorithms and discusses thread-level parallelism (TLP) and the shared on-chip caches for future architecture trends that would influence the performance of high performance aggregation. Zhou et al. [39] implement various hash join and sort-merge join. Section 8 concludes. database operations to exploit data-level parallelism (DLP) with SIMD instructions. In this work, we show that by efficiently ex- 2. RELATED WORK ploiting the capabilities of modern CPUs, we can obtain significant performance advantages over previous join implementations. Over the past few decades, significant efforts have been made Sort-merge join is highly dependent on a good sort implemen- to develop efficient join algorithms. Among the algorithms de- tation. Quicksort is one of the fastest algorithms in practice, but veloped, sort-merge join and hash join algorithms are two most it is unclear whether it can be mapped efficiently to the SIMD ar- popular algorithms for computing the equi-join of two relations. chitecture. In contrast, bitonic sort [2] uses a sorting network that The sort-merge join algorithm [3] was dominantly used in early re- predefines all comparisons without unpredictable branches and per- lational database systems. Later, the hash join algorithm became mits multiple comparisons in the same cycle. These characteristics popular and was shown to outperform sort merge join in many sit- make it well suited for SIMD processors. Bitonic sort has also been uations. The Grace hash join [23] and the hybrid hash join [38] implemented on GPUs [16, 13, 32] since it mapped well to the high algorithms were first proposed to overcome the disk I/O overhead compute density and bandwidth of GPUs. Chhugani et al. [7] show of general hash-based join algorithms. As the capacity of main an efficient implementation of a merge sort algorithm by exploit- memories increased over the years, researchers have focused on ing both TLP and DLP on recent multi-core processors. Our pa- main-memory join operations [35, 5, 27]. Shatdal et al. [35] pro- per adopts the fastest CPU sorting implementation by Chhugani et pose a cache partitioning algorithm, where they partition the hash al. [7] and extends it to sort tuples of (key, rid). table to fit in cache memory so as to reduce the latency of ac- With regards to the choice of the join algorithm, Graefe et al. [14] cess to the hash table. Manegold et al. [27, 28] observe that the compare sort-merge join and hash-join and recommend that both partitioning stage itself incurs a lot of TLB and cache misses and algorithms be included in a DBMS and be chosen by a query opti- becomes the performance bottleneck when the size of relations is mizer based on the input data. The hash join algorithm is a natural too large and there are too many partitions required for each to fit choice when the size of two relations differ markedly. They also in cache. They propose a radix-clustering algorithm that fixes the show that data skew hurts the performance of hash join; they thus number of partitions based on the number of TLB entries and per- recommend sort-merge join in the presence of significant skew in forms a partial radix sorting. When the total number of partitions the input data. Our paper revisits this comparison of both algo- is greater than some fixed number, they perform multi-pass par- rithms focusing on in-memory join operations. Our hash join im- titioning. Our implementation also relies on a similar multi-pass plementation is not affected by data skew and is optimized for the partitioning scheme. In contrast to algorithms based on cache par- modern multi-core CPUs. titioning, Chen et al. [6] argue that the trend towards concurrent databases will create more cache conflicts, thus reducing the effec- tiveness of caches. Instead of exploiting caches, they propose to 3. JOIN FOR MODERN PROCESSORS use software prefetch schemes to hide the long latency of access- The performance of computer systems has improved steadily ing hash tables. However, as memory bandwidth becomes an im- over the years, mainly from advances in semiconductor manufac- portant consideration for performance, partitioning-based schemes turing and processor architecture. In this section, we will exam- that attempt to maintain the working set in cache will still retain ine how the architectural improvements have impacted the perfor- an advantage. In this work, we use a partitioning-based scheme to mance of the join operation. reduce memory bandwidth use. While the above join implementations are based on sequential 3.1 Main Memory Databases algorithms, there has been considerable research on parallel parti- With the increase in capacity of main memory, a large num- tioning [9] and join algorithms [10, 26]. One key issue in parallel ber of database tables reside completely in main memory. Typi- joins is to achieve good load-balancing, especially when data are cal databases consist of tables with numerous columns, with each skewed. Different schemes to handle data skew in parallel joins column having different width (in bytes). User queries performing have been proposed for both sort-merge join and hash join algo- join on more than two such tables are decomposed into pairwise ta- rithms [20, 37, 36]. Unlike prior algorithms, our implementation ble join operations. Performing join on the original tables is not an does not require an extra tuning and scheduling phase to address efficient utilization of the memory bandwidth and computation, and the problem of data skew. therefore the tables are stored using 2 columns, key and rid, with Recently, researchers have explored new architectures to improve key being the join key, and rid storing the address of the tuple [5]. join performance. He at al. [18] present GPU-based implementa- For main-memory databases, the number of entries in a table is tions of both sort-merge join and hash-join. Gedik et al. [12] opti- typically less than the range of 32-bit numbers (232 ), and hence mize join code for the Cell processors. Both papers try to exploit rid can be represented using 32 bits. Although the key can be of 2

3.any variable width, since the number of records is less than 232 , 3.3 Optimizing for TLP the number of distinct keys cannot be more than 232 , and should The number of cores and thread contexts will continue to grow in also be represented using 32 bits. Of course, representing a vari- future processors. To obtain performance from such architecture, able length key using 32 bits and without changing the information applications should be threaded to exploit thread-level parallelism content is computationally hard; schemes like key-prefix [31] and (TLP). For best performance, data accessed by threads must be par- 32-bit XOR and shift hash function [6] have been used to represent titioned to minimize concurrent updates to shared data structures. keys using 32 bits. Therefore, we focus, analyze and provide results for tables con- 3.4 Optimizing for DLP sisting of 32-bit key and 32-bit rid. We propose our join compu- Single-Instruction-Multiple-Data (SIMD) execution is an effec- tation pipeline to include a prologue phase that converts the inputs tive way to increase compute density by performing the same oper- to the aforementioned representation, and an epilogue that oper- ation on multiple data simultaneously. A 128-bit wide SIMD (e.g. ates on the generated output, and removes the false positive results SSE) is common in processors today. Future processors will have while gathering the actual keys. 256-bit or wider SIMD supports [21, 34]. SIMD execution requires contiguous data in registers or in memory. If data accesses are not contiguous, gather and scatter overheads are incurred. Another is- 3.2 Optimizing for Memory Subsystem sue with SIMD execution is the requirement of fixed width data Join is a memory intensive operation and consequently is directly structures. In today’s databases, tuples are often compressed with affected by the performance of memory subsystem. As compute light compression schemes such as prefix compression, resulting in performance improves at a much faster rate than memory subsys- variable length tuples. Use of SIMD instructions causes the size of tem performance, memory access latency continues to worsen. To the data to increase by a factor of 2x to 10x due to decompression. address this performance gap, a number of architectural features have been devised to improve average memory access latency. 4. HASH JOIN Cache: A cache is a module of small but fast SRAM cells As explained in Section 3.1, we focus on equi-join queries on that provides low latency accesses to recently used data. It is used two tables with each tuple consisting of two fields (key, rid), each to bridge the performance gap between the processor and the main being a 32-bit number. memory. Not only caches can reduce memory access latency, they serve as memory bandwidth amplifiers by filtering out external mem- Q: SELECT ... FROM R, S WHERE R.key = S.key ory requests. Bandwidth between processor cores and caches is or- ders of magnitude higher than external memory bandwidth. Thus, In addition, the tuples completely reside in main memory. For blocking data into caches is critical for data intensive operations the remainder of the paper, we use the following notation: such as join. N R : Number of tuples in outer relation (R). TLB: Virtual memory is developed to alleviate programmers N S : Number of tuples in inner relation (S). from having to deal with memory management. It allows a program T : Number of hardware threads (including SMT). to use memory that is larger than the amount of physical memory K : SIMD width. in the system. However, every memory access must go through C : Cache size (L2) in bytes. a virtual-to-physical address translation that often is in the critical P : Size of 1st level TLB (in number of entries). path of memory access. To improve translation speed, a translation look aside buffer (TLB) is used to cache virtual-to-physical trans- The basic idea behind a hash join implementation is to create a lation of most frequently accessed pages. Ideally, the number of hash table with keys of the inner relation (S), and reorder its tu- TLB entries should match the number of pages touched by an ap- ples. This partitioning phase is followed by the actual join phase, plication. With memory size in the Gigabyte range, the number of by iterating through the tuples in R, and for each key – searching TLB entries would be in the thousands. However, to make trans- for matching keys in the hash table, and appending the matching lation fast, TLB is typically designed as either a fully associative tuples from S to the output table. The expected O(1) search cost of or highly associative cache. A TLB size greater than a certain size hash tables makes this an attractive option for database implemen- (e.g., 64) is very complex and consumes a lot of power. Recent pro- tations. However, with increasing number of entries in the tables, cessors use multiple levels of TLBs with the lowest level caching this approach suffers from following performance limiters on cur- the most frequent use pages. rent CPU architectures: Prefetch: Another mechanism that has been employed to re- duce the memory access latency is prefetches. Modern processors Size of hash table: In order to avoid wasteful comparisons dur- often include hardware prefetchers that track memory access pat- ing join, it is imperative to avoid collisions during hash lookups. terns and automatically prefetch data into caches [11]. However, Theoretically, this requires a hash table with size around two times prefetchers only work well when there is a regular access pattern larger than the number of input elements, or the cardinality of the to begin with. For the join operation, the memory access pattern input keys (if known a priori). We also need a hash function be- is fairly random that reduces much of the benefit of the hardware longing to the class of strongly 2-universal functions [30]. prefetcher. A fixed (small) number of TLB entries: With large table sizes, Processor-Memory Bandwidth: Besides memory access la- it is possible to access regions of memory whose page entries are tency, memory bandwidth is another critical component in the mem- not cached in the TLB – thereby incurring TLB misses, and a large ory subsystem. Over the years, improvements to increase band- increase in latency. Therefore, it is critical to perform memory ac- wdith include faster data transfer rate and wider interfaces. De- cesses in an order that avoids significant TLB misses – even for spite these improvements, memory bandwidth still grows at a much large input sizes. lower rate than transistor count [33]. Chip-multiprocessors exacer- Duplicate key’s in S : Each duplicate key necessarily leads to a bate the bandwidth problem as compute grows faster than band- collision in the hash table. Both direct chaining and open address- width. ing methods [25] lead to poor cache behavior leading to increased 3

4. 12345 log(P ′ ) bits of the key, i.e., bit positions [l*log(P ′ ) .. l*log(P ′ ) + 61 71 log(P ′ /2)] from the right to compute the hash index. Step P2: Perform the prefix sum of the histogram (Hist) to com- 62 72 pute the starting addresses of the elements mapping to the respec- tive indices of the histogram. For example, after computing the 63 73 prefix sum, Hist[j] stores the address where the first tuple from the table whose key maps to index j needs to be scattered to. 62 Step P3: Reorder the tuples of the table by iterating over them, 72 and scattering a tuple to the address stored at the corresponding 689 32456 3 3245256 3 3245257 689 32457 hash location. The address at the corresponding location is incre- mented by the size of the tuple to correctly reflect the scatter ad- Figure 1: Partitioning relations R and S to speedup the hash dress for the next tuple that maps to that index. join operation We perform the above three steps for each level of subdivision memory latency. Array Hashes [1] lead to increased memory con- for each of the sub-tables. Note that the final size of each sub-table sumption and do not scale with multiple cores. depends on the distribution of the key’s in the relation. Further- more, we read each tuple twice (once during Step P1 and later dur- Having described the potential performance bottlenecks, we now ing step P3). This helps in computing the starting address for each describe our hash join implementation that addresses these issues, sub-table within a pre-allocated memory chunk and avoids allocat- along with a corresponding analytical performance model. This is ing separate buffers for each sub-table and maintaining them. We followed by the detailed algorithm description, and its extension to later show that the partitioning phase is compute bound on current exploit the multiple cores and the 128-bit SIMD (SSE). CPUs, and not affected by the two trips to main memory. 4.1 Algorithm Overview We now describe the analytical model — 1) the amount of data To overcome the dependence on memory latency and memory that needs to be accessed from/to the main memory and 2) the num- bandwidth, we need to partition the input table1 into smaller dis- ber of operations that need to be executed — for the partitioning joint tables (referred to as sub-tables), such that each sub-table can phase. We assume that the input table is too big to fit into the cache. reside in cache. This is followed by the actual join between the Also, since the histogram fits in the cache, the reads/writes from/to corresponding sub-tables in the two relations. We now describe the the histogram are not dependent on the memory bandwidth. For two phases in detail: each tuple, Step P1 reads 8 bytes and Step P3 reads 8 bytes and writes 8 bytes (scattered tuple). Note that the scattered write in 4.1.1 Partitioning Phase Step P3 causes the cache subsystem to first bring in the cache line Our partitioning phase is based on the radix-cluster algorithm of the destination into the cache before the write is performed. So, proposed by Manegold et al. [28]. We partition the data based on this scattered write indirectly reads 8 bytes into the cache and then the rightmost B bits of the two input tables to obtain 2B sub-tables, overwrite that location without using it. Hence, a total of 16 bytes denoted by R1 , .., R2B and S1 , .., S2B (Figure 1). Note that we now are read, and 8 bytes are written in Step P3. In short, a total of 24 need to perform 2B independent join operations – between Ri and Si bytes are read and 8 bytes are written per tuple during the parti- ∀ i 1..2B . The parameter B is chosen in a way that the average size tioning phase. Note that both the reads and writes are performed in of the resultant sub-tables in the inner relation (N S /2B ) fits in the a sequential fashion, hence the bandwidth is effectively utilized. L2 cache. Furthermore, in order to avoid TLB misses, we do not want to have more than P open pages at any point of time, where P To compute the number of operations, let costhash denote the is the size of the 1st level TLB. In practice, 2P pages seem to work cost of hash computation on the key as well as loading input data. well when backed up by the next level TLB. However, having more During Step P1, we compute the hash value and also increment than 2P pages seems to expose the TLB miss latency, affecting the the histogram index (effectively loading, adding one, and storing run-time performance on CPUs. Hence, the maximum number of back). Let costincr denote the cost of incrementing the histogram sub-tables that can be generated at any one point of time is fixed to index. Furthermore, the counter is incremented and compared to be P ′ (= 2P ). check for the end of the table. Let costepil denote the cost of this Since we wish to partition the table into 2B sub-tables, we per- epilogue operation. We denote costP1 as the number of operations form a hierarchical partitioning, with each level subdividing a given executed per tuple during the execution of P1. Hence, table into P ′ disjoint sub-tables. We start with the input table, and costP1 = costhash + costincr + costepil (per tuple). subdivide it into P ′ sub-tables. Each of these sub-tables is further subdivided into P ′ sub-tables to obtain a total of P ′2 sub-tables. Step P2 operates on the histogram, and for each entry, reads it, This operation is carried out until the total number of sub-tables modifies it and writes it back. The cost is the same as costincr . This equals 2B . A total of ⌈ B / log(P ′ ) ⌉ levels are required.2 step has the same epilogue operations to obtain a resultant of costP2 operations per hash entry. For level(l) ← 1 ... ⌈ B / log(P ′ ) ⌉: costP2 = costincr + costepil (per hash entry). Step P1: Iterate over the tuples of the table and build a histogram Step P3 again computes the hash index, increments it, and scat- (Hist), with the jth entry storing the number of input keys that ters the input tuple to its permuted location. Let the cost of writing hash to index j. Note that the hash function used simply considers be costwrite per tuple. Hence, 1 The terms table and relation are used interchangeably throughout costP3 = costhash + costincr + costwrite + costepil (per tuple). the paper. 2 Unless otherwise stated, log refers to logarithm with base 2 (log ). We denote the cost of partitioning for each tuple (for every level) 2 4 costPartition . Note that we partition both R and S tables into the number of elements (or cardinality) for reducing collisions. Hence, same number of sub-tables using the above algorithm before mov- N Hist is chosen to be min(232−B , 2N Si ). ing to the join phase, described next. Finally, we derive how to choose B for a given cache size C . During the join phase, the original table (Si ), the permuted table (S′i ) 4.1.2 Join Phase and the Histogram Hist need to be cache resident together. Hence for a table with N Si entries, (8+8+4)N Si bytes of cache are required. The join phase performs 2B independent joins – one for each Thus N Si is around ⌊(C /20)⌋. Thus, we need to create around partition generated in the partitioning phase. Let Ri and Si denote the two relations that are being joined in one such partition, and N Ri N S /⌊(C /20)⌋ partitions. Therefore, B equals ⌈log(N S /⌊(C /20)⌋)⌉. and N Si be the number of tuples in the two relations respectively. We now derive a cost model for the amount of data that needs to We again build a histogram using a hash fuction for Si and reorder be read from the main memory and the number of operations that the tuples to obtain S′i . The histogram together with S′i comprises need to be executed and for the Join phase. Although we set our the hash table. The size of the histogram is chosen differently, and partition parameters such that Si , S′i and Hist can reside together we will derive it at the end of the subsection. In the meanwhile, we in the cache in the average case, it is indeed possible to obtain some refer to the size of the histogram as N Hist . We perform the 3 steps partitions where this is not true. Therefore, to compute the memory (P1, P2 and P3) described above, with a different hash function. requirements, we distinguish between the cases where above three Note that after reordering the tuples in Si , all the tuples with keys entities fit together, and when they do not. For the former case, that map to the same hash index are stored contiguously in S′i . Step J1 reads 8 bytes and Step J4 reads 8 bytes. Hence a total of 16 bytes are read per tuple. The build phase is followed by the probe phase, where we iterate If the three entities do not fit into the cache, 8 bytes are read over the tuples of Ri , and hash each key, and go over the correspond- during J1 and 8 bytes are read and 8 bytes are written during J3. ing tuples in S′i to find matching keys and output the result. Note Note that our partitioning scheme ensures that Hist always fits in that for each tuple in Ri , all the potential matches in S′i are stored in the cache, and hence does not stress the memory bandwidth. The consecutive locations. Hence to reduce the dependency on mem- scattered write (during J3) also reads 8 bytes to fetch data for write, ory latency, we issue software prefetches by computing the starting and hence a total of 24 bytes are read and 8 bytes of written. For address of the matching tuples in S′i for keys at a certain distance Step J4, 8 bytes of Ri are read, and the probe phase now may need from the current tuple being considered in Ri . We now describe the to bring in a complete cache line (64 bytes) of S′i in the worst case build and probe phases in detail. to compare the first 8 bytes. Hence in the worst case, a total of 96 bytes are read and 8 bytes are written. In the best case, each Step J1: Similar to Step P1, iterate over the tuples of Si and build cache line that is read is completely utilized by different inputs, a histogram, Hist. The hash function uses log(N Hist ) bits, i.e., bit thus requiring only 8 bytes being read in the probe phase, for a positions [(B +1) .. (B ) + log(N Hist )] from the right to compute the total of 40 bytes read and 8 bytes written during the whole Join hash index. phase. Of course, for each output tuple, 12 bytes are written (and Step J2: Similar to Step P2, compute the prefix sum of Hist. correspondingly 12 bytes read). Step J3: Similar to Step P3, permute the tuples in Si using Hist As far as the number of operations are concerned, we borrow the to obtain S′i . expressions for steps J1, J2 and J3 from Section 4.1.1. costJ1 = costhash + costincr + costepil (per tuple). Having obtained S′i , we now perform the probe phase – Step J4. costJ2 = costincr + costepil (per hash entry). costJ3 = costhash + costincr + costwrite + costepil (per tuple). Step J4: In order to issue prefetches, we implement the probe phase as follows. We keep a buffer (Buffer) of small number (say Step J4 computes the hash index, and stores locally the tuple b) of elements. We iterate over tuples in Ri in batches of b tuples. and the computed hash index, followed by issuing the prefetch in- For each tuple (say with key Ri [k].key), we store the tuple and its struction. As far as the probe phase is concerned, it reads the two computed hash index (jk ) in the Buffer. Furthermore, we also is- consecutive addresses stored in the Histogram, and compares the sue a prefetch with the appropriate address of the first potentially tuples in that range in S′i . We represent the cost of locally storing matching tuple in S′i . By construction, the offset of this element is and issuing the prefetch as cost pre f . Furthermore, let h denote the Hist[jk ] within S′i . average number of tuples used for comparison from S′i per tuple in Ri and let costcomp denote the cost of one comparison. After filling up Buffer with b elements, we iterate over the tu- costJ4 = costhash + cost pre f + hcostcomp + costepil (per tuple in ples stored in Buffer, and now for each key Buffer[k′ ].key, and Ri ). the corresponding hash value j′ , compare all tuples in S′i between indices Hist[j′ ] and Hist[j′ +1]. Since we had already issued a We denote the cost of join as costJoin , which is the sum of the pre-fetch for the first element at address Hist[j′ ], we expect the above four expressions. first few tuples to be in the L1 cache. Furthermore, as we sequen- tially search for the matching keys, the hardware prefetcher would fetch the subsequent cache lines into the L1 cache. This reduces 4.2 Exploiting Thread-Level Parallelism the latency requirements of our probe phase. Note however, that In order to exploit the multiple cores, and simultaneous multi- we still incur branch misprediction (while comparing the keys in threading within one core, we need to parallelize both the parti- S′i ), and our performance should improve with the support of si- tioning and the join phases. multaneous multi-threading (SMT) on one-core. We now derive the value of N Hist . Since we have already con- 4.2.1 Parallelized Partition Phase sidered B bits during the partitioning phase, the maximum number During the first level of partitioning (for R or S), all the T threads of unique elements cannot exceed 232−B . In addition, as described need to simultaneously perform Steps P1, P2 and P3. In addition, in Section 4, we need a hash table of size around two times the there needs to be an explicit barrier at the end of each step. 5

6. After the first level of partitioning, there are enough partitions a better load balancing between threads and address the variability and each thread can operate on a single partition without any ex- between individual task execution times. There is an explicit bar- plicit communication with the remaining threads. This is especially rier at the end of Phase-I, and now we describe Phase-II, where all true on current multi-core architectures, with T being small (≤16). the threads work simultaneously to join a pair of sub-tables. All We now describe in detail the algorithm for parallelizing the first the sub-tables that were not joined during Phase-I now go through level of partitioning and the issues that impact scalability. We re- through Phase-II. fer to parallelized Steps P1, P2 and P3 as Steps P1 p , P2 p and P3 p respectively. To reduce the impact of load imbalance, we use a Phase-II: Let Ri and Si denote the two relations to be joined by scheme based on dynamic partitioning of the tuples. Specifically, multiple threads. As in the partitioning phase, the first three steps we use the Task Queueing [29] model, and decompose the execu- (J1, J2 and J3) are parallelized in a similar fashion to steps P1, P2 tion into parallel tasks, each executing a fraction of the total work and P3 respectively. After the build phase, we now execute the (described below). This allows the runtime system to schedule the probe phase (J4 p ) as described below. tasks on different hardware threads. For the discussion below, we assume T ′ (≥ T ) tasks, and later explain the relation between T ′ Step J4 p : Evenly divide the input tuples in Ri amongst the T ′ and T . tasks. Each task maintains a separate Buffer (Bufferi ) and oper- ates in batches of b tuples. While searching for potential matches Step P1 p : Equally divide the input tuples amongst the T ′ tasks. for any key in S′i , it is possible to find a lot (≥ Thresh2 , a pre- Each task T i ′ maintains its local histogram (Histi ) and updates it defined threshold) of potential matches (for skewed distributions by iterating over its share of tuples. The hash function used is same like Zipf [15]). To avoid load imbalance in such cases, the task does as the one used in serial P1 step. not perform the search and appends that key to the list of unprobed Step P2 p : Having computed their local histograms, the tasks keys, and also stores the starting and ending probing address. compute the prefix sum in a parallel fashion. Consider the jth in- At the end of the above phase, we have a list of unprobed keys. dex. At the end of Step P1 p , each task stores the number of tuples We now consider each of the unprobed keys, and perform the search whose keys map to jth index, and hence the total number of tuples in a parallel fashion. mapping to jth index is ∑Histi [j]. For the ith task, the starting ad- dress of each index j can be computed by adding up the histogram Phase-III: All the threads work simultaneously for each of the values of all indices less than j (for all the tasks), and jth index for probes. We evenly divide the search range amongst the keys and all tasks chronologically before the ith task. This is same as the each thread searches for matching keys and appends the relevant prefix sum operation, and we use the algorithm by Hillis et al. [19] tuples to their respective output tables. to parallelize it. Step P3 p : Each task again iterates over its share of tuples and The above three phase parallelization scheme incurs low over- uses its local histogram to scatter the tuple to its final location (sim- head (for large input size), and aims at efficiently utilizing the mem- ilar to P3). ory bandwidth, and computation cores. The thresholds used above, Thresh1 and Thresh2 are set to T C 1 and T 2C 1 respectively, where In practice, we set T ′ = 4T . As a result, the dynamic load bal- C 1 is the number of tuples that fit in the L1 cache. These cutoffs ancing using task queue’s improved the scaling by 5% – 10% over are chosen to reduce the overhead of parallelization. a static partitioning of tasks. This may be attributed to the reduc- Prior attempts have been made to solve the load imbalance prob- tion in the latency of the writes, since different tasks are at different lem in parallelizing the join phase in the presence of data skew [20, execution stages during the task and are not simultaneously writing 36]. However, these were in the context of parallelizing across clus- to the main memory. ters of computers and not chip multiprocessors. Consequently, net- work communication latency and synchronization overhead did not 4.2.2 Parallelized Join Phase allow for such fine-grained task-level parallelization. The scheme described in this section creates fine-grained tasks (phases-II and The partitioning phase creates 2B partitions. Statically dividing III) that can divide up work evenly across threads. Any remaining the sub-tables amongst the T threads may lead to severe load im- imbalance due to arbitration and latency effects is handled through balance amongst the threads since the partitions are not guaranteed a task queue mechanism that performs work stealing to balance the to be equi-sized. In addition, for skewed distributions, it is possible load across threads. We shall see in Section 6 that our scheme re- to have some partitions that have a large percentage of the input sults in scalable performance even for heavily skewed data. elements, and hence only a few threads will be effectively utilized. We devised a three phase parallelization scheme that accounts for all kinds of data skewness and efficiently utilizes all the computing 4.3 Exploiting Data-Level Parallelism resources. As in the partitioning phase, we use the task queueing There is inherently a lot of data-level parallelism in the hash join model. Note that we maintain T output tables, that are merged at implementation. We describe below a data parallel algorithm as- the end of the complete join algorithm. suming a K element wide SIMD, with K being equal to 4 for the current SSE architecture. Phase-I: We create T ′ tasks, and evenly distribute the sub-tables During the partitioning phase, we operate on each tuple by com- amongst the tasks. If the size of both the inner and outer sub-table puting the hash index, followed by updating the histogram followed is less than a pre-defined threshold (Thresh1 ), it performs the join by the scatter operation. The corresponding steps P1d p , and P3d p operation (steps J1 .. J4) and appends the output to the relevant are as follows: output table. Note that there is no contention for writing between Step P1d p : Iterate over the input tuples by operating on K tuples threads. In case the size of any of the sub-tables is greater than simultaneously. We need to extract the K keys and compute the Thresh1 , the task simply appends that sub-table’s pair id to a list hash function. Since the same hash function is applied on all the of pairs to be addressed in the next phase. keys, this maps to SIMD in a straightforward way. This would be In addition, the use of task queues ensures that we can achieve followed by updating the K histogram bins simultaneously. 6

7. Step P2d p : As described in Section 4.1.1, Step P2 involves a pre- of this approach is that no extra instructions are needed on 64-bit fix sum over a histogram table. Data parallel prefix-sum algorithms architecture. A comparison can be performed only on the 32-bit have been proposed in the literature [4]. However, such algorithms keys but the actual sort moves the entire 64-bit entity that includes do not seem to give much benefit at K = 4. both key and rid. Step P3d p : Operating on K tuples simultaneously, and comput- The snippet of x86 assembly instructions below depicts the in- ing the hash index, followed by gathering the scatter address from nermost loop that merges two lists, A and B. It loads pointers to A the histogram bins, and scattering the K tuples at their respective and B (lines 1-2), assumes that B is less than A and loads B’s con- permuted locations. tent into register rdx (line 3), and speculatively advances both A For the join phase, steps J1d p through J3d p can exploit data level and B pointers (lines 4-5). A comparison of A’s and B’s keys (line parallelism in a similar fashion to steps P1d p through P3d p respec- 6) sets a conditional flag. Conditional move instructions (cmov) tively. Step J4d p is modified as follows: use this conditional flag to fix rdx (line 7) and roll back B’s pointer Step J4d p : Operating on K tuples simultaneously, and perform- (line 8) if A is less than B. If A is greater than B, A’s pointer is ing the search by comparing one element for each of the tuples. rolled back to the old value (line 9). Finally, the content of rdx (whether containing A or B) is stored into a destination (line 10). In order to achieve SIMD scaling, we need efficient hardware Although this code executes more instructions than a simple if- implementation of the following features: then-else block, it eliminates the branch based on the comparison of A and B, and improves the runtime performance. • Data Gather: In Step J4d p , we need to pack together the Note that the number of instructions below is the same as the elements from distinct search locations in a SIMD register to number of instructions for sorting 32-key only, except that the perform comparisons with the respective keys. quadword keyword is used to actually move 8 bytes to and from • Data Scatter: In Step P3d p , we need to write consecutive memory. elements in the SIMD register to non-contiguous locations. 1. mov rsi, rax ; save old ptr_A • SIMD update collision: In P1d p and J1d p , the simultaneous 2. mov rdi, rbx ; save old ptr_B update of the histogram bins needs to be handled correctly 3. mov rdx, qword ptr [rbx] ; load B’s key&rid in SIMD. In case the K bins are all distinct, it reduces to a 4. add rax, 8 ; ptr_A+=2 5. add rbx, 8 ; ptr_B+=2 gather operation, followed by an increment, followed by a 6. cmp dword ptr [rsi], edx ; compare keys only scatter back. However, if two or more bins are the same, the 7. cmovc rdx, [rsi] ; A<B, load A’s key&rid update for all the distinct bins needs to be done in the first 8. cmovc rbx, rdi ; A<B, roll back ptr_B pass, followed by the next set of distinct bins. Implement- 9. cmovnc rax, rsi ; A>=B, roll back ptr_A ing it in software is prohibitively expensive, and we need 10. mov qword ptr [rcx], rdx ; store both key&rid hardware support for such atomic vector updates. The ben- efit of such hardware support can be more significant espe- The second way is to treat them separately, therefore incurring cially when there are few conflicts expected within SIMD extra instructions to move the associated rid as well. This results lanes (e.g., Step J1d p ). in slowdown but is more general, as it does not assume that the key and rid are kept together. However, the current CPU SSE architecture lacks support for ef- When the size of a key or rid is greater than 32 bits, a (key, rid) ficient implementation of all the above features. Hence, we do not tuple cannot co-locate in a single 64-bit scalar register. In this case, see any appreciable speedup in the data-level parallel implementa- extra instructions are needed to sort them. The code above needs tion of the hash join algorithm. three new instructions to explicitly move rid with the key: a load (load B’s rid), a conditional move (load A’s rid), and a store (store 5. SORT-MERGE JOIN rid). Sort-merge join sorts rows in both input tables by the join key and then merges these tables. The most expensive part (≥98%) of 5.2 Exploiting Data-Level Parallelism sort-merge join is the sorting of those two tables. Therefore, it is We use a bitonic merge network [7] to exploit data-level paral- essential to use the most efficient sorting implementation to achieve lelism. Figure 2 shows a 4x4 bitonic merge network that merges the best sort-merge join performance. two sorted sequences of length 4 and produces a single sorted se- quence of length 8. A 4x4 merge network has three levels, each of 5.1 Scalar Implementation which comprises of comparisons of four pairs of elements on four For a scalar sort implementation, we adopt an efficient imple- lanes (e.g., four boxes at each level). Within a lane i, it assigns mentation of merge sort by Chhugani et al. [7]. Merge sort essen- to Li the smaller element and Hi the larger element. Between each tially merges two lists of length L to produce a list of length 2L. level is a shuffle network that routes the L’s and H’s to the desired In the next step, it merges two lists of length 2L to produce one of lanes for the next level. length 4L, and so on until there is a single sorted list. Chhugani Initially, sequences A and B are sorted in the same ascending or- et al. optimized their implementation by (1) replacing branches der. Bitonic merge needs one sequence to be sorted in ascending or- with conditional moves that do not suffer branch misprediction, (2) der (A), and the other in descending order (B). In the figure, A0 , A1 , blocking for cache to make efficient use of memory bandwidth, and A2 , A3 are four contiguous elements of A that are already loaded (3) using multi-way merging to merge cache-size blocks to pro- in a SIMD register. B is shown after loaded into a SIMD register duce a single sorted list. However, their implementation sorts only and permuted into descending order (B3 , B2 , B1 , B0 ), called B′ . At keys. For this work, we extend their implementation to sort tuples level 1, a SIMD comparison on A and B′ assigns the smaller values of (key, rid). of the pairs (A0 , B3 ), (A1 , B2 ), (A2 , B1 ), (A3 , B0 ) to one SIMD reg- There are two ways to sort the tuples. One way is to treat (key, ister containing L’s (L0 L1 L2 L3 ) and larger values to another SIMD rid) as a single entity. For example, if both key and rid are both 32- register containing H’s (H0 H1 H2 H3 ). For level 2, these L’s and H’s bit values, they can be treated as a single 64-bit entity. The benefit need to be routed the desired lanes for another comparison. If the 7

8. 11 22 13 24 14 23 12 21 20. xmm3 = sse shuffle(xmm18, xmm19, dir4); 1232456 2 2 4 4 3 3 1 1 Each level contains a SIMD min and a SIMD max instruction. We use generic names such as sse min and sse max to simplify 1232457 2 2 4 4 3 3 1 1 discussion as SSE has a variety of min and max for different data types and sizes. Moreover, SSE uses different SIMD instructions to shuffle elements, depending on the shuffle patterns. Here we use 1232458 2 2 4 4 3 3 1 1 a generic sse shuffle(A,B,direction) to represent all these instruc- tions, where direction tells the shuffle instructions how to route 444 84449 64447 34445 the elements of A and B. Note that for sorting keys only, after min/max instructions at Figure 2: A bitonic merge network that merges two sequences lines 3-4, a pair of shuffle instructions is needed because H0 , H1 of 4 elements each (A and B) to produce a single sorted se- need to go to the lanes 3 and 4 while L2 , L3 need to go to lanes 0 quence of 8 elements. and 1, respectively. Due to the way the shuffle is implemented in SSE, the pair of shuffles at lines 7-8 actually require three SSE in- structions to route 2 4-wide SIMD registers. This peculiarity disap- L’s (or H’s) within a SIMD register need to be routed in different pears when routing 2-wide SIMD registers. Thus, the total number directions, then shuffles are need. In practice, one shuffle is needed of SIMD instructions for sorting keys only is 13. for each direction. As the bitonic merge network becomes larger, The same 4x4 network can be applied to tuples of a 32-bit key the top levels do not need shuffles because all L’s or H’s within a and a 32-bit rid, treating a (key, rid) tuple as a single 64-bit entity. SIMD register move in the same direction. The same operation is Since SSE4 operates on two 64-bit values at a time, the number of repeated for level 3. At the end of level 3, two resulting SIMD reg- comparisons at each level doubles (i.e., 2 SIMD min and 2 SIMD isters containing L0 L1 L2 L3 and H0 H1 H2 H3 are interleaved (via a max instructions). In the code above, all lines (1-20) belong to the pair of shuffle instructions) to get a sorted sequence of L0 , H0 , L1 , merge network. The number of shuffle instructions at level 2 and H1 , L2 , H2 , L3 , and H3 . level 3 doubles. However, no shuffle is needed at level 1 because Mapping this bitonic merge network to SSE4 [22] produces the the entire SIMD registers (xmm4, xmm5, xmm6, xmm7) remain sequence of instructions below. The instructions in black text (lines intact going to the next level. The number of instructions is 20, 1-2, 5-12) are for sorting keys only, as four keys can fit into a single and increase of 1.54X over 32-bit keys only (13 instructions). The SIMD register and can be processed by a SSE instruction concur- performance is expected to be less than 2X slower than keys only. rently. Ignore the instructions in blue text for now (lines 3-4, 13-20) as they are for (key, rid) tuples (describe below). // xmm2 and xmm3 in descending order 5.3 Exploiting Thread-Level Parallelism // level 1 Intel Core i7 provides two aspects of TLP: two hardware con- 1. xmm4 = sse min(xmm0,xmm3); texts on each core (SMT) and four cores on the same CPU pack- 2. xmm5 = sse max(xmm0,xmm3); age. Merge sort can take advantage of SMT by running two merg- 3. xmm6 = sse min(xmm1,xmm2); ing threads on the same core to hide instruction and memory la- 4. xmm7 = sse max(xmm1,xmm2); tency. Without SMT, a wider network should be used to increase // two shuffles for keys only parallelism and overlap SIMD instruction latency with computa- // no shuffle for (key, rid) tion. Consider the merge network in Figure 2, a 4x4 network is // 1st 2x2 network begins composed of two independent 2x2 networks at level 2 and level 3. // level 2 However, a 4x4 network has one extra level (level 1 in Figure 2) 5. xmm8 = sse min(xmm4,xmm6); that includes extra min/max (two for keys only, four for key-rid) 6. xmm9 = sse max(xmm4,xmm6); and two shuffles (keys only). Going to a 8x8 network gives two // shuffle independent 4x4 networks but requires yet another level (four extra 7. xmm12 = sse shuffle(xmm8, xmm9, dir1); min/max). In short, as the network becomes wider, more levels and 8. xmm13 = sse shuffle(xmm8, xmm9, dir2); thus instructions are required. Note that when the network is wider // level 3 than SIMD lanes, no shuffles are needed at the upper levels. 9. xmm16 = sse min(xmm12,xmm13); SMT obviates the need to go to wider networks by overlapping 10. xmm17 = sse max(xmm12,xmm13); instruction and memory latency with instructions from the other // interleave result thread. Using smaller network results in fewer instructions and as 11. xmm0 = sse shuffle(xmm16, xmm17, dir3); long as all pipeline stalls are overlapped with useful work, it will 12. xmm1 = sse shuffle(xmm16, xmm17, dir4); result in shorter execution time. // 2nd 2x2 network begins The second aspect of TLP is parallel merge. Tuples are first par- // level 2 titioned among T threads, which sort their own partitions. Then 13. xmm10 = sse min(xmm5,xmm7); they cooperate in merging T sorted list into a single sorted list. 14. xmm11 = sse max(xmm5,xmm7); When intermediate lists are larger than caches, merging may be- // shuffle come bandwidth bound, as these lists streams from/to memory mul- 15. xmm14 = sse shuffle(xmm10, xmm11, dir1); tiple times. We address the bandwidth issue in the next section. 16. xmm15 = sse shuffle(xmm10, xmm11, dir2); // level 3 5.4 Bandwidth-Oblivious Sort 17. xmm18 = sse min(xmm14,xmm15); Multiway merging [7] is used to address the bandwidth bottle- 18. xmm19 = sse max(xmm14,xmm15); neck by forming a tree of threads that incrementally merge the // interleave four results heads of partially sorted lists simultaneously. As tuples are merged, 19. xmm2 = sse shuffle(xmm18, xmm19, dir3); they are pushed to a “parent” thread up the tree. The parent thread 8

9. 123455 623455 723455 52 52 611 458 51 51 42 458 42 97 58 41 7 9 41 45 51 76 32 857 32 8 54 65 32 31 43 975 41 1 2 21 31 78 65 2 1234 1 342 31 3 2 3 6 3 1 3 8 8 8 8 8 8 8 8 5 4 5 2 6 1 5 2 6 1 1 567838324 1 5 4 7 5 4 7 4 9 1 5 1 41 71 81 91 311 4 21 756 68 4 558256 64 35 67625 6 (a) (b) 1 52 52 4 7 5 8 61 66 62 69 63 6 64 67 65 68 21 51 51 458 42 54 123456789 52 5 2 8 42 97 79 41 58 58 41 Figure 3: Time spent in partitioning and join phases with vary- 76 32 76 32 54 54 ing number of partitions for 128M tuples with uniformly dis- 32 31 32 31 1 1 tributed keys. 2 2 1 1 1 41 71 81 91 311 341 1 164 167 168 169 3 merges these tuples with tuples from other child threads and pushes 2 7823775 673378 1212345678697 6 7 7478 the merged tuples up to its own parent. This goes on until tuples (c) (d) reach the root thread, which merges and stores them to memory. By merging all lists in parallel in this fashion and limiting the number Figure 4: Computation time measured in cycles per tuple with of “in-flight” tuples within the last level cache, multiway merg- various inputs: (a) varying the number of tuples from 64K to ing reads each tuple once from main memory and writes it once 128M (b) varying the join selectivity (c) varying the input car- to main memory. In short, multiway merging turns a bandwidth- dinalities (d) varying the θ value of Zipf distributions. bound merge sort into a compute-bound merge sort by juduciously managing its usage of memory bandwidth. Experimentally, we found the optimal number of partitions is 16K when each partition just fits in the on-die caches. Multi-pass 6. PERFORMANCE EVALUATION partitioning is necessary when the number of partitions per pass is greater than 128, which is twice the size of L1 TLB. In this section, we show the performance of both hash-based join implementation and sort-based join implementation. We run 6.1.2 Uniform Distribution our experiments on a single socket Intel Core i7 965 system with 6GB of DDR3-1333 memory. The processor runs at 3.2GHz and Next, we study how the hash join algorithm handles input vari- has four out-of-order superscalar processor cores that support si- ations. In this study, we do not include the time to generate the multaneous multi-threading (SMT) with two hardware threads per output, which is less than 3 cycles per output tuple. Even when the core. Each core has a 32KB L1 instruction cache, a 32KB L1 data output size is similar to the input size, this adds less than 10% to cache and a 256KB combined L2 instruction and data cache. The the actual runtime. four cores share a 6MB L3 cache. For fast virtual-to-physical ad- Figure 4(a) shows the effect of changing the input data size from dress translations, each core maintains a 64-entry fully-associative 64K to 128M tuples. The time per tuple varies from 25 cycles L1 TLB and a 512-entry four-way set associative L2 TLB. to 32 cycles as the input data size increases. However, the varia- tion is small and very stable. This corresponds to 100M to 128M 6.1 Hash Join Performance tuples per second. This result is better than any published perfor- We evaluate the performance of the hash join in three aspects: mance in the literature. For example, He et al. [18] report a runtime 1) the benefit of partitioning; 2) the handling of input variations of 2.5 seconds on a 2.4GHz Intel Core2 quad-core processor with such as the table size, the join selectivity and the number of dis- tables of 16M tuples that have uniformly distributed 32-bit keys. tinct keys; 3) the handling of heavily skewed data such as the Zipf In comparison, our Core i7 performance is around 0.15 seconds distribution [15]. (30cpe ∗ 16M/3.2G) which is 16.6X faster. We also measured our performance on the same 2.4GHz Core2 quad-core processor used 6.1.1 Partitioned Hash Join by He et al. and found our implementation to be 6.5X faster. This To study the benefit of partitioning for hash join algorithm, we illustrates the efficiency of our implementation. Furthermore, con- join two tables of 128 million tuples with uniformly distributed trary to their claim that the hash join implementation on a Core2 keys [9, 18, 28]. Figure 3 shows the time (in cycles) spent to pro- quad-core CPU is 1.9X slower than a GPU (Nvidia 8800GTX), our cess each tuple on the quad-core processor when the number of Core2 implementation is in fact 3.4X faster than the same GPU partitions varies from 64 to 1M3 . platform. To study the performance trade offs, we separate the time spent Figure 4(b) shows the effect of the join selectivity by changing in the partitioning phase and the join phase. When the number of the percentage of matching tuples for the table size of 128M tuples. partitions is small (64, 128 and 256 partitions), the partition size 0% means that there is no matching tuples, which results in no is too big to fit in the caches. The join phase is memory bounded output data. We notice that the overall join time improves slightly and it becomes the performance bottleneck. As the number of par- as there are less tuples with matches4 . This is because the branch titions increases, the partition size reduces and it eventually fits into prediction in the probing phase improves as the prediction accuracy the caches. When the number of partitions is greater than 4K, the increases by always predicting a non-match. time required for the join phase stabilizes. Further increase in the Figure 4(c) shows the effect of varying the number of distinct number of partitions would not improve performance any more be- keys from 1M to 128M for the table size of 128M tuples. In gen- cause the time spent in the partitioning phase is dominating. eral, data with low cardinality shows less branch mispredictions in 3 1M refers to 1 million. 4 the graph does not include the time for writing the output. 9

10.the probe phase, thus achieving better performance by around 5%- Time Partitioning Join 10% as compared to the higher cardinality data. Note that we do P1 P3 Total Build Probe Total not exploit the cardinality information during execution of our al- SMT OFF 1.2 4.4 5.6 4.7 5.7 10.4 gorithm. In case the cardinality is low and known a priori, we can SMT ON 1.2 6.0 7.2 4.7 4.7 9.4 reduce the number of partitioning phases or completely eliminate it, thereby further speeding up the runtime. Table 1: Computation time (in cycles per tuple in the inner relation) for each step in a hash join implementation and the TLP Scaling: The performance results reported earlier in this effect of SMT support. For the partitioning phase, we report section correspond to the parallel implementation of the hash join the time taken for every pass. algorithm as described in Section 4.2. The partitioning step is par- allelized by dividing the input tables evenly among the threads. In described in Section 4. The numbers are reported for joining two the join phase, each thread is responsible for joining one or more 128M relations with uniformly distributed keys. Since the cache independent partitions. size (L2) is 256 KB, the number of partition bits (B ) should equal For the uniform distribution, both the partitioning and the join ⌈log(128M/12.8K)⌉ (= 14). This is validated by Figure 3, where the phases scale very well with respect to the number of cores. There join time is minimized with 14 bits of partitioning. We now com- is no load imbalance among the different threads executing in par- pare the runtimes with our derived analytical model for the current allel. Load imbalance usually arises in the join phase due to the platform. The cost symbols below are defined in Sections 4.1.1 variation of the partition sizes when input keys are skewed. For and 4.1.2. The primitive costs were determined by counting in- uniform distributed keys, partition sizes do not vary a lot and this structions in the binary. All cycle and bandwidth numbers, unless is not an issue. Consequently, our parallelization of the join step stated otherwise, are given per tuple of data operated on. only needs to go through Phase-I of Section 4.2. We see a scaling of 4.4X over scalar code using four cores. The scaling is over 4X Step P1 (Section 4.1.1) reads 8 bytes of data. The peak band- because SMT threads hide memory latency and improves the core width for our platform is around 7.2 bytes per cycle5 , and hence efficiency. step P1 is bandwidth bound and should take around 1.1 cycles, which is close to the actual performance (1.2 cycles). Since the 6.1.3 Handling Skewed Data performance is limited by memory bandwidth, SMT does not im- While the uniform distribution offers the best case for parallel prove the runtime any further. Step P2 has negligible runtime (less scalability, skewed input distribution such as the Zipf [15] distribu- than 0.01) and is not reported. On our current system, costhash = 4 tions would test the ability of the parallel hash join implementation ops6 , costincr = 3 ops, costwrite = 5 ops and costepil = 3 ops. Hence to handle load imbalance. Under this circumstance, our 3-phase step P3 should take around 15 ops of computation on a single core. parallelization algorithm of Section 4.2 will be exercised and we The total memory bandwidth requirement is around 24 bytes. As- will discuss the results in this section. The serial performance of suming a throughput of 1 op per cycle, step P3 should be com- our hash-join algorithm for skewed data is comparable to the serial pute bound, and take around 3.75 cycles on our system (with linear performance for uniform data. This is in accordance with the re- scalability). This is within 20% of the actual measured time (4.4 sults of our analysis of Section 4.1. cycles). Note that SMT threads degrades the performance since we incur TLB misses (2 threads sharing the same TLB) which incur TLP Scaling: When all partitions are not uniformly sized, the the additional latency. amount of time to join each partition varies. In the extreme case when all the tuples fall into a single partition, we will see no parallel For the build phase during join (Section 4.1.2), the total cost = scalability with a naive implementation. The Zipf distribution is costJ1 + costJ2 + costJ3 = 27 ops. With a throughput of 1 op/cycle one such skewed distribution [15]. Consequently, we only see a and linear scaling, this amounts to 6.75 cycles. Since our current 2.8X scaling on the 4-core processor when only the phase-I of our system can issue multiple instructions in one cycle, it increases the parallel join scheme is employed. throughput for steps J1 and J2 and we measure a runtime of 4.7 cy- We address the load imbalance problem through our 3-phase cles per tuple. Note that this does not benefit the partitioning phase, join parallelization algorithm. In Phase-II, threads cooperatively since step P1 has an explicit barrier at the end of its execution, and work on joining large partitions by dividing up the tuples among it is bandwidth bound. However, during the building phase, there themselves. For the Zipf distribution with θ = 100 (which is heavily is no barrier after individual steps, and the entire phase is compute skewed), 30-40% of the inputs are greater than 32K tuples, which bound. As expected, SMT does not provide any further benefits. was selected as the threshold. However, there may still be load im- balance in the probe phase when certain tuples have a large set of For the probing phase during join (Section 4.1.2), cost pre f = 7 potential matches. To handle this, we separate out such probes in ops and costcomp = 10 ops (accounting for average case of branch phase-III and cooperatively probe each such tuple using all threads. misprediction). Thus, the total time evaluates to 24 cycles on one For the Zipf distribution with θ = 100, less than 0.1% of the tuples core, and around 6 cycles on 4 cores, which is within 6% of the ac- went through this phase. Using these optimizations, we improved tual measured data. SMT further improves the performance since our parallel scalability of the join phase from 2.8X to 3.9X. the stalls during branch misprediction can be overlapped with po- Figure 4 (d) shows the stability of our algorithm with skewed tentially other computation being performed by the other executing data. We control data skew by changing θ value from 0 to 100. thread on the core. Hence the effective value of costcomp should With our load-balancing optimization, the figure shows that our further reduce to around 3 ops (assuming complete overlap), to ob- hash join implementation is stable across different degrees of tain the required speedup. Thus, SMT substantially benefits the skewed data. join phase of the hash join algorithm. To summarize, our analytical 6.1.4 Analytical Model 5 measured using an in-house bandwidth calibrator. Table 1 shows the breakdown of the time spent in each step as 61 op implies 1 operation or 1 executed instruction. 10

11.         Sort Join (128b-wide) Sort Join (256b-wide) 631 Sort Join (512b-wide) Hash Join 621 70.0 12345678597 845 611 60.0 Cycles per Tuple 51 50.0 41 40.0 31 30.0 20.0 21 10.0 1 0.0 437 6257 2847 8627 69 29 39 59 649 29 439 6259 1M 2M 4M 8M 16M 32M 64M 128M 64K 128K 256K 512K 5977 845677754 Number of Tuples in a Relation Figure 5: Comparison between sort-merge join and hash join Figure 6: Comparison between sort-merge join and hash join. performance with varying number of tuples in the inner and For sort-merge join, we add projecting performance with 256 outer relations. bit-wide and 512 bit-wide SIMD. model predicts runtimes within 6% - 20% of the observed times for most of the cases. However, the model cannot evaluate the effect 7. FUTURE ARCHITECTURE TRENDS of multiple instruction issues, and hence is an upper bound for such In this section, we discuss future architectural trends from both phases (like building phase in the join operation). the near term and longer term perspective, and how these trends af- fect the join algorithm choices. 6.2 Sort-merge Join Performance Wider SIMD Execution: In Section 5.2, we show that sort- For the sort-merge join implementation, most of execution time merge join can fully exploit DLP using SIMD execution. In sort- is spent in sorting two tables. For the scalar version, sorting 32-bit merge join, the efficiency of SIMD execution is affected only by or 64-bit keys takes 11 clock cycles per element per iteration (cepi) the size of the (key, rid) tuple. For example, with a 32-bit (key, on our system. The execution time is cepi*N*logN cycles. Sorting rid) pair, each tuple is already 64 bits and a 128-bit wide SIMD two 128M-key tables takes 79 billion cycles (24.9 seconds). Sort- implementation (such as SSE) can only operate on two tuples si- ing two tables, each with 128M 64-bit tuples of (key, rid) takes multaneously. In the near term, future processors will adopt wider 11.4 cycles per element per iteration (25.8 seconds), a negligible SIMD execution (such as 256-bit for AVX [21] and 512-bit for increase in cycles over sorting only keys. The small increase in Larrabee [34]). These wider SIMD support would strongly ben- clock cycles for sorting (key, rid) is likely due to moving more data efit sort-merge join. (both key and rid) from memory. Another factor that impacts sort- Figure 6 shows the effect of wider SIMD execution on sort- ing performance is the size of keys and rids. When a (key, rid) merge join and hash join. We project the performance of sort- tuple cannot fit in a single 64-bit scalar register, extra instructions merge join with 256-bit and 512-bit SIMD based on the work by are needed to sort them. On our test system, it takes 14.2 cycles per Chhugani et al. [7]. With 256-bit SIMD, sort-merge join starts per- tuple per iteration to sort 128-bit (key, rid) tuples. That translates forming better than hash join for small number tuples, and 512-bit to 32 seconds for two 128M-tuple tables. SIMD execution of sort-merge join is projected to be 1.35X – For the SIMD implementation, sorting keys only takes 3 cepi 1.65X faster than hash join. (6.8 seconds for two 128M-key tables) while sorting 64-bit (key, For hash join, the scatter update to the partitions (Step P1) or the rid) tuples takes 4.5 cepi (10.2 seconds). These numbers matches hashed bucket (Step J1) is the primary limiter in exploiting DLP. the analytical model proposed by Chhugani et al. [7]. The slow- In order to exploit DLP in this step, efficient hardware scatter sup- down of sorting (key, rid) tuples over keys only is 1.5X, which port is necessary. An efficient scatter operation will write multiple matches the increase in the number of instructions over keys only elements to different memory locations in the most bandwidth effi- (1.53X). Parallel scaling of the SIMD implementation is nearly lin- cient manner with the minimal latency. ear, 3.6X on four cores. The cepi for keys only is 0.83 (1.8 seconds More importantly, further performance benefit can be achieved for 2 tables of 128M keys) and for (key, rid) tuples is 1.25 (2.8 with atomic vector support. In Steps P1 and J1, multiple tuples can seconds for two tables of 128M tuples). potentially hash into the same partition. These steps require the targeted hash entry to be updated accordingly. When performing 6.3 Comparison between Hash Join and Sort- SIMD execution, multiple elements will update the same memory merge Join location. Current SIMD architectures cannot handle this collision Figure 5 shows computation time of hash join and sort-merge case and would require reverting back to the scalar implementa- join with varying number of tuples in both relations. For sort-merge tion. As a result, the SIMD execution of step P1 and J1 is slower join, we show both non-SSE and SSE implementation numbers. than serial execution due to instruction overhead of conflict detec- The SSE implementation of sort-merge join improves performance tion. Efficient support for atomic vector operations such as that by 1.9X over non-SSE implementation. The theoretical maximum proposed by Kumar et al. [24] would be beneficial. improvement with 128-bit SSE is 2X because each tuple consists of a 32-bit (key, rid) pair, and therefore we can accommodate Limited Per-Core Bandwidth: As described in Section 3.2, two tuples in one 128-bit word. With 128 million tuples, our hash external memory bandwidth is becoming a scarce resource with join implementation is 2X faster than even this optimized SSE sort- the advent of many-core processors. Once the memory bandwidth merge join implementation. Sort-merge join becomes faster with requirement reaches the peak external bandwidth, integrating addi- smaller tuples because the number of sort levels decreases propor- tional processor cores would not provide any performance benefit tional to log N (N: number of tuples). The gap between hash join and would increase the power consumption. Therefore, any paral- and sort-merge join decreases to 1.6X with 64K elements. lel algorithms need to reuse the data in the cache as many times as 11

12.possible before data are written back to main memory. [12] B. Gedik, P. S. Yu, and R. Bordawekar. Executing stream joins on the cell As far as sort-merge join is concerned, we only need to processor. In VLDB, pages 363–374, 2007. [13] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: High access data from/to the main memory two times (Section 5). On Performance Graphics Co-processor Sorting for Large Database Management. the other hand, for a hash-join, we need to partition the data fol- In Proceedings of the ACM SIGMOD Conference, pages 325–336, 2006. lowed by the actual (cache-friendly) join phase. As we argue in [14] G. Graefe, A. Linville, and L. D. Shapiro. Sort versus hash revisited. IEEE Section 4, the restricted size of TLB forces multiple levels of parti- Trans. Knowl. Data Eng., 6(6):934–944, 1994. [15] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly tioning for efficient runtime – at least two levels for large database generating billion-record synthetic databases. In SIGMOD Conference, pages sizes. In addition, the actual join requires one more main memory 243–252, 1994. read/write for a total of three trips to the main memory. Therefore, [16] A. Greß and G. Zachmann. GPU-ABiSort: Optimal Parallel Sorting on Stream as compared to sort-merge join, hash join would require 1.5X more Architectures. In Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, page 45, Apr. 2006. bandwidth. Therefore, for future scenarios with limited per-core [17] N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and bandwidth, the join runtime would be proportional to the number B. Falsafi. Database servers on chip multiprocessors: Limitations and of memory external reads/writes of the data and hash join is pro- opportunities. In CIDR, pages 79–87, 2007. jected to be 1.5X slower than sort-merge join for large datasets [18] B. He, K. Yang, R. Fang, M. Lu, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational joins on graphics processors. In SIGMOD Conference, pages with high or unknown cardinality. 511–524, 2008. [19] W. D. Hillis and G. L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170–1183, 1986. 8. CONCLUSIONS [20] K. A. Hua and C. Lee. Handling data skew in multiprocessor database In this paper, we re-examined the two popular join algorithms – computers using partition tuning. In VLDB, pages 525–535, 1991. [21] Intel Advanced Vector Extensions Programming Reference. 2008, hash join and sort-merge join – and provided efficient implemen- tations along with a detailed analysis and analytical model of the Programming-Reference-31943302.pdf. runtime performance. Our join implementations efficiently utilize [22] Intel SSE4 programming reference. 2007, the modern processor features by cache blocking to minimize ac- [23] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base cess latency, vectorizing for SIMD to increase compute density, machine and its architecture. New Generation Comput., 1(1), 1983. and balancing the load amongst cores, even for heavily skewed in- [24] S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. Hughes, put datasets. Our hash-based implementation achieves more than C. Kim, V. Lee, and A. Nguyen. Atomic vector operations on chip 100M tuples per second on the latest quad-core processor which is multiprocessors. In 35th International Symposium on Computer Architecture, pages 441–452, June 2008. 17X faster than the best reported numbers on quad-core processors [25] T. J. Lehman and M. J. Carey. A study of index structures for main memory and 8X faster than the best reported GPUs. Furthermore, our sort- database management systems. In VLDB ’86: Proceedings of the 12th merge join algorithm achieves more than 50M tuples per second – International Conference on Very Large Data Bases, pages 294–303, 1986. an order of magnitude faster than the best reported numbers. [26] H. Lu, K.-L. Tan, and M.-C. Shan. Hash-based join algorithms for multiprocessor computers. In D. McLeod, R. Sacks-Davis, and H.-J. Schek, We developed analytical models to project the performance of editors, VLDB, pages 198–209, 1990. the two algorithms with future architectural trends towards increas- [27] S. Manegold, P. A. Boncz, and M. L. Kersten. What happens during a join? ing SIMD width and limited per-core memory bandwidth. The lack dissecting cpu and memory optimization effects. In VLDB, pages 339–350, 2000. of appropriate hardware features to exploit SIMD limit the scalabil- [28] S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing main-memory join ity of hash join algorithms, while sort-based join algorithms scale on modern hardware. IEEE Trans. Knowl. Data Eng., 14(4):709–730, 2002. near-linearly with SIMD and are projected to be faster with a SIMD [29] E. Mohr, D. A. Kranz, and R. H. Halstead. Lazy task creation: a technique for width of 512-bits or higher. In addition, the higher inherent mem- increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems, 2:185–197, 1991. ory bandwidth requirements of the hash join algorithm as compared [30] R. Motwani and P. Raghvan. Randomized Algorithms. Cambridge University to sort merge further point towards sort-merge join executing faster Press, 1995. than hash join. [31] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet. Alphasort: a risc machine sort. SIGMOD Rec., 23(2):233–242, 1994. [32] T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan. 9. REFERENCES Photon Mapping on Programmable Graphics Hardware. In Graphics Hardware 2003, pages 41–50, July 2003. [1] N. Askitis and J. Zobel. Cache-conscious collision resolution in string hash [33] M. Reilly. When multicore isn’t enough: Trends and the future for tables. In in Proc. String Processing and Information Retrieval Symposium multi-multicore systems. In HPEC, 2008. (SPIRE, pages 92–104, 2005. [34] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, [2] K. E. Batcher. Sorting networks and their applications. In Spring Joint S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, Computer Conference, pages 307–314, 1968. and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual [3] M. W. Blasgen and K. P. Eswaran. Storage and access in relational data bases. Computing. Proceedings of SIGGRAPH, 27(3), 2008. IBM Systems Journal, 16(4):362–377, 1977. [35] A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for [4] G. E. Blelloch. Synthesis of Parallel Algorithms, chapter Prefix sums and their relational query processing. In VLDB, pages 510–521, 1994. applications, pages 35–60. Morgan Kaufmann, 1993. [36] J. L. Wolf, D. M. Dias, and P. S. Yu. A parallel sort merge join algorithm for [5] P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized managing data skew. IEEE Trans. Parallel Distrib. Syst., 4(1):70–86, 1993. for the new bottleneck: Memory access. In VLDB, pages 54–65, 1999. [37] J. L. Wolf, P. S. Yu, J. Turek, and D. M. Dias. A parallel hash join algorithm for [6] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Improving hash join managing data skew. IEEE Trans. Parallel Distrib. Syst., 4(12):1355–1371, performance through prefetching. In ICDE, pages 116–127, 2004. 1993. [7] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, [38] H. Zeller and J. Gray. An adaptive hash join algorithm for multiuser A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on environments. In VLDB, pages 186–197, 1990. multi-core SIMD CPU architecture. VLDB, pages 1313–1324, 2008. [39] J. Zhou and K. A. Ross. Implementing database operations using simd [8] J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In instructions. In SIGMOD Conference, pages 145–156, 2002. VLDB, pages 339–350, 2007. [9] J. Cieslewicz and K. A. Ross. Data partitioning on chip multiprocessors. In DaMoN, pages 25–34, 2008. [10] D. J. DeWitt and R. H. Gerber. Multiprocessor hash-based join algorithms. In VLDB, pages 151–164, 1985. [11] J. Doweck. Inside Intel core microarchitecture and smart memory access. White Paper, Intel Corporation Jul 2006. 12