Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited

In this paper we experimentally study the performance of main-memory, parallel, multi-core join algorithms, focusing on sort-merge and (radix-)hash join. The relative performance of these two join approaches have been a topic of discussion for a long time. With the advent of modern multicore architectures, it has been argued that sort-merge join is now a better choice than radix-hash join. This claim is justified based on the width of SIMD instructions (sort-merge outperforms radix-hash join once SIMD is sufficiently wide)

1. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited Cagri Balkesen, Gustavo Alonso Jens Teubner ¨ M. Tamer Ozsu Systems Group, ETH Zurich TU Dortmund University University of Waterloo Switzerland Germany Canada {name.surname} ABSTRACT provides support for vector instructions that are sufficiently In this paper we experimentally study the performance of wide (SIMD with 256-bit AVX and wider), sort-merge joins main-memory, parallel, multi-core join algorithms, focusing would easily outperform radix-hash joins. This claim was on sort-merge and (radix-)hash join. The relative perfor- reinforced by recent results by Albutiu et al. [2] who report mance of these two join approaches have been a topic of that their NUMA-aware implementation of sort-merge join discussion for a long time. With the advent of modern multi- is already superior to hash joins even without using SIMD core architectures, it has been argued that sort-merge join is instructions. Furthermore, there are a number of new opti- now a better choice than radix-hash join. This claim is jus- mizations for parallel radix join [3, 4, 5] that have not been tified based on the width of SIMD instructions (sort-merge considered in these studies, but that should be part of any outperforms radix-hash join once SIMD is sufficiently wide), analysis of the relative performance of the two options. and NUMA awareness (sort-merge is superior to hash join In this paper, we approach the question experimentally. in NUMA architectures). We conduct extensive experiments We bring carefully-tuned implementations of all relevant, on the original and optimized versions of these algorithms. state-of-the-art join strategies (including radix join [4, 15, The experiments show that, contrary to these claims, radix- 19], no-partitioning join [5], sort-merge join [15], and mas- hash join is still clearly superior, and sort-merge approaches sively parallel sort-merge (MPSM) join [2]) to a common to performance of radix only when very large amounts of and up-to-date hardware platform. We then compare the data are involved. The paper also provides the fastest im- relative performance of all these algorithms under a wide plementations of these algorithms, and covers many aspects range of experimental factors and parameters: algorithm of modern hardware architectures relevant not only for joins design, data sizes, relative table sizes, degree of parallelism, but for any parallel data processing operator. use of SIMD instructions, effect of NUMA, data skew, and different workloads. Many of these parameters and combi- nations thereof were not foreseeable in earlier studies, and 1. INTRODUCTION our experiments show that they play a crucial role in deter- Modern processor architectures introduce many possibili- mining the overall performance of join algorithms. ties as well as challenges for the implementation of parallel Through an extensive experimental analysis, this paper data operators. Advanced instruction sets, vector opera- makes several contributions: (1) we show that radix-hash tions, multi-core architectures, and NUMA constraints cre- join is still superior to sort-merge join in most cases; (2) we ate a very rich design space where the effects of given design provide several insights on the implementation of data oper- decisions on the performance of data operators are not al- ators on modern processors; and (3) we present the fastest ways easy to determine. There is a need for developing a algorithms available to date for both sort-merge—2-3 times better understanding of the performance of parallel data op- faster than available results— and radix-hash join, demon- erators on new hardware. strating how to use modern processors to improve the per- In this paper, we explore the relative performance of radix- formance of data operators. hash vs. sort-merge join algorithms in main-memory, multi- In addition, the paper sheds light on a number of relevant core settings. Our main goal is to analyze the hypothesis issues involving the processing of “big data” and the factors raised by recent work claiming that sort-merge joins over that affect the choice of the algorithm. These include: new hardware are now a better option than radix-hash joins, Input Sizes. Our results show that the relative and ab- the algorithm traditionally considered to be the fastest [2, solute input table sizes have a big effect on performance. 15]. Kim et al. [15] have suggested that, once hardware Moreover, as the data size grows, the duality between hash- ing and sorting becomes more pronounced, changing the as- Permission to make digital or hard copies of all or part of this work for sumption that only hashing involves several passes over the personal or classroom use is granted without fee provided that copies are data when it is sufficiently large. We show that sort-merge not made or distributed for profit or commercial advantage and that copies joins also have to do multiple passes over the data, with bear this notice and the full citation on the first page. To copy otherwise, to their performance suffering accordingly. republish, to post on servers or to redistribute to lists, requires prior specific Degree of Parallelism. Existing work has studied algo- permission and/or a fee. Articles from this volume were invited to present rithms using a small degree of parallelism. As the number their results at The 40th International Conference on Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. of available hardware contexts increases, contention in large Proceedings of the VLDB Endowment, Vol. 7, No. 1 merge trees for sorting also increases. This contention is not Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00. 85

2.visible in the four-core configurations used in earlier studies Albutiu et al. [2] presented a “massively parallel sort- but it becomes a dominant factor in larger systems. merge join” (MPSM) tailored for modern multi-core and Cache Contention. Cache-conscious approaches make multi-socket NUMA processors. MPSM is claimed to ob- a significant difference in both hash and sort-merge joins. serve NUMA-friendly access patterns and avoids full sort- Cache consciousness becomes the decisive factor in choosing ing of the outer relation. In their experiments on a 32- the best hash join algorithm, favoring the radix join ap- core, four socket machine, they report that sort-merge join proach of [4, 15] over the no-partitioning approach of [5]. is faster than the “no-partitioning” hash join implementa- Similarly, for efficient merging, Kim et al. [15] assume that tion of Blanas et al. [5] (see below). Unlike the projections entire merge trees can fit into a single last-level cache. As of Kim et al. [15], the claim is that sort-merge join is faster the degree of parallelism, data sizes, and merge fan-ins in- than hash join even without using SIMD parallelism. crease, this assumption may no longer hold, which calls for a On the hash join side, cache-aware, partitioning-based al- closer look at the implementation of multi-way merge trees. gorithms such as “radix join” provide the best performance SIMD Performance. There are many ways to exploit [19, 24]. More recently, Blanas et al. [5] introduced the “no- SIMD in all algorithms. Our results show, however, that partitioning” idea and advocated its simplicity, hardware hardware peculiarities play a big role, and it turns out that obliviousness, and efficiency. Recent work has studied these the width of the SIMD registers is not the only relevant hash join algorithms and showed that hardware-conscious, factor. The complex hardware logic and signal propagation parallel radix join has the best overall performance [3, 4]. As delays inherent to more complex SIMD designs may result the code for the no-partitioning and radix join algorithms in latencies larger than one cycle, limiting the advantages of is available, we use these algorithms in this study. We also SIMD. refer to the literature for detailed descriptions of these algo- The paper shows that hash joins still have an edge over rithms and focus here mainly on sort-merge joins. sort-merge alternatives unless the amount of data involved is very large. Some advances in hardware will eventually fa- 2.3 Hardware-Assisted Sorting vor sort-merge joins, e.g., wider SIMD registers and higher Recent work on sorting has explored the use of SIMD data memory bandwidth, but our results show that exploiting parallelism [7, 10, 13, 22]. Inoue et al. [13] introduced AA- such advances will also benefit radix join algorithms. Fur- Sort which utilized both SIMD and thread-level parallelism. thermore, new processor features such as memory gather AA-Sort eliminates unaligned loads for maximum utiliza- support in Intel’s upcoming Haswell series may play a big- tion of SIMD where unaligned accesses cause performance ger role in improving hash joins than the factors considered bottlenecks in architectures such as PowerPC and Cell pro- so far. cessors. Gedik et al. [10] also investigated parallel sorting for the Cell processor and implemented an efficient sorting algo- 2. BACKGROUND AND RELATED WORK rithm with bitonic sorting and merging using SIMD paral- lelism. Chhugani et al. [7] provided a multi-core SIMD sort- 2.1 Sort vs. Hash—Early Work ing implementation over commodity x86 processors. Their “Sort or hash” has long been a point of discussion in algorithm, based on merge sort, extensively used bitonic databases. Initially, sort-merge join was the preferred op- sort and merge networks for SIMD parallelism, following the tion [20]. Later, the invention of hashing-based techniques ideas introduced by Inoue et al. [13] for in-register sorting. [6, 16] changed the balance. Schneider et al. [23] compared However, the machine they used had only four cores and fur- hash-based with sort-merge joins and concluded that hash- ther scalability was based on projections. Satish et al. [22] based joins were superior unless memory was limited. Hash have analyzed comparison and non-comparison-based sort- join was also the main choice in most of the early parallel ing algorithms on modern CPU/GPU architectures. Their database systems [8, 9, 23]. study provided some of the fastest sorting implementations Changes in hardware, memory, and data volumes prompt- and found that non-comparison-based scalar sorting such ed researchers to revisit the “sort or hash” question regularly as radix sort is faster with smaller keys. Moreover, they over the years. Graefe et al. [12] provided an extensive study showed that SIMD-based merge sort is more competitive of sort- and hash-based query processing algorithms. They with larger keys and will be even more favorable with the outlined the dualities and similarities among the approaches future hardware trends such as larger SIMD width. Kim et and concluded that performance only differs by percentages al. [14] implemented distributed in-memory sorting over a rather than factors if both algorithms are implemented in a cluster of multi-core machines. While their main solution careful and equally-optimized manner. Moreover, the study focuses on overlapping computation and inter-node commu- pointed out that there are cases when one of the algorithms nication, their approach also makes extensive use of parallel would be preferable over the other and therefore both algo- SIMD sorting on each of the machines. rithms would be needed in a database system. Subsequently, Graefe et al. [11] used histogram-based partitioning to im- 2.4 The Role of NUMA prove hash-based joins and finally concluded that sort-merge For better performance in NUMA systems, algorithms joins only win in a number of corner cases. must be hardware conscious by taking the increasingly more complex NUMA topologies into consideration [2, 18]. Li 2.2 Sort vs. Hash—Multi-core Era et al. [18], for instance, showed that a hardware-conscious Multi-core has changed things once again. Kim et al. [15] “ring-based” data shuffling approach across NUMA regions compared parallel radix-hash join with a sorting-based join achieves a much better interconnect bandwidth and improves exploiting both SIMD data parallelism and multiple threads. the performance of sort-merge join algorithm of Albutiu et They concluded that wider SIMD registers will soon make al. [2]. Therefore, we followed a similar approach and made sort-merge a better option. our algorithms NUMA aware. 86

3. 3. PARALLELIZING SORT WITH SIMD a1 out 1 sorted The dominant cost in sort-merge joins is sorting the input a2 out 2 relations. We thus now discuss strategies to implement sort- a3 out 3 sorted a4 out 4 ing in a hardware-conscious manner. Typically, sort-merge b4 out 5 sorted joins use merge sort—a tribute to the latency/bandwidth b3 out 6 gap in modern system architectures. Both building blocks b2 out 7 of merge sort, (a) initial run generation and (b) the merging b1 out 8 of pre-sorted runs, benefit from SIMD. 3.1 Run Generation Figure 2: Bitonic merge network. For initial run generation, many chunks with a small num- ber of tuples need to be sorted. This favors sorting al- four items each requires 10 min/max instructions, 8 shuffles, gorithms that can process multiple chunks in parallel over 4 loads, and 4 stores. Shuffle operations significantly reduce ones that have a good asymptotic complexity with respect the effective SIMD speedup for run generation from optimal to the tuple count. Sorting networks provide these char- κ = 4 to about 2.7. acteristics and fit well with the SIMD execution model of modern CPUs [7, 10, 21]. 3.2 Merging Sorted Runs 3.1.1 Sorting Networks 3.2.1 Bitonic Merge Networks 5 3 Figure 1 on the left illustrates, in the Although sequential in nature, merging also benefits from 9 3 notation of Knuth [17, Section 5.3.4], a 9 6 5 SIMD acceleration. The basic idea comes from Inoue et 5 5 sorting network for four input items. A al. [13] and has been used for sorting [7] and joins [15]. 3 5 6 3 6 set of four items 9, 5, 3, 6 enters the net- Looking back to the idea of sorting networks, larger net- 6 9 6 9 work on the left and travels toward the works can be built with help of merging networks that com- right through a series of comparators . bine two pre-sorted inputs into an overall sorted output. Figure 1: Even- Every comparator emits the smaller of Figure 2 shows a network that combines two input lists of odd network for its two input values at the top, the larger size four. The network in Figure 2 is a sequence of three four inputs. on the bottom. After traversing the five stages, each consisting of four comparator elements . Each comparators, the data set is sorted. stage can thus be implemented using one max and one min The beauty of sorting networks is that comparators can be SIMD instruction (assuming κ = 4). Shuffle instructions implemented with help of min/max operators only. Specif- in-between stages bring vector elements into their proper ically, the five comparators in Figure 1 com- e = min (a, b) positions (for instance, if a and b are provided as one SIMD pile into a sequence of ten min/max operations f = max (a, b) register each, b must be reversed using shuffles to prepare as illustrated here on the right (input vari- g = min (c, d) for the first min/max instruction pair). ables a, . . . , d and output variables w, . . . , z). h = max (c, d) On current Intel hardware, for κ = 4, implementing a Limited data dependencies and the absence i = max (e, g) bitonic merge network for 2 × 4 input items requires 6 SIMD of branching instructions make such code run j = min (f, h) min/max instructions and 7–10 shuffles. The exact number of very efficiently on modern hardware. w = min (e, g) shuffles depends on the bit width of the input items and the Sorting networks are also appealing be- x = min (i, j) instruction set offered by the hardware (SSE, AVX, AVX2). cause they can be accelerated through SIMD y = max (i, j) instructions. When all variables in the code z = max (f, h) 3.2.2 Merging Larger Lists using Bitonic Merge on the right are instantiated with SIMD vec- For larger input sizes, merge networks scale poorly [21]: tors of κ items and all min/max calls are replaced by SIMD sorting networks for N input items require O N log2 N calls, κ sets of items can be sorted in approximately the comparators—clearly inferior to alternative algorithms. But same time that a single set would require in scalar mode small merge networks can be used as a kernel within a merg- (suggesting a κ-fold speedup through SIMD). ing algorithm for larger lists [13]. The resulting merging al- gorithm (Algorithm 1) uses a working set of 2×k data items 3.1.2 Speedup Through SIMD (variables a and b, both implemented as SIMD registers). However, the strategy illustrated above will sort input In each iteration of the algorithm’s loop body, that working items across SIMD registers. That is, for each vector po- set is sorted (using the merge kernel bitonic_merge4 () and sition i, the sequence wi , xi , yi , zi will be sorted, but not knowing that a and b themselves are sorted already) and the the sequence of items within one vector (i.e., wi , . . . , wκ is smaller k items are emitted to the merge result. in undefined order). Only full SIMD vectors can be read or The emitted SIMD vector is then replaced by fresh data written to memory consecutively. Before writing back initial from the input. As in the classical scalar merge algorithm, runs to main-memory, SIMD register contents must thus be the two head elements of the input runs are used to decide transposed, so items within each vector become sorted (i.e., which new data to load (line 5 in Algorithm 1). Unlike in w2 must be swapped with x1 , w3 with y1 , etc.). the classical algorithm, however, the decision is used to load Transposition can be achieved through SIMD shuffle in- an entire vector into the working set. The rationale is that structions that can be used to move individual values within the resulting working set still contains at least k items that and across SIMD registers. A common configuration in the are smaller than the larger of the two head items, and only context of join processing is to generate runs of four items k items will be emitted in the next loop iteration. with κ = 4. Eight shuffle instructions are then needed In terms of performance, the separation between control to transpose registers. That is, generating four runs of flow and merge kernel operations in Algorithm 1 fits well 87

4. Algorithm 1: Merging larger lists with help of bitonic mentioned in Section 3.2.2, an 8-wide bitonic merge imple- merge kernel bitonic_merge4 () (k = 4). mentation requires 36 assembly instructions per eight tuples being merged—or per 64 bytes of input data. 1 a ← fetch4 (in 1 ); b ← fetch4 (in 2 ); We analyzed the 36-instructions loop of our code using the 2 repeat Intel Architecture Code Analyzer [1]. The tool considers the 3 a, b ← bitonic_merge4 (a, b); super-scalar instruction pipeline of modern Intel processors 4 emit a to output; and infers the expected execution time for a given instruc- 5 if head (in 1 ) < head (in 2 ) then tion sequence. For our code, the tool reported a total of 6 a ← fetch4 (in 1 ); 29 CPU cycles per 64 bytes of input data. The tool does 7 else not consider potential pipeline stalls due to memory accesses 8 a ← fetch4 (in 2 ); (bandwidth and/or latency). 9 until eof (in 1 ) or eof (in 2 ); With a clock frequency of 2.4 GHz, this corresponds to a 10 a, b ← bitonic_merge4 (a, b); memory bandwidth of 2 × 5.3 GB/s (read+write) or 10.6 GB/s 11 emit4 (a); emit4 (b); for a single merging thread. This is more than existing in- 12 if eof (in 1 ) then terfaces support, considering also that typical CPUs contain 13 emit rest of in 2 to output; eight or more cores per chip. Out-of-cache merging is thus severely bound by the memory bandwidth. 14 else Memory bandwidth demand can be reduced by merging 15 emit rest of in 1 to output; more than two runs at once. Multi-way merging saves round- trips to memory and thus precious memory bandwidth. To still benefit from CPU-efficient, SIMD- with the execution model of modern CPUs. In particular, optimized bitonic merg- merge merge merge merge no values have to be moved between the scalar and vector ing, we implement multi- execution units of the CPU (a costly operation in many ar- way merging using mul- merge merge chitectures). Branch mispredictions will still occur, but their tiple two-way merge effect will now be amortized over k input elements. Also, op- merge units, as illustrated one thread timizations such as predication or conditional moves can be in Figure 3. Two- applied in the same way as in scalar merge implementation. way merge units are The vector size used for merging is a configuration param- connected using FIFO Figure 3: Multi-way merging. eter, typically k = 2p × κ (where κ is the hardware SIMD queues; FIFO queues width). For the hardware that we used in our experimental are sized such that all queues together still fit into the CPU study, we found k = 8 to be a sweet spot that amortizes cache. Thus, external memory bandwidth is needed only at branching cost well, while not suffering too much from the the front of the multi-way merging tree. complexity of large merge networks. Each loop iteration re- As indicated in Figure 3, a complete multi-way merging quires 36 assembly instructions to produce 8 output items. is realized by a single operating system thread. This thread will treat two-way merges as individual tasks and switch 4. CACHE CONSCIOUS SORT JOINS between tasks whenever a task is blocked by a FIFO over- or underrun. 4.1 Sorting and the Memory Hierarchy Compute vs. Bandwidth. Task switching will cause new The cache hierarchies in modern hardware require sep- CPU overhead that is not necessary in an implementation arating the overall sorting into several phases to optimize that merges two runs from memory until completion. This cache access: (i) in-register sorting, with runs that fit into overhead will increase when the FIFO queues between merg- (SIMD) CPU registers; (ii) in-cache sorting, where runs can ing tasks are small, because more task switches are needed still be held in a CPU-local cache; and (iii) out-of-cache then to perform a full multi-way merge. Since we size FIFO sorting, once runs exceed cache sizes. buffers such that all buffers in one thread fill a CPU cache, In-Register Sorting. Phase (i) corresponds to run-genera- CPU overhead increases together with the merge fan-in. tion as discussed in Section 3.1. This offers us a handle to balance bandwidth and com- In-Cache Sorting. In Phase (ii), runs are then merged pute. Merging only two runs is bound by memory band- until runs can no longer be contained within CPU caches. width, with plenty of stalled CPU cycles that could be spent In-cache sorting corresponds to Algorithm 1 (bitonic merg- on additional CPU instructions. As we increase merge fan- ing) in Section 3.2.2. It is backed up by a bitonic merge in, memory pressure becomes reduced until the system be- kernel such as bitonic_merge4 (). Using bitonic merging, comes CPU-bound. At that point, larger fan-in will degrade runs are repeatedly combined until runs have reached 1/2 performance again because of increasing CPU load. cache size (since in- and output runs must fit into cache). Impact of NUMA. In practice, at least some merging passes Out-of-Cache Sorting. Phase (iii) continues merging until will inevitably cross NUMA boundaries (if not, NUMA cross- the data is fully sorted. Once runs have exceeded the size ing has to be performed in the later join phase, which in this of the cache, however, all memory references will have to be regard behaves like a two-way merge). As pointed out by fetched from off-chip memory. Li et al. [18]—and confirmed by our measurements—multi- socket systems show an increasing asymmetry, where the 4.2 Balancing Computation and Bandwidth NUMA interconnect bandwidth stays further and further behind the aggregate memory bandwidth that the individual Accesses to off-chip memory make out-of-cache sorting memory controllers could provide. With multi-way merging, sensitive to the characteristics of the memory interface. As 88

5.we can combat this problem in an effective and scalable way. In the experimental part of this work we study how merg- R ing can be tuned to avoid memory bottlenecks even across local sort local sort local sort local sort 1 NUMA boundaries. 5. HASH-BASED JOINS merge 2 Having covered the optimized sort-based join implemen- tations, we now look at hash-based joins. While efficient, hashing results in random access to memory, which can lead R to cache misses. Shatdal et al. [24] identified that when the ✶ ✶ ✶ ✶ 5 hash table is larger than the cache size, almost every ac- S cess to the hash table results in a cache miss. As a result, 3/4 a partitioning phase to the hash joins is introduced to re- duce cache misses. The performance of the resulting join is S largely dictated by this partitioning phase. 5.1 Radix Partitioning Figure 4: m-way : NUMA-aware sort-merge join Manegold et al. [19] refined the partitioning idea by con- with multi-way merge and SIMD. sidering as well the effects of translation look-aside buffers (TLBs) during the partitioning phase, leading to the multi- pass radix partitioning join. Conceptually, radix partitioning will exactly fill one cache line (64 bytes). This in turn allows takes all input tuples one-by-one and writes them to their for another low-level optimization. Since we are now always corresponding destination partition (pos[·] keeps the write writing a full cache line at once to global memory, the CPU location in each partition): can take advantage of its write combining facilities together 1 foreach input tuple t do with non-temporal writes, thus avoiding to read the cache 2 k ← hash(t); line before writing it back. 3 p[k][pos[k]] = t; // copy t to target partition k In practice, the advantage of software-managed buffers is 4 pos[k]++; two-fold: (i) for many situations, software-managed buffers offer better absolute performance, since fewer passes can Generally, partitions are far apart and on separate VM pages. usually achieve the same overall fan-out; (ii) it is possible If the fan-out of a partitioning stage is larger than the num- to partition even larger data sizes in a single pass, which has ber of TLB entries in the system, copying each input tuple not been considered previously. will cause another TLB miss. The number of TLB entries is thus treated as an upper bound to the partitioning fan-out. 6. JOIN ALGORITHMS ANALYZED 5.2 Software-Managed Buffers 6.1 Sort-Merge Join Algorithm – m-way The TLB miss limitations on maximum fan-out can be re- duced, when writes are buffered inside the cache first. The The m-way algorithm is a highly parallel sort-merge join idea is to allocate a set of buffers, one for each output par- that relies on both data and thread parallelism and is care- tition and each with room for up to N input tuples. Buffers fully optimized toward NUMA. The general idea and the are copied to final destinations only when full: individual phases of the algorithm are presented in Figure 4 assuming a hypothetical machine with four NUMA regions 1 foreach input tuple t do and one thread per region. 2 k ← hash(t); Initially, input relations R and S are stored such that they 3 buf[k][pos[k] mod N ] = t; // copy t to buffer are equally distributed across NUMA regions. In the first 4 pos[k]++; phase, each thread is assigned its NUMA-local chunk and 5 if pos[k] mod N = 0 then all the threads range-partition their local chunks in parallel 6 copy buf[k] to p[k]; // copy buffer to part. k using the software-managed buffers technique (Section 5.2). The main intuition behind partitioning in the first place Buffering leads to additional copy overhead. However, for is allowing threads in the subsequent phases to work inde- sufficiently small N , all buffers will fit into a single memory pendently without any synchronization. In this phase, the page and into L1 cache. Thus, a single TLB entry will suffice partitioning fan-out is usually on the order of the number of unless a buffer becomes full and the code enters the copy- threads (64–128) and can be done efficiently using a single ing routine in line 6. Beyond the TLB entry for the buffer pass at the speed of total memory bandwidth of the ma- page, an address translation is required only for every N th chine. Then, each local partition is sorted using the AVX input tuple, significantly reducing the pressure on the TLB sorting algorithm. In this phase, different threads can sort system. And as soon as TLB misses become infrequent, it different partitions independently, again just reading from is likely that the CPU can hide their latency through out- and writing to the NUMA-local memory. Partitioning and of-order execution mechanisms. This optimization follows local sorting are indicated in Figure 4 as 1 . the idea of Satish et al. [22] who used it to reduce the TLB Phase 2 in Figure 4 is the only phase that requires shuf- pressure of radix sort. fling data between different NUMA regions. Therefore, it In our implementation of radix join we utilize such soft- is likely to be limited by the memory/interconnect band- ware-managed buffers and configure N such that one buffer width. Hence, we employ multi-way merging here as de- 89

6.scribed in Section 4.2. Multi-way merging successfully over- short notation algorithm laps the data transfer and merging and brings computation m-way Sort-merge join with multi-way merging and bandwidth into a balance. Outcome of this phase is a m-pass Sort-merge join with multi-pass na¨ıve merging globally sorted copy of R, indicated as R in Figure 4. mpsm Our impl. of massively parallel sort-merge [2] The same steps are also applied to relation S (indicated radix Parallel radix hash join [15, 4] as Phases 3/4 in Figure 4). R and S are stored in the n-part No-partitioning hash join [5, 4] NUMA-local memory of each thread. Finally, each thread concurrently evaluates the join between NUMA-local sorted Table 1: Algorithms analyzed. runs using a single-pass merge join (Phase 5 ). This join amounts to an extremely fast linear scan of both sorted runs A (adapted from [2]) B (from [15, 4]) where matching pairs constitute the join result. size of key / payload 4 / 4 bytes 4 / 4 bytes size of R 1600 · 106 tuples 128 · 106 tuples 6.2 Sort-Merge Join Algorithm – m-pass size of S m · 1600 · 106 tuples, m = 1,..,8 128 · 106 tuples The second variant for sort-merge join is m-pass. The al- total size R 11.92 GiB 977 MiB total size S m · 11.92 GiB 977 MiB gorithm differs from m-way only in Phase 2 in Figure 4. Instead of applying a multi-way merge for merging NUMA- Table 2: Workload characteristics. remote runs, m-pass applies successive two-way bitonic merg- ing. The first iteration of merging of sorted runs is done as the data is transferred to the local memory. As a result of and Balkesen et al. [4]. They all assume a column-oriented the first iteration, the number of runs reduces to 1/2 of the storage model and joins are assumed to be evaluated over initial total number of runs. The rest of the merging contin- narrow (8- or 16-byte) key, payload tuple configurations. ues in local memory, using the multi-pass merging technique To understand the value of data parallelism using vectorized (cf. Section 3.2.2) in an iterative manner. instructions, we assume key and payload are four bytes each. The workloads used are listed in Table 2. All attributes are 6.3 Massively Parallel Sort-Merge Join – mpsm integers, but AVX currently only supports floating point; The mpsm algorithm first globally range-partitions rela- therefore we treat integer keys as floats when operating with tion R (again as discussed in Section 5.2). This step ensures AVX.1 There is a foreign key relationship from S to R. That that different ranges of R are assigned to different NUMA- is, every tuple in S is guaranteed to find exactly one join regions/threads. Next, each thread independently sorts its partner in R. Most experiments (unless noted otherwise) partition, resulting in a globally-sorted R . In contrast, S assume a uniform distribution of key values from R in S. is sorted only partially. Each thread sorts its NUMA-local chunk of S without a prior partitioning. Therefore, during 7.2 System the last phase, a run of R must be merge-joined with all the Experiments are run on an Intel Sandy Bridge with a 256- NUMA-remote runs of relation S. For cases where relation bit AVX instruction set. It has a four-socket configuration, S is substantially larger than R, avoiding the global parti- with each CPU socket containing 8 physical cores and 16 tioning/sorting may pay off and the overall join may become thread contexts by the help of the hyper-threading. Cache more efficient. For further details, we refer to [2]. sizes are 32 KiB for L1, 256 KiB for L2, and 20 MiB L3 (the latter shared by the 16 threads within the socket). The cache 6.4 Radix Hash Join – radix line size of the system is 64 bytes. TLB1 contains 64/32 For parallel radix-hash join, we partition both input re- entries when using 4 KiB/2 MiB pages (respectively) and 512 lations as discussed in Section 5.2. The goal is to break at TLB2 entries (page size 4 KiB). Total memory available is least the smaller input into pieces that fit into caches. Then, 512 GiB (DDR3 at 1600 MHz). The system runs Debian we run a cache-local hash join on individual partition pairs. Linux 7.0, kernel version 3.4.4-U5 compiled with transparent For a detailed discussion, refer to [4, 5, 15, 19]. huge page support and it uses 2 MiB VM pages for memory allocations transparently. This has been shown to improve 6.5 No-Partitioning Hash Join – n-part join performance up to ≈ 15% under certain circumstances The no-partitioning join is a direct parallel version of the [3, 4]. Therefore, we assume the availability of large page canonical hash join. Both input relations are divided into support in the system. The code is compiled with gcc 4.7.2 equi-sized portions that are assigned to a number of worker using -O3. Experiments using Intel’s icc compiler did not threads. In the build phase, all worker threads populate a show any notable differences, qualitatively or quantitatively. shared hash table with all tuples of R. After synchronization via a barrier, all worker threads enter the probe phase and concurrently find matching join partners for their assigned 8. ANALYSIS OF THE SORT PHASE S portions. For further details, we refer to [4, 5]. In a first set of experiments, we make sure that our single- thread sorting performs well compared to alternatives. Table 1 summarizes the algorithms considered in the ex- Figure 5 shows the performance of our SIMD sorting algo- periments and the shorthand notation used in the graphs. rithm implemented using AVX instructions. As a baseline, we include the C++ STL sort algorithm. Overall, AVX sort 7. EXPERIMENTAL SETUP runs between 2.5 and 3 times faster than the C++ sort. One expected result is also visible: whenever the size of the in- 7.1 Workloads put increases, both algorithms suffer due to the increasing To facilitate comparisons with existing results, we use sim- 1 AVX2 will support vectorized integer operations, thus there ilar workloads to those of Kim et al. [15], Albutiu et al. [2] will be no longer semantical differences for our code. 90

7.sort throughput [M. tuples/sec] AVX sort C++ STL sort m-way merge radix part. sw-managed buf. 35 180 throughput [M. tuples/sec] 30 160 25 140 120 20 speedup 100 15 80 10 60 3 5 2 40 1 0 20 1 2 4 8 16 32 64 128 256 0 number of tuples in R (in 220 ) 4 8 16 32 64 128 256 512 1024 2048 merge fan-in/partitioning fan-out Figure 5: Single-threaded sorting performance where input table size varies from 8 MiB to 2 GiB. Figure 6: Impact of fan-in/fan-out on multi-way merging/partitioning (1-pass and single-thread). number of trips to the main-memory. AVX sort mainly suf- fers from the multi-pass, pair-wise merging of cache-sized 9.2 Performance of the Merge Phase sorted runs. As the input size increases, the number of runs The model we have outlined essentially suggests that in- double at each data point. Accordingly, the number of trips creasing fan-in, and hence input sizes, will deteriorate the to the memory also increases logarithmically in the input merge performance. To verify this finding, we performed the runs. Whenever used alone, this performance degradation following experiment: We increased the number of fixed-size of single-threaded AVX sort might be acceptable since it still sorted runs from 4 to 2,048 and ran multi-way merging with runs 2.5 times faster than a serial of-the-shelf algorithm. this number as fan-in. To isolate the effect of concurrent On the other hand, the excessive memory bandwidth con- threads, we ran our AVX multi-way merge on a single core sumption indicates that there will be a problem when multi- using 1/16 of our 20 MiB L3 cache exclusively assigned to the ple threads are active and contend for the same bandwidth. merging buffer. The results are shown in Figure 6 denoted We look into this problem when discussing the merge phase. with . The results confirm the model: merge performance We have also compared our sorting algorithm with those decreases steeply with increasing fan-in. The main reason is of Chhugani et al. [7] and Kim et al. [15]. For reasons of the decreasing size of the L3 buffer per merge node in the space, we omit the results. In summary, the sort algorithm merge tree. For instance, with fan-in of 256, the L3-resident behaves well in comparison to existing ones and it does not FIFO buffer of each merge node holds just 321 tuples and affect the relative performance of the join algorithms. decreases to 160 tuples with fan-in of 512. At higher fan-in values, multi-way merge becomes less effective as the num- 9. ANALYSIS OF THE MERGE PHASE ber of tuples processed at once drops significantly. Increasing input sizes require multiple passes over the en- Another reason for performance degradation can be ob- tire data if sub-lists are pair-wise merged. Multi-way merg- served through performance counters: The increasing depth ing remedies this problem by doing the merge in single pass. of the merge tree means a logarithmically increasing number For efficiency, Kim et al. [15] assume that entire merge trees of executed instructions per tuple. At that point, multi-way fit into a single last-level cache. However, as the degree of merging becomes CPU bound as shown in Table 3. The parallelism, data sizes, and merge fan-ins increase, this as- instruction count per tuple increases significantly with in- sumption no longer holds. creasing fan-in along with the instructions-per-cycle (IPC). On the Intel Sandy Bridge, a maximum of 2 IPC for AVX in- 9.1 Modeling the Merge Phase structions is barely possible. Consequently, the thread is far As described in Section 4.1, overall sorting starts with in- from being memory bound (cf. column “Total Bytes/Cycle”). cache sort which generates 1/2-L2-cache-sized sorted runs. Other profiling metrics are also in line with the expectations: Therefore, the fan-in (F ) of the multi-way merge operation each tuple is read and written once where the write also equals to the number of 1/2-L2-sized sorted runs: Given an causes another read for a total of 2 reads and 1 write of the 8- input size of N tuples, F = N ×tuple-size/1/2-L2-size. Secondly, byte tuple (compare with “Read/Tup.” and “Write/Tup.”). efficient multi-way merging requires the merge tree to reside In terms of the cache efficiency, the expected 1/8 (0.125) L3 in the last level (L3) cache. A multi-way merge tree is es- misses per tuple is also in line with the measured L3 misses. sentially a binary tree and the number of total nodes (M ) The cache contention problem in multi-way merging raises a question against the assumption that multi-way merge can in a merge tree with fan-in of F equals to M = 2 log2 F +1 − be done efficiently. This is possible only for certain data 1. Moreover, the shared L3 must be divided by the num- sizes/fan-ins, however increasing data sizes will require mul- ber of threads (T ; i.e., T = 16 for our system) in a CPU tiple passes over the data, affecting the overall performance socket. Accordingly, in order to be L3 resident, each node of the join operator. in the merge tree must have at most B tuples as follows B = L3-size/(M ×tuple-size)×T . Finally, and most importantly, B must be on the order of hundreds of tuples to ensure that 10. OPTIMIZING THE MERGE PHASE multi-way merge will not degrade to single or a few item-at- The previous results show the importance of the input a-time (scalar) merging. size on the performance of sort-merge operations. The same 91

8. Merge IPC Instr. Tot.-Bytes. Read Write L3-Miss partitioning multi-way merge throughput [M tuples/sec] 4000 Fan-in /Core /Tup. Cyc. /Tup. /Tup. /Tup. 3500 8 1.04 19.70 1.38 16.53 9.50 0.12 16 1.13 26.74 1.08 16.47 8.96 0.12 3000 32 1.17 34.00 0.87 16.45 8.73 0.12 2500 64 1.23 43.68 0.71 16.47 8.73 0.12 128 1.26 56.81 0.57 16.52 8.92 0.13 2000 256 1.35 79.55 0.44 16.63 9.41 0.13 1500 512 1.46 126.87 0.31 16.83 10.32 0.13 1000 1024 1.64 245.51 0.20 17.32 12.16 0.14 500 0 Table 3: Performance counter monitoring results for 64M 128M 256M 512M 768M 1024M multi-way merge with increasing merge fan-in. input size [M tuples] Partition-Then-Sort Sort-Then-Merge throughput [M. tuples/sec] 700 Figure 8: Trade-off between partitioning and merg- 600 ing (using 64 threads). 500 400 300 the first phase, local sorting, can choose either partitioning 200 or merging: 100 Partition-then-Sort approach first range-partitions the 0 input with efficient software-managed radix partitioning (us- 64M 128M 256M 512M 768M 1024M ing most-significant radix bits). Each of the partitions are input size [M tuples] individually sorted using the AVX sort routine, and concate- nation of these naturally creates a sorted output. Sort-then-Merge approach, instead, first creates cache- Figure 7: Impact of input size on different multi- sized sorted runs using the AVX sort routine. These sorted threaded sorting approaches (using 64 threads). runs are then merged to form a complete sorted output using the AVX multi-way merge routine. In Figure 7, we compare the performance of the two ap- problem occurs for the partitioning phase of hash-based joins. proaches. The workload consists of 64 M to 1024 M 8-byte Interestingly, the solution for that problem can also be used tuples. First, the Partition-then-Merge approach achieves a for sort-merge joins as we show in this section. throughput of up to 680 million tuples per second. More importantly, it shows a stable performance with increasing 10.1 Performance of the Partitioning Phase input sizes, sorting 8 GB in less than 2 seconds. On the other Figure 6 shows the performance of partitioning with the hand, the performance of Sort-then-Merge approach drops same amount of total data as in the analysis for merging. significantly beyond table sizes of 256 M; mainly due to in- This randomly generated data is partitioned with given fan- creasing fan-in of the multi-way merge. For instance with out. The results denoted with show that radix partition- table size of 512 M, multi-way merge runs with a fan-in of ing is sensitive to the TLB size, which is 32 in our system 512. The main performance difference stems from partition- for 2 MiB pages. Therefore, partitioning throughput signifi- ing vs. merging performance. Figure 8 shows the throughput cantly decreases after a fan-out of 32 due to TLB misses. achieved for merging and partitioning phases in isolation. However, the software-managed buffers idea described in Partitioning achieves up to 4 billion tuples per second by ef- Section 5.2 is much better (cf. ). The robust performance fectively utilizing all the available bandwidth in our machine of the software-managed buffers strategy for high fan-outs is (4 billion × 8 bytes × 4 ≈ 128 GB/s; i.e., 3 reads/1 write for clearly a big improvement over the normal radix partitioning radix partitioning), whereas merge performance drop signif- previous authors have considered so far. icantly from 1.5 to 0.5 billion tuples per second. Partitioning and merging are duals of each other. How- ever, the cost of these two operations differ as shown in Fig- 10.3 Alternative Implementations for Merge ure 6. When compared with the software-managed buffer- The new way of doing the merge seems promising but ing technique, multi-way merging is significantly slower than needs to be evaluated against alternative proposals from the partitioning. This raises the question of why not using par- literature. Chhugani et al. [7] claim that multi-way merging titioning with the software-managed buffers technique for accommodates parallelism in a natural way and works well sorting. If we also use this technique for hash-based joins, for multi-core processors. However, as the available degree previous assumptions about the number of passes over the of hardware parallelism increases, contention in large merge data for the different join approaches no longer hold. With trees also increases. This contention is possibly not visible the optimized technique, a significant amount of data can in four-core configurations considered by Chhugani et al. [7]. be partitioned in a single pass as opposed to the two or more But it critically affects performance in modern systems with passes previously assumed by Kim et al. [15]. This in turn a few ten hardware threads. It turns out that the new ap- calls for a more careful comparison of join algorithms under proach we propose to merging, also addresses this problem. a wider range of data sizes. We have confirmed this experimentally by implementing the two different approaches to multi-way merging: 10.2 Using Partitioning with Sort The cooperative m-way approach follows the original idea The overall NUMA-aware sorting-based join algorithm, by Chhugani et al. [7] where there is a single multi-way m-way is described in Section 6.1. The implementation of merge tree and multiple threads cooperatively merge the 92

9. independent sort cooperative m-way 51.425 50.507 sort throughput [M. tuples/sec] 180 m-way 160 50 m-pass execution time [secs] 140 mpsm (scalar) 30.645 29.504 120 40 26.167 19.803 100 30 15.514 15.280 14.802 80 20 9.067 7.929 60 5.045 40 10 20 0 0 1.6B 3.2B 6.4B 12.8B 2 4 6 8 10 12 14 16 S relation size in billion tuples number of threads Figure 9: Scalability of cooperative multi-way merg- Figure 10: Execution time comparison of sort-merge ing vs. synchronization-free independent sorting. join algorithms. Workload A, 64 threads. Using only one CPU socket. partition sort merge mjoin tput 315M/s cycles per output tuple data. Once the children of a node have enough data, the 175M/s 25 22.9cy node becomes available for merging by any of the threads. This approach has a potential advantage: It increases the 20 105M/s buffer space per merge node as there is a single merge tree 15 13.6cy resident in last-level cache (in Section 9 we showed that buffer space is important for merge performance). 10 7.6cy The independent sorting approach follows the Partition- 5 then-Sort idea discussed in Section 10.2. Each thread locally 0 partitions its data and then sorts smaller individual parti- tions. In this case, the thread can independently merge the mpsm m-pass m-way sorted runs without further interaction with other threads. Figure 9 shows the throughput of the two approaches Figure 11: Performance breakdown for sort-merge when sorting an input relation of 256 M tuples. We inten- join algorithms. Workload A. Throughput metric is tionally restrict to a single socket to ensure that all threads output tuples per second, i.e. |S|/execution time. use the same last-level cache. As the figure shows, coop- erative multi-way merging does not scale for a variety of reasons: contention for upper-level nodes in the merge tree, whereas the larger one is only partially sorted. Yet, mpsm idle threads due to lack of enough work, and synchronization can match the performance of m-pass only when the S rela- overhead. The independent sorting approach, in contrast, tion is significantly large ( 12.8 billion tuples (100 GiB)). scales linearly up to the physical number of cores as shown Nonetheless, mpsm is a scalar algorithm applicable to wider in Figure 9. However, the scalability in the hyper threads re- keys as well. For reasons of space, we omit the results of gion remains limited. This is the common case for hardware- scalar sorting-based joins that are also applicable to wider conscious algorithms where most of the physical hardware keys. In general, the scalar m-way has a good performance resources are shared among hyper-threads. As a conclusion, despite not using SIMD and performs better than mpsm even though all the threads have a fraction of the last level even with 8-byte tuples and large S relations. cache for their multi-way merge, the synchronization-free Figure 11 shows execution time breakdown and through- nature of this approach shows its benefit and independent put for the equal-sized table case in Workload A using 64 sorting proves to be better than the originally proposed co- threads. First, the merge phase in m-way is 3 times faster operative multi-way merging. than m-pass with bandwidth-aware multi-way merging. Sec- ond, in contrast to mpsm, the “mjoin” phase is a linear 11. SORT-MERGE JOINS merge-join operation on NUMA-local sorted runs in the other After identifying the factors affecting the performance of algorithms and overhead of that phase becomes negligible. the components of a sort-merge join algorithm and choosing the best-possible implementations for the different phases, 11.2 Dissecting the Speedup of m-way we now compare the performance of the resulting sort-merge In order to understand the efficiency of m-way, we calcu- join operator (m-way) with that of mpsm and m-pass. lated the speedup ratios of m-way over the other algorithms (Figure 12). The bars denoted with “speedup from merge” 11.1 Comparison with Different Table Sizes shows the speedup of m-way attained over m-pass. This We run first an experiment for different sizes of table S us- metric reflects the actual benefit of multi-way merging alone. ing Workload A shown in Figure 10. The results show that As seen in the figure, up to 16 threads the speedup from m-way runs significantly faster than the other options and multi-way merge is ≈ 1.5X in which case there is enough is robust across different relation size ratios while producing aggregate memory bandwidth for that number of threads. fully sorted output. Algorithm mpsm draws its true benefit However, once the number of threads go beyond 16, mem- from only sorting the smaller of the input tables completely ory bandwidth per thread becomes limited and multi-way 93

10. speedup from merge speedup from AVX speedup to mpsm partition sort merge mjoin build-probe cycles per output tuple 4 411ms 5062ms 3.5 8 3 4255ms speedup 2.5 6 15514ms 2 1018ms 1.5 196ms 9972ms 1 4 0.5 460ms 0 2 1 2 4 8 16 32 64 number of threads 0 mway rdx mway rdx mway rdx mway rdx Figure 12: Speedup of m-way due to parallelism 128M ✶ 128M 1.6B ✶ 1.6B 128M ✶ 512M 1.6B ✶ 6.4B from AVX and efficiency from multi-way merge. join workloads in number of tuples m-way m-pass mpsm “315M/sec” throughput [M. output tuples/sec] Figure 14: Comparison of best sort vs. best hash 288 join algorithms with cycles per output tuple metric 144 under different workloads. Using 64 threads. 72 36 Despite all the optimizations discussed in this paper and the performance boost through 256-bit vector instructions, sort- 18 merge joins cannot match the performance of hash joins on 9 existing recent hardware and for these workloads. For significantly larger workloads (as suggested by Albutiu 4.5 et al. [2]) the picture becomes more favorable for sort-merge joins. Input table sizes in the range of billions of tuples make 1 2 4 8 16 32 64 m-way more competitive with radix. Since m-way produces number of threads fully sorted output, it could be a good choice for large data volumes that must be further processed. Figure 13: Scalability of sorting-based joins. Work- load A, (11.92 GiB ✶ 11.92 GiB). Throughput metric 12.2 Effect of Input Size is output tuples per second, i.e. |S|/execution time. The effect of the input relation size can be better captured by the following experiment: we vary the size of each equi- sized R and S tables from 128M tuples (≈ 1 GB) to 1,920M merge benefit goes up to a factor of 2. The bars denoted tuples (≈ 15 GB) and run m-way and radix at each data with “speedup from AVX” show the speedup of m-way at- point using 64 threads. The results are shown in Figure 15. tained over the same algorithm’s scalar variant. Generally, The conclusion is clear: sort-merge join approaches the speedup from AVX is between 2X and 2.3X. Lastly, the over- performance of radix only when the input tables become sig- all speedup of m-way over mpsm is ≈ 3X. nificantly large. Radix join performance degrades with the increasing input size due to the increasing number of passes 11.3 Scalability of Sort-based Join Algorithms for the partitioning phase. To illustrate, radix configura- Figure 13 shows the scalability of sorting-based join algo- tions vary from 1 pass/12 bits up to 1 billion tuples and from rithms with increasing number of threads where both axes that point on resorts back to 2 pass/18 bits optimal config- are in logarithmic scale. All the algorithms exhibit linear urations. The optimized radix partitioning with software- scaling behavior up to 16 physical CPU cores. However, managed buffers is very efficient up to 9-bits/512-way parti- as all of these algorithms are cache- and CPU resource- tioning since the entire partitioning buffer can reside in L1 sensitive, the scaling with hyper threads is rather limited. cache (32 KiB = 512 × 64-bytes-cache-lines). Even higher radix-bits/fan-outs such as a maximum of 12/4,096 can be tolerated when backed up by the L2 cache (256 KiB = 4,096 × 12. SORT OR HASH? 64-bytes). Consequently, the partitioning performance de- In this section, we present the best sort and hash join grades gracefully up to L2 cache size. Once the partition- algorithms side-by-side under a wide range of parameters. ing requirement goes above 16,384, 2-pass partitioning, each pass with 9-bits, and fully L1-resident buffers become the 12.1 Comparison with Different Workloads best option. Further, sort-merge join demonstrates robust The results of best sorting-based join algorithm m-way performance due to the optimizations previously discussed and best hash join algorithm radix in our study are summa- in this paper. Finally, sorted output in query plans is an rized in Figure 14 over various workloads. attractive argument to make sort-merge joins even more fa- In Figure 14 we observe that hash-based join algorithms vorable in this range of the spectrum of input sizes. still maintain an edge over sort-based counterparts. When We have also analyzed the potential impact of relative input table sizes are in the hundred millions, radix hash join table sizes (partly shown in Figure 14). The experiments is more than 2 times faster than m-way sort-merge join. The show that both algorithms are insensitive to the variance in speed difference is maintained even when the outer table relation size ratios, being affected only by the size of the size becomes significantly larger than the primary key table. output as the amount of data grows. We omit the graphs 94

11.throughput [M. output tuples/sec] 700 1200 exec. time [M. cycles] 1000 600 800 500 600 400 400 300 200 m-way – sort join radix – hash join 200 0 0 0.25 0.5 0.75 1.0 100 m-way – sort join radix – hash join Zipf factor z 0 128 384 640 896 1152 1408 1664 1920 Figure 16: Join performance when foreign key ref- million tuples in R and S erences follow a Zipfian distribution. Workload B. Figure 15: Sort vs. hash with increasing input table sizes (|R| = |S|). Throughput metric is total output achieves approximately 650 million whereas m-way sort-mer- tuples per second, i.e. |S|/execution time. ge join achieves approximately 315 million output tuples per second. As mentioned earlier, performance gap between al- gorithms closes only with significant input sizes as in Fig- and more detailed discussions for reasons of space. ure 17(b), where the number of passes for radix partitioning increases to two. On the other hand, only within the SMT 12.3 Effect of Skew region, algorithms scale poorly. This is an inevitable, well- In this section, we take a look at the effect when the for- known situation for hardware-conscious algorithms: SMT eign key column in table S follows a Zipf distribution. provides the illusion of running 2 threads on a single core For handling skew in parallel radix hash join, we previ- where almost all of the resources are shared. ously proposed a fine-granular task decomposition method. Finally, SMT scalability for radix join are different in the The key idea is to parallelize the processing of larger parti- two workloads mainly because of the optimal partitioning tions through a refined partitioning. We refer to [4, 15] for configuration: In Figure 17(a), radix join runs with single- details. pass partitioning with fan-out of 4,096 that fully stresses the The first and second phases in m-way are not vulnerable L2 cache. Whenever the other thread starts to contend for to skew since all the threads have an equal share of the data. the L2 cache, SMT shows no benefit apart from hiding the Phase 2 in Figure 4, multi-way merging, is prone to skew. self-caused misses. However, in Figure 17(b), the partition- We handle the skew in two steps: 1 When creating a ing buffer is L1 resident with a fan-out of 512 in each of multi-way merge task for the thread, if the total size of the the two passes. Therefore, a limited benefit can be observed merge exceeds an expected average size, we create multiple where certain amount of L1 misses are hidden by SMT. merge tasks by finding boundaries within each of the sorted runs with binary search. These tasks are then inserted into a 12.5 Sort vs. Hash with All Algorithms NUMA-local task queue shared by the threads in the same NUMA region. 2 For extremely large tasks, we identify In this section, we bring all of the state-of-the-art join heavy hitters by computing an equi-depth histogram over strategies together for a side-by-side comparison using com- sorted runs (which is a fast operation) in a similar approach mon workloads. The results are shown in Figure 18. Radix to Albutiu et al. [2]. Then heavy hitters are directly copied hash join comes out as the fastest algorithm in this compar- to their output targets without a need for merging. ison. Albutiu et al. [2] previously compared their massively- Figure 16 illustrates the behavior of m-way and radix with parallel sort-merge join (mpsm) algorithm to no-partitioning increasing Zipf skew factor. The enhancements to m-way, join (n-part) implementation of Blanas et al. [5] and found explicit skew and heavy-hitter handling mechanisms, result out that sort-merge joins are already faster than hash joins. in a robust performance against skewed distribution of keys However, in our recent work [4], we have optimized the no- while showing less than 10 % overhead. On the other hand, partitioning join idea further (nevertheless it is still not fully radix join is also robust against skew with the fine-granular NUMA-aware as the hash table is spread over NUMA re- task decomposition technique. Overall, the results in Fig- gions). Therefore, we extend the comparison of Albutiu ure 16 show that the comparison of sort vs. hash joins does et al. [2] with our optimized implementations of the algo- not significantly change due to skew. rithms. The results in Figure 18 indicate that mpsm and n-part algorithms in fact achieve similar performance. Op- 12.4 Scalability Comparison timized n-part is only faster by ≈ 10-15 % while lacking the We compare the algorithms in terms of scalability by vary- partially sorted output benefit of mpsm. Nevertheless, all ing the number of threads up to the available 64 threads the results clearly indicate that hash joins are still faster in our system. Threads are assigned to NUMA regions in than sort-merge join counterparts. a round-robin fashion. Between 32 and 64 threads, algo- rithms use SMT (hyper-)threads. Results for two different workloads are shown in Figure 17. 13. CONCLUSIONS First, both algorithms show almost linear scalability up to As hardware changes, the “Sort vs. Hash” question needs the physical number of cores. Consequently, radix hash join to be revisited regularly over time. In this paper, we look 95

12. throughput [M. output tuples/sec] radix – hash join m-way – sort join radix – hash join m-way – sort join throughput [M. output tuples/sec] 600 350 500 300 250 400 200 300 150 200 100 100 50 0 0 1 4 8 16 24 32 40 48 56 64 1 4 8 16 24 32 40 48 56 64 number of threads number of threads (a) 977 MiB ✶ 977 MiB (128 million 8-byte tuples) (b) 11.92 GiB ✶ 11.92 GiB (1.6 billion 8-byte tuples) Figure 17: Scalability of sort vs. hash join. Throughput is in output tuples per second, i.e. |S|/execution time. partition sort merge mjoin build probe ¨ [3] C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. 15236ms Main-memory hash joins on multi-core CPUs: Tuning to the 22 cycles per output tuple 1145ms 12992ms underlying hardware. Technical report, ETH Zurich, Nov. 2012. 1016ms ¨ [4] C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. 18 Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE, 2013. 14 [5] S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In 10 SIGMOD Conference, pages 37–48, 2011. 411ms 5062ms 4255ms [6] K. Bratbergsengen. Hashing methods and relational algebra 6 operations. In VLDB, pages 323–333, 1984. 196ms [7] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, 2 Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. mway mpsm n-part rdx mway mpsm n-part rdx PVLDB, 1(2):1313–1324, 2008. 128M ✶ 128M 1.6B ✶ 1.6B [8] D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. Gamma - a high performance algorithms / workloads in number of tuples dataflow database machine. In VLDB, pages 228–237, 1986. [9] S. Fushimi et al. An overview of the system software of a parallel relational database machine grace. In VLDB, 1986. Figure 18: Sort vs. hash join comparison with ex- [10] B. Gedik, R. Bordawekar, and P. S. Yu. Cellsort: High tended set of algorithms. All using 64 threads. performance sorting on the cell processor. In VLDB, 2007. [11] G. Graefe. Sort-merge-join: An idea whose time has(h) passed? In ICDE, pages 406–417, 1994. [12] G. Graefe, A. Linville, and L. D. Shapiro. Sort versus hash at it in the context of in-memory data processing for re- revisited. IEEE Trans. Knowl. Data Eng., 6(6):934–944, 1994. cent multi-core machines. Our results provide the fastest in- [13] H. Inoue et al. Aa-sort: A new parallel sorting algorithm for memory join processing algorithms using sorting (2–3 times multi-core simd processors. In PACT, pages 189–198, 2007. faster than available results) and hashing. Moreover, we [14] C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. Cloudramsort: fast and efficient large-scale distributed ram sort show that hash-based join algorithms still have an edge over on shared-nothing cluster. In SIGMOD, pages 841–850, 2012. sort-merge joins despite the advances on the hardware side [15] C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, A. D. Nguyen, such as 256-bit SIMD. Finally, sort-merge join turns out to A. D. Blas, V. W. Lee, N. Satish, and P. Dubey. Sort vs. hash be more comparable in performance to radix-hash join with revisited: Fast join implementation on modern multi-core CPUs. PVLDB, 2(2):1378–1389, 2009. very large input sizes. [16] M. Kitsuregawa et al. Application of hash to data base machine All the code used to obtain results in this paper is available and its architecture. New Generation Comput., 1(1), 1983. at [17] D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, 2nd edition, 1998. [18] Y. Li, I. Pandis, R. M¨ uller, V. Raman, and G. M. Lohman. Acknowledgements Numa-aware algorithms: the case of data shuffling. In CIDR, 2013. This work was supported by the Swiss National Science [19] S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing Foundation (Ambizione grant; project Avalanche), by the main-memory join on modern hardware. IEEE Trans. Knowl. Data Eng., 14(4):709–730, 2002. Enterprise Computing Center (ECC) of ETH Zurich, and [20] T. H. Merrett. Why sort-merge gives the best implementation by Deutsche Forschungsgemeinschaft (DFG; SFB 876 “Pro- of the natural join. SIGMOD Rec., 13(2):39–51, Jan. 1983. viding Information by Resource-Constrained Analysis”) [21] R. M¨ uller, J. Teubner, and G. Alonso. Sorting networks on fpgas. VLDB J., 21(1), 2012. [22] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, 14. REFERENCES D. Kim, and P. Dubey. Fast sort on CPUs and GPUs: A case [1] Intel architecture code analyzer. for bandwidth oblivious SIMD sort. In SIGMOD, 2010. en-us/articles/intel-architecture-code-analyzer. Online, [23] D. A. Schneider and D. J. DeWitt. A performance evaluation of accessed February 2013. four parallel join algorithms in a shared-nothing multiprocessor [2] M.-C. Albutiu, A. Kemper, and T. Neumann. Massively environment. SIGMOD ’89, pages 110–121, 1989. parallel sort-merge joins in main memory multi-core database [24] A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious systems. PVLDB, 5(10):1064–1075, 2012. algorithms for relational query processing. In VLDB, 1994. 96