Massively Parallel Sort-Merge Joins

Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for diskbased systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel multi-core data processing as it was deemed inferior to hash joins.

1. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems Martina-Cezara Albutiu Alfons Kemper Thomas Neumann Technische Universitat ¨ Munchen ¨ Boltzmannstr. 3 85748 Garching, Germany ABSTRACT and core numbers. So far, main memory database systems Two emerging hardware trends will dominate the database were either designed for transaction processing applications, system technology in the near future: increasing main mem- e.g., VoltDB [25], or for pure OLAP query processing [4]. ory capacities of several TB per server and massively parallel However, the upcoming requirements for so-called real-time multi-core processing. Many algorithmic and control tech- or operational business intelligence demand complex query niques in current database technology were devised for disk- processing in “real time” on main memory resident data. based systems where I/O dominated the performance. In SAP’s Hana [10] and our hybrid OLTP&OLAP database this work we take a new look at the well-known sort-merge system HyPer [16], for which the presented massively paral- join which, so far, has not been in the focus of research in lel join algorithms were developed, are two such databases. scalable massively parallel multi-core data processing as it The query processing of in-memory DBMSs is no longer I/O was deemed inferior to hash joins. We devise a suite of new bound and, therefore, it makes sense to investigate massive massively parallel sort-merge (MPSM) join algorithms that intra-operator parallelism in order to exploit the multi-core are based on partial partition-based sorting. Contrary to hardware effectively. Only query engines relying on intra- classical sort-merge joins, our MPSM algorithms do not rely query and intra-operator parallelism will be able to meet on a hard to parallelize final merge step to create one com- the instantaneous response time expectations of operational plete sort order. Rather they work on the independently business intelligence users if large main memory databases created runs in parallel. This way our MPSM algorithms are to be explored. Single-threaded query execution is not are NUMA-affine as all the sorting is carried out on local promising to meet the high expectations of these database memory partitions. An extensive experimental evaluation users as the hardware developers are no longer concerned on a modern 32-core machine with one TB of main memory with speeding up individual CPUs but rather concentrate proves the competitive performance of MPSM on large main on multi-core parallelization. memory databases with billions of objects. It scales (al- Consequently, in this paper we develop a new sort-based most) linearly in the number of employed cores and clearly parallel join method that scales (almost) linearly with the outperforms competing hash join proposals – in particular number of cores. Thereby, on modern multi-core servers our it outperforms the “cutting-edge” Vectorwise parallel query sort-based join outperforms hash-based parallel join algo- engine by a factor of four. rithms which formed the basis for multi-core optimization in recent proposals. The well-known radix join algorithm of MonetDB [19] pioneered the new focus on cache locality by 1. INTRODUCTION repeatedly partitioning the arguments into ever smaller par- Increasing main memory capacities of up to several TB titions. The recursive sub-partitioning, rather than directly per server and highly parallel processing exploiting multi- partitioning into small fragments, preserves TLB cache lo- core architectures dominate today’s hardware environments cality by restricting the random write of the partitioning and will shape database system technology in the near fu- phase to a small number of pages whose addresses fit into ture. New database software has to be carefully targeted the TLB cache. The join is carried out on small cache-sized against the upcoming hardware developments. This is par- fragments of the build input in order to avoid cache misses ticularly true for main memory database systems that try during the probe phase. Because of this cache-affine be- to exploit the two main trends – increasing RAM capacity havior the radix join became the basis for most work on multi-core parallel join implementations, e.g., [17, 14]. In addition to the cache locality, He et al. [14] and Kim et Permission to make digital or hard copies of all or part of this work for al. [17] also focussed on low synchronization overhead and personal or classroom use is granted without fee provided that copies are avoidance of dynamic memory allocation. Both aspects were not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to achieved by computing histograms of the data to be parti- republish, to post on servers or to redistribute to lists, requires prior specific tioned and then deriving the prefix sums to determine the permission and/or a fee. Articles from this volume were invited to present exact array positions into which parallel threads write their their results at The 38th International Conference on Very Large Data Bases, partitioned data. Unfortunately, merely relying on straight- August 27th - 31st 2012, Istanbul, Turkey. forward partitioning techniques to maintain cache locality Proceedings of the VLDB Endowment, Vol. 5, No. 10 and to keep all cores busy will not suffice for the modern Copyright 2012 VLDB Endowment 2150-8097/12/06... $ 10.00. 1064

2. 100% We thus conclude that sequential scans of remote memory 22756 ms 417344 ms 1000 ms are acceptable from a performance perspective. 837 ms relative execution This observation and further micro-benchmarks led us to state the following three rather simple and obvious rules duration (called “commandments”) for NUMA-affine scalable multi- core parallelization: 12946 ms 7440 ms C1 Thou shalt not write thy neighbor’s memory randomly – chunk the data, redistribute, and then sort/work on ote ocal ed tia l ote ocal your data locally. rem l o niz quen rem l r h e C2 Thou shalt read thy neighbor’s memory only sequen- nc s (2) (1) sy (3) tially – let the prefetcher hide the remote access la- sort partitioning merge join tency. (sequential read) C3 Thou shalt not wait for thy neighbors – don’t use fine- Figure 1: Impact of NUMA-affine versus NUMA-agnostic grained latching or locking and avoid synchronization data processing points of parallel threads. By design, the massively parallel sort-merge join algo- hardware that scales main memory via non-uniform mem- rithms (called MPSM) we present in this paper obey all ory access (NUMA). Besides the multi-core parallelization three commandments whereas the previously proposed hash also the RAM and cache hierarchies have to be taken into join variants violate at least one of the commandments and, account. In particular the NUMA division of the RAM has therefore, exhibit scalability problems of various forms. to be considered carefully. The whole NUMA system log- We will show that the carefully engineered NUMA-friendly ically divides into multiple nodes, which can access both MPSM exhibits an outstanding performance when compared local and remote memory resources. However, a processor to the Wisconsin hash join [1] and Vectorwise [15]. Our per- can access its own local memory faster than non-local mem- formance evaluation proves the scalability of MPSM for very ory, that is, memory local to another processor or memory large main memory databases with hundreds of GB data shared between processors. The key to scalable, high perfor- volume. For large numbers of cores (up to 32) MPSM out- mance is data placement and data movement such that performs the recently proposed hash-based Wisconsin join threads/cores work mostly on local data – called NUMA- by up to an order of magnitude. MPSM scales (almost) friendly data processing. linearly in the number of cores and compared to the TPC- To back up this claim, Figure 1 shows the results of a few H endorsed “world champion” query processor Vectorwise micro-benchmarks we ran on a one TB main memory ma- even achieves a factor of four. chine with 32 cores (cf. Section 5, Figure 11). We therefore The remainder of the paper is structured as follows: In instantiated 32 threads to work on one relation with a total Section 2 we depict the basic idea of MPSM in comparison of 1600M (throughout the paper we use M = 220 ) tuples, to the radix join and the Wisconsin hash join. In Section 3 each consisting of a 64-bit sort key and a 64-bit payload, we address the concrete implementations of the MPSM con- in parallel. (1) We first chunked the relation and sorted cept in detail and in Section 4 we discuss the skew resilience the chunks of 50M tuples each as runs in parallel. In the of MPSM. We evaluate MPSM in Section 5 and cover re- “green” NUMA-affine benchmark, the sorting of each core lated work in Section 6. Finally, we conclude our work in was performed in the local NUMA RAM partition whereas Section 7. in the unfavorable “red” case the sort was performed on a globally allocated array. We observe a severe performance 2. THE BASIC IDEA OF MPSM penalty of a factor of three if NUMA boundaries are ig- We will first present the very basic idea of the NUMA- nored. (2) We then analyzed the performance penalty of affine MPSM in comparison to radix join and Wisconsin fine-grained synchronization. For this the 32 threads parti- hash join. Later, we will discuss important refinements re- tioned the global relation into 32 chunks each being stored garding performance improvement and skew resilience. as an array. In the “red” experiment the next write position The recently proposed Wisconsin hash join [2] is based was individually read from a (test-and-set) synchronized in- on a global shared hash table which has to be built across dex variable of the corresponding partition array. In the the NUMA partitions by a large number of threads. These “green” experiment all threads were allocated precomputed concurrent accesses to a single hash table need synchroniza- sub-partitions that could be written sequentially without tion via latches. Therefore, during the parallel build phase synchronization. This experiment proves that fine-grained “commandments” C2 and C3 are violated. During the probe synchronization (even with wait-free test-and-set variables) phase random reads to the hash table are performed across is a “no-go” for scalable data processing. (3) Finally, in the the NUMA memory partitions, which again violates C2 as last microbenchmark we analyzed the tolerable performance the hardware prefetcher cannot hide the access latency. In penalty of sequentially scanning remote memory in compar- Figure 2a we illustrate the random writes and reads within ison to local memory. Each of the 32 parallel threads merge the NUMA partitions using different-colored arrows and the joins two chunks of 50M tuples each. Thereby, each thread required synchronization with locks. works on one local run. The second run is either in remote The radix join of MonetDB [19] and Oracle/Intel [17] (“yellow”) or local (“green”) NUMA partitions. The neg- writes across NUMA partitions during the initial partition- ative impact of the second chunk being accessed remotely ing phase as illustrated in Figure 2b. The radix join repeat- compared to the second chunk being local, too, is mitigated edly partitions the arguments in order to achieve cache lo- by the hardware prefetcher as the accesses are sequential. cality of the hash table probes despite their random nature. 1065

3. NUMA partitions NUMA partitions NUMA partitions R S R S R S sort sort run run sort sort Hash Table run run sort sort run run build HT probe HT non-local partitioning phase local partitioning phase sequential read (a) Wisconsin hash join processing (b) Radix join processing (c) MPSM join processing Figure 2: Comparison of basic join processing of Wisconsin hash join, radix join, and MPSM Unfortunately, the price for this locality is the partitioning S the public input. After the sorting phase is finished each of both join arguments across the NUMA memory during worker processes only its own chunk of the private input the first partitioning step. but sequentially scans the complete public input. We will Our massively parallel sort-merge (MPSM) join is de- later devise the range partitioned variant where this com- signed to take NUMA architectures into account which were plete scanning is avoided to speed up the join phase even not yet in the focus of prior work on parallel join process- more beyond parallelization. During run generation (phase ing for main memory systems. Though, we emphasize that 1 and phase 2), each worker thread handles an equal share MPSM is oblivious to specific NUMA architectures as it of both the public and the private input. These phases do only assumes the locality of a RAM partition for a single not require any synchronization between the workers and core – without relying on multiple cores sharing the local- are performed in local memory which we have shown to be ity of RAM partitions or caches. As illustrated in Figure 2c advantageous for the sort operator (cf. Figure 1). Even each data chunk is processed, i.e., sorted, locally. Unlike tra- if data has to be copied from remote to local chunks this ditional sort-merge joins we refrain from merging the sorted can be amortized by carrying out the first partitioning step runs to obtain a global sort order and rather join them all in of sorting while copying. In phase 3, each worker joins its a brute-force but highly parallel manner. We opt to invest sorted private input run with the sorted public input runs more into scanning in order to avoid the hard to parallelize using merge join. The join phase requires reading non-local merge phase. Obviously, this decision does not result in a memory, however, only sequentially. As we have shown be- globally sorted join output but exhibits a partial sort order fore, sequential scans heavily profit from (implicit processor) that allows for sort order based subsequent operations, e.g, prefetching and cache locality and therefore do not affect early aggregation. During the subsequent join phase, data performance significantly. accesses across NUMA partitions are sequential, so that the The B-MPSM algorithm is absolutely skew resistant and prefetcher mostly hides the access overhead. We do not obeys the three “commandments” for NUMA-affine design employ shared data structures so that no expensive syn- we stated above: During the run generation phases for pub- chronization is required. Therefore, MPSM obeys all three lic and private input, only local memory is written. In the NUMA-commandments by design. join phase, all runs (local and remote) are scanned sequen- tially. Furthermore, B-MPSM requires only one synchro- 2.1 The B-MPSM Algorithm nization point as we need to make sure that the public input The basic MPSM algorithm starts by generating sorted runs Si are ready before we start the join phase. Note that runs in parallel. These runs are not merged as doing so the sort phase of the private data R need not be finished would heavily reduce the “parallelization power” of mod- before other threads start their join phase. Thus, the syn- ern multi-core machines. Instead, the sorted runs are sim- chronization is limited to ensure that all other workers have ply joined in parallel. In the following, we first describe finished their sorting of the public input chunk before phase the MPSM algorithm in its basic form (B-MPSM) which is 3 (join) is entered. The fact that the output of each worker absolutely insensitive to any kind of skew. It bears some is a sorted run may be leveraged by subsequent operators similarity to fragment and replicate distributed join algo- like sort-based aggregation. Also, presorted relations can rithms. However, it only replicates merge join scans of the obviously be exploited to omit one or both sorting phases. threads/cores but does not duplicate any data. Then we present an improved MPSM version based on range parti- W1 W2 W3 W4 tioning of the input by join keys (P-MPSM). Further, the larger R1 R2 R3 R4 MPSM can effectively be adapted to non-main memory sce- Phase 2 sort R data narios, i.e., scenarios in which intermediate data must be smaller written to disk. We call this the disk-enabled MPSM algo- Phase 3 W1 … … W4 rithm (D-MPSM). MJ MJ The B-MPSM algorithm is sketched in Figure 3 for a sce- smaller nario with four worker threads. The input data is chunked Phase 1 sort S data into equally sized chunks among the workers, so that for in- larger S1 S2 S3 S4 stance worker W1 is assigned a chunk R1 of input R and another chunk S1 of input S. Each worker sorts its data chunks, thereby generating sorted runs of the input data in Figure 3: B-MPSM join with four workers Wi parallel. In the following, we call R the private input and 1066

4.2.2 Complexity of B-MPSM RAM larger R1 R2 R3 R4 B-MPSM basically executes T sort-merge joins in parallel, where T is the number of worker threads. In each of these page index sorted RAM sort-merge joins, 1/T th of the input relations is processed. A W1 S1 S4 crude complexity approximation per worker Wi results in: smaller S2 W4 S1 W1 … … W4 |S|/T · log(|S|/T ) MJ MJ sort chunk Si of size |S|/T W3 S3 smaller S4 + |R|/T · log(|R|/T ) sort chunk Ri of size |R|/T W2 S3 S1 S3 + T · |R|/T process run Ri for all S runs S4 sorted S2 S4 RAM + T · |S|/T process all S runs ... larger S1 S2 S3 S4 = |S|/T · log(|S|/T ) + |R|/T · log(|R|/T ) + |R| + |S| On the bottom line, each thread sorts “its” chunks of R still to be processed active already released to be prefetched and S and processes all sorted S runs. Thereby, the own R from RAM run is read several (T ) times as each of the S runs possibly joins with the local run. Figure 4: Disk-enabled MPSM join: the four workers Wi The formula above reveals that the sort phases of B- progress synchronously through their Ri run and all S runs, MPSM scale well with the number of worker threads T . thereby only active parts of the runs are in RAM The join phase, however, requires each worker to process the complete public input regardless of the processing par- P-MPSM is a pure main memory version that range parti- allelism given. For I/O bound disk-based processing in D- tions the input data thereby providing scalability with re- MPSM this is hidden by the I/O latency. However, for pure spect to processing cores. D-MPSM is a RAM-constrained in-core processing we address this issue by including a pro- version that spools runs to disk. Both scenarios are com- logue phase to range partition and assign the private input mon in main memory DBMSs and require attention when data to the workers in a way that allows saving much of the database operators are designed. We carefully consider both work during the join phase. This variant is called (parti- variants detailed enough to allow for an implementation and tioned) P-MPSM and is explained in detail in this paper. considerations about performance. 2.3 Sorting 3.1 Memory-Constrained Disk MPSM Efficient sorting is decisive for the performance of MPSM. The presented MPSM can effectively be adapted to sce- As we deal with (realistic) large join keys and payloads that narios in which the intermediate result data is too large to be need to be sorted we cannot utilize the specialized bitonic kept in main memory. Even in main memory database sys- sorting routines that exploit the SIMD registers [6], as these tems like HyPer that retain the entire transactional database are limited to 32-bit data types. Instead we developed our in RAM, the query processor spools intermediate results to own three-phase sorting algorithm that operates as follows: disk to preserve the precious RAM capacity for the trans- 1. in-place Radix sort [18] that generates 28 = 256 parti- actional working set. Therefore, it is important to support tions according to the 8 most significant bits (MSD). This both pure main memory algorithms and a disk-based pro- works by computing a 256 bucket histogram and deter- cessing mode with a very small RAM footprint. mining the boundaries of each partition. Then the data The disk-enabled MPSM (D-MPSM) processes the left elements are swapped into their partition. and right input runs by synchronously moving through the key domain which is sorted. The resulting data locality al- 2. IntroSort (Introspection Sort) [20] lows to spill already processed data to disk and to prefetch 2.1 Use Quicksort to at most 2 · log(N ) recursion levels. data that is to be processed soon. Figure 4 illustrates the If this does not suffice, resort to heapsort. approach: both R and S runs are stored on disk, only the currently processed pages (white) need to be main memory 2.2 As soon as a quicksort partition contains less than resident. Already processed data is not touched again and 16 elements stop and leave it to a final insertion sort thus can be released from RAM (green) and soon to be pro- pass to obtain the total ordering. cessed data is prefetched from disk asynchronously (yellow). We analyzed that this sorting routine is about 30% faster For this purpose, we maintain a page index which is or- than, for example, the STL sort method – even when up to dered page-wise by key value. The index is built during run 32 workers sort their local runs in parallel. Note that we generation and contains pairs vij , Si where vij is the first do not employ synchronization-heavy parallel sorting – each (minimal) join key value on the j th page of run Si . Figure 4 worker sorts a separate chunk of data into a run. depicts a simplified page index (only run identifiers) on the left. It actually contains the following information: 3. THE RANGE-PARTITIONED MPSM AND sorted by vij −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ A DISK-BASED VARIANT v11 v41 v21 v12 v31 v42 v32 v43 . . . So far, we presented the basic concept of MPSM which S1 S4 S2 S1 S3 S4 S3 S4 . . . (in object-oriented terminology) is only an abstract class for several algorithmic specializations. We present two derived where v11 ≤ v41 ≤ . . . ≤ v43 . Both the prefetcher and the implementations: workers process the S input data in the order specified by the index, thereby synchronously moving through the key B-MPSM domain and allowing to keep only a small part of the data P-MPSM D-MPSM in memory during join processing. All page index entries 1067

5. W1 W2 W3 W4 Compared to the complexity approximation of B-MPSM, range partitioning pays off if the cost of range-partitioning R is smaller than the savings in join processing, i.e., if C1 C2 C3 C4 |R|/T ≤ |S| − |S|/T range partition Phase 2 larger For a parallelism greater than or equal two and |R| ≤ |S| it R1 R2 R3 R4 pays off. The performance of P-MPSM thus scales almost Phase 3 sort R data linearly with the number of parallel threads T which is de- smaller cisive for the effective multi-core scalability of P-MPSM, as Phase 4 … … our experimental evaluation will also prove. MJ MJ In general, the two input relations to a join operation are smaller not equally sized but usually consist of a larger (fact) table and smaller (dimension) tables. Assigning the private in- Phase 1 sort S data put role R to the smaller of the input relations and thus the larger public input role S to the larger yields the best performance. S1 S2 S3 S4 Thereby, only a small fraction (depending on the number of worker threads T ) of the remote public input needs to be Figure 5: P-MPSM join with four workers Wi processed while the smaller private input is scanned several already processed by the “slowest” worker, e.g., W1 in the times with almost no performance penalties. We will present illustration, point to run pages that may be released from evaluation results quantifying the performance impact of re- RAM (green). The prefetcher is supposed to pre-load run versed public/private input roles in Section 5.4. pages according to the index before they are accessed by any worker (yellow). Implicitly, the workers’ private input 3.2.1 Partitioning the Private Input (Phase 2) runs Ri are read from disk (red), processed, and released We design the re-distribution of the private input chunks from RAM in ascending order of join keys. Please note Ci to be very efficient, i.e., branch-free, comparison-free, and that the common page index structure does not require any synchronization-free. (1) branch-freeness and comparison- synchronization as it is accessed read-only. freeness are achieved by using radix-clustering [19] on the Obviously, the performance of D-MPSM is determined by highest B bits of the join key where log(T ) ≤ B. For the time to write (run generation) and read (join phase) both log(T ) = B, radix-clustering results in exactly T clusters. inputs. Therefore, in order to exploit the power of multiple By increasing B, we can account for skew in both R and S as cores a sufficiently large I/O bandwidth (i.e., a very large we will discuss in Section 4. (2) We then range partition the number of disks) is required. private input chunks, thereby guaranteeing synchronization- freeness by letting each worker write sequentially to precom- 3.2 Range-partitioned MPSM puted sub-partitions within all runs. For this purpose, each The range partitioned MPSM (P-MPSM) extends the B- thread builds a histogram on its chunk of the global rela- MPSM algorithm by a prologue phase to range partition and tion R. The local histograms are combined to obtain a set assign the private input data to the workers in a way that of prefix sums where each prefix sum represents the start allows saving much of the work during the join phase. The positions of each worker’s partitions within the target runs. different phases of the algorithm are sketched in Figure 5 Each worker then scatters its input chunk to the partitions for a scenario with four workers, choosing R as private in- using the prefix sums and updating them accordingly. This put and S as public input. In phase 1, the public input is approach was adapted from the radix join of [14]. chunked and sorted locally, resulting in runs S1 to S4 . Sub- We demonstrate the partitioning of R in Figure 6 for two sequently, in phase 2, the private input is chunked into C1 to workers, B = 1 and a join key range of [0, 32). Each worker C4 and those chunks are range partitioned. We always em- thread Wi scans its own chunk Ci and probes for each tuple ploy a histogram-based technique to ensure that the range into a histogram array depending on its highest bit (which partitions are balanced (cf. Section 4) even for skewed data we show underlined), i.e., join key values < 16 are assigned distributions. Thereby, the private input data is partitioned to the first position and join key values ≥ 16 are assigned to into disjoint key ranges as indicated by the different shades the second. According to h1 , chunk C1 contains four entries in Figure 5 ranging from white over light and dark gray to for the first and three for the second partition. The two black. In phase 3, each worker then sorts its private input partitions are shown as white and black entries. From the chunk and in phase 4, merge joins its own private run Ri combined histograms prefix sums are computed that point to with all public input runs Sj . the subpartition into which the workers scatter their chunk’s By refining the MPSM to use range partitioning, each tuples. For example, the prefix sum ps1 denotes that W1 thread conducts only the join between 1/T th of the join key scatters its entries for the first and second partition starting domain of R and S. This reduces the complexity per worker at position 0. According to ps2 , W2 scatters tuples belonging Wi to to the first partition beginning at position 4 (as W1 writes |S|/T · log(|S|/T ) sort chunk Si of size |S|/T to positions 0 to 3), and those belonging to the second par- + |R|/T range-partition chunk Ri tition beginning at position 3. In general, the ps-entries are + |R|/T · log(|R|/T ) sort chunk Ri of size |R|/T computed as + T · |R|/T process run Ri for all S runs + T · |S|/T 2 process 1/T th of each S run 0, if i = 1 psi [j] = i−1 k=1 hk [j], else = |S|/T · log(|S|/T ) + |R|/T + |R|/T · log(|R|/T ) + |R| + |S|/T 1068

6. Phase 2 19 W1 9 9 W1 scatters key 19 to R2 at position 0 7 W1 sort run R1 chunk C1 7 7=00111 h1 ps1 3 ++ proceed 3 4 <16 0 1 with 21 3 •16 0 1 sequential write 2 17=10001 join 1 ++ histogram prefix W2 4 17 sum 8 2 W2 sequential write 19 23 W2 scatters key 2 to R1 at position 4 21 W1 chunk C2 sort run R2 4 4=00100 h2 ps2 17 proceed 31 ++ 3 <16 4 5 23 with 8 20=10100 4 •16 3 31 join 20 ++ W2 20 histogram prefix 26 sum 26 R Figure 6: Phase 2 of P-MPSM: 5 bit join keys in the range [0,32), 1 bit histogram for 2 partitions Actually, the psi contain pointers to the positions, not key domain 1. index values, as shown by the dotted arrows in Figure 6, i.e., psi [j] = &Rj [( i−1 k=1 hk [j])]. The prefix sums psi per ƔƔƔƔƔƔƔƔƔƔƔƔƔƔƔ Sj ƔƔƔƔƔƔƔƔƔƔƔƔƔƔƔƔ worker Wi , which are computed from the combined local sj1 2. sjk sjn histograms, are essential for the synchronization-free parallel ri1 scattering of the tuples into their range partition. Every ƔƔƔ Ri ƔƔƔ worker has a dedicated index range in each array Ri into which it can write sequentially. This is orders of magnitude Figure 7: Interpolation search for ri1 in Sj more efficient than synchronized writing into the array – as shown in Figure 1 (2) and makes MPSM immune against cache coherency overhead. space, as well as the difference of the searched key value and Note that depending on the actual join key value distri- the search space minimum key value. The computed index bution, in particular the minimum and maximum join key per iteration is always relative to the search space starting values, it might be necessary to preprocess the join keys index. In the illustration in Figure 7, only two steps are before applying radix-clustering. This can usually be done required to find the starting point for merge join: efficiently using bitwise shift operations. Although we use radix-clustering for partitioning the pri- 1. the search space is [sj1 , sjn ], i.e. from indexes 1 to n, vate input, the approach is not restricted to integer join thus we compute 1 + (n − 1) · (ri1 − sj1 )/(sjn − sj1 ) = k keys. However, if long strings are used as join keys, MPSM 2. the search space is narrowed to [sj1 , sjk ], i.e. from indexes should work on the hash codes of those strings, thereby giv- 1 to k, so we compute 1 + (k − 1) · (ri1 − sj1 )/(sjk − sj1 ) ing up the meaningful sorting of the output. Furthermore, main memory DBMSs usually employ dictionary encoding and find the start index of the light gray partition. so that joins on strings are usually internally executed as joins on integers anyway. 4. SKEW RESILIENCE OF P-MPSM The basic B-MPSM as well as the disk variant D-MPSM 3.2.2 Join Phase (Phase 4) are completely skew immune as they do not range partition. Due to partitioning, the private input data chunks contain So far, we discussed P-MPSM using statically determined only a fraction of the key value domain and thus probably partition bounds. In case of uniform data distribution the join only with a fraction of each public input data chunk. As presented algorithm assigns balanced workloads (i.e., equally indicated in Figure 5, the public input runs are therefore im- sized chunks) to the workers. It is important to note that plicitly partitioned – by the sorting order. Sequentially the location of the data – e.g., if by time of creation cluster- searching for the starting point of merge join within each ing small values appear mostly before large values – within public data chunk would incur numerous expensive compar- the relations R and S has no negative effect. The location isons. Thus, we determine the first public input tuple of skew among the R and S runs is implicitly handled by range run Sj to be joined with the private input run Ri using partitioning the R data and thereby limiting the S data each interpolation search as sketched in Figure 7. worker has to process. Of course, location skew may cause Depending on which of the first values of each run – sj1 slight NUMA effects that cannot be controlled lastly. As our and ri1 – is larger (in general this will be ri1 because the evaluation in Section 5.5 shows, these effects usually have a key range of R runs is limited while the key range of S runs positive impact on performance as the join partners of a is not), we search for it within the other run by iteratively partition Ri are better clustered in S. narrowing the search space. The most probable position in We now present a more elaborate version of P-MPSM that each iteration is computed by applying the rule of proportion can handle distribution skew while incurring only very lit- using the minimum and maximum index positions and the tle overhead to the overall performance. Skew resilience is minimum and maximum key values of the current search achieved by not determining the partition bounds statically 1069

7. Phase 2.1 are of size four. In the second phase, the local partition W1 W1 W1 W1 bounds of all workers are collected as input to a global cu- 1 2 4 5 b11=7 7 b21=12 12 b31=9 9 b41=13 13 mulative distribution function (CDF). 10 17 13 28 key b12=15 15 b22=25 25 b32=30 30 b42=44 44 Using the local equi-height histograms we can only es- 22 33 37 49 values 31 42 48 56 timate the gradient of the step function by approximating b13=31 b23=42 78 b33=48 54 b43=56 77 66 81 90 75 100 each step to be equally high. Of course, the real global distri- b14=81 b24=90 b34=75 b44=100 bution deviates (slightly) from this as the different workers’ S1 S2 S3 S4 equi-height partitions have overlapping key ranges. In the # tuples key value example in Figure 8, each worker thread determines T = 4 total=32 distribution of S local bounds, in general we propose to compute f · T local bounds for better precision. By increasing f and thus the ¾·total=24 number of local bounds determined by each worker, more fine grained information about the global data distribution can be collected at negligible costs. ½·total=16 probe for Note that the CDF allows for configuration changes con- high of Ri cerning the number of workers. Appropriate partition limits probe for ¼·total=8 are then found using interpolation as denoted in Figure 8 by low of Ri the diagonal connections between steps. This also allows to key value combine the global S data distribution represented by the 50 100 CDF with the R data distribution in order to handle uncor- related or even negatively correlated skew in R and S as we Figure 8: P-MPSM CDF computation: example with will show below. skewed input (mostly small key values) 4.2 Global R Distribution Histogram (Phase but computing them based on dynamically obtained infor- 2.2) mation about the key value distributions in R and S. We In phase 2.2, each worker scans its private input chunk Ci exploit the sort order of the public input S to compute arbi- and computes a local histogram on it using radix-histogram- trarily precise histograms representing the key value distri- ming. Thereby, the number of leading bits B determines bution of S en passant, i.e., in almost no time. Further, we the precision of the histogram, i.e., using B bits we obtain increase the number B of bits used for the histogram compu- a histogram of size 2B . Building a more fine-grained his- tation for radix-clustering of the private input R and thereby togram does only incur little overhead but allows for a much also obtain very precise histograms representing the private more precise computation of global R bounds. By merging input join key value distribution. We then determine global some of the clusters to form T partitions with a balanced load-balancing partition bounds based on the computed dis- workload (cost(sort(Ri ))+cost(Ri ⊲⊳S)) we obtain the global tributions. We show that the presented approach adds only partition bounds. On the left hand side of Figure 9 we see very little overhead to the overall join processing. that higher precision of radix-histogramming comes at no For better illustration, we split the range partition phase 2 additional cost. On the right hand side we see the inferior into the following sub-phases: The histogram on S is deter- performance of comparison-based partitioning. mined in phase 2.1 using a cumulative distribution function (CDF). The histogram on R is determined in phase 2.2 us- 14000 histogram ing probing as described above but increasing the number prefix sum of leading bits B used for fine-grained histogram bound- 12000 partitioning aries. In phase 2.3 we combine the information about the execution time [ms] 10000 key value distributions in R and S to find global partition bounds, called splitters, balancing the costs for sorting R 8000 chunks and joining. This way, we ensure that each worker thread is assigned a balanced workload to make sure that 6000 they all finish at the same time which is very important for 4000 subsequent query operators. 2000 4.1 Global S Data Distribution (Phase 2.1) We gain insight in the global S data distribution in two 0 32 64 128 256 512 1024 2048 32 steps: First, each worker thread Wi computes an equi-height radix explicit bounds histogram for its local input run Si . Building the equi-height granularity of histograms histograms comes at almost no costs as the data is already sorted. Then, the local histograms are merged to provide Figure 9: Fine-grained histograms at little overhead a global distribution view. The procedure is exemplified in Figure 8 for four runs S1 to S4 with skewed data, i.e., small In Figure 10 the proceeding is exemplified for a scenario join key values occur much more often than large join key with two workers clustering two initial chunks and redis- values. The local equi-height histogram bounds bij for each tributing them to two target partitions. They first build a worker Wi computed during the first phase are marked as local histogram of size 4 (B = 2) each, dividing the skewed red dotted lines within the input runs. In the example, each input data with key domain [0, 32) into four partitions: < 8, worker collects four local bounds, i.e., the local histograms [8, 16), [16, 24), ≥ 24. The histograms reveal that the chunks 1070

8. Phase 2.1 Phase 2.2 Phase 2.3 5 W1 9=01001 5 9 h1 7 compute splitters: W1 sort run R1 7=00111 chunk C1 7 ++ 4 <8 analyze ps1 3 proceed 56 3 1 [8,16) histograms to 0 1 with 21 2 [16,24) determine skew sp 0 1 2 28 17=10001 join 1 0 •24 of R and 0 prefix W2 4 ++ size the 17 histogram 1 sum 6 8 32 partitions CDF 2 W2 2=00010 according to 1 9 of S 13 h2 cost formula: 1 W1 21 chunk C2 4=00100 sort run R2 distribution 4 3 <8 cost(sort Ri) + splitter ps2 ++ 17 proceed 31 cost(Ri S) 4 5 |S|=56 2 [8,16) 13 with 8 20=10100 is balanced 3 1 [16,24) 31 join 20 ++ for all i [23] prefix W2 1 •24 8 6 sum histogram 20 R Figure 10: Load balanced partitioning of the private input R: join keys in the range [0, 32) are skewed (mostly small) contain many more small key values than large key values, in h2 and the CDF of S, the first cluster < 8 becomes the first particular, there are a total of seven values in the first par- partition and the other three clusters ≥ 8 form the second tition, three values in the second, three values in the third, partition. and one value in the fourth partition. We then partition the private input chunks using the global partition bounds. Thereby, we avoid synchronization by let- 4.3 Partitioning the Private Input R (Phase ting each worker write sequentially to precomputed parti- 2.3) tions. For this purpose, the local histograms are combined We use the global CDF for S determined in phase 2.1 to a set of prefix sums where each prefix sum represents and the global R distribution histogram from phase 2.2 to the workers’ partitions within the target runs. In phase 2.3, heuristically determine global partition bounds using a com- each worker scatters its input chunk to the partitions using plexity approximation that takes into account both the sort the prefix sums via the indirection of the splitter vector sp, costs of the R chunks and the join costs per worker Wi : i.e., worker Wi scatters its next tuple t as follows: split-relevant-costi = memcpy(psi [sp[t.key ≫ (64 − B)]]++, t, t.size) |Ri | · log(|Ri |) sort chunk Ri + T · |Ri | process run Ri psi contains pointers, not indexes because each worker scat- + CDF(Ri .high) − CDF(Ri .low) process relev. S data ters to different arrays. According to the global R partition bounds b1 = 8 and b2 = 32, there are four values of chunk where Ri .low and Ri .high denote the radix boundaries for C1 falling into the first partition and three (1+2+0) falling which we probe in the CDF. Note that because of the sorting into the second. From chunk C2 , three values belong to the S can be partitioned at any position. The boundaries are first and four (2+1+1) to the second partition. The local determined at the radix granularity of R’s histograms. histograms (which are computed per chunk) are combined As shown in Figure 8 and on the left of Figure 10 using to global prefix sums. The values in ps1 , for instance, denote blue dashed lines, the tentative R histogram bounds are used that worker W1 should scatter its data falling into the first to probe into the CDF to determine the anticipated S costs partition to run R1 beginning at position 0, whereas worker for the currently considered R partition [low, high). If the W2 should write its data for the first partition to run R1 be- key values in R and S are uniformly distributed or skewed ginning at position 4. Thereby, psi is incremented for each in a correlated way, the global R partition bounds will be tuple scattered. Please note that – depending on the key similar to the global S partition bounds and thus all Ri will distribution in R – the resulting runs might not be of equal be approximately equally sized. If the key value distribution size. It is more important that the cost is balanced rather is uncorrelated, they may be very different so that we need than the size (cf. Section 5.6). Unlike radix join, MPSM to weight their effect on the overall performance to find the can partition the private input R completely independently final global partition bounds. of S. The public input S is partitioned implicitly via the We opt to partition R and S such that each worker is sorting and thus does not incur any partitioning overhead. assigned the same amount of work, i.e., we determine the partition bounds such that they minimize the biggest cost split-relevant-costi over all 1 ≤ i ≤ T . We refer to Ross and 5. EXPERIMENTAL EVALUATION Cieslewicz [23] who present elaborate techniques for finding We implemented the MPSM join variants in C++, and optimal partition bounds for two table partitioning prob- the join query plans are compiled as employed in our HyPer lems. query processor [21]. All experiments are such that the data In the example in Figure 10, to simplify matters we as- is completely in main memory. For the disk-enabled Vector- sume the key value distribution of S to be correlated to that wise this is achieved by executing the query several times of R. Therefore, when probing into the CDF using 8 and 32 and reporting only the execution times of runs after the data as [low,high) values for R2 , those bounds divide S in equally was fully resident in RAM. In order to cover the most impor- sized partitions. Thus, according to the histograms h1 and tant scenarios, we report benchmark results using datasets 1071

9. 1080205 Core 0 Core 1 phase 1 HyperThread 12 1.2e+006 phase 2 (0,32) (4,36) phase 3 Core 2 Core 3 HyperThread 44 phase 4 (8,40) (12,44) 1e+006 812937 execution time [ms] L1I 32KB L1D 32KB Core 4 Core 5 build 675132 (16,48) (20,52) 800000 probe 621983 L2 256KB 581196 Core 6 Core 7 (24,56) (28,60) 600000 24MB L3 355280 400000 223369 16x16 GB 16x16 GB 16x16 GB 169267 123498 CPU 0 CPU 1 97027 59202 200000 33482 QPI 0 16x16 GB 1 4 8 16 1 4 8 16 1 4 8 16 CPU 2 CPU 3 MPSM VW Wisconsin algorithm / multiplicity Figure 12: MPSM, Vectorwise (VW), and Wisconsin hash Figure 11: Intel 32 core, 1 TB server (HyPer1) join on uniform data representing join input relations of different sizes, different multiplicities, and different data distributions. We consider derline tuples – a size which is covered by our experiments. the common case that two relations R and S are scanned, It is interesting to note that the transactional sales data of a selection is applied, and then the results are joined. So, this largest merchandiser, if properly normalized and possi- no referential integrity (foreign keys) or indexes could be bly compressed, fits into the RAM of our one TB machine exploited. which makes operational BI on main memory resident data Due to space limitations we will concentrate our experi- a reality – if the parallelization power of these machines can mental evaluation on the in-memory range-partitioned vari- be effectively exploited. ant P-MPSM and leave the analysis of the disk variant for Each tuple consists of a 64-bit key within the domain future work. [0, 232 ) and a 64-bit payload: 5.1 Platform and Benchmark Scenarios {[joinkey: 64-bit, payload : 64-bit]} We conduct the experiments on a Linux server (kernel We execute an equi-join on the tables: 3.0.0) with one TB main memory and four Intel(R) Xeon(R) X7560 CPUs clocked at 2.27GHz with 8 physical cores (16 SELECT max(R.payload + S.payload) hardware contexts) each, resulting in a total of 32 cores FROM R, S (and due to hyperthreading 64 hardware contexts) as in WHERE R.joinkey = S.joinkey Figure 11. This machine has a list price of approximately e40000 which makes it a good candidate for the real time This query is designed to ensure that the payload data is business intelligence scenario on transactional data for which fed through the join while only one output tuple is gener- our HyPer main memory DBMS is intended (therefore the ated in order to concentrate on join processing cost only. machine is called HyPer1 in our lab). Further, we made sure that early aggregation was not used As “contenders” we chose the most recent research sys- by any system. We chose the data format both for scal- tem, published by the Wisconsin database group [1] and ing reasons (payload may represent a record ID or a data the “cutting-edge” Vectorwise query engine which holds the pointer) as well as for ease of comparison reasons to the world record in single-server TPC-H power test. Accord- experiments presented in [1]. Our datasets of cardinality ing to our tests it currently has the best-performing paral- 1600M × (1+multiplicity) have sizes ranging from 50 GB lel join processing engine which is based on the pioneering to 400 GB which is representative for large main memory MonetDB work on cache-friendly radix joins [19]. This is operational BI workloads. The multiplicities between the also testified by Vectorwise’s record TPC-H powertest per- relations R and S further cover a wide range, including not formance on “small” main memory fitting databases up to only the common cases (4, as specified for instance in TPC- one TB on a single machine. (Actually, the TPC-H record H and 8 to approximate the TPC-C specification) but also numbers were obtained on a similar machine as our HyPer1.) extreme cases (1 and 16). We further experimented with For the SIGMOD2011 Wisconsin hash join benchmarks we skewed datasets. use the original code [1]. The Vectorwise benchmarks were conducted on Vectorwise Enterprise Edition 2.0. 5.2 Comparison of MPSM, Vectorwise, and We chose the datasets to be representative for a few re- Wisconsin Join on Uniform Data alistic data warehouse scenarios. Each dataset consists of We compare MPSM, Vectorwise, and Wisconsin join on two relations R and S. The cardinality of R is 1600M , the uniform data for different multiplicities (in the extreme case cardinality of S is scaled to be 1 · |R|, 4 · |R|, 8 · |R|, and S is 16 times as large as R). The results are shown in Fig- 16 · |R|. These database sizes are one order of magnitude ure 12. MPSM outperforms Vectorwise by a factor of four. larger than in prior related studies [17, 1, 2] to account for Wisconsin is not adapted to efficiently work for NUMA ar- recent hardware improvements in RAM capacity and real- chitectures as it builds and probes a global hash table across world requirements in operational business intelligence. For NUMA partitions, which results in poor performance for example, Amazon has a yearly revenue of $40 billion, for such large data volumes and numbers of cores. Therefore, which an estimated item price of $10 results in 4 billion or- we don’t consider it in further experiments. 1072

10. phase 1 455114 1.4e+006 500000 phase 1 2346427 phase 2 phase 2 phase 3 1.2e+006 phase 3 phase 4 join execution time [ms] phase 4 400000 execution time [ms] 1e+006 773809 697966 300000 221183 800000 169267 433362 600000 200000 396322 110822 359012 97027 233694 222969 59202 400000 201971 100000 33482 32790 103580 67278 200000 59202 0 0 1 4 8 16 1 4 8 16 2 4 8 16 32 64 2 4 8 16 32 64 R private S private MPSM VW private relation / multiplicity algorithm / parallelism Figure 14: Effect of role reversal on join execution time Figure 13: Scalability of MPSM and Vectorwise 100000 phase 1 phase 2 phase 3 5.3 Scalability in Number of Cores 80000 phase 4 execution time [ms] We compare the scalability with respect to the number of cores for MPSM and Vectorwise and report the results in 60000 Figure 13. MPSM scales almost linearly in the number of parallel executing worker threads. As depicted in Figure 11, 40000 our server has 32 physical cores and a total of 64 hardware contexts. When exceeding the number 32 of physical cores and using hyperthreading (parallelism level 64), the perfor- 20000 mance of MPSM remains stable but does not improve as all cores are already fully utilized at parallelism 32. From 0 these results, we are confident, that MPSM will scale well n ons tion rtitio partiti l join parti in pa T join c a ote jo on future hardware with even hundreds of cores. 1 lo 1 r e m location skew in S 5.4 Role Reversal Figure 15: Location skew (32 workers, multiplicity 4) We mentioned in Section 3.2 that for P-MPSM it is advis- able to consider role reversal for performance improvements. reduces the complexity from In Figure 14 we compare the execution time for two relations |S|/T · log(|S|/T ) + |R|/T + |R|/T · log(|R|/T ) + |R| + |S|/T R and S where we vary the size of S to be multiplicity times that of R. Thereby, phase 1 (sorting the public input) and to phase 3 (sorting the private input) are interchanged and |S|/T · log(|S|/T ) + |R|/T + |R|/T · log(|R|/T ) + |R|/T + |S|/T have the same execution time when summed up. However the effect of role reversal is clearly visible for the range par- as the private Ri is only scanned once to produce all join tition phase 2 and the join phase 4. For the multiplicity results. If there is less pronounced location skew in S, the al- 1, role reversal obviously has no effect on the join execu- gorithm performance lies between those two extremes shown tion time (as both inputs have the same size). However, the in Figure 15. Note that in all other experiments location larger S grows, the more considerable is the effect that di- skew was not present/exploited. rectly follows from |R| < |S| and the complexity estimate in Section 3.2 (ignoring the equal sort costs) as 5.6 Skewed Data with Negative Correlation |R|/T + |R| + |S|/T < |S|/T + |S| + |R|/T In this sort of experiments we analyze the quality of the splitter computation (cf. Figure 10) to balance the load evenly across all workers. For this purpose we generated a 5.5 Location Skew dataset with the worst possible skew for our join algorithm: We introduced location skew by arranging S in small to negatively correlated skew in R and S. (Positively corre- large join key order – no total order, so sorting the clus- lated skew does not affect MPSM either due to the dynamic ters was still necessary. Location skew on R has no effect splitter computation.) Our data set again contained 1600M at all as R is redistributed anyway. Extreme location skew tuples in R with an 80:20 distribution of the join keys: 80% of S means that all join partners of Ri are found in only of the join keys were generated at the 20% high end of the one Sj . This results in each worker Wi effectively producing domain. The S data of cardinality 4 · 1600M was generated join results only with one local Si , respectively one remote with opposite skew: 80% of the join keys at the low 20% Sj where i = j. This is the extreme case as only/no local end of the domain. Let us refer to Figure 16a to intuitively S data contributes to the join result and only one remote explain the necessity of balancing the load according to the memory area has to be accessed. Of course, all runs are data distribution of R and S. On the left-hand side we show still accessed using interpolation search, however, no rele- the effects of partitioning R into equal-cardinality partitions vant data is found in (T − 1) of the S runs. This effectively thereby having wider ranges on the left and narrower ranges 1073

11.on the right. Because of the negative correlation the corre- sponding S partitions are very unbalanced – so the combined cardinality of the two partitions |Ri | + |SRi | is much higher # tuples SR1 SR2 SR3 # tuples SR3 SR4 SR4 SR2 at the low end than at the high end. SRi denotes the rele- SR1 vant join range of S for the range of Ri . Note that SRi is R4 R1 R2 R3 R4 R1 R2 R3 composed of sub-partitions across all S1 , · · · , ST but its size join key range join key range can effectively be estimated from the CDF. For 32 workers operating on this equi-height R partitioning we obtain the (a) Equi-height R (left) vs. equi-cost R-and-S split- ter partitioning (right) response times shown in Figure 16b. We see that the “blue” sort costs are balanced but the “green” join processing takes 120000 phase 1 much longer for the workers on the left that process the low phase 2 phase 3 join keys. The correct splitter-based partitioning balances 100000 phase 4 the load across all servers – as shown in Figure 16c. The barrier synchronization point execution time [ms] figure is idealized in balancing the cardinality of the two 80000 corresponding partitions; in reality the sort+join costs are balanced. This is achieved by considering the cardinality of 60000 each Ri in combination with its corresponding SRi partition which is obtained from the CDF. This balanced R-and-S 40000 partitioning is visualized in Figure 16a on the right hand side. For this experiment we computed the R histograms at a granularity of 1024 (B=10) to give the splitter computa- 20000 tion sufficient opportunity to find best possible splitters. 0 5.7 Experiments’ Summary workers We showed that MPSM scales to very large data volumes (b) Equi-height R partitioning (multiplicity 4) and scales almost linearly to increasing number of cores. This indicates MPSM’s future potential as the upcoming 120000 phase 1 servers will have even more RAM and more cores – probably phase 2 phase 3 a few hundred soon to come. 100000 phase 4 The superior performance characteristics of MPSM are barrier synchronization point execution time [ms] corroborated by the observation that, in our experiments 80000 we neither 1) exploited any possibly existing sort order nor 60000 2) exploited the quasi-sorted’ness of the result In complete QEPs both aspects would favor the MPSM join 40000 even more in comparison to hash-based variants. 20000 6. RELATED WORK Parallel join processing originates from the early work on 0 database machines, e.g., Gamma [9], where hash-based par- workers titioning was used to distribute the join argument to mul- (c) Equi-cost R-and-S splitter partitioning (multiplicity 4) tiple machines in a compute cluster. In this context some heuristics for skew handling were developed [8]. Teubner Figure 16: Balancing splitters et al. [11, 24] present parallel joins for modern distributed databases. In multi-core parallel processing the distribu- cores, and smaller memory bandwidth per core sort-merge tion of the data is much more efficient as we can exploit the join is likely to outperform hash join on upcoming chip mul- shared memory, albeit regarding the consequences of the tiprocessors. Blanas et al. [1, 2] presented even better perfor- NUMA architecture [22]. Our MPSM join is, to the best mance results for their parallel hash join variants. We com- of our knowledge, the first work that consequently takes pare the sort-based MPSM to their best-performing variant NUMA into consideration, which is decisive for large scale which we called Wisconsin hash join here and thereby pick in-core databases. Most previous approaches to in-core par- the competition between sort-merge join and hash join up allel join processing were based on the radix join prioneered once again [12]. As a second contender we chose Vector- by the MonetDB group [19, 3]. This join method achieves wise [15] that builds on the pioneering radix join work of cache locality by continuously partitioning into ever smaller MonetDB [4] in addition to vector-based processing of X100. chunks that ultimately fit into the cache. Ailamaki et al. [5] He et al. [14] develop parallel nested-loop, sort-merge, and improve cache locality during the probing phase of the hash hash joins on GPUs. The algorithms take advantage of mas- join using software controlled prefetching. Our sort-based sive thread parallelism, fast inter-processor communication MPSM algorithm has high cache locality and hardware pre- through local memory, and histograms-based radix parti- fetcher affinity by its very own merge join behavior that tioning. We adapted the histogram approach for synchroni- sequentially scans a pair of runs. zation-free partitioning of MPSM’s private input. For sort- An Intel/Oracle team [17] adapted hash join to multi-core ing in MPSM we developed our own Radix/IntroSort. In the CPUs. They also investigated sort-merge join and hypothe- future however, wider SIMD registers will allow to explore sized that due to architectural trends of wider SIMD, more bitonic SIMD sorting [6]. 1074

12. MPSM does not produce completely sorted output. How- [5] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. ever, each worker’s partition is subdivided into sorted runs. Mowry. Improving Hash Join Performance through This interesting physical property might be exploited in fur- Prefetching. ACM TODS, 32(3):17, 2007. ther operations [7]. Note that the algorithms we compare [6] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and MPSM to do not exhibit any interesting physical property in P. Dubey. Efficient Implementation of Sorting on their output and we did not exploit any possibly pre-existing Multi-Core SIMD CPU Architecture. PVLDB, sorting in our comparative performance experiments. 1(2):1313–1324, 2008. Our disk-based D-MPSM was partly inspired by G-join [13] [7] J. Claussen, A. Kemper, D. Kossmann, and which also operates on sorted runs instead of hash parti- C. Wiesner. Exploiting Early Sorting and Early tions [12]. However, G-join lacks the parallelism which is in Partitioning for Decision Support Query Processing. the focus of this paper. The VLDB Journal, 9:190–213, 2000. [8] D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, 7. CONCLUSIONS AND FUTURE WORK A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma Database Machine Project. IEEE TKDE, The two dominating hardware trends are increasing RAM 2(1):44–62, 1990. sizes and ever more (soon hundreds of) cores. Both facilitate [9] D. J. DeWitt, J. F. Naughton, D. A. Schneider, and the development of main-memory databases that are essen- S. Seshadri. Practical Skew Handling in Parallel Joins. tial to propel the operational/real-time business intelligence In VLDB, pages 27–40, 1992. applications. To minimize query response times (in our main [10] F. F¨arber, S. K. Cha, J. Primsch, C. Bornh¨ ovd, memory database system HyPer) we devised a massively S. Sigg, and W. Lehner. SAP HANA Database: Data Management for Modern Business Applications. ACM parallel algorithm for the most important query processing SIGMOD Record, 40(4):45–51, 2011. operator, the equi-join. MPSM merge joins in parallel sorted [11] P. W. Frey, R. Goncalves, M. Kersten, and runs, which themselves were sorted by parallel threads. The J. Teubner. Spinning Relations: High-speed Networks performance analysis revealed that MPSM can effectively for Distributed Join Processing. In DaMoN, pages join very large main memory data of billions of tuples as it 27–33, 2009. scales almost linearly with the number of cores. The scal- [12] G. Graefe. Sort-Merge-Join: An Idea Whose Time able performance of MPSM is due to carefully exploiting the Has(h) Passed? In ICDE, pages 406–417, 1994. NUMA characteristics of the modern high-capacity servers. [13] G. Graefe. New algorithms for join and grouping operations. Computer Science - We avoided fine-grained synchronization and random access Research&Development, 27(1):3–27, 2012. to remote NUMA memory partitions. The linear scalability [14] B. He, K. Yang, R. Fang, M. Lu, N. K. Govindaraju, in the number of cores promises MPSM to scale even beyond Q. Luo, and P. V. Sander. Relational Joins on our tested 32 core, 1TB server – which is currently the top Graphics Processors. In SIGMOD, pages 511–524, of the line main memory server but will soon be surpasseed 2008. by the next generations of servers with several TB capacity [15] D. Inkster, M. Zukowski, and P. Boncz. Integration of and hundreds of cores. VectorWise with Ingres. ACM SIGMOD Record, 40:45–53, 2011. In future work we will develop the algorithmic details of [16] A. Kemper and T. Neumann. HyPer: A Hybrid MPSM for other join variants, e.g., outer, semi, and non- OLTP&OLAP Main Memory Database System based equi joins. Also, we are working on exploiting the “rough” on Virtual Memory Snapshots. In ICDE, pages sort order that MPSM inherently generates due to its range- 195–206, 2011. partitioned run processing. This allows to optimize sub- [17] C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, A. D. sequent query plan operations analogously to traditional Nguyen, A. D. Blas, V. W. Lee, N. Satish, and merge joins. In this paper we concentrated on the response P. Dubey. Sort vs. Hash Revisited: Fast Join time optimal range partitioned in-core variant of MPSM. In Implementation on Modern Multi-Core CPUs. PVLDB, 2(2):1378–1389, 2009. a follow-up paper we will also analyze in detail the memory [18] D. E. Knuth. The Art of Computer Programming, Vol. constrained disk-based processing of D-MPSM that, due to III: Sorting and Searching. Addison-Wesley, page 80, space limitations, we could only sketch here. This variant is Ex. 13, 1973. particularly promising for large batch query processing tasks [19] S. Manegold, P. A. Boncz, and M. L. Kersten. that take place in parallel with transactions and real-time Optimizing Main-Memory Join on Modern Hardware. BI analytics. IEEE TKDE, 14(4):709–730, 2002. [20] D. R. Musser. Introspective Sorting and Selection Algorithms. Software Practice&Experience, 8. REFERENCES 27(8):983–993, 1997. [21] T. Neumann. Efficiently Compiling Efficient Query [1] S. Blanas, Y. Li, and J. M. Patel. Design and Plans for Modern Hardware. In VLDB, 2011. Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. In SIGMOD, pages 37–48, 2011. [22] D. E. Ott. Optimizing Software Applications for NUMA, Whitepaper (Intel), 2009. [2] S. Blanas and J. M. Patel. How Efficient is our Radix Join Implementation? optimizing-software-applications-for-numa/. ~sblanas/files/comparison.pdf, 2011. [23] K. A. Ross and J. Cieslewicz. Optimal Splitters for [3] P. A. Boncz, S. Manegold, and M. L. Kersten. Database Partitioning with Size Bounds. In ICDT, Database Architecture Optimized for the New pages 98–110, 2009. Bottleneck: Memory Access. In VLDB, pages 54–65, 1999. [24] J. Teubner and R. M¨ uller. How soccer players would do stream joins. In SIGMOD, pages 625–636, 2011. [4] P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Evolution: Mammals [25] VoltDB LLC. VoltDB Technical Overview, Flourished long before Dinosaurs became Extinct. Whitepaper, 2010. PVLDB, 2(2):1648–1653, 2009. VoltDBTechnicalOverviewWhitePaper.pdf. 1075