Scaling Up Concurrent Main-Memory Column-Store Scans

Main-memory column-stores are called to efficiently use modern non-uniform memory access (NUMA) architectures to service concurrent clients on big data. The efficient usage of NUMA architectures depends on the data placement and scheduling strategy of the column-store. Most column-stores choose a static strategy that involves partitioning all data across the NUMA architecture, and employing a stealingbased task scheduler.

1.Scaling Up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA-aware Data and Task Placement Iraklis Psaroudakis ‡ Tobias Scheuer‡ Norman May‡ Abdelkader Sellami‡ Anastasia Ailamaki ‡ EPFL, Lausanne, Switzerland SAP SE, Walldorf, Germany {first-name.last-name} {first-name.last-name} 8 NUMA-agnostic 250 Throughput (GiB/s) ABSTRACT 7 NUMA-aware 200 S4 (x104 q/min) 6 Throughput Main-memory column-stores are called to efficiently use mod- 5 Memory 150 S3 ern non-uniform memory access (NUMA) architectures to 4 3 5× 100 S2 service concurrent clients on big data. The efficient usage 2 50 S1 of NUMA architectures depends on the data placement and 1 0 0 scheduling strategy of the column-store. Most column-stores 1 4 16 64 256 1024 NUMA- agnostic NUMA- aware choose a static strategy that involves partitioning all data across the NUMA architecture, and employing a stealing- (a) # of concurrent clients (b) based task scheduler. In this paper, we implement different strategies for data placement and task scheduling for the Figure 1: (a) Impact of NUMA. (b) Memory case of concurrent scans. We compare these strategies with throughput of the sockets for the case of 1024 clients. an extensive sensitivity analysis. Our most significant find- ings include that unnecessary partitioning can hurt through- NUMA introduces new performance challenges, as com- put by up to 70%, and that stealing memory-intensive tasks munication costs vary across sockets [10, 27], and the band- can hurt throughput by up to 58%. Based on our analysis, width of the interconnect links is an additional bottleneck we envision a design that adapts the data placement and to be considered (see Section 2). The column-store needs to task scheduling strategy to the workload. become NUMA-aware by handling the placement of its data structures across sockets, and scheduling the execution of 1. INTRODUCTION queries onto the sockets. Figure 1a shows the performance Contemporary analytical workloads are characterized by mas- difference between a NUMA-agnostic and a NUMA-aware sive data and high concurrency, with hundreds of clients column-store as they evaluate an increasing number of an- [31]. The key comparative criterion for analytical relational alytical clients on a machine with four sockets (see Section database management systems (DBMS) is their efficiency 6 for more details). In this scenario, NUMA-awareness sig- in servicing numerous clients on big data. For this reason, nificantly improves throughput, by up to 5×. By avoiding many analytical DBMS, e.g., SAP HANA [14] or Oracle inter-socket communication, the memory bandwidth of the [18] (see Section 3 for additional examples), opt for a main- sockets can be fully utilized, as shown in Figure 1b. memory column-store instead of a disk-based row-store typ- In the literature, there has been a recent wave of related ically employed for OLTP workloads [32]. The column-wise work for NUMA-aware analytical DBMS. Their majority in-memory representation minimizes the amount of data to employs a static strategy for data placement and scheduling. read, as only the necessary columns are accessed. Data is For example, HyPer [23] and ERIS [17] partition all data typically compressed, e.g., using dictionary encoding, and across sockets and parallelize queries with a task scheduler. processed in parallel using SIMD and multiple CPU cores [14]. The task scheduler consists of a pool of worker threads. Each Main-memory column-stores need to efficiently exploit the worker processes local tasks or steals tasks from other work- increasing amount of DRAM and multi-core processors. Pro- ers. The trade-offs between different data placement and cessor vendors are scaling up hardware by interconnecting task scheduling strategies have not been yet fully analyzed. sockets of multi-core processors, each with its own memory controller and memory [5]. Memory is decentralized, form- Contributions. In this paper, we describe and implement ing a non-uniform memory access (NUMA) architecture. data placement and task scheduling strategies for concurrent scans in main-memory column-stores. Through a sensitivity analysis, based on a commercial column-store (SAP HANA), This work is licensed under the Creative Commons Attribution- we identify the trade-offs for each strategy under various NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- cense, visit Obtain per- workload parameters. Our main contributions are: mission prior to any use beyond those covered by the license. Contact copyright holder by emailing Articles from this volume • We present a novel partitioning scheme for dictionary- were invited to present their results at the 41st International Conference on encoded columns, supporting quick repartitioning, for Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, skewed scan-intensive workloads (see Section 4). Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 • We show that unnecessary partitioning for highly con- Copyright 2015 VLDB Endowment 2150-8097/15/08. current memory-intensive workloads can hurt through- 1442

2. put by up to 70% in comparison to not partitioning (see Section 6.1). Hot data in skewed workloads should Table 1: Local and inter-socket idle latencies, and be partitioned until socket utilization is balanced. peak memory bandwidths of three different servers. Statistic 4×Ivybridge-EX 32×Ivybridge-EX 8×Westmere-EX • We show that stealing memory-intensive tasks can hurt Local latency 150 ns 112 ns 163 ns throughput by up to 58% (see Section 6.2). The task 1 hop latency 240 ns 193 ns 195 ns scheduler needs to support tasks that can prevent be- Max hops latency 240 ns 500 ns 245 ns Local B/W 65 GiB/s 47.5 GiB/s 19.3 GiB/s ing stolen by another socket (see Section 5). 1 hop B/W 8.8 GiB/s 11.8 GiB/s 10.3 GiB/s • Based on the implications of our sensitivity analysis, Max hops B/W 8.8 GiB/s 9.8 GiB/s 4.6 GiB/s Total local B/W 260 GiB/s 1530 GiB/s 96.2 GiB/s we envision a design that can adapt the task scheduling and data placement (by moving and partitioning hot data) strategy to the workload at hand (see Section 7). The first is the one of Figure 2, the second is a rack-scale SGI UV 300 system with 32 sockets of Intel Xeon E7-8890 2. BACKGROUND v2 2.80GHz (Ivybridge-EX) processors, and the third con- sists of 2 IBM x3950 X5 boxes with a total of 8 sockets of In this section, we give a quick overview of different NUMA Intel Xeon E7-8870 2.40GHz (Westmere-EX) processors. architectures (which we use in our experiments), and mem- While the 4-socket machine is fully interconnected, the ory allocation facilities in the operating system (OS). topologies of the 8-socket and 32-socket machines have mul- NUMA architectures. Figure 2 shows a NUMA server tiple hops. The max hop latency on the 4-socket and 8- with 4 sockets of 15-core Intel Xeon E7-4880 v2 2.50GHz socket machines is more than 30% slower than the local la- (Ivybridge-EX) processors. Each core has two hardware tency, and around 5× slower on the 32-socket machine. The threads, a 32KiB L1, and a 256 KiB L2 cache. The cores inter-socket bandwidth decreases by an order of magnitude in a socket share a 37.5MB L3 cache. Each socket shares 2 with multiple hops. Also, the total local bandwidth of the memory controllers (MC) [1], configured in “independent” 8-socket machine is not the aggregated local bandwidth of mode for the highest throughput [2]. Each MC supports 2 each socket, due to its broadcast-based snooping cache co- Intel SMI interfaces [2]. Each SMI supports 2 memory chan- herence protocol, that stresses the interconnects. Ivybridge- nels (buses), with up to 3 DDR3 DIMM attached on each EX processors use directory-based cache coherence. channel. Our configuration has one 16GiB DDR3 1600MHz The lack of knowledge and control about inter-socket rout- DIMM per channel. The sockets are interconnected to en- ing or the cache coherence, hinders the ability to handle per- able accessing remote memory of another socket. Each socket formance optimization similarly to a distributed network. has 3 Intel QPI (QuickPath Interconnect). Each QPI has NUMA-awareness is typically achieved in a simple way: op- a 16GiB/s bandwidth that supports data requests and the timizing for faster local accesses instead of remote accesses, cache coherence protocol [1]. The majority of NUMA ar- and avoiding unnecessary centralized bandwidth bottlenecks. chitectures today are cache coherent (ccNUMA). The inter- OS memory allocation facilities. The OS organizes connect topology, the interconnect protocol, and the cache physical memory into fixed-sized (typically 4KiB) pages [19]. coherence protocol are specific to each system and vendor. When an application requests memory, the OS allocates new NUMA introduces new considerations for software perfor- virtual memory pages. Application libraries are responsible mance, in comparison to UMA [10]: (a) accesses to remote for organizing smaller allocations. Typically, virtual mem- memory are slower than local memory, (b) the bandwidth of ory is not immediately backed by physical memory. On the a MC can be separately saturated, and (c) the bandwidth of first page fault, the OS actually allocates physical memory an interconnect can be separately saturated. To understand for that page. In a UMA system, performance is not affected how these factors vary, we show in Table 1 local and inter- by the physical location of virtual memory. In a NUMA socket latencies, and peak memory bandwidths for 3 ma- system, however, the performance of a NUMA-agnostic ap- chines (measured with Intel Memory Latency Checker v2). plication can degrade substantially (see Figure 1). A moderate solution is to use interleaving, distributing C1 C2 C3 C15 C1 C2 C3 C15 pages in a round-robin fashion across all sockets. This avoids L1 L1 L1 ... L1 L1 L1 L1 ... L1 worst-case performance scenarios with unnecessary central- L2 L2 L2 L2 L2 L2 L2 L2 ized bandwidth bottlenecks, and averages memory latencies. L3 L3 It involves, however, a lot of remote memory accesses. MC0 MC1 MC0 MC1 A NUMA-aware application should control and track the physical location of its virtual memory [19]. A simple strat- 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB Socket 1 Socket 2 egy is to use the default first-touch policy: on the first page fault, the OS allocates physical memory from the local Socket 3 C1 C2 C3 C15 C1 C2 C3 C15 Socket 4 socket (unless it is exhausted). For stronger control, addi- L1 L1 L1 ... L1 L1 L1 L1 ... L1 tional facilities are provided. In Linux, e.g., move pages can L2 L2 L2 L2 L2 L2 L2 L2 be used to query and move already touched virtual memory. L3 L3 We use these facilities in our data placement strategies. MC0 MC1 MC0 MC1 3. RELATED WORK 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB 16GB NUMA-aware DBMS need to address two major dimensions: (a) data placement, and (b) scheduling tasks across sockets. Figure 2: 4-socket server with Ivybridge-EX CPU. We begin by mentioning how related work handles these di- 1443

3.mensions, continue with NUMA-aware database operators, prefetching can hide the latency of remote accesses, con- and finish with black-box approaches for NUMA-awareness. structing a competitive sort-merge join. Hash-joins, how- ever, are shown to be superior [8, 20]. Yinan et al [25] opti- Data placement. Many DBMS that do not mention ad- mize data shuffling on a fully-interconnected NUMA topol- vanced NUMA capabilities, indirectly rely on the first-touch ogy. Most related work, however, optimize for low concur- policy, e.g., Vectorwise [36], IBM DB2 BLU [30], and the rency, using the whole machine. In this work, we handle column-store of Microsoft SQL Server [21]. numerous concurrent analytical operations such as scans. A few research prototypes explicitly control data place- ment. HyPer [23] follows a static strategy: it chunks all Black-box approaches. DINO [10], Carrefour [13], or the objects, and distributes the chunks uniformly over the sock- new automatic NUMA balancing of Linux, monitor perfor- ets. ERIS [17] employs range partitioning and assigns each mance metrics to predict an application’s behavior. They at- partition to a worker thread. ERIS dynamically adapts the tempt to improve performance by moving threads and data sizes of the partitions to battle workload skewness. This is to balance cache load, opt for local memory accesses, and similar to what we propose. Moreover, we show that unnec- avoid bandwidth bottlenecks. Results for DBMS, however, essary partitioning has a significant overhead on large-scale are typically sub-optimal. A more promising approach is NUMA topologies. Partitioning should be used only for hot presented by Giceva et al [15] for DBMS evaluating a pre- data, and the granularity should be adjusted judiciously. defined global query plan. In the end, the DBMS needs to ATraPos [27] also uses dynamic repartitioning for OLTP become NUMA-aware itself to optimize performance. workloads, to avoid transactions crossing partitions and av- oid inter-partition synchronization. ATraPos optimizes for 4. DATA PLACEMENT STRATEGIES the latency of transactions, while we focus on the memory We start this section by describing the basic data structures bandwidth and CPU utilization of all sockets. In contrast in a main-memory column-store. We continue by outlining to HyPer, ERIS, and ATraPos, we analyze data placement data placement strategies for them. Finally, we describe how strategies for dictionary-encoded columns as well. we keep metadata information about the data placements. Task scheduling. Task scheduling is an indirection layer 4.1 Main-memory column-stores between the OS and the execution engine. Operations are The data of an uncompressed column can be stored sequen- encapsulated in tasks, and a pool of worker threads is used tially in a vector in main-memory [14, 18, 21, 23]. Compres- to process them. In our previous work, we showed how task sion techniques are typically employed to reduce the amount scheduling can be used in a NUMA-agnostic main-memory of consumed memory, and potentially speed up processing. column-store to efficiently process mixed OLTP and OLAP The simplest and most common compression is dictionary workloads [28, 29]. We showed how stealing and a flexible encoding [24]. In Figure 3, we show the data structures that concurrency level can help to saturate CPU resources with- compose a column in a generic column-store (naming can be out oversubscribing them, and how a concurrency hint can different), along with an optional index. be used to adjust the task granularity of analytical partition- The indexvector (IV) is an integer vector of dictionary- able operations to avoid unnecessary scheduling overhead. encoded values, called value identifiers (vid). The po- In this work, we make our task scheduler NUMA-aware. sition (pos) relates to the row in the relation/table. The Since our previous work, task scheduling has been promi- dictionary stores the sorted real values of vid. Fixed-width nent in NUMA-related work. In HyPer, each worker pro- values are stored inline, while variable-length ones may re- cesses local data chunks through a whole pipeline of opera- quire, e.g., using pointers or prefix-compressed storage for tors [23]. HyPer uses task stealing. We show that stealing strings. A value lookup in the dictionary can be done with should be avoided for memory-intensive tasks such as scans. binary search, but one can also implement predicates like ERIS uses a worker per core, which processes tasks tar- less-or-equal directly on the vid. The IV can be further geting a particular data partition [17]. ERIS uses the whole compressed using bit-compression, i.e., using the least num- system, and it is not clear how this design can handle full ber of bits (called bitcase) to store the vid. Vectorization query execution plans, intra- and inter-operator parallelism. enables the efficient implementation of scans, including their Our NUMA-aware task scheduler handles all the workload’s predicates, using SIMD. In this paper, we use scans imple- generic tasks. ATraPos’s data-oriented execution model uses mented with SSE instructions on bit-compressed IV [33]. a worker per partition, similar to ERIS [27]. An optional index (IX) can speed up low selectivity In the realm of commercial DBMS, IBM DB2 BLU pro- lookups without scanning the IV. The simplest IX consists cesses data in chunks so that they fit in the CPU cache [30]. of two vectors. Each position of the first correlates to a vid, Each worker processes one chunk at a time and can steal chunks. Since NUMA-aware data placement is not specif- pos vid vid value vid ix pos ically mentioned, chunks may not be local to the worker 0 3 0 Anna 0 0 15 thread. The vanilla version of Vectorwise [36] uses static 1 3 1 Bree 1 1 3 Index- Dictionary Index parallelism during query execution. A recent thesis [16] de- 2 6 2 Carl 2 3 6 vector (IV) (IX) scribes how to break query plans into stages and parallelize 3 1 3 Emma 3 4 7 sorted 4 4 4 Evie 4 6 0 them using a task scheduler. Stealing from remote sockets 1 ... ... ... is allowed based on the priority of unscheduled tasks and 4 the contention of the sockets. In this paper, we show that 8 ... stealing should be prevented for memory-intensive tasks. NUMA-aware operators. There is related work on stan- dalone operators as well. E.g., Albutiu et al [6] show that Figure 3: Example of a dictionary-encoded column. 1444

4.and holds an index towards the second. The second vector holds the, possibly multiple, positions of a vid in the IV. Table 2: Workload properties best fitted for each data placement, and a few key characteristics. 4.2 Data placement of columns Data Concur- Selecti- Workload Latency Memory Readjust- Large-scale In this section, we propose and analyze the trade-offs of three placement rency vities distrib. distrib. consumed ment overhead data placement strategies, shown in Figure 4. In Table 2, RR High All Uniform Unfair Normal Quick Low we summarize which workload properties best fit each data Low (w/o Uniform placement and a few of their key characteristics. IVP All index) & Fair Normal Quick High & skewed medium Round-robin (RR). The simplest data placement is plac- Uniform PP All All Fair High Slow High & skewed ing a whole column on a single socket. This means that queries wishing to scan this column, or do index lookups, should run and parallelize within that socket, to keep mem- operation needs to be split into all partitions. As we show ory accesses local. A simple way to exploit this data place- in our experiments, unnecessary partitioning can have a sig- ment for multiple columns, is to place each column on a nificant overhead. It should be used only for hot columns different socket in a round-robin way. when the workload is skewed. As we show in our experiments, RR is not performant for low concurrency, because a query cannot utilize the whole Physical partitioning (PP). To overcome the limitations machine, or for skewed workloads, because placing more of IVP, we can opt for an explicit physical partitioning of than one hot column on a socket creates a hotspot on that the table. PP can use a hash function on a set of columns, socket. Additionally, our evaluation shows that for high a range partitioning on a column, or a simple round-robin concurrency, query latencies suffer a high variation, in com- scheme [4, 18]. All columns are split into the parts de- parison to the following partitioning strategies. fined by the PP. PP is useful for improving the performance through pruning, i.e., skipping a part if it is excluded by the Indexvector partitioning (IVP). To overcome the afore- query predicate, and for moving parts in a distributed envi- mentioned negative implications of RR, we present a novel ronment (see, e.g., SAP HANA [4] and Oracle [18]). In this data placement that partitions the IV across the sockets. paper, we use PP to place each part on a different socket. This can happen quickly and transparently, by using, e.g., Since we wish to evaluate NUMA aspects, we avoid exploit- move pages in Linux, to change the physical location of the ing the pruning capability in our sensitivity analysis. involved pages without affecting virtual memory addresses. The advantage of PP is that each part of a column can be A scan can be parallelized within the socket of each part, allocated on a single socket. A scan is split into each part. potentially using all sockets. The materialization phase for each part takes the qualifying The disadvantage of IVP is that there is no clear choice vid of the scan and uses the local dictionary to materialize how to place the dictionary or the IX. Unless the column the real values. In contrast to IVP, PP is thus performant has sorted values, the ordering of the vid in the IV does for high-selectivity queries and index lookups as well. not follow the same ordering as the vid of the dictionary The disadvantages of PP are two-fold. First, PP is heavy- and the IX. Thus, we interleave them across the sockets, in weight and time-consuming to perform or repartition. The order to average out the latency of memory accesses during DBMS needs to recreate the components of all parts of the materialization (converting qualifying vid to real values from columns. For this reason, IVP is preferable for workloads the dictionary – see Section 5.2) and during index lookups. that are intensive on scanning the IV, since IVP is faster As we show in the experiments, the disadvantage of IVP to perform, and can be applied to a subset of the table’s results in high selectivity scans and index lookups suffer- columns. The second disadvantage of PP is its potentially ing decreased performance in comparison to the other data increased memory consumption. Although it results in non- placements. Although a high selectivity scan can scan the intersecting IV across the parts of a column, the dictionaries parts of the IV locally, the dominating materialization phase and the IX of multiple parts may have common values. For involves potentially numerous remote accesses to the inter- large data types, e.g., strings, this can be expensive. The leaved memory of the dictionary. Similarly, index lookups only case where this does not occur is if the column has suffer from remote accesses to the interleaved index. values sorted according to the PP partitioning scheme. Furthermore, both IVP and the following physical parti- tioning suffer from overhead when a column is partitioned Other data placements. We note that the aforemen- across a high number of sockets. The reason is that each tioned data placements are not exhaustive. For example, one can interleave columns across a subset of sockets. Or, one can replicate some or all components of a column on a Round-robin (RR) IV partitioned (IVP) Physically partitioned (PP) few sockets, at the expense of memory. Replication is an S1 P1 P2 P3 P4 orthogonal issue. The three basic data placements we de- IX Dict IV IX Dict IV IX Dict IV scribe are a set of realistic choices. More importantly, our S2 experiments show how their basic differences affect the per- formance of different workloads, and motivate a design that S3 adapts the data placement to the workload at hand. S4 COLi COLi COLi 4.3 Tracking and moving memory We need a way to expose a column’s data placement. When Figure 4: Different data placements of a dictionary- scheduling scans, e.g., we need to be able to query the phys- encoded column, with an index, on 4 sockets. ical location of a column, the location of a component of a 1445

5. S1 S3 ... 0x1000 0x2000 0x3000 0x4000 0x5000 a range can contain a maximum of 232 pages (or 16TiB for ... 4KiB pages), and that a machine can have up to 256 sockets. S2 S4 0x6000 0x7000 0x8000 0x9000 0xa000 The size of a PSM is 360·r+8192 bits, where r is the number of stored ranges. Let us examine the size of the metadata Vector of ranges #1 #2 #3 for a column on a 32-socket machine. We intend to attach a First page address (64 bits) 0x2000 0x4000 0x8000 Number of pages (32 bits) 2 2 3 PSM to the IV, dictionary, and IX of a column, so that the Socket (8 bits) 1 2 2 scheduler can query the physical location of any component. Interleaving pattern (256 bits) 0000... 0000... 1100... If a column is placed wholly on a socket, then r = 1 for ... the IV and the dictionary, and r = 2 for the IX (contains 2 Summary: pages per socket (256 · 32 bits) 3 4 0 0 vectors). The metadata is 26016 bits, or 3KiB. If a column is placed with IVP across all sockets, then r = 32 for the Figure 5: Example of a PSM after adding the virtual IV, r = 1 for the interleaved dictionary, and r = 2 for the memory ranges we wish to track (bold lines). interleaved IX. The metadata is 37176 bits, or 5KiB. If a column is physically partitioned, with 32 parts, each part is column, and the location of a specific position in the vector wholly placed on a socket. The metadata is around 102KiB. of a component. To satisfy these requirements, we design a The size of the metadata is not large, compared to the typi- novel class, Page Socket Mapping (PSM), that summa- cal sizes of columns (at least several MiB). We note that one rizes the physical location of virtual address ranges. can decrease the space substantially by losing some accuracy In Figure 5, we show an exemplary PSM. The figure de- and the capability of querying specific virtual addresses: not picts a piece of virtual memory consisting of ten 4KiB pages. storing ranges, and keeping only the summary vector. Each box includes the base address of each page.1 The color signifies the socket where the page is physically allocated. 5. NUMA-AWARE TASK SCHEDULING Assume that we wish to keep metadata about the physical location of virtual address ranges [0x2000, 0x6000) and We begin this section by describing our NUMA-aware task [0x8000, 0xb000). This example can represent a tiny col- scheduler. We then continue with outlining the different umn, without an index, placed with IVP, where the first task scheduling strategies for concurrent scans, considering range holds the IV, partitioned across sockets S1 and S2, also the data placements of Section 4. and the second range holds the interleaved dictionary. 5.1 Task scheduler infrastructure The PSM maintains an internal vector of ranges. Each range consists of a virtual page address, the number of sub- For a summary of task scheduling and our previous work, sequent pages, and the socket where they are physically al- see Section 3. In this paper, we extend our task scheduler for located. If the range is interleaved, the interleaving pattern NUMA-aware workloads. Tasks need to be able to choose signifies the participating sockets, and then the socket num- the socket on which to run, and the task scheduler to reflect ber denotes the starting socket. The ranges are sorted by the topology of the machine. The design of our task sched- the virtual address of their base page. We choose a vector of uler is shown in Figure 6. Upon initialization, the sched- ranges to optimize for reading the PSM instead of changing uler divides each socket into one or more thread groups it. Looking up the physical location of a pointer includes a (TG). Small topologies are assigned one TG per socket, quick binary search on the ranges’ first pages, and, in case while larger topologies are assigned a couple TG per socket. the range is interleaved, following the interleaving pattern. Hyperthreads (HT) are grouped in the same TG. Figure 6 Furthermore, we maintain another vector, which keeps a depicts the TG for a socket of the 4-socket machine we de- summary of the number of pages on each socket. scribed in Section 1. The main purpose of multiple TG per When we add virtual address ranges to the PSM, it maps socket is to decrease potential synchronization contention them to page boundaries, checks which pages are not already for the contained task priority queues and worker threads. included, and calls move pages on Linux, not to move them Inside a thread group. Each TG contains two priority but to find out their physical location. The algorithm goes queues for tasks. The first has tasks that can be stolen by through their physical locations, collapsing contiguous pages other sockets. The second has tasks that have a hard affin- on the same socket into a new range for the internal vector. ity and can only be stolen by worker threads of TG of the It detects an interleaving pattern when every other page is same socket. As our experimental evaluation shows, sup- allocated on a different socket, following the same recur- porting bound tasks is essential for memory-intensive work- ring pattern. When the pattern breaks, the new interleaved loads. The priority queues are protected by a lock. Lock-free range is inserted in the internal vector, and the algorithm continues. The summary vector is updated accordingly. PSM objects support additional useful functionality. We C1 C2 C7 C8 C15 group 1 group 2 Socket 1 Thread Thread can remove memory ranges, add or subtract another PSM, HT1 HT1 HT1 HT1 ... HT1 ... ask for a subset of the metadata in a new PSM, and get HT2 HT2 HT2 HT2 HT2 the socket where the majority of the pages are. More im- portantly, we can also move a range to another socket or Priority Hard Priority Working Inactive Free Parked interleave it. The PSM uses move pages to move the range, Thread Queue Queue threads threads threads threads group and update the internal information appropriately. t t The space used by a PSM depends on the number of stored t t ranges. For the indicative sizes of Figure 5, we assume that t 1 For simplicity, we display the last 4 hexadecimal characters of the 64-bit addresses instead of all 16 characters. Figure 6: NUMA-aware task scheduler design. 1446

6.implementations for approximate priority queues [7] can be By default, for NUMA-agnostic workloads, we do not bind employed for cases of numerous short-lived tasks where syn- worker threads to the H/W contexts of their TG. This is chronization becomes a bottleneck. This is not the case for because the OS scheduler is sometimes better in balanc- our experiments, mainly due to the concurrency hint (of our ing threads to improve the performance. We bind a worker previous work [28]) that dynamically adjusts the task gran- thread to the H/W contexts of their TG only when it is ularity of analytical operations, including scans, depending about to handle a task with an affinity. And while the next on the workload concurrency. task also has an affinity, the thread continues to be bound, Each TG maintains a number of worker threads, which are otherwise it unbinds itself before running a task without distinguished to working, inactive, free, and parked. Work- an affinity. This gives us the flexibility to compare the ing threads are those currently handling a task. Inactive OS scheduler, by not assigning affinities to tasks, against threads are those currently sleeping in the kernel due to NUMA-aware scheduling by assigning affinities to tasks. a synchronization primitive, while handling a task. Free threads wait on a semaphore to be woken up by the sched- 5.2 NUMA-aware scheduling of scans uler to get a newly inserted task. Parked threads, similarly To achieve NUMA-awareness, tasks need to consult the PSM to free threads, wait on a semaphore to be woken up in cases of the data they intend to process to define their affinity. In when there are no free threads. The difference between free Figure 7, we show the execution phases of a query selecting and parked threads is that free threads wake up periodically data from a single column, assuming the column is placed to check for tasks, and the number of free threads cannot using IVP: (a) finding the qualifying matches, and (b) ma- pass the number of H/W contexts of the TG. terializing the output. Next, we describe these phases. The task scheduler strives to keep the number of working threads equal to the number of H/W contexts of each TG. Finding the qualifying matches. Depending on the In cases where tasks may block and become inactive, the predicate’s estimated selectivity, the optimizer may either scheduler wakes up free or parked threads to correct the scan the IV, or perform a few lookups in the index (if avail- number of working threads [28]. able). For both cases, the query first needs to encode its predicate with vid. For a simple range predicate, the bound- Main loop of a worker thread. The worker firstly checks aries are replaced with the corresponding vid. If the pred- that it is allowed to run, by checking that the number of icate is more complex, a list of qualifying vid is built and working threads is not larger than the number of the H/W used during the scan or the index lookups. contexts of the TG. If it is, it goes to free mode, if the In the case of a scan, it is parallelized by splitting the IV number of free threads is less than the H/W contexts of the into a number of ranges and issuing a task per range. The TG, else it goes to park. If it is allowed to run, it peeks in task granularity is defined by the concurrency hint [28], to the two priority queues to get the element with the highest avoid too many tasks under cases of high concurrency, but priority. If there are no tasks in the TG, it attempts to steal also opt for maximum parallelism under low concurrency. In a task from the priority queues of the other TG of the same the case of IVP, as in the example, we round up the number socket. If there are no tasks, it goes around the TG of all of tasks to a multiple of the partitions, so that tasks have a sockets, stealing tasks (not from the hard priority queues). range wholly in one partition. We define a task’s affinity by If the worker thread finally has a task, it executes it, and consulting the PSM of the IV for the task’s range. loops again. If no task is finally found, it goes to free mode. In the case of index lookups, the operation is not paral- lelized. For each qualifying vid, the IX is looked up to find Watchdog. We use a watchdog thread to speed up task the qualifying positions of the IV. The affinity of the single processing [28]. It wakes up periodically to gather statistics task is defined as the location of the IX. If it is interleaved, and goes around all TG to check their number of working as in the example with IVP, we do not define an affinity. threads. If a TG is not saturated, but has tasks, it signals The qualifying matches can be stored in two potential for- more free or unparked worker threads, or even creates new mats. For high selectivities, a bitvector format is preferred worker threads. If a TG is saturated, but has more tasks, it where each bit signifies if the relevant position is selected. also monitors that, in order to wake up free threads in other For low selectivities, a vector of qualifying IV positions is TG that can potentially steal these tasks. built. Both formats consume little space, and we do not track their location on memory with a PSM. Task priorities. In this work, we do not set the user- defined priority of tasks [35]. Internally, however, the sched- Output materialization. Since we know the number of uler augments the user-defined priority of a task with more the qualifying matches, we allocate the whole output vector. information. One crucial part is the time the related SQL statement was issued. The older the timestamp, the higher the priority of related tasks. For our experimental evalua- (a) (b) S1 tion, this means that the tasks generated during the execu- ① lookups scan tion of a query are handled more or less at the same time. vid S2 IV -or- Task affinities. A task can have an affinity for a socket, in ③ Output S3 ① ② ② which case it is inserted in the priority queue of one of the IX Dict S4 TG of the socket. Additionally, the task can specify a flag for hard affinity so that it is inserted into the hard priority queue. In case of no affinity, the task is inserted into the Figure 7: Task scheduling (a) for a scan or index priority queue of the TG where the caller thread is running lookups to find qualifying matches, and (b) for the (for potentially better cache affinity). output materialization, for an IVP-placed column. 1447

7.Similar to the scan, we parallelize materialization by split- of random integers generated with a uniform distribution. ting the output into ranges and issuing a task per range. We use bitcases 17 to 26 in a round-robin fashion for the A task, for each qualifying position of its range, finds the 160 columns, to avoid scans with the same speed [33]. relevant vid through the IV. Then it finds the real value by We use a Java application on a different machine to gener- consulting the dictionary, and finally writes it to the output. ate the workload. The clients connect and build a prepared Because different partitions of the IV may produce more statement for each column: SELECT COLx FROM TBL WHERE or less qualifying matches, the output may have unbalanced COLx >= ? AND COLx <= ?. We note that we experiment partitions. To define task affinities, we need a short prepro- with queries selecting a single column for simplicity. The cessing. Going through all qualifying matches to figure out implications of our evaluation are relevant for queries that the exact boundaries is costly. Thus, we divide the output have a predicate on multiple columns or project multiple vector length by the number of H/W contexts of the ma- columns also. In the former case, the first phase of Figure 7 chine, to make fixed-sized regions, and find the location of is repeated (in parallel) for each column, to find the quali- their boundaries by consulting the PSM of the IV. We coa- fying positions. In the latter case, the materialization phase lesce contiguous regions on the same socket, to make up the of Figure 7 is repeated (in parallel) for each column. final list of partitions (visualized in Figure 7). For each par- After the statements are prepared, each client continu- tition, we issue a correspondingly weighted number of tasks ously picks a prepared statement to execute. There are no with the affinity of that partition’s socket, taking care that thinking times. The client does not fetch the results, other- the number of tasks does not surpass the concurrency hint, wise the network transfer would dominate. We focus on the and that each partition has at least one task. time required to create the results (that could potentially Figure 7 hints that we place the partitions to their cor- be used as intermediate results for higher-level operators). responding socket. Unfortunately, allocating a new output We measure the total throughput (TP) over 2’ and report vector in order to specify its location turns out to have a the average throughput per minute. 2’ are sufficient, since bad performance. Especially for high-selectivity concurrent all TP results we present are at least one order of magni- scans, it involves numerous page faults with heavy synchro- tude more than the number of clients. The additional per- nization in the OS. This is one reason why SAP HANA formance metrics presented are averaged over the whole 2’ implements its own allocators [4] to re-use virtual memory. period. Every data point and metric presented is an average Furthermore, we note that using move pages to move the of 3 iterations with a standard deviation of less than 10%. partitions also runs into a similar problem in the OS. Thus, The data placement strategies we compare are: for concurrent workloads, re-using virtual memory for the • Round-robin (RR). Each column is allocated on one output vectors, even if writes are remote accesses, is better socket, in a round-robin fashion. than explicitly placing the pages of the output vectors. • Indexvector partitioning (IVP). The IV of each col- Remaining data placements. Figure 7 describes how a umn is partitioned equally across the sockets. scan is scheduled when the selected column is placed with IVP. In the case of RR, where a column is allocated on one • Physical partitioning (PP). The table is physically socket, the same scheduling is involved, but without figuring partitioned according to ranges of the ID column. The out the boundaries of the partitions of the output vector. number of equally-sized ranges is the number of the In the case of PP, the phase of finding qualifying matches sockets. Each part is placed on a different socket. occurs once per part, concurrently. There is a single output The task scheduling strategies we compare are: vector, with a length equal to the sum of the lengths of • OS. We do not define task affinities, and we do not the results of each part. The preprocessing phase of the bind worker threads, leaving the scheduling to the OS. materialization happens once, in a similar way as in the case of IVP, by considering that partitions of the IV are • Target. We define task affinities. Tasks may be stolen. now separate IV. The materialization phase occurs once per • Bound. We define task affinities, and set the hard part, concurrently, and each one knows the region of the affinity flag for tasks. Inter-socket stealing is prevented. single output vector where to write the real values. The workload parameters we vary are: 6. EXPERIMENTAL EVALUATION • Concurrency. The number of clients in the workload. In this section, we present a sensitivity analysis of concur- • Indexes. Whether indexes can be used or not. In the rent scans for different data placement and task scheduling majority of the experiments, we do not use indexes. strategies, under various workload parameters. We use a • Selectivity. The selectivity of the range predicates. prototype built on SAP HANA (SP9), a commercial main- memory column-store DBMS. We extend the execution en- • Column selection. The probability that a column gine of SAP HANA with our new NUMA-aware data place- may be selected. Can either be uniform or skewed. ments (see Section 4) and task scheduling (see Section 5). For all experiments, we warm up the DBMS first. We • Parallelism. We can disable intra-query parallelism, make sure that all clients are admitted, and we disable query to execute each query in one task. In the majority of caching. In several cases, we present additional performance the experiments, intra-query parallelism is enabled. metrics gathered from Linux, SAP HANA, and H/W coun- The main server we use is the 4-socket Ivybridge-EX of ters (using the Intel Performance Counter Monitor tool). Table 1. In certain cases, we present results from the other We generate a dataset with a large table, taking up 100GiB two servers, to show how some implications change due to of a flat CSV file. It consists of 100 million rows, an ID inte- different hardware. On all servers, the OS is a 64-bit SMP ger column as the primary key, and 160 additional columns Linux 3.0.101 (SUSE Enterprise Server 11 SP3). 1448

8.6.1 Uniformly distributed workload Due to the saturation of the QPI, we cannot fully exploit In this section, we evaluate a uniform workload, i.e., clients the memory bandwidth of all sockets simultaneously. Thus, pick a column to query, randomly with uniform distribution. Bound improves performance only by 2× compared to OS. Performance metrics for the case of 1024 clients 6.1.1 Impact of scheduling OS Target Bound 100 S8 30 10 Memory TP (GiB/s) QPI data traffic (TB) This experiment aims to show the largest performance differ- 30 Bound 80100 TP (x103 q/min) S7 25 QPI traffic (TB) 25 8 ence between NUMA-agnostic and NUMA-aware execution. 60 80 CPU load (%) 20 S6 20 We use RR to place the columns on the 4-socket server. 15 40 6 10 20 60 S5 15 Intra-query parallelism is enabled, and the selectivity of the 5 4 0 0 40 S4 10 queries is low (0.001%). Indexes are not used, thus scans 20 S3 2 16Bound OS 1Target 5 1 4 16 64 256 1024 are used, and the workload is memory-intensive. The TP # Clients 0 S2 0 0 4 256 1024 64 and relevant performance metrics are shown in Figure 8. OS 1 Target 1 Bound Performance metrics include the CPU load, the number # Clients of tasks, and the number of tasks stolen across sockets. For Figure 9: As Figure 8, on the 8-socket server. the case of 1024 clients, performance metrics include the last-level cache (LLC) load misses (local and remote), the Implications. H/W characteristics, such as the cache co- memory throughput of each socket, the instructions per cy- herence protocol, can affect the NUMA impact we observe. cle (IPC), the total traffic through the QPI, and the total data (without the cache coherence) traffic through the QPI. 6.1.3 Impact of parallelism and data placement We continue with the previous experiment, using Bound, but OS Target Bound with different data placements, on the 4-socket machine. We 60 100 16 16 TP (x103 q/min) Stolen tasks (x 105) 50 80 trigger intra-query parallelism in this experiment, to show Tasks (x 106) CPU load (%) 40 12 12 60 the effect of issuing a single task for each query versus mul- 30 8 8 20 40 tiple tasks that can potentially be distributed across the 10 20 4 4 0 0 0 0 partitions of a column. The results are shown in Figure 10. 1 4 16 64 256 1024 1 4 16 64 256 1024 1 4 16 64 256 1024 1 4 16 64 256 1024 # Clients # Clients # Clients # Clients RR IVP PP Remote Local 60 60 LLC load misses (x1010) LLC load misses (x1010) 50 50 TP (x103 q/min) TP (x103 q/min) Performance metrics for the case of 1024 concurrent clients 40 40 40 40 50 250 2 10 6 30 30 LLC load misses (x1010) Memory TP (GiB/s) 40 200 20 20 20 20 QPI data traffic (TB) 8 5 QPI traffic (TB) 30 150 10 10 4 0 0 20 100 6 0 0 1 4 16 64 256 1024 1 4 16 64 256 1024 IPC 10 50 1 3 RR IVP PP # Clients # Clients RR IVP PP 0 0 4 2 w/o parallelism w/ parallelism w/o parallelism w/ parallelism Bound OS Target Bound OS Target 2 1 Performance metrics for 1024 clients Remote Local 0 0 0 S4 S3 S2 S1 1OS Target1 Bound 1 Figure 10: The effect of intra-query parallelism on the RR, IVP, and PP data placement strategies. Figure 8: Evaluating the OS, Target, and Bound scheduling strategies, with RR-placed columns. Intra-query parallelism is useful for low concurrency, since it can use more CPU resources and achieve better through- There is a 5× TP improvement with the Target and Bound put. Also, parallelism is required when columns are parti- strategies, over the OS, mainly due to the improved mem- tioned. Otherwise, the single task has to access remotely the ory throughput. The LLC load misses, most of which are sockets of the remaining partitions. Parallelism, however, is prefetched, are almost 5× more, and mostly local compared not required for RR and high concurrency, since the single to the mostly remote misses of OS. The number of processed task is wholly local to the socket of a column. All parallel tasks is 5× higher, and IPC is also 5× higher due to faster versions of the three data placements reach the same TP for memory accesses. QPI data traffic is reduced analogously, high concurrency. IVP has slightly more remote accesses but cache coherence traffic is generated indirectly, even with than PP, since the dictionary is interleaved. local accesses, and cannot be avoided. Although the same throughput is reached with parallelism, Overall, Bound achieves better throughput than Target. there is a difference between the data placements. In Figure Although stealing improves CPU load, it hurts throughput 11, we show violin plots of the query latency distributions. for memory-intensive workloads due to the incurred remote All have the same average latency. RR, however, is unfair, accesses and stress on the QPI. We see this effect again later. with more variance. IVP and PP have most latencies closer Implications. NUMA-awareness can significantly improve Latency (ms) Latency (ms) Latency (ms) RR IVP PP the performance of memory-intensive workloads. Memory- 1500 intensive tasks should be bound to the socket of their data. 3000 1500 6.1.2 Impact of the cache coherence protocol 0 0 0 Figure 9 shows the results of the previous experiment on the 256 1024 256 1024 256 1024 8-socket Westmere-EX machine. Bound decreases the QPI Clients Clients Clients data traffic, but the total traffic is increased, due to the broadcast-based cache coherence protocol (see Section 2). Figure 11: Violin plots of the latency distributions. 1449 the average. This is because in RR, queries queue up and Implications. Unnecessarily increasing the number of par- execute on the socket level. With IVP and PP, each query titions has an overhead that depends on the scale of the parallelizes across all sockets, and because the tasks are pri- NUMA topology and the concurrency of the workload. oritized according to the query’s timestamp, they complete approximately in the order they were received. 6.1.5 Impact of selectivity In this experiment, we vary the selectivity from 0.001% up to Implications. Intra-query parallelism is required for par- 10%. We enable indexes, evaluate 1024 clients, and use RR titioned data. Partitioning has a fair latency distribution. and Bound, on the 4-socket machine. We note that Target 6.1.4 Impact of scale on data placement achieves similar results since the workload is uniform and the concurrency is high. The results are shown in Figure 14. Although partitioning can achieve the same performance as RR on the 4-socket machine, this is not the case for large- 1E+7 107 35 Remote 200 S4 LLC load misses (x1010) scale machines. We evaluate the previous experiment with 30 Memory TP (GiB/s) 1E+6 106 0.001% 25 Local 150 S3 different partitioning granularities on the 32-socket machine. 1E+5 105 20 S2 TP (q/min) 0.01% 15 100 We increase the number of partitions up to 32, when each 1E+4 104 10 S1 0.1% 50 column is partitioned across all sockets. We take care to 1E+3 103 5 1% 0 0 distribute the partitions in a round-robin manner around 1E+2 102 10% 1% 0.001% 0.01% 0.1% 10% 0.01% 0.1% 1% 10% 0.001% 1E+1 10 the sockets. For this reason, the case of one partition per 1E+01 column degenerates to RR. We show the results for IVP in 1 Figure 12, for 1024 concurrent clients, and for all scheduling strategies. We note that the implications are similar for PP. Figure 14: Evaluating different selectivities. 20 20 20 RR As expected, throughput drops as we increase the selec- TP (x104 q/min) TP (x104 q/min) TP (x104 q/min) 15 15 15 IVP2 tivity. The optimizer chooses to perform index lookups for 10 10 10 IVP4 selectivities 0.001%-0.1%, as implied by the low memory IVP8 throughput and the number of LLC misses. For larger selec- 5 5 5 IVP16 0 0 0 tivities, it chooses scans. For selectivity 1%, scans dominate Target IVP32 OS Bound the execution, as is shown by the high memory through- 1 1 1 put and the large number of LLC misses. The workload Figure 12: Combinations of scheduling strategies is more memory-intensive. For selectivity 10%, however, and IVP granularities, on the 32-socket machine. the materialization phase dominates the execution. Since the materialization consists of random accesses due to the Bound is the best, while OS is the worst and is not af- dictionary, it is more CPU-intensive, and we observe less fected by the placement. Target achieves significantly less memory throughput and a lower number of LLC misses. throughput than Bound, because on this machine there are much higher chances of stealing a memory-intensive task. Implications. For a dictionary-encoded column, with an Stealing stresses the remote memory controller and the in- index, the selectivity changes the critical path of execution. terconnects due to increased traffic. For RR, Target has It consists of CPU-intensive index lookups for low selectiv- around 58% worse throughput than Bound. We revisit this ities, memory-intensive scans for intermediate selectivities, effect later in Section 6.2.1. and CPU-intensive materializations for high selectivities. An important implication is that as the number of par- titions increases, the performance of Target and Bound de- 6.2 Skewed workload creases significantly. This is due to the overhead of par- In this section, we use a skewed workload. Clients have a allelizing queries across numerous sockets. This overhead 20% probability of choosing a random column from the first is unnecessary, since the workload is uniformly distributed 80 columns of the dataset, and a 80% probability of choosing and RR can already saturate the machine. As shown, RR one from the remaining 80 columns. and IVP2 are comparable. Partitioning across more sock- ets, however, incurs overhead. Partitioning across all sockets 6.2.1 Impact of stealing memory-intensive tasks decreases the throughput by around 70% compared to RR. We perform the same experiment of Section 6.1.1. We use Interestingly, this implication is not true for all cases of RR, and we intend to see the effect of scheduling strategies concurrency. As shown in Figure 13, for low concurrency, on performance. The results are shown in Figure 15. partitioning either matches or surpasses RR. For high con- currency, however, partitioning proves worse than RR. OS Target Bound Performance metrics for 1024 clients 40 100 35 200 LLC load misses (x1010) Memory TP (GiB/s) 20 20 30 TP (x103 q/min) TP (x104 q/min) Target Bound TP (x104 q/min) 30 80 25 150 15 15 20 CPU load (%) 60 100 RR 20 15 10 10 IVP8 40 10 50 5 10 5 5 IVP32 20 0 0 0 0 0 0 Bound OS Target OS Target Bound 1 4 1 4 16 64 16 64 256 1024 256 1024 1 4 256 1 4 256 1024 16 64 1024 16 64 # Clients # Clients # Clients # Clients Remote Local S4 S3 S2 S1 Figure 13: Scaling up the number of concurrent Figure 15: Evaluating the OS, Target, and Bound clients with different partitioning granularities. scheduling strategies, with RR-placed columns. 1450

10. Performance metrics for 1024 clients Bound still achieves the best throughput, even though RR IVP PP 20 120 LLC load misses (x1010) Memory TP (GiB/s) it underutilizes the machine. As implied by the memory 8 100 100 TP (x103 q/min) throughput, only two sockets contain the hot set of columns. 6 80 15 CPU load (%) 80 60 10 60 One would expect that Target achieves better through- 4 40 40 put, since it utilizes more CPU resources. It decreases the 2 5 20 20 throughput, however, by around 15%. This is because the 0 0 0 0 RR RR IVP IVP PP PP 1 4 16 64 1 4 16 64 256 256 1024 1024 two hot sockets are already saturated. Remote accesses to # Clients # Clients Remote Local S4 S3 S2 S1 these sockets prevent some local accesses from queuing in the memory controllers fast. Also, increased traffic stresses the interconnects. We observe a similar effect in Section Figure 17: As Figure 16, with a high selectivity. 6.1.4, where Target hurts throughput by around 58%. Implications. To battle the skewness of IV-intensive work- Implications. Allowing stealing for memory-intensive tasks loads, IVP is a quick solution. PP is slower to perform, but can decrease throughput by up to 58%. best for battling skewness in all workloads. 6.2.2 Impact of partitioning 6.2.4 Impact of stealing CPU-intensive tasks To battle skewness, apart from collocating hot and cold Stealing can be helpful, as long as tasks are CPU-intensive columns, one can partition hot columns. Figure 16 shows so that the incurred remote traffic does not stress the inter- the results of the previous experiment using Bound, but eval- connects. Figure 18 shows the results of the previous exper- uating the different data placements and partitioning types. iment with high selectivities, but with the Target strategy. IVP and PP achieve the best throughput as RR in the case of Stealing now does not hurt as in the case of Section 6.2.1. the uniform workload of Section 6.1.1. Skewness is smoothed Performance metrics for 1024 clients out since queries are parallelized across all sockets. RR IVP PP 20 120 LLC load misses (x1010) Memory TP (GiB/s) 8 100 Performance metrics for 1024 clients 100 TP (x103 q/min) RR IVP PP 6 80 15 CPU load (%) 80 60 10 60 60 100 50 250 LLC load misses (x1010) 4 Memory TP (GiB/s) 40 TP (x103 q/min) 50 80 40 40 200 5 CPU load (%) 2 20 20 40 60 30 150 30 0 0 0 0 40 20 100 RR RR IVP IVP PP PP 1 4 16 64 1 4 16 64 256 256 1024 1024 20 20 10 50 # Clients # Clients 10 Remote Local S4 S3 S2 S1 0 0 0 0 RRIVP PP RR IVP PP 1 4 16 64 256 1 4 16 64 256 1024 1024 # Clients # Clients Remote Local S4 S3 S2 S1 Figure 18: As Figure 17, with Target. Stealing does not improve IVP and PP since they were Figure 16: Evaluating the RR, IVP, and PP data already saturating CPU resources, but improves RR, which placements, with the Bound scheduling strategy. now has full CPU load and achieves the same throughput as IVP. Stealing incurs remote accesses, and both RR and IVP PP has less CPU load due to the concurrency hint being are still worse than PP which results in more local accesses. considered for every part, resulting in a smaller number of tasks than IVP. Even with more CPU load, however, PP Implications. Inter-socket stealing should be allowed for cannot achieve a better throughput since the memory band- CPU-intensive tasks. width is already saturated. Due to space limitations, we do not include Target. We note that it achieves a better 6.3 TPC-H and SAP BW-EML benchmarks throughput for low concurrency, and a similar throughput The implications of our analysis largely apply to the data for high concurrency, since worker threads find local tasks placement of whole tables and to the scheduling of aggrega- and steal fewer tasks than in the case of RR. tions as well. We parallelize aggregations similarly to scans, Implications. Partitioning can significantly help in smooth- and we define task affinities similarly (see Section 5.2). In ing skewed memory-intensive workloads. this section, we show the impact of different strategies for placing tables and scheduling queries on the throughput of 6.2.3 Impact of partitioning type two benchmarks dominated by scans and aggregations. For the first benchmark, we use TPC-H [12] with a scaling One needs to consider two things for choosing between IVP factor 100 (100GiB flat files). We measure the throughput and PP. Firstly, PP is expensive to perform as it recreates (queries per hour) of TPC-H Q1 instances, with random columns. PP on this dataset takes around 18’, compared to parameters (see clause [12]), which are continuously 4’ for IVP, and consumes around 8% more memory because issued by 32 concurrent clients (who can saturate resources dictionaries contain recurrent values. Secondly, IVP inter- due to intra-query parallelism). The evaluation of TPC-H leaves the IX and the dictionary, which may be inefficient Q1 is dominated by aggregations on a single table (lineitem). for index lookups and intensive materialization phases. For the second benchmark, we measure the throughput As a practical example, Figure 17 shows the results of of the reporting load of SAP BW-EML [3, 9].2 BW-EML the previous experiment with a high selectivity of 10%. Ex- is representative of realistic SAP BW (business warehouse) ecution is dominated by the CPU-intensive materialization industrial workloads. It uses an application server to host phase, which involves random accesses to the dictionary. PP is better, since it involves more local accesses. The through- 2 We refer the reader to an IBM redbook [11] for a more put of IVP ultimately decreases due to remote accesses. detailed introduction to BW-EML and sample statements. 1451

11.BW users who query a database server. Throughput is mea- to the workload. Considering the implications of our anal- sured in navigation steps (or queries) per hour. At the core ysis, we envision an adaptive design. The design, shown in of the data model, there are 3 InfoCubes, each modeling Figure 20, comprises of three components: (a) the catalog, multidimensional data with an extended star schema [22]. (b) the task scheduler, and the (c) data placer. The major part of the execution consists of queries which are dominated by scans and aggregations on the InfoCubes. Catalog. The catalog holds information about the tables, Every presented measurement uses the maximum number of their columns, and whether a table is physically partitioned users that can be serviced without timeouts. Our dataset (see Section 4). The second table of the figure, e.g., has has 1 billion records (around 800GiB of flat files). two physical partitions. Through the catalog, the PSM of To evaluate BW-EML, we split the 32-socket rack-scale the components (indexvector, dictionary, index) of any col- machine into two 16-socket ones, each hosting the applica- umn can be accessed. Task creators can consult the PSM to tion and the database server (running our prototype of SAP define a socket affinity for their tasks. HANA) respectively. We use the database server for TPC-H as well. We evaluate the impact of different PP granularities, Task scheduler. The task scheduler needs to be NUMA- and the impact of Target and Bound, on the throughput. aware by supporting task affinities (see Section 5). We en- For each granularity, we distribute the table partitions in a vision that task creators assign estimated performance met- round-robin manner around the sockets. Similar to Section rics to tasks, e.g., their memory bandwidth consumption. A 6.1.4, the case of one partition per table degenerates to RR. black-box approach using H/W counters can be employed Due to legal reasons, throughput results are normalized to to find performance metrics (see Section 3). Task creators undisclosed constants c1 and c2 , corresponding to the maxi- can set the hard affinity flag of a task based on whether its mum observed throughput for TPC-H and BW-EML respec- performance metrics indicate a memory-intensive task. tively. This does not hinder us from comparing the impact Data placer. The data placer initially places data items of the different data placement and scheduling strategies on (tables or columns) across the sockets with RR. Afterwards, the throughput. The results are shown in Figure 19. it continuously runs the workflow of Figure 20 to balance TPC-H Q1 instances (SF = 100) SAP BW-EML (SF = 1 billion rows) the CPU and memory bandwidth utilization of all sockets 1 1 by moving or repartitioning data items. The utilization can (c2 · nav. steps/h) Throughput 0.8 0.8 be measured with H/W counters. Throughput (c1 · QphH) 0.6 0.6 Target If the utilization across sockets is unbalanced, the data 0.4 0.4 0.2 0.2 Bound placer finds the hottest, most utilized, socket. By examining 0 0 its active tasks, and their performance metrics, it discerns PP2 PP4 PP8 PP2 PP4 PP8 RR RR PP16 PP16 the hottest data item. If the hottest data item is not dom- inating the utilization of the socket, the data placer moves Figure 19: TPC-H and SAP BW-EML with different it to the coldest socket. If it is dominating the utilization of PP granularities, and different scheduling strategies. the socket, it increases its number of partitions, either with IVP, if the data item services tasks that mostly scan the IV TPC-H is severely skewed, since Q1 queries one table. In- of the columns, or PP otherwise (see Section 4). The new creasing the number of partitions improves performance as partition is moved to the coldest socket. queries are gradually executed locally on more sockets. Our In case the socket utilization is balanced, the data placer measurements show that Q1 is CPU-intensive as its execu- iterates over the catalog to find partitioned data. For each tion is dominated by the multiplications of its aggregations. partitioned data item, it decides if it is cold by examining if Due to this, Target is better than Bound. For Bound, in- any active tasks are processing it. If a partitioned data item creasing the number of partitions results in utilizing more is cold, the data placer decreases its number of partitions. sockets, finally matching the throughput of Target. BW-EML, in contrast, has simpler aggregation expres- TBL1 PART1 COLa IV PSM Priority Hard Priority sions and is memory-intensive. Thus, Bound is better than COLb Dict PSM Queue Queue run() Target, even if it underutilizes the machine (as in the case TBL2 PART1 COLx IX PSM t t Socket = 1 COLy t t Hard aff. = of RR), as stolen tasks incur remote accesses and stress the PART2 COLx Socket 1 TG 1 Performance QPI. This is why the performance of Target remains low. metrics Increasing the number of partitions up to 4 improves the COLy Catalog Task scheduler performance of Bound, since the 3 InfoCubes are fully con- Data placer (flowchart) Place data using RR suming the memory bandwidth of 12 sockets. Further par- titioning, however, is unnecessary since the machine is satu- Yes Find most Yes Unbalanced? Partition? IV-intensive? rated, and creates an overhead (as in Section 6.1.4 for IVP). hot socket No Yes Find most Implications. The implications are in line with our scan No Partition Partition hot data benchmarks. First, memory-intensive tasks should be bound, No with PP with IVP while CPU-intensive tasks not. Second, only hot data should Move to coldest socket Place new partition be partitioned, up to the point that the CPU and memory Continue Yes bandwidth utilization is balanced across all sockets. Iterate catalog Partitioned? Cold data? No No End Yes Decrease number of partitions 7. TOWARDS AN ADAPTIVE DESIGN Our analysis shows that a main-memory column-store needs a task scheduling and data placement strategy that adapts Figure 20: Our envisioned adaptive design. 1452

12.8. DISCUSSION AND OUTLOOK [8] C. Balkesen et al. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. VLDB, 7(1):85–96, 2013. Applicability. To support our strategies for data place- [9] T. Becker et al. Searching for the Best System ment and task scheduling, a main-memory column-store can Configuration to Fit Your BW?, Nov. 2012. be augmented with two functionalities. First, adopt PSM in order to realize the RR, IVP, and PP data placements (see [10] S. Blagodurov et al. A Case for NUMA-aware Section 4). Second, the task scheduler needs to support a Contention Management on Multicore Systems. In socket affinity and a hard affinity for tasks (see Section 5). USENIXATC, pp. 557–558, 2011. [11] W. Chen et al. Architecting and Deploying DB2 with Compression. Most column-stores use dictionary encod- BLU Acceleration. IBM Redbooks, 2014. ing such as Microsoft SQL Server [21], IBM DB2 BLU [30], [12] TPC-H Benchmark Rev. 2.17.1. Oracle [18], and MonetDB [26]. Our analysis can also extend [13] M. Dashti et al. Traffic Management: A Holistic to columns without a dictionary. In this case, the output of Approach to Memory Placement on NUMA Systems. In ASPLOS, pp. 381–394, 2013. a scan is written directly without a dictionary. Our novel [14] F. F¨ arber et al. The SAP HANA Database – An IVP can be as efficient as PP for high selectivities as well. Architecture Overview. IEEE Data Eng. Bull., Moreover, our scans are implemented using SSE over bit- 35(1):28–33, 2012. compressed IV [33]. IV can be further compressed using, [15] J. Giceva et al. Deployment of Query Plans on e.g., run-length or prefix encoding, and scans can use the Multicores. VLDB, 8(3):233–244, 2014. new AVX2 instructions [34]. Different compression forms, [16] T. Gubner. Achieving Many-Core Scalability in however, do not change the basic implications for placing Vectorwise. Master’s thesis, TU Ilmenau, 2014. data and scheduling tasks. Decompression may modify the [17] T. Kissinger et al. ERIS: A NUMA-Aware In-Memory Storage Engine for Analytical Workloads. In ADMS, CPU- and memory-intensity of tasks. In this case, our en- pp. 74–85, 2014. visioned adaptive design can accommodate them. [18] T. Lahiri et al. Oracle Database In-Memory: A Dual Additional operators. We are working on extending our Format In-Memory Database. In ICDE, 2015. analysis and our envisioned design to incorporate more com- [19] C. Lameter et al. NUMA (Non-Uniform Memory Access): An Overview. ACM Queue, 11(7):40, 2013. plex operators, such as joins. We intend to use similar data [20] H. Lang et al. Massively Parallel NUMA-aware Hash placement and task scheduling strategies. What we need to Joins. In IMDM, 2013. consider additionally for, e.g., joins, is the placement of the [21] P.-A. Larson et al. Enhancements to SQL server data structures used internally in the operator, and placing column stores. In SIGMOD, pp. 1159–1168, 2013. correlated data on the same socket or on nearby sockets. [22] T. Legler et al. Data Mining with the SAP NetWeaver BI Accelerator. In VLDB, pp. 1059–1068, 2006. 9. CONCLUSIONS [23] V. Leis et al. Morsel-Driven Parallelism: A In this paper, we show that main-memory column-stores NUMA-Aware Query Evaluation Framework for the Many-Core Age. In SIGMOD, pp. 743–754, 2014. should depart from a static data placement and task schedul- [24] C. Lemke et al. Speeding up queries in column stores: ing strategy on NUMA machines, towards a strategy that a case for compression. In DaWaK, pp. 117–129, 2010. adapts to the workload. We describe and implement various [25] Y. Li et al. NUMA-aware algorithms: the case of data strategies for concurrent scans. For task scheduling, we dis- shuffling. In CIDR, 2013. tinguish between Target, and Bound. For data placement, [26] S. Idreos et al. MonetDB: Two decades of research in we distinguish between RR, our novel IVP, and PP. Our column-oriented database architectures. Data extensive sensitivity analysis of the strategies, under vari- Engineering, page 40, 2012. ous workload parameters, shows that (a) stealing memory- [27] D. Porobic et al. ATraPos: Adaptive transaction processing on hardware Islands. In ICDE, pp. intensive tasks can hurt throughput by up to 58%, and that 688–699, 2014. (b) unnecessary partitioning can hurt throughput by up to [28] I. Psaroudakis et al. Task Scheduling for Highly 70%. Based on our analysis, we envision an adaptive de- Concurrent Analytical and Transactional sign that balances the utilization of all sockets. Partitioning Main-Memory Workloads. In ADMS, pp. 36–45, 2013. should be used only for hot data in case of skewed work- [29] I. Psaroudakis et al. Scaling Up Mixed Workloads: A loads, and the number of partitions should be increased up Battle of Data Freshness, Flexibility, and Scheduling. to the point that utilization across sockets is balanced. In TPCTC, pp. 97–112, 2015. [30] V. Raman et al. DB2 with BLU Acceleration: So 10. REFERENCES much more than just a column store. VLDB, 6(11):1080–1091, 2013. [1] Intel Xeon Processor E7 Family Uncore Performance [31] P. Russom. Best Practices Report: High-Performance Monitoring, 2011. Data Warehousing, 2012. [2] Intel Xeon Processor E7 v2 2800/4800/8800 - [32] M. Stonebraker et al. The end of an architectural era Datasheet - Vol. 2, 2014. (it’s time for a complete rewrite). In VLDB, pp. [3] SAP BW Enhanced Mixed Load benchmark, 2015. 1150–1160, 2007. [33] T. Willhalm et al. SIMD-Scan: Ultra Fast in-Memory [4] SAP HANA Platform SQL and System Views Table Scan using on-Chip Vector Processing Units. In Reference, 2015. VLDB, volume 2, pp. 385–394, 2009. [5] A. Ailamaki et al. How to stop under-utilization and [34] T. Willhalm et al. Vectorizing database column scans love multicores. In SIGMOD, pp. 189–192, 2014. with complex predicates. In ADMS, pp. 1–12, 2013. [6] M.-C. Albutiu et al. Massively Parallel Sort-Merge [35] F. Wolf et al. Extending Database Task Schedulers for Joins in Main Memory Multi-Core Database Systems. Multi-threaded Application Code. In SSDBM, 2015, VLDB, 5(10):1064–1075, 2012. to appear. [7] D. Alistarh et al. The SprayList: A Scalable Relaxed [36] M. Zukowski et al. Vectorwise: Beyond Column Priority Queue. In PPoPP, pp. 11–20, 2015. Stores. IEEE Data Eng. Bull., 35(1):21–27, 2012. 1453