Main-memory Scan Sharing for Multi-Core CPUs

We then generalize this approach to a BatchSharing scheme that avoids thrashing on ”agg-tables”—hash tables that are used for aggregation processing—caused by execution of too many queries on a core. This scheme partitions queries into batches such that the working-set of agg-table entries for each batch can fit into a cache; an efficient sampling technique is used to estimate selectivities and working-set sizes for purposes of query partitioning. Finally, we use lottery-scheduling techniques to ensure fairness and impose a hard upper bound on staging time to avoid starvation. On our 8- core testbed, we were able to completely remove the memory I/O bottleneck, increasing throughput by a factor of 2 to 2.5, while also maintaining fairness and avoiding starvation.

1. Main-Memory Scan Sharing For Multi-Core CPUs Lin Qiao Vijayshankar Raman Frederick Reiss Peter J. Haas Guy M. Lohman IBM Almaden Research Center, San Jose, CA 95120, U.S.A. {lsqiao,ravijay,frreiss,phaas,lohmang} 5.0E+09 ABSTRACT Total CPU cycles of a query Computer architectures are increasingly based on multi-core CPUs 4.0E+09 and large memories. Memory bandwidth, which has not kept pace with the increasing number of cores, has become the primary pro- Memory cessing bottleneck, replacing disk I/O as the limiting factor. To 3.0E+09 DTLB miss address this challenge, we provide novel algorithms for increas- L2 hit ing the throughput of Business Intelligence (BI) queries, as well Branch Misprediction 2.0E+09 Computation as for ensuring fairness and avoiding starvation among a concur- rent set of such queries. To maximize throughput, we propose a novel FullSharing scheme that allows all concurrent queries, when 1.0E+09 performing base-table I/O, to share the cache belonging to a given core. We then generalize this approach to a BatchSharing scheme 0.0E+00 that avoids thrashing on ”agg-tables”—hash tables that are used for 1 core 8 cores aggregation processing—caused by execution of too many queries on a core. This scheme partitions queries into batches such that the Figure 1: Breakdown of total CPU cycles consumed for the working-set of agg-table entries for each batch can fit into a cache; query from Figure 2, when running on an 8-core server. The an efficient sampling technique is used to estimate selectivities and stacked bar on the left is generated by pinning all threads to a working-set sizes for purposes of query partitioning. Finally, we single core. When all cores are used, main memory bandwidth use lottery-scheduling techniques to ensure fairness and impose a becomes the performance bottleneck. hard upper bound on staging time to avoid starvation. On our 8- core testbed, we were able to completely remove the memory I/O bottleneck, increasing throughput by a factor of 2 to 2.5, while also disk I/O bottleneck. For the first time, BI has become CPU-bound. maintaining fairness and avoiding starvation. However, recent trends in hardware are bringing this new era quickly to an end. Processor manufacturers are putting ever-increas- 1. INTRODUCTION ing numbers of cores onto a CPU die, and main memory bandwidth Historically, business intelligence, or BI, has been an I/O-bound is not keeping pace. workload. Business data is stored on the disks of a data warehouse, A simple experiment demonstrates the performance effects of the and retrieving data from these disks is the main cost in query exe- recent trend towards multicore CPUs. We take a typical BI query cution. The state of the art in BI is defined by this I/O bottleneck: (Figure 2) and run it on the Blink query processor [18], on an In- Low-end systems spend most of their time waiting for disk I/O; tel Xeon server with 2 quad-core CPUs. Blink is a compressed while high-end systems use large numbers of disks to achieve high main-memory database targeted at BI applications; we describe throughput at great financial cost. the system in more detail in Section 2. We used the oprofile Researchers have developed several techniques to alleviate this whole-system profiler and the Xeon’s hardware performance coun- bottleneck by reducing amount of data a query processor needs to ters to break the cycles spent on this query (across all 8 cores) into touch. These techniques include aggressive compression [2, 17], five components: cycles spent in computation, cycles wasted on column stores [11], and materialized views [19]. With the advent pipeline stalls due to branch mispredictions, cycles spent on level- of large main memories, these techniques often allow the entire 1 (L1) cache misses, translation lookaside buffer (TLB) misses, and working set of a BI system to fit in RAM, bypassing the traditional main memory access (L2 cache misses). As Figure 1 shows, when the query is bound to a single core, Permission to make digital or hard copies of portions of this work for the system is CPU-bound, with the majority of time going to com- personal or classroom use is granted without fee provided that copies putation. But when we allow the query to use all 8 cores on the Permission to copy without fee all or part of this material is granted provided are not made or distributed for profit or commercial advantage and machine, accessing main memory becomes the bottleneck. With that the copies are not made or distributed for direct commercial advantage, that copies bear this notice and the full citation on the first page. manufacturers soon to put 6 and 8 cores on a single chip, this prob- the VLDB copyright notice and the title of the publication and its date appear, Copyright for components of this work owned by others than VLDB and notice is given that copying is by permission of the Very Large Data lem will only become worse. Endowment Base must beTohonored. Endowment. copy otherwise, or to republish, to post on servers Abstracting with credit In this paper, we address the memory bandwidth bottleneck with or to redistribute to lists,isrequires permitted. a feeToand/or copy special otherwise, to republish, permission from the a novel scan sharing technique for main-memory query processors. to post onACM. publisher, servers or to redistribute to lists requires prior specific permission and/or a fee. Request permission VLDB ‘08, August 24-30, 2008, Auckland, New Zealand to republish from: We observe that much of the memory bandwidth used in BI queries Publications Copyright 2008Dept., VLDB ACM,Endowment,Inc. ACM Fax 000-0-00000-000-0/00/00. +1 (212) 869-0481 or goes towards scanning tables and indexes, and that these scans are PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-305-1/08/08 610

2.often repeated across multiple queries. We attach multiple queries select X2.MATL GROUP, X1.INDUSTRY, DU.SID BASE UOM, to a single scan, amortizing the overhead of reading the data from DU.SID STAT CURR, DT.SID CALMONTH sum(F.RTNSQTY), sum(F.RTNSVAL), sum(F.INVCD CST3), sum(F.INVCD QTY3), sum(F.INVCD VAL3), main memory into the processor’s cache. sum(F.OPORDQTYBM3), sum(F.OPORDVALSC3), sum(F.ORD ITEMS3), sum(F.RTNSCST3), Shared scans have been used in the past to overcome disk I/O sum(F.RTNSQTY3), sum(F.RTNSVAL3), sum(F.RTNS ITEMS3), sum(F.CRMEM CST4), sum(F.CRMEM QTY4), sum(F.CRMEM VAL4), sum(F.CST4), sum(F.QTY4), bottlenecks [16, 21], but bringing this technique to main-memory sum(F.VAL4), sum(F.INVCD CST4) from DBMS’s poses significant challenges. Disk-based systems use pro- ”/BIC/FLARGE” F, ”/BIC/DLARGE1” D1 ”/BIC/XCUSTOMER” X1, ”/BIC/DLARGE2” D2, ”/BI0/XMATERIAL” X2, ”/BIC/DLARGEP” DP, ”/BIC/DLARGET” DT, ”/BIC/DLARGEU” DU grammable buffer pools and dedicated I/O threads to implement where scan sharing. Different queries share data via the buffer pool, and a F.FKEY1 = D1.DIMID and D1.SID 0SOLD TO = X1.SID and F.FKEY2 = D2.DIMID and D2.SID 0MATERIAL = X2.SID buffer manager choreographs the reading of data into the pool. and F.FKEYP = DP.DIMID and F.FKEYU = DU.DIMID and F.FKEYT = DT.DIMID In a main-memory DBMS, however, the processor cache takes and DP.SID 0CHNGID = 0 and X2.MATL GROUP IN (9, 8, 7, 6, 5, 4, 3, 2) and DP.SID 0RECORDTP = 0 and DP.SID 0REQUID ≤536 the place of the buffer pool, with the cache controller hardware de- and X1.INDUSTRY IN (9, 8, 7, 6, 5, 4, 3, 2) and D5.SID 0VTYPE IN (2, 3, 4, 5, 6) termining the data that resides in cache. In such an environment, and X2.OBJVERS = ’A’ and X1.OBJVERS = ’A’ group by scan sharing requires careful scheduling of low-level query opera- X2.MATL GROUP, X1.INDUSTRY, DU.SID BASE UOM, DU.SID STAT CURR, DT.SID CALMONTH; tions, ensuring that data is resident in the cache when it is needed. This scheduling is complicated by the fact that processor caches Figure 2: An example BI query, reporting various sales statis- are significantly smaller than buffer pools. The working set (aux- tics, broken down by industry and time. iliary data structures like hash tables and dimension tables) of a small group of queries can easily exceed the size of cache, leading to thrashing. An implementation of scan sharing needs to estimate Vendor Intel AMD Sun the working set sizes of queries and to avoid grouping too many CPU Name Xeon X5355 Opteron 8347 UltraSPARC T2 queries together. Efficiently predicting the working set size of a Num. Cores 4 4 8 query, e.g, by sampling, is a non-trivial problem. For example, L1 Cache 4 x 64 KB 4 x 128 KB 8 x 24 KB with a group-by query, if we adopt a simple bound on the working L2 Cache 2 x 4 MB 4 x 512 KB 1 x 4 MB L3 Cache N/A 1 x 2 MB N/A set – the number of distinct group-by values, we have to solve the Main Memory Up to 32 GB Up to 32 GB Up to 256 GB infamous sample-based distinct count estimation problem [10]. (4GB modules) 1.1 Contributions Table 1: Architectural characteristics of three recent CPUs. In this paper, we present a query scheduling algorithm that im- The experiments in this paper were conducted on a machine plements shared scans on a main-memory DBMS. We show that with two Intel Xeon X5355 processors. the naive application of this algorithm can lead to thrashing as the working set size of queries exceeds cache. We then develop ex- tensions to our algorithm to group queries in a way that avoids with 4 separate processing cores, with 6- and 8- core processors in thrashing. As part of this extended algorithm, we provide a new the pipeline. Table 1 shows some statistics of processors from the technique for efficiently and robustly estimating the working-set current architectural generation. size of group-by queries. This technique rests on the observation Each core in a multi-core processor is an independent CPU; this that the working-set size is less than the number of distinct values CPU sits at the top of a memory hierarchy consisting of 2-3 lev- due to skew in access to the hash table buckets, so that we can ex- els of cache and a relatively slow main memory. Each core has a ploit the well-studied statistical notion of coverage [9]; coverage is private level-1 (L1) cache that is very fast but very small. Larger a more tractable quantity to estimate than the distinct value count. level-2 (L2) and, often, level-3 (L3) caches provide slower access We also develop techniques for guaranteeing a fair allocation of to larger amounts of memory Typically, the largest cache is shared shared resources. across all cores on the processor die, while each processor main- We evaluate our techniques using a thorough set of experiments tains its own private cache at the higher caching layers. For exam- on a real-world dataset. We demonstate that our techniques signif- ple, the AMD Opteron processor in Table 1 has a shared L3 cache icantly improve the scaling of multi-query workloads on multicore and private L1 and L2 caches. processors. On our 8-core testbed, we obtain near linear scaling At each level of the hierarchy, performance drops by one to two of throughput with cores, a performance improvement of up to 2.5 orders of magnitude. Storage capacity follows a different trajectory, times of that which is attained without these techniques. increasing by a factor of 2-4 at each cache layer, with a dramatic Paper Outline jump in capacity at the main memory layer. Even the largest pro- Section 2 provides architectual and system background. Section 3 cessor caches represent less than half of one percent of a modern describes related work. In section 4, we propose a novel approach computer’s memory. to achieve appropriate I/O sharing inside a cache. In section 5, we This cache/memory hierarchy is somewhat similar to the mem- provide a robust technique to estimate query parameters. Section 6 ory/disk hierarchy for which mainstream database systems were presents experimental results, and we conclude in Section 7. designed, with cache taking the place of the buffer pool and main memory taking the place of disk. However, there are are two im- portant differences. 2. BACKGROUND First of all, control of this memory hierarchy is implemented This section presents some material that is useful background for mostly in hardware, with the cache and memory controllers making this paper. We give an overview of modern hardware architectures, most low-level decisions about which regions of memory reside in and then describe Blink, the main memory database in which our which level of the hierarchy. Modern CPUs provide a few instruc- techniques are implemented and our experiments are run. tions to “suggest” policy changes to the hardware (e.g., the x86-64 prefetch instructions [1]), but these mechanisms do not provide the 2.1 Modern Hardware Architectures flexibility and control that a typical database buffer pool enjoys. In Today, major processor vendors are shipping processors equipped addition, many of the low-level synchronization primitives needed 611 implement a buffer pool within the L2 cache are themselves as sor is heavily multi-threaded. Blink maintains a pool of worker expensive as a cache miss. threads, one thread per core. Each thread “picks up” one query at The second difference is one of scale. Even large L2 and L3 a time and runs the query, scanning blocks of compressed tuples. caches are typically less than 10 MB in size, which is smaller than When there are more threads than queries, the idle threads “steal database buffer pools have been for many years. Business intelli- work”, attaching themselves to queries that are already executing. gence (BI) queries are highly complex, and running them efficiently These additional worker threads split the compressed table among requires keeping a large “working set” in cache, including indexes, themselves, executing the query in parallel. To avoid locking over- intermediate data structures, and executable code. head, each thread maintains a private copy of the agg-table for each query that it executes. At query completion time, these agg-tables 2.2 The Blink Query Processor are merged, if necessary, to produce the final query result. Blink is a query processor that operates on compressed main- The query processing steps that Blink uses have memory ac- memory tables. We now briefly describe the data format and query cess characteristics similar to the low-level operations in a conven- processing strategy used within Blink, focusing on the aspects that tional relational database. From the memory subsystem’s perspec- are relevant to the work in this paper. A more complete description tive, reading compressed tuples from the denormalized fact table of Blink can be found in [18]. is similar to a conventional table or index scan. Likewise, dictio- nary lookups are analogous to index joins on primary key - foreign 2.2.1 Data Organization key relationships. Finally, Blink’s agg-table is similar to the data Most columns are encoded with a dictionary code, where indi- structures most DBMSs use for hash-based aggregation. Because vidual column values are replaced with codewords: offsets into a of these low-level similarities, we expect that our results on Blink column-specific array of distinct values (the column dictionary). should also apply to other main-memory query processors. These offsets are fixed-length and bit-aligned, for extreme com- pression. Blink also exploits skew in data distribution by horizon- 3. RELATED WORK tally partitioning tables such that values with similar (marginal) fre- DBMSs have always aimed to share the results of I/O among quencies go to the same partition, and using shorter offset-lengths concurrent tasks via the buffer manager. Many recent systems ex- for the partitions with more frequent values. plicitly synchronize concurrent queries to improve the amount of This dictionary compression scheme is further optimized in two I/O that can be shared at the buffer pool, by grouping together ways to ensure that we do not have to decode the codewords when queries that run at similar speeds [16, 21]. Unlike previous sys- processing each tuple. First, the column dictionaries are stored in tems, the sharing in main-memory DBMSs must be done in L2 order-preserving fashion so that equality, range, and in-list predi- cache and not in memory. As we have discussed, this buffer pool cates can be applied directly on the codewords. Second, numerical model does not lend itself well to the implementation within the L2 columns with high cardinality are encoded by simply subtracting cache. If large shared L3 caches become common, this approach is each value from a base and representing the difference as a com- more promising. The much smaller cache sizes (when compared to pact integer of appropriate bit-size. memory) means that the combined working set of the queries often From the standpoint of this paper, the result of these optimiza- fails to fit. The thrashing of the working set leads to significant I/O tions is that accesses to the dictionary are done once at the start of that competes with the table I/O that we are trying to share. a query, and introduce almost no I/O after that. An alternative approach to sharing is a data-driven approach in Blink stores its input tables in de-normalized fashion. During which a single pipeline of tuples feeds multiple concurrent queries. load, Blink joins tables of an input schema along primary key - for- This is typically done using an operator that is shared among mul- eign key relationships to form a de-normalized table, so that query tiple queries, such as in a staged database system [12] or a contin- execution just involves scans and aggregation over this table. uous query processor [15, 5]. Our FullSharing scheme is similar to this approach. Recently, Johnson et al. [13] discuss the tradeoffs 2.2.2 Query Processing of work sharing in a multi-core processor using shared operators. The query processing component of Blink operates over fixed- In particular, they obtain the same observation as we do from a dif- size blocks of tuples from the denormalized fact table. Query pro- ferent perspective. That is, work sharing does not always improve cessing proceeds in three stages. throughput in a multicore system. This is similar to our statement In the first stage, the query processor scans the denormalized that FullSharing is not always beneficial. However, we focus on table, accessing the columns referenced in the query’s WHERE solving issues related to hardware resource constraints, e.g. cache clause and applies selection predicates that can be applied directly contention, while they focus on solving throughput degradation due over the codewords, without decoding. to serialization. Our results on estimating the working set size and In the second stage, the query processor applies any remaining batching queries appropriately are applicable to any shared opera- predicates that need decoding (such as shipDate − receiptDate > tor in a staged database system. 30), only over tuples that passed the first stage. For each tuple that Another popular technique for avoiding the I/O bottleneck is to passes this second stage, Blink computes a unique group code from lay out records of a table in a column-major order. Harizopou- the compressed values of the columns referenced in the query’s los et. al [11] quantify the extent of I/O savings in this layout for GROUP BY clause. queries that access various fractions of a table. Such a layout is The third stage of processing uses each group code as a key for an complementary to the techniques described in this paper; for ex- in-memory hash table called the agg-table. Each entry in the agg- ample, our BatchSharing method can cluster queries according to table holds a pointer to an array of running aggregates, the array the vertical partitions of the data they access. holding one value for each clause in the query’s SELECT list. The Finally, another approach to scaling on a multicore system is agg-table is implemented via open addressing with linear probing, intra-query parallelism. Recent papers on this topic have proposed because this provide good cache locality and avoids the overhead specialized hash-based aggregation algorithms to avoid lock con- of pointers that happens with chaining. tention across cores [7]. This paper focuses on throughput and To take advantage of multi-core CPUs, Blink’s query proces- inter-query parallelism. Each thread uses a private hash table to 612

4. A B C D E F G Query A B C D E F G Query FullSharing NaiveSharing 1 3 1 2 2 Throughput speedup 3 2.5 3 4 4 5 2 5 6 6 7 1.5 7 8 8 9 1 9 10 10 0.5 Data block Data block Core 1 Core 2 Core 1 Core 2 0 1 2 4 8 16 32 64 (a) NaiveSharing (b) FullSharing Number of concurrent queries Figure 3: Illustration of the data access patterns of NaiveShar- ing and FullSharing on a two-core system. FullSharing inverts Figure 4: NaiveSharing vs FullSharing the traditional division of work among cores. inverts the traditional division of work within the database: instead store its running aggregates, so lock contention is avoided. of processing all blocks of an entire query at a time, each thread Recently, the OS community has begun to consider the problem “processes” a block of data at a time across all queries. of sharing a cache among multiple threads on chip multi-processors. The benefits of FullSharing over NaiveSharing are demonstrated Chang et al. [4] proposed a cooperative cache partitioning scheme in Figure 4. We constructed a batch query workload consisting of to achieve high throughput using multiple threads. Kim et al. [14] multiple copies of the query from Figure 1 and ran it on an 8-core addressed the issue of fairness in cache sharing. Chen et al. [6] pre- server, first using NaiveSharing to schedule the 8 cores, and then sented a constructive cache sharing scheduler that allows threads to using FullSharing. We repeated the experiment multiple times, share an overlapping working set. These techniques are orthogo- varying the number of queries in the workload. For each run of the nal to our work, and applying them in a shared cache can further experiment, we compared overall throughput against the through- improve throughput and fairness of our system. put of the one-query workload. As the number of queries in the system increases, FullSharing is able to amortize memory I/O across the entire group of queries, 4. SCAN-SHARING more than doubling its query throughput. Beyond 4 concurrent Query processors that run concurrent queries usually operate in queries, NaiveSharing achieves some speedup through I/O sharing. a multi-threaded fashion, where each thread handles a query at a However, the speedup is negligible compared to that of FullShar- time. When this model is applied to a main-memory, multicore ing. Even though all the queries in the workload are identical and system, each thread runs on a core and scans data from memory. start at the same time, the convoy effect is not sufficient to induce The challenge of I/O sharing is to optimize the memory access so effective sharing of memory I/O. that the threads are always busy doing work, and are not bound by memory bandwidth. As we discussed in Section 2, main-memory 4.2 Implementing FullSharing DBMSs lack buffer pools, instead relying on hardware to read data In Blink, FullSharing is implemented easily, with only modest into the processor’s caches. code changes. Recall from Section 2.2 that Blink queries scan Even in the absence of a buffer pool, main-memory DBMS’s a compressed denormalized fact table. Blink divides this table can attain some speedup through “incidental” I/O sharing, which horizontally into blocks, so that multiple cores can work on the occurs because of the convoy phenomenon [3]. Consider the case same query simultanesouly. To support this intra-query parallelism, when multiple queries, running on different cores, start scanning a Blink has a global scheduler that tells each thread which queries table at approximately the same time. The first query will incur a to run over which blocks. Our implementation of FullSharing in cache miss to read each tuple from main memory. The remaining Blink replaces this scheduler component, leaving all other portions queries, however, can take advantage of the data that the “trail- of the system unchanged. blazer” query has read into the processor’s shared L2 or L3 cache. Our new scheduler works as follows: When we want to run a The queries form a “convoy” behind whichever query is furthest workload Q of queries, we create a pool of work-units, where each along in scanning the table; slower queries can catch up while faster work-unit corresponds to a block. Each thread steals work from queries wait for the memory controller to respond. Throughout the this pool as follows: rest of this paper, we use the term NaiveSharing to describe the Repeat until the pool is empty: traditional multithreaded approach to scheduling query execution, • Pick a block from the pool of work-units. which achieves limited I/O sharing via the convoy phenomenon. • Scan this block. • For every query q ∈ Q, apply q on this block. 4.1 FullSharing In the rest of Section 4, we develop techniques that obtain signif- 4.3 Agg-Table Thrashing icantly more I/O sharing — and hence better performance — than The overall goal of scan-sharing in a main-memory DBMS is NaiveSharing. Our first technique is called FullSharing. Here, each to reduce the number of cache misses. The FullSharing technique processing thread executes a separate table scan. A given thread achieves this goal by loading tuples into cache once, then sharing feeds each block of tuples through every query before moving on them among multiple queries. However, applying FullSharing too to the next block. Figures 3(a) and 3(b) show how FullSharing’s aggressively can lead to more cache misses, due to an effect we call query scheduling contrasts with that of NaiveSharing. FullSharing agg-table thrashing. In this subsection, we explain why agg-table 613

5. Total agg-table size (MB) across all queries. Our test machine has two 4MB L2 caches, each 0.2 0.4 0.8 1.6 3.2 6.4 12.8 2.5 split between two cores. Effectively, each core has 2MB of cache. The block size was 400K, leaving 1.6MB of space per core for the Throughput speedup 2 queries’ working sets When the total agg-table size exceeds 1.6MB, the queries start to thrash. Our experiments have verified this result 1.5 across queries with selectivities from 1 to 100 percent and agg-table sizes ranging from 30KB to 3.2MB. 1 To summarize, a scan-sharing technique must avoid agg-table thrashing in order to achieve the benefits of shared scans. The two 0.5 selectivity=100% selectivity=0.1% factors that determine whether thrashing will occur are query se- lectivity and working set size. In the rest of Section 4, we use this 0 1 2 4 8 16 32 64 knowledge to develop techniques to avoid thrashing. Number of queries in a static workload 4.4 BatchSharing To achieve the benefits of scan-sharing without inducing agg- Figure 5: Performance improvement due to the FullSharing table thrashing, we have developed a technique that we call Batch- technique. Unless selectivity is below 0.1%, throughput de- Sharing. The intuition behind BatchSharing is simple: Prevent grades when the working set of the queries, combined with the thrashing by grouping together smaller numbers of queries into current block of tuples, exceeds L2 cache size. We call this phe- batches. However, making this intuition work in practice is dif- nomenon agg-table thrashing. ficult, because it is hard to determine whether a given set of queries can share a scan without thrashing. In this section, we describe the components of the BatchShar- ing technique. We start by discussing the problem of determining thrashing occurs. In the sections that follow, we use this knowledge which queries can safely share a scan. Then we describe the query to develop scan-sharing techniques that are immune to the problem. parametrization algorithm that we use to solve this problem. Fi- A query that scans a table typically streams the results of the scan nally, we explain how we implement BatchSharing in Blink. into another operation, such as index nested-loops join, or grouped For ease of exposition, this section describes a “static” version aggregation. To run efficiently, these operations require fast access of BatchSharing. That is, we assume that the system is executing to a “working set” of data structures, such as indexes or hash tables. a single workload of queries (as in a report-generation scenario) If too many queries share a scan, their working sets can overflow all at once. Further, we assume that the goal of the system is to the cache. Once this situation occurs, the queries start to thrash, finish this entire workload as quickly as possible without regard for incurring frequent cache misses to fetch portions of their working the relative running times of individual queries. This scenario is sets. The resulting accesses to main memory can easily negate the analogous to running daily reporting queries over a data warehouse. benefits of scan-sharing. The working set of a Blink query consists Later, in Section 4.5, we will relax these assumptions and extend primarily of the agg-table data structure (see Section 2.2); hence, BatchSharing to handle dynamic query arrival while ensuring a fair we use the name “agg-table thrashing” to describe this effect. division of system resources among queries. We have conducted detailed experiments to determine the con- ditions in which agg-table thrashing can occur; full results can be 4.4.1 Query Parameter Estimation found in Section 6. To aid the discussion in the current section, we have created Figure 5, which shows a small subset of our re- In Section 4.3, we examined the phenomenon of agg-table thrash- sults. The experiments behind Figure 5 used FullSharing to share ing. Our analysis identified two factors that, taken together, can be a single scan between multiple copies of a given query. We var- used to predict whether a batch of queries will thrash. These factors ied the number of simultaneous queries from 1 to 64 and measured are query selectivity and query working-set size. the resulting throughput improvement. The two lines in the graph For queries in Blink, the working set is dominated by its agg- show the performance improvement for two variants of the query in table. In general, there is no known efficient (i.e., sampling-based) Figure 2. We produced these variants by modifying the WHERE method to estimate a priori the number of rows in an agg-table— clause of the query, changing the query selectivies to 100% and i.e., the number of groups that the query’s GROUP BY clause 0.1%, respectively. The high-selectivity query experiences agg- produces—with guaranteed error bounds [10]. However, by care- table thrashing, suffering a performance reduction when more than fully defining the estimation problem, we can sidestep these issues 8 queries run simultaneously. so that a sampling-based technique will meet our needs. Our experiments identified two factors that determine whether Our algorithm hinges on three key observations: agg-table thrashing will occur: query selectivity and working set O BSERVATION 4.1: Based on our characterization of agg-table size. The results in Figure 5 illustrate these two factors. The effects thrashing ( Section 4.3), we can classify queries into 3 categories: of selectivity are most readily apparent: The high-selectivity query • Always share: If a query has low selectivity (<0.1% as shown in thrashes, while the low-selectivity query does not. Overall, we have our experiments), it can be grouped with any other query without found that queries with selectivities of 0.1% or less do not exhibit thrashing. agg-table thrashing. • Never share: If a query’s working set size exceeds the size of The effects of working set size can also be seen by focusing on cache, adding that query to any batch will lead to thrashing. points at which thrashing occurs: in the case of Figure 5, at all • Could share: If a query does not fit into the previous two cate- points beyond 8 queries. The agg-tables for the queries shown gories, then the system needs to estimate the query’s working set here take up 200KB of memory each. (Recall that, in the Blink size to know whether it can be safely added to a given batch. query processor, the agg-table data structure is the dominant part O BSERVATION 4.2: Some parts of a query’s agg-table are accessed of a query’s working set.) Note the secondary scale across the very rarely, while others are accessed frequently; thus the working top of the graph; this scale shows the total size of the agg-tables set can be viewed, approximately, as the set of groups that are nec- 614

6.essary to account for almost every access to the query’s agg-table A B C D E F G Query (here “almost every” is a tunable parameter). If this working set 1 resides in cache, thrashing will not occur. 2 3 Core 1 O BSERVATION 4.3: It is easier to estimate a query’s working set 4 size from a sample than it is to estimate the size of its agg-table, 5 Core 2 because hard-to-capture rare values impact the distinct-value count 6 but not the working-set size. Working-set size is closely related to 7 the classical statistical notion of “sample coverage,” and techniques 8 for estimating sample coverage are applicable. 9 These observations allow us to convert a potentially hard estimation 10 problem into a tractable one: Data block First, identify queries with selectivities of less than Figure 6: Query execution with BatchSharing on a two-core 0.1%, as well as queries with working sets that clearly processor, running two batches; each core shares a single scan exceed the size of cache. Then, for the remaining queries, among the queries from one batch. If one core finishes its batch estimate the working-set size. before the other, the idle core will steal work from the remain- ing batch. In Section 5, we describe our query-parameter estimation algo- rithm in detail. For now, we give a synopsis of the algorithm and an intuition for how it works. This constraint is based on a conservative model of cache behav- Our algorithm operates using two phases of sampling. Each ior. Let γ denote the fraction of memory accesses covered by each phase operates over preallocated random samples of the table be- query’s working set. We assume that the cache controller will keep ing scanned. The first phase identifies queries in the “always share” the most popular γth percentile of memory in cache for each query. category. This phase proceeds by running the query over a sample As long as this invariant holds, the overall cache miss rate across of the table. If very few tuples pass the query’s selection predicate, the queries in the batch is bounded from above by 1 − γ. In reality, the query is marked as “always share.” This phase works well be- the controller will use a global replacement policy to allocate cache cause it is relatively easy to estimate predicate selectivities on the lines across all queries in a batch; this actual policy will achieve a order of 0.1% or higher from a sample. lower miss rate than the simplified policy we assume. During its second phase, the algorithm feeds a sample of the More formally, our packing problem is: Given a set Q of queries table through the query while monitoring the number of distinct and corresponding working set sizes wq , find a partitioning: groups encountered thus far. The algorithm stops either when the groups encountered thus far account for almost every access to the Q = Q1 ⊎ Q2 ⊎ · · · ⊎ Qp , (1) agg-table (as measured by sample coverage) or when the groups encountered thus far would not fit into cache. In the latter case, the that minimizes p, subject to the constraint: query is classified as “never share,” whereas in the former case, the X wq ≤ C − B, ∀ 1 ≤ i ≤ p, (2) algorithm returns the number of groups encountered thus far as its q∈Qi estimate of the working-set size. This phase works well because the coverage estimator that we employ is accurate as long as the actual where C is the size of the cache and B is the size of a block of number of groups in the working set is sufficiently small relative tuples (Agg-table thrashing occurs when the total working set of to the number of tuples in the sample. By definition, every “could the queries in a batch is greater than C − B bytes). share” query meets this criterion, because a processor cache can This problem is identical to the standard bin-packing problem, only hold roughly 10,000 agg-table entries. with bin size C − B. We use the well known first-fit decreas- After the two phases of sampling, the algorithm has collected ing heuristic, which sorts queries by decreasing wq and repeatedly enough information to decide which queries can be safely batched packs a query into the first batch with sufficient space, starting a together. In practice, we can obtain sufficiently accurate results for new batch if none is found. This heuristic is known to pack no both phases with sample sizes of less than 100,000 tuples. Even worse than a factor of 11/9 times the optimal solution [8]. when running a highly complex query, the Blink query processor can scan such a small sample in less than 5 msec. 4.4.3 Execution We have implemented BatchSharing on the Blink query proces- 4.4.2 Packing queries into batches sor. As with our implementation of FullSharing, all of our changes The result of the above estimation procedure is a quantification to Blink were concentrated in the query scheduler. The general of the working set size wq for each query q that the system needs scheduling algorithm that we use is illustrated in Figure 6. The to assign to a batch. For “always share” queries, this working set scheduler assigns a separate batch of queries to each core in the size is effectively zero; for “never share” queries, the working set processor. Each core scans the table, feeding each block of tuples size is effectively infinite. The next stage of BatchSharing uses this through its batch of queries. working set information to pack the queries into batches. Since BatchSharing assigns queries to batches purely according The goal of the packing algorithm is to minimize per-batch over- to their agg-table sizes, the batches can be very heterogeneous. As a heads by packing the queries into as few batches as possible; while result, the running times of different batches can vary significantly, avoiding agg-table thrashing. To prevent thrashing, we ensure that and the division of work among cores can be uneven. To prevent there is enough space in the cache for the working set of every cores from being idle, our implementation of BatchSharing uses query in a given batch. That is, if C is the size of the cache and B the work-stealing features already present in Blink. is the size of a block of data, then we guarantee that the queries in Recall from Section 2.2 that Blink’s conventional query sched- a batch have a total working set size of less than C − B. uler detects idle cores and assigns them to “help out” other cores 615

7.that have been assigned expensive queries. Our implementation such inbalances, we ensure that each batch receives an amount of of FullSharing disables this feature, since FullSharing naturally CPU time proportional to its size. levels the load across threads. With BatchSharing, we re-enable We use lottery scheduling [20] to implement this allottment of work-stealing. If a thread finishes its batch of queries before the CPU time. Each running batch receives a number of lottery tick- other threads, the thread can steal work (table blocks) from another ets proportional to the number of queries in the batch. We store batch. Thus, multiple threads work concurrently on the expensive the mapping from tickets to batches in an array, where each entry batches, automatically achieving a balanced load. represents a single ticket. We divide time into slices that are suffi- ciently large to amortize the overhead of flushing the processor’s L2 4.5 Dynamic Query Grouping cache. At the start of each time slice, every core chooses a lottery The description of BatchSharing in the previous section assumed ticket uniformly at random and executes the corresponding batch a single static workload of queries. In this section, we extend the for the remainder of the time slice. Overall, the expected amount technique to handle an online environment with dynamic query ar- of CPU time that each batch receives is proportional to its number rival, as in a data warehouse supporting a stream of analyst queries. of tickets, and hence to its number of queries. We still want to run queries in batches, with the combined work- ing set of each batch fitting in the L2 cache to avoid agg-table 4.5.2 Avoiding Starvation thrashing. The basic methods from the previous section on esti- To prevent starvation, our implementation of BatchSharing en- mating the agg-table size of each query and on packing queries forces an upper bound tw on the amount of time a query can be in into batches still apply. But we need to form and maintain batches the unassigned state. At the same time, we want to keep queries in for a dynamic query stream. We have two choices in how we form the staging area as long as possible, so as to maximize the opportu- and maintain batches of queries for a dynamic query stream: nities for effective bin-packing. We achieve a compromise between • If a batch X of queries is running and a new query q arrives, add these two factors by tracking the original arrival time of each unas- q to X if the working sets of X and q together fit in cache. signed query. • Once a batch of queries has started running, treat it as immutable. During query processing, the staging area is left untouched until The first option could potentially give higher throughput, but re- one of the following occurs: quires additional book-keeping to track which queries in a batch • No more active queries remain, or have operated on which blocks. To keep our design simple, we • A query has spent more than tw time in the staging area. choose the second option. When either of these events happens, it triggers the following Our dynamic approach to BatchSharing works as follows: At sequence of actions: any point in time, the queries in the system fall into two categories: 1. Pack all the unassigned queries into batches. active and unassigned. Active queries have been assigned to query 2. Activate any batch containing a query that has spent more than batches; these active batches are in the process of being executed tw time in the pool. over shared scans. Unassigned queries are not yet part of a batch; 3. Activate a batch if there are still no active batches. these queries reside in a special staging area until they are assigned 4. Return the remaining queries to the staging area. to a batch. Dynamic workloads arise primarily in interactive applications, with concurrent users submitting queries from their individual con- 5. WORKING-SET SIZE ESTIMATION soles. It is important for these users to see consistent query re- As discussed in Section 4.4.1, we need to classify queries as sponse times. To function correctly in such an environment, a query always-share, never-share, or could-share and, for the could-share processor must schedule queries fairly and avoid starvation. Our queries, estimate the working-set (WS) size. We now describe our dynamic BatchSharing implementation targets two kinds of fair- sampling-based classification and estimation algorithm. ness: We first introduce some notation. For a specified query q and real • Fair scheduling: On average, every active query receives an number γ ∈ [0, 1], a working set Wγ (q) is defined as a minimal set equal fraction of CPU time to within a constant multiplicative of rows in the agg-table—not necessarily unique—that accounts factor d. for 100γ% of rows in the answer to q after predicates have been • No starvation: As long as the system is not overloaded, the applied but prior to grouping. I.e., if the cache comprises the rows amount of time that a query can be in the unassigned state is in Wγ (q), then the cache-hit rate for query q (in isolation) will be strictly bounded. 100γ%. (We use a value of γ = 0.8 in our prototype.) Given a value of γ, we wish to (1) classify a query as always-share if 4.5.1 Scheduling Queries Fairly its selectivity σ is less than a threshold σ ∗ , (2) classify a query as Since the queries in a given batch share a scan, it follows that never-share if the WS size will clearly exceed the space threshold every query in the batch must complete at the same time. Should a d∗ = B − C allotted for the agg-tables, and (3) otherwise compute batch contain both fast and slow queries, the faster queries will re- |Wγ (q)| for purposes of bin packing. ceive a smaller slice of the CPU, violating our fair scheduling goal. To avoid expensive table scans, we sample the table T of interest, To avoid this problem, we incorporate constraints on query running and merely estimate σ and |Wγ (q)|. The idea is to execute the time into our bin-packing algorithm. A given pair of queries are al- classification steps (1)–(3) above, but modify each step to take into lowed to share a batch only if their running times differ by a factor account the uncertainty due to sampling, using an “indifference- of less than d. We choose d experimentally in Section 6.4. Since zone” approach, which we now describe. Set xi = 1 if the ith row we do table scans, query running times can be easily estimated from of T satisfies the predicates in q, and xi = 0 otherwise, so that P|T | P|T | running the query on a sample. σ = (1/|T |) i=1 xi . Also set α2 = (1/|T |) i=1 (xi − σ)2 . ∗ Another obstacle to fairness is the relative weight of different To determine whether σ < σ , we take a simple random sample batches in scheduling the activities of the CPU’s cores. If two of n rows from table T and apply the predicates in q to the sam- batches of unequal size receive equal slices of CPU time, the queries pled rows. Set Xi = 1 if the ith sampled row satisfies the predi- in the smaller batch will receive a greater share of CPU. To avoid cates in q, and Xi = 0 otherwise. We then estimate σ by σ ˆn = 616

8.(1/n) n ˆn < σ ∗ − ǫn . P i=1 Xi , and classify q as always-share if σ V ≥ γ, we stop the sampling process and use the number of rows The formulas for n and ǫn are given below, and are chosen so that in W as the estimate of the working-set size. The idea is that the the probability of a “type-1” or “type-2” error is less than a user- most frequent grouping values will appear in W , so that W will be specified threshold p. A type-1 error occurs if σ > σ ∗ + δ2 but approximately minimal; more elaborate approaches are possible, ˆ < σ ∗ − ǫn , where δ2 is an “indifference” constant. That is, a σ but experiments indicate that our proposed technique is adequate type-1 error occurs if σ lies “significantly” above σ ∗ , as measured for our task. As with query selectivity, the test of whether or not by δ2 , but our procedure, which uses σ ˆ , incorrectly classifies query V ≥ γ is modified to take into account the uncertainty introduced q as always-share. Similarly, a type-2 error occurs if σ < σ ∗ − δ1 by sampling, using an indifference-zone approach. ˆ > σ ∗ − ǫn . If σ lies in the interval [σ ∗ − δ1 , σ ∗ + δ2 ], then but σ In more detail, when W contains n elements, we estimate the we can tolerate a misclassification. In general, the repercussions of coverage V by Vˆn = 1 − f1 /n, where fj [1 ≤ j ≤ |D(W )|] is a type-1 error are much more serious than those of a type-2 error. the number of distinct grouping values that appear exactly j times Based on preliminary experiments, we found that suitable values in W ; see [9] for a discussion of this estimator, which is originally of the foregoing constants are given by σ ∗ = 0.001, δ1 = σ ∗ , and credited to Turing. Choose an indifference zone of the form [γ − δ2 = 0.099. δ1′ , γ + δ2′ ] and set Specifically, we set „ «2 2βn z1−p „ «2 n′ = ∨ nmin , 2αz1−p δ2′ n= ∨ nmin , δ1 and and „ βn z1−p «+ „ «+ ǫ′n = √ − δ1′ , αz1−p n ǫn = √ − δ2 n where βn = (f1 /n) + 2(f2 /n) − (f1 /n)2 . Then, provided that where nmin ≈ 500, zx is the 100x% quantile of the standard (mean |W | ≥ n′ , declare that V ≥ γ if and only if Vˆ|W | > γ + ǫ′|W | . 0, variance 1) normal distribution, x ∨ y = max(x, y), and x+ = An argument similar to the one given above shows that, to a good max(x, 0). Note that the constant α appearing in the above for- approximation, the probability of a type-1 or type-2 error will be at mulas is unknown; in practice we use a small pilot P sample of size most p. (The key difference from the prior argument is that we use m = nmin to estimate α by α ˆ 2m = (m − 1)−1 m i=1 (Xi − σˆ m )2 . a CLT for the coverage estimator due to Esty [9], rather than the To see that use of the foregoing values achieves (approximately) standard CLT.) In our prototype, we use indifference-zone values the desired error control, observe that of δ1′ = 0.05 and δ2′ = 0.10. The overall technique is given as Algorithm 1. In the algorithm, ˆn < σ ∗ − ǫn } P { type-1 error } = P { σ the function D ISTINCT(W ) computes the number of distinct el- ∗  ff ˆn − σ σ σ −σ ǫn ements in W , and N UM W ITH F REQ(W, i) computes the quantity =P √ < √ − √ α/ n α/ n α/ n fi defined previously. The function S AMPLE(T, n) takes a simple ( „ «+ ) random sample of n rows from table T , without replacement. The ˆn − σ σ −δ2 δ2 function I NCREMENT S AMPLE(W, T, i) repeatedly samples from ≤P √ < √ − z1−p − √ α/ n α/ n α/ n T until a sampled tuple survives the predicates in q. This tuple is  ˆn − σ σ ff then projected onto the grouping attributes and added to W . The ≤P √ < −z1−p ≈ p, sampling from T is incremental within and between function calls; α/ n the variable i records the cumulative number of tuples that have where the last ≈ follows from the central limit theorem (CLT) and been sampled from T over all calls to I NCREMENT S AMPLE. the definition of z1−p . (Our choice of nmin attempts to ensure the We achieve efficiency by precomputing a sample T ′ of 100k accuracy of the CLT approximation.) Similarly, rows, and storing them in random order, so that incremental sam- pling of T corresponds to a simple scan of T ′ . We set nmax = |T ′ |, P { type-2 error } = P { σˆn > σ ∗ − ǫn } so that if the sample becomes exhausted at any point (lines 14 and σ∗ − σ  ff ˆn − σ σ ǫn 25), the algorithm terminates and conservatively categorizes query =P √ > √ − √ α/ n α/ n α/ n q as never-share. In practice, the same sample T ′ can be used for  ff both the selectivity test (pilot and regular samples) and the WS size- ˆn − σ σ δ1 ≤P √ > √ − z1−p estimation phase, without much adverse impact on the effectiveness α/ n α/ n of BatchSharing. Finally, note that, in line 26, D ISTINCT(W ) is es-  ff ˆn − σ σ sentially a lower bound on the size of the working set, so that the ≤P √ > z1−p ≈ p, α/ n test in line 26 indeed identifies whether q is a never-share query. To obtain a reasonable estimate of the working-set size for a query q, we incrementally maintain a uniform multiset sample W 6. EXPERIMENTS of grouping values by incrementally sampling table T ; for each We now present a detailed performance evalution of our scan- sampled tuple, we apply all of the predicates in q and, if the tuple sharing algorithms. The evaluation proceeds in roughly the same survives, project it onto the grouping attributes before adding it to order that we have described the algorithms: W . After each incremental sampling step, we estimate the cover- Extent of Thrashing: Early in this paper we introduced a fairly age V of the set D(W ) of distinct grouping values in W . Denoting simple algorithm, FullSharing, and then we argued that it leads by T ∗ the reduced version of T obtained byPapplying the selection to thrashing of agg-tables. Does it? How bad is the thrashing predicates in q, we define the coverage as i∈D(W ) πi , where πi (Section 6.1)? Can we avoid this thrashing just by adding a sim- is the fraction of rows in T ∗ whose grouping values match the ith ple admission-control scheme to FullSharing (Section 6.2)? value in D(W ); see [9] for a discussion of coverage. As soon as Mixed Workloads: BatchSharing is much more fancy than Full- 617

9.Algorithm 1 Query classification and WS size estimation The queries used in our experiments have the following template 1: T, q: table and query under consideration (all column names are anonymized): 2: σ ∗ , γ: selectivity and WS-size cutoff values SELECT SUM(col1 ), . . . SUM(col12 ) FROM table 3: d∗ = C − B: agg-table threshold for never-share WHERE conjunction of single column predicates 4: δ1 , δ2 , δ1′ , δ2′ : indifference-zone values GROUP BY grouping cols 5: p: maximum allowed error probability 6: nmin , nmax : minimum and maximum sample sizes 6.1 Tackling Agg-Table Thrashing 7: For our first experiment, we ran workloads with varying numbers 8: // test selectivity of queries but homogeneous agg-table sizes. All the queries in a 9: m ← nmin given workload have the same GROUP-BY clause, chosen so as to 10: T ′ ← S AMPLE P (T, m) // take pilot sample achieve a specific size for the query agg-table. The predicates were ˆ ← m−1 m 11: σ i=1 Xi chosen to be non-selective (about 98% selectivity), so that almost ´1/2 ˆ ← (m − 1)−1 m ˆ )2 ` all tuples participate in aggregation. P 12: α i=1 (Xi − σ 13: n ← (2αz ˆ 1−p /δ1 )2 ∨ nmin Figure 7(a) plots the throughput of each workload: increasing the 14: if n > nmax then return “never-share” number of queries on the x-axis, with a separate curve for each agg- 15: T ′ ← S AMPLE table size. BS refers to BatchSharing and FS to FullSharing. For P (T, n) // take actual sample instance, ”BS-50KB” is BatchSharing for queries with 50KB agg- ˆ ← n−1 n 16: σ i=1 Xi ´1/2 table size. The total agg-table size for each workload is (number of ˆ ← (n − 1)−1 n ˆ )2 ` P 17: α i=1 (Xi − σ queries) × (query agg-table size). + ˆ 1−p n−1/2 ) − δ2 ` ´ 18: ǫ ← (αz Observe that the throughput using FullSharing (the dotted curves) ∗ 19: if σˆ < σ − ǫ then return “always-share” // selectivity test starts to drop when the total working-set size exceeds 1.6MB, which 20: maps well to the 2MB L2 cache available per core (the remainder 21: // estimate WS size holds one block of the table being scanned). BatchSharing, on the 22: W ← ∅; i ← 0 // initialize other hand, does not exhibit any such thrashing. We have noted 23: while true do from logs that it behaves like FullSharing up to the thrashing point, 24: I NCREMENT S AMPLE(W, T, q, i) and starts partitioning queries into batches thereafter. 25: if i > nmax then return “never-share” Observe also that the throughput of BatchSharing does plateau, 26: if D ISTINCT(W ) > d∗ then return “never-share” but at different numbers of queries for different agg-table sizes: 27: f1 ← N UM W ITH F REQ(W, 1) about 16 queries for 800KB agg-tables, about 64 for 200KB agg- 28: f2 ← N UM W ITH F REQ(W, 2) tables. By turning on logging, we have found that this corresponds ´1/2 β ← (f1 /n) + 2(f2 /n) − (f1 /n)2 ` 29: to the point when the queries are partitioned into 8 batches: at this 30: n′ ← (2βz1−p /δ2′ )2 ∨ nmin point each core gets its own batch and we cannot improve perfor- 31: if |W | ≥ n′ then // is |W | big enough for testing? mance any further. 32: Vˆ ← 1 − (f1 /|W |) Figure 7(b) plots a more detailed version of the results for the ´+ same experiment, showing more query agg-table sizes. The x-axis ǫ′ ← (βz1−p n−1/2 ) − δ1′ ` 33: shows the total agg-table size of the workload: notice that through- 34: if Vˆ > γ + ǫ′ then return D ISTINCT(W ) put keeps increasing well beyond the 1.6MB point at which Full- 35: end if Sharing peaked. 36: end while 6.2 Mixed workloads The homogeneous workload used in the last experiment is be- Sharing. Focusing purely on throughput, how good is it? Es- coming less and less common as users consolidate different appli- pecially, what happens with a mixed workload of queries with cations against the same DBMS. For example, queries whose re- small and large working sets? Is BatchSharing able to correctly sults are to be shown to a human being usually have a small number batch the queries (Section 6.2)? of groups and hence small agg-table sizes, since they must fit on a Interaction of Selectivity with Working Set: How does selectiv- screen, whereas drill-down OLAP queries involve fine-granularity ity impact the thrashing of agg-tables? Were we right to model grouping clauses that lead to large agg-table sizes. In real-world queries with low selectivity as not contributing to the working scenarios, the ratio of small-result queries to large-result queries set of a batch (Section 6.3)? varies across systems and over time. So a natural question is whether BatchSharing is able to handle Fairness: How well does BatchSharing maintain fairness among such mixed workloads. To test BatchSharing’s robustness, we gen- concurrent queries (Section 6.4)? erate mixed workloads with queries of two agg-table sizes: 1.6MB Putting it all together: We started this paper with a problem of and 50KB. Each workload has 100 queries, with the ratio of large scaling on a multicore system. Have we solved this problem to small varying from 1:1 to 1:32. (Section 6.5)? This experiment also tests another hypothesis. Based on the Setup: Our test machine has two Intel Xeon quad-core CPUs (2.66 results in Figure 7(a), one might think that FullSharing can be GHz/core). The memory hierarchy is 16GB of RAM, 4MB L2 fixed just by adding an admission control scheme that permits n cache (shared between two cores), and 64KB each of L1 data cache concurrent queries at a time. Call such a scheme F ullSharing- and instruction cache per core. n. We show a full comparison among NaiveSharing, FullSharing, We use an actual customer dataset, CUST1 for our experiments. FullSharing-2, FullSharing-4, and BatchSharing in Figure 8. Ob- The denormalized table has 28M rows, and takes up 25GB uncom- serve that none of the FullSharing schemes with admission con- pressed in a traditional commerical DBMS. Blink’s compression trol is always better than the others. So there is no magic admis- reduces this to 2.8GB, an amount that comfortably fits in memory. sion control value for FullSharing. Moreover, FullSharing with 618

10. 2.5 be ignored while estimating the working set size of a batch. We run queries with selectivity varying from 100% from 0.01%, using 2 FullSharing. FullSharing does no batching, so this shows us the Throughput speedup BS-50KB extent of agg-table thrashing at various selectivities. 1.5 BS-200KB BS-800KB FS-50KB Query agg-table size agg-table size 1 FS-200KB selectivity in Fig. 9(a) in Fig. 9(b) FS-800KB 100% 0.9MB 0.22MB 0.5 10% 0.9MB 0.22MB 1% 0.84MB 0.22MB 0 0.1% 0.61MB 0.21MB 1 2 4 8 16 32 64 128 0.01% 0.19MB 0.12MB Number of queries in a static workload (a) Comparing FullSharing with BatchSharing Table 2: Working-set sizes in Figure 9 7 Throughput (queries/sec) 6 query 3 agg-table 5 size query Throughput speedup 2.5 selectivity 30KB 4 50KB 2 0.01% 100KB 0.10% 3 200KB 1.5 1% 400KB 10% 2 800KB 1 100% 1.6MB 1 3.2MB 0.5 C-B=1.6MB 0 0 0.03 0.05 0.1 0.22 0.45 0.8 1.6 3.2 6.4 12.8 25.6 1 2 4 8 16 32 64 Total Agg-table Size (MB) # of queries in a static workload (b) No Agg-table Thrashing using BatchSharing (a) 0.9MB working-set 3 Figure 7: BatchSharing Throughput query Throughput speedup 2.5 selectivity 2 0.01% admission control is not a replacement for BatchSharing, because 0.10% BatchSharing always outperforms such a scheme. 1.5 1% 10% The other important point from Figure 8 is that the benefit of 1 100% BatchSharing varies when the query ratio changes. The throughput 0.5 improvement of BatchSharing decreases when there is a larger pro- portion of queries with a large working set. In practice, a BI system 0 usually runs many more analytical queries than reporting queries; 1 2 4 8 16 32 64 such a workload favors BatchSharing. Overall, BatchSharing out- # of queries in a static workload performs other sharing schemes by up to a factor of 2.5. (b) 0.2MB working-set NaiveSharing Full-Sharing-2 Full-Sharing-4 Full-sharing BatchSharing 2.5 Figure 9: FullSharing vs query selectivity Throughput speedup 2 Figure 9 shows the throughput speedup for queries under two 1.5 situations: in Figure 9(a), the agg-table size before predicates are applied is 0.9MB, and in Figure 9(b) it is 0.2MB. But this size only 1 applies to the 100% selectivity queries. When the WHERE clause is changed to reduce the selectivity, the agg-table size reduces be- 0.5 cause some groups get no tuples. Table 2 lists the agg-table sizes 0 for various selectivities. 1:1 1:2 1:4 1:8 1:16 1:32 Observe that for the queries with 1% to 100% selectivity, the Ratio of queries of large WS to small WS agg-table size does not differ much. But the peak throughputs are very different: e.g, in Figure 9(a), throughput peaks with 4 queries Figure 8: Throughput Speedup in a Mixed Query Workload at 10% selectivity and with 2 queries with 100% selectivity. This phenomenon arises because queries with low selectivity do very few agg-table IOs for each base table I/O: most tuples fail the pred- 6.3 Impact of Selectivity on Thrashing icate and never reach the aggregation stage. So the extent of agg- Our next goal is to validate the assumption in Section 5 that table thrashing is low for such queries; at selectivities below 0.1%, queries with low selectivity do not contribute to thrashing and can the agg-table thrashing is negligible. 619

11.6.4 Fairness and Starvation Avoidance Using of the regression line is displayed in each figure, along with the R2 Dynamic Grouping Algorithm goodness-of-fit statistic. (Values of R2 between 0.8 and 1 indicate We now turn our attention from throughput to fairness and star- a good fit.) vation avoidance. Instead of a static workload, we generate queries Figure 10(a) shows the result with d = 2. Notice that the data on the fly at an arrival rate of 6 queries/second and feed this stream points are naturally clustered into two groups. Queries from this into the dynamic BatchSharing scheduler. Queries are randomly workload form two types of batches: one for short running queries generated, varying two parameters: the predicates are varied to and the other for long running queries. Across batches, the lottery achieve selectivities ranging from 0.01% to 98%, and the group-by scheduler ensures that the batches with short queries get the same columns are varied to achieve agg-table sizes ranging from 0.2MB share of the CPU as the batches with long queries. Figure 10(b) to 0.9MB. This ensures a mix of: short-running queries (with low shows the result with a smaller d = 1.25. The data points form selectivity and small agg-table sizes), long-running queries (with more clusters when compared to Figure 10(a). Moreover, the data high selectivity and large agg-table sizes), and medium queries (the points are more closely clustered around the trendline, with a R2 rest). The individual query execution times vary from 81 millisec- of 0.9, compared with R2 = 0.8 for d = 2. onds to 554 milliseconds. At the same time, for both d = 2 and d = 1.25, all the queries As described in Section 4.5, to achieve fair scheduling, we use a are present in the plot, so no starvation occurs. multiplicative factor d to bound the (estimated) the execution times Overall, the results are as we would expect. Smaller values of d of queries within a batch. We use this experiment to study the sen- result in more fairness (values tightly clustered around a line), but sitivity to the value of d and pick a good number. We report both both values of d avoid starvation. fairness and throughput in the following. 6.4.2 Throughput 6.4.1 Fairness and Starvation Avoidance 8 Throughput (queries/second) Concurrent query execution time 25000 6 20000 4 15000 (ms) 2 10000 0 5000 y=30x 2 1.5 1.25 1 NaiveSharing R2=0.8 Factor d 0 0 100 200 300 400 500 600 Figure 11: Dynamic Grouping Throughput vs Query Range Individual query execution time (ms) (1/d, d) (a) d = 2 25000 We choose a good value of d based on throughput. Figure 11 concurrent query execution time reports throughput for dynamic grouping algorithm with d varying between 1 and 2. For the dynamic grouping algorithm, the sys- 20000 tem throughput decreases with smaller d, because that leads to a smaller pool of queries that are available to form a batch. However, 15000 throughput only differs 12% between d = 2 and d = 1.25, whereas (ms) it differs 50% between d = 2 and d = 1. So, we pick d = 1.25 as 10000 a sweet spot to reach a balance between throughput and fairness. y=29x The same plot also shows a variant of NaiveSharing which takes 5000 R2=0.9 a query stream as input. Comparing dynamic grouping with d = 1.25 to NaiveSharing, we see that the system throughput of the 0 former is 2x of that of the latter. 0 100 200 300 400 500 600 Individual query execution time (ms) 6.5 Multi-core Scaling (b) d = 1.25 Our last experiment repeats the query from Figure 2 of the in- troduction, but using BatchSharing. Our goal is to see if the I/O bottleneck has been removed. Figure 10: Fairness Using Dynamic Scan-sharing Using the oprofile system profiler and Xeon hardware counters, we break the total CPU cycles into 5 components (Figure 12): com- To measure fairness, we look at the ratio of execution times for a putation, pipeline stall due to branch misprediction, L2 cache hit, query, when running concurrently vs when running standalone: if DTLB miss, and resource stall due to memory loads (using Xeon these ratios are identical for all queries we have perfect justice for counters: CPU CLK UNHALTED, RS UOPS DISPATCHED NONE, all. Figure 10 is a scatter plot of individual (x-axis) and concurrent MEM LOAD RETIRED (with mask 0x01 and 0x04), DTLB MISSES, (y-axis) query execution time. Each point represents a query. To and RESOURCE STALLS (with mask 0x10 and 0xf)). help visualize the degree of fairness, we also add a trendline, based The plot compares NaiveSharing and BatchSharing, run using 1 on a linear regression fit passing through the origin. The equation core and using 8 cores. With NaiveSharing on 8 cores much of 620

12. 5.00E+09 Total CPU Cycles of a query 8. REFERENCES 4.00E+09 [1] Intel Architecture Software Developers Manual, volume 2. Memory [2] D. J. Abadi, S. Madden, and M. Ferreira. Integrating 3.00E+09 DTLB miss compression and execution in column-oriented database L2 hit Branch Misprediction systems. In SIGMOD, pages 671–682, 2006. 2.00E+09 Computation [3] M. Blasgen, J. Gray, M. Mitoma, and T. Price. The convoy phenomenon. SIGOPS Oper. Syst. Rev., 13(2):20–25, 1979. 1.00E+09 [4] J. Chang and G. S. Sohi. Cooperative Cache Partitioning for 0.00E+00 Chip Multiprocessors. In Proceedings of Supercomputing 1 core 8 cores 1 core 8 cores (SC), pages 242–252, 2007. Naïve Naïve Scan- Scan- sharing sharing [5] J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. Figure 12: CPU breakdown of BatchSharing In SIGMOD, 2000. [6] S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, the total CPU cycles go to memory access, because the bandwidth N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling is saturated with all 8 cores issuing loads. For BatchSharing, the Threads for Constructive Cache Sharing on CMPs. In portion spent on memory access remains almost the same in going Proceedings of SPAA, pages 105–115, 2007. from 1 to 8 cores. BatchSharing also has more L2 cache hits than [7] J. Cieslewicz and K. A. Ross. Adaptive Aggregation on Chip NaiveSharing, due to sharing of base table IOs. Multiprocessors. In VLDB, pages 339–350, 2007. NaïveSharing BatchSharing [8] G. Dosa. The Tight Bound of First Fit Decreasing 8 Bin-Packing Algorithm Is FFD(I)=(11/9)OPT(I)+6/9. In ESCAPE, 2007. Throughput speedup 6 [9] W. W. Esty. A normal limit law for a nonparametric estimator of the coverage of a random sample. Ann. Statist., 11(8):905–911, 1983. 4 [10] P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93, 1998. 2 [11] S. Harizopoulos, V. Liang, D. J. Abadi, and S. Madden. Performance tradeoffs in read-optimized databases. In 0 VLDB, pages 487–498, 2006. 1 2 3 4 5 6 7 8 [12] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: a Number of cores used simultaneously pipelined relational query engine. In SIGMOD, pages 383–394, 2005. Figure 13: Performance scaling on multi-cores [13] R. Johnson, N. Hardavellas, I. Pandis, N. Mancheril, S. Harizopoulos, K. Sabirli, A. Ailamaki, and B. Falsafi. To Figure 13 plots the speedup in throughput as a function of the share or not to share? In VLDB, pages 351–362, 2007. number of cores. NaiveSharing increases sub-linearly with increas- [14] S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and ing number of cores, saturating at 2.9. BatchSharing scales almost Partitioning in a Chip Multiprocessor Architecture. In PACT, linearly, to a factor of 7 on 8 cores. Certain phases of query pro- pages 111–122, 2004. cessing are not parallelizable in our system, such as query parsing and compilation, and the merging of partial agg-tables into a global [15] S. Krishnamurthy, M. J. Franklin, J. M. Hellerstein, and agg-table (as described in Section 2.2). This explains why use of G. Jacobson. The case for precision sharing. In VLDB, pages BatchSharing results in a speedup factor of 7, and not 8, for 8 cores. 972–984, 2004. [16] C. A. Lang, B. Bhattacharjee, T. Malkemus, S. Padmanabhan, and K. Wong. Increasing buffer-locality for multiple relational table scans through grouping and 7. CONCLUSION throttling. In ICDE, pages 1136–1145, 2007. The stories we have been hearing about the multicore trend have [17] V. Raman and G. Swart. How to wring a table dry: Entropy so far been mostly negative: clock speeds are decelerating, we have compression of relations and querying of compressed to write parallel programs, parallel programs do not scale easily, relations. In VLDB, pages 858–869, 2006. and enterprise software will perform poorly. The results of this [18] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, paper suggest that the situation may not be so dire in all cases. In D. Kossmann, I. Narang, and R. Sidle. Constant-time query the context of a compessed database, we have shown a solution that processing. In ICDE, pages 60–69, 2008. achieves near-linear speedup of query throughput when running an [19] N. Roussopoulos. View indexing in relational databases. 8-query workload on a server with 8 cores. ACM Trans. Database Syst., 7(2):258–290, 1982. Looking forward, a board with two quad-core processors is a far cry from the GPUs with 100s of cores that are available today, or [20] C. A. Waldspurger and W. E. Weihl. Lottery scheduling: the CPUs with dozens of cores that are on the horizon. With such Flexible proportional-share resource management. In OSDI, aggressively multicore architectures, the amount of cache available pages 1–11, 1994. per core will decrease. An interesting direction for future work is to [21] M. Zukowski, S. H´eman, N. Nes, and P. A. Boncz. design aggregation tables that are tightly compressed, so that they Cooperative Scans: Dynamic Bandwidth Sharing in a fit in small caches, yet are efficiently updateable. DBMS. In VLDB, pages 723–734, 2007. 621