Sharing Data and Work Across Concurrent Analytical Queries

Today’s data deluge enables organizations to collect massive data, and analyze it with an ever-increasing number of concurrent queries.Traditional data warehouses (DW) face a challenging problem inexecuting this task, due to their query-centric model: each query is optimized and executed independently. This model results in high contention for resources

1. Sharing Data and Work Across Concurrent Analytical Queries Iraklis Psaroudakis Manos Athanassoulis Anastasia Ailamaki ´ Ecole Polytechnique Fed ´ erale ´ ´ Ecole Polytechnique Fed ´ erale ´ ´ Ecole Polytechnique Fed ´ erale ´ de Lausanne de Lausanne de Lausanne ABSTRACT analysis. The increase of processing power and the growing appli- Today’s data deluge enables organizations to collect massive data, cations’ needs have led to increased requirements for both through- and analyze it with an ever-increasing number of concurrent queries. put and latency of analytical queries over ever-growing datasets. Traditional data warehouses (DW) face a challenging problem in According to a recent study of the DW market [26], more than executing this task, due to their query-centric model: each query is half of DW have less than 50 concurrent users. Almost 40%, how- optimized and executed independently. This model results in high ever, of the companies using DW project that in 3 years they will contention for resources. Thus, modern DW depart from the query- have 200-1000 concurrent users. General-purpose DW, however, centric model to execution models involving sharing of common cannot easily handle analytical workloads over big data with such data and work. Our goal is to show when and how a DW should concurrency [3]. A limiting factor is their typical query-centric employ sharing. We evaluate experimentally two sharing method- model: DW optimize and execute each query independently. Con- ologies, based on their original prototype systems, that exploit work current queries, however, often exhibit overlapping data accesses or sharing opportunities among concurrent queries at run-time: Si- computations. The query-centric model misses the opportunities of multaneous Pipelining (SP), which shares intermediate results of sharing work and data, and results in performance degradation due common sub-plans, and Global Query Plans (GQP), which build to the contention of concurrent queries for I/O, CPU and RAM. and evaluate a single query plan with shared operators. First, after a short review of sharing methodologies, we show 1.1 Methodologies for sharing data and work that SP and GQP are orthogonal techniques. SP can be applied to A variety of ideas have been proposed to exploit sharing, in- shared operators of a GQP, reducing response times by 20%-48% in cluding buffer pool management techniques, materialized views, workloads with numerous common sub-plans. Second, we corrob- caching and multi-query optimization (see Section 2). More re- orate previous results on the negative impact of SP on performance cently, DW started sharing data at the I/O layer using shared scans for cases of low concurrency. We attribute this behavior to a bot- (with variants also known as circular scans, cooperative scans or tleneck caused by the push-based communication model of SP. We clock scan) [7, 8, 23, 29, 30]. In this paper, we evaluate work shar- show that pull-based communication for SP eliminates the over- ing techniques at the level of the execution engine. We distinguish head of sharing altogether for low concurrency, and scales better on two predominant methodologies: (a) Simultaneous pipelining (SP) multi-core machines than push-based SP, further reducing response [13], and (b) Global query plans (GQP) [2, 3, 4, 11]. times by 82%-86% for high concurrency. Third, we perform an ex- Simultaneous pipelining (SP) is introduced in QPipe [13], an perimental analysis of SP, GQP and their combination, and show operator-centric execution engine, where each relational operator when each one is beneficial. We identify a trade-off between low is encapsulated into a self-contained module called a stage. Each and high concurrency. In the former case, traditional query-centric stage detects common sub-plans among concurrent queries, evalu- operators with SP perform better, while in the latter case, GQP with ates only one and pipelines the results to the rest when possible (see shared operators enhanced by SP give the best results. Sections 2.2 and 2.3). Global query plans (GQP) with shared op- erators are introduced in the CJOIN operator [3, 4]. A single shared operator is able to evaluate multiple concurrent queries. CJOIN 1. INTRODUCTION uses a GQP, consisting of shared hash-join operators that evaluate Data warehouses (DW) are databases specialized for servicing the joins of multiple concurrent queries simultaneously. More re- on-line analytical processing (OLAP) workloads. OLAP work- cent research prototypes extend the logic to additional operators loads consist mostly of ad-hoc, long running, scan-heavy queries and to more general cases [2, 11] (see Sections 2.4 and 2.5). over relatively static data (new data is periodically loaded). Today, Figure 1 illustrates how a query-centric model, shared scans, SP, in the era of data deluge, organizations collect massive data for and a GQP operate through a simple example of three concurrent queries which perform natural joins without selection predicates Permission to make digital or hard copies of all or part of this work for and are submitted at the same time. The last two queries have a personal or classroom use is granted without fee provided that copies are common plan, which subsumes the plan of the first. We note that not made or distributed for profit or commercial advantage and that copies shared scans are typically used with both SP and GQP. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present 1.2 Integrating Simultaneous Pipelining and their results at The 39th International Conference on Very Large Data Bases, Global Query Plans August 26th - 30th 2013, Riva del Garda, Trento, Italy. Proceedings of the VLDB Endowment, Vol. 6, No. 9 In order to perform our analysis and experimental evaluation of Copyright 2013 VLDB Endowment 2150-8097/13/07... $ 10.00. SP vs. GQP, we integrate the original research prototypes that in- 637

2. Q2 Q3 Q2 Q3 Q1 ‫ڇ‬ ‫ڇ‬ Q1 ‫ڇ‬ Q2 Q3 Q1 ‫ڇ‬ Q2, Q3 Q2 Q1 ‫ڇ‬ ‫ڇ‬ Q1: A ‫ ڇ‬B Q3 Q2: A ‫ ڇ‬B ‫ ڇ‬C ‫ڇ‬ ‫ڇ‬ ‫ڇ‬ ‫ڇ‬ ‫ڇ‬ Q1, Q2, Q3 ‫ڇ‬ ‫ڇ‬ ‫ڇ‬ Q3: A ‫ ڇ‬B ‫ ڇ‬C A B C A B C A B C A B A B C A B C Concurrent Queries (a) Query-centric model (b) Shared Scans (c) Simult. Pipelining (d) Global Query Plan Figure 1: Evaluation of three concurrent queries using (a) a query-centric model, (b) shared scans, (c) SP, and (d) a GQP. troduced them into one system: we integrate the CJOIN operator as With respect to SP, we corroborate previous related work [14, an additional stage of the QPipe execution engine (see Section 3). 23], that if SP entails a serialization point, then enforcing aggres- Thus, we can dynamically decide whether to evaluate multiple con- sive sharing does not always improve performance in cases of low current queries with the standard query-centric relational operators concurrency. Our newly optimized SP with SPL, however, elimi- of QPipe, with or without SP, or the GQP offered by CJOIN. nates the serialization point, making SP beneficial in cases of both Furthermore, this integration allows us to combine the two shar- low and high concurrency. ing techniques, showing that they are in fact orthogonal. As shown With respect to GQP, we corroborate previous work [2, 3, 4, 11] in Figure 1d, the GQP misses the opportunity of sharing common that shared operators are efficient in reducing contention for re- sub-plans, and redundantly evaluates both Q2 and Q3. SP can be sources and in improving performance for high concurrency (see applied to shared operators to complement a GQP with the addi- Section 5.2.1). The design of a shared operator, however, inher- tional capability of sharing common sub-plans (see Section 3). ently increases bookkeeping in comparison to the typical operators of a query-centric model. Thus, for low concurrency, we show that 1.3 Optimizing Simultaneous Pipelining shared operators result in worse performance than the traditional For the specific case of SP, it is shown in the literature [14, 23] query-centric operators (see Section 5.2.2). that if there is a serialization point, enforcing aggressive sharing Moreover, we show that SP can be applied to shared operators does not always improve performance. In cases of low concurrency of a GQP, in order to get the best out of the two worlds. SP can and sufficient available resources, it is shown that the system should reduce the response time of a GQP by 20%-48% for workloads first parallelize with a query-centric model before sharing. with common sub-plans (see Section 5.2.3). To calculate the turning point where sharing becomes beneficial, Sharing in the I/O layer. Though our work primarily studies shar- a prediction model is proposed [14] for determining at run-time ing inside the execution engine, our experimental results also cor- whether SP is beneficial. In this paper, however, we show that roborate previous work relating to shared scans. The simple case of the serialization point is due to the push-based communication em- a circular scan per table is able to improve performance of typical ployed by SP [13, 14]. We show that pull-based communication analytical workloads both in cases of low and high concurrency. In can drastically minimize the impact of the serialization point, and highly concurrent cases, response times are reduced by 80%-97% is better suited for sharing common results on machines with multi- in comparison to independent table scans (see Section 5.2.1). core processors. We introduce Shared Pages Lists (SPL), a pull-based sharing ap- Rules of thumb. Putting all our observations together, we deduce proach that eliminates the serialization point caused by push-based a few rules of thumb for sharing, presented in Table 1. Our rules sharing during SP. SPL are data structures that store the intermedi- of thumb apply for the case of typical OLAP workloads involving ate results of relational operators, and allow for a single producer ad-hoc, long running, scan-heavy queries over relatively static data. and multiple consumers. SPL make sharing with SP always ben- eficial and reduce response times by 82%-86% in cases of high How to share in the When concurrency, compared to the original push-based SP design and Execution Engine I/O Layer implementation [13, 14] (see Section 4). Low concurrency Query-centric operators + SP Shared Scans High concurrency GQP (shared operators) + SP 1.4 Simultaneous Pipelining vs. Global Query Plans Table 1: Rules of thumb for when and how to share data and Having optimized SP, and having integrated the CJOIN operator work across typical concurrent analytical queries in DW. in the QPipe execution engine, we proceed to perform an exten- sive analysis and experimental evaluation of SP vs. GQP (see Sec- 1.5 Contributions tion 5). Our work answers two fundamental questions: when and We perform an experimental analysis of two work sharing method- how an execution engine should share in order to improve perfor- ologies, (a) Simultaneous Pipelining (SP), and (b) Global Query mance of analytical workloads. Plans (GQP), based on the original research prototypes that intro- Sharing in the execution engine. We identify a performance trade- duce them. Our analysis answers two fundamental questions: when off between using a query-centric model and sharing. For a high and how an execution engine should employ sharing in order to im- number of concurrent queries, the execution engine should share, prove performance of typical analytical workloads. We categorize as it reduces contention for resources and improves performance in different sharing techniques for relational databases, and identify comparison to a query-centric model. For low concurrency, how- SP and GQP as two state-of-the-art sharing methodologies (Sec- ever, sharing is not always beneficial. tion 2). Next, our work makes the following main contributions: 638

3. • Integration of SP and GQP: We show that SP and GQP are hashing pages to the disks at random. During execution, it reads orthogonal, and can be combined to take the best of the two pages from the disks asynchronously but sequentially, thus aggre- worlds (Section 3). In our experiments, we show that SP can gating the throughput of a sequential scan on every disk of the array. further improve the performance of a GQP by 20%-48% for Sharing in the execution engine. By sharing work among queries, workloads that expose common sub-plans. we refer to techniques that avoid redundant computations inside the • Pull-based SP: We introduce Shared Pages Lists (SPL), a execution engine. A traditional query-centric DBMS typically uses pull-based approach for SP that eliminates the sharing over- query caching [28] and materialized views [24]. Both, however, do head of push-based SP. Pull-based SP is better suited for mul- not exploit sharing opportunities among in-progress queries. ticores than push-based SP, is beneficial for cases of both low Multi-Query Optimization (MQO) techniques [25, 27] are an and high concurrency, and further reduces response times by important step towards more sophisticated sharing methodologies. 82%-86% for high concurrency (Section 4). MQO detects and re-uses common sub-expressions among queries. There are two main disadvantages of classic MQO: (i) it operates • Evaluation of SP vs. GQP: We analyze the trade-offs of on batches of queries only during the optimization phase, and (ii) SP, GQP and their combination, and we detail through an ex- it depends on materializing shared intermediate results, at the ex- tensive sensitivity analysis when each one is beneficial (Sec- pense of memory. This cost can be alleviated by using pipelin- tion 5). We show that query-centric operators combined with ing [9], which additionally exploits the parallelization provided by SP result in better performance for cases of low concurrency, multi-core processors. The query plan is divided into sub-plans and while GQP with shared operators enhanced by SP are better operators are evaluated in parallel. suited for cases of high concurrency. Both SP and GQP leverage forms of pipelined execution and sharing methodologies which bear some superficial similarities with Paper Organization. This paper is organized as follows. Section MQO. These techniques, however, provide deeper and more dy- 2 consists of a review of sharing methodologies, SP, and GQP. Sec- namic forms of sharing at run-time. In the rest of this section, we tion 3 describes our implementation for integrating SP and GQP. provide an overview of SP and GQP, the systems that introduce Section 4 presents shared pages lists, our pull-based solution for them, and more recent research prototypes that advanced GQP. sharing common results during SP. Section 5 includes our exper- imental evaluation. Section 6 includes a short discussion. We 2.2 Simultaneous Pipelining present our conclusions in Section 7. SP identifies identical sub-plans among concurrent queries at run-time, evaluates only one and pipelines the results to the rest 2. WORK SHARING TECHNIQUES simultaneously [13]. Figure 2a depicts an example of two queries In this section, we provide the necessary background of sharing that share a common sub-plan below the join operator (along with techniques in the literature. We start by shortly reviewing related any selection and join predicates), but have a different aggrega- work, and continue to extensively review work related to SP and tion operator above the join. SP evaluates only one of them, and GQP, which compose our main area of interest. In Table 2, we pipelines the results to the other aggregation operator. summarize the sharing methodologies used by traditional query- Fully sharing common sub-plans is possible if the queries arrive centric systems and the research prototypes we examine. at the same time. Else, sharing opportunities may be restricted. The amount of results that a newly submitted Q2 can re-use from 2.1 Related Work the pivot operator (the top operator of the common sub-plan) of the Sharing in the I/O layer. By sharing data, we refer to techniques in-progress Q1, depends on the type of the pivot operator and the that coordinate and share the accesses of queries in the I/O layer. arrival of Q2 during Q1’s execution. This relation is expressed as The typical query-centric database management system (DBMS) a Window of Opportunity (WoP) for each relational operator (fol- incorporates a buffer pool and employs eviction policies [5, 16, 19, lowing the original acronym [13]). Figure 2b depicts two common 22]. Queries, however, communicate with the buffer pool manager WoP, a step and a linear WoP. on a per-page basis, thus it is difficult to analyze their access pat- A step WoP expresses that Q2 can re-use the full results of Q1 terns. Additionally, if multiple queries start scanning the same table if it arrives before the first output tuple of the pivot operator. Joins at different times, scanned pages may not be re-used. and aggregations have a step WoP. A linear WoP signifies that Q2 For this reason, shared scans have been proposed. Circular scans can re-use the results of Q1 from the moment it arrives up until the [7, 8, 13] are a form of shared scans. They can handle numerous pivot operator finishes. Then, Q2 needs to re-issue the operation in concurrent scan-heavy analytical queries as they reduce buffer pool order to compute the results that it missed before it arrived. Sorts contention and they avoid unnecessary I/O accesses to the underly- and table scans have a linear WoP. In fact, the linear WoP of the ing storage devices. Furthermore, more elaborate shared scans can table scan operator is translated into a circular scan of each table. be developed for servicing different fragments of the same table or different groups of queries depending on their speed [18, 30], and Q2 gain Q2 gain Q1 Q2 (a) (b) for main-memory scans [23]. Shared scans can be used to handle a large number of concurrent Σ Σ 100% 100% updates as well. The Crescando [29] storage manager performs Re-use a circular scan over memory-resident table partitions, interleaving ‫ ڇ‬results ‫ڇ‬ Cancel the reads and the updates of a batch of queries along the way. The identical 0% 100% 0% 100% scan first executes the update requests of the batch for a scanned Α Β Α Β sub-plan Q1 progress Q1 progress tuple in their arrival order, and then the read requests. Shared scans, however, are not immediately translated to a fast linear scan of a disk. The DataPath system [2], which uses a disk Figure 2: (a) SP example with two queries having a common array as secondary storage, stores relations column-by-column by sub-plan below the join operator. (b) A step and a linear WoP. 639

4. Traditional System QPipe CJOIN DataPath SharedDB query-centric model Sharing in the Query Caching, Simultaneous Global Query Plan Global Query Plan Global Query Plan execution engine Materialized Views, MQO Pipelining (joins of Star Queries) (with Batched Execution) Sharing in the Buffer pool Circular scan of Circular scan of Asynchronous linear Circular scan of in-memory I/O layer management techniques each table the fact table scan of each disk table partitions Special I/O Subsystem Crescando Storage Manager Any Any Any (read-only requests) (read and update requests) Table 2: Sharing methodologies employed by a query-centric model and the research prototypes we examine. 2.3 The QPipe execution engine uate two queries. It starts with the build phase by receiving tuples QPipe [13] is a relational execution engine that supports SP at ex- from the shared selection operator of the inner relation. Then, the ecution time. QPipe is based on the paradigms of staged databases probe phase begins by receiving tuples from the shared selection [12]. Each relational operator is encapsulated into a self-contained operator of the outer relation. The hash-join proceeds as normal, module called a stage. Each stage has a queue for work requests by additionally performing a bitwise AND operation between the and employs a local thread pool for processing the requests. bitmaps of the joined tuples. An incoming query execution plan is converted to a series of The most significant advantage is that a single shared operator inter-dependent packets. Each packet is dispatched to the relevant can evaluate many similar queries. For example, a shared hash- stage for evaluation. Data flow between packets is implemented join can evaluate queries having the same equi-join predicate, and through FIFO (first-in, first-out) buffers and page-based exchange, possibly different selection predicates. In the worst case, the union following a push-only model with pipelined execution. The buffers of the selection predicates may force it to join the whole two re- also regulate differently-paced actors: a parent packet may need to lations. The disadvantage of a shared operator in comparison to a wait for incoming pages of a child and, conversely, a child packet query-centric one is that it entails increased bookkeeping. For ex- may wait for a parent packet to consume its pages. ample, a shared hash-join maintains a hash table for the union of the This design allows each stage to monitor only its packets for tuples of the inner relation selected by all queries, and performs bit- detecting sharing opportunities efficiently. If it finds an identical wise operations between the bitmaps of the joined tuples. For low packet, and their interarrival delay is inside the WoP of the pivot concurrency, as shown by our experiments (see Section 5), query- operator, it attaches the new packet (satellite packet) to it (host centric operators outperform shared operators. A similar tradeoff is packet). While it evaluates the host packet, SP copies the results found for the specific case of shared aggregations on CMP [6]. of the host packet to the output FIFO buffer of the satellite packet. By using shared scans and shared operators, a GQP can be built for evaluating all concurrent queries. GQP are introduced by CJOIN 2.4 Global Query Plans with shared operators [3, 4], an operator based on shared selections and shared hash-joins for evaluating the joins of star queries [17]. GQP are advanced by SP is limited to common sub-plans. If two queries have similar the DataPath system [2] for more general schemas, by tackling the sub-plans but with different selection predicates for the involved issues of routing and optimizing the GQP for a newly incoming tables, SP is not able to share them. Nevertheless, the two queries query. DataPath also adds support for a shared aggregate operator, still share a similar plan that exposes sharing opportunities. It is that calculates a running sum for each group and query. possible to employ shared operators, where a single shared opera- Both CJOIN and DataPath handle new queries immediately when tor can evaluate both queries simultaneously. The basic technique they arrive. This is feasible due to the nature of the supported for enabling them is sharing tuples among queries and correlating shared operators: selections, hash-joins and aggregates. Some op- each tuple to the queries, e.g. by annotating tuples with a bitmap, erators, however, cannot be easily shared. For example, a sort op- whose bits signify if the tuple is relevant to one of the queries. erator cannot easily handle new queries that select more tuples than The simplest shared operator is a shared selection, that can eval- the ones being sorted [2]. To overcome this limitation, SharedDB uate multiple queries that select tuples from the same relation. For [11] batches queries for every shared operator. Batching allows each received tuple, it toggles the bits of its attached bitmap accord- standard algorithms to be easily extended to support shared opera- ing to the selection predicates of the queries. A hash-join can also tors, as they work on a fixed set of tuples and queries. SharedDB be easily shared by queries that share the same equi-join predicate supports shared sorts and various shared join algorithms, not be- (more relaxed requirements are also possible [2]). Figure 3 shows a ing restricted only to equi-joins. Nevertheless, batched execution conceptual example of how a single shared hash-join is able to eval- has drawbacks: a new query may suffer increased latency, and the latency of a batch is dominated by the longest-running query. Q1: SELECT A.c2, B.c2 FROM A, B Q2: SELECT Β.c3 FROM A, B WHERE A.c1 = B.c1 WHERE A.c1 = B.c1 AND A.c2 > 10 AND B.c2 < 5 AND A.c1 < 12 AND B.c2 > 3 2.5 The CJOIN operator The selection of CJOIN [3, 4] for our analysis, is based on the 8 16 8 23 “a” 01 bitmap facts that it introduced GQP, and that it is optimized for the sim- A.c1 A.c2 B.c1 B.c2 B.c3 ple case of star queries. Without loss of generality, we restrict our probe build evaluation to star schemas, and correlate our observations to more 8 16 11 ‫ڇ‬ 8 23 “a” 01 general schemas (used, e.g., by DataPath [2] or SharedDB [11]). σ Α + bitwise AND Β σ Star schemas are very common for organizing data in relational DW. They allow for numerous performance enhancements [17]. A star schema consists of a large fact table, that stores the mea- Figure 3: Example of shared selection and hash-join operators. sured information, and is linked through foreign-key constraints to 640

5. Q1: SELECT A, B, … FROM F, D1 Q2: SELECT A, B, … FROM F, D1, D2 going query for the new query. If we assume that the top-most WHERE F ‫ ڇ‬D1 AND σ(D1) WHERE F ‫ ڇ‬D1 ‫ ڇ‬D2 AND σ’(D1) AND σ’(D2) operators in a query plan have a full step WoP (e.g. when final re- GQP of Q1: F ‫ ڇ‬D1 Q1 sults are buffered and given wholly to the client instead of being CJOIN F ‫ڇ‬ ‫ ڇ‬Q2: (F ‫ ڇ‬D1) ‫ ڇ‬D2 Q2: F ‫ ڇ‬D1 Q2 pipelined), the new query does not need to participate at all in the GQP, independent of its time of arrival during the ongoing query’s Q1: σ(D1) σ σ Q2: σ’(D2) evaluation. This is the case where the integration of SP and GQP Evaluation of Q2: σ’(D1) GQP with a offers the maximum performance benefits. Additionally, admission single pipeline D1 D2 costs are completely avoided, the tuples’ bitmaps do not need to be Q1 Q2 extended to accommodate the new query (translating to fewer bit- wise operations), and the latency of the new query is decreased to F Preprocessor Filter Filter Distributor the latency of the remaining part of the ongoing query. Figure 4: CJOIN evaluates a GQP for star queries. Shared selections. If a new query has the same selection predicate as an ongoing query, SP allows to avoid the redundant evaluation of the same selection predicate from the moment the new query smaller dimension tables. A star query is an analytical query over arrives until the end of evaluation of the ongoing query (a selection a star schema. It typically joins the fact table with several dimen- operator has a linear WoP). For each tuple, SP copies the resulting sion tables and performs operations such as aggregations or sorts. bit of the shared selection operator for the ongoing query, to the CJOIN evaluates the joins of all concurrent star queries, using position in the tuple’s bitmap that corresponds to the new query. a GQP with shared scans, shared selections and shared hash-joins. Figure 4 shows the GQP that CJOIN evaluates for two star queries. Shared joins. If a new query has a common sub-plan with an ongo- CJOIN adapts the GQP with every new star query. If a new star ing query under a shared join operator, and arrives within the step query references already existing dimension tables, the existing WoP, SP can avoid extending tuples’ bitmaps with one more bit for GQP can evaluate it. If a new star query joins the fact table with a the new query for the sub-plan. The join still needs to be evaluated, new dimension table, the GQP is extended with a new shared se- but the number of bitwise operations can be decreased. lection and hash-join. Due to the semantics of star schemas, the Shared aggregations. If a new query has a common sub-plan with directed acyclic graph of the GQP takes the form of a chain. an ongoing query under a shared aggregation operator, and arrives CJOIN exploits this form to facilitate the evaluation of the GQP. within the step WoP, SP avoids calculating a redundant sum. It It materializes the small dimension tables and stores in-memory the copies the final result from the ongoing query. selected dimension tuples in the hash tables of the corresponding shared hash-joins. Practically, for each dimension table, it groups Admission costs. For every new query submitted to a GQP, an the shared scan, selection and hash-join operators into an entity admission phase is required that possibly re-adjusts the GQP to ac- called filter. When a new star query is admitted, CJOIN pauses, commodate it. In case of common sub-plans, SP can avoid part of adds newly referenced filters, updates already existing filters, aug- the admission costs. The cost depends on the implementation. ments the bitmaps of dimension tuples according to the selection For CJOIN [3, 4], the admission cost of a new query includes (a) predicates of the new star query, and then continues. Parts of the scanning all involved dimension tables, (b) evaluating its selection admission phase, such as the scan of the involved dimension tables, predicates, (c) extending the bitmaps attached to tuples, (d) increas- can be done asynchronously while CJOIN is running [4]. ing the size of hash tables of the shared hash-joins to accommodate Consequently, CJOIN is able to evaluate the GQP using a sin- newly selected dimension tuples (if needed), and (e) stalling the gle pipeline: the preprocessor uses a circular scan of the fact table, pipeline to re-adjust filters [3, 4]. For identical queries, SP can and flows fact tuples through the pipeline. The data flow in the avoid these costs completely. For queries with common sub-plans, pipeline is regulated by intermediate buffers, similar to QPipe. The SP can avoid parts of these costs, such as avoiding scanning dimen- filters in-between are actually the shared hash-joins that join the sion tables for which selection predicates are identical. fact tuples with the corresponding dimension tuples and addition- For DataPath [2], SP can decrease the optimization time of the ally perform a bitwise AND between their bitmaps. At the end GQP if it assumes that the common sub-plan of a new query can of the pipeline, the distributor examines the bitmaps of the joined use the same part of the current GQP as the ongoing query. For tuples and forwards them to the relevant queries. For every new SharedDB [11], SP can help start a new query before the next batch query, the preprocessor admits it, marking its point of entry on the at any operator, if it has a common sub-plan with an ongoing query circular scan of the fact table and signifies its completion when it and has arrived within the corresponding WoP of the operator. wraps around to its point of entry on the circular scan. 3.2 CJOIN as a QPipe stage 3. INTEGRATING SP AND GQP We integrate the original CJOIN operator into the QPipe execu- By integrating SP and GQP, we can exploit the advantages of tion engine as a new stage, using Shore-MT [15] as the underlying both forms of sharing. In Section 3.1, we describe how SP can storage manager. In Figure 5, we depict the new stage that encap- conceptually improve the performance of shared operators in the sulates the CJOIN pipeline. presence of common sub-plans, using several examples. These ob- The CJOIN stage accepts incoming QPipe packets that contain servations apply to general GQP, and are applicable to the research the necessary information to formulate a star query: (a) the pro- prototypes we mention in Section 2.4. We continue in Sections 3.2 jections for the fact table and the dimension tables to be joined, and 3.3 to describe our implementation based on CJOIN and QPipe. and (b) the selection predicates. The CJOIN operator does not sup- port selection predicates for the fact table [3], as these would slow 3.1 Benefits of applying SP to GQP the preprocessor significantly. We have ran experiments with the preprocessor evaluating fact table selection predicates, but in most Identical queries. If a new query is completely identical with an cases the cost of a slower pipeline defeated the purpose of poten- ongoing query, SP takes care to re-use the final results of the on- tially flowing fewer fact tuples in the pipeline. To respect space 641

6. QPipe Satellite Satellite Satellite F D1 ... Shore-MT packets Satellite Satellite Satellite packet packet packet Q1: CJOIN F ‫ ڇ‬D1 ‫ ڇ‬D 2 packet packet packet forward results CJOIN stage Distributor FIFO buffer Filters Part Q2: CJOIN ... F ‫ ڇ‬D2 Host packet (a) (b) Host packet Preprocessor Distributor Distributor Part Q3: CJOIN Figure 7: Sharing identical results during SP with: (a) push- F ‫ ڇ‬D1 ‫ ڇ‬D 3 only model and (b) a SPL. Figure 5: Integration scheme of CJOIN as a QPipe stage. and parallelize as much as possible with a query-centric model, be- fore sharing. A prediction model is proposed [14] for determining limitations, we do not include these experiments. Fact table predi- at run-time whether sharing is beneficial. In this section, however, cates are evaluated on the output tuples of CJOIN. we show that SP is possible without a serialization point, thus ren- To improve admission costs we use batching, following the origi- dering SP always beneficial. nal CJOIN proposal [4]. In one pause of the pipeline, the admission The serialization point is caused by strictly employing push-only phase adapts the filters for all queries in the batch. During the ex- communication. Pipelined execution with push-only communica- ecution of each batch, additional new queries form a new batch to tion typically uses FIFO buffers to exchange results between oper- be subsequently admitted. ators [13]. This allows to decouple query plans and have a distinct With respect to threads, there is a number of threads assigned separation between queries, similar to a query-centric design. Dur- to filters (we assume the horizontal configuration of CJOIN [3, ing SP, this forces the single thread of the pivot operator of the host 4]), each one taking a fact tuple from the preprocessor, passing it packet to forward results to all satellite packets sequentially (see through the filters up to the distributor. The original CJOIN uses Figure 7a), which creates a serialization point. a single-threaded distributor which slows the pipeline significantly. This serialization point is reflected in the prediction model [14], To address this bottleneck, we augment the distributor with sev- where the total work of the pivot operator includes a cost for for- eral distributor parts. Every distributor part takes a tuple from the warding results to all satellite packets. By using copying to forward distributor, examines its bitmap, and determines relevant CJOIN results [14], the serialization point becomes significant and delays packets. For each relevant packet, it performs the projection of the subsequent operators in the plans of the host and satellite packets. star query and forwards the tuple to the output buffer of the packet. This creates a trade-off between sharing and parallelism, where in CJOIN supports only shared hash-joins. Subsequent operators the latter case a query-centric model without sharing is used. in a query plan, e.g. aggregations or sorts, are query-centric. Nev- ertheless, our evaluation gives us insight on the general behavior Sharing vs. Parallelism. We demonstrate this trade-off with the of shared operators in a GQP, as joins typically comprise the most following experiment, similar to the experiment of [14], which eval- expensive part of a star query. uates SP for the table scan stage with a memory-resident database. Though the trade-off applies for disk-resident databases and other 3.3 SP for the CJOIN stage stages as well, it is more pronounced in this case. Our experimental We enable SP for the CJOIN stage with a step WoP. Evaluat- configuration can be found in Section 5. We evaluate two configu- ing the identical queries Q2 and Q3 of Figure 1d employing SP, rations of the QPipe execution engine: (a) No SP (FIFO), which requires only one packet entering the CJOIN stage. The second evaluates query plans independently without any sharing, and (b) satellite packet re-uses the results. CS (FIFO), with SP enabled only for the table scan stage, thus CJOIN is itself an operator, and we integrate it as a new stage in supporting circular scans (CS). FIFO buffers are used for pipelined QPipe. As with any other QPipe stage, SP is applied on the overall execution and copying is used to forward pages during SP, follow- CJOIN stage. Conceptually, our implementation applies SP for the ing the original push-only design [13, 14]. We evaluate identical whole series of shared hash-joins in the GQP. Our analysis, how- TPC-H [1] Q1 queries, submitted at the same time, with a database ever, gives insight on the benefits of applying SP to fine-grained of scale factor 1. Figure 6a shows the response times of the config- shared hash-joins as well. This is due to the fact that a redundant urations, while varying the number of concurrent queries. CJOIN packet involves all redundant costs we mentioned in Sec- For low concurrency, No SP (FIFO) efficiently uses available tion 3.1 for admission, shared selections operators and shared hash- CPU resources. Starting with as few as 32 queries, however, there joins. Our experiments show that the cost of a redundant CJOIN is contention for CPU resources and response times grow quickly, packet is significant, and SP decreases it considerably. due to the fact that our server has 24 available cores and the query- centric model evaluates queries independently. For 64 queries, it uses all cores at their maximum, resulting in excessive and unpre- 4. SHARED PAGES LISTS FOR SP dictable response times, with a standard deviation up to 30%. In this section, we present design and implementation issues of CS (FIFO) suffers from low utilization of CPU resources, due sharing using SP, and how to address them. Contrary to intuition, it to the aforementioned serialization point of SP. The critical path is shown in the literature that work sharing is not always beneficial: increases with the number of concurrent queries. For 64 queries, it if there is a serialization point during SP, then sharing common re- uses an average of 3.1 of available cores. In this experiment, the sults aggressively can lead to worse performance, compared to a proposed prediction model [14] would not share in cases of low query-centric model that implicitly exploits parallelism [14, 23]. concurrency, essentially falling back to the line of No SP (FIFO), When the producer (host packet) forwards results to consumers and would share in cases of high concurrency. (satellite packets), it is in the critical path of the evaluation of the Nevertheless, the impact of the serialization point of SP can be remaining nodes of the query plans of all involved queries. For- minimized. Simply copying tuples in a multi-threaded way would warding results can cause a significant serialization point. In this not solve the problem, due to synchronization overhead and in- case, the DBMS should first attempt to exploit available resources creased required CPU resources. A solution would be to forward 642

7. 120 No SP (FIFO) 120 2 (No SP / CS) with FIFO No SP (SPL) 100 CS (FIFO) 100 CS (SPL) (No SP / CS) with SPL Response time (sec) Response time (sec) 3.1 CPU 1.5 80 80 60sec Speedup 60 60 1 40 40 0 2 4 6 8 10 12 14 16 18 19.1 CPU 20 20 8sec 0.5 0 0 1 2 4 8 16 32 64 1 2 4 8 16 32 64 0 Number of concurrent queries Number of concurrent queries Number of concurrent queries Figure 6: Evaluating multiple identical TPC-H Q1 queries (a) with a push-only model during SP (FIFO), and (b) with a pull-based model during SP (SPL). In (c), we show the corresponding speedups of the two methods of SP, over not sharing, for low concurrency. tuples via pointers, a possibility not considered by the original sys- sharing, for both models. We depict only values for low concur- tem. We can, however, avoid unnecessary pointer chasing; by em- rency, as sharing is beneficial for both models in cases of high con- ploying pull-based communication, we can share the results and currency. We corroborate previous results on the negative impact of eliminate forwarding altogether. In essence, we transfer the respon- sharing with push-only communication [14] for low concurrency, sibility of sharing the results from the producer to the consumers. and show that pull-based sharing is always beneficial. Thus, the total work of the producer does not include any forward- ing cost. Our pull-based communication model is adapted for SP 4.1 Design of a Shared Pages List for any stage with a step or linear WoP. Figure 8 depicts a SPL. It points to the head and tail of the linked The serialization point of push-based SP was not a visible bottle- list. The host packet adds pages at the head. Satellite packets read neck in the original implementation of QPipe, due to experiments pages from the SPL independently. Due to different concurrent ac- being ran on a uni-processor [13]. On machines with multi-core tors accessing the SPL, we associate a lock with it. Contention for processors its impact grows as the available parallelism increases, locking is minimal in all our experiments, mainly due to the gran- and the aforementioned prediction model [14] was proposed to de- ularity of pages we use (32KB). A lock-free linked list, however, cide at run-time whether (push-based) sharing should be employed. can also be used to address any scalability problems. Our pull-based communication model for SP, however, eliminates Theoretically, if we allow the SPL to be unbounded, we can the serialization point, leading to better scaling on machines with achieve the maximum parallelism possible, even if the producer modern multi-core processors with virtually no sharing overhead. and the consumers move at different speeds. There are practical To achieve this, we create an intermediate data structure, the reasons, however, why the SPL should not be unbounded, similar Shared Pages Lists (SPL). SPL have the same usage as the FIFO to the reasons why a FIFO buffer should not be unbounded, includ- buffers of the push-only model. A SPL, however, allows a single ing: saving RAM, and regulating differently paced actors. producer and multiple consumers. A SPL is a linked list of pages, To investigate the effect of the maximum size, we ran the experi- depicted in Figure 7b. The producer adds pages at the head, and the ment of Figure 6b, for the case of 8 concurrent queries, varying the consumers read the list from the tail up to the head independently. maximum size of SPL up to 512MB. We observed that changing the In order to show the benefits of SPL, we run the experiment of maximum size of the SPL does not heavily affect performance. Due Figure 6a, by employing SPL instead of FIFO buffers. When SP to space limitations, we do not present the relevant graph. Hence, does not take place, a SPL has the same role as a FIFO buffer, we chose a maximum size of 256KB for our experiments in Section used by one producer and one consumer. Thus, the No SP (SPL) 5 in order to minimize the memory footprint of SPL. line has similar behavior with the No SP (FIFO) line. During SP, In order to decrease the size of the SPL, the last consumer is however, a single SPL is used to share the results of one producer responsible for deleting the last page. Each page has an atomic with all consumers. Figure 6b shows the response times of the counter with the number of consumers that will read this page. configurations, while varying the number of concurrent queries. When a consumer finishes processing a page, he decrements its With SPL, sharing has the same or better performance than not counter, deleting the page if he sees a zero counter. In order to sharing, for all cases of concurrency. We avoid using a prediction know how many consumers will read a page, the SPL stores a list model altogether, for deciding whether to share or not. Parallelism of active satellite packets. The producer assigns their number as the is achieved due to the minimization of the serialization point. For initial value of the atomic counter of each emitted page. high concurrency, CS (SPL) uses more CPU resources and reduces response times by 82%-86% in comparison to CS (FIFO). 4.2 Linear Window of Opportunity Additionally, Figure 6c shows the speedup of sharing over not In order to handle a linear WoP, such as circular scans, the SPL stores the point of entry of every consumer. When the host packet Satellite packet Satellite packet finishes processing, the SPL is passed to the next host packet that - List of finishing packets handles the processing for re-producing missed results. - Atomic counter of reads When the host packet emits a page, it checks for consumers - Data whose point of entry is this page, and will need to finish when SPL - Lock they reach it. The emitted page has attached to it a list of these - List of satellite packets and their points of entry Host packet finishing packets, which are removed from the active packets of the - Maximum size SPL (they do not participate in the atomic counter of subsequently emitted pages). When a consumer (packet) reads a page, it checks Figure 8: Design of a shared pages list. whether it is a finishing packet, in which case, it exits the SPL. 643

8.5. EXPERIMENTAL EVALUATION SELECT c_city, s_city, d_year, S SUM(lo_revenue) as revenue FROM customer, lineorder, supplier, date 5.1 Experimental Methodology WHERE lo_custkey = c_custkey A We compare five configurations of the QPipe execution engine: AND lo_suppkey = s_suppkey ‫ڇ‬ AND lo_orderdate = d_datekey • QPipe, without SP, which is similar to a typical query-centric AND c_nation = [NationCustomer] ‫ ڇ‬D3 date model that evaluates query plans separately with pipelining, AND s_nation = [NationSupplier] AND d_year >= [YearLow] without any sharing. This serves as our baseline. AND d_year <= [YearHigh] ‫ ڇ‬D2 customer GROUP BY c_city, s_city, d_year • QPipe-CS, supporting SP only for the table scan stage, i.e. ORDER BY d_year ASC, revenue DESC F D1 supplier circular scans (CS). It improves performance over QPipe by reducing contention for CPU resources, the buffer pool and the underlying storage device. Figure 9: The SSB Q3.2 SQL template and the query plan. • QPipe-SP, supporting SP additionally for the join stage. It query plans. Queries are submitted at the same time, and are all improves performance over QPipe-CS, in cases of high sim- evaluated concurrently. This single batch for all queries allows us ilarity, i.e. common sub-plans. In cases of low similarity, it to minimize query admission overheads for CJOIN, and addition- behaves similar to QPipe-CS. ally allows us to show the effects of SP, as all queries with common sub-plans arrive surely inside the WoP of their pivot operators. We • CJOIN, without SP, which is the result of our integration of note that variable interarrival delays can decrease sharing opportu- CJOIN into QPipe, hence the joins in star queries are evalu- nities for SP, and refer the interested reader to the original QPipe ated with a GQP of shared hash-joins. We remind that CJOIN paper [13] to review the effects of interarrival delays for different only supports shared hash-joins, thus subsequent operators cases of pivot operators and WoP. are query-centric. Nevertheless, this configuration allows Finally, in Section 5.3, we evaluate QPipe-SP, CJOIN-SP, and us to compare shared hash-joins with the query-centric ones Postgres with a mix of SSB queries and a scale factor 30. We use used by the previous configurations, giving us insight on the PostgreSQL 9.1.4 as another example of a query-centric execution performance characteristics of general shared operators. engine that does not share among concurrent queries. We configure • CJOIN-SP, which additionally supports SP for the CJOIN PostgreSQL to make the comparison with QPipe as fair as possible. stage (see Section 3.3). We use this configuration to evaluate We use 32KB pages, large shared buffers that fit the database, en- the benefits of combining SP with a GQP. It behaves similar sure that it never spills to the disk and that the query execution plans to CJOIN in cases of low similarity in the query mix. are the same. We disable query caching, which does not execute a previously seen query at all. We do not want to compare caching In all our experiments, SP for the aggregation and sorting stages of previously executed queries, but the efficiency of sharing among is off. This is done on purpose to isolate the benefits of SP for joins in-progress queries. only, so as to better compare QPipe-SP and CJOIN-SP. We use the Star Schema Benchmark [21] and Shore-MT [15] 5.2 Sensitivity Analysis as the storage manager. SSB is a simplified version of TPC-H [1] In this section, we measure performance by evaluating multiple where the tables lineitem and order have been merged into lineorder concurrent instances of SSB Q3.2. It is a typical star query that and there are four dimension tables: date, supplier, customer and joins three of the four dimension tables with the fact table. The part. Shore-MT is an open-source multi-threaded storage manager SQL template and the execution plan are shown in Figure 9. We developed to achieve scalability on multi-core platforms. select a single query template for our sensitivity analysis because Our server is a Sun Fire X4470 server with four hexa-core pro- we can adjust the similarity of the query mix to gain insight on cessors Intel Xeon E7530 at 1.86 Ghz, with hyper-threading dis- the benefits of SP, and also, the GQP of CJOIN is the same for all abled and 64 GB of RAM. Each core has a 32KB L1 instructions experiments, with the same 3 shared hash-joins for all star queries. cache, a 32KB L1 data cache, and a 256KB L2 cache. Each proces- sor has a 12MB L3 cache, shared by all its cores. For storage, we 5.2.1 Impact of concurrency use two 146 GB 10kRPM SAS 2.5” disks, configured as a RAID-0. We start with an experiment that does not involve I/O accesses to The O/S is a 64-bit SMP Linux (Red Hat), with a 2.6.32 kernel. study the computational behavior of the configurations. We store We clear the file system caches before every measurement. All our database in a RAM drive. We evaluate multiple concurrent SSB configurations use a large buffer pool that fits datasets of scale fac- Q3.2 instances for a scale factor 1. The predicates of the queries are tors up to 30 (scanning all tables reads 21GB of data from disk). chosen randomly, keeping a low similarity factor among queries SPL are used for exchanging results among packets. We use 32KB and the selectivity of fact tuples varies from 0.02% to 0.16% per pages and a maximum size of 256KB for a SPL (see Section 4). query. Figure 10 (left) shows the response times of the configura- Unless stated otherwise, every data point is the average of mul- tions, while varying the number of concurrent queries. tiple iterations with standard deviation less or equal to 10%. In For low concurrency, QPipe successfully uses available CPU some cases, contention for resources results in higher deviations. resources. Starting with as few as 32 concurrent queries, there is Furthermore, we mention the average CPU usage and I/O through- contention for CPU resources, due to the fact that our server has put of representative iterations (averaged only over their activity 24 cores and QPipe evaluates queries separately. Response times period), to gain insight on the performance of the configurations. grow quickly and unpredictability results in standard deviations up Our sensitivity analysis is presented in Section 5.2. We vary to 50%. For 256 queries it uses all cores at their maximum. (a) the number of concurrent queries, (b) whether the database is The circular scans of QPipe-CS reduce contention for CPU re- memory-resident or disk-resident, (c) the selectivity of fact tuples, sources and the buffer pool, improving performance. For high con- (e) the scale factor, and (d) the query similarity which is modeled currency, however, there are more threads than available hardware in our experiments by the number of possible different submitted contexts, thus increasing response time. 644

9. 10000 10000 140 CJOIN Admission 10000 Misc (#6) Locks (#5) Memory-resident database Disk-resident database CJOIN Scans (#4) Aggreg. (#3) QPipe 120 QPipe-SP Joins (#2) Hashing (#1) Response time (sec) 1000 QPipe-CS 1000 1000 Response time (sec) 100 CPU time (sec) QPipe-SP CJOIN 80 100 100 100 60 10 10 40 10 20 QPipe-SP CJOIN 1 1 1 0 0.1 1.0 10.0 20.0 30.0 30.0 0.1 1.0 10.0 20.0 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 0 10 20 30 Number of concurrent queries Number of concurrent queries Selectivity (%) Selectivity (%) Measurement \Configur. QPipe QPipe-CS QPipe-SP CJOIN Measurement \Configuration QPipe-SP CJOIN Experiment with memory-resident database Avg. # Cores Used 17.79 18.86 Avg. # Cores Used 23.91 19.72 18.75 3.47 Experiment with disk-resident database Figure 11: 8 queries with a memory-resident database of Avg. # Cores Used 23.86 19.84 17.06 3.49 SF=10. The table includes measurements for 30% selectivity. Avg. Read Rate (MB/s) 1.88 74.47 97.67 156.11 5.2.2 Impact of data size Figure 10: Experiment with memory-resident (left) and disk- In this section, we study the behavior of the configurations by resident (right) database of SF=1. The table includes measure- varying the amount of data they handle. We perform two experi- ments for the case of 256 concurrent queries. ments: In the first, we vary the selectivity of fact tuples of queries, and in the second, the scale factor. QPipe-CS misses several sharing opportunities at higher opera- Impact of selectivity. We use a memory-resident database with tors in the query plans. QPipe-SP can exploit them. Even though scale factor 10. The query mix consists of 8 concurrent queries we use random predicates, the ranges of variables of the SSB Q3.2 which are different instances of a modified SSB Q3.2 template. For template allows QPipe-SP to share the first hash-join 126 times, the modified template, we select the maximum possible range for the second hash-join 17 times, and the third hash-join 1 time, on the year. Moreover, we extend the WHERE clause of the query average for 256 queries. Thus, it saves more CPU resources, and template by adding more options for both customer and supplier results in lower response time than the circular scans alone. nation attributes. For example, if we use a disjunction of 2 nations The shared operators of CJOIN offer the best performance, as for customers and 3 nations for suppliers, we achieve a selectivity they are the most efficient in saving resources. CJOIN has an ini- 2 3 of 25 25 ≈ 1% of fact tuples. Nations are selected randomly over tialization overhead in comparison to the other configurations, at- all 25 possible values and are unique in every disjunction, keeping tributed to its admission phase, which has a large part that pauses a minimal similarity factor. The results are shown in Figure 11. In the pipeline (see Section 3.1). The shared hash-joins in the GQP this experiment, there is no contention for resources and no com- can effortlessly evaluate many instances of SSB Q3.2. Neverthe- mon sub-plans. Thus, we do not depict QPipe and QPipe-CS, as less, admission and evaluation costs accumulate for an increasing they have the same behavior as QPipe-SP, and we do not depict number of queries, thus the CJOIN line also starts to degrade. CJOIN-SP, as it has the same behavior as CJOIN. We do not depict CJOIN-SP, as it has the same behavior as Our selectivity experiments provide more insight on the general CJOIN. As we noted in Section 3.3, our implementation of CJOIN- behavior of the configurations. For this reason, we also include SP supports sharing CJOIN packets with all predicates identical. the time of the admission phase of CJOIN, and performance break- This is rare due to this experiment’s random selection predicates. down graphs. The latter show the CPU time of all cores, as mea- Our observations apply also for the same experiment with the sured with Intel VTune Amplifier 2011, for different parts of the database on disk, shown in Figure 10 (right). QPipe suffers from query evaluation. We compare the effect of sharing on the CPU CPU contention, which de-schedules scanner threads regularly re- time of hash-joins rather than analyze the bottlenecks of QPipe sulting in low I/O throughput. Additionally, scanner threads com- and CJOIN, which are largely dependent on implementation de- pete for bringing pages into the buffer pool. Response times for tails. We further break down the CPU time of hash-joins to two low concurrency have increased, but not significantly for high con- categories. The first, shown as “Hashing”, includes the total CPU currency because the workload becomes CPU-bound. QPipe-CS time of the hash() and equal() functions, which are the heart improves performance (by 80%-97% for high concurrency) by re- of the building and probing phases, and allow us to compare the ducing contention for resources and the buffer pool. QPipe-SP effect of sharing between the configurations, without strong side- further improves performance by eliminating common sub-plans. effects from implementation details. The remaining CPU time of The shared operators of CJOIN still prevail for high concurrency. the hash-joins is shown as “Joins”. Furthermore, the overhead of the admission phase of CJOIN, that Both QPipe-SP and CJOIN show a degradation in performance we observed for a memory-resident database, is masked by file sys- as selectivity increases, due to the increasing amount of data they tem caches for disk-resident databases. We explore this effect in a need to handle. CJOIN, however, is always worse than QPipe- next experiment, where we vary the scale factor. SP. This is due to three reasons mainly. Firstly, the cost of the Implications. Shared scans improve performance by reducing con- admission phase of CJOIN is increased, as more tuples are selected tention for resources, the buffer pool and the underlying storage de- for referencing in the hash tables of the filters. vices. SP is able to eliminate common sub-plans. Shared operators Secondly, the shared operators inherently entail a bookkeeping in a GQP are more efficient in evaluating a high number of queries, overhead, in comparison to standard query-centric operators. In our in comparison to standard query-centric operators. case, the additional cost of shared hash-joins includes the mainte- 645

10. 2000 CJOIN Admission 100000 Misc Locks 1000 QPipe-SP (Direct I/O) Scans Aggreg. CJOIN Joins Hashing CJOIN (Direct I/O) 10000 800 Response time (sec) 1500 QPipe-SP QPipe-SP Response time (sec) CPU time (sec) 1000 600 CJOIN 1000 SF Data (GB) 100 400 1 0.8 10 7.4 500 10 200 30 21.0 QPipe-SP CJOIN 50 37.5 1 100 66.8 0 0 16 32 64 128 16 32 64 128 16 32 64 128 256 0 10 20 30 40 50 60 70 80 90 100 Number of Concurrent Queries Number of Concurrent Queries Scale Factor Measurement \Configuration QPipe-SP CJOIN Measur.\Configur. QPipe-SP CJOIN QPipe-SP CJOIN Avg. # Cores Used 22.86 17.73 (Direct I/O) (Direct I/O) # Cores Used 5.96 1.68 5.38 2.47 Figure 12: Memory-resident database of SF=10 and 30% selec- Read Rate (MB/s) 97.16 70.01 215.58 204.71 tivity. The table includes measurements for 256 queries. Figure 13: 8 concurrent queries with disk-resident databases. nance of larger hash tables for the union of the selected dimension The table includes measurements for the case of SF=100. tuples of all concurrent queries, and bitwise AND operations be- tween the bitmaps of tuples. Query-centric operators do not entail tween 0.02% and 0.16%. The results are shown in Figure 13. The these costs, and maintain a hash table for one query. The increased response times of QPipe-SP and CJOIN increase linearly. Their bookkeeping costs are reflected in the CPU time of the area un- slopes, however, are different. The reasons are the same as in our der Joins of CJOIN, which is more expensive than QPipe-SP for selectivity experiment. all cases of selectivity. As the selectivity increases, the hashing We also show the response time of the two configurations by CPU time of QPipe-SP increases faster than CJOIN, as it does using direct I/O for accessing the database on disk, to bypass file not share parts of the hash-joins of the concurrent queries. We note system caches. This allows us to isolate the overhead of CJOIN’s that the bookkeeping overhead can be decreased significantly with preprocessor. As we have mentioned, the preprocessor is in charge careful implementation choices. DataPath [2] uses a single large of the circular scan of the fact table, the admission phase of new hash table for all shared hash-joins, and techniques to decrease the queries, and finalizing queries when they wrap around to their point maintenance and access costs for the hash table. of entry. These responsibilities slow down the circular scan signif- Thirdly, the horizontal configuration of CJOIN (all threads in icantly. Without direct I/O, file system caches coalesce contiguous one “stage” [3]) results in synchronization costs, as threads con- I/O accesses and read-ahead, achieving high I/O read throughput in tend while passing tuples through the pipeline. The synchroniza- sequential scans, masking the preprocessor’s overhead. tion costs are a significant reason for the worse trend of CJOIN. Nevertheless, synchronization costs are highly dependent on imple- Implications. For low concurrency, a GQP with shared operators mentation. For example, in CJOIN, the synchronization costs can entails a bookkeeping overhead in comparison to query-centric op- be minimized with the vertical (one thread per filter) or hybrid con- erators. For high concurrency, however, the overhead of shared figuration of CJOIN. These configurations, however, do not neces- operators is amortized. sarily provide better performance [3]. In DataPath or SharedDB, 5.2.3 Impact of Similarity a shared operator in a GQP does not necessarily require multiple threads. Nevertheless, for low concurrency, the synchronization In this experiment we use a disk-resident database of scale factor costs for a query are higher in a GQP than in the query-centric 1. We limit the randomness of the predicates of queries to a small model, as a GQP tends to be much larger than the constituent query set of values. Specifically, there are 16 possible query plans for in- plans. For one query in a GQP, tuples not selected by it, but se- stances of Q3.2. The selectivity of fact tuples ranges from 0.02% to lected by other queries, need to pass through shared operators, and 0.05%. In Figure 14, we show the response times of the configura- tuples selected by the query may need to pass by additional shared tions, varying the number of concurrent queries. We do not depict operators to accommodate other concurrent queries. This is also a 50 QPipe-CS 50sec reason why GQP achieve better throughput for high concurrency, QPipe-SP Response time (sec) but may hurt the latency of queries, especially for low concurrency. 40 We used 8 queries to avoid CPU contention. For higher concur- CJOIN 30 CJOIN-SP rency, shared operators still prevail, due to their efficiency in saving resources. Figure 12 shows the response times for the case of 30% 20 14sec selectivity. For high concurrency, the query-centric operators of 13sec 10 12sec QPipe-SP contend for resources. This is also shown in the CPU times of the break-down graph, which all scale (superlinearly) with 0 the number of queries. CJOIN is able to save more resources and 1 2 4 8 16 32 64 128 256 outperform the query-centric operators. This is best reflected by Number of concurrent queries the hashing CPU time, which stays at the same level, irrespective Measur. \Configur. QPipe-CS QPipe-SP CJOIN CJOIN-SP of the number of queries, as the hashing is shared. Avg. # Cores Used 20.32 9.60 2.50 2.34 Impact of scale factor. The same trade-off between shared oper- Avg. Read Rate (MB/s) 74.37 120.20 130.18 130.15 ators and query-centric operators is observed by varying the scale factor between 1 to 100. We use disk-resident databases and 8 con- Figure 14: Disk-resident database of SF=1 and 16 possible current queries with randomly varied predicates and selectivity be- plans. The table includes measurements for 256 queries. 646

11. 2500 QPipe-SP 25 QPipe-SP 10 9 Throughput (x100 queries/h) Response time (x100 sec) CJOIN CJOIN-SP 2000 20 8 Postgres Response time (sec) CJOIN-SP 7 15 6 1500 5 10 4 1000 3 5 2 500 1 0 0 0 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Number of concurrent queries Number of concurrent clients 1 128 256 512 Random Number of possible different plans Measur. \Configuration Postgres QPipe-SP CJOIN-SP Differ. plans 1 128 256 512 Random Response time experiment (256 concurrent queries) QPipe-SP 1/0/51018/94/38149/156/287 106/196/188 362/82/5 Avg. # Cores Used 18.56 19.07 19.11 CJOIN-SP 510 384 287 190 12 Avg. Read Rate (MB/s) 15.93 84.98 110.03 Throughput experiment (256 concurrent clients) Figure 15: Evaluating 512 concurrent queries with a vary- Avg. # Cores Used 18.29 19.59 13.70 ing similarity factor, for a SF=100. The table includes the SP Avg. Read Rate (MB/s) 15.94 67.42 79.98 sharing opportunities (average of all iterations), in the format 1st/2nd/3rd hash-join for QPipe-SP. Figure 16: Disk-resident database of SF=30. Response time (left) and throughput experiment (right), varying the number of concurrent queries and clients respectively. The table in- QPipe, as it does not exploit any sharing and results in increased clude measurements for both experiments. contention and high response times for high concurrency. QPipe-SP evaluates a maximum of 16 different plans and re- uses results for the rest of similar queries. It shares the second efficiency in sharing among a high number of concurrent queries. hash-join 1 time, and the third hash-join 238 times, on average, for Postgres follows a traditional query-centric model of execution, 256 queries. This leads to high sharing and minimal contention and does not share among in-progress queries. For this reason, it for computations. On the other hand, QPipe-CS does not share results in contention for resources. QPipe-SP results in a better operators other than the table scan, resulting in high contention. performance due to circular scans and the elimination of common Similarly, CJOIN misses exploiting these sharing opportunities sub-plans. CJOIN-SP attains the best performance, as shared op- and evaluates identical queries redundantly. In fact, QPipe-SP erators are the most efficient in sharing among concurrent queries. outperforms CJOIN. CJOIN-SP, however, is able to exploit them. Figure 16 also shows the throughput of the three configurations, For a group of identical star queries, only one is evaluated by the by varying the number of concurrent clients. Each client initially GQP. CJOIN-SP shares CJOIN packets 239 times on average for submits a query, and when it finishes, the next one is submitted. 256 queries. Thus, CJOIN-SP outperforms all configurations. The shared operators of a GQP are able to handle new queries with To further magnify the impact of SP, we perform another experi- minimal additional resources. Thus, the throughput of CJOIN-SP ment for 512 concurrent queries, a scale factor of 100 (with a buffer continues to increase. The throughput of the query-centric oper- pool fitting 10% of the database), and varying the number of pos- ators of Postgres and QPipe-SP, however, ultimately degrades sible different query plans. Figure 15 shows the results. CJOIN is with an increasing number of clients, due to resources contention. not heavily affected by the number of different plans. For the ex- treme cases of high similarity, QPipe-SP prevails. For lower sim- ilarity, the number of different plans it needs to evaluate is larger 6. DISCUSSION and performance is deteriorated due to contention. CJOIN-SP is Shared scans and SPL. Pull-based models, similar to SPL, have able to exploit identical CJOIN packets and improve performance been proposed for shared scans that are specialized for efficient of CJOIN by 20%-48% for cases with common sub-plans. buffer pool management and are based on the fact that all data is Implications. We can combine SP with a GQP to eliminate redun- available for accessing (see Section 2.1). SPL differ because they dant computations and improve performance of shared operators are generic and can be used during SP at any operator which may be for a query mix that exposes common sub-plans. producing results at run-time. It is possible, as well, to use shared scans for table scans, and use SPL during SP for other operators. 5.3 SSB query mix evaluation Prediction model for sharing with a GQP. Shared operators of a In this section we evaluate QPipe-SP, CJOIN-SP, and Post- GQP are not beneficial for low concurrency, in comparison to the gres using a mix of three SSB queries (namely Q1.1, Q2.1 and query-centric model, because they entail an increased bookkeeping Q3.2), with a disk-resident database and a scale factor of 30. The overhead (see Section 5.2.2). The turning point, however, when predicates for the queries are selected randomly and the selectivity shared operators become beneficial needs to be pinpointed. A sim- of fact tuples is less than 1%. Each query is instantiated from the ple heuristic is the point when resources become saturated (see Ta- three query templates in a round-robin fashion, so all configurations ble 1). An exact solution would be a prediction model, similar to contain the same number of instances for each query type. [14]. This model, however, targets only sharing identical results Figure 16 shows the response times of the configurations, while during SP (see Section 4). Shared operators in a GQP do not share varying the number of concurrent queries. As Postgres is a more identical results, but part of their evaluation among possibly dif- mature system than the two research prototypes, it attains a bet- ferent queries. A potential prediction model for a GQP needs to ter performance for low concurrency. Our aim, however, is not to consider the bookkeeping overhead, and the cost of optimizing the compare the per-query performance of the configurations, but their GQP, for the current query mix and resources. 647

12.Distributed environments. This work focuses on scaling up rather [7] L. Colby et al. Red brick vistaT M : aggregate computation than out. Following prior work [2, 3, 11, 13], we consider scal- and management. In Proc. of the 14th Int’l Conf. on Data ing up as a base case, because it is a standard means of increasing Engineering, pages 174–177, 1998. throughput in DBMS. Further research in parallel DBMS [20] and [8] C. Cook. Database Architecture: The Storage Engine, 2001. other distributed data systems [10] will have interesting implica- [9] N. N. Dalvi et al. Pipelining in multi-query optimization. In tions. For example, we can improve global scheduling in parallel Proc. of the 20th ACM SIGMOD-SIGACT-SIGART DBMS by considering sharing: each replica node can employ a Symposium on Principles of Databases, pages 59–70, 2001. separate GQP, and a new query should be dispatched to the node [10] J. Dean et al. MapReduce: Simplified data processing on which incurs the minimum estimated marginal cost for evaluation. large clusters. Communications ACM, 51(1):107–113, 2008. [11] G. Giannikis et al. SharedDB: killing one thousand queries 7. CONCLUSIONS with one stone. Proc. of the VLDB Endowment, 5(6):526–537, 2012. In this paper we perform an experimental study to answer when [12] S. Harizopoulos et al. A case for staged database systems. In and how an execution engine should share data and work across Proc. of the 2003 Conf. on Innovative Data Systems concurrent analytical queries. We review work sharing methodolo- Research, 2003. gies and we study Simultaneous Pipelining (SP) and Global Query [13] S. Harizopoulos et al. QPipe: a simultaneously pipelined Plans with shared operators (GQP) as two state-of-the-art sharing relational query engine. In Proc. of the 2005 ACM SIGMOD Int’l Conf. on Management of Data, pages 383–394, 2005. techniques. We perform an extensive evaluation of SP and GQP, [14] R. Johnson et al. To share or not to share? In Proc. of the based on their original research prototype systems. 33rd Int’l Conf. on Very Large Data Bases, pages 351–362, Work sharing is typically beneficial for high concurrency be- 2007. cause the opportunities for common work increase, and it reduces [15] R. Johnson et al. Shore-MT: a scalable storage manager for contention for resources. For low concurrency, however, there is the multicore era. In Proc. of the 12th Int’l Conf. on a trade-off between sharing and parallelism, particularly when the Extending Database Technology: Advances in Database sharing overhead is significant. We show that GQP are not ben- Technology, pages 24–35, 2009. eficial for low concurrency as shared operators inherently involve [16] T. Johnson et al. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. In Proc. of the a bookkeeping overhead compared to query-centric ones. For SP, 20th Int’l Conf. on Very Large Data Bases, pages 439–450, however, we show that it can be beneficial for low concurrency as 1994. well, if the appropriate communication model is employed: we in- [17] R. Kimball et al. The Data Warehouse Toolkit: The Complete troduce SPL, a pull-based approach that scales better on machines Guide to Dimensional Modeling. John Wiley & Sons, Inc., with modern multi-core processors than push-based SP. SPL is a 2nd edition, 2002. data structure that promotes parallelism by shifting the responsibil- [18] C. Lang et al. Increasing Buffer-Locality for Multiple ity of sharing common results from the producer to the consumers. Relational Table Scans through Grouping and Throttling. In Furthermore, we show that SP and GQP are two orthogonal shar- Proc. of the 23rd Int’l Conf. on Data Engineering, pages 1136–1145, 2007. ing techniques and their integration allows to share operators and [19] N. Megiddo et al. ARC: A Self-Tuning, Low Overhead handle a high number of concurrent queries, while also sharing any Replacement Cache. In Proc. of the 2nd USENIX Conf. on common sub-plans presented in the query mix. In conclusion, an- File and Storage Technologies, pages 115–130, 2003. alytical query engines should dynamically choose between query- [20] M. Mehta et al. Batch Scheduling in Parallel Database centric operators with SP for low concurrency and GQP with shared Systems. In Proc. of the 9th Int’l Conf. on Data Engineering, operators enhanced by SP for high concurrency. pages 400–410, 1993. [21] P. O. Neil et al. Star Schema Benchmark. 2009. Acknowledgments. The authors would like to thank the anony- [22] E. J. O’Neil et al. The LRU-K page replacement algorithm mous reviewers for their helpful comments, Alkis Polyzotis and for database disk buffering. In Proc. of the 1993 ACM George Candea for their insights and providing access to the orig- SIGMOD Int’l Conf. on Management of Data, pages inal source code of the CJOIN operator, and Ryan Johnson for our 297–306, 1993. helpful discussions about the QPipe engine. This work was sup- [23] L. Qiao et al. Main-memory scan sharing for multi-core cpus. Proc. of the VLDB Endowment, 1(1):610–621, 2008. ported by the FP7 project BIGFOOT (grant n. 317858). [24] N. Roussopoulos. View indexing in relational databases. ACM Trans. Database Syst., 7(2):258–290, 1982. 8. REFERENCES [25] P. Roy et al. Efficient and extensible algorithms for multi query optimization. In Proc. of the 2000 ACM SIGMOD Int’l [1] TPC-H Benchmark: Standard Specification, Revision 2.14.3. Conf. on Management of Data, pages 249–260, 2000. [2] S. Arumugam et al. The DataPath system: a data-centric [26] P. Russom. High-Performance Data Warehousing. TDWI, analytic processing engine for large data warehouses. In 2012. Proc. of the 2010 ACM SIGMOD Int’l Conf. on Management report-high-performance-data-warehousing.aspx. of Data, pages 519–530, 2010. [27] T. K. Sellis. Multiple-query optimization. ACM Trans. [3] G. Candea et al. A scalable, predictable join operator for Database Syst., 13(1):23–52, 1988. highly concurrent data warehouses. Proc. of the VLDB Endowment, 2(1):277–288, 2009. [28] J. Shim et al. Dynamic Caching of Query Results for Decision Support Systems. In Proc. of the 11th Int’l Conf. on [4] G. Candea et al. Predictable performance and high query Scientific and Statistical Database Management, pages concurrency for data analytics. The Int’l Journal on Very 254–263, 1999. Large Data Bases, 20(2):227–248, 2011. [29] P. Unterbrunner et al. Predictable performance for [5] H.-T. Chou et al. An evaluation of buffer management unpredictable workloads. Proc. of the VLDB Endowment, strategies for relational database systems. In Proc. of the 11th 2(1):706–717, 2009. Int’l Conf. on Very Large Data Bases, pages 127–141, 1985. [30] M. Zukowski et al. Cooperative scans: dynamic bandwidth [6] J. Cieslewicz et al. Adaptive aggregation on chip sharing in a DBMS. In Proc. of the 33rd Int’l Conf. on Very multiprocessors. In Proc. of the 33rd Int’l Conf. on Very Large Data Bases, pages 723–734, 2007. Large Data Bases, pages 339–350, 2007. 648