SharedDB: Killing One Thousand Queries with One Stone

Traditional database systems are built around the query-at-a-time model. This approach tries to optimize performance in a best-effort way. Unfortunately, best effort is not good enough for many modern applications. These applications require response time guarantees in high load situations. This paper describes the design of a new database architecture that is based on batching queries and shared computation across possibly hundreds of concurrent queries and updates. Performance experiments with the TPC-W benchmark show that the performance of our implementation, SharedDB,is indeed robust across a wide range of dynamic workloads.

1. SharedDB: Killing One Thousand Queries With One Stone Georgios Giannikis Gustavo Alonso Donald Kossmann Systems Group, Department of Computer Science, ETH Zurich, Switzerland {giannikg,alonso,kossmann} ABSTRACT joining all orders with the union of German and Swiss customers Traditional database systems are built around the query-at-a-time and routing the results of the big join to the corresponding queries. model. This approach tries to optimize performance in a best-effort At a first glance, SharedDB looks like a bad idea. In the exam- way. Unfortunately, best effort is not good enough for many mod- ple, SharedDB performs extra work when compared to a traditional ern applications. These applications require response time guar- query-at-a-time approach. Specifically, SharedDB compares Swiss antees in high load situations. This paper describes the design of customers and orders of the Year 2010 as part of its join; such a a new database architecture that is based on batching queries and combination is never considered when processing the two queries shared computation across possibly hundreds of concurrent queries individually and aggressively pushing down predicates below the and updates. Performance experiments with the TPC-W bench- joins. Because of this, SharedDB is likely to perform poorly in low- mark show that the performance of our implementation, SharedDB, throughput situations. SharedDB, however, was designed to han- is indeed robust across a wide range of dynamic workloads. dle high throughput with response time guarantees. If hundreds of concurrent queries involving a customer-order join need to be pro- cessed, then there will be a significant overlap between the sets of 1. INTRODUCTION customers and orders relevant for these queries. The more queries, Over the last decades, tremendous efforts have been invested into the higher the overlap, even if the queries involve totally different query optimization with the goal of achieving the best possible per- kinds of predicates on the Customer and Order tables. Further- formance for each individual query in a query-at-a-time processing more, SharedDB defines an upper bound for the amount of work model. To this end, sophisticated query compile-time techniques that needs to be carried out for a set of concurrent customer-order (e.g., cost-based optimization [24]) and many kinds of database join queries. In the worst case, the whole customer table needs to be operator implementations (e.g., [26]) have been developed. While joined with the whole order table, independently of the number of highly effective, unfortunately, all these approaches are not suffi- concurrent queries processed. In contrast, the effort of a traditional, cient to meet common requirements of modern database applica- query-at-a-time system grows linearly with the number of queries. tions if used in a query-at-a-time model. Often, modern applica- This way, SharedDB is able to maintain response time guarantees tions need to meet service-level agreements (SLAs) that involve a in high-throughput situations. maximum response time for, say, 99 percent of the queries. Further- SharedDB adopts many ideas that were developed in the context more, such SLAs may specify isolation levels and/or data freshness of multi-query optimization and data stream processing. In particu- guarantees for workloads that involve queries and updates. The lar, SharedDB adopts some of the ideas developed as part of QPipe query-at-a-time processing model is not good to meet such SLAs [15], CJoin [3, 4] and DataPath [1]. These systems, however, are because it may result in resource contention and interference in only effective for certain kinds of queries and, thus, their applica- high load situations. tion is limited to OLAP workloads with complex queries. In con- This paper describes a system, SharedDB, specifically designed trast, this paper will show that the design principles of SharedDB to meet SLAs in high load situations for complex and highly dy- are general and can be applied to any kind of query and update. As namic workloads. SharedDB is based on a new processing model a result, SharedDB is able to process OLTP workloads in addition that batches queries and updates in order to share computation across to OLAP and mixed workloads. It is this generality and its ability these queries and updates. The key idea of SharedDB can be best to meet SLAs in high load situations that distinguishes SharedDB described using an example involving two queries. The first query from its closest competitors. Technically, it is the batch-oriented asks for all orders of German customers. The second query asks query processing model with a new way to share computation that for all orders of Swiss customers in the Year 2011. SharedDB ex- makes SharedDB unique. ecutes these two queries in a single customer-order join operation We have carried out comprehensive experiments using the TPC- W benchmark, thereby comparing the performance of SharedDB Permission to make digital or hard copies of all or part of this work for with that of MySQL and a top-of-the-line commercial relational personal or classroom use is granted without fee provided that copies are database system. The results show that SharedDB is able to sustain not made or distributed for profit or commercial advantage and that copies twice the throughput of the top-of-the-line database system and al- bear this notice and the full citation on the first page. To copy otherwise, to most eight times the throughput of MySQL. These results are sur- republish, to post on servers or to redistribute to lists, requires prior specific prising since the traditional database systems are much more ma- permission and/or a fee. Articles from this volume were invited to present ture and were specifically tuned to achieve highest possible through- their results at The 38th International Conference on Very Large Data Bases, August 27th - 31st 2012, Istanbul, Turkey. put for benchmarks such as TPC-W. Furthermore, SharedDB does Proceedings of the VLDB Endowment, Vol. 5, No. 6 not apply any of the recently developed optimizations to achieve Copyright 2012 VLDB Endowment 2150-8097/12/02... $ 10.00. 526

2.particularly good performance for OLTP workloads (e.g., [27]). We aPath are limited to shared computation of joins and to cases in could not carry out experiments with QPipe, CJoin, or DataPath which the particular CJoin and DataPath join methods show good because these systems are not publicly available. Furthermore, it performance. Again, this constrained applicability of the CJoin is not clear if and how the techniques used in these systems can be and DataPath techniques has limited these systems to process only applied to transactional workloads such as TPC-W. Nevertheless, OLAP workloads so far, whereas SharedDB can be applied to trans- this paper contains a careful discussion of the differences and the actional (i.e., OLTP), OLAP, and mixed workloads such as those advantages of SharedDB as compared to these systems. modeled in the TPC-W benchmark. Furthermore, SharedDB can The remainder of this paper is organized as follows: Section 2 re- give response time guarantees which QPipe and DataPath cannot visits related work. Section 3 presents the main ideas of SharedDB make. CJoin can only give response time guarantees under certain and discusses in more detail the differences to other related ap- circumstances; i.e., if the bulk of the work is carried out as part proaches to share computation. Section 4 gives implementation de- of the Star join pipeline and not as part of other joins or as part tails. Section 5 discusses the results of performance experiments. of grouping, sorting, or aggregation. Section 3 discusses the com- Section 6 contains conclusions and avenues for future work. monalities and differences between SharedDB and QPipe, CJoin, and DataPath in more detail when the main ideas of SharedDB are 2. RELATED WORK described. In addition, Section 3.5 contains a detailed discussion of the advantages and disadvantages of SharedDB as compared to As mentioned in the introduction, SharedDB is based on shared these related systems. computation that has recently been studied in several studies to op- A specific problem of shared computation is that it may result timize OLAP workloads. The idea of shared computation was first in deadlocks in a pull-oriented query processor [6]. This problem devised in the Eighties in the context of multi-query optimization can be alleviated by a push-oriented query processing approach (MQO) [11, 25]. The key idea of MQO is to detect common subex- which is the approach adopted by SharedDB. This push-based pressions in concurrently executed queries and to evaluate these processing model has its roots in data stream processing technol- common subexpressions only once. One problem of MQO is its ogy; e.g., YFilter [8]. Even though it is applied to a totally dif- limited applicability: In many workloads (in particular, transac- ferent query processing paradigm, SharedDB has a number of ad- tional workloads such as TPC-W), there are not many opportunities ditional commonalities with the YFilter approach: Just as YFil- to factor out common subexpressions. Another problem of classic ter, SharedDB compiles a set of queries into a single query plan MQO is that it is quite costly to detect common subexpressions. and continuously uses the same plan for different generations of For these two reasons, MQO has been mostly applied to complex queries. In most applications, this approach becomes possible be- OLAP queries that involve the processing of large amounts of data cause the kinds of queries are known in advance and running an ap- and involve large common subexpressions (e.g., the scan of a fact plication over time involves executing the same queries with differ- table in a data warehouse) so that the investment of detecting com- ent parameter settings. The implementation of the TPC-W bench- mon subexpressions is offset by the benefits of shared computation. mark, for instance, involves about thirty different JDBC Prepared- As explained in the example of the introduction, SharedDB is able Statements that are executed with different parameter settings. The to carry out shared computation without common subexpression idea of such an always-on-query plan for short-lived queries was detection and is, thus, applicable to OLTP and mixed workloads. also adopted in the CJoin work [3] and is one of the commonalities Another problem of MQO is the synchronization of the execution between SharedDB, CJoin, and DataPath. of queries with common subexpressions when queries are submit- A push-based, dataflow architecture for query processing has ted at different moments in time. To this end, the QPipe system de- also been adopted in the Eddies project [2]. While SharedDB and veloped a new query processing model that allows to exploit MQO Eddies are also similar in a number of other technical details, the if the queries were executed within a certain time frame [15]. The two systems were designed for totally different purposes. Eddies big advantage of this approach is that sharing does not slow down were designed for run-time adaptivity in situations in which the queries if the queries arrive at different points in time. Another query optimizer cannot make good decisions at compile-time. Ed- contribution of the QPipe project is to differentiate between data dies, however, cannot provide any response time guarantees. In sharing and work sharing and to provide a comprehensive taxon- contrast, SharedDB is static and does not adapt the query plan at omy on the different forms of sharing for different kinds of database runtime. As a result, SharedDB will not always achieve the best operators. Opportunities for data sharing arise in scan operations possible response time for any given query, but SharedDB is able through base data; i.e., shared scans. Opportunities for work shar- to provide response time guarantees which is the primary goal of ing in QPipe, however, only arise in the presence of common sub- this work. expressions. In order to unfold its full potential, therefore, QPipe Of course, there is a great deal of other related work on optimiz- also relies on common subexpression detection and has, thus, only ing databases to meet SLAs in high-throughput situations. Exam- been studied for OLAP workloads, just as classic MQO techniques. ples include indexing, materialized views, and the design of advi- Based on the QPipe work, two recent systems have exploited sors to get the best physical database design for a given workload work sharing without the detection of common subexpressions: [5]. Other examples include caching (e.g., [7]), reuse of query re- CJoin [3, 4] and DataPath [1]. SharedDB falls into this same class sults (e.g., [18]), or optimization techniques that control the data of systems so that CJoin and DataPath can be viewed as SharedDB’s placement in a distributed system (e.g., [20]). Various aspects of closest competitors. Indeed, there are a number of similarities in shared scans, one particular building block exploited in SharedDB the design, but there are also fundamental differences. At the core (and QPipe, CJoin, and DataPath), have been studied in [10, 29, of CJoin and DataPath are special, dedicated join algorithms in 21, 28]. Finally, the idea of bounded computation for constant-time order to facilitate shared computation in a pipelined query exe- query processing was pioneered in the Blink system [22]. cution model a` la QPipe. In contrast, SharedDB is based on a batched query execution model. Furthermore, SharedDB uses stan- dard query processing techniques such as index nested-loops, hash- 3. SYSTEM OVERVIEW ing and sorting for any kind of operator of the relational algebra This section presents the key ideas that combined provide the (e.g., joins, grouping, ranking, and sorting) whereas CJoin and Dat- unique performance and predictability characteristics of SharedDB 527

3. Result Set with Redundancy (First Normal Form) Queries Q1 Q2 Q3 Q4 Q5 Client Row Id Name Other Attr. query id 143 John Smith ... 1 148 Kate Johnson ... 3 143 John Smith ... 2 148 Kate Johnson ... 2 143 John Smith ... 3 Γ ⋈1 Sort Compact Result Set (NF2) Processing Row Id Name Other Attr. query id 143 John Smith ... 1, 2, 3 148 Kate Johnson ... 2, 3 ⋈2 Figure 1: Query-Data Model Variants for a large range of different workloads: the data-query model, the global query plan, batched query execution, and shared oper- ators. Furthermore, this section contains a detailed comparison of Storage SharedDB versus its closest competitors. Implementation details USERS ORDERS ITEMS of SharedDB are described in Section 4. 3.1 Data-Query Model SharedDB features a novel data-query model to represent (shared) Query SQL intermediary query results. This data-query model extends the re- lational data model by adding an additional column which keeps Q1 SELECT COUNTRY, SUM(USER ID) track of the identifiers of queries that are potentially interested in a FROM USERS GROUP BY COUNTRY tuple. Specifically, the schema of an (intermediary) Relation R is Q2 SELECT * FROM USERS U, ORDERS O represented as follows in this data-query model: WHERE U.USER ID = O.USER ID AND U.USERNAME = ? AND O.STATUS = ’OK’ {Ra , Rb , ..., Rn , query id} Q3 SELECT * FROM USERS U, ORDERS O, ITEMS I Here, Ri are the (normal) attributes of R. The query id attribute WHERE U.USER ID = O.USER ID uniquely identifies each query that is currently active in SharedDB. AND O.ITEM ID = I.ITEM ID AND I.AVAILABLE < ? As will be shown in the following sub-sections, the query id can be used in relational operators just as any other attribute; for Q4 SELECT * FROM ORDERS O, ITEMS I instance, it could be part of the join predicate between two relations WHERE O.ITEM ID = I.ITEM ID AND O.DATE > ? (Section 3.3). ORDER BY I.PRICE In a traditional, “first normal form” relational database system, the implementation of the data-query model would result in a great Q5 SELECT * FROM ITEMS I deal of redundancy. If a tuple is relevant for multiple queries, mul- WHERE I.CATEGORY = ? ORDER BY I.PRICE tiple copies of this tuple would have to be generated and processed; one for each relevant query. As a result, the complexity of query processing and the main memory footprint would grow linearly Figure 2: Example of a Global Query Plan with the number of queries. The goal of SharedDB is to do bet- ter. Therefore, SharedDB implements this column internally as a reused over a long period of time, possibly for the entire lifetime set-valued attribute (i.e., using the NF2 model). As a result, an of the system. As stated in Section 2, this approach was pioneered operator (e.g., a predicate) must be applied to a tuple only once in the context of continuous query processing in data stream pro- independent of the number of concurrent queries and updates that cessing systems (e.g., YFilter [8]) and first applied to traditional, may have subscribed to that tuple. Furthermore, the main memory short-lived queries as part of the CJoin system [3]. footprint is reduced significantly. For illustration, Figure 1 shows Figure 2 shows an example of such a global query plan. In this examples of both formats. example, four database operators are executed on three tables to There is a question of how to implement set-valued attributes evaluate five different query types. Figure 2 shows how different most efficiently. In the literature, two data structures have been kinds of queries can share operators in different ways. For instance, proposed: (a) bitmaps and (b) lists. For SharedDB, we chose to use both Q2 and Q3 involve a join between the Users and Orders table; a list-based implementation because that turned out to be the more consequently, this join is shared between queries of these two types space and time efficient option in all our experiments. (denoted as ✶1 in Figure 2). Likewise, queries of types Q3 and Q4 share the join between the Orders and Lineitems table. The sort on 3.2 Global Query Plan price can be shared between queries of types Q4 and Q5 . Instead of compiling every query into a separate query plan, Just as important as sharing across different types of queries is SharedDB compiles the whole workload of the system into a sin- sharing within the same type of query. For instance, the plan shown gle global query plan. The global query plan may serve hundreds in Figure 2 could be used to execute hundreds of concurrent queries or thousands of concurrent SQL queries and updates and may be of type Q4 (in addition to hundreds of concurrent queries of the 528

4.other types), all with different parameter settings for the O.DATE Set of Queries predicate. All these concurrent Q4 queries would share the same SELECT * SELECT * SELECT * join and sort operators so that only a single big join and sort would FROM R,S FROM R,S FROM R,S be carried out for all these Q4 queries. WHERE WHERE WHERE To understand the execution model of SharedDB, it is impor- = = = AND = ? AND = ? AND R.addr = ? tant to define what concurrent means. As mentioned in Section AND = ? AND S.price < ? AND > ? 2, SharedDB facilitates the sharing of operators in a very differ- ent way than QPipe, CJoin, DataPath, and other related systems that support operator-level sharing of computation. These systems Traditional Query Processing start executing each query as soon as it arrives. Unfortunately, this Q1 Q2 Q3 approach limits the opportunities to share computation for many database operators as observed in [15]. In order to overcome these limitations, CJoin and DataPath devise specific join methods. In Q1 = Q1 Q2 = Q2 Q3 = Q3 contrast, SharedDB batches queries and updates; it is this batching σ σ σ σ σ σ that enables SharedDB to exploit shared computation in a scalable and generic way, thereby making use of traditional, best-of-breed R S R S R S algorithms to implement joins, sorting, and grouping. SharedDB batches queries and updates in the following way: While one batch of queries and updates is processed, newly ar- Shared Query Processing riving queries and updates are queued. When the current batch Q1, Q2, Q3 of queries and updates has been processed, then the queues are Γquery_id emptied in order to form the next batch of queries and updates. Metaphorically, SharedDB works like the blood circulation: With every “heartbeat”, tuples are pushed through the global query plan in order to process the next generation of queries and updates. For = && OLTP workloads, these heartbeats can be frequent in the order of U R.query_id = U S.query_id one second or even less. Q1 Q2 Q3 Q1 Q2 Q3 The queueing of queries is fine-grained per input relation and σ σ σ σ σ σ query operator: In the example of Figure 2, queries of Type Q1 , Q2 , and Q3 would queue for reading the Users relation, queries of Type Q2 , Q3 , and Q4 would queue for reading the Order relation and so R S on. Then User tuples (generated for all three query types) would queue for the Γ and ✶1 operators which in turn would process Figure 3: Shared Join these tuples in batches so that all User tuples belonging to a spe- cific query are processed within a single batch. Pipelines are only timization [13]. The result of this logical optimization, one plan created for certain operators; for instance, an operator can stream for each query, is shown in the middle of Figure 3. In the sec- its output into the build phase of a hash join and even then, tuples ond step, the three individual query plans are merged into a single are processed in batches following a vector model of execution for global plan. That is, rather than processing three small joins (one better instruction cache locality [14, 17] (Section 4). for each query), one big join is executed that meets the require- SharedDB works particularly well if most query types are known ments of all three queries. Technically, the union of all R and S in advance; e.g., as part of JDBC PreparedStatements. In this case, tuples that the three queries are interested in are considered as part SharedDB can generate a global query plan for all the query types of the join. Furthermore, the join predicate is amended, thereby known in advance and exploit sharing in the best possible way for considering the query id. This way, an R tuple that is only rele- these queries. Ad-hoc queries need to be processed individually. vant for Query Q1 does not match an S tuple that is only relevant Nevertheless, even ad-hoc queries can take advantage of sharing. for Query Q2 . Finally, the routing of the join results to the relevant For instance, an ad-hoc query that asks for the ten users that have queries is carried out using a grouping operator (Γ) by query id. placed the most orders could share ✶1 with all queries of type Q2 As stated in the introduction, this way to process joins sounds and Q3 in the global plan of Figure 2. The Top 10 operation of that like a bad idea at first glance: It is usually better to process a few ad-hoc query, however, would have to be compiled and executed small joins than to process one big join. This approach only be- separately just as in any other traditional database system. In some comes advantageous if there are many (possibly hundreds) of con- sense, all operators of the global plan can be regarded by the query current queries and there is overlap in the tuples that need to be compiler as materialized views which are available to speed-up the processed for a set of queries. One particularly nice way in which processing of ad-hoc queries. This observation has also been made the global join plan of Figure 3 supports scalability with the num- in the QPipe project [15]. ber of concurrent queries is by making the query id part of the join predicate. This way, the query id can be indexed; for in- 3.3 Shared Join Plans stance, a hash join could be used that builds a hash table on the One of the key innovations of SharedDB is the way it processes query id of S (see below). This observation shows nicely how joins, sorts, and group-bys. Figure 3 shows how SharedDB pro- SharedDB achieves scalability by turning queries into data that can cesses joins for three different queries (Q1 , Q2 , Q3 ). All three be processed using traditional query processing techniques. This queries involve a join between tables R and S, but each query has is another key idea that SharedDB has adopted from data stream separate predicates on the S and R tables. In the first step, each processing systems. query is parsed and compiled individually, thereby pushing down Another crucial advantage of the global join plan of Figure 3 predicates. This step is typically referred to as logical query op- is that any join method can be used; e.g., hashing, sorting, index- 529

5.based, and nested-loops. In particular, any parallel or cache-aware Shared Sort Queries join methods can be used (e.g., [19]). Furthermore, as observed Query A: SELECT * FROM USERS in the previous paragraph, either = or R.query id = WHERE BIRTHDATE > 1980.01.01 S.query id can be used as primary join predicates. If the latter, a ORDER BY NAME set-based join is carried out as studied in [16]. In our implemen- Query B: SELECT * FROM USERS tation, we use a simple hash table that maps a query id to a set WHERE ACCOUNT > 1000 of pointers that reference the corresponding tuples. Again, this set ORDER BY NAME of pointers is implemented using a list data structure because this particular join method is only beneficial if these sets are small. It is even possible to use multi-dimensional join methods which are Relation: USERS commonly used in spatial and temporal database systems; e.g., [9]. Name Account Birthdate Query Ids In summary, SharedDB can always make use of the best-of-breed John Smith 3,000 1980.03.05 A, B algorithms. In contrast, the processing model of CJoin and DataP- Kate Johnson 800 1976.04.11 ath constrains the use of join methods used: In some cases, using Bill Harisson 1,230 1978.03.02 B the dedicated join method of DataPath might show good perfor- mance; in other cases, however, its performance may be terrible. Nick Lee 540 1982.02.09 A SharedDB makes use of multiple join methods for the same rea- James Meyer 2,300 1981.03.09 A, B sons as traditional database systems. In addition to supporting multiple join methods, SharedDB does not constrain join ordering thanks to its flexible data-query model. Sorted Output In Figure 2, for instance, the Orders / Items join (✶2) of Q3 could Name Account Birthdate Query Ids be carried out before or after the join with Users (✶1). As a result, the following join odering for Q3 would also be possible: Bill Harisson 1,230 1978.03.02 B John Smith 3,000 1980.03.05 A, B (Users ✶ Orders) ✶ Items James Meyer 2,300 1981.03.09 A, B This join order for Q3 would also enable sharing ✶1 with Q2 and Nick Lee 540 1982.02.09 A ✶2 with Q4 , in the same way as the original join order for Q3 depicted in Figure 2. To share a join across queries, SharedDB Figure 4: Shared Sort only fixes the join method and the inner and outer relation of the join for all queries that are share the join. Studying the details of the SharedDB query compiler and opti- grouping is the most expensive part so that sharing can be applied mizer is beyond the scope of this paper. Developing a sophisticated to reduce the cost of the most performance critical operation. cost-based optimizer is an important avenue for future work. We The storage manager of SharedDB provides two operators for believe that the two-step optimization approach shown in Figure accessing tables: shared table scans and shared index probes. We 3 (i.e., determine join orders for each query individually and then do not claim any novelty here. Shared table scans have been studied merge the plans for the individual queries into a single global plan) extensively in previous work ([10, 29, 21, 28]) and SharedDB uses will perform well for many workloads. However, it is conceivable the ClockScan algorithm proposed in [28]. Shared index probes that an approach that optimizes the global plan for all queries in a have also been studied in the past (e.g., [12]) and SharedDB simply single pass results in better performance in certain cases. adopts the well established techniques in that domain, too. Both op- erators generate tuples in the data-query model such as those shown in Figure 1. 3.4 Other Operators: Scan, Sort, Group-by The idea shown in Figure 3 to process shared joins can also be applied to process any other operator of the relational algebra. Fig- 3.5 Discussion ure 4 illustrates the principle for a shared sort using two example This section summarizes the main advantages and disadvantages queries and a few tuples of a Users table. Again, in theory, it is of SharedDB in comparison to traditional, query-at-a-time systems better to have a few small sorts than one big sort, but sharing may and other systems that exploit shared computation. more than offset this effect. In this example, it is more efficient to do one sort with four tuples than to do two sorts with three tuples SharedDB vs. “query-at-a-time”. The main disadvantage each. Obviously, the overlap increases with the number of queries. of SharedDB is that it adds latency to each query due to its batch- The Top-N operator is an extension of the sort operator and can, based execution model. In contrast, traditional database system thus, benefit from sharing in a similar way as the sort operator. start processing queries as soon as they are submitted by an appli- In the SharedDB implementation, the shared Top-N operator first cation. In the worst case, batching increases latency by a factor of sorts all the tuples that are relevant for all the active queries; thus, 2: one cycle of queuing and one cycle of actual query processing. the sorting is shared. Then, it filters the Top N results for each The biggest advantage of SharedDB is that it is able to bound query individually. computation and scales with the number of concurrent queries and Like Top N, the Group-By operator is carried out in two phases. updates. In the worst case if there are many concurrent queries that In the first phase, the input tuples are grouped. Again, this phase involve all the tuples of the whole database, SharedDB joins and can be shared so that all the tuples that are relevant for all active sorts the whole relations as part of the global query plan, indepen- queries are grouped in one big batch. Also, any kind of grouping dent of the number of concurrent queries. This way, SharedDB can algorithm (e.g., hashing or sorting) can be used for this purpose. In give response time guarantees which is critical for many modern the second phase, HAVING predicates and aggregation functions are applications to meet SLAs. For instance, if SLAs specify that all applied to the tuples of each group. Just as for Top N, this second queries must be processed within 3 seconds, then SharedDB would phase must be carried out for each query individually. Fortunately, provision enough CPU cores such that a batch of queries can be 530

6.processed in at most 1.5 seconds in the worst case. All this is pre- 4. IMPLEMENTATION DETAILS dictable and can be planned upfront: There is no interference and This section describes a number of implementation details of resource contention between concurrent queries because SharedDB SharedDB that will help the reader get a better understanding of schedules the data flow and the utilization of cores at compile-time the system. as part of its global plan. In contrast, the work carried out by tradi- tional database systems grows linearly with the number of concur- 4.1 Query Model rent queries. Furthermore, traditional database systems allocate a separate thread for each query and these threads might compete for As described in Section 3, SharedDB evaluates queries using a shared resources (e.g., the main memory bus or processor caches) data flow network of always-on database operators. There are no in an unpredictable and uncontrollable way. individual query plans and instead a global plan is always active. As mentioned in Section 3.3, SharedDB might result in extra Every SharedDB query describes an acyclic path in the data flow work if the load on the system is light and there is little or no over- network. As an example, Figure 5 shows a possible representation lap in the data processed by the queries that share a common op- of Q1 of Figure 2. In this example, the GroupBy operator receives erator. With an increasing load, however, the overlap increases. In a query from the Output operator. In order to execute it, another theory, if each Query Qi needs to process ni tuples, there are k query is issued to the TableScan operator. Result tuples are gener- concurrent queries, n = Σki=1 ni , and o is the number of tuples that ated by the TableScan, passed to the GroupBy operator and finally at least one query needs to process (o ≤ n), then SharedDB will sent to the clients. save work for an operator with complexity O(f (n)) if: Operator Configuration f (o) < Σi=1 kf (ni ) 1. TableScan USERS WHERE LAST LOGIN > 2011.01.01 For operators with linear complexity (e.g., table scans and joins un- 2. GroupBy GROUP(USERS.COUNTRY) der certain circumstances, this equation is always fulfilled (unless SUM(USERS.ACCOUNT) o = n which is the worst case for SharedDB). For operators with 3. Output Network, TCP Port 5843 a complexity of f (n) = n ∗ logn (e.g., sorts and certain joins), the advantage of SharedDB depends on o and n: In such cases, Figure 5: An Example of a Query in SharedDB SharedDB will result in extra work in the worst case (o = n). But, even in these bad cases, SharedDB maintains its property of bounded computation and predictable performance. 4.2 Operators Shared operators are designed to evaluate a number of queries SharedDB vs. “pipelined sharing”. The most significant concurrently by processing them in cycles. Every cycle evaluates disadvantage of SharedDB as compared to other, recent approaches the set of active queries or subqueries. Any additional queries that to effect shared computation (e.g., QPipe, CJoin, and DataPath) arrive after the cycle has started, are queued and will be evaluated is again that SharedDB adds latency due its batched processing during the next cycle. model. In contrast, QPipe, CJoin, and DataPath have a continu- Algorithm 1 shows an abstract SharedDB operator that processes ous query processing model. The big advantage of this batched queries in cycles. At the beginning of the cycle, the operator de- processing model in concert with the special way in which joins queues the pending queries and activates them by issuing their sub- and other operators are processed, is its generality to any kind of queries to the respective operators. Then it receives the generated operator of the relational algebra, any kind of algorithm (e.g., join result tuples they generated, processes them and forwards the pro- method), and to the processing of updates. This generality enables cessed output to the issuers of the queries. For example, in the case SharedDB to process any kind of workload (OLTP, OLAP, and of a filter operator, like SQL LIKE, the ProcessTuple function mixed) without any special tuning, whereas QPipe, CJoin, and Dat- tests the tuple against the LIKE expression. If it matches, the result aPath have so far only been shown to work well for OLAP work- tuple is pushed to the next operator in the pipeline. loads with queries that each involve processing a large portion of Blocking operators, such as the SORT operator, can use the func- the entire database. As observed in the QPipe work [15], the kinds tion ProcessTuple to append the tuple to a buffer structure (i.e., of sharing are limited depending on the query operator in a contin- a vector). The same buffer structure is used for all the queries that uous query processing model. As a result, CJoin and DataPath rely belong to the same batch. In this case, no results are produced from on specific, dedicated join methods in order to carry out shared join ProcessTuple. Once all the result tuples have been received, the computation; other operators (e.g., sorting and grouping) need to be buffer structure is sorted and it is pushed to the consumers as part carried out for each query individually in these systems. Further- of the SendEndOfStream function. more, these dedicated join methods only show good performance for specific kinds of workloads. Just as a traditional database sys- 4.3 Runtime Configuration tem, SharedDB supports multiple join methods in order to adapt to All database operators are executed in a separate hardware con- different kinds of workloads. text. If there are enough CPU cores in the system, each database A specific advantage of SharedDB as compared to QPipe and operator is assigned to a different CPU core, using hard processor DataPath is its ability to meet SLAs and bound the response time affinity. This guarantees that the threads do not migrate between of queries. Predictable performance is also supported in CJoin, but processors, allowing for optimal instruction cache locality. Addi- only for star-join queries. Another advantage of the batched pro- tionally, the binding of operators to physical CPU cores allows for cessing model of SharedDB is that it supports concurrent updates optimal use of NUMA (non-uniform memory access) hardware ar- and strong consistency (e.g., Snapshot Isolation). Section 4 gives chitectures. The local stack and heap of every operator is allocated more details on how updates are processed in SharedDB. Again, on memory that has the minimum NUMA distance. As a result updates and transaction processing have not been studied in the CPUs never access each other’s local memory except for passing context of QPipe and DataPath; CJoin does support concurrent up- operation results, giving maximum bandwidth and minimum ac- dates, but the model is more complicated than in SharedDB. cess latency where it matters: during query processing. 531

7. Algorithm 1: Skeleton of a SharedDB Operator stream processing. Updates are executed in arrival order as part of Data: SyncedQueue iqq; // incoming queries queue the same scan that executes the queries. At this level, Crescando Data: SyncedQueue irq; // incoming result tuples queue guarantees that all select queries read a consistent snapshot of data. while true do The original Crescando storage manager presented in [28] only Array aq ← ∅ ; //Array of active queries supports full table scans using the ClockScan algorithm. For this //Operators that will receive the produced results work, we extended Crescando and implemented B-Tree indexes Array consumers ← ∅ ; and index probe operators as an additional access path. These index //Activate all queries in the incoming queue probe operators are used to implement regular scans (with predi- while ¬IsEmpty(iqq) do Put(aq, Get(iqq)); cates) on base tables and to implement index nested-loops joins. //Enqueue subqueries to underlying operators The logic of the index probe operator is similar to Algorithm 1. foreach Query q ∈ ag do Look-ups are enqueued in the pending query queue which is emp- Query subQuery ← GetSubQuery(q); tied at the beginning of each cycle. During the cycle, the updates Operator op ← GetOperator(subQuery); are executed in the arrival order and multiple B-Tree look-ups are EnqueueQuery(op, subQuery); used to evaluate all the select queries. Executing multiple look-ups Operator consumer ← GetConsumer(q); in one cycle allows for better instruction and data cache locality Put(consumers, consumer); [12]. Just as the (shared) full table scan, the index probe operator guarantees that all select queries will read a consistent snapshot. //Loop until all active queries have finished In general, transactions can be implemented in SharedDB in al- Number queriesLef t ← Size(aq); most the same way as in any other database system. In partic- while queriesLeft do ular, atomicity, consistency (i.e., checking integrity constraints), //Receive a tuple from the underlying operators and durability (i.e., recovery) are completely orthogonal to shared Tuple t ← Get(irq); query and update processing. With regard to isolation, the design of //Process the incoming tuple. Depending on the SharedDB favors optimistic and multi-version concurrency control operator, this function might generate results because any kind of locking would result in unpredictable response Tuple resultT uple ← ProcessTuple(t); times due to lock contention and blocking. In particular, Snapshot if ¬IsNull(resultTuple) then SendResult(consumers, resultTuple); Isolation, as supported by the Crescando storage manager, comple- ments nicely the batch-oriented shared query processing model of if IsEndOfStream(t) then SharedDB. With regard to durability, Crescando keeps all data in queriesLef t ← queriesLef t − 1; main memory, but it also supports full recovery by checkpointing //Notify the consumers that processing has finished and logging all data to disk. SendEndOfStream(consumers); Crescando supports horizontal partitioning of data and process- ing several partitions with different cores in parallel. This feature of Crescando, however, was not used in the performance experiments presented in Section 5. Moreover, careful deployment of database operators across CPU cores can further exploit modern multi core hardware architectures. 4.5 Replication In these systems, CPU cores that are located on the same chip, share The design of our system allows replication in SharedDB. In components of the memory architecture, like the L3 cache and the fact, not only storage operators but also database operators can be NUMA main-memory region. Database operators that access simi- replicated. Replicating a storage operator does not hurt consistency, lar sets of records can be assigned to “adjacent” CPU cores in order because updates are always executed in the same order as they were to benefit from the sharing of these memory components. received by the system. Moreover, data processing operators can Our existing implementation supports all these optimizations. be also replicated in order to reduce the effects of bottlenecks and Currently the assignment of operators to CPU cores has been per- hotspots in the data flow network. For instance, if a specific oper- formed manually by examining the data access paths of every op- ator becomes a bottleneck, SharedDB can partition the load across erator. Future versions of SharedDB will support an optimizer to two replicas of the same physical operators. Similar to the deploy- automatically deploy operators to the proper CPU cores. ment of database operators to CPU cores, replicating a specific op- erator is a task that needs to be performed by an optimizer, based 4.4 Storage Manager and Transactions on the global query plan. The current implementation of SharedDB is based on the Cres- cando storage manager which has been successfully deployed to 5. PERFORMANCE EXPERIMENTS AND serve demanding update-intensive operational business intelligence workloads of the travel industry [28]. The Crescando storage man- RESULTS ager is currently in production at Amadeus, the market leader for In order to evaluate the performance of SharedDB, we carried airline reservation. out a series of performance experiments. As baselines, we used In the version used for SharedDB, Crescando supports two ac- MySQL and a high-end commercial database product. We present cess methods for reading base tables: (a) (shared) table scan, and results of comprehensive workloads using the TPC-W benchmark (b) index probes. Table scans are carried out using the ClockScan and the results of micro benchmarks with individual queries in or- algorithm, described in detail in [28]. ClockScan batches queries der to demonstrate specific effects. and updates in the same way as SharedDB and processes a whole batch of queries and updates. As a result, the ClockScan algo- 5.1 Experimental Environment rithm fits nicely into the overall SharedDB design. Performance The TPC-W benchmark models an online bookstore. It assesses is increased by indexing the query predicates instead of the data the performance of multi-tier information systems which contain a and performing query-data joins, a technique widely used in data- database layer, a web server layer and a client layer. Every client 532

8. Search Item Search Item Search Item Best Sellers By Subject By Author View Cart Products Products Confirm By Title Request Request Display Refresh Admin Admin Home Detail Order New Cart Buy Top-N Sort Distinct Group By Group By Top-N Top-N (by Date) (by Title) NL ⋈ Distinct * Hash ⋈ HashJoin NL ⋈ Hash ⋈ Hash ⋈ Filter Top N NL ⋈ Filter Disjunction Disjunction NL ⋈ Hash ⋈ (By Date) Like Like NL ⋈ Hash ⋈ Expression Expression SHOPPING SHOPPING COUNTRY ADDRESS CUSTOMER AUTHOR ITEM ORDER_LINE ORDERS CARTLINE CART Figure 6: Global Plan for the TPC-W Benchmark (Updates have been omitted for simplicity) of the system is an emulated browser that issues http requests (web fied because we were interested in the performance of the database interactions in TPC-W terminology) to the web server layer. Be- system under high load. The client machines were connected to the tween two web interactions, there is a “thinktime” which is defined database server machine using a 1 Gbps ethernet. by a negative exponential distribution and has an average of 7 sec- We implemented the full TPC-W workload in SharedDB. Fig- onds. The web servers accept the clients’ requests and issue queries ure 6 shows the global query plan that was generated. It con- to the database layer in order to retrieve the requested data. Each sists of 26 database operators in addition to shared scans and in- client interaction is translated to a number of database queries, de- dex probe operators to access the nine base tables of the TPC-W pending on the type of the interaction. For instance, to execute the benchmark. The current version of SharedDB does not feature any “Home” web interaction (i.e., a user visits the application’s home parallel join or sort algorithms and does not support partitioning or page), two queries have to be evaluated: The first query fetches a replication of base data. As a result, we used at most 32 cores for set of promotion items, and the second query retrieves the profile SharedDB, one core for each operator and nine cores for the shared of the user. table scans. (Filter operators do not require a separate CPU core In order to be compliant with the TPC-W specification, every and the Distinct* operator is evaluated as part of the underlying web interaction needs to be answered in a predefined amount of Hash✶ operator.) SharedDB is able to make use of the additional time. The timeout depends on the type of the web interaction, rang- CPU cores by replicating operators, as explained in Section 4.5. ing from 2 seconds for small, point queries, up to 20 seconds for However, we avoided using replication of data and operators in all long running analytical queries. Any web interaction that exceeds our experiments as the goal is to study the benefits of sharing rather this timeout, is not valid. than the benefits of this particular technique. In order to demon- TPC-W contains a total of 14 different web interactions. Ev- strate the scalability of SharedDB, we varied the number of cores ery web interaction has a different probability of appearing in the between 1 and 32 in some experiments. We used the kernel param- workload. The probabilities of all the web interactions are given eter maxcpus to limit the number of available CPU cores. If not by the “workload mix”. TPC-W defines three different workload stated otherwise, 24 cores were used. mixes: Browsing, Shopping, and Ordering. The Browsing mix is a In all experiments reported in this paper, SharedDB held all the read-mostly, search intensive workload with few updates and many data in main memory. Disk I/O was only required to log updates as analytical queries. The Ordering mix is a write-intensive workload part of the Crescando storage manager (Section 4.4). As a result, with only a few analytical queries. The Shopping mix is somewhere running the TPC-W benchmark on SharedDB was CPU-bound. It in between with some updates and some analytical queries. was also CPU-bound for the two baseline systems that we describe In all experiments reported in this paper, we used a 48 core ma- in the next sub-section. chine as a database server. This machine features four twelve- core AMD Opteron 6174 (“Magny-Cours”) sockets and is equipped 5.2 Baselines with 128 GB of DDR3 1333 RAM. Each core has a 2.2 GHz clock To put the performance of SharedDB into perspective, we com- frequency, 128 KB L1 cache, 512 KB L2 cache, and is connected pared it against two existing database systems. The first one is a to a shared 12 MB L3 cache. The operating system used in all ex- popular commercial system that will be referred to as SystemX. periments was a 64-bit SMP Linux. The emulated browsers were The second one is a widely used database system, MySQL 5.1 executed on up to eight client machines, each having 16 CPU cores using the InnoDB storage engine. Just like SharedDB, MySQL and 24 GB of DDR3 1066 RAM. The clients also ran the applica- and SystemX are general-purpose database systems, designed to tion logic; that is, the clients issued queries directly to the database perform well for any kind of workload. Other existing solutions server. This is a slight simplification of the TPC-W set-up and justi- may reach better performance for specific workloads; e.g., column stores for OLAP workloads and lock-free, single-threaded systems 533

9.for OLTP workloads. We do not compare SharedDB to these sys- TPC-W Browsing Mix tems because we wanted to understand specifically the performance 2000 GeneratedLoad tradeoffs of SharedDB’s shared execution model as opposed to the Throughput (Web Interactions/second) MySQL traditional, query-at-a-time model. SystemX Clearly, the comparison of SharedDB with MySQL and Sys- SharedDB 1500 temX is not an apples-to-apples comparison, but we tried to make the comparison as fair as possible by fine-tuning the two baseline systems. We built all the necessary indexes on the two systems and 1000 we used an in-main-memory filesystem for the data files of both systems. Furthermore, we provided a big memory buffer pool for MySQL and SystemX, big enough to hold the database and all in- 500 dexes. Finally, we filled this buffer pool by carrying out full table scans and a warm-up phase of the benchmark. As a result, neither MySQL nor SystemX performed any disk I/O to carry out queries. 0 Disk I/O was only performed by SystemX and SharedDB as part of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Emulated Browsers (In Thousands) logging in order to persist updates. For MySQL, all recovery op- tions were turned off so that MySQL did not even perform disk I/O TPC-W Ordering Mix for logging. Overall, however, the workload was CPU-bound for 2000 all three systems. The isolation level used was “read committed”, GeneratedLoad Throughput (Web Interactions/second) MySQL as TPC-W requires only session consistency. SystemX We disabled query and result caching in both MySQL and Sys- 1500 SharedDB temX. Query caching allows a database system to skip execution of a query as long as it has been executed before and no change in the dataset has occurred. Result caching performs the same opti- 1000 mization on each individual subquery. By disabling any caching, we guarantee that we measure the performance of evaluating every query, rather than fetching pre-calculated results from a cache. 500 In summary, we did everything to make a fair comparison be- tween the three systems. Overall, however, our goal is not to show that SharedDB is better than MySQL, SystemX, or any other ex- 0 isting database system. Our goal was to show that batch-oriented 1 2 3 4 5 6 7 8 9 10 11 12 13 14 sharing can result in predictable performance under high load. To Emulated Browsers (In Thousands) this end, we also present the results of micro-benchmarks that study the behavior of all systems under growing load and isolates the ef- TPC-W Shopping Mix 2000 fects of “query-at-a-time” vs. “shared” query processing. GeneratedLoad Throughput (Web Interactions/second) Unfortunately, we could not carry out performance experiments MySQL SystemX on QPipe, CJoin, or DataPath, the closest competitors of SharedDB SharedDB in terms of shared computation. These systems are research pro- 1500 totypes and were not available for experimentation. Furthermore, we believe that significant research is necessary before the special, continuous (as opposed to batch-oriented) sharing featured by these 1000 systems can be applied to comprehensive workloads such as the TPC-W benchmark. So far, these systems have only been applied to OLAP queries of the TPC-H and SSB benchmarks. 500 5.3 Performance under Varying Load 0 In the first set of experiments, we compared the performance of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 SharedDB, MySQL and SystemX on the three workload mixes of Emulated Browsers (In Thousands) TPC-W. We varied the load of the system by increasing the number Figure 7: Throughput: Varying Load, All Mixes of emulated browsers and measured the web interactions that were successfully answered by the system in the response time limit that The Ordering mix involves few searches for items, and instead is defined by the TPC-W specification. All web interactions that places many new orders. For this mix, MySQL is able to execute as exceeded this limit were not accounted as successful. For this ex- many web interactions as SystemX. Again, MySQL had a bit of an periment, we configured the database system to use 24 CPU cores. unfair advantage as compared to SystemX and SharedDB here be- Figure 7 shows the results. SharedDB is able to achieve higher cause all recovery options were turned off for MySQL. SharedDB throughput in all three TPC-W workload mixes. For instance, in the still wins in this experiment, but the margins are lower. In the Or- Browsing mix, we see that SharedDB is able to sustain twice the dering mix, most queries are point queries that can be executed throughput of SystemX and eight times the throughput of MySQL. highly efficiently with an index look-up in a traditional, query-at- The Browsing mix involves use cases of customers searching for a-time fashion. Furthermore, there is a little benefit for sharing for items. Most of the search queries are heavy queries that involve such point queries. a number of joins and sorts. This result confirms previous results Finally, the bottom diagram of Figure 7 shows the throughput with MQO on TPC-H queries (e.g., [15]) that shared computation results with varying load of the three systems for the Shopping mix. is beneficial if heavy queries need to be executed. Again, SharedDB is the clear winner and again the reason is that 534

10. TPC-W Browsing Mix Performance of Single TPC-W Web Interactions 1000 MySQL Throughput (Web Interactions/second) 2000 SystemX SharedDB Max Throughput (WIPS) 800 1500 600 400 1000 200 500 0 Home NewProducts BestSellers ProductDetail SearchRequest SearchResults ShoppingCart CustomerRegistration BuyRequest BuyConfirmation OrderInquiry OrderDisplay AdminRequest AdminConfirm 0 0 8 16 24 32 40 48 Number of Cpu Cores TPC-W Ordering Mix MySQL Throughput (Web Interactions/second) 2000 SystemX SharedDB MySQL SystemX SharedDB 1500 Figure 9: Analysis of Individual Web Interactions 5.4 Scaling with the Number of Cores 1000 In the second set of experiments, we explored the impact of hav- ing additional CPU cores on the database server. For this reason, 500 we varied the number of available CPU cores of the database server machine from 1 to 48 and repeated the experiments of Section 5.3 for each configuration. As mentioned in Section 5.1, we varied 0 the number of cores for SharedDB only between 1 to 32 because 0 8 16 24 32 40 48 SharedDB cannot take advantage of additional cores for the TPC- Number of Cpu Cores W benchmark in the configuration used for these experiments. We measured the maximum number of successful web interactions per TPC-W Shopping Mix second each system could achieve. MySQL Figure 8 shows that SharedDB again is the clear winner for all Throughput (Web Interactions/second) 2000 SystemX SharedDB three workload mixes and almost independent of the number of cores. SharedDB is only outperformed by MySQL for the Ordering 1500 mix if the database server is constrained to using only a single core. Again, the magic ingredient is sharing and the special architecture of SharedDB. Both SharedDB and SystemX scale nicely with the 1000 number of cores for the Browsing and Shopping mixes. Scalability is limited for the update-intensive Ordering mix because at some point concurrency control and transaction management limit the 500 throughput of the system. MySQL does not scale beyond twelve cores, independent of the workload. This observation was also made in a recent study by Salomie et al. [23]. 0 0 8 16 24 32 40 48 Number of Cpu Cores 5.5 Analysis of Individual Web Interactions The TPC-W benchmark involves a variety of different web in- Figure 8: Max. Throughput: Vary # Cores, All Mixes teractions, each involving a different set of queries. For instance, the home web interaction involves two simple point queries (fetch- ing promotion articles and a user’s profile). Other web interactions SharedDB takes advantage of shared computation for heavy and involve point queries and several updates. Finally, there are also medium-sized queries. web interactions that involve heavy, analytical queries with mul- We would like to reiterate that SharedDB used the same global tiple joins, grouping, and sorting. Figure 9 shows the maximum plan for all three mixes. SharedDB was not adapted or tuned in throughput that each of the three systems can achieve if the clients any way to meet the specific requirements of these workloads. Ob- are configured to issue only queries that correspond to a single web viously, a “query-at-a-time” system can adapt much better to the interaction. These experiments were carried out in a configuration workload and, for instance, optimize a query based on the currently with 24 cores for the database server. available resources. We do not know how, for example, SystemX SharedDB wins in this experiment for many kinds of web inter- does so. It is clear, however, that shared computation is more criti- actions (e.g., BestSellers and CustomerRegistration). Again, shar- cal to sustain high throughputs with response time guarantees than ing is the main reason for SharedDB’s success. Keep in mind that any adaptation technique implemented in MySQL and SystemX. SharedDB does not only support sharing across queries of different 535

11. Response Time of Batches of the Search Item By Title Query Load Interaction between Light and Heavy Queries 3200 800 Small queries Throughput (Web Interactions/second) 700 Ideal 2800 TPC-W timeout MySQL Batch Response Time (msec) MySQL SystemX 2400 SystemX 600 SharedDB SharedDB 2000 500 1600 400 1200 300 800 200 400 100 0 0 0 500 1000 1500 2000 0 10 20 30 40 50 Batch Size: Number of Concurrent Queries Percentage of Heavy Queries in the Workload Figure 11: Load Interaction Response Time of Batches of the Best Sellers Query 25000 TPC-W timeout tems follows the same trend. SystemX is able to execute the batches MySQL SystemX of queries faster than SharedDB, which is expected as explained in Batch Response Time (msec) 20000 SharedDB Section 5.5. This query is executed so fast that the overhead of batching queries and updates is greater than the gains. 15000 For best sellers queries, the performance of the three systems differs significantly. MySQL’s performance is almost linear with the number of queries in the batch. SystemX outperforms MySQL 10000 because it is simply the better and more mature system. The best performance, however, was again achieved with SharedDB. Even 5000 though SharedDB has a less mature query processor, it outperforms SystemX in this experiment because of sharing. 0 0 500 1000 1500 2000 5.7 Load Interaction Batch Size: Number of Concurrent Queries In our last experiment, we explored how heavy queries compete Figure 10: Heavy Queries vs. Light Queries with lighter queries for resources and how this resource contention may impact performance. For this reason we used a synthetic work- types but also sharing between concurrent queries of the same type, load that is a mixture of the two queries that were analyzed in Sec- executed with different parameter settings. Figure 9, however, also tion 5.6. A constant load of 400 “search item by title” queries per shows that SharedDB loses for several web interactions as com- second was sent to the systems under test. The load of these “search pared to SystemX (e.g., NewProducts and ShoppingCart). These item” queries can easily be sustained by all three systems. In ad- web interactions involve mostly point queries and/or updates for dition to these “search item by title”, we submitted an increasing which sharing does not help much. SystemX wins because it is the number of “best sellers” queries in this experiment. This way, we more mature system and carries out the same work more efficiently were able to study precisely how mixing light and heavy queries than SharedDB. In other words, SharedDB can only beat SystemX affected the performance of the three different systems. if it carries out less work as a result of shared computation. The choice of “search item by title” and “best sellers” queries was not random. In fact, as shown in Figure 6, these two queries 5.6 Heavy Queries vs. Light Queries have a shared join on Items and Authors in SharedDB. As a re- Next, we analyzed the performance of the three test systems un- sult, these two queries compete for resources in SharedDB, too. der two very different queries of the TPC-W benchmark. The first Of course, an experiment on resource contention between different one uses a key identifier to select an item and its author. This query kinds of queries would not be fair if SharedDB would run the two is part of the ProductDetail web interaction. It is a lightweight different kinds of queries on different cores. query that performs a join of two relations and fetches one record The throughput results for all three systems are shown in Figure from each of them. The second query is the “best sellers” query that 11. At the beginning (the left part of the figure), the load consists of is part of the BestSellers web interaction. This heavy query in- only the 400 “search item” queries and only a few concurrent “best volves the analysis of the latest 3,333 orders that have been placed sellers” queries. Such a workload can be sustained by all three sys- by customers in order to retrieve the most ordered items that match tems. Moving to the right (i.e., increasing the number of concur- a selection predicate provided by the client. It performs three joins rent “best sellers” queries), MySQL and SystemsX are not longer over four relations which are followed by a group-by operator and able to sustain the throughput. In fact, the throughput of MySQL additional sorting of the results. and SystemX drops below 400 so that the presence of “best sell- We used batches of an increasing number of such queries and ers” queries hurts the execution of the “search item” queries. Ob- issued a stream of them to the three systems while measuring the viously, this problem could be fixed by introducing sophisticated time needed in order to complete the whole batch. For SharedDB, load control mechanisms. Figure 11 shows that such load control the measured time includes the queueing time of the batch, as de- mechanisms are not needed for SharedDB. The overall through- scribed in Section 3.2. The results are shown in Figure 10. With put increases monotonically; the more concurrent queries, the more regard to the search item query, the performance of the three sys- sharing and the merrier. Unfortunately, SharedDB is not a perfor- 536

12.mance panacea either. Starting at about 250 “best sellers” queries [5] S. Chaudhuri and V. R. Narasayya. An Efficient Cost-Driven Index per second, SharedDB is not able to handle the full workload either Selection Tool for Microsoft SQL Server. In Proc. VLDB, pages and its throughput diverges from the ideal throughput, depicted by 146–155, 1997. the top-most dotted line in Figure 11. Since the concurrent “best [6] N. N. Dalvi, S. K. Sanghai, P. Roy, and S. Sudarshan. Pipelining in Multi-Query Optimization. In Proc. PODS, pages 59–70, 2001. sellers” queries have different parameter settings, perfect sharing is [7] S. Dar, M. J. Franklin, B. T. J´onsson, D. Srivastava, and M. Tan. not possible and there is a per-query overhead in this experiment Semantic Data Caching and Replacement. In Proc. VLDB, pages which limits the scalability of SharedDB with the number of con- 330–341, 1996. current queries. Nevertheless, SharedDB scales much better than [8] Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, and P. Fischer. Path the other systems, beating SystemX by a factor of 3 in throughput in Sharing and Predicate Evaluation for High-Performance XML the extreme case of this experiment. More importantly, SharedDB Filtering. ACM Trans. Database Systems, 28(4):467–516, 2003. is robust and makes sure that the processing of heavy queries does [9] J.-P. Dittrich, B. Seeger, D. S. Taylor, and P. Widmayer. Progressive not have an impact on the performance of light queries, even if no Merge Join: a Generic and Non-Blocking Sort-Based Join Algorithm. In Proc. VLDB, pages 299–310, 2002. special load control techniques are applied. [10] P. M. Fernandez. Red Brick Warehouse: A Read-Mostly RDBMS for Open SMP Platforms. In Proc. SIGMOD, page 492, 1994. 6. CONCLUSION [11] S. Finkelstein. Common Expression Analysis in Database This paper presented SharedDB, a general-purpose relational da- Applications. In Proc. SIGMOD, pages 235–245, 1982. tabase system. At the core of SharedDB is a novel query process- [12] P. M. Fischer and D. Kossmann. Batched Processing for Information Filters. In Proc. ICDE, pages 902–913, 2005. ing model that is based on batching queries and shared computa- [13] L. M. Haas, J. C. Freytag, G. M. Lohman, and H. Pirahesh. tion. SharedDB does not always outperform traditional database Extensible Query Processing in Starburst. In Proc. SIGMOD, pages techniques that rely on the “query-at-a-time” processing model. 377–388, 1989. The advantages of SharedDB become apparent for high loads with [14] S. Harizopoulos and A. Ailamaki. StagedDB: Designing Database unpredictable mixes of heavy and light queries and updates. In Servers for Modern Hardware. IEEE Data Eng. Bull., 28(2):11–16, these situations, the performance (i.e., query latency and sustained 2005. throughput) of SharedDB is extremely robust without requiring any [15] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A special tuning knobs, load control, or other adaptive techniques. Simultaneously Pipelined Relational Query Engine. In Proc. SIGMOD, pages 383–394, 2005. Our experimental study using the TPC-W benchmark confirmed [16] S. Helmer and G. Moerkotte. Evaluation of Main Memory Join this robustness along a number of different dimensions: query load, Algorithms for Joins with Set Comparison Join Predicates. In Proc. hardware configuration, and query/update diversity. Compared to VLDB, pages 386–395, 1997. other related systems that are also based on shared computation, [17] S. H´eman, N. Nes, M. Zukowski, and P. Boncz. Vectorized Data SharedDB wins in terms of generality. For instance, these systems Processing on the Cell Broadband Engine. In Proc. DaMoN, pages are typically not suitable to process transactional workloads with 4:1–4:6, 2007. many small queries and updates. [18] M. G. Ivanova, M. L. Kersten, N. J. Nes, and R. A. Gonc¸alves. An The SharedDB project is only at its beginning and this paper only Architecture for Recycling Intermediates in a Column-Store. In Proc. SIGMOD, pages 309–320, 2009. presented the main design principles and architecture of SharedDB. [19] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, The next logical step is to develop a comprehensive query opti- J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. Hash Revisited: Fast mizer that automatically generates good global query plans. As Join Implementation on Modern Multi-Core CPUs. In Proc. VLDB, part of this work, we will develop a cost model for shared execu- pages 1378–1389, 2009. tion in SharedDB. Furthermore, we will extend the SharedDB run- [20] D. Kossmann, M. J. Franklin, G. Drasch, and W. Ag. Cache time system in order to make better use of future NUMA machines Investment: Integrating Query Optimization and Distributed Data with possibly hundreds of cores; e.g., integrate parallel joins and Placement. ACM Trans. Database Systems, 25(4):517–558, 2000. advanced partitioning and replication of base data. Another impor- [21] L. Qiao, V. Raman, F. Reiss, P. J. Haas, and G. M. Lohman. Main-Memory Scan Sharing for Multi-Core CPUs. In Proc. VLDB, tant consideration is to optimize for processor cache locality and pages 610–621, 2008. integrate, e.g., cache-aware join methods. Currently, SharedDB [22] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, is based on standard, traditional join methods. Finally, we would I. Narang, and R. Sidle. Constant-Time Query Processing. In Proc. like to investigate the benefits of distributing the global query plan ICDE, pages 60–69, 2008. across different machines. [23] T.-I. Salomie, I. E. Subasu, J. Giceva, and G. Alonso. Database Engines on Multicores, Why Parallelize when you can Distribute? In Acknowledgments. This work was funded by the Enterprise Proc. EuroSys, pages 17–30, 2011. Computing Center of ETH Zurich ( Furthermore, [24] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and the work of G. Giannikis was supported by a Swiss National Sci- T. G. Price. Access Path Selection in a Relational Database ence Foundation grant as part of its Pro-Doc program. Management System. In Proc. SIGMOD, pages 23–34, 1979. [25] T. K. Sellis. Multiple-Query Optimization. ACM Trans. Database 7. REFERENCES Systems, 13(1):23–52, 1988. [1] S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. [26] L. D. Shapiro. Join Processing in Database Systems with Large Main The DataPath System: A Data-Centric Analytic Processing Engine Memories. ACM Trans. Database Systems, 11(3):239–264, 1986. for Large Data Warehouses. In Proc. SIGMOD, pages 519–530, [27] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, 2010. N. Hachem, and P. Helland. The End of an Architectural Era: (It’s [2] R. Avnur and J. M. Hellerstein. Eddies: Continuously Adaptive Time for a Complete Rewrite). In Proc. VLDB, pages 1150–1160, Query Processing. In Proc. SIGMOD, pages 261–272, 2000. 2007. [3] G. Candea, N. Polyzotis, and R. Vingralek. A Scalable, Predictable [28] P. Unterbrunner, G. Giannikis, G. Alonso, D. Fauser, and Join Operator for Highly Concurrent Data Warehouses. In Proc. D. Kossmann. Predictable Performance for Unpredictable VLDB, pages 277–288, 2009. Workloads. In Proc. VLDB, pages 706–717, 2009. [4] G. Candea, N. Polyzotis, and R. Vingralek. Predictable Performance [29] M. Zukowski, S. H´eman, N. Nes, and P. Boncz. Cooperative Scans: and High Query Concurrency for Data Analytics. VLDB Journal, Dynamic Bandwidth Sharing in a DBMS. In Proc VLDB, pages 20(2):227–248, 2011. 723–734, 2007. 537