An Evaluation of Concurrency Control with One Thousand Cores

Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.

1. Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores Xiangyao Yu George Bezerra MIT CSAIL MIT CSAIL Andrew Pavlo Srinivas Devadas Michael Stonebraker Carnegie Mellon University MIT CSAIL MIT CSAIL ABSTRACT that instruction-level parallelism and single-threaded performance Computer architectures are moving towards an era dominated by will give way to massive thread-level parallelism. many-core machines with dozens or even hundreds of cores on a As Moore’s law continues, the number of cores on a single chip single chip. This unprecedented level of on-chip parallelism intro- is expected to keep growing exponentially. Soon we will have hun- duces a new dimension to scalability that current database manage- dreds or perhaps a thousand cores on a single chip. The scalability ment systems (DBMSs) were not designed for. In particular, as the of single-node, shared-memory DBMSs is even more important in number of cores increases, the problem of concurrency control be- the many-core era. But if the current DBMS technology does not comes extremely challenging. With hundreds of threads running in adapt to this reality, all this computational power will be wasted on parallel, the complexity of coordinating competing accesses to data bottlenecks, and the extra cores will be rendered useless. will likely diminish the gains from increased core counts. In this paper, we take a peek at this dire future and examine what To better understand just how unprepared current DBMSs are for happens with transaction processing at one thousand cores. Rather future CPU architectures, we performed an evaluation of concur- than looking at all possible scalability challenges, we limit our rency control for on-line transaction processing (OLTP) workloads scope to concurrency control. With hundreds of threads running in on many-core chips. We implemented seven concurrency control parallel, the complexity of coordinating competing accesses to data algorithms on a main-memory DBMS and using computer simula- will become a major bottleneck to scalability, and will likely dwin- tions scaled our system to 1024 cores. Our analysis shows that all dle the gains from increased core counts. Thus, we seek to com- algorithms fail to scale to this magnitude but for different reasons. prehensively study the scalability of OLTP DBMSs through one of In each case, we identify fundamental bottlenecks that are indepen- their most important components. dent of the particular database implementation and argue that even We implemented seven concurrency control algorithms in a main state-of-the-art DBMSs suffer from these limitations. We conclude memory DBMS and used a high-performance, distributed CPU sim- that rather than pursuing incremental solutions, many-core chips ulator to scale the system to 1000 cores. Implementing a system may require a completely redesigned DBMS architecture that is from scratch allows us to avoid any artificial bottlenecks in existing built from ground up and is tightly coupled with the hardware. DBMSs and instead understand the more fundamental issues in the algorithms. Previous scalability studies used existing DBMSs [24, 26, 32], but many of the legacy components of these systems do not 1. INTRODUCTION target many-core CPUs. To the best of our knowledge, there has not The era of exponential single-threaded performance improve- been an evaluation of multiple concurrency control algorithms on a ment is over. Hard power constraints and complexity issues have single DBMS at such large scale. forced chip designers to move from single- to multi-core designs. Our analysis shows that all algorithms fail to scale as the number Clock frequencies have increased for decades, but now the growth of cores increases. In each case, we identify the primary bottle- has stopped. Aggressive, out-of-order, super-scalar processors are necks that are independent of the DBMS implementation and ar- now being replaced with simple, in-order, single issue cores [1]. gue that even state-of-the-art systems suffer from these limitations. We are entering the era of the many-core machines that are pow- We conclude that to tackle this scalability problem, new concur- ered by a large number of these smaller, low-power cores on a sin- rency control approaches are needed that are tightly co-designed gle chip. Given the current power limits and the inefficiency of with many-core architectures. Rather than adding more cores, com- single-threaded processing, unless a disruptive technology comes puter architects will have the responsibility of providing hardware along, increasing the number of cores is currently the only way that solutions to DBMS bottlenecks that cannot be solved in software. architects are able to increase computational power. This means This paper makes the following contributions: This work is licensed under the Creative Commons Attribution- • A comprehensive evaluation of the scalability of seven con- NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- currency control schemes. cense, visit Obtain per- mission prior to any use beyond those covered by the license. Contact • The first evaluation of an OLTP DBMS on 1000 cores. copyright holder by emailing Articles from this volume • Identification of bottlenecks in concurrency control schemes were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, that are not implementation-specific. Hawaii. The remainder of this paper is organized as follows. We begin Proceedings of the VLDB Endowment, Vol. 8, No. 3 in Section 2 with an overview of the concurrency control schemes Copyright 2014 VLDB Endowment 2150-8097/14/11. 209

2.used in our evaluation. Section 3 describes the components of our DL_DETECT 2PL with deadlock detection. 2PL study. We present our analysis in Sections 4 and 5, followed by a NO_WAIT 2PL with non-waiting deadlock prevention. discussion of results in Section 6. Finally, we survey related work WAIT_DIE 2PL with wait-and-die deadlock prevention. in Section 7 and discuss future research directions in Section 8. TIMESTAMP Basic T/O algorithm. MVCC Multi-version T/O. T/O 2. CONCURRENCY CONTROL SCHEMES OCC Optimistic concurrency control. OLTP database systems support the part of an application that in- H-STORE T/O with partition-level locking. teracts with end-users. End-users interact with the front-end appli- Table 1: The concurrency control schemes evaluated in this paper cation by sending it requests to perform some function (e.g., reserve a seat on a flight). The application processes these requests and then avoid this problem. If a transaction is unable to acquire a lock for an executes transactions in the DBMS. Such users could be humans on element, then it is forced to wait until the lock becomes available. their personal computer or mobile device, or another computer pro- If this waiting is uncontrolled (i.e., indefinite), then the DBMS can gram potentially running somewhere else in the world. incur deadlocks [3]. Thus, a major difference among the different A transaction in the context of one of these systems is the exe- variants of 2PL is in how they handle deadlocks and the actions cution of a sequence of one or more operations (e.g., SQL queries) that they take when a deadlock is detected. We now describe the on a shared database to perform some higher-level function [17]. It different versions of 2PL that we have implemented in our evalua- is the basic unit of change in a DBMS: partial transactions are not tion framework, contrasting them based on these two details: allowed, and the effect of a group of transactions on the database’s 2PL with Deadlock Detection (DL_DETECT): The DBMS mon- state is equivalent to any serial execution of all transactions. The itors a waits-for graph of transactions and checks for cycles (i.e., transactions in modern OLTP workloads have three salient charac- deadlocks) [19]. When a deadlock is found, the system must choose teristics: (1) they are short-lived (i.e., no user stalls), (2) they touch a transaction to abort and restart in order to break the cycle. In prac- a small subset of data using index look-ups (i.e., no full table scans tice, a centralized deadlock detector is used for cycle detection. The or large joins), and (3) they are repetitive (i.e., executing the same detector chooses which transaction to abort based on the amount of queries with different inputs) [38]. resources it has already used (e.g., the number of locks it holds) to An OLTP DBMS is expected to maintain four properties for each minimize the cost of restarting a transaction [3]. transaction that it executes: (1) atomicity, (2) consistency, (3) iso- lation, and (4) durability. These unifying concepts are collectively 2PL with Non-waiting Deadlock Prevention (NO_WAIT): Un- referred to with the ACID acronym [20]. Concurrency control per- like deadlock detection where the DBMS waits to find deadlocks mits end-users to access a database in a multi-programmed fashion after they occur, this approach is more cautious in that a transac- while preserving the illusion that each of them is executing their tion is aborted when the system suspects that a deadlock might oc- transaction alone on a dedicated system [3]. It essentially provides cur [3]. When a lock request is denied, the scheduler immediately the atomicity and isolation guarantees in the system. aborts the requesting transaction (i.e., it is not allowed to wait to We now describe the different concurrency control schemes that acquire the lock). we explored in our many-core evaluation. For this discussion, we 2PL with Waiting Deadlock Prevention (WAIT_DIE): This is a follow the canonical categorization that all concurrency schemes non-preemptive variation of the NO_WAIT scheme technique where are either a variant of two-phase locking or timestamp ordering pro- a transaction is allowed to wait for the transaction that holds the tocols [3]. Table 1 provides a summary of these different schemes. lock that it needs if that transaction is older than the one that holds the lock. If the requesting transaction is younger, then it is aborted 2.1 Two-Phase Locking (hence the term “dies”) and is forced to restart [3]. Each trans- Two-phase locking (2PL) was the first provably correct method action needs to acquire a timestamp before its execution and the of ensuring the correct execution of concurrent transactions in a timestamp ordering guarantees that no deadlocks can occur. database system [6, 12]. Under this scheme, transactions have to acquire locks for a particular element in the database before they 2.2 Timestamp Ordering are allowed to execute a read or write operation on that element. Timestamp ordering (T/O) concurrency control schemes gener- The transaction must acquire a read lock before it is allowed to read ate a serialization order of transactions a priori and then the DBMS that element, and similarly it must acquire a write lock in order to enforces this order. A transaction is assigned a unique, monotoni- modify that element. The DBMS maintains locks for either each cally increasing timestamp before it is executed; this timestamp is tuple or at a higher logical level (e.g., tables, partitions) [14]. used by the DBMS to process conflicting operations in the proper The ownership of locks is governed by two rules: (1) different order (e.g., read and write operations on the same element, or two transactions cannot simultaneously own conflicting locks, and (2) separate write operations on the same element) [3]. once a transaction surrenders ownership of a lock, it may never We now describe the T/O schemes implemented in our test-bed. obtain additional locks [3]. A read lock on an element conflicts The key differences between the schemes are (1) the granularity with a write lock on that same element. Likewise, a write lock on that the DBMS checks for conflicts (i.e., tuples vs. partitions) and an element conflicts with a write lock on the same element. (2) when the DBMS checks for these conflicts (i.e., while the trans- In the first phase of 2PL, known as the growing phase, the trans- action is running or at the end). action is allowed to acquire as many locks as it needs without re- Basic T/O (TIMESTAMP): Every time a transaction reads or mod- leasing locks [12]. When the transaction releases a lock, it enters ifies a tuple in the database, the DBMS compares the timestamp the second phase, known as the shrinking phase; it is prohibited of the transaction with the timestamp of the last transaction that from obtaining additional locks at this point. When the transac- reads or writes the same tuple. For any read or write operation, tion terminates (either by committing or aborting), all the remain- the DBMS rejects the request if the transaction’s timestamp is less ing locks are automatically released back to the coordinator. than the timestamp of the last write to that tuple. Likewise, for a 2PL is considered a pessimistic approach in that it assumes that write operation, the DBMS rejects it if the transaction’s timestamp transactions will conflict and thus they need to acquire locks to is less than the timestamp of the last read to that tuple. In TIMES- 210

3.TAMP, a read query makes a local copy of the tuple to ensure re- Application Target  Mul2core   Host  Machines   peatable reads since it is not protected by locks. When a transaction is aborted, it is assigned a new timestamp and then restarted. This core   core   corresponds to the “basic T/O” algorithm as described in [3], but core   core   our implementation uses a decentralized scheduler. core   core   core   core   Multi-version Concurrency Control (MVCC): Under MVCC, core   core   every write operation creates a new version of a tuple in the database [4, core   core   5]. Each version is tagged with the timestamp of the transaction that created it. The DBMS maintains an internal list of the versions of an element. For a read operation, the DBMS determines which Figure 1: Graphite Simulator Infrastructure – Application threads are version in this list the transaction will access. Thus, it ensures a mapped to simulated core threads deployed on multiple host machines. serializable ordering of all operations. One benefit of MVCC is that the DBMS does not reject operations that arrive late. That is, the DBMS does not reject a read operation because the element that it targets has already been overwritten by another transaction [5]. Optimistic Concurrency Control (OCC): The DBMS tracks the read/write sets of each transaction and stores all of their write operations in their private workspace [28]. When a transaction commits, the system determines whether that transaction’s read set overlaps with the write set of any concurrent transactions. If no overlap exists, then the DBMS applies the changes from the trans- Figure 2: Target Architecture – Tiled chip multi-processor with 64 tiles action’s workspace into the database; otherwise, the transaction is and a 2D-mesh network-on-chip. Each tile contains a processing core, L1 aborted and restarted. The advantage of this approach for main and L2 caches, and a network switch (SW). memory DBMSs is that transactions write their updates to shared ing a separate thread for each core in the architecture. As shown memory only at commit time, and thus the contention period is in Fig. 1, each application thread is attached to a simulated core short [42]. Modern implementations of OCC include Silo [42] and thread that can then be mapped to different processes on separate Microsoft’s Hekaton [11, 29]. In this paper, our algorithm is simi- host machines. For additional performance, Graphite relaxes cy- lar to Hekaton in that we parallelize the validation phase and thus cle accuracy, using periodic synchronization mechanisms to model is more scalable than the original algorithm [28]. instruction-level granularity. As with other similar CPU simulators, T/O with Partition-level Locking (H-STORE): The database is it only executes the application and does not model the operating divided into disjoint subsets of memory called partitions. Each system. For this paper, we deployed Graphite on a 22-node cluster, partition is protected by a lock and is assigned a single-threaded each with dual-socket Intel Xeon E5-2670 and 64GB of DRAM. execution engine that has exclusive access to that partition. Each The target architecture is a tiled multi-core CPU, where each tile transaction must acquire the locks for all of the partitions that it contains a low-power, in-order processing core, 32KB L1 instruc- needs to access before it is allowed to start running. This requires tion/data cache, a 512KB L2 cache slice, and a network router. the DBMS to know what partitions that each individual transac- This is similar to other commercial CPUs, such as Tilera’s Tile64 tion will access before it begins [34]. When a transaction request (64 cores), Intel’s SCC (48 cores), and Intel’s Knights Landing (72 arrives, the DBMS assigns it a timestamp and then adds it to all cores) [1]. Tiles are interconnected using a high-bandwidth, 2D- of the lock acquisition queues for its target partitions. The execu- mesh on-chip network, where each hop takes two cycles. Both the tion engine for a partition removes a transaction from the queue tiles and network are clocked at 1GHz frequency. A schematic of and grants it access to that partition if the transaction has the oldest the architecture for a 64-core machine is depicted in Fig. 2. timestamp in the queue [38]. Smallbase was an early proponent of We use a shared L2-cache configuration because it is the most this approach [22], and more recent examples include H-Store [27]. common last-level cache design for commercial multi-cores. In a comparison experiment between shared and private L2-caches, we 3. MANY-CORE DBMS TEST-BED observe that shared caches lead to significantly less memory traffic Since many-core chips do not yet exist, we performed our anal- and higher performance for OLTP workloads due to its increased ysis through Graphite [30], a CPU simulator that can scale up to aggregate cache capacity (results not shown). Since L2 slices are 1024 cores. For the DBMS, we implemented a main memory OLTP distributed among the different tiles, the simulated multi-core sys- engine that only contains the functionality needed for our experi- tem is a NUCA (Non-Uniform Cache Access) architecture, where ments. The motivation for using a custom DBMS is two fold. First, L2-cache latency increases with distance in the 2D-mesh. we can ensure that no other bottlenecks exist other than concur- rency control. This allows us to study the fundamentals of each 3.2 DBMS scheme in isolation without interference from unrelated features. We implemented our own lightweight main memory DBMS based Second, using a full-featured DBMS is impractical due to the con- on pthreads to run in Graphite. It executes as a single process with siderable slowdown of simulators (e.g., Graphite has an average the number of worker threads equal to the number of cores, where slowdown of 10,000×). Our engine allows us to limit the experi- each thread is mapped to a different core. All data in our DBMS is ments to reasonable times. We now describe the simulation infras- stored in memory in a row-oriented manner. The system supports tructure, the DBMS engine, and the workloads used in this study. basic hash table indexes and a pluggable lock manager that allows us swap in the different implementations of the concurrency con- 3.1 Simulator and Target Architecture trol schemes described in Section 2. It also allows the indexes and Graphite [30] is a fast CPU simulator for large-scale multi-core lock manager to be partitioned (as in the case with the H-STORE systems. Graphite runs off-the-shelf Linux applications by creat- scheme) or run in a centralized mode. 211

4. Client threads are not simulated in our system; instead, each 0.35 0.7 DL_DETECT Throughput (Million txn/s) DL_DETECT Throughput (Million txn/s) worker contains a fixed-length queue of transactions that are served 0.30 NO_WAIT 0.6 NO_WAIT WAIT_DIE WAIT_DIE in order. This reduces the overhead of network protocols, which are 0.25 TIMESTAMP 0.5 TIMESTAMP MVCC MVCC inherently difficult to model in the simulator. Each transaction con- 0.20 OCC 0.4 OCC tains program logic intermixed with query invocations. The queries 0.15 0.3 are executed serially by the transaction’s worker thread as they are 0.10 0.2 encountered in the program logic. Transaction statistics, such as 0.05 0.1 throughput, latency, and abort rates, are collected after a warm-up 0.000 5 10 15 20 25 30 35 0.00 5 10 15 20 25 30 35 period that is long enough for the system to achieve a steady state. Number of Cores Number of Cores (a) Graphite Simulation (b) Real Hardware In addition to runtime statistics, our DBMS also reports how much time each transaction spends in the different components of Figure 3: Simulator vs. Real Hardware – Comparison of the concurrency the system [21]. We group these measurements into six categories: control schemes running in Graphite and a real multi-core CPU using the YCSB workload with medium contention (theta=0.6). USEFUL WORK: The time that the transaction is actually exe- cuting application logic and operating on tuples in the system. in TPC-C are modeled in our simulation. Since these two comprise ABORT: The overhead incurred when the DBMS rolls back all 88% of the total TPC-C workload, this is a good approximation. of the changes made by a transaction that aborts. Our version of TPC-C is a “good faith” implementation, although we omit the “thinking time” for worker threads. Each worker issues TS ALLOCATION: The time that it takes for the system to ac- transactions without pausing; this mitigates the need to increase the quire a unique timestamp from the centralized allocator. For those size of the database with the number of concurrent transactions. concurrency control schemes that require a timestamp, the alloca- tion overhead happens only once per transaction. 3.4 Simulator vs. Real Hardware INDEX: The time that the transaction spends in the hash indexes To show that using the Graphite simulator generates results that for tables, including the overhead of low-level latching of the buck- are comparable to existing hardware, we deployed our DBMS on an ets in the hash tables. Intel Xeon E7-4830 and executed a read-intensive YCSB workload WAIT: The total amount of time that a transaction has to wait. with medium contention (theta=0.6). We then executed the same A transaction may either wait for a lock (e.g., 2PL) or for a tuple workload in Graphite with the same number of cores. whose value is not ready yet (e.g., T/O). The results in Fig. 3 show that all of the concurrency control schemes exhibit the same performance trends on Graphite and the MANAGER: The time that the transaction spends in the lock real CPU. We note, however, that the relative performance differ- manager or the timestamp manager. This excludes any time that ence between MVCC, TIMESTAMP, and OCC is different in Fig. 3b. it has to wait. This is because MVCC accesses memory more than the other two 3.3 Workloads schemes and those accesses are more expensive on a two-socket system. Graphite models a single CPU socket and thus there is We next describe the two benchmarks that we implemented in no inter-socket traffic. In addition to this, the throughput of the our test-bed for this analysis. T/O-based and WAIT_DIE schemes drops on 32 cores due to the YCSB: The Yahoo! Cloud Serving Benchmark is a collection overhead of cross-core communication during timestamp alloca- of workloads that are representative of large-scale services created tion. We address this issue in Section 4.3. by Internet-based companies [8]. For all of the YCSB experiments in this paper, we used a ∼20GB YCSB database containing a sin- 4. DESIGN CHOICES & OPTIMIZATIONS gle table with 20 million records. Each YCSB tuple has a single One of the main challenges of this study was designing a DBMS primary key column and then 10 additional columns each with 100 and concurrency control schemes that are as scalable as possible. bytes of randomly generated string data. The DBMS creates a sin- When deploying a DBMS on 1000 cores, many secondary aspects gle hash index for the primary key. of the implementation become a hindrance to performance. We did Each transaction in the YCSB workload by default accesses 16 our best to optimize each algorithm, removing all possible scalabil- records in the database. Each access can be either a read or an up- ity bottlenecks while preserving their essential functionality. Most date. The transactions do not perform any computation in their pro- of this work was to eliminate shared data structures and devise dis- gram logic. All of the queries are independent from each other; that tributed versions of the classical algorithms [3]. is, the input of one query does not depend on the output of a pre- In this section, we discuss our experience with developing a vious query. The records accessed in YCSB follows a Zipfian dis- many-core OLTP DBMS and highlight the design choices we made tribution that is controlled by a parameter called theta that affects to achieve a scalable system. Additionally, we identify fundamental the level of contention in the benchmark [18]. When theta=0, all bottlenecks of both the 2PL and T/O schemes and show how hard- tuples are accessed with the same frequency. But when theta=0.6 ware support mitigates these problems. We present our detailed or theta=0.8, a hotspot of 10% of the tuples in the database are ac- analysis of the individual schemes in Section 5. cessed by ∼40% and ∼60% of all transactions, respectively. TPC-C: This benchmark is the current industry standard for 4.1 General Optimizations evaluating the performance of OLTP systems [40]. It consists of We first discuss the optimizations that we added to improve the nine tables that simulate a warehouse-centric order processing ap- DBMS’s performance across all concurrency control schemes. plication. All of the transactions in TPC-C provide a WAREHOUSE Memory Allocation: One of the first limitations we encountered id as an input parameter for the transaction, which is the ancestral when trying to scale our DBMS to large core counts was the malloc foreign key for all tables except ITEM. For a concurrency control function. When using the default Linux version of malloc, we algorithm that requires data partitioning (i.e., H-STORE), TPC-C is found that the DBMS spends most of the time waiting for memory partitioned based on this warehouse id. allocation. This is a problem even for read-only workloads, since Only two (Payment and NewOrder) out of the five transactions the DBMS still needs to copy records for reads in TIMESTAMP 212

5. Throughput (Million txn/s) Throughput (Million txn/s) 0.22 2200 101 0.21 2000 1800 0.20 Abort Rate 100 theta=0 1600 theta=0.6 0.19 1400 theta=0.8 1200 0.18 10-1 1000 0.17 800 0.160 1us 10us 100us 1ms 10ms 600 100ms 100 101 102 103 Timeout Threshold Number of Cores Figure 5: Waiting vs. Aborting – Results for DL_DETECT with varying Figure 4: Lock Thrashing – Results for a write-intensive YCSB workload timeout threshold running high contention YCSB (theta=0.8) at 64 cores. using the DL_DETECT scheme without deadlock detection. Each transac- tion acquires locks in their primary key order. To demonstrate the impact of thrashing, we executed a write- intensive YCSB workload (i.e., 50/50% read-write mixture) using a and to create internal meta-data handles for access tracking data variant of DL_DETECT where transactions acquire locks in primary structures. We tried running optimized versions (TCMalloc [15], key order. Although this approach is not practical for all work- jemalloc [13]), but both yielded similar disappointing results. loads, it removes the need for deadlock detection and allows us to We overcame this by writing a custom malloc implementation. better observe the effects of thrashing. Fig. 4 shows the transaction Similar to TCMalloc and jemalloc, each thread is assigned its own throughput as a function of the number of cores for different lev- memory pool. But the difference is that our allocator automatically els of contention. When there is no skew in the workload (theta=0), resizes the pools based on the workload. For example, if a bench- the contention for locks is low and the throughput scales almost lin- mark frequently allocates large chunks of contiguous memory, the early. As the contention level increases, however, thrashing starts to pool size increases to amortize the cost for each allocation. occur. With medium contention (theta=0.6), the throughput peaks Lock Table: As identified in previous work [26, 36], the lock at several hundred cores and then decreases due to thrashing. At table is another key contention point in DBMSs. Instead of having the highest contention level (theta=0.8), the DBMS’s throughput a centralized lock table or timestamp manager, we implemented peaks at 16 cores and cannot scale beyond that. Simulation re- these data structures in a per-tuple fashion where each transaction sults show that almost all the execution time is spent on waiting only latches the tuples that it needs. This improves scalability, but for locks. Thus, lock thrashing is the key bottleneck of lock-based increases the memory overhead because the DBMS maintains addi- approaches that limits scalability in high-contention scenarios. tional meta-data for the lock sharer/waiter information. In practice, Waiting vs. Aborting: The thrashing problem can be solved in this meta-data (several bytes) is negligible for large tuples. DL_DETECT by aborting some transactions to reduce the number Mutexes: Accessing a mutex lock is an expensive operation that of active transactions at any point in time. Ideally, this keeps the requires multiple messages to be sent across the chip. A central system running at the highest throughput achieved in Fig. 4. We critical section protected by a mutex will limit the scalability of any added a timeout threshold in the DBMS that causes the system to system (cf. Section 4.3). Therefore, it is important to avoid using abort and restart any transaction that has been waiting for a lock for mutexes on the critical path. For 2PL, the mutex that protects the an amount of time greater than the threshold. We note that when centralized deadlock detector is the main bottleneck, while for T/O timeout is zero, this algorithm is equivalent to NO_WAIT. algorithms it is the mutex used for allocating unique timestamps. We ran the same YCSB workload with high contention using In the subsequent sections, we describe the optimizations that we different timeout thresholds on a 64-core CPU. We measure both developed to obviate the need for these mutexes. the throughput and abort rate in the DBMS for the DL_DETECT scheme sweeping the timeout from 0–100 ms. 4.2 Scalable Two-Phase Locking The results in Fig. 5 indicate when the CPU has a small number We next discuss the optimizations for the 2PL algorithms. of cores, it is better to use a shorter timeout threshold. This high- lights the trade-off between performance and the transaction abort Deadlock Detection: For DL_DETECT, we found that the dead- rate. With a small timeout, the abort rate is high, which reduces the lock detection algorithm is a bottleneck when multiple threads com- number of running transactions and alleviates the thrashing prob- pete to update their waits-for graph in a centralized data structure. lem. Using a longer timeout reduces the abort rate at the cost of We solved this by partitioning the data structure across cores and more thrashing. Therefore, in this paper, we evaluate DL_DETECT making the deadlock detector completely lock-free. Now when a with its timeout threshold set to 100µs. In practice, the threshold transaction updates its waits-for graph, its thread updates its queue should be based on an application’s workload characteristics. with the transactions that it is waiting for without any locks. This step is local (i.e., no cross-core communication), as the thread does not write to the queues of other transactions. 4.3 Scalable Timestamp Ordering In the deadlock detection process, a thread searches for cycles in Finally, we discuss the optimizations that we developed to im- a partial waits-for graph constructed by only reading the queues of prove the scalability of the T/O-based algorithms. related threads without locking the queues. Although this approach Timestamp Allocation: All T/O-based algorithms make order- may not discover a deadlock immediately after it forms, the thread ing decisions based on transactions’ assigned timestamps. The is guaranteed to find it on subsequent passes [5]. DBMS must therefore guarantee that each timestamp is allocated Lock Thrashing: Even with improved detection, DL_DETECT to only one transaction. A naïve approach to ensure this is to use a still does not scale due to lock thrashing. This occurs when a trans- mutex in the allocator’s critical section, but this leads to poor perfor- action holds its locks until it commits, blocking all the other concur- mance. Another common solution is to use an atomic addition op- rent transactions that attempt to acquire those locks. This becomes eration to advance a global logical timestamp. This requires fewer a problem with high contention and a large number of concurrent instructions and thus the DBMS’s critical section is locked for a transactions, and thus is the main bottleneck of all 2PL schemes. smaller period of time than with a mutex. But as we will show, this 213

6.approach is still insufficient for a 1000-core CPU. We now discuss 10000 Clock Hardware Throughput (Million ts/s) three timestamp allocation alternatives: (1) atomic addition with Atomic batch=16 batching [42], (2) CPU clocks, and (3) hardware counters. 1000 Atomic batch=8 With the batched atomic addition approach, the DBMS uses the Atomic 100 Mutex same atomic instruction to allocate timestamps, but the timestamp manager returns multiple timestamps together in a batch for each 10 request. This method was first proposed in the Silo DBMS [42]. To generate a timestamp using clock-based allocation, each worker 11 thread reads a logical clock from its local core and then concate- 10 100 1000 Number of Cores nates it with its thread id. This provides good scalability as long as all the clocks are synchronized. In distributed systems, synchro- Figure 6: Timestamp Allocation Micro-benchmark – Throughput mea- nization is accomplished using software protocols [31] or external surements for different timestamp allocation methods. clocks [9]. On a many-core CPU, however, this imposes large over- 90 Clock 3.0 Throughput (Million txn/s) 80 HW Counter Throughput (Million txn/s) head and thus requires hardware support. As of July 2014, only Atomic batch=16 2.5 70 Atomic batch=8 Intel CPUs support synchronized clocks across cores. 60 Atomic 2.0 50 Mutex Lastly, the third approach is to use an efficient, built-in hardware 1.5 40 counter. The counter is physically located at the center of the CPU 30 1.0 such that the average distance to each cores is minimized. No exist- 20 0.5 10 ing CPU currently supports this. Thus, we implemented a counter 00 200 400 600 800 1000 0.00 200 400 600 800 1000 in Graphite where a timestamp request is sent through the on-chip Number of Cores Number of Cores network to increment it atomically in a single cycle. (a) No Contention (b) Medium Contention To determine the maximum rate that the DBMS can allocate Figure 7: Timestamp Allocation – Throughput of the YCSB workload timestamps for each method, we ran a micro-benchmark where using TIMESTAMP with different timestamp allocation methods. threads continually acquire new timestamps. The throughput as tion’s read set is compared to previous transactions’ write sets to a function of the number of cores is shown in Fig. 6. We first note detect conflicts. Although this step is short, as mentioned above, that mutex-based allocation has the lowest performance, with ∼1 any mutex-protected critical section severely hurts scalability. We million timestamps per second (ts/s) on 1024 cores. The atomic solve this problem by using per-tuple validation that breaks up this addition method reaches a maximum of 30 million ts/s with a check into smaller operations. This is similar to the approach used small number of cores, but throughput decreases with the number in Hekaton [29] but it is simpler, since we only support a single of cores down to 8 million ts/s. This is due to the cache coherence version of each tuple. traffic from writing back and invalidating the last copy of the cor- responding cache line for every timestamp. This takes one round Local Partitions: We optimized the original H-STORE proto- trip of communication across the chip or ∼100 cycles for a 1024- col to take advantage of shared memory. Because the DBMS’s core CPU, which translates to a maximum throughput of 10 million worker threads run in a single process, we allow multi-partition ts/s at 1GHz frequency. Batching these allocations does help, but transactions to access tuples at remote partitions directly instead of it causes performance issues when there is contention (see below). sending query requests that are executed by the remote partitions’ The hardware-based solutions are able to scale with the number of worker threads. This allows for a simpler implementation that is cores. Because incrementing the timestamp takes only one cycle faster than using intra-process communication. With this approach, with the hardware counter-based approach, this method achieves the data is not physically partitioned since on-chip communication a maximum throughput of 1 billion ts/s. The performance gain latency is low. Read-only tables are accessed by all threads without comes from removing the coherence traffic by executing the addi- replication, thus reducing the memory footprint. Finally, we use tion operation remotely. The clock-based approach has ideal (i.e., the same timestamp allocation optimizations from above to avoid linear) scaling, since this solution is completely decentralized. the mandatory wait time to account for clock skew [38]. We also tested the different allocation schemes in the DBMS to see how they perform for real workloads. For this experiment, we 5. EXPERIMENTAL ANALYSIS executed a write-intensive YCSB workload with two different con- We now present the results from our analysis of the different tention levels using the TIMESTAMP scheme. The results in Fig. 7a concurrency control schemes. Our experiments are grouped into show that with no contention, the relative performance of the al- two categories: (1) scalability and (2) sensitivity evaluations. For location methods are the same as in Fig. 6. When there is con- the former, we want to determine how well the schemes perform as tention, however, the trends in Fig. 7b are much different. First, we increase the number of cores. We scale the number of cores up the DBMS’s throughput with the batched atomic addition method to 1024 while fixing the workload parameters. With the sensitivity is much worse. This is because when a transaction is restarted due experiments, we vary a single workload parameter (e.g., transaction to a conflict, it gets restarted in the same worker thread and is as- access skew). We report the DBMS’s total simulated throughput as signed the next timestamp in the last batch. But this new timestamp well as a breakdown of the amount of time that each worker thread will also be less than the one for the other transaction that caused spends in the different parts of the system listed in Section 3.2. the abort, and thus it will continually restart until the thread fetches We begin with an extensive analysis of the YCSB workload. a new batch. The non-batched atomic addition method performs as The nature of this workload allows us to change its parameters well as the clock and hardware counter approaches. Hence, for this and create a variety of scenarios that stress the concurrency con- paper the DBMS uses atomic addition without batching to allocate trol schemes in different ways. Next, we analyze the TPC-C work- timestamps because the other approaches require specialized hard- load, where we vary the number of warehouses and observe the ware support that is currently not available on all CPUs. impact on the throughput of the algorithms. The H-STORE scheme Distributed Validation: The original OCC algorithm contains is excluded from our initial experiments and is only introduced in a critical section at the end of the read phase, where the transac- Section 5.5 when we analyze database partitioning. 214

7.Throughput (Million txn/s) 14 4.5 DL_DETECT TIMESTAMP Throughput (Million txn/s) DL_DETECT TIMESTAMP 4.0 NO_WAIT MVCC 12 NO_WAIT MVCC WAIT_DIE OCC 10 WAIT_DIE OCC 3.5 3.0 8 2.5 6 2.0 4 1.5 2 1.0 0.5 00 200 400 600 800 1000 0.00 Number of Cores 200 400 600 800 1000 Number of Cores (a) Total Throughput (a) Total Throughput 1.0 1.0 0.8 0.8 Useful Work Useful Work 0.6 Abort 0.6 Abort Ts Alloc. Ts Alloc. 0.4 Index 0.4 Index 0.2 Wait Wait Manager 0.2 Manager 0.0 0.0 T IT IE MP CC OCC TEC O_WA AIT_D STA MV CT AIT ETE NO_W WAIT_D IE MP CC OCC DL_ DE N W M E D STA MV TI DL_ IT ME (b) Runtime Breakdown (1024 cores) (b) Runtime Breakdown (512 cores) Figure 8: Read-only Workload – Results for a read-only YCSB workload. Figure 9: Write-Intensive Workload (Medium Contention) – Results for YCSB workload with medium contention (theta=0.6). 5.1 Read-Only Workload In this first scalability analysis experiment, we executed a YCSB down in Fig. 9b indicates that the DBMS spends a larger percentage workload comprising read-only transactions with a uniform access of its time waiting in these schemes. DL_DETECT is inhibited by distribution. Each transaction executes 16 separate tuple reads at a lock thrashing at 256 cores. NO_WAIT is the most scalable because time. This provides a baseline for each concurrency control scheme it eliminates this waiting. We note, however, that both NO_WAIT before we explore more complex workload arrangements. and WAIT_DIE have a high transaction abort rate. This is not an In a perfectly scalable DBMS, the throughput should increase issue in our experiments because restarting an aborted transaction linearly with the number of cores. This is not the case, however, has low overhead; the time it takes to undo a transaction is slightly for the T/O schemes in Fig. 8a. The time breakdown in Fig. 8b less than the time it takes to re-execute the transactions queries. But indicates that timestamp allocation becomes the bottleneck with a in reality, the overhead may be larger for workloads where trans- large core count. OCC hits the bottleneck even earlier since it needs actions have to rollback changes to multiple tables, indexes, and to allocate timestamps twice per transaction (i.e., at transaction start materialized views. and before the validation phase). Both OCC and TIMESTAMP have The results in Fig. 9a also show that the T/O algorithms perform significantly worse performance than the other algorithms regard- well in general. Both TIMESTAMP and MVCC are able to overlap less of the number of cores. These algorithms waste cycles because operations and reduce the waiting time. MVCC performs slightly they copy tuples to perform a read, whereas the other algorithms better since it keeps multiple versions of a tuple and thus can serve read tuples in place. read requests even if they have older timestamps. OCC does not perform as well because it spends a large portion of its time abort- 5.2 Write-Intensive Workload ing transactions; the overhead is worse since each transaction has A read-only workload represents an optimistic (and unrealistic) to finish before the conflict is resolved. scenario, as it generates no data contention. But even if we intro- With higher contention, the results in Fig. 10 show that perfor- duce writes in the workload, the large size of the dataset means that mance of all of the algorithms is much worse. Fig. 10a shows the probability that any two transactions access the same tuples at that almost all of the schemes are unable to scale to more than 64 the same time is small. In reality, the access distribution of an OLTP cores. Beyond this point, the DBMS’s throughput stops increas- application is rarely uniform. Instead, it tends to follow a Zipfian ing and there is no performance benefit to the increased core count. skew, where certain tuples are more likely to be accessed than oth- NO_WAIT initially outperforms all the others, but then succumbs ers. This can be from either skew in the popularity of elements in to lock thrashing (cf. Fig. 4). Surprisingly, OCC performs the the database or skew based on temporal locality (i.e., newer tuples best on 1024 cores. This is because although a large number of are accessed more frequently). As a result, this increases contention transactions conflict and have to abort during the validation phase, because transactions compete to access the same data. one transaction is always allowed to commit. The time breakdown We executed a write-intensive YCSB workload comprising trans- in Fig. 10b shows that the DBMS spends a larger amount of time actions that access 16 tuples at time. Within each transaction, each aborting transactions in every scheme. of these accesses will modify the tuple with a 50% probability. The To better understand when each scheme begins to falter with in- amount of skew in the workload is determined by the parameter creased contention, we fixed the number of cores to 64 and per- theta (cf. Section 3.3). We use the medium and high contention formed a sensitivity analysis on the skew parameter (theta). The levels for the transactions’ access patterns. results in Fig. 11 indicate that for theta values less than 0.6, the con- The medium contention results in Fig. 9 show that NO_WAIT tention has little effect on the performance. But for higher settings, and WAIT_DIE are the only 2PL schemes that scales past 512 cores. there is a sudden drop in throughput that renders all algorithms non- NO_WAIT scales better than WAIT_DIE. For DL_DETECT, the break- scalable and approaches zero for values greater than 0.8. 215

8. 0.25 DL_DETECT TIMESTAMP 70 Throughput (Million tuple/s) Throughput (Million txn/s) NO_WAIT MVCC 0.20 WAIT_DIE OCC 60 50 0.15 40 0.10 30 20 DL_DETECT TIMESTAMP 0.05 NO_WAIT MVCC 10 WAIT_DIE OCC 0.000 200 400 600 800 1000 00 2 4 6 8 10 12 14 16 Number of Cores Number of Rows Accessed per Transaction (a) Total Throughput (a) Total Throughput 1.0 1.0 0.8 0.8 Useful Work Useful Work 0.6 Abort 0.6 Abort Ts Alloc. Ts Alloc. 0.4 Index 0.4 Index 0.2 Wait Wait Manager 0.2 Manager 0.0 0.0 ECT _WAIT IT_DIE STAMP MVCC OCC CT AIT IE MP CC OCC DET NO WA E ETE NO_W WAIT_D STA MV DL_ TIM DL_ D TI M E (b) Runtime Breakdown (64 cores) (b) Runtime Breakdown (transaction length = 1) Figure 10: Write-Intensive Workload (High Contention) – Results for Figure 12: Working Set Size – The number of tuples accessed per core on YCSB workload with high contention (theta=0.8). 512 cores for transactions with a varying number of queries (theta=0.6). 0.7 to one. Again, we see that the T/O schemes spend most of their Throughput (Million txn/s) 0.6 execution time allocating timestamps. As the transactions become 0.5 longer, Figs. 8b and 9b shows that the allocation is no longer the 0.4 main bottleneck. The results in Fig. 12 also show that the T/O- DL_DETECT based algorithms are more tolerant to contention than DL_DETECT. 0.3 NO_WAIT WAIT_DIE 0.2 TIMESTAMP 5.4 Read/Write Mixture 0.1 MVCC Another important factor for concurrency control is the read- OCC 0.00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 /write mixtures of transactions. More writes leads to more con- tention that affect the algorithms in different ways. For this exper- Theta iment, we use YCSB on a 64 core configuration and vary the per- Figure 11: Write-Intensive Workload (Variable Contention) – Results centage of read queries executed by each transaction. Each trans- for YCSB workload with varying level of contention on 64 cores. action executes 16 queries using the high skew setting (theta=0.8). The results in Fig. 13 indicate that all of the algorithms achieve 5.3 Working Set Size better throughput when there are more read transactions. At 100% The number of tuples accessed by a transaction is another factor reads, the results match the previous read-only results in Fig. 8a. that impacts scalability. When a transaction’s working set is large, TIMESTAMP and OCC do not perform well because they copy tu- it increases the likelihood that the same data is accessed by concur- ples for reading. MVCC stand out as having the best performance rent transactions. For 2PL algorithms, this increases the length of when there are small number of write transactions. This is also an time that locks are held by a transaction. With T/O, however, longer example of where supporting non-blocking reads through multiple transactions may reduce timestamp allocation contention. In this versions is most effective; read queries access the correct version experiment, we vary the number of tuples accessed per transaction of a tuple based on timestamps and do not need to wait for a writing in a write-intensive YCSB workload. Because short transactions transaction. This is a key difference from TIMESTAMP, where late leads to higher throughput, we measure the number of tuples ac- arriving queries are rejected and their transactions are aborted. cessed per second, rather than transactions completed. We use the medium skew setting (theta=0.6) and fix the core count to 512. 5.5 Database Partitioning The results in Fig. 12 show that when transactions are short, the Up to this point in our analysis, we assumed that the database is lock contention is low. DL_DETECT and NO_WAIT have the best stored as a single partition in memory and that all worker threads performance in this scenario, since there are few deadlocks and can access any tuple. With the H-STORE scheme, however, the the number of aborts is low. But as the transactions’ working set DBMS splits the database into disjoint subsets to increase scala- size increases, the performance of DL_DETECT degrades due to bility [38]. This approach achieves good performance only if the the overhead of thrashing. For the T/O algorithms and WAIT_DIE, database is partitioned in such a way that enables a majority of the throughput is low when the transactions are short because the transactions to only need to access data at a single partition [34]. DBMS spends a majority of its time allocating timestamps. But H-STORE does not work well when the workload contains multi- as the transactions become longer, the timestamp allocation cost is partition transactions because of its coarse-grained locking scheme. amortized. OCC performs the worst because it allocates double the It also matters how many partitions each transaction accesses; for number of timestamps as the other schemes for each transaction. example, H-STORE will still perform poorly even with a small Fig. 12b shows the time breakdown for transaction length equals number of multi-partition transactions if they access all partitions. 216

9. 1.4 10 Throughput (Million txn/s) Throughput (Million txn/s) 1.0 Throughput (Million txn/s) 1.2 DL_DETECT readonly readwrite 8 NO_WAIT 0.8 1.0 WAIT_DIE 0.6 6 part=1 part=2 0.8 TIMESTAMP 0.4 4 part=4 MVCC part=8 0.6 OCC 0.2 2 part=16 0.4 0.00.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000 0.2 Percentage of multi-partition transactions Number of Cores (a) Multi-Partition Percentage (b) Partitions per Transaction 0.00.0 0.2 0.4 0.6 0.8 1.0 Percentage of Read Requests Figure 15: Multi-Partition Transactions – Sensitivity analysis of the H- STORE scheme for YCSB workloads with multi-partition transactions. Figure 13: Read/Write Mixture – Results for YCSB with a varying per- centage of read-only transactions with high contention (theta=0.8). throughput for the single-partition workload in Fig. 15b exhibits the same degradation due to timestamp allocation as H-STORE in 10 Fig. 14. This is also why the throughputs for the one- and two- Throughput (Million txn/s) partition workloads converge at 1000 cores. The DBMS does not 8 scale with transactions accessing four or more partitions because of DL_DETECT MVCC NO_WAIT OCC the reduced parallelism and increased cross-core communication. 6 WAIT_DIE HSTORE TIMESTAMP 4 5.6 TPC-C Finally, we analyze the performance of all the concurrency con- 2 trol algorithms when running the TPC-C benchmark. The trans- actions in TPC-C are more complex than those in YCSB and is 00 200 400 600 800 1000 representative of a large class of OLTP applications. For example, Number of Cores they access multiple tables with a read-modify-write access pattern Figure 14: Database Partitioning – Results for a read-only workload on a and the output of some queries are used as the input for subsequent partitioned YCSB database. The transactions access the database based on queries in the same transaction. TPC-C transactions can also abort a uniform distribution (theta=0.0). because of certain conditions in their program logic, as opposed to only because the DBMS detected a conflict. To explore these issues in a many-core setting, we first compare The workload in each trial comprises 50% NewOrder and 50% H-STORE to the six other schemes under ideal conditions. We then Payment transactions. These two make up 88% of the default TPC- analyze its performance with multi-partition transactions. C mix and are the most interesting in terms of complexity. Support- We divide the YCSB database into the same number of parti- ing the other transactions would require additional DBMS features, tions as the number of cores in each trial. Since YCSB only has such as B-tree latching for concurrent updates. This would add ad- one table, we use a simple hashing strategy to assign tuples to par- ditional overhead to the system, and thus we defer the problem of titions based on their primary keys so that each partition stores ap- scaling indexes for many-core CPUs as future work. proximately the same number of records. These tests use a write- The size of TPC-C databases are typically measured by the num- intensive workload where each transaction executes 16 queries that ber of warehouses. The warehouse is the root entity for almost all all use index look-ups without any skew (theta=0.0). We also as- tables in the database. We follow the TPC-C specification where sume that the DBMS knows what partition to assign each transac- ∼10% of the NewOrder transactions and ∼15% of the Payment tion to at runtime before it starts [34]. transactions access a “remote” warehouse. For partitioned-based In the first experiment, we executed a workload comprised only schemes, such as H-STORE, each partition consists of all the data of single-partition transactions. The results in Fig. 14 show that H- for a single warehouse [38]. This means that the remote warehouse STORE outperforms all other schemes up to 800 cores. Since it is transactions will access multiple partitions. especially designed to take advantage of partitioning, it has a much We first execute the TPC-C workload on a 4-warehouse database lower overhead for locking than the other schemes. But because with 100MB of data per warehouse (0.4GB in total). This allows H-STORE also depends on timestamp allocation for scheduling, it us to evaluate the algorithms when there are more worker threads suffers from the same bottleneck as the other T/O-based schemes. than warehouses. We then execute the same workload again on a As a result, the performance degrades at higher core counts. For 1024-warehouse database. Due to memory constraints of running the other schemes, partitioning does not have a significant impact in the Graphite simulator, we reduced the size of this database to on throughput. It would be possible, however, to adapt their imple- 26MB of data per warehouse (26GB in total). This does not affect mentation to take advantage of partitioning [36]. our measurements because the number of tuples accessed by each We next modified the YCSB driver to vary the percentage of transaction is independent of the database size. multi-partition transactions in the workload and deployed the DBMS on a 64-core CPU. The results in Fig. 15a illustrate two important 5.6.1 4 Warehouses aspects of the H-STORE scheme. First, there is no difference in The results in Fig. 16 show that all of the schemes fail to scale performance whether or not the workload contains transactions that for TPC-C when there are fewer warehouses than cores. With modify the database; this is because of H-STORE’s locking scheme. H-STORE, the DBMS is unable to utilize extra cores because of Second, the DBMS’s throughput degrades as the number of multi- its partitioning scheme; the additional worker threads are essen- partition transactions in the workload increases because they reduce tially idle. For the other schemes, the results in Fig. 16c show that the amount of parallelism in the system [34, 42]. they are able to scale up to 64 cores for the NewOrder transaction. Lastly, we executed YCSB with 10% multi-partition transactions TIMESTAMP, MVCC, and OCC have worse scalability due to high and varied the number of partitions that they access. The DBMS’s abort rates. DL_DETECT does not scale due to thrashing and dead- 217

10. Throughput (Million txn/s) Throughput (Million txn/s) Throughput (Million txn/s) 0.18 0.25 0.16 0.20 0.14 0.20 0.12 0.15 0.15 0.10 0.08 0.10 DL_DETECT MVCC 0.10 0.06 WAIT_DIE OCC 0.04 0.05 NO_WAIT HSTORE 0.05 0.02 TIMESTAMP 0.000 50 100 150 200 250 300 0.000 50 100 150 200 250 300 0.000 50 100 150 200 250 300 Number of Cores Number of Cores Number of Cores (a) Payment + NewOrder (b) Payment only (c) NewOrder only Figure 16: TPC-C (4 warehouses) – Results for the TPC-C workload running up to 256 cores. Throughput (Million txn/s) Throughput (Million txn/s) Throughput (Million txn/s) 10 25 DL_DETECT MVCC 10 WAIT_DIE OCC 8 20 NO_WAIT HSTORE 8 TIMESTAMP 6 15 6 4 10 4 2 5 2 00 200 400 600 800 1000 00 200 400 600 800 1000 00 200 400 600 800 1000 Number of Cores Number of Cores Number of Cores (a) Payment + NewOrder (b) Payment only (c) NewOrder only Figure 17: TPC-C (1024 warehouses) – Results for the TPC-C workload running up to 1024 cores. locks. But the results in Fig. 16b show that no scheme scales for to achieve higher than ∼10 million txn/s. This is the same scenario the Payment transaction. The reason for this is that every Payment as Fig. 12a where 2PL outperforms T/O for short transactions. transaction updates a single field in the warehouse (W_YTD). This H-STORE performs the best overall due to its ability to exploit means that either the transaction (1) must acquire an exclusive lock partitioning even with ∼12% multi-partition transactions in the work- on the corresponding tuple (i.e., DL_DETECT) or (2) issue a pre- load. This corroborates results from previous studies that show that write on that field (i.e., T/O-based algorithms). If the number of H-STORE outperforms other approaches when less than 20% work- threads is greater than the number of warehouses, then updating load comprises multi-partition transactions [34, 42]. At 1024 cores, the warehouse table becomes a bottleneck. however, it is limited by the DBMS’s timestamp allocation. In general, the main problem for these two transactions is the contention on updating the WAREHOUSE table. Each Payment trans- action updates its corresponding warehouse entry and each NewOrder 6. DISCUSSION will read it. For the 2PL-based algorithms, these read and write We now discuss the results of the previous sections and propose operations block each other. Two of the T/O-based algorithms, solutions to avoid these scalability issues for many-core DBMSs. TIMESTAMP and MVCC, outperform the other schemes because their write operations are not blocked by reads. This eliminates 6.1 DBMS Bottlenecks the lock blocking problem in 2PL. As a result, the NewOrder trans- Our evaluation shows that all seven concurrency control schemes actions can execute in parallel with Payment transactions. fail to scale to a large number of cores, but for different reasons and conditions. Table 2 summarizes the findings for each of the 5.6.2 1024 Warehouses schemes. In particular, we identified several bottlenecks to scala- We next execute the TPC-C workload with 1024 warehouses bility: (1) lock thrashing, (2) preemptive aborts, (3) deadlocks, (4) with up to 1024 cores. Once again, we see in Fig. 17 that no scheme timestamp allocation, and (5) memory-to-memory copying. is able to scale. The results indicate that unlike in Section 5.6.1, the Thrashing happens in any waiting-based algorithm. As discussed DBMS’s throughput is limited by NewOrder transactions. This is in Section 4.2, thrashing is alleviated by proactively aborting. This due to different reasons for each scheme. leads to the trade-off between aborts and performance. In general, With almost all the schemes, the main bottleneck is the overhead the results in Section 5.2 showed that for high-contention work- of maintaining locks and latches, which occurs even if there is no loads, a non-waiting deadlock prevention scheme (NO_WAIT) per- contention. For example, the NewOrder transaction reads tuples forms much better than deadlock detection (DL_DETECT). from the read-only ITEM table, which means for the 2PL schemes Although no single concurrency control scheme performed the that each access creates a shared-lock entry in the DBMS. With best for all workloads, one may outperform the others under cer- a large number of concurrent transactions, the lock meta-data be- tain conditions. Thus, it may be possible to combine two or more comes large and thus it takes longer to update them. OCC does not classes of algorithms into a single DBMS and switch between them use such locks while a transaction runs, but it does use latches for based on the workload. For example, a DBMS could use DL_DETECT each tuple accessed during the validation phase. Acquiring these for workloads with little contention, but switch to NO_WAIT or a latches becomes an issue for transactions with large footprints, like T/O-based algorithm when transactions are taking too long to fin- NewOrder. Although MVCC also does not have locks, each read ish due to thrashing. One could also employ a hybrid approach, request generates a new history record, which increases memory such as MySQL’s DL_DETECT + MVCC scheme, where read-only traffic. We note, however, that all of this is technically unnecessary transactions use multi-versioning and all others use 2PL. work because the ITEM table is never modified. These results also make it clear that new hardware support is The results in Fig. 17b indicate that when the number of ware- needed to overcome some of these bottlenecks. For example, all of houses is the same or greater than the number of worker threads, the T/O schemes suffer from the timestamp allocation bottleneck the bottleneck in the Payment transaction is eliminated. This im- when the throughput is high. Using the atomic addition method proves the performance of all schemes. For T/O schemes, however, when the core count is large causes the worker threads to send many the throughput becomes too high at larger core counts and thus they messages across the chip to modify the timestamp. We showed are inhibited by timestamp allocation. As a result, they are unable in Section 4.3 how the clock and hardware counter methods per- 218

11. DL_DETECT Scales under low-contention. Suffers from lock are derived from the seminal surveys by Bernstein et al. [3, 5]. In thrashing. recent years, there have been several efforts towards improving the 2PL NO_WAIT Has no centralized point of contention. Highly scal- shortcomings of these classical implementations [11, 24, 32, 42]. able. Very high abort rate. Other work includes standalone lock managers that are designed to WAIT_DIE Suffers from lock thrashing and timestamp bottle- be more scalable on multi-core CPUs [36, 26]. We now describe neck. these systems in further detail and discuss why they are still un- TIMESTAMP High overhead from copying data locally. Non- likely to scale on future many-core architectures. blocking writes. Suffers from timestamp bottleneck. Shore-MT [24] is a multi-threaded version of Shore [7] that em- MVCC Performs well w/ read-intensive workload. Non- ploys a deadlock detection scheme similar to DL_DETECT. Much T/O blocking reads and writes. Suffers from timestamp bottleneck. of the improvements in Shore-MT come from optimizing bottle- OCC High overhead for copying data locally. High abort necks in the system other than concurrency control, such as log- cost. Suffers from timestamp bottleneck. ging [25]. The system still suffers from the same thrashing bottle- H-STORE The best algorithm for partitioned workloads. Suf- neck as DL_DETECT on high contention workloads. fers from multi-partition transactions and timestamp DORA is an OLTP execution engine built on Shore-MT [32]. In- bottleneck. stead of assigning transactions to threads, as in a traditional DBMS architecture, DORA assigns threads to partitions. When a transac- Table 2: A summary of the bottlenecks for each concurrency control scheme evaluated in Section 5. tion needs to access data at a specific partition, its handle is sent to the corresponding thread for that partition where it then waits formed the best without the drawbacks of batching. Thus, we be- in a queue for its turn. This is similar to H-STORE’s partitioning lieve that they should be included in future CPU architectures. model, except that DORA supports multiple record-level locks per We also saw that memory issues cause slowdown in some of the partition (instead of one lock per partition) [33]. We investigated schemes. One way to alleviate this problem is to add a hardware implementing DORA in our DBMS but found that it could not be accelerator on the CPU to do memory copying in the background. easily adapted and requires a separate system implementation. This would eliminate the need to load all data through the CPU’s The authors of Silo [42] also observed that global critical sec- pipeline. We also showed in Section 4.1 how malloc was another tions are the main bottlenecks in OCC. To overcome this, they use bottleneck and that we were able to overcome it by developing our a decentralized validation phase based on batched atomic addition own implementation that supports dynamic pool resizing. But with timestamps. But as we showed in Section 4.3, the DBMS must a large number of cores, these pools become too unwieldy to man- use large batches when deployed on a large number of cores to age in a global memory controller. We believe that future CPUs amortize the cost of centralized allocation. This batching in turn will need to switch to decentralized or hierarchical memory con- increases the system’s latency under contention. trollers to provide faster memory allocation. Hekaton [11] is a main memory table extension for Microsoft’s SQL Server that uses a variant of MVCC with lock-free data struc- 6.2 Multi-core vs. Multi-node Systems tures [29]. The administrator designates certain tables as in-memory Distributed DBMSs are touted for being able to scale beyond tables that are then accessed together with regular, disk-resident ta- what a single-node DBMS can support [38]. This is especially bles. The main limitation of Hekaton is that timestamp allocation true when the number of CPU cores and the amount of memory suffers from the same bottleneck as the other T/O-based algorithms available on a node is small. But moving to a multi-node architec- evaluated in this paper. ture introduces a new performance bottleneck: distributed trans- The VLL centralized lock manager uses per-tuple 2PL to re- actions [3]. Since these transactions access data that may not be move contention bottlenecks [36]. It is an optimized version of on the same node, the DBMS must use an atomic commit proto- DL_DETECT that requires much smaller storage and computation col, such as two-phase commit [16]. The coordination overhead overhead than our implementation when the contention is low. VLL of such protocols inhibits the scalability of distributed DBMSs be- achieves this by partitioning the database into disjoint subsets. Like cause the communication between nodes over the network is slow. H-STORE, this technique only works when the workload is parti- In contrast, communication between threads in a shared-memory tionable. Internally, each partition still has a critical section that environment is much faster. This means that a single many-core will limit scalability at high contention workloads. node with a large amount of DRAM might outperform a distributed The work in [26] identified latch contention as the main scala- DBMS for all but the largest OLTP applications [42]. bility bottleneck in MySQL. They removed this contention by re- It may be that for multi-node DBMSs two levels of abstraction placing the atomic-write-after-read synchronization pattern with a are required: a shared-nothing implementation between nodes and read-after-write scheme. They also proposed to pre-allocate and a multi-threaded shared-memory DBMS within a single chip. This deallocate locks in bulk to improve scalability. This system, how- hierarchy is common in high-performance computing applications. ever, is still based on centralized deadlock detection and thus will More work is therefore needed to study the viability and challenges perform poorly when there is contention in the database. In addi- of such hierarchical concurrency control in an OLTP DBMS. tion, their implementation requires the usage of global barriers that will be problematic at higher core counts. 7. RELATED WORK Others have looked into using the software-hardware co-design approach for improving DBMS performance. The “bionic database” The work in [39] is one of the first hardware analysis of a DBMS project [23] is similar to our proposal, but it focuses on implement- running an OLTP workload. Their evaluation focused on multi- ing OLTP DBMS operations in FPGAs instead of new hardware processor systems, such as how to assign processes to processors directly on the CPU. Other work is focused on OLAP DBMSs and to avoid bandwidth bottlenecks. Another study [37] measured CPU thus is not applicable to our problem domain. For example, an stall times due to cache misses in OLTP workloads. This work was FPGA-based SQL accelerator proposed in [10] filters in-flight data later expanded in [2] and more recently by [41, 35]. moving from a data source to a data sink. It targets OLAP applica- With the exception of H-STORE [14, 22, 38, 43] and OCC [28], tions by using the FPGA to accelerate the projection and restriction all other concurrency control schemes implemented in our test-bed 219

12.operations. The Q100 project is a special hardware co-processor for [11] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, OLAP queries [44]. It assumes a column-oriented database storage N. Verma, and M. Zwilling. Hekaton: SQL Server’s memory-optimized OLTP engine. In SIGMOD, pages 1243–1254, 2013. and provides special hardware modules for each SQL operator. [12] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The notions of consistency and predicate locks in a database system. Commun. ACM, 19(11):624–633, Nov. 1976. 8. FUTURE WORK [13] J. Evans. jemalloc. This work uncovered fundamental bottlenecks of concurrency [14] H. Garcia-Molina and K. Salem. Main memory database systems: An overview. control algorithms that limit their scalability as the number of cores IEEE Trans. on Knowl. and Data Eng., 4(6):509–516, Dec. 1992. increases. Because these limitations are inherent to these algo- [15] S. Ghemawat and P. Menage. TCMalloc: Thread-caching malloc. rithms, it is possible that no workaround exists in software. In this [16] J. Gray. Concurrency Control and Recovery in Database Systems, chapter case, software-hardware co-design is the only solution to address Notes on data base operating systems, pages 393–481. Springer-Verlag, 1978. these issues. For certain functionalities, specialized hardware can [17] J. Gray. The transaction concept: Virtues and limitations. In VLDB, pages significantly improve performance while reducing power consump- 144–154, 1981. [18] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly tion. We plan to study possible hardware modifications that can generating billion-record synthetic databases. SIGMOD, pages 243–252, 1994. bring the most performance gain for OLTP DBMSs. [19] J. N. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. Modelling in data base Concurrency control is only one of the several aspects of a DBMS management systems. chapter Granularity of locks and degrees of consistency that affects scalability. To build a truly scalable DBMS, other com- in a shared data base, pages 365–393. 1976. [20] T. Haerder and A. Reuter. Principles of transaction-oriented database recovery. ponents also need to be studied. We plan to investigate logging and ACM Comput. Surv., 15(4):287–317, Dec. 1983. index implementations, and then analyze possible optimizations for [21] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTP through these components. We will also expand our work to include multi- the looking glass, and what we found there. In SIGMOD, pages 981–992, 2008. socket systems with more than one many-core CPU. [22] M. Heytens, S. Listgarten, M.-A. Neimat, and K. Wilkinson. Smallbase: A main-memory dbms for high-performance applications. Technical report, Hewlett-Packard Laboratories, 1995. 9. ACKNOWLEDGEMENTS [23] R. Johnson and I. Pandis. The bionic dbms is coming, but what will it look like? In CIDR, 2013. This research was funded (in part) by the Intel Science and Tech- [24] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi. Shore-MT: a nology Center for Big Data. We also pour out some lean in grati- scalable storage manager for the multicore era. EDBT, pages 24–35, 2009. tude to the great Phil Bernstein for his sapient feedback. [25] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki. Aether: a scalable approach to logging. Proc. VLDB Endow., 3(1-2):681–692, 2010. [26] H. Jung, H. Han, A. D. Fekete, G. Heiser, and H. Y. Yeom. A scalable lock 10. CONCLUSION manager for multicores. In SIGMOD, pages 73–84, 2013. [27] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. This paper studied the scalability bottlenecks in concurrency con- Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. trol algorithms for many-core CPUs. We implemented a lightweight H-Store: A High-Performance, Distributed Main Memory Transaction main memory DBMS with a pluggable architecture that supports Processing System. Proc. VLDB Endow., 1(2):1496–1499, 2008. [28] H. T. Kung and J. T. Robinson. On optimistic methods for concurrency control. seven concurrency control schemes. We ran our DBMS in a dis- ACM Trans. Database Syst., 6(2):213–226, June 1981. tributed CPU simulator that provides a virtual environment of 1000 [29] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. Patel, and M. Zwilling. cores. Our results show that none of the algorithms are able to get High-performance concurrency control mechanisms for main-memory good performance at such a high core count in all situations. For databases. VLDB, 5(4):298–309, Dec. 2011. [30] J. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, lower core configurations, we found that 2PL-based schemes are J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for good at handling short transactions with low contention that are multicores. In HPCA, pages 1–12, 2010. common in key-value workloads. Whereas T/O-based algorithms [31] D. L. Mills. Internet time synchronization: the network time protocol. are good at handling higher contention with longer transactions that Communications, IEEE Transactions on, 39(10):1482–1493, 1991. [32] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented are more common in complex OLTP workloads. Although it may transaction execution. Proc. VLDB Endow., 3:928–939, September 2010. seem like all hope is lost, we proposed several research directions [33] I. Pandis, P. Tözün, R. Johnson, and A. Ailamaki. PLP: Page Latch-free that we plan to explore to rectify these scaling issues. Shared-everything OLTP. Proc. VLDB Endow., 4(10):610–621, July 2011. [34] A. Pavlo, C. Curino, and S. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In SIGMOD, pages 11. REFERENCES 61–72, 2012. [1] Intel brings supercomputing horsepower to big data analytics. [35] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki. OLTP on, November 2013. Hardware Islands. Proc. VLDB Endow., 5:1447–1458, July 2012. [2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern [36] K. Ren, A. Thomson, and D. J. Abadi. Lightweight locking for main memory processor: Where does time go? In VLDB, pages 266–277, 1999. database systems. In VLDB, pages 145–156, 2013. [3] P. A. Bernstein and N. Goodman. Concurrency control in distributed database [37] M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta. The systems. ACM Comput. Surv., 13(2):185–221, 1981. impact of architectural trends on operating system performance. In SOSP, pages [4] P. A. Bernstein and N. Goodman. Multiversion concurrency control - theory and 285–298, 1995. algorithms. ACM Trans. Database Syst., 8(4):465–483, Dec. 1983. [38] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and [5] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and P. Helland. The end of an architectural era: (it’s time for a complete rewrite). In Recovery in Database Systems, chapter 5. 1987. VLDB, pages 1150–1160, 2007. [6] P. A. Bernstein, D. Shipman, and W. Wong. Formal aspects of serializability in [39] S. S. Thakkar and M. Sweiger. Performance of an OLTP application on database concurrency control. IEEE Transactions on Software Engineering, symmetry multiprocessor system. In ISCA, pages 228–238, 1990. 5(3):203–216, 1979. [40] The Transaction Processing Council. TPC-C Benchmark (Revision 5.9.0). [7] M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall, M. L. McAuliffe, J. F., June 2007. Naughton, D. T. Schuh, M. H. Solomon, C. Tan, O. G. Tsatalos, et al. Shoring [41] P. Tözün, B. Gold, and A. Ailamaki. OLTP in wonderland: where do cache up persistent applications, volume 23. ACM, 1994. misses come from in major OLTP components? In DaMoN, 2013. [8] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. [42] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in Benchmarking cloud serving systems with YCSB. In SoCC’10, pages 143–154. multicore in-memory databases. In SOSP, 2013. [9] J. C. Corbett and et al. Spanner: Google’s Globally-Distributed Database. In [43] A. Whitney, D. Shasha, and S. Apter. High Volume Transaction Processing OSDI, pages 251–264, 2012. Without Concurrency Control, Two Phase Commit, SQL or C++. In HPTS’97. [10] C. Dennl, D. Ziener, and J. Teich. On-the-fly composition of fpga-based sql [44] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: the query accelerators using a partially reconfigurable module library. In FCCM, architecture and design of a database processing unit. In ASPLOS, 2014. pages 45–52, 2012. 220