TicToc: Time-Traveling Optimistic Concurrency Control

Concurrency control for on-line transaction processing (OLTP) database management systems (DBMSs) is a nasty game. Achieving higher performance on emerging many-core systems is difficult. Previous research has shown that timestamp management is the key scalability bottleneck in concurrency control algorithms. This prevents the system from scaling to large numbers of cores. In this paper we present TicToc, a new optimistic concurrency control algorithm that avoids the scalability and concurrency bottlenecks of prior T/O schemes. TicToc relies on a novel and provably correct data-driven timestamp management protocol.

1. TicToc: Time Traveling Optimistic Concurrency Control Xiangyao Yu Andrew Pavlo Daniel Sanchez Srinivas Devadas CSAIL MIT Carnegie Mellon University CSAIL MIT CSAIL MIT yxy@csail.mit.edu pavlo@cs.cmu.edu sanchez@csail.mit.edu devadas@csail.mit.edu ABSTRACT assign a unique, monotonically-increasing timestamp to each trans- Concurrency control for on-line transaction processing (OLTP) data- action. The DBMS then uses these timestamps to process conflict- base management systems (DBMSs) is a nasty game. Achieving ing operations in the proper order. The two most common variants higher performance on emerging many-core systems is difficult. of T/O are multi-version concurrency control (MVCC) [30] and Previous research has shown that timestamp management is the key optimistic concurrency control (OCC) [22]. scalability bottleneck in concurrency control algorithms. This pre- T/O algorithms are popular because they allow significant con- vents the system from scaling to large numbers of cores. currency, but suffer from a fundamental scalability bottleneck: times- In this paper we present TicToc, a new optimistic concurrency tamp allocation. Using a shared-memory counter to produce times- control algorithm that avoids the scalability and concurrency bot- tamps limits T/O schemes to a few million transactions per second, tlenecks of prior T/O schemes. TicToc relies on a novel and prov- orders of magnitude lower than modern OLTP workload require- ably correct data-driven timestamp management protocol. Instead ments [35, 37]. Prior work has proposed hardware and software of assigning timestamps to transactions, this protocol assigns read techniques to increase timestamp allocation throughput, but both and write timestamps to data items and uses them to lazily com- approaches have serious limitations. On the hardware side, cen- pute a valid commit timestamp for each transaction. TicToc re- tralized asynchronous counters [37], remote atomic memory op- moves the need for centralized timestamp allocation, and commits erations [2, 18], and fully-synchronized clocks [19] alleviate the transactions that would be aborted by conventional T/O schemes. timestamp allocation bottleneck, but they are challenging to imple- We implemented TicToc along with four other concurrency con- ment and are not available in current systems. On the software side, trol algorithms in an in-memory, shared-everything OLTP DBMS coarse-grained timestamp epochs with group commit [35] reduces and compared their performance on different workloads. Our re- the frequency of timestamp allocations, but still limits concurrency sults show that TicToc achieves up to 92% better throughput while in common scenarios as we show later. reducing the abort rate by 3.3× over these previous algorithms. In this paper we present TicToc, a new concurrency control al- gorithm that achieves higher concurrency than state-of-the-art T/O schemes and completely eliminates the timestamp allocation bot- 1. INTRODUCTION tleneck. The key contribution of TicToc is a technique that we call Multi-core systems are now pervasive, as parallelism has become data-driven timestamp management: instead of assigning times- the main approach to increase system performance. Systems with tamps to each transaction independently of the data it accesses, Tic- tens to hundreds of CPU cores are already on the market [2, 12], Toc embeds the necessary timestamp information in each tuple to and thousand-core chips will be available in the near future [10]. enable each transaction to compute a valid commit timestamp after Conventional DBMSs, however, still scale poorly beyond a few it has run, right before it commits. This approach has two benefits. cores. The key bottleneck of on-line transaction processing (OLTP) First, each transaction infers its timestamp from metadata associ- DBMSs is their concurrency control algorithm. Prior work has ated to each tuple it reads or writes. No centralized timestamp allo- shown that common concurrency control algorithms suffer from cator exists, and concurrent transactions accessing disjoint data do both fundamental and artificial scalability bottlenecks [34, 37, 29]. not communicate, eliminating the timestamp allocation bottleneck. Although recent work ameliorates some artificial bottlenecks [20, Second, by determining timestamps lazily at commit time, TicToc 21, 24, 28, 32, 35], fundamental bottlenecks remain. finds a logical-time order that enforces serializability even among Ideally, concurrency control schemes should restrict the inherent transactions that overlap in physical time and would cause aborts parallelism in transactional workloads as little as possible, while in- in other T/O-based protocols. In essence, TicToc allows commit curring small management overhead that scales well with the num- timestamps to move forward in time to uncover more concurrency ber of cores. Most of the recently-proposed concurrency control than existing schemes without violating serializability. schemes are based on timestamp ordering (T/O) [5]. T/O schemes We present a high-performance, OCC-based implementation of Permission to make digital or hard copies of all or part of this work for personal or TicToc, and prove that it enforces serializability. We also design classroom use is granted without fee provided that copies are not made or distributed several optimizations that further improve TicToc’s scalability. Fi- for profit or commercial advantage and that copies bear this notice and the full cita- nally, we compare TicToc with four other modern concurrency con- tion on the first page. Copyrights for components of this work owned by others than trol schemes in the DBx1000 main-memory DBMS [1], using two ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission different OLTP workloads on a multi-socket, 40-core system. Our and/or a fee. Request permissions from permissions@acm.org. results show that TicToc achieves up to 92% higher throughput than SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA prior algorithms under a variety of workload conditions. © 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00 DOI: http://dx.doi.org/10.1145/2882903.2882935

2.2. BACKGROUND tions [14] that can increment the timestamp counter without incur- OLTP DBMSs support the part of an application that interacts ring extra cache coherence traffic [2, 18]. In practice, this achieves with end users. End users send requests to the application to per- 100 million timestamps per second [11]. A second option is to form some function (e.g., post a comment, purchase an item). The add a special hardware timestamp counter on the multi-core chip application processes these requests and then executes transactions (which does not exist in any CPUs today). This approach is able in the DBMS to read or write to the database. A transaction in the to allocate one billion timestamps per second according to simula- context of one of these systems is the execution of a sequence of tions [37]. The third option is to produce timestamps using a clock one or more operations (e.g., SQL queries) on a shared database to that is synchronized across all cores. In fact, small-scale Intel sys- perform some higher-level function [15]. tems have a synchronized clock [19]. However, fully-synchronized A concurrency control scheme is the protocol that a DBMS uses clocks are challenging to maintain in large-scale, multi-socket sys- to interleave the operations of simultaneous transactions in such tems, which use different clock sources that drift over time. Keep- a way to provide the illusion that each transaction is running ex- ing these clocks completely synchronized would require either an clusively on the database. There are two classes of concurrency unreasonable amount of communication for adjustments, or a global control algorithms [5]: two-phase locking and timestamp ordering. clock source with an impractically low frequency. Two-phase locking (2PL) was the first method proven to ensure All the hardware solutions described here require some hardware correct execution of concurrent DBMS transactions [6, 13]. Under support that either does not exist in any CPUs or only exists in a this scheme, transactions have to acquire locks for a particular el- subset of the CPUs today. Even for those CPU architectures that do ement in the database before they are allowed to execute a read or have this support, it is not guaranteed that the support will still exist write operation on that element. 2PL is considered a pessimistic in the future considering that its cost increases with the number of approach because it assumes that transactions will conflict and thus cores. Moreover, even if this hardware support exists in all proces- they need to acquire locks to avoid this problem. If a transaction is sors, a T/O-based concurrency control algorithm may still achieve unable to acquire a lock for an element, then it is forced to wait un- suboptimal performance. In these algorithms, timestamps are stat- til the lock becomes available. If this waiting is uncontrolled, then ically assigned to transactions and the assignment does not depend the DBMS can incur deadlocks. Thus, a major difference among on the data access pattern. At runtime, the actual dependency be- 2PL variants is their deadlock-avoidance strategy. tween two transactions may not agree with the assigned timestamp Timestamp ordering (T/O) concurrency control schemes gener- order and thus transactions may be unnecessarily aborted, which ate a serialization order of transactions a priori based on monoton- hurts performance (see example in Section 3.1). ically increasing timestamps. The DBMS uses these timestamps to process conflicting operations in the proper order (e.g., read and 2.2 Optimistic Concurrency Control write operations on the same element, or two separate write opera- The original OCC algorithm was proposed in 1981 [22], but it tions on the same element) [5]. has only recently been adopted in high-performance OLTP DBMSs. Although 2PL has been widely adopted in traditional DBMSs By contrast, MVCC, the other T/O-based algorithm, has been used (e.g., IBM DB2, Microsoft SQL Server, MySQL), the contention in DBMSs for several decades (e.g., Oracle, Postgres). introduced by locks severely hurts performance in today’s many- Under OCC, the DBMS executes a transaction in three phases: core systems [37]. Almost every OLTP DBMS released in the last read, validation, and write. In the read phase, the transaction per- decade that we are aware of, including all but a few of the NewSQL forms read and write operations to tuples without blocking. The DBMSs [3], uses a T/O-based concurrency control scheme. DBMS maintains a separate private workspace for each transac- We next discuss T/O algorithms in further detail to understand tion that contains its read set and write set. All of a transaction’s their key bottlenecks. We then discuss state-of-the-art algorithms modifications are written to this workspace and are only visible to in Section 2.2. This will provide the necessary background for our itself. When the transaction finishes execution, it enters the valida- presentation of the TicToc algorithm in Section 3. tion phase, where the OCC scheme checks whether the transaction conflicts with any other active transaction. If there are no conflicts, 2.1 Timestamp Allocation the transaction enters the write phase where the DBMS propagates The execution of transactions must obey certain ordering con- the changes in the transaction’s write set to the database and makes straints. In the strictest isolation level (i.e., serializable), the ex- them visible to other transactions. ecution schedule must be equivalent to a schedule where all the Many algorithms have been proposed to refine and improve the transactions are executed sequentially. In a T/O-based concurrency original OCC algorithm [8, 17, 26, 31]. The first OCC deriva- control algorithm, this serial order is expressed using timestamps. tives from the 1980s dealt with improving transaction validation Each transaction is assigned a unique and monotonically increas- for single-threaded systems with limited memory [17, 31]. ing timestamp as the serial order that is used for conflict detection. One OCC-based protocol that bears some similarity to our pro- Multi-versioning (MVCC) and optimistic (OCC) concurrency con- posed TicToc algorithm is the dynamic timestamp allocation (DTA) trol algorithms are both timestamp based. approach [4, 7]. Instead of assigning a specific timestamp to a In traditional T/O-based algorithms, a centralized timestamp al- transaction, DTA allows the transaction to take a range of times- locator assigns a unique timestamp to each transaction. A common tamps and adjusts the commit timestamp during the validation phase. way to implement the allocator is through an atomic add instruc- This approach was revisited in the 1990s for “real-time” DBMSs tion that increments a global counter for each new transaction. This to give higher priority in the validation phase to certain transac- approach, however, is only able to generate less than 5 million in- tions [23, 25]. Similar to TicToc, DTA-based OCC can reduce structions per second on a modern multi-core system due to the the number of aborts compared to a traditional OCC. The key dif- long latency incurred by the CPU’s cache coherence protocol [11, ference is that DTA only assigns timestamps to transactions and 35]. As such, most state-of-the-art T/O-based algorithms suffer not to tuples. As a result, DTA-based OCC requires the DBMS to from the timestamp allocation bottleneck [37]. use a global critical section for coordination among concurrently- Hardware support can alleviate the timestamp allocation bottle- validating transactions. This is a major scalability bottleneck on neck. For example, Tilera processors support remote atomic opera- multi-core and multi-socket systems [37].

3. Silo is a state-of-the-art OCC algorithm that achieves high through- Algorithm 1: Read Phase put by avoiding bottlenecks caused by global locks or timestamp Data: read set RS, tuple t allocation [35]. In Silo, a global timestamp (called an epoch) is 1 r = RS.get_new_entry() allocated at coarse time granularity (every 40 ms) and is used to in- 2 r.tuple = t dicate the serial order among transactions. Within an epoch, trans- # Atomically load wts, rts, and value action IDs are used to identify data versions as well as to detect 3 < r.value = t.value, r.wts = t.wts, r.rts = t.rts > conflicts. These IDs, however, do not reflect the relative order among transactions. This is because they only capture read-after- write dependencies, but not write-after-read dependencies (anti- until timestamp rts. A version read by a transaction is valid if and dependencies). Silo is still able to enforce serializable execution, only if that transaction’s commit timestamp is in between the ver- but only able to exploit a limited amount of parallelism. sion’s wts and rts. And a write by a transaction is valid if and only To tackle the above issues, we now present a new OCC variant if the transaction’s commit timestamp is greater than the rts of the that uses decentralized data-driven timestamp management. previous version. Formally, the following invariant must hold for transaction T to commit: 3. THE TICTOC ALGORITHM ∃ commit_ts, Like other T/O-based algorithms, TicToc uses timestamps to in- (∀v ∈ {versions read by T}, v.wts ≤ commit_ts ≤ v.rts) dicate the serial order of the transactions. But unlike these previous ∧ (∀v ∈ {versions written by T}, v.rts < commit_ts) (1) approaches, it does not assign timestamps to transactions using a centralized allocator. Instead, a transaction’s timestamp is calcu- This policy leads to serializable execution because all the reads lated lazily at its commit time in a distributed manner based on the and writes within a transaction occur at the same timestamp. A tuples it accesses. There are two key advantages of this timestamp read always returns the version valid at that timestamp and a write management policy. First, its distributed nature avoids all of the is ordered after all the reads to older versions of the same tuple. bottlenecks inherent in timestamp allocation [37], making the al- gorithm highly scalable. Second, laziness makes it possible for the 3.2 Protocol Specification DBMS to exploit more parallelism in the workload, thereby reduc- Like standard OCC algorithms, each transaction in TicToc ac- ing aborts and improving performance. cesses the database without acquiring locks during normal opera- tion. This is known as the read phase. When the transaction in- 3.1 Lazy Timestamp Management vokes the commit operation, the protocol then takes the transaction To see why lazy timestamp management can reduce conflicts and through the validation phase to check whether it should be allowed improve performance, we consider the following example involv- to commit. If it does, then it enters the write phase where the trans- ing two concurrent transactions, A and B, and two tuples, x and y. action’s changes are applied to the shared database. The transactions invoke the following sequence of operations: We now discuss these phases in further detail. 1. A read(x) 3.2.1 Read Phase 2. B write(x) The DBMS maintains a separate read set and write set of tu- 3. B commits ples for each transaction. During this phase, accessed tuples are 4. A write(y) copied to the read set and modified tuples are written to the write This interleaving of operations does not violate serializability be- set, which is only visible to the current transaction. Each entry in cause transaction A can be ordered before B in the serial order. But the read or write set is encoded as {tuple, data, wts, rts}, where A cannot commit after B in the serial order because the version of tuple is a pointer to the tuple in the database, data is the data value x read by A has already been modified by B. of the tuple, and wts and rts are the timestamps copied from the tu- Traditional OCC algorithms assign timestamps to transactions ple when it was accessed by the transaction. For a read set entry, statically, essentially agreeing on a fixed sequential schedule for TicToc maintains the invariant that the version is valid from wts to concurrent transactions. This eases conflict detection, but limits rts in timestamp order. concurrency. In this example, if transaction A is assigned a lower Algorithm 1 shows the procedure for a tuple access request in the timestamp than transaction B, then A can commit since the inter- read phase. The pointer to the tuple is stored in the read or write set leaving of operations is consistent with timestamp order. However, depending on the request type. The data value and read and write if transaction A is assigned a higher timestamp than transaction timestamps are also recorded. Note that the value and timestamps B, A must eventually abort since committing it would violate the must be loaded atomically to guarantee that the value matches the schedule imposed by timestamp order. timestamps. We explain in Section 3.6 how to efficiently perform By contrast, TicToc does not allocate timestamps statically, so it this operation in a lock-free manner. does not restrict the set of potential orderings. It instead calculates the timestamp of each transaction lazily at the transaction’s com- 3.2.2 Validation Phase mit time by inspecting the tuples it accessed. In our example, when In the validation phase, TicToc uses the timestamps stored in the transaction A reaches its commit point, TicToc calculates the com- transaction’s read and write sets to compute its commit timestamp. mit timestamp using the versions of the tuple x and y it actually Then, the algorithm checks whether the tuples in the transaction’s read/wrote rather than the latest version in the database right now. read set are valid based on this commit timestamp. And since the version of tuple x read by A is older than the one The first step for this validation, shown in Algorithm 2, is to lock written by B, A will be ordered before B and can commit. all the tuples in the transaction’s write set in their primary key order To encode the serialization information in the tuples, each data to prevent other transactions from updating the rows concurrently. version in TicToc has a valid range [4] of timestamps bounded by Using this fixed locking order guarantees that there are no dead- the write timestamp (wts) and the read timestamp (rts). Specifi- locks with other transactions committing at the same time. This cally, a particular version is created at timestamp wts and is valid technique is also used in other OCC algorithms (e.g., Silo [35]).

4. Algorithm 2: Validation Phase Algorithm 3: Write Phase Data: read set RS, write set WS Data: write set WS, commit timestamp commit_ts # Step 1 – Lock Write Set 1 for w in WS do 1 for w in sorted(WS) do 2 write(w.tuple.value, w.value) 2 lock(w.tuple) 3 w.tuple.wts = w.tuple.rts = commit_ts 3 end 4 unlock(w.tuple) # Step 2 – Compute the Commit Timestamp 5 end 4 commit_ts = 0 5 for e in WS ∪ RS do Step 1 Step 2 Step 3 Step 4 6 if e in WS then 7 commit_ts = max(commit_ts, e.tuple.rts +1) 1 x y x y x y x y Physical Timestamp Time 8 else 2 9 commit_ts = max(commit_ts, e.wts) 3 wts 10 end 4 rts 11 end A.read(x) B.write(x) A.write(y) A commits @ 3 # Step 3 – Validate the Read Set B commits @ 4 12 for r in RS do Figure 1: An example of two transactions executing using TicToc. 13 if r.rts < commit_ts then # Begin atomic section action execution. The locks and atomic sections protect only the 14 if r.wts = r.tuple.wts or (r.tuple.rts ≤ commit_ts and tuples that a transaction touches. In Section 5, we present optimiza- isLocked(r.tuple) and r.tuple not in W) then tions to further reduce the contention caused by these operations. 15 abort() 16 else 3.2.3 Write Phase 17 r.tuple.rts = max(commit_ts, r.tuple.rts) Finally, if all of the tuples that the transaction accessed pass vali- 18 end dation, then the transaction enters the write phase. As shown in Al- # End atomic section gorithm 3, in this phase the transaction’s write set is written to the 19 end database. For each tuple in the transaction’s write set, the DBMS 20 end sets their wts and rts to commit_ts, indicating that it is a new ver- sion. All locks that were acquired during the validation phase are The second step in the validation phase is to compute the trans- then released, making the changes visible to all other transactions. action’s commit timestamp from the timestamps stored within each tuple entry in its read/write sets. As discussed in Section 3.1, for a 3.3 Example tuple in the read set but not in the write set, the commit timestamp We now revisit the example in Section 3.1 and explain how Tic- should be no less than its wts since the tuple would have a differ- Toc is able to commit both transactions A and B even though pre- ent version before this timestamp. For a tuple in the transaction’s vious OCC algorithms could not. Fig. 1 shows a step-by-step dia- write set, however, the commit timestamp needs to be no less than gram. In this example, one operation occurs in each physical step. its current rts + 1 since the previous version was valid till rts. The wts and rts for tuples x and y are encoded as the start and end In the last step, the algorithm validates the tuples in the transac- point of the vertical bands. tion’s read set. If the transaction’s commit_ts is less than or equal Step 1: Transaction A reads tuple x. The current version of x to the rts of the read set entry, then the invariant wts ≤ commit_ts and its timestamps (wts = 2 and rts = 3) are stored in A’s read set. ≤ rts holds. This means that the tuple version read by the trans- Step 2: Transaction B writes to tuple x and commits at times- action is valid at commit_ts, and thus no further action is required. tamp 4. The version of x will be overwritten and both the wts and If the entry’s rts is less than commit_ts, however, it is not clear rts of the new version will become 4. whether the local value is still valid or not at commit_ts. It is possi- ble that another transaction has modified the tuple at a logical time Step 3: Transaction A writes to tuple y. Since the previous ver- between the local rts and commit_ts, which means the transaction sion of y has rts = 2, the new version can be written at timestamp has to abort. Otherwise, if no other transaction has modified the 3. At this point, the new version of y is only stored in the write set tuple, rts can be extended to be greater than or equal to commit_ts, of transaction A and is not visible to other transactions yet. making the version valid at commit_ts. Step 4: Transaction A enters the validation phase. According Specifically, the local wts is first compared to the latest wts. If to Algorithm 2, the commit timestamp should be the maximum of they are different, the tuple has already been modified by another the read tuple’s wts and write tuple’s rts +1, which is timestamp transaction and thus it is not possible to extend the rts of the lo- 3 in this example. Then, transaction A validates the read set by cal version. If wts matches, but the tuple is already locked by a checking whether the version of tuple x it read is valid at timestamp different transaction (i.e., the tuple is locked but it is not in the 3. Since transaction A’s version of tuple x is valid at timestamp 2 transaction’s write set), it is not possible to extend the rts either. If and 3, it passes the validation phase and commits. the rts is extensible or if the version is already valid at commit_ts, Note that when the DBMS validates transaction A, it is not even the rts of the tuple can be extended to at least commit_ts. Note that aware that tuple x has been modified by another transaction. This the whole process must be done atomically to prevent interference is different from existing OCC algorithms (including Hekaton [24] from other transactions. The DBMS also does not need to validate and Silo [35]) which always recheck the tuples in the read set. tuples that are only in the write set since they are already protected These algorithms would abort transaction A in this example be- by the locks acquired at the beginning of the validation phase. cause tuple x has already been modified since transaction A last In TicToc, there is no centralized contention point during trans- read it.

5.3.4 Spurious Aborts Algorithm 4: Atomically Load Tuple Data and Timestamps A transaction may not always be able to validate its read set dur- Data: read set entry r, tuple t ing the validation phase, which leads to aborts that may seem spu- 1 do rious. For example, if tuple y in Fig. 1 was originally valid from 2 v1 = t.read_ts_word() timestamps 1 to 4, then transaction A’s commit_ts has to be 5. And 3 read(r.data, t.data) since x’s rts cannot be extended to 5, A has to abort. In this case, 4 v2 = t.read_ts_word() A aborts not because of the timestamps and not data values. 5 while v1 = v2 or v1.lock_bit == 1; In general, the reason that these aborts occur is because other 6 r.wts = v1.wts transactions violate serializability. For the example above, imagine 7 r.rts = v1.wts + v1.delta that there exists a transaction C reading tuple x and y after B com- mits but before A commits. C is able to commit at timestamp 4 as it observes B’s write to x and the original value of y. C will extend TS_word [63]: Lock bit (1 bit). the rts of y to 4. This means that A cannot commit now without vi- TS_word [62:48]: delta = rts − wts (15 bits). olating serializability because there is a dependency cycle between TS_word [47:0]: wts (48 bits). A, B and C 1 . Note that when A enters the validation phase, it does not know that C exists or that A would form a dependency cycle The highest-order bit is used as the lock bit. wts is stored as a 48- with other committed transactions. bit counter. To handle wts overflows, which happens at most once Note that in TicToc a transaction aborts only if a tuple it reads every several weeks for the most active workloads, the tuples in the or writes is overwritten by another transaction that enters the vali- database are periodically loaded in the background to reset their dation phase first. So only concurrent transactions (i.e., one starts wts. This process is infrequent and can be performed concurrently before the other commits) can cause aborts. If a transaction com- with normal transactions, so its overhead is negligible. mits, then all transactions that start after it will observe its changes. Algorithm 4 shows the lock-free implementation of atomically loading the data and timestamps for the tuple read operation from 3.5 Discussion Algorithm 1. TS_word is loaded twice, before and after loading the data. If these two TS_word instances are the same and both have the Beyond scalability and increased concurrency, TicToc’s protocol lock bit unset, then the data value must not have changed and is still has two other distinguishing features. Foremost is that the trans- consistent with the timestamps. Otherwise, the process is repeated action’s logical commit timestamp order may not agree with the until both timestamps are consistent. There are no writes to shared physical commit time order. In the example from shown in Fig. 1, memory during this process. To avoid starvation, one could revert transaction A commits physically after transaction B, but its com- to more heavy-weight latching if this check repeatedly fails. mit timestamp is less than transaction B’s commit timestamp. This Similarly, Algorithm 5 shows the steps to atomically extend a means that A precedes B in the serial schedule. This also indicates tuple’s rts in TicToc’s validation phase (Algorithm 2). Recall that that TicToc is not order-preserving serializable, since the serial or- this operation is called if commit_ts is greater than the local rts; der may not be the commit order. the DBMS makes the local version valid at commit_ts by extending Another feature of TicToc is that logical timestamps grow more the rts of the tuple. The first part of the algorithm is the same as slowly than the number of committed transactions. Moreover, the explained in Section 3.2.2; validation fails if the tuple’s rts cannot rate at which the logical timestamp advances is an indicator of the possibly be extended to commit_ts. contention level in the workload. This is because different trans- Since we only encode delta in 15 bits in TS_word, it may over- actions may commit with the same logical timestamp. Such a sce- flow if rts and wts grow far apart. If an overflow occurs, we also nario is possible if two transactions have no conflicts with each increase wts to keep delta within 15 bits without affecting the cor- other, or if one transaction reads a version modified by the other rectness of TicToc. Intuitively, this can be considered a dummy transaction. At one extreme, if all transactions are read-only and write to the tuple at the new wts with the same data. Inserting such thus there is no contention, all transactions will have the same com- a dummy write does not affect serializability. Increasing wts, how- mit timestamp. At the other extreme, if all the transactions write ever, may increase the number of aborts since another transaction to the same tuple, each commit would increase the tuple’s wts by may consider the version as being changed while it has not actually one, and the logical timestamp would increase at the same rate as changed. This effect is more problematic the fewer bits delta uses. the number of committed transactions. Since most OLTP work- Although not shown in the paper, our experiments indicate that 15 loads have some contention, the DBMS’s logical timestamps will bits is enough for the overflow effect to be negligible. increase more slowly than the number of committed transactions; Finally, the new wts and delta are written to a new TS_word the higher the contention, the faster logical timestamps advance. and atomically applied to the tuple. The DBMS uses an atomic We will show this in Section 6.5. compare-and-swap instruction to make sure that the TS_word has not been modified by other transactions simultaneously. 3.6 Implementation Scanning tuples in a database may miss tuples being inserted As shown in Algorithms 1 and 2, both the read and validation because they are not observed by the scanning transaction. Stan- phases require the DBMS to atomically read or write tuples’ times- dard techniques for solving this problem include using locks in in- tamps. But implementing these atomic sections using locks would dexes [31] or rescanning the tuples during the validation phase [24]. degrade performance. To avoid this problem, TicToc adopts an op- Both techniques incur significant performance overhead. This can timization from Silo [35] to encode a lock bit and a tuple’s wts and be avoided by running the transactions at lower isolation levels rts into a single 64-bit word (TS_word) of the following form: (e.g., snapshot isolation) if the application allows it. We believe that it is possible to apply the data-driven timestamp management concept used in TicToc to order-preserving indexes to avoid this 1 phantom anomaly for serializable transactions. This exploration is A<B due to write-after-read on x, B<C due to read-after-write on x, and C<A due to write-after-read on y. outside of the scope of this paper and is left for future work.

6. Algorithm 5: Read-Set Validation termine the serial order. In TicToc, however, the commit times- Data: read set entry r, write set W , commit timestamp tamp is derived from the accessed tuples and no global coordination commit_ts takes place, and thus two transactions may commit with the same 1 do timestamp. Therefore, transactions cannot be fully ordered based 2 success = true on their commit timestamps alone. 3 v2 = v1 = r.tuple.read_ts_word() Our proof instead uses a combination of the timestamp and phys- 4 if r.wts = v1.wts or (v1.rts ≤ commit_ts and ical time orders [38]. A transaction’s logical commit time is its isLocked(r.tuple)) and r.tuple not in W then commit timestamp; its physical commit time is the physical time 5 Abort() between a transaction’s validation phase and write phase. We de- 6 end fine the following specific serial order. # Extend the rts of the tuple D EFINITION 1 (S ERIAL O RDER ). Using <s , <ts and <ps to 7 if v1.rts ≤ commit_ts then indicate serial order, commit timestamp order, and physical commit # Handle delta overflow time order, respectively, the serial order between transaction A and 8 delta = commit_ts − v1.wts B is defined as follows: 9 shift = delta − delta ∧ 0x7fff 10 v2.wts = v2.wts + shift A <s B A <ts B ∨ (A =ts B ∧ A ≤pt B) 11 v2.delta = delta − shift # Set the new TS word The serial order defined in Definition 1 is a total order among all 12 success = compare_and_swap(r.tuple.ts_word, v1, v2) the transactions. Transaction A is ordered before transaction B if 13 end A has a smaller commit timestamp or if they have the same commit 14 while not success; timestamp but A commits before B in physical time. If A and B both have the same logical and physical commit time, then they can have arbitrary serial order. 3.7 Logging and Durability The goal of the correctness proof is summarized as follows: The TicToc algorithm can support logging and crash recovery in a similar way as traditional concurrency control algorithms. The T HEOREM 1. Any schedule in TicToc is equivalent to the serial DBMS can use the canonical ARIES approach if there is only a sin- schedule defined in Definition 1. gle log file [27]. But ARIES cannot provide the bandwidth required To prove this, we show that the dependencies in the actual sched- in today’s multi-core systems [39]. Implementing arbitrarily scal- ule are maintained as the dependencies in the equivalent serial sched- able logging is out of the scope of this paper and is left for future ule. Specifically, a read in the actual schedule always returns the work. In this section, we briefly discuss one idea of implementing value of the last store in the serial schedule. We also prove that parallel logging with multiple log files on TicToc. transactions having the same commit timestamp and physical com- Parallel logging has been studied in other DBMSs [36, 39]. The mit time do not have conflicts. basic idea is to perform logging in batches. All transactions in a previous batch must be ordered before any transaction in a later 4.2 Formal Proofs batch, but the relative ordering among transactions within the same We first prove a useful lemma that will be used to prove subse- batch can be arbitrary. For each batch, logs are written to multiple quent Lemmas 2 and 3. files in parallel. A batch is considered durable only after all the logs within that batch have been written to files. L EMMA 1. Transactions writing to the same tuple must have In TicToc, the batching scheme requires that transactions in a different commit timestamps. later batch must have commit timestamps greater than transactions in a previous batch. This can be achieved by setting a minimum P ROOF. According to Algorithms 2 and 3, a tuple is locked commit timestamp for transactions belonging to the new batch. To while being written, therefore only one transaction can write to that start a new batch, each worker thread should coordinate to compute tuple at any time. According to line 3 in Algorithm 3, both wts and the minimum commit timestamp that is greater than the commit rts of the modified tuple become its commit timestamp. timestamps of all transactions in previous batches. Each transac- According to line 7 in Algorithm 2, if another transaction writes tion in the new batch has a commit timestamp greater than this to the same tuple at a later time, its commit timestamp must be minimum timestamp. Setting a minimum timestamp does not af- strictly greater than the tuple’s current rts. Since rts never decreases fect the correctness of TicToc since timestamps only increase in this in the TicToc algorithm, the commit timestamp of the later trans- process, and transactions are properly ordered based on their times- action must be greater than the commit timestamp of the earlier tamps. The performance of this parallel logging scheme should be transaction. Therefore, transactions writing to the same tuple must the same as with other concurrency control algorithms [36, 39]. have different commit timestamps. As discussed in the previous section, two requirements are needed 4. PROOF OF CORRECTNESS to prove Theorem 1. First, transactions that have the same commit In this section, we prove that the TicToc algorithm is able to timestamp and physical time must not conflict. Second, a read al- correctly enforce serializability. ways returns the latest write in the serial order. We now prove these two requirements for TicToc: 4.1 Proof Idea L EMMA 2. Transactions that commit at the same timestamp To prove that a schedule is serializable in TicToc, we need to and physical time do not conflict with each other. show that the schedule is equivalent to another schedule where all the transactions are executed serially. Previous T/O concurrency P ROOF. According to Lemma 1, write-write conflicting transac- control algorithms use the transaction’s unique timestamps to de- tions must have different commit timestamps. Therefore, we only

7.need to show that all read-write or write-read conflicting transac- transactions A B C D tions commit at different logical or physical time. Consider a pair of transactions committing at the same physical time. One reads a tuple and the other writes to the same tuple. Then, the commit timestamp of the reading transaction must be less tuples x y z u v than or equal to the tuple’s current rts. And the commit timestamp of the writing transaction must be greater than the tuple’s current Locking Waiting rts. Therefore, they have different commit timestamps. Figure 2: An example of lock thrashing in a 2PL protocol. L EMMA 3. A read operation from a committed transaction re- turns the value of the latest write to the tuple in the serial schedule. essentially the same process as a 2PL protocol where a transaction P ROOF. We first prove that if a committed transaction’s read may wait if the next lock that it needs to acquire is not immediately observes another transaction’s write, then the reading transaction available. Waiting for locks, however, may create thrashing prob- must be ordered after the writing transaction in the serial schedule. lems at high core counts, even if locks are acquired in primary key A tuple’s wts is always updated together with its value, and the order [37]. Thrashing happens when a transaction already holds wts is always the commit timestamp of the transaction that writes locks and waits for the next lock. The locks it already holds, how- the value. Line 9 in Algorithm 2 states that if another transaction ever, may block other transactions. reads the value, then its commit timestamp must be greater than or Consider a pathological case shown in Fig. 2 where each trans- equal to wts. If the commit timestamp equals wts, then the reading action tries to lock two tuples. Transaction D has already acquired transaction still commits after the writing transaction in physical both of the locks that it needs, while transactions A, B, and C are time because the writing transaction only makes its writes globally waiting for locks held by other transactions. When transaction D visible after its physical commit time. By Definition 1, the reading commits, C is able to acquire the lock and make forward progress. transaction is always ordered after the writing transaction in the Transactions A and B, however, still need to wait. The end result serial schedule. is that the four transactions are validated sequentially. We next show that the write observed by the following read is Note that, in this particular example, it is actually possible to the latest write in the serial schedule. In other words, if the writing abort transactions A and C so that B and D can acquire the locks transaction has timestamp t1 and the reading transaction has times- and run in parallel. After they finish, A and C can also run in tamp t2 , no other write to the same tuple happens at timestamp t, parallel. This schedule only takes half the execution time compared such that t1 ≤ t ≤ t2 . to the pathological schedule, but it requires an additional deadlock According to Algorithm 2, when the reading transaction com- detection thread to quickly identify these scenarios. mits, it can observe a consistent view of the tuple’s TS_word with A better approach to avoid the thrashing problem is to use a 2PL wts and rts, where t1 = wts and t2 ≤ rts. This implies that so far in variant based on non-waiting deadlock prevention in TicToc’s vali- physical time, no write to the same tuple has happened between t1 dation phase [5]. This protocol optimization, which we refer to as and t2 in logical time because otherwise the wts of the tuple would no-wait, is like running a mini concurrency control algorithm in- be greater than t1 . No such write can happen in the future either side of the TicToc algorithm. With no-wait, if a transaction fails to because all future writes will have timestamps greater the tuple’s acquire a lock for a tuple in its write set during the validation phase, rts and thus greater than t2 . the validation is immediately aborted (releasing any locks) and then TicToc restarts the validation phase. The transaction sleeps for a P ROOF OF T HEOREM 1. According to Lemma 2, transactions short period (1 µs) before retrying to reduce restarts. Our experi- with the same commit timestamp and physical commit time do not ments show that the algorithm’s performance is not overly sensitive conflict. Thus, all serial orders among them are equivalent. to the length of this sleep time as long as it is not too large. According to Lemma 3, for transactions with different commit The no-wait optimization minimizes the blocking and allows more timestamps or physical commit times, a read in a transaction al- transactions to validate simultaneously. In the example in Fig. 2, if ways returns the latest write in the serial schedule. According to no-wait is used, then A or B may run in parallel with D. Lemma 1, only one such latest write can exist so there is no ambi- guity. Then, for each transaction executed in the actual schedule, 5.2 Preemptive Aborts all the values it observes are identical to the values it would observe The first step of the validation phase (Algorithm 2) locks the tu- in the serial schedule. Hence, the two schedules are equivalent. ples in the transaction’s write set before it examines the read set. If a transaction ends up aborting because read set validation fails, 5. OPTIMIZATIONS then this locking potentially blocked other transactions unnecessar- The TicToc algorithm as presented so far achieves good perfor- ily. We observe that for some transactions the decision to abort can mance when tested on a multi-socket system (Section 6). There actually be made before locking the write set tuples. We call this are, however, still places in the validation phase that may create optimization preemptive abort. Since all the serialization infor- unnecessary contention and thus hurt performance. For example, mation is already stored in the tuples’ timestamps, it can be used to locking the transaction’s write set during the validation phase may make early abort decisions, thereby reducing contention. cause thrashing for write-intensive benchmarks. To better understand this, consider a transaction with one tuple In this section, we discuss several optimizations that we devel- in its read set. This transaction will fail the validation phase if this oped for TicToc to minimize contention. We also discuss how Tic- tuple’s local rts is less than the transaction’s commit timestamp and Toc works for weaker isolation levels for those applications that do its local wts does not match the tuple’s latest wts. A tuple’s latest not need strong serializability. wts can be atomically read from the TS_word of the tuple. The transaction’s commit timestamp, however, cannot be accurately de- 5.1 No-Wait Locking in Validation Phase termined before the write set is locked because a tuple’s rts in the In TicToc’s validation phase (Algorithm 2), tuples in a transac- write set might be changed by a different transaction. The key ob- tion’s write set are locked following the primary key order. This is servation here is that we just need to find an approximate commit

8. Step 1 Step 2 Step 3 Step 4 els for those applications that are willing to sacrifice isolation guar- 1 x x x x Physical antees in favor of better performance and lower abort rate. Timestamp 2 Time Snapshot Isolation: This level mandates that all of a transac- 3 tion’s reads see a consistent snapshot of the database, and that the 4 transaction will commit only if it does not conflict with any con- A.read(x) B extends C.write(x) A validates x current updates made since that snapshot. In other words, all the x’s rts with commit_ts = 3 read operations should happen at the same timestamp (commit_rts) Figure 3: Using a tuple’s timestamp history to avoid aborting. and all the write operations should happen at a potentially later timestamp (commit_wts), and the written tuples are not modified timestamp. Thus, this optimization achieves some early aborts, but between commit_rts and commit_wts. does not catch all the transactions that will fail read set validation. To support snapshots, instead of using a single commit_ts and We compute the approximate commit timestamp using the local verifying that all reads and writes are valid at this timestamp, two wts and rts in the read and write sets. For each tuple in the read set, commit timestamps are used, one for reads (commit_rts) and one the approximate commit timestamp is no less than the tuple’s local for writes (commit_wts). The algorithm verifies that all reads are wts; for each tuple in the write set, the approximate commit times- valid at commit_rts and all writes are valid at commit_wts. It also tamp is no less than the tuple’s local rts +1. Note that the actual guarantees that, before the transaction writes to a tuple, its previous commit timestamp is no less than our approximation, because the wts is less than or equal to commit_rts. All of these can be imple- latest timestamps in the tuples cannot be less than the local times- mented with minor changes to Algorithm 2. tamps. Once an approximate commit timestamp is determined, it is Repeatable Reads: With this weaker isolation level, a trans- used to determine if the transaction should be preemptively aborted. action’s reads do not need to happen at the same timestamp even though writes should still have the same commit timestamp. This 5.3 Timestamp History means there is no need to verify the read set in the validation phase. TicToc always aborts a transaction if its local rts of a tuple is For a tuple read and updated by the same transaction, however, the less than commit_ts and the local wts does not match the latest wts. DBMS still needs to guarantee that no other updates happened to There are some cases, however, where the latest wts is greater than that tuple since the transaction last read the value. commit_ts and the local version is still valid at commit_ts. Such transactions can actually commit without violating serializability. Fig. 3 shows such an example. Transaction A first reads tuple 6. EXPERIMENTAL EVALUATION x with wts = 2 and rts = 2. Later, tuple x’s rts is extended to We now present our evaluation of the TicToc algorithm. For timestamp 3 due to the validation of transaction B. Then, tuple x these experiments, we use the DBx1000 OLTP DBMS [1]. This is modified by transaction C and thus the latest wts and rts both is a multi-threaded, shared-everything system that stores all data in become 4. Finally, transaction A enters the validation phase and DRAM in a row-oriented manner with hash table indexes. validates tuple x at commit_ts = 3 (not 2 because transaction A DBx1000 uses worker threads (one per core) that invoke transac- accessed other tuples not shown in the figure). At this point, trans- tions from a fixed-length queue. Each transaction contains program action A only has the local timestamps of x (wts = 2 and rts = 2) logic intermixed with query invocations. Queries are executed se- and knows that the local version is valid at timestamp 2, but does rially by the transaction’s worker thread as they are encountered in not know if it is still valid at timestamp 3. From transaction A’s the program logic. Transaction statistics, such as throughput and perspective, it is possible that the local version has been extended abort rates, are collected after the system achieves a steady state to timestamp 3 by some other transaction; it is also possible, how- during the warm-up period. The abort rate is calculated as the to- ever, that some other transaction did a write that is only valid at tal number of aborts divided by the total number of transaction at- timestamp 3. Based on all the information transaction A has, these tempts (both committed and aborted transactions). two situations are indistinguishable. DBx1000 includes a pluggable lock manager that supports dif- To prevent these unnecessary aborts, we can extend TicToc to ferent concurrency control schemes. This allows us to compare five maintain a history of each tuple’s wts rather than just one scalar approaches all within the same system: value. When a new version is created for a tuple, the wts of the old version is stored in a history buffer. The value of the old ver- TICTOC: Time traveling OCC with all optimizations sion does not need to be stored since transactions in TicToc always SILO: Silo OCC [35] read the latest data version. Therefore, the storage overhead of this HEKATON: Hekaton MVCC [24] optimization in TicToc is smaller than that of MVCC. In our imple- DL_DETECT: 2PL with deadlock detection mentation, the history buffer is a per-tuple array that keeps a fixed NO_WAIT: 2PL with non-waiting deadlock prevention number of the most recent wts’s, and thus the DBMS does not have We deployed DBx1000 on a 40-core machine with four Intel to perform garbage collection. Xeon E7-4850 CPUs and 128 GB of DRAM. Each core supports During a transaction’s validation phase, if a read tuple’s local rts two hardware threads, for a total of 80 threads. The experiments is less than commit_ts and the wts does not match the latest wts, with more than 40 threads (shaded areas in the throughput graphs) then the DBMS checks if the wts matches any version in the tuple’s use multiple threads per core, and thus may scale sub-linearly due history buffer. If so, the valid range of that version is from the local to contention. To minimize memory latency, we use numactl to wts to the next wts in the history buffer. If commit_ts falls within ensure each thread allocates memory from its own socket. that range, the tuple can still be validated. 5.4 Lower Isolation Levels 6.1 Workloads The TicToc algorithm described in Section 3 provides serializ- We next describe the two benchmarks that we implemented in able isolation, which is the strictest isolation level in ANSI SQL. the DBx1000 DBMS for this analysis. With minimal changes, TicToc can also support lower isolation lev- TPC-C: This workload is the current industry standard to eval-

9. DL_DETECT HEKATON NO_WAIT SILO TICTOC Throughput (Million txn/s) 2.0 0.8 1.5 0.6 Abort Rate 1.0 0.4 0.5 0.2 0.0 0.0 0 20 40 60 80 0 20 40 60 80 Thread Count Thread Count (a) Throughput (b) Abort Rate Figure 4: TPC-C (4 Warehouses) – Scalability of different concurrency control algorithms on TPC-C workload with 4 warehouses. 5.0 0.8 Throughput (Million txn/s) 4.0 0.6 Abort Rate 3.0 0.4 2.0 0.2 1.0 0.0 0.0 0 20 40 60 80 0 20 40 60 80 Number of Warehouses Number of Warehouses (a) Throughput (b) Abort Rate Figure 5: TPC-C (Variable Warehouses) – Scalability of different concurrency control algorithms on TPC-C when sweeping the number of warehouses. The number of worker threads in DBx1000 is fixed at 80. DL_DETECT NO_WAIT TICTOC 3. High Contention: 16 queries per transaction (50% reads and HEKATON SILO 50% writes) with a hotspot of 10% tuples that are accessed 30.0 by ∼75% of all queries (theta=0.9). Throughput (Million txn/s) 25.0 For all of the YCSB experiments in this paper, we used a ∼10 GB 20.0 database containing a single table with 10 million records. Each tu- 15.0 ple has a single primary key column and then 10 additional columns each with 100 bytes of randomly generated string data. 10.0 5.0 6.2 TPC-C Results 0.0 We first analyze the performance of all the concurrency control 0 20 40 60 80 algorithms on the TPC-C benchmark. Thread Count The number of warehouses in TPC-C determines both the size Figure 6: YCSB (Read-Only) – Results for a read-only YCSB of the database and the amount of concurrency. Each warehouse workload for the different concurrency control schemes and the adds ∼100 MB to the database. The warehouse is the root entity atomic add timestamp allocator. for almost all of the tables in the database. We follow the TPC-C specification where ∼10% of the NewOrder transactions and ∼15% of the Payment transactions access a “remote” warehouse. uate OLTP systems [33]. It consists of nine tables that simulate a We first run TPC-C with four warehouses, as this is an example warehouse-centric order processing application. Only two (Payment of a database that has a lot of contention. We then run an experi- and NewOrder) out of the five transactions in TPC-C are modeled in ment where we fix the number of threads and scale the number of our simulation, with the workload comprised of 50% of each type. warehouses in the database. This measures how well the algorithms These two make up 88% of the default TPC-C mix and are the most scale when the workload has more parallelism opportunities. interesting in terms of complexity for our evaluation. YCSB: The Yahoo! Cloud Serving Benchmark is representa- 6.2.1 4 Warehouses tive of large-scale on-line services [9]. Each query accesses a sin- The results in Fig. 4 show that the performance improvements gle random tuple based on a Zipfian distribution with a parameter of additional threads are limited by contention on the WAREHOUSE (theta) that controls the contention level in the benchmark [16]. table. Each Payment transaction updates a per-warehouse tuple in We evaluate three different variations of this workload: this table and each NewOrder transaction reads that tuple. Since there are only four such tuples in the entire database, they become 1. Read-Only: Two read queries per transaction and a uniform the bottleneck of the whole system. access distribution (theta=0). The Payment transaction is simpler faster than NewOrder trans- 2. Medium Contention: 16 queries per transaction (90% reads actions. In SILO, when the NewOrder transaction enters the valida- and 10% writes) with a hotspot of 10% tuples that are ac- tion phase, it is likely that a Payment transaction has already mod- cessed by ∼60% of all queries (theta=0.8). ified the tuple in the WAREHOUSE table. Therefore, SILO (like other

10. DL_DETECT HEKATON NO_WAIT SILO TICTOC Throughput (Million txn/s) 4.0 0.2 3.0 0.2 Abort Rate 2.0 0.1 1.0 0.1 0.0 0.0 0 20 40 60 80 0 20 40 60 80 Thread Count Thread Count (a) Throughput (b) Abort Rate Figure 7: YCSB (Medium Contention) – Results for a read-write YCSB workload with medium contention. Note that DL_DETECT is only measured up to 40 threads. 1.0 1.0 Throughput (Million txn/s) 0.8 0.8 Abort Rate 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 20 40 60 80 0 20 40 60 80 Thread Count Thread Count (a) Throughput (b) Abort Rate Figure 8: YCSB (High Contention) – Results for a read-write YCSB workload with high contention. Note that DL_DETECT is only measured up to 40 threads. traditional OCCs) frequently aborts these NewOrder transactions. over SILO decreases and eventually disappears at 80 warehouses. In TICTOC, the NewOrder transaction would also see that the With respect to the scheme’s measured abort rate, shown in Fig. 5b, WAREHOUSE tuple has been modified. But most of the time the trans- TICTOC has consistently fewer aborts than SILO for fewer than 80 action can find a common timestamp that satisfies all the tuples it warehouses because it is able to adjust timestamps to commit trans- accesses and thus is able to commit. As shown in Fig. 4, TICTOC actions that SILO aborts. achieves 1.8× better throughput than SILO while reducing its abort rate by 27%. We attribute this to TICTOC’s ability to achieve better 6.3 YCSB Results parallelism by dynamically selecting the commit timestamp. We now compare TicToc to other concurrency control schemes The figure also shows that DL_DETECT has the worst scalabil- under different YCSB scenarios. ity of all the algorithms. This is because DL_DETECT suffers from the thrashing problem discussed in Section 5.1. Thrashing occurs 6.3.1 Read-Only because a transaction waits to acquire new locks while holding We executed a YCSB workload comprising read-only transac- other locks, which cause other transactions to block and form a tions with a uniform access distribution. This provides a baseline convoy. NO_WAIT performs better than DL_DETECT as it avoids for each concurrency control scheme before we explore more com- this thrashing problem by not waiting for locks. HEKATON also plex workload arrangements. supports non-blocking reads since a transaction can always access The results in Fig. 6 show that all of the algorithms except for a previous version and transactions that perform conflicting writes HEKATON scale almost linearly up to 40 threads. Beyond that are immediately aborted. Since rolling back an aborted transac- point, scaling is sub-linear as the threads executing on the same tion in an in-memory DBMS is a relatively fast operation, these in- physical core contend for pipeline resources. TICTOC and SILO creased aborts do not significantly hurt performance. But NO_WAIT achieve better absolute performance than the other algorithms be- still performs worse than TICTOC and SILO due to the usage of cause they do not have locking overheads. HEKATON is limited locks. Similarly, HEKATON is slower because of the overhead of by its centralized timestamp allocation component. It uses a single maintaining multiple versions. atomic add instruction on a global counter, which causes threads accessing the counter from different cores to incur cache coherence traffic on the chip. In our 4-socket system, this limits HEKATON to 6.2.2 Variable Warehouses ∼5 million timestamps per second. As we increase the number of warehouses while fixing the num- ber of worker threads, the contention in the system will decrease. 6.3.2 Medium Contention In Fig. 5 the number of warehouses is swept from 4 to 80 but the In a read-only workload, transactions do not conflict with each number of worker threads is fixed to 80. other and thus any algorithm without artificial bottlenecks should When the number of warehouses is small and contention is high, scale. For workloads with some contention, however, the ways that TICTOC performs consistently better than SILO for the same rea- the algorithms handle conflicts affect the DBMS’s performance. son as in Section 6.2.1. As the number of warehouses grows, par- Fig. 7 shows the throughput and abort rate of the medium con- allelism in TPC-C becomes plentiful, so the advantage of TICTOC tention YCSB workload. The results in Fig. 7a show that SILO and

11. SILO No Opts NoWait NoWait + PreAbort All Opts 2.0 0.8 Useful Work Abort Manager Throughput (Million txn/s) 1.0 1.5 0.6 Norm. Runtime 0.8 Abort Rate 1.0 0.4 0.6 0.4 0.5 0.2 0.2 0.0 0.0 0.0 SILO pts Wait ort tory 0 20 40 60 80 0 20 40 60 80 No O + No eAb + His + Pr Thread Count Thread Count (a) Throughput (b) Abort Rate (c) Execution Time Breakdown (80 threads) Figure 9: TicToc Optimizations (TPC-C) – Throughput measurements of TicToc using the different optimizations from Section 5 for TPCC with 4 warehouses. 1.0 0.6 Useful Work Abort Manager Throughput (Million txn/s) 1.0 0.8 Norm. Runtime 0.8 Abort Rate 0.4 0.6 0.6 0.4 0.2 0.4 0.2 0.2 0.0 0.0 0.0 SILO pts Wait ort tory 0 20 40 60 80 0 20 40 60 80 No O + No eAb + His + Pr Thread Count Thread Count (a) Throughput (b) Abort Rate (c) Execution Time Breakdown (80 threads) Figure 10: TicToc Optimizations (YCSB) – Throughput measurements of TicToc using the different optimizations from Section 5 for a high contention read-write YCSB workload. TICTOC both scale well and achieve similar throughput. But the 2. NoWait: TicToc with no-wait locking described in Section 5.1. graph in Fig. 7b shows that TICTOC has a ∼3.3× lower abort rate 3. NoWait + PreAbort: TicToc with no-wait locking and pre- than SILO. This is due to TICTOC’s data-driven timestamp man- emptive aborts from Section 5.2. agement, as transactions can commit at the proper timestamp that 4. All Opts: TicToc with no-wait locking, preemptive aborts, is not necessarily the largest timestamp so far. and timestamp history from Section 5.3. The throughput measurements show that DL_DETECT again has the worst scalability of all the algorithms due to lock trashing. Recall that this last configuration is the default setting for TicToc Since NO_WAIT does better since transactions can get immediately in all of the other experiments in this paper. We also include the restarted when there is a deadlock. HEKATON performs better than performance measurements for SILO from Fig. 8a for comparison. the 2PL schemes since multiple versions allows more read opera- The results in Fig. 9 show the performance, abort rate, and time tions to succeed (since they can access older versions) which leads breakdown (at 80 cores) for the TPC-C workload with four ware- to fewer transaction aborts. But this adds overhead that causes houses. At 80 threads, TICTOC without optimizations achieves HEKATON to perform worse than TICTOC and SILO. 35.6% higher throughput than SILO, and has a 32% lower abort rate. This gain comes from the greater parallelism exploited by the TICTOC’s timestamp management policy. Using the no-wait opti- 6.3.3 High Contention mization for locking transactions’ write sets provides another 38% We now compare the algorithms on a YCSB workload with high performance gain at 80 threads while not affecting the abort rate. contention. Here, conflicts are more frequent and the workload has In Fig. 9c, we see that the gains of basic TICTOC over SILO mainly lower inherent parallelism, which stresses the DBMS and allows us come from reducing the abort rate. Optimizations do not reduce to more easily identify the main bottlenecks in each algorithm. TICTOC’s abort rate further, but they do reduce the amount of time As expected, the results in Fig. 8 show that all algorithms are less wasted in aborting transactions. These optimizations effectively scalable than in the medium contention workload. We note, how- make each abort take a shorter amount of time. ever, that the performance difference between TICTOC and SILO is Fig. 10 shows the same experiments on high contention YCSB more prominent. As we discuss next, this performance gain comes workload. Here we see similar performance improvement as in from the optimizations we presented in Section 5. TPC-C but it comes from different aspects of TICTOC. TICTOC Both TICTOC and SILO have similar abort rates in Fig. 8b under without any optimizations only performs 10% better than SILO, and high contention. The timestamp management policy in TICTOC most of the performance improvement comes from the no-wait and does not reduce the abort rate because the workload is too write- preemptive abort optimizations. In contrast to Fig. 9, using pre- intensive and the contention level is so high that both algorithms emptive aborts provides a larger performance gain in YCSB. This have similar behaviors in terms of aborting transactions. is partly because in YCSB each transaction locks more tuples dur- ing the validation phase. Preemptive aborts alleviate the contention 6.4 TicToc Optimizations caused by these locks. We now evaluate the optimizations we presented in Section 5 to A key finding is that the timestamp history optimization does not determine their individual effect on TicToc’s overall performance. provide any measurable performance gain in either workload. This To do this, we run DBx1000 multiple times using TicToc but en- was initially surprising to us, but upon further investigation we are able the optimizations one-at-a-time. We use four different config- convinced that this is indeed correct. In a way, TICTOC without urations in this experiment: this optimization already stores multiple versions of a tuple in each transaction’s private workspace. This means that each transaction 1. No Opts: TicToc without any optimizations.

12. 10000 DL_DETECT HEKATON NO_WAIT SILO TICTOC SR 0.005 0.18 0.30 0.52 0.82 Commit Timestamp TS_ALLOC SI – 0.23 – – 0.90 8000 High RR 0.010 0.23 0.35 0.80 1.04 6000 Medium (a) Throughput (Million txn/s) 4000 DL_DETECT HEKATON NO_WAIT SILO TICTOC 2000 SR 74.0% 34.4% 69.9% 46.8% 44.3% SI – 30.9% – – 40.1% 0 0 2000 4000 6000 8000 10000 RR 74.3% 30.4% 71.3% 42.3% 39.7% Number of Committed Txns (b) Abort Rate Figure 11: Logical Time Analysis – Comparison of the growth Table 2: Isolation Levels (High Contention) – Performance mea- rate of the timestamps in TICTOC versus TS_ALLOC. surements for the concurrency control schemes running YCSB un- der different isolation levels with 40 threads. DL_DETECT HEKATON NO_WAIT SILO TICTOC SR 0.43 1.55 0.63 2.32 2.57 compare the DBMS’s performance when transactions execute un- SI – 1.78 – – 2.69 RR 0.72 1.88 1.89 2.45 2.69 der snapshot isolation (SI) and repeatable read isolation (RR) lev- els versus the default serializable isolation (SR). All five algorithms (a) Throughput (Million txn/s) support the RR level. For SI, we are only able to test the TICTOC and HEKATON algorithms. This is because supporting SI requires DL_DETECT HEKATON NO_WAIT SILO TICTOC SR 0.35% 11.6% 63.2% 6.47% 1.76% the DBMS to maintain multiple versions for each tuple, and thus SI – 1.96% – – 1.54% this requires significant changes to the other algorithms. We use the RR 0.10% 1.94% 9.9% 0.71% 0.72% medium- and high-contention YCSB workloads from Section 6.3. (b) Abort Rate The medium-contention YCSB results are shown in Table 1. For this setting, the workload has enough parallelism and thus all the Table 1: Isolation Levels (Medium Contention) – Performance optimistic T/O-based algorithms only see small improvements when measurements for the concurrency control schemes running YCSB running at a lower isolation level (4.7% for TICTOC and 5.6% for under different isolation levels with 40 threads. SILO), whereas for the pessimistic 2PL algorithms the improve- ment is more pronounced (67.4% for DL_DETECT and 200.0% for can commit in parallel using its own version. Although in theory NO_WAIT). HEKATON only has a 21.3% improvement from SR to timestamp history can enable more concurrency, in practice there RR. The abort rate measurements in Table 1b show that the lower is no clear performance benefit for the workloads we evaluated. isolation levels achieve lower abort rates because there are fewer conflicts between transactions. As expected, all the algorithms have 6.5 Logical Time Analysis the fewest number of aborted transactions under RR since it is the We analyze how TicToc’s commit timestamps grow over time most relaxed isolation level. under different levels of contention. Ideally, we would like times- The high-contention YCSB results are shown in Table 2. Lower tamps to grow slowly over time relative to the total number of com- isolation levels have better performance than serializable isolation. mitted transactions, indicating that synchronization among transac- Again, the throughput of the RR isolation level is slightly better tions is relatively infrequent. For this experiment, we execute the than SI’s. In general, for this workload setting we found that differ- medium and high contention YCSB workloads from Section 6.3 ent isolation levels do not cause large reductions in abort rates due and track the values of the transactions’ commit timestamps over to the significant amount of contention on hotspot tuples. time. We also include TS_ALLOC as a baseline where each trans- action is assigned a unique timestamp using an atomic add instruc- 7. CONCLUSION tion. This is representative of the timestamp growth rate in other In this paper we have presented TicToc, a new OCC-based con- T/O-based algorithms, such as HEKATON. currency control algorithm that eliminates the need for centralized Fig. 11 shows the relationship between logical timestamps and timestamp allocation. TicToc instead uses a novel data-driven times- the number of committed transactions for the three configurations. tamp management approach that decouples logical timestamps and With the TS_ALLOC protocol, the number of committed transac- physical time by deriving transaction commit timestamps from data tions and logical timestamps increase at the same rate. In TicToc, items. This enables an OLTP DBMS to exploit more parallelism however, logical timestamps increase at a slower rate: 64× and than other OCC-based algorithms. We have also presented several 10× slower for low and high contention levels in YCSB, respec- optimizations that leverage this timestamp management policy to tively. What is interesting about these measurements is that the rate further improve TicToc’s performance. Our evaluation results show of growth of logical timestamps indicates the inherent level of par- that, compared to another state-of-the-art OCC algorithm, TicToc allelism in a workload that can be exploited by TicToc. In the high achieves up to 92% higher throughput while reducing transaction contention workload, for example, this ratio is 10×. This corrob- abort rates by up to 3.3× under different workload conditions. orates our results in Fig. 8a that show the DBMS was only able to achieve 7.7× better throughput from running multiple threads in the high contention YCSB workload. ACKNOWLEDGMENTS This research was funded (in part) by the Intel Science and Tech- 6.6 Isolation Levels nology Center for Big Data and the U.S. National Science Founda- All of the experiments so far have used serializable isolation. Se- tion (CCF-1438955, CCF-1438967). rializable is the strictest isolation level and thus usually has less For questions or comments about this paper, please call the concurrency opportunities than lower isolation levels. We now CMU Database Hotline at +1-844-88-CMUDB.

13.8. REFERENCES [22] H. T. Kung and J. T. Robinson. On optimistic methods for [1] DBx1000. https://github.com/yxymit/DBx1000. concurrency control. ACM Trans. Database Syst., [2] Tile-gx family of multicore processors. 6(2):213–226, June 1981. http://www.tilera.com. [23] K.-W. Lam, K.-Y. Lam, and S.-L. Hung. Real-time [3] M. Aslett. How will the database incumbents respond to optimistic concurrency control protocol with dynamic NoSQL and NewSQL? The 451 Group, April 2011. adjustment of serialization order. In Real-Time Technology [4] R. Bayer, K. Elhardt, J. Heigert, and A. Reiser. Dynamic and Applications Symposium, pages 174–179. IEEE, 1995. timestamp allocation for transactions in database systems. In [24] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. 2nd Int. Symp. on Distributed Databases, pages 9–20, 1982. Patel, and M. Zwilling. High-performance concurrency [5] P. A. Bernstein and N. Goodman. Concurrency control in control mechanisms for main-memory databases. VLDB, distributed database systems. ACM Comput. Surv., 5(4):298–309, Dec. 2011. 13(2):185–221, 1981. [25] J. Lee and S. H. Son. Using dynamic adjustment of [6] P. A. Bernstein, D. Shipman, and W. Wong. Formal aspects serialization order for real-time database systems. In of serializability in database concurrency control. IEEE Real-Time Systems Symposium, pages 66–75. IEEE, 1993. Transactions on Software Engineering, 5(3):203–216, 1979. [26] D. A. Menascé and T. Nakanishi. Optimistic versus [7] C. Boksenbaum, M. Cart, J. Ferrié, and J.-F. Pons. pessimistic concurrency control mechanisms in database Concurrent certifications by intervals of timestamps in management systems. Information Systems, 7(1):13–27, distributed database systems. Software Engineering, IEEE 1982. Transactions on, (4):409–419, 1987. [27] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and [8] M. J. Carey. Improving the performance of an optimistic P. Schwarz. Aries: a transaction recovery method supporting concurrency control algorithm through timestamps and fine-granularity locking and partial rollbacks using versions. Software Engineering, IEEE Transactions on, write-ahead logging. TODS, 17(1):94–162, 1992. SE-13(6):746–751, June 1987. [28] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. [9] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and Data-oriented transaction execution. Proc. VLDB Endow., R. Sears. Benchmarking cloud serving systems with YCSB. 3:928–939, September 2010. In SoCC’10, pages 143–154. [29] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki. [10] W. J. Dally. GPU Computing: To Exascale and Beyond. In OLTP on Hardware Islands. Proc. VLDB Endow., Supercomputing ’10, Plenary Talk, 2010. 5:1447–1458, July 2012. [11] T. David, R. Guerraoui, and V. Trigonakis. Everything you [30] D. P. Reed. Naming and synchronization in a decentralized always wanted to know about synchronization but were computer system. PhD thesis, Massachusetts Institute of afraid to ask. In Symposium on Operating Systems Technology, 1978. Principles, pages 33–48, 2013. [31] M. Reimer. Solving the phantom problem by predicative [12] B. D. de Dinechin, R. Ayrignac, P.-E. Beaucamps, optimistic concurrency control. VLDB ’83, pages 81–88, P. Couvert, B. Ganne, P. G. de Massas, F. Jacquet, S. Jones, 1983. N. M. Chaisemartin, F. Riss, and T. Strudel. A clustered [32] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, manycore processor architecture for embedded and N. Hachem, and P. Helland. The end of an architectural era: accelerated applications. In Proc. of the High Performance (it’s time for a complete rewrite). In VLDB, pages Extreme Computing Conference, 2013. 1150–1160, 2007. [13] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The [33] The Transaction Processing Council. TPC-C Benchmark notions of consistency and predicate locks in a database (Revision 5.9.0), June 2007. system. CACM, 19(11):624–633, 1976. [34] A. Thomasian. Concurrency control: Methods, performance, [14] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, and analysis. ACM Comput. Surv., 30(1):70–119, Mar. 1998. L. Rudolph, and M. Snir. The NYU Ultracomputer: [35] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Designing an MIMD Shared Memory Parallel Computer. Speedy transactions in multicore in-memory databases. In IEEE Trans. Comput., 100(2), 1983. SOSP, 2013. [15] J. Gray. The transaction concept: Virtues and limitations. In [36] T. Wang and R. Johnson. Scalable logging through emerging VLDB, pages 144–154, 1981. non-volatile memory. Proceedings of the VLDB Endowment, [16] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. 7(10):865–876, 2014. Weinberger. Quickly generating billion-record synthetic [37] X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and databases. SIGMOD, pages 243–252, 1994. M. Stonebraker. Staring into the abyss: An evaluation of [17] T. Härder. Observations on optimistic concurrency control concurrency control with one thousand cores. volume 8, schemes. Inf. Syst., 9(2):111–120, Nov. 1984. pages 209–220. VLDB Endowment, 2014. [18] H. Hoffmann, D. Wentzlaff, and A. Agarwal. Remote store [38] X. Yu and S. Devadas. TARDIS: timestamp based coherence programming. In High Performance Embedded Architectures algorithm for distributed shared memory. In International and Compilers, pages 3–17. Springer, 2010. Conference on Parallel Architectures and Compilation [19] Intel. Intel 64 and IA-32 Architectures Software Developer’s Techniques, 2015. Manual, Volume 3B, 17.14.1 Invariant TSC, 2015. [39] W. Zheng, S. Tu, E. Kohler, and B. Liskov. Fast databases [20] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and with fast durability and recovery through multicore B. Falsafi. Shore-MT: a scalable storage manager for the parallelism. In Proceedings of the 11th USENIX Conference multicore era. EDBT, pages 24–35, 2009. on Operating Systems Design and Implementation, OSDI’14, [21] H. Kimura. Foedus: Oltp engine for a thousand cores and pages 465–477. USENIX Association, 2014. nvram. SIGMOD ’15, pages 691–706, 2015.

14.APPENDIX: DYNAMIC TIMESTAMP ALLO- OCC requires considerable logic in the critical section. As a result, CATION our implementation provides an upper bound on the performance of DTA OCC. For this experiment, we used the same medium- As discussed in Section 2.2, the original DTA technique [4] was contention YCSB workload that we use in Section 6.3.2. The re- designed to resolve and detect deadlocks using dependency graph sults in Fig. 12 show that, as expected, DTA OCC fails to scale be- analysis. It was first used in deadlock detection and prevention yond 20 cores due to this critical section bottleneck. This matches for 2PL algorithms. In DTA, timestamps are associated with each previous studies that have shown that algorithms with shared global transaction and are compared when two transactions conflict with resources are unable to scale beyond 16–32 active threads [37]. each other. Timestamps are not associated with tuples in DTA like they are in TicToc. 4.0 Throughput (Million txn/s) DTA was then adapted to OCC algorithms [7] and was later stud- ied in the context of real-time DBMSs [23, 25]. In these designs, TICTOC 3.0 DTA DTA was used to detect conflicts and determine the relative order- ing among transactions. Similar to TicToc, DTA OCC also reorders 2.0 transactions in logical time, which may not agree with physical time. However, since timestamps are only associated with transac- 1.0 tions but not tuples, the DBMS has to maintain a dependency graph of all active transactions. This dependency graph is updated when- 0.0 ever there is a conflicting access. All the DTA OCC algorithms 0 20 40 60 80 Thread Count proposed in the literature have this critical section that becomes a serious performance bottleneck as thread count increases. Figure 12: Dynamic Timestamp Allocation – Performance com- We modeled DTA OCC in our DBx1000 system and compared parison of the DTA OCC and TicToc algorithms for the YCSB it against TicToc. Our implementation of DTA OCC is idealized workload with medium contention. since we modeled an empty critical section where a full-blown DTA