High-Performance Concurrency Control Mechanisms

A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases.

1. High-Performance Concurrency Control Mechanisms for Main-Memory Databases Per-Åke Larson1, Spyros Blanas2, Cristian Diaconu1, Craig Freedman1, Jignesh M. Patel2, Mike Zwilling1 1 2 Microsoft , University of Wisconsin – Madison {palarson, cdiaconu, craigfr,mikezw}@microsoft.com, {sblanas, jignesh}@cs.wisc.edu ABSTRACT found that traditional single-version locking is “fragile”. It works A database system optimized for in-memory storage can support well when all transactions are short and there are no hotspots but much higher transaction rates than current systems. However, performance degrades rapidly under high contention or when the standard concurrency control methods used today do not scale to workload includes even a single long transaction. the high transaction rates achievable by such systems. In this pa- Decades of research has shown that multiversion concurrency per we introduce two efficient concurrency control methods spe- control (MVCC) methods are more robust and perform well for a cifically designed for main-memory databases. Both use multiver- broad range of workloads. This led us to investigate how to con- sioning to isolate read-only transactions from updates but differ in struct MVCC mechanisms optimized for main memory settings. how atomicity is ensured: one is optimistic and one is pessimistic. We designed two MVCC mechanisms: the first is optimistic and To avoid expensive context switching, transactions never block relies on validation, while the second one is pessimistic and relies during normal processing but they may have to wait before com- on locking. The two schemes are mutually compatible in the sense mit to ensure correct serialization ordering. We also implemented that optimistic and pessimistic transactions can be mixed and a main-memory optimized version of single-version locking. Ex- access the same database concurrently. We systematically ex- perimental results show that while single-version locking works plored and evaluated these methods, providing an extensive ex- well when transactions are short and contention is low perfor- perimental evaluation of the pros and cons of each approach. The mance degrades under more demanding conditions. The multiver- experiments confirmed that MVCC methods are indeed more sion schemes have higher overhead but are much less sensitive to robust than single-version locking. hotspots and the presence of long-running transactions. This paper makes three contributions. First, we propose an opti- 1. INTRODUCTION mistic MVCC method designed specifically for memory resident data. Second, we redesign two locking-based concurrency control Current database management systems were designed assuming methods, one single-version and one multiversion, to fully exploit that data would reside on disk. However, memory prices continue a main-memory setting. Third, we evaluate the effectiveness of to decline; over the last 30 years they have been dropping by a these three different concurrency control methods for different factor of 10 every 5 years. The latest Oracle Exadata X2-8 system workloads. The insights from this study are directly applicable to ships with 2TB of main memory and it is likely that we will see high-performance main memory databases: single-version locking commodity servers with multiple terabytes of main memory with- performs well only when transactions are short and contention is in a few years. On such systems the majority of OLTP databases low; higher contention or workloads including some long transac- will fit entirely in memory, and even the largest OLTP databases tions favor the multiversion methods; and the optimistic method will keep the active working set in memory, leaving only cold, performs better than the pessimistic method. infrequently accessed data on external storage. The rest of the paper is organized as follows. Section 2 covers A DBMS optimized for in-memory storage and running on a preliminaries of multiversioning and describes how version visi- many-core processor can support very high transaction rates. bility and updatability are determined based on version Efficiently ensuring isolation between concurrently executing timestamps. The optimistic scheme and the pessimistic scheme transactions becomes challenging in such an environment. Current are described in Section 3 and Section 4, respectively. Section 5 DBMSs typically rely on locking but in a traditional implementa- reports performance results. Related work is discussed in Section tion with a separate lock manager the lock manager becomes a 6, and Section 7 offers concluding remarks. Proofs of correctness bottleneck at high transaction rates as shown in experiments by are provided in an online addendum to this paper and at [27]. Johnson et al [15]. Long read-only transactions are also problem- atic as readers may block writers. 2. MV STORAGE ENGINE This paper investigates high-performance concurrency control A transaction is by definition serializable if its reads and writes mechanisms for OLTP workloads in main-memory databases. We logically occur as of the same time. The simplest and most widely used MVCC method is snapshot isolation (SI). SI does not guar- Permission to make digital or hard copies of all or part of this work for antee serializability because reads and writes logically occur at personal or classroom use is granted without fee provided that copies are different times, reads at the beginning of the transaction and not made or distributed for profit or commercial advantage and that copies writes at the end. However, a transaction is serializable if we can bear this notice and the full citation on the first page. To copy otherwise, to guarantee that it would see exactly the same data if all its reads republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present were repeated at the end of the transaction. their results at The 38th International Conference on Very Large Data Bases, To ensure that a transaction T is serializable we must guarantee August 27th - 31st 2012, Istanbul, Turkey. that the following two properties hold: Proceedings of the VLDB Endowment, Vol. 5, No. 4 Copyright 2011 VLDB Endowment 2150-8097/11/12... $ 10.00. 298

2.1. Read stability. If T reads some version V1 of a record during Hash bucket J contains four records: three versions for John and its processing, we must guarantee that V1 is still the version one version for Jane. The order in which the records are linked visible to T as of the end of the transaction, that is, V1 has not together is immaterial. Jane’s single version (Jane, 150) has a been replaced by another committed version V2. This can be valid time from 15 to infinity meaning that it was created by a implemented either by read locking V1 to prevent updates or transaction that committed at time 15 and it is still valid. John’s by validating that V1 has not been updated before commit. oldest version (John, 100) was valid from time 10 to time 20 when This ensures that nothing has disappeared from the view. it was updated. The update created a new version (John, 110) that 2. Phantom avoidance. We must also guarantee that the transac- initially had a valid time of 20 to infinity. We will discuss John’s tion’s scans would not return additional new versions. This can last version (John, 130) in a moment. be implemented in two ways: by locking the scanned part of an index/table or by rescanning to check for new versions before 2.2 Reads commit. This ensures that nothing has been added to the view. Every read specifies a logical (as-of) read time and only versions whose valid time overlaps the read time are visible to the read; all Lower isolation levels are easier to support. other versions are ignored. Different versions of a record have x For repeatable read, we only need to guarantee read stability. non-overlapping valid times so at most one version of a record is x For read committed, no locking or validation is required; visible to a read. A lookup for John, for example, would be han- always read the latest committed version. dled by a scan of bucket J that checks every version in the bucket x For snapshot isolation, no locking or validation is required; but returns only the one whose valid time overlaps the read time. always read as of the beginning of the transaction. We have implemented a prototype main-memory storage engine. 2.3 Updates We begin with a high-level overview of how data is stored, how Bucket L contains two records which both belong to Larry. Trans- reads and updates are handled, and how it is determined what action 75 is in the process of transferring $20 from Larry’s ac- versions are visible to a reader, and that a version can be updated. count to John’s account. It has created the new versions for Larry (Larry, 150) and for John (John, 130) and inserted them into the Record format appropriate buckets in the index. Header Payload Note that transaction 75 has stored its transaction ID in the Begin Begin End Name Amount Hash ptr and End fields of the new and old versions, respectively. (One bit in the field indicates the field’s current content.) A transaction ID Hash index stored in the End field serves as a write lock and prevents other on Name 10 20 John 100 transactions from updating the same version and it identifies which transaction has updated it. A transaction Id stored in the 15 inf Jane 150 Begin field informs readers that the version may not yet be com- J mitted and it identifies which transaction owns the version. 20 Tx75 John 110 Old 100 Now suppose transaction 75 commits with end timestamp 100. It then returns to the old and new versions and sets the Begin and Tx75 Inf John 130 New End fields, respectively, to 100. The final values are shown in red 100 below the old and new versions. The old version (John, 110) now L 30 Tx75 Larry 170 Old has the valid time 20 to 100 and the new version (John, 130) has a 100 valid time from 100 to infinity. Tx75 inf Larry 150 New Every update creates a new version so we need to discard old 100 versions that are no longer needed to avoid filling up memory. A version can be discarded when it is no longer visible to any trans- Figure 1: Example account table with one hash index. action. In our prototype, once a version has been identified as Transaction 75 has transferred $20 from Larry’s account to John’s account but has not yet committed. garbage, collection is handled cooperatively by all threads. Alt- hough garbage collection is efficient and fully parallelizable, 2.1 Storage and Indexing keeping track of the old versions does consume some processor Our prototype currently supports only hash indexes which are resources. Details of our garbage collection algorithm are beyond implemented using lock-free hash tables. A table can have many the scope of this paper. indexes, and records are always accessed via an index lookup - there is no direct access to a record without going through an in- 2.4 Transaction Phases dex. (To scan a table, one simply scans all buckets of any index A transaction can be in one of four states: Active, Preparing, on the table.) The techniques presented here are general and can Committed, or Aborted. Fig. 2 shows the possible transitions be- be applied to ordered indexes implemented by trees or skip lists. tween these states. Figure 1 shows a simple bank account table containing six ver- Log updates and wait for I/O Committed sions. Ignore the numbers (100) in red for now. The table has two (user) columns: Name and Amount. Each version has a valid time Transaction gets Active Transaction gets Preparing begin timestamp end timestamp defined by timestamps stored in the Begin and End fields. A table can have several indexes, each one with a separate (hash) pointer Aborted field. The example table has one index on the Name column. For simplicity we assume that the hash function just picks the first Figure 2: State transitions for each transaction letter of the name. Versions that hash to the same bucket are linked together using the Hash ptr field. 299

3.A transaction goes through three different phases. We outline the Begin field contains a transaction ID processing in each phase only briefly here; it is fleshed out in Suppose transaction T reads version V and finds that V’s Begin more detail in connection with each concurrency control method field contains the ID of a transaction TB. Version V may still be (Sections 3 and 4). visible; it depends on transaction TB’s state and TB’s end 1. The transaction is created; it acquires a begin timestamp and timestamp. TB may, for example, have committed already but not sets its state to Active. yet finalized the Begin fields of its new versions. If so, V is a 2. Normal processing phase. The transaction does all its normal committed version with a well-defined begin time. Table 1 sum- processing during this phase. A transaction never blocks during marizes the various cases that may occur and the action to take this phase. For update operations, the transaction copies its depending on the state of the transaction TB. transaction ID into the Begin field of the new versions and into Table 1: Case analysis of action to take when version V’s the End field of the old or deleted versions. If it aborts, it Begin field contains the ID of transaction TB changes its state to Aborted and skips directly to step 4. When the transaction has completed its normal processing and re- TB’s state TB’s end Action to take when transac- quests to commit, it acquires an end timestamp and switches to timestamp tion T checks visibility of ver- the Preparing state. sion V. 3. Preparation phase. During this phase the transaction deter- Active Not set V is visible only if TB=T and V’s mines whether it can commit or is forced to abort. If it has to end timestamp equals infinity. abort, it switches its state to Aborted and continues to the next phase. If it is ready to commit, it writes all its new versions and Preparing TS V’s begin timestamp will be TS information about deleted versions to a redo log record and but V is not yet committed. Use waits for the log record to reach non-volatile storage. The TS as V’s begin time when test- transaction then switches its state to Committed. ing visibility. If the test is true, 4. Postprocessing phase. If the transaction has committed, it allow T to speculatively read V. proceeds to replace its transaction ID with its end timestamp Committed TS V’s begin timestamp will be TS from the Begin field of the new versions and from the End and V is committed. Use TS as field of the old or deleted versions. If the transaction has abort- V’s begin time to test visibility. ed, it marks all its new versions as garbage and sets their Begin and End timestamps to infinity to make them invisible. Aborted Irrelevant Ignore V; it’s a garbage version. 5. The transaction is terminated. The old versions are assigned to Terminated Irrelevant Reread V’s Begin field. TB has the garbage collector, which is responsible for discarding them or not terminated so it must have final- when they are no longer needed. found ized the timestamp. Timestamps are drawn from a global, monotonically increasing If TB is still in the Active state, the version is uncommitted and counter. A transaction’s gets a unique timestamp by atomically thus not visible to any other transactions than TB. If TB has up- reading and incrementing the counter. dated a record multiple times, only the latest version is visible to it. V is the latest version if its End field contains infinity. 2.5 Version Visibility Under multiversioning a read must specify a logical read time. If transaction TB is in the Preparing state, it has acquired an end Only versions whose valid time overlaps the logical read time are timestamp TS which will be V’s begin timestamp if TB commits. visible to the read. The read time can be any value between the A safe approach in this situation would be to have transaction T transaction’s begin time and the current time. Which read time is wait until transaction TB commits. However, we want to avoid all chosen depends on the concurrency control method used and the blocking during normal processing so instead we continue with transaction’s isolation level; more about this in Sections 3 and 4. the visibility test and, if the test returns true, allow T to specula- tively read V. Transaction T acquires a commit dependency on While determining the visibility of a version is straightforward in TB, restricting the serialization order of the two transactions. That principle, it is more complicated in practice as we do not want a is, T is allowed to commit only if TB commits. Commit depend- transaction to block (wait) during normal processing. Recall that a encies are discussed in more detail in Section 2.7. version’s Begin or End fields can temporarily store a transaction ID, if the version is being updated. If a reader encounters such a The last three cases in Table 1 are straightforward. version, determining visibility without blocking requires checking End field contains a transaction ID another transaction’s state and end timestamp and potentially even restricting the serialization order of transactions. Once it has been determined that version V’s valid time begins before transaction T’s read time RT, we proceed to check V’s End We now examine each case in turn, beginning from the easiest field. If it contains a timestamp, determining visibility is straight- and most common case where both fields contain a timestamp, forward: V is visible to T if and only if RT is less than the and then discussing the cases where the Begin or End fields con- timestamp. However, if the field contains the ID of transaction tain transaction IDs. TE, we have to check the state and end timestamp of TE. Table 2 Begin and End fields contain timestamps summarizes the various cases and the actions to take, assuming that we have already determined that V’s begin timestamp is, or Let RT denote the logical read time being used by a transaction T. will be, less than RT. To determine whether a version V is visible to T, we check V’s Begin and End fields. If both fields contain timestamps, we simp- The first case (TE is Active) was discussed earlier. If TE’s state is ly check whether RT falls between the two timestamps. If it does, Preparing, it has an end timestamp TS that will become the end V is visible to T, otherwise not. timestamp of V if TE does commit. If TS is greater than the read 300

4.time RT, it is obvious that V will be visible if TE commits. If TE tion has sneaked in and updated V before T managed to install its aborts, V will still be visible, because any transaction that updates update. If the store succeeds, T then connects the new version into V after TE has aborted will obtain an end timestamp greater than all indexes it participates in. T also saves pointers to the old and TS. If TS is less than RT, we have a more complicated situation: new versions; they will be needed during postprocessing. if TE commits, V will not be visible to T but if TE aborts, it will be visible. We could handle this by forcing T to wait until TE 2.7 Commit Dependencies commits or aborts but we want to avoid all blocking during nor- A transaction T1 has a commit dependency on another transaction mal processing. Instead we allow T to speculatively ignore V and T2, if T1 is allowed to commit only if T2 commits. If T2 aborts, proceed with its processing. Transaction T acquires a commit T1 must also abort, so cascading aborts are possible. T1 acquires a dependency (see Section 2.7) on TE, that is, T is allowed to com- commit dependency either by speculatively reading or specula- mit only if TE commits. tively ignoring a version, instead of waiting for T2 to commit.. Table 2: Case analysis of action to take when V's End field We implement commit dependencies by a register-and-report contains a transaction ID TE. approach: T1 registers its dependency with T2 and T2 informs T1 when it has committed or aborted. Each transaction T contains a TE’s state TE’s end Action to take when transaction counter, CommitDepCounter, that counts how many unresolved timestamp T checks visibility of a version V commit dependencies it still has. A transaction cannot commit as of read time RT. until this counter is zero. In addition, T has a Boolean variable AbortNow that other transactions can set to tell T to abort. Each Active Not set V is visible only if V was created transaction T also has a set, CommitDepSet, that stores transac- by transaction T and V’s end tion IDs of the transactions that depend on T. timestamp equals infinity. To take a commit dependency on a transaction T2, T1 increments Preparing TS V’s end timestamp will be TS pro- its CommitDepCounter and adds its transaction ID to T2’s Com- vided that TE commits. If TS > mitDepSet. When T2 has committed, it locates each transaction in RT, V is visible to T. If TS < RT, T its CommitDepSet and decrements their CommitDepCounter. If speculatively ignores V. T2 aborted, it tells the dependent transactions to also abort by Committed TS V’s end timestamp will be TS and setting their AbortNow flags. If a dependent transaction is not V is committed. Use TS as V’s end found, this means that it has already aborted. timestamp when testing visibility Note that a transaction with commit dependencies may not have to Aborted Irrelevant V is visible. wait at all - the dependencies may have been resolved before it is ready to commit. Commit dependencies consolidate all waits into Terminated Irrelevant Reread V’s End field. TE has ter- a single wait and postpone the wait to just before commit. or not found minated so it must have finalized the timestamp. Some transactions may have to wait before commit. Waiting rais- es a concern of deadlocks. However, deadlocks cannot occur be- The case when TE’s state is Committed is obvious but the Abort- cause an older transaction never waits on a younger transaction. In ed case warrants some explanation. If TE has aborted, some other a wait-for graph the direction of edges would always be from a transaction TO may have sneaked in after T read V’s End field, younger transaction (higher end timestamp) to an older transaction discovered that TE has aborted and updated V. However, TO must (lower end timestamp) so cycles are impossible. have updated V’s end field after T read it and TO must have been in the Active state. This implies that TO’s end timestamp was 3. OPTIMISTIC TRANSACTIONS assigned after T read V and thus it must be later than T’s logical This section describes in more detail the processing performed in read time, It follows that it doesn’t matter if a transaction TO the different phases for optimistic transactions. We first consider sneaked in; V is always visible to T if TE is in the Aborted state. serializable transactions and then discuss lower isolation levels. If TE has terminated or is not found, TE must have either commit- The original paper by Kung and Robinson [17] introduced two ted or aborted and finalized V’s end timestamp since we read the validation methods: backward validation and forward validation. field. So we reread the End field and try again. We use backward validation but optimize it for in-memory stor- age. Instead of validating a read set against the write sets of all 2.6 Updating a Version other transactions, we simply check whether a version that was Suppose transaction T wants to update a version V. V is updatable read is still visible as of the end of the transaction. A separate only if it is the latest version, that is, it has an end timestamp equal write phase is not needed; a transaction’s updates become visible to infinity or its End field contains the ID of a transaction TE and to other transactions when the transaction changes its state to TE’s state is Aborted. If the state of transaction TE is Active or Committed. Preparing, V is the latest committed version but there is a later uncommitted version. This is a write-write conflict. We follow the A serializable optimistic transaction keeps track of its reads, scans first-writer-wins rule and force transaction T to abort. and writes. To this end, a transaction object contains three sets: Suppose T finds that V is updatable. It then creates the new ver- 1. ReadSet contains pointers to every version read; sion and proceeds to install it in the database. The first step is to 2. ScanSet stores information needed to repeat every scan; atomically store T’s transaction ID in V’s End field to prevent 3. WriteSet contains pointers to versions updated (old and other transactions from updating V. If the store fails because the new), versions deleted (old) and versions inserted (new). End field has changed, T must abort because some other transac- 301

5. Validation outcome 3.1 Normal Processing Phase Normal processing consists of scanning indexes (see Section 2.1) Reads Phantoms to locate versions to read, update, or delete. Insertion of an entire- V1 Pass Pass ly new record or updating an existing record creates a new version V2 Fail Pass that is added to all indexes for records of that type. V3 NA Pass To do an index scan, a transaction T specifies an index I, a predi- cate ܲ, and a logical read time RT. The predicate is a conjunction V4 NA Fail ܲ ൌ ܲ௦ ‫ܲ ר‬௥ where ܲ௦ is a search predicate that determines what part of the index to scan and ܲ௥ is an optional residual predicate. T’s lifetime For a hash index, ܲ௦ is an equality predicate on columns of the hash key. For an ordered index, ܲ௦ is a range predicate on the Figure 3: Possible validation outcomes. ordering key of the index. When transaction T reaches the end of normal processing, it pre- We now outline the processing during a scan when T runs at seri- commits and begins its preparation phase. Precommit simply con- alizable isolation level. All reads specify T’s begin time as the sists of acquiring the transaction’s end timestamp and setting the logical read time. transaction state to Preparing. Start scan. When a scan starts, it is registered in T’s ScanSet so T 3.2 Preparation Phase can check for phantoms during validation. Sufficient information The preparation phase of an optimistic transaction consists of must be recorded so that the scan can be repeated. This includes at three steps: read validation, waiting for commit dependencies, and least the following: index I and predicates ܲ௦ and ܲ௥ . Exactly how logging. We discuss each step in turn. the predicates are specified depends on the implementation. Validation consists of two steps: checking visibility of the ver- Check predicate. If a version V doesn’t satisfy P, it is ignored sions read and checking for phantoms. To check visibility transac- and the scan proceeds. If the scan is a range scan and the index tion T scans its ReadSet and for each version read, checks whether key exceeds the upper bound of the range, the scan is terminated. the version is still visible as of the end of the transaction. To Check visibility. Next we check whether version V is visible to check for phantoms, T walks its ScanSet and repeats each scan transaction T as of time RT (section 2.5). The result of the test looking for versions that came into existence during T’s lifetime may be conditional on another transaction T2 committing. If so, T and are visible as of the end of the transaction. (T may acquire registers a commit dependency with T2 (section 2.7). If the visi- additional commit dependencies during validation but only if it bility test returns false, the scan proceeds to the next version. speculatively ignores a version.) Read version. If T intends to just read V, no further checks are Figure 3 illustrates the different cases that can occur. It shows the required. T records the read by adding a pointer to V to its lifetime of a transaction T, the valid times of versions V1 to V4 of ReadSet. The pointer will be used during validation. four different records, and the expected outcome of read valida- tion and phantom detection. We assume that all four versions Check updatability. If transaction T intends to update or delete satisfy the search predicate used by T and that they were all creat- V, we must check whether the version is updatable. A visible ed and terminated by transactions other than T. version is updatable if its End field equals infinity or it contains a transaction ID and the referenced transaction has aborted. Note Version V1 is visible to T both at its start and end. If V1 is includ- that speculative updates are allowed, that is, an uncommitted ver- ed in T’s ReadSet, it passes read validation and also phantom sion can be updated but the transaction that created it must have detection. completed normal processing. Version V2 is visible to T as of its start timestamp but not at the Update version. To update V, T first creates a new version VN end of the transaction. If V2 is included in T’s ReadSet, it fails and then atomically sets V’s End field to T’s transaction ID. This read validation. V2 is not a phantom. fails if some other transaction T2 has already set V’s End field. Version V3 both began and ended during T’s lifetime, so it is not This is a write-write conflict and T must abort. visible to T at the start or at the end of the transaction. It is not If T succeeds in setting V’s End field (the most likely outcome), included in T’s ReadSet so it won’t be subject to read validation. this serves as an exclusive write lock on V because it prevents V3 is not visible at the end of T, so V3 is not a phantom. further updates of V. T records the update by adding two pointers Version V4 was created during T’s lifetime and is visible at the to its WriteSet: a pointer to V (old version) and a pointer to VN end of T, so V4 is a phantom. It is not included in T’s ReadSet (new version). These pointers are used later for multiple purposes: because it was not visible as of T’s start time. for logging new versions during commit, for postprocessing after If T fails validation, it is not serializable and must abort. If T commit or abort, and for locating old versions when they are no passes validation, it must wait for outstanding commit dependen- longer needed and can be garbage collected. cies to be resolved, if it has any. More specifically, T can proceed The new version VN is not visible to any other transaction until T to the postprocessing phase if either its CommitDepCounter is precommits, therefore T can proceed to include VN in all indexes zero or its AbortNow flag is set. that the table participates in. To complete the commit, T scans its WriteSet and writes the new Delete version. A delete is an update of V that doesn’t create a versions it created to a persistent log. Commit ordering is deter- new version. The end timestamp of V is first checked and then set mined by transaction end timestamps, which are included in the in the same way as for updates. If this succeeds, a pointer to V log records, so multiple log streams on different devices can be (old version) is added to the write set and the delete is complete. used. Deletes are logged by writing a unique key or, in the worst 302

6.case, all columns of the deleted version. After the log writes have 4.1 Lock Types been completed, T sets its transaction state to Committed, thereby We use two types of locks: record locks and bucket locks. Record signaling that the commit is done. locks are placed on versions to ensure read stability. Bucket locks 3.3 Postprocessing are placed on (hash) buckets to prevent phantoms. The name re- During this phase a committed transaction TC propagates its end flects their use for hash indexes in our prototype but range locks timestamp to the Begin and End fields of new and old versions, for ordered indexes can be implemented in the same way. respectively, listed in its WriteSet. An aborted transaction TA sets 4.1.1 Record Locks the Begin field of its new versions to infinity, thereby making Updates or deletes can only be applied to the latest version of a them invisible to all transactions, and attempts to reset the End record; older versions cannot be further updated. Thus, locks are fields of its old versions to infinity. However, another transaction required only for the latest version of a record; never for older may already have detected the abort, created another new version versions. So what’s needed is an efficient many-readers-single- and reset the End field of the old version. If so, TA leaves the End writer lock for this case. field value unchanged. We do not want to store record locks in a separate table – it’s too The transaction then processes all outgoing commit dependencies slow. Instead we embed record locks in the End field of versions listed in its CommitDepSet. If it aborted, it forces the dependent so no extra space is required. In our prototype, the End field of a transactions to also abort by setting their AbortNow flags. If it version is 64 bits. As described earlier, this field stores either a committed, it decrements the target transaction’s CommitDep- timestamp or a transaction ID with one bit indicating what the Counter and wakes up the transaction if the count becomes zero. field contains. We change how we use this field to make room for Once postprocessing is done, other transactions no longer need to a record lock. refer to the transaction object. It can be removed from the transac- 1. ContentType (1 bit): indicates the content type of the remain- tion table but it will not be discarded entirely; the pointers to old ing 63 bits versions in its WriteSet are needed for garbage collection. 2. Timestamp (63 bits): when ContentType is zero. 3. RecordLock (63 bits): when ContentType is one. 3.4 Lower Isolation Levels 3.1. NoMoreReadLocks (1 bit): a flag set when no further Enforcing lower isolation levels is cheaper. A transaction requir- read locks are accepted. Used to prevent starvation. ing a higher isolation level bears the full cost of enforcing the 3.2. ReadLockCount (8 bits): number of read locks. higher level and does not penalize “innocent bystanders”. 3.3. WriteLock (54 bits): ID of the transaction holding a Repeatable read: Repeatable read is required to enforce read write lock on this version or infinity (max value). stability but not to prevent phantoms. We implement repeatable read simply by validating a transaction’s ReadSet before commit. We do not explicitly keep track of which transactions have a ver- As phantom detection is not required, a transaction’s scans are not sion read locked. Each transaction records its ReadSet so we can recorded. Transaction begin time is used as the logical read time. find out by checking the ReadSets of all current transactions. This is only needed for deadlock detection which occurs infrequently. Read committed: Read committed only guarantees that all ver- sions read are committed. We implement read committed by al- A transaction acquires a read lock on a version V by atomically ways using the current time as the logical read time. No validation incrementing V’s ReadLockCount. No further read locks can be is required so a transaction’s reads and scans are not recorded. acquired if the counter has reached its max value (255) or the NoMoreReadlocks flag is set. If so, the transaction aborts. Snapshot isolation: Implementing snapshot isolation is straight- forward in our case: always read as of the beginning of the trans- A transaction write locks a version V by atomically copying its action. No validation is needed so scans and reads are not tracked. transaction ID into the WriteLock field. This action both write locks the version and identifies who holds the write lock. Read-only transactions: If a transaction is known to be read- only, the best performance is obtained by running it under snap- 4.1.2 Bucket Locks (Range Locks) shot isolation or read committed depending on whether it needs a Bucket locks are used only by serializable transactions to prevent transaction-consistent view or not. phantoms. When a transaction TS begins a scan of a hash bucket, it locks the bucket. Multiple transactions can have a bucket 4. PESSIMISTIC TRANSACTIONS locked. A bucket lock consists of the following fields. A pessimistic transaction prevents its reads from being invalidated by acquiring read locks. This section describes our design of mul- 1. LockCount: number of locks on this bucket. tiversion locking optimized for main-memory databases 2. LockList: list of (serializable) transactions holding a lock on this bucket. A serializable pessimistic transaction must keep track of which versions it read, which hash buckets it scanned, and its new and The current implementation stores the LockCount in the hash old versions. To this end, the transaction maintains three sets: bucket to be able to check quickly whether the bucket is locked. LockLists are implemented as arrays stored in a separate hash 1. ReadSet contains pointers to versions read locked by the table with the bucket address as the key. transaction; 2. BucketLockSet contains pointers to hash buckets visited and To acquire a lock on a bucket B, a transaction TS increments B’s locked by the transaction; LockCount, locates B’s LockList, and adds its transaction Id to 3. WriteSet contains references to versions updated (old and the list. To release the lock it deletes its transaction ID from B’s new), versions deleted (old) and versions inserted (new). LockList and decrements the LockCount. Range locks in an ordered index can be implemented in the same way. If the index is implemented by a tree structure, a lock on a 303

7.node locks the subtree rooted at that node. If the index is imple- incrementing V’s ReadLockCount. If V’s NoMoreReadLocks flag mented by skip lists, locking a tower locks the range from that is set or ReadLockCount is at max already, lock acquisition fails tower to the next tower of the same height. and TR aborts. Otherwise, if V is not write locked or V’s Read- LockCount is greater than zero, TR increments V’s Read- 4.2 Eager Updates, Wait-For Dependencies LockCount and proceeds. However, if V is write locked by a In a traditional implementation of multiversion locking, an update transaction TU and this is the first read lock on V (V’s Read- transaction TU would block if it attempts to update or delete a LockCount is zero), TR must force TU to wait on V. TR checks read locked version or attempts to insert or update a version in a TU’s NoMoreWaitFors flag. If it is set, TU cannot install the locked bucket. This may lead to frequent blocking and thread wait-for dependency and aborts. Otherwise everything is in order switching. A thread switch is expensive, costing several thousand and TR acquires the read lock by incrementing Vs’ Read- instructions. In a main-memory system, just a few thread switches LockCounter and installs the wait-for dependency by increment- can add significantly to the cost of executing a transaction. ing TU’s WaitForCounter. To avoid blocking we allow a transaction TU to eagerly update or When a transaction TR releases a read lock on a version V, it may delete a read locked version V but, to ensure correction serializa- also need to release a wait-for dependency. If V is not write tion order, TU cannot precommit until all read locks on V have locked, TR simply decrements V’s ReadLockCounter and pro- been released. Similarly, a transaction TR can acquire a read lock ceeds. The same applies if V is write locked and V’s Read- on a version that is already write locked by another transaction LockCounter is greater than one. However, if V is write locked by TU. If so, TU cannot precommit until TR has released its lock. a transaction TU and V’s ReadLockCounter is one, TR is about to Note that an eager update or delete is not speculative because it release the last read lock on V and therefore must also release doesn’t matter whether TR commits or aborts; it just has to com- TU’s wait-for dependency on V. TR atomically sets V’s Read- plete and release its read lock. LockCounter to zero and V’s NoMoreReadLocks to true. If this succeeds, TR locates TU and decrements TU’s WaitForCounter. The same applies to locked buckets. Suppose a bucket B is locked by two (serializable) transactions TS1 and TS2. An update Setting the NoMoreReadLocks flag before releasing the wait-for transaction TU is allowed to insert a new version into B but it is dependency is necessary because this may be TU’s last wait-for not allowed to precommit before TS1 and TS2 have completed dependency. If so, TU is free to acquire an end timestamp and and released their bucket locks. being its commit processing. In that case, TU’s commit cannot be further postponed by taking out a read lock on V. In other words, We enforce correct serialization order by wait-for dependencies. further read locks on V would have no effect. A wait-for dependency forces an update transaction TU to wait before it can acquire an end timestamp and begin commit pro- 4.2.2 Bucket Lock Dependencies. cessing. There are two flavors of wait-for dependencies, read A serializable transaction TS acquires a lock on a bucket B by lock dependencies and bucket lock dependencies that differ in incrementing B’s LockCounter and adding its transaction ID to what event they wait on. B’s LockList. The purpose of TR’s bucket lock is not to disallow A transaction T needs to keep track of both incoming and out- new versions from being added to B but to prevent them from going wait-for dependencies. T has an incoming dependency if it becoming visible to TR. That is, another transaction TU can add a waits on some other transaction and an outgoing dependency if version to B but, if it does, then TU cannot precommit until TS some other transaction waits on it. To track wait-for dependen- has completed its processing and released its lock on B. This is cies, the following fields are included in each transaction object. enforced by TU obtaining a wait-for dependency on TS. For incoming wait-for dependencies: TU can acquire this type of dependency either by acquiring one x WaitForCounter: indicates how many incoming dependen- itself or by having one imposed by TS. We discuss each case. cies the transaction is waiting for. Suppose that, as a result of an update or insert, TU is about to add x NoMoreWaitFors: when set the transaction does not allow a new version V to a bucket B. TU checks whether B has any additional incoming dependencies. Used to prevent starva- bucket locks. If it does, TU takes out a wait-for dependency on tion by incoming dependencies continuously being added. every transaction TS in B’s LockList by adding its own transac- For outgoing wait-for dependencies: tion ID to TS’s WaitForList and incrementing its own WaitFor- Counter. If TU’s NoMoreWaitFors flag is set, TU can’t take out x WaitingTxnList: IDs of transactions waiting on this transac- the dependency and aborts. tion to complete. Suppose a serializable transaction TS is scanning a bucket B and 4.2.1 Read Lock Dependencies. encounters a version V that satisfies TS’s search predicate but the A transaction TU that updated or deleted a version V has a wait- version is not visible to TS, that is, V is write locked by a transac- for dependency on V as long as V is read locked. TU is not al- tion TU that is still active. If TU commits before TS, V becomes lowed to acquire an end timestamp and begin commit processing a phantom to TS. To prevent this from happening, TS registers a unless V’s ReadLockCount is zero. wait-for dependency on TU’s behalf by adding TU’s transaction When a transaction TU updates or deletes a version V, it acquires ID to its own WaitingTxnList and incrementing TU’s WaitFor- a write lock on V by copying its transaction ID into the WriteLock Counter. If TU’s NoMoreWaitFors flag is set, TS can’t impose the field. If V’s ReadLockCount is greater than zero, TU takes a wait- dependency and aborts. for dependency on V simply by incrementing its WaitForCounter. When a serializable transaction TS has precommitted and ac- TU may also acquire a wait-for dependency on V by another quired its end timestamp, it releases its outgoing wait-for depend- transaction TR taking a read lock on V. A transaction TR that encies. It scans its WatingTxnList and, for each transaction T wants to read a version V must first acquire a read lock on V by found, decrements T’s WaitForCounter. 304

8.4.3 Processing Phases 4.4 Deadlock Detection This section describes how locking affects the processing done in Commit dependencies are only taken on transactions that have the different phases of a transaction. already precommitted and are completing validation. As discussed earlier (Section 2.7) commit dependencies cannot cause or be 4.3.1 Normal Processing Phase involved in a deadlock. Recall that normal processing consists of scanning indexes to select record versions to read, update, or delete. An insertion or Wait-for dependencies, however, can cause deadlocks. To detect update creates a new version that has to be added to all indexes deadlocks we build a standard wait-for graph by analyzing the for records of that type. wait-for dependencies of all transactions that are currently blocked. Once the wait-for graph has been built, any algorithm for We now outline what a pessimistic transaction T does differently finding cycles in graphs can be used. Our prototype uses Tarjan’s than an optimistic transaction during a scan and how this depends algorithm [25] for finding strongly connected components. on T’s isolation level. For snapshot isolation, the logical read time is always the transaction begin time. For all other isolation levels, A wait-for graph is a directed graph with transactions as nodes it is the current time which has the effect that the read sees the and waits-for relationships as edges. There is an edge from trans- latest version of a record. action T2 to transaction T1 if T2 is waiting for T1 to complete. The graph is constructed in three steps. Start scan. If T is a serializable transaction, it takes out a bucket lock on B to prevent phantoms and records the lock in its Bucket- 1. Create nodes. Scan the transaction table and for each transac- LockSet. Other isolation levels do not take out a bucket lock. tion T found, create a node N(T) if T has completed its normal processing and is blocked because of wait-for dependencies Check predicate. Same as for optimistic transactions. 2. Create edges from explicit dependencies. Wait-for depend- Check visibility. This is done in the same way as for optimistic encies caused by bucket locks are represented explicitly in transaction, including taking out commit dependencies as needed. WaitingTxnLists. For each transaction T1 in the graph and If a version V is not visible, it is ignored and the scan continues each transaction T2 in T1’s WaitingTxnList, add an edge from for all isolations levels except serializable. If T is serializable and T2 to T1. V is write locked by a transaction TU that is still active, V is a 3. Create edges from implicit dependencies. A wait-for de- potential phantom so T forces TU to delay its precommit by im- pendency on a read-locked version V is an implicit representa- posing a wait-for dependency on TU. tion of wait-for dependencies on all transaction holding read Read version. If T runs under serializable or repeatable read and locks on V. We can find out which transactions hold read V is a latest version, T attempts to acquire a read lock on V. If T locks on V by checking transaction read sets. Edges from im- can’t acquire the read lock, it aborts. If T runs under a lower iso- plicit dependencies can be constructed as follows. For each lation level or V is not a latest version, no read lock is required. transaction T1 in the graph and each version V in T1’s Read- LockSet: if V is write locked by a transaction T2 and T2 is in Check updatability. The same as for optimistic transactions. the graph, add an edge from T2 to T1. Update version. As for optimistic transactions, T creates a new While the graph is being constructed normal processing continues version N, sets V’s WriteLock and, if V was read locked, takes so wait-for dependencies may be created and resolved and trans- out a wait-for dependency on V by incrementing its own Wait- actions may become blocked or unblocked. Hence, the final graph ForCounter. T then proceeds to add N to all indexes it participates obtained may be imprecise, that is, it may differ from the graph in. If T adds N to a locked index bucket B, it takes out wait-for that would be obtained if normal processing stopped. But this dependencies on all (serializable) transactions holding locks on B. doesn’t matter because if there truly is a deadlock neither the Delete version. A delete is essentially an update of V that doesn’t nodes nor the edges involved in the deadlock can disappear while create a new version. T sets V’s WriteLock and if V was read the graph is being constructed. There is a small chance of detect- locked, takes out a wait-for dependency on V by incrementing its ing a false deadlock but this is handled by verifying that the trans- own WaitForCounter. actions participating in the deadlock are still blocked and the edg- When transaction T reaches the end of normal processing, it re- es are still covered by unresolved wait-for dependencies. leases its read locks and its bucket locks, if any. If it has outstand- 4.5 Peaceful Coexistence ing wait-for dependencies (its WaitForCounter is greater than An interesting property of our design is that optimistic and pessi- zero), it waits. Once its WaitForCounter is zero, T precommits, mistic transactions can be mixed. The change required to allow that is, acquires an end timestamp and sets its state to Validating. optimistic transactions to coexist with pessimistic transactions is 4.3.2 Preparation Phase straightforward: optimistic update transactions must honor read Pessimistic transactions require no validation – that’s taken care locks and bucket locks. Making an optimistic transaction T honor of by locks. However, a pessimistic transaction T may still have read locks and bucket locks requires the following changes: outstanding commit dependencies when reaching this point. If so, 1. When T write locks a version V, it uses only a 54-bit transac- T waits until they have been resolved and then proceeds to write tion ID and doesn’t overwrite read locks. to the log and commit. If a commit dependency fails, T aborts. 2. When T updates or deletes a version V that is read locked, it 4.3.3 Postprocessing Phase takes a wait-for dependency on V. 3. When T inserts a new version into a bucket B, it checks for Postprocessing is the same as for optimistic transactions. Note that bucket locks and takes out wait-for dependencies as needed. there is no need to explicitly release write locks; this is automati- cally done when the transaction updates Begin and End fields. 305

9.5. EXPERIMENTAL RESULTS the two MV schemes, MV/L has the biggest memory footprint, Our prototype storage engine implements both the optimistic and due to the additional overhead of maintaining a bucket lock table. the pessimistic scheme. We also implemented a single-version 5.1.1 Scalability (Read Committed) storage engine with locking for concurrency control. The imple- We first show how transaction throughput scales with increasing mentation is optimized for main-memory databases and does not multiprogramming level. For this experiment each transaction use a central lock manager, as this can become a bottleneck [19]. performs 10 reads and 2 writes (R=10 and W=2) against a table Instead, we embed a lock table in every index and assign each with N=10,000,000 rows. The isolation level is Read Committed. hash key to a lock in this partitioned lock table. A lock covers all records with the same hash key which automatically protects against phantoms. We use timeouts to detect and break deadlocks. The experiments were run on a two-socket Intel Xeon X5650 @ 2.67 GHz (Nehalem) that has six cores per socket. Hyper- Threading was enabled. The system has 48 GB of memory, 12 MB L3 cache per socket, 256 KB L2 cache per core, and two separate 32 KB L1-I and L1-D caches per core. This is a NUMA system and memory latency is asymmetric: accessing memory on the remote socket is approximately 30% slower than accessing local memory. We size hash tables appropriately so there are no collisions. The operating system is Windows Server 2008 R2. Each transaction generates log records but these are asynchro- nously written to durable storage; transactions do not wait for log I/O to complete. Asynchronous logging allows us to submit log Figure 4: Scalability under low contention records in batches (group commit), greatly reducing the number of I/O operations. The I/O bandwidth required is also moderate: Figure 4 plots transaction throughput (y-axis) as the multipro- Each update produces a log record that stores the difference be- gramming level increases (x-axis). Under low contention, tween the old and new versions, plus 8 bytes of metadata. Even throughput for all three schemes scales linearly up to six threads. with millions of updates per second, the I/O bandwidth required is After six threads, we see the effect of the higher access latency as within what even a single desktop hard drive can deliver. This the data is spread among two NUMA nodes, and beyond twelve choice allows us to focus on the effect of concurrency control. threads we see the effect of HyperThreading. Traditional disk-based transaction processing systems require For the 1V scheme, HyperThreading causes the speed-up rate to hundreds of concurrently active transactions to achieve maximum drop but the system still continues to scale linearly, reaching a throughput. This is to give the system useful work to do while maximum of over 2M transactions/sec. The multiversion schemes waiting for disk I/O. Our main-memory engine does not wait for have lower throughput because of the overhead of version man- disk I/O, so there is no need to overprovision threads. We ob- agement and garbage collection. Creating a new version for every served that the CPU is fully utilized when the multi-programming update and cleaning out old versions that are no longer needed is level equals the number of hardware threads; allowing more con- obviously more expensive than updating in place. current transactions causes throughput to drop. We therefore lim- Comparing the two multiversion schemes, MV/L has 30% lower ited the number of concurrently active transactions to be at most performance than MV/O. This is caused by extra writes for track- 24, which matches the number of threads our machine supports. ing dependencies and locks, which cause increased memory traf- We experiment with three CC schemes: the single-version locking fic. It takes MV/L 20% more cycles to execute the same number engine (labeled “1V”), the multi-version engine when transactions of instructions and the additional control logic translates into 10% run optimistically (“MV/O”) and the multi-version engine where more instructions per transaction. transactions run pessimistically (“MV/L”). 5.1.2 Scaling under Contention (Read Committed) Records that are updated very frequently (hotspots) pose a prob- 5.1 Homogeneous Workload lem for all CC schemes. In locking schemes, high contention We first show results from a parameterized artificial workload. By causes transactions to wait because of lock conflicts and dead- varying the workload parameters we can systematically explore locks. In optimistic schemes, hotspots result in validation failures how sensitive the different schemes are to workload characteris- and write-write conflicts, causing high abort rates and wasted tics. We focus on short update transactions which are common for work. At the hardware level, some data items are accessed so OLTP workloads. The workload consists of a single transaction frequently that they practically reside in the private L1 or L2 type that performs R reads and W writes against a table of N rec- caches of each core. This stresses the hardware to the limits, as it ords with a unique key. Each row is 24 bytes, and reads and writes triggers very high core-to-core traffic to keep the caches coherent. are uniformly and randomly scattered over the N records. We simulate a hotspot by running the same R=10 and W=2 trans- The memory footprint of the database differs for each concurren- action workload from Section 5.1.1 on a very small table with just cy control scheme. In 1V, each row consumes 24 bytes (payload) N=1,000 rows. Transactions run under Read Committed. Figure 5 plus 8 bytes for the “next” pointer of the hash table. Both MV shows the throughput under high contention. Even in this admit- schemes additionally use 16 bytes to store the Begin and End tedly extreme scenario, all schemes achieve a throughput of over a fields (cf. Figure 1), but the total consumption depends on the million transactions per second, with MV/O slightly ahead of both average number of versions required by the workload. Comparing locking schemes. 1V achieves its highest throughput at six threads, then drops and stays flat after 8 threads. 306

10. Figure 5: Scalability under high contention Figure 6: Impact of read-only transactions (low contention) 5.1.3 Higher IsolationLlevels The experiments in the previous section ran under Read Commit- ted isolation level, which is the default isolation level in many commercial database engines [23], as it prevents dirty reads and offers high concurrency. Higher isolation levels prevent more anomalies but reduce throughput. In this experiment, we use the same workload from Section 5.1.1, we fix the multiprogramming level to 24 and we change the isolation level. Read Repeatable Read Serializable Committed % drop % drop tx/sec tx/sec tx/sec vs RC vs RC Figure 7: Impact of read-only transactions (high contention) 1V 2,080,492 2,042,540 1.8% 2,042,571 1.8% 5.2.1 Impact of Short Read Transactions MV/L 974,512 963,042 1.2% 877,338 10.0% In this experiment we change the ratio of read and update transac- MV/O 1,387,140 1,272,289 8.3% 1,120,722 19.2% tions. There are two transaction types running under Read Com- mitted isolation: the update transaction performs 10 reads and 2 Table 3: Throughput at higher isolation levels, and writes (R=10 and W=2), while the read-only transaction performs percentage drop compared to Read Committed (RC) 10 reads (R=10 and W=0). In Table 3, we report the transaction throughput from each Figure 6 shows throughput (y-axis) as the ratio of read-only trans- scheme and isolation level. We also report the throughput drop as actions varies in the workload (x-axis) in a table with 10,000,000 a percentage of the throughput when running under the Read rows. The leftmost point (x=0%) reflects the performance of the Committed isolation level. update-only workload of Section 5.1.1 at 24 threads. As we add The overhead for Repeatable Read for both locking schemes is read-only transactions to the mix, the performance gap between very small, less than 2%. MV/O needs to repeat the reads at the all schemes closes. This is primarily because the update activity is end of the transaction, and this causes an 8% drop in throughput. reduced, reducing the overhead of garbage collection. For Serializable, the 1V scheme protects the hash key with a lock, The MV schemes outperform 1V when most transactions are and this guarantees phantom protection with very low overhead read-only. When a transaction is read-only, the two MV schemes (2%). Both MV schemes achieve serializability at a cost of 10%– behave identically: transactions read a consistent snapshot and do 19% lower throughput: MV/L acquires read locks and bucket not need to do any locking or validation. In comparison, 1V has to locks, while MV/O has to repeat each scan during validation. acquire and release short read locks for cursor stability even for Under MV/O, however, a transaction requesting a higher isolation read-only transactions which impacts performance. level bears the full cost of enforcing the higher isolation. This is In Figure 7 we repeat the same experiment but simulate a hotspot not the case for locking schemes. by using a table of 1,000 rows. The leftmost point (x=0%) again 5.2 Heterogeneous Workload reflects the performance of the update-only workload of Section The workload used in the previous section represents an extreme 5.1.12 under high contention at 24 threads. The MVCC schemes update-heavy scenario. In this section we fix the multiprogram- have a clear advantage at high contention, as snapshot isolation ming level to 24 and we explore the impact of adding read-only prevents read-only transactions from interfering with writers. transactions in the workload mix. When 80% of the transactions are read-only, the MVCC schemes achieve 63% and 73% higher throughput than 1V. 307 Impact of Long Read Transactions are long readers, at x=12, MV has 80X higher update throughput Not all transactions are short in OLTP systems. Users often need than 1V. In terms of read throughput (Figure 8), both MV to run operational reporting queries on the live system. These are schemes consistently outperform 1V. long read-only transactions that may touch a substantial part of the database. The presence of a few long-running queries should not 5.3 TATP Results The workloads used in the previous sections allowed us to high- severely affect the throughput of “normal” OLTP transactions. light fundamental differences between the three concurrency con- This experiment investigates how the three concurrency control trol mechanisms. However, real applications have higher demands methods perform in this scenario. We use a table with 10,000,000 than randomly reading and updating values in a single index. We rows and fix the number of concurrently active transactions to be conclude our experimental evaluation by running a benchmark 24. The workload consists of two transaction types: (a) Long, that models a simple but realistic OLTP application. transactionally consistent (Serializable), read-only queries that The TATP benchmark [24] simulates a telecommunications appli- touch 10% of the table (R=1,000,000 and W=0) and (b) Short cation. The database consists of four tables with two indexes on update transactions with R=10 and W=2. each table to speed up lookups. The benchmark runs a random mix of seven short transactions; each transaction performs less than 5 operations on average. 80% of the transactions executed only query the database, while 16% update, 2% insert, and 2% delete items. We sized the database for 20 million subscribers and generated keys using the non-uniform distribution that is specified in the benchmark. All transactions run under Read Committed. 1V MV/L MV/O Transactions per second 4,220,119 3,129,816 3,121,494 Table 4: TATP results Table 4 shows the number of committed transactions per second for each scheme. Our concurrency control mechanisms can sus- tain a throughput of several millions of transaction per second on a low-end server machine. This is an order of magnitude higher Figure 8: Update throughput with long read transactions than previously published TATP numbers for disk-based systems [19] or main memory systems [14]. 6. RELATED WORK Concurrency control has a long and rich history going back to the beginning of database systems. Several excellent surveys and books on concurrency control are available [4], [16], [26]. Multiversion concurrency control methods also have a long histo- ry. Chapter 5 in [4] describes three multiversioning methods: mul- tiversion timestamp ordering (MVTO), two-version two-phase locking (2V2PL), and a multiversion mixed method. 2V2PL uses at most two versions: last committed and updated uncommitted. They also sketch a generalization that allows multiple uncommit- ted versions and readers are allowed to read uncommitted ver- sions. The mixed method uses MVTO for read-only transactions and Strict 2PL for update transactions. Figure 9: Read throughput with long read transactions The optimistic approach to concurrency control originated with Figures 8 and 9 show update and read throughput (y-axis) as we Kung and Robinson, but they only considered single-version da- vary the number of concurrent long read-only transactions in the tabases [17]. Many multiversion concurrency control schemes system (x-axis). At the leftmost point (x=0) all transactions are have been proposed [2], [5], [6], [8], [9], [13], [18], [20], but we short update transactions, while at the rightmost point (x=24) all are aware of only two that take an optimistic approach: Multiver- transactions are read-only. At x=6, for example, there are 6 read- sion Serial Validation (MVSV) by Carey [11], [12] and Multiver- only and 24-6=18 short update transactions running concurrently. sion Parallel Validation (MVPV) by Agrawal et al [1]. While the Looking at the update throughput in Figure 7, we can see that 1V two schemes are optimistic and multiversion, they differ signifi- is twice as fast as the MV schemes when all transactions are short cant from our scheme. Their isolation level is repeatable read; update transactions, at x=0. (This is consistent with our findings other isolation levels are not discussed. MVSV does validation from the experiments with the homogeneous workload in Section serially so validation quickly becomes a bottleneck. MVPV does 5.1.) However the picture changes completely once a single long validation in parallel but installing updates after validation is done read-only transaction is present in the system. At x=1, update serially. In comparison, the only critical section in our method is throughput drops 75% for the single version engine. In contrast, acquiring timestamps; everything else is done in parallel. Acquir- update throughput drops only 5% for the MV schemes, making ing a timestamp is a single instruction (an atomic increment) so MV twice as fast as 1V. The performance gap grows as we allow the critical section is extremely short. more read-only transactions. When 50% of the active transactions 308

12.Snapshot isolation (SI) [3] is a multiversioning scheme used by [4] Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman: many database systems. Several commercial database systems Concurrency Control and Recovery in Database Systems. support snapshot isolation to isolate read-only transactions from Addison-Wesley 1987, ISBN 0-201-10715-5. updaters: Oracle [22], Postgres [21] and SQL Server [23] and [5] Paul M. Bober, Michael J. Carey: Multiversion Query Lock- possibly others. However, SI is not serializable and many papers ing, VLDB, 1992. have considered under what circumstances SI is serializable or [6] Paul M. Bober, Michael J. Carey: On Mixing Queries and how to make it serializable. Cahill et al published a complete and Transactions via Multiversion Locking, ICDE, 1992. practical solution in 2008 [9]. Their technique requires that trans- [7] Mihaela A. Bornea, Orion Hodson, Sameh Elnikety, Alan actions check for read-write dependencies. Their implementation Fekete: One-Copy Serializability with Snapshot Isolation un- uses a standard lock manager and transactions acquire “locks” and der the Hood. ICDE, 2011. check for read-write dependencies on every read and write. The [8] Albert Burger, Vijay Kumar, Mary Lou Hines: Performance “locks” are non-blocking and used only to detect read-write de- of Multiversion and Distributed Two-Phase Locking Concur- pendencies. Whether their approach can be implemented efficient- rency Control Mechanisms in Distributed Databases, Infor- ly for a main-memory DBMS is an open question. Techniques mation Sciences, 96(1), 1997. such as validating by checking repeatability of reads and predi- [9] Michael J. Cahill, Uwe Röhm, Alan David Fekete: Serializa- cates have already been used in the past [7]. ble Isolation for Snapshot Databases. TODS, 34(4), 2009. Oracle TimesTen [22] and IBM’s solidDB [10] are two commer- [10] IBM solidDB, information available at http://www.ibm.com/. cially available main-memory DBMSs. TimesTen uses single- [11] Michael J. Carey: Multiple Versions and the Performance of version locking with multiple lock types (shared, exclusive, up- Optimistic Concurrency Control. Tech. Rep. 517, Computer date) and multiple granularities (row, table, database). For main- Sciences Dept., Univ. of Wisconsin-Madison, 1983. memory tables, solidDB also uses single-version locking with [12] Michael J. Carey, Waleed A. Muhanna: The Performance of multiple lock types (shared, exclusive, update) and two granulari- Multiversion Concurrency Algorithms, TODS, 4(4), 1985. ties (row, table). For disk-based tables, solidDB supports both [13] Theo Härder, Erwin Petry: Evaluation of a Multiple Version optimistic and pessimistic concurrency control. . Scheme for Concurrency Control, Information Systems, 12(1), 1987. 7. CONCLUDING REMARKS [14] IBM SolidDb Performance on Intel Systems, white paper at In this paper we investigated concurrency control mechanisms http://download.intel.com/business/software/testimonials/do optimized for main memory databases. The known shortcomings wnloads/xeon5500/ibm.pdf. of traditional locking led us to consider solutions based on multi- [15] Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki: Im- versioning. We designed and implemented two MVCC methods, proving OLTP Scalability using Speculative Lock Inher- one optimistic using validation and one pessimistic using locking. itance. PVLDB, 2(1), 2009. For comparison purposes we also implemented a variant of single- [16] Vijay Kumar (Ed.): Performance of Concurrency Control version locking optimized for main memory databases. We then Mechanisms in Centralized Database Systems. Prentice-Hall experimentally evaluated the performance of the three methods. 1996, ISBN 0-13-065442-6. Several conclusions can be drawn from the experimental results. [17] H. T. Kung, John T. Robinson: On Optimistic Methods for Concurrency Control. TODS, 6(2), 1981. x Single-version locking can be implemented efficiently and [18] Sanjay Kumar Madria, Mohammed Baseer, Vijay Kumar, without lock acquisition becoming a bottleneck. Sourav S. Bhowmick: A Transaction Model and Multiver- x Single-version locking is fragile; it performs well when sion Concurrency Control for Mobile Database Systems, Dis- transactions are short and contention is low but suffers under tributed and Parallel Databases, 22(2-3), 2007. more demanding conditions. [19] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, Ana- x The MVCC schemes have higher overhead but are more stasia Ailamaki: Data-Oriented Transaction Execution. resilient, retaining good throughput even in the presence of PVLDB, 3(1), 2010. hotspots and long read-only transactions. [20] Christos H. Papadimitriou, Paris C. Kanellakis: On Concur- x The optimistic MVCC scheme consistently achieves higher rency Control by Multiple Versions, TODS, 9(1), 1984. throughput than the pessimistic scheme. [21] PostgreSQL 8.4.2 Documentation, Ch. 13 Concurrency Con- trol, available from www.postgresql.org. 8. ACKNOWLEDGMENTS [22] Oracle TimesTen In-Memory Database, information availa- We thank Phil Bernstein, Marcel van der Holst and Dimitris Tsi- ble at http://www.oracle.com. rogiannis for their contributions. This work was supported in part [23] SQL Server 2008 Books Online: Isolation Levels in the Da- by a grant from the Microsoft Jim Gray Systems Lab. tabase Engine, available at http://msdn.microsoft.com/en- us/library/ms189122.aspx. 9. REFERENCES [24] Telecommunication Application Transaction Processing [1] D. Agrawal, A. J. Bernstein, P. Gupta, S. Sengupta: Distrib- (TATP) Benchmark Description, available at uted Multi-Version Optimistic Concurrency Control with http://tatpbenchmark.sourceforge.net/. Reduced Rollback, Distributed Computing, 1987. [25] Robert Tarjan: Depth-First Search and Linear Graph Algo- [2] D. Agrawal, S. Sengupta: Modular Synchronization in Mul- rithms, SIAM J. of Computing, 1(2), 1972. tiversion Databases: Version Control and Concurrency Con- [26] Alexander Thomasian: Concurrecy control: Methods, Per- trol, SIGMOD, 1989. formance and Analysis, ACM Computing Surveys, 30(1), [3] Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, 1998. Elizabeth J. O'Neil, Patrick E. O'Neil: A Critique of ANSI [27] http://www.cs.wisc.edu/~sblanas/proofsketch.pdf SQL Isolation Levels. SIGMOD, 1995. 309