Aether: A Scalable Approach to Logging

The shift to multi-core hardware brings new challenges to database systems, as the software parallelism determines performance. Even though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execution. Write-ahead logging is a fundamental, omnipresent component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks, especially in OLTP workloads making frequent small changes to data.

1. Aether: A Scalable Approach to Logging Ryan Johnson†‡ Ippokratis Pandis†‡ Radu Stoica‡ Manos Athanassoulis‡ Anastasia Ailamaki†‡ †Carnegie Mellon University ‡École Polytechnique Fédérale de Lausanne 5000 Forbes Ave. Pittsburgh, PA 15213, USA EPFL, CH-1015 Lausanne, Switzerland {ryanjohn, ipandis} {radu.stoica, manos.athanassoulis, natassa} ABSTRACT Done! The shift to multi-core hardware brings new challenges to database WAL Commit systems, as the software parallelism determines performance. Even Xct 1 though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execu- A C tion. Write-ahead logging is a fundamental, omnipresent compo- nent in ARIES-style concurrency and recovery, and one of the WAL most important yet-to-be addressed potential bottlenecks, espe- Xct 2 cially in OLTP workloads making frequent small changes to data. In this paper, we identify four logging-related impediments to D B database system scalability. Each issue challenges different level in Time the software architecture: (a) the high volume of small-sized I/O requests may saturate the disk, (b) transactions hold locks while Lock Mgr. Log Mgr. Working Waiting waiting for the log flush, (c) extensive context switching over- whelms the OS scheduler with threads executing log I/Os, and (d) Figure 1. A timeline of two transactions illustrating four kinds of contention appears as transactions serialize accesses to in-memory log-imposed delay: (A) I/O-related delays, (B) increased lock con- log data structures. We demonstrate these problems and address tention, (C) scheduler overload, and (D) log buffer contention. them with techniques that, when combined, comprise a holistic, scalable approach to logging. Our solution achieves a 20%-69% speedup over a modern database system when running log-inten- uniprocessors with high latency I/O subsystems. Database engines sive workloads, such as the TPC-B and TATP benchmarks. More- therefore excel at exploiting concurrency –support for multiple in- over, it achieves log insert throughput over 1.8GB/s for small log progress operations– to interleave the execution of a large number records on a single socket server, an order of magnitude higher of transactions, most of which are idle at any given moment. How- than the traditional way of accessing the log using a single mutex. ever, as the number of cores per chip increases in step with Moore’s law, software must exploit parallelism –support for con- CATEGORIES AND SUBJECT DESCRIPTORS current operations to proceed simultaneously– to benefit from new H.2.4 [Database Management]: Systems - Transaction process- hardware. Although database workloads exhibit high concurrency, ing, Concurrency; H.2.7 [Database Management]: Database internal bottlenecks [11] often mean that database engines cannot Administration - Logging and recovery. extract enough parallelism to meet multicore hardware demands. GENERAL TERMS The log manager is a key service of modern DBMSs, espe- cially prone to bottlenecks due to its centralized design and depen- Design, Performance, Experimentation, Measurement. dence on I/O. Long flush times, log-induced lock contention, and KEYWORDS contention for log buffers in main memory all impact scalability, Log manager. Early lock release. Flush pipelining. Log buffer and no single bottleneck is solely responsible for suboptimal per- contention. Consolidation array. formance. Modern systems can achieve transaction rates of 100ktps or higher, exacerbating the log bottleneck.1 Research to 1. INTRODUCTION date offers piecewise or partial solutions to the various bottlenecks, Recent changes in computer microarchitecture have led to multi- which do not lead to a fully scalable log manager for today’s multi- core systems, which in turn have several implications in database core hardware. management systems (DBMS) software design [6]. DBMS soft- ware was designed in an era during which most computers were 1.1 Write-ahead Logging and Log Bottlenecks Nearly all database systems use centralized write-ahead logging Permission to make digital or hard copies of all or part of this work (WAL) [14] to protect against data corruption and lost committed for personal or classroom use is granted without fee provided that work after crashes. WAL allows transactions to execute and com- copies ar not made or distributed for profit or commercial advantage mit without requiring that all data pages they update be written to and that copies bear this notice and the full citation on the first page. disk first. However, as Figure 1 illustrates, there are four main To copy otherwise, to republish, to post on servers or to redistribute types of delays which logging can impose on transactions: to lists, requires prior specific permission and/or a fee. Articles from this volume were presented at The 36th International Conference on Very Large Data Bases, September 13-17, 2010, Singapore. 1. See, e.g. top published TPC-C results or performance figures reported Proceedings of the VLDB Endowment, Vol. 3, No. 1 by main-memory databases like H-Store [22]. Copyright 2010 VLDB Endowment 21508097/10/09... $ 10.00.

2.I/O-related delays (A). The system must ensure that a transac- Machine utilization tion’s log records reach non-volatile storage before committing. 100% With access times in the order of milliseconds, a log flush to mag- idle netic media can easily become the longest part of a transaction. 80% Further, log flush delays become serial if the log device is over- Log mgr. contention loaded by multiple small requests. Fortunately, log flush I/O times 60% become less important as fast solid-state drives gain popular- Log mgr. work ity[1][12], and when using techniques such as group commit [8]. 40% Other contention Log-induced lock contention (B). Under traditional WAL, each Other work transaction which requests a commit must first flush its log records 20% to disk, retaining all write locks until the operation completes. Holding locks during this final (and often only) I/O significantly 0% increases lock contention in the system and creates an artificial Log I/O OS Log buffer bottleneck in many workloads. For example, the left-most bar in latency scheduler contention Figure 2 shows CPU utilization as 60 clients run for 95 seconds the Figure 2. Breakdown of CPU time showing work and contention TPC-B [24] benchmark in a modern storage manager [11] on a Sun due to the log vs other parts of the system, when 60 clients run the Niagara II chip with 64 hardware contexts (see Section 6.1 for the TPC-B benchmark, as we remove log-related bottlenecks. detailed experimental setup). Due to the increased lock contention the system is idle 75% of the time. Section 3 shows that even though reduced I/O times help, the problem remains even when mize or eliminate the log bottleneck. We highlight new contribu- logging to a ramdisk with minimal latency. tions below. Excessive context switching (C). Log flushes incur additional First, we evaluate Early Lock Release (ELR), a promising costs beyond I/O latency because the transaction cannot continue technique for eliminating log-induced lock contention. ELR has and must be descheduled until the I/O completes. Unlike I/O been proposed several times in the past but, to our knowledge, has latency, context switching and scheduling decisions consume CPU never been evaluated in the literature and is not used today by any time and thus cannot overlap with other work. In addition, the mainstream database engine. We show that, particularly for abundance of hardware contexts in multicore hardware can make skewed accesses common to real workloads, ELR increases scheduling a bottleneck in its own right as runnable threads begin throughput by 15%-164% even when logging to fast flash disks. to accumulate faster than the OS can dispatch them. The second Second, we propose and evaluate Flush Pipelining, a tech- bar in Figure 2 shows for the same workload the processing time nique which allows most transactions to commit without triggering of a system which suffers from the problem of OS scheduler over- context switches. In synergy with ELR it achieves the same perfor- load. The system remains 20% idle even with transactions ready to mance with asynchronous commit without sacrificing durability. run. We analyze excessive context switching problem in Section 4. Finally, we propose and evaluate three improvements to log buffer design, including a new “consolidation-based backoff” Log buffer contention (D). Another log bottleneck arises as the scheme which allows threads to aggregate their requests to the log multicore trend continues to demand exponential increases in par- when they encounter contention. As a result, maximum log conten- allelism; where current hardware trends generally reduce the other tion is decoupled from thread counts and log record sizes. Our bottlenecks (e.g. solid state drives reduce I/O latencies), each suc- evaluation shows that contention is minimized and identifies mem- cessive processor generation aggravates contention with an ory bandwidth as the most likely bottleneck to arise next. increase in hardware contexts. The third bar in Figure 2 shows that if we remove the problems of logical lock contention and exces- 2. RELATED WORK sive context switching, the system utilizes fully the available hard- As a core database service, logging has been the focus of extensive ware. But, as a large number of threads attempt simultaneous log research. Virtually all database engines employ some variant of inserts, the contention for the centralized log buffer contributes a ARIES [14], a sophisticated write-ahead logging system which significant (and growing) fraction of total transaction time. We integrates concurrency control with transaction rollback and disas- therefore consider this bottleneck as the most dangerous to future ter recovery, and allows the system to recover fully even if recov- scalability, in spite of its modest performance impact on today’s ery is interrupted repeatedly by new crashes. To achieve its high hardware. Section 5 focuses on this problem. robustness with good performance, ARIES couples tightly with the In summary, log bottlenecks arise for several reasons, and no rest of the system, particularly the lock and buffer pool managers, single approach addresses them all. A technique known as “asyn- and has a strong influence on the design of access methods such as chronous commit” is perhaps the clearest symptom of the continu- B+Tree indexes [13]. The log is typically implemented as a single ing log bottleneck. Available in most DBMSs (including Oracle global structure shared by every transaction, making it a potential [16] and PostgreSQL [17]) asynchronous commit allows transac- bottleneck in highly parallel systems. Even in a single-threaded tions to complete and return results without waiting for their log database engine the overhead of logging accounts for roughly 12% entries to become durable. Skipping the log flush step sidesteps of the total time in a typical OLTP workload [7]. problems A-C listed above, but at the cost of unsafe operation: the Several recent studies [12][3] evaluate solid state flash drives system can lose committed work after a crash. To date no existing in the context of logging, and demonstrate significant speedups proposal addresses all the bottlenecks associated with log flush, due to both better response times and also better handling of the and the looming problem of log buffer contention. small I/O sizes common to logging. However, even the fastest flash drives do not eliminate all overhead because synchronous log 1.2 A Holistic Approach to Scalable Logging flush requests still block and therefore cause OS scheduling. This paper presents Aether, a complete approach towards log scal- Log group commit strategies [8][18] reduce pressure on mag- ability, and demonstrates how the proposed solutions address all netic log disks by aggregating multiple requests for log flush into a log bottlenecks on modern hardware, even for the most demanding single I/O operation; fewer and larger disk accesses translate into workloads. Aether combines new and existing solutions to mini-

3.significantly better disk performance by avoiding unnecessary 100 10000 us (slow disk) 100 us (flash) head seeks. Unfortunately, group commit does not eliminate 1000 us (fast disk) 0 us (memory) unwanted context switches because transactions merely block pending notification from the log rather than blocking on I/O Speedup requests directly. Asynchronous commit [16][17] extends group commit by not 10 only aggregating I/O requests together, but also allowing transac- tions to complete without waiting for those requests to complete. This optimization moves log flush times completely off the critical path but at the expense of durability. That is, committed work can be lost if a crash prevents the corresponding log records to become 1 durable. Despite being unsafe, asynchronous commit is used 0.0 1.0 2.0 3.0 4.0 5.0 widely in commercial and open source database systems because it provides a significant performance boost. In contrast, Aether Data access skew (zipfian s parameter) achieves this performance boost without sacrificing durability. Figure 3. Speedup due to ELR when running the TPC-B bench- DeWitt et al. [4] observe that a transaction can safely release mark and varying I/O latency and skew in data accesses. locks before flushing its log records to disk provided certain condi- tions are met. IVS [5] implemented this optimization but its cor- rectness was proven more recently [21]. We refer to this technique Early Lock Release (ELR) removes log flush latency from the crit- as early lock release (ELR) and evaluate it further in Section 3. ical path by ensuring that only the committing transaction must Main-memory database engines impose a special challenge wait for its commit operation to complete; having released all held for log implementations because the log is the only I/O operation database locks, others can acquire these locks immediately and of a given transaction. Not only is the I/O time responsible for a continue executing. In spite of its potential benefits modern data- large fraction of total response time, but short transactions also base engines do not implement ELR and to our knowledge this is lead to high concurrency and contention for the log buffer. Some the first paper to analyze empirically ELR’s performance. We proposals go so far as to eliminate the log (and its overheads) hypothesize that this is largely due to the effectiveness of asyn- altogether [22], replicating each transaction to multiple database chronous commit [16][17], which obviates ELR and which nearly instances and relying on hot fail-over to maintain durability. all major systems do provide. However, systems which do not sac- Aether is especially well-suited to in-memory databases because it rifice durability can benefit strongly from ELR under workloads addresses both log flush delays and contention at the log buffer. which exhibit lock contention and/or long log flush times. 3. MOVING LOG I/O LATENCY OFF THE 3.2 Evaluation of ELR CRITICAL PATH We use the TPC-B benchmark [24] to evaluate ELR. TPC-B was designed as a database stress test and also exhibits significant lock During its lifecycle a transaction acquires database locks to ensure contention. The benchmark executes on a 64-context Niagara II consistency and logs all actions before performing them. At com- server running the Shore-MT storage manager [11] (further details pletion time –after writing a commit record to non-volatile stor- about the platform and experimental methodology can be found in age– the transaction finally releases the locks it has accumulated. Section 6.1). Figure 3 shows the benefit of ELR over a baseline Releasing the locks only after the commit record has reached disk system as we vary the two major factors which impact its effec- (or been flushed) ensures that other transactions do not encounter tiveness: lock contention and I/O latency. The y-axis shows uncommitted data, but also increases lock hold time significantly, speedup due to ELR as the skew of zipfian-distributed data especially for in-memory workloads where the log commit is the accesses increases along the x-axis. Lower skew leads to more uni- longest part of many transactions. form accesses and lower lock contention. Different log device 3.1 Early Lock Release latencies are given as data series ranging from 0 to 10ms. The first DeWitt et al. [4] observe that a transaction’s locks can be released series (0ms) is measured using a ramdisk which imposes almost no before its commit record is written to disk, as long as it does not additional delay beyond a round trip through the OS kernel (40- return results to the client before becoming durable. Other transac- 80µs). The remaining series are created by using a combination of tions which read data updated by a pre-committed transaction asynchronous I/O and high resolution timers to impose additional become dependant on it and must not be allowed to return results response times of 100µs (fast flash drive), 1ms (fast magnetic to the user until both their own and their predecessor’s log records drive), and 10ms (slow magnetic drive). have reached the disk. Serial log implementations preserve this As shown in the graph, ELR’s speedup is maximized (35x) property naturally, because the dependant transaction’s log records for slower devices, but remains substantial (2x) even with flash must always reach the log later than those of the pre-committed drives if contention is present. This effect occurs because transac- transaction and will therefore become durable later also. Formally, tions are short even compared to 100µs I/O times, and ELR eases as shown in [21], the system must meet two conditions for early contention by removing that delay from the critical path. As write lock release to preserve recoverability: performance of most flash drives remains unpredictable (and usu- ally slower than desired) ELR remains an important optimization 1. Every dependant transaction’s commit log record is written to even as systems move away from magnetic media. the disk after the corresponding log record of pre-committed Varying lock contention impacts performance in three phases. transaction. For very low contention, the probability of a transaction to request 2. When a pre-committed transaction is aborted all dependant an already-held lock is low. Thus, holding that lock through the log transactions must also be aborted. Most systems meet this con- flush does not stall other transactions and ELR has no opportunity dition trivially; they do no work after inserting the commit to improve performance. At the other extreme, very high skew record, except to release locks, and therefore can only abort leads to such high contention that transactions encounter held during recovery when all uncommitted transactions roll back. locks even with no log flush time. In the middle range, however,

4. #CPU utilized #CPU utilized 70 64 #CPU 300 64 60 Scheduler activity Throughput (Ktps) 56 # System CPU 250 56 48 48 50 40 200 40 40 32 150 32 30 24 100 24 Flush Pipelining 20 16 16 AsynchCommit 8 50 8 10 Baseline 0 0 0 0 0 8 16 24 32 40 48 56 64 0 8 16 24 32 40 48 56 64 0 20 40 60 Client count Clients Figure 4. Total and system CPU utilization and number of context switches without (left) Figure 5. Performance of flush pipelining and and with (right) flush pipelining. asynchronous commit vs. baseline. ELR significantly improves performance because holding locks steadily with the number of client threads.2 The CPU utilization through log flush causes stalls which would not have arisen other- curve illustrates that the OS is unable to handle this load, as 12 of wise. The sweet spot becomes wider as longer I/O times stretch out the 64 hardware contexts are idle. Further, as load increases an the total transaction length in the baseline case. Finally, by way of increasing fraction of total load is due to system time rather than comparison, the intuitive rule that 80% of accesses are to 20% of the application, further reducing the effective utilization. the data corresponds roughly to a skew of 0.85. In other words, Excessive context switching explains why group commit workloads are likely to exhibit exactly the contention levels which alone is not fully scalable and why asynchronous commit is popu- ELR is well-equipped to reduce. lar despite being unsafe. The latter eliminates context switching In conclusion, we find that ELR is a straightforward optimi- associated with transaction commit while the former does not. zation which can benefit even modern database engines. Further, as the next section demonstrates, it will become an important com- 4.1 Flush Pipelining ponent in a safe replacement for asynchronous commit. To eliminate the scheduling bottleneck (and thereby increase CPU utilization and throughput), the database engine must decouple the 4. DECOUPLING OS SCHEDULING transaction commit from thread scheduling. We propose Flush Pipelining, a technique which allows agent threads to detach from FROM LOG FLUSH OPERATIONS transactions during log flush in order to execute other work, The latency of a log flush arises from two sources: the actual I/O resuming the transaction once the flush is completed. wait time and the context switches required to block and unblock Flush pipelining operates as follows. First, agent threads com- the thread at either end of the wait. Existing log flush optimiza- mit transactions asynchronously (without waiting for the log flush tions, such as group commit, focus on improving I/O wait time to complete). However, unlike asynchronous commit they do not without addressing thread scheduling. Similarly, while ELR return immediately to the client but instead detach from the trans- removes log flush from the critical path of other transactions action, enqueue its state at the log and continue executing other (shown as (B) in Figure 1) the requesting transaction must still transactions. A daemon thread triggers log flushes using policies block for its log flush I/O and be rescheduled as the I/O completes similar to those used in group commit (e.g. “flush every X transac- (shown as (A) in Figure 1). Unlike I/O wait time, which the OS tions, L bytes logged, or T time elapsed, whichever comes first”). can overlap with other work, each scheduling decision consumes After each I/O completion, the daemon notifies the agent threads several microseconds of CPU time which cannot be overlapped. of newly-hardened transactions, which eventually reattach to each The cost of scheduling and context switching is increasingly transaction, finish the commit process and return results to the cli- important for several reasons. First, high-performance solid state ent. Transactions which abort after generating log records must storage provides access times measured in tens of microseconds, also be hardened before rolling back. The agent threads handle this leaving the accompanying scheduling decisions as a significant case as relatively rare under traditional (non-optimistic) concur- fraction of the total delay. Second, exponentially growing core rency control and do not pass the transaction to the daemon. counts make scheduler overload an increasing concern as the OS When combined with ELR (see previous section), flush pipe- must dispatch threads for every transaction completion. The sched- lining provides the same throughput3 as asynchronous commit uler must coordinate these scheduling decisions (at least loosely) without sacrificing any safety. Only the log’s daemon thread suf- across all cores. The excessive context switching triggers a sched- fers wait time and scheduling due to log flush requests, with agent uling bottleneck which manifests as a combination of high load threads pipelining multiple requests to hide even long delays. (e.g. many runnable threads) but low CPU utilization and signifi- cant system time. Figure 4 (left) shows an example of the scheduler overload induced when the Shore-MT storage manager [11] runs the TPC-B 2. Daemon threads contribute a secondary effect. As load increases these benchmark [24] on a 64-context Sun Niagara II machine. As the threads wake more and more frequently at first, then sleep less and number of client threads increases along the x-axis, we plot the less, and finally revert to polling as the system becomes saturated. rate of context switches (in thousands/s), as well as the CPU utili- 3. Asynchronous commit does deliver superior response times for indi- zation achieved and the number of CPUs running inside the OS vidual transactions (they do not wait for the log flush to complete), but kernel (system time). The number of context switches increases the delays overlap perfectly and overall throughput is not impacted.

5.4.2 Evaluation of Flush Pipelining (B) Baseline Consolidation array (C) To evaluate flush pipelining we run the same experiment as in Figure 4 (left), but this time with flush pipelining active. Figure 4 (right) shows the result. As before we vary the number of client threads and measure the number of context switches (in mil- lions), utilization achieved, and the OS system time contribution. In contrast to the baseline case, the number of context switches after an initial increase, remains almost steady for the entire load spectrum. The utilization reaches the maximum possible (64) indi- cating that the scheduling bottleneck has been resolved. Further confirmation comes from the system time contribution, which remains nearly constant regardless of how many threads enter the (D) Decoupled buffer insert Hybrid design (CD) system. This behavior is expected because only one thread issues I/ Start/finish Mutex held O requests regardless of thread counts, and the group commit pol- Waiting Copy into buffer icy ensures that requests become larger rather than more frequent. Figure 5 compares the performance of baseline Shore-MT to Figure 6. Illustrations of several log buffer designs. The baseline asynchronous commit and flush pipelining when running the TPC- system can be optimized for shorter critical path (D), fewer threads B. The x-axis varies the number of clients as we plot throughput on attempting log inserts (C), or both (CD) the y-axis. Even with a fast log disk, the baseline system begins to lag almost immediately as scheduling overheads increase reducing its scalability. In contrast, the other two scale better achieving up to strong peaks at 40B and 264B (a 6x difference) and the largest log 22% higher performance. As Section 6.4 will show, for even more records can occupy several kB each. log-intensive workloads the benefits of flush pipelining are larger. To permanently eliminate contention for the log buffer, we In summary, flush pipelining successfully and safely removes seek to make the cost of accessing the log independent of both the the log from the system's critical path of execution by breaking the sizes of the log records being inserted and the number of threads correlation between transaction commits and scheduling. inserting them. The following subsections explore both approaches and propose a hybrid solution which combines them. 5. SCALABLE LOG BUFFER DESIGN Most database engines use some variant of ARIES, which assigns 5.1 Consolidating Buffer Allocation A log record consists of a standard header followed by an arbitrary each log record a unique log sequence number (LSN). The LSN payload. Log buffer allocation is composable in the sense that two encodes a record’s disk address, acts as a timestamp for data pages successive requests also begin with a log header and end with an written to disk, and serves as a pointer to log records both in mem- arbitrary payload. We exploit this composability by allowing ory and on disk. It is also convenient for LSN to serve as addresses threads to combine their requests into groups, carve up and fill the in the log buffer, so that generating an LSN also reserves buffer group’s buffer space off the critical path, and finally release it back space. In order to keep the database consistent in spite of repeated to the log as a unit. To this end we extend the idea of elimination- failures, ARIES imposes strict ordering constraints on LSN gener- based backoff [9][15], a hybrid approach combining elimination ation. While a total ordering is not technically required for correct- trees [19] with backoff. Threads which encounter contention back ness, valid partial orders tend to be too complex and off, but instead of sleeping or counting cycles they congregate at interdependent to be worth pursuing as a performance optimization an elimination array, a set of auxiliary locations where they (see Section A.5 of the Appendix for further discussion). Inserting attempt to combine their requests with those of others. a record into the log buffer consists of three distinct phases: When elimination is successful threads satisfy their requests 1. LSN generation and log buffer acquire. The thread must first without returning to the shared resource at all, making the backoff claim the space it will eventually fill with the intended log very effective. For example, stacks are amenable to elimination record. Though serial, LSN generation is short and predictable because push and pop requests which encounter each other while barring exceptional situations such as buffer wraparound or full backing off can cancel each other directly via the elimination array log buffer and leave. Similarly, threads which encounter contention for log 2. Log record insertion. The thread copies the log record in the inserts back off to a consolidation array and combine their buffer space it has claimed. requests before reattempting the log buffer. We use the term “con- solidation” instead of “elimination” because, unlike with a stack or 3. Log buffer release. The transaction releases the buffer space, counter, threads must still cooperate after combining their requests which allows the log manager to write the record to disk. so that the last to finish can release the group’s buffer space. Like A straightforward log insert implementation acquires a central an elimination array, any number of threads can consolidate into a mutex before performing all three phases and the mutex is released single request, effectively bounding contention at the log buffer to at the same time as the buffer (pseudocode in Algorithm 1, Appen- the number of array entries protecting the log buffer, rather than dix). This approach is attractive for its simplicity: log inserts are the number of threads in the system. Algorithm 2 (Appendix) pro- relatively inexpensive, and in the monolithic case buffer release is vides a sketch of the consolidation array-based buffer allocation. simplified to a mutex release. The net effect of consolidation is that only the first thread The weakness of a monolithic log insert is that it serializes from each group competes to acquire buffer space from the log, buffer fill operations –even though buffer regions never overlap– and only the last thread to leave must wait to release it. Figure 6(C) which adds their cost directly to the critical path. In addition, log depicts the effect of consolidation; the first thread to arrive is record sizes vary significantly, making copying costs unpredict- joined by two others while it waits on the log mutex and all three able. Figure 6(B) illustrates how a single large log record can proceed in parallel once the mutex acquire succeeds. However, as impose long delays on later threads; this situation arises frequently the figure also shows, consolidation leaves significant wait times in our system because the distribution of log records has two because only buffer fill operations within a group proceed in paral-

6. 100% pling buffer fill operations allows pipelining between groups and Log mgr.  reduces the log critical section length by moving buffer outside, Time breakdown 80% contention thus making performance relatively insensitive to log record sizes. The resulting design, shown in Figure 6(CD), achieves bounded 60% Other  contention for threads in the buffer acquire stage and maximum contention pipelining of all operations. 40% Log mgr.  work 6. PERFORMANCE EVALUATION 20% Other work We implement the techniques described in sections 3, 4, and 5 into a logging subsystem called Aether. To enhance readability, most of 0% the performance evaluation of ELR and flush pipelining is shown in sections 3 and 4, respectively. Unless otherwise stated in this section we assume those optimizations are already in place. This System load section details the sensitivity of the consolidation array based tech- Figure 7. Breakdown of the execution time of Shore-MT with niques to various parameters, and finally evaluates performance of ELR and flush pipelining, running TATP-UpdateLocation transac- Aether in a prototype database system. tions, as load increases. The log buffer becomes the bottleneck. 6.1 Experimental Setup All experiments were performed on a Sun T5220 (Niagara II) server with 64GB of main memory running Solaris 10. The Niag- lel; operations between groups are still serialized. Given enough ara II chip contains sixteen processing pipelines, each capable of threads in the system, at least one thread of each group is likely to supporting four hardware contexts, for a total of 64 OS-visible insert a large log record, delaying later groups. “CPUs.” The high degree of hardware parallelism makes it a good 5.2 Decoupling Buffer Fill indicator of the challenges all platforms will face as on-chip core Because buffer fill operations are not inherently serial (records counts continue to double. We use Shore-MT [11], an open-source never overlap) and have variable costs, they are highly attractive multi-threaded storage manager. We developed Shore-MT using as targets to move off the critical path. All threads which have basis the SHORE storage manager [2], to achieve scalability on acquired buffer regions can safely fill those regions in any order as multicore platforms. To eliminate contention in the lock manager long as they release their regions in LSN order. We therefore mod- and focus on logging, we use a version of Shore-MT with Specula- ify the original algorithm so that threads release the mutex imme- tive Lock Inheritance [10]. We run the following benchmarks: diately after acquiring buffer space. Buffer fill operations thus TATP. TATP (aka TM1) [23] models a cell phone provider data- become pipelined, with a new buffer fill starting as soon as the base. It consists of seven very small transactions, both update and next thread can acquire its own buffer region. read-only. The application exhibits little logical contention, but the Decoupling log inserts from holding locks results in a non- small transaction sizes stress database services, especially logging trivial buffer release operation which becomes a second critical and locking. We use a database of 100K Subscribers. section. Like LSN generation, buffer release must be serialized to TPC-B. This benchmark [24] models a banking workload and it is avoid creating gaps in the log. Log records must be written to disk intended as a database stress test. It consists of a single small in LSN order because recovery must stop at the first gap it encoun- update transaction and exhibits moderate lock contention. Our ters; in the event of a crash any committed transactions beyond a experiments utilize a 100-teller dataset. gap would be lost. No mutex is required, but before releasing its own buffer region, each thread must wait until the previous buffer Log insert microbenchmark. We extract a subset of Shore-MT’s has been released (Algorithm 3 of the appendix gives pseudocode). log manager as an executable which supports only log insertions With pipelining in place, arriving threads can overlap their without flushes to disk or performing other work, thereby isolating buffer fills with that of a large log record, without waiting for it to the log buffer performance. We then vary the number of threads, finish first. Figure 6(D) illustrates the improved concurrency that the log record size and distribution, and the timing of inserts. results, with significantly reduced wait times at the buffer acquire For each component of Aether, we first quantify existing bottle- phase. Though skew in the record size distribution could limit scal- necks, then implement our solution in Shore-MT and evaluate the ability because of the requirement to release buffers in order, we resulting impact on performance. Because our focus is on the log- find that this is not a problem in practice because realistic log ging subsystem, and because modern transaction processing work- record sizes do not vary enough to justify the additional complex- loads are largely memory resident [22], we use memory-resident ity. We consider this matter further in Section A.3 of the appendix datasets, while disk still provides durability. and propose a solution which provides robustness in the face of All results report the average of 10 30-second runs unless skewed log record sizes with a 10% performance penalty. stated otherwise; we do not report variance because all measure- ments were within 2% of the mean. Measurements come from tim- 5.3 Putting it all Together: Hybrid Log Buffer ers in the benchmark driver as well as Sun’s profiling tools. In the previous two sections we outlined (a) a consolidation array based approach to reduce the number of threads entering the criti- Profiling is highly effective at identifying software bottlenecks cal section, and (b) a decoupled buffer fill which allows threads to even in the early stages before they begin to impact performance. pipeline buffer fills outside the critical section. Neither approach The hardware limits scalability artificially by multiplexing many eliminates all contention by itself. The two are orthogonal, how- hardware contexts over each processor pipeline; we verify that this ever, and can be combined easily. Consolidating groups of threads is the case by running independent copies of Shore-MT in parallel limits log contention to a constant that does not depend on the (where the effect remains in spite of a total lack of software con- number threads in the system, while providing a degree of buffer tention), and on multi-socket machines (where the effect is shifted insert pipelining (within groups but not between them). Decou- to the right by a factor proportional to the number of sockets).

7.Throughput (GB/s) Throughput (ktps) 100 200 Aether CD in L1 FlushPipelining + ELR CD 150 Baseline 10 C 1 D 100 Baseline 0.1 50 0.01 0 1 4 16 64 12 120 1200 12000 0 16 32 48 64 Thread count Log record size (bytes) Clients Figure 8. Log buffer scalability with respect to thread counts (left, 120B log records) and Figure 9. Overall performance improve- log record size (right, 64 threads) ment provided by each component of Aether 6.2 Log Buffer Contention the baseline. But once contention increases, the threads combine First, to set the stage, we measure the contention on the log buffer their requests and performance scales linearly. In contrast, decou- once the Early Lock Release and the flush pipelining have been pled insertions avoid the initial performance penalty and perform applied. Figure 7 shows the time breakdown for Shore-MT with better, but eventually the growing contention degrades perfor- ELR and flush pipelining active using its baseline log buffer mance and perform worst than the consolidation array. implementation as an increasing number of clients submit the Finally, the hybrid approach combines the best properties of UpdateLocation transaction from TATP. As the load in the system both optimizations, eliminating most of the startup cost from (C) increases, the time each transaction spends contenting for the log while limiting the contention which (D) suffers. The drop in scal- buffer increases, at a point which the log buffer contention ability near the end is a hardware limitation, as described in becomes the bottleneck taking more than 35% of the execution Section 6.1. Overall, we see that while both consolidation and time. This problem will only grow as processor vendors release decoupling are effective at reducing contention, both have limita- more parallel multi-core hardware. tions which we overcome by combining the two, achieving near- linear scalability. 6.3 Impact of log buffer optimizations (micro- benchmarks) 6.3.2 Scalability With Respect to Log Record Size A database log manager should be able to sustain any number of In addition to thread counts, log record sizes also have a strong threads regardless of the size of the log records they insert, limited influence on the performance of the log buffer. In the case of the only by memory and compute bandwidth. Next, through a series of baseline and consolidated variants, larger record sizes increase the microbenchmarks we determine how well the log buffer designs critical section length; in all cases, however, larger record sizes proposed in Section 5 meet these goals. In each experiment we decrease the number of log inserts one thread can perform because compare the baseline implementation with the consolidation array it must copy an increasing amount of data per insertion. (C), decoupled buffer insert (D), and the hybrid solution combin- Figure 8(right) shows the impact of these two factors, plotting ing the two optimizations (CD). We examine scalability with sustained bandwidth achieved by 64 threads as they insert log respect to both thread counts and log record sizes and we analyze records ranging between 48B and 12kB (the largest record size in how the consolidation array’s size impacts its performance. Further Shore-MT). As log records grow the baseline performs better, but experiments in Sections A.3 and A.4 (appendix) explore the there is always enough contention that makes all other approaches impact of skew in the record size distribution and of changing the more attractive. The consolidated variant (C) performs better at number of slots in the slot array. small records sizes as it can handle contention much better than the decoupled record insert (D). But once the records size is over 1kB 6.3.1 Scalability With Respect to Thread Count contention becomes low and the decoupled insert variant fares bet- The most important metric of a log buffer is how many insertions it ter as more log inserts can be pipelined at the same time. The can sustain per unit time, or the bandwidth which the log can sus- hybrid variant again significantly outperforms its base components tain at a given average log insert size. It is important because core across the whole range, but in the end all three become bandwidth- counts grow exponentially while log record sizes are application- limited as they saturate the machine’s memory system. and DBMS-dependent and are fixed. The average record size in Finally, we modify the microbenchmark so that threads insert our workloads is about 120 bytes and a high-performance applica- their log records repeatedly into the same thread-local storage, tion generates between 100 and 200MBps of log, or between 800K which is L1 cache resident. With the memory bandwidth limitation and 1.6M log insertions per second. removed, the hybrid variant continues to scale linearly with record Figure 8(left) shows the performance of the log insertion sizes until it becomes CPU-limited at roughly 21GBps (nearly 20x microbenchmark for records of an average size of 120B as the higher throughput than today’s systems can reach). number of threads varies along the x-axis. Each data series shows one of the log variants. We can see that the baseline implementa- 6.4 Overall Impact of Aether tion quickly becomes saturated, peaking at roughly 140MB/s and To complete the experimental analysis, we successively add each falling slowly as contention increases further. Due to its complex- of the components of Aether to the baseline log system and mea- ity, the consolidation array starts out with lower throughput than sure the impact. With all components active we avoid the bottle-

8.necks summarized in Figure 1 and can identify optimizations [5] D. Gawlick and D. Kinkade. “Varieties of Concurrency Con- which are likely to have highest impact now and in the future. trol in IMS/VS Fast Path.” IEEE Database Eng. Bull. 1985. Figure 9 captures the scalability of Shore-MT running the [6] N. Hardavellas, et al. “Database servers on chip multiproces- UpdateLocation transaction from TATP. We plot throughput as the sors: limitations and opportunities.” In Proc. CIDR, 2007. number of client threads varies along the x-axis. For systems today, flush pipelining provides the largest single performance [7] S. Harizopoulos, D. J. Abadi,. S. Madden, and M. Stone- boost, 68% higher than the baseline. The scalable log buffer adds a braker. “OLTP through the looking glass, and what we found modest 7% further speedup by eliminating log contention. there.” In Proc. SIGMOD, 2008. Based on these results we conclude that the most pressing [8] P. Helland, H. Sammer, J. Lyon, R. Carr, and P. Garrett. bottleneck is scheduler overload induced by high transaction “Group Commit Timers and High-Volume Transaction Sys- throughput and the associated context switching. However, flush tems.” In Proc. HPTS, 1987. pipelining depends on ELR to prevent log-induced lock contention [9] D. Hendler, N. Shavit, and L. Yerushalmi. “A Scalable Lock- which would otherwise limit scalability. free Stack Algorithm.” In Proc. SPAA, 2004. As core counts continue to increase, we also predict that in the future log buffer contention will become a serious bottleneck [10] R. Johnson, I. Pandis, and A. Ailamaki. “Improving OLTP unless techniques such as the hybrid implementation presented in Scalability using Speculative Lock Inheritance.” In Proc. Section 5 are used. Even today, contention at the log buffer VLDB, 2009. impacts scalability to a small degree. In addition, the profiling [11] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. results from Figure 7 indicate that this bottleneck is growing rap- Falsafi. “Shore-MT: a scalable storage manager for the multi- idly with core counts and will soon dominate. This indication is core era.” In Proc. EDBT, 2009. further strengthened by the fact that Shore-MT running on today’s hardware achieves almost exactly the peak log throughput we mea- [12] S.-W. Lee, B. Moon, J.-M. Kim, and S.-W. Kim. “A Case for sure in the microbenchmark for the baseline log. In other words, Flash Memory SSD in Enterprise Database Applications.” In even a slight increase in throughput (with corresponding log inser- Proc. SIGMOD, 2008. tions) will likely push the log bottleneck to the forefront. Fortu- [13] C. Mohan. “ARIES/KVL: A key-value locking method for nately, the hybrid log buffer displays no such lurking bottleneck concurrency control of multiaction transactions operating on and our microbenchmarks suggest that it has significant headroom B-tree indexes.” In Proc. VLDB, 1990. to accept additional log traffic as systems scale in the future. [14] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. 7. CONCLUSIONS Schwarz. “ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write- Log manager performance becomes increasingly important as ahead logging.” ACM TODS, 17(1), 1992. database engines continue to increase performance by exploiting hardware parallelism. However, the serial nature of the log, as well [15] M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. “Using as long I/O times, threatens to turn the log into a growing bottle- Elimination to Implement Scalable FIFO Queues.” In Proc. neck. As available hardware parallelism grows exponentially, con- SPAA, 2005. tention for the central log buffer threatens to halt scalability. A new [16] Oracle. Oracle Asynchronous Commit. Oracle Database algorithm, consolidation array-based backoff, incorporates con- Advanced Application Developer's Guide. Available at: http:/ cepts from distributed systems to convert the previously serial log / insert operation into a parallel one which scales well even under b14251/adfns_sqlproc.htm. much higher contention than current systems can generate. We [17] PostgreSQL Asynchronous Commit. PostgreSQL 8.4.2 Docu- address more immediate concerns of excessive log-induced con- mentation. Available at: text switching using a combination of early lock release and log umentation/pdf/8.4/postgresql-8.4.2-A4.pdf. flush pipelining which allow transactions to commit without trig- gering scheduler activity, and without sacrificing safety or durabil- [18] A. Rafii, and D. DuBois. “Performance Tradeoffs of Group ity. Taken together, these techniques allow the database log Commit Logging.” In Proc. CMG Conference, 1989. manager to stay off the critical path of the system for maximum [19] N. Shavit and D. Touitou. “Elimination Trees and the Con- performance even as available parallelism continues to increase. struction of Pools and Stacks.” In Theory of Computing Sys- tems, 30(6), pp 645-670, 1997. ACKNOWLEDGMENTS This work was partially supported by a Sloan research fellowship, [20] M. L. Scott. “Non-Blocking Timeout in Scalable Queue- NSF grants CCR-0205544, IIS-0133686, and IIS-0713409, an ESF Based Spin Locks.” In Proc. PODC, 2002. EurYI award, and Swiss National Foundation funds. [21] E. Soisalon-Soininen, and T. Ylonen. “Partial Strictness in Two-Phase Locking.” In Proc. ICDT, 1995. REFERENCES [22] M. Stonebraker, et al. “The end of an Architectural Era (It’s [1] L. Bouganim, B. Jonsson, and P. Bonnet. “uFlip: Understand- Time for a Complete Rewrite).” In Proc. VLDB, 2007. ing Flash I/O Patterns.” In Proc. CIDR, 2009. [23] Telecom Application Transaction Processing Benchmark [2] M. Carey, et al. “Shoring up persistent applications.” In Proc. (TATP). TATP Benchmark Description. Available at: http:// SIGMOD, 1994. [3] S. Chen. “FlashLogging: exploiting flash devices for synchro- [24] Transaction Processing Performance Council (TPC). TPC nous logging performance.” In Proc. SIGMOD, 2009. Benchmark B: Standard Specification. Available at http:// [4] D. DeWitt, et al. “Implementation Techniques for Main Memory Database Systems.” ACM TODS, 14(2), 1984.

9.A APPENDIX Algorithm 1 – Baseline log insertion algorithm This appendix consists of four subsections. First, we present in 1 log_insert(size, data): detail the log buffer designs, presented in Section 5, using code 2 lock_acquire(L) sketches for the various algorithms (Section A.1). Second, we 3 lsn = buffer_acquire(size) describe in detail the consolidation array used by Algorithm 2 4 buffer_fill(lsn, size, data) 5 buffer_release(lsn, size) (Section A.2). Third, we discuss a further modification of the log 6 end buffer design to address a potential source of delays coming from 7 8 buffer_acquire(size): the requirement that all threads need to release their buffer in-order 9 /* ensure buffer space available */ (Section A.3). Fourth, we discuss about distributing the log and 10 lsn = /* update lsn and buffer state */ 11 return lsn why it is very difficult to have an efficient and scalable distributed 12 end log implementation (Section A.5). 13 14 buffer_fill(lsn, size, data): 15 /* set record’s LSN */ A.1 Details of the Log Buffer Algorithms 16 /* copy data to buffer (may wrap) */ In this subsection we explain the implementation of the algo- 17 end 18 rithms, presented in Section 5, with pseudocode sketches. 19 buffer_release(lsn, size): 20 /* release buffer up to lsn+size */ Baseline. In a straightforward implementation, a single mutex 21 lock_release(L) protects the log’s buffer, LSN generator, and other structures. 22 end Algorithm 1 presents such an approach, which the later designs Algorithm 2 – Log insertion with consolidated buffer acquire build on. In the baseline case a log insert always begins with acquiring the global mutex (L2) and finishes with its release (L21). 1 log_insert(size, data): Inside the critical section there are three operations: (i) A thread 2 if (lock_attempt(L)== SUCCESS) 3 lsn = buffer_acquire(size) first allocates log buffer space (L8-12); (ii) It then performs the 4 buffer_fill(lsn, size, data) record insert (L14-17); (iii) Finally, it releases the buffer space 5 buffer_release(lsn, size) 6 return /* no contention */ making the record insert visible to the flush daemon by increment- 7 end ing a dedicated pointer (L20). As discussed, the baseline algorithm 8 {s, offset} = slot_join(size) 9 if (0 == offset) /* slot owner */ suffers two weaknesses. First, contention is proportional to the 10 lock_acquire(L) number of threads in the system; second, the critical section length 11 group_size = slot_close(s) is proportional to the amount of work performed by each thread. 12 replace_slot(s) 13 lsn = buffer_acquire(group_size) Consolidation array. Consolidation-based backoff aims to 14 slot_notify(s, lsn, group_size) 15 else /* wait for owner */ reduce contention and, more importantly, make it independent of 16 {lsn, group_size} = slot_wait(s) the number of threads accessing the log. A sketch of the code is 17 end 18 buffer_fill(lsn+offset, size, data) presented in Algorithm 2. The primary data structure consists of an 19 if (slot_release(s) == SLOT_DONE) array with a fixed number of slots where threads can aggregate 20 buffer_release(lsn, group_size) 21 end their requests. Rather than acquiring the lock unconditionally, 22 end threads begin with a non-blocking lock attempt. If the attempt suc- ceeds, they perform the log insert directly, as before (L2-6). Algorithm 3 – Log insertion with decoupled buffer fill Threads which encounter contention back off to the consolidation 1 buffer_acquire(size, data): array and attempt to join one of its slots at random (L8). The first 2 /* wait for buffer space */ thread to claim a slot becomes the slot’s owner and is responsible 3 lsn = /* update lsn and buffer state */ 4 lock_release(L) to acquire buffer space on behalf of the group which forms while it 5 return lsn waits for the mutex. Once inside the critical section, the group 6 end leader reads the current group size and marks the group as closed 7 8 buffer_release(lsn, size): using an atomic swap instruction (L11); once a slot closes threads 9 while (lsn != next_release_lsn) can no longer join the group. The group leader then acquires buffer 10 /* wait my turn */ 11 end space and notifies the other group members before beginning its 12 /* release buffer up to lsn+size */ own buffer fill (L13-14). Meanwhile, threads which join the group 13 next_release_lsn = lsn+size 14 end infer their relative position in the meta-request from the group size; once the group leader reports the LSN and buffer location each thread can compute the exact LSN and buffer location which ber of consolidation structures at startup, which we treat as a circu- belongs to it (L16 and L18). As each thread leaves (leader lar buffer when allocating new slots. At any given moment of time included), it decrements the slot’s reference count and the last arriving threads access only the combination structures present in thread to leave releases the buffer (L19-20). the slots of the consolidation array, and those slots are returned to Once a consolidation array slot closes, it remains inaccessible the free pool after the buffer release stage. In the common case the while the threads in the group perform the consolidated log insert, next slot to be allocated was freed long ago and each “allocation” with time proportional to the log record insert size plus the over- requires only an index increment. Section A.3 describes the imple- head of releasing the buffer space. To prevent newly-arrived mentation details of the consolidation array. threads from finding all slots closed and being forced to wait, each slot owner removes the consolidation structure from the consolida- Decoupled buffer fill. Decoupling the log insert from holding the tion array when it closes, replacing it with a fresh slot that can mutex reduces the critical section length and in addition contention accommodate new threads (L12). The effect is that the array slot cannot increase with the size of the log record size. Algorithm 3 reopens even though the threads that consolidated their request are shows the changes over the baseline implementation (Algorithm 1) still working on the previous (now-private) version of that slot. We needed to decouple buffer fills from the serial LSN generation avoid memory management overheads by allocating a large num- phase. First, a thread acquires the log mutex, generates the LSN, and allocates buffer space. Then, it releases the central mutex

10.Algorithm 4 – Log insertion with delegated buffer release owner: pos = join_slot(s) PENDING SET(DONE‐total) 1 buffer_acquire(size, data): 2 /* wait for buffer space */ 3 lsn = /* update lsn and buffer state */ owner & mutex holder: 4 qnode = queue_join(Q, lsn, size) OPEN total = SWAP(PENDING) COPYING 5 lock_release(L) 6 return qnode (READY+n) (DONE‐n) 7 end 8 mutex holder: ADD(size) != DONE 9 buffer_release(qnode): SET(READY) 10 if (queue_delegate(Q, qnode) == DELEGATED) 11 return /* someone else will release*/ ADD(size) == DONE 12 end 13 FREE last one: DONE 14 do_release: 15 /* release qnode’s buffer region */ SET(FREE) 16 next = queue_handoff(Q, qnode) 17 if (next && is_delegated(next)) 18 qnode = next 19 goto do_release – DONE FREE PENDING READY + 20 end 21 end COPYING OPEN Figure 10. Life cycle and state space of a c-array slot, to accom- immediately (L4) and performs its buffer fill concurrently with pany Algorithm 5.The OPEN (COPYING) state covers all values other threads. Once the buffer fill is completed, the thread waits for at least as large (small) as READY (DONE). all other threads before it to finish their inserts (L9) and the last to finish releases the log buffer space (L13). The release stage uses tively rare because slots are swapped out of the array immediately the implicit queuing of the release_lsn to avoid expensive atomic whenever they become PENDING. Threads attempt to claim operations or mutex acquisitions. OPEN slots using atomic compare-and-swap to increment the state A.2 Consolidation-based Backoff by the insert size. In the common case the CAS fails only if another thread also incremented the slot’s size. However, the slot The consolidated log buffer acquire described in Algorithm 2 uses may also close, forcing the thread to start probing again. Eventu- a new algorithm, the consolidation array to divert contention away ally the thread succeeds in joining a slot and returns a (slot, offset) from the log buffer. We base our design on the elimination-based pair. The offset serves two purposes: the thread at position zero backoff algorithm [9], extending it to allow the extra cooperation becomes the “group leader” and must acquire space in the log buf- needed to free the buffer after threads consolidate their requests. fer on behalf of the group, and follower threads use their offset to Elimination backoff turns “opposing” operations (e.g. stack partition the resulting allocation with no further communication. push and pop) into a particularly effective form of backoff: threads which encounter contention at the main data structure probe ran- Slot close operation (lines 21-33). After the group leader acquires domly an array of N “slots” while they wait. Threads which arrive the log buffer mutex, it closes the group in order to determine the at a slot together serve each others’ requests and thereby cancel amount of log space to request (and to prevent new threads from each other out. When such eliminations occur, the participating arriving after allocation has occurred). It does so using an atomic threads return to their caller without ever entering the main data swap, which returns the current state and assigns a state of PEND- structure, slashing contention. With an appropriately-sized elimi- ING. The state change forces all further slot_join operations to fail nation array, an unbounded number of threads can use the shared (line 7), but most threads will never see this because the calling data structure without causing undue contention. thread first swaps a fresh slot into the array. To do so, it probes Consolidation backoff operates on a similar principle to elim- through the pool of available slots, searching for a FREE one. The ination, but with the complication that log inserts do not cancel pool is sized large enough to ensure the first probe nearly always each other out entirely: At least one thread from each group (the succeeds. The pool allocation need not be atomic because the “leader”) must still acquire space from the log buffer on behalf of caller already holds the log mutex. Once the slot is closed the func- the group. In this sense consolidation is more similar to a shared tion returns the group size to the caller so it can request the appro- counter than a stack, but with the further requirement that the last priate quantity of log buffer space. thread of each group to complete its buffer fill operation must Slot notify and wait operations (lines 35-46). After the slot release the group’s buffer back to the log. These additional com- owner acquires buffer space, it stores the base LSN and buffer munication points require two major differences between the con- address into the slot, then sets the slot’s state to DONE-group_size solidation array and an elimination array. First, the slot protocol as a signal to the rest of the group. Meanwhile, waiting threads which threads use to combine requests is significantly more com- spin until the state changes, then retrieve the starting LSN and size plex. Second, slots spend a significant fraction of their lifecycle of the group (the latter is necessary because any thread could be unavailable for consolidation and it becomes important to replace the one to release the group’s buffer space). busy slots with fresh ones for consolidation to remain effective under load. Algorithm 5 gives pseudocode for the consolidation Slot release and free operations (lines 48-55). As each thread array implementation, which the following paragraphs describe in completes its buffer insert, it decrements the slot’s count by its further detail, while Figure 10 supplements the pseudocode with a contribution. The last thread to release will detect that the slot summary of a slot’s life cycle and state space. became DONE must free the slot; all others may leave immedi- ately. The slot does not immediately become free, however, Slot join operation (lines 1-19). The consolidation array consists because the calling thread may still use it. This is particularly of ARRAY_SIZE pointers to active slots. Threads which enter the important for the delegated buffer release optimization described slot array start probing for slots in the OPEN state, starting at a in Section A.3, because the to-be-freed slot becomes part of a random location. Probing repeats as necessary, but should be rela- queue to be processed by some other thread. Once the slot is truly

11.Algorithm 5 – Consolidation array implementation fer acquire, each thread joins a release queue (L4), storing in the queue node all information needed to release its buffer region. The 1 slot_join(size): decoupled buffer fill proceeds as before. At buffer release time, the 2 probe_slot: 3 idx = randn(ARRAY_SIZE) thread first attempts to abandon its queue node, delegating the cor- 4 s = slot_array[idx]; responding buffer release to a (presumably slow) predecessor 5 old_state = s->state 6 join_slot: which has not yet completed its own buffer fill. The delegation 7 if(old_state < SLOT_READY) protocol is lock-free and non-blocking, and is based on the abort- 8 /* new threads not welcome */ able MCS queue lock by Scott [20] and the critical section-com- 9 goto probe_slot; 10 end bining approach suggested by Oyama et al. [A5] 11 new_state = old_state + size To summarize the protocol, a thread with at least one prede- 12 cur_state = cas_state(s, old_state, new_state) 13 if(cur_state != old_state) cessor attempts to change its queue node status atomically from 14 old_state = cur_state waiting to delegated (L10, corresponding to the aborted state in 15 goto join_slot 16 end Scott’s work). On success, a predecessor will be responsible for the 17 /* return my position within the group */ buffer release and the thread returns immediately (L11). Other- 18 return {s, old_state-SLOT_READY} wise, or if no predecessor exists, the thread releases its own buffer 19 end 20 region and attempts to leave before its successor can delegate more 21 slot_close(s): work (L16). A successful CAS from waiting to released prevents 22 retry: 23 s2 = slot_pool[pool_idx % POOL_SiZE]; the successor from abandoning its node; on failure, the thread con- 24 pool_idx = pool_idx+1 tinues to release delegated nodes until it reaches the end of the 25 if(s2->state != SLOT_FREE) 26 goto retry; queue or successfully hands off (L17-20). Threads randomly 27 end choose not to abandon their nodes with probability 1/32 to prevent 28 /* new arrivals will no longer see s */ a “treadmill” effect where one thread becomes stuck performing 29 s2->state = SLOT_OPEN 30 slot_array[s->idx] = s2 endless delegated buffer releases. This breaks long delegation 31 old_state = swap_state(s, SLOT_PENDING) chains (which are relatively rare) without impeding pipelining in 32 return old_state-SLOT_READY 33 end the common case. As with Oyama’s proposal [A5], the grouping 34 actually improves performance because a single thread does all the 35 slot_notify(s, lsn, group_size): 36 s->lsn = lsn work without incurring coherence misses. 37 s->group_size = group_size In Figure 11 we test the stability of the new algorithm (named 38 set_state(s, SLOT_DONE-group_size) CDME) and compare it with the hybrid variant from Section 5.3 39 end 40 (CD). We use the same microbenchmark setup from Section 6 but 41 slot_wait(s): modify it to present the worst-case scenario for the CD algorithm: 42 while(info->state > SLOT_DONE) 43 /* wait for notify */ a strongly bi-modal distribution of log record sizes. We fix one 44 end peak at 48 bytes (the smallest log record in Shore-MT) and we 45 return {s->lsn, s->group_size} 46 end vary the second peak (called the outlier). For every 60 small 47 records a large record is inserted in the log. CD performs poorly 48 slot_release(s, size): with such a workload because the rare, large, record can block 49 new_state = state_atomic_add(s, size) 50 return new_state many smaller ones and disrupt the pipelining effect. We present 51 end along the y-axis the throughput as we increase the outlier record 52 53 slot_free(s): size along the x-axis. CD and CDME perform similarly until an 54 set_state(s, SLOT_FREE) outlier size of around 8kiB, when CD stops scaling and its perfor- 55 end mance levels off. CDME, which is immune to record size variabil- ity, achieves up to double the performance of the CD for outlier finished, the owning thread sets its state to FREE; the operation records larger than 65kiB. need not be atomic because other threads ignore closed slots. The CDME algorithm is more robust than the CD variant but, In conclusion, the consolidation array provides a way for for the database workloads we examined, it is unnecessary in prac- threads to communicate in a much more distributed fashion than tice because nearly all records are small and the frequency of the original (serial) log buffer operation which it protects. The larger outliers is orders magnitude smaller than examined here. For overhead is small, in the common case two or three atomic opera- example, in Shore-MT the largest log record is 12kiB with a fre- tions per participating thread, and occurs entirely off the critical quency of 0.01% of the total log inserts. In addition, CDME path (other threads continue to access the log unimpeded). achieves around 10% lower throughput than the CD variant under normal circumstance, making it unattractive. Nevertheless, for A.3 Delegated Log Buffer Release and Skew other configurations which encounter significant skew, the CDME The requirement that all threads release their buffers in order algorithm might be attractive given its stability guarantee. remains a potential source of delays. Many smaller insertions might execute entirely in the shadow of a large one but must still A.4 Sensitivity to consolidation array size wait for the large insert to complete before releasing their buffer Our last microbenchmark analyzes whether (and by how much) the space. Buffer and log file wraparounds complicate matters further, consolidation array’s performance is affected by the number of because they prevent threads from consolidating buffer releases. available slots. Ideally the performance should depend only on the Such wrapping buffer releases must be identified and processed hardware and be stable as thread counts vary. Figure 12 shows a separately from normal ones because they impose extra work at contour map of the space of slot sizes and thread counts, where the log flush time, such as closing and opening log files. height of each data point is its sustained bandwidth. Lighter colors We remove this extra dependency between transactions by indicate higher bandwidth, with contour lines marking specific turning the implied LSN queue into a physical data structure, as throughput levels. We achieve peak performance with 3-4 slots, shown in Algorithm 4. Before releasing the mutex, during the buf- with lower thread counts peaking with fewer and high thread

12. Throughput (GB/s) 60 8 CDME 50 1700 6 CD 40 1600 Thread# 1400 4 30 1200 1000 2 20 800 0 10 400 16 512 16384 0 Outlier record size (bytes) 1 2 3 4 5 6 7 8 9 10 Slot# Figure 11. Performance impact of log Figure 12. Sensitivity to the number of slots Figure 13. Inter-log dependencies for 1ms record size skew. in the consolidation array of TPC-C (8 logs, ~100kB, ~30 commits). counts requiring a somewhat larger array. The optimal slot number recently inserted records for its log at the time. The entire graph corresponds closely with the number of threads required to saturate covers roughly 100kB of log records, which corresponds to less the baseline log which the consolidation array protects. Based on than 1ms wall time and dozens of transaction commits. these results we fix the consolidation array size at four slots to Because dependencies are so widespread and frequent, it is favor high thread counts; at low thread counts the log is not on the almost infeasible to track them, and even if tracked efficiently the critical path of the system and its peak performance therefore mat- dependencies would still require most transactions to flush multi- ters much less than at high thread counts. ple logs at commit time. In Figure 13 there is no obvious way of assigning log records to different partitions so that the dependency A.5 A Case Against Distributed Logging lines between partitions would be significantly reduced. The This subsection presents qualitative and quantitative analysis authors are unaware of any DBMS which distributes the log within showing that our improved log buffer design is likely to outper- a single node, and even distributed DBMS often opt for a shared form distributed logging as a contention-reducing approach, both log (including Rdb/VMS [A4]). Distributed DBMS which utilize from a performance and implementation perspective. distributed logging either force transactions to span multiple nodes A distributed log has the potential to ease bottlenecks by (with well-known consequences for performance and spreading load over N logs instead of just one. ARIES-style recov- scalability [A1]) or else migrate dirty pages between nodes ery only requires a partial order between the transactions accessing through a shared storage or network interconnect rather than the same data. Intuitively, it should be possible to parallelize the accepting the high overhead of having a distributed transaction that log, given that most transactions execute in parallel without con- needs to flush multiple logs in a specific sequence [A2]. flicts. However, a distributed log must track transaction dependen- Using physical-only logging and having an almost-perfectly cies and make sure that logs become durable in a coherent order, as partitionable workload makes the implementation of a distributed discussed by DeWitt et al. [4]. log feasible [A3]. However, if physiological logging is used and as Write-ahead logging allows transactions to release page Figure 13 shows, distributed logs are both highly complex and latches immediately after use, minimizing data contention and potentially very slow under many workloads. We conclude that allowing database pages to accumulate many changes in the buffer adding a distributed log manager within a database instance is nei- pool before being written back to disk. Further, serial logging ther attractive nor feasible for reducing log buffer contention. allows transactions to not track physical dependencies, especially those that arise with physiological logging,4 as a transaction’s APPENDIX REFERENCES commit never reaches disk before its dependencies. A distributed [A1] P. Helland. “Life Beyond Distributed Transactions: an Apos- implementation must instead track or eliminate these physical tate's Opinion.” In Proc. CIDR, 2007. dependencies without requiring multiple log flushes per transac- [A2] T. Lahiri, V. Srihari, W. Chan, and N. MacNaughton. “Cache tion. Otherwise, the serial implementation will actually be faster. Fusion: Extending shared-disk clusters with shared caches.” Unfortunately, this challenge is difficult to address efficiently In Proc. VLDB, 2001. because physical dependencies can be very tight, especially due to hot database pages. For example, Figure 13 shows the dependen- [A3] D. Lomet. “Recovery for Shared Disk Systems Using Multi- cies which would arise in an 8-way distributed log for a system ple Redo Logs.” CRL 90/4, 1990. running the TPC-C benchmark [A6]. Each node in the graph repre- [A4] D. Lomet, R. Anderson, T. K. Rengarajan, and P. Spiro. sents a log record, with horizontal edges connecting records from “How the Rdb/VMS Data Sharing System Became Fast.” the same log. Diagonal edges mark physical dependencies which CRL 92/4, 1992. arise when a page moves between logs. Dark edges mark tight [A5] Y. Oyama, K. Taura, and A. Yonezawa. “Executing Parallel dependencies where the older record is one of the five most Programs with Synchronization Bottlenecks Efficiently.” In Proc. PDSIA, 1999, pp. 182--204. 4. For example, if transaction A inserts a record in slot 13 of a page, and then B inserts a record in slot 14, A’s log record must become durable [A6] Transaction Processing Performance Council. “TPC - C v5.5: first or recovery could encounter an inconsistent page and fail. On-Line Transaction Processing (OLTP) Benchmark.”