A Scalable Storage Manager for the Multicore Era

Database storage managers have long been able to efficiently handle multiple concurrent requests. Until recently, however, a computer contained only a few single-core CPUs, and therefore only a few transactions could simultaneously access the storage manager's internal structures. This allowed storage managers to use non-scalable approaches without any penalty. With the arrival of multi-core chips, however, this situation is rapidly changing. More and more threads can run in parallel, stressing the internal scalability of the storage manager. Systems optimized for high performance at a limited number of cores are not assured similarly high performance at a higher core count, because unanticipated scalability obstacles arise.

1. Shore-MT: A Scalable Storage Manager for the Multicore Era Ryan Johnson,12 Ippokratis Pandis,1 Nikos Hardavellas,1 Anastasia Ailamaki,12 and Babak Falsafi2 1Carnegie Mellon University, USA 2École Polytechnique Fédérale de Lausanne, Switzerland ABSTRACT 10 Postgres Database storage managers have long been able to efficiently MySql handle multiple concurrent requests. Until recently, however, a Norm. Throughput 8 Shore computer contained only a few single-core CPUs, and therefore only a few transactions could simultaneously access the storage BDB 6 manager's internal structures. This allowed storage managers to use non-scalable approaches without any penalty. With the arrival 4 of multicore chips, however, this situation is rapidly changing. More and more threads can run in parallel, stressing the internal 2 scalability of the storage manager. Systems optimized for high 0 performance at a limited number of cores are not assured simi- larly high performance at a higher core count, because unantici- 0 4 8 12 16 20 24 28 32 pated scalability obstacles arise. Concurrent Threads We benchmark four popular open-source storage managers (Shore, BerkeleyDB, MySQL, and PostgreSQL) on a modern Figure 1.Scalability as a function of available hardware contexts multicore machine, and find that they all suffer in terms of scal- single-thread performance to remain the same or increase slowly ability. We briefly examine the bottlenecks in the various storage while the number of available hardware contexts grows exponen- engines. We then present Shore-MT, a multithreaded and highly tially. Though shared-nothing query processing is very effective, scalable version of Shore which we developed by identifying and distributed transactions do not scale well and have not seen wide- successively removing internal bottlenecks. When compared to spread adoption [18][11]. As a result, the storage manager for a other DBMS, Shore-MT exhibits superior scalability and 2-4 transaction engine must be able to utilize the dozens of hardware times higher absolute throughput than its peers. We also show contexts that will soon be available to it. However, the internal that designers should favor scalability to single-thread perfor- scalability of storage managers has not been tested under such mance, and highlight important principles for writing scalable rigorous demands before. storage engines, illustrated with real examples from the develop- ment of Shore-MT. 1.1 How do existing storage managers scale? To determine how well existing storage managers scale, we 1. INTRODUCTION experiment with four popular open-source storage managers: Most database storage manager designs date back to the 1980's, Shore [7], BerkeleyDB [1], MySQL [2], and PostgreSQL [30]. when disk I/O was the predominant bottleneck. Machines typi- The latter three engines are all widely deployed in commercial cally featured 1-8 processors (up to 64 at the very high end) and systems. We ran our experiments on a Sun T2000 (Niagara) limited RAM. Single-thread speed, minimal RAM footprint, and server, which features eight cores and four hardware thread con- I/O subsystem efficiency determined overall performance of a texts per core for a total of 32 OS-visible “processors.” storage manager. Efficiently multiplexing concurrent transactions to hide disk latency was the key to high throughput. Research Our first experiment consists of a microbenchmark where each focused on efficient buffer pool management, fine-grain concur- client in the system creates a private table and repeatedly inserts rency control, and sophisticated caching and logging schemes. records into it (see Section 3 for details). This setup ensures there is no contention for database locks or latches, and that there is no Today's database systems face a different environment. Main I/O on the critical path. Figure 1, shows the results of executing memories are in the order of several tens of gigabytes, and play this microbenchmark on each of the four storage managers. The the role of disk for many applications whose working set fits in number of concurrent threads varies along the x-axis with the cor- memory [15]. Modern CPU designs all feature multiple proces- responding throughput for each engine on the y-axis. As the num- sor cores per chip, often with each core providing some flavor of ber of concurrent threads grows from 1 to 32, throughput in a hardware multithreading. For the forseeable future we can expect perfectly scalable system should increase linearly. However, none of the four systems scales well, and their behavior varies from Permission to copy without fee all or part of this material is granted pro- arriving at a plateau (PostgreSQL and Shore) to a significant drop vided that the copies are not made or distributed for direct commercial in throughput (BerkeleyDB and MySQL). These results suggest advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the ACM. that, with core counts doubling every two years, none of these To copy otherwise, or to republish, to post on servers or to redistribute to systems is ready for the multicore era. lists, requires a fee and/or special permissions from the publisher, ACM. In retrospect these results are understandable because, when the EDBT'09, March 24-26, 2009, Saint Petersburg, Russia. engines were developed, internal scalability was not a bottleneck Copyright 2009 ACM 978-1-60558-422-5/09/0003 ...$5.00. 24

2.and designers did not foresee the coming shift to multicore hard- ware. It would have been difficult to justify spending consider- 16 Pentium Hardware Contexts able effort in this area. Today, however, internal scalability of Itanium these engines is key to performance as core counts continue 12 Intel Core2 increasing. UltraSparc 8 IBM Power 1.2 Creating a scalable storage manager AMD Because none of the existing storage managers provide the kind of scalability required for modern multicore hardware, we set out 4 to create a scalable storage manager based on the Shore storage manager. This exercise resulted in Shore-MT, which scales far 0 better than its open source peers while also achieving superior single-thread performance (as shown in the experimental section). 1990 1995 2000 2005 2010 We find that the remaining gap between Shore-MT’s scalability Year and the ideal is due to hardware multithreading in the Niagara processor — threads contend for hardware resources within the Figure 2.Number of HW contexts per chip as a function of time. processor itself, limiting the scalability software can achieve [29]. a commercial system and 2-4 times as fast as the fastest open- We use the lessons learned while creating Shore-MT to distill a source system, across all of our experiments. set of important principles for developing scalable storage manag- • Using lessons learned from the development of Shore-MT we ers, many of which also apply to parallel software in general: show that designers must focus primarily on scalability rather • Efficient synchronization primitives are critical. Poorly- than on single-thread performance. This focus will allow stor- designed or -applied primitives destroy scalability. age managers to exploit the parallelism future multicore chips • Every component must be explicitly designed for scalability make available, and our experience with Shore-MT shows that or it will become a bottleneck. Common operations must be single-thread performance can actually improve as a side able to proceed in parallel unless they truly conflict. effect as well. • Hotspots must be eliminated, even when the hot data is read- This paper is organized into two main parts. The first (Sections 2- mostly. The serial overhead imposed by acquiring a shared- 4) details the trends and internal components that lead to bottle- mode latch can grow quickly into a bottleneck, for example. necks in current systems, and examines the bottlenecks in each of • Abstraction and encapsulations do not mix well with critical the four storage managers. The second part (Sections 5-7) pre- sections. We observed large gains by merging data structures sents Shore-MT, measuring its performance and scalability and with the synchronization primitives that protect them. detailing lessons and principles we learned that can be applied to We present these principles as both a corroboration of prior work other engines. We conclude in Section 8. and an example of their practical application in a software system as large and complex as a storage manager. In particular, we dem- 2. BACKGROUND AND RELATED WORK onstrate how systematically applying these principles can trans- This section explains the recent architectural trends that make form an existing, poorly performing storage manager to a fully parallelism and scalability the most critical aspects of modern competitive and scalable one. software design, as well as the main components in a database storage manager that determine the system’s behavior. 1.3 Contributions and Paper Organization To the best of our knowledge, this is the first paper to study and 2.1 DB Computing in the Multicore Era compare the internal scalability of open-source database storage Despite Moore’s law doubling transistor density every two years, managers on a modern, many-core machine. The contributions of by the early 2000s it became clear to chip designers that power- this paper are: related issues[21][10] prohibit commensurate increases in unipro- • We identify the bottlenecks which prevent all of the open- cessor speed. Instead, each new generation of processors incorpo- source database storage managers we tested from scaling well rates an ever-increasing number of processors (cores) on the same on a modern multi-core platform. Our experiments show that chip, leading the computing world into the so-called “multi-core performance flattens or even decreases after 4-12 hardware era.” Section 2 illustrates how core counts of all popular architec- contexts. tures are growing exponentially; uniprocessor systems are • We develop Shore-MT — a multithreaded and scalable ver- increasingly difficult to find due to stagnant single-thread perfor- sion of the Shore storage manager [7] — and make it available mance. As a result, all software must be able to exploit parallel- to the research community.1 ism for performance rather than depending on succeeding • Shore-MT achieves superior scalability and excellent single- processor generations to provide significant speedup to individ- thread performance compared to its peers. It is twice as fast as ual threads. Software designers should focus first on scalability, rather than single-thread performance, to ensure maximum sys- tem throughput as core counts grow exponentially. 1. See http://www.cs.cmu.edu/~stageddb/systems/shoremt.html 25

3.This shift in processor design has jump-started a wide effort to extract parallelism from all types of existing software. While Transaction Lock Manager most database applications are inherently parallel, current data- Management base storage managers are designed under the assumption that only a limited number of threads will access their internal data structures during any given instant. Even when the application is executing parallelizable tasks and even if disk I/O is off the criti- Log Metadata Free Space cal path, serializing accesses to these internal data structures Manager Manager Management impedes scalability. An orthogonal way of scaling up a system to handle multiple requests in parallel is to use a distributed database. In fact, most Buffer Pool Manager Latching of today’s large database installations run on clusters of machines using either shared-nothing [13][12] or shared-disk [4] configura- Figure 3.Main components of a storage manager tions. Recent proliferation of virtualization technology [6] allows system administrators to even go one step further and create vir- memory the buffer pool manager must fetch it from disk (evicting tual nodes within a single multicore machine. This is a desirable some other page) while the application waits. Applications “pin” way of utilizing multicore systems in many applications, but in-use pages in the buffer pool to prevent their being evicted too transactional workloads favor a DBMS deployment with a single soon, and unpin them when finished. Finally, the buffer pool instance in the (multicore) node. Distributed transactions are manager and log manager (below) are responsible for ensuring notoriously difficult to implement correctly and efficiently that modified pages are flushed to disk (preferably in the back- [18][11], leading to severe performance degradation when multi- ground) so that changes to in-memory data become durable. ple storage managers must participate in a transaction. Given the Buffer pools are typically implemented as large hash tables in need for high performance inside a single multicore node, it is order to find quickly any page requested by an application. Oper- important to understand which factors may hinder scalability and ations within the hash table must be protected from concurrent discover ways to overcome them. structural changes caused by evictions, usually with per-bucket As we study in this paper, building a scalable database storage mutex locks. Hash collisions and hot pages can cause contention engine is challenging. One reason for that is the wide range of among threads for the hash buckets; growing memory capacities operations the storage engine is expected to support. However, if and hardware context counts increase the frequency of page the domain is specific, or the requirements are relaxed the goal requests, and hence the pressure, which the buffer pool must deal for scalability is feasible. For example, recent work proposes sys- with. Finally, the buffer pool must flush dirty pages and identify tems that provide relaxed consistency semantics, such as Ama- suitable candidates for eviction without impacting requests from zon's Dynamo [11], and systems that operate exclusively in main- applications. memory, such as H-Store [31]. These kinds of systems achieve significant gains in performance and/or scalability by giving up 2.2.2 Page latches features traditionally provided by database storage engines. In Pinning a page in the buffer pool ensures that it will remain in specialized domains these systems perform admirably, but gen- memory but does not protect its contents from concurrent modifi- eral-purpose database storage managers are still necessary for cation. To protect its integrity, each page has a reader-writer lock many transactional workloads, and their scalability is crucial. called a latch associated with it; each operation acquires the latch in either read or write mode before accessing the page.2 2.2 Inside a Storage Manager Latches are a potential bottleneck for two reasons. First, very hot To provide a base for the discussions in the rest of the paper, this data such as metadata and high-level index pages are accessed so section briefly describes each component and briefly highlights frequently that even compatible read operations can serialize some of the potential scalability challenges it presents. Figure 3 attempting to acquire the latch [20]. Second, latch operations are illustrates the major components of a storage manager and how typically very short, but when an transaction blocks on I/O it can they communicate. The storage manager forms the heart of a hold a latch for several milliseconds and serialize many transac- database management system. It is responsible for maintaining tions behind it. durability and consistency in the system by coordinating active transactions and their access to data, both in memory and on disk. 2.2.3 Lock manager Database locks enforce logical consistency at the transaction 2.2.1 Buffer pool manager level, ensuring that transactions do not interfere with the correct- The buffer pool manager fills two critical functions for the data- ness of other concurrent transactions. One of the most intuitive base. First, it presents the rest of the system with the illusion that (and restrictive) consistency models is “two phase locking” the entire database resides in main memory, similar to an operat- ing system’s virtual memory manager. The buffer pool is a set of 2. Multi-versioned buffer pools avoid latches entirely by providing “frames,” each of which can hold one page of data from disk. copy-on-write semantics for pages in memory. Transactional When an application requests a database page not currently in workloads typically access only a few bytes per page per trans- action and quickly fill the buffer pool with near-identical copies of hot pages at the expense of other data. 26

4.(2PL), which dictates that a transaction may not acquire any new 3. EXPERIMENTAL ENVIRONMENT locks once it has released any. This scheme is sufficient to ensure Because the basis of this work lies in evaluating the performance that all transactions appear to execute in some serial order, though of database engines, we begin by describing our experimental it can also restrict concurrency. In order to balance the overhead environment. All experiments were conducted using a Sun T2000 of locking with concurrency, the database provides hierarchical (Niagara) server [21][10] running Solaris 10. The Niagara chip locks. In order to modify a row, for example, a transaction has an aggressive multi-core architecture with 8 cores clocked at acquires a database lock, table lock, and row lock; meanwhile 1GHz; each core supports 4 thread contexts, for a total of 32 OS- transactions which access a large fraction of a table may reduce visible "processors." The 8 cores share a common 3MB L2 cache overhead by “escalating” to coarser-grained locking at the table and each of them is clocked at 1GHz. The machine is configured level. with 16GB of RAM and its I/O subsystem consists of a RAID-0 Because each row in the database has a logical lock associated disk array with 11 15kRPM disks. with it, the lock manager maintains a pool of locks and lock We relied heavily on the Sun Studio development suite, which requests similar to a buffer pool; however, as the number of integrates compiler, debugger, and performance analysis tools. active locks can change drastically over time, the hash table is unless otherwise stated every system is compiled using version likely to have longer chains than the buffer pool, leading to more 5.9 of Sun’s CC. All profiler results were obtained using the ‘col- contention for buckets. Additionally, hierarchical locking results lect’ utility, which performs sample-based profiling on unmodi- in extremely hot locks which most or all transactions acquire; fied executables and imposes very low overhead (<5%). again, this contention can serialize lock requests that will eventu- ally turn out to be compatible. 3.1 Storage Engines Tested 2.2.4 Log manager We evaluate four open-source storage managers: PostgreSQL [30], MySQL [2], BerkeleyDB [1], and Shore [7]. For compari- Database operations log all operations in order to ensure that they son and validation of the results, we also present measurements are not lost if the system database fails before the buffer pool from a commercial database manager (DBMS "X").3 All database flushes those changes to disk. The log also allows the database to data resides on the RAID-0 array, with log files sent to an in- roll back modifications in the event of a transaction abort. Most memory file system. The goal of our experiments is to exercise storage managers follow the ARIES scheme [25][26], which all the components of the database engine (including I/O, locking combines the log, buffer pool manager, and concurrency control and logging), but without imposing I/O bottlenecks. Unless other- schemes into a comprehensive recovery scheme. wise noted, all storage managers were configured with 4GB The database log is a serial record of all modifications to the buffer pools. database, and therefore forms a potential scalability bottleneck as In order to extract the highest possible performance from each the number of concurrent operations increases. storage manager, we customized our benchmarks to interface with each storage manager directly through its respective C API. Cli- 2.2.5 Transaction management ent code executed on the same machine as the database server, The storage manager must maintain information about all active but we found the overhead of clients to be negligible (<5%). transactions, especially the newest and oldest in the system, in order to coordinate services such as checkpointing and recovery. PostgreSQL v8.1.4: PostgreSQL is an open source database Checkpointing allows the log manager to discard old log entries, management system providing a powerful optimizer and many saving space and shortening recovery time. However, no transac- advanced features. We used a Sun distribution of PostgreSQL tions may begin or end during checkpoint generation, producing a optimized specifically for the T2000. We configured PostgreSQL potential bottleneck unless checkpoints are very fast. with a 3.5GB buffer pool, the largest allowed for a 32-bit binary.4 The client drivers make extensive use of SQL prepared state- 2.2.6 Free space and metadata management ments. The storage manager must manage disk space efficiently across MySQL v5.1.22-rc: MySQL is a very popular open-source data- many insertions and deletions, in the same way that malloc() and base server recently acquired by Sun. We configured and com- free() manage memory. It is especially important that pages which piled MySQL from sources using InnoDB as the underlying are scanned regularly by transactions be allocated sequentially in transactional storage engine. InnoDB is a full transactional stor- order to improve disk access times; table reorganizations are age manager (unlike the default, MyISAM). Client drivers use occasionally necessary in order to improve data layout on disk. In dynamic SQL syntax calling stored procedures because we found addition, databases store information about the data they store, they provided significantly better performance than prepared and applications make heavy use of this metadata. statements. The storage manager must ensure that changes to metadata and BerkeleyDB v4.6.21: BerkeleyDB is an open source, embedded free space do not corrupt running transactions, while also servic- database engine currently developed by Oracle and optimized for ing a high volume of requests, especially for metadata. C/C++ applications running known workloads. It provides full storage manager capabilities but client drivers link against the 3. Licensing restrictions prevent us from disclosing the vendor. 4. The release notes mention subpar 64-bit performance 27

5.database library and make calls directly into it through the C++ 10 shore API, avoiding the overhead of a SQL front end. BerkeleyDB is Throughput (tps/thread) fully reentrant but depends on the client application to provide bdb multithreaded execution. We note that BerkeleyDB is the only mysql storage engine without row-level locking; its page-level locks can postgres severely limit concurrency in transactional workloads. 1 dbms "X" Shore: Shore was developed at the University of Wisconsin in the early 1990’s and provides features that all modern DBMS use: shore-mt full concurrency control and recovery with two-phase row-level locking and write-ahead logging, along with a robust implementa- tion of B+Tree indexes. The Shore storage manager is designed to 0.1 be either an embedded database or the back end for a "value- 0 8 16 24 32 added server" implementing more advanced operations. Client Concurrent Threads driver code links directly to the storage manager and calls into it Figure 4.Scalability and performance comparison of Shore-MT vs using the API provided for value-added servers.The client code several open-source engines and one commercial engine must use the threading library that Shore provides. formance of the storage managers for more realistic workloads; as the following sections show, the overall scalability trends are the 3.2 Benchmarks same for all the benchmarks. We evaluate storage managers using a small suite of microbench- marks which each stresses the engine in different ways. We are 4. EVALUATION OF EXISTING ENGINES interested in two metrics: throughput (e.g. transactions per sec- We begin by benchmarking each database storage manager under ond) and scalability (how throughput varies with the number of test and highlight the most significant factors that limit its scal- active threads). Ideally an engine would be both fast and scalable, ability. Figure 4 compares the scalability of the various engines but as we will see, storage managers tend to be either fast or scal- when we run the insert-only micro-benchmark. Due to lock con- able, but not both. tention in the transactional benchmarks, the internals of the Record Insertion: The first microbenchmark repeatedly inserts engines do not face the kind of pressure they do on the insert- records into a database table backed by a B-Tree index. Each cli- only benchmark. Thus we use the latter to expose the scalability ent uses a private table; there is no logical contention and no I/O bottlenecks at high core counts and to highlight the expected on the critical path.5 Transactions commit every 1000 records, behavior of the transactional benchmarks as the number of hard- with one exception. We observed a severe bottleneck in log ware contexts per chip continues to increase. flushes for MySQL/InnoDB and modified its version of the To have a better insight on what is going on we profile the runs benchmark to commit every 10000 records in order to allow a with multiple concurrent clients (16 or 24) stressing up the stor- meaningful comparison against the other engines. Record inser- age engine. Then we collect the results and interpret call stacks to tion stresses primarily the free space manager, buffer pool, and identify the operations where each system spends its time. log manager. We use this benchmark for the primary scalability study in the next section because it is entirely free from logical PostgreSQL: PostgreSQL suffers a loss of parallelism due to contention, unlike the other two. three main factors. First, contention for log inserts causes threads to block (XLogInsert). Second, calls to malloc add more TPC-C Payment: The TPC-C benchmark models a workload of serialization during transaction creation and deletion (Create- short transactions arriving at high frequency [33]. The Payment ExecutorState and ExecutorEnd). Finally, transactions transaction updates the customer's balance and corresponding dis- block while trying to lock index metadata (ExecOpenIndi- trict and warehouse sales statistics. It is the smallest transaction ces), even though no two transactions ever access the same of the TPC-C transaction mix, reading 1-3 rows and updating 4 table. Together these bottlenecks only account for 10-15% of total others. One of the updates made by Payment is to a contended thread time, but that is enough to limit scalability. table, WAREHOUSE. Payment stresses the lock manager, log manager, and B-Tree probes. MySQL: MySQL/InnoDB is bottlenecked on two spots. The first one is the interface to InnoDB; in a function called TPC-C New Order: The New Order is a medium-weight trans- srv_conc_enter_innodb threads remain blocked as long as action which enters an order and its line items into the system, as around the 39% of the total execution time. The second one are well as updating customer and stock information to reflect the the log flushes. In another function labeled log_- change. It inserts roughly a dozen records in addition to reading preflush_pool_modified_pages the system again expe- and updating existing rows. This transaction stresses B-Tree riences large blocking time equal to the 20% of the total execu- indexes (probes and insertions) and the lock manager. tion time (even after increasing transaction length to 10K inserts). Payment and New Order together comprise 88% of transactions We also observe that mysql spends a non-trivial fraction of its executed by the TPC-C benchmark. We use them to measure per- time on two malloc-related functions, take_deferred_- signal and mutex_lock_internal. This suggests a 5. All the engines use asynchronous page cleaning and generated potential for improvement by avoiding excessive use of malloc more than 40MB/sec of disk traffic during the tests. (trash stacks, object re-use, thread-local malloc libraries…) 28

6. 1000 Postgres 1000 Throughput(tps)/client Throughput (tps/client) Dbm s "X" Shore-MT 100 100 10 10 0 8 16 24 32 0 8 16 24 32 Clients Clients Figure 5.Per-client throughput of Shore-MT, DBMS “X” and PostgreSQL for the New Order (left) and Payment (right) mircrobenchmarks BerkeleyDB: BDB spends the majority of its time on either test- than it really is. In contrast, a log-y graph gives the same slope to ing for availability or trying to acquire a mutex — the system curves with the same scalability. spends over 80% of its processing time in two functions with Shore-MT scales commensurately with the hardware we make names _db_tas_lock and _lock_try. Presumably the available to it, setting the absolute example for other systems to former is a spinning test-and-set lock while the former is the test follow. Figure 4 shows the scalability achieved by the different of the same lock. Together these likely form a test-and-test-and- engines running the insert microbenchmark as the number of con- set mutex primitive, which is supposed to scale better than the current threads varies along the x-axis. While single-threaded simple test-and-set. The excessive use of test-and-test-and-set Shore did not scale at all, Shore-MT exhibits excellent scaling. (TATAS) locking justifies the high performance of BDB on low Moreover, at 32 clients it scales better than DBMS X, a popular contended cases, since the TATAS locks impose very little over- commercial DBMS. Therefore, the key ideas upon which a soft- head on low contention, but fail miserably on high contention. ware designer should rely to build scalable software (detailed in As we know BerkeleyDB employs page-level locking. The Section 7) lead to very good scaling performance. While our orig- coarse-grained locking by itself imposes scalability problems. inal goal was only to achieve high scalability, we also achieved The two callers for lock acquisition have names _bam_search nearly 3x speedup in single-thread performance over the original and _bam_get_root. So, in high contention BDB spends most Shore, leading to a healthy performance lead over the other of its time trying to acquire the latches for tree probes. Addition- engines.6 We attribute the performance improvement to the fact ally, we see that the system spends significant amount of time that database engines spend so much time in critical sections. The blocked waiting on a pthread_mutex_lock and cond_- process of shortening and eliminating critical sections, and reduc- wait, most probably because the pthread mutexen are used as a ing synchronization overhead, had the side effect of also shorten- fallback plan to acquire the highly contended locks. ing the total code path for a single thread. This section highlights the bottlenecks we observed in existing As a further comparison, Figure 5 shows the performance of the storage managers, and which developers cannot ignore if the goal three fastest storage managers running the New Order (left) and is a true scalable system in emerging many-core hardware. As the Payment (right) microbenchmarks. Again Shore-MT achieves the PostgreSQL case illustrates, what appears to be a small bottle- highest performance7 while scaling as well as the commercial neck can still hamper scalability as the number of concurrent cli- system for New Order. All three systems encounter high conten- ents increases tion in the STOCK and ITEM tables, causing a significant dip in scalability for all three around 16 clients. Payment, in contrast, 5. PERFORMANCE OF SHORE-MT imposes no application-level contention, allowing Shore-MT to scale all the way to 32 threads. We originally set out to make Shore a scalable storage manager. In the quest to achieve scalability in Shore we completely ignored We note that, while both Shore-MT and DBMS “X” scale well up single-thread performance and focused only on removing bottle- to 32 threads, profiling indicates that both face looming bottle- necks. This section presents the final outcome, Shore-MT, com- necks (both in log inserts, as it happens). Our experience tuning paring it with the other storage managers. Shore-MT suggests that synchronization bottlenecks are an ongo- ing battle as the number of threads continues to increase. How- Figure 4 and those that follow display performance as the number of transactions per second per thread, plotted on a log-y axis. We use log-y scale on the graphs because it shows scalability clearly 6. BerkeleyDB outperforms the other systems at first, but its per- without masking absolute performance. Linear y-axis is mislead- formance drops precipitously for more than four clients. 7. Some of Shore’s performance advantage is likely due to its cli- ing because two systems with the same scalability will have dif- ents being directly embedded in the engine while the other two ferently-sloped lines, making the faster one appear less scalable engines communicate with clients using local socket connec- tions. We would be surprised, however, if a local socket connec- tion imposed 100% overhead per transaction! 29

7.ever, Shore-MT proves that even the largest bottlenecks can be 12 addressed using the principles detailed in Section 6. Storage engine designs can make use of them in the future to ensure scal- 10 ability and take full advantage of the available hardware. Throughput (ktps) 8 6. ACHIEVING SCALABILITY We chose the Shore storage manager as our target for optimiza- 6 tion for two reasons. First, Shore supports all the major features of modern database engines: full transaction isolation, hierarchi- cal locking, a CLOCK buffer pool with replacement and prefetch 4 Re factor hints, B-Tree indexes, and ARIES-style logging and recovery. M CS m ute x Additionally, Shore has previously shown to behave like commer- 2 T&T&S m ute x cial engines at the instruction level [3], making it a good open- bpool 1 source platform for comparing against closed-source engines. 0 The optimization process from Shore to Shore-MT was straight- 0 8 16 24 32 forward. We began each step by profiling Shore in order to iden- Concurrent Threads tify the dominant bottleneck(s). After addressing each bottleneck, Figure 6.Examples of the kinds of impact optimizations can have scalability would either increase, or a previously minor bottleneck on performance and scalability would take its place. We then repeated the profiling and optimiza- tion steps until Shore became compute-bound for 32 threads. The graphs in Figure 6 give examples of the kinds of impact opti- In the absence of external bottlenecks such as I/O or unscalable mizations can have on overall performance. Each line in the database applications, virtually all bottlenecks centered on the graph depicts the performance and scalability of optimizations to various critical sections in Shore’s code. Every major component Shore-MT made between the “bpool 1” and “caching” versions of the storage manager must communicate with multiple threads, shown in Figure 7 (c.f. Section 7.3). The y-axis of Figure 6 plots and key data structures must be protected from concurrent system throughput as the number of threads in the system varies accesses that would corrupt them. along the x-axis. At this stage in Shore’s development the profiler identified a contended pthread mutex in the free space manager as the primary bottleneck in the system. 6.1 Scalability or Performance First? Traditionally, software optimization has focused on improving Our first optimization attempt replaced the pthread mutex with a single-thread performance, while addressing scalability almost as test-and-test-and-set (T&T&S) mutex having much lower over- an afterthought. Algorithms and data structures within the stor- head, in hopes of quickly relieving pressure on the critical section age manager are often designed for performance, then modified [20]. The reduced overhead improved single-thread performance after the fact to remove any scalability problems that might arise. by 90% but did not improve scalability; in fact, scalability We find that the “performance first” approach quickly leads to dropped by 50% because single-thread performance doubled with problems as the number of contexts in the system continues to no change for 32 threads. The next optimization attempt replaced climb exponentially. Optimizations that improve performance sig- the non-scalable T&T&S mutex with a scalable MCS mutex. This nificantly for one or a few threads often have a limited impact on time scalability improved noticeably but the critical section performance — or even reduce it — as more threads enter the remained contended. system. H-Store [31] is an extreme example of this effect, achiev- Having exhausted the “easy” approaches for eliminating the bot- ing very high single-thread performance but with no possibility tleneck, we examined the code more closely and determined that for scalability within a single instance of the storage manager. the free space manager acquired a page latch while holding the Three factors lead to tension between speed and scalability. First, mutex; the page latch acquire was the biggest part of the critical optimizations which impact only single-thread performance have section even in the best case; in the worst case the latch was the effect of allowing threads to reach bottleneck critical sections taken or the page was not in memory, leading to long delays faster, increasing contention. The critical sections will then limit which serialized other threads. By refactoring the code we were throughput or even reduce it if protected with unscalable synchro- able to move the latch acquire outside the critical section, remov- nization primitives. BerkeleyDB, with its TATAS locks, is a good ing the pressure and vastly improving scalability. The overhead example of this effect (see Section 4). Second, the most scalable we introduced reduced single-thread performance by about 30%, synchronization primitives tend to also have the highest overhead. but there was a net gain of about 200% for 32 threads. As a result, single-thread performance can potentially drop when This sequence of optimizations illustrate how focusing only on the software is modified to be more scalable. Finally, improving performance for a few threads can be misleading. As we saw in single-thread performance raises the bar for scalability. Even if Section 4, several open-source storage managers fell into this the throughput for many threads does not drop, the scalability trap, performing well at first but failing to scale past 4-8 threads. will have been reduced when the (now faster) single thread case This observation supports our conclusion that one should focus is used as a baseline. first on scalability, with performance as a secondary goal. 30

8.6.2 Principles for Scalable Storage Managers fill one extent completely before moving on to the next, this opti- We next introduce four fundamental principles for achieving scal- mization cut the number of page checks by over 95%. ability in a database storage manager. Problem 2: The Shore log manager originally used a non-circular • Efficient synchronization primitives are critical. Poorly- buffer and synchronous flushes. Log inserts would fill the buffer designed or -applied primitives destroy scalability. until a thread requested a flush or it became full (triggering a • Every component must be explicitly designed for scalability flush). The flush operation would drain the buffer to file before or it will become a bottleneck. Common operations must be allowing inserts to continue. able to proceed in parallel unless they truly conflict. Observation: Log inserts have nothing to do with flushes as long • Hotspots must be eliminated, even when the hot data is read- as the buffer is not full, and flushing should almost never inter- mostly. The serial overhead imposed by acquiring a shared- fere with inserts. Further, using a circular buffer means the buffer mode latch can grow quickly into a bottleneck, for example. never fills as long as flushes can keep up with inserts (on aver- • Abstraction and encapsulations do not mix well with critical age). sections. We observed large gains by merging data structures Solution: We converted Shore to use a circular buffer and pro- with the synchronization primitives that protect them. tected each operation (insert, compensate, flush) with a different As can be seen in Section 7, we applied each principle many mutex. Insert and compensate each use a light-weight queueing times to different parts of the code; the rest of this section gives mutex, while the slower flush operation uses a blocking mutex. examples from the development of Shore-MT where each princi- Inserts own the buffer head, flushes own the tail, and compensa- ple had a significant impact on scalability. tions own a marker somewhere in between (everything between tail and marker is currently being flushed). Inserts also maintain a 6.2.1 Use the right synchronization primitives cached copy of the tail pointer. If an insert would pass the cached Problem: Conceptually, every page search that hits in the buffer tail, the thread must update the tail to a more recent value and pool requires at least three critical sections - one to lock the hash potentially block until the buffer drains. Flush operations acquire bucket (temporarily preventing other thread from moving the the compensation mutex (while holding the flush mutex) just long page to other buckets during the search), one to pin it (preventing enough to update the flush marker. Log compensations acquire evictions while the thread is using the page), and a final critical only the compensation mutex, knowing that anything between the section to request a latch on the page (preventing concurrent flush marker and the insertion point is safe from flushing. Distrib- modification of the page's data). For hot pages, especially, the uting critical sections over the different operations allows unre- critical sections can become a bottleneck as many threads com- lated operations to proceed in parallel and prevents fast inserts pete for them — even if they all latch the page in read-mode. and compensations from waiting on slow flushes. Observation: Pinned pages cannot be evicted. 6.2.3 Eliminate hotspots Solution: When a buffer pool search finds a promising page it Problem: Shore's buffer pool was implemented as a open chain- applies an atomic "pin-if-pinned" operation to the page, which ing hash table protected by a single global mutex. In-transit pages increments the page's pin-count only if it is non-zero (easily (those being flushed to or read from disk) resided in a single implemented using an atomic compare-and-swap). If the page is linked list. Virtually every database operation involves multiple not already pinned the thread must lock the bucket in order to pin accesses to the buffer pool, and the global mutex became a crip- it, but if the conditional pin succeeds, the page cannot have been pling bottleneck for more than about four threads. evicted because the pin count was non-zero (though it may have been moved to a different bucket). The thread then verifies that it Observation: pinned the correct page and returns it. In the common case this • Every thread pins one buffer pool page per critical section. eliminates the critical section on the bucket. • Most buffer pool searches (80-90%) hit. • Most in-transit pages are reads thanks to page cleaning. 6.2.2 Shorten or remove critical sections Solution: We distributed locks in the buffer pool in three stages. Problem 1: Shore uses logical logging for many low-level opera- The first stage added per-bucket locks to protect the different tions such as page allocations. Because of this, Shore must verify hash buckets, leaving the global lock to protect in-transit and free that a given page belongs to the correct database table before lists and the clock hand. However, each page search had to lock a inserting new records into it. The original page allocation code bucket in order to traverse the linked list. This remained a serious entered a global critical section for every record insertion in order bottleneck for hot pages. to search page allocation tables. The second stage replaced the open chaining hash table with a 3- Observation: Page allocations do not change often. Also, The ary cuckoo hash table [27][14][32]. Cuckoo hashes use N hash information immediately becomes stale upon exit from the critical functions to identify N locations a value may legally reside in. A section. Once the page has been brought into the buffer pool, collision only occurs if all N locations for a value are full, and is Shore must check the page itself anyway. resolved by evicting some other entry from the table. Evicted entries are then re-inserted into one of their other (N-1) slots, Solution: We added a small thread-local cache to the record allo- potentially causing a cascade of evictions. Cuckoo hashes have cation routine which remembers the result of the most recent two extremely desirable properties. Like open chaining hash lookup. Because Shore allocates extents of 8 pages and tends to tables, deletions are straight-forward. Like closed hash tables, 31

9.updates and searches only interfere with each other when they 10 actually touch the same value at the same time. Because the final buffer pool is merely a cache, we can also evict particularly trou- blesome pages in order to end cascades. Cuckoo hashing does bpool 2 Performance (tps/client) face one major drawback, however: operations cost more because lock m gr 1 they must evaluate multiple high-quality hashes.8 In our case, the log improvement in concurrency more than offsets the increased cost caching of hashing by eliminating hot spots on bucket lists. bpool 1 Unfortunately, increasing the number of threads still further also bas e line 0.1 increased the number of misses per unit time (though the rate remained the same), leading to contention for in-transit lists. On every miss the buffer pool must ensure that the desired page is not currently in-transit. An in-transit-out page cannot be read back in until the dirty data has been written out, and requests for 0.01 an in-transit-in page must block until the read completes. The 0 8 16 24 32 large number of transits in progress at any moment led to long Concurrent Threads linked list traversals and slowed down miss handling. The third stage broke the in-transit lists into several smaller ones (128 in Figure 7.Performance and scalability improvements due to opti- mizations detailed in Section 7. our case) to cut the length of each list. In addition, the buffer pool immediately places in-transit-in pages in the hash table, making it • Notifications to daemon threads need not occur immediately - visible to searches and bypassing the transit lists. The thread per- the thread can safely release the insert mutex first, moving the forming the page read simply holds the page latch in EX mode to operation off the critical path of other threads. block other accesses until the I/O completes. As a result, only • Encapsulation is a means, not an end. true misses search the in-transit-list, which contains only dirty pages currently being evicted. Because effective asynchronous Solution: We rewrote the log manager to make encapsulation page cleaning makes dirty page evictions very rare, each "in-tran- boundaries surround, rather than cut through, critical sections. sit” list is currently a fixed-length array of only one element. The refactoring moved the log buffer above the file handling to finish decoupling inserts from log flushes. Threads perform only Finally, page misses release the clock hand before evicting the the minimum work required to allow them to safely insert a log selected replacement frame or attempting to read in the new data. record later, copying data and notifying daemon threads after This allows the thread to release the clock hand mutex before releasing the mutex. Finally, because log insert performance is so attempting expensive I/O. As a result, the buffer pool can service critical, the we redesigned log buffer as an extended queuing many page misses simultaneously even though an individual page lock, combining the functionality of the critical section with the miss is very slow. mechanism for protecting it. As each thread enters the log buffer's queue it computes which portion of the log buffer it will eventu- 6.2.4 Beware of over-abstraction and encapsulation ally write to. It then hands off this information to its successor in Problem: Log inserts remained a major bottleneck in Shore, even the queue at the same time as it exits the critical section. Finally, after decoupling inserts from flushes. In Shore, log inserts occa- the log flush daemon follows behind, dequeuing all threads' left- sionally acquire a blocking mutex in order to wake checkpoint over nodes as it determines which regions of the log buffer to and flush threads at regular intervals. In addition, the log manager flush. This scheme has three advantages: First, contended state of was highly encapsulated and abstracted in ways that cut through the log buffer (insert offset) is passed from thread to thread in an critical sections. For example, the log buffer was embedded deep orderly fashion, with no contention at hand-off. By combining the in the file-handling portion of the log manager in a way that made functionality of an MCS queuing lock [24] with the log buffer, we a circular log difficult to implement. Overall, the insert critical also consolidate the code and eliminate overhead. Finally, section was far too long. because a thread only needs to determine where it will eventually Observations: insert (not when), it only serializes threads behind it for a short time. • The log insert critical section only needs to determine where in the buffer the write will (eventually) occur, with what log sequence number (LSN), and whether the buffer currently has 7. FROM SHORE TO SHORE-MT space. Once the location of the write is settled, the thread is This section details the process of converting Shore into Shore- free to insert the log record at its leisure, perhaps even after MT. We began with the version 5.01 release9 and methodically waiting for a flush to complete. removed scalability impediments identified by Sun’s sampling profiler, collect. The improvements clustered into several stages; many optimizations removed a bottleneck, only to have another replace it with no change in scalability. Figure 7 shows the per- formance of Shore-MT running the insert microbenchmark after 8. Cuckoo hashing is extremely prone to clustering with weak hash functions. Our implementation combines three universal hash functions to make one high-quality hash [27]. 9. Available at http://ftp.cs.wisc.edu/paradise/sm5.0/ 32

10.each major stage of the optimization process. The rest of this sec- ID of the oldest transaction in the system. We added a local vari- tion describes the major changes for each phase. able to store this ID; callers could read it atomically because IDs are 64-bit integers, and committing transactions would update the 7.1 Baseline ID when they removed themselves from the list. Shore implements a user-level thread library (smthreads) on top Finally, we made one more optimization to the buffer pool; we of a single operating system thread; auxiliary “diskrw” processes already eliminated one of the three critical sections required to service blocking I/O requests and communicate with Shore latch a page (the pin operation), and here we added a small array through shared memory regions. In order to begin making Shore for especially hot pages; instead of protecting the array with a scalable we first had to replace the thread package with standard mutex, we changed the search to pin the page, then check its ID POSIX threads (already partly supported) and modify the Shore before acquiring the latch; if a page eviction occurs before the pin scheduler to allow them to run in parallel. We also removed the completes the IDs would not match; once the pin succeeds the external I/O engines as they were no longer necessary. Because caller is assured that the page will not move. Shore was written for cooperative multithreading on a single OS The “caching” line in Figure 7 shows the resulting performance. thread, there were many low-level data races in the code due to Single-thread speed did not change, but scalability nearly dou- unprotected critical sections. We identified these critical sections bled. using code inspection (many were identified by existing com- ments) as well as Sun’s race detection tool. We then inserted Principles applied: POSIX mutex locks to protect each critical section. Higher-level • shorten or remove critical sections critical sections were already properly protected because lock • eliminate hotspots contention and I/O could cause inconvenient context switches • eliminate counterproductive abstraction even under cooperative multithreading. At this stage we also fixed several portability bugs due to our sparc64/solaris environ- 7.4 Log Manager ment (Shore only officially supports x86/linux). Shore’s log manager presents a single API to the rest of the sys- The resulting system (“baseline” in Figure 7) was correct but tem, implemented as a virtual class backed by a fairly complex completely unscalable; throughput ranged from 2.4tps at one hierarchy of subclasses. The entire component was protected by a thread to about 1.2tps for four or more threads. single mutex, even though there are four major log operations, each with varying cost. In order to allow these (usually compati- 7.2 Bufferpool Manager ble) operations to proceed in parallel, we split the log critical sec- tion into four, as described in Section 6.2.2. We further optimized The bufferpool manager was the first major scalability challenge. the log insert operation, as described in Section 6.2.4, so that Shore protected it using a single, global mutex that very quickly insertions serialize just long enough to claim buffer space; the became contended. We replaced the global mutex with one mutex transactions then copy log entries into the space in parallel, per hash bucket. At this point we also applied the atomic pin blocking on a full buffer if necessary, then notify the log flush count update optimization described in Section 6.2.1. Finally, we daemon when they complete. As a result, the most common log replaced several key pthread mutex instances with test-and-set operation — insert — requires an extremely short critical section. spinlocks that acquire a pthread mutex and cond var only under contention. This reduced the overhead of the common (uncon- At this point the profiler identified a significant bottleneck in tended) case significantly. calls to malloc() and free(). Fortunately, Solaris provides a thread- local version of these functions which trades higher memory utili- These changes (“bpool 1”) boosted scalability significantly, dou- zation for zero contention between threads. bling single-thread performance and increasing 32-thread throughput by nearly 5x. The free space manager also became a problem again at this point as well — this time in the metadata check to determine which Principles applied: table a given page ID belongs to. We alleviated the bottleneck by • shorten or remove critical sections creating a small cache of most recently-used extent ids (an extent • eliminate hotspots is 8 consecutive pages), which allowed the hottest page accesses • use the right synchronization primitive to avoid accessing metadata pages at all. Last of all, we changed the buffer pool from an open chained hash table to a cuckoo hash 7.3 Free Space and Transaction Management at this point (see Section 6.2.3) After tuning the buffer pool the next bottlenecks were in the free The impact of these optimizations is captured by the “log” line in space and transaction management components Shore. These bot- Figure 7. Again, scalability nearly doubled. tlenecks were all straightforward to address, however. An exami- nation of the code in the free space manager showed that it held a Principles applied: contended mutex while acquiring a page latch (which could be • shorten or remove critical sections contended or block on I/O). We refactored the operation so that it • eliminate hotspots was safe to release the mutex before acquiring the latch, reducing significantly the pressure on the former. We also discovered a 7.5 Lock Manager bottleneck on the mutex protecting the transaction list: the most Profiling pointed to the lock manager as the next bottleneck to common operations checked the head of the list to determine the tackle. Like the bufferpool, the lock manager’s hash table was 33

11.protected by a single mutex. However, the lock manager code lets the system skip large portions of the log during recovery. We included support for a mutex per bucket, statically disabled by a modified the dirty page cleaner threads — which already traverse single #define. In addition, we modified several critical sections whole bufferpool, but asynchronously — to track the newest LSN in order to shorten them and avoid acquiring nested locks. The they encounter during each sweep; because they write out the last optimization in the lock manager dealt with the lock request dirty pages they encounter, at the end of a sweep the “newest” pool. The lock manager maintains a pool of pre-allocated lock LSN is now the oldest, perfect for inclusion in the log checkpoint. requests, which it populates and inserts into lock lists as needed; We therefore modified the checkpoint thread to simply read this the pool’s mutex became a contention point, so we reimple- value rather than computing it during the critical section. mented it as a lock-free stack where threads can push or pop The “final” line in the graph shows the performance of the com- requests using a single compare-and-swap operation. pleted Shore-MT. The profiler reported no more significant bot- Principles applied: tlenecks for this benchmark, which we verified by running • use the right synchronization primitive multiple copies of Shore-MT in parallel; scalability was the same • shorten or remove critical sections in both cases, indicating that it utilizes fully the hardware. • eliminate hotspots Principles applied: • shorten or remove critical sections 7.6 Bufferpool Manager Revisited • eliminate hotspots After fixing the log and lock managers, the bufferpool manager again became a bottleneck. This time mutex protecting the clock 7.8 Lessons Learned replacement algorithm, and the mutex protecting “in-transit” This section and the previous one show clearly how a few funda- pages became contended. On every page miss, the bufferpool mental principles, applied systematically, can transform a storage “clock” sweeps through the pages searching for suitable candi- engine from virtually single-threaded to fully scalable. As we dates to evict. For efficiency, the clock hand does not pin or latch explained in the beginning of Section 6, we expect that designers pages unless they appear promising; we modified the algorithm can apply these principles to any storage manager to improve its slightly so it could release the clock hand mutex before pinning scalability in similar fashion. Though we distilled these principles and latching the victim; if another thread pinned the page first and lessons from our experience with Shore-MT, these results (making it non-evictable), the clock simply resumes its search. generalize to other storage engines for the following reasons: Once a page has been selected for eviction, it enters an in-transit 1. Shore uses state of the art algorithms and behaves like a list while the old page is flushed to disk (if dirty) and the new one commercial dbms at the microarchitectural level. This sug- read in. This list can become quite large, especially given the ran- gests that Shore is a typical case does roughly the same dom-accesses common to transaction processing. We applied the kinds of things as the commercial application. optimization detailed in Section 6.2.3 in order to shorten the list 2. Shore started out with the worst scalability of the open and distribute it into many smaller lists. source engines, indicating that other engines which scale Finally, we added another cache in the free space manager to poorly can be improved using with a similar approach. bypass an O(n2) algorithm in page allocation routine: searching a 3. Shore ended up scaling as well as, or better than, the com- linked list of pages to find the last. These changes brought a sig- mercial engine. This indicates that the implementation of the nificant boost in scalability, as indicated by the “bpool 2” line of different components, rather than a fundamental design flaw the figure. that might be specific to Shore. Principles applied: 4. Our optimizations were guided by a small set of very generic approaches for improving scalability. These approaches are • shorten or remove critical sections therefore likely to be effective in other storage managers as • eliminate hotspots well as other domains. Finally, we note that achieving scalability is not a one-time event. 7.7 Final Optimizations Fixing one bottleneck can expose others, and adding more hard- The final optimizations to Shore-MT were spread throughout the ware contexts to the system will eventually cause contention for code base. We created several more caches inside the free space some critical section. With the number of hardware contexts dou- manager to avoid expensive critical sections, including the one bling each processor generation it is never safe to assume a criti- described in Section 6.2.2. We also removed an unnecessary cal section or algorithm will not become a bottleneck. search of the lock table initiated by B+Tree probes. The most important optimization involved log checkpoint generation. The 8. CONCLUSION soft checkpointing algorithm in Shore builds a list of active trans- As we enter the multicore era, database storage managers must actions and dirty pages contained in the buffer pool at the time of provide scalability in order to achieve the high performance users the checkpoint. Unfortunately, traversing a large bufferpool takes demand.Though modern open source storage managers are not a single thread several seconds due to page faults and cache currently up to the task, our experience converting Shore to the misses; during that time no transaction can begin or complete. We scalable Shore-MT suggests that much progress is possible. With observed that recovery rebuilds the list of dirty pages using infor- careful benchmarking and analysis, we identify bottlenecks that mation from the log. The only important piece of information is inhibit scalability; repairing them creates Shore-MT, a multi- the oldest log sequence number found during the traversal, which 34

12.threaded storage manager which exhibits excellent scalability and [15] J. Gray. “Tape is Dead, Disk is Tape, Flash is Disk, RAM superior performance when compared to both its peers as well as Locality is King.” Gong Show Presentation at CIDR, 2007. to a popular commercial database system. Most importantly, the [16] J. Gray, and A. Reuter. “Transaction Processing: Concepts lessons we identified provide a useful reference as new bottle- and Techniques.” Morgan Kaufmann Publishers, Inc., 1993. necks appear in the future. [17] B. He, W. N. Scherer III, and M. L. Scott. “Preemption Adaptivity in Time-Published Queue-Based Spin Locks.” In 9. ACKNOWLEGEMENTS Proc. HiPC, 2005. The authors thank the Database Research Group at Columbia [18] P. Helland. “Life beyond Distributed Transactions: an University for providing access to their T2000 Server on short Apostate's Opinion.” In Proc. CIDR, 2007. notice following a hardware failure. We also thank the anony- mous reviewers for their helpful comments. This work was par- [19] J. M. Hellerstein, and M. Stonebraker. “Anatomy of a Database System.” In Readings in Database Systems, 4th ed. tially supported by grants and equipment from Intel; a Sloan research fellowship; an IBM faculty partnership award; NSF [20] R. Johnson, I. Pandis, A. Ailamaki. “Critical sections: re- grants CCR-0205544, CCR-0509356, IIS-0133686, and IIS- emerging scalability concerns for database storage engines.” 0713409; and an ESF EurYI grant. In Proc. DaMoN, 2008. [21] P. Kongetira, K. Aingaran, and K. Olukotun. "Niagara: A 32- 10. REFERENCES Way Multithreaded SPARC Processor". IEEE MICRO, 2005. [1] Oracle BerkeleyDB. http://www.oracle.com/technology/ [22] P. Lehman, and B. Yao. “Efficient locking for concurrent products/berkeley-db/index.html operations on B-trees.” ACM TODS, 6(4), 1981. [2] MySQL. http://www.mysql.com. [23] P. Magnussen, A. Landin, and E. Hagersten. “Queue locks on cache coherent multiprocessors.” In Proc. IPPS, 1994. [3] A. Ailamaki, D. J. DeWitt, and M. D. Hill. “Walking Four Machines By The Shore.” In Proc. CAECW, 2001. [24] J. Mellor-Crummey, and M. Scot. “Algorithms for scalable synchronization on shared-memory multiprocessors.” ACM [4] R. Bamford, D. Butler, B. Klots, and N. MacNaughton. TOCS, 9(1), 1991. “Architecture of Oracle Parallel Server.” In Proc. VLDB, 1998. [25] C. Mohan, “ARIES/KVL: a key-value locking method for concurrency control of multiaction transactions operating on [5] P. A. Bernstein, and N. Goodman. “Multiversion Concurrency B-tree indexes.” In Proc. VLDB, 1990. Control - Theory and Algorithms.” ACM TODS, 8(4), 1983. [26] C. Mohan, and F. Levine. “ARIES/IM: an efficient and high [6] E. Bugnion, S. Devine, and M. Rosenblum. “Disco: running concurrency index management method using write-ahead commodity operating systems on scalable multiprocessors.” logging”. In Proc. SIGMOD, 1992. In Proc. SOSP, 1997. [27] R. Pagh, and F. F. Rodler. “Cuckoo Hashing”. In Proc. ESA, [7] M. Carey, D. J. DeWitt, M. Franklin, N. Hall, M. McAuliffe, 2001. J. Naughton, D. Schuh, M. Solomon, C. K. Tan, O. Tsatalos, S. White, M. Zwilling. “Shoring up persistent applications.” [28] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. In Proc. SIGMOD, 1994. Barroso. “Performance of Database Workloads on Shared- Memory Systems with Out-of-Order Processors.” In Proc. [8] H. T. Chou, and D. J. DeWitt. "An Evaluation of Buffer ASPLOS, 1998. Management Strategies for Relational Database Systems". In Proc. VLDB, 1985. [29] Y. Ruan, V. Pai, E. Nahum, J. Tracey. Evaluating the impact of simultaneous multithreading on network servers using real [9] T. Craig. “Building FIFO and priority-queueing spin locks hardware. ACM SIGMETRICS, 33(1), 2005. from atomic swap.” Technical Report TR 93-02-02, University of Washington, Dept. of Computer Science, 1993. [30] M. Stonebraker, and L. A. Rowe. “The Design of Postgres.” In Proc. SIGMOD, 1986. [10] J. D. Davis, J. Laudon and K. Olukotun. "Maximizing CMP Throughput with Mediocre Cores". In Proc. PACT, 2005. [31] M. Stonebraker, S. Madden, D. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. “The End of an Architectural Era [11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. (It's Time for a Complete Rewrite).” In Proc. VLDB, 2007. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. “Dynamo: Amazon's highly available key-value [32] M. Thorup. “Even strongly universal hashing is pretty fast.” store.” In Proc. SOSP, 2007. In Proc. SODA, 2000. [12] D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. [33] TPC (Transaction Processing Performance Council). TPC Bricker, H. Hsiao, and R. Rasmussen. “The Gamma Database Benchmark C (OLTP) Standard Specification, Revision 5.9. Machine Project.” IEEE TKDE 2(1), 1990. [13] D. J. DeWitt, and J. Gray. “Parallel Database Systems: The Future of High Performance Database Systems.” Commun. ACM, 35(6), 1992. [14] D. Fotakis, R. Pagh, P. Sanders, and P. G. Spirakis. “Space efficient hash tables with worst case constant access time.” In STACS, 2003. 35