In-Memory Performance for Big Data

We introduce here a novel buffer pool design that adapts pointer swizzling for references between system objects (as opposed to application objects), and uses it to practically eliminate buffer pool overheads for memoryresident data. Our implementation and experimental evaluation demonstrate that we achieve graceful performance degradation when the working set grows to exceed the buffer pool size, and graceful improvement when the working set shrinks towards and below the memory and buffer pool sizes.
展开查看详情

1. In-Memory Performance for Big Data Goetz Graefe Haris Volos Hideaki Kimura HP Labs Palo Alto HP Labs Palo Alto HP Labs Palo Alto goetz.graefe@hp.com haris.volos@hp.com hideaki.kimura@hp.com Harumi Kuno Joseph Tucek HP Labs Palo Alto HP Labs Palo Alto harumi.kuno@hp.com joseph.tucek@hp.com ∗ Mark Lillibridge Alistair Veitch HP Labs Palo Alto Google mark.lillibridge@hp.com alistair.veitch@gmail.com ABSTRACT coherence issues. However, a buffer pool imposes a layer When a working set fits into memory, the overhead im- of indirection that incurs a significant performance cost. In posed by the buffer pool renders traditional databases non- particular, page identifiers are required to locate the page competitive with in-memory designs that sacrifice the ben- regardless of whether it is in memory or on disk. This in efits of a buffer pool. However, despite the large memory turn calls for a mapping between in-memory addresses of available with modern hardware, data skew, shifting work- pages and their identifiers. loads, and complex mixed workloads make it difficult to guarantee that a working set will fit in memory. Hence, some 10000000 recent work has focused on enabling in-memory databases to protect performance when the working data set almost 1000000 fits in memory. Contrary to those prior efforts, we en- OS Virtual Memory able buffer pool designs to match in-memory performance Traditional Buffer Pool Throughput (qps) while supporting the “big data” workloads that continue 100000 to require secondary storage, thus providing the best of both worlds. We introduce here a novel buffer pool de- sign that adapts pointer swizzling for references between 10000 system objects (as opposed to application objects), and uses it to practically eliminate buffer pool overheads for memory- resident data. Our implementation and experimental eval- 1000 uation demonstrate that we achieve graceful performance degradation when the working set grows to exceed the buffer pool size, and graceful improvement when the working set 100 0 5 10 15 20 shrinks towards and below the memory and buffer pool sizes. Working Set Size (GB) Figure 1: Index lookup performance with in- 1. INTRODUCTION memory vs. traditional database. Database systems that use a buffer pool to hold in-memory copies of the working data enjoy a number of benefits. They can very efficiently manage working sets that far exceed Figure 1 illustrates the impact of the buffer pool’s over- available memory. They offer natural support for write- heads, comparing the performance of an in-memory database ahead logging. They are somewhat insulated from cache using direct pointers between B-tree nodes (marked “Main Memory”) and a database system using a traditional buffer ∗Work done while at HP Labs. pool with disk pages (marked “Traditional Buffer Pool”). The available memory is 10 GB; further details of this ex- periment are provided in Section 5. In the left half, the This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- entire working set fits into the available physical memory cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- and buffer pool. In the right half, the working set exceeds mission prior to any use beyond those covered by the license. Contact the available memory and spills to the disk. The in-memory copyright holder by emailing info@vldb.org. Articles from this volume database relies on the OS virtual memory mechanism to sup- were invited to present their results at the 41st International Conference on port working sets larger than the available physical memory. Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, The in-memory database permits substantially faster index Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 1 look-ups than the traditional buffer pool when the working Copyright 2014 VLDB Endowment 2150-8097/14/09. set fits in physical memory. In fact, our experiments show 37

2.that an in-memory database improves throughput by be- our approach is entirely automatic and it adapts to cases tween 41% to 68% compared to one that uses a traditional of growing and shrinking working sets. As the raison d’etre buffer pool. This correlates well with earlier results that of such systems is to provide superior performance for in- the buffer pool takes about 30% of the execution cycles in memory workloads, most of these do not support data sets traditional transaction processing databases [14]. that do not fit in memory. We borrow the technique of pointer swizzling and adapt For instance, Microsoft’s SQL Server Hekaton stores raw it to the context of page-oriented data stores, selectively memory addresses as pointers, eliminating the need for data replacing identifiers of pages persisted in storage with direct pages and indirection by page IDs. However, once the work- references to the memory locations of the page images. Our ing set size exceeds the available amount of memory, Heka- techniques enable a data store that uses a buffer pool to ton simply stops inserting rows [1]. achieve practically the same performance as an in-memory One possible solution is to rely on the VM layer to deal database. The research contributions here thus include: with any spills out of memory. For instance, some databases (such as Tokyo Cabinet) simply mmap their backing store [6]. 1. Introduction of pointer swizzling in database buffer This leads to several problems. The first is that the OS pool management. VM layer is particularly poor at making eviction decisions 2. Specification and analysis of swizzling and un-swizzling for transactional workloads (as we show later in our exper- techniques for the specific case of B-tree indexes. iments). The second is that the operating system consid- ers itself free to flush dirty pages opportunistically, with- 3. Victim selection (replacement policy) guiding eviction out notifying the application. This can cause data integrity and un-swizzling of B-tree pages from the buffer pool. problems when a page dirtied by an in-flight transaction is written back without the matching log records. 4. Experimental proof that these simple techniques en- Failure atomic msync [35] provides a mechanism by which able a database with a buffer pool to achieve in-memory 1) the kernel is prevented from lazily flushing pages, and 2) database performance when the working set fits in a modification to the msync call allows users to select a sub- memory, and furthermore achieve a continuum with set of pages to be written out atomically. Although this graceful degradation as the working set size exceeds certainly helps, it is insufficient for a transaction process- the memory size. ing system. First, failure atomic msync does not provide In the remainder of this paper, we first discuss prior efforts parallelism. From contact with the authors, failure atomic to improve the performance of in-memory database systems msync “only gives A and D . . . we provide no support for iso- with regard to memory limitations as well as prior work in lation. If you want safe, concurrent, database-style transac- pointer swizzling (Section 2). We next describe the problem tions, you need more” [34]. Second, as implemented, failure being solved (Section 3), and then present details of our atomic msync hijacks the file system journaling mechanism, techniques for swizzling and un-swizzling within a database duplicating each page write. Unlike a traditional buffer pool buffer pool (Section 4). We evaluate the performance of based system where changing a single row causes a few bytes our solution using an implementation (Section 5) and finally of log writes and a page-sized in-place update, a system present our conclusions from the work to-date (Section 6). based on failure atomic msync incurs a minimum of two page-sized writes, a significant performance penalty. 2. RELATED WORK There are a small number of recent efforts that make VM- based in-memory databases performant when the working This paper describes our efforts to adapt the known tech- data set almost fits in memory. These systems address the nique of pointer swizzling to the new context of making problem of swapping by either explicitly moving some data disk-based systems competitive with in-memory database records to virtual memory pages that the OS can then swap systems. This section begins by discussing prior work that to disk more intelligently or else by explicitly compressing addresses the problem of how to preserve the performance less-frequently accessed data records so as to reduce the of in-memory database systems when the working set does space consumed, but also reducing access speed. not quite fit in the available amount of memory. These prior Stoica and Ailamaki profiled the performance of the state- systems do not consider how to match the performance of a of-the-art in-memory VoltDB database, and found that per- traditional database when disk I/O is the bottleneck. Our formance dropped precipitously when the working data set design, on the other hand, enables a traditional database exceeded available memory [37]. Their work demonstrated system that has been optimized for disk I/O to compete that by intelligently controlling which tuples are swapped with in-memory systems when the working data set fits in- out by the operating system and which are kept in memory, memory. Thus, the second half of this section discusses prior it is possible to make much more effective use of available work on pointer swizzling, which is the technique we lever- memory. To achieve this, two regions of memory are created age to eliminate buffer pool overheads. – one hot and one cold. Pages in the hot region are pinned in 2.1 In-Memory Database Systems memory so that the virtual memory manager will not evict them; pages in the cold region are not pinned. Offline log A number of in-memory database systems, including Or- analysis identifies individual data records that are likely no acle’s TimesTen, Microsoft’s SQL Server Hekaton, CWI’s longer needed, after which a special process explicitly moves MonetDB, SAP HANA, MIT’s Silo, and VoltDB, avoid the those records from the hot area to the cold data area. For overheads of disk-based systems by focusing on providing this offline analysis, they leveraged the work of Levandoski the best possible performance for in-memory workloads [5, et al., who investigated how to efficiently identify hot and 13, 28, 31, 38, 39]. However, both Oracle’s and Microsoft’s cold data in the context of a database system optimized for in-memory database systems force a database administra- main memory [25]. After finding that maintaining access tor to make explicit tuning decisions and requests, whereas 38

3.statistics incurred a 25% overhead, Levandoski et al. devel- ers between application objects, not storage containers such oped efficient algorithms by which they could analyze 1M as pages. Similarly, White and DeWitt’s [40] “QuickStore log records in under a second. provides fast access to in-memory objects by allowing appli- DeBrabant et al. also consider how to evict individual cold cation programs to access objects via normal virtual memory tuples from memory to disk [4]. Evicted tuples are stored pointers.” But again, pointers between system objects such in an on-disk hash table (called a Block table) and tracked as frames in the buffer pool are not considered. by a separate in-memory hash table. Because they assume The general technique of pointer swizzling first emerged that main-memory is the primary storage device (data is about 30 years ago [2, 19]. We are informed by the thought- initially created directly in memory), they call their tech- ful and in-depth discussion and evaluation of pointer swiz- nique anti-caching. Like Stoica and Ailamaki, DeBrabant zling techniques by Kemper and Kossman [20, 21, 22]. They et al. implemented their technique using VoltDB. However, characterize pointer swizzling techniques according to two considering offline analysis to be a prohibitive expense yet key design decisions: (1) when to swizzle references, bal- wanting to make eviction decisions on a tuple-by-tuple basis, ancing the cost of swizzling references that might never be DeBrabrant et al. use a doubly-linked LRU chain of tuples actually used versus the performance benefits of having ref- to identify “hot” vs. “cold” tuples. In order to reduce the erences swizzled in advance of being read, and (2) how to overhead of this tracking, they select only a fraction of their handle references to objects that are not resident in main transactions to monitor at runtime. memory, balancing the cost and complexity of maintaining Funke et al. [Funke+2012] compact memory-resident data, reference relationship dependencies (e.g., in a reverse refer- identifying and compressing immutable “frozen” data so as ence table) versus the cost of managing swizzled references to dedicate as much memory as possible to the hot mutable to non-memory resident objects. working data set [7]. This distinction between mutable and A significant, yet subtle, difference between our work and immutable data enables a single main memory database sys- prior work in swizzling is that although prior work has con- tem, HyPer, to support both online transactional processing sidered swizzling at the granularity of B-tree pages and seg- (OLTP) and online analytical processing (OLAP). ments, the swizzling was only performed on object IDs (ref- In summary, all of these prior systems focus on how to erences between application objects). In contrast, we re- enable in-memory databases to handle the case where the strict swizzling to references between system objects — the working data set does not quite fit in-memory; they do in-memory images of pages in the buffer pool. not consider how to match the performance of a traditional For example, Kemper and Kossman briefly mention swiz- database when disk I/O is the bottleneck. We, on the other zling pointers to pages [22]: “In some systems, address trans- hand, enable a traditional database system that has been lation is carried out on a per-page basis. . . . a page ta- optimized for disk I/O to compete with in-memory systems ble records the in-memory address of every resident page.” when the working data set fits, or almost fits, in-memory. However, they consider this only in the context of object- oriented databases, as opposed to relational database man- 2.2 Pointer Swizzling agement systems or key-value stores, i.e., for any storage layer. Moreover, their purpose is application performance Pointer swizzling refers to replacing a reference to a persis- (application logic running on top of some in-memory store tent unique object identifier with a direct reference to the in- or buffer pool) whereas our purpose is indexing performance. memory image of that object. Following a swizzled pointer Finally, our system still employs indirection (via the descrip- avoids the need to access a mapping between the identifiers tor data structures in the buffer pool) such that dirty bit and memory addresses of memory-resident objects [29]. Our and other temporary metadata can be supported, but this approach, detailed in Section 4, eliminates the buffer pool is indirection using in-memory addresses (pointers) rather overhead by swizzling a parent page’s reference to a child than the persistent page identifiers that require lookup in page, replacing the persistent identifier with the in-memory the buffer pool’s table of contents. Their notion of “indirect address of the child page. swizzling” is very different – the purpose is to permit evict- Some relational databases, notably in-memory databases, ing an object from memory without un-swizzling all pointers use in-memory pointers. Most relational databases sepa- to this object. In other words, the design requires that some rate their persistent storage and their volatile buffer pool. object remains in memory (and consumes memory) long af- Those systems have pages point to one another by means of ter an object has been evicted due to memory contention. addresses in persistent storage, typically page identifiers on They suggest reference counting to control the lifetime of disk storage or other page-access devices. What is used here the indirection object as maintenance of a “reverse refer- for the first time, as far as we know, is a dynamic transla- ence list” is too expensive. tion between addresses for database containers, i.e., pages, by using a memory-optimized address system (virtual mem- ory addresses) while navigating in memory and using a disk- address system (page identifiers) on persistent storage. 3. PROBLEM DESCRIPTION Pointer swizzling is well known in the context of applica- The context of our work is a transactional storage man- tion objects in specialized databases but we are not aware ager that adapts traditional tools to provide ACID guaran- of any relational database management system employing tees on modern hardware. Data resides in B-tree structures it. E.g., Garcia-Molina et al. [8] anticipate the opportu- that store all data items in their leaves, with separator keys nity for pointer swizzling when they observe that “index in the upper B-tree levels serving only to guide searches [10]. structures are composed of blocks that usually have point- A buffer pool holds in-memory copies of pages used by the ers within them. Thus, we need to study the management of working data set. Latching and locking respectively ensure pointers as blocks are moved between main and secondary mutual isolation of concurrent operations to physical data memory.” However, the book’s discussion focuses on point- structures and logical database contents [9]. Write-ahead 39

4. logging provides “all or nothing” failure atomicity, database a B-tree parent page associates a given separator key with consistency, and transactional durability [11, 27]. a child page, that reference is stored in persistent storage using the persistent page identifier. 3.5M Regardless of whether or not the working data set fits in 16.2% 8.1%Btree 1.6M Btree keys available memory, the buffer pool is a useful tool for ensur- Btree keys 21% ing correctness and managing page consistency. For exam- keys 3.0M ple, write-ahead logging requires that a modified database 1.4M 11.9% logging logging page must not be written (in place) until the modifications logging 1.2M 16.3% 2.5M 18.7% are logged on stable storage. The buffer pool allows page Instructions modifications to be made in memory as opposed to directly locking on storage. The buffer pool also enables check-pointing all 1.0M locking Cycles locking 2.0M 14.2% dirty (modified but not persisted) pages to a backing store. 10.2% However, managing and protecting the mappings between .8M latching latching 1.5M the persistent page identifiers and in-memory page images latching 34.6% 29.6% .6M can cause a significant amount of latch contention as the 1.0M number of pages that can be stored in memory and the num- buffer ber of processes that will access those mappings increase. .4M buffer manager manager For example, consider a typical buffer manager that uses a buffer .5M hash table to implement the mapping. Simply traversing .2M 12.3% manager a parent page’s pointer to a child page requires calculating 6.8% .0M .0M the child page’s hash key, protecting the mapping hash ta- ble, performing the hash table look-up, searching the hash verhead Figure 8. Instructions (left) vs. Cycles (right) for New Order. bucket’s linked list, etc. Furthermore, this operation is likely Figure 2: Module costs in transaction processing, copied from Harizopoulos et al. [14]. to incur CPU cache misses. e Section 3.2 Virtual memory inThe B-tree key code,islogging, challenge that when and locking. the workingSince New dataOrder addsin set fits B-tree insertions in the mix of operations, there become is more relative Virtual memory might seem like a natural alternative ap- memory, these traditional tools themselves the new benefit to be accounting had by optimizing the key evaluation proach, i.e., mapping the entire database into virtual mem- bottlenecks, for about 80% of the code CPU(aboutcycles and is pri- 16%).byLogging ory and using each page’s virtual memory address as a page used a singleand locking now transaction onlyFigure [14]. account2 for about from (copied 12% and[14]) p portions 16% of the total identifier. Unmapping and closing the file would persist the summarizes theinstructions; execution this costsis largely because the of a TPC-C totalorder” “new frac- in create) tion of time in spent in operations where logging database to the backing store. This would bypass the in- transaction a traditional database systemand withlocking per- the entire at must be form a lotin of memory, work is much smaller as in this case. of the total exe- direction of purely logical page identifiers and furthermore database expressed a fraction delegate the problem of controlling which pages to swap out cution effort The buffer and calculated manager optimizationsbothstillin terms representof instructions the most signifi- and of memory to hardware and the operating system in the about 30% cycles (time). A number of observations suggest cant win here, again because we are able to bypass the high over- themselves. case that the working data set grows larger than would fit of modifi- For headexample, of recordthe relatively creation. largeatnumber Looking of instructions the detailed breakdown in ex- in memory. Unfortunately, such an approach is unaccept- , and page ecuted Figure in thethe 7 for B-tree buffercode managercompared to the optimization relatively reveals somethingsmall able in terms of correctness and performance. ol in most time spent changing surprising: there indicates from 8Kwell-written to 32K pages and well-executed (labelled “small With respect to correctness, the problem is durability and d substan- code page”)with spatial provides and atemporal almost 14% reduction locality, with in the totalthe opposite instruction control over writes to the backing store. Write-ahead log- at perform effects count. inThisthesimple logging code. However, optimization — whichitserves is readily apparent to reduce the ging requires that a modified database page must not be that frequency of page allocations and decrease B-tree depth —the the buffer manager is the most significant of five offers written until the modifications (relevant log records) have bottlenecks a sizeable gain.shown, consuming about 30% of cycles and 35% (for a 20x been logged on stable storage, and virtual memory might of instructions. A na¨ıve calculation indicates that total (or unt, and is write a database page too early. Similarly, virtual memory near-total) elimination of the buffer pool from the execution stem. This 4.3.4should Instructions improvevs. Cycles throughput (performance, may write a database page too late – e.g., in many imple- path transaction major com- Having looked at the detailed breakdown of instruction countsper- in mentations of database check-pointing, a checkpoint is not scalability) by a factor of about 1.5. Figure 1 suggests nd or more the Payment and New Order transactions, we now compare complete until all dirty pages have been written to the back- formance opportunities of the same magnitude. This isthe the all of our fraction ofmotivation time (cycles) ing store. Finally, if the latency-optimized logging space is principal of spent in each work. our present phase of the New Order ount is not transaction to Harizopoulos the fraction of instructions used in each phase. The limited and requires periodic recycling, log records must not Note that et al. advocates a complete re- anager, the results of arethe shown in Figure 8.storageAs we manager, noted earlier, we do notto be recycled until all database changes have been persisted. design transactional as opposed er instruc- expect these two fractions Some operating systems provide mechanisms to control improvements in any one oftothe be big identical for a given overheads, with the phase,rea- when the becausethat cache misses and physical I/O to memory-mapped files, e.g., POSIX msync, soning eliminating anypipeline one of the stalls (typicallywould overheads due im- to branches) can cause only someainstructions to take more cycles would than mlock, and elated system calls. Unfortunately, there are prove performance little bit because the others others. For example, no mechanisms for asynchronous read-ahead and for writ- remain. We, on the B-tree other optimizations hand, are working reduce cycles less thanall to eliminate ing multiple pages concurrently, i.e., multiple msync calls ofthey reduce these instructions, overheads. In because particular,the Shore this B-tree papercode overhead describes our New Order we remove is mainlybufferoffset calculations with few cache limiting misses. execute serially. As observed by [37], without extensive ex- effort to eliminate pool overheads without a detailed Conversely, plicit control, the virtual memory manager may swap out the resultingour residual system to“kernel” uses a larger small databases thatfraction fit inofmemory. cycles s we per- than it does instructions, because it is branch-heavy, consisting hot data along with the cold. Our experiments in Section 5 nstructions 3.1 mostlyBuffer of functionpool and calls. page identifiers Similarly, logging uses significantly demonstrate this performance penalty. 12 record more Thecycles bufferbecause it touches pool caches a lot ofofmemory images pages in creating memory and writ- while ns, and 23 ing log they arerecords (disk or being read I/Omodified. time is notEach included in eitherdatabase formatted graph). 4. NEW TECHNIQUES of instruc- Finally, page haslocking a uniqueand the buffer manager persistent consume identifier which aboutthethestorage same With the goal of eliminating the overhead of the buffer ayment are percentagecan manager of cycles use toaslocate they dothatinstructions. page and bring it into the pool, we look to pointer swizzling, a technique used by ob- buffer pool if it is not already there. For example, when ject stores to avoid the cost of repeatedly translating be- 989 40

5.tween distinct address spaces when referencing in-memory Figure 3 shows a simple B-tree with four nodes that we objects [29, 41]. That is, we can remove the layer of indirec- will use as a running example in the remainder of this sec- tion by translating page identifiers into memory addresses tion; we focus on the two nodes labeled in red – one repre- when a page is brought into the buffer pool. senting the root node (page id 42) and the other representing For example, consider a conventional B-tree index that a child node (page id 90). Page 90 is associated with the completely fits in memory, and a parent page that contains key range from −∞ to 200. a pointer to a child page. The images of both pages are cached in memory as frames in the buffer pool. A search Buffer pool frames operation reads the contents of the parent page and finds (Page images in the buffer pool) Hash table in the identifier of the child page. Locating the in-memory im- the buffer pool age of the child page requires a look up in a data structure that maps between page identifiers and in-memory page ad- Frame ID: 1 (Page 90) Frame descriptors xxxxxxxxxxxxxxxxxx dresses. Furthermore, since pages will potentially move into xxxxxxxxxxxxxxxxxx Frame ID: 1 and out of memory if the working set exceeds the buffer pool Page ID: 90 xxxxxxxxxxxxxxxxxx size, this data structure must be protected against concur- hash(90) xxxxxxxxxxxxxxxxxx Latch info: xxxxxxxxxxxxxxxxxx rent accesses (e.g., using one or more latches). Dirty bit xxxxxxxxxxxxxxxxxx Swizzling in the context of an object storage system swiz- xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx zles references between application-level container objects, while our design applies swizzling to internal data struc- tures (system objects) independent of any choices made for Frame ID: 2 (Page 42) application-level objects and data structures. Swizzling page xxxxxxxxxxxxxxxxxx references, as opposed to arbitrary application object refer- xxxxxxxxxxxxxxxxxx Frame ID: 2 ences, greatly limits the number of swizzled references that Page ID: 42 Key 200: Page ID 90 must be tracked and maintained, for example confining tar- xxxxxxxxxxxxxxxxxx Latch info: xxxxxxxxxxxxxxxxxx gets for swizzling to the interior nodes of a B-tree as opposed hash(42) Dirty bit xxxxxxxxxxxxxxxxxx to potentially impacting every object in an object base. xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Although confining swizzling to system objects simplifies problems such as how to keep track of which references have been swizzled so that they can be un-swizzled, it also raises new issues. In order to adapt the concept of swizzling for use Figure 4: Parent-child relationship in a traditional within a database buffer pool, the database storage manager buffer pool. must coordinate swizzling, un-swizzling, and page eviction Figure 4 shows in-memory page images corresponding to from the buffer pool. the pages with ID’s 42 and 90, realized using a traditional buffer pool. The buffer pool has a limited number of frames 4.1 Swizzling child pointers in which to store in-memory images of pages. Each frame Since most B-tree operations rely on root-to-leaf passes, has a descriptor that contains metadata about the page parent-to-child pointers between B-tree pages are the most frame. A hash table maps from page identifiers to frame important pointers to swizzle (as those are the most fre- identifiers (e.g., offsets into arrays of frames and frame de- quently used). Ideally, the pointer to the root node, e.g., in scriptors). In a traditional implementation, a page-to-page the database catalogs, can also be swizzled, in particular if pointer is realized using the target page’s identifier. an in-memory cache holds the catalogs while the database Figure 5 is a flow chart that sketches the process of search- is open and the index root page is in the buffer pool. ing a given in-memory page image (the contents of the buffer In a traditional database buffer pool, multiple cursors may pool frame ID 1 from our example) for a search key, finding read from the same page at the same time. A pin count, a parent-to-child pointer (pointing to page 90), and then us- similar to a reference count, tracks the number of concurrent ing the child page’s identifier to retrieve its in-memory page users of a buffer pool frame as well as swizzled parent-to- image. Following the process shown in Figure 5, the traver- child pointers. A page in the buffer pool can be evicted and sal from the root node (page ID 42) to the node with page replaced only when its pin count is zero. ID 90 requires a hash calculation and traversal of a linked list representing a hash bucket to reach a descriptor for a buffer frame, which has a direct pointer to the buffer frame Page ID: 42 (and its contents). 200 −∞ +∞ In contrast, Figure 6 sketches the process of searching a page in an in-memory database that does not have a buffer Page ID: 90 pool. Because all pages are in-memory, traversing parent- child relationship can be done without indirection. 140 200 … −∞ 30 … When a pointer from a B-tree parent page to a child page is swizzled, we modify the in-memory page image of the parent page, replacing the Page ID of the child page with 120 140 140 160 30 a pointer to the buffer pool frame holding the in-memory Page ID: 75 Page ID: 88 image of the child page. This direct pointer can be realized using a virtual memory address, but in our reference imple- mentation (and the illustration) we simply use the identifier Figure 3: Parent-to-child pointers in a B-tree. of the buffer pool frame. For example, Figure 7 shows the 41

6. Buffer pool Search In-memory Search page image key page image key Look for entry Look for entry in page image search key not in page image Found entry? no that corresponds that corresponds found to search key to search key yes Get location of Get page id of Return in-memory Found yes the next page to the next page to page image of the entry? search from the search from the next page to search page image page image no search key not Calculate hash found id of the page id Figure 6: Following a page pointer in an in-memory database. Look in buffer pool hash table for hashed page id all or none of the child pointers in a parent page simplified (protect hash table) some bookkeeping and encouraged larger I/O operations (as- suming good clustering of child pages) but either prevented speeding up search in specific key ranges or incurred exces- Bring page into buffer sive space requirements in the buffer pool. Found By swizzling during root-to-leaf traversals, swizzling pro- pool (possibly need to hashed page no evict another page ceeds from the root towards the (active) leaves. Un-swizzling id? image) proceeds in the opposite direction. Since a node’s parent is at least as active as the node itself, swizzled pointers occur yes only in parent nodes that are themselves swizzled, i.e., the pointer from grandparent to parent node is swizzled. Return buffer pool page image of the 4.2 Buffer pool eviction next page to search Pagewise-eager swizzling in object-oriented databases scans through a page and swizzles all the references to objects contained in the page at page-fault time [15, 22]. How- Figure 5: Following a page pointer in a traditional ever, handling replacement in the buffer pool posed a major buffer pool. challenge, requiring expensive tracking of swizzled object references using data structures such as reverse reference lists [22], persisted reverse pointers [21], swizzle tables [26], contents of the page image in the buffer frame with id 2 after and indirect swizzling descriptors, etc. [22]. the reference to page 90 has been swizzled. Our design simplifies the challenge of how to handle swiz- Figure 8, in contrast to Figure 5 and Figure 6, sketches zled references when objects are evicted from memory. When the process of traversing a swizzled pointer. In our design, the buffer pool needs to evict a page, our implementation pointer swizzling in the buffer pool is a side effect of normal uses a fairly standard implementation of generalized clock B-tree operations and as such is transparent to applications. counter [30, 36]. For example, when a root-to-leaf pass encounters a fault in Swizzling pins the affected pages and thus protects them the buffer pool, it loads the missing page and swizzles the from replacement by the clock mechanism. Thus, another pointer to the page. When a root-to-leaf pass encounters mechanism is needed that un-swizzles pages when the clock a hit in the buffer pool with an un-swizzled pointer, i.e., a mechanism cannot find possible replacement victims. Our traditional child reference using a page identifier, the pointer current design sweeps B-tree structures in the buffer pool is swizzled to speed up future root-to-leaf passes within the using depth-first search. Each sweep resumes where the last pertinent key range. one ended, retained using a key value, and frees some pages. In our current design, pointers are swizzled one at a time. This design requires detailed knowledge of the B-tree data Within a single parent page, some pointers may be swizzled structure within the buffer pool manager or appropriate call- while some are not; it is possible that some child pointers backs into the B-tree module. Pages with no recent usage as are swizzled while other child pages are not even present indicated by the g-clock counter are un-swizzled unless the in the buffer pool. An earlier design that swizzled either page itself is a branch page containing swizzled parent-to- 42

7. Buffer pool frames Buffer pool frames Hash table in (Page images in the buffer pool) Hash table in (Page images in the buffer pool) the buffer pool the buffer pool Frame ID: 1 (Page 90) Frame ID: 1 (Page 90) Frame descriptors xxxxxxxxxxxxxxxxxx Frame descriptors xxxxxxxxxxxxxxxxxx Frame ID: 1 xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Frame ID: 1 Page ID: 90 xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Page ID: 90 hash(90) Latch info: hash(90) xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Parent Frame : 2 xxxxxxxxxxxxxxxxxx Dirty bit xxxxxxxxxxxxxxxxxx Latch info: xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Dirty bit xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Frame ID: 2 (Page 42) Frame ID: 2 (Page 42) xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Frame ID: 2 xxxxxxxxxxxxxxxxxx Page ID: 42 Key 200: Frame ID 1 Frame ID: 2 xxxxxxxxxxxxxxxxxx Page ID: 42 Key 200: Frame ID 1 Latch info: xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx hash(42) Dirty bit Parent Frame : xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx hash(42) Latch info: xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Dirty bit xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx Figure 7: Swizzled parent-child relationship in a Figure 9: Child-to-parent pointers in the frame de- buffer pool. scriptors. Buffer pool Search key page image Figure 9 sketches a child-to-parent pointer added to the frame descriptors from Figure 7. The child-to-parent pointer speeds un-swizzling when a page is chosen for replacement Look for entry Bring page into buffer pool (possibly in page image (eviction) in the buffer pool. In that case, it is necessary need to evict another page image) that corresponds to determine efficiently whether that page is a B-tree node to search key and whether the parent node holds a swizzled pointer to the no Get evicted page. For that reason alone, it seems that a parent identifier of Return buffer pointer may be a good idea. the next pool page Note that the code used for the performance evaluation in Found Identifier yes page to yes image of the Section 5 does not include child-to-parent pointers. Nonethe- entry? swizzled? search from next page to less, as seen in Figure 10, it performs similarly to both the no the page search image traditional B-tree as well as the one in which all pointers are search key virtual memory addresses, even when the workload spills not found from memory and the swizzling solution needs to pay the overhead of unswizzling. Figure 8: Swizzled pointer traversal. 5. EVALUATION We have implemented a prototype of the swizzling and un- child pointers. Thus, just as swizzling proceeds root-to-leaf, swizzling of page identifiers in the context of the Shore-MT un-swizzling proceeds leaf-to-root. experimental database system [3, 16, 32, 33]. Our hypoth- Current work strives for a cleaner separation of buffer pool esis is that swizzling B-tree page identifiers alleviates the and B-tree modules; future work may produce a tighter in- performance overhead imposed by the buffer pool and al- tegration of the two mechanisms, i.e., g-clock for eviction lows a traditional database to match the performance of an and hierarchical sweep for un-swizzling. in-memory system that has no buffer pool. In this section we present experiments that compare the performance of 4.3 Child-to-parent pointers three system configurations: a baseline configuration with Child-to-parent pointers are generally considered a bad a traditional buffer pool, an in-memory database (without idea due to the complexity and effort required when split- a buffer pool), and a buffer pool that swizzles page identi- ting non-leaf nodes. Thus, to the best of our knowledge, no fiers. Our experiments consider (1) whether swizzling elim- database system has used persistent parent pointers since inates the performance overhead imposed by a buffer pool, one cited by K¨uspert [23] in 1985. compared to an in-memory database; (2) whether perfor- While this is true for disk-based databases, in-memory mance degrades gracefully as the working data set exceeds databases require reconsideration, as does a database with the size of memory, and (3) whether the swizzling approach swizzled pointers in the buffer pool. Specifically, the child- can adapt to a drifting workload without requiring offload to-parent pointers can reside in and link descriptors in the log analysis or explicit tracking of record-accesses. buffer pool rather than page images. Moreover, if child- to-parent pointers exist only where the matching parent- 5.1 Prototype Implementation to-child pointer is swizzled, maintenance while splitting a Aside from the swizzling techniques in the buffer pool non-leaf node is quite efficient. module, we also applied several modern optimizations for 43

8.many-cores to make our prototype a more appropriate test 5.3.1 Query performance bed to evaluate the effect of swizzling. Overall, we observed Our first set of experiments evaluate the impact of swiz- that the following optimizations were highly effective. zling in the context of a read-only workload. To this end, we compare the performance of an in-memory database that • Foster B-tree [12] to make B-tree operations and the uses direct pointers between B-tree nodes to a database sys- latching module more efficient and to ensure that ev- tem that uses a traditional buffer pool. ery node in the B-tree has at most a single incoming pointer at all times (which simplified book-keeping for unswizzling). Main-Memory Traditional 1200 Swizzling • Consolidation Array [17] to speed-up the logging mod- ule. 1240 • Read-After-Write style lock manager [18] to improve 1000 Query Throughput [103 QPS] the scalability of the locking module. 1220 For the second and third optimizations, we thank the au- 800 1200 thors of the original papers for their generous guidance in 1180 applying the techniques in our code base. We also emphasize that our experiments confirm their observations in the afore- 600 1160 mentioned papers, reproducing significant speed-ups in a dif- ferent code base. The throughput of our prototype in the 1140 standard TPC-C benchmark has more than doubled with 400 1120 these techniques. 0 2 4 6 8 10 5.2 Experimental set-up 200 The experiments were performed on a 4-socket Intel Xeon X7542 NUMA machine running at 2.7 GHz with 24 cores, and equipped with 256 GB of RAM. We used two RAID- 0 10 10K rpm disk drives for storing the data files. We ran 0 10 20 30 40 50 60 70 80 90 100 Working Set Size [GB] CentOS 5.5 with a 2.6.18-194 64-bit Linux kernel. All config- urations were compiled using GCC 4.1.2, and used the same compile options, experimental codes, and parameters. Un- Figure 10: Read-only query throughput: Main- less otherwise noted, we use a 10 GB buffer pool configured Memory (no bufferpool) vs Swizzling Bufferpool vs to use direct I/O (O DIRECT) when reading from and writing Traditional Bufferpool. Magnified sub-figure (inset) to the disk to prevent the OS file system cache from caching focuses on Main-memory vs Swizzling perfomance disk pages. This is the standard method to prevent OS for an in-memory working set. caching and buffering used even by commercial databases such as Oracle and DB2. Our database has a size of 100 GB Figure 10 shows the performance of read-only queries, (i.e., 10 times the buffer pool size). It consists of 1.73x109 with pointer swizzling in the buffer pool compared against records, with each record having a key size of 20 bytes and two baselines, i.e., an in-memory database and a traditional value size of 20 bytes. For the main memory configuration buffer pool. Error bars are one standard error. The inset we restrict the database size to the buffer pool size (10 GB). sub-figure of Figure 10 magnifies the upper left-hand portion In order to isolate the effects of the buffer pool implemen- of the main graph so as to make it easier to compare the per- tation (or lack of a buffer pool for the in-memory case), we formance of the Main-Memory and Swizzling systems when use the same base implementation of the database; that is, the working set fits in memory. This figure is based upon the lock manager, logger, query engine, etc. are all identical the same data as Figure 1 with the addition of a buffer pool across configurations. We do not evaluate the benefits of with pointer swizzling. Each data point represents at least various data layouts (e.g. we do not consider T-trees [24]), five runs of a complete experiment with fixed buffer pool and so all configurations use Foster B-trees with an 8 kilo- size (10 GB) and fixed working set size. 24 threads search byte page size. the index for randomly chosen key values. The performance metric shows the number of look-ups completed per second. 5.3 Buffer pool performance These results confirm the principal goal and expectation of We first evaluate the impact of swizzling on the overhead our design: when the working set is smaller than the buffer imposed by the buffer pool through a set of experiments pool, query performance of a database with a buffer pool based on micro-benchmarks. We compare pointer swizzling that supports pointer swizzling is comparable (practically in the buffer pool against two baselines, i.e., an in-memory equal) to that of an in-memory database that enjoys the per- database using direct pointers between B-tree nodes and a formance advantage of in-memory pointers but is restricted traditional buffer pool. As our microbenchmark study fo- to data sets smaller than memory. As a secondary effect, cuses solely on buffer pool performance, we disable other the performance of the database with the traditional buffer modules, including logging and transactional locking, in or- pool can be seen to improve significantly as the working set der to isolate the effects of swizzling on buffer pool per- size increases to fill available memory (at the left-most edge formance. We discuss overall database system performance of Figure 10) because threads work on a larger set of pages later in Section 5.4. and hence suffer less latch contention. The traditional buffer 44

9.pool is more sensitive to contention due to additional use of experiment. In the experiment, 24 threads add 50 million atomic instructions to pin and unpin pages. records to an initial 10 million records, with key values cho- The inset sub-figure in Figure 10 zooms into the range sen at random and in random order. where the working set size is less than 10 GB — that is, where in-memory perfomance is feasible. The inset of Fig- 1200000 Main Memory ure 10 provides a close-up view of the purple shaded square in the upper left of the main figure, where the Swizzling Swizzling and Main-Memory curves overlap. As the inset figure il- 1000000 lustrates, the performance difference between an in-memory No Swizzling Throughput (qps) database that uses only in-memory pointers and one that 800000 employs swizzling in an otherwise traditional buffer pool is very small. The differences are around 1-3% and quite com- parable to the experimental standard error, indicated with 600000 vertical bars. At the left edge of the diagram, the working sets even fit into the CPU caches. Due to multiple proces- 400000 sors and CPU caches, however, the additional performance gain is minimal. 200000 12000 0 Main Memory 10000 Traditional Buffer Pool Figure 12: Insertion performance. Throughput (qps) 8000 Swizzling The performance differences follow the same pattern as in prior experiments using read-only index queries. The num- 6000 ber of operations per second is smaller compared to the read- only experiments because of extra time spent in index main- 4000 tenance, including leaf insertions and tree re-balancing. As shown in the figure, the performance of swizzling is equiva- lent to that of an in-memory database, and both out-perform 2000 a traditional buffer pool with page identifiers serving as child pointers. 0 10 20 30 40 50 5.3.3 Drifting working set An additional experiment focuses on changes in the work- Working Set Size (GB) ing set, i.e., the set of hot database pages changes over time. In the following, the working set is defined by a key range Figure 11: Graceful degradation within a B-tree index chosen for a working set size of 0.1 GB throughout. Every 60 seconds, 25% of the key range Figure 11 also zooms into the data of Figure 10, specifi- (and thus of the current working set) is dropped and an cally the range of working set sizes just larger than the avail- equal key range (and working set size) is added. Thus, for able buffer pool. In all three systems, even a small fraction a short transition period, performance is expected to drop. of buffer faults and the required disk I/O cause performance Ideally, the buffer pool loads the new 25% of the working set to degrade significantly compared to in-memory operation. and then again exhibits performance comparable to an in- More interesting here is that the performance of the sys- memory database. This experiment models scenarios where tem with a traditional buffer pool and the one that employs people want to read the latest data such as user status up- swizzling deteriorate quite similarly as they use similar page dates. As the interesting behavior happens when evicting replacement policies. The buffer pool with swizzling per- pages from the buffer pool, we perform this experiment us- forms slightly worse than the traditional buffer pool as it ing a smaller buffer pool size of 1 GB to reduce the buffer- has to un-swizzle the pointer to a victim page before evict- pool warm up time and to artificially create high pressure ing the page. In other words, the buffer pool with pointer on the buffer pool. swizzling does not introduce any disadvantages in operat- Figure 13(a) shows the effects in a traditional buffer pool. ing regions outside its strengths. In contrast to the buffer Performance is steady except for the brief transition periods. pool designs that degrade gracefully, the main-memory de- After 1,920 seconds, the experiment has touched 1 GB of sign suffers a sudden performance drop when the OS virtual data and the buffer pool must start evicting pages; that is memory mechanism starts swapping pages to disk. This why the transition period at that time is slightly longer than happens because the OS is agnostic with regard to the rela- other transition periods. tive importance of inner-node pages versus leaf-node pages. Figure 13(b) shows the corresponding performance, but for a buffer pool that employs pointer swizzling. As expected 5.3.2 Insertion performance from the experiments reported above, when the working Figure 12 evaluates insertion performance for a B-tree in- data set is in memory, throughput with swizzling is approx- dex. The entire index fits into memory throughout the entire imately twice as high as with the traditional system shown 45

10. 1000000 1000000 800000 800000 Throughput (qps) Throughput (qps) 600000 600000 400000 400000 200000 200000 0 0 1800 1850 1900 1950 2000 2050 2100 2150 2200 1800 1850 1900 1950 2000 2050 2100 2150 2200 Time elapsed (s) Time elapsed (s) (a) Traditional buffer pool (b) Buffer pool with swizzling Figure 13: Drifting working set 70 No-Swizzling Swizzling MainMemory 60 TPC-C Throughput [103 TPS] 50 40 30 20 10 0 Both ON LOCK OFF LOG OFF Both OFF Figure 14: TPC-C Benchmark Result. 100 Warehouses, 12 Clients, warmed-up buffer pool larger than data set. Logging and Locking Modules are turned ON/OFF to analyze the contribution of Swizzling improvement in the entire database engine. in Figure 13(a), and the length of the transition periods are zling techniques virtually eliminate the buffer pool bottle- equivalent between the traditional and swizzling systems. neck. The first is evidenced by the fact that the throughput of No-Swizzling did not significantly improve by turning off 5.4 TPC-C Benchmark both locking and logging modules. Even supposing perfectly The last experiment runs the standard TPC-C bench- scalable locking and logging modules, a database using a mark 1 . Figure 14 compares the throughput of our database buffer pool without swizzling would not achieve many-core engine 1) with a traditional buffer pool (No-Swizzling), 2) performance competitive with a database without a buffer with a swizzling buffer pool (Swizzling), and 3) without a pool. The second is evidenced by the fact that in all cases, buffer pool (MainMemory). As TPC-C is a write-intensive performance of a database with a swizzling buffer pool is workload, we also turn on and off the locking module and statistically equivalent to that of an in-memory database. the logging module in order to isolate the performance im- Finally, Figure 14 motivates our eventual goal to overhaul pact of swizzling in buffer pool compared to the logging and all modules for the many-core era. Our current, ongoing, locking bottlenecks. work is to significantly reduce the locking and logging bottle- When both locking and logging modules are on, swiz- necks with drastically different architectures optimized for zling improves throughput over the traditional buffer pool many cores, such as [38]. We anticipate that as we make the about 10%. When both modules are off, on the other hand, locking and the logging modules as scalable and efficient as swizzling improves throughput as much as 50%. The re- swizzling makes the buffer pool, overall throughput will dra- sult is consistent with earlier observations (Figure 2) that matically increase; the potential improvement can be seen the buffer pool is one of the bottlenecks in databases. A in the rightmost bars of Figure 14. significant improvement in one module does not necessar- ily result in an equally significant improvement in overall performance; the removal of one bottleneck can expose new bottlenecks. 6. SUMMARY AND CONCLUSIONS Nonetheless, this result demonstrates both that the sig- In summary, the buffer pool in a traditional database nificance of the buffer pool bottleneck as well as that swiz- management system enables efficient data access and data manipulation for databases larger than memory as well as 1 http://www.tpc.org/tpcc/ transactional updates with write-ahead logging. However, 46

11.when the working set fits in memory, the performance over- Acknowledgments head of a traditional buffer pool is so large that it motivates We thank with deepest appreciation all the developers and in-memory databases, whose principal value is that they do researchers of the Shore-MT team at EPFL, CMU and UW- not use a buffer pool. Madison for making the Shore-MT code-base2 available. Our Some recent work has addressed in-memory database sys- development was greatly aided by their clean and efficient tems with working sets slightly larger than memory. These code as well as its extensive and insightful comments. We approaches have relied on virtual memory, compression, and particularly thank Ryan Johnson and Ippokratis Pandis for offline analysis of hot and cold data items in order to im- their advice and suggestions regarding our implementation prove the ability of the working data set to fit in memory. In of the Consolidation Array they described in [17] and which contrast, our pointer swizzling approach simply eliminates was used in order to improve our logging module, as well as most buffer pool overheads. Hyungsoo Jung and the other authors of [18] for their clari- Unlike prior work on pointer swizzling between applica- fication and advice with regard to our implementation of the tion objects, with difficulties arising from shared objects Read-After-Write style lock manager that they proposed in (multiple pointers to the same object) and from shared con- their paper. tainers (multiple objects per database page), the proposed approach swizzles pointers between page frames in the buffer pool, i.e., between a finite number of system objects. 7. REFERENCES Our prototype implementation adds metadata to buffer [1] T. Anderson. Microsoft SQL Server 14 man: ‘Nothing pool frame descriptors without impacting actual data page stops a Hekaton transaction’. contents, which simplifies the task of un-swizzling. An in- http://www.theregister.co.uk/2013/06/03/ dex structure with only a single (incoming) pointer per node microsoft_sql_server_14_teched/, 2013. simplifies both swizzling a pointer to an in-memory (vir- [2] M. P. Atkinson, K. Chisholm, W. P. Cockshott, and tual memory) address and un-swizzling a pointer, and also R. Marshall. Algorithms for a Persistent Heap. Softw., greatly simplifies the mechanisms involved in selecting a Pract. Exper., 13(3):259–271, 1983. page for replacement in the buffer pool. [3] M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall, An experimental evaluation of this prototype demonstrates M. L. McAuliffe, J. F. Naughton, D. T. Schuh, M. H. that swizzling parent-to-child pointers between the buffer Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and pool’s page frames practically eliminates the buffer pool M. J. Zwilling. Shoring up persistent applications. In bottleneck. Experiments with shifting working sets also SIGMOD, pages 383–394, 1994. show that un-swizzling overheads (tracking page usage, etc.) [4] J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and are minimal. Further experiments illustrate graceful perfor- S. B. Zdonik. Anti-Caching: A New Approach to mance degradation when the working set size grows and ex- Database Management System Architecture. PVLDB, ceeds memory as well as quick performance improvements 6(14):1942–1953, 2013. when the working set shrinks below memory size. [5] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, Judicious swizzling and un-swizzling of pointers within the P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. buffer pool enables the performance of in-memory databases Hekaton: SQL Server‘s memory-optimized OLTP for memory-resident data, even when the total data set engine. SIGMOD, 2013. is much larger than the available buffer pool. In those [6] FAL Labs. Tokyo Cabinet: a modern implementation cases where modern hardware with a large enough mem- of DBM. http://fallabs.com/tokyocabinet/. ory encompasses the application’s entire active working set, [7] F. Funke, A. Kemper, and T. Neumann. Compacting database performance matches that of special-purpose in- Transactional Data in Hybrid OLTP & OLAP memory databases. In other cases, graceful degradation and Databases. PVLDB, 5(11):1424–1435, 2012. graceful improvement enable fluid transitions between in- memory mode and traditional buffer pool operation. For [8] H. Garcia-Molina, J. D. Ullman, and J. Widom. big data and working sets far exceeding memory capacity Database system implementation, volume 654. (which may well become the common case rather than an Prentice Hall Upper Saddle River, NJ, 2000. exception), the design enables scalability with the storage [9] G. Graefe. A Survey of B-tree Locking Techniques. costs of traditional disk drives. Graceful degradation and ACM TODS, 35(2):16:1–16:26, 2010. fluid transitions between those cases ensure optimal perfor- [10] G. Graefe. Modern B-tree techniques. Foundations mance throughout. and Trends in Databases, 3(4):203–402, 2011. In conclusion, the proposed design turns the contradic- [11] G. Graefe. A Survey of B-tree Logging and Recovery tion between in-memory computing and big data into a Techniques. ACM TODS, 37(1):1:1–1:35, 2012. continuum with competitive performance across the entire [12] G. Graefe, H. Kimura, and H. Kuno. Foster B-Trees. spectrum, eliminating the need to divide memory between ACM Transactions on Database Systems (TODS), an in-memory database and a buffer pool for a persistent 2012. database. While our research context is in databases, the [13] SAP HANA. http://www.saphana.com/. design applies directly to key-value stores and it should ap- [14] S. Harizopoulos, D. Abadi, S. Madden, and ply with little change to file systems and their allocation M. Stonebraker. OLTP Through the Looking Glass, trees (indirection pages) and directory trees. Thus, we hope and What We Found There. In SIGMOD, 2008. that turning a contradiction into a continuum, with com- 2 petitive performance throughout, will prove useful for many http://diaswww.epfl.ch/shore-mt/ future storage systems. 47

12.[15] A. L. Hosking and J. E. B. Moss. Object Fault [27] C. Mohan. Disk read-write optimizations and data Handling for Persistent Programming Languages: A integrity in transaction systems using write-ahead Performance Evaluation. In OOPSLA, pages 288–303, logging. In ICDE, pages 324–331, 1995. 1993. [28] MonetDB. http://www.monetdb.org/. [16] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, [29] J. E. B. Moss. Working with persistent objects: To and B. Falsafi. Shore-MT: a scalable storage manager swizzle or not to swizzle. IEEE Trans. Software Eng., for the multicore era. In EDBT, pages 24–35, 2009. 18(8):657–673, 1992. [17] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, [30] V. F. Nicola, A. Dan, and D. M. Dias. Analysis of the and A. Ailamaki. Aether: a scalable approach to generalized clock buffer replacement scheme for logging. Proceedings of the VLDB Endowment, database transaction processing. SIGMETRICS 3(1-2):681–692, 2010. Perform. Eval. Rev., 20(1):35–46, June 1992. [18] H. Jung, H. Han, A. D. Fekete, G. Heiser, and H. Y. [31] Oracle TimesTen In-Memory Database. Yeom. A scalable lock manager for multicores. In http://www.oracle.com/technetwork/products/ SIGMOD, pages 73–84. ACM, 2013. timesten/overview/index.html. [19] T. Kaehler and G. Krasner. LOOM: Large [32] I. Pandis, R. Johnson, N. Hardavellas, and Object-Oriented Memory for Smalltalk-80 Systems. In A. Ailamaki. Data-oriented transaction execution. S. B. Zdonik and D. Maier, editors, Readings in PVLDB, 3(1):928–939, 2010. Object-Oriented Database Systems, pages 298–307. [33] I. Pandis, P. Tozun, R. Johnson, and A. Ailamaki. Kaufmann, San Mateo, CA, 1990. PLP: Page latch-free shared-everything OLTP. [20] A. Kemper and D. Kossmann. Adaptable Pointer PVLDB, 2011. Swizzling Strategies in Object Bases. In ICDE, pages [34] S. Park. Personal Communication, 2013. 155–162, 1993. [35] S. Park, T. Kelly, and K. Shen. Failure-atomic [21] A. Kemper and D. Kossmann. Dual-Buffering msync(): A simple and efficient mechanism for Strategies in Object Bases. In VLDB, pages 427–438, preserving the integrity of durable data. In EuroSys 1994. ’13, 2013. [22] A. Kemper and D. Kossmann. Adaptable Pointer [36] A. J. Smith. Sequentiality and Prefetching in Database Swizzling Strategies in Object Bases: Design, Systems. ACM TODS, 3(3):223–247, Sept. 1978. Realization, and Quantitative Analysis. VLDB J., [37] R. Stoica and A. Ailamaki. Enabling efficient OS 4(3):519–566, 1995. paging for main-memory OLTP databases. In DaMoN, [23] K. K¨ uspert. Fehlererkennung und Fehlerbehandlung in page 7, 2013. Speicherungsstrukturen von Datenbanksystemen. [38] S. Tu, W. Zheng, E. Kohler, B. Liskov, and Informatik-Fachberichte. Springer-Verlag, 1985. S. Madden. Speedy transactions in multicore [24] T. J. Lehman and M. J. Carey. A study of index in-memory databases. In Proceedings of the structures for main memory database management Twenty-Fourth ACM Symposium on Operating systems. In VLDB, VLDB ’86, pages 294–303, San Systems Principles, pages 18–32. ACM, 2013. Francisco, CA, USA, 1986. Morgan Kaufmann [39] VoltDB. http://www.voltdb.com. Publishers Inc. [40] S. J. White and D. J. DeWitt. QuickStore: A high [25] J. J. Levandoski, P.-A. Larson, and R. Stoica. performance mapped object store, volume 23. ACM, Identifying hot and cold data in main-memory 1994. databases. In ICDE, pages 26–37, 2013. [41] P. Wilson and S. V. Kakkad. Pointer swizzling at page [26] M. L. McAuliffe and M. H. Solomon. A trace-based fault time: Efficiently and compatibly supporting huge simulation of pointer swizzling techniques. In ICDE, address spaces on standard hardware. In Computer pages 52–61, 1995. Architecture News, pages 364–377, 1992. 48