a Thousand Cores and NVRAM

Server hardware is about to drastically change. As typified by emerging hardware such as UC Berkeley’s Firebox project and by Intel’s Rack-Scale Architecture (RSA), next generation servers will have thousands of cores, large DRAM, and huge NVRAM. We analyze the characteristics of these machines and find that no existing database is appropriate. Hence, we are developing FOEDUS, an open-source, from-scratch database engine whose architecture is drastically different from traditional databases
展开查看详情

1. FOEDUS: OLTP Engine for a Thousand Cores and NVRAM Hideaki Kimura HP Labs, Palo Alto, CA hideaki.kimura@hp.com ABSTRACT data pages in both NVRAM and DRAM and bridges the Server hardware is about to drastically change. As typ- two in a unique manner for high scalability. ified by emerging hardware such as UC Berkeley’s Fire- The key idea in FOEDUS is to maintain a physically in- box project and by Intel’s Rack-Scale Architecture (RSA), dependent but logically equivalent dual of each data page. next generation servers will have thousands of cores, large The one side of the dual is a mutable Volatile Page in DRAM, and huge NVRAM. We analyze the characteristics DRAM and the other side is an immutable Snapshot Page of these machines and find that no existing database is ap- in NVRAM. FOEDUS constructs a set of snapshot pages propriate. Hence, we are developing FOEDUS, an open-source, from logical transaction logs, not from volatile pages, so that from-scratch database engine whose architecture is drasti- transaction execution and construction of snapshot pages are cally different from traditional databases. It extends in- completely independent and run in parallel. memory database technologies to further scale up and also FOEDUS sequentially writes snapshot pages to NVRAM allows transactions to efficiently manipulate data pages in to maximize the I/O performance and endurance of NVRAM, both DRAM and NVRAM. We evaluate the performance of similarly to LSM-Trees [23]. Unlike LSM-Trees, however, FOEDUS in a large NUMA machine (16 sockets and 240 FOEDUS’s Stratified Snapshots mirror each volatile page physical cores) and find that FOEDUS achieves multiple or- by a single snapshot page in a hierarchical fashion. When ders of magnitude higher TPC-C throughput compared to the volatile page is dropped to save DRAM, serializable H-Store with anti-caching. transactions have to read only one snapshot page to retrieve 1. DATABASES ON FUTURE SERVERS the searched keys or non-existence thereof. Future server computers will be equipped with a large Another novel technique in this paper is the Master-Tree number of cores, large DRAM, and low-power, non-volatile (a portmanteau of Masstree [22] and Foster B-tree [13]). random-access memory (NVRAM) with huge capacity. This It advances the state-of-the-art of OCC for simplicity and disruptive difference in hardware also demands drastic changes higher performance in future hardware, providing strong in- to software, such as databases. Traditional OLTP databases variants to simplify OCC and reduce its aborts/retries. do not scale up to a large number of cores. Recent advances These techniques together achieve excellent scalability. In in main-memory optimized databases have achieved far bet- our experiments on hundreds of CPU cores, we observed ter scalability, but they cannot efficiently handle NVRAM 100x higher throughput than H-Store [9], the state-of-the- as secondary storage. To fully exploit future hardware, we art database, with anti-caching feature. need a redesign of databases. The key contributions in this paper are as follows: We are building FOEDUS1 (Fast Optimistic Engine for Data • Analysis of next-generation server hardware trends and Unification Services), an open-source2 , ground-up database the resulting challenges for databases (§ 2-3). engine for the next generation of server hardware. FOEDUS • A new approach to scale beyond DRAM yet keep the is a full-ACID database engine whose architecture is com- performance of in-memory databases; duality of volatile pletely different from traditional databases. It is designed to pages and stratified snapshot pages (§ 4-7). scale up to a thousand cores and make the best use of DRAM and NVRAM, allowing a mix of write-intensive OLTP trans- • A highly scalable and simple tree data structure for a actions and big-data OLAP queries. thousand cores and NVRAM; Master-Tree (§ 8). FOEDUS employs a lightweight optimistic concurrency • OLTP scalability evaluation in an unprecedented scale; control (OCC) similar to those used in in-memory databases [30]. 240 physical cores in a single machine (§ 9). Unlike in-memory databases, however, FOEDUS maintains 1 foedus: (Latin) contract, compact. 2. NEXT GENERATION SERVERS 2 http://github.com/hkimura/foedus In this section, we characterize next-generation servers. Permission to make digital or hard copies of part or all of this work for personal or Several universities and companies have recently started to classroom use is granted without fee provided that copies are not made or distributed design servers based on emerging hardware, such as Fire- for profit or commercial advantage, and that copies bear this notice and the full citation box [4], Intel’s Rack-Scale Architecture (RSA), and Hewlett- on the first page. Copyrights for third-party components of this work must be honored. Packard’s The Machine [1]. For all other uses, contact the owner/author. Copyright is held by the author/owner. These designs widely vary in some component. For in- SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia. ACM 978-1-4503-2758-9/15/05. stance, they differ in the technology to use for interconnect. http://dx.doi.org/10.1145/2723372.2746480. However, they have some common features, described below. 691

2.2.1 Thousands of CPU cores future servers must take into account that memory-access Virtually all hardware vendors agree that many-core sys- costs will be highly non-uniform. System-on-chip (SOC) [4] tems are the future. Typical servers in current data centers is one direction that can deal with even cache-incoherent ar- are equipped with up to one hundred CPU cores. Next- chitectures. Be it incoherent or not, DBMS must carefully generation servers are expected to have thousands of CPU place data so that most accesses to DRAM and NVRAM cores for an even higher density of computational power [35]. are local. In this paper, we interchangeably use the words Non-uniform/coherent memory: However, with great SOC/NUMA-node and SOC-friendly/NUMA-aware to mean (computational) power comes great responsibility. The cost the requirement, targeting both SOC and NUMA systems. of maintaining coherent memory-caches and other issues limit Avoid contentious communications: The massive the number of CPU cores that can be placed in a uni- number of cores demands avoiding contentious communi- form memory-access region. Most many-core servers to- cations as much as possible. For instance, many databases day have two to eight CPU sockets that are connected to employ LSN-based concurrency control, which fundamen- each other via QPI, which exhibits latency and bandwidth tally requires at least one atomic compare-and-swap (CAS) limitations as the number of sockets increases. Future ma- in a single address (tail LSN) from every thread. We ob- chines will have further non-uniformity or potentially cache- served via experiments that even one contentious atomic CAS incoherence in memory-accesses [4]. per transaction severely limits the performance when there are hundreds of cores, which is consistent with observations 2.2 Large DRAM and Huge NVRAM in prior work [30]. Databases inherently have synchronous Over decades, the capacity of DRAM has increased expo- communications, but all of them should be non-contentious. nentially. Future servers will have a large amount of main NVRAM for big data, DRAM for hot data: A memory; hundreds of TB or more. Nevertheless, it is also database must make use of NVRAM for big-data that do a widely accepted observation that DRAM is becoming in- not fit in DRAM. However, DRAM still has an advantage creasingly expensive to scale to smaller feature sizes for the in latency. Frequently-accessed data must reside in DRAM scale beyond [3]. while cold data are moved in/out to NVRAM. When data NVRAM: Major hardware vendors are developing inno- are written to NVRAM, it is desirable that writes are con- vative memory devices to overcome the limitations of DRAM. verted into a small number of sequential writes so that the Some of the front runners among them include Phase- performance and the endurance of NVRAM are maximized. Change Memory (PCM) [18], STT-MRAM, and Memristors [29]. In addition to lowering the cost per bit (compared to DRAM), 3. RELATED WORK the retention time of these technologies is also many years, Individual desiderata above are not new. We discuss key making them non-volatile. With the widening performance challenges to address all of them by reviewing prior work in gap between main memory and storage, the next big leap in the database literature categorized into three types; tradi- system performance can be achieved by using the emerging tional, in-memory, and hybrid. NVRAM technologies as the primary data store [7]. 3.1 Traditional Databases Performance: NVRAM is expected to perform orders of Traditional databases are optimized for disks. They can magnitude faster than state-of-the-art non-volatile devices, use NVRAM as faster secondary storage device, but they such as SSD. However, the expected bandwidth and latency are significantly slower and less scalable compared to more of NVRAM has a wide variation across vendors and even recent alternatives in many-core servers [9]. Our empir- within each vendor because producing a mass-marketed stor- ical evaluation observes that even a well-optimized disk- age device involves a wide range of challenges. The currently based database does not benefit from replacing SSDs with dominant expectation for the near future is that NVRAM NVRAM due to bottlenecks other than I/O, which is the will have higher latency than DRAM. For example, a cur- key issue in traditional databases. rent PCM product has 5 to 30 µs read latency and 100 µs Shadow Paging: One noteworthy approach in tradi- write latency [17]. tional databases is Shadow Paging, which uses copy-on-write Endurance and thermal issues: A key limitation of to modify each data page and atomically switches to the new emerging NVRAM technologies is their limited endurance. versions during transaction commit [33]. Although Shadow- Depending on the type of cell (single level vs. multi level ing can avoid random in-place updates, it causes severe con- cell) and material used, NVRAM endurance is many orders tention when the transaction runs in serializable isolation of magnitude lower than DRAM. For example, PCM’s en- level and consists of multiple data accesses. Hence, its use durance vary from 106 to 108 writes, and Memristor has an in real world is limited to some filesystem that needs only endurance of 1011 writes [19], whereas DRAM has an en- local consistency or non-serializable database. durance of > 1015 writes. In addition, technologies such as PCM and Memristor are also thermally sensitive: PCM re- 3.2 In-memory Databases In-memory databases perform significantly faster and scale tention time goes down with increase in temperature [6] and much better in many-core environments [10, 30], but they Memristor suffers from increased sneak current with temper- cannot utilize NVRAM. ature [28]. Prior work on NVRAM have shown that simple SILO/Masstree: Nevertheless, recent efforts in in- wear-leveling and wear-out tolerance techniques can allevi- memory databases have devised several breakthroughs to ate the endurance problem and provide more than five years scale up to a large number of cores equipped with large of server lifetime [26, 34]. NUMA memories. FOEDUS inherits the key enablers from 2.3 Desiderata for Databases these efforts: lightweight optimistic concurrency control (OCC) These characteristics pose several requirements to databases protocol in SILO [30] and cache-sensitive data structures, running on future servers. such as Masstree [22]. Because these are essential pieces of SOC/NUMA-aware memory accesses: Software in FOEDUS, we recap SILO’s OCC protocol and Masstree in 692

3. Xct Evict/ Evict/ Dual Pages Bloom Page-In Retrieve Filters Tuple Data Volatile Snapshot DRAM All Key Cold Disk Hot pages pages /All Store Bufferpool Pages Store NVM Index a) Traditional Databases b) H-Store/Anti-Cache c) Hekaton/Siberia d) FOEDUS Figure 1: Various architectures to go beyond DRAM. a) xcts go through bufferpool that pages in/out, b) only non-key data are evicted, c) global bloom filters tell what is evicted, d) dual of volatile/snapshot pages. later sections along with the design of FOEDUS. Anti-cache, like traditional bufferpools, keeps two data Hekaton/Bw-Tree: SQL Server Hekaton [10] employs physically dependent to each other, thus the tuple/page a commit verification protocol similar to SILO. One differ- must be evicted/retrieved in a synchronized fashion. On the ence is that Hekaton uses the Bw-tree [20] for range-accessed other hand, Siberia does not maintain logical equivalence in indexes, which has a read-lock to prevent writers from ac- any granularity of data between hot-store and cold-store. cessing the tuple until the transaction that read the tuple Hence, to retrieve a piece of information, such as the value commits. This protocol can reduce aborts, but instead even of a tuple or the existence of keys in a certain range, it needs a read incurs an atomic write to shared memory. In other to maintain a global data structure and lookup in it for each words, although both SILO and Hekaton employ OCC, SILO tuple to coordinate the two stores. is more optimistic. FOEDUS employs SILO-like OCC to minimize synchronous 4. KEY IDEA: DUAL PAGES This paper suggests an alternative approach for hybrid communications for reads because read operations happen databases to bridge hot and cold data; duality of pages. more often than writes even in OLTP databases. Instead, FOEDUS stores all data in fixed-size pages and maintains the FOEDUS’s Master-Tree ameliorates the issue of aborts. all metadata in-page in a hierarchical fashion, using dual Key Issues: In contrast with traditional databases, in- page pointers, depicted in Figure 2. memory databases scale well but the entire data set must Volatile Snapshot Install-SP Modify/Add Drop Volatile (if X≡Y) fit in DRAM. The key challenge arises when one tries to Pointer Pointer provide both scalability and ability to go beyond DRAM. NULL NULL : nullptr 3.3 Hybrid Databases X NULL : Volatile-Page is just made. No Snapshot yet. An emerging hybrid approach combines in-memory and NULL Y : Snapshot-Page is latest. No modification. on-disk DBMSs to address the issues together. FOEDUS, H-Store with anti-caching [8, 9], and Siberia [12] fall X Y : X is the latest truth, but Y may be equivalent. into this category. Figure 1 summarizes these hybrid ap- Figure 2: States/Transitions of Dual Page Pointers. proaches for handling data sets larger than DRAM. Tra- A dual page pointer points to a pair of logically equiva- ditional databases use a bufferpool to handle disk-resident lent pages, a mutable volatile page in DRAM for the latest pages, which is tightly coupled with logging and locking/latch- version and an immutable snapshot page in NVRAM for ing modules. All hybrid databases avoid this major source previous version (which might be the latest if there was no of contention and overhead so that they can keep the high modification). performance of in-memory databases. The two pages are physically independent, thus a transac- Anti-caching: H-Store with anti-caching keeps all es- tion that modifies the volatile page does not interfere with a sential data in DRAM and evicts only cold data to secondary process that updates the snapshot page and vice versa. Yet, storage so that a transaction can speculatively identify the the two pages are logically equivalent, thus we do not need set of records to retrieve from secondary storage in one pass. additional information or synchronization to figure out the One limitation of this approach is that it has to keep all in- location of any data represented by the page. dexes and key attributes in memory. Another limitation is When a volatile page exists, it is guaranteed to represent that it has to evict/retrieve/repack each tuple and maintain all data that exist in the snapshot page. When it is null, on per-tuple metadata to track what is evicted, to where, and the other hand, the corresponding snapshot page is guaran- last access time. Hence, it has substantial overheads and teed to represent all data that exist in the database. memory footprints. No out-of-page information: FOEDUS maintains no Siberia: Siberia uses Hekaton as the primary hot-store out-of-page information, such as a separate memory region in DRAM while it moves infrequently used tuples to a sep- for record bodies [30], mapping tables [20], a central lock arate cold-store on disk, tracking which tuple is in the cold- manager, etc. This invariant is essential for highly scal- store with global Bloom Filters. An interesting benefit of able data management where contentious communications this approach is that the two data stores can be structured are restricted to each page and all footprints are propor- differently (e.g., column-store for cold-store). tional only to the size of hot (DRAM-resident) data, not Although Siberia can move entire tuples to cold store, it cold (NVRAM-resident) data. In an extreme-but-possible still has out-of-page Bloom Filters. A system that main- case where there are TBs of all-cold data, FOEDUS does tains such global metadata out-of-page must either hold a not need any information in DRAM except a single dual huge number of metadata proportional to the size of the pointer to the root page whereas anti-caching and Siberia entire database and/or suffer from false positives. both require huge amounts of metadata. Key Issues: Prior approaches have either physical de- By storing all data purely in-page, FOEDUS also elim- pendency or logical inequivalence between corresponding hot inates data compaction and migration, which often cause and cold data. severe scalability issues for garbage collector (GC). FOEDUS 693

4. On DRAM On NVRAM §8 Dual Pointers §7 Snapshot Cache Master Vola. Ptr. Stratified Snap. Ptr. Tree Volatile Snapshots Pages TID Tuple null Snapshot In 0xABCD SP2,PID. Pages TID Tuple SP1 SP2 SOC-2 SP1,PID. SP3 Read-Set §6 Commit Protocol & Sequential Decentralized Logs Log Gleaner Dump Log Writer Write-Set SOC-1 Epoch Epoch SOC-2 Transactions X~Y Y~Z Durable Committed Current Log File Log File Figure 3: FOEDUS Architecture Overview. All page pointers are dual pointers. Volatile Pages in DRAM are physically independent but logically equivalent duals of Snapshot Pages in NVRAM, which are separately and concurrently constructed by Log Gleaner from logical transaction logs. reclaims pages when they are no longer needed, but the re- pool inherently requires remote access to make transactions claimed page can be immediately reused in another context serializable, FOEDUS places a volatile page in a node that without compaction or migration because we use a fixed-size first modifies the page to exploit locality of access. page everywhere. This design also eliminates an additional Commit Protocol and Logging: In order to guaran- CPU cache miss and potentially a remote SOC access be- tee ACID properties, each transaction in FOEDUS main- cause the record data are always in the page itself. tains read- and write-sets as well as log buffers. This book- keeping information is completely thread-private to avoid 5. SYSTEM OVERVIEW OF FOEDUS contentions and inter-SOC communications. Section 6 ex- Figure 3 shows the architectural overview of FOEDUS. plains the detailed protocols to verify and apply them when Conceptually, FOEDUS is a multi-version DBMS with lightweight the transaction commits. OCC to coordinate transactions, which accesses two sets of Stratified Snapshots: Stratified Snapshots are the cen- data pages lazily synced via logical transaction logs. tral mechanism of FOEDUS to store snapshot pages in NVRAM Storages: Storage is the unit of data structure in FOE- and maintain the duality with volatile pages. It consists of DUS. One table consists of one or more storages, such as a an arbitrary number of snapshots, each of which is a com- primary index and secondary indexes. FOEDUS maintains plete, durable, and immutable image of the entire database a tiny amount of static information for each storage, such as as of a certain epoch (snapshot-epoch), which serves both name, type, and pointer to the root page. transactional processing and crash recovery. Section 7 de- Dual Pointers: As described in previous section, most scribes stratified snapshots in details. page pointers in FOEDUS are dual page pointers except Master-Trees: FOEDUS provides a few storage types, a few temporary pointers, such as foster-twins described such as Heap and Hash-index. Its most general storage type later. A transaction follows or atomically installs the volatile is Master-Tree, a variant of B-trees designed for NVRAM pointer to modify tuples in the page or its descendants. and many-cores. Section 8 explains Master-Tree in details. A transaction might follow the snapshot pointer when the NVRAM Interface: FOEDUS manipulates NVRAM volatile pointer is null (i.e., the snapshot page is the latest devices only via standard filesystem APIs. FOEDUS does information) or it is not a serializable transaction. The not require vendor-specific APIs to support a wide range of main body of this paper focuses on the serializable exe- NVRAMs, even including high-end flash devices. cution. Appendix B discusses SI transactions in FOEDUS. Page Pools and Data Sharing: We have two DRAM- resident page pools; one for volatile pages and another for 6. LOGGING AND COMMITTING caching snapshot pages. In traditional databases, all CPU FOEDUS employs SILO’s decentralized logging and opti- cores would share all pages in such page pools. In FOEDUS, mistic committing protocols [30, 37] and extends it to snap- snapshot cache is completely SOC-local to eliminate inter- shot pages stored in NVRAM. Section 6.1 gives a detailed SOC communications, utilizing the fact that snapshot pages review of the SILO’s protocols while Section 6.2 explains are immutable thus safe to replicate. Although the volatile FOEDUS’s extension. 694

5. Algorithm 1: SILO precommit protocol [30] Algorithm 2: FOEDUS precommit protocol Input: R: Read set, W: Write set, N: Node set Input: R: Read set, W: Write set, P: Pointer set /* Precommit-lock-phase */ /* Precommit-lock-phase */ Sort W by unique order; while until all locks are acquired do foreach w ∈ W do Lock w; foreach w ∈ W do if w.tid.is-moved() then w.tid ← track-moved(w) Fences, get commit epoch; /* Precommit-verify-phase */ Sort W by unique order; foreach r, observed ∈ R do if r.tid = observed and foreach w ∈ W do Try lock w. If we fail and find r∈/ W then abort; that w.tid.is-moved(), release all locks and retry foreach n, observed ∈ N do if n.version = observed end then abort; Fences, get commit epoch; /* Precommit-verify-phase */ Generate TID, apply W, and publish log; foreach r, observed ∈ R do if r.tid.is-moved() then r.tid ← track-moved(r) if r.tid = observed and r ∈ / W then abort; 6.1 Logs and Read/Write Sets end Transactional Logging: Each worker thread (trans- foreach p ∈ P do if p.volatile-ptr = null then abort; action) holds a circular, private log-buffer as shown in the left-bottom of Figure 3. The log buffer is written only by Generate TID, apply W, and publish log; the worker thread and read by the log writer to write out to log files stored in NVRAM. Each log entry contains the ID of the storage, the key string, and the payload of the record discrepancy between the observed TID and the current TID, inserted/modified/etc by the logical action. which is either locked or updated by a concurrent transac- The log buffer maintains a few markers that point to posi- tion. The transaction then applies the planned changes in tions in the buffer 1) durable: upto where the log writer has the private log buffer to the locked tuples, overwriting their written out to files, 2) committed: upto where the thread has TIDs with a newly generated TID of the transaction that is completed pre-commit, and 3) current: where the thread larger than all observed TIDs. The committed transaction is currently writing to. The job of a log writer is to dump out logs are then published by moving committed to current so logs between durable and committed, then advance durable. that log writers can write them to log files for durability. Read/Write Sets: Each transaction maintains read- Advancing Epochs: The current epoch and durable set and write-set, which record the addresses of tuples the epoch of the system are periodically advanced by background transaction accessed. The main differences from traditional threads that check the progress of each logger and worker databases are that 1) it only remembers the observed version thread with some interval (e.g., 20 ms). Pre-committed number (Transaction-ID, or TID) of the tuple as of reading transactions are deemed as truly committed when the new instead of taking a read-lock, and 2) it merely pushes the durable epoch is same or larger than their TID’s epoch. planned modification to the transaction’s private log buffer These decentralized logging/commit protocols based on coarse- and remembers the log position in the corresponding write grained epochs eliminates contentious communications in set instead of immediately locking the tuple and applying the traditional LSN-based databases. the modification. 6.2 Extension for NVRAM SILO’s Commit Protocol: The core mechanism of Issues: SILO is a purely in-memory DBMS. It cannot our concurrency control lies in its pre-commit procedure, handle the case where a data page is evicted from DRAM. which concludes a transaction with verification for serializ- On the other hand, FOEDUS might drop a volatile page ability but not for durability. We provide durability for a when it is physically identical to the snapshot page. After group of transactions by occasionally flushing transaction that, a transaction only sees the read-only snapshot data logs with fsync to log files in NVRAM for each epoch, a page unless it has to modify the data page, in which case coarse-grained timestamp. In other words, pre-commit guar- the transaction installs a volatile version of the page based antees ACI out of ACID, whereas D is separately provided by on the latest snapshot page. However, this can violate seri- group-commit. Algorithm 1 summarizes the original version alizability when other concurrent transactions have already of the pre-commit protocol proposed in SILO. read the snapshot pages. It first locks all records in the write set. SILO uses concur- Pointer Set: To detect the installation of new volatile rency control that places an in-page lock mechanism (e.g., 8 pages, each transaction in FOEDUS maintains a pointer bytes TID for each record) that can be locked and unlocked set in addition to read-set/write-set. Whenever a serializ- via atomic operations without a central lock manager. Plac- able transaction follows a pointer to a snapshot page because ing lock mechanism in-page avoids high overheads and phys- there was no volatile page, it adds the address of the volatile ical contention of central lock managers [27], and potentially pointer to the pointer set so that it can verify it at the pre- scales orders of magnitude more in main-memory databases. commit phase and abort if there has been a change as shown It then verifies the read set by checking the current version in Algorithm 2. numbers after finalizing the epoch of the transaction, taking The pointer set is analogous to the node-set (page-version memory fences. If other transactions have not changed the set) in SILO. The difference is that the purpose of the SILO TIDs since the reading, the transaction is guaranteed to be node-set is to validate page contents, whereas FOEDUS uses serializable. For example, write-skews are detected from the the pointer set to verify page existence. SILO cannot ver- 695

6.ify the existence of new volatile pages, which SILO did not need because it is a pure in-memory database. FOEDUS Log Log Log protects the contents of pages with a different mechanism, Files Files Files called foster-twins. We revisit FOEDUS’s commit protocol with the foster-twin technique in Section 8. Mapper Mapper FOEDUS’s pointer set usually contains only a few entries for each transaction because we do not have to add pointer sets once we follow a snapshot pointer. By definition, ev- Part Part Part Part Part Part erything under a snapshot page is stable. We have to verify .1 .2 .1 .2 .1 .2 only the first pointer we followed to jump from the volatile world to the snapshot world. There is no path from the snapshot world back to the volatile world. In an extreme (but not uncommon) case, we have only one pointer set for reading millions of records and several thousands of pages in a serializable OLAP transaction. In fact, we empirically ob- Batch-Apply serve that placing data in snapshot pages significantly (3x) Reducer 1 Reducer 2 speeds up the version verification during pre-commit. Sorted 7. STRATIFIED SNAPSHOTS Buff. Runs Stratified Snapshots are FOEDUS’s main data repository Part 1 Part 2 placed in NVRAM. As the name suggests, stratified snap- shots are layers of snapshots each of which is constructed New Snapshot by the Log Gleaner from log files. The log gleaner does Figure 5: Log Mapper partitions logs. Log Reducer not dump out the entire image of the database for each ex- constructs snapshot pages in batches. ecution, which would be prohibitively expensive for large databases. Instead, it replaces only the modified parts of Each execution of log gleaner consists of the four stages the database. LSM-tree [23] also writes out only changed shown in Figure 4, detailed below. data, but the difference is that log gleaner always outputs a snapshot that is a single image of the entire storage. Mul- 7.2 Assign Partitions tiple snapshots together form a stratified snapshot where The first stage of log gleaner is to determine the parti- newer snapshots overwrite parts of older snapshots. Each tioning keys and their assigned nodes (SOCs). To minimize snapshot contains a complete path for every record up to inter-SOC communications, FOEDUS tries to store snap- the epoch as of the snapshot. For example, the root page of shot pages in a node that will most frequently use the page. a modified storage is always included in the snapshot, but For this purpose, FOEDUS maintains statistics in volatile in many cases the only change from the previous snapshot pages to record which node has recently modified the page. version is just one pointer to lower level. All other pointers FOEDUS uses this information to determine the boundary to lower level point to previous snapshot’s pages. keys defining the partitions and their assigned nodes. This The benefit of this design is that a transaction has to read statistics is maintained without transactional guarantees to only one version of stratified snapshots to read a record or a avoid unnecessary overheads. range of records. This is essential in OLTP databases where checking an existence of key must be quick (e.g., inserting 7.3 Log Mappers and Reducers into a table that has primary key, or reading a range of keys The second stage of log gleaner is to run mappers/reducers as a more problematic case). LSM-Tree approaches would in each node as Figure 5 illustrates. have to traverse several trees or maintain various Bloom Fil- Mappers: First, each mapper reads the log files contain- ters for serializability [12], whose footprints are proportional ing log entries in the target epochs for the snapshot. Then, it to the size of cold data, not hot data. buckets log entries by storages, buffering several log entries per storage, usually in MBs. Once a bucket for a storage 7.1 Log Gleaner Overview becomes full, it sorts and partitions the logs in the bucket The log gleaner occa- based on the boundary keys for the storage designed in the sionally (e.g., every 10 1. Assign Partitions previous stage. The partitioned log entries are sent to each minutes) collects transac- partition (reducer) per bucket. As far as the partitioning tional logs in each SOC’s captures the locality, mappers send most logs to a reducer 2. Run Mappers/Reducers log files and processes in the same node. them in a map-reduce The mapper ships the log entries to the reducer’s buffer in fashion, sequentially writ- 3. Combine Root/Metadata a three-step concurrent copying mechanism. It first reserves ing the snapshot pages to space to copy into the reducer’s buffer by atomically mod- NVRAM. This guarantees ifying the state of the reducer’s buffer. It then copies the that FOEDUS always, in- 4. Install/Drop Pointers entire bucket (usually MBs) into the reserved space by single cluding the log files, writes write, which is much more efficient than issuing many writes to NVRAM in a sequential Figure 4: Log Gleaner especially in a remote node. This copying is the most ex- fashion without in-place- pensive operation in this protocol, but it happens in parallel updates. This maximizes the I/O performance and en- to copying from other mappers. Finally, the mapper atomi- durance of NVRAM. The new snapshot pages then replace cally modifies the state of reducer’s buffer to announce the volatile pages in DRAM to reduce DRAM consumption. completion of the copying. 696

7. Reducers: A reducer maintains two buffers, one for cur- remote accesses for the node that has smaller sub-trees. In rent batch and another for previous batch. A mapper writes the above example, transactions on the first node can access to the current batch until it becomes full. When it becomes local records in three reads (the combined root page and its full, the reducer atomically swaps the current and previous sub-tree) while, if the re-balanced tree has five levels, they batch and then waits until all mappers complete their copy- might have to read two more pages in higher levels that are ing. While mappers copy to the new current buffer, the likely in remote nodes. reducer dumps the ex-current buffer to a file (sorted runs) after sorting them by storages, keys, and then serialization 7.5 Install/Drop Pointers Once the metadata file and the snapshot files are writ- order (epoch and in-epoch ordinals). ten, the log gleaner installs new snapshot pointers and drops Once all mappers are finished, each reducer does a merge- pointers to volatile pages that had no modifications during sort on the current buffer in memory, dumped sorted-runs, the execution of log gleaner. FOEDUS keeps volatile ver- and previous snapshot pages if the key ranges overlap. In sions of frequently modified pages and their ascendants. sum, this results in streams of logs sorted by storages, keys, Installing Snapshot Pointers: The log gleaner in- and then serialization order, which can be efficiently applied. stalls snapshot pointers to corresponding dual pointers based Batching and Optimization: Separating the construc- on key ranges of the pages. A volatile page corresponds to tion of snapshot pages from transaction execution enables a snapshot page if and only if the key ranges are exactly the several optimizations in mappers and reducers. same. Reducers try to minimize the cases of non-matching One optimization is compaction of log entries. In a few boundaries by peeking the key ranges of volatile pages while places where FOEDUS performs a batch-sort of logs, some they construct snapshot pages, but it is possible that the log entries can be eliminated without affecting correctness. volatile page had another page split because log gleaner runs For example, repeated overwrites to one tuple in a storage concurrently to transactions. In such a case, the log gleaner can be safely represented by just one last update log for the leaves the snapshot pointer of the dual pointer to NULL, tuple as far as the updated region (e.g., column) is the same. thus it cannot drop the volatile page until next execution. Deletion nullifies all previous inserts and modifications. In- Dropping Volatile Pointers: The last step of log gleaner crement operations can be combined into one operation, too. drops volatile pointers to save DRAM. We currently pause This often happens in summary tables (e.g., warehouse in transaction executions during this step by locking out new TPC-C), which compact several thousands log entries into transactions. We emphasize that this does not mean the just one before the mapper sends it out to reducers. entire log gleaner is a stop-the-world operation. Dropping Another optimization in reducers is to batch-apply a set volatile pointers based on an already-constructed snapshot is of logs into one data page. Reducers construct snapshot an instantaneous in-memory operation that finishes in mil- pages based on fully sorted inputs from log entries and pre- liseconds. In the case of TPC-C experiments in Section 9, vious snapshot pages. Hence, reducers simply append a large it takes only 70 milliseconds even after gleaning all data number of records in tight loops without any binary search (initial data population). The log-gleaner spends the vast or insert-in-between, which makes processing of the same majority of time in mappers and reducers, which work fully amount of data substantially more efficient than in trans- in parallel to transaction executions. By pausing transaction action executions. Also, some tasks are much easier to do executions while dropping volatile pointers, we can avoid ex- using log records after a transaction has been committed pensive atomic operations for each pointer only at the cost than with live pages in DRAM during transaction process- of a sub-second transaction latency once in several minutes. ing. For example, it is very efficient for reducers to physically delete an empty page or a logically-deleted record as opposed 7.6 Snapshot Cache to performing the same task during transaction processing. Transactions read the snapshot pages via snapshot cache FOEDUS never physically deletes a record in volatile pages to avoid repeated I/O like traditional buffer pools. to simplify its concurrency control. However, because snapshot pages are immutable, this snap- In Section 9, we observe that a single reducer instance can shot cache has several unique properties that distinguish it catch up with a huge influx of logs emit by several worker from typical buffer pools. First, snapshot cache is SOC- threads because of these optimizations. local. One snapshot page might be cached in more than one SOCs to avoid remote memory access. Such a replication 7.4 Combining Root Pages and Metadata for mutable pages in traditional databases would cause ex- When all mappers and reducers are done, the log gleaner pensive synchronization. Second, when a thread requests a collects root pages from reducers and combines them to a page that has already been buffered, it is acceptable if oc- single new root page for each storage that had any modi- casionally the page is re-read and a duplicate image of the fication. The log gleaner then writes out a new snapshot page added to the buffer pool. This does not violate cor- metadata file that contains updated metadata of all stor- rectness, nor does it impact performance insofar as it occurs ages (e.g., page ID of their latest root pages in the stratified only occasionally. The occasional extra copies wastes only a snapshots). negligible amount of DRAM, and the performance benefits Combining root pages is a trivial job because the par- FOEDUS gains by exploiting these relaxed requirements are titions are non-overlapping. The tricky case happens only significant. Snapshot cache does not use any locks or even when there is a skew; the choice of partitioning keys failed to atomic operations. It only needs a few memory fences and capture the log distribution accurately. For example, there an epoch-based grace-period before reusing evicted pages. might be a case where one node receives few logs that result in a B-tree of two levels while another node receives many 7.7 Crash Recovery more logs that result in a B-tree of four levels. FOEDUS In traditional databases, a crash recovery is a complex so far leaves the tree imbalanced in this case. It is possible procedure because of the interaction between transaction to re-balance the tree, but it might result in unnecessary logs and data pages, which may or may not have the modi- 697

8. Mass-tree Foster B-Tree Bw-Tree 1 100 1 100 Mapping Table RCU Split Δ Adopt PID Ptr 10 P 1 10 5 10 1 10 1 5 5 10 1 5 10 5 5 Q P Q Foster-child Master-Tree 1 100 1 100 • All data/metadata 1 10 placed in-page Retired Foster 1 10 Adopt • Immutable range Moved 1 5 5 10 • All information Twin Track always track-able 1 5 5 10 Moved • All invariants Pre-commit Verify recursively hold Figure 6: Master-Tree: Foster-Twin provides strong invariants to simplify OCC and reduce retries. fications right before the crash. In fact, recent research [36] RCU (read-copy-update) to create a new version of the page finds that most databases, including commercial databases, and atomically switches the pointer as shown in Figure 6. may fail to recover from power-failure, causing data loss or A transaction in SILO guarantees serializability by taking even inconsistent state. node-set (page-version set) to the page and aborting if the The key principle of FOEDUS, physical separation be- page-version tells that the page has split. tween the volatile pages and snapshot pages, simplifies this Masstree is a pure main-memory data structure where a problem. The only additional thing FOEDUS does after page stays there forever. To allow page-in/out, we extend restart is to truncate transactional log files up to the place it with Foster B-Tree techniques. The key issue to page- the previous execution durably flushed. As there is no volatile in/out in Masstree is that it has multiple incoming pointers page after restart, FOEDUS then simply invokes a log gleaner per page, such as next/prev/parent pointer in addition to just like a periodic execution, which is well optimized as de- the pointer from parent pages. In a database with page scribed above. There is no code path specific to restart. This in/out, such multiple incoming pointers cause many issues in makes FOEDUS’s crash recovery significantly more robust concurrency control. Foster B-Tree solves it by the concept and efficient. of foster-child, a tentative parent-child relationship that is The interval between log gleaner executions (usually a few de-linked when the real parent page adopts the foster-child. minutes) is also the upper limit of recovery time because Master-tree guarantees a single incoming pointer per page log gleaning during restart cannot be slower than that of with this approach except retired pages explained later. during normal executions as there are no live transactions Master-tree also extensively uses system transactions for to put pressure on CPU cores, its caches, and volatile pages various physical operations. For example, inserting a new to install/drop pointers from. Hence, stratified snapshots record usually consists of a system transaction that physi- are also efficient checkpoints to quickly restart from. cally inserts a logically-deleted record of the key with suffi- cient body length and a user transaction that logically flips 8. MASTER-TREE the deleted flag and installs the record. It is worth not- Master-Tree combines techniques in Masstree [22] and Fos- ing that system transactions are especially useful when used ter B-Tree [13] to build B-trees appropriate for NVRAM and with logical logging, not physiological logging. Because a OCC. Not only combining the two, Master-Tree also has a system transaction does nothing logically, it does not have key distinction that provides strong invariants to drastically to write out any log nor touch log manager, which was one simplify our OCC protocol and reduce aborts/retries. Sec- drawback of inserting with system transactions [13]. A sys- tion 8.1 briefly reviews the two prior work and describes tem transaction in FOEDUS merely takes read-set/write-set how FOEDUS combines them. Section 8.2 then details our and follows the same commit protocol as usual transactions. innovative foster-twin technique. Master-tree makes several contributions in extending prior 8.1 Basic Data Structures work for FOEDUS. However, this paper focuses on a key Master-Tree leverages Masstree’s cache-craftiness and em- invention: the foster-twins. ploys a variant of its OCC protocol. In short, Masstree is 8.2 Foster-Twins a 64-bit B-trie where each layer is a B-tree optimized for Page Split Issue: As described earlier, SILO uses an 64-bit integer keys. Most key comparisons are done as ef- in-page locking mechanism based on TID. One issue of this ficient 64-bit integer comparisons with only a few cacheline protocol is that an in-page lock state (TID) can cause false fetches per page, digging down to next layers when keys are aborts due to page splits or requires per-tuple GC. longer than 64-bit. When Masstree splits a full page, it does When a page splits, we must keep at least the lock states 698

9.of all existing records in the old page because the concur- using the foster-twin chain. In case of frequent splits, the rent transactions are pointing to the physical addresses in foster-twin might form a binary tree of an arbitrary depth the old page. This means we cannot move any records to (although rarely more than one depth), hence this tracking make room for new insertions, defeating the purpose of split. might recurse. We do this tracking without locking to avoid One trivial alternative is to abort such transactions (they deadlocks. We then sort them by address and take locks. can at least see that something happened by seeing updated However, this might cause staleness; concurrent transactions TID in the old page or checking the split-counter used in might have now split the page again, moving the TIDs. In SILO), but this causes the transaction to abort even when that case, we release all locks and retry the locking protocol. the tuple in the read-set was not modified. For an extreme Benefits: The main benefit of foster-twins is that every but not uncommon case, even a single-threaded execution page has a stable key-region for its entire life. Regard- might have unnecessary aborts. The transaction inserts less of splits, moves, or retirement, a page is a valid page many pseudo-deleted records by system transactions, adding pointing to precisely the same set of records via foster-twins. the records to write-set to logically insert at the pre-commit Thus, even if concurrent transactions see moved or even re- phase. Once it triggers splits of the page, the transaction tired pages, they do not have to retry from the root of the has to abort at pre-commit even though there is no concur- tree (which [22] and [30] does). This property drastically rent transaction. Another trivial alternative is to re-search simplifies our OCC algorithm. Especially in interior pages, all keys at pre-commit from the root of the tree, but this is we never have to do the hand-over-hand verification proto- significantly more expensive and sometimes no better than col nor the split-counter protocol in the original Masstree. aborting and retrying the whole transaction. A tree search simply reads a probably-correct page pointer Yet another trivial solution is to have pointers rather than and follows it without even memory fences. It just checks data in pages. The pointers point to dynamically-sized tu- the key-range, an immutable metadata of the page, and only ple data in centrally allocated memory-pool, for example locally retries in the page if it does not match. delta-records in Bw-trees [20]. This avoids the problem of When a transaction contains cursors or miss-search, FOE- a lock state disappearing at split, but instead requires per- DUS verifies the number of physical tuples in the border tuple GC. When these tuples are not needed any longer, we page to avoid phantoms. However, unlike the page-version have to reclaim memory for them. However, per-tuple GC is verification in SILO, FOEDUS does not need to check any significantly more costly and complicated compared to per- information in intermediate pages or anything other than page GC for fixed-size data pages. Per-tuple GC inherently the number of tuples, such as split-counters. causes defragmentation due to various sizes of tuples and The simplification not only improves scalability by elimi- thus must do compaction like general GCs in managed lan- nating retries and fences but also makes Master-Tree a more guages, which is notoriously expensive. Further, the pointer maintainable non-blocking data structure. While non-blocking approach requires one CPU cache miss for accessing each algorithm is the key to higher scalability in many-cores, tuple because the tuple data are not physically placed in a a complex non-blocking algorithm that consists of various data page. atomic operations and memory fences is error-prone and Foster-Twins: To solve the problem, we introduce a hard to implement, debug, test, and even reason about its foster-child variation that we call foster-twins. When a page correctness [31]. Even a non-blocking algorithm published splits, we mark the TIDs of all records in the page as moved in a well-thought paper and an implementation in a widely and also create two foster children; or foster-twins. Foster- used library are sometimes found to contain a bug after a few twins consist of minor (or left) foster child and major (right) years [5]. Making the algorithm trivially simple and robust foster child. The minor foster child is responsible for the first thus has tremendous benefits for real database systems. half of key regions after split while the major foster child is Cursors do a similar trick described in Appendix C. Fi- responsible for the second half. In other words, the major nally, we point out that the idea of Foster-Twins also applies foster child is the foster child in a Foster B-Tree while the to other dynamic tree data structures. minor foster child is a fresh-new mirror of the old page (but after compaction). At the beginning of the split, we mark 9. EXPERIMENTS the old page as moved, meaning that the page must no longer This section provides an empirical evaluation of FOEDUS have any modifications. In the next tree traversal, the parent to verify its efficiency and scalability. We have two sets page of the old page finds it and adopts the major foster of experiments; in-memory and NVRAM. The in-memory child like a Foster B-Tree, but also it modifies the pointer experiments evaluate the performance of FOEDUS as an in- to the old page to the minor foster child, marking the old memory database while the NVRAM experiments evaluate page retired. Retired pages are reclaimed based on epochs the performance when a part of the data reside NVRAM. to guarantee that no transactions are referencing them as of reclaiming. Like a Foster B-Tree, our Master-tree has 9.1 Setup We run our experiments on two servers detailed in Ta- only one incoming pointer per page, thus there is no other ble 1. The two servers have identical CPUs and OS. The reference to the retired page except concurrent transactions only difference is the number of CPU sockets; DragonHawk that took the address of the TIDs as read-set and write- has 4 times more CPUs. set. The concurrent transactions, during their pre-commit We have implemented FOEDUS as a general embedded phase, become aware of the moved-mark on the records and database library written in C++. All experiments presented track the re-located records in foster-minor or foster-major in this section use the standard TPC-C benchmark. In each children. evaluated system, we assign every worker thread to its own Commit Protocol: Now, let us revisit Algorithm 2 in home warehouse as done in the systems we compare with the earlier section. The difference from Algorithm 1 is that [9, 25, 30]. This also means that the scale factor (the num- we locate the new location of TID as we see the moved bit, ber of warehouses) is same as the number of threads. TPC-C 699

10. Table 1: Experimental Environments derlines, not remote transactions. As each neworder trans- Model HP DL580 HP DragonHawk action issues 5 to 15 line-items, remote-ratio=1 corresponds Sockets 4 16 to about 10% remote transactions. Cores (w/HT) 60 (120) 240 (480) In the regular TPC-C setting (remote-ratio=1), FOE- CPU Intel Xeon E7-4890 v2 @ 2.80GHz DUS achieves 400 times higher throughput over H-Store. CPU Cache L2 256KB/core, L3 38MB/socket The throughput of H-Store significantly drops when some DRAM 3 TB 12 TB transactions touch remote warehouses because H-Store trig- OS Fedora 19 RHEL 7.0z gers distributed transactions with global locks for such trans- (Kernel) (3.16.6) (3.16.6) actions. In fact, more than 90% of transactions in H-Store abort with higher remote ratios. This makes H-Store per- form even slower on DragonHawk except remote-ratio=0. 10000 On the other hand, FOEDUS and SILO suffer from only Throughput [103 TPS] modest slowdowns because of their lightweight concurrency FOEDUS control and cache-conscious data structures, resulting in sev- 1000 FOEDUS-Log eral orders of magnitude faster performance for larger re- H-Store mote ratios. 100 H-Store-Log Table 2: In-memory TPC-C Scale-up (Remote=1). SILO Total/per-core Throughput [kTPS]. Log ON. 10 SILO-Log Server FOEDUS SILO H-Store DL580 3659 1868 55.3 0 1 2 3 4 5 6 7 8 9 10 (per-core) (61.0) (31.1) (0.92) Remote transaction ratio [/neworder 1%, payment 15%] DragonHawk 13897 5757 35.2 (per-core) (57.9) (24.0) (0.15) Figure 7: In-memory TPC-C with varied remote transaction fractions on DragonHawk. Remote- FOEDUS also performs consistently faster than SILO. fraction=1 is the regular TPC-C. ’-Log’ writes out FOEDUS’s throughput is twice that of SILO because Master- transactional logs to NVRAM (5 µs latency). Tree eliminates most transaction aborts and all global search retries as well as hand-over-hand verification steps in interior has two kinds of transactions that might access remote ware- pages. Table 2 compares the entire and per-core throughput houses, payment and neworder. Some experiments vary the of the systems on DL580 (60 cores) and DragonHawk (240 fraction of these transactions touching remote warehouses. cores). Although both FOEDUS and SILO scale well, the We retrieved the latest versions of the systems and com- relative performance difference modestly but consistently piled them by the latest compilers with highest optimization grows as the number of core grows. In all remote-ness set- levels. We also carefully tuned each system for the environ- tings, FOEDUS performs 2.4x faster than SILO on 240 cores ments, varying several parameters and choosing the best- and 2x faster on 60 cores. FOEDUS is less affected by trans- performing ones. In fact, we observe throughput of most, actional logging, too. Whether FOEDUS writes out logs to but not all, systems evaluated in this section even better tmpfs or NVRAM, it only issues a few large sequential writes than the numbers in prior publications [14, 30, 32]. for each epoch, thus the latency has almost no impact. Above all, we configured H-Store, the academic branch These results verify the scalability of FOEDUS on a large of VoltDB, with help from the H-Store team. H-Store runs number of cores with highly non-uniform memory accesses. independent sites, each of which consists of a native execu- tion engine written in C++ and a Java VM that hosts users’ 9.3 NVRAM Experiments stored procedures. We allocate 8-12 sites and set the number The second set of experiments places the data files and of partitions in each site to 8 in order to avoid the scalability transactional log files in an emulated NVRAM device. We issue of Java GC. All clients are non-blocking and configured again evaluate three systems, but this time we do not evalu- to maximize H-Store’s throughput. ate SILO because it is a pure in-memory database. Instead, All experiments use the serializable isolation level. we compare FOEDUS with H-Store with anti-caching (75% eviction threshold) and a variant of Shore-MT [14, 16, 25, 32]. 9.2 In-Memory Experiments The first set of experiments places all data in DRAM to 9.3.1 NVRAM Emulator compare FOEDUS with the best performance of in-memory As discussed in Section 2, the latency of future NVRAM databases. We evaluate three systems; FOEDUS, H-Store, devices widely varies. We therefore developed a simple emu- and SILO. In these experiments, we turn on/off transac- lator to emulate NVRAM devices with an arbitrary latency. tional logging in each system. When we turn on trans- Our emulator leverages the fact that all three systems we actional logging, we use our emulated NVRAM device de- evaluate use standard filesystem APIs to access and manip- scribed in next section with 5 µs latency. Both SILO and ulate log and data files. It emulates NVRAM using DRAM FOEDUS assign 12 worker threads per socket. and by extending a Linux memory file system (i.e., tmpfs) Figure 7 shows the results of experiments on DragonHawk. to inject software-created delays to each file read and write The x-axis is the fraction of each orderline stored in remote I/O operation. It creates delays using a software spin loop warehouses, which we call remote-ratio below. We also that uses the x86 RDTSCP instruction to read the processor vary the fraction of remote payments accordingly (remote- timestamp counter and spin until the counter reaches the in- ratio * 15%, up to 100%). Prior work [30] did an equivalent tended delay. It also allocates and places data in emulated evaluation. We denote remoteness as a fraction of remote or- NVRAM in a NUMA-aware fashion. Our NVRAM emula- 700

11.tor is completely transparent and requires no source code 10000 changes to applications as long as they access NVRAM via Throughput [103 TPS] the filesystem API. Compared to previous hardware-based approaches for em- 1000 ulating NVRAM [11], our emulation approach requires no special hardware or firmware, and most importantly it is not limited to single-socket systems. Thus, our approach 100 enables us to evaluate larger scale systems comprising thou- sands of cores and huge NVRAM. 10 FOEDUS H-Store 9.3.2 Transaction Throughput FOEDUS-NoSC Shore-MT Figure 8 shows TPC-C throughput of the three systems in DragonHawk with various NVRAM latencies that cover 100 300 1000 3000 10000 30000 all expectations and types of NVRAM (100 ns to 50 µs). NVRAM Latency [ns] FOEDUS: Because most of its writes are large and Figure 8: NVRAM TPC-C with varied NVRAM sequential, FOEDUS is nearly completely unimpacted by the latency on DragonHawk, Remote=1 except H-Store higher NVRAM latency. Even though the size of snapshot- (Remote=0). Transactional Logging on. cache was only 50% of the entire data size, the page-based caching effectively reduces the number of read accesses to NVRAM. for NVRAM. Shore-MT does not scale up to the large num- FOEDUS-NoSC: To confirm the effects of the snapshot- ber of cores, either. Due to contentions in locking and log- cache, FOEDUS NoSC always reads snapshot pages from NVRAM. ging modules, we had to limit the number of worker threads As the result shows, the throughput visibly deteriorates around to 26 to get its peak performance, which is consistent with 10 µs and hits major I/O bottlenecks around 30 µs. In other the numbers [32] observes. words, when the snapshot-cache is too small or its cache re- placement algorithm (so far a simple CLOCK) does not cap- 9.3.3 Log Gleaner Throughput Next, we measure the performance of the log gleaner. Al- ture the access pattern, the performance of FOEDUS highly though the log gleaner runs concurrently to transactions depends on whether the read latency of the NVRAM de- without any interference, the log gleaner must catch up vice is ≤10 µs or not. Another interesting result is that the with the rate at which transactions emit logs. Otherwise, throughput of FOEDUS NoSC is lower than FOEDUS even when too many volatile pages must be kept in-memory, eventually the latency is negligibly low (e.g., 100 ns). This is because causing transactions to pause until the log gleaner catches snapshot-cache also avoids the overhead to invoke filesystem up. The same applies to log writers. If loggers cannot write API and copying the page (4 KB) itself, which is expensive out logs to files fast enough, the circular log buffers of work- especially when the snapshot page exists in another SOC. ers become full and cause paused transaction executions. Therefore, SOC-local snapshot-cache is beneficial regardless of the NVRAM device’s latency. H-Store: On the other hand, anti-caching in H-Store in- Table 3: FOEDUS Log Gleaner Throughput. curs high overheads due to its need to maintain the status of Module Throughput ±stdev [106 logs/sec] eviction for each tuple and frequent I/Os to NVRAM. Java Mapper 3.39 ± 0.13 per instance stored procedures with query optimization help developers Reducer 3.89 ± 0.25 per instance but have overheads, too. Furthermore, anti-caching requires Logger 12.2 ± 1.3 per instance an additional JNI call when the transaction retrieves evicted tuples (dependencies). As a result, FOEDUS’s throughput Table 3 shows the throughput of each module in the TPC- is more than 100 times higher than that of H-Store. C experiments on NVRAM. Assuming 50k transactions per Note: The current implementation of anti-caching in H- core per second, which emit about half a million log entries Store has some unstability when a transaction retrieves a tu- per core per second, this result indicates that 1 or 2 log map- ple from anti-cache in remote partitions. Hence, this experi- pers and 1 log reducer can catch up with 8 worker threads. ment runs H-Store without remote transactions (remote=0), For example, in the environments we used in the experi- which is guaranteed to only improve H-Store’s performance ments which have 15 cores per CPU, we assign 12 worker (according to Figure 7, about 2x) rather than degrade. threads, 2 log mappers, and 1 log reducer per socket. Shore-MT: Shore-MT keeps about the same perfor- One logger in FOEDUS keeps up with 24 worker threads. mance as H-Store in all latencies. Its throughput drops only Despite the similar design in transactional logging, SILO when NVRAM latency is beyond the expected range. We reports a different balance; 1 logger thread can handle up observed 10% slow down with 0.1 ms latency, 70% with 0.5 to only 7 worker threads [30]. The improvement is due to ms, and 150% with 1 ms. However, even an SSD device in FOEDUS’s design to directly write out worker buffers to log the current market has less than 0.1 ms latency. In other files. A logger in FOEDUS does not have its own buffer words, Shore-MT benefits from replacing HDDs with SSDs aside from a small 4 KB buffer to align boundary cases for but not from replacing SSDs with NVRAMs because its disk- direct I/O. On the other hand, SILO copies logs from worker based architecture incurs high overheads that overshadow buffers to loggers buffers, doubling the overhead. the differences in µs order. In fact, we did not observe per- These results on NVRAM verify that 1) FOEDUS achieves formance differences by varying the size of its bufferpool. a high performance on NVRAM equivalent or better than in- The figure shows the throughput of Shore-MT whose buffer- memory systems, and that 2) FOEDUS’s logging and snap- pool size is 75% of the data file, but 50% bufferpool size had shotting framework can handle a large influx of transaction almost identical throughput in all realistic latency settings logs with a sizable number of cores per socket. 701

12.9.4 Where the time goes authors that some of the bottlenecks in H-Store are at- This section analyzes the current bottlenecks in FOEDUS. tributed to the Java runtime. H-Store/VoltDB allows users Figure 9 shows the results of CPU profiling in DragonHawk to write their transaction in Java stored procedures that during the TPC-C executions in the in-memory setting and communicate with the native execution engine via JNI. This the NVRAM setting with a latency of 5 µs. In this figure, causes expensive cross-boundary costs as well as the need Lock-Acquire contains the wait time of conflicting locks. for JavaVM’s GC. Although garbage collection could be ex- 100 pensive in any language, Java GC is especially known for 90 its poor scalability for small and frequent transactions like OLTP on large NUMA machines. On the other hand, FOE- 80 Lock-Acquire DUS supports only C/C++ interface, which does not have CPU Cycles [%] 70 this issue but is not as productive as Java in terms of devel- Lock-Release 60 opers’ productivity. Providing both developer productivity 50 PreCommit-Verify and system efficiency requires further research. 40 NVRAM-IO Second, there is another approach to bring the perfor- 30 mance of purely in-memory databases to NVRAM. One can Useful-Work/Other run a purely in-memory database, such as SILO, on a vir- 20 tual memory backed by NVRAM. This approach does not 10 require any code change in the database, but has a few 0 In-memory NVRAM fundamental limitations. The most important limitation is that it cannot recover from a system crash without re- Figure 9: CPU Profile of FOEDUS during TPC-C. doing all transactional logs. A modification to a virtual Unlike the famous analysis in Shore [15], the vast majority memory right before the crash might or might not be made of CPU cycles are spent in actual transactional processing in durable. Unless the NVRAM device and the OS provide both settings. This is a significant improvement especially additional guarantees, such as write-order-dependency and considering that the prior analysis was about overheads on flush-on-failure, the database needs an extra mechanism to single-threaded execution. FOEDUS effectively keeps lock- handle recovery. Further, the OS must wisely choose mem- ing and logging contentions small even on the unprecedent- ory regions to swap out and modify mapping tables (e.g., edly large number of cores. It is exactly one of the goals to hash-table) in a thread-safe fashion just like database buffer- completely separate transactional modifications in volatile pools, which poses challenges on scalability and complexity. pages from log writers and the log gleaner. They can thus In fact, [14] observes that virtual memory performs poorly construct snapshot pages in parallel to transaction execu- as a bufferpool mechanism. One reason is that an OS has tions without any races or frequent flushes. no application-domain knowledge. Nevertheless, it is still an The CPU profile also shows an interesting benefit of the interesting direction to overcome these limitations without stratified snapshots. Although the NVRAM execution adds completely changing the architecture of existing DBMSs for the cost to access NVRAM (NVRAM-IO), it reduces lock con- the sake of software compatibility. tention (25.5% → 15.2%) and the overheads of verification (1.8% → 0.6%). As Section 7 discusses, snapshot pages 10. CONCLUSIONS are immutable. Once the transaction followed a snapshot The advent of NVRAM and thousand CPU cores demands pointer, it never has to check record TIDs, page versions, new database management systems. Neither disk-based nor or page locks except for the single pointer to jump from the in-memory databases will scale. We presented the design of volatile world to the snapshot world. This property dra- FOEDUS, our from-scratch OLTP engine that scales up to a matically reduces the number of read-sets to verify in the thousand cores and fully exploits NVRAM. Our experiments pre-commit phase and speeds up reading static pages. This on a 240-core server revealed that FOEDUS performs orders is especially beneficial for OLAP workloads, discussed next. of magnitude faster than state-of-the-art databases and is substantially more resilient to contention. 9.5 OLAP Experiments There are several topics in FOEDUS we will further in- TPC-C is an insert-heavy OLTP workload. In order to vestigate. Although the performance results on the 240 core evaluate the efficiency of FOEDUS for OLAP workloads, server imply that FOEDUS will not peak out its perfor- we also ran an altered TPC-C that executes analysis-style mance at least for high-hundreds of cores, we will run ex- queries on orders of magnitude larger data. We found that periments on a thousand cores as soon as possible. We plan FOEDUS achieves orders of magnitude larger throughput to improve the resilience to even higher contention, the main compared to H-Store. Further, we observed an interest- drawback of OCC, by autonomously combining pessimistic ing aspect of the dual page architecture; FOEDUS performs approaches depending on workloads. faster, not slower, if it drops all volatile pages (hot-data) Another issue we plan to work on is to improve the auto- from DRAM and has only snapshot pages (cold-data). This matic partitioning algorithm in the log gleaner. The current happens because verification costs are more significant in algorithm simply checks which node has modified the page OLAP settings and snapshot pages are more efficient to read recently. This sometimes misses the best partitioning, for than equivalent volatile pages thanks to its immutability and example a page that is occasionally modified in node-1 but density. Appendix A describes this experiment in details. very frequently read in node-2. However, we must avoid re- 9.6 Discussions quiring each read operation to write something to a global Finally, we discuss a few issues that are not explicitly place, even just statistics. We have a preliminary idea on visible in the performance numbers. efficient collection of statistics for partitioning and will eval- First, we observed and also confirmed with the original uate its accuracy and overheads. 702

13.APPENDIX 100 90 A. OLAP EXPERIMENTS 80 Move-Border Section 9 used the standard TPC-C to evaluate the per- CPU Cycles [%] 70 Move-Inter formance of FOEDUS, which is an insert-heavy OLTP work- 60 load. This section evaluates the efficiency of FOEDUS for std::sort 50 OLAP workloads, using an altered TPC-C workload as a Verify micro-benchmark. The OLAP workload uses order-status, 40 a read-only transaction in TPC-C that consists of three cur- 30 NVRAM-IO sor accesses; 1) retrieving customer-ID from customer name, 20 Others 2) retrieving last order-ID from customer-ID, and 3) retriev- 10 ing line items of the order. 0 We also increased the average number of tuples this query Hot Cold reads. Each order in TPC-C contains 5 (MIN_OL_CNT) to 15 (MAX_OL_CNT) line items. Analytic queries usually read Figure 11: CPU Profile of FOEDUS during a cursor- hundreds or thousands of tuples. We thus varied MAX_OL_CNT heavy OLAP workload on 0.5 TB data. from 15 to 500. The initial size of the database is increased accordingly, turning it to an OLAP-style database of 0.5 TB. from snapshot pages reduces the verification cost (5.0% → Throughput: Figure 10 shows the throughputs of the 0.5%), which is much more significant in this OLAP setting. OLAP workload. In this experiment, we disabled H-Store’s Second, the snapshot versions of Master-Tree’s border pages anti-caching to evaluate its best performance, assuming that are always fully sorted. Hence, a cold execution skips the DRAM is larger than the data set. The workload has no sorting step to scan a border page. Appendix C explains distributed transaction, either. We tested two configurations how cursors work in Master-Trees. of FOEDUS to evaluate different aspects of its architecture. Third, stratified snapshots contain more dense intermedi- FOEDUS-Hot keeps all volatile pages, thus no snapshot pages ate pages than the volatile pages. Fill factor is almost 100% on NVRAM are accessed. FOEDUS-Cold, on the other hand, and it does not have foster twins. There are simply fewer drops all volatile pages, thus the transactions read snapshot intermediate pages to read. pages from NVRAM (5 µs latency) with a large snapshot Finally, the snapshot cache serves as a local replica, thus cache. reading from it causes only NUMA-local accesses. On the other hand, mutable volatile pages cannot be replicated eas- FOEDUS-Hot H-Store ily. Especially on a larger database (MAX_OL_CNT=500), this FOEDUS-Cold causes a significant speedup (70%) due to memory band- width limitation across sockets. 100000 Throughput [103 TPS] In sum, the complete separation of volatile and snapshot pages has a few additional performance benefits for OLAP 10000 workloads. These observations verify the design of dual pages, whose main goal is to scale beyond the size of DRAM 1000 and also efficiently process big-data OLAP applications. 100 B. NON-SERIALIZABLE TRANSACTIONS 10 15 100 500 FOEDUS provides a few weaker isolation levels for use- MAX-OL-CNT cases where the user prefers even higher efficiency and scal- ability at the cost of non-serializable results. This appendix Figure 10: OLAP Workload (order-status with section discusses a popular isolation level, snapshot isolation many more records). FOEDUS-Hot keeps all (SI). volatile pages while -Cold has only snapshot pages. FOEDUS achieved orders of magnitude higher through- B.1 Snapshot Isolation (SI) puts than H-Store. This is not a surprising result because SI guarantees that all reads in the transaction will see a H-Store is not designed for OLAP workloads. A far more consistent and complete image of the database as of some interesting observation is that FOEDUS-Cold is faster than time (time-stamp). FOEDUS’s SI requires that the time- FOEDUS-Hot for about 30-70% in this cursor-heavy OLAP stamp of an SI transaction in FOEDUS must be the epoch of workload. We consistently observed the speed-up unless the latest (or older) snapshot. Unlike a database that assigns the snapshot cache is small or the workload consists of a a transaction’s time-stamp at its beginning (e.g., Oracle), large number of random seeks (e.g., stock-level, which is the granularity of timestamps in FOEDUS is much coarser. dominated by uniformly random lookups on stock table). This is surprising because the general assumption in hybrid B.2 How FOEDUS Executes SI databases is that cold-data are slower to manipulate. To an- The execution of SI transactions in FOEDUS is fairly sim- alyze how this happens, we took CPU profile of FOEDUS. ple because a stratified snapshot by itself is a complete, con- Profile: Figure 11 shows the CPU profile in this experi- sistent, and immutable image of the entire database as of a ment. A cold execution adds the I/O cost on NVRAM, but single point of time (snapshot epoch). For reads, a transac- there are a few costs that are substantially lower instead. tion merely follows the root snapshot pointer for all storages First, as reported in the NVRAM experiment, reading as of the latest snapshot. It does not have to remember read 703

14.sets nor pointer sets. For writes, it does the same as serial- Algorithm 3: Summary of Master-Tree’s cursor scan izable execution. Procedure cursor next() FOEDUS’s multi-versioning is based on pages. Unlike move border(last route); tuple-based multi-versioning databases where a new key in- Procedure move inter(route) sertion might be physically modifying the page, reading and pos = route.pos + 1; traversing immutable snapshot pages in FOEDUS does not if pos > page.count then wind up and move; need any additional logic (e.g., skipping a tuple with a newer if page.f ences[pos] = route.last then relocate pos; time-stamp) even for cursor accesses. The transaction merely next = page.pointers[pos]; reads locally-cached snapshot pages and processes them as if if next.low = route.last then local retry; the system is single-threaded. Further, because all data as of route.pos = pos, route.last = next.high; the exact epoch reside in the page, reading each tuple does follow(next); not need to access UNDO logs, which is the major source of Procedure move border(route) I/O overheads and CPU cache misses in traditional MVCC pos = ++route.pos; databases. In fact, FOEDUS does not have any UNDO logs if pos ≥ page.count then wind up and move; because FOEDUS applies the changes to data only after the if Points to next layer then follow(next layer); transaction is committed. Optimistic-Read(page.records[route.order[pos]]); The drawback, however, is the coarser granularity of time- Procedure follow(page) stamps, which is more likely to cause write skews. The route ← new route(pos = 0, last = page.low); reason for us to choose this design of SI is that, because if page is border and volatile then FOEDUS provides a highly scalable and efficient serializ- if page is moved then follow(foster twins); able execution, a user application of FOEDUS would choose /* Verified during precommit */ SI probably because of a significantly high priority on per- remember page.count; formance rather than recency of reads, or full serializability. route.order = sort(page.keys); Hence, the drastically simple and efficient protocol above end would best serve the usecases. move border/inter(route); B.3 Instant Recovery for SI 0 16 32 As described in Section 7.7, FOEDUS runs the log gleaner Reserved  after restart, taking up to a few minutes when the previ- MCS  ous execution did not run log gleaner during the shutdown, 32 48 64 probably because of a system crash. When the user needs  Lock Thread-ID (tail ) MCS-Node (tail ) to run only read-only SI transactions, FOEDUS can even 64 68 96 skip this step because stratified snapshots are immediately queriable databases by themselves unlike traditional check- Flags Epoch   Version points. This is a restricted form of instant recovery [24] that 96 128 Number  (VID) provides additional fault tolerance at no cost. In-Epoch Ordinal C. CURSORS IN MASTER-TREE Figure 12: FOEDUS’s TID Layout. Master-Tree provides a cursor interface for range access, which is the raison d’ˆetre of B-trees. A cursor in Master- end of the page. Thus, keys are not sorted in border pages. Tree is designed with the same principles as point-queries To efficiently scan the page, a cursor calculates the order discussed in the main body, namely no global retries or hand- of keys in the page when it follows a pointer to the page. over-hand verifications. Bw-Tree has a similar step [20]. This sorting is skipped in Algorithm 3 summarizes how Master-Tree scans records snapshot pages because log gleaner always construct a page forward. The key idea is again the foster-twin and the stable that is fully sorted. key-region of pages. In intermediate pages, the cursor remembers up to which key it scanned so far (last). The next page to follow is cor- D. DETAILED DATA LAYOUTS rect if and only if the page’s low fence key is equal to the This appendix section provides a concrete layout of a few key. If it does not match, there was some reorganization in data structures explained in the main body. Although the the page, but it is guaranteed that the separator still exists main idea and mechanism will stay the same, the exact data in the page. Thus, it locally re-locates the position rather layout might change over time. We thus recommend to check than aborting the transaction or globally retrying from the the latest source code of FOEDUS. root. Observe that the algorithm is correct even if the inter- mediate page is now moved or retired. The cursor eventually D.1 TID reaches the exactly same set of border pages without locks TID (Transaction-ID) is the metadata placed next to ev- or hand-over-hand verifications. ery tuple. Our TID layout is slightly different from SILO In border pages, however, we have to check the possibility to accommodate more information and to employ MCS- of phantoms. Hence, the transaction remembers the key locking [21], which avoids remote-spinning for higher scal- count of the page as of initially opening the page in addition ability. FOEDUS’s TID consists of two parts, each of which to per-tuple read-sets. Also, like Masstree [22], Master-Tree is 64-bits as shown in Figure 12. The first part, MCS-Lock, never changes the location of existing records to simplify contains the ID of the tail-waiter of the tuple and address concurrency control. It always appends a new record at the of its lock-node. A transaction atomically swaps (e.g., x86 704

15.0 16 32 64 Log-Type Length Storage-ID  Table 4: H-Store General Configuration. Common  64 128 Parameter Name Configuration  Header VID site.jvm asserts false 128 144 160 176 192 site.cpu affinity true Key-Len Payl.-Off. Payl.-Len Reserved  client.blocking false  Type- 192 Specific client.memory 20480 Data site.memory 30720 Key and Payload (64-bit aligned) ...  global.memory 2048 client.txnrate 10000 Figure 13: Example Log Record: Master-Tree Log. client.threads per host [Adjusted to the throughput] hosts localhost:0:0-7;localhost:1:8-15;... lock xchg) this portion to install its lock request and spins locally on its own node as done in typical MCS-locking. Table 5: H-Store Command Logging Configuration. The second part, VID, is the version number protected by Parameter Name Configuration the lock. All logical status of the record are stored in this site.commandlog enable true part, including moved flag, deleted flag, next-layer flag, site.commandlog dir [The NVRAM filesystem] and being-written flag that indicates whether the record site.commandlog timeout 500 is half-written. Most code in FOEDUS only deals with this part. For instance, an optimistic read just reads VID, take Table 6: H-Store Anti-caching Configuration. a consume or acquire fence, then read the tuple. Logical Parameter Name Configuration log records, described next, also contain only VIDs. evictable HISTORY,CUSTOMER, D.2 Log Record ORDERS,ORDER LINE Figure 13 exemplifies the logical log record in FOEDUS. site.anticache dir [The NVRAM filesystem] Unlike traditional LSN-based databases, it does not con- site.anticache enable true tain any physical information, such as page-ID and undo- site.anticache threshold mb 1000 chain. All log records are 64-bit aligned with appropriate paddings. In addition to performance benefits, such as ef- partially fix the unstability issue of anti-caching mentioned ficiently slicing a 64-bit key for Master-Tree with aligned in Section 9. The H-Store team also recommended us to dis- pointers, the alignment is essential to contiguously fill an able remote transactions to measure the best performance arbitrary part of unused log regions with a special filler log of anti-caching, thus we disabled remote transactions in all for each Direct-IO operation. experiments that use H-Store’s anti-caching. General Configuration: In both in-memory and NVM D.3 Read-Write Set experiments, we used the parameters listed in Table 4 to run Each transaction maintains an array of the following read- the experiments. and write-sets. Based on recommendations by the H-Store team, we ran struct Read { struct Write { the experiments with turning CPU affinity ON/OFF and VID observed ; MCSNode l o c k e d ; observed that enabling CPU affinity gives a better perfor- StrID s t o r a g e ; StrID s t o r a g e ; mance. We always kept the JVM heap size within 32 GB void ∗ a d d r e s s ; void ∗ a d d r e s s ; so that the JVM can exploit Compressed-Oops [2] for its Write ∗ r e l a t e d ; Read∗ r e l a t e d ; best performance. The JVM we used is 64-bit OpenJDK }; Log∗ log ; 24.51-b03 (1.7.0 51), Server VM. }; For TPC-C benchmark, we modified the tpcc.properties file. The only changes from the default setting are the Figure 14: Read- and Write-Set Structure. fraction of remote-transactions. We vary neworder_multip (multip stands for multi-partition) and payment_multip as The related_ field is set when an operation both reads described in Section 9. When remote=0, we also set false to and writes to a tuple, which is a very common case. The bi- neworder_ and payment_multip_mix so that H-Store com- directional chain is useful to avoid a few redundant work dur- pletely avoids the overhead of distributed transactions. ing the pre-commit phase, such as tracking moved records. For the OLAP experiment in this appendix, we varied the A write set additionally contains the MCS lock node of transaction weights and MAX_OL_CNT in TPCCConstants.java. the transaction, which is needed to unlock. Both the log We also slightly modified the data loader to flush more fre- pointer and the MCS node always point to the transaction’s quently to handle an increased number of records. own memory. Command Logging Configuration: In the NVM ex- periment and in-memory experiment with logging, we en- E. H-STORE CONFIGURATION abled H-Store’s command-logging feature, placing the log This appendix section provides the full details of the H- files in our emulated NVRAM device as listed in Table 5. Store configuration for verifying and reproducing our per- Anti-caching Configuration: In the NVM experi- formance comparison with H-Store. ment, we used H-Store’s default implementation of anti- Versions and Patches: Following recommendations caching backed by BerkeleyDB on history, customer, or- from the H-Store team, we used the latest version of H-Store ders, orderline with 1 GB eviction threshold (or 75%) per as of Dec 2014 with a patch provided by the H-Store team to site as listed in Table 6. 705

16.Acknowledgments [16] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi. Shore-mt: a scalable storage manager for the We appreciate kind helps from the H-Store team, especially multicore era. In Proceedings of ICDT, pages 24–35, 2009. Joy Arulraj, Michael Giardino, and Andy Pavlo, for config- uring, debugging, and tuning H-Store as well as suggestions [17] H. Kim, S. Seshadri, C. L. Dickey, and L. Chiu. Evaluating phase change memory for enterprise storage systems: a study to the draft of this paper. We thank Harumi Kuno and of caching and tiering approaches. In FAST, 2014. numerous colleagues in HP for editorial help, Haris Volos for building the NVRAM emulator, and William Kuszmaul [18] B. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase for studying Cuckoo Hashing. We owe Paolo Faraboschi, change memory as a scalable dram alternative. SIGARCH Computer Architecture News, 37(3), 2009. Gary Gostin, and Naveen Muralimanohar for their insights on hardware characteristics of NVRAM and future servers. [19] M.-J. Lee, C. B. Lee, D. Lee, S. R. Lee, M. Chang, J. H. Hur, Davidlohr Bueso, Aswin Chandramouleeswaran, and Jason Y.-B. Kim, C.-J. Kim, D. H. Seo, S. Seo, et al. A fast, high- endurance and scalable non-volatile memory device made Low tuned the linux kernel for DragonHawk and gave us in- from asymmetric ta2o5-x/tao2-x bilayer structures. Nature sightful inputs to make FOEDUS more scalable. We thank Materials, 10(8), 2011. the DragonHawk hardware team to setup the system and the anonymous reviewers for insightful comments. [20] J. J. Levandoski, D. B. Lomet, and S. Sengupta. The bw- tree: A b-tree for new hardware platforms. In ICDE, 2013. [21] P. Magnusson, A. Landin, and E. Hagersten. Queue locks References on cache coherent multiprocessors. In Parallel Processing [1] http://www8.hp.com/hpnext/posts/ Symposium, 1994. discover-day-two-future-now-machine-hp. [22] Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for [2] http://docs.oracle.com/javase/7/docs/technotes/ fast multicore key-value storage. In Eurosys, 2012. guides/vm/performance-enhancements-7.html. [23] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The [3] Process integration, devices, and structures. In International log-structured merge-tree (lsm-tree). Acta Informatica, Technology Roadmap for Semiconductors, 2007. 33(4):351–385, 1996. [24] I. Oukid, W. Lehner, T. Kissinger, T. Willhalm, and P. Bum- [4] K. Asanovic. Firebox: A hardware building block for 2020 bulis. Instant recovery for main-memory databases. warehouse-scale computers (keynote). In FAST, 2014. [25] I. Pandis, P. T¨ oz¨ un, M. Branco, D. Karampinas, D. Porobic, [5] S. Burckhardt, R. Alur, and M. M. Martin. Checkfence: R. Johnson, and A. Ailamaki. A data-oriented transaction checking consistency of concurrent data types on relaxed execution engine and supporting tools. In SIGMOD, 2011. memory models. In SIGPLAN Notices, 2007. [26] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, [6] G. W. Burr, M. J. Breitwisch, M. Franceschini, D. Garetto, L. Lastras, and B. Abali. Enhancing lifetime and security K. Gopalakrishnan, B. Jackson, B. Kurdi, C. Lam, L. A. of pcm-based main memory with start-gap wear leveling. In Lastras, A. Padilla, et al. Phase change memory technology. Proceedings of MICRO, pages 14–23, 2009. J. of Vacuum Science Technology B, 28(2):223–262, 2010. [27] K. Ren, A. Thomson, and D. J. Abadi. Lightweight locking [7] D. Chakrabarti, H. Boehm, and K. Bhandari. Atlas: Lever- for main memory database systems. PVLDB, 6(2), 2012. aging locks for non-volatile memory consistency. In SIG- [28] T. Sakata, K. Sakata, G. H¨ ofer, and T. Horiuchi. Prepara- PLAN OOPSLA, 2014. tion of nbo2 single crystals by chemical transport reaction. Journal of Crystal Growth, 12(2):88–92, 1972. [8] J. DeBrabant, J. Arulraj, A. Pavlo, M. Stonebraker, S. Zdonik, and S. R. Dulloor. A prolegomenon on oltp [29] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. database systems for non-volatile memory. In ADMS, 2014. Williams. The missing memristor found. Nature, 453, 2008. [9] J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and [30] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. S. Zdonik. Anti-caching: A new approach to database man- Speedy transactions in multicore in-memory databases. In agement system architecture. PVLDB, 6(14), 2013. Proceedings of SOSP, 2013. [31] J. van den Hooff. Fast Bug Finding in Lock-Free Data Struc- [10] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, tures with CB-DPOR. PhD thesis, MIT, 2014. R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: Sql server’s memory-optimized oltp engine. In SIGMOD, 2013. [32] T. Wang and R. Johnson. Scalable logging through emerging non-volatile memory. PVLDB, 7(10), 2014. [11] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, and J. Jackson. System software [33] G. Weikum and G. Vossen. Transactional information sys- for persistent memory. In Proceedings of EuroSys, 2014. tems, 2002. [12] A. Eldawy, J. Levandoski, and P. Larson. Trekking [34] D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, through siberia: Managing cold data in a memory-optimized N. P. Jouppi, and M. Erez. Free-p: Protecting non-volatile database. PVLDB, 7(11), 2014. memory against both hard and soft errors. In HPCA, 2011. [35] X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and M. Stone- [13] G. Graefe, H. Kimura, and H. Kuno. Foster b-trees. TODS, braker. Staring into the abyss: An evaluation of concurrency 37(3):17, 2012. control with one thousand cores. PVLDB, 8(3), 2014. [14] G. Graefe, H. Volos, H. Kimura, H. Kuno, J. Tucek, M. Lil- [36] M. Zheng, J. Tucek, D. Huang, F. Qin, M. Lillibridge, E. S. libridge, and A. Veitch. In-memory performance for big data. Yang, B. W. Zhao, and S. Singh. Torturing databases for PVLDB, 8(1), 2014. fun and profit. In OSDI, 2014. [15] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stone- [37] W. Zheng, S. Tu, E. Kohler, and B. Liskov. Fast databases braker. Oltp through the looking glass, and what we found with fast durability and recovery through multicore paral- there. In Proceedings of SIGMOD, pages 981–992, 2008. lelism. In SOSP, 2014. 706