Fast Databases with Fast Durability and Recovery

Multicore in-memory databases for modern machines can support extraordinarily high transaction rates for online transaction processing workloads. A potential weakness, however, is recovery from crash failures. Can classical techniques, such as checkpoints, be made both efficient enough to keep up with current systems’ memory sizes and transaction rates, and smart enough to avoid additional contention?

1. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism Wenting Zheng and Stephen Tu, Massachusetts Institute of Technology; Eddie Kohler, Harvard University; Barbara Liskov, Massachusetts Institute of Technology This paper is included in the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. October 6–8, 2014 • Broomfield, CO 978-1-931971-16-4 Open access to the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX.

2. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism Wenting Zheng, MIT* Stephen Tu, MIT* Eddie Kohler, Harvard University Barbara Liskov, MIT Abstract walk over the entire database, which can cause data Multicore in-memory databases for modern machines movement and cache pollution that reduce concurrent can support extraordinarily high transaction rates for on- transaction performance. Recovery of a multi-gigabyte line transaction processing workloads. A potential weak- database using a single core could take more than 90 ness, however, is recovery from crash failures. Can clas- minutes on today’s machines, which is a long time even sical techniques, such as checkpoints, be made both ef- in a replicated system. ficient enough to keep up with current systems’ mem- Our goal in this work was to develop an in-memory ory sizes and transaction rates, and smart enough to database with full persistence at relatively low cost to avoid additional contention? Starting from an efficient transaction throughput, and with fast recovery, mean- multicore database system, we show that naive logging ing we hoped to be able to recover a large database to and checkpoints make normal-case execution slower, but a transactionally-consistent state in just a few minutes that frequent disk synchronization allows us to keep without replication. Starting from Silo [27], a very fast up with many workloads with only a modest reduction in-memory database system, we built SiloR, which adds in throughput. We design throughout for parallelism: logging, checkpointing, and recovery. Using a combina- during logging, during checkpointing, and during re- tion of logging and checkpointing, we are able to re- covery. The result is fast. Given appropriate hardware cover a 43.2 GB YCSB key-value-style database to a (three SSDs and a RAID), a 32-core system can recover transactionally-consistent snapshot in 106 seconds, and a 43.2 GB key-value database in 106 seconds, and a a more complex > 70 GB TPC-C database with many > 70 GB TPC-C database in 211 seconds. tables and secondary indexes in 211 seconds. Perhaps more interesting than our raw performance is 1 Introduction the way that performance was achieved. We used con- currency in all parts of the system. The log is written In-memory databases on modern multicore ma- concurrently to several disks, and a checkpoint is taken chines [10] can handle complex, large transactions at by several concurrent threads that also write to multi- millions to tens of millions of transactions per second, ple disks. Concurrency was crucial for recovery, and we depending on transaction size. A potential weakness found that the needs of recovery drove many of our de- of such databases is robustness to crashes and power sign decisions. The key to fast recovery is using all of failures. Replication can allow one site to step in for the machine’s resources, which, on a modern machine, another, but even replicated databases must write data means using all cores. But some designs tempting on the to persistent storage to survive correlated failures, and logging side, such as operation logging (that is, logging performance matters for both persistence and recovery. transaction types and arguments rather than logging val- Crash resistance mechanisms, such as logging and ues), are difficult to recover in parallel. This drive for checkpointing, can enormously slow transaction execu- fast parallel recovery affected many aspects of our log- tion if implemented naively. Modern fast in-memory ging and checkpointing designs. databases running tens of millions of small transactions Starting with an extremely fast in-memory database, per second can generate more than 50 GB of log data we show: per minute when logging either values or operations. In terms of both transaction rates and log sizes, this is up • All the important durability mechanisms can and to several orders of magnitude more than the values re- should be made parallel. ported in previous studies of in-memory-database dura- bility [2, 14, 24]. Logging to disk or flash is at least theo- • Checkpointing can be fast without hurting normal retically fast, since log writes are sequential, but sequen- transaction execution. The fastest checkpoints in- tial log replay is not fast on a modern multicore machine. troduce undesired spikes and crashes into concur- Checkpoints are also required, since without them, logs rent throughput, but through good engineering and would grow without bound, but checkpoints require a by pacing checkpoint production, this variability *Currently at University of California, Berkeley. can be reduced enormously. USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  465

3. • Even when checkpoints are taken frequently, a obtains the TID for a committing transaction by effec- high-throughput database will have to recover from tively incrementing a global counter. On modern multi- a very large log. In our experiments, log recovery core hardware, though, any global counter can become is the bottleneck; for example, to recover a 35 GB a source of performance-limiting contention. Silo elim- TPC-C database, we recover 16 GB from a check- inates this contention using time periods called epochs point and 180 GB from the log, and log recovery ac- that are embedded in TIDs. A global epoch number E counts for 90% of recovery time. Our design allows is visible to all threads. A designated thread advances it us to accomplish log replay at roughly the maxi- periodically (every 40 ms). Worker threads use E during mum speed of I/O. the commit procedure to compute the new TID. Specif- ically, the new TID is (a) greater than any TID in the • The system built on these ideas can recover a rela- read-set, (b) greater than the last TID committed by this tively large database quite quickly. worker, and (c) in epoch E. This avoids false contention on a global TID, but 2 Silo overview fundamentally changes the relationship between TIDs and the serial order. Consider concurrent transactions We build on Silo, a fast in-memory relational database T1 and T2 where T1 reads a key that T2 then over- that provides tables of typed records. Clients issue one- writes. The relationship between T1 and T2 is called shot requests: all parameters are available when a re- an anti-dependency: T1 must be ordered before T2 be- quest begins, and the request does not interact with its cause T1 depends on the absence of T2. In conventional caller until it completes. A request is dispatched to a sin- OCC, whose TIDs capture anti-dependencies, our ex- gle database worker thread, which carries it out to com- ample would always have TID(T1) < TID(T2). But in pletion (commit or abort) without blocking. Each worker Silo, there is no communication whatsoever from T1 to thread is pinned to a physical core of the server machine. T2, and we could find TID(T1) > TID(T2)! This means Most cores run workers, but SiloR reserves several cores that replaying a Silo database’s committed transactions for logging and checkpointing tasks. in TID order might recover the wrong database. Silo tables are stored in efficient, cache-friendly con- Epochs provide the key to correct replay. On total- current B-trees [15]. Each table uses one primary tree store-order (TSO) architectures like x86-64, the desig- and zero or more secondary trees for secondary indexes. nated thread’s update of E becomes visible at all workers Key data is embedded in tree structures, and values are simultaneously. Because workers read the current epoch stored in separately-allocated records. All structures are at the serialization point, the ordering of TIDs with dif- stored in shared memory, so any worker can access the ferent epochs is always compatible with the serial or- entire database. der, even in the case of anti-dependencies. Epochs allow Silo uses a variant of optimistic concurrency control for a form of group commit: SiloR persists and recovers (OCC) [11] to serialize transactions. Concurrency con- in units of epochs. We describe below how this impacts trol centers on transaction IDs (TIDs). Each record con- logging, checkpointing, and recovery. tains the TID of the transaction that most recently mod- ified it. As a worker runs a transaction, it maintains a 3 Logging read-set containing the old TID of each read or written This section explains how SiloR logs transaction modifi- record, and a write-set containing the new state of each cations for persistence. Our design builds on Silo, which written record. On transaction completion, a worker de- included logging but did not consider recovery, log trun- termines whether the transaction can commit. First it cation, or checkpoints. The SiloR logging subsystem locks the records in the write-set (in a global order to adds log truncation, makes changes related to liveness, avoid deadlock). Then it computes the transaction’s TID; and allows more parallelism on replay. this is the serialization point. Next it compares the TIDs of records in the read-set with those records’ current 3.1 Basic logging TIDs, and aborts if any TIDs have changed or any record The responsibility for logging in SiloR is split between is locked by a different transaction. Otherwise it com- workers, which run transactions, and separate logging mits and overwrites the write-set records with their new threads (“loggers”), which handle only logging, check- values and the new TID. pointing, and other housekeeping tasks. Workers gener- ate log records as they commit transactions; they pass 2.1 Epochs these records to loggers, which commit the logs to disk. Silo transaction IDs differ in an important way from When a set of logs is committed to disk via fsync, the those in other systems, and this difference impacts the loggers inform the workers. This allows workers to send way SiloR does logging and recovery. Classical OCC transaction results to clients. 466  11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

4. A log record comprises a committed transaction’s TID more work to analyze transactions and split their up- plus the table, key, and value information for all records dates appropriately. More fundamentally, every worker modified by that transaction. Each worker constructs might have to communicate with every logger. Though log records in disk format and stores them in a mem- log records are written in batches (so the communica- ory buffer taken from a per-worker buffer pool. When a tion would not likely introduce contention), this design buffer fills, or at an epoch boundary, the worker passes would inevitably introduce remote writes or reads: phys- the buffer to the logger over a shared-memory queue. ical memory located on one socket would be accessed, either for writes or reads, by a thread running on a dif- 3.2 Value logging vs. operation logging ferent socket. Remote accesses are expensive and should SiloR uses value logging, not operation or transaction be avoided when possible. logging. This means that SiloR logs contain each trans- Our final design divides workers into disjoint subsets, action’s output keys and values, rather than the identity and assigns each subset to exactly one logger. Core pin- of the executed operation and its parameters. ning is used to ensure that a logger and its workers run The choice of value logging is an example of re- on the same socket, making it likely that log buffers al- covery parallelism driving the normal-case logging de- located on a socket are only accessed by that socket. sign. Value logging has an apparent disadvantage rela- tive to operation logging: for many workloads (such as 3.4 Buffer management TPC-C) it logs more data, and therefore might unnec- Although loggers should not normally limit transaction essarily slow transaction execution. However, from the execution, loggers must be able to apply backpressure point of view of recovery parallelism, the advantages of to workers, so that workers don’t generate indefinite value logging outweigh its disadvantages. Value logging amounts of log data. This backpressure is implemented is easy to replay in parallel—the largest TID per value by buffer management. Loggers allocate a maximum wins. This works in SiloR because TIDs reflect depen- number of log buffers per worker core. Buffers circu- dencies, i.e., the order of writes, and because we recover late between loggers and workers as transactions exe- in units of epochs, ensuring that anti-dependencies are cute, and a worker blocks when it needs a new log buffer not a problem. Operation logging, in contrast, requires and one is not available. A worker flushes a buffer to that transactions be replayed in their original serial or- its logger when either the buffer is full or a new epoch der. This is always hard to parallelize, but in Silo, it begins, whichever comes first. It is important to flush would additionally require logging read-sets (keys and buffers on epoch changes, whether or not those buffers TIDs) to ensure anti-dependencies were obeyed. Op- are full, because SiloR cannot mark an epoch as persis- eration logging also requires that the initial pre-replay tent until it has durably logged all transactions that hap- database state be a transactionally consistent snapshot, pened in that epoch. Each log buffer is 512 KB. This which value logging does not; and for small transactions is big enough to obtain some benefit from batching, but value and operation logs are about the same size. These small enough to avoid wasting much space when a par- considerations led us to prefer value logging in SiloR. tial buffer is flushed. We solve the problem of value logging I/O by adding We found that log-buffer backpressure in Silo trig- hardware until logging is not a bottleneck, and then us- gered unnecessarily often because it was linked with ing that hardware wisely. fsync times. Loggers amplified file system hiccups, such as those caused by concurrent checkpoints, into major 3.3 Workers and loggers dips in transaction rates. SiloR’s loggers instead recircu- Loggers have little CPU work to do. They collect logs late log buffers back to workers as soon as possible— from workers, write them to disk, and await durability after a write, rather than after the following epoch notification from the kernel via the fsync/fdatasync sys- change and fsync. We also increased the number of log tem call. Workers, of course, have a lot of CPU work to buffers available to workers, setting this to about 10% of do. A SiloR deployment therefore contains many worker the machine’s memory. The result was much less noise threads and few logger threads. We allocate enough log- in transaction execution rates. ger threads per disk to keep that disk busy, one per disk in our evaluation system. 3.5 File management But how should worker threads map to logger Each SiloR logger stores its log in a collection of files threads? One possibility is to assign each logger a par- in a single directory. New entries are written to a file tition of the database. This might reduce the data writ- called data.log, the current log file. Periodically (cur- ten by loggers (for example, it could improve the ef- rently every 100 epochs) the logger renames this file to ficacy of compression), and it might speed up replay. old_data.e, where e is the largest epoch the file contains, We rejected this design because of its effect on normal- then starts a new data.log. Using multiple files simpli- case transaction execution. Workers would have to do fies the process of log truncation and, in our measure- USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  467

5.ments, didn’t slow logging relative to Silo’s more prim- actions in epochs ≤ e p , and that no results from trans- itive single-file design. actions with epoch > e p were released to clients. There- Log files do not contain transactions in serial order. A fore it is safe for recovery to recover all transactions with log file contains concatenated log buffers from several epochs ≤ e p , and also necessary since those results may workers. These buffers are copied into the log without have been released to clients. It has one important dis- rearrangement; in fact, to reduce data movement, SiloR advantage, namely that the critical path for transaction logger threads don’t examine log data at all. A log file commit contains two fsyncs (one for the log file and one can even contain epochs out of order: a worker that de- for pepoch) rather than one. This somewhat increases lays its release of the previous epoch’s buffer will not latency. prevent other workers from producing buffers in the new epoch. All we know is that a file old_data.e contains no 4 Checkpoints records with epochs > e. And, of course, a full log com- Although logs suffice to recover a database, they do prises multiple log directories stored independently by not suffice to recover a database in bounded time. In- multiple loggers writing to distinct disks. Thus, no single memory databases must take periodic checkpoints of log contains enough information for recovery to produce their state to allow recovery to complete quickly, and to a correct database state. It would be possible to extract support log truncation. This section describes how SiloR this information from all logs, but instead SiloR uses takes checkpoints. a distinguished logger thread to maintain another file, pepoch, that contains the current persistent epoch. The 4.1 Overview logger system guarantees that all transactions in epochs Our main goal in checkpoint production is to produce ≤ pepoch are durably stored in some log. This epoch is checkpoints as quickly as possible without disrupting calculated as follows: worker throughput. Checkpoint speed matters because 1. Each worker w advertises its current epoch, ew , and it limits the amount of log data that will need to be re- guarantees that all future transactions it sends to its played at recovery. The smaller the distance between logger will have epoch ≥ ew . It updates ew by set- checkpoints, the less log data needs to be replayed, and ting ew ← E after flushing its current log buffer to we found the size of the log to be the major recovery ex- its logger. pense. Thus, as with log production, checkpointing uses multiple threads and multiple disks. 2. Each logger l reads log buffers from workers and Checkpoints are written by checkpointer threads, one writes them to log files. per checkpoint disk. In our current implementation checkpoints are stored on the same disks as logs, and 3. Each logger regularly decides to make its writes loggers and checkpointers execute on the same cores durable. At that point, it calculates the minimum of (which are separate from the worker cores that exe- the ew for each of its workers and the epoch number cute transactions). Different checkpointers are responsi- of any log buffer it owns that remains to be written. ble for different slices of the database; a distinguished This is the logger’s current epoch, el . The logger checkpoint manager assigns slices to checkpointers. then synchronizes all its writes to disk. Each checkpointer’s slices amount to roughly 1/n th of 4. After this synchronization completes, the logger the database, where n is the number of disks. A check- publishes el . This guarantees that all associated point is associated with a range of epochs [el , eh ], where transactions with epoch < el have been durably each checkpointer started its work during or after el and stored for this logger’s workers. finished its work during or before eh . Each checkpointer walks over its assigned database 5. The distinguished logger thread periodically com- slices in key order, writing records as it goes. Since putes a persistence epoch e p as min{el } − 1 over OCC installs modifications at commit time, all records all loggers. It writes e p to the pepoch file and then seen by checkpointers are committed. This means that synchronizes that write to disk. full ARIES-style undo and redo logging is unnecessary; 6. Once pepoch is durably stored, the distinguished the log can continue to contain only “redo” records for logger thread publishes e p to a global variable. At committed transactions. However, concurrent transac- that point all transactions with epochs ≤ e p have be- tions continue to execute during the checkpoint period, come durable and workers can release their results and they do not coordinate with checkpointers except to clients. via per-record locks. If a concurrent transaction commits multiple modifications, there is no guarantee the check- This protocol provides a form of group commit. It en- pointers will see them all. SiloR checkpoints are thus in- sures that the logs contain all information about trans- consistent or “fuzzy”: the checkpoint is not necessarily 468  11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

6.a consistent snapshot of the database as of a particular The checkpoint is organized to enable efficient recov- point in the serial order. To recover a consistent snap- ery. During recovery, all cores are available, so we de- shot, it is always necessary both to restore a checkpoint signed the checkpoint to facilitate using those cores. and to replay at least a portion of the log. For each table, each checkpointer divides its assigned We chose to produce an inconsistent checkpoint be- key range into m files, where m is the number of cores cause it’s less costly in terms of memory usage than that would be used during recovery for that key range. a consistent checkpoint. Silo could produce consis- Each of a checkpointer’s m files are stored on the same tent checkpoints using its support for snapshot trans- disk. As the checkpointer walks over its range of the ta- actions [27]. However, checkpoints of large databases ble, it writes blocks of keys to these m files. Each block take a long time to write (multiple tens of seconds), contains a contiguous range of records, but blocks are which is enough time for all database records to be over- assigned to files in round-robin order. There is a tension written. The memory expense associated with preserv- here between two aspects of fast recovery. On the one ing the snapshot for this period, and especially the al- hand, recovery is more efficient when a recovery worker location expense associated with storing new updates is given a continuous range of records, but on the other in newly-allocated records (rather than overwriting old hand, recovery resources are more effectively used when records), reduces normal-case transaction throughput by the recovery workload is evenly distributed (each of the 10% or so. We prefer better normal-case throughput. m files contain about the same amount of work). Calcu- Our choice of inconsistent checkpoints further neces- lating a perfect partition of an index range into equal- sitates our choice of value logging; it is impossible to size subranges is somewhat expensive, since to do this recover from an inconsistent checkpoint without either requires tree walks. We chose a point on this tradeoff value logging or some sort of ARIES-style undo log- where indexes are coarsely divided among checkpoint- ging. ers into roughly-equal subranges, but round-robin as- Another possible design for checkpoints is to avoid signment of blocks to files evens the workload at the file writing information about keys whose records haven’t level. changed since the previous checkpoint, for example, de- The checkpoint manager thread starts a new check- signing a disk format that would allow a new check- point every C seconds. It picks the partition for each ta- point to elide unmodified key ranges. We rejected this ble and writes this information into a shared array. It then approach because ours is simpler, and also because chal- records el , the checkpoint’s starting epoch, and starts up lenging workloads, such as uniform updates, can cause n checkpointer threads, one per disk. For each table, each any design to effectively write a complete checkpoint ev- thread creates the corresponding checkpoint files and ery time a checkpoint is required. We wanted to under- walks over its assigned partition using a range scan on stand the performance limits caused by these workloads. the index tree. As it walks, it constructs a block of record In an important optimization, checkpointer threads data, where each record is stored as a key/TID/value tu- skip any records with current epoch ≥ el . Thus, the ple. When its block fills up, the checkpointer writes that checkpoint contains those keys written in epochs < el block to one of the checkpoint files and continues. The that were not overwritten in epochs ≥ el . It is not nec- next full block is written to the next file in round-robin essary to write such records because, given any incon- order. sistent checkpoint started in el , it is always necessary to Each time a checkpointer’s outstanding writes exceed replay the log starting at epoch el . Specifically, the log 32 MB, it syncs them to disk. These intermediate syncs must be complete over a range of epochs [el , ex ], where turned out to be important for performance, as we dis- ex ≥ eh , for recovery of a consistent snapshot to be pos- cuss in §6.2. sible. There’s no need to store a record in the checkpoint When a checkpointer has processed all tables, it does that will be replayed by the log. This optimization re- a final sync to disk. It then reads the current epoch E duces our checkpoint sizes by 20% or more. and reports this information to the manager. When all checkpointers have reported, the manager computes eh ; 4.2 Writing the checkpoint this is the maximum epoch reported by the checkpoint- Checkpointers walk over index trees to produce the ers, and thus is the largest epoch that might have updates checkpoint. Since we want each checkpointer to be re- reflected in the checkpoint. Although, thanks to our re- sponsible for approximately the same amount of work, duced checkpoint strategy, new tuples created during eh yet tables differ in size, we have all checkpointers walk are not stored in the checkpoint, tuples removed or over- over all tables. To make the walk efficient, we partition written during eh are also not stored in the checkpoint, the keys of each table into n subranges, one per check- so the checkpoint can’t be recovered correctly without pointer. This way each checkpointer can take advantage complete logs up to and including eh . Thus, the man- of the locality for keys in the tree. ager waits until eh ≤ e p , where e p is the persistence USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  469

7.epoch computed by the loggers (§3.5). Once this point is 5.2 Log recovery reached, the manager installs the checkpoint on disk by After all threads have finished their assigned checkpoint writing a final record to a special checkpoint file. This recovery tasks, the system moves on to log recovery. As file records el and eh , as well as checkpoint metadata, mentioned in §3, there was no attempt at organizing the such as the names of the database tables and the names log records at runtime (e.g. partitioning the log records of the checkpoint files. based on what tables were being modified). Instead it is likely that each log file is a jumble of modifications to 4.3 Cleanup various index trees. This situation is quite different than After the checkpoint is complete, SiloR removes old it was for the checkpoint, which was organized so that files that are no longer needed. This includes any previ- concurrent threads could work on disjoint partitions of ous checkpoints and any log files that contain only trans- the database. However, SiloR uses value logging, which actions with epochs < el . Recall that each log comprises has the property that the logs can be processed in any a current file and a number of earlier files with names order. All we require is that at the end of processing, ev- like old_data.e. Any file with e < el can be deleted. ery key has an associated value corresponding to the last The next checkpoint is begun roughly 10 seconds af- modification made up through the most recent persistent ter the previous checkpoint completed. Log replay is far epoch prior to the failure. If there are several modifi- more expensive than checkpoint recovery, so we aim to cations to a particular key k, these will have associated minimize log replay by taking frequent checkpoints. In TIDs T1, T2, and so on. Only the entry with the largest future work, we would like to investigate a more flexi- of these TIDs matters; whether we happen to find this ble scheme that, for example, could delay a checkpoint entry early or late in the log recovery step does not. if the log isn’t growing too fast. We take advantage of this property to process the log in parallel, and to avoid unnecessary allocations, copies, and work. First the manager thread reads the pepoch 5 Recovery file to obtain e p , the number of the most recent persis- tent epoch. All log records for transactions with TIDs SiloR performs recovery by loading the most recent for later epochs are ignored during recovery. This is im- checkpoint, then correcting it using information in the portant for correctness since group commit has not fin- log. In both cases we use many concurrent threads to ished for those later epochs; if we processed records for process the data and we overlap processing and I/O. epochs after e p we could not guarantee that the resulting database corresponded to a prefix of the serial order. 5.1 Checkpoint recovery The manager reads the directory for each disk, and To start recovery, a recovery manager thread reads the creates a variable per disk, Ld , that is used to track which latest checkpoint metadata file. This file contains infor- log files from that disk have been processed. Initially mation about what tables are in the system and el , the this variable is set to the number of relevant log files for epoch in which the checkpoint started. The manager cre- that disk, which, in our experiments, is in the hundreds. ates an in-memory representation for each of the T index Then the manager starts up g log processor threads for trees mentioned in the checkpoint metadata. In addition each disk. We use all threads during log recovery. For in- it deletes any checkpoint files from earlier or later check- stance, on a machine with N cores and n disks, we have points and removes all log files from epochs before el . g = N/n . This can produce more recovery threads than The checkpoint is recovered concurrently by many there are cores. We experimented with the alternative threads. Recall that the checkpoint consists of many files m = N/n , but this leaves some cores idle during re- per database table. Each table is recorded on all n disks, covery, and we observed worse recovery times than with partitioned so that on each disk there are m files for oversubscription. each table. Recovery is carried out by n × m threads. A log processor thread proceeds as follows. First it Each thread reads from one disk, and is responsible for reads, decrements, and updates Ld for its disk. This up- reading and processing T files from that disk (one file date is done atomically: this way it learns what file it per index tree). Processing is straightforward: for each should process, and updates the variable so that the next key/value/TID in the file, the key is inserted in the index log processor for its disk will process a different file. tree identified by the file name, with the given value and If the value it reads from Ld is ≤ 0, the log processor TID. Since the files contain different key ranges, check- thread has no more work to do. It communicates this to point recovery threads are able to reconstruct the tree in the manager and stops. Otherwise the processor thread parallel with little interference; additionally they benefit reads the next file, which is the newest file that has not from locality when processing a subrange of keys in a yet been processed. In other words, we process the files particular table. in the opposite order than they were written. The proces- 470  11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

8.sor thread on disk d that first reads Ld processes the cur- 6 Evaluation rent log file data.log; after this files are read in reverse In this section, we evaluate the effectiveness of the tech- order by the epoch numbers contained in their names. niques in SiloR, confirming the following performance The files are large enough that, when reading them, we hypotheses: get good throughput from the disk; there’s little harm in reading the files out of order (i.e., in an order different • SiloR’s checkpointer has only a modest effect on from the order they were written). both the latency and throughput of transactions on The processor thread reads the entries in the file se- a challenging write-heavy key-value workload and a quentially. Recall that each entry contains a TID t and a typical online transaction processing workload. set of table/key/value tuples. If t contains an epoch num- • SiloR recovers 40–70 GB databases within minutes, ber that is < el or > e p , the thread skips the entry. Oth- even when crashes are timed to maximize log replay. erwise, the thread inserts a record into the table if its key 6.1 Experimental setup isn’t there yet; when a version of the record is already in All of our experiments were run on a single machine the table, the thread overwrites only if the log record has with four 8-core Intel Xeon E7-4830 processors clocked a larger TID. at 2.1 GHz, yielding a total of 32 physical cores. Each Value logging replay has the same result no matter core has a private 32 KB L1 cache and a private 256 KB what order files are processed. We use reverse order for L2 cache. The eight cores on a single processor share a reading log files because it uses the CPU more efficiently 24 MB L3 cache. The machine has 256 GB of DRAM than forward order when keys are written multiple times. with 64 GB of DRAM attached to each socket, and runs When files are processed in strictly forward order, ev- 64-bit Linux 3.2.0. We run our experiments without net- ery log record will likely require overwriting some value worked clients; each database worker thread runs with in the tree. When files are processed in roughly reverse an integrated workload generator. We do not take advan- order, and keys are modified multiple times, then many tage of our machine’s NUMA-aware memory allocator, log records don’t require overwriting: the tree’s current a decision discussed in §6.5. value for the key, which came from a later log file, is We use three separate Fusion ioDrive2 flash drives often newer than the log record. and one RAID-5 disk array. Each disk is used for both 5.3 Correctness logging and checkpointing. Each drive has a dedicated logger thread and checkpointer thread, both of which Our recovery strategy is correct because it restores the run on the same core. Within a drive, the log and check- database to the state it had at the end of the last persis- point information reside in separate files. Each logger or tent epoch e p . The state of the database after processing checkpointer writes to a series of files on a single disk. the checkpoint is definitely not correct: it is inconsis- We measure three related databases, SiloR, LogSilo, tent, and it is also missing modifications of persistent and MemSilo. These systems have identical in-memory transactions that ran after it finished. All these problems database structures. SiloR is the full system described are corrected by processing the log. The log contains all here, including logging and checkpointing. LogSilo is modifications made by transactions that ran in epochs in a version of SiloR that only logs data: there are no el up through e p . Therefore it contains what is needed to checkpointer threads or checkpoints. MemSilo is Silo rectify the checkpoint. Furthermore, the logic used to do run without persistence, and is a later version of the sys- the rectification leads to each record holding the modifi- tem of Tu et al. [27] Unless otherwise noted, we run cation of the last transaction to modify it through epoch SiloR and LogSilo with 28 worker threads and MemSilo e p , because we make this decision based on TIDs. And, with 32 worker threads. importantly, we ignore log entries for transactions from epochs after e p . 6.2 Key-value workload It’s interesting to note that value logging works with- To demonstrate that SiloR can log and checkpoint with out having to know the exact serial order. All that is re- low overhead, we run SiloR on a variant of YCSB quired is enough information so that we can figure out workload mix A. YCSB is a popular key-value bench- the most recent modification. That is, log record “version mark from Yahoo [4]. We modified YCSB-A to have numbers” must capture dependencies, but need not cap- a read/write (get/put) ratio of 70/30 (not 50/50), and a ture anti-dependencies. Silo TIDs meet this requirement. record size of 100 bytes (not 1000). This workload mix And because TID comparison is a simple commutative was originally designed for MemSilo to stress database test, log processing can take place in any order. In addi- internals rather than memory allocation; though the tion, of course, we require the group commit mechanism read/write ratio is somewhat less than standard YCSB- provided by epochs to ensure that anti-dependencies are A, it is still quite high compared to most workloads. also preserved. Our read and write transactions sample keys uniformly. USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  471

9.Figure 1: Throughput and latency of SiloR, and Figure 3: Throughput and latency of SiloR on throughput of LogSilo and MemSilo, on our mod- YCSB with 32 workers. Average throughput was ified YCSB benchmark. Average throughput was 9.14 Mtxn/s and average latency 153 ms. 8.76 Mtxn/s, 9.01 Mtxn/s, and 10.83 Mtxn/s, respec- tively. Average SiloR latency was 90 ms/txn. Database 8.76 Mtxn/s, 80% the average throughput of MemSilo size was 43.2 GB. Grey regions show those times (10.83 Mtxn/s). Average latency is affected by logging when the SiloR experiment was writing a checkpoint. and checkpointing somewhat more significantly; it is 90 ms/transaction.1 Some of this latency is inherent in Silo’s epoch design. Since the epoch advances every 40 ms, average latency cannot be less than 20 ms. The rest is due to a combination of accumulated batching de- lays (workers batch transactions in log buffers, loggers batch updates to synchronizations) and delays in the per- sistent storage itself (i.e., the two fsyncs in the critical path each take 10–20 ms, and sometimes more). Never- theless, we believe this latency is not high for a system involving persistent storage. Figure 2: Throughput of MemSilo on YCSB with During the experiment, SiloR generates approxi- 32 and 28 workers. Average throughput was mately 298 MB/s of IO per disk. The raw bandwidth of 10.83 Mtxn/s and 9.77 Mtxn/s, respectively. our Fusion IO drives is reported as 590 MB/s/disk; we are achieving roughly half of this. There are 400M keys for a total database size of roughly SiloR and LogSilo’s throughput is less than Mem- 43.2 GB (3.2 GB of key data, 40 GB of value data). Silo’s for several reasons, but as Figure 2 shows, an im- Figure 1 shows the results over a 10-minute exper- portant factor is simply that MemSilo has more workers iment. Checkpointing can be done concurrently with available to run transactions. SiloR and LogSilo require logging without greatly affecting transaction through- extra threads to act as loggers and checkpointers; we put. The graph shows, over the length of the experi- run four fewer workers to leave cores available for those ment, rolling averages of throughput and latency with threads. If we run MemSilo with 28 workers, its through- a 0.5-second averaging window. For SiloR and LogSilo, put is reduced by roughly 10% to 9.77 Mtxn/s, making throughput and latency are measured to transaction per- up more than half the gap with SiloR. We also ran SiloR sistence (i.e., latency is from the time a transaction is with 32 workers. This bettered the average throughput submitted to the time SiloR learns the transaction’s ef- to 9.13 Mtxn/s, but CPU oversubscription caused wide fects are persistent). Intervals during which the check- variability in throughput and latency (Figure 3). pointer is running are shown in gray. Figure 1’s results As we expect, the extensive use of group commit in are typical of our experimental runs; Figure 6 in the ap- LogSilo and SiloR make throughput, and particularly la- pendix shows two more runs. tency, more variable than in MemSilo. Relative to Mem- SiloR is able to run multiple checkpoints and al- 1 Due to a technical limitation in SiloR’s logger implementation, the most match LogSilo’s throughput. Its throughput is also latency shown in the figure is the (running) average latency for write close to that of MemSilo, although MemSilo does no transactions only; we believe these numbers to be a converative upper logging or checkpointing whatsoever: SiloR achieves bound on the actual latency of the system. 472  11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

10. (a) 1 fsync (b) 1 fsync, 600ms sleeps (c) Regular fsyncs (SiloR) Figure 4: Importance of regular disk synchronization. In (a), one fsync call synchronizes the checkpoint; throughput and latency are extremely bursty (note the latency axis tops out at 2 sec). In (b), regular sleep calls in the checkpoint threads reduce burstiness, but do not eliminate it. In (c), SiloR, regular calls to fsync almost entirely eliminate burstiness. Here we run the modified YCSB benchmark. Silo with 28 cores, LogSilo’s performance is more vari- tomers assigned to a set of districts within a local ware- able, and SiloR’s more variable still. The spike in latency house, placing orders in those districts. There are ten pri- visible in Figure 1, which happened at one time or an- mary tables plus two secondary indexes (SiloR treats pri- other in most of our runs, is discussed below in §6.5. mary tables and secondary indexes identically). We do not model client “think” time, and we run the standard Importance of regular synchronization. A check- workload mix. This contains 45% “new-order” transac- point is useless until it is complete, so the obvious dura- tions, which contain 8–18 inserts and 5–15 overwrites bility strategy for a checkpointer thread is to call fsync each. Also write-heavy are “delivery” transactions (4% once, after writing all checkpoint data and before report- of the mix), which contain up to 150 overwrites and 10 ing completion to the manager. But SiloR checkpoint- removes each.2 Unmodified TPC-C is not a great fit for ers call fsync far more frequently—once per 32 MB of an in-memory database: very few records are removed, data written. Figure 4 shows why this matters: the naive so the database grows without bound. During our 10- strategy, (a), is very unstable on our Linux system, in- minute experiments, database record size (not includ- ducing wild throughput swings and extremely high la- ing keys) grows from 2 GB to 94 GB. Nevertheless, the tency. Slowing down checkpointer threads through the workload is well understood and challenging for our sys- occasional introduction of sleep() calls, (b), reduces the tem. problem, but does not eliminate it. We believe that, with Figure 5 shows the results. TPC-C transactions are the single fsync, the kernel flushed old checkpoint pages challenging enough for Silo’s in-memory structures that only when it had to—when the buffer cache became the addition of persistence has little effect on throughput: full—placing undue stress on the rest of the system. Fre- SiloR’s throughput is about 93% that of MemSilo. The quent synchronization, (c), produces far more stable re- MemSilo graph also shows that this workload is more sults; it also can produce a checkpoint more quickly than inherently variable than YCSB-A. We use 28 workers can the version with occasional sleeps. for MemSilo, rather than 32, because 32-worker runs Compression. We also experimented with compress- actually have lower average throughput, as well as far ing the database checkpoints via lz4 before writing to more variability (see Figure 7 in the appendix: our 28- disk. This didn’t help either latency or throughput, and it worker runs achieved 587–596 Mtxn/s, our 32-worker actually slowed down the time it took to checkpoint. Our runs 565–583 Mtxn/s). As with YCSB-A, the addition storage is fast enough that the cost of checkpoint com- of persistence increases this variability, both by batching pression outweighed the benefits of writing less data. transactions and by further stressing the machine. (Fig- ure 7 in the appendix shows that, for example, check- 6.3 On-line transaction processing workload points can happen at quite different times.) Throughput YCSB-A, though challenging, is a well-behaved work- degrades over time in the same way for all configura- load: all records are in one table, there are no secondary tions. This is because the database grows over time, and indexes, accesses are uniform, all writes are overwrites Silo tables are stored in trees with height proportional (no inserts or deletes), all transactions are small. In this to the log of the table size. The time to take a check- section, we evaluate SiloR on a more complex work- 2 It is common in the literature to report TPC-C results for the stan- load, the popular TPC-C benchmark for online trans- dard mix as “new order transactions per minute.” Following Silo, we action processing [26]. TPC-C transactions involve cus- report transactions per second for all transactions. USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  473

11. (a) SiloR (b) LogSilo (c) MemSilo Figure 5: Throughput and latency of SiloR and LogSilo, and throughput of MemSilo, on a modified TPC-C benchmark. Average throughput is 548 Ktxn/s, 575 Ktxn/s, and 592 Ktxn/s, respectively. Average SiloR latency is 110 ms/txn; average LogSilo latency is 97 ms/txn. The database initially contains 2 GB of record data, and grows to 94 GB by the end of the experiment. All experiments run 28 workers. point also grows with database size (3.5 s or so per GB other limits that SiloR would encounter in a real deploy- of record data). Latency, which is 110 ms/txn average ment. At the rates we are writing, our expensive flash for SiloR, is higher than in YCSB-A, but not by much, drives would reach their maximum endurance in a bit even though TPC-C transactions are far more complex. more than a year! In summary, SiloR can handle more complex workloads In contrast with the evaluation of Silo, we disable the with larger transactions as well as it can handle simple NUMA-aware allocator in our tests. When enabled, this workloads with small transactions. allocator improves average throughput by around 25% on YCSB (to 10.91 Mtxn/s for SiloR) and 20% on TPC- 6.4 Recovery C (to 644 Ktxn/s for SiloR). The cost—which we de- We now show that SiloR checkpoints allow for fast re- cided was not worth paying, at least for TPC-C—was covery. We run YCSB-A and TPC-C benchmarks, and performance instability and dramatically worse latency. in each case, crash the database immediately before a Our TPC-C runs saw sustained latencies of over a second checkpoint completes. This maximizes the length of the in their initial 40 s so, and frequent latency spikes later log that must be recovered to restore a transactionally- on, caused by fsync calls and writes that took more than correct state. We use 6 threads per disk (24 threads to- 1 s to complete. These slow file system operations ap- tal) to restore the checkpoint, and 8 threads per disk (32 pear unrelated to our storage hardware: they occur only threads total) to recover the log. when two or more disks are being written simultane- For YCSB-A, SiloR must recover 36 GB of check- ously; they occur at medium write rates as well as high point and 64 GB of log to recreate a 43.2 GB database. rates; they occur whether or not our log and checkpoint Recovery takes 106 s, or about 1.06 s/GB of recovery files are preallocated; and they occur occasionally on data. 31% of this time (33 s) is spent on the check- each of our disks (both Fusion and RAID). Turning off point and the rest (73 s) on the log. The TPC-C database NUMA-aware allocation greatly reduces the problem, grows over time, so checkpoints have different sizes. but traces of it remain: the occasional latency spikes visi- We stop a SiloR run of TPC-C immediately before its ble in our figures have the same cause. NUMA-aware al- fourth checkpoint completes, at about 465 s into the ex- location is fragile, particularly in older versions of Linux periment, when the database contains about 72.2 GB like ours;3 it is possible that a newer kernel would miti- of record data (not including keys). SiloR must recover gate this problem. 15.7 GB of checkpoint and 180 GB of log to recreate this database. Recovery takes 211 s, or about 1.08 s/GB of re- 7 Related work covery data. 8% of this time (17 s) is spent on the check- point and the rest (194 s) on the log. Thus, recovery time SiloR is based on Silo, a very fast in-memory database is proportional to the amount of data that must be read to for multicore machines [27]. We began with the publicly recover, and log replay is the limiting factor in recovery, available Silo distribution, but significantly adapted the justifying our decision to checkpoint frequently. logging implementation and added checkpointing and recovery. Silo draws from a range of work in databases 6.5 Discussion 3 For instance, to get good results with the NUMA allocator, we This work’s motivation was to explore the performance had to pre-fault our memory pools to skirt kernel scalability issues; limits afforded by modern hardware. However, there are this step could take up to 30 minutes per run! 474  11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

12.and in multicore and transactional memory systems course, VoltDB is more full-featured than SiloR. more generally [1, 3, 6–9, 11, 12, 15, 18, 19, 21]. Checkpointing and recovery for in-memory databases Cao et al. [2] describe a design for frequent consis- has long been an active area of research [5, 20, 22–24]. tent checkpoints in an in-memory database. Their re- Salem et al. [24] survey many checkpointing and recov- quirements align with ours—fast recovery without slow- ery techniques, covering the range from fuzzy check- ing normal transaction execution or introducing latency points (that is, inconsistent partial checkpoints) with spikes—but for much smaller databases. Like Malviya et value logging to variants of consistent checkpoints with al., they use “logical logging” (command/operation log- operation logging. In those terms, SiloR combines an ging) to avoid the expense of value logging. The focus action-consistent checkpoint (the transaction might con- of Cao et al.’s work is two clever algorithms for preserv- tain some, but not all, of an overlapping transaction’s ing the in-memory state required for a consistent check- effects) with value logging. Salem et al. report this as a point. These algorithms, Wait-Free ZigZag and Wait- relatively slow combination. However, the details of our Free Ping-Pong, effectively preserve 2 copies of the logging and checkpointing differ from any of the sys- database in memory, a current version and a snapshot tems they describe, and in our measurements we found version; but they use a bitvector to mark on a per-record that those details matter. In Salem et al. action-consistent basis which version is current. During a checkpoint, up- checkpoints either write to all records (to paint them), or dates are directed to the noncurrent version, leaving the copy concurrently modified records; our checkpointers snapshot version untouched. This requires enough mem- avoid all writes to global data. More fundamentally, we ory for at least 2, and possibly 3, copies of the database, are dealing with database sizes and speeds many orders which for the system’s target databases is realistic (they of magnitude higher, and technology tradeoffs may have measure a maximum of 1.6 GB). As we also observe, the changed. slowest part of recovery is log replay, so Cao et al. aim H-Store and its successor, VoltDB, are good represen- to shorten recovery by checkpointing every couple sec- tatives of modern fast in-memory databases [10, 13, 25]. onds. This is only possible for relatively small databases. Like SiloR, VoltDB achieves durability by a combina- Writing as fast as spec sheets promise, it would take at tion of checkpointing and logging [14], but it makes least 10 seconds for us to write a 43 GB checkpoint in different design choices. First, VoltDB uses command parallel to 3 fast disks, and that is assuming there is no logging (a variant of operation logging), in contrast to concurrent log activity, and thus that normal transaction SiloR’s value logging. Since VoltDB, unlike Silo, par- processing has halted. titions data among cores, it can recover command logs somewhat in parallel (different partitions’ logs can pro- The gold standard for database logging and check- ceed in parallel). Command logging in turn requires that pointing is agreed to be ARIES [16], which combines VoltDB’s checkpoints be transactionally consistent; it undo and redo logging to recover inconsistent check- takes a checkpoint by marking every database record points. Undo logging is necessary because ARIES might as copy-on-write, an expense we deem unacceptable. flush uncommitted data to the equivalent of a check- Malviya et al. also evaluate a variant of VoltDB that point; since SiloR uses OCC, uncommitted data never does “physiological logging” (value logging). Although occurs in a checkpoint, and redo logging suffices. their command logging recovers transactions not much faster than it can execute them—whereas physiological The fastest recovery times possible can be obtained logging can recover transactions 5x faster—during nor- through hot backups and replication [14, 17]. RAM- mal execution command logging performs much better Cloud, in particular, replicates a key-value store node’s than value logging, achieving 1.5x higher throughput memory across nearby disks, and can recover more than on TPC-C. This differs from the results we observed, 64 GB of data to service in just 1 or 2 seconds. However, where value logging was just 10% slower than a sys- RAMCloud is not a database: it does not support trans- tem with persistence entirely turned off. Our raw per- actions that involve multiple keys. Furthermore, RAM- formance results also differ from those of Malviya et al. Cloud achieves its fast recovery by fragmenting failed For command logging on 8 cores, they report roughly partitions across many machines. This fragmentation is 1.3 Ktxn/s/core for new-order transactions, using a vari- undesirable in a database context because increased par- ant of TPC-C that entirely lacks cross-warehouse trans- titioning requires more cross-machine coordination to actions. (Cross-warehouse transactions are particularly run transactions (e.g., some form of two-phase commit). expensive in the partitioned VoltDB architecture.) Our Nevertheless, 1 or 2 seconds is far faster than SiloR can TPC-C throughput with value logging, on a mix includ- provide. Replication is orthogonal to our system and an ing cross-warehouse transactions and similar hardware, interesting design point we hope to explore in future is roughly 8.8 Ktxn/s/core for new-order transactions. Of work. USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  475

13.8 Conclusions [8] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The notions of consistency and predicate locks in a We have presented SiloR, a logging, checkpointing, and database system. Commun. ACM, 19(11), 1976. recovery subsystem for a very fast in-memory database. What distinguishes SiloR is its focus on performance [9] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and for extremely challenging workloads. SiloR writes logs B. Falsafi. Shore-MT: a scalable storage manager for and checkpoints at gigabytes-per-second rates without the multicore era. In Proc. 12th Conf. on Extending greatly affecting normal transaction throughput, and can Database Tech., Mar. 2009. recover > 70 GB databases in minutes. [10] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, For future work, we would like to investigate check- S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, pointers that cycle through logical partitions of the Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: a high- database. We believe this approach will allow us to sub- performance, distributed main memory transaction pro- stantially reduce the amount of log data that needs to be cessing system. Proc. VLDB Endow., 1:1496–1499, replayed after a crash. Another possibility is to investi- 2008. gate a RAMCloud-like recovery approach in which data [11] H. T. Kung and J. T. Robinson. On optimistic methods is fragmented during recovery, allowing quick resump- for concurrency control. ACM TODS, 6(2), 1981. tion of service at degraded rates, but then reassembled at a single server to recover good performance. [12] P.-Å. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. Patel, and M. Zwilling. High-performance concurrency Acknowledgements control mechanisms for main-memory databases. Proc. We thank the anonymous reviewers and our shepherd, VLDB Endow., 5(4), 2011. Allen Clement, for helpful comments and patience. This [13] A.-P. Liedes and A. Wolski. SIREN: a memory- work was supported by the NSF under grants 1302359, conserving, snapshot-consistent checkpoint algorithm 1065219, and 0704424, and by Google and a Microsoft for in-memory databases. In Proc. ICDE ’06, Apr. 2006. Research New Faculty Fellowship. [14] N. Malviya, A. Weisberg, S. Madden, and M. Stone- References braker. Rethinking main memory OLTP recovery. In Proc. ICDE ’14, Mar. 2014. [1] N. G. Bronson, J. Casper, H. Chafi, and K. Olukotun. A practical concurrent binary search tree. In Proc. PPoPP [15] Y. Mao, E. Kohler, and R. Morris. Cache craftiness for ’10, Jan. 2010. fast multicore key-value storage. In Proc. EuroSys ’12, Apr. 2012. [2] T. Cao, M. Vaz Salles, B. Sowell, Y. Yue, A. Demers, J. Gehrke, and W. White. Fast checkpoint recovery al- [16] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and gorithms for frequently consistent applications. In Proc. P. Schwarz. ARIES: A transaction recovery method sup- ACM SIGMOD 2011, June 2011. porting fine-granularity locking and partial rollbacks us- ing write-ahead logging. ACM Trans. on Database Sys., [3] S. K. Cha, S. Hwang, K. Kim, and K. Kwon. Cache- 17(1):94–162, 1992. conscious concurrency control of main-memory indexes on shared-memory multiprocessor systems. In Proc. [17] D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, VLDB ’01, Sept. 2001. and M. Rosenblum. Fast crash recovery in RAMCloud. In Proc. SOSP 2011, Oct. 2011. [4] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with [18] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. YCSB. In Proc. ACM Symp. on Cloud Computing, June Data-oriented transaction execution. Proc. VLDB En- 2010. dow., 3(1-2), 2010. [5] D. J. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stone- [19] I. Pandis, P. Tözün, R. Johnson, and A. Ailamaki. PLP: braker, and D. Wood. Implementation techniques for Page latch-free shared-everything OLTP. Proc. VLDB main memory database systems. In Proc. SIGMOD ’84, Endow., 4(10), 2011. June 1984. [20] C. Pu. On-the-fly, incremental, consistent reading of en- [6] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mit- tire databases. Algorithmica, 1:271–287, 1986. tal, R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL Server’s memory-optimized OLTP engine. In Proc. [21] K. Ren, A. Thomson, and D. J. Abadi. Lightweight lock- SIGMOD 2013, June 2013. ing for main memory database systems. Proc. VLDB En- dow., 6(2), 2012. [7] D. Dice, O. Shalev, and N. Shavit. Transactional locking II. In Proc. DISC ’06, Sept. 2006. [22] D. Rosenkrantz. Dynamic database dumping. In Proc. SIGMOD ’78, May 1978. 476  11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

14.Appendix Figure 6: Performance of SiloR, LogSilo, and MemSilo on our modified YCSB benchmark: additional runs. Figure 7: Performance of SiloR, LogSilo, and MemSilo (with 32 and 28 workers) on our modified TPC-C bench- mark: additional runs. [23] K. Salem and H. Garcia-Molina. Checkpointing tectural era: (it’s time for a complete rewrite). In Proc. memory-resident databases. In Proc. ICDE ’89, Feb. VLDB ’07, Sept. 2007. 1989. [26] The Transaction Processing Council. TPC-C Benchmark [24] K. Salem and H. Garcia-Molina. System M: A transac- (Revision 5.9.0)., June 2007. tion processing testbed for memory resident data. IEEE Trans. Knowledge and Data Eng., 2(1), Mar. 1990. [27] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in multicore in-memory databases. [25] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopou- In Proc. SOSP ’13, Nov. 2013. los, N. Hachem, and P. Helland. The end of an archi- USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)  477