Rethinking Main Memory OLTP Recovery

Fine-grained, record-oriented write-ahead logging, as exemplified by systems like ARIES, has been the gold standard for relational database recovery. In this paper, we show that in modern high-throughput transaction processing systems, this is no longer the optimal way to recover a database system. In particular, as transaction throughputs get higher, ARIEs-style logging starts to represent a non-trivial fraction of the overall transaction execution time.
展开查看详情

1. Rethinking Main Memory OLTP Recovery Nirmesh Malviya #1, Ariel Weisberg .2, Samuel Madden #3, Michael Stonebraker #4 # MIT CSAIL * VoltDB Inc. 1nirmesh@csail.mit.edu 2aweisberg@voltdb.com 3madden@csail.mit.edu 4stonebraker@csail.mit.edu Abstract-Fine-grained, record-oriented write-ahead logging, procedure's name) along with the query parameters; doing so as exemplified by systems like ARIES, has been the gold standard also keeps the log entries small. Such a command log captures for relational database recovery. In this paper, we show that updates performed on the database implicitly in the commands in modern high-throughput transaction processing systems, this is no longer the optimal way to recover a database system. In (or transactions) themselves, with only one log record entry per particular, as transaction throughputs get higher, ARIEs-style command. After a crash, if we can bring up the database using logging starts to represent a non-trivial fraction of the overall a pre-crash transactionally-consistent snapshot (which may or transaction execution time. may not reflect all of the committed transactions from before We propose a lighter weight, coarse-grained command logging the crash), the database can recover by simply re-executing the technique which only records the transactions that were executed on the database. It then does recovery by starting from a trans­ transactions stored in the command log in serial order instead actionally consistent checkpoint and replaying the commands of replaying individual writes as in ARIES-style physiological in the log as if they were new transactions. By avoiding the logging. overhead of fine-grained logging of before and after images (both Compared to physiological logging, command logging oper­ CPU complexity as well as substantial associated 110), command ates at a much coarser granularity, and this leads to important logging can yield significantly higher throughput at run-time. Recovery times for command logging are higher compared to performance differences between the two approaches. Gener­ an ARIEs-style physiological logging approach, but with the ally, command logging will write substantially fewer bytes advent of high-availability techniques that can mask the outage per transaction than physiological logging, which needs to of a recovering node, recovery speeds have become secondary in write the data affected by each update. Command logging importance to run-time performance for most applications. simply logs the incoming transaction text or name, while We evaluated our approach on an implementation of TPC­ physiological logging needs to spend many CPU cycles to C in a main memory database system (VoltDB), and found that command logging can offer 1.5 x higher throughput than a main­ construct before and after images of pages, which may require memory optimized implementation of ARIEs-style physiological differencing with the existing pages in order to keep log logging. records compact. These differences mean that physiological logging will impose a significant run-time overhead in a I. INTRODUCTION high throughput transaction processing (OLTP) system. For Database systems typically rely on a recovery subsystem to example, as shown by [7] , a typical transaction processing ensure the durability of committed transactions. If the database system (Shore) spends 10-20% of the time executing TPC-C crashes while some transactions are in-flight, a recovery phase transactions (at only a few hundred transactions per second) on ensures that updates made by transactions that committed ARIES-style physiological logging. As transaction processing pre-crash are reflected in the database after recovery, and systems become faster, and more memory resident, this will that updates performed by uncommitted transactions are not start to represent an increasingly larger fraction of the total reflected in the database state. query processing time. For example, in high throughput data The gold standard for recovery is write-ahead logging processing systems like H-Store [28] , RAMCloud [21] and during transaction execution, with crash recovery using a com­ Redis [27] , the goal is to process many thousands of trans­ bination of logical UNDO and physical REDO, exemplified by actions per second per node. To achieve such performance, systems like ARIES [20] . In a conventional logging system like it is important that logging be done efficiently. Up to this ARIES, before a modification to a database page is propagated point, it has been an open question as to whether disk-based to disk, a log entry containing an image of the modified logging can even be used in such a system without sacrificing data in the page before and after the operation is logged. throughput. In this paper we show definitively that it is quite Additionally, the system ensures that the tail of the log is feasible! on disk before a commit returns. This makes it possible to It is also relevant to look at recovery times for the two provide the durability guarantees described above. logging approaches. One would expect physiological logging There is an alternative to ARIES-style logging, however. to perform recovery faster than command logging; because in Suppose during transaction execution, instead of logging command logging, transactions need to be re-executed com­ modifications, the transaction's logic (such as SQL query pletely at recovery time whereas in ARIES-style physiological statements) is written to the log. For transactional applications logging, only data updates need to be re-applied. However, that run the same query templates over and over, it may in fact given that failures are infrequent (once a week or less), be possible to simply log a transaction identifier (e. g. , a stored recovery times are generally much less important than run-time 978-1-4799-2555-1114/$31. 00 © 2014 IEEE 604 ICDE Conference 2014

2.performance. Additionally, any production OLTP deployment along with the partition. Every node in the cluster runs multiple will likely employ some form of high-availability (e. g. , based execution sites (e. g. , one per CPU core), with each partition on on replication) that will mask single-node failures. Thus, the node assigned to one such site. Each node has an initiator failures that require recovery to ensure system availability are component which sends out transactions to the appropriate much less frequent. partitions/replicas. By employing careful, workload-aware par­ In this paper, our goal is to study these performance trade­ titioning, most transactions can be made single-sited (run on offs between physiological logging and command logging just a single partition) [24] . in detail. We describe how command logging works, and discuss our implementation of both command logging and a B. Transactions main-memory optimized version of physiological logging in Transactions in VoltDB are issued as stored procedures that the VoltDB main memory open-source database system [30] run inside of the database system. Rather than sending SQL (VoltDB is based on the design of H-Store [28]). We compare commands at run-time, applications register a set of SQL­ the performance of both the logging modes on two transac­ based procedures (the workload) with the database system, tional benchmarks, Voter and TPC-C. with each transaction being a single stored procedure. This Our experimental results show that command logging has scheme requires all transactions to be known in advance, but a much lower run-time overhead compared to physiological for OLTP applications that back websites and other online logging when (a) the transactions are not very complex and systems, such an assumption is reasonable. Encapsulating only a small fraction of all transactions are distributed, so that all transaction logic in a single stored procedure prevents CPU cycles spent constructing the differential physiological application stalls mid-transaction and also allows VoltDB to log record and the disk 110 due to physiological logging avoid the overhead of transaction parsing at run-time. At represent a substantial fraction of transaction execution time; run-time, client applications invoke these stored procedures, and (b) the size of a command log record written for a passing in just the procedure names and parameters. transaction is small and the transaction updates a large number of data tuples, because physiological logging does much more All transactions in VoltDB are run serially at the appropriate work in this case. In our experiments, we found that for TPC­ execution site(s). Because OLTP transactions are short, typi­ C, which has has short transactions that update a moderate cally touch only a small number of database records, and do number of records, the maximum overall throughput achieved not experience application or disk stalls, this is actually more by our system when command logging is used is about 1. 5 x efficient than using locking or other concurrency control mech­ higher than the throughput when physiological logging is anisms which themselves introduce significant overhead [11] . employed instead, a result in line with the plot's prediction. VoltDB supports distributed and replicated transactions by Also for TPC-C, we found that recovery times, as expected, running all transactions in a globally agreed upon order. Since are better for physiological logging than command logging, only one transaction can run at a time at each execution site, a by a factor of about l. 5. transaction that involves all partitions will be isolated from all Given this high level overview, the rest of this paper is other transactions. If a transaction operates at a single partition, organized as follows. We begin with a short discussion of it is isolated from other transactions because the execution VoltDB's system architecture in Section II. We then describe site owning the partition is single threaded and all replicas our approach to command logging in Section III, followed by a of the partition/site run transactions in the globally agreed detailed description of our main-memory adaptation of ARIES­ upon order. Even if one or more sites do not respond to a style physiological logging in Section IV . Subsequently, we transaction request (e. g. , because of a crash), the transaction report extensive performance experiments in Section V and will be executed as long as a one replica of each partition discuss possible approaches to generalize command logging involved in the transaction is available. in Section V I. Section V II provides an overview of relevant Below, we briefly explain how transaction ordering works past work in this area, and Section V III summarizes the paper. in VoltDB. 1) Global Transaction Ordering and Replication: In II. VOLTDB OVERVIEW VoltDB, a database component called an initiator receives VoltDB is an open source main memory database system client requests and dispatches transactions to the appropriate whose design is based on that of the H-Store system [28] , execution sites; the pool of initiators consists of one initiator with some differences. Below, we give a brief overview of for each node. Each initiator generates unique timestamp­ VoltDB's system architecture. based transaction-ids that are roughly synchronized with those generated by other initiators using NTP. At each execution site, A. Partitions and Execution Sites transactions received from an initiator are placed in a special VoltDB is a distributed in-memory database which runs on a priority queue, which ensures that only tasks that are globally cluster of nodes. In VoltDB, a table is horizontally partitioned ordered and safely replicated are executed. on keys; each partition resides in the main memory of a For global ordering, this is done by checking if the id cluster node and can be replicated across several nodes for of a transaction in the queue is the minimum across prior high availability. All indexes are also kept in main memory transactions received from all initiators. Since initiators uses 605

3.timestamps to generate monotonically increasing transaction­ can read the shadow version (as with deletes, the shadow ids and messages to the priority queue are TCP ordered, the version is removed after the checkpoint process scans it). A minimum transaction-id can be used to determine when the background process serializes the snapshot to disk, so there position of a transaction in the global order is known. If a is no need to quiesce the system and the checkpoint can be transaction is not globally ordered, it is held in the queue written asynchronously. When one such sweep is done, and until its position in the global order is known. the background process has completed its scan, the executor Replication follows a similar process, but in reverse: ini­ returns to regular mode. tiators inform the transaction execution sites of the minimum This checkpointing mechanism, although somewhat VoltDB safely replicated transaction-id for transactions from that ini­ specific, can be easily generalized to any database system that tiator. If a transaction-id is greater than the safely replicated uses snapshots for isolation, since the copy-on-write mode is transaction-id for that initiator, the site holds the transaction very similar to the way transaction isolation is implemented in the queue until it is replicated. in snapshot-isolation based systems. This global ordering and replication information propagates Having a transaction-consistent checkpoint is crucial for via new transaction requests and their responses and not as the correctness of the command logging recovery protocol, separate messages. In the event of light transaction load, as we shall discuss in the next section which details how our heartbeats (no-op transactions) are used to prevent stalls. implementation of command logging in VoltDB works. We note that a recently released version of VoltDB does ordering differently, above we have described how global III. COMMAND LOGGING transaction ordering works in the system we have used to The idea behind command logging is to simply log what implement our recovery approaches. command was issued to the database before the command C. Durability (a transaction for example) is actually executed. Command logging is thus a write-ahead logging approach meant to persist The VoltDB mechanisms discussed above result in very database actions and allow a node to recover from a crash. high transaction throughputs (about 4K transactions per second Note that command logging is an extreme form of logical per core on TPC-C), but durability is still a problem. In the logging, and is distinct from both physical logging and record­ event of a single node failure, replicas ensure availability level logical logging. As noted in Section I, the advantage of of database contents. Additionally, VoltDB uses command command logging is that it is extremely lightweight, requiring logging (described in Section III), along with a non-blocking just a single log record to be written per transaction. transaction-consistent checkpointing mechanism to avoid loss For the rest of this paper, we assume that each command is of database contents in the event of power failure or other a stored procedure and that the terms command logging and cluster-wide outage. transaction logging are equivalent. The commands written to To deal with scenarios where a transaction must be rolled the log record in a command logging approach thus consist back mid-execution, VoltDB maintains an in-memory undo log of the name of a stored procedure and the parameters to of compensating actions for a transaction. Thus, transaction be passed to the procedure. For stored procedures that must savepoints (partial rollback) are also supported; this is possible generate random numbers or call non-deterministic functions because any partial rollback for a deterministic transaction such as date ()/time (), a timestamp based transaction-id will also be deterministic. The undo log is separate from the (see Section II) can be used as the seed for the generator and command log and is never written to disk. It is discarded on for extracting the date/time. transaction commit/abort because it can be regenerated when the transaction is replayed. Generally speaking, stored procedure names are likely to be substantially smaller than entire SQL-queries, so this 1) Asynchronous Checkpointing: VoltDB's checkpoint serves to reduce the amount of data logged by command mechanism periodically writes all committed database state logging. Specifically, an entry in the log is of the form to disk (index updates are not propagated to disk). Before starting the snapshot, a distributed transaction is used to start (transaction-name, parameter-values). the snapshot by putting all of the sites into a copy-on-write A. Writing to the Log (COW) mode. The snapshot process then begins scanning every row in the table, while queries continue to execute Writing a command log record for a single-partition trans­ on the database. All updates from this point until the end action is relatively simple. For a distributed transaction, only of the snapshot are COW if they are performed on a row the coordinator site specific to the transaction writes the that hasn't been scanned for the snapshot yet. Specifically, transaction to its command log; all other sites participating three bits per row track whether the row was added, deleted, in the transaction do not log the transaction. The coordinator or modified since the snapshot began (these bits are not a for a distributed transaction is the site with the lowest id on the part of the snapshot). Newly added rows are skipped by the node where the transaction was initiated. Multiple execution snapshot. Deleted rows are removed by the snapshot after they sites on the same node write to a shared command log. For have been scanned. Updates cause the row to be copied to a both single and multi-sited transactions, if replicas are present, shadow table before the update is applied, so the snapshot the transaction is also logged at all replicas of the site. The 606

4. check-sum LSN record-type xaction-id partition-id xaction-type params I Fig. 1 . Command logging record structure. command log for each node also records transaction ordering initiator replaying the log can simply send the transaction to messages sent/received by the node's initiator at runtime. the new site for a given partition-id. Transactions are written to the command log right away Given that each log record corresponds to a single trans­ after they have been received, thus a transaction need not have action and that the initiator has access to ordering messages been globally ordered and replicated before it's written to the written to the command log at run-time, global ordering at log. This in turn requires that at recovery time, transactions be replay time is identical to the pre-crash execution ordering. sorted again in a manner that agrees with the global transaction If a command log record for a transaction is written to the ordering at runtime. This is easy to accomplish because the log but the database system crashes before the transaction logged ordering messages are also replayed at recovery time. completes executing (so that the client isn't notified), com­ In our VoltDB implementation of command logging, log mand log replay will recover this transaction and bring the records written out for each transaction have the structure database to a state as if this transaction had been committed, shown in Figure 1. even though the client won't be notified after replay (a similar 1) Optimizations: Command log records can be flushed to situation can happen in a conventional DBMS as well). disk either synchronously or asynchronously. ACID semantics We further discuss how command logging could be ex­ can be guaranteed only with synchronous logging, because in tended to other database systems in Section V I. this case a log record for a transaction is forced to disk before the transaction is acknowledged as committed. For this reason, IV. PHYSIOLOGICAL LOGGING even though our implementation permits either mode, we Traditional database systems have typically employed report results only for the synchronous mode and throughout ARIES [20] or ARIES-like techniques for their recovery sub­ the rest of this paper, use the term command logging to mean system. ARIES does physiological logging; each operation synchronous command logging. (insert/delete/update) performed by a transaction is written to a To improve the performance of command logging, we log record table before the update is actually performed on the employ group-commit: the system batches log records for data. Each such log entry contains the before and after images multiple transactions (more than a fixed threshold or few of modified data. Recovery using ARIES happens in several milliseconds worth) and flushes them to the disk together. passes, which include an analysis pass, a physical REDO pass After the disk write has successfully completed, a commit and a logical UNDO pass. confirmation is sent for all transactions in the batch. This While the core idea behind ARIES can be used in a main­ batching of writes to the command log reduces the number memory database, substantial changes are required for the of writes to the disk and helps improve synchronous com­ technique to work in a main memory context. In addition, the mand logging performance, at the cost of a small amount of main-memory environment can be exploited to make logging additional latency per-transaction. more efficient. Given the differences, throughout the rest of this paper, we simply refer to our main memory optimized B. Recovery ARIES-style logging technique as physiological logging. Recovery processing for the command logging approach Below, we discuss in detail the changes required and opti­ works as follows. mizations that must be made for main-memory physiological First, using the latest database snapshot on disk, database logging to work well. contents are initialized in memory. Because the disk snapshot does not contain indexes, all indexes are then rebuilt at start­ A. Supporting Main Memory up; this can be done in parallel with the snapshot restore In a disk-based database, inserts, updates and deletes to as index reconstruction for the part of the database that has tables are reflected on disk as updates to the appropriate disk already been restored can begin while the rest of the database page(s) storing the data. For each modified page, ARIES writes finishes loading. a separate log record with a unique logical sequence number Next, the shared command log for each node is read by a (LSN) (a write to a page is assumed to be atomic [26] ). These dedicated thread which reads the log into memory in chunks. log records contain disk specific fields such as the page-id Starting from the log record for the first transaction not of the modified page along with length and offset of change. reflected in the database, log entries are processed by the This is stored as a RID, or record ID, of the form (page node's initiator and the corresponding transaction is dispatched #, slot #). A dirty page table, capturing the earliest log to the appropriate sites (which may be on a different node in record that modified a dirty page in the buffer pool is also case of a distributed transaction). maintained. In addition, a transaction table keeps track of the This recovery approach works even if the number of ex­ state of active transactions, including the LSN of the last log ecution sites at replay time is different from the number of record written out by each transaction. The dirty page and sites at run-time, as long as the number of database partitions transaction tables are written out to disk along with periodic remains the same. In the event of a site topology change, the checkpoints. 607

5. Record-type Insert/Update/ Transaction-id Modified Column Delete List Fig. 2. Physiological logging record structure. In a main-memory database like VoltDB, a data tuple can be grow to a fairly large size. Alternatively, we could eliminate accessed directly by probing its main-memory location without the separate dirty record table and instead simply associate a any indirection through a page-oriented buffer pool. Thus, dirty bit with each database tuple in memory. This dirty bit is the ARIES logging structures can be simplified when adapted subsequently unset when the dirty record is written out to disk to main-memory physiological logging; specifically, all disk as a part of a snapshot. Not storing the dirty record table results related fields in the log record structure can be omitted. in space savings, but depending on the checkpoint mechanism For each modification to a database tuple, our physiological in use, doing so can have significant performance impacts, as logging approach simply writes a unique entry to the log we discuss next. with serialized before and after images of the tuple. Instead Checkpointing. Disk-based ARIES assumes fuzzy checkpoint­ of referencing a tuple through a (page #, slot #) RID, ing [18] to be the database snapshot mechanism. Fuzzy check­ the tuple is referenced via a (table-id, primary-key) points happen concurrently with regular transaction activity, pair that uniquely identifies the modified data tuple. If a table and thus updates made by uncommitted transactions can also does not have a unique primary key and the modification be written to disk as a part of the checkpoint. In disk-based operation is not an insert, the entire before-image of a tuple ARIES, both the dirty page and transaction tables are flushed to must be used to identify the tuple's location in the table either disk along with each checkpoint. The main memory equivalent via a sequential scan or a non-unique index lookup. For the of this would be to write out the dirty record and transaction both the Voter and TPC-C benchmarks we use in our study, all tables with a checkpoint. Not having an explicit dirty record tables written to have primary keys except the TPC-C History table in such a scenario is inefficient, because prior to each table which only has tuples inserted into it (see Section V-A checkpoint, we would need to scan the in-memory database for schema details). to construct the dirty record table so it could be written along Use of persistent virtual memory addresses instead of with the checkpoint. (table-id, primary-key) for in-memory tuple iden­ Alternatively, we could use transaction-consistent check­ tity is also an option [13] [8] , but we believe that it is not a good pointing [25] instead of fuzzy checkpointing. VoltDB already choice as a unique identifier because a database in general uses non-blocking transaction-consistent checkpointing (see is not guaranteed to load a table and all its records at the Section II), so we leveraged it for our implementation. With same virtual address on restart after a crash unless specifically transaction consistent checkpointing, only updates from com­ engineered to do so. Moreover, doing so limits potential mitted transactions are made persistent, so that we can simply compaction over the table memory layout to minimize data keep track of the oldest LSN whose updates have not yet been fragmentation. reflected on disk. Thus, the dirty record table is not needed at 1) Optimizations: In this section we describe a number of checkpoint time. optimizations to our physiological logging scheme. Moreover, as explained in Section II-Cl, transaction­ Differential Logging. For tables with wide rows, a large consistent checkpointing is also used by our system to ensure amount of log space can be saved by additionally recording correctness of command logging. which attributes in the tuple were updated by a transaction, Log-per-node. In our VoltDB implementation of physiological with before and after images recorded for only those columns logging, execution sites on the same node write to a shared log instead of the entire tuple (this optimization does not apply to with arbitrary interleaving of log records from different sites. inserts). The logging scheme is thus physiological physical - The ordering of log records per site is still preserved. A field in with respect to the changes for a particular column and logical the log record identifies the partition the update corresponds to with respect to which columns have been modified (similar to (site-id to partition-id mapping is one-to-many as each site can the way ARIES records physical changes to logical page slots). be host to more than one database partition). Having a shared We found that that this dijferential logging optimization led to log for all sites as opposed to a log per-execution site makes a significant reduction in a log record's size for the TPC-C recovery much simpler, since the database is not constrained benchmark (nearly 5 x for TPC-C). However, we noticed that to restarting with an identical partition-to-site mapping on a this reduction came at the cost of increased CPU complexity to given node. This is important, because if a node crash requires construct log records, an overhead which becomes significant a reconfiguration, or the database must be recovered on a at in-memory OLTP execution throughputs (we discuss the different machine from the one it was previously running implications of this observation in Section V ). on, we may have a different number of sites and a different Dirty Page Tracking. An in-memory database has no concept partition-to-site mapping. However, the shared nature of the of disk pages and so unlike ARIES, we do not need to log requires that all partitions previously residing together on maintain a dirty page table. One option is to create a dirty a node must still be on the same node for replay, even though record table to keep track of all dirty (updated or deleted) the number of sites on the node can be changed. database records. For a write-heavy workload, though regular Batched writes. Because OLTP transactions are short, the snapshotting would keep the size of this table bounded, it can amount of log data produced per update in the transaction 608

6.is not enough to justify an early disk write given that the final during crash recovery, no rollbacks are required and the undo update's log record must also be flushed before the transaction pass can be eliminated altogether (employing this optimization can commit. For this reason, its best to buffer all log records reduced log record sizes by nearly a factor of two, as the before for a single transaction and write them all to the log together. image in update records could now be omitted). Also, with no Similar to ARIES, our physiological logging is synchronous, undo pass, the transaction table can be done away with. so that log writes of a committed transaction are forced Note that in databases other than VoltDB which use to disk before we report back the transaction's status as transaction-consistent checkpointing but run multiple concur­ committed. Similar to command logging, our physiological rent transactions per execution site, the idea of simply not logging implementation uses group commit; writes from dif­ reapplying the last transaction's updates for each site does not ferent transactions are batched together to achieve better disk work and an undo pass is required. This is because there could throughput and to reduce the logging overhead per transaction. be a mixture of operations from committed and uncommitted The log record structure for our physiological logging transactions in the log. implementation in VoltDB is shown in Figure 2. V. PERFORMANCE EVALUATION B. Recovery We implemented both command logging and ARIES-style Recovery using disk-based ARIES happens in three phases: physiological logging inside VoltDB, with group-commit en­ an analysis phase, a redo phase and an undo phase. The redo abled for both techniques. The logging is synchronous in pass is physical and the undo pass is logical. each case with both techniques issuing an fsync to ensure Our physiological logging scheme for main-memory also that a transaction's log records are written to disk before the begins with an analysis phase, the goal of which is to identify results are returned. All performance optimizations discussed the LSN from which log replay should start. The redo pass in Sections III and IV were implemented and enabled for all then reads every log entry starting from this LSN and reapplies experiments except where we explicitly study the performance updates in the order the log entries appear. For each log entry, impact of turning off a specific optimization. We implemented the data tuple that needs to be updated is identified using the an additional optimization for physiological logging, in which (table-name, primary-key) pair and the serialized all the log records for each transaction are first logged to after-image bytes in the log record are used to modify the tuple a local buffer, and then at commit time, written to the disk appropriately (this covers insert and delete operations as well). in a batch along with records of other already completed For the primary-key lookup identifying a tuple's location to be transactions. For OLTP workloads, this optimization adds a fast, an index on the primary key is used at replay time. In small amount of latency but amortizes the cost of synchronous VoltDB, index modifications are not logged to disk at run-time, log writes and substantially improves throughput. Also, we so all indexes are reconstructed at recovery time in parallel ensured that the physiological logging implementation group­ with snapshot reloading prior to log replay (see Section III). commits with the same frequency as command logging. Because log records corresponding to different partitions of In this section, we show experimental results comparing the database can be replayed in any order, the redo phase is command logging against physiological logging. We study highly parallelizable. This optimization yielded linear scale up several different performance dimensions to characterize the in recovery speeds with the number of cores used for replay circumstances under which one approach is preferable over the (see Section V for performance numbers). Next comes the other: run-time overhead (throughput and latency), recovery undo pass. For transactions which had not committed at the time and bytes logged per transaction. time of the crash, the undo phase simply scans the log in We also look at the effect of distributed transactions and reverse order using the transaction table and uses the before replication on performance of the two techniques. image of the data record to undo the update (or deletes the In Section V-A, we briefly discuss the benchmarks we used record in case it was an insert). in our study. Then we describe our experimental setup in Recovery can be simplified for an in-memory database Section V-B followed by performance results in Section V-C such as VoltDB that uses transaction consistent checkpoint­ Finally, we summarize our results and discuss their high level ing and only permits serial execution of transactions over implications in Section V-D. each database partition. In such a system, no uncommitted writes will be present on disk. Also, because transactions A. Benchmarks are executed in serial order by the run-time system, log We use two different OLTP benchmarks in our study, Voter' records for a single transaction writing to some partition on and TPC-C These two benchmarks differ considerably in their an execution site S are never interleaved with log records for transaction complexity. The work done by each transaction in other transactions executed by S. Hence for each partition, the Voter benchmark is minimal compared to TPC-C TPC-C only the transaction executing at the time of crash will need database tables are also much wider and exhibit more complex to be rolled back (at most one per partition). Even this single relationships as compared to Voter. rollback can be avoided by simply not replaying the tail of the The two benchmarks are described below. log corresponding to this transaction; doing so necessitates a one transaction look-ahead per partition at replay time. Then I https:llcommunity.voltdb.com/node/47 609

7. 1) Voter: The Voter benchmark simulates a phone based ensuring that data durability was not compromised; such a election process. The database schema is extremely simple careful setup is necessary to optimize either recovery system. and is as follows: The client was run on a separate machine with a system contestants (contestant_name STRING, configuration identical to that of the server. We simulated contestant_number INTEGER) several clients requesting transactions from the server by area_code_state (areacode INTEGER, state STRING) running a single client issuing requests asynchronously at a votes (vote_id INTEGER, phone_number INTEGER, fixed rate. state STRING, contestant_number INTEGER) For our distributed transactions experiments in Sec­ Given a fixed set of contestants, each voter can cast multiple tion V-C6, we used a cluster of four identical machines, with votes up to a set maximum. During a run of the benchmark, one machine used to run an asynchronous client and the other votes from valid telephone numbers randomly generated by three used as database servers. Each machine was an Intel the client are cast and reflected in the votes table. At the Xeon dual-socket 12-core server box with a processor speed end of the run, the contestant with the maximum number of of 2. 4GHz, 48GB of RAM, 4 TB of hard disk space with a votes is declared the winner. battery backed cache and running Ubuntu Server. This benchmark is open-loop and has only one kind of trans­ Because the VoltDB server process runs on a multi-core ma­ action, the stored procedure vote. This transaction inserts one chine, we can partition the database and run several execution row into the votes table and commits. There are no reads to sites concurrently, with each site accessing its own partition. the votes table until the end of the client's run, the other two For an 8-core machine, we experimentally determined that tables in the database are read as a part of the vote transaction running six sites works best for the Voter benchmark and but not written to. In addition, the width of all the tables is more sites did not lead to increased throughput. For the TPC­ very small (less than 20 bytes each). C benchmark, we found that best performance is achieved by The number of contestants as well as the number of votes using all possible sites (one per core). Each site corresponds each voter is allowed to cast can be varied. For our experi­ to one warehouse, so that the results to follow are for a TPC­ ments, these are set to default values of 6 and 2 respectively. C 8-warehouse configuration (except for Section V-C6, where 2) TP C-C: TPC-C [29] is a standard OLTP system bench­ 12 warehouses are used). While it's possible to fit a much mark simulating an order-entry environment. larger database (e. g. , 64 warehouses) given the server memory, The TPC-C database consists of nine different tables: Cus­ we found that the system throughput for a 64-warehouse tomer, District, History, Item, New-Order, Order, Order-Line, configuration was nearly the same as for the 8-warehouse Stock and Warehouse. These tables are between 3 and 21 one (which is expected given that the entire database is in columns wide and are related to each other via foreign key memory in both cases). Given that the TPC-C database grows relationships. The Item table is read-only. in size over time as new transactions are issued, we chose a The benchmark is a mix of five concurrent transactions of smaller database to facilitate long running experiments without varying complexity, namely New-Order, Payment, Delivery, running out of memory. Order-Status and Stock-Level. Of these, Order-Status and C. Results Stock-Level are read-only and do not update the contents of the database. The number of New-Order transactions executed All the experimental results we present below were obtained per minute (tpmC) is the metric used to measure system by running our benchmarks against three different modes of throughput. VoltDB: (a) command logging on and physiological logging The TPC-C implementation used to report numbers in turned off, (b) physiological logging turned on and command Section V-C differs from the standard benchmark in that (a) it's logging turned off, and (c) both command logging and phys­ open-loop, (b) New-order transactions do not read items from iological logging turned off. a remote warehouse, so that the transactions are always single­ For most direct performance comparisons between the sited. Performance numbers for multi-sited TPC-C New-order above three modes, we show plots with the performance metric transactions however are reported in Section V-C6. on the y-axis and the client rate on the x-axis. Doing so allows As TPC-C simulates a real order-entry environment, the us to compare performance of the three logging modes by benchmark description also mimics a human terminal operator asking a simple question: what throughputllatency/recovery­ by adding keying times and think times for each transaction. rate do each of the logging modes have for a given amount of Our implementation of TPC-C does not take these into ac­ work to be done per unit time (in this case a client attempting count. to execute transactions at a certain rate)? For all experiments, we set the system snapshot frequency B. Experimental setup to 180 seconds. Increasing or lowering this value affects All our experiments in Sections V-C1-V-C5 were run on performance of each logging mode equally as the system does a single Intel Xeon dual-socket 2. 4 GHz 8-core server with extra work in the background at runtime in all cases. The 24GB of RAM, 12 TB of hard disk with a battery backed write rationale for setting the snapshotting frequency to the order of cache and running Ubuntu Server. To improve disk throughput, a few minutes instead of seconds (or continuous) is that there the disk was mounted with appropriate additional flags while is substantial data on the log that must be replayed, which 610

8. 120 50 700 Command-logging - Command-logging - Command-logging - � Physiological-logging ------­ Physiological-logging ------_. � 600 Physiological-logging ------_. '0 100 No-logging No-logging '" 40 15 D '" 500 1" -- 80 C .. . � -,/ /, i 30 // � 400 � // 2 60 / '" � � 20 / '" 300 5 / i! i 40 // <:- 200 � 10 / " 20 � .--- '" 100 15 a: > 0 0 0 0 � W 00 00 100 1� 1W 100 0 20 40 60 80 100 120 140 160 0 � W 00 00 100 1� 1W 100 Client rate (thousands of Ips) Client Rate (thousands of Ips) Client rate during run before crash (thousands of tps) Fig. 3 . Voter throughput vs. client rate (both Fig. 4. Voter latency in milliseconds vs. client Fig. 5. Voter log replay rates (tps). tps). rate (tps). G 1600 Command-logging - 120 Command-logging - Command-logging - E 1000 G .9- Physiological-logging ------­ Physiological-logging ------_. E 1400 Physiological-logging -------- '0 No-logging No-logging .9- 100 '" '0 1200 D 800 '" 80 C 1000 D � 600 � '" C '" � 800 � '5 g- 60 � 400 3 600 i 40 � '" 400 � 200 20 () () § '" 200 a: 0 0 0 a. 0 10 20 30 40 50 60 0 10 40 50 60 f- 0 10 20 30 40 50 60 Client rate (thousands of Ips) Client Rate (thousands of Ips) Client rate during run before crash (thousands of tps) Fig. 6. TPC-C throughput (tpmC) vs. client Fig. 7. TPC-C latency in milliseconds vs. Fig. 8. TPC-C log replay rates (tpmC). rate (tps). client rate (tps). makes measured recovery rates more reliable and offsets any transaction, and is favored more heavily by command logging. dominating replay startup costs that would affect the measured 2) Latency: The variation of transaction latency with client numbers. rates for the voter benchmark is shown in Figure 4. For 1) Throughput: Figure 3 shows the variation in system client rates less than 50K tps, the system runs well under its throughput for the voter benchmark as the client rate is capacity and all logging methods result in a 5-7ms latency. varied from 25,000 transactions per second up to 150,000 Note that this latency is dependent on the group commit transactions per second. All three logging modes (no-logging, frequency, which was fixed at 5ms for this experiment (ob­ physiological-logging and command-logging) are able to tained by varying group commit frequencies is an independent match the client rate until 80K tps at which physiological experiment, elided due to space constraints). The latencies logging tops out while the other two saturate at 95K tps. We for all methods gradually increase as the database server observe that the overhead of command logging is nearly zero. approaches saturation load. Command-logging has almost the Due to the extra CPU overhead of creating a log record based same latency as no-logging whereas physiological-logging has on the insert row's serialized bytes during the transaction, a 15% higher latency. The higher transaction latencies for physiological logging suffers about 15% drop in maximum client rates greater than the saturation load result from each throughput at run time. For more complex transactions, phys­ transaction waiting in a queue before it can execute. The iological logging has a higher performance penalty, as we see queue itself only allows a maximum of 5,000 outstanding next. transactions, and the admission control mechanism in VoltDB Figure 6 shows throughput measured in tpmC achieved by refuses to accept new transactions if the queue is full. the three logging modes for the TPC-C benchmark, as the In Figure 7, we see that TPC-C performs similarly, except client rate varies from lOK up to 60K tps. Similar to the results that physiological logging reaches saturation at about 21K tps, for the voter benchmark, command logging achieves nearly so that its latency goes up much earlier. The other two logging the same throughput as the no logging scenario. However, here modes hit saturation latencies at client rates higher than 30K physiological logging caps out at about 66% of the throughput tps and both have about the same latency. Due to extra logging achieved by the other two. In other words, command logging overhead, physiological logging suffers from latencies that are provides about 1.5 x more throughput than physiological log­ at least 45% higher for all client rates. ging for the TPC-C benchmark. This is expected behavior be­ 3) Number of Bytes Logged: As noted earlier, the voter cause TPC-C transactions are much more complex than voter benchmark only has one transaction (the stored procedure transactions, and each one potentially updates many database vote). For each transaction initiated by the client, command records. Extra CPU overhead is incurred in constructing log logging writes a log record containing the name of this stored record for each of these inserts/updates, and the amount of procedure and necessary parameters (phone number and state) logged data also increases (see Section V-C3 for numbers). along with a log header. We found that the size of this log The penalty on Voter is lower because the number of log writes record is always 55 bytes. On the other hand, physiological for the vote transaction is small (just one). logging directly records a new after-image (insert to the votes Both approaches have short transactions, do better with table) to the log along with a header, and writes 81 bytes command logging, but TPC-C performs more updates per per invocation of vote. This transaction only inserts data, so 611

9.that the before-image does not exist. Moreover, as discussed take the same amount of time irrespective of the logging mode in Section IV , before images can be done away with in any being used. If no logging was done at run-time, all transactions case. For voter, both the logging techniques only write one executed after the last snapshot was written to disk will be log record per transaction. permanently lost. Hence, our recovery performance numbers The TPC-C benchmark has three different transaction types are for command logging and physiological logging only. Our which update the database: delivery, neworder and payment. implementations for both the logging modes are optimized The above mentioned three different transaction types for to do parallel log replay, each execution site reads from the TPC-C together modify 8 out of 9 tables in the TPC-C shared recovery log and replays all log records corresponding database (the item table is read-only). Modifications include to its site. insert, update as well as delete operations on tables. In many Figure 5 shows the log replay times for the two logging cases, only 1 record is modified per transaction for each table, modes for voter. During recovery, the system replays the log but the neworder , orders, order-line and stock tables have at maximum speed but does not serve new client transactions either 10 or 100 records modified per transaction for certain simultaneously, naturally this way recovery rate is not a operations. function of previous load. Command logging must actually re­ For command logging, the three transactions write between execute each transaction, and we see that its L OOK tps recovery 50 (delivery) and 170 (neworder) bytes per transaction (there is rate is about the same as the maximum throughput it can only one log record per transaction). The neworder transaction achieve at run-time (seen earlier in Figure 3). On the other logs the highest number of bytes, which is not surprising hand, for voter, physiological logging is able to replay the log given that neworder is the backbone of the TPC-C workload. almost 5 x faster at about 500K tps. This difference is due Depending on the table that is updated, log record sizes to the fact that physiological logging directly records each for physiological logging vary from 70 bytes (New-Order transaction's modifications to the log at run-time. It does not table) to 240 bytes (Customer table) per record, with most have to repeat its reads or transaction logic during recovery log records less than 115 bytes in size. Overall, for TPC­ and is able to recover much faster. The simplicity of voter C, physiological logging writes about lO x more data per transactions ensures that the physiological logging overhead transaction in comparison to command logging (averaged over of parsing each log record and reapplying the relevant updates the three transaction types). is small. 4) Log Record Size vs Peiformance: Because physiological In Figure 8, we see that even for the TPC-C benchmark, logging writes so much more data than command logging physiological logging also replays at a faster rate compared on TPC-C, we wanted to test if the run-time performance to command logging. Command logging can only recover difference between the two systems on this benchmark was at about 865K tpmC, which is also its maximum run-time completely attributable to I/O time. We ran an experiment throughput on an average (Figure 6). However owing to the in which we truncated the size of physiological logging increased complexity of TPC-C transactions, physiological records written out per transaction to L OO bytes, which is logging replay is only about l. 5 x faster than command approximately what command logging writes on an average for logging for TPC-C as opposed to the 5 x speedup for the much a TPC-C transaction. The resulting recovery log is unrecover­ simpler voter benchmark. able/corrupt, but this is not important for the purposes of this While command logging has a longer recovery time, it's im­ experiment. We found that physiological logging throughput pact on availability is minimal because all modern production slightly increases by a mere 1 %, and command logging wins OLTP systems are engineered to employ replication for high by nearly the same factor. availability, so that the higher throughput of command logging Thus, the performance gap at run-time between command at run-time is a good tradeoff for it's slower background logging and physiological logging is a result of not only the recovery while the other replica nodes continue to serve live extra disk I/O that physiological logging needs to do to write traffic. larger records to disk, but also of the higher CPU overhead Recovery numbers in the two plots just discussed are for incurred in logging activities during transaction execution. As log replay only and do not include log read times. Once a discussed in Section IV , this overhead incurred by physiologi­ database snapshot has been restored from disk, the log is read cal logging is due to CPU cycles spent generating I/O efficient in chunks by a single execution site and thereafter shared by all differential log records. While the CPU complexity of creating sites on the node during replay; this applies for both command log records is not a new phenomenon, it becomes significant at logging and physiological logging. For both Voter and TPC-C, main-memory OLTP speeds, where the actual work performed the log read in case of command logging added less than 1% by each transaction is small and completes in tens to hundreds extra overhead to the replay time, due to small log records of microseconds. and relatively high per-transaction replay times. Log reads in 5) Recovery Times: After a server node crashes and is physiological logging, in contrast, add a 30% overhead to voter brought up again, it must recover to its initial state by first replay times and about 8% overhead to TPC-C, due to larger reading the latest database snapshot into memory with indexes log records and faster re-execution times. rebuilt in parallel and then replaying log records. For both 6) Distributed Transactions and Replicated Partitions: As voter and TPC-C, snapshot restore and index reconstruction we noted in Section I, OLTP transactions have become shorter 612

10. ID ID � 1 0000 �������---� Command-logging - � 1 0000 �-�-��--�---� C o m m and-logging - � Physiological-logging � Physiological-logging No-logging .. N o-logging .. u o Command 1 000 - - .- � . . _ _ :;; Logg ing - . <n , - ' Physiological "2 Preferred - '" � 8 , ' Logging Q) Preferred 0:: ,-,,- x TPC-C ' -� x TPC-C � ,- / wi added latency (aka TPC-C i n 10 � ,-' 1 989) z , ,)(" Voter 0% 1% 1 0% 1 00% 0% 1% 10% 1 00% Transaction Length % of distributed N ew-order transactions - log scale % of distributed New-o rder transactions - log scale Fig. 9. TPC-C New-order run-time throughput Fig. 10. TPC-C New-order run-time through­ Fig. 1 1 . Illustration of when command log­ (tpmC) vs. % of distributed transactions for a put (tpmC) vs. % of distributed transactions for ging is preferred over write-ahead physiologi­ single-node multi-site setup. a multi-node 3 -way replicated multi-site setup. cal logging, with experimental results overlaid. as processors have gotten faster and RAM sizes of tens of 2 AM tpmC (slightly below that of no-logging at 2 . 6M tpmC) gigabytes have become routine . In this section, we increase and 1 .5 x that of physiological logging 1 .5 tpmC throughput. the average transaction length by varying the fraction of This 1 . 5 x throughput gap between command logging and distributed transactions in the workload and see how run-time physiological logging remains even as distributed transactions performance of each of the three logging approaches changes are introduced. This gap slowly drops down, and remains about as we do so. Our hypothesis is that a longer transaction l A x even at 50% distributed transactions, until at about 1 00% length should make physiological logging look better, because distributed transactions, transaction latencies are so high that logging will represent a small fraction of the total work the all logging approaches provide identical results. transaction does. Figure 1 0 shows performance numbers for the different For all our experiments in this section, we use a modified logging approaches when we have a cluster configuration of TPC-C benchmark consisting of 1 00% New-Order transactions 3 server nodes running 12 execution sites each, with each site and vary the fraction of multi-partition New-Order transac­ replicated three-ways. We still have a TPC-C workload with tions. The methodology behind doing so is as follows. In 1 2 warehouses, with the difference that now each warehouse TPC-C New-Order, an order has between 5 to 15 items, partition is stored by three different execution sites. We see that for an average of 1 0. Each item can come from a remote again, at 0%, the performance gap is as expected, command warehouse with x% probability (default is 1 %, we vary this) . logging throughput wins by a factor of almost 2 x , with a Our TPC-C New-order table is partitioned on warehouse-id, penalty of less than 5% compared to when no logging is so for a New-order transaction to be multi-partition, at least done. For this configuration however, the performance offered one of the items must come from a remote warehouse, and by all three approaches drops quickly as we increase the thus the probability that a transaction is distributed can be fraction of distributed transactions, with a gap of 1 . 2 x in 10 approximated as 1 - ( 1 - 1�0 ) . favor of command logging at 5% distributed transactions, As mentioned in Section V-B, we use a slightly dif­ which closes down to nearly identical throughput for all three ferent cluster setup for running our distributed transac­ approaches beyond 1 0% . These results are in agreement with tionsireplication experiments. As servers we use for these the hypothesis: progressively higher transaction lengths lead experiments have 1 2 cores each, we employ a configuration to smaller run-time performance gaps between the different with 1 2 execution sites per server node . Our New-order table logging approaches. is partitioned on warehouse id, and we let each warehouse Another interesting point to note is that the gap between the partition be owned by an execution site, so that New-order different logging approaches closes slower with no replication transactions with an item from a remote warehouse are always and faster with replication (Figures 9 vs. 1 0) : this is expected multi-partition. because a 3-way replication setup makes each transaction in We start with results for a system setup similar to re­ the workload, distributed or not, multi -sited. sults presented previously: an asynchronous client issuing We do not show recovery rates due to lack of space here, but transactions at maximum speed to a single server node with we found that physiological logging is much more efficient at multiple execution sites with no replication. Figure 9 shows recovery compared to command logging if the workload has throughput numbers for this case as we increase the num­ a very high fraction of distributed transactions. ber of multi-partition transactions (latency plots are omitted due to space constraints, but latency is inversely related to D. Discussion throughput as shown in previous results.) Because throughput Our results shows that command logging has a much lower drops dramatically with even a small fraction of distributed run-time overhead than physiological logging (nearly zero transactions, both axes on the plot are in log scale with in fact) . This is due to the fact that it does less work at the 0% x-label (only single-sited transactions) approximated run-time to generate log records, and also because it writes as 0. 1 % on the plot. Here the numbers for no distributed less data to disk. In the two benchmarks we evaluated, transactions (x 0) are in agreement with those seen in = command logging was able to achieve as much as a 1 .5 x earlier sections, with command logging having a throughput of performance improvement over our main-memory optimized 613

11.implementation of physiological logging on TPC-C, and about a read-only transaction that reads the entire database and 1.2 x on Voter. This improved performance comes at the cost writes its pages to disk. If the database employs some form of of an increased recovery time for command logging, since snapshot-isolation (which most databases, including Postgres, it has to redo all of the work of a transaction, whereas Oracle, and SQL Server do), such read-only transactions will physiological logging only has to re-apply updates to data not block any other transactions in the system. However, this tuples. Recovery times for command logging range from 1. 5 x requires two copies of the database to be on disk, which may slower on TPC-C to 5 x slower on Voter. In reality, system not be feasible. Exploring the best method for transactionally­ failures are infrequent, and can be masked via high-availability consistent snapshotting of conventional databases, such as through replication; this makes recovery speed secondary in those in [25] , is an interesting area for future work. importance to system performance for most systems. Hence, For the second property, assuming a transactionally­ in modern high-throughput settings, command logging, with consistent checkpoint is available, serial replay from a com­ its near-zero overhead at run-time and modest reduction in mand log will result in a correct recovery as long as the recovery times, is the best choice. transactions in the log represent the serial equivalent commit In our experiments with high fraction of distributed trans­ order in which transactions were executed pre-crash. This will actions, physiological logging does better, since the overheads be the case assuming: (a) the use of strict two-phase locking represent a small fraction of overall run-time, and recovery (S2PL) for isolation, (b) no writes of dirty pages of uncom­ times for physiological logging become much better than mitted transactions, obviating the need for undo logging, so for command logging. Hence, for applications with complex that correctness is ensured despite potential non-deterministic or mostly distributed transactions that update few records transaction aborts resulting from deadlocks. Other transac­ (which is not true of most OLTP applications), ARIES-style tional isolation protocols, like serializable snapshot isolation physiological logging is probably a better choice. This is also (SSI) [1] , unfortunately do not guarantee that commit order the reason why ARIES has traditionally been considered the is the same as the serial equivalent execution order. Further­ gold-standard method of recovery: in the 1980's when initial more, it's unclear what the semantics of command log-based research on recovery was done, OLTP throughputs were much recovery are in the face of non-serializable isolation levels like lower, and the relative overheads of ARIES-style logging likely snapshot isolation (which is widely used in practice). Hence, represented a much smaller fraction of the total work done per another interesting area for future work involves investigating transaction. These results are summarized Figure 11. this relationship. Our conclusion is that for modern OLTP database systems V II. RELATED WORK that need to process many thousands of transactions per second, command logging should be the recovery method of ARIES [20] is considered the gold standard method for choice, unless for some reason, recovery times are unusually recovery in traditional databases.Most main memory database important for the OLTP system. recovery techniques proposed in the past [8] [9] [4] [ 15] are similar in spirit to ARIES; we briefly go over them here, a V I. GENERALIZING COMMAND LOGGING detailed discussion can be found in [5] [6] . A natural question about the command-logging approach Dewitt et al [3] suggest compressing the log size by writing described in this paper is how it would generalize to traditional only new values to disk but require the presence of stable disk-based systems and to other main-memory OLTP systems memory large enough to hold the write-ahead log for active that use locking. We believe it should generalize well. To make transactions. In absence of such storage, they flush log records it work, we need to ensure two properties: first, command log­ in batches (group commit). Both logging modes in our system based recovery needs to start from a transactionally-consistent (command logging and physiological logging) implement the snapshot, and second, replaying transactions in the command group commit optimization. log in serial order must result in a re-execution that is Li et al [17] also suggest run-time optimizations for reduc­ equivalent to the original execution order of the committed ing log size by using shadow pages for updates but also require transactions pre-crash. all shadow updates as well as the log buffer to reside in non­ To ensure the first property, if transactions are short-lived, volatile memory. Lehman and Carey's recovery algorithm [14] there should be no need to write dirty (uncommitted) data to also requires presence of non-volatile RAM to be able to store disk. However, this alone isn't sufficient to ensure that the state log tails. We do not make such an assumption in our system, of the database on disk when recovery begins is transactionally which is impractical on commodity machines, the entire main consistent, since a crash may occur while data is being flushed memory contents are considered lost after a crash. back, resulting in only part of a transaction's state being on Levy and Silberschatz [16] describe an incremental recovery disk at recovery time. We may be able to atomically flush a set algorithm for main memory databases, which does not require of pages to disk by relying on batteries in enterprise class disks recovery to be performed in a quiescent state, allowing trans­ to ensure that a set of flushed writes actually make it to disk action processing in parallel. This is achieved by recovering even in the event of a power outage or crash. Alternatively, the database pages individually. VoltDB does not have a concept same transactionally-consistent snapshotting approach used in of pages; we implement a similar idea by employing parallel VoltDB could be employed in a disk-based database by issuing recovery at a per partition level for physiological logging. 614

12.Achieving the same is harder with command logging owing to [3] D . J . DeWitt, R . H . Katz, F. Olken, L . D . Shapiro, M . R . Stonebraker, uncertainty about what pages a stored procedure would touch. and D. A. Wood. Implementation techniques for main memory database systems. SIGMOD ' 84, pages 1-8, New York, NY, USA, 1 984. ACM. Purely logical logging has also been proposed recently [19] . [4] M. H. Eich. Main memory database recovery. In Proceedings of i 986 Our work in this paper applies the logical logging idea in its ACM Fall joint computer conference, ACM ' 86, pages 1 226- 1 232, 1 986. extreme to an in-memory database similar in spirit to [12] , and [5] H. Garcia-Molina and K. Salem. Main memory database systems: An overview. iEEE Transactions on Knowledge and Data Engineering, quantifies via extensive experiments the trade-oft's between a 4: 509-5 1 6 , 1 992. highly logical command logging vs. a more traditional ARlES­ [6] L. Gruenwald, J. Huang, M. H. Dunham, J.-L. Lin, and A. C. Peltier. style physiological logging approach. Recovery in main memory databases, 1 996. [7] S . Harizopoulos, D . J. Abadi, S . Madden, and M. Stonebraker. Oltp Recent work by Cao et al describes main-memory check­ through the looking glass, and what we found there. SIGMOD '08, point recovery algorithms for frequently consistent applica­ pages 9 8 1 -992, New York, NY, USA, 2008. ACM. tionsl [2] , we believe their efficient checkpointing techniques [8] H. V. Jagadish, D. F. Lieuwen, R. Rastogi, A. Silberschatz, and S. Su­ darshan. Dab: A high performance main memory storage manager. In can be used in combination with our recovery algorithms for 1. B. Bocca, M. Jarke, and C. Zaniolo, editors, VLDB '94, pages 48-59. better system performance. [9] H. V. Jagadish, A. Silberschatz, and S . Sudarshan. Recovering from Related work such as [lO] [22] [23] has focused on making main-memory lapses. VLDB ' 9 3 , pages 3 9 1 -404. [ l 0] R. Johnson, 1. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki. logging more efficient in general by employing ideas such Aether: a scalable approach to logging. Proc. VLDB Endow. , 3 : 6 8 1-692, as reducing log related lock contention. They emphasize that September 20 1 0 . a separation of transactions from detailed knowledge about [ l 1 ] E. P. Jones, D . J. Abadi, and S . Madden. Low overhead concurrency control for partitioned main memory databases. SIGMOD ' 1 0 , pages data placement naturally requires logical recovery. Our system 603-6 14, New York, NY, USA, 20 1 0 . ACM. architecture does not employ locking, so these techniques do [ l 2] A. Kemper and T. Neumann. Hyper: A hybrid oltp&olap main memory not apply. database system based on virtual memory snapshots. ICDE ' 1 1 , pages 1 9 5-206, Washington, DC, USA, 201 1 . IEEE Computer Society. V III. C ONCL U S I O N [ l 3] C. Lamb, G. Landis, J. Orenstein, and D. Weinreb. The objectstore database system. Commun. ACM, 34( 10): 50-63, Oct. 1 99 1 . In this paper, we compared the run-time and recovery [ l 4] T. 1 . Lehman and M . J . Carey. A recovery algorithm for a high­ performance of command logging to ARIES-style physiolog­ performance memory-resident database system. SIGMOD ' 87, pages 1 04- 1 1 7 , New York, NY, USA, 1 9 87 . ACM. ical logging in high-throughput OLTP settings. Command [ l 5] T. 1. Lehman and M. J. Carey. A concurrency control algorithm for logging recovers by re-running committed transactions from a memory-resident database systems. FOFO ' 89, pages 490-504, London, transactionally-consistent checkpoint, whereas for physiologi­ UK, 1989. Springer-Verlag. [ l 6] E. Levy and A. Silberschatz. Incremental recovery in main memory cal logging, fine-grained updates are recorded at run-time and database systems. iEEE Trans. on Knowl. and Data Eng. , 4: 529-540, the same updates applied at recovery time. We implemented December 1 992. these techniques in the VoltDB main-memory database system [ 1 7] X. Li and M. H. Eich. Post-crash log processing for fuzzy checkpointing main memory databases. In ICDE, pages 1 1 7- 1 24, Washington, DC, and found that on a modern machine running two OLTP USA, 1993. IEEE Computer Society. benchmarks at high throughputs (in excess of 4K tps per [ l 8] J.-L. Lin and M. H. Dunham. Segmented fuzzy checkpointing for main core), physiological logging imposes significantly higher run­ memory databases. SAC '96, pages 1 58- 165, New York, NY, USA, 1 996. ACM. time overheads than command logging, yielding l. 2 x to l. 5 x [ l 9] D. Lomet, K. Tzoumas, and M. Zwilling. Implementing performance lower throughput. It does, however, recover more quickly, with competitive logical recovery. Proc. VLDB Endow. , 4:430-439, April recovery times ranging from l. 5 x to 5 x faster. Our conclusion 201 1 . [20] C . Mohan, D . Haderle, B . Lindsay, H . Pirahesh, and P. Schwarz. Aries: from these experiments is that, since most systems invoke a transaction recovery method supporting fine-granularity locking and recovery infrequently, databases focused on high-throughput partial rollbacks using write-ahead logging. ACM Trans. Database Syst. , transaction processing should implement command logging as 1 7 : 94-1 62, March 1 992. [2 1 ] J. K. Ousterhout, P. Agrawal, D . Erickson, C. Kozyrakis, J. Leverich, the recovery system of choice. D. Mazieres, S. Mitra, A. Narayanan, M. Rosenblum, S. M. Rumble, We believe that these results should also apply to disk­ E. Stratmann, and R. Stutsman. The case for ramclouds: Scalable resident databases, since logging represents a significant over­ high-performance storage entirely in dram. In SIGOPS OSR. Stanford InfoLab, 2009 . head in these systems as well (hundreds of microseconds per [22] I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction, according to prior research [7] ). Hence, generaliz­ transaction execution. Proc. VLDB Endow. , 3 : 928-939, September 2 0 1 0 . ing command logging to a disk-based system is an interesting [ 2 3 ] I. Pandis, P. Toziin, R. Johnson, and A. Ailamaki. Pip: page latch-free shared-everything oltp. Proc. VLDB Endow. , 4:6 1 0-62 1 , July 201 1 . area of future work. Doing so is non-trivial as our current [24] A . Pavlo, C . Curino, and Z . Stan. Skew-aware automatic database implementation of command logging relies on the fact that our partitioning in shared-nothing, parallel oltp systems. SIGMOD ' 1 2, system recovers from a transactionally-consistent checkpoint 20 1 2 . [25] S . Pilarski and T . Kameda. Checkpointing for distributed databases: (which does not include any uncommitted data) and that Starting from the basics. IEEE Trans. Parallel Distrib. Syst. , 3(5):602- the command log is written in an equivalent serial order of 6 1 0 , Sept. 1 992. execution of the committed transactions in the database. [26] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hili, Inc., New York, NY, USA, 3 edition, 2003 . REFERENCES [27] Redis. http://redis.io. [ 1 ] M. J. Cahill, U. Rohm, and A. Fekete. Serializable isolation for snapshot [28] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, databases. ACM Trans. Database Syst. , 34(4) :20 : 1 -20:42, 2009 . and P. Helland. The end of an architectural era: (it's time for a complete [2] T. Cao, M. Vaz Salles, B . Sowell, Y. Yue, A. Demers, J. Gehrke, and rewrite). In VLDB '07, pages 1 1 50-1 1 60. VLDB Endowment, 2007 . W. White. Fast checkpoint recovery algorithms for frequently consistent [29] The TPC-C benchmark. www.tpc.org/tpcc. applications. SIGMOD ' 1 1 , pages 265-276. ACM, 20 1 1 . [30] VoltDB . http://voltdb.com. 615