A Commutative and Associative Logging Scheme

he major contribution of this paper is the so-called d$- ferential logging scheme that permits unrestricted parallelism in logging and recovery. Using the bit-wise XOR operation both to compute the differential log between the before and after images and to recover the consistent database state, this scheme offers the mom for significant performance improvement in the MMDBMS.

1. Differential Logging: A Commutative and Associative Logging Scheme for Highly Parallel Main Memory Database Juchang Lee Kihong Kim Sang K. Cha Graduate School of Electrical Engineering and Computer Science Seoul National University fjuch, next, chask)@kdb.snu.ac.kr Abstract processing the disk-resident data indirectly through the main-memory buffer. With a gigabyte of memory priced at less than $2,000 these days, the MMDBMS emerges as an With a gigabyte of memory priced at less than $2,000, the economically viable alternative to the DRDBMS with the main-memory DBMS (MMDBMS) is emerging as an eco- data structures and algorithms optimized for the in-memory nomically viable alternative to the disk-resident DBMS access. (DRDBMS) in many problem domains. The MMDBMS can show significantly higher performance than the DRDBMS The MMDBMS can show higher performance than the by reducing disk accesses to the sequential form of log writ- DRDBMS not only for read transactions but also for update ing and the occasional checkpointing. Upon the system transactions by orders of magnitude. This is because the crash, the recovery process begins by accessing the disk- disk access in the MMDBMS is reduced to the sequential resident log and checkpoint data to restore a consistent form of log writing and the occasional checkpointing [4]. state. With the increasing CPU speed, however, such disk Upon the system crash, the recovery begins by accessing access is still the dominant bottleneck in the MMDBMS. To the disk-resident log and checkpoint data to restore a con- overcome this bottleneck, this paper explores altematives of sistent state. However, in the MMDBMS, such disk access is still the dominant bottleneck, especially, for update- parallel logging and recoveiy. intensive applications. To overcome this bottleneck, this The major contribution of this paper is the so-called d$- paper explores the alternatives of parallel MMDB logging ferential logging scheme that permits unrestricted parallel- and recovery. The availability of low-cost but high-speed ism in logging and recovery. Using the bit-wise X O R op- multiprocessor platforms with many commodity disks justi- eration both to compute the differential log between the be- fies this direction of research. fore and after images and to recover the consistent data- base state, this scheme offers the mom for significant per- The novel contribution of this paper is the so-called dif- formance improvement in the MMDBMS. First, with log- ferential logging scheme that permits unrestricted parallel- ging done on the difference, the log volume is reduced to ism in logging and recovery. Based on the nice properties of almost half compared with the conventional physical log- the bit-wise XOR operation, this scheme uses this single ging. Second, the commutativiq and associativity of XOR operation both to computes the differential log between the enables processing of log records in an arbitrary order: before image and the after image, and to recover a consis- This means that we can freely distribute log records to mul- tent database state from the checkpoint data and a collection tiple disks to improve the logging pelforniance. During the of log records. Compared with the well-known DRDBMS- recovery time, we can do parallel restart independently f o r oriented recovery schemes such as ARIES [9], the differen- each log disk. This paper shows the superior performance tial logging offers the room for significant performance im- of the differential logging comparatively with the physical provement in MMDBMS. First, since logging is done on logging in the shared-memory multiprocessor environment. the difference, the log volume is reduced to almost half compared with the physical logging, which writes both the before and after images. Second, since the recovery process 1. Introduction involves a single XOR operation, there is no need to distin- guish the redo and undo phases as in the typical DRDBMS. Emerging data-intensive applications such as e-commerce In addition, since the XOR operation is both commutative and mobile value-added information services require grace- and associative, the log records can be processed in an arbi- ful processing of highly concentrated user transaction pro- trary order, independent of the serialization order. This files. The traditional DRDBMS cannot effectively deal with means that we can freely distribute log records in parallel to such a requirement because of the overhead associated with an arbitrary number of disks to improve the logging per- 173 1063-6382/01$10.000 2001 IEEE

2. log during the recovery time. Typically, the two most recent ~~ ~ backup copies are kept. The recovery manager recovers a consistent in-memory DB from the log and the most recent backup DB copy in the case of the system failure. This paper assumes that the primary DB consists of a set of fixed-length in-memory pages, and each page consists of \ J a fixed number of slots. The slot length is identical in a sin- 4 + gle page but may differ among pages. This is to say that the slot is the basic unit of logging as well as storage manage- ment. In this slotted page scheme, variable-length records are handled by linking multiple slots [ 131. We assume the slotted page scheme just for simplicity, but our discussion Figure 1. Logical structure of MMDB in this paper is also valid for the heap scheme, which alle formance. During recovery, we can do parallel restart inde- cates an arbitrary size of memory [7]. pendently for each of log disks. Even intermixing the log replay and the backup DB replay is allowed to pursue the 2.2. Logging, Checkpointing, and Restart maximum degree of parallelism. To guarantee the durability of the committed transactions, To verify the benefit of using the differential logging, we all of the redo log information (the “after image” in the have implemented the parallel recovery schemes based on physical logging) is flushed to the disk before writing the the differential logging and compared them with those commit record. The undo log information (the “before im- based on the physical logging. Both types of schemes use age” in the physical logging) is also created before every the transaction-based log partitioning to distribute log re- update to prepare for aborting the transaction. On the trans- cords to multiple log disks for parallel logging and parallel action abort, the compensation log record (CLR) is gener- restart. In a series of experiments conducted in a CCPU ated to facilitate the undo of the aborted transaction during multiprocessor environment with multiple log disks, the dif- post-crash restart [9]. ferential logging outperforms the best physical logging im- To allow normal transactions to run simultaneously dur- plementation by about twice in the transaction processing ing the checkpoint, we focus on the fuzzy checkpoint policy performance and by about six times in the log processing [5]. Since an inconsistent backup copy can be made with time for the recovery. this policy, the undo log is required to be flushed to the disk. This paper is organized as follows. Section 2 presents an However, the WAL (Write Ahead Log) protocol of DRDB, overview of MMDB recovery and discusses the problems which requires flushing the undo log information to the disk associated with parallelizing it. Section 3 presents the defi- before writing a data page on disk, is not necessary for the nition and properties of differential logging and the parallel MMDB. Instead, the undo log is flushed in bulk just before MMDB logging and recovery algorithms based on it. Sec- finishing the checkpoint. If the system crashes during tion 4 presents a framework for parallel logging and recov- checkpointing to a backup copy, the database can be recov- ery based on the transaction-based log partitioning, and ex- ered from the other backup copy. plores the levels of parallelism achievable in the multiproc- With the physical logging scheme, the above discussion essor environment with multiple disks. Section 5 presents translates to forcing both the before and after images to be the result of the experiment conducted to compare the dif- flushed to the log before the commit, abort, or checkpoint ferential logging with the physical logging. Section 6 [9]. In addition, during post-crash restart, the log records for briefly compares the differential logging with the previous the same resource have to be applied by the serialization MMDB recovery schemes, and section 7 finally concludes order. The well-known ARIES-based recovery algorithm this paper. proceeds in the following order: 1. Read the most recent backup copy into the primary DB. 2. Parallelism in MMDBMS 2. Redo by scanning the log forward from the beginning of the most recent checkpoint. 2.1. MMDB Structure 3. Undo by scanning the log backward for the loser transactions that were active at the time of crash. Figure 1 shows the usual MMDB structure. The primary DB keeps the up-tedate data in memory. The log manager 2.3. Parallel Logging and Restart maintains the in-memory log buffer and the disk-resident log volume. The checkpoint manager creates backup data- Since the log write and read is the dominant bottleneck in base copies from time to time to avoid replaying the entire MMDBMS, it has been suggested to use multiple log disks 174

3. 123. However, using multiple log disks in parallel has sev- 3. The Differential Logging eral problems to be solved, which will be discussed in this section. To the best of our knowledge, these problems have not been studied. 3.1. Definitions and Properties We first list two desirable restart properties for parallel logging schemes. First, if possible, it is desirable to avoid Definition 1 (Differential Log) Assume that a transaction changes the value p of a slot to q. Then, the corresponding the cost of merging logs by the serialization order during the post-crash restart. When the physical or the logical log- differential log A@, q) is defined a s p @ q, where @ denotes ging scheme is used, log records for the same resource have the bit-wise XOR operation. to be replayed by the serialization order [9][3]. Otherwise, the consistent state of the database cannot be recovered. Definition 2 (Redo and Undo) Second, it is desirable to distribute the processing of log re- Using the differential logging, the redo and undo operations cords in parallel to multiple CPUs. This is because the log are defined as p @ A@, q) and q 0 A@, q), respectively, ' transfer rate may exceed the processing rate of a single where p is the before image and q is the after image. CPU when multiple log disks are used. These two properties, desirable to reduce the restart time, For example, consider that p is 0010 and q is 1100. Then, can be achieved by partitioning log records by the resource. the corresponding differential log is 1110 (= 0010@1100). For example, consider partitioning the database into several The redo 001O@lllO(i.e.@A) recovers q correctly, and the segments and storing the log records for each segment to a undo 1100@1110(i.e. q 0 A ) recovers p correctly. To ease different disk [3]. Then, each log can be replayed in parallel. the discussion of the differential logging, we first list the However, this resource-based partitioning scheme has the properties of bit-wise XOR operator. following three problems. 1. When updates are skewed to certain segments, the disks 1.Existence of identity: p @ 0 = p for other segments are not fully utilized. 2.Existence of inverse: p @ p = 0 3.Comutative: p @ q = q @ p 2. Suppose that the system crashes in the middle of com- 4.Associative: (p @ q) @ r = p @ ( q @ r ) mitting a transaction that updated multiple segments. Since the log records are written to multiple disks simul- Theorem 1 (Recoverability) Assume that the value of a taneously, the records for a certain segment might not be slot has changed from boto b, by m updates and each u p written although the commit record has been already date U; (i = 1 , ..., m) has generated the differential log A(b;.], written to a disk. In this case, it is impossible to recover bJ. Then, the final value b,can be recovered from the ini- the consistent database. To avoid this problem, the tial value bo and the differential log, and the initial value commit record can be written to all the disks. However, can be recovered from the final value and the differential this complicates the task of determining whether a trans- log. action committed or not during restart. Proof: 3. Suppose that there are many transactions that update i) Recovery of b, from bo : multiple segments. Since log records are flushed to the disk before committing a transaction, such a transaction bo 0 A ( ~ o6,) 0 A ( b l , b ~0) *.. 0 A ( b m - 1 , b m ) 3 incurs multiple disk I/Os, one for each segment. This = bo 0 (b,@b/)0 (b/@b2)0 ... @ (b,./@b,) leads to a significant decrease of transaction throughput, = (bo0bo) 0 (bleb,)0 ... Obm= 0 @... @ 0 @ b, = b, compared with the case where each transaction incurs ii) Recovery of bo from b,,, : only one disk I/O. b, 0 (b,./@b,) 0 ... 0 (bo061) = b, @ (bmObm./) @ ... @ (bl0bo)= ... = bo Our proposal in this paper is to use the differential log- ging scheme. Since the redo and undo operations of the dif- Theorem 2 (Order-Independence of Redo and Undo) ferential logging scheme are commutative and associative, those two desirable restart properties are automatically Given the initial value boof a slot and differential logs A;, i achieved. Therefore, we can freely distribute log records to = 1, ..., m,where some of Ai may be undo logs, , the final multiple disks such that the above problems with the re- value b, can be recovered applying the differential logs in source-based partitioning scheme do not occur. One such a an arbitrary order. distribution scheme is the transaction-based partitioning, Proof: which will be described in section 4. Assume that differential logs are applied in the order of Ak(,), In addition to the logging, checkpointing can be also per- Ak(z), ..., Ak,,) , where k(i) E { 1, 2, ..., m ) and k(i) != k(j) iff formed in parallel using multiple disks [5]. This paper as- i ! = j for all i andj. Then, the final value of the slot is sumes that the backup database is partitioned into several bo @ AMI)0 0 ... @ &(m) disks. 175

4. Size of Differential Log LSN PrevLSN TID PagelD Differential Log chronization problems. Compared with the physical logging scheme, the differ- During checkpointing, a dirty page should not be copied ential logging has the following nice properties that offer room for significant improvement of the logging and the re- into a backup DB while a transaction is updating a slot in start performance in the MMDBMS. the page. Otherwise, a mixture of the before and the after The log records can be processed in an arbitrary order. images may be copied, and then the page cannot be This means that we can freely distribute log records to recovered from a crash. multiple disks to improve the logging and the restart per- Since the XOR operation used for redo and undo is not formance. The restart techniques to get unrestricted par- idempotent, we have to know whether the update re- allelism will be given in section 4. corded in a certain log record has been reflected to a Redo and undo operations can be mixed in an arbitrary backup. If so, the record should not be replayed when order. This means that two separate passes of redo and rolling the corresponding transaction forward. undo during restart is not necessary. We develop a one- To handle the first problem, we use the simplest locking pass restart algorithm in section 3.4. primitive, the so-called mutex, so that the checkpointer and Compared with the physical logging, the log volume is update transactions may have to acquire the mutex for a reduced to almost half since the redo log and the undo page before copying or updating. And, to deal with the sec- log are same. Thus, both the time to write the log during ond problem, we use a flag named BackupID. Each page the normal operation and the time to read the log during has a BackupID field in its header, and this field is copied the restart decrease. into each differential log record. This field of a page is tog- gled after flushing the page to the appropriate backup DB 3.2. Logging during checkpointing. Figure 2 shows the structure of a differential log record. The corresponding update and fuzzy checkpointing algo- The LSN (log sequence number) represents the order of log rithms are presented in Algorithms 1 and 2. Algorithm 1 is record creation. In our implementation, we use the physical used by a transaction when updating a slot. After check- location of a log record on disk as its LSN. The PrevLSN is pointing, Algorithm 2 records the LSN of end-checkpoint used to chain the log records of a transaction for fast back- record in the log anchor, which keeps the active status of ward traversal. The TID represents the transaction identifier log, so that the end-checkpoint record can be located that created the log record. There are six record types: begin, quickly for restart. abort, commit, DL, begin-checkpoint, end-checkpoint. The first three are used to record the beginning and the end of a Algorithm 1. Update transaction. Only log records of the DL type have the re- 1. Generate the corresponding differential log maining five fields, whose meaning is self-evident except record. 2. If the global checkpoint flag is set, ac- BackupID, which indicates the more recent one of the two quire the mutex for the page. backup DBs. We will explain its usage in section 3.3 and 3. Copy the BackupID flag stored in the page's section 3.4. header into the BackupID field of the log When updating a slot, the transaction generates a log re- record. cord of the DL type by applying the bit-wise XOR opera- 4. Update the page. tion to the before and the after images. Then, the record is 5. Release the mutex if it was acquired previ- appended to the log buffer, which is flushed to the log disk ously . 6. Append the log record to the log buffer. when the buffer overflows or the transaction commits. 3.3. Checkpointing Algorithm 2. FuzzyCheckpoint 1. Begin with the following. A . Set the global checkpoint flag. The differential logging can be used with any of the fuzzy B. Create a begin-checkpoint record and ap- 176

5. pend it to the log buffer. 2 . Read the current backup DB, and reconstruct C. Choose the backup DB that was the least the primary DB as of the checkpointing. recently checkpointed as the current 3 . Initialize CTT as empty. backup DB. 4 . Scanning the log backward from the end of 2 . Scanning the database page by page, do the log until the end-checkpoint record is en- following for each page. countered, do the following depending on A. Acquire the mutex for the page. the record type. B. Toggle the BackupID flag in the page's A. Begin record: remove the corresponding header. entry from CTT C. If the dirty bit in the page header that B. Commit record: create a corresponding en- corresponds to the current backup DB is try in CTT set, copy the page asynchronously into C. Abort record: ignore the current backup DB. D. DL record: if TID is in CTT, redo. Other- D. Release the mutex wise, ignore it 3 . Finish by doing the following. 5 . Initialize ATT with TIDs in the A. Reset the global checkpoint flag. end-checkpoint record. Then, let ATT be B. Wait until all the asynchronous I/Os com- ATT - CTT plete. 6. Continue scanning the log backward to the C. Create an end-checkpoint record contain- begin-checkpoint record doing the follow- ing active transaction IDS, and append ing depending on the log record type. it to the log buffer. A. Begin record: remove the corresponding D. Flush the log buffer. entry in either of CTT o r ATT B. Commit record: create a corresponding en- 3.4. Restart try in CTT C. Abort record: create a corresponding en- When the differential logging is used, the forward scan of try in ATT log is not mandatory since the redo operation is commuta- D. DL record tive and associative. If we scan the log backward, we en- i. If TID is in CTT and the BackupIDs in the log record and the correspond- counter a commit or an abort record of a transaction before ing page header are same, redo. other records of the transaction. Thus, we can skip the re- ii. If TID is not in CTT and the Back- cords of aborted transactions and loser transactions. On the upIDs in the log record and the other hand, when scanning forward, all the log records are corresponding page header are d i f - usually replayed presuming the commit because the abort ferent, undo. ratio is usually low. This strategy needs the separate undo iii. Otherwise, ignore the current log re- pass to roll back loser transactions. The undo pass scans the cord. log backward from the end of the log, and it may incur UOs 7. Undo for all the transactions remained in ATT scanning the log backward. if some of the log data of loser transactions have been flushed out of memory. Although the differential logging can also be used with the two-pass restart strategy, we pre- Since the one-pass restart algorithm just ignores the log sent only the backward one-pass restart algorithm due to the records of aborted transactions, not only compensation log space limitation. records don't have to be made when a transaction aborts but When scanning backward, two special cases need to be also log records don't have to be flushed before aborting. handled. One is the transaction that was active at the time of To facilitate the backward scan by the one-pass restart algo- crash, and the other is the transaction that aborted after rithm, we store the log header after the log body. An alter- checkpointing. Since there is no commit or abort record for native is to append the size of log body at the end of each the first type of transactions, we should skip the log records record. that appear without a corresponding commit or abort record. Since the pages updated by the second type of transactions 4. Parallel Logging and Restart might have been copied into a backup DB in an inconsistent state, we need to roll back the reflected updates by those transactions. To identify these two types of transactions, 4.1. Parallel Logging two tables CTT(committed transaction table) and A l T (aborted transaction table) are maintained. The detailed al- For parallel logging, this paper proposes to partition log re- gorithm is given in Algorithm 3. cords based on the transaction ID. Figure 3 shows the log- ging system architecture with multiple instantiations of the Algorithm 3. One-Pass Restart log manager. When a transaction begins, the transaction 1. Read the current backup DB ID from the log manager inserts a record in the transaction table. Its anchor. LogMgrlD field is filled with the identifier of the idlest log 177

6. BL: Backup DB Loader Transaction Table BR Backup DB Read BP: Backup DB Play LL: Log Loader LR: Log Read LP Log Play b... Active Transactions U--.& Backup DB Backup DB fg Partition 1 Partition m Partition I Partition n Figure 3. Transaction-partitioned parallel logging manager, which has the smallest number of log pages to Figure 4. Parallel restart flush. Then, subsequent log records of the transaction are directed to that log manager. The LastLSN is initialized the BR and the LR simultaneously if the YO bandwidth of with 0. Each log manager has its own log buffer and log the system allows. However, since the read log records disk. cannot be applied until BLs finish the BR and BP, log re- cords are just piled up in memory. Thus, this level of paral- 4.2. Parallel Restart lelism is useful when there is enough memory to keep a large number of log records. Figure 4 shows the parallel restart architecture. The restart process is performed by two kinds of agents, BL (backup Parallel Restart Level 4 The final level of parallelism is to DB loader) and LL (log loader), which are instantiated and do the BP and the LP simultaneously, i.e., applying log re- coordinated by the recovery manager. The recovery cords while copying a backup DB into the primary DB. Al- manager instantiates as many BLs and LLs as the number though it seems impossible, the differential logging allows of backup DB partitions and log partitions, respectively. it with the following two modifications. The BL does two jobs, BR (backup DB read) and BP The BL applies the bit-wise XOR operator to two pages: (backup DB play), and the LL does LR and LP. The BR and one in the primary database and the other read from the the LR are VO jobs while the BP and the LP are CPU jobs. backup database. Therefore, both the BP and the LP are Basically, these four jobs are performed sequentially, done using the XOR operator. Of course, the BL has to ac- + + + BR BP LR LP, but they also can be performed in quire the mutex before updating a page. parallel. This section presents four levels of parallel restart Before the recovery manager instantiates BLs and LLs, it schemes. clears the primary database such that all pages have Os. Parallel Restart Level 1 In this base level of parallelism, Theorem 3. The restart scheme using the Level 4 parallel- multiple BLs run simultaneously, and multiple LLs run si- ism recovers the primary database correctly from the crash. multaneously, but LLs can start only after all the BLs fin- ished their jobs. Proof: Due to the space limitation, we omit the rigorous Since each BL updates a different part of the primary DB, proof, and give a sketch of it. Since a page in the primary no synchronization is needed among BLs. However, LLs database is initially filled with Os, applying the XOR to the need to be synchronized so that they should not apply log page and the corresponding page read from the backup DB records for the same resource simultaneously. Such syn- is equivalent to copying from the backup DB to the primary chronization can be done inexpensively by acquiring the DB. Since the XOR operator is commutative, BP and LP mutex for a page as described in section 3.3. Furthermore, since log records are partitioned by the transaction, the transactions tables used in Algorithms 3 don’t have to be shared among LLs. Parallel Restart Level 2 A simple additional parallelism is to use the pipeline. When the asynchronous YO is supported, each of BLs and LLs can do their YO and the CPU jobs si- multaneously using a pipeline. Table 1. Four levels of parallelism in restart ({ }: multiple instantiation, I: pipelined parallel execution, Parallel Restart Level 3 When the backup database and “II”: parallel execution) log records are stored in different disks, it is possible to do 178

7.can be done in an arbitrary order. 0 60 50 Table I compares these four levels of parallelism. We as- 2'40 sume that the higher level of parallelism includes all the lower levels of parallelism. For example, all the four levels of parallelism are used in the level 4. -8 vi 30 i= 20 10 5. Experimental Evaluation 0 OK 200K 400K 600K 5.1. Experimental Setup #transactions since the checkpoint To show the practical impact of the differential logging, we Figure 5. Log processing time with the log volume and implemented the proposed recovery schemes in our main- log disks memory DBMS kernel called P*TIME [I]. Like its prede- cessor XMAS [12][13], this system is implemented in C++ from the table. These transaction types mock the message to facilitate the object-oriented extension, and supports mul- receiver and the flusher processes of the SMS system, re- tithreaded processing for scalable performance on spectively. The former receives message through the SS#7 multiprocessor platforms. network and inserts them, and the latter moves sent mes- sages to the archive in order to keep the MMDB from Table 2 summarizes three implemented parallel recovery exploding. schemes. PL2P combines the physical logging and the two- pass restart algorithm. DL2P combines the differential log- The experiment is conducted on a Compaq ML 570 ging and the two-pass restart algorithm. DLlP combines the server with four 700 MHz Xeon CPUs, 6GB RAM, and differential logging and the one-pass restart algorithm. All dozen disks. It has a single U 0 channel, which can support of these schemes support pipelining between LR and LP up to 160MB/sec. The disks are of the same model. Their during log processing in restart. When restarting with more average seek time is 6.3 ms, average latency time is 4.17ms, than two log disks, PL2P reads log records from log disks and maximum sustained transfer rate is 29 MB/sec. in parallel, merge them by the serialization order, and re- 5.2. Log Processing Time plays them in that order. Therefore, PL2P does the I/O job in parallel but it does the CPU job sequentially. In contrast, As we analyzed in the section 4.2, the restart time is broken DL schemes (DL2P and DLlP) also do the CPU job in par- down to the backup DB loading time and the log processing allel. In the experiment, log records are partitioned into one time. The first is dominated by the disk access and is inde- to eight disks using the transaction-based log-partitioning pendent of the logging scheme. Therefore, in identifying the scheme, and the fuzzy checkpointing is used. impact of the differential logging, we first focus on the The test database is derived from the SMS domain [Ill, measurement of the latter varying the volume of log, the which is one of update-intensive wireless information ser- number of log disks, and the abort ratio. In the measure- vices. The size of the primary database is roughly 250 MB, ment, we expect that: which includes 1 million records. For the controlled ex- DL schemes reduce the total log volume by almost half periment, we made a simplified SMS message table with compared with PL2P, and thus will accelerate the log three fields: message id, destination address, and message. processing even if a single log disk is used. The message id is a 4-byte integer. The destination address The speedup ratio of DL schemes over PL2P remains is a 12-byte character array, which consists of a MIN (mo- bile identification number), the type of MIN, and so on. The constant despite the changes in the number of update message is a 240-byte character array. Thus, the total record transactions since the last checkpoint. or slot size is 256 bytes. The benefit of increasing the number of log disks is We mixed two types of update transactions: one inserts 2 higher in DL schemes than in PL2P in the multiprocessor messages into the table and the other removes 2 messages environment. Logging Exploited parallel- Number of passes Number of Checkpointing scheme ism during restart in restart log disks scheme PL2P Two DL2P Log reading and 1-8 Fuzzy - DLIP Differential log replaying - One Table 2. Tested recovery schemes 179

8. h 0 J 4 .-E G3 .- v) M 2 K M 1 0 ’ I 3 0 2 4 6 8 0 # log disks 0% 5% 10% 15% 20% (a) PLZP Abort ratio 30 I I Figure 7. Comparison of DL2P and DLlP with the abort ratio --a-- LR 8 20 - - a - - LP To compare DLIP with DL2P, we measured the log processing time varying the abort ratio from 0% to 20%. Figure 7 shows the result when four log disks are used and 5 A. 600K transactions had been performed since the last check- 0 point. This result shows that DLlP outperforms DL2P by 0 2 4 6 8 about 1.15 times when the abort ratio is 20%. This im- # log disks provement is mainly attributed to the difference in the log (b) DLlP volume. Since DLlP does not make compensation log re- cords when a transaction aborts, the log volume of DLlP Figure 6. Breakdown of log processing time remains constant regardless of the abort ratio. DLIP is also DLIP will outperform DL2P as the abort ratio increases. faster than DL2P in terms of the CPU time because DLlP does not replay the log records of aborted transactions Figure 5 shows the log processing time measured by while DL2P does the log records of both the committed and varying the number of update transactions since the last the aborted transactions. checkpoint. Note that this number determines the total log volume to process during restart. When it is 600K, the log 5.3. Total Recovery Time volume is 721MB for PL2P, 377MB for DL2P and 370MB for DLlP. The abort ratio was fixed at 2%. The numbers in To see the impact of fully parallel restart based on the dif- the parentheses represent the number of log disks. DLlP(4) ferential logging, we measured the restart time for the fol- is about six times faster than PL2P(4) and about seven lowing recovery schemes: times faster than PL2P(I), regardless of the total log vol- PL2P-Seq: PL2P observing the parallelism level 2 rule of ume. section 4.Namely, the backup DB loading and the log Figure 6 shows the log processing time of PL2P and processing are done in sequence while the pipelining is DLIP (with 600K update transactions since the last check- allowed in each of these two steps. point) by varying the number of log disks. For DLlP, it de- DLIP-Seq: the DLIP version of PL2P-Seq. creases almost linearly with the number of disks until it DLIP-Para: DLlP observing the parallelism level 4 rule reaches a saturation point. On the other hand, for PL2P, the log processing time first decreases linearly and then in- of section 4. The backup DB processing is intermixed creases slowly after a certain point. To show where the per- with the log processing. formance difference comes, two components of the log Figure 8 shows the measured restart time when the processing time are shown with broken lines: the U 0 time backup DB size is 250MB and the processed log volume is to read log records from disk (LR time) and the CPU time 370MB for DLIP and 721MB for PL2P. Figure 8(a) shows to replay the log (LP time). For both PL2P and DLIP, the the result with two log disks and two backup DB.disks and U 0 time decreases almost linearly with the number of disks. Figure 8(b) with four log disks and four backup DB disks. DLIP takes 47% less VO time than PL2P because of the Here are the observations drawn from this figure. difference in the total log volume. Note that PL2P is domi- As in the log processing time, the backup DB loading nated by the LP time when more than two log disks are time also decreases by half as the number of disks dou- used, while DLIP is dominated by the LR time regardless bles. of the number of disks. This is because the LP step of PL2P DLlP-Seq outperforms PL2P-Seq significantly, by about requires merging of logs from different disks by the two times in Figure 8(a) and by about three times in serialization order. 180

9. XI I . 1 400K 1 I Fl I , IDBllLog E;; s d 10 5 0 I PL2P-Seq DLlP-Seq DLlP-Para 0 2 4 6 8 (a) Two log disks and two backup disks #log disks 30 (a) With the number of log disks h 25 400K s 20 0 .$ 15 300K E 4 10 rA 2 200K d 5 0 lOOK PL2P-Seq DLlP-Seq DLlP-Para (b) Four log disks and four backup disks OK 0 512 1024 1536 2048 Figure 8. Breakdown of total restart time (the symbol Slot size (bytes) II denotes the parallel processing) (b) With the size of a slot Figure 8(b). The difference comes mainly from the log Figure 9. Logging performance with the number of processing time while the backup DB loading time re- log disks and the size of data objects mains the same. The parallelism further decreases the total restart time. DLIP-Para outperforms DLIP-Seq by about two times 6. Related Work in both of Figure 8(a) and Figure 8(b). The idea of using the XOR-based logging has been ad- 5.4. Transaction Throughput dressed in the DRDBMS context under the name of transi- tion logging [6]. However, if the normal update and recov- In this series of experiment, we measured the logging per- ery operations are applied directly to the disk database with formance varying the number of log disks and the size of a only the XOR difference on the log disk and if the system slot. crashes in the middle of such operations, it is impossible to Figure 9(a) shows the throughput in TPS (transactions recover the consistent database state. The shadow page per second) with the number of disks. In this measurement, scheme has been considered for each disk update to solve the abort ratio is 2%.This result shows that the logging per- this problem, but it has been abandoned because of its high formance of DLlP and PL2P increases in proportion to the run-time overhead. Fortunately, the MMDBMS does not number of disks. DLlP(8) outperforms PL2P(I) by 9.6 have this problem because it applies the update and recov- times. Even if the same number of log disks are used, DLlP ery operations to the main memory copy of the backup da- processes over 1.8 times as many transactions as PL2P dur- tabase on the disk. ing the same time interval. In order to reduce the log write and recovery time, the Figure 9(b) shows the throughput measured varying the redo-only logging has been studied in the MMDB context slot size. As the slot size increases, the overhead ratio of log [2][5][8]. By keeping the log of uncommitted, active trans- header decreases and the ratio of DL1P.log volume to PL2P actions in memory, it is possible to write only the redo part log volume approaches 0.5.As result, the performance gain of the log on disk when a transaction commits. Thus, the of DLIP(4) over PL2P(4) approaches two. Note that the disk-resident log volume is reduced to almost half as in the performance gain of DLIP(4) over PL2P(l)is about six. differential logging. However, the differential logging has at least the following three advantages over the redo-only logging. First, it permits the unrestricted parallelism in re- covery as mentioned before. Second, it can be used with the 181

10.fuzzy checkpointing while the redo-only logging can be We also plan to verify the benefit of the differential logging used only with the consistent checkpointing schemes [8]. by applying it to the real environment such as e-commerce Third, it does not need the memory space to keep the log of and mobile value-added information services. active transactions, whose volume can grow substantially when long transactions are involved. Although we com- Acknowledgement pared the differential logging with the redo-only logging, they are basically orthogonal and can be used together. This research has been in part supported by the research However, the only benefit from such combination is not to contract with the SK Telecom of Korea on the feasibility write the log data of aborted transactions while the fuzzy study of the main-memory database for mobile value-added checkpointing becomes impossible. information services and the Brain Korea 21 program spon- The transaction-partitioned parallel logging and recovery sored by the Korean ministry of education. has been studied in the shared disk architecture [lo]. This scheme, named ARIESISD, is similar to the PL2P physical logging scheme introduced in section 5. The difference is References that AFUES/SD logs at the page level while the PL2P logs at S. K. Cha, et. al., “P*TIME: A Highly Parallel Main Mem- the slot level. ory DBMS Based on Differential Logging,” submitted to ACM SIGMOD 2001 Conference for demo. 7. Conclusion D.J. DeWitt, et. al., “Implementation Techniques for Main Memory Database Systems,” Proceedings of ACM SIG- This paper studied the problem of parallel MMDB logging MOD Conference, pages 1-8, 1984. I. Gray and A. Reuter, Transaction Processing: Concepts and recovery to increase the transaction processing per- and Techniques, 2nd ed., Morgan Kaufmann, 1993. formance and to reduce the recovery time exploiting the H. Garcia-Molina and K. Salem, “Main Memory Database low-cost multiprocessor platforms and commodity disks in Systems: An Overview,” IEEE Transactions on Knowledge the market. and Data Engineering, 4(6), pages 509-516, 1992. The novel contribution of this paper is to introduce the R. B. Hagmann, “A Crash Recovery Scheme for a Memory- secalled differential logging for parallel MMDB logging Resident Database Systems,” IEEE Transactions on Com- and recovery. Using the bit-wise XOR operation both to puters, C-35(9), pages 839-843, 1986. compute the log and to recover the consistent state from the T. Haerder and A. Reuter, “Principles of Transactions- Oriented Database Recovery,” ACM Compuring Surveys, log, the differential logging permits unrestricted parallelism 15(4), pages 287-31 7, 1983. in MMDB logging and recovery. In addition, because the H. V. Jagadish, D. Lieuwen, R. Rastogi, and A. Silberschatz, log is the XOR difference between the before and after im- “Dah: A High Performance Main Memory Storage Man- ages, the log volume is reduced to almost half compared ager,” Proceedings of VLDB Conference, pages 48-59, 1994. with the physical logging. We assumed the transaction- H. V. Jagadish, A. Silberschatz and S . Sudarshan, “Recov- based partitioning of log for distribution to multiple log erying from Main-Memory Lapses,” Proceedings of VLDB disks and presented two versions of the parallel recovery Conference, pages 39 1-404, 1993. scheme based on the differential logging. The two-pass ver- C. Mohan, D. Haderle, B. G Lindsay, H. Pirahesh, and P. Schwarz, “ARIES: A Transaction Recovery Method Sup- sion scans the log twice for redo and undo as in the well- porting Fine-Granularity Locking and Partial Rollbacks Us- known ARIES algorithm, and the one-pass version scans ing Write-Ahead Logging,” ACM Transactions on Database the log only once taking advantage of the nice properties of Systems, 17(1), pages 94-162, 1992. the differential logging. C. Mohan and 1. Narang, “Data Base Recovery in Shared An experimental study of comparing the differential log- Disks and Client-Server Architectures,” Proceedings of In- ging and the physical logging shows that the former outper- ternational Conference on Distributed Computing Systems, forms the latter significantly both in the update transaction pages 310-317, 1992. throughput and in the recovery time. It is also shown that Mobile Lifestreams, “An Introduction to the Short Message Services,” White paper, http://www.mobiIesms.com/wp/ since the differential logging permits interleaving of the log wp2.htm replaying with the backup DB replaying during the restart, J. H. Park, Y. Sik Kwon, K. Kim, S. Lee, B. D. Park, and S. the backup DB loading and the log processing can proceed K. Cha, “Xmas: An Extensible Main-Memory Storage Sys- in parallel to reduce the total recovery time. Such a level of tem for High-Performance Applications,” Proceedings of parallelism is unthinkable in the MMDBMS based on the ACM SIGMOD Conference, pages 578-580, 1998. traditional logging schemes or in the context of DRDBMS. J. H. Park, K. Kim, S. K. Cha, M. S. Song, S. Lee, and J. In this paper, we have conducted our experiment in the Lee, “A High-performance Spatial Storage System Based on Main-Memory Database Architecture,” Proceedings of 4-CPU shared-memory multiprocessor environment. In the DEXA Conference, pages 1066-1075, 1999. future, we plan to study the benefit of the differential log- ging in different types of parallel computing environment. 182