Low-Overhead Paxos Replication

日志复制是高可用性数据库系统中的关键组件。为了保证数据的一致性和可靠性,现代数据库系统通常使用Paxos协议,该协议负责从一个数据库复制事务日志。
主节点进行多个备份。但是,Paxos复制需要存储和同步一些其他元数据,例如已提交的日志序列号(提交点),以保证数据库的一致性。
这增加了存储和网络的开销,这将对更新密集型工作负载中的吞吐量产生负面影响。在本文中,我们提出了一种日志复制和数据库恢复方法的实现,该方法采用了背负式的思想,即可以将提交点嵌入到提交日志中。这种做法不仅保留了Paxos复制的优点,而且还保留了Paxos复制的优点。
还可以有效减少磁盘和网络IO。我们在主内存数据库系统中实施和评估了我们的方法。我们的实验表明,与具有同步机制的典型日志复制相比,piggy备份方法可提供1.39倍的吞吐量。

展开查看详情

1.Data Sci. Eng. DOI 10.1007/s41019-017-0039-z Low-Overhead Paxos Replication Jinwei Guo1 • Jiajia Chu1 • Peng Cai1 • Minqi Zhou1 • Aoying Zhou1 Received: 30 November 2016 / Revised: 10 March 2017 / Accepted: 13 March 2017  The Author(s) 2017. This article is an open access publication Abstract Log replication is a key component in highly 1 Introduction available database systems. In order to guarantee data consistency and reliability, it is common for modern Through the smart phone, we can submit transaction pro- database systems to utilize Paxos protocol, which is cessing requests to the databases at any time. And in the responsible for replicating transactional logs from one scenario of Internet application, highly concurrent requests primary node to multiple backups. However, the Paxos have overwhelmed the traditional database systems. For replication needs to store and synchronize some additional example, in Chinese ‘‘Single Day’’ (i.e., Double 11 shop- metadata, such as committed log sequence number (com- ping carnival), the total transactions may hit the level of mit point), to guarantee the consistency of the database. hundreds of millions in the first minute. To resolve this This increases the overhead of storage and network, which challenge, many NoSQL and NewSQL systems were would have a negative impact on the throughput in the designed and implemented [7]. NoSQL refers to the data update intensive work load. In this paper, we present an storage systems which are non-relational, distributed and implementation of log replication and database recovery not guaranteed to follow the ACID properties. Compared to methods, which adopts the idea of piggybacking, i.e., the relational DBMS’s, NoSQL systems have some commit point can be embedded in the commit logs. This excellent characteristics, such as without needing to pre- practice not only retains virtues of Paxos replication, but define the data schema, high scalability, share-nothing also reduces disk and network IO effectively. We imple- architecture and asynchronous replication. These features mented and evaluated our approach in a main memory provide strong support for the Internet applications in the database system. Our experiments show that the piggy- Web 2.0. On the other hand, both the industrial and aca- backing method can offer 1.39 higher throughput than demic communities hope to use the unique features of typical log replication with synchronization mechanism. NoSQL to solve the massive data processing problems. NoSQL systems have got extensive attentions, and main Keywords Log replication  Database recovery  Paxos  industry players including Google, Amazon and Facebook OceanBase have developed their NoSQL database products which have played a key role in their services. NoSQL systems have some limitations when they are used in the mission critical applications which require strong data consistency. For example, asynchronous repli- cation and the eventual consistency mechanism provided by NoSQL are not applicable for the bank systems. If the delay of inconsistency window is too long and the primary & Peng Cai crashes in this delay period, then the last update informa- pcai@sei.ecnu.edu.cn tion may be lost because the committed update transactions 1 East China Normal University, 3663 N. Zhongshan Rd., have not been synchronized to the backups. In this proce- Shanghai 200062, China dure, it is possible that a customer performs a withdraw 123

2. J. Guo et al. operation, but the final balance of the account is not 2.1 OceanBase reduced accordingly. Log replication based on Paxos [15] can achieve the OceanBase is a scalable relational database management strong data consistency. The Paxos algorithm is proposed system developed by Alibaba. It supports cross-table and by Leslie Lamport in 1990 which is a consistency algo- cross-row transactions over billions of records with hun- rithm based on the message passing model. The algorithm dreds of terabyte data. solves the problem of reaching agreement among multiple OceanBase can be divided into four modules: the master processes or threads under the distributed environment. server (RootServer), update server (UpdateServer), base- Recently, there exist many systems adopting Paxos algo- line data server (ChunkServer) and data merge server rithm to the log replication [1, 10, 24, 25]. As long as the (MergeServer). log records have been replicated in the majority of servers, • RootServer It manages all servers metainformation in the primary node can commit the transaction. This method an OceanBase cluster, as well as data storage location. can guarantee the strong consistency between primary and • UpdateServer It stores updated data in OceanBase. secondary nodes. When the primary node failed, the UpdateServer is the only node responsible for execut- majority of the system nodes can select at least one and ing any update requests such as DELETE or UPDATE only one new primary to achieve a seamless takeover of the SQL statements. Thus, there is no distributed transac- predecessor. tion in OceanBase because any update operations are Unfortunately, a typical implementation of Paxos processed in a single node. replication—which adopts two-phase commit protocol • ChunkServer It stores OceanBase baseline data, which (2PC) using some metadata such as commit point to is also called static data. guarantee the consistency of the database—can increase • MergeServer It receives and parses SQL requests, and the overhead of disk and network. In other words, the forwards them to the corresponding ChunkServers or synchronization of metadata for consistency can put an UpdateServer after lexical analysis, syntax analysis, excessive burden on the disk and network, which causes query optimization and a series of operations. negative impacts on the throughput of the transactional database. Therefore, this paper presents a low-overhead log UpdateServer is a key component in OceanBase, and it has replication, which adopts the thought of piggybacking. some characteristics, which we utilize to implement our More precisely, the commit point is embedded in the Paxos replication, as follows: transactional log and then is synchronized from the primary • UpdateServer can be seen as a main memory database, node to backups along with the log records. We imple- which stores updated data in memtable. mented this mechanism on an open source database system • One transaction only corresponds to one commit log, OceanBase [19] and showed that this method can provide which is generated when the transaction is finishing. good performance in terms of throughput of transaction • Log records are stored on disk continuously. Therefore, processing since the overhead of disk and network was there are no holes in log files. reduced. OceanBase can be configured with multiple clusters, e.g., The remainder of the paper is organized as follows: one master cluster and one slave cluster. Only the master Preliminary works, which include the OceanBase archi- tecture and Paxos replication model, are presented in can receive write requests and process these transactions. When master cluster breaks down, the whole system is not Sect. 2. We introduce related work of Paxos replication in available for clients. For this reason, we have implemented Sect. 3. Sections 4 and 5 introduce the mechanism of log replication and recovery for OceanBase, which aims to Paxos replication, whose model will be introduced in the next subsection. reduce the overhead of disk and network. Section 6 pre- sents experimental results. We conclude the paper in Sect. 7. 2.2 Paxos Replication Model Using Paxos to replicate log records is a popular choice to 2 Preliminary build a scalable, consistent and highly available database. Traditionally, systems adopting Paxos replication have two main phases: Leader election and log replication. The ser- In this section, we will introduce the database system OceanBase, where our low-overhead method is imple- vers participating in these phases are called Paxos members, which make up a Paxos group. For ease of description, we mented. And then we will describe and analyze the can use member to refer to the member of Paxos group. mechanism of typical Paxos replication. 123

3.Low-Overhead Paxos Replication During the Leader election phase, there may be no smaller than the flow of point 2, which means the situation Leader in the system. Assume that each member in the of durability. Paxos group should take part in the Leader election. When the Leader gets a majority of acknowledge Therefore, they report the local last log sequence number responses, it can update local committed LSN, which can (LSN) to the election service. Note that the local last LSN be called commit point. Then the Leader flushes the point may be comprised of log id, generated log timestamp or to disk and sends it to each Follower. Note that the commit epoch number, which can be used to order the state of each point is an important metadata in Paxos replication, which member. The election service elects a Leader from the is used to guarantee data consistency for read operation and reporting members in consideration of the LSN’s, i.e., the recovery. In other words, this metadata enables the Fol- new Leader’s LSN must not be less than a majority of lower to provide clients with timeline consistency services, members’. When a majority of nodes acquire the election and simplifies recovery processes which guarantee the data result and succeed in registering to the new Leader, the consistency when the database recovers from a failure. election phase is finished successfully. It is noteworthy that However, the traditional synchronization mechanism of the Leader can maintain its authority by renewing its commit point increases the overhead of storage and net- election lease periodically. If the majority of members note work. Therefore, we design and implement the low-over- that the Leader’s lease is expired, the system will enter into head log replication adopting durability strategy for the Leader election phase again. OceanBase. During the log replication phase, each member has one of the two replica status: Leader and Follower, which own primary and backup replica, respectively. As is shown in 3 Related Work Fig. 1, the Leader receives a write request from a client, generates a commit log, and replicates the log record to all ARIES [18] has been the actual criteria for transactional Followers. When a Follower receives the log message from logging in traditional database systems. It gives a reference the Leader, it can do different actions according to the for log model and recovery mechanism. Replication, such strategies of reliability: as eager or lazy mechanism [13], is an effective means to provide horizontal scalability, high availability and fault • Durability The Follower cannot response the Leader tolerance in distributed systems. Therefore, it is common until they ensure that the log record is persistent in local for distributed database systems to replicate transactional disk. log from a primary node to one or multiple backup replicas. • Non-durability The Follower responses the Leader As the Internet applications are developed rapidly, an immediately when it receives the log message and increasing large number of databases leverage the NoSQL buffers the log record in memory. techniques, which provides us with scalability and high The execution flow of point 1 in Fig. 1 shows the situation availability through the use of replication. Dynamo [11], of non-durability. We note that the delay of write request is which is developed by Amazon, is a highly available key- value datastore system. Its replication resorts to NWR strategy, which permits clients to decide to balance avail- ability against consistency. Facebook’s Cassandra [6, 14] adopts the idea of Dynamo and makes use of the optimized mechanisms, such as load balance. At present, it has became an open source distributed database management system in Apache. Yahoo’s PNUTS [8] is a scalable datastore, which is focused on cross-datacenter replication. Although these systems can offer good performance and high availability, the eventual consistency can be only provided. Paxos is a consensus protocol for solving consistency problem in distributed systems. It is described basically by Lamport in [15]. Multi-Paxos introduced in [16] is an important protocol for Paxos replication. And more vari- ants are introduced by him in [17]. Using Paxos for repli- cation is a common choice for implementing scalable, consistent, and highly available datastore. Chubby [5] is Fig. 1 Log replication model based on Paxos Google’s service aiming at providing a coarse grained lock 123

4. J. Guo et al. service for loosely coupled distributed systems. Zoo- (the final results produced by the transaction)—will be Keeper [26] is its open source implementation. Google has generated. developed MegaStore [1], Spanner [10] and F1 [24], and Recall from Sect. 2.2 that the commit points need to be these database systems have used Paxos for log replication. replicated and persisted in Paxos members, which can be Megastore is a storage system providing strong consistent. achieved by the synchronization of log records. Therefore, Spanner is a scalable, multi-version, global distributed, in order to reduce the impact of handling commit points, synchronous replication database. F1 provides the func- the committed LSN can be embedded into each commit log tionality of the SQL database. Raft [20] is a consensus record. The format of the log entries carrying the com- algorithm for RAMCloud [21]. It is designed to be easy to mitted LSN is shown in Fig. 2. We note that the commit understand and equivalent to Paxos. Rao et al. [23] intro- log entry with LSN of 101 holds a committed LSN of 99, duce a relatively complete technology solution to build a which indicates that the logs whose LSN is not greater than datastore using Paxos. Patterson et al. [22] analyze and 99 can be applied to the memtable. discuss the replication based on Paxos. It is clear that embedding the commit point into the log In recent years, with the rapid development of new increases the size of log. Since the size of one LSN is only hardware, e.g., SSD (solid-state drive), NVM (non-volatile a few bytes, the extra information will not take up too memory) and RDMA (remote direct memory access), new much space. The group commit mechanism which will be log replication and recovery mechanisms emerge. Drago- described below can further reduce the impact of embed- jević et al. [12] leverage NVM and RDMA to build a ding the metadata. highly scalable and available computing platform without sacrificing the strong consistency. Tango [2, 3] and 4.2 Log Replication Hyder [4] use log-sharing architecture—which is based on SSD and high-speed network—to ensure the reliability and After generating the commit log, the Leader needs to send availability. the log record to all Followers by an asynchronous network function, which does not block the single commit thread held responsible for synchronization and persistence of log 4 Low-Overhead Replication Protocol records. In other words, the Leader is able to flush the commit log to local disk without waiting for the responses This section will describe the log replication protocol based from the Followers. on piggybacking method, which will reduce disk and net- When a Follower receives a log message from the work IO. To simplify the discussion, we adopt three-way Leader, it would extract the committed LSN from the replication. More precisely, the Leader—which is the pri- received log record and compare it with the local cached mary node—replicates log records to two Followers own- committed LSN. If the local value is less than the new ing the backup replicas. committed LSN, it should be updated with the new one; Recall that the architecture of OceanBase is described in otherwise, the value is not refreshed. Then the Follower Sect. 2.1. In order to give a simple explanation, we treat appends the commit log to the end of the log file in local MergeServer’s and ChunkServer’s as the clients which disk. Once the appending operation has finished or overrun forward write requests, UpdateServer’s and RootServer’s a certain period of time, the Follower would get the max- as the Paxos members in log replication model mentioned imum flushed LSN, which represents the latest state of above, which are responsible for log replication and Leader commit log in local disk. Then it sends the response election, respectively. 4.1 Commit Log Entry LSN In the Leader’s term, the primary replica is the only one LSN 98 99 100 101 which can receive and process the write operations. When a client issues a write request, it first acquires the infor- mation about which UpdateServer is the Leader, and then sends the write to the Leader node. LSN 101 When the Leader receives a write request, it gets the TransID 1478793601000 corresponding transaction, processes the operation and pre- CommitedLSN 99 applies the results to the local memory table. Once the LogData x='a' transaction enters the commit phase, a log entry—which contains a unique LSN, the transaction ID and the log data Fig. 2 An example of commit log entries with committed LSN 123

5.Low-Overhead Paxos Replication message containing the maximum flushed LSN to the 4.3 Further Discussion Leader. The value of committed LSN is checked periodically by The main procedure of the log replication protocol has the log replay threads in the Follower node. When the been introduced above. However, we note that the syn- value is changed, these threads will get and replay the chronization triggered by each commit log entry can pro- persistent log which has not been replayed in local. More duce massive disk and network IO operations. Therefore, in precisely, the log entries whose LSN’s are not greater than order to further reduce the overhead of log replication, it is the committed LSN need to be applied to the local memory common for many databases to adopt group commit table in order of LSN. If the Follower cannot find the mechanism, which treats a group of log records as one corresponding log records in the log file, it will request the commit unit. missing ones to the Leader by themselves. When the Leader generates a log record for a transac- In order to compute the commit point, the Leader has to tion, it caches the log record in the local buffer. Once the store the flushed LSN’s of all the Followers. When size of log in the buffer reaches the maximum capacity or receiving the response message from a Follower, the the time interval from the last successive commit is longer Leader would extract the flushed LSN from the message than a configured value commit interval, the Leader and compare it with the local cached value of this Fol- packages the buffered log entries and sends the package to lower. If the new flushed LSN is greater than the local one, each Follower by an asynchronous method, and then the Leader will replace the local value; otherwise, the value appends all of them to the log file in disk. Therefore, we is not changed. Based on all the Followers’ flushed LSN’s can generate a special commit log entry containing com- cached in local and the majority strategy, the Leader cal- mitted LSN at the end of group, which can reduce the space culates a new committed LSN, which indicates a certain overhead in the log. state of the database. If the new value is greater than the However, it is difficult to configure commit interval previous one cached in local memory, the local committed since the hardware may be not same in different production LSN will be updated with the new one and embedded in the environments and the processing time of log replication in next commit log entry. Finally, the Leader commits the Followers should be considered. Therefore, we can design corresponding transactions in accordance with the local an adaptive group commit mechanism, which takes into cached commit point and returns the results to the clients. account the time of log persistence in each replica node. Figure 3 shows an example of Paxos members’ infor- Let persistence timeðiÞ denote the latest time of flushing a mation stored in Leader, which is used to compute the group of logs in node i. When the Leader receives the committed LSN. We find that the log records whose LSN is persistence timeðiÞ, it needs to recalculate the not greater than 120 are durable in majority of members. commit interval as follows: Therefore, the Leader can update the committed LSN to commit interval ¼ ðcommit interval þ persistence timeðiÞÞ=2 120, commit the transactions whose LSN’s are not greater ð1Þ than 120, and embed the value 120 into the next commit log entry. If a Follower receives the new commit point, it The commit interval is initially set to a value configured will apply the logs to local memtable in order of LSN until by the administrator, and it is automatically changed to a 120. relatively stable value—which is suitable for the platform Note that the members’ information is only stored in of database—through the simple adaptive method. If a Leader’s memory. Therefore, a new Leader needs to replica node i encounters something abnormal, the Leader request the flushed LSN to the Followers when it starts to will catch the exception and not consider the value take over the replication. And if a Follower fails, the pri- persistence timeðiÞ. mary node needs to clear out the failure’s information from the table. 5 Recovery Member FlushedLSN In this section, we describe how a Paxos member The log entries whose LSN's are not recovers from a failure. It is clear that the failure is a Leader 130 greater than commied_LSN are flushed in a majority of Paxos members. common phenomenon in distributed systems, e.g., power Follower1 120 failure, administrator mistakes, software or hardware commied _LSN = 120 errors and so on. Therefore, we need to adopt recovery Follower2 110 mechanism to ensure the correctness and consistency of Fig. 3 An example of Paxos members’ information used to generate the database. committed LSN 123

6. J. Guo et al. Recall from Sect. 2.2 that the system adopting Paxos rather than consistency. Therefore, when the Leader starts replication has two phases. Therefore, each Paxos member the replaying task, it can service read requests from clients needs to periodically check the system status, which can be as long as replaying local log to the committed LSN. The presented by a local variable. There are two kinds of states remainder of the local log is applied along with the new of the status, i.e., DURING_ELECTION and AFTER_ELECTION, committed LSN. which indicate whether there exists a Leader in the system. If the status is AFTER_ELECTION, it shows that the system is in 5.2 Follower Recovery the log replication phase. When a member is restarting, its election role is definitely determined. Therefore, it can take When a restarting member finds the status of system is predetermined actions in accordance with the role. If the AFTER_ELECTION and its election role is Follower, it has to system is in DURING_ELECTION state, it means that the Paxos ensure that the state of local data is consistent with the group is in the Leader election phase. The recovering Leader. Since the Follower cannot judge whether the log member needs to take part in the election. It is only when records whose LSN is greater than the committed LSN new Leader is elected that the restarting member can should be applied to the local memory table, it must get continue to recover. necessary information from the Leader. In order to reduce the network overhead, we implement a recovery mecha- 5.1 Leader Recovery nism as below. To begin with, the Follower scans log file in disk to update local variables, e.g., local last LSN, com- When a recovering member finds the status of system is mitted LSN. As described above, the committed LSN is the AFTER_ELECTION and its election role is Leader, it has to take max committed LSN stored in the log file. Then, it starts to some steps to ensure the consistency of the database. In replay local log records whose LSN is not greater than the other words, it is not until the new Leader guarantees that committed LSN, and it discards the remaining log records. its local log records are persisted in a majority of the At the same time, the Follower reports its committed LSN members that it can service requests from the clients. to the Leader. When the Leader receives this message, it Firstly, it scans the log file in disk, gets local last LSN and sends the corresponding log records after that LSN to the max committed LSN from the file, and caches them in local Follower. Finally, the Follower can receive new log variables. Then the Leader starts up threads to replay whole records and refresh the committed LSN, which triggers local logs from checkpoint. By this time, the Leader cannot itself to replay the log continuously. service any requests from clients. Next, it appends a special Note that if the role of Leader is frequently switched in commit log entry which only contains max committed different members, the log records of committed transac- LSN, and replicates the record to other members. Finally, tions will be lost. To prevent this, it is not until the backup the master receives the responses of Followers and updates node ensures that the received log records which are inte- corresponding information of the servers. When the pri- grated from the committed LSN and whose LSN’s are mary replica detects that the committed LSN is not less greater than the local last LSN that Follower can discard than the previous cached local last LSN, it can provide log records after the committed LSN. In other words, the services for clients. Follower buffers the new log entries until these data cover the LSN range ðlocal committed LSN; local last LSN, and then replaces the corresponding log entries in disk atomically. The main steps are illustrated in Procedure 2. The Leader adopting the above recovery procedure can guarantee the consistency of the database. Its main steps are summarized in Procedure 1. If the Leader has taken over the log replication completely, a client getting the data from the primary replica can be provided with strong After the Follower applies the commit log entries— consistency. In some cases, we would like high availability whose LSN’s are not greater than the 123

7.Low-Overhead Paxos Replication local committed LSN—to the memory table, it can pro- Database Deployment The database system is configured vide clients with weakly consistent services, such as with three-way replication. More precisely, we start three timeline or snapshot consistency. Therefore, a client for- OceanBase clusters, and each instance—which contains a wards a read request to a Paxos member in accordance with Paxos member (RootServer and UpdateServer) running in the requirement of consistency. one node and 5 clients (MergeServer and ChunkServer) in others—is deployed on 6 servers. Benchmarks We adopted YCSB [9]—a popular key-value 6 Experiments benchmark from Yahoo—to evaluate the log replication performance of the three methods, we used workloads This section evaluates the performance of several different containing heavy replace operations with read/write ratio implementations of the synchronization of the commit of 0/100. Since the MergeServer is the external interface of point, i.e., piggybacking method, synchronization method the database system, the application of YCSB should and asynchronous method, which are implemented in the connect to MergeServer firstly, and then executes replace open source database system OceanBase 0.4.2: auto-commit transaction repeatedly. We observe the results • Piggybacking method (PIGGY) This method is our of YCSB after the execution and the statistics of system implementation in this work described above. In order during the execution. The database is preloaded with 10 to reduce the disk and network overhead, we append a million records, and the size of each record is about 100 special commit log entry containing the committed bytes. LSN to the end of the log group. • Synchronization method (SYNC) This method is differ- 6.2 Log Replication Performance ent from PIGGY. When the Leader detects that the committed LSN has been changed, it would call Linux We ran experiments to benchmark PIGGY against the interface fsync() to flush the committed LSN to the other methods by using the YCSB. The performance in the disk and call a method which sends the commit point to terms of TPS, IOPS, write throughput and Follower the Followers asynchronously. receiving was focused on. The TPS refers to the number of • Asynchronization method (ASYNC) This method is transactions performed by the system per second. The different from the above two methods. The Leader IOPS is used to denote the write requests issued to disk per starts a background thread, which is responsible for second. Let write throughput and Follower receiving rep- flushing and sending the committed LSN periodically. resent the volume of data flushed to disk in the Leader and All the methods adopts the group commit technique the number of messages received in Followers over a introduced specifically in Sect. 4.3. period of time. The experimental results are illustrated from Figs. 4, 5, 6 and 7. 6.1 Experimental Setup We first compared PIGGY with other methods for the TPS case, which reflected the performance of transaction This subsection describes the platform of the cluster and processing while executing various workloads. Figure 4 deployment of the database, and gives a brief overview of the benchmark. 45000 Cluster Platform We ran the experiments on a cluster of 40000 18 machines. The software and hardware setup of each server is shown in Table 1. Note that the write latency of 35000 Throuput (txns/sec) the SSD is about hundreds of microseconds. 30000 25000 Table 1 Experimental setup 20000 Software and hardware setup PIGGY 15000 CPU E5606@2.13 G * 2 SYNC CPU cores 8 (Hyper-threading disabled) ASYNC 10000 Memory 16GB PC3L-12800R * 6 5000 Disk 100 GB SSD * 1 200 400 600 800 1000 1200 1400 Network Gigabit Ethernet Number of clients Operating system CentOS 6.5 Fig. 4 Throughput of transactions 123

8. J. Guo et al. 1100 240 1000 PIGGY 220 900 SYNC 200 Disk IO (wrequests/sec) ASYNC 180 Messages (pkgs/sec) 800 160 700 140 600 120 500 100 400 80 300 60 PIGGY 200 40 SYNC 100 ASYNC 20 0 0 200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400 Number of clients Number of clients Fig. 5 Write requests over the disk Fig. 7 Message throughput in one follower 13 needs not to store the commit point, which could incur 12 additional IO and then increase the disk overhead. The 11 ASYNC method had the highest number of write requests, 10 because the single thread flushed the commit point every Disk write (MB/sec) 9 10 ms, which caused more requests than SYNC persisted 8 the committed LSN to local disk only after the group of 7 transactional log entries committed. The curves of all the 6 methods had the decreasing trends with the increase in 5 connections, we note that the commit interval described in 4 PIGGY Sect. 4.3 would be larger as the number of clients 3 SYNC increased. 2 ASYNC Figure 6 shows the write throughput over the disk of the 1 tree methods. In this case, we also used the iostat 0 command to evaluate this performance. Note that PIGGY 200 400 600 800 1000 1200 1400 has the highest write throughput and the ASYNC and Number of clients SYNC had the similar results. The curves of these mech- anisms had the same trend with the ones in Fig. 4, which Fig. 6 Write throughput over the disk indicated that the throughput of transactions determined the data volume of disk writes and the flushing committed LSN shows the transaction throughput of three methods as the could not inconspicuously increase the size of data written workload (number of clients) was scaled up. Note that the to the disk. Although the PIGGY adopted a special log to PIGGY had the better performance than the other methods record commit point, it is more efficient than other in any workload, which significantly demonstrated the methods. effectiveness of our method. Since the PIGGY could not We compared the PIGGY with other methods for the produce additional impact on the disk, it improved case of Follower receiving through monitoring the number throughput of write transactions by at least 1.39 when the of packets received by a Follower within one second during number of clients was 1400. And the SYNC and ASYNC normal processing. Figure 7 shows the results of receiving had a nearly same TPS, because the latency of flushing messages per second as the workload (number of clients) committed LSN is \2 ms which is largely smaller than a was scaled up. The ASYNC method had the highest results common transaction response delay. since a background thread sent commit point in Leader We evaluated IOPS by using the Linux tool iostat frequently. With the increase in client connections, all of which can monitor the number of write requests issued to the three curves decreased by degrees. The reason of the the specified device per second. Through the comparison of decline is same as the discussion described in Fig. 5. Since the three methods, Fig. 5 shows that the Piggybacking the flushing commit point in SYNC increases the pro- method had the lowest write requests per second, because it cessing time of a group of log entries, the SYNC method 123

9.Low-Overhead Paxos Replication has the lowest Follower receiving when the number of 3. Balakrishnan M, Malkhi D, Wobber T et al (2013) Tango: dis- client was less than 1000. With the increase in connections, tributed data structures over a shared log. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles, the PIGGY was more effective in group commit mecha- pp 325–340 nism. Therefore, the PIGGY had the lowest Follower 4. Bernstein PA, Reid CW, Das S (2011) Hyder—a transactional receiving when the number of connections was [1200. record manager for shared flash. In: Fifth biennial conference on From the above experiment results, we could draw a innovative data systems research, pp 9–20 5. Burrows M (2006) The Chubby lock service for loosely-coupled conclusion that the persistence of committed LSN—used to distributed systems. In: Proceedings of the 7th symposium on improve the availability of the system—could lead to much operating systems design and implementation, pp 335–350 more overhead of disk, and then decreases the capacity of 6. Cassandra website. http://cassandra.apache.org/ IO, which is respected as a precious commodity to replicate 7. Cattell R (2011) Scalable SQL and NoSQL data stores. SIGMOD Rec 39(4):12–27 and persist transactional log. Therefore, the PIGGY has a 8. Cooper BF, Ramakrishnan R, Srivastava U et al (2008) PNUTS: better performance than other approaches. Moreover, the Yahoo!’s hosted data serving platform. Proc VLDB Endow SYNC could provide similar throughput of transactions to 1(2):1277–1288 ASYNC. 9. Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R (2010) Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on cloud computing, pp 143–154 7 Conclusion 10. Corbett JC, Dean J, Epstein M et al (2013) Spanner: Google’s globally distributed database. ACM Trans Comput Syst 31(3):8 11. DeCandia G, Hastorun D, Jampani M, et al (2007) Dynamo: Log replication based on Paxos can provide database sys- Amazon’s highly available key-value store. In: Proceedings of tems with scalability, consistency and highly availability. twenty-first ACM SIGOPS symposium on operating systems This paper described an implementation mechanism of principles, pp 205–220 Paxos replication for OceanBase, which is scalable and has 12. Dragojević A, Narayanan D, Nightingale EB et al (2015) No compromises: distributed transactions with consistency, avail- a memory transactional engine. Unlike traditional imple- ability, and performance. In: Proceedings of the 25th symposium mentation, our method takes into account the overhead of on operating systems principles, pp 54–70 storage and network, which have a significant impact on 13. Gray J, Helland P, O’Neil P, Shasha D (1996) The dangers of performance. replication and a solution. SIGMOD Rec 25(2):173–182 14. Lakshman A, Malik P (2010) Cassandra: a decentralized struc- We find that the synchronization of committed LSN tured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40 used for timeline consistency may improve the overhead of 15. Lamport L (1998) The part-time parliament. ACM Trans Comput the system. Therefore, we make use of piggybacking Syst 16(2):133–169 technique to implement log replication and database 16. Lamport L (2001) Paxos made simple. ACM SIGACT News 32(4):18–25 recovery. Compared to the synchronization mechanism, 17. Lamport L (2006) Fast paxos. Distrib Comput 19(2):79–103 our method improves throughput of update operations by 18. Mohan C, Haderle D, Lindsay B, Pirahesh H, Schwarz P (1992) 1.39. ARIES: a transaction recovery method supporting fine-granular- ity locking and partial rollbacks using write-ahead logging. ACM Acknowledgements This work is partially supported by National Trans Database Syst 17(1):94–162 High-tech R&D Program (863 Program) under Grant Number 19. OceanBase website. https://github.com/alibaba/oceanbase/ 2015AA015307 and National Science Foundation of China under 20. Ongaro D, Ousterhout JK (2014) In search of an understandable Grant Number 61332006. consensus algorithm. In: Proceedings of the 2014 USENIX con- ference on USENIX annual technical conference, pp 305–319 Open Access This article is distributed under the terms of the 21. Ousterhout J, Agrawal P, Erickson D et al (2010) The case for Creative Commons Attribution 4.0 International License (http://crea RAMClouds: scalable high-performance storage entirely in tivecommons.org/licenses/by/4.0/), which permits unrestricted use, DRAM. SIGOPS Oper Syst Rev 43(4):92–105 distribution, and reproduction in any medium, provided you give 22. Patterson S, Elmore AJ, Nawab F, Agrawal D, El Abbadi A appropriate credit to the original author(s) and the source, provide a (2012) Serializability, not serial: Concurrency control and link to the Creative Commons license, and indicate if changes were availability in multi-datacenter datastores. Proc VLDB Endow made. 5(11):1459–1470 23. Rao J, Shekita EJ, Tata S (2011) Using paxos to build a scalable, consistent, and highly available datastore. Proc VLDB Endow 4(4):243–254 References 24. Shute J, Vingralek R, Samwel B et al (2013) F1: a distributed SQL database that scales. Proc VLDB Endow 6(11):1068–1079 1. Baker J, Bond C, Corbett JC et al (2011) Megastore: providing 25. Thomson A, Diamond T, Weng SC et al (2012) Calvin: fast scalable, highly available storage for interactive services. In: distributed transactions for partitioned database systems. In: Fifth biennial conference on innovative data systems research, Proceedings of the 2012 ACM SIGMOD international conference pp 223–234 on management of data, pp 1–12 2. Balakrishnan M, Malkhi D, Davis JD et al (2013) CORFU: a 26. ZooKeeper website. http://zookeeper.apache.org/ distributed shared log. ACM Trans Comput Syst 31(4):10 123