- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
共识算法
展开查看详情
1 . In Search of an Understandable Consensus Algorithm (Extended Version) Diego Ongaro and John Ousterhout Stanford University Abstract state space reduction (relative to Paxos, Raft reduces the Raft is a consensus algorithm for managing a replicated degree of nondeterminism and the ways servers can be in- log. It produces a result equivalent to (multi-)Paxos, and consistent with each other). A user study with 43 students it is as efficient as Paxos, but its structure is different at two universities shows that Raft is significantly easier from Paxos; this makes Raft more understandable than to understand than Paxos: after learning both algorithms, Paxos and also provides a better foundation for build- 33 of these students were able to answer questions about ing practical systems. In order to enhance understandabil- Raft better than questions about Paxos. ity, Raft separates the key elements of consensus, such as Raft is similar in many ways to existing consensus al- leader election, log replication, and safety, and it enforces gorithms (most notably, Oki and Liskov’s Viewstamped a stronger degree of coherency to reduce the number of Replication [29, 22]), but it has several novel features: states that must be considered. Results from a user study • Strong leader: Raft uses a stronger form of leader- demonstrate that Raft is easier for students to learn than ship than other consensus algorithms. For example, Paxos. Raft also includes a new mechanism for changing log entries only flow from the leader to other servers. the cluster membership, which uses overlapping majori- This simplifies the management of the replicated log ties to guarantee safety. and makes Raft easier to understand. • Leader election: Raft uses randomized timers to 1 Introduction elect leaders. This adds only a small amount of Consensus algorithms allow a collection of machines mechanism to the heartbeats already required for any to work as a coherent group that can survive the fail- consensus algorithm, while resolving conflicts sim- ures of some of its members. Because of this, they play a ply and rapidly. key role in building reliable large-scale software systems. • Membership changes: Raft’s mechanism for Paxos [15, 16] has dominated the discussion of consen- changing the set of servers in the cluster uses a new sus algorithms over the last decade: most implementations joint consensus approach where the majorities of of consensus are based on Paxos or influenced by it, and two different configurations overlap during transi- Paxos has become the primary vehicle used to teach stu- tions. This allows the cluster to continue operating dents about consensus. normally during configuration changes. Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. We believe that Raft is superior to Paxos and other con- Furthermore, its architecture requires complex changes sensus algorithms, both for educational purposes and as a to support practical systems. As a result, both system foundation for implementation. It is simpler and more un- builders and students struggle with Paxos. derstandable than other algorithms; it is described com- After struggling with Paxos ourselves, we set out to pletely enough to meet the needs of a practical system; find a new consensus algorithm that could provide a bet- it has several open-source implementations and is used ter foundation for system building and education. Our ap- by several companies; its safety properties have been for- proach was unusual in that our primary goal was under- mally specified and proven; and its efficiency is compara- standability: could we define a consensus algorithm for ble to other algorithms. practical systems and describe it in a way that is signifi- The remainder of the paper introduces the replicated cantly easier to learn than Paxos? Furthermore, we wanted state machine problem (Section 2), discusses the strengths the algorithm to facilitate the development of intuitions and weaknesses of Paxos (Section 3), describes our gen- that are essential for system builders. It was important not eral approach to understandability (Section 4), presents just for the algorithm to work, but for it to be obvious why the Raft consensus algorithm (Sections 5–8), evaluates it works. Raft (Section 9), and discusses related work (Section 10). The result of this work is a consensus algorithm called 2 Replicated state machines Raft. In designing Raft we applied specific techniques to Consensus algorithms typically arise in the context of improve understandability, including decomposition (Raft replicated state machines [37]. In this approach, state ma- separates leader election, log replication, and safety) and chines on a collection of servers compute identical copies This tech report is an extended version of [32]; additional material is of the same state and can continue operating even if some noted with a gray bar in the margin. Published May 20, 2014. of the servers are down. Replicated state machines are 1
2 . tency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems. • In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system perfor- mance. 3 What’s wrong with Paxos? Over the last ten years, Leslie Lamport’s Paxos proto- Figure 1: Replicated state machine architecture. The con- col [15] has become almost synonymous with consensus: sensus algorithm manages a replicated log containing state it is the protocol most commonly taught in courses, and machine commands from clients. The state machines process most implementations of consensus use it as a starting identical sequences of commands from the logs, so they pro- duce the same outputs. point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated used to solve a variety of fault tolerance problems in dis- log entry. We refer to this subset as single-decree Paxos. tributed systems. For example, large-scale systems that Paxos then combines multiple instances of this protocol to have a single cluster leader, such as GFS [8], HDFS [38], facilitate a series of decisions such as a log (multi-Paxos). and RAMCloud [33], typically use a separate replicated Paxos ensures both safety and liveness, and it supports state machine to manage leader election and store config- changes in cluster membership. Its correctness has been uration information that must survive leader crashes. Ex- proven, and it is efficient in the normal case. amples of replicated state machines include Chubby [2] Unfortunately, Paxos has two significant drawbacks. and ZooKeeper [11]. The first drawback is that Paxos is exceptionally diffi- Replicated state machines are typically implemented cult to understand. The full explanation [15] is notori- using a replicated log, as shown in Figure 1. Each server ously opaque; few people succeed in understanding it, and stores a log containing a series of commands, which its only with great effort. As a result, there have been several state machine executes in order. Each log contains the attempts to explain Paxos in simpler terms [16, 20, 21]. same commands in the same order, so each state ma- These explanations focus on the single-decree subset, yet chine processes the same sequence of commands. Since they are still challenging. In an informal survey of atten- the state machines are deterministic, each computes the dees at NSDI 2012, we found few people who were com- same state and the same sequence of outputs. fortable with Paxos, even among seasoned researchers. Keeping the replicated log consistent is the job of the We struggled with Paxos ourselves; we were not able to consensus algorithm. The consensus module on a server understand the complete protocol until after reading sev- receives commands from clients and adds them to its log. eral simplified explanations and designing our own alter- It communicates with the consensus modules on other native protocol, a process that took almost a year. servers to ensure that every log eventually contains the We hypothesize that Paxos’ opaqueness derives from same requests in the same order, even if some servers fail. its choice of the single-decree subset as its foundation. Once commands are properly replicated, each server’s Single-decree Paxos is dense and subtle: it is divided into state machine processes them in log order, and the out- two stages that do not have simple intuitive explanations puts are returned to clients. As a result, the servers appear and cannot be understood independently. Because of this, to form a single, highly reliable state machine. it is difficult to develop intuitions about why the single- Consensus algorithms for practical systems typically decree protocol works. The composition rules for multi- have the following properties: Paxos add significant additional complexity and subtlety. • They ensure safety (never returning an incorrect re- We believe that the overall problem of reaching consensus sult) under all non-Byzantine conditions, including on multiple decisions (i.e., a log instead of a single entry) network delays, partitions, and packet loss, duplica- can be decomposed in other ways that are more direct and tion, and reordering. obvious. • They are fully functional (available) as long as any The second problem with Paxos is that it does not pro- majority of the servers are operational and can com- vide a good foundation for building practical implemen- municate with each other and with clients. Thus, a tations. One reason is that there is no widely agreed- typical cluster of five servers can tolerate the failure upon algorithm for multi-Paxos. Lamport’s descriptions of any two servers. Servers are assumed to fail by are mostly about single-decree Paxos; he sketched possi- stopping; they may later recover from state on stable ble approaches to multi-Paxos, but many details are miss- storage and rejoin the cluster. ing. There have been several attempts to flesh out and op- • They do not depend on timing to ensure the consis- timize Paxos, such as [26], [39], and [13], but these differ 2
3 .from each other and from Lamport’s sketches. Systems There were numerous points in the design of Raft such as Chubby [4] have implemented Paxos-like algo- where we had to choose among alternative approaches. rithms, but in most cases their details have not been pub- In these situations we evaluated the alternatives based on lished. understandability: how hard is it to explain each alterna- Furthermore, the Paxos architecture is a poor one for tive (for example, how complex is its state space, and does building practical systems; this is another consequence of it have subtle implications?), and how easy will it be for a the single-decree decomposition. For example, there is lit- reader to completely understand the approach and its im- tle benefit to choosing a collection of log entries indepen- plications? dently and then melding them into a sequential log; this We recognize that there is a high degree of subjectiv- just adds complexity. It is simpler and more efficient to ity in such analysis; nonetheless, we used two techniques design a system around a log, where new entries are ap- that are generally applicable. The first technique is the pended sequentially in a constrained order. Another prob- well-known approach of problem decomposition: wher- lem is that Paxos uses a symmetric peer-to-peer approach ever possible, we divided problems into separate pieces at its core (though it eventually suggests a weak form of that could be solved, explained, and understood relatively leadership as a performance optimization). This makes independently. For example, in Raft we separated leader sense in a simplified world where only one decision will election, log replication, safety, and membership changes. be made, but few practical systems use this approach. If a Our second approach was to simplify the state space series of decisions must be made, it is simpler and faster by reducing the number of states to consider, making the to first elect a leader, then have the leader coordinate the system more coherent and eliminating nondeterminism decisions. where possible. Specifically, logs are not allowed to have As a result, practical systems bear little resemblance holes, and Raft limits the ways in which logs can become to Paxos. Each implementation begins with Paxos, dis- inconsistent with each other. Although in most cases we covers the difficulties in implementing it, and then de- tried to eliminate nondeterminism, there are some situ- velops a significantly different architecture. This is time- ations where nondeterminism actually improves under- consuming and error-prone, and the difficulties of under- standability. In particular, randomized approaches intro- standing Paxos exacerbate the problem. Paxos’ formula- duce nondeterminism, but they tend to reduce the state tion may be a good one for proving theorems about its cor- space by handling all possible choices in a similar fashion rectness, but real implementations are so different from (“choose any; it doesn’t matter”). We used randomization Paxos that the proofs have little value. The following com- to simplify the Raft leader election algorithm. ment from the Chubby implementers is typical: 5 The Raft consensus algorithm There are significant gaps between the description of Raft is an algorithm for managing a replicated log of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an un- the form described in Section 2. Figure 2 summarizes the proven protocol [4]. algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these Because of these problems, we concluded that Paxos figures are discussed piecewise over the rest of this sec- does not provide a good foundation either for system tion. building or for education. Given the importance of con- Raft implements consensus by first electing a distin- sensus in large-scale software systems, we decided to see guished leader, then giving the leader complete responsi- if we could design an alternative consensus algorithm bility for managing the replicated log. The leader accepts with better properties than Paxos. Raft is the result of that log entries from clients, replicates them on other servers, experiment. and tells servers when it is safe to apply log entries to 4 Designing for understandability their state machines. Having a leader simplifies the man- We had several goals in designing Raft: it must provide agement of the replicated log. For example, the leader can a complete and practical foundation for system building, decide where to place new entries in the log without con- so that it significantly reduces the amount of design work sulting other servers, and data flows in a simple fashion required of developers; it must be safe under all conditions from the leader to other servers. A leader can fail or be- and available under typical operating conditions; and it come disconnected from the other servers, in which case must be efficient for common operations. But our most a new leader is elected. important goal—and most difficult challenge—was un- Given the leader approach, Raft decomposes the con- derstandability. It must be possible for a large audience to sensus problem into three relatively independent subprob- understand the algorithm comfortably. In addition, it must lems, which are discussed in the subsections that follow: be possible to develop intuitions about the algorithm, so • Leader election: a new leader must be chosen when that system builders can make the extensions that are in- an existing leader fails (Section 5.2). evitable in real-world implementations. • Log replication: the leader must accept log entries 3
4 . State RequestVote RPC Persistent state on all servers: Invoked by candidates to gather votes (§5.2). (Updated on stable storage before responding to RPCs) Arguments: currentTerm latest term server has seen (initialized to 0 on first boot, increases monotonically) term candidate’s term candidateId candidate requesting vote votedFor candidateId that received vote in current lastLogIndex index of candidate’s last log entry (§5.4) term (or null if none) lastLogTerm term of candidate’s last log entry (§5.4) log[] log entries; each entry contains command for state machine, and term when entry Results: was received by leader (first index is 1) term currentTerm, for candidate to update itself voteGranted true means candidate received vote Volatile state on all servers: commitIndex index of highest log entry known to be Receiver implementation: committed (initialized to 0, increases 1. Reply false if term < currentTerm (§5.1) monotonically) 2. If votedFor is null or candidateId, and candidate’s log is at lastApplied index of highest log entry applied to state least as up-to-date as receiver’s log, grant vote (§5.2, §5.4) machine (initialized to 0, increases monotonically) Rules for Servers Volatile state on leaders: All Servers: (Reinitialized after election) • If commitIndex > lastApplied: increment lastApplied, apply nextIndex[] for each server, index of the next log entry log[lastApplied] to state machine (§5.3) to send to that server (initialized to leader • If RPC request or response contains term T > currentTerm: last log index + 1) set currentTerm = T, convert to follower (§5.1) matchIndex[] for each server, index of highest log entry known to be replicated on server Followers (§5.2): (initialized to 0, increases monotonically) • Respond to RPCs from candidates and leaders • If election timeout elapses without receiving AppendEntries AppendEntries RPC RPC from current leader or granting vote to candidate: Invoked by leader to replicate log entries (§5.3); also used as convert to candidate heartbeat (§5.2). Candidates (§5.2): Arguments: • On conversion to candidate, start election: term leader’s term • Increment currentTerm leaderId so follower can redirect clients • Vote for self prevLogIndex index of log entry immediately preceding • Reset election timer new ones • Send RequestVote RPCs to all other servers prevLogTerm term of prevLogIndex entry • If votes received from majority of servers: become leader entries[] log entries to store (empty for heartbeat; • If AppendEntries RPC received from new leader: convert to may send more than one for efficiency) follower leaderCommit leader’s commitIndex • If election timeout elapses: start new election Results: Leaders: term currentTerm, for leader to update itself • Upon election: send initial empty AppendEntries RPCs success true if follower contained entry matching (heartbeat) to each server; repeat during idle periods to prevLogIndex and prevLogTerm prevent election timeouts (§5.2) • If command received from client: append entry to local log, Receiver implementation: respond after entry applied to state machine (§5.3) 1. Reply false if term < currentTerm (§5.1) • If last log index ≥ nextIndex for a follower: send 2. Reply false if log doesn’t contain an entry at prevLogIndex AppendEntries RPC with log entries starting at nextIndex whose term matches prevLogTerm (§5.3) • If successful: update nextIndex and matchIndex for 3. If an existing entry conflicts with a new one (same index follower (§5.3) but different terms), delete the existing entry and all that • If AppendEntries fails because of log inconsistency: follow it (§5.3) decrement nextIndex and retry (§5.3) 4. Append any new entries not already in the log • If there exists an N such that N > commitIndex, a majority 5. If leaderCommit > commitIndex, set commitIndex = of matchIndex[i] ≥ N, and log[N].term == currentTerm: min(leaderCommit, index of last new entry) set commitIndex = N (§5.3, §5.4). Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. Section numbers such as §5.2 indicate where particular features are discussed. A formal specification [31] describes the algorithm more precisely. 4
5 . Election Safety: at most one leader can be elected in a given term. §5.2 Leader Append-Only: a leader never overwrites or deletes entries in its log; it only appends new entries. §5.3 Log Matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. §5.3 Leader Completeness: if a log entry is committed in a Figure 4: Server states. Followers only respond to requests given term, then that entry will be present in the logs from other servers. If a follower receives no communication, of the leaders for all higher-numbered terms. §5.4 it becomes a candidate and initiates an election. A candidate State Machine Safety: if a server has applied a log entry that receives votes from a majority of the full cluster becomes at a given index to its state machine, no other server the new leader. Leaders typically operate until they fail. will ever apply a different log entry for the same index. §5.4.3 Figure 3: Raft guarantees that each of these properties is true at all times. The section numbers indicate where each prop- erty is discussed. from clients and replicate them across the cluster, forcing the other logs to agree with its own (Sec- Figure 5: Time is divided into terms, and each term begins tion 5.3). with an election. After a successful election, a single leader • Safety: the key safety property for Raft is the State manages the cluster until the end of the term. Some elections Machine Safety Property in Figure 3: if any server fail, in which case the term ends without choosing a leader. has applied a particular log entry to its state machine, The transitions between terms may be observed at different then no other server may apply a different command times on different servers. for the same log index. Section 5.4 describes how will begin shortly. Raft ensures that there is at most one Raft ensures this property; the solution involves an leader in a given term. additional restriction on the election mechanism de- Different servers may observe the transitions between scribed in Section 5.2. terms at different times, and in some situations a server After presenting the consensus algorithm, this section dis- may not observe an election or even entire terms. Terms cusses the issue of availability and the role of timing in the act as a logical clock [14] in Raft, and they allow servers system. to detect obsolete information such as stale leaders. Each 5.1 Raft basics server stores a current term number, which increases monotonically over time. Current terms are exchanged A Raft cluster contains several servers; five is a typical whenever servers communicate; if one server’s current number, which allows the system to tolerate two failures. term is smaller than the other’s, then it updates its current At any given time each server is in one of three states: term to the larger value. If a candidate or leader discovers leader, follower, or candidate. In normal operation there that its term is out of date, it immediately reverts to fol- is exactly one leader and all of the other servers are fol- lower state. If a server receives a request with a stale term lowers. Followers are passive: they issue no requests on number, it rejects the request. their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if Raft servers communicate using remote procedure calls a client contacts a follower, the follower redirects it to the (RPCs), and the basic consensus algorithm requires only leader). The third state, candidate, is used to elect a new two types of RPCs. RequestVote RPCs are initiated by leader as described in Section 5.2. Figure 4 shows the candidates during elections (Section 5.2), and Append- states and their transitions; the transitions are discussed Entries RPCs are initiated by leaders to replicate log en- below. tries and to provide a form of heartbeat (Section 5.3). Sec- tion 7 adds a third RPC for transferring snapshots between Raft divides time into terms of arbitrary length, as servers. Servers retry RPCs if they do not receive a re- shown in Figure 5. Terms are numbered with consecutive sponse in a timely manner, and they issue RPCs in parallel integers. Each term begins with an election, in which one for best performance. or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it 5.2 Leader election serves as leader for the rest of the term. In some situations Raft uses a heartbeat mechanism to trigger leader elec- an election will result in a split vote. In this case the term tion. When servers start up, they begin as followers. A will end with no leader; a new term (with a new election) server remains in follower state as long as it receives valid 5
6 .RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no vi- able leader and begins an election to choose a new leader. To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or (c) a period of time goes by with no winner. These out- comes are discussed separately in the paragraphs below. Figure 6: Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was A candidate wins an election if it receives votes from created (the number in each box) and a command for the state a majority of the servers in the full cluster for the same machine. An entry is considered committed if it is safe for that term. Each server will vote for at most one candidate in a entry to be applied to state machines. given term, on a first-come-first-served basis (note: Sec- tion 5.4 adds an additional restriction on votes). The ma- Elections are an example of how understandability jority rule ensures that at most one candidate can win the guided our choice between design alternatives. Initially election for a particular term (the Election Safety Prop- we planned to use a ranking system: each candidate was erty in Figure 3). Once a candidate wins an election, it assigned a unique rank, which was used to select between becomes leader. It then sends heartbeat messages to all of competing candidates. If a candidate discovered another the other servers to establish its authority and prevent new candidate with higher rank, it would return to follower elections. state so that the higher ranking candidate could more eas- ily win the next election. We found that this approach While waiting for votes, a candidate may receive an created subtle issues around availability (a lower-ranked AppendEntries RPC from another server claiming to be server might need to time out and become a candidate leader. If the leader’s term (included in its RPC) is at least again if a higher-ranked server fails, but if it does so too as large as the candidate’s current term, then the candidate soon, it can reset progress towards electing a leader). We recognizes the leader as legitimate and returns to follower made adjustments to the algorithm several times, but after state. If the term in the RPC is smaller than the candidate’s each adjustment new corner cases appeared. Eventually current term, then the candidate rejects the RPC and con- we concluded that the randomized retry approach is more tinues in candidate state. obvious and understandable. The third possible outcome is that a candidate neither wins nor loses the election: if many followers become 5.3 Log replication candidates at the same time, votes could be split so that Once a leader has been elected, it begins servicing no candidate obtains a majority. When this happens, each client requests. Each client request contains a command to candidate will time out and start a new election by incre- be executed by the replicated state machines. The leader menting its term and initiating another round of Request- appends the command to its log as a new entry, then is- Vote RPCs. However, without extra measures split votes sues AppendEntries RPCs in parallel to each of the other could repeat indefinitely. servers to replicate the entry. When the entry has been Raft uses randomized election timeouts to ensure that safely replicated (as described below), the leader applies split votes are rare and that they are resolved quickly. To the entry to its state machine and returns the result of that prevent split votes in the first place, election timeouts are execution to the client. If followers crash or run slowly, chosen randomly from a fixed interval (e.g., 150–300ms). or if network packets are lost, the leader retries Append- This spreads out the servers so that in most cases only a Entries RPCs indefinitely (even after it has responded to single server will time out; it wins the election and sends the client) until all followers eventually store all log en- heartbeats before any other servers time out. The same tries. mechanism is used to handle split votes. Each candidate Logs are organized as shown in Figure 6. Each log en- restarts its randomized election timeout at the start of an try stores a state machine command along with the term election, and it waits for that timeout to elapse before number when the entry was received by the leader. The starting the next election; this reduces the likelihood of term numbers in log entries are used to detect inconsis- another split vote in the new election. Section 9.3 shows tencies between logs and to ensure some of the properties that this approach elects a leader rapidly. in Figure 3. Each log entry also has an integer index iden- 6
7 .tifying its position in the log. The leader decides when it is safe to apply a log en- try to the state machines; such an entry is called commit- ted. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is Figure 7: When the leader at the top comes to power, it is safe. The leader keeps track of the highest index it knows possible that any of scenarios (a–f) could occur in follower to be committed, and it includes that index in future logs. Each box represents one log entry; the number in the AppendEntries RPCs (including heartbeats) so that the box is its term. A follower may be missing entries (a–b), may other servers eventually find out. Once a follower learns have extra uncommitted entries (c–d), or both (e–f). For ex- ample, scenario (f) could occur if that server was the leader that a log entry is committed, it applies the entry to its for term 2, added several entries to its log, then crashed before local state machine (in log order). committing any of them; it restarted quickly, became leader We designed the Raft log mechanism to maintain a high for term 3, and added a few more entries to its log; before any level of coherency between the logs on different servers. of the entries in either term 2 or term 3 were committed, the Not only does this simplify the system’s behavior and server crashed again and remained down for several terms. make it more predictable, but it is an important component be missing entries that are present on the leader, it may of ensuring safety. Raft maintains the following proper- have extra entries that are not present on the leader, or ties, which together constitute the Log Matching Property both. Missing and extraneous entries in a log may span in Figure 3: multiple terms. • If two entries in different logs have the same index In Raft, the leader handles inconsistencies by forcing and term, then they store the same command. the followers’ logs to duplicate its own. This means that • If two entries in different logs have the same index conflicting entries in follower logs will be overwritten and term, then the logs are identical in all preceding with entries from the leader’s log. Section 5.4 will show entries. that this is safe when coupled with one more restriction. The first property follows from the fact that a leader To bring a follower’s log into consistency with its own, creates at most one entry with a given log index in a given the leader must find the latest log entry where the two term, and log entries never change their position in the logs agree, delete any entries in the follower’s log after log. The second property is guaranteed by a simple con- that point, and send the follower all of the leader’s entries sistency check performed by AppendEntries. When send- after that point. All of these actions happen in response ing an AppendEntries RPC, the leader includes the index to the consistency check performed by AppendEntries and term of the entry in its log that immediately precedes RPCs. The leader maintains a nextIndex for each follower, the new entries. If the follower does not find an entry in which is the index of the next log entry the leader will its log with the same index and term, then it refuses the send to that follower. When a leader first comes to power, new entries. The consistency check acts as an induction it initializes all nextIndex values to the index just after the step: the initial empty state of the logs satisfies the Log last one in its log (11 in Figure 7). If a follower’s log is Matching Property, and the consistency check preserves inconsistent with the leader’s, the AppendEntries consis- the Log Matching Property whenever logs are extended. tency check will fail in the next AppendEntries RPC. Af- As a result, whenever AppendEntries returns successfully, ter a rejection, the leader decrements nextIndex and retries the leader knows that the follower’s log is identical to its the AppendEntries RPC. Eventually nextIndex will reach own log up through the new entries. a point where the leader and follower logs match. When During normal operation, the logs of the leader and this happens, AppendEntries will succeed, which removes followers stay consistent, so the AppendEntries consis- any conflicting entries in the follower’s log and appends tency check never fails. However, leader crashes can leave entries from the leader’s log (if any). Once AppendEntries the logs inconsistent (the old leader may not have fully succeeds, the follower’s log is consistent with the leader’s, replicated all of the entries in its log). These inconsisten- and it will remain that way for the rest of the term. cies can compound over a series of leader and follower If desired, the protocol can be optimized to reduce the crashes. Figure 7 illustrates the ways in which followers’ number of rejected AppendEntries RPCs. For example, logs may differ from that of a new leader. A follower may when rejecting an AppendEntries request, the follower 7
8 .can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the con- flicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this opti- mization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent en- tries. Figure 8: A time sequence showing why a leader cannot de- With this mechanism, a leader does not need to take any termine commitment using log entries from older terms. In special actions to restore log consistency when it comes to (a) S1 is leader and partially replicates the log entry at index power. It just begins normal operation, and the logs auto- 2. In (b) S1 crashes; S5 is elected leader for term 3 with votes matically converge in response to failures of the Append- from S3, S4, and itself, and accepts a different entry at log Entries consistency check. A leader never overwrites or index 2. In (c) S5 crashes; S1 restarts, is elected leader, and deletes entries in its own log (the Leader Append-Only continues replication. At this point, the log entry from term 2 Property in Figure 3). has been replicated on a majority of the servers, but it is not This log replication mechanism exhibits the desirable committed. If S1 crashes as in (d), S5 could be elected leader consensus properties described in Section 2: Raft can ac- (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3. However, if S1 replicates an en- cept, replicate, and apply new log entries as long as a ma- try from its current term on a majority of the servers before jority of the servers are up; in the normal case a new entry crashing, as in (e), then this entry is committed (S5 cannot can be replicated with a single round of RPCs to a ma- win an election). At this point all preceding entries in the log jority of the cluster; and a single slow follower will not are committed as well. impact performance. terms are present on each new leader from the moment of 5.4 Safety its election, without the need to transfer those entries to The previous sections described how Raft elects lead- the leader. This means that log entries only flow in one di- ers and replicates log entries. However, the mechanisms rection, from leaders to followers, and leaders never over- described so far are not quite sufficient to ensure that each write existing entries in their logs. state machine executes exactly the same commands in the Raft uses the voting process to prevent a candidate from same order. For example, a follower might be unavailable winning an election unless its log contains all committed while the leader commits several log entries, then it could entries. A candidate must contact a majority of the cluster be elected leader and overwrite these entries with new in order to be elected, which means that every committed ones; as a result, different state machines might execute entry must be present in at least one of those servers. If the different command sequences. candidate’s log is at least as up-to-date as any other log This section completes the Raft algorithm by adding a in that majority (where “up-to-date” is defined precisely restriction on which servers may be elected leader. The below), then it will hold all the committed entries. The restriction ensures that the leader for any given term con- RequestVote RPC implements this restriction: the RPC tains all of the entries committed in previous terms (the includes information about the candidate’s log, and the Leader Completeness Property from Figure 3). Given the voter denies its vote if its own log is more up-to-date than election restriction, we then make the rules for commit- that of the candidate. ment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads Raft determines which of two logs is more up-to-date to correct behavior of the replicated state machine. by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then 5.4.1 Election restriction the log with the later term is more up-to-date. If the logs In any leader-based consensus algorithm, the leader end with the same term, then whichever log is longer is must eventually store all of the committed log entries. In more up-to-date. some consensus algorithms, such as Viewstamped Repli- cation [22], a leader can be elected even if it doesn’t 5.4.2 Committing entries from previous terms initially contain all of the committed entries. These al- As described in Section 5.3, a leader knows that an en- gorithms contain additional mechanisms to identify the try from its current term is committed once that entry is missing entries and transmit them to the new leader, ei- stored on a majority of the servers. If a leader crashes be- ther during the election process or shortly afterwards. Un- fore committing an entry, future leaders will attempt to fortunately, this results in considerable additional mecha- finish replicating the entry. However, a leader cannot im- nism and complexity. Raft uses a simpler approach where mediately conclude that an entry from a previous term is it guarantees that all the committed entries from previous committed once it is stored on a majority of servers. Fig- 8
9 . leaderU , as shown in Figure 9. The voter is key to reaching a contradiction. 3. The voter must have accepted the committed entry from leaderT before voting for leaderU ; otherwise it would have rejected the AppendEntries request from leaderT (its current term would have been higher than T). Figure 9: If S1 (leader for term T) commits a new log entry 4. The voter still stored the entry when it voted for from its term, and S5 is elected leader for a later term U, then there must be at least one server (S3) that accepted the log leaderU , since every intervening leader contained the entry and also voted for S5. entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict ure 8 illustrates a situation where an old log entry is stored with the leader. on a majority of servers, yet can still be overwritten by a 5. The voter granted its vote to leaderU , so leaderU ’s future leader. log must have been as up-to-date as the voter’s. This To eliminate problems like the one in Figure 8, Raft leads to one of two contradictions. never commits log entries from previous terms by count- ing replicas. Only log entries from the leader’s current 6. First, if the voter and leaderU shared the same last term are committed by counting replicas; once an entry log term, then leaderU ’s log must have been at least from the current term has been committed in this way, as long as the voter’s, so its log contained every entry then all prior entries are committed indirectly because in the voter’s log. This is a contradiction, since the of the Log Matching Property. There are some situations voter contained the committed entry and leaderU was where a leader could safely conclude that an older log en- assumed not to. try is committed (for example, if that entry is stored on ev- 7. Otherwise, leaderU ’s last log term must have been ery server), but Raft takes a more conservative approach larger than the voter’s. Moreover, it was larger than for simplicity. T, since the voter’s last log term was at least T (it con- Raft incurs this extra complexity in the commitment tains the committed entry from term T). The earlier rules because log entries retain their original term num- leader that created leaderU ’s last log entry must have bers when a leader replicates entries from previous contained the committed entry in its log (by assump- terms. In other consensus algorithms, if a new leader re- tion). Then, by the Log Matching Property, leaderU ’s replicates entries from prior “terms,” it must do so with log must also contain the committed entry, which is its new “term number.” Raft’s approach makes it easier a contradiction. to reason about log entries, since they maintain the same 8. This completes the contradiction. Thus, the leaders term number over time and across logs. In addition, new of all terms greater than T must contain all entries leaders in Raft send fewer log entries from previous terms from term T that are committed in term T. than in other algorithms (other algorithms must send re- 9. The Log Matching Property guarantees that future dundant log entries to renumber them before they can be leaders will also contain entries that are committed committed). indirectly, such as index 2 in Figure 8(d). 5.4.3 Safety argument Given the complete Raft algorithm, we can now ar- Given the Leader Completeness Property, we can prove gue more precisely that the Leader Completeness Prop- the State Machine Safety Property from Figure 3, which erty holds (this argument is based on the safety proof; see states that if a server has applied a log entry at a given Section 9.2). We assume that the Leader Completeness index to its state machine, no other server will ever apply a Property does not hold, then we prove a contradiction. different log entry for the same index. At the time a server Suppose the leader for term T (leaderT ) commits a log applies a log entry to its state machine, its log must be entry from its term, but that log entry is not stored by the identical to the leader’s log up through that entry and the leader of some future term. Consider the smallest term U entry must be committed. Now consider the lowest term > T whose leader (leaderU ) does not store the entry. in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all 1. The committed entry must have been absent from higher terms will store that same log entry, so servers that leaderU ’s log at the time of its election (leaders never apply the index in later terms will apply the same value. delete or overwrite entries). Thus, the State Machine Safety Property holds. 2. leaderT replicated the entry on a majority of the clus- Finally, Raft requires servers to apply entries in log in- ter, and leaderU received votes from a majority of dex order. Combined with the State Machine Safety Prop- the cluster. Thus, at least one server (“the voter”) erty, this means that all servers will apply exactly the same both accepted the entry from leaderT and voted for set of log entries to their state machines, in the same order. 9
10 .5.5 Follower and candidate crashes Until this point we have focused on leader failures. Fol- lower and candidate crashes are much simpler to han- dle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then fu- ture RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this Figure 10: Switching directly from one configuration to an- causes no harm. For example, if a follower receives an other is unsafe because different servers will switch at dif- AppendEntries request that includes log entries already ferent times. In this example, the cluster grows from three servers to five. Unfortunately, there is a point in time where present in its log, it ignores those entries in the new re- two different leaders can be elected for the same term, one quest. with a majority of the old configuration (Cold ) and another 5.6 Timing and availability with a majority of the new configuration (Cnew ). One of our requirements for Raft is that safety must server MTBFs are several months or more, which easily not depend on timing: the system must not produce incor- satisfies the timing requirement. rect results just because some event happens more quickly or slowly than expected. However, availability (the ability 6 Cluster membership changes of the system to respond to clients in a timely manner) Up until now we have assumed that the cluster config- must inevitably depend on timing. For example, if mes- uration (the set of servers participating in the consensus sage exchanges take longer than the typical time between algorithm) is fixed. In practice, it will occasionally be nec- server crashes, candidates will not stay up long enough to essary to change the configuration, for example to replace win an election; without a steady leader, Raft cannot make servers when they fail or to change the degree of replica- progress. tion. Although this can be done by taking the entire cluster Leader election is the aspect of Raft where timing is off-line, updating configuration files, and then restarting most critical. Raft will be able to elect and maintain a the cluster, this would leave the cluster unavailable dur- steady leader as long as the system satisfies the follow- ing the changeover. In addition, if there are any manual ing timing requirement: steps, they risk operator error. In order to avoid these is- broadcastTime ≪ electionTimeout ≪ MTBF sues, we decided to automate configuration changes and In this inequality broadcastTime is the average time it incorporate them into the Raft consensus algorithm. takes a server to send RPCs in parallel to every server For the configuration change mechanism to be safe, in the cluster and receive their responses; electionTime- there must be no point during the transition where it out is the election timeout described in Section 5.2; and is possible for two leaders to be elected for the same MTBF is the average time between failures for a single term. Unfortunately, any approach where servers switch server. The broadcast time should be an order of mag- directly from the old configuration to the new configura- nitude less than the election timeout so that leaders can tion is unsafe. It isn’t possible to atomically switch all of reliably send the heartbeat messages required to keep fol- the servers at once, so the cluster can potentially split into lowers from starting elections; given the randomized ap- two independent majorities during the transition (see Fig- proach used for election timeouts, this inequality also ure 10). makes split votes unlikely. The election timeout should be In order to ensure safety, configuration changes must a few orders of magnitude less than MTBF so that the sys- use a two-phase approach. There are a variety of ways tem makes steady progress. When the leader crashes, the to implement the two phases. For example, some systems system will be unavailable for roughly the election time- (e.g., [22]) use the first phase to disable the old configura- out; we would like this to represent only a small fraction tion so it cannot process client requests; then the second of overall time. phase enables the new configuration. In Raft the cluster The broadcast time and MTBF are properties of the un- first switches to a transitional configuration we call joint derlying system, while the election timeout is something consensus; once the joint consensus has been committed, we must choose. Raft’s RPCs typically require the recip- the system then transitions to the new configuration. The ient to persist information to stable storage, so the broad- joint consensus combines both the old and new configu- cast time may range from 0.5ms to 20ms, depending on rations: storage technology. As a result, the election timeout is • Log entries are replicated to all servers in both con- likely to be somewhere between 10ms and 500ms. Typical figurations. 10
11 . There are three more issues to address for reconfigura- tion. The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to com- mit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configu- ration change, in which the new servers join the cluster Figure 11: Timeline for a configuration change. Dashed lines as non-voting members (the leader replicates log entries show configuration entries that have been created but not to them, but they are not considered for majorities). Once committed, and solid lines show the latest committed configu- the new servers have caught up with the rest of the cluster, ration entry. The leader first creates the Cold,new configuration the reconfiguration can proceed as described above. entry in its log and commits it to Cold,new (a majority of Cold The second issue is that the cluster leader may not be and a majority of Cnew ). Then it creates the Cnew entry and part of the new configuration. In this case, the leader steps commits it to a majority of Cnew . There is no point in time in down (returns to follower state) once it has committed the which Cold and Cnew can both make decisions independently. Cnew log entry. This means that there will be a period of • Any server from either configuration may serve as time (while it is committing Cnew ) when the leader is man- leader. aging a cluster that does not include itself; it replicates log • Agreement (for elections and entry commitment) re- entries but does not count itself in majorities. The leader quires separate majorities from both the old and new transition occurs when Cnew is committed because this is configurations. the first point when the new configuration can operate in- The joint consensus allows individual servers to transition dependently (it will always be possible to choose a leader between configurations at different times without com- from Cnew ). Before this point, it may be the case that only promising safety. Furthermore, joint consensus allows the a server from Cold can be elected leader. cluster to continue servicing client requests throughout The third issue is that removed servers (those not in the configuration change. Cnew ) can disrupt the cluster. These servers will not re- Cluster configurations are stored and communicated ceive heartbeats, so they will time out and start new elec- using special entries in the replicated log; Figure 11 illus- tions. They will then send RequestVote RPCs with new trates the configuration change process. When the leader term numbers, and this will cause the current leader to receives a request to change the configuration from Cold revert to follower state. A new leader will eventually be to Cnew , it stores the configuration for joint consensus elected, but the removed servers will time out again and (Cold,new in the figure) as a log entry and replicates that the process will repeat, resulting in poor availability. entry using the mechanisms described previously. Once a To prevent this problem, servers disregard RequestVote given server adds the new configuration entry to its log, RPCs when they believe a current leader exists. Specif- it uses that configuration for all future decisions (a server ically, if a server receives a RequestVote RPC within always uses the latest configuration in its log, regardless the minimum election timeout of hearing from a cur- of whether the entry is committed). This means that the rent leader, it does not update its term or grant its vote. leader will use the rules of Cold,new to determine when the This does not affect normal elections, where each server log entry for Cold,new is committed. If the leader crashes, waits at least a minimum election timeout before starting a new leader may be chosen under either Cold or Cold,new , an election. However, it helps avoid disruptions from re- depending on whether the winning candidate has received moved servers: if a leader is able to get heartbeats to its Cold,new . In any case, Cnew cannot make unilateral deci- cluster, then it will not be deposed by larger term num- sions during this period. bers. Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the 7 Log compaction Leader Completeness Property ensures that only servers Raft’s log grows during normal operation to incorpo- with the Cold,new log entry can be elected as leader. It is rate more client requests, but in a practical system, it can- now safe for the leader to create a log entry describing not grow without bound. As the log grows longer, it oc- Cnew and replicate it to the cluster. Again, this configura- cupies more space and takes more time to replay. This tion will take effect on each server as soon as it is seen. will eventually cause availability problems without some When the new configuration has been committed under mechanism to discard obsolete information that has accu- the rules of Cnew , the old configuration is irrelevant and mulated in the log. servers not in the new configuration can be shut down. As Snapshotting is the simplest approach to compaction. shown in Figure 11, there is no time when Cold and Cnew In snapshotting, the entire current system state is written can both make unilateral decisions; this guarantees safety. to a snapshot on stable storage, then the entire log up to 11
12 . InstallSnapshot RPC Invoked by leader to send chunks of a snapshot to a follower. Leaders always send chunks in order. Arguments: term leader’s term leaderId so follower can redirect clients lastIncludedIndex the snapshot replaces all entries up through and including this index lastIncludedTerm term of lastIncludedIndex offset byte offset where chunk is positioned in the snapshot file data[] raw bytes of the snapshot chunk, starting at Figure 12: A server replaces the committed entries in its log offset (indexes 1 through 5) with a new snapshot, which stores just done true if this is the last chunk the current state (variables x and y in this example). The snap- Results: shot’s last included index and term serve to position the snap- term currentTerm, for leader to update itself shot in the log preceding entry 6. Receiver implementation: that point is discarded. Snapshotting is used in Chubby 1. Reply immediately if term < currentTerm and ZooKeeper, and the remainder of this section de- 2. Create new snapshot file if first chunk (offset is 0) scribes snapshotting in Raft. 3. Write data into snapshot file at given offset 4. Reply and wait for more data chunks if done is false Incremental approaches to compaction, such as log 5. Save snapshot file, discard any existing or partial snapshot cleaning [36] and log-structured merge trees [30, 5], are with a smaller index also possible. These operate on a fraction of the data at 6. If existing log entry has same index and term as snapshot’s last included entry, retain log entries following it and reply once, so they spread the load of compaction more evenly 7. Discard the entire log over time. They first select a region of data that has ac- 8. Reset state machine using snapshot contents (and load cumulated many deleted and overwritten objects, then snapshot’s cluster configuration) they rewrite the live objects from that region more com- pactly and free the region. This requires significant addi- Figure 13: A summary of the InstallSnapshot RPC. Snap- shots are split into chunks for transmission; this gives the fol- tional mechanism and complexity compared to snapshot- lower a sign of life with each chunk, so it can reset its election ting, which simplifies the problem by always operating timer. on the entire data set. While log cleaning would require modifications to Raft, state machines can implement LSM leader would already have this entry. However, an excep- trees using the same interface as snapshotting. tionally slow follower or a new server joining the cluster Figure 12 shows the basic idea of snapshotting in Raft. (Section 6) would not. The way to bring such a follower Each server takes snapshots independently, covering just up-to-date is for the leader to send it a snapshot over the the committed entries in its log. Most of the work con- network. sists of the state machine writing its current state to the The leader uses a new RPC called InstallSnapshot to snapshot. Raft also includes a small amount of metadata send snapshots to followers that are too far behind; see in the snapshot: the last included index is the index of the Figure 13. When a follower receives a snapshot with this last entry in the log that the snapshot replaces (the last en- RPC, it must decide what to do with its existing log en- try the state machine had applied), and the last included tries. Usually the snapshot will contain new information term is the term of this entry. These are preserved to sup- not already in the recipient’s log. In this case, the follower port the AppendEntries consistency check for the first log discards its entire log; it is all superseded by the snapshot entry following the snapshot, since that entry needs a pre- and may possibly have uncommitted entries that conflict vious log index and term. To enable cluster membership with the snapshot. If instead the follower receives a snap- changes (Section 6), the snapshot also includes the latest shot that describes a prefix of its log (due to retransmis- configuration in the log as of last included index. Once a sion or by mistake), then log entries covered by the snap- server completes writing a snapshot, it may delete all log shot are deleted but entries following the snapshot are still entries up through the last included index, as well as any valid and must be retained. prior snapshot. This snapshotting approach departs from Raft’s strong Although servers normally take snapshots indepen- leader principle, since followers can take snapshots with- dently, the leader must occasionally send snapshots to out the knowledge of the leader. However, we think this followers that lag behind. This happens when the leader departure is justified. While having a leader helps avoid has already discarded the next log entry that it needs to conflicting decisions in reaching consensus, consensus send to a follower. Fortunately, this situation is unlikely has already been reached when snapshotting, so no de- in normal operation: a follower that has kept up with the cisions conflict. Data still only flows from leaders to fol- 12
13 .lowers, just followers can now reorganize their data. crashes after committing the log entry but before respond- We considered an alternative leader-based approach in ing to the client, the client will retry the command with a which only the leader would create a snapshot, then it new leader, causing it to be executed a second time. The would send this snapshot to each of its followers. How- solution is for clients to assign unique serial numbers to ever, this has two disadvantages. First, sending the snap- every command. Then, the state machine tracks the latest shot to each follower would waste network bandwidth and serial number processed for each client, along with the as- slow the snapshotting process. Each follower already has sociated response. If it receives a command whose serial the information needed to produce its own snapshots, and number has already been executed, it responds immedi- it is typically much cheaper for a server to produce a snap- ately without re-executing the request. shot from its local state than it is to send and receive one Read-only operations can be handled without writing over the network. Second, the leader’s implementation anything into the log. However, with no additional mea- would be more complex. For example, the leader would sures, this would run the risk of returning stale data, since need to send snapshots to followers in parallel with repli- the leader responding to the request might have been su- cating new log entries to them, so as not to block new perseded by a newer leader of which it is unaware. Lin- client requests. earizable reads must not return stale data, and Raft needs There are two more issues that impact snapshotting per- two extra precautions to guarantee this without using the formance. First, servers must decide when to snapshot. If log. First, a leader must have the latest information on a server snapshots too often, it wastes disk bandwidth and which entries are committed. The Leader Completeness energy; if it snapshots too infrequently, it risks exhaust- Property guarantees that a leader has all committed en- ing its storage capacity, and it increases the time required tries, but at the start of its term, it may not know which to replay the log during restarts. One simple strategy is those are. To find out, it needs to commit an entry from to take a snapshot when the log reaches a fixed size in its term. Raft handles this by having each leader com- bytes. If this size is set to be significantly larger than the mit a blank no-op entry into the log at the start of its expected size of a snapshot, then the disk bandwidth over- term. Second, a leader must check whether it has been de- head for snapshotting will be small. posed before processing a read-only request (its informa- The second performance issue is that writing a snap- tion may be stale if a more recent leader has been elected). shot can take a significant amount of time, and we do Raft handles this by having the leader exchange heart- not want this to delay normal operations. The solution is beat messages with a majority of the cluster before re- to use copy-on-write techniques so that new updates can sponding to read-only requests. Alternatively, the leader be accepted without impacting the snapshot being writ- could rely on the heartbeat mechanism to provide a form ten. For example, state machines built with functional data of lease [9], but this would rely on timing for safety (it structures naturally support this. Alternatively, the operat- assumes bounded clock skew). ing system’s copy-on-write support (e.g., fork on Linux) 9 Implementation and evaluation can be used to create an in-memory snapshot of the entire We have implemented Raft as part of a replicated state machine (our implementation uses this approach). state machine that stores configuration information for 8 Client interaction RAMCloud [33] and assists in failover of the RAMCloud This section describes how clients interact with Raft, coordinator. The Raft implementation contains roughly including how clients find the cluster leader and how Raft 2000 lines of C++ code, not including tests, comments, or supports linearizable semantics [10]. These issues apply blank lines. The source code is freely available [23]. There to all consensus-based systems, and Raft’s solutions are are also about 25 independent third-party open source im- similar to other systems. plementations [34] of Raft in various stages of develop- Clients of Raft send all of their requests to the leader. ment, based on drafts of this paper. Also, various compa- When a client first starts up, it connects to a randomly- nies are deploying Raft-based systems [34]. chosen server. If the client’s first choice is not the leader, The remainder of this section evaluates Raft using three that server will reject the client’s request and supply in- criteria: understandability, correctness, and performance. formation about the most recent leader it has heard from 9.1 Understandability (AppendEntries requests include the network address of To measure Raft’s understandability relative to Paxos, the leader). If the leader crashes, client requests will time we conducted an experimental study using upper-level un- out; clients then try again with randomly-chosen servers. dergraduate and graduate students in an Advanced Oper- Our goal for Raft is to implement linearizable seman- ating Systems course at Stanford University and a Dis- tics (each operation appears to execute instantaneously, tributed Computing course at U.C. Berkeley. We recorded exactly once, at some point between its invocation and a video lecture of Raft and another of Paxos, and created its response). However, as described so far Raft can exe- corresponding quizzes. The Raft lecture covered the con- cute a command multiple times: for example, if the leader tent of this paper except for log compaction; the Paxos 13
14 . 20 number of participants 60 15 Paxos much easier 50 Paxos somewhat easier 10 Roughly equal Raft somewhat easier 40 Raft much easier Raft grade 5 30 0 implement explain 20 Figure 15: Using a 5-point scale, participants were asked 10 (left) which algorithm they felt would be easier to implement Raft then Paxos in a functioning, correct, and efficient system, and (right) Paxos then Raft 0 which would be easier to explain to a CS graduate student. 0 10 20 30 40 50 60 Paxos grade the order in which they learned the algorithms. The model Figure 14: A scatter plot comparing 43 participants’ perfor- predicts that the choice of quiz produces a 12.5-point dif- mance on the Raft and Paxos quizzes. Points above the diag- ference in favor of Raft. This is significantly higher than onal (33) represent participants who scored higher for Raft. the observed difference of 4.9 points, because many of the lecture covered enough material to create an equivalent actual students had prior Paxos experience, which helped replicated state machine, including single-decree Paxos, Paxos considerably, whereas it helped Raft slightly less. multi-decree Paxos, reconfiguration, and a few optimiza- Curiously, the model also predicts scores 6.3 points lower tions needed in practice (such as leader election). The on Raft for people that have already taken the Paxos quiz; quizzes tested basic understanding of the algorithms and although we don’t know why, this does appear to be sta- also required students to reason about corner cases. Each tistically significant. student watched one video, took the corresponding quiz, We also surveyed participants after their quizzes to see watched the second video, and took the second quiz. which algorithm they felt would be easier to implement About half of the participants did the Paxos portion first or explain; these results are shown in Figure 15. An over- and the other half did the Raft portion first in order to whelming majority of participants reported Raft would be account for both individual differences in performance easier to implement and explain (33 of 41 for each ques- and experience gained from the first portion of the study. tion). However, these self-reported feelings may be less We compared participants’ scores on each quiz to deter- reliable than participants’ quiz scores, and participants mine whether participants showed a better understanding may have been biased by knowledge of our hypothesis of Raft. that Raft is easier to understand. We tried to make the comparison between Paxos and A detailed discussion of the Raft user study is available Raft as fair as possible. The experiment favored Paxos in at [31]. two ways: 15 of the 43 participants reported having some 9.2 Correctness prior experience with Paxos, and the Paxos video is 14% We have developed a formal specification and a proof longer than the Raft video. As summarized in Table 1, we of safety for the consensus mechanism described in Sec- have taken steps to mitigate potential sources of bias. All tion 5. The formal specification [31] makes the informa- of our materials are available for review [28, 31]. tion summarized in Figure 2 completely precise using the On average, participants scored 4.9 points higher on the TLA+ specification language [17]. It is about 400 lines Raft quiz than on the Paxos quiz (out of a possible 60 long and serves as the subject of the proof. It is also use- points, the mean Raft score was 25.7 and the mean Paxos ful on its own for anyone implementing Raft. We have score was 20.8); Figure 14 shows their individual scores. mechanically proven the Log Completeness Property us- A paired t-test states that, with 95% confidence, the true ing the TLA proof system [7]. However, this proof relies distribution of Raft scores has a mean at least 2.5 points on invariants that have not been mechanically checked larger than the true distribution of Paxos scores. (for example, we have not proven the type safety of the We also created a linear regression model that predicts specification). Furthermore, we have written an informal a new student’s quiz scores based on three factors: which proof [31] of the State Machine Safety property which quiz they took, their degree of prior Paxos experience, and is complete (it relies on the specification alone) and rela- Concern Steps taken to mitigate bias Materials for review [28, 31] Equal lecture quality Same lecturer for both. Paxos lecture based on and improved from exist- videos ing materials used in several universities. Paxos lecture is 14% longer. Equal quiz difficulty Questions grouped in difficulty and paired across exams. quizzes Fair grading Used rubric. Graded in random order, alternating between quizzes. rubric Table 1: Concerns of possible bias against Paxos in the study, steps taken to counter each, and additional materials available. 14
15 . 100% ing). The leader was crashed uniformly randomly within cumulative percent 80% its heartbeat interval, which was half of the minimum 60% election timeout for all tests. Thus, the smallest possible 150-150ms downtime was about half of the minimum election time- 40% 150-151ms 150-155ms out. 20% 150-175ms 150-200ms The top graph in Figure 16 shows that a small amount 150-300ms of randomization in the election timeout is enough to 0% 100 1000 10000 100000 avoid split votes in elections. In the absence of random- 100% ness, leader election consistently took longer than 10 sec- cumulative percent 80% onds in our tests due to many split votes. Adding just 5ms 60% of randomness helps significantly, resulting in a median 40% 12-24ms downtime of 287ms. Using more randomness improves 25-50ms 50-100ms worst-case behavior: with 50ms of randomness the worst- 20% 100-200ms 150-300ms case completion time (over 1000 trials) was 513ms. 0% 0 100 200 300 400 500 600 The bottom graph in Figure 16 shows that downtime time without leader (ms) can be reduced by reducing the election timeout. With Figure 16: The time to detect and replace a crashed leader. an election timeout of 12–24ms, it takes only 35ms on The top graph varies the amount of randomness in election average to elect a leader (the longest trial took 152ms). timeouts, and the bottom graph scales the minimum election However, lowering the timeouts beyond this point violates timeout. Each line represents 1000 trials (except for 100 tri- Raft’s timing requirement: leaders have difficulty broad- als for “150–150ms”) and corresponds to a particular choice casting heartbeats before other servers start new elections. of election timeouts; for example, “150–155ms” means that This can cause unnecessary leader changes and lower election timeouts were chosen randomly and uniformly be- overall system availability. We recommend using a con- tween 150ms and 155ms. The measurements were taken on a servative election timeout such as 150–300ms; such time- cluster of five servers with a broadcast time of roughly 15ms. Results for a cluster of nine servers are similar. outs are unlikely to cause unnecessary leader changes and will still provide good availability. tively precise (it is about 3500 words long). 10 Related work 9.3 Performance There have been numerous publications related to con- Raft’s performance is similar to other consensus algo- sensus algorithms, many of which fall into one of the fol- rithms such as Paxos. The most important case for per- lowing categories: formance is when an established leader is replicating new • Lamport’s original description of Paxos [15], and at- log entries. Raft achieves this using the minimal number tempts to explain it more clearly [16, 20, 21]. of messages (a single round-trip from the leader to half the • Elaborations of Paxos, which fill in missing details cluster). It is also possible to further improve Raft’s per- and modify the algorithm to provide a better founda- formance. For example, it easily supports batching and tion for implementation [26, 39, 13]. pipelining requests for higher throughput and lower la- • Systems that implement consensus algorithms, such tency. Various optimizations have been proposed in the as Chubby [2, 4], ZooKeeper [11, 12], and Span- literature for other algorithms; many of these could be ap- ner [6]. The algorithms for Chubby and Spanner plied to Raft, but we leave this to future work. have not been published in detail, though both claim We used our Raft implementation to measure the per- to be based on Paxos. ZooKeeper’s algorithm has formance of Raft’s leader election algorithm and answer been published in more detail, but it is quite different two questions. First, does the election process converge from Paxos. quickly? Second, what is the minimum downtime that can • Performance optimizations that can be applied to be achieved after leader crashes? Paxos [18, 19, 3, 25, 1, 27]. To measure leader election, we repeatedly crashed the • Oki and Liskov’s Viewstamped Replication (VR), an leader of a cluster of five servers and timed how long it alternative approach to consensus developed around took to detect the crash and elect a new leader (see Fig- the same time as Paxos. The original description [29] ure 16). To generate a worst-case scenario, the servers in was intertwined with a protocol for distributed trans- each trial had different log lengths, so some candidates actions, but the core consensus protocol has been were not eligible to become leader. Furthermore, to en- separated in a recent update [22]. VR uses a leader- courage split votes, our test script triggered a synchro- based approach with many similarities to Raft. nized broadcast of heartbeat RPCs from the leader before The greatest difference between Raft and Paxos is terminating its process (this approximates the behavior Raft’s strong leadership: Raft uses leader election as an of the leader replicating a new log entry prior to crash- essential part of the consensus protocol, and it concen- 15
16 .trates as much functionality as possible in the leader. This Several different approaches for cluster member- approach results in a simpler algorithm that is easier to ship changes have been proposed or implemented in understand. For example, in Paxos, leader election is or- other work, including Lamport’s original proposal [15], thogonal to the basic consensus protocol: it serves only as VR [22], and SMART [24]. We chose the joint consensus a performance optimization and is not required for achiev- approach for Raft because it leverages the rest of the con- ing consensus. However, this results in additional mecha- sensus protocol, so that very little additional mechanism nism: Paxos includes both a two-phase protocol for basic is required for membership changes. Lamport’s α -based consensus and a separate mechanism for leader election. approach was not an option for Raft because it assumes In contrast, Raft incorporates leader election directly into consensus can be reached without a leader. In comparison the consensus algorithm and uses it as the first of the two to VR and SMART, Raft’s reconfiguration algorithm has phases of consensus. This results in less mechanism than the advantage that membership changes can occur with- in Paxos. out limiting the processing of normal requests; in con- trast, VR stops all normal processing during configura- Like Raft, VR and ZooKeeper are leader-based and tion changes, and SMART imposes an α -like limit on the therefore share many of Raft’s advantages over Paxos. number of outstanding requests. Raft’s approach also adds However, Raft has less mechanism that VR or ZooKeeper less mechanism than either VR or SMART. because it minimizes the functionality in non-leaders. For example, log entries in Raft flow in only one direction: 11 Conclusion outward from the leader in AppendEntries RPCs. In VR Algorithms are often designed with correctness, effi- log entries flow in both directions (leaders can receive ciency, and/or conciseness as the primary goals. Although log entries during the election process); this results in these are all worthy goals, we believe that understandabil- additional mechanism and complexity. The published de- ity is just as important. None of the other goals can be scription of ZooKeeper also transfers log entries both to achieved until developers render the algorithm into a prac- and from the leader, but the implementation is apparently tical implementation, which will inevitably deviate from more like Raft [35]. and expand upon the published form. Unless developers have a deep understanding of the algorithm and can cre- Raft has fewer message types than any other algo- rithm for consensus-based log replication that we are ate intuitions about it, it will be difficult for them to retain its desirable properties in their implementation. aware of. For example, we counted the message types VR and ZooKeeper use for basic consensus and membership In this paper we addressed the issue of distributed con- changes (excluding log compaction and client interaction, sensus, where a widely accepted but impenetrable algo- as these are nearly independent of the algorithms). VR rithm, Paxos, has challenged students and developers for and ZooKeeper each define 10 different message types, many years. We developed a new algorithm, Raft, which while Raft has only 4 message types (two RPC requests we have shown to be more understandable than Paxos. and their responses). Raft’s messages are a bit more dense We also believe that Raft provides a better foundation than the other algorithms’, but they are simpler collec- for system building. Using understandability as the pri- tively. In addition, VR and ZooKeeper are described in mary design goal changed the way we approached the de- terms of transmitting entire logs during leader changes; sign of Raft; as the design progressed we found ourselves additional message types will be required to optimize reusing a few techniques repeatedly, such as decomposing these mechanisms so that they are practical. the problem and simplifying the state space. These tech- niques not only improved the understandability of Raft Raft’s strong leadership approach simplifies the algo- but also made it easier to convince ourselves of its cor- rithm, but it precludes some performance optimizations. rectness. For example, Egalitarian Paxos (EPaxos) can achieve higher performance under some conditions with a lead- 12 Acknowledgments erless approach [27]. EPaxos exploits commutativity in The user study would not have been possible with- state machine commands. Any server can commit a com- out the support of Ali Ghodsi, David Mazi`eres, and the mand with just one round of communication as long as students of CS 294-91 at Berkeley and CS 240 at Stan- other commands that are proposed concurrently commute ford. Scott Klemmer helped us design the user study, with it. However, if commands that are proposed con- and Nelson Ray advised us on statistical analysis. The currently do not commute with each other, EPaxos re- Paxos slides for the user study borrowed heavily from quires an additional round of communication. Because a slide deck originally created by Lorenzo Alvisi. Spe- any server may commit commands, EPaxos balances load cial thanks go to David Mazi`eres and Ezra Hoch for well between servers and is able to achieve lower latency finding subtle bugs in Raft. Many people provided help- than Raft in WAN settings. However, it adds significant ful feedback on the paper and user study materials, complexity to Paxos. including Ed Bugnion, Michael Chan, Hugues Evrard, 16
17 .Daniel Giffin, Arjun Gopalan, Jon Howell, Vimalkumar [7] C OUSINEAU , D., D OLIGEZ , D., L AMPORT, L., M ERZ , Jeyakumar, Ankita Kejriwal, Aleksandar Kracun, Amit S., R ICKETTS , D., AND VANZETTO , H. TLA+ proofs. Levy, Joel Martin, Satoshi Matsushita, Oleg Pesok, David In Proc. FM’12, Symposium on Formal Methods (2012), Ramos, Robbert van Renesse, Mendel Rosenblum, Nico- D. Giannakopoulou and D. M´ery, Eds., vol. 7436 of Lec- las Schiper, Deian Stefan, Andrew Stone, Ryan Stutsman, ture Notes in Computer Science, Springer, pp. 147–154. David Terei, Stephen Yang, Matei Zaharia, 24 anony- [8] G HEMAWAT, S., G OBIOFF , H., AND L EUNG , S.-T. The mous conference reviewers (with duplicates), and espe- Google file system. In Proc. SOSP’03, ACM Symposium cially our shepherd Eddie Kohler. Werner Vogels tweeted on Operating Systems Principles (2003), ACM, pp. 29–43. a link to an earlier draft, which gave Raft significant ex- [9] G RAY, C., AND C HERITON , D. Leases: An efficient fault- posure. This work was supported by the Gigascale Sys- tolerant mechanism for distributed file cache consistency. tems Research Center and the Multiscale Systems Cen- In Proceedings of the 12th ACM Ssymposium on Operating ter, two of six research centers funded under the Fo- Systems Principles (1989), pp. 202–210. cus Center Research Program, a Semiconductor Research [10] H ERLIHY, M. P., AND W ING , J. M. Linearizability: a Corporation program, by STARnet, a Semiconductor Re- correctness condition for concurrent objects. ACM Trans- search Corporation program sponsored by MARCO and actions on Programming Languages and Systems 12 (July DARPA, by the National Science Foundation under Grant 1990), 463–492. No. 0963859, and by grants from Facebook, Google, Mel- [11] H UNT, P., KONAR , M., J UNQUEIRA , F. P., AND R EED , lanox, NEC, NetApp, SAP, and Samsung. Diego Ongaro B. ZooKeeper: wait-free coordination for internet-scale is supported by The Junglee Corporation Stanford Gradu- systems. In Proc ATC’10, USENIX Annual Technical Con- ate Fellowship. ference (2010), USENIX, pp. 145–158. References [12] J UNQUEIRA , F. P., R EED , B. C., AND S ERAFINI , M. [1] B OLOSKY, W. J., B RADSHAW, D., H AAGENS , R. B., Zab: High-performance broadcast for primary-backup sys- K USTERS , N. P., AND L I , P. Paxos replicated state tems. In Proc. DSN’11, IEEE/IFIP Int’l Conf. on Depend- machines as the basis of a high-performance data store. able Systems & Networks (2011), IEEE Computer Society, In Proc. NSDI’11, USENIX Conference on Networked pp. 245–256. Systems Design and Implementation (2011), USENIX, [13] K IRSCH , J., AND A MIR , Y. Paxos for system builders. pp. 141–154. Tech. Rep. CNDS-2008-2, Johns Hopkins University, [2] B URROWS , M. The Chubby lock service for loosely- 2008. coupled distributed systems. In Proc. OSDI’06, Sympo- [14] L AMPORT, L. Time, clocks, and the ordering of events in sium on Operating Systems Design and Implementation a distributed system. Commununications of the ACM 21, 7 (2006), USENIX, pp. 335–350. (July 1978), 558–565. [3] C AMARGOS , L. J., S CHMIDT, R. M., AND P EDONE , F. [15] L AMPORT, L. The part-time parliament. ACM Transac- Multicoordinated Paxos. In Proc. PODC’07, ACM Sym- tions on Computer Systems 16, 2 (May 1998), 133–169. posium on Principles of Distributed Computing (2007), [16] L AMPORT, L. Paxos made simple. ACM SIGACT News ACM, pp. 316–317. 32, 4 (Dec. 2001), 18–25. [4] C HANDRA , T. D., G RIESEMER , R., AND R EDSTONE , J. [17] L AMPORT, L. Specifying Systems, The TLA+ Language Paxos made live: an engineering perspective. In Proc. and Tools for Hardware and Software Engineers. Addison- PODC’07, ACM Symposium on Principles of Distributed Wesley, 2002. Computing (2007), ACM, pp. 398–407. [18] L AMPORT, L. Generalized consensus and Paxos. Tech. [5] C HANG , F., D EAN , J., G HEMAWAT, S., H SIEH , W. C., Rep. MSR-TR-2005-33, Microsoft Research, 2005. WALLACH , D. A., B URROWS , M., C HANDRA , T., F IKES , A., AND G RUBER , R. E. Bigtable: a distributed [19] L AMPORT, L. Fast paxos. Distributed Computing 19, 2 storage system for structured data. In Proc. OSDI’06, (2006), 79–103. USENIX Symposium on Operating Systems Design and [20] L AMPSON , B. W. How to build a highly available system Implementation (2006), USENIX, pp. 205–218. using consensus. In Distributed Algorithms, O. Baboaglu [6] C ORBETT, J. C., D EAN , J., E PSTEIN , M., F IKES , A., and K. Marzullo, Eds. Springer-Verlag, 1996, pp. 1–17. F ROST, C., F URMAN , J. J., G HEMAWAT, S., G UBAREV, [21] L AMPSON , B. W. The ABCD’s of Paxos. In Proc. A., H EISER , C., H OCHSCHILD , P., H SIEH , W., K AN - PODC’01, ACM Symposium on Principles of Distributed THAK , S., K OGAN , E., L I , H., L LOYD , A., M ELNIK , Computing (2001), ACM, pp. 13–13. S., M WAURA , D., NAGLE , D., Q UINLAN , S., R AO , R., [22] L ISKOV, B., AND C OWLING , J. Viewstamped replica- ROLIG , L., S AITO , Y., S ZYMANIAK , M., TAYLOR , C., tion revisited. Tech. Rep. MIT-CSAIL-TR-2012-021, MIT, WANG , R., AND W OODFORD , D. Spanner: Google’s July 2012. globally-distributed database. In Proc. OSDI’12, USENIX Conference on Operating Systems Design and Implemen- [23] LogCabin source code. http://github.com/ tation (2012), USENIX, pp. 251–264. logcabin/logcabin. 17
18 .[24] L ORCH , J. R., A DYA , A., B OLOSKY, W. J., C HAIKEN , http://ramcloud.stanford.edu/˜ongaro/ R., D OUCEUR , J. R., AND H OWELL , J. The SMART thesis.pdf. way to migrate replicated stateful services. In Proc. Eu- [32] O NGARO , D., AND O USTERHOUT, J. In search of an roSys’06, ACM SIGOPS/EuroSys European Conference on understandable consensus algorithm. In Proc ATC’14, Computer Systems (2006), ACM, pp. 103–115. USENIX Annual Technical Conference (2014), USENIX. [25] M AO , Y., J UNQUEIRA , F. P., AND M ARZULLO , K. Mencius: building efficient replicated state machines for [33] O USTERHOUT, J., AGRAWAL , P., E RICKSON , D., WANs. In Proc. OSDI’08, USENIX Conference on KOZYRAKIS , C., L EVERICH , J., M AZI E` RES , D., M I - Operating Systems Design and Implementation (2008), TRA , S., N ARAYANAN , A., O NGARO , D., PARULKAR , USENIX, pp. 369–384. G., ROSENBLUM , M., RUMBLE , S. M., S TRATMANN , E., AND S TUTSMAN , R. The case for RAMCloud. Com- [26] M AZI E` RES , D. Paxos made practical. http: munications of the ACM 54 (July 2011), 121–130. //www.scs.stanford.edu/˜dm/home/ papers/paxos.pdf, Jan. 2007. [34] Raft consensus algorithm website. [27] M ORARU , I., A NDERSEN , D. G., AND K AMINSKY, M. http://raftconsensus.github.io. There is more consensus in egalitarian parliaments. In [35] R EED , B. Personal communications, May 17, 2013. Proc. SOSP’13, ACM Symposium on Operating System Principles (2013), ACM. [36] ROSENBLUM , M., AND O USTERHOUT, J. K. The design and implementation of a log-structured file system. ACM [28] Raft user study. http://ramcloud.stanford. Trans. Comput. Syst. 10 (February 1992), 26–52. edu/˜ongaro/userstudy/. [37] S CHNEIDER , F. B. Implementing fault-tolerant services [29] O KI , B. M., AND L ISKOV, B. H. Viewstamped replication: A new primary copy method to support using the state machine approach: a tutorial. ACM Com- puting Surveys 22, 4 (Dec. 1990), 299–319. highly-available distributed systems. In Proc. PODC’88, ACM Symposium on Principles of Distributed Computing [38] S HVACHKO , K., K UANG , H., R ADIA , S., AND (1988), ACM, pp. 8–17. C HANSLER , R. The Hadoop distributed file system. [30] O’N EIL , P., C HENG , E., G AWLICK , D., AND ON EIL , E. In Proc. MSST’10, Symposium on Mass Storage Sys- The log-structured merge-tree (LSM-tree). Acta Informat- tems and Technologies (2010), IEEE Computer Society, ica 33, 4 (1996), 351–385. pp. 1–10. [31] O NGARO , D. Consensus: Bridging Theory and Practice. [39] VAN R ENESSE , R. Paxos made moderately complex. PhD thesis, Stanford University, 2014 (work in progress). Tech. rep., Cornell University, 2012. 18