Raft:日志复制共识算法

Paxos一直是分布式协议的标准,但是它非常难理解,更难实现,Stanford的新的分布式协议研究称为Raft,它是一个可用于工业级的应用协议,主要注重协议的容易代码实现和容易理解。   分布式系统和单机系统相比,优势之一就是有更好的容错性,Raft协议提出了共识这一概念,它是指多个服务器在状态达成一致,但是在一个分布式系统中,服务器可能会因为各种原因变得不可靠,一致性的状态就变得很困难。因此,我们需要一个一致性协议,为了确保容错性,也即使系统中有个服务器不可靠,也不会影响整个系统。   为了以容错方式达成一致,不可能要求所有服务器100%都达成一致状态,只要超过半数达成一致就可以了,  Paxos和Raft都是为了实现Consensus一致性这个目标,这个过程如同选举一样,参选者需要说服大多数选民(服务器)投票给他,一旦选定后就跟随其操作。本文描述了Raft的选举过程,以及日志复制记录结构。
展开查看详情

1.Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University

2.Replicated log => replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: fail-stop (not Byzantine), delayed/lost messages March 3, 2013 Raft Consensus Algorithm Slide 2 Goal: Replicated Log add jmp mov shl Log Consensus Module State Machine add jmp mov shl Log Consensus Module State Machine add jmp mov shl Log Consensus Module State Machine Servers Clients shl

3.Two general approaches to consensus: Symmetric, leader-less: All servers have equal roles Clients can contact any server Asymmetric, leader-based: At any given time, one server is in charge, others accept its decisions Clients communicate with the leader Raft uses a leader: Decomposes the problem (normal operation, leader changes) Simplifies normal operation (no conflicts) More efficient than leader-less approaches March 3, 2013 Raft Consensus Algorithm Slide 3 Approaches to Consensus

4.Leader election : Select one of the servers to act as leader Detect crashes, choose new leader Normal operation (basic log replication) Safety and consistency after leader changes Neutralizing old leaders Client interactions Implementing linearizeable semantics Configuration changes: Adding and removing servers March 3, 2013 Raft Consensus Algorithm Slide 4 Raft Overview

5.At any given time, each server is either: Leader : handles all client interactions, log replication At most 1 viable leader at a time Follower : completely passive (issues no RPCs, responds to incoming RPCs) Candidate : used to elect a new leader Normal operation: 1 leader, N-1 followers March 3, 2013 Raft Consensus Algorithm Slide 5 Server States Follower Candidate Leader start timeout, start election receive votes from majority of servers timeout, new election discover server with higher term discover current server or higher term “step down”

6.Time divided into terms: Election Normal operation under a single leader At most 1 leader per term Some terms have no leader (failed election) Each server maintains current term value Key role of terms: identify obsolete information March 3, 2013 Raft Consensus Algorithm Slide 6 Terms Term 1 Term 2 Term 3 Term 4 Term 5 time Elections Normal Operation Split Vote

7.Time divided into terms: Election Normal operation under a single leader At most 1 leader per term Some terms have no leader (failed election) Each server maintains current term value Key role of terms: identify obsolete information March 3, 2013 Raft Consensus Algorithm Slide 6 Terms Term 1 Term 2 Term 3 Term 4 Term 5 time Elections Normal Operation Split Vote

8.Servers start up as followers Followers expect to receive RPCs from leaders or candidates Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority If electionTimeout elapses with no RPCs: Follower assumes leader has crashed Follower starts new election Timeouts typically 100-500ms March 3, 2013 Raft Consensus Algorithm Slide 8 Heartbeats and Timeouts

9.Increment current term Change to Candidate state Vote for self Send RequestVote RPCs to all other servers, retry until either: Receive votes from majority of servers: Become leader Send AppendEntries heartbeats to all other servers Receive RPC from valid leader: Return to follower state No-one wins election (election timeout elapses): Increment term, start new election March 3, 2013 Raft Consensus Algorithm Slide 9 Election Basics

10.Safety : allow at most one winner per term Each server gives out only one vote per term (persist on disk) Two different candidates can’t accumulate majorities in same term Liveness : some candidate must eventually win Choose election timeouts randomly in [T, 2T] One server usually times out and wins election before others wake up Works well if T >> broadcast time March 3, 2013 Raft Consensus Algorithm Slide 10 Elections, cont’d Servers Voted for candidate A B can’t also get majority

11.Log entry = index, term, command Log stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers Durable, will eventually be executed by state machines March 3, 2013 Raft Consensus Algorithm Slide 11 Log Structure 1 add 1 2 3 4 5 6 7 8 3 jmp 1 cmp 1 ret 2 mov 3 div 3 shl 3 sub 1 add 3 jmp 1 cmp 1 ret 2 mov 1 add 3 jmp 1 cmp 1 ret 2 mov 3 div 3 shl 3 sub 1 add 1 cmp 1 add 3 jmp 1 cmp 1 ret 2 mov 3 div 3 shl leader log index followers committed entries term command

12.Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to followers Once new entry committed: Leader passes command to its state machine, returns result to client Leader notifies followers of committed entries in subsequent AppendEntries RPCs Followers pass committed commands to their state machines Crashed/slow followers? Leader retries RPCs until they succeed Performance is optimal in common case: One successful RPC to any majority of servers March 3, 2013 Raft Consensus Algorithm Slide 12 Normal Operation

13.High level of coherency between logs: If log entries on different servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all preceding entries are also committed March 3, 2013 Raft Consensus Algorithm Slide 13 Log Consistency 1 add 1 2 3 4 5 6 3 jmp 1 cmp 1 ret 2 mov 3 div 4 sub 1 add 3 jmp 1 cmp 1 ret 2 mov

14.Each AppendEntries RPC contains index, term of entry preceding new ones Follower must contain matching entry; otherwise it rejects request Implements an induction step , ensures coherency March 3, 2013 Raft Consensus Algorithm Slide 14 AppendEntries Consistency Check 1 add 3 jmp 1 cmp 1 ret 2 mov 1 add 1 cmp 1 ret 2 mov leader follower 1 2 3 4 5 1 add 3 jmp 1 cmp 1 ret 2 mov 1 add 1 cmp 1 ret 1 shl leader follower AppendEntries succeeds: matching entry AppendEntries fails: mismatch

15.At beginning of new leader’s term: Old leader may have left entries partially replicated No special steps by new leader: just start normal operation Leader’s log is “the truth” Will eventually make follower’s logs identical to leader’s Multiple crashes can leave many extraneous log entries: March 3, 2013 Raft Consensus Algorithm Slide 15 Leader Changes 1 2 3 4 5 6 7 8 log index 1 1 1 1 5 5 6 6 6 6 1 1 5 5 1 4 1 1 1 7 7 2 2 3 3 3 2 7 term s 1 s 2 s 3 s 4 s 5

16.Once a log entry has been applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders This guarantees the safety requirement Leaders never overwrite entries in their logs Only entries in the leader’s log can be committed Entries must be committed before applying to state machine March 3, 2013 Raft Consensus Algorithm Slide 16 Safety Requirement Committed → Present in future leaders’ logs Restrictions on commitment Restrictions on leader election

17.Can’t tell which entries are committed! During elections, choose candidate with log most likely to contain all committed entries Candidates include log info in RequestVote RPCs (index & term of last log entry) Voting server V denies vote if its log is “more complete”: ( lastTerm V > lastTerm C ) || ( lastTerm V == lastTerm C ) && ( lastIndex V > lastIndex C ) Leader will have “most complete” log among electing majority March 3, 2013 Raft Consensus Algorithm Slide 17 Picking the Best Leader 1 2 1 1 2 1 2 3 4 5 1 2 1 1 1 2 1 1 2 unavailable during leader transition committed?

18.Case #1/2: Leader decides entry in current term is committed Safe: leader for term 3 must contain entry 4 March 3, 2013 Raft Consensus Algorithm Slide 18 Committing Entry from Current Term 1 2 3 4 5 6 1 1 1 1 1 1 1 2 1 1 1 s 1 s 2 s 3 s 4 s 5 2 2 2 2 2 2 2 AppendEntries just succeeded Can’t be elected as leader for term 3 Leader for term 2

19.Case #2/2: Leader is trying to finish committing entry from an earlier term Entry 3 not safely committed : s 5 can be elected as leader for term 5 If elected, it will overwrite entry 3 on s 1 , s 2 , and s 3 ! March 3, 2013 Raft Consensus Algorithm Slide 19 Committing Entry from Earlier Term 1 2 3 4 5 6 1 1 1 1 1 1 1 2 1 1 1 s 1 s 2 s 3 s 4 s 5 2 2 AppendEntries just succeeded 3 4 3 Leader for term 4 3

20.For a leader to decide an entry is committed: Must be stored on a majority of servers At least one new entry from leader’s term must also be stored on majority of servers Once entry 4 committed: s 5 cannot be elected leader for term 5 E ntries 3 and 4 both safe March 3, 2013 Raft Consensus Algorithm Slide 20 New Commitment Rules 1 2 3 4 5 1 1 1 1 1 1 1 2 1 1 1 s 1 s 2 s 3 s 4 s 5 2 2 3 4 3 Leader for term 4 4 4 Combination of election rules and commitment rules makes Raft safe 3

21.Leader changes can result in log inconsistencies: March 3, 2013 Raft Consensus Algorithm Slide 21 Log Inconsistencies 1 4 1 1 4 5 5 6 6 6 1 2 3 4 5 6 7 8 9 10 11 12 log index leader for term 8 1 4 1 1 4 5 5 6 6 1 4 1 1 1 4 1 1 4 5 5 6 6 6 6 1 4 1 1 4 5 5 6 6 6 1 4 1 1 4 1 1 1 possible followers 4 4 7 7 2 2 3 3 3 3 3 2 (a) (b) (c) (d) (e) (f) Extraneous Entries Missing Entries

22.March 3, 2013 Raft Consensus Algorithm New leader must make follower logs consistent with its own Delete extraneous entries Fill in missing entries Leader keeps nextIndex for each follower: Index of next log entry to send to that follower Initialized to (1 + leader’s last index) When AppendEntries consistency check fails, decrement nextIndex and try again: Repairing Follower Logs 1 4 1 1 4 5 5 6 6 6 1 2 3 4 5 6 7 8 9 10 11 12 log index leader for term 7 1 4 1 1 1 1 1 followers 2 2 3 3 3 3 3 2 (a) (b) nextIndex Slide 22

23.When follower overwrites inconsistent entry, it deletes all subsequent entries: March 3, 2013 Raft Consensus Algorithm Slide 23 Repairing Logs, cont’d 1 4 1 1 4 5 5 6 6 6 1 2 3 4 5 6 7 8 9 10 11 log index leader for term 7 1 1 1 follower (before) 2 2 3 3 3 3 3 2 nextIndex 1 1 1 follower (after) 4

24.Deposed leader may not be dead: Temporarily disconnected from network Other servers elect a new leader Old leader becomes reconnected, attempts to commit log entries Terms used to detect stale leaders (and candidates) Every RPC contains term of sender If sender’s term is older, RPC is rejected, sender reverts to follower and updates its term If receiver’s term is older, it reverts to follower, updates its term, then processes RPC normally Election updates terms of majority of servers Deposed server cannot commit new log entries March 3, 2013 Raft Consensus Algorithm Slide 24 Neutralizing Old Leaders

25.Send commands to leader If leader unknown, contact any server If contacted server not leader, it will redirect to leader Leader does not respond until command has been logged, committed, and executed by leader’s state machine If request times out (e.g., leader crash): Client reissues command to some other server Eventually redirected to new leader Retry request with new leader March 3, 2013 Raft Consensus Algorithm Slide 25 Client Protocol

26.What if leader crashes after executing command, but before responding? Must not execute command twice Solution: client embeds a unique id in each command Server includes id in log entry Before accepting command , leader checks its log for entry with that id If id found in log, ignore new command, return response from old command Result: exactly-once semantics as long as client doesn’t crash March 3, 2013 Raft Consensus Algorithm Slide 26 Client Protocol, cont’d

27.System configuration: ID, address for each server Determines what constitutes a majority Consensus mechanism must support changes in the configuration: Replace failed machine Change degree of replication March 3, 2013 Raft Consensus Algorithm Slide 27 Configuration Changes

28.Cannot switch directly from one configuration to another: conflicting majorities could arise March 3, 2013 Raft Consensus Algorithm Slide 28 Configuration Changes, cont’d C old C new Server 1 Server 2 Server 3 Server 4 Server 5 Majority of C old Majority of C new time

29.March 3, 2013 Raft Consensus Algorithm Slide 29 Raft uses a 2-phase approach: Intermediate phase uses joint consensus (need majority of both old and new configurations for elections, commitment) Configuration change is just a log entry; applied immediately on receipt (committed or not) Once joint consensus is committed, begin replicating log entry for final configuration Joint Consensus time C old+new entry committed C new entry committed C old C old+new C new C old can make unilateral decisions C new can make unilateral decisions