TCP传输协议,分布式决策,RPC的介绍和应用

本文介绍TCP传输协议,分布式决策,RPC及应用。TCP(Transmission Control Protocol 传输控制协议)是一种面向连接的、可靠的、基于字节流的传输层通信协议。RPC(Remote Procedure Call)—远程过程调用,它是一种通过网络从远程计算机程序上请求服务,而不需要了解底层网络技术的协议。它们主要应用在远程调用和传输方面。
展开查看详情

1. CS162 Goals of Today’s Lecture Operating Systems and •  TCP flow control Systems Programming Lecture 22 •  Two-Phase Commit TCP Flow Control, Distributed Decision Making, RPC November 13th, 2017 Prof. Ion Stoica http://cs162.eecs.Berkeley.edu 11/13/2017 CS162 © UCB Fall 2017 Lec 21.2 Flow Control TCP Flow Control •  Recall: Flow control ensures a fast sender does not overwhelm a •  TCP: sliding window protocol at byte (not packet) level slow receiver •  Example: Producer-consumer with bounded buffer (Lecture 5) •  Receiver tells sender how many more bytes it can receive –  A buffer between producer and consumer without overflowing its buffer (i.e., AdvertisedWindow) –  Producer puts items into buffer as long as buffer not full –  Consumer consumes items from buffer •  The ack(nowledgement) contains sequence number N of next byte the receiver expects, i.e., receiver has received all bytes in sequence up to and including N-1 buffer Produ- Con- cer sumer 11/13/2017 CS162 © UCB Fall 2017 Lec 21.3 11/13/2017 CS162 © UCB Fall 2017 Lec 21.4 Page 1

2. TCP Flow Control TCP Flow Control Sending Process Receiving Process Receiving Process Sending Process TCP layer 1 3 TCP layer OS OS (TCP/IP) OS (TCP/IP) IP layer IP layer •  TCP/IP implemented by OS (Kernel) 2 –  Cannot do context switching on sending/receiving every packet •  Three pairs of producer-consumer’s »  At 10Gbps, it takes 1.2 usec to send an 1500 bytes, and ①  sending process à sending TCP 80nsec to send an 100 byte packet ②  Sending TCP à receiving TCP •  Need buffers to match … ③  receiving TCP à receiving process –  sending app with sending TCP –  receiving TCP with receiving app 11/13/2017 CS162 © UCB Fall 2017 Lec 21.5 11/13/2017 CS162 © UCB Fall 2017 Lec 21.6 TCP Flow Control Circular Buffer •  Assume Sending Process Receiving Process –  A buffer of size N –  A stream of bytes, where bytes have increasing sequence numbers TCP layer TCP layer »  Think of stream as an unbounded array of bytes and of sequence number as 300 bytes indexes in this array OS •  Buffer stores at most N consecutive bytes from the stream IP layer IP layer •  Byte k stored at position (k mod N) + 1 in the buffer buffered data sequence # •  Example assumptions: 27 28 29 30 31 32 33 34 35 36 –  Maximum IP packet size = 100 bytes H E L L O W O R L –  Size of the receiving buffer (MaxRcvBuf) = 300 bytes (28 mod 10) + 1 = 9 (35 mod 10) + 1 = 6 •  Recall, ack indicates the next expected byte in-sequence, not the last Circular buffer received byte (N = 10) L O W O R E L 1 2 3 4 5 6 7 8 9 10 •  Use circular buffers end start 11/13/2017 CS162 © UCB Fall 2017 Lec 21.7 11/13/2017 CS162 © UCB Fall 2017 Lec 21.8 Page 2

3. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(0) LastByteWritten(0) LastByteSent(0) •  LastByteWritten: last byte written by sending process •  LastByteWritten: last byte written by sending process •  LastByteSent: last byte sent by sender to receiver 11/13/2017 CS162 © UCB Fall 2017 Lec 21.9 11/13/2017 CS162 © UCB Fall 2017 Lec 21.10 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(0) LastByteWritten(0) LastByteAcked(0) LastByteSent(0) LastByteAcked(0) LastByteSent(0) LastByteRcvd(0) •  LastByteWritten: last byte written by sending process •  LastByteWritten: last byte written by sending process •  LastByteSent: last byte sent by sender to receiver •  LastByteSent: last byte sent by sender to receiver •  LastByteAcked: last ack received by sender from receiver •  LastByteAcked: last ack received by sender from receiver •  LastByteRcvd: last byte received by receiver from sender 11/13/2017 CS162 © UCB Fall 2017 Lec 21.11 11/13/2017 CS162 © UCB Fall 2017 Lec 21.12 Page 3

4. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(0) LastByteWritten(0) LastByteRead(0) LastByteAcked(0) LastByteSent(0) LastByteRcvd(0) NextByteExpected(1) LastByteAcked(0) LastByteSent(0) LastByteRcvd(0) NextByteExpected(1) •  LastByteWritten: last byte written by sending process •  LastByteWritten: last byte written by sending process •  LastByteSent: last byte sent by sender to receiver •  LastByteSent: last byte sent by sender to receiver •  LastByteAcked: last ack received by sender from receiver •  LastByteAcked: last ack received by sender from receiver •  LastByteRcvd: last byte received by receiver from sender •  LastByteRcvd: last byte received by receiver from sender •  NextByteExpected: last in-sequence byte expected by receiver •  NextByteExpected: last in-sequence byte expected by receiver •  LastByteRead: last byte read by the receiving process 11/13/2017 CS162 © UCB Fall 2017 Lec 21.13 11/13/2017 CS162 © UCB Fall 2017 Lec 21.14 TCP Flow Control TCP Flow Control Receiving Process Sending Process Receiving Process LastByteRead LastByteWritten LastByteRead MaxRcvBuffer MaxSendBuffer MaxRcvBuffer NextByteExpected LastByteRcvd LastByteAcked LastByteSent NextByteExpected LastByteRcvd •  AdvertisedWindow: number of bytes TCP receiver can receive •  AdvertisedWindow: number of bytes TCP receiver can receive AdvertisedWindow = MaxRcvBuffer – (LastByteRcvd – LastByteRead) AdvertisedWindow = MaxRcvBuffer – (LastByteRcvd – LastByteRead) •  SenderWindow: number of bytes TCP sender can send SenderWindow = AdvertisedWindow – (LastByteSent – LastByteAcked) 11/13/2017 CS162 © UCB Fall 2017 Lec 21.15 11/13/2017 CS162 © UCB Fall 2017 Lec 21.16 Page 4

5. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten LastByteRead LastByteWritten LastByteRead MaxSendBuffer MaxRcvBuffer MaxSendBuffer MaxRcvBuffer LastByteAcked LastByteSent NextByteExpected LastByteRcvd LastByteAcked LastByteSent NextByteExpected LastByteRcvd •  Still true if receiver missed data…. •  Still true if receiver missed data…. AdvertisedWindow = MaxRcvBuffer – (LastByteRcvd – LastByteRead) AdvertisedWindow = MaxRcvBuffer – (LastByteRcvd – LastByteRead) •  WriteWindow: number of bytes sending process can write WriteWindow = MaxSendBuffer – (LastByteWritten – LastByteAcked) 11/13/2017 CS162 © UCB Fall 2017 Lec 21.17 11/13/2017 CS162 © UCB Fall 2017 Lec 21.18 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(0) LastByteWritten(350) LastByteRead(0) 1, 1, 1, 350 1, 101, 350 350 100 100 LastByteAcked(0) LastByteSent(0) LastByteRcvd(0) NextByteExpected(1) LastByteAcked(0) LastByteSent(100) LastByteRcvd(100) NextByteExpected(101) {[1,100]} Data[1,100] {[1,100]} •  Sending app sends 350 bytes •  Recall: –  We assume IP only accepts packets no larger than 100 bytes –  MaxRcvBuf = 300 bytes, so initial Advertised Window = 300 byets Sender sends first packet (i.e., first 100 bytes) time and receiver gets the packet 11/13/2017 CS162 © UCB Fall 2017 Lec 21.19 11/13/2017 CS162 © UCB Fall 2017 Lec 21.20 Page 5

6. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(0) LastByteWritten(350) LastByteRead(0) 1, 1, 1, 101, 1, 101, 1, 101, 350 350 101,201, 350350 100 100 100 200 100 200 LastByteAcked(0) LastByteSent(100) LastByteRcvd(100) NextByteExpected(101) LastByteAcked(0) LastByteSent(200) LastByteRcvd(200) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] = 200 {[1,200]} 01, Ad vWin = 200 Ack=1 01, A dvWin Ack=1 Receiver sends ack for 1st packet AdvWin = MaxRcvBuffer – (LastByteRcvd – LastByteRead) Sender sends 2nd packet (i.e., next 100 bytes) = 300 – (100 – 0) = 200 and receiver gets the packet 11/13/2017 CS162 © UCB Fall 2017 Lec 21.21 11/13/2017 CS162 © UCB Fall 2017 Lec 21.22 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process 1, 100 LastByteWritten(350) LastByteRead(0) LastByteWritten(350) LastByteRead(100) 101, 1, 200 101,201, 350350 1, 200 1, 200 101,201, 350350 200 LastByteAcked(0) LastByteSent(200) LastByteRcvd(200) NextByteExpected(201) LastByteAcked(0) LastByteSent(200) LastByteRcvd(200) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] {[1,200]} Data[101,200] {[1,200]} {[1,200]} 00 00 in = 2 in = 2 01 , AdvW 01 , AdvW Ack=1 Ack=1 Sender sends 2nd packet (i.e., next 100 bytes) Receiving TCP delivers first 100 bytes to and receiver gets the packet recienving process 11/13/2017 CS162 © UCB Fall 2017 Lec 21.23 11/13/2017 CS162 © UCB Fall 2017 Lec 21.24 Page 6

7. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 101, 201, 301, 101, 1, 200 101,201, 350350 1, 200 101,201, 350350 200 300 350 200 LastByteAcked(0) LastByteSent(200) LastByteRcvd(200) NextByteExpected(201) LastByteAcked(0) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] {[1,200]} Data[101,200] {[1,200]} {[1,200]} = 200 {[1,300]} Data[201,300] dvWin 00 0 1 , A in = 2 Ack=1 201, AdvW Ack= Receiver sends ack for 2nd packet AdvWin = MaxRcvBuffer – (LastByteRcvd – LastByteRead) Sender sends 3rd packet (i.e., next 100 bytes) = 300 – (200 – 100) = 200 and the packet is lost 11/13/2017 CS162 © UCB Fall 2017 Lec 21.25 11/13/2017 CS162 © UCB Fall 2017 Lec 21.26 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 301, 101, 301, 101, 1,300101,201, 350350 1,300101,201, 350350 350 200 350 200 LastByteAcked(0) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) LastByteAcked(0) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] {[1,200]} Data[101,200] {[1,200]} {[1,200]} {[1,300]} Data[201,300] {[1,300]} Data[201,300] Ack=101, AdvWin = 200 Sender stops sending as window full •  Sender gets ack for 1st packet SndWin = AdvWin – (LastByteSent – LastByteAcked) •  AdWin = 200 = 300 – (300 – 0) = 0 11/13/2017 CS162 © UCB Fall 2017 Lec 21.27 11/13/2017 CS162 © UCB Fall 2017 Lec 21.28 Page 7

8. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 301, 101, 301, 101, 101,300 101,201, 350350 101,300 101,201, 350350 350 200 350 200 LastByteAcked(100) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) LastByteAcked(100) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] {[1,200]} Data[101,200] {[1,200]} {[1,200]} {[1,300]} Data[201,300] {[1,300]} Data[201,300] {101, 300} Ack=101, AdvWin = 200 {101, 300} Ack=101, AdvWin = 200 •  Ack for 1st packet (ack indicates next byte Sender still cannot send as window full expected by receiver) SndWin = AdvWin – (LastByteSent – LastByteAcked) •  Receiver no longer needs first 100 bytes = 200 – (300 –CS162 100) =0 11/13/2017 CS162 © UCB Fall 2017 Lec 21.29 11/13/2017 © UCB Fall 2017 Lec 21.30 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 301, 101, 201, 301, 101, 101,300 101,201, 350350 101, 201, 350 350 350 200 300 350 200 LastByteAcked(100) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) LastByteAcked(200) LastByteSent(300) LastByteRcvd(200) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] {[1,200]} Data[101,200] {[101,200]} {[101,200]} {[1,300]} Data[201,300] {[1,300]} Data[201,300] {101, 300} {101, 300} {201, 300} Ack=201, AdvWin = 200 {201, 300} Ack=201, AdvWin = 200 •  Receiver gets ack for 2nd packet Sender can now send new data! •  AdvWin = 200 bytes SndWin = AdvWin – (LasByteSent – LastByteAcked) = 100 11/13/2017 CS162 © UCB Fall 2017 Lec 21.31 11/13/2017 CS162 © UCB Fall 2017 Lec 21.32 Page 8

9. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 201, 301, 101, 301, 201, 301, 101, 301, 101, 201, 350 350 101, 201, 350 350 300 350350 200 350 300 350350 200 350 LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(201) LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(201) {[1,100]} Data[1,100] {[1,100]} Data[1,100] {[1,100]} {[1,100]} {[1,200]} Data[101,200] {[1,200]} Data[101,200] {[101,200]} {[101,200]} {[1,300]} Data[201,300] {[1,300]} Data[201,300] {101, 300} {101, 300} {[201,350]} Data[301,350] {[201,350]} Data[301,350] {[101,200],[301,350]} {[101,200],[301,350]} {201, 350} Ack=201, AdvWin = 50 11/13/2017 CS162 © UCB Fall 2017 Lec 21.33 11/13/2017 CS162 © UCB Fall 2017 Lec 21.34 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 201, 301, 101, 301, 201, 301, 101, 301, 101, 201, 350 350 101, 201, 350 350 300 350350 200 350 300 350350 200 350 LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(201) LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(201) {[201,350]} Data[301,350] {[101,200],[301,350]} {201, 350} Ack=201, AdvWin = 50 {[201,350]} Data[301,350] {[101,200],[301,350]} • {201, Ack350} still specifies 201 (first Ack=201, AdvWinbyte= 50 out of sequence) •  Ack still specifies 201 (first byte out of sequence) •  AdvWin = 50, so can sender re-send 3rd packet? •  AdvWin = 50, so can sender re-send 3rd packet? 11/13/2017 CS162 © UCB Fall 2017 Lec 21.35 11/13/2017 CS162 © UCB Fall 2017 Lec 21.36 Page 9

10. TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 201, 301, 101, 201, 301, 201, 301, 101, 201, 350 350 101, 201, 350 350 101, 350 300 350350 200 300 350 300 350350 LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(351) LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(351) {[201,350]} Data[301,350] {[201,350]} Data[301,350] {[101,200],[301,350]} {[101,200],[301,350]} {201, 350} Ack=201, AdvWin = 50 {201, 350} Ack=201, AdvWin = 50 {[201,350]} Data[201,300] {[201,350]} Data[201,300] {[101,350]} {[101,350]} Yes! Sender can re-send 2nd packet since it’s in existing window Yes! Sender can re-send 2nd packet since it’s in existing window – won’t cause receiver window to grow – won’t cause receiver window to grow 11/13/2017 CS162 © UCB Fall 2017 Lec 21.37 11/13/2017 CS162 © UCB Fall 2017 Lec 21.38 TCP Flow Control TCP Flow Control Sending Process Receiving Process Sending Process Receiving Process LastByteWritten(350) LastByteRead(100) LastByteWritten(350) LastByteRead(100) 201, 301, 101, 350 101, 350 300 350 LastByteAcked(200) LastByteSent(350) LastByteRcvd(350) NextByteExpected(351) LastByteAcked(350) LastByteSent(350) LastByteRcvd(350) NextByteExpected(351) {[201,350]} Data[301,350] {[201,350]} Data[301,350] {[101,200],[301,350]} {[101,200],[301,350]} {201, 350} Ack=201, AdvWin = 50 {201, 350} Ack=201, AdvWin = 50 {[201,350]} Data[201,300] {[201,350]} Data[201,300] {[101,350]} {[101,350]} {} Ack=351, AdvWin = 50 {} Ack=351, AdvWin = 50 •  Sender gets 3rd packet and sends Ack for 351 •  AdvWin = 50 Sender DONE with sending all bytes! 11/13/2017 CS162 © UCB Fall 2017 Lec 21.39 11/13/2017 CS162 © UCB Fall 2017 Lec 21.40 Page 10

11. Discussion Administrivia •  Why not have a huge buffer at the receiver (memory is •  Midterm 3 coming up on Wen 11/29 6:30-8PM cheap!)? –  All topics up to and including Lecture 24 »  Focus will be on Lectures 17 – 24 and associated readings, •  Sending window (SndWnd) also depends on network and Projects 3 congestion »  But expect 20-30% questions from materials from –  Congestion control: ensure that a fast sender doesn’t Lectures 1-16 overwhelm a router in the network (discussed in detail in –  Closed book cs168) –  2 sides hand-written notes both sides •  In practice there is another set of buffers in the protocol stack, at the link layer (i.e., Network Interface Card) 11/13/2017 CS162 © UCB Fall 2017 Lec 21.41 11/13/2017 CS162 © UCB Fall 2017 Lec 21.42 Goals of Today’s Lecture •  TCP flow control •  Two-Phase Commit BREAK 11/13/2017 CS162 © UCB Fall 2017 Lec 21.43 11/13/2017 CS162 © UCB Fall 2017 Lec 21.44 Page 11

12. General’s Paradox General’s Paradox •  Constraints of problem: •  Can messages over an unreliable network be used to guarantee two – Two generals, on separate mountains entities do something simultaneously? – Can only communicate via messengers –  Remarkably, “no”, even if all messages get through – Messengers can be captured 11 am ok? •  Problem: need to coordinate attack Yes, 11 works – If they attack at different times, they all die So, 11 it is ? – If they attack at same time, they win you Yeah, but what if •  Named after Custer, who died at Little Big Horn because he k? Don’t get this ac arrived a couple of days too early –  No way to be sure last message gets through! 11/13/2017 CS162 © UCB Fall 2017 Lec 21.45 11/13/2017 CS162 © UCB Fall 2017 Lec 21.46 Two-Phase Commit Two-Phase Commit Protocol •  Persistent stable log on each machine: keep track of whether commit •  Since we can’t solve the General’s Paradox (i.e. simultaneous has happened action), let’s solve a related problem –  If a machine crashes, when it wakes up it first checks its log to recover state of world at time of crash •  Distributed transaction: Two or more machines agree to do •  Prepare Phase: something, or not do it, atomically –  The global coordinator requests that all participants will promise to commit or rollback the transaction –  Participants record promise in log, then acknowledge •  Two-Phase Commit protocol: Developed by Turing Award –  If anyone votes to abort, coordinator writes "Abort" in its log and tells winner Jim Gray (first Berkeley CS PhD, 1969) everyone to abort; each records "Abort" in log •  Commit Phase: –  After all participants respond that they are prepared, then the coordinator writes "Commit" to its log –  Then asks all nodes to commit; they respond with ACK –  After receive ACKs, coordinator writes "Got Commit" to log •  Log used to guarantee that all machines either commit or don’t 11/13/2017 CS162 © UCB Fall 2017 Lec 21.47 11/13/2017 CS162 © UCB Fall 2017 Lec 21.48 Page 12

13. 2PC Algorithm Detailed Algorithm Coordinator Algorithm Worker Algorithm •  One coordinator •  N workers (replicas) Coordinator sends VOTE-REQ to all workers •  High level algorithm description: –  Wait for VOTE-REQ from coordinator –  Coordinator asks all workers if they can commit –  If ready, send VOTE-COMMIT to –  If all workers reply “VOTE-COMMIT”, then coordinator broadcasts coordinator “GLOBAL-COMMIT” –  If not ready, send VOTE-ABORT to Otherwise coordinator broadcasts “GLOBAL-ABORT” –  If receive VOTE-COMMIT from all N coordinator –  Workers obey the GLOBAL messages workers, send GLOBAL-COMMIT to –  And immediately abort all workers •  Use a persistent, stable log on each machine to keep track of what –  If doesn’t receive VOTE-COMMIT you are doing from all N workers, send GLOBAL- –  If a machine crashes, when it wakes up it first checks its log to recover ABORT to all workers state of world at time of crash –  If receive GLOBAL-COMMIT then commit –  If receive GLOBAL-ABORT then abort 11/13/2017 CS162 © UCB Fall 2017 Lec 21.49 11/13/2017 CS162 © UCB Fall 2017 Lec 21.50 Failure Free Example Execution State Machine of Coordinator coordinator •  Coordinator implements simple state machine: VOTE- GLOBAL- REQ COMMIT worker 1 INIT Recv: START worker 2 Send: VOTE-REQ VOTE- WAIT COMMIT Recv: VOTE-ABORT Recv: all VOTE-COMMIT worker 3 Send: GLOBAL-COMMIT Send: GLOBAL-ABORT time ABORT COMMIT 11/13/2017 CS162 © UCB Fall 2017 Lec 21.51 11/13/2017 CS162 © UCB Fall 2017 Lec 21.52 Page 13

14. State Machine of Workers Dealing with Worker Failures •  Failure only affects states in which the coordinator is waiting for messages •  Coordinator only waits for votes in “WAIT” state •  In WAIT, if doesn’t receive N votes, it times out and INIT sends GLOBAL-ABORT Recv: VOTE-REQ INIT Recv: VOTE-REQ Send: VOTE-ABORT Send: VOTE-COMMIT Recv: START READY Send: VOTE-REQ Recv: GLOBAL- WAIT Recv: GLOBAL-COMMIT ABORT Recv: VOTE-ABORT Recv: VOTE-COMMIT Send: GLOBAL-ABORT Send: GLOBAL-COMMIT ABORT COMMIT ABORT COMMIT 11/13/2017 CS162 © UCB Fall 2017 Lec 21.53 11/13/2017 CS162 © UCB Fall 2017 Lec 21.54 Example of Worker Failure Dealing with Coordinator Failure INIT •  Worker waits for VOTE-REQ in INIT WAIT –  Worker can time out and abort (coordinator handles it) •  Worker waits for GLOBAL-* message in READY coordinator ABORT COMM timeout –  If coordinator fails, workers must BLOCK waiting for coordinator GLOBAL- to recover and send GLOBAL_* message VOTE-REQ ABORT worker 1 INIT Recv: VOTE-REQ Recv: VOTE-REQ VOTE- Send: VOTE-ABORT Send: VOTE-COMMIT worker 2 COMMIT READY Recv: GLOBAL-ABORT Recv: GLOBAL-COMMIT worker 3 time ABORT COMMIT 11/13/2017 CS162 © UCB Fall 2017 Lec 21.55 11/13/2017 CS162 © UCB Fall 2017 Lec 21.56 Page 14

15. Example of Coordinator Failure #1 Example of Coordinator Failure #2 INIT INIT READY READY ABORT COMM coordinator ABORT COMM coordinator restarted VOTE- REQ timeout VOTE-REQ worker 1 worker 1 VOTE- ABORT VOTE- GLOBAL- worker 2 timeout worker 2 COMMIT ABORT timeout block waiting for worker 3 coordinator worker 3 11/13/2017 CS162 © UCB Fall 2017 Lec 21.57 11/13/2017 CS162 © UCB Fall 2017 Lec 21.58 Durability Blocking for Coordinator to Recover •  All nodes use stable storage to store current state •  A worker waiting for global decision can ask fellow workers about their state –  stable storage is non-volatile storage (e.g. backed by disk) that guarantees atomic writes. –  If another worker is in ABORT or INIT COMMIT state then coordinator Recv: VOTE-REQ Recv: VOTE-REQ must have sent GLOBAL-* Send: VOTE-ABORT Send: VOTE-COMMIT •  Upon recovery, it can restore state and resume: »  Thus, worker can safely READY –  Coordinator aborts in INIT, WAIT, or ABORT abort or commit, respectively Recv: GLOBAL-ABORT Recv: GLOBAL-COMMIT –  Coordinator commits in COMMIT ABORT COMMIT –  Worker aborts in INIT, ABORT –  If another worker is still in INIT state then both workers –  Worker commits in COMMIT can decide to abort –  Worker asks Coordinator in READY –  If all workers are in ready, need to BLOCK (don’t know if coordinator wanted to abort or commit) 11/13/2017 CS162 © UCB Fall 2017 Lec 21.59 11/13/2017 CS162 © UCB Fall 2017 Lec 21.60 Page 15

16. Distributed Decision Making Discussion (1/2) Distributed Decision Making Discussion (2/2) •  Why is distributed decision making desirable? •  Undesirable feature of Two-Phase Commit: Blocking –  Fault Tolerance! –  One machine can be stalled until another site recovers: –  A group of machines can come to a decision even if one or more »  Site B writes "prepared to commit" record to its log, of them fail during the process sends a "yes" vote to the coordinator (site A) and crashes »  Simple failure mode called “failstop” (different modes later) »  Site A crashes –  After decision made, result recorded in multiple places »  Site B wakes up, check its log, and realizes that it has voted "yes" on the update. It sends a message to site A asking what happened. At this point, B cannot decide to abort, because update may have committed »  B is blocked until A comes back –  A blocked site holds resources (locks on updated items, pages pinned in memory, etc) until learns fate of update 11/13/2017 CS162 © UCB Fall 2017 Lec 21.61 11/13/2017 CS162 © UCB Fall 2017 Lec 21.62 PAXOS Byzantine General’s Problem •  PAXOS: An alternative used by Google and others that does not Lieutenant have this blocking problem ! Atta ack Att Atta ck! –  Develop by Leslie Lamport (Turing Award Winner) ck! Attack! Att Retreat! •  What happens if one or more of the nodes is malicious? ack ! Attack! Lieutenant –  Malicious: attempting to compromise the decision making at! Retre General ! Att ck a Lieutenant Malicious! •  Byazantine General’s Problem (n players): –  One General and n-1 Lieutenants –  Some number of these (f) can be insane or malicious •  The commanding general must send an order to his n-1 lieutenants such that the following Integrity Constraints apply: –  IC1: All loyal lieutenants obey the same order –  IC2: If the commanding general is loyal, then all loyal lieutenants obey the order he sends 11/13/2017 CS162 © UCB Fall 2017 Lec 21.63 11/13/2017 CS162 © UCB Fall 2017 Lec 21.64 Page 16

17. Byzantine General’s Problem (con’t) Summary •  Impossibility Results: –  Cannot solve Byzantine General’s Problem with n=3 because one •  TCP flow control malicious player can mess up things –  Ensures a fast sender does not overwhelm a slow receiver –  Receiver tells sender how many more bytes it can receive without General General overflowing its buffer (i.e., AdvertisedWindow) Attack! Attack! Attack! Retreat! –  The ack(nowledgement) contains sequence number N of next byte the Lieutenant Lieutenant Lieutenant Lieutenant receiver expects, i.e., receiver has received all bytes in sequence up to and Retreat! Retreat! including N-1 –  With f faults, need n > 3f to solve problem •  Various algorithms exist to solve problem –  Original algorithm has #messages exponential in n •  Two-phase commit: distributed decision making –  Newer algorithms have message complexity O(n2) –  First, make sure everyone guarantees they will commit if asked »  One from MIT, for instance (Castro and Liskov, 1999) (prepare) •  Use of BFT (Byzantine Fault Tolerance) algorithm –  Allow multiple machines to make a coordinated decision even if some –  Next, ask everyone to commit subset of them (< n/3 ) are malicious Request Distributed Decision 11/13/2017 CS162 © UCB Fall 2017 Lec 21.65 11/13/2017 CS162 © UCB Fall 2017 Lec 21.66 Page 17