RPC, 键值存储, Chord

本文介绍了RPC, 键值存储, Chord基础概念,在分布式系统中,为了维护数据的一致性,可用性,根据CAP理论,三者只能兼顾其中的两者,介绍各种系统中关于这三者之间的权衡,最后重点介绍了为维护数据一致性,PAXOS算法和最终一致性算法的基本原理。
展开查看详情

1. Goals of Today’s Lecture CS162 •  Ending previous lecture (PAXOS) Operating Systems and Systems Programming Lecture 23 •  RPC RPC, •  Key-value storage Key-Value Stores, Chord November 15th, 2017 Prof. Ion Stoica http://cs162.eecs.Berkeley.edu 11/15/17 CS162 © UCB Fall 2017 Lec 23.2 Recap: Distributed Decision Making Discussion (1/2) Distributed Decision Making Discussion (2/2) •  Why is distributed decision making desirable? •  Undesirable feature of Two-Phase Commit: Blocking –  Fault Tolerance! –  One machine can be stalled until another site recovers: –  A group of machines can come to a decision even if one or more »  Site B writes "prepared to commit" record to its log, of them fail during the process sends "yes" vote to coordinator (site A) and crashes »  Simple failure mode called “failstop” (different modes later) »  Site A crashes –  After decision made, result recorded in multiple places »  Site B wakes up, check its log, and realizes that it has voted "yes" on the update. It sends a message to site A asking what happened. At this point, B cannot decide to abort, because update may have committed »  B is blocked until A comes back Coordinator (site A) VOTE-REQ Has to wait for A VOTE- to come back! Worker COMMIT restarted (site B) –  B blocked while holding resources (locks on updated items, pages pinned in memory, etc) until hears from A 11/15/17 CS162 © UCB Fall 2017 Lec 23.3 11/15/17 CS162 © UCB Fall 2017 Lec 23.4 Page 1

2. PAXOS Byzantine General’s Problem •  PAXOS: An alternative used by Google and others that does not Lieutenant have this blocking problem and work when nodes are malicious ! Atta ack Att Atta ck! –  Develop by Leslie Lamport (Turing Award Winner) ck! Attack! Att Retreat! •  What happens if one or more of the nodes is malicious? ack ! Attack! Lieutenant –  Malicious: attempting to compromise the decision making at! Retre General ! Attack Lieutenant Malicious! •  Byazantine General’s Problem (n players): –  One General and n-1 Lieutenants –  Some number of these (f) can be insane or malicious •  The commanding general must send an order to his n-1 lieutenants such that the following Integrity Constraints apply: –  IC1: All loyal lieutenants obey the same order –  IC2: If the commanding general is loyal, then all loyal lieutenants obey the order he sends 11/15/17 CS162 © UCB Fall 2017 Lec 23.5 11/15/17 CS162 © UCB Fall 2017 Lec 23.6 Byzantine General’s Problem (con’t) Goals of Today’s Lecture •  Impossibility Results: –  Cannot solve Byzantine General’s Problem with n=3 because one •  Ending previous lecture (PAXOS) malicious player can mess up things General General •  RPC Attack! Attack! Attack! Retreat! Lieutenant Lieutenant Lieutenant Lieutenant Retreat! Retreat! •  Key-value storage –  With f faults, need n > 3f to solve problem •  Various algorithms exist to solve problem –  Original algorithm has #messages exponential in n –  Newer algorithms have message complexity O(n2) »  One from MIT, for instance (Castro and Liskov, 1999) •  Use of BFT (Byzantine Fault Tolerance) algorithm –  Allow multiple machines to make a coordinated decision even if some subset of them (< n/3 ) are malicious Request Distributed Decision 11/15/17 CS162 © UCB Fall 2017 Lec 23.7 11/15/17 CS162 © UCB Fall 2017 Lec 23.8 Page 2

3. Remote Procedure Call (RPC) RPC Implementation •  Raw messaging is a bit too low-level for programming •  Request-response message passing (under covers!) –  Must wrap up information into message at source •  “Stub” provides glue on client/server –  Must decide what to do with message at destination – Client stub is responsible for “marshalling” arguments and –  May need to sit and wait for multiple messages to arrive “unmarshalling” the return values – Server-side stub is responsible for “unmarshalling” •  Another option: Remote Procedure Call (RPC) arguments and “marshalling” the return values. –  Calls a procedure on a remote machine –  Client calls: •  Marshalling involves (depending on system) remoteFileSystem→Read("rutabaga"); –  Converting values to a canonical form, serializing objects, copying –  Translated automatically into call on server: arguments passed by reference, etc. fileSys→Read("rutabaga"); 11/15/17 CS162 © UCB Fall 2017 Lec 23.9 11/15/17 CS162 © UCB Fall 2017 Lec 23.10 RPC Information Flow RPC Details (1/3) bundle •  Equivalence with regular procedure call args –  Call & parameters ⇔ Request Message call send »  Result ⇔ Reply message Client Client Packet (caller) Stub Handler –  Name of Procedure: Passed in request message return receive –  Return Address: mbox2 (client return mail box) unbundle mbox2 ret vals Network Network Machine A •  Stub generator: Compiler that generates stubs Machine B bundle –  Input: interface definitions in an “interface definition language (IDL)” ret vals mbox1 »  Contains, among other things, types of arguments/return return send –  Output: stub code in the appropriate source language Server Server Packet (callee) Stub Handler »  Code for client to pack message, send it off, wait for result, call receive unpack result and return to caller unbundle »  Code for server to unpack message, call procedure, pack results, args send them off 11/15/17 CS162 © UCB Fall 2017 Lec 23.11 11/15/17 CS162 © UCB Fall 2017 Lec 23.12 Page 3

4. RPC Details (2/3) RPC Details (3/3) •  Cross-platform issues: •  Dynamic Binding –  What if client/server machines are different architectures/ languages? –  Most RPC systems use dynamic binding via name service »  Convert everything to/from some canonical form »  Name service provides dynamic translation of service → mbox »  Tag every item with an indication of how it is encoded (avoids –  Why dynamic binding? unnecessary conversions) »  Access control: check who is permitted to access service »  Fail-over: If server fails, use a different one •  How does client know which mbox to send to? •  What if there are multiple servers? –  Need to translate name of remote service into network endpoint –  Could give flexibility at binding time (Remote machine, port, possibly other info) »  Choose unloaded server for each new client –  Binding: the process of converting a user-visible name into a network –  Could provide same mbox (router level redirect) endpoint »  Choose unloaded server for each new request »  This is another word for “naming” at network level »  Only works if no state carried from one call to next »  Static: fixed at compile time »  Dynamic: performed at runtime •  What if multiple clients? –  Pass pointer to client-specific return mbox in request 11/15/17 CS162 © UCB Fall 2017 Lec 23.13 11/15/17 CS162 © UCB Fall 2017 Lec 23.14 Problems with RPC: Non-Atomic Failures Problems with RPC: Performance •  Different failure modes in dist. system than on a single machine •  Cost of Procedure call « same-machine RPC « network RPC •  Consider many different types of failures – User-level bug causes address space to crash •  Means programmers must be aware that RPC is not free – Machine failure, kernel bug causes all processes on same – Caching can help, but may make failure handling complex machine to fail – Some machine is compromised by malicious party •  Before RPC: whole system would crash/die •  After RPC: One machine crashes/compromised while others keep working •  Can easily result in inconsistent view of the world – Did my cached data get written back or not? – Did server do what I requested or not? •  Answer? Distributed transactions/Byzantine Commit 11/15/17 CS162 © UCB Fall 2017 Lec 23.15 11/15/17 CS162 © UCB Fall 2017 Lec 23.16 Page 4

5. Administrivia •  Midterm 3 coming up on Wen 11/29 6:30-8PM –  All topics up to and including Lecture 24 »  Focus will be on Lectures 17 – 24 and associated readings, and Projects 3 »  But expect 20-30% questions from materials from Lectures 1-16 –  Closed book –  2 sides hand-written notes both sides BREAK 11/15/17 CS162 © UCB Fall 2017 Lec 23.17 11/15/17 CS162 © UCB Fall 2017 Lec 23.18 Goals of Today’s Lecture Key Value Storage •  Ending previous lecture (PAXOS) •  Handle huge volumes of data, e.g., PetaBytes! –  Store (key, value) tuples •  RPC •  Simple interface – put(key, value); // insert/write “value” associated with “key” •  Key-value storage – value = get(key); // get/read data associated with “key” •  Used sometimes as a simpler but more scalable “database” 11/15/17 CS162 © UCB Fall 2017 Lec 23.19 11/15/17 CS162 © UCB Fall 2017 Lec 23.20 Page 5

6. Key Values: Examples Key-Value Storage Systems in Real Life •  Amazon •  Amazon: –  Key: customerID –  DynamoDB: internal key value store used for Amazon.com (cart) –  Value: customer profile (e.g., buying history, credit card, ..) –  Simple Storage System (S3) •  BigTable/HBase: distributed, scalable data store •  Facebook, Twitter: –  Key: UserID •  Cassandra: “distributed data management –  Value: user profile (e.g., posting history, photos, friends, …) system” (developed by Facebook) •  iCloud/iTunes: •  Memcached: in-memory key-value store for small chunks of –  Key: Movie/song name arbitrary data (strings, objects) –  Value: Movie, Song •  … 11/15/17 CS162 © UCB Fall 2017 Lec 23.21 11/15/17 CS162 © UCB Fall 2017 Lec 23.22 Key Value Store Challenges •  Also called Distributed Hash Tables (DHT) … •  Main idea: partition set of key-values across many machines key, value •  Fault Tolerance: handle machine failures without losing data and without degradation in performance •  Scalability: –  Need to scale to thousands of machines –  Need to allow easy addition of new machines •  Consistency: maintain data consistency in face of node … failures and message losses •  Heterogeneity (if deployed as peer-to-peer system): –  Latency: 1ms to 1000ms –  Bandwidth: 32Kb/s to 100Mb/s 11/15/17 CS162 © UCB Fall 2017 Lec 23.23 11/15/17 CS162 © UCB Fall 2017 Lec 23.24 Page 6

7. Key Questions Directory-Based Architecture •  put(key, value): where to store a new (key, value) •  Have a node maintain the mapping between keys and tuple? the machines (nodes) that store the values associated with the keys Master/Directory •  get(key): where is the value associated with a given “key” stored? put(K14, V14) K5 N2 K14 N3 14) K105 N50 4, V •  And, do the above while providing (K1 put –  Fault Tolerance –  Scalability K5 V5 K14 V14 K105V105 –  Consistency … N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.25 11/15/17 CS162 © UCB Fall 2017 Lec 23.26 Directory-Based Architecture Directory-Based Architecture •  Have a node maintain the mapping between keys and •  Having the master relay the requests → recursive query the machines (nodes) that store the values associated •  Another method: iterative query (this slide) with the keys –  Return node to requester and let requester contact node Master/Directory Master/Directory get(K14) K5 N2 put(K14, V14) K5 N2 V14 K14 N3 N3 K14 N3 K105 N50 put K105 N50 4) (K1 (K1 4, V 4 14) V1 get K5 V5 K14 V14 K105V105 K5 V5 K14 V14 K105V105 … … N1 N2 N3 N50 N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.27 11/15/17 CS162 © UCB Fall 2017 Lec 23.28 Page 7

8. Directory-Based Architecture Discussion: Iterative vs. Recursive Query Master/Directory Master/Directory get(K14) •  Having the master relay the requests → recursive query get(K14) V14 K14 N3 N3 V14 K14 N3 ge t(K •  Another method: iterative query 4) 14 K1 ) 4 Recursive V1 get( Iterative –  Return node to requester and let requester contact node K14 V14 K14 V14 Master/Directory … … get(K14) K5 N2 N3 K14 N3 N1 N2 N3 N50 N1 N2 N3 N50 V14 ge K105 N50 •  Recursive Query: t(K –  Advantages: 14 ) »  Faster, as typically master/directory closer to nodes »  Easier to maintain consistency, as master/directory can serialize puts()/gets() K5 V5 K14 V14 K105V105 –  Disadvantages: scalability bottleneck, especially for large “values”, as all “values” go through master/directory … •  Iterative Query –  Advantages: more scalable N1 N2 N3 N50 –  Disadvantages: slower, harder to enforce data consistency 11/15/17 CS162 © UCB Fall 2017 Lec 23.29 11/15/17 CS162 © UCB Fall 2017 Lec 23.30 Fault Tolerance Fault Tolerance •  Replicate value on several nodes •  Again, we can have •  Usually, place replicas on different racks in a datacenter –  Recursive replication (previous slide) to guard against rack failures –  Iterative replication (this slide) Master/Directory Master/Directory put(K14, V14) K5 N2 put(K14, V14) K5 N2 K14 N1,N3 N1, N3 K14 N1,N3 ) 14 K105 N50 put K105 N50 (K1 ,V 4, V 14 put(K14, V14) put(K14, V14), N1 14) t(K pu K14 V14 K5 V5 K14 V14 K105V105 K14 V14 K5 V5 K14 V14 K105V105 … … N1 N2 N3 N50 N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.31 11/15/17 CS162 © UCB Fall 2017 Lec 23.32 Page 8

9. Fault Tolerance Scalability •  Or we can use recursive query and iterative •  More storage: use more nodes replication… •  More tread requests: Master/Directory put(K14, V14) –  Can serve requests from all nodes on which a value is K5 N2 K14 N1,N3 stored in parallel K105 N50 ) –  Master can replicate a popular value on more nodes 14) 14 14, V ,V put(K 1 4 t(K pu •  Master/directory scalability: K14 V14 K5 V5 K14 V14 K105V105 –  Replicate it –  Partition it, so different keys are served by different … masters/directories N1 N2 N3 N50 »  How do you partition? 11/15/17 CS162 © UCB Fall 2017 Lec 23.33 11/15/17 CS162 © UCB Fall 2017 Lec 23.34 Scalability: Load Balancing Consistency •  Directory keeps track of the storage availability at •  Need to make sure that a value is replicated correctly each node –  Preferentially insert new values on nodes with more •  How do you know a value has been replicated on every storage available node? –  Wait for acknowledgements from every node •  What happens when a new node is added? •  What happens if a node fails during replication? –  Cannot insert only new values on new node. Why? –  Pick another node and try again –  Move values from the heavy loaded nodes to the new node •  What happens if a node is slow? –  Slow down the entire put()? Pick another node? •  What happens when a node fails? –  Need to replicate values from fail node to other nodes •  In general, with multiple replicas –  Slow puts and fast gets 11/15/17 CS162 © UCB Fall 2017 Lec 23.35 11/15/17 CS162 © UCB Fall 2017 Lec 23.36 Page 9

10. Consistency (cont’d) Consistency (cont’d) •  If concurrent updates (i.e., puts to same key) may need •  If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order to make sure that updates happen in the same order •  put(K14, V14’) and put(K14, V14’’) •  put(K14, V14’) and put(K14, V14’’) Master/Directory reach N1 & N3 in reverse order Master/Directory reach N1 & N3 in reverse order put(K14, V14’) put(K14, V14’) K5 N2 K5 N2 put(K14, V14’’) K14 N1,N3 put(K14, V14’’) K14 N1,N3 K105 N50 K105 N50 put(K put(K put(K ) ) 4’’ 4’’ 4’) , V1 V1 , V1 , 14, V 14, V 14, V 14 14 1 4 t(K t(K ut(K pu pu p 14’') 14’) 14’) K14 V14’’ V14 K5 V5 K14 V14’ V14 K105 V105 K14 V14’ V14 K5 V5 K14 V14’’ V14 K105 V105 … … N1 N2 3 N N50 N1 N2 3 N N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.37 11/15/17 CS162 © UCB Fall 2017 Lec 23.38 Consistency (cont’d) Large Variety of Consistency Models •  If concurrent updates (i.e., puts to same key) may need •  Atomic consistency (linearizability): reads/writes (gets/ to make sure that updates happen in the same order puts) to replicas appear as if there was a single underlying replica (single system image) •  put(K14, V14’) and put(K14, V14’’) Master/Directory reach N1 & N3 in reverse order –  Think “one updated at a time” put(K14, V14’) K5 N2 •  What does get(K14) return? –  Transactions put(K14, V14’’) K14 N1,N3 K105 N50 •  Undefined! •  Eventual consistency: given enough time all updates will put(K put(K 1 1 4’’ ) 4’) propagate through the system ,V ,V 14, V 14, V K 14 K14 –  One of the weakest form of consistency; used by many t( t( pu pu systems in practice 14’') 14’) –  Must eventually converge on single value/key (coherence) K14 V14’ V14 K5 V5 K14 V14’’ V14 K105 V105 •  And many others: causal consistency, sequential consistency, … strong consistency, … N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.39 11/15/17 CS162 © UCB Fall 2017 Lec 23.40 Page 10

11. Quorum Consensus Quorum Consensus Example •  N=3, W=2, R=2 •  Improve put() and get() operation performance •  Replica set for K14: {N1, N3, N4} •  Assume put() on N3 fails •  Define a replica set of size N –  put() waits for acknowledgements from at least W replicas put(K14, V14) –  get() waits for responses from at least R replicas put ) 14 (K1 –  W+R > N ,V AC 4 1 CK 4, V K t(K A pu 14) •  Why does it work? –  There is at least one node that contains the update K14 V14 K14 V14 •  Why might you use W+R > N+1? N1 N2 N3 N4 11/15/17 CS162 © UCB Fall 2017 Lec 23.41 11/15/17 CS162 © UCB Fall 2017 Lec 23.42 Quorum Consensus Example Scaling Up Directory •  Now, issuing get() to any two nodes out of three will •  Challenge: return the answer –  Directory contains a number of entries equal to number of (key, value) tuples in the system –  Can be tens or hundreds of billions of entries in the system! •  Solution: consistent hashing ) get(K14) 14 •  Associate to each node a unique id in an uni-dimensional NIL t(K 14 ge V space 0..2m-1 –  Partition this space across m machines –  Assume keys are in same uni-dimensional space K14 V14 K14 V14 –  Each (Key, Value) is stored at the node with the smallest ID larger than Key N1 N2 N3 N4 11/15/17 CS162 © UCB Fall 2017 Lec 23.43 11/15/17 CS162 © UCB Fall 2017 Lec 23.44 Page 11

12. Key to Node Mapping Example Scaling Up Directory •  With consistent hashing, directory contains only a number of entries •  m = 8 à ID space: 0..63 63 0 4 equal to number of nodes •  Node 8 maps keys [5,8] 58 –  Much smaller than number of tuples 8 •  Node 15 maps keys [9,15] •  Next challenge: every query still needs to contact the directory •  Node 20 maps keys [16, 20] 14 V14 •  … •  Solution: distributed directory (a.k.a. lookup) service: 15 •  Node 4 maps keys [59, 4] –  Given a key, find the node storing value associated to the key 44 20 •  Key idea: route request from node to node until reaching the node storing the request’s key 35 •  Key advantage: totally distributed 32 –  No point of failure; no hot spot 11/15/17 CS162 © UCB Fall 2017 Lec 23.45 11/15/17 CS162 © UCB Fall 2017 Lec 23.46 Chord: Distributed Lookup (Directory) Service Lookup lookup(37) •  Key design decision •  Each node maintains 4 pointer to its successor 58 –  Decouple correctness from efficiency 8 •  Route packet (Key, Value) •  Properties to the node responsible for ID using successor node=44 is –  Each node needs to know about O(log(M)), where M is the total pointers responsible 15 number of nodes for Key=37 –  Guarantees that a tuple is found in O(log(M)) steps •  E.g., node=4 lookups for node responsible for 44 20 Key=37 •  Many other lookup services: CAN, Tapestry, Pastry, Kademlia, … 35 32 11/15/17 CS162 © UCB Fall 2017 Lec 23.47 11/15/17 CS162 © UCB Fall 2017 Lec 23.48 Page 12

13. Stabilization Procedure Joining Operation •  Periodic operation performed by each node n to maintain its succ=4 successor when new nodes join the system §  Node with id=50 pred=44 joins the ring 4 §  Node 50 needs to 58 8 n.stabilize() know at least one x = succ.pred; node already in the system if (x ∈ (n, succ)) -  Assume known succ=nil succ = x; // if x better successor, update pred=nil node is 15 50 15 succ.notify(n); // n tells successor about itself € 44 n.notify(n’) succ=58 20 pred=35 if (pred = nil or n’ ∈ (pred, n)) pred = n’; // if n’ is better predecessor, update 35 32 € 11/15/17 CS162 © UCB Fall 2017 Lec 23.49 11/15/17 CS162 © UCB Fall 2017 Lec 23.50 Joining Operation Joining Operation succ=4 §  n=50 executes pred=44 §  n=50 sends join(50) succ=4 stabilize() 4 to node 15 pred=44 4 §  n’s successor (58) 58 44 8 x= §  n=44 returns node 58 58 returns x = 44 8 §  n=50 updates its successor to 58 join(50) succ=58 succ=58 succ=nil pred=nil 15 pred=nil 50 50 58 15 44 succ=58 20 44 pred=35 succ=58 20 pred=35 n.stabilize() x = succ.pred; 35 if (x ∈ (n, succ)) 32 35 32 succ = x; succ.notify(n); 11/15/17 CS162 © UCB Fall 2017 Lec 23.51 €11/15/17 CS162 © UCB Fall 2017 Lec 23.52 Page 13

14. Joining Operation Joining Operation succ=4 §  n=50 executes succ=4 §  n=50 executes pred=44 pred=44 stabilize() 4 stabilize() 4 ) §  x = 44 (50 §  x = 44 58 58 8 8 tify §  succ = 58 §  succ = 58 no §  n=50 sends to it’s successor (58) succ=58 notify(50) succ=58 pred=nil 15 pred=nil 15 50 50 44 44 succ=58 20 succ=58 20 pred=35 pred=35 n.stabilize() n.stabilize() x = succ.pred; 35 x = succ.pred; 35 if (x ∈ (n, succ)) 32 if (x ∈ (n, succ)) 32 succ = x; succ = x; succ.notify(n); succ.notify(n); €11/15/17 CS162 © UCB Fall 2017 Lec 23.53 €11/15/17 CS162 © UCB Fall 2017 Lec 23.54 Joining Operation Joining Operation §  n=58 processes succ=4 §  n=58 processes succ=4 notify(50) pred=44 notify(50) pred=44 pred=50 4 4 ) ) §  pred = 44 §  pred = 44 (50 (50 58 58 8 8 tify tify §  n’ = 50 §  n’ = 50 no no §  set pred = 50 succ=58 succ=58 pred=nil 15 pred=nil 15 50 50 44 44 succ=58 20 succ=58 20 pred=35 pred=35 n.notify(n’) n.notify(n’) if (pred = nil or n’ ∈ (pred, n)) 35 if (pred = nil or n’ ∈ (pred, n)) 35 pred = n’ 32 pred = n’ 32 € € 11/15/17 CS162 © UCB Fall 2017 Lec 23.55 11/15/17 CS162 © UCB Fall 2017 Lec 23.56 Page 14

15. Joining Operation Joining Operation succ=4 succ=4 §  n=44 runs pred=50 §  n=44 runs pred=50 stabilize() 4 stabilize() 4 §  n’s successor (58) 58 §  x = 50 58 8 8 returns x = 50 x=50 §  succ = 58 succ=58 succ=58 pred=nil 15 pred=nil 15 50 50 44 44 succ=58 20 succ=58 20 pred=35 pred=35 n.stabilize() n.stabilize() x = succ.pred; 35 x = succ.pred; 35 if (x ∈ (n, succ)) 32 if (x ∈ (n, succ)) 32 succ = x; succ = x; succ.notify(n); succ.notify(n); €11/15/17 CS162 © UCB Fall 2017 Lec 23.57 €11/15/17 CS162 © UCB Fall 2017 Lec 23.58 Joining Operation Joining Operation succ=4 succ=4 §  n=44 runs pred=50 §  n=44 runs pred=50 stabilize() 4 stabilize() 4 §  x = 50 58 §  n=44 sends 58 8 8 §  succ = 58 notify(44) to its successor §  n=44 sets succ=50 succ=58 succ=58 pred=nil 15 pred=nil 15 50 50 notify(44) 44 44 succ=58 succ=50 20 succ=50 20 pred=35 pred=35 n.stabilize() n.stabilize() x = succ.pred; 35 x = succ.pred; 35 if (x ∈ (n, succ)) 32 if (x ∈ (n, succ)) 32 succ = x; succ = x; succ.notify(n); succ.notify(n); €11/15/17 CS162 © UCB Fall 2017 Lec 23.59 €11/15/17 CS162 © UCB Fall 2017 Lec 23.60 Page 15

16. Joining Operation Joining Operation §  n=50 processes succ=4 §  n=50 processes succ=4 notify(44) pred=50 notify(44) pred=50 4 4 §  pred = nil 58 §  pred = nil 58 8 8 §  n=50 sets pred=44 succ=58 succ=58 pred=nil 15 pred=nil pred=44 15 50 50 notify(44) notify(44) 44 44 succ=50 20 succ=50 20 pred=35 pred=35 n.notify(n’) n.notify(n’) if (pred = nil or n’ ∈ (pred, n)) 35 if (pred = nil or n’ ∈ (pred, n)) 35 pred = n’ 32 pred = n’ 32 € € 11/15/17 CS162 © UCB Fall 2017 Lec 23.61 11/15/17 CS162 © UCB Fall 2017 Lec 23.62 Joining Operation (cont’d) Achieving Efficiency: finger tables §  This completes the joining Say m=7 pred=50 operation! 4 Finger Table at 80 0 58 8 i ft[i] (80 + 26) mod 27 = 16 80 + 25 112 0 96 20 1 96 succ=58 50 2 96 96 pred=44 15 3 96 80 + 24 32 4 96 80 + 23 44 5 112 80 + 22 80 + 21 succ=50 20 45 6 20 80 + 20 80 35 32 i m ith entry at peer with id n is first peer with id >= n + 2 (mod 2 ) 11/15/17 CS162 © UCB Fall 2017 Lec 23.63 11/15/17 CS162 © UCB Fall 2017 Lec 23.64 Page 16

17. Achieving Fault Tolerance for Lookup Service Storage Fault Tolerance •  To improve robustness each node maintains the k (> 1) immediate successors instead of only one successor •  Replicate tuples on 63 0 4 successor nodes 58 8 •  In the pred() reply message, node A can send its k-1 •  Example: replicate successors to its predecessor B (K14, V14) on nodes 14 V14 20 and 32 15 •  Upon receiving pred() message, B can update its successor list 14 V14 by concatenating the successor list received from A with its own list 44 20 14 V14 •  If k = log(M), lookup operation works with high probability even if half of nodes fail, where M is number of nodes in the 35 32 system 11/15/17 CS162 © UCB Fall 2017 Lec 23.65 11/15/17 CS162 © UCB Fall 2017 Lec 23.66 Storage Fault Tolerance Summary (1/2) •  Remote Procedure Call (RPC): Call proc on remote 63 0 machine •  If node 15 fails, no 4 –  Provides same interface as procedure reconfiguration 58 8 needed –  Automatic packing and unpacking of arguments (in stub) –  Still have two replicas 14 V14 •  Key-Value Store: –  All lookups will be correctly routed 15 –  Two operations •  Will need to add a 14 V14 »  put(key, value) new replica on node 44 »  value = get(key) 20 35 –  Challenges 14 V14 »  Fault Tolerance → replication 35 32 »  Scalability → serve get()’s in parallel; replicate/cache hot tuples »  Consistency → quorum consensus to improve put() performance 11/15/17 CS162 © UCB Fall 2017 Lec 23.67 11/15/17 CS162 © UCB Fall 2017 Lec 23.68 Page 17

18. Summary (2/2) •  Chord: –  Highly scalable distributed lookup protocol –  Each node needs to know about O(log(M)), where m is the total number of nodes –  Guarantees that a tuple is found in O(log(M)) steps –  Highly resilient: works with high probability even if half of nodes fail 11/15/17 CS162 © UCB Fall 2017 Lec 23.69 Page 18