- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
RPC, 键值存储, Chord
展开查看详情
1 . Goals of Today’s Lecture CS162 • Ending previous lecture (PAXOS) Operating Systems and Systems Programming Lecture 23 • RPC RPC, • Key-value storage Key-Value Stores, Chord November 15th, 2017 Prof. Ion Stoica http://cs162.eecs.Berkeley.edu 11/15/17 CS162 © UCB Fall 2017 Lec 23.2 Recap: Distributed Decision Making Discussion (1/2) Distributed Decision Making Discussion (2/2) • Why is distributed decision making desirable? • Undesirable feature of Two-Phase Commit: Blocking – Fault Tolerance! – One machine can be stalled until another site recovers: – A group of machines can come to a decision even if one or more » Site B writes "prepared to commit" record to its log, of them fail during the process sends "yes" vote to coordinator (site A) and crashes » Simple failure mode called “failstop” (different modes later) » Site A crashes – After decision made, result recorded in multiple places » Site B wakes up, check its log, and realizes that it has voted "yes" on the update. It sends a message to site A asking what happened. At this point, B cannot decide to abort, because update may have committed » B is blocked until A comes back Coordinator (site A) VOTE-REQ Has to wait for A VOTE- to come back! Worker COMMIT restarted (site B) – B blocked while holding resources (locks on updated items, pages pinned in memory, etc) until hears from A 11/15/17 CS162 © UCB Fall 2017 Lec 23.3 11/15/17 CS162 © UCB Fall 2017 Lec 23.4 Page 1
2 . PAXOS Byzantine General’s Problem • PAXOS: An alternative used by Google and others that does not Lieutenant have this blocking problem and work when nodes are malicious ! Atta ack Att Atta ck! – Develop by Leslie Lamport (Turing Award Winner) ck! Attack! Att Retreat! • What happens if one or more of the nodes is malicious? ack ! Attack! Lieutenant – Malicious: attempting to compromise the decision making at! Retre General ! Attack Lieutenant Malicious! • Byazantine General’s Problem (n players): – One General and n-1 Lieutenants – Some number of these (f) can be insane or malicious • The commanding general must send an order to his n-1 lieutenants such that the following Integrity Constraints apply: – IC1: All loyal lieutenants obey the same order – IC2: If the commanding general is loyal, then all loyal lieutenants obey the order he sends 11/15/17 CS162 © UCB Fall 2017 Lec 23.5 11/15/17 CS162 © UCB Fall 2017 Lec 23.6 Byzantine General’s Problem (con’t) Goals of Today’s Lecture • Impossibility Results: – Cannot solve Byzantine General’s Problem with n=3 because one • Ending previous lecture (PAXOS) malicious player can mess up things General General • RPC Attack! Attack! Attack! Retreat! Lieutenant Lieutenant Lieutenant Lieutenant Retreat! Retreat! • Key-value storage – With f faults, need n > 3f to solve problem • Various algorithms exist to solve problem – Original algorithm has #messages exponential in n – Newer algorithms have message complexity O(n2) » One from MIT, for instance (Castro and Liskov, 1999) • Use of BFT (Byzantine Fault Tolerance) algorithm – Allow multiple machines to make a coordinated decision even if some subset of them (< n/3 ) are malicious Request Distributed Decision 11/15/17 CS162 © UCB Fall 2017 Lec 23.7 11/15/17 CS162 © UCB Fall 2017 Lec 23.8 Page 2
3 . Remote Procedure Call (RPC) RPC Implementation • Raw messaging is a bit too low-level for programming • Request-response message passing (under covers!) – Must wrap up information into message at source • “Stub” provides glue on client/server – Must decide what to do with message at destination – Client stub is responsible for “marshalling” arguments and – May need to sit and wait for multiple messages to arrive “unmarshalling” the return values – Server-side stub is responsible for “unmarshalling” • Another option: Remote Procedure Call (RPC) arguments and “marshalling” the return values. – Calls a procedure on a remote machine – Client calls: • Marshalling involves (depending on system) remoteFileSystem→Read("rutabaga"); – Converting values to a canonical form, serializing objects, copying – Translated automatically into call on server: arguments passed by reference, etc. fileSys→Read("rutabaga"); 11/15/17 CS162 © UCB Fall 2017 Lec 23.9 11/15/17 CS162 © UCB Fall 2017 Lec 23.10 RPC Information Flow RPC Details (1/3) bundle • Equivalence with regular procedure call args – Call & parameters ⇔ Request Message call send » Result ⇔ Reply message Client Client Packet (caller) Stub Handler – Name of Procedure: Passed in request message return receive – Return Address: mbox2 (client return mail box) unbundle mbox2 ret vals Network Network Machine A • Stub generator: Compiler that generates stubs Machine B bundle – Input: interface definitions in an “interface definition language (IDL)” ret vals mbox1 » Contains, among other things, types of arguments/return return send – Output: stub code in the appropriate source language Server Server Packet (callee) Stub Handler » Code for client to pack message, send it off, wait for result, call receive unpack result and return to caller unbundle » Code for server to unpack message, call procedure, pack results, args send them off 11/15/17 CS162 © UCB Fall 2017 Lec 23.11 11/15/17 CS162 © UCB Fall 2017 Lec 23.12 Page 3
4 . RPC Details (2/3) RPC Details (3/3) • Cross-platform issues: • Dynamic Binding – What if client/server machines are different architectures/ languages? – Most RPC systems use dynamic binding via name service » Convert everything to/from some canonical form » Name service provides dynamic translation of service → mbox » Tag every item with an indication of how it is encoded (avoids – Why dynamic binding? unnecessary conversions) » Access control: check who is permitted to access service » Fail-over: If server fails, use a different one • How does client know which mbox to send to? • What if there are multiple servers? – Need to translate name of remote service into network endpoint – Could give flexibility at binding time (Remote machine, port, possibly other info) » Choose unloaded server for each new client – Binding: the process of converting a user-visible name into a network – Could provide same mbox (router level redirect) endpoint » Choose unloaded server for each new request » This is another word for “naming” at network level » Only works if no state carried from one call to next » Static: fixed at compile time » Dynamic: performed at runtime • What if multiple clients? – Pass pointer to client-specific return mbox in request 11/15/17 CS162 © UCB Fall 2017 Lec 23.13 11/15/17 CS162 © UCB Fall 2017 Lec 23.14 Problems with RPC: Non-Atomic Failures Problems with RPC: Performance • Different failure modes in dist. system than on a single machine • Cost of Procedure call « same-machine RPC « network RPC • Consider many different types of failures – User-level bug causes address space to crash • Means programmers must be aware that RPC is not free – Machine failure, kernel bug causes all processes on same – Caching can help, but may make failure handling complex machine to fail – Some machine is compromised by malicious party • Before RPC: whole system would crash/die • After RPC: One machine crashes/compromised while others keep working • Can easily result in inconsistent view of the world – Did my cached data get written back or not? – Did server do what I requested or not? • Answer? Distributed transactions/Byzantine Commit 11/15/17 CS162 © UCB Fall 2017 Lec 23.15 11/15/17 CS162 © UCB Fall 2017 Lec 23.16 Page 4
5 . Administrivia • Midterm 3 coming up on Wen 11/29 6:30-8PM – All topics up to and including Lecture 24 » Focus will be on Lectures 17 – 24 and associated readings, and Projects 3 » But expect 20-30% questions from materials from Lectures 1-16 – Closed book – 2 sides hand-written notes both sides BREAK 11/15/17 CS162 © UCB Fall 2017 Lec 23.17 11/15/17 CS162 © UCB Fall 2017 Lec 23.18 Goals of Today’s Lecture Key Value Storage • Ending previous lecture (PAXOS) • Handle huge volumes of data, e.g., PetaBytes! – Store (key, value) tuples • RPC • Simple interface – put(key, value); // insert/write “value” associated with “key” • Key-value storage – value = get(key); // get/read data associated with “key” • Used sometimes as a simpler but more scalable “database” 11/15/17 CS162 © UCB Fall 2017 Lec 23.19 11/15/17 CS162 © UCB Fall 2017 Lec 23.20 Page 5
6 . Key Values: Examples Key-Value Storage Systems in Real Life • Amazon • Amazon: – Key: customerID – DynamoDB: internal key value store used for Amazon.com (cart) – Value: customer profile (e.g., buying history, credit card, ..) – Simple Storage System (S3) • BigTable/HBase: distributed, scalable data store • Facebook, Twitter: – Key: UserID • Cassandra: “distributed data management – Value: user profile (e.g., posting history, photos, friends, …) system” (developed by Facebook) • iCloud/iTunes: • Memcached: in-memory key-value store for small chunks of – Key: Movie/song name arbitrary data (strings, objects) – Value: Movie, Song • … 11/15/17 CS162 © UCB Fall 2017 Lec 23.21 11/15/17 CS162 © UCB Fall 2017 Lec 23.22 Key Value Store Challenges • Also called Distributed Hash Tables (DHT) … • Main idea: partition set of key-values across many machines key, value • Fault Tolerance: handle machine failures without losing data and without degradation in performance • Scalability: – Need to scale to thousands of machines – Need to allow easy addition of new machines • Consistency: maintain data consistency in face of node … failures and message losses • Heterogeneity (if deployed as peer-to-peer system): – Latency: 1ms to 1000ms – Bandwidth: 32Kb/s to 100Mb/s 11/15/17 CS162 © UCB Fall 2017 Lec 23.23 11/15/17 CS162 © UCB Fall 2017 Lec 23.24 Page 6
7 . Key Questions Directory-Based Architecture • put(key, value): where to store a new (key, value) • Have a node maintain the mapping between keys and tuple? the machines (nodes) that store the values associated with the keys Master/Directory • get(key): where is the value associated with a given “key” stored? put(K14, V14) K5 N2 K14 N3 14) K105 N50 4, V • And, do the above while providing (K1 put – Fault Tolerance – Scalability K5 V5 K14 V14 K105V105 – Consistency … N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.25 11/15/17 CS162 © UCB Fall 2017 Lec 23.26 Directory-Based Architecture Directory-Based Architecture • Have a node maintain the mapping between keys and • Having the master relay the requests → recursive query the machines (nodes) that store the values associated • Another method: iterative query (this slide) with the keys – Return node to requester and let requester contact node Master/Directory Master/Directory get(K14) K5 N2 put(K14, V14) K5 N2 V14 K14 N3 N3 K14 N3 K105 N50 put K105 N50 4) (K1 (K1 4, V 4 14) V1 get K5 V5 K14 V14 K105V105 K5 V5 K14 V14 K105V105 … … N1 N2 N3 N50 N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.27 11/15/17 CS162 © UCB Fall 2017 Lec 23.28 Page 7
8 . Directory-Based Architecture Discussion: Iterative vs. Recursive Query Master/Directory Master/Directory get(K14) • Having the master relay the requests → recursive query get(K14) V14 K14 N3 N3 V14 K14 N3 ge t(K • Another method: iterative query 4) 14 K1 ) 4 Recursive V1 get( Iterative – Return node to requester and let requester contact node K14 V14 K14 V14 Master/Directory … … get(K14) K5 N2 N3 K14 N3 N1 N2 N3 N50 N1 N2 N3 N50 V14 ge K105 N50 • Recursive Query: t(K – Advantages: 14 ) » Faster, as typically master/directory closer to nodes » Easier to maintain consistency, as master/directory can serialize puts()/gets() K5 V5 K14 V14 K105V105 – Disadvantages: scalability bottleneck, especially for large “values”, as all “values” go through master/directory … • Iterative Query – Advantages: more scalable N1 N2 N3 N50 – Disadvantages: slower, harder to enforce data consistency 11/15/17 CS162 © UCB Fall 2017 Lec 23.29 11/15/17 CS162 © UCB Fall 2017 Lec 23.30 Fault Tolerance Fault Tolerance • Replicate value on several nodes • Again, we can have • Usually, place replicas on different racks in a datacenter – Recursive replication (previous slide) to guard against rack failures – Iterative replication (this slide) Master/Directory Master/Directory put(K14, V14) K5 N2 put(K14, V14) K5 N2 K14 N1,N3 N1, N3 K14 N1,N3 ) 14 K105 N50 put K105 N50 (K1 ,V 4, V 14 put(K14, V14) put(K14, V14), N1 14) t(K pu K14 V14 K5 V5 K14 V14 K105V105 K14 V14 K5 V5 K14 V14 K105V105 … … N1 N2 N3 N50 N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.31 11/15/17 CS162 © UCB Fall 2017 Lec 23.32 Page 8
9 . Fault Tolerance Scalability • Or we can use recursive query and iterative • More storage: use more nodes replication… • More tread requests: Master/Directory put(K14, V14) – Can serve requests from all nodes on which a value is K5 N2 K14 N1,N3 stored in parallel K105 N50 ) – Master can replicate a popular value on more nodes 14) 14 14, V ,V put(K 1 4 t(K pu • Master/directory scalability: K14 V14 K5 V5 K14 V14 K105V105 – Replicate it – Partition it, so different keys are served by different … masters/directories N1 N2 N3 N50 » How do you partition? 11/15/17 CS162 © UCB Fall 2017 Lec 23.33 11/15/17 CS162 © UCB Fall 2017 Lec 23.34 Scalability: Load Balancing Consistency • Directory keeps track of the storage availability at • Need to make sure that a value is replicated correctly each node – Preferentially insert new values on nodes with more • How do you know a value has been replicated on every storage available node? – Wait for acknowledgements from every node • What happens when a new node is added? • What happens if a node fails during replication? – Cannot insert only new values on new node. Why? – Pick another node and try again – Move values from the heavy loaded nodes to the new node • What happens if a node is slow? – Slow down the entire put()? Pick another node? • What happens when a node fails? – Need to replicate values from fail node to other nodes • In general, with multiple replicas – Slow puts and fast gets 11/15/17 CS162 © UCB Fall 2017 Lec 23.35 11/15/17 CS162 © UCB Fall 2017 Lec 23.36 Page 9
10 . Consistency (cont’d) Consistency (cont’d) • If concurrent updates (i.e., puts to same key) may need • If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order to make sure that updates happen in the same order • put(K14, V14’) and put(K14, V14’’) • put(K14, V14’) and put(K14, V14’’) Master/Directory reach N1 & N3 in reverse order Master/Directory reach N1 & N3 in reverse order put(K14, V14’) put(K14, V14’) K5 N2 K5 N2 put(K14, V14’’) K14 N1,N3 put(K14, V14’’) K14 N1,N3 K105 N50 K105 N50 put(K put(K put(K ) ) 4’’ 4’’ 4’) , V1 V1 , V1 , 14, V 14, V 14, V 14 14 1 4 t(K t(K ut(K pu pu p 14’') 14’) 14’) K14 V14’’ V14 K5 V5 K14 V14’ V14 K105 V105 K14 V14’ V14 K5 V5 K14 V14’’ V14 K105 V105 … … N1 N2 3 N N50 N1 N2 3 N N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.37 11/15/17 CS162 © UCB Fall 2017 Lec 23.38 Consistency (cont’d) Large Variety of Consistency Models • If concurrent updates (i.e., puts to same key) may need • Atomic consistency (linearizability): reads/writes (gets/ to make sure that updates happen in the same order puts) to replicas appear as if there was a single underlying replica (single system image) • put(K14, V14’) and put(K14, V14’’) Master/Directory reach N1 & N3 in reverse order – Think “one updated at a time” put(K14, V14’) K5 N2 • What does get(K14) return? – Transactions put(K14, V14’’) K14 N1,N3 K105 N50 • Undefined! • Eventual consistency: given enough time all updates will put(K put(K 1 1 4’’ ) 4’) propagate through the system ,V ,V 14, V 14, V K 14 K14 – One of the weakest form of consistency; used by many t( t( pu pu systems in practice 14’') 14’) – Must eventually converge on single value/key (coherence) K14 V14’ V14 K5 V5 K14 V14’’ V14 K105 V105 • And many others: causal consistency, sequential consistency, … strong consistency, … N1 N2 N3 N50 11/15/17 CS162 © UCB Fall 2017 Lec 23.39 11/15/17 CS162 © UCB Fall 2017 Lec 23.40 Page 10
11 . Quorum Consensus Quorum Consensus Example • N=3, W=2, R=2 • Improve put() and get() operation performance • Replica set for K14: {N1, N3, N4} • Assume put() on N3 fails • Define a replica set of size N – put() waits for acknowledgements from at least W replicas put(K14, V14) – get() waits for responses from at least R replicas put ) 14 (K1 – W+R > N ,V AC 4 1 CK 4, V K t(K A pu 14) • Why does it work? – There is at least one node that contains the update K14 V14 K14 V14 • Why might you use W+R > N+1? N1 N2 N3 N4 11/15/17 CS162 © UCB Fall 2017 Lec 23.41 11/15/17 CS162 © UCB Fall 2017 Lec 23.42 Quorum Consensus Example Scaling Up Directory • Now, issuing get() to any two nodes out of three will • Challenge: return the answer – Directory contains a number of entries equal to number of (key, value) tuples in the system – Can be tens or hundreds of billions of entries in the system! • Solution: consistent hashing ) get(K14) 14 • Associate to each node a unique id in an uni-dimensional NIL t(K 14 ge V space 0..2m-1 – Partition this space across m machines – Assume keys are in same uni-dimensional space K14 V14 K14 V14 – Each (Key, Value) is stored at the node with the smallest ID larger than Key N1 N2 N3 N4 11/15/17 CS162 © UCB Fall 2017 Lec 23.43 11/15/17 CS162 © UCB Fall 2017 Lec 23.44 Page 11
12 . Key to Node Mapping Example Scaling Up Directory • With consistent hashing, directory contains only a number of entries • m = 8 à ID space: 0..63 63 0 4 equal to number of nodes • Node 8 maps keys [5,8] 58 – Much smaller than number of tuples 8 • Node 15 maps keys [9,15] • Next challenge: every query still needs to contact the directory • Node 20 maps keys [16, 20] 14 V14 • … • Solution: distributed directory (a.k.a. lookup) service: 15 • Node 4 maps keys [59, 4] – Given a key, find the node storing value associated to the key 44 20 • Key idea: route request from node to node until reaching the node storing the request’s key 35 • Key advantage: totally distributed 32 – No point of failure; no hot spot 11/15/17 CS162 © UCB Fall 2017 Lec 23.45 11/15/17 CS162 © UCB Fall 2017 Lec 23.46 Chord: Distributed Lookup (Directory) Service Lookup lookup(37) • Key design decision • Each node maintains 4 pointer to its successor 58 – Decouple correctness from efficiency 8 • Route packet (Key, Value) • Properties to the node responsible for ID using successor node=44 is – Each node needs to know about O(log(M)), where M is the total pointers responsible 15 number of nodes for Key=37 – Guarantees that a tuple is found in O(log(M)) steps • E.g., node=4 lookups for node responsible for 44 20 Key=37 • Many other lookup services: CAN, Tapestry, Pastry, Kademlia, … 35 32 11/15/17 CS162 © UCB Fall 2017 Lec 23.47 11/15/17 CS162 © UCB Fall 2017 Lec 23.48 Page 12
13 . Stabilization Procedure Joining Operation • Periodic operation performed by each node n to maintain its succ=4 successor when new nodes join the system § Node with id=50 pred=44 joins the ring 4 § Node 50 needs to 58 8 n.stabilize() know at least one x = succ.pred; node already in the system if (x ∈ (n, succ)) - Assume known succ=nil succ = x; // if x better successor, update pred=nil node is 15 50 15 succ.notify(n); // n tells successor about itself € 44 n.notify(n’) succ=58 20 pred=35 if (pred = nil or n’ ∈ (pred, n)) pred = n’; // if n’ is better predecessor, update 35 32 € 11/15/17 CS162 © UCB Fall 2017 Lec 23.49 11/15/17 CS162 © UCB Fall 2017 Lec 23.50 Joining Operation Joining Operation succ=4 § n=50 executes pred=44 § n=50 sends join(50) succ=4 stabilize() 4 to node 15 pred=44 4 § n’s successor (58) 58 44 8 x= § n=44 returns node 58 58 returns x = 44 8 § n=50 updates its successor to 58 join(50) succ=58 succ=58 succ=nil pred=nil 15 pred=nil 50 50 58 15 44 succ=58 20 44 pred=35 succ=58 20 pred=35 n.stabilize() x = succ.pred; 35 if (x ∈ (n, succ)) 32 35 32 succ = x; succ.notify(n); 11/15/17 CS162 © UCB Fall 2017 Lec 23.51 €11/15/17 CS162 © UCB Fall 2017 Lec 23.52 Page 13
14 . Joining Operation Joining Operation succ=4 § n=50 executes succ=4 § n=50 executes pred=44 pred=44 stabilize() 4 stabilize() 4 ) § x = 44 (50 § x = 44 58 58 8 8 tify § succ = 58 § succ = 58 no § n=50 sends to it’s successor (58) succ=58 notify(50) succ=58 pred=nil 15 pred=nil 15 50 50 44 44 succ=58 20 succ=58 20 pred=35 pred=35 n.stabilize() n.stabilize() x = succ.pred; 35 x = succ.pred; 35 if (x ∈ (n, succ)) 32 if (x ∈ (n, succ)) 32 succ = x; succ = x; succ.notify(n); succ.notify(n); €11/15/17 CS162 © UCB Fall 2017 Lec 23.53 €11/15/17 CS162 © UCB Fall 2017 Lec 23.54 Joining Operation Joining Operation § n=58 processes succ=4 § n=58 processes succ=4 notify(50) pred=44 notify(50) pred=44 pred=50 4 4 ) ) § pred = 44 § pred = 44 (50 (50 58 58 8 8 tify tify § n’ = 50 § n’ = 50 no no § set pred = 50 succ=58 succ=58 pred=nil 15 pred=nil 15 50 50 44 44 succ=58 20 succ=58 20 pred=35 pred=35 n.notify(n’) n.notify(n’) if (pred = nil or n’ ∈ (pred, n)) 35 if (pred = nil or n’ ∈ (pred, n)) 35 pred = n’ 32 pred = n’ 32 € € 11/15/17 CS162 © UCB Fall 2017 Lec 23.55 11/15/17 CS162 © UCB Fall 2017 Lec 23.56 Page 14
15 . Joining Operation Joining Operation succ=4 succ=4 § n=44 runs pred=50 § n=44 runs pred=50 stabilize() 4 stabilize() 4 § n’s successor (58) 58 § x = 50 58 8 8 returns x = 50 x=50 § succ = 58 succ=58 succ=58 pred=nil 15 pred=nil 15 50 50 44 44 succ=58 20 succ=58 20 pred=35 pred=35 n.stabilize() n.stabilize() x = succ.pred; 35 x = succ.pred; 35 if (x ∈ (n, succ)) 32 if (x ∈ (n, succ)) 32 succ = x; succ = x; succ.notify(n); succ.notify(n); €11/15/17 CS162 © UCB Fall 2017 Lec 23.57 €11/15/17 CS162 © UCB Fall 2017 Lec 23.58 Joining Operation Joining Operation succ=4 succ=4 § n=44 runs pred=50 § n=44 runs pred=50 stabilize() 4 stabilize() 4 § x = 50 58 § n=44 sends 58 8 8 § succ = 58 notify(44) to its successor § n=44 sets succ=50 succ=58 succ=58 pred=nil 15 pred=nil 15 50 50 notify(44) 44 44 succ=58 succ=50 20 succ=50 20 pred=35 pred=35 n.stabilize() n.stabilize() x = succ.pred; 35 x = succ.pred; 35 if (x ∈ (n, succ)) 32 if (x ∈ (n, succ)) 32 succ = x; succ = x; succ.notify(n); succ.notify(n); €11/15/17 CS162 © UCB Fall 2017 Lec 23.59 €11/15/17 CS162 © UCB Fall 2017 Lec 23.60 Page 15
16 . Joining Operation Joining Operation § n=50 processes succ=4 § n=50 processes succ=4 notify(44) pred=50 notify(44) pred=50 4 4 § pred = nil 58 § pred = nil 58 8 8 § n=50 sets pred=44 succ=58 succ=58 pred=nil 15 pred=nil pred=44 15 50 50 notify(44) notify(44) 44 44 succ=50 20 succ=50 20 pred=35 pred=35 n.notify(n’) n.notify(n’) if (pred = nil or n’ ∈ (pred, n)) 35 if (pred = nil or n’ ∈ (pred, n)) 35 pred = n’ 32 pred = n’ 32 € € 11/15/17 CS162 © UCB Fall 2017 Lec 23.61 11/15/17 CS162 © UCB Fall 2017 Lec 23.62 Joining Operation (cont’d) Achieving Efficiency: finger tables § This completes the joining Say m=7 pred=50 operation! 4 Finger Table at 80 0 58 8 i ft[i] (80 + 26) mod 27 = 16 80 + 25 112 0 96 20 1 96 succ=58 50 2 96 96 pred=44 15 3 96 80 + 24 32 4 96 80 + 23 44 5 112 80 + 22 80 + 21 succ=50 20 45 6 20 80 + 20 80 35 32 i m ith entry at peer with id n is first peer with id >= n + 2 (mod 2 ) 11/15/17 CS162 © UCB Fall 2017 Lec 23.63 11/15/17 CS162 © UCB Fall 2017 Lec 23.64 Page 16
17 . Achieving Fault Tolerance for Lookup Service Storage Fault Tolerance • To improve robustness each node maintains the k (> 1) immediate successors instead of only one successor • Replicate tuples on 63 0 4 successor nodes 58 8 • In the pred() reply message, node A can send its k-1 • Example: replicate successors to its predecessor B (K14, V14) on nodes 14 V14 20 and 32 15 • Upon receiving pred() message, B can update its successor list 14 V14 by concatenating the successor list received from A with its own list 44 20 14 V14 • If k = log(M), lookup operation works with high probability even if half of nodes fail, where M is number of nodes in the 35 32 system 11/15/17 CS162 © UCB Fall 2017 Lec 23.65 11/15/17 CS162 © UCB Fall 2017 Lec 23.66 Storage Fault Tolerance Summary (1/2) • Remote Procedure Call (RPC): Call proc on remote 63 0 machine • If node 15 fails, no 4 – Provides same interface as procedure reconfiguration 58 8 needed – Automatic packing and unpacking of arguments (in stub) – Still have two replicas 14 V14 • Key-Value Store: – All lookups will be correctly routed 15 – Two operations • Will need to add a 14 V14 » put(key, value) new replica on node 44 » value = get(key) 20 35 – Challenges 14 V14 » Fault Tolerance → replication 35 32 » Scalability → serve get()’s in parallel; replicate/cache hot tuples » Consistency → quorum consensus to improve put() performance 11/15/17 CS162 © UCB Fall 2017 Lec 23.67 11/15/17 CS162 © UCB Fall 2017 Lec 23.68 Page 17
18 . Summary (2/2) • Chord: – Highly scalable distributed lookup protocol – Each node needs to know about O(log(M)), where m is the total number of nodes – Guarantees that a tuple is found in O(log(M)) steps – Highly resilient: works with high probability even if half of nodes fail 11/15/17 CS162 © UCB Fall 2017 Lec 23.69 Page 18