12-Gossip协议

自我管理系统中的一个难题是组件可能甚至不知道如何彼此联系。Gossip协议非常强大且可扩展,并且具有完全可预测的负载。这使它们在系统管理中特别有价值。 以下是我们将在本讲座中讨论的两个主要系统的论文。亚马逊收购了Astrolabe,尽管他们很快就将它演变为不再像原版一样了。后来他们报道了Gossip导致问题的一种方式:着名的Amazon S3存储可用性跟踪错误,将S3关闭了一整天
展开查看详情

1.CS5412/Lecture 12 Gossip Protocols Ken Birman CS5412 Spring 2019 http://www.cs.cornell.edu/courses/cs5412/2019sp 1

2.Gossip 101 Gossip protocols: Ones in which information is spread node-to-node at random, like a Zombie virus. At first, the rate of spread doubles on each round of gossip. Eventually, a lot of “already infected” events slow the spread down. CS5412 Spring 2016 2

3.Key aspects to the concept Participants have a membership list, or some random subset of it. They pick some other participant at random, once every T time units. Then the two interact to share data: The messages are of fixed maximum size. http://www.cs.cornell.edu/courses/cs5412/2019sp 3 { Push: A “tells” B some rumors Pull: A “asks” B for news Push-Pull: Both

4.Notice that gossip has fixed peak load! Every process sends and receives at the same fixed rate (due to random peer selection, some processes might receive 2 messages in time period T, but very few receive 3 or more… the “birthday paradox”) And at most, we fill those messages to the limit with rumors, but then they max out and nothing more can be added. So gossip is very predictable. System managers like this aspect. http://www.cs.cornell.edu/courses/cs5412/2019sp 4

5.Gossip spreads slowly at first, then faster Log(N) tells us how many rounds (each taking T time units) to anticipate With N=100,000, log(N) would be 12 So with one gossip round per five seconds, information would need one minute to spread in a large system! Some gossip protocols combine pure gossip with an accelerator A good way to get the word out quickly CS5412 Spring 2016 5

6.Easy to work with A recent Cornell student created a framework for Gossip applications, called the M I CA system ( Microprotocol Composition Architecture) You take a skeleton, add a few lines of logic to tell it how to merge states (incoming gossip), and MICA runs the resulting application for you. Plus, it supports a modular, “compositional” coding style. Use cases were mostly focused on large-scale system management. http://www.cs.cornell.edu/courses/cs5412/2019sp 6

7.Bimodal Multicast CS5412 Spring 2016 7 This uses gossip to send a message from one source to many receivers. It combines gossip with a feature called IP multicast: an unreliable 1-to-many UDP option available on optical Ethernet In Bimodal Multicast, the first step is to send a message using IP multicast. Not reliable, and we don’t add acks or retransmissions No flow control (but it does support a rate limiting feature) In data centers that lack IP multicast, can simulate by sending UDP packets 1:1. Again, these use UDP without acks

8.What’s the cost of an IP multicast? CS5412 Spring 2016 8 In principle, each Bimodal Multicast packet traverses the relevant data center links and routers just once per message So this is extremely cheap... but how do we deal with systems that didn’t receive the multicast?

9.Making Bimodal Multicast reliable CS5412 Spring 2016 9 We can use gossip! The “rumors” will be the IP multicast messages! Every node tracks the membership of the target group (using gossip) Then after doing the IP multicast, “fill in the holes” (missed messages).

10.Making Bimodal Multicast reliable CS5412 Spring 2016 10 So, layer in a gossip mechanism that gossips about multicasts each node knows about Rather than sending the multicasts themselves, the gossip messages just talk about “digests”, which are lists of messages received, perhaps in a compressed format Node A might send node B I have messages 1-18 from sender X I have message 11 from sender Y I have messages 14, 16 and 22-71 from sender Z This is a form of “push” gossip

11.Making Bimodal Multicast reliable CS5412 Spring 2016 11 On receiving such a gossip message, the recipient checks to see which messages it has that the gossip sender lacks, and vice versa Then it responds I have copies of messages M, M’ and M’’ (which you seem to lack) I would like a copy of messages N, N’ and N’’ An exchange of the actual messages follows

12.This makes it “bimodal” There is a first wave of message delivery from the IP multicast, which takes a few milliseconds to reach every node in the whole data center. But a few miss the message. Then a second wave of gossip follows, filling in the gaps, but this takes a few rounds, so we see a delay of T*2 or T*3 while this plays out. http://www.cs.cornell.edu/courses/cs5412/2019sp 12 Delay Count of nodes reached after this delay

13.Experimental findings Bimodal multicasts works best if the initial IP multicast reaches almost every process, and “usually” this is so. But “sometimes” a lot of loss occurs. In those cases, N (the number of receivers missing the message) is much larger. Then the second “mode” (second bump in the curve) is large and slow. http://www.cs.cornell.edu/courses/cs5412/2019sp 13

14.Optimizations CS5412 Spring 2016 14 Bimodal Multicast resends using IP multicast if there is “evidence” that a few nodes may be missing the same thing E.g. if two nodes ask for the same retransmission Or if a retransmission shows up from a very remote node (IP multicast doesn’t always work in WANs) It also prioritizes recent messages over old ones With these changes, “almost all” receivers will get the message via IP multicast, so N is small and gossip fills gaps within just 2 or 3 rounds.

15.lpbcast variation ( Kermarrec , Guerraoui ) CS5412 Spring 2016 15 In this variation on Bimodal Multicast instead of gossiping with every node in a system, the protocol: Maintains a “peer overlay”: each member tracks two sets of neighbors. First set: peers picked to be reachable with low round-trip times. Second set: peers picked to ensure that the graph is an “expander” graph. Called a “small worlds” structure by Jon Kleinberg. Lpbcast is often faster, but equally reliable!

16.Speculation... about speed CS5412 Spring 2016 16 When we combine IP multicast with gossip we try to match the tool we’re using with the need Try to get the messages through fast... but if loss occurs, try to have a very predictable recovery cost Gossip has a totally predictable worst-case load Even the IP multicast acceleration idea just adds an unacknowledged IP multicast message or two, per Bimodal Multicast sent. This is appealing at large scales How can we generalize this concept?

17.Astrolabe Help for applications adrift in a sea of information Structure emerges from a randomized gossip protocol This approach is robust and scalable even under stress that cripples traditional systems Initially developed by a team led by Robbert van Renesse. Technology was adopted at Amazon.com (but they rebuild it over time) CS5412 Spring 2016 17

18.Astrolabe is a flexible monitoring overlay Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu cardinal.cs.cornell.edu Periodically, pull data from monitored systems Name Time Load Weblogic ? SMTP? Word Version swift 2271 1.8 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2231 1.7 1 1 6.0 CS5412 Spring 2016 18

19.Astrolabe in a single domain Each node owns a single tuple, like the management information base (MIB) Nodes discover one-another through a simple broadcast scheme (“anyone out there?”) and gossip about membership Nodes also keep replicas of one-another’s rows Periodically (uniformly at random) merge your state with some else… CS5412 Spring 2016 19

20.State Merge: Core of Astrolabe epidemic Name Time Load Weblogic ? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu cardinal.cs.cornell.edu CS5412 Spring 2016 20

21.State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Version swift 2003 .67 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2004 4.5 1 0 6.0 swift.cs.cornell.edu cardinal.cs.cornell.edu swift 2011 2.0 cardinal 2201 3.5 CS5412 Spring 2016 21

22.State Merge: Core of Astrolabe epidemic Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1971 1.5 1 0 4.1 cardinal 2201 3.5 1 0 6.0 swift.cs.cornell.edu cardinal.cs.cornell.edu CS5412 Spring 2016 22

23.Observations Merge protocol has constant cost One message sent, received (on avg ) per unit time. The data changes slowly, so no need to run it quickly – we usually run it every five seconds or so Information spreads in O(log N) time But this assumes bounded region size In Astrolabe, we limit them to 50-100 rows CS5412 Spring 2016 23

24.Big systems… A big system could have many regions Looks like a pile of spreadsheets A node only replicates data from its neighbors within its own region CS5412 Spring 2016 24

25.Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 cardinal.cs.cornell.edu Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 Name Time Load Weblogic? SMTP? Word Version swift 2011 2.0 0 1 6.2 falcon 1976 2.7 1 0 4.1 cardinal 2201 3.5 1 1 6.0 CS5412 Spring 2016 25

26.Name Load Weblogic? SMTP? Word Version … swift 2.0 0 1 6.2 falcon 1.5 1 0 4.1 cardinal 4.5 1 0 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 0 0 4.5 zebra 3.2 0 1 6.2 gnu .5 1 0 6.2 Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 Astrolabe builds a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Name Load Weblogic? SMTP? Word Version … swift 2.0 0 1 6.2 falcon 1.5 1 0 4.1 cardinal 4.5 1 0 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 0 0 4.5 zebra 3.2 0 1 6.2 gnu .5 1 0 6.2 Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 San Francisco New Jersey SQL query “summarizes” data Dynamically changing query output is visible system-wide Name Load Weblogic? SMTP? Word Version … swift 1.7 0 1 6.2 falcon 2.1 1 0 4.1 cardinal 3.9 1 0 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 4.1 0 0 4.5 zebra 0.9 0 1 6.2 gnu 2.2 1 0 6.2 Name Avg Load WL contact SMTP contact SF 2.2 123.45.61.3 123.45.61.17 NJ 1.6 127.16.77.6 127.16.77.11 Paris 2.7 14.66.71.8 14.66.71.12 CS5412 Spring 2016 26

27.Large scale: “fake” regions These are Computed by queries that summarize a whole region as a single row Gossiped in a read-only manner within a leaf region But who runs the gossip? Each region elects “k” members to run gossip at the next level up. Can play with selection criteria and “k” CS5412 Spring 2016 27

28.Hierarchy is virtual… data is replicated Name Load Weblogic? SMTP? Word Version … swift 2.0 0 1 6.2 falcon 1.5 1 0 4.1 cardinal 4.5 1 0 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 0 0 4.5 zebra 3.2 0 1 6.2 gnu .5 1 0 6.2 Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 San Francisco New Jersey Yellow leaf node “sees” its neighbors and the domains on the path to the root. Falcon runs level 2 epidemic because it has lowest load Gnu runs level 2 epidemic because it has lowest load CS5412 Spring 2016 28

29.Hierarchy is virtual… data is replicated Name Load Weblogic? SMTP? Word Version … swift 2.0 0 1 6.2 falcon 1.5 1 0 4.1 cardinal 4.5 1 0 6.0 Name Load Weblogic? SMTP? Word Version … gazelle 1.7 0 0 4.5 zebra 3.2 0 1 6.2 gnu .5 1 0 6.2 Name Avg Load WL contact SMTP contact SF 2.6 123.45.61.3 123.45.61.17 NJ 1.8 127.16.77.6 127.16.77.11 Paris 3.1 14.66.71.8 14.66.71.12 San Francisco New Jersey Green node sees different leaf domain but has a consistent view of the inner domain CS5412 Spring 2016 29