1.CS5412: Lecture 8 Time in I o T Applications Ken Birman Spring, 2019 http://www.cs.cornell.edu/courses/cs5412/2019sp 1
2.guarantees In I o T Platforms Everyone loves guarantees! They make it easier to create systems. Right? Now that we know how a file system can keep track of time and offer temporal reads with guaranteed consistency, why not go further? Can we invent a -service offering time-based help to IoT sensors and applications? http://www.cs.cornell.edu/courses/cs5412/2019sp 2
3.Microsoft FarmBeats Recall: FarmBeats is a prototype smart farm solution on Azure IoT Edge Does time arise as a need in this kind of system, other than indexing the stored data by time? http://www.cs.cornell.edu/courses/cs5412/2019sp 3
4.Insights from Lecture 6 + 7 Two drones will cross the same spot in a field. Should we coordinate paths using accurate clocks, or precise clocks, if we can’t have both? As they fly, they measure air movement and save time-based wind data into a file system. Would we want to use Theo’s Freeze Frame File System to “learn” wind patterns, or would HDFS be fine? http://www.cs.cornell.edu/courses/cs5412/2019sp 4
5.What other “Uses” for time arise? In 1995, the United States launched an ambitious program to update all of US air traffic control. And controlling drones is sort of a 2018 story that sounds like a scaled up version of ATC. In fact, the 1995 FAA project centered on an IoT computing model! Basically, sensors + radar + on-plane GPS let us track planes. Pilots and Air Traffic Controllers interact with the system. http://www.cs.cornell.edu/courses/cs5412/2019sp 5
6.Core Real-Time Mechanism We’ve discussed publish-subscribe. The FAA settled on this model. Real-time systems often center on a publish-subscribe product called a real-time data distribution service (DDS) DDS technology has become highly standardized (by the OMG) One can layer DDS over an object store (like in Derecho). In fact we are doing that in Derecho now, as an optional API. http://www.cs.cornell.edu/courses/cs5412/2019sp 6
7.Air Traffic Example A persistent real-time DDS combines database and pub/sub functionality http://www.cs.cornell.edu/courses/cs5412/2019sp 7 Owner of flight plan updates it… there can only be one owner. DDS makes the update persistent, records the ordering of the event, reports it to client systems … Other ATC controllers see real-time read-only notifications Intelligent -service Intelligent -service Real-Time Guarantee: Notifications get through within delay , even if something crashes!
8.Ideas IBM proposed A real-time key-value storage system, in which updates to variables would become visible to all processes simultaneously. A real-time “heart beat” to help devices take coordinated actions. A real-time message bus (DDS) to report new events in a temporally consistent way to processes interested in those events. http://www.cs.cornell.edu/courses/cs5412/2019sp 8
9.A real-time distributed shared memory http://www.cs.cornell.edu/courses/cs5412/2019sp 9 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b Think of “x” as a key and “3” as the new value. Here x=3 becomes visible within a time window between t+a and t+b . set x=3 x=3
10.Periodic process group: Marzullo http://www.cs.cornell.edu/courses/cs5412/2019sp 10 p 0 p 1 p 2 p 3 p 4 p 5 Here, a set of devices can coordinate their actions based on a heartbeat.
11.Fault-tolerant real-time DDS http://www.cs.cornell.edu/courses/cs5412/2019sp 11 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b * * * * * Message is sent at time t by p 0 . Later both p 0 and p 1 fail. But message is still delivered atomically, after a bounded delay, and within a bounded interval of time By having every receiver echo every message to all other members, we create an O(n 2 ) message burst but can overcome crash failures or network link issues.
12.The CASD protocol suite (named for the authors: Cristian, Aghili , Strong, Dolev) Also known as the “ -T” protocols, solves all of these problems! Goal is to implement a timed atomic broadcast tolerant of failures First, they looked at crash failures and message loss. Then they added in more severe scenarios like message corruption (the so-called “Byzantine” model). To make this feasible, they assumed limits on how many faults occur. http://www.cs.cornell.edu/courses/cs5412/2019sp 12
13.Examples of the assumptions they made Assumes use of clock synchronization, known clock skew limits. Sender timestamps message, then sends it. Could crash and not send to some receivers: they can only tolerated a few crashes of this kind. Recipients forward the message using a flooding technique (each echos the message to others). Crashes here count towards the “crash failure” limit. Wait until all correct processors have a copy, then deliver in unison (up to limits of the clock skew) http://www.cs.cornell.edu/courses/cs5412/2019sp 13
14.CASD picture http://www.cs.cornell.edu/courses/cs5412/2019sp 14 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b * * * * * p 0 , p 1 fail. Messages are lost when echoed by p 2 , p 3 These were crashes. Otherwise the messages from p 0 would have reached everyone in step one! Notice how time elapses step by step After so many “all-to-all” echo steps, one of them must have been successful.
15.Idea of CASD Because we are assuming that there are known limits on number of processes that fail during protocol, number of messages lost, we can do a kind of worst-case reasoning. Same for temporal (timing) mistakes. The key idea is to overwhelm the failures – run down the clock. Then schedule delivery using original time plus a delay computed from the worst-case assumptions http://www.cs.cornell.edu/courses/cs5412/2019sp 15
16.Basic protocol is very simple! Every process that receives a message Discards it, if the timestamp on the message is “out of bounds” If this is a duplicate, no more work to do, otherwise, save a copy. Echo it to every other process if it wasn’t a duplicate. Then after a determined delay, deliver the message at a time computed by taking the timestamp and adding a specific (hence the protocol name). http://www.cs.cornell.edu/courses/cs5412/2019sp 16
17.Where do the bounds come from? They do a series of “pessimistic” worst-case analyses. For example, they say things like this: Suppose message M is sent at time T by some process. What is the earliest clock value that process Q could have when M first arrives at Q? What is the latest possible clock value? This would let them compute the bounds for discarding a message that “definitely was from a process with a faulty clock” in step 1. http://www.cs.cornell.edu/courses/cs5412/2019sp 17
18.The analysis is surprisingly difficult! A message can jump from process to process so by the time it reaches Q, it may actually have gone through several hops. Each hop contributes delay, plus at each hop some process ran through the same decision logic, and apparently, decided to forward the message. If Q and R are correct, we need a proof that they will ultimately both deliver M, or both reject M, and this is hard to pull off! http://www.cs.cornell.edu/courses/cs5412/2019sp 18
19.What does it mean to be “correct”? In many settings, it is obvious when a process is faulty! But with CASD a process is correct if it behaves according to a set of assumptions that cover timing and other aspects (for example, messages have digital signatures, and a process that damages a message is incorrect). So there when we say “If Q and R are correct”, this is a fairly elaborate set of conditions on their behavior. http://www.cs.cornell.edu/courses/cs5412/2019sp 19
20.The problems with CASD In the usual case, nothing goes wrong, hence the delay can is too conservative Even if things do go wrong, is it right to assume that if a worst-case message needs ms to make one hop, we should budget n * time for n hops? How realistic is it to bound the number of failures expected during a run? http://www.cs.cornell.edu/courses/cs5412/2019sp 20
21.CASD in a more typical run http://www.cs.cornell.edu/courses/cs5412/2019sp 21 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b * * * * * * In a normal run, everything gets through right away, but then we wait a long time before the DDS can deliver messages.
22.... Developers tuned, aiming for this http://www.cs.cornell.edu/courses/cs5412/2019sp 22 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b * * * * * * Here we just changed the delay parameter to deliver sooner. Seems reasonable, no?
23.Oops! CASD starts to “malfunction” http://www.cs.cornell.edu/courses/cs5412/2019sp 23 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b * all processes look “incorrect” (red) from time to time * * * Very few processes are deemed “correct”. The others do weird wrong things. CASD didn’t work!
24.Why does CASD Treat so many processes as faulty? We need to think about what these assumptions meant. Suppose CASD assumes that a message sent from P to Q will always arrive within delay , but then we set the limit, , to a very small value. If =100ms, we would never have seen problems. But with =1ms, perhaps 10% of messages show up “late”. This will be treated as if P or Q had failed! http://www.cs.cornell.edu/courses/cs5412/2019sp 24
25.More examples CASD has a limit on how long after a message arrives, Q can take to process it. If this limit is very low, scheduling delays make Q look faulty. CASD has a limit on how many messages can be lost in the network. CASD has a limit on how far the clocks on the computers can drift. http://www.cs.cornell.edu/courses/cs5412/2019sp 25
26.What does faulty “mean”? Since P and Q are still running, in what sense are they faulty? They won’t be notified that CASD “thinks” of them as faulty. They don’t have any local evidence they can rely on. In fact both think of themselves as healthy. They are normal programs. Yet the CASD guarantees no longer apply because of these violations of the assumptions – CASD only promises atomicity and accurate timing if the model isn’t violated. http://www.cs.cornell.edu/courses/cs5412/2019sp 26
27.Closer look http://www.cs.cornell.edu/courses/cs5412/2019sp 27 p 0 p 1 p 2 p 3 p 4 p 5 t t+a t+b * all processes look “incorrect” (red) from time to time * * * Very few processes are deemed “correct”. The others do weird wrong things. CASD didn’t work!
28.Consequence? Only some of the drones learn the most current search plan. Perhaps a “periodic event” triggers and only p 3 and p 5 act. Others don’t do anything, and p 4 acts but 3s late. So clearly, running CASD “aggressively” didn’t work at all! But CASD with a 20s delay (a delay for which it works well) is useless! http://www.cs.cornell.edu/courses/cs5412/2019sp 28
29.Programming over CASD The original idea was to configure CASD to be perfectly reliable. But now we can see that with aggressive parameter settings, CASD becomes extremely unreliable. How would we compensate for this risk in software? http://www.cs.cornell.edu/courses/cs5412/2019sp 29