Spark 流式有两套系统:Spark Streaming 和 Structured Streaming。那么这两套系统的区别在哪里呢?以及为什么 Spark 有了 Spark Streaming 还有做 Structured Streaming 呢?我们应该如何去选择呢?

注脚

展开查看详情

1.2018/12/4 From Spark Streaming to Structured Streaming From Spark Streaming to Structured Streaming @E-MapReduce http://localhost:3999/xxx.slide#1 1/44

2.2018/12/4 From Spark Streaming to Structured Streaming Outline Spark Streaming Google Data ow Structured Streaming Reference http://localhost:3999/xxx.slide#1 2/44

3.2018/12/4 From Spark Streaming to Structured Streaming 1. Spark Streaming http://localhost:3999/xxx.slide#1 3/44

4.2018/12/4 From Spark Streaming to Structured Streaming 1.1 Overview http://localhost:3999/xxx.slide#1 4/44

5.2018/12/4 From Spark Streaming to Structured Streaming 1.2 DStream Model http://localhost:3999/xxx.slide#1 5/44

6.2018/12/4 From Spark Streaming to Structured Streaming 1.2 DStream Model pageViews = readStream("http://...", "1s") ones = pageViews.map(event => (event.url, 1)) counts = ones.runningReduce((a, b) => a + b) http://localhost:3999/xxx.slide#1 6/44

7.2018/12/4 From Spark Streaming to Structured Streaming 1.3 Failure Recovery Parallel Recovery Straggler http://localhost:3999/xxx.slide#1 7/44

8.2018/12/4 From Spark Streaming to Structured Streaming 1.4 Consistency Semantics Input -> That Depends DStream -> Exactly Once Output -> At-least-once(default) http://localhost:3999/xxx.slide#1 8/44

9.2018/12/4 From Spark Streaming to Structured Streaming 1.5 DStream API Transformations on DStreams Output Operations on DStreams http://localhost:3999/xxx.slide#1 9/44

10.2018/12/4 From Spark Streaming to Structured Streaming 1.6 Evaluation Linear Scalability High Throuphput High Performance http://localhost:3999/xxx.slide#1 10/44

11.2018/12/4 From Spark Streaming to Structured Streaming 2. Google Data ow http://localhost:3999/xxx.slide#1 11/44

12.2018/12/4 From Spark Streaming to Structured Streaming 2.1 Overview http://localhost:3999/xxx.slide#1 12/44

13.2018/12/4 From Spark Streaming to Structured Streaming 2.2 Points Unbounded/Bounded vs Streaming/Batch Window Time Domain: Processing Time vs Event Time http://localhost:3999/xxx.slide#1 13/44

14.2018/12/4 From Spark Streaming to Structured Streaming 2.3 More The world beyond batch: Streaming 101 The world beyond batch: Streaming 102 Streaming Systems http://localhost:3999/xxx.slide#1 14/44

15.2018/12/4 From Spark Streaming to Structured Streaming 3. Structured Streaming http://localhost:3999/xxx.slide#1 15/44

16.2018/12/4 From Spark Streaming to Structured Streaming 3.1 DStream Pains using processing time, not event time complex, low-level api reason about end-to-end guarantees http://localhost:3999/xxx.slide#1 16/44

17.2018/12/4 From Spark Streaming to Structured Streaming 3.2 Structured Streaming Overview http://localhost:3999/xxx.slide#1 17/44

18.2018/12/4 From Spark Streaming to Structured Streaming 3.2 Structured Streaming Overview Incremental query model Support for end-to-end applications Spark SQL engine reuse: optimizer and runtime code generator. http://localhost:3999/xxx.slide#1 18/44

19.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model http://localhost:3999/xxx.slide#1 19/44

20.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model http://localhost:3999/xxx.slide#1 20/44

21.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model WordCount Example // Create DataFrame representing the stream of input lines from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words val words = lines.as[String].flatMap(_.split(" ")) // Generate running word count val wordCounts = words.groupBy("value").count() http://localhost:3999/xxx.slide#1 21/44

22.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model http://localhost:3999/xxx.slide#1 22/44

23.2018/12/4 From Spark Streaming to Structured Streaming 3.4 Output Mode Append mode (default) Complete mode Update mode http://localhost:3999/xxx.slide#1 23/44

24.2018/12/4 From Spark Streaming to Structured Streaming 3.5 API http://localhost:3999/xxx.slide#1 24/44

25.2018/12/4 From Spark Streaming to Structured Streaming 3.5 API Static-typing and runtime type-safety High-level abstraction and custom view into data Ease-of-use of APIs with structure Performance and Optimization http://localhost:3999/xxx.slide#1 25/44

26.2018/12/4 From Spark Streaming to Structured Streaming 3.6 Window with Event Time import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window("eventTime", "10 minutes", "5 minutes"), $"word" ).count() http://localhost:3999/xxx.slide#1 26/44

27.2018/12/4 From Spark Streaming to Structured Streaming 3.6 Window with Event Time http://localhost:3999/xxx.slide#1 27/44

28.2018/12/4 From Spark Streaming to Structured Streaming 3.7 EventTime Watermark import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words .withWatermark("eventTime", "10 minutes") .groupBy( window("eventTime", "10 minutes", "5 minutes"), $"word") .count() http://localhost:3999/xxx.slide#1 28/44

29.2018/12/4 From Spark Streaming to Structured Streaming 3.7 EventTime Watermark http://localhost:3999/xxx.slide#1 29/44