Spark 流式有两套系统:Spark Streaming 和 Structured Streaming。那么这两套系统的区别在哪里呢?以及为什么 Spark 有了 Spark Streaming 还有做 Structured Streaming 呢?我们应该如何去选择呢?

_新年快快发布于2018/12/05 22:25

注脚

1.2018/12/4 From Spark Streaming to Structured Streaming From Spark Streaming to Structured Streaming @E-MapReduce http://localhost:3999/xxx.slide#1 1/44

2.2018/12/4 From Spark Streaming to Structured Streaming Outline Spark Streaming Google Data ow Structured Streaming Reference http://localhost:3999/xxx.slide#1 2/44

3.2018/12/4 From Spark Streaming to Structured Streaming 1. Spark Streaming http://localhost:3999/xxx.slide#1 3/44

4.2018/12/4 From Spark Streaming to Structured Streaming 1.1 Overview http://localhost:3999/xxx.slide#1 4/44

5.2018/12/4 From Spark Streaming to Structured Streaming 1.2 DStream Model http://localhost:3999/xxx.slide#1 5/44

6.2018/12/4 From Spark Streaming to Structured Streaming 1.2 DStream Model pageViews = readStream("http://...", "1s") ones = pageViews.map(event => (event.url, 1)) counts = ones.runningReduce((a, b) => a + b) http://localhost:3999/xxx.slide#1 6/44

7.2018/12/4 From Spark Streaming to Structured Streaming 1.3 Failure Recovery Parallel Recovery Straggler http://localhost:3999/xxx.slide#1 7/44

8.2018/12/4 From Spark Streaming to Structured Streaming 1.4 Consistency Semantics Input -> That Depends DStream -> Exactly Once Output -> At-least-once(default) http://localhost:3999/xxx.slide#1 8/44

9.2018/12/4 From Spark Streaming to Structured Streaming 1.5 DStream API Transformations on DStreams Output Operations on DStreams http://localhost:3999/xxx.slide#1 9/44

10.2018/12/4 From Spark Streaming to Structured Streaming 1.6 Evaluation Linear Scalability High Throuphput High Performance http://localhost:3999/xxx.slide#1 10/44

11.2018/12/4 From Spark Streaming to Structured Streaming 2. Google Data ow http://localhost:3999/xxx.slide#1 11/44

12.2018/12/4 From Spark Streaming to Structured Streaming 2.1 Overview http://localhost:3999/xxx.slide#1 12/44

13.2018/12/4 From Spark Streaming to Structured Streaming 2.2 Points Unbounded/Bounded vs Streaming/Batch Window Time Domain: Processing Time vs Event Time http://localhost:3999/xxx.slide#1 13/44

14.2018/12/4 From Spark Streaming to Structured Streaming 2.3 More The world beyond batch: Streaming 101 The world beyond batch: Streaming 102 Streaming Systems http://localhost:3999/xxx.slide#1 14/44

15.2018/12/4 From Spark Streaming to Structured Streaming 3. Structured Streaming http://localhost:3999/xxx.slide#1 15/44

16.2018/12/4 From Spark Streaming to Structured Streaming 3.1 DStream Pains using processing time, not event time complex, low-level api reason about end-to-end guarantees http://localhost:3999/xxx.slide#1 16/44

17.2018/12/4 From Spark Streaming to Structured Streaming 3.2 Structured Streaming Overview http://localhost:3999/xxx.slide#1 17/44

18.2018/12/4 From Spark Streaming to Structured Streaming 3.2 Structured Streaming Overview Incremental query model Support for end-to-end applications Spark SQL engine reuse: optimizer and runtime code generator. http://localhost:3999/xxx.slide#1 18/44

19.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model http://localhost:3999/xxx.slide#1 19/44

20.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model http://localhost:3999/xxx.slide#1 20/44

21.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model WordCount Example // Create DataFrame representing the stream of input lines from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words val words = lines.as[String].flatMap(_.split(" ")) // Generate running word count val wordCounts = words.groupBy("value").count() http://localhost:3999/xxx.slide#1 21/44

22.2018/12/4 From Spark Streaming to Structured Streaming 3.3 Program Model http://localhost:3999/xxx.slide#1 22/44

23.2018/12/4 From Spark Streaming to Structured Streaming 3.4 Output Mode Append mode (default) Complete mode Update mode http://localhost:3999/xxx.slide#1 23/44

24.2018/12/4 From Spark Streaming to Structured Streaming 3.5 API http://localhost:3999/xxx.slide#1 24/44

25.2018/12/4 From Spark Streaming to Structured Streaming 3.5 API Static-typing and runtime type-safety High-level abstraction and custom view into data Ease-of-use of APIs with structure Performance and Optimization http://localhost:3999/xxx.slide#1 25/44

26.2018/12/4 From Spark Streaming to Structured Streaming 3.6 Window with Event Time import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window("eventTime", "10 minutes", "5 minutes"), $"word" ).count() http://localhost:3999/xxx.slide#1 26/44

27.2018/12/4 From Spark Streaming to Structured Streaming 3.6 Window with Event Time http://localhost:3999/xxx.slide#1 27/44

28.2018/12/4 From Spark Streaming to Structured Streaming 3.7 EventTime Watermark import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words .withWatermark("eventTime", "10 minutes") .groupBy( window("eventTime", "10 minutes", "5 minutes"), $"word") .count() http://localhost:3999/xxx.slide#1 28/44

29.2018/12/4 From Spark Streaming to Structured Streaming 3.7 EventTime Watermark http://localhost:3999/xxx.slide#1 29/44

30.2018/12/4 From Spark Streaming to Structured Streaming 3.8 UDF Stateful Operator mapGroupsWithState atMapGroupsWithState http://localhost:3999/xxx.slide#1 30/44

31.2018/12/4 From Spark Streaming to Structured Streaming 3.9 Query Planning Catalyst optimizer in Spark SQL 1. Validity Analysis 2. Incrementalization 3. Query Optimization http://localhost:3999/xxx.slide#1 31/44

32.2018/12/4 From Spark Streaming to Structured Streaming 3.10 Consistency Semantics End to End Exactly Once 1. Input: Replayable + WAL 2. Output: Idempotent Writes http://localhost:3999/xxx.slide#1 32/44

33.2018/12/4 From Spark Streaming to Structured Streaming 3.11 Continuous Processing http://localhost:3999/xxx.slide#1 33/44

34.2018/12/4 From Spark Streaming to Structured Streaming 3.11 Continuous Processing micro-batch latency http://localhost:3999/xxx.slide#1 34/44

35.2018/12/4 From Spark Streaming to Structured Streaming 3.11 Continuous Processing continuous latency http://localhost:3999/xxx.slide#1 35/44

36.2018/12/4 From Spark Streaming to Structured Streaming 3.11 Continuous Processing latency ~1 ms at-least-once fault-tolerance guarantees Chandy-Lamport algorithm http://localhost:3999/xxx.slide#1 36/44

37.2018/12/4 From Spark Streaming to Structured Streaming 3.11 Continuous Processing Spark + AI submit 2018: https://databricks.com/session/continuous-processing-in- structured-streaming http://localhost:3999/xxx.slide#1 37/44

38.2018/12/4 From Spark Streaming to Structured Streaming 3.12 Evaluation http://localhost:3999/xxx.slide#1 38/44

39.2018/12/4 From Spark Streaming to Structured Streaming 3.13 Operational Features Code Updates Manual Rollback Hybrid Batch and Streaming Execution Monitoring Fault and Straggler Recovery http://localhost:3999/xxx.slide#1 39/44

40.2018/12/4 From Spark Streaming to Structured Streaming 4. Reference http://localhost:3999/xxx.slide#1 40/44

41.2018/12/4 From Spark Streaming to Structured Streaming 4. Reference 1. Zaharia M, Das T, Li H, et al. Discretized streams: Fault-tolerant streaming computation at scale[C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013: 423-438. 2. Akidau T, Bradshaw R, Chambers C, et al. The data ow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing[J]. Proceedings of the VLDB Endowment, 2015, 8(12): 1792-1803. 3. Armbrust M, Das T, Torres J, et al. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark[C]//Proceedings of the 2018 International Conference on Management of Data. ACM, 2018: 601-613. 4. The world beyond batch: Streaming 101 5. The world beyond batch: Streaming 102 6. Streaming Systems 7. https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in- structured-streaming-in-apache-spark-2-3-0.html 8. https://en.wikipedia.org/wiki/Chandy-Lamport_algorithm http://localhost:3999/xxx.slide#1 41/44

42.2018/12/4 From Spark Streaming to Structured Streaming 4. Reference 9. A Deep Dive Into Structured Streaming: https://databricks.com/session/a-deep-dive-into- structured-streaming 10. Continuous Applications: Evolving Streaming in Apache Spark 2.0: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in- apache-spark-2-0.html 11. Spark Structured Streaming:A new high-level API for streaming: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html 12. Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming: https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache- sparks-structured-streaming.html 13. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html 14. Benchmarking Structured Streaming on Databricks Runtime Against State-of-the-Art Streaming Systems: https://databricks.com/blog/2017/10/11/benchmarking-structured- streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html http://localhost:3999/xxx.slide#1 42/44

43.2018/12/4 From Spark Streaming to Structured Streaming Thank you kelu.tkl@alibaba-inc.com (mailto:kelu.tkl@alibaba-inc.com) http://www.legendtkl.com (http://www.legendtkl.com) http://localhost:3999/xxx.slide#1 43/44

44.2018/12/4 From Spark Streaming to Structured Streaming http://localhost:3999/xxx.slide#1 44/44

user picture

相关Slides

  • 讲解了Facebook在spark shuffle方面的优化,相关论文为 EuroSys ’18: Riffle: Optimized Shuffle Service for Large-Scale Data Analytics

  • Hive作为数据仓库的核心,其元数据管理已经成为大数据领域事实上的标准,各种大数据处理引擎都尝试对其兼容,本文描述社区如何讲Hive服务以及Hive MetaStore服务独立处理,并支持各种权限验证功能。

  • Spark 流式有两套系统:Spark Streaming 和 Structured Streaming。那么这两套系统的区别在哪里呢?以及为什么 Spark 有了 Spark Streaming 还有做 Structured Streaming 呢?我们应该如何去选择呢?

  • MLSQL的文档自助系统 更多信息访问官网: http://www.mlsql.tech