结构化流中的连续处理

本讲座将介绍结构化流中连续处理的细节,以及我在Spark 2.3中实现初始版本的工作以及2.4的更新。DStreams是Spark第一次尝试流式传输,并且通过dstream Spark成为在一个统一的执行引擎中提供批量处理和流式功能的第一个框架。
展开查看详情

1.Continuous Processing in Structured Streaming Jose Torres, Databricks #Dev4SAIS

2.Continuous Processing Overview ● Unified Spark SQL API ● No microbatches ● Low (~1ms) latency Continuous Processing Microbatch #Dev4SAIS 2

3.DStream API ● Non-declarative, similar to RDDs ● Scala/Java only ● Checkpoints only through complete snapshots ● No event time #Dev4SAIS 3

4.Structured Streaming ● Data represented as a virtual append-only table ● Unified Spark SQL query API ● Batch and streaming queries return same results #Dev4SAIS 4

5.Structured Streaming #Dev4SAIS 5

6.Structured Streaming Features ● Dataframes and Datasets ● SQL, Python, and R language APIs ● Delta-based aggregation state #Dev4SAIS 6

7.Microbatches #Dev4SAIS 7

8.Continuous Processing #Dev4SAIS 8

9.Chandy-Lamport Checkpoints ● Asynchronous Driver checkpoint complete ready for global commit ● Consistent epoch marker Data Stream Aggregation Reader Writer Task Task save checkpoint to state store partition level commit #Dev4SAIS 9

10.Checkpointing - Detailed Driver Shuffle epoch markers Reader Processing Aggregation Writer Reader Processing Reader Processing Aggregation Writer #Dev4SAIS 10

11.Checkpointing - Detailed Shuffle Reader Processing Aggregation Writer Reader Processing Reader Processing Aggregation Writer #Dev4SAIS 11

12.Checkpointing - Detailed Shuffle Reader Processing Aggregation Writer Reader Processing Reader Processing Aggregation Writer #Dev4SAIS 12

13.Checkpointing - Detailed Shuffle Reader Processing Aggregation Writer Reader Processing Reader Processing Aggregation Writer #Dev4SAIS 13

14.Checkpointing - Detailed Shuffle Reader Processing Aggregation Writer Reader Processing Reader Processing Aggregation Writer #Dev4SAIS 14

15.Checkpointing - Detailed Shuffle Reader Processing Aggregation Writer aggregation checkpoint Reader Processing Reader Processing Aggregation Writer aggregation checkpoint #Dev4SAIS 15

16.Checkpointing - Detailed Driver Shuffle ready for global commit Reader Processing Aggregation Writer commit partition writer Reader Processing Reader Processing Aggregation Writer commit partition writer #Dev4SAIS 16 commit partition writer

17.Continuous Processing API ● It’s just Structured Streaming ● Run the same queries in continuous mode #Dev4SAIS 17

18.Continuous Processing in 2.3 ● Initial experimental release ● Supports ETL use cases #Dev4SAIS 18

19.Ongoing And Future Work ● Shuffles (SPARK-24036) ● Event time (SPARK-24459) ● Metrics (SPARK-23887) ● Exactly-once semantics mode (SPARK-24460) ● Performance testing (TBD) ● Additional data sources (TBD) 19

20.Q&A 20