Unified Data Processing with Apache Flink and Apache Pulsar——Seth Wiesman

展开查看详情

1.Unified Data Processing with Apache Flink and Apache Pulsar Seth Wiesman Senior Solutions Architect @ Ververica Committer Apache Flink © 2019 Ververica

2.About Ververica (the company formerly known as “data Artisans”) Original Creators of Enterprise Stream Processing Subsidiary of Apache Flink® With Ververica Platform Alibaba Group 2 © 2019 Ververica

3. Apache Flink 3 © 2019 Ververica

4.Apache Flink 4 © 2019 Ververica

5. Apache Flink at The "Singles Day" (11/11/2019) containers data size throughput latency state size Sub- 2M 985 PB 2.5 B Second 100TB events / sec 5 © 2019 Ververica

6. Why Stream Processing? 6 © 2019 Ververica

7. Stream Processing is real-time data processing and real-time data-driven actions 7 © 2019 Ververica

8. Stream Processing is the unification of real-time and offline analytics 8 © 2019 Ververica

9. Stream Processing is the intersection of data analytics and applications 9 © 2019 Ververica

10. Stream Processing is to event-driven applications what the database is to request/response apps 10 © 2019 Ververica

11. Stream Processing is a flexible and extensible architecture for data-driven applications 11 © 2019 Ververica

12.Stream Processing changes how Applications and Data interact request/trigger result/response event stream event stream Stream Application / Processor Business Logic events are the data events act as triggers Application / application logic triggered Business Logic by events/changes (Datalake, Database) Batch Proc. or Req/resp. Stream Processing 12 © 2019 Ververica

13. What is Stream Processing for? Ad-hoc queries, data exploration, Most business logic ML model training query/logic changes fast data changes fast data changes slowly query/logic changes slowly Batch Proc. or Req/resp. Continuous Streaming 13 © 2019 Ververica

14. The Spectrum of Streaming Data Use Cases machine learning unified offline/ real-time behavior modeling real-time ML model model training real-time analytics (recommenders, pricing, ..) training/evaluation data warehousing continuous continuous monitoring real-time alerts distributed OLAP / BI / reporting ETL (position, risk, …) (fraud, security, …) OLTP-style apps more lag time more real time 14 © 2019 Ververica

15.Stateful Single Record Processing 15 © 2019 Ververica

16.Everything is a Stream Streams Of Records in a Log or MQ 16 © 2019 Ververica

17.Everything is a Stream Stream of Requests/Responses to/from Services GET /a/b POST /b/c PUT /e/f Service 200 404 200 200 403 DB à event sourcing architecture 17 © 2019 Ververica

18.Everything is a Stream Stream of Rows in a Table or in Files 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am … 2016-3-11 10:00pm 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-12 2:00am 2016-3-12 3:00am 18 © 2019 Ververica

19.Everything is a Stream Stream of Rows in a Table or in Files 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am … 2016-3-11 10:00pm 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-12 2:00am 2016-3-12 3:00am a batch 19 © 2019 Ververica

20.Everything is a Stream Streams may span storage systems Parquet files Avro records 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am … 2016-3-11 10:00pm 2016-3-11 11:00pm more distant past recent past (e.g., compressed files in DFS/Object Store) (e.g., events in MQ/Log) 20 © 2019 Ververica

21.21 © 2019 Ververica

22.Bounded and Unbounded Streams 22 © 2019 Ververica

23.Components of a Streaming Data Architecture (Apache Flink) Results (Views) (K/V stores, databases) Stream Processing Stream Processing Stream Processing Log / Stream Storage Triggered (Pulsar) Applications Event producers (applications, servers, databases, sensors) 23 © 2019 Ververica

24. Apache Flink: Analytics and Applications on Streaming Data Stateful Stream Processing Event-driven Streaming Analytics Applications Streams, State, Time SQL and Tables Stateful Functions Flink Runtime Stateful Computations over Data Streams 24 © 2019 Ververica

25. Stateful Stream Processing 25 © 2019 Ververica

26. Apache Flink: Analytics and Applications on Streaming Data Stateful Stream Processing Event-driven Streaming Analytics Applications Streams, State, Time SQL and Tables Stateful Functions Flink Runtime Stateful Computations over Data Streams 26 © 2019 Ververica

27.Stateful Stream Processing Source (Stream) Source (Static) Computation State Computation State Transformation Computation Computation State Sink Sink 27 © 2019 Ververica

28.Example Use Cases • Real time search and recommendation models (e.g., Alibaba) • Build a real-time session behavior profile of users (e.g., Netflix) • Real time trade settlement dashboard (e.g., UBS) • Real time revenue accounting (various AdTechs) • Machine Learning-based anomaly/fraud detection (e.g., ING, Microsoft) • Real-time data refinement and data pipelines (many) 28 © 2019 Ververica

29.DataStream API val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…)) Source val events: DataStream[Event] = lines.map((line) => parse(line)) Transformation val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) Windowed Transformation .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Sink Streaming Dataflow Source Transform Window Sink (state read/write) 29 © 2019 Ververica

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。