Data streaming is becoming increasingly popular

Apache FlinkTM .... Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high ...
展开查看详情

1.Kostas Tzoumas @ kostas_tzoumas Apache Flink TM Counting elements in streams

2.Introduction 2

3.3 Data streaming is becoming increasingly popular * *Biggest understatement of 2016

4.4 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

5.5 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams

6.Counting 6

7.Continuous counting A seemingly simple application, but generally an unsolved problem E.g., count visitors, impressions, interactions, clicks, etc Aggregations and OLAP cube operations are generalizations of counting 7

8.Counting in batch architecture Continuous ingestion Periodic (e.g., hourly) files Periodic batch jobs 8

9.Problems with batch architecture High latency Too many moving parts Implicit treatment of time Out of order event handling Implicit batch boundaries 9

10.Counting in λ architecture "Batch layer": what we had before "Stream layer": approximate early results 10

11.Problems with batch and λ Way too many moving parts (and code dup) Implicit treatment of time Out of order event handling Implicit batch boundaries 11

12.Counting in streaming architecture Message queue ensures stream durability and replay Stream processor ensures consistent counting 12

13.Counting in Flink DataStream API Number of visitors in last hour by country 13 DataStream < LogEvent > stream = env . addSource (new FlinkKafkaConsumer (...) ); // create stream from Kafka . keyBy ("country") ; // group by country . timeWindow ( Time.minutes (60) ) // window of size 1 hour . apply (new CountPerWindowFunction ()); // do operations per window

14.Counting in Flink DataStream API Number of visitors in last hour by country 13 DataStream < LogEvent > stream = env . addSource (new FlinkKafkaConsumer (...) ); // create stream from Kafka . keyBy ("country") ; // group by country . timeWindow ( Time.minutes (60) ) // window of size 1 hour . apply (new CountPerWindowFunction ()); // do operations per window

15.Counting hierarchy of needs 15 Continuous counting

16.Counting hierarchy of needs 16 Continuous counting ... with low latency,

17.Counting hierarchy of needs 17 Continuous counting ... with low latency, ... efficiently on high volume streams,

18.Counting hierarchy of needs 18 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once),

19.Counting hierarchy of needs 19 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable,

20.Counting hierarchy of needs 20 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable 1.1+

21.Rest of this talk 21 Continuous counting ... with low latency, ... efficiently on high volume streams, ... fault tolerant (exactly once), ... accurate and repeatable, ... queryable

22.Latency 22

23.Yahoo! Streaming Benchmark 23 Storm, Spark Streaming, and Flink benchmark by the Storm team at Yahoo! Focus on measuring end-to-end latency at low throughputs First benchmark that was modeled after a real application Read more: https ://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines- at

24.Benchmark task: counting! 24 Count ad impressions grouped by campaign Compute aggregates over last 10 seconds Make aggregates available for queries ( Redis )

25.Results (lower is better) 25 Flink and Storm at sub-second latencies Spark has a latency-throughput tradeoff 170k events/sec

26.Efficiency, and scalability 26

27.Handling high-volume streams Scalability: how many events/sec can a system scale to, with infinite resources? Scalability comes at a cost (systems add overhead to be scalable) Efficiency: how many events/sec can a system scale to , with limited resources? 27

28.Handling high-volume streams Scalability: how many events/sec can a system scale to, with infinite resources? Scalability comes at a cost (systems add overhead to be scalable) Efficiency: how many events/sec can a system scale to , with limited resources? 27

29.Results (higher is better) 29 Also: Flink jobs are correct under failures (exactly once), Storm jobs are not 500k events/sec 3mi events/sec 15mi events/sec