大数据实时流式处理引擎比较

从流处理的核心概念,到功能的完备性,再到周边的生态环境,全方位对比了目前比较热门的流处理框架:Spark,Flink,Storm和Gearpump。结合不同的框架的设计,为大家进行深入的剖析。与此同时,从吞吐量和延时两个方面,对各个框架进行性能评估。主要技术点:流失数据处理,Spark,Flink,Storm和Gearpump。
展开查看详情

1.Functional Comparison and Performance Evaluation 毛玮 王华峰 张天伦 2016/9/10

2. Overview  Streaming Core  MISC  Performance Benchmark Choose your weapon ! *Other names and brands may be claimed as the property of others. 2

3.

4. Continuous Streaming Continuous Streaming Micro-Batch Ack per Record Checkpoint “per Batch” Checkpoint per Batch Apache Twitter Aapche Apache Apache Spark Apache Storm Storm* Heron* Flink* Gearpump* Streaming* Trident* Storage Storage Storage Source Operator Sink Source Operator Sink Source Operator Sink id id offset state str ack offset state str job status Acker JobManager/ Driver HDFS HDFS This is the critical part, as it affects many features *Other names and brands may be claimed as the property of others. 4

5. Continuous Streaming Continuous Streaming Micro-Batch Ack per Record Checkpoint “per Batch” Checkpoint per Batch Apache Twitter Aapche Apache Apache Spark Apache Storm Storm* Heron* Flink* Gearpump* Streaming* Trident* Low Latency High Latency High Overhead Low Overhead Low Throughput High Throughput *Other names and brands may be claimed as the property of others. 5

6. Delivery Guarantee Apache Twitter Aapche Apache Apache Spark Apache Storm Storm* Heron* Flink* Gearpump* Streaming* Trident* At least once Exactly once • Ackers know about if a • State is persisted in record is processed durable storage successfully or not. If it failed, replay it. • Checkpoint is linked with state storage per Batch • There is no state consistency guarantee. *Other names and brands may be claimed as the property of others. 6

7. Native State Operator Apache Twitter Aapche Apache Apache Spark Apache Storm Storm* Heron* Flink* Gearpump* Streaming* Trident* Yes* Yes Yes • Storm: • Flink Java API: • Spark 1.5:  KeyValueState  ValueState  updateStateByKey  ListState  ReduceState • Spark 1.6: • Heron:  mapWithState X User Maintain • Flink Scala API:  mapWithState • Trident:  persistentAggregate • Gearpump  State  persistState *Other names and brands may be claimed as the property of others. 7

8. Dynamic Load Balance & Recovery Speed Apache Twitter Aapche Apache Apache Spark Apache Storm Storm* Heron* Flink* Gearpump* Streaming* Trident* exec 10s + 5s = 15s exec 10s Source exec 5s Source exec 10s exec 10s + 5s = 15s exec 10s *Other names and brands may be claimed as the property of others. 8

9.

10. Apache Storm* Compositional Twitter Heron* Apache Gearpump* • Highly customizable operator based on basic building blocks • Manual topology definition and optimization TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“input", new RandomSentenceSpout(), 1); builder.setBolt("split", new SplitSentence(), 3).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 2).fieldsGrouping("split", new Fields("word")); “foo, foo, bar” “foo”, “foo”, “bar” {“foo”: 2, “bar”: 1} Spout Bolt Bolt *Other names and brands may be claimed as the property of others. 10

11. Apache Spark Streaming* Declarative Apache Storm Trident* Aapche Flink* • Higher order function as operators (filter, mapWithState…) Apache Gearpump* • Logical plan optimization DataStream<String> text = env.readTextFile(params.get("input")); DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1); “foo, foo, bar” “foo”, “foo”, “bar” {“foo”: 1, “foo”: 1, “bar”: 1} {“foo”: 2, “bar”: 1} *Other names and brands may be claimed as the property of others. 11

12. Statistical • Data scientist friendly • Dynamic type Apache Spark Apache Twitter ˚Structured ˚Apache Streaming* Storm* Heron* Streaming* Storm* Python R lines <- textFile(sc, “input”) lines = ssc.textFileStream(params.get("input")) words <- flatMap(lines, function(line) { words = lines.flatMap(lambda line: line.split(“,")) strsplit(line, “ ”)[[1]] pairs = words.map(lambda word: (word, 1)) }) counts = pairs.reduceByKey(lambda x, y: x + y) wordCount <- lapply(words, function(word) { counts.saveAsTextFiles(params.get("output")) list(word, 1L) } counts <- reduceByKey(wordCount, “+”, 2L) *Other names and brands may be claimed as the property of others. 12

13. SQL Apache Spark Structured Streaming* Streaming Fusion Style Aapche Flink* Pure Style Apache Storm Trident* CREATE EXTERNAL TABLE InputDStream.transform((rdd: RDD[Order], time: Time) => ORDERS (ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY { INT) import sqlContext.implicits._ LOCATION 'kafka://localhost:2181/brokers?topic=orders' rdd.toDF.registAsTempTable TBLPROPERTIES '{...}}‘ val SQL = "SELECT ID, UNIT_PRICE * QUANTITY AS TOTAL FROM ORDERS WHERE UNIT_PRICE * INSERT INTO LARGE_ORDERS SELECT ID, UNIT_PRICE * QUANTITY > 50" QUANTITY val largeOrderDF = sqlContext.sql(SQL) AS TOTAL FROM ORDERS WHERE UNIT_PRICE * QUANTITY > largeOrderDF.toRDD 50 }) bin/storm sql XXXX.sql *Other names and brands may be claimed as the property of others. 13

14. Summary Compositional Declarative Python/R SQL Apache Spark X √ √ √ Streaming* Apache NOT support √ X √ Storm* aggregation, Apache Storm windowing and X √ X joining Trident* Apache √ √ X X Gearpump* Aapche Support select, X √ X Flink* from, where, union Twitter √ X √˚ X Heron* *Other names and brands may be claimed as the property of others. 14

15.

16. Twitter • Single Task on Single Process Heron* JVM JVM Process Connect Process Connect with Task with Task local SM local SM Thread Thread Thread Thread Aapche • Multi Tasks of Multi Applications on Single Process Flink* JVM JVM Process Process Task Task Task Task Task Thread Thread Thread Thread Thread Task task from application A Task task from application B *Other names and brands may be claimed as the property of others. 16

17. • Multi Tasks of Single application on Single Process Apache Spark o Single task on single thread Streaming* JVM JVM Process Process Task Task Task Task Task Thread Thread Thread Thread Thread Apache Apache Storm Apache o Multi tasks on single thread Storm* Trident* Gearpump* JVM JVM Task Process Task Process Task Task Task Task Task Task Thread Thread Thread Thread *Other names and brands may be claimed as the property of others. 17

18.● Window Support ● Out-of-order Processing ● Memory Management ● Resource Management ● Web UI ● Community Maturity

19. Window Support smaller than gap t t • Sliding Window • Session Window • Count Window session gap Sliding Window Count Window Session Window Apache Spark √ X X˚ Streaming* Apache √ √ X Storm* Apache Storm √ √ X Trident* Apache √˚ X X Gearpump* Apache Flink* √ √ √ Apache X X X Heron* *Other names and brands may be claimed as the property of others. 19

20. Out-of-order Processing Processing Time Event Time Watermark Apache Spark √ √˚ X˚ Streaming* Apache √ √ √ Storm* Apache Storm √ X X Trident* Apache √ √ √ Gearpump* Aapche √ √ √ Flink* Twitter √ X X Heron* *Other names and brands may be claimed as the property of others. 20

21. Memory Management JVM Manage Self Manage on-heap Self Manage off-heap Apache Spark √ √˚ √˚ Streaming* Aapche √ √ √ Flink* Apache √ X X Storm* Apache √ X X Gearpump* Twitter √ X X Heron* *Other names and brands may be claimed as the property of others. 21

22. Resource Management Standalone YARN Mesos Apache Spark √ √ √ Streaming* Apache √ √˚ √˚ Storm* Apache Storm √ √˚ √˚ Trident* Apache √ √ X Gearpump* Aapche √ √ X Flink* Twitter √ √ √ Heron* *Other names and brands may be claimed as the property of others. 22

23. Web UI Submit Cancel Inspect Show Show Check Inspect Alert Jobs Jobs Jobs Statistics Input Rate Exceptions Config Apache Spark X √ √ √ √ √ √ X Streaming* Apache X √ √ √ √˚ √ √ X Storm* Apache √ √ √ √ √˚ √ √ X Gearpump* Apache √ √ √ √ X √ √ X Flink* Twitter X X √ √ √˚ √ √ X Heron* *Other names and brands may be claimed as the property of others. 23

24. Past 1 Months Summary on GitHub Community Maturity 1000 Commits Committor 780 Apache 800 Initiation Contrib Top 600 Time utors Project 400 Apache 217 184 200 102 130 Spark 2013 2014 926 20 21 5 34 20 Streaming* 0 Spark Storm Gearpump Flink Heron Apache 2011 2014 219 Source website: https://github.com/apache/spark/pulse/monthly Storm* Past 3 Months Summary on JIRA Apache Created Resloved 2014 Incubator 21 2500 2161 Gearpump* 2000 1500 Apache 2010 2015 208 1000 Flink* 514 500 237 161 77 Twitter 0 2014 N/A 44 Spark Storm Gearpump Flink Heron Heron* Source website: https://issues.apache.org/jira/secure/Dashboard.jspa *Other names and brands may be claimed as the property of others. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced 24 data are accurate.

25.HiBench 6.0

26.Test Philosophical • “Lazy Benchmarking” • Simple test case infer practical use case 26

27. The Setup Apache Kafka* Cluster Name Version • CPU: 2 x Intel(R) Xeon(R) CPU E5- Java 1.8 • 2699 v3@ 2.30GHz Mem: 128 GB x3 • Disk: 8 x HDD (1TB) Scala 2.11.7 • Network: 10 Gbps Apache Hadoop* 2.6.2 10 Gbps Apache Zookeeper* 3.4.8 Apache Kafka* 0.8.2.2 Apache Spark* 1.6.1 Test Cluster Apache Storm* 1.0.1 • CPU: 2 x Intel(R) Xeon(R) CPU E5- Apache Flink* 1.0.3 • • 2697 v2@ 2.70GHz Core: 20 / 24 Mem: 80 / 128 GB x7 Apache Gearpump* 0.8.1 • Disk: 8 x HDD (1TB ) • Network: 10 Gbps • Apache Heron* require specific Operation System (Ubuntu/CentOS/Mac OS) • Structured Streaming doesn’t support Kafka source yet (Spark 2.0) *Other names and brands may be claimed as the property of others. 27

28.Architecture Kafka Test Cluster (Standalone) Broker Data Kafka Client Master Topic A Topic A Generator Broker In Time Kafka Broker Slave Slave Slave 20 Core 20 Core 20 Core Topic B 80G 80G 80G Mem Mem Mem Out Time Slave Slave Slave Slave 20 Core 20 Core 20 Core 20 Core 80G 80G 80G 80G File Result Metrics Reader Mem Mem Mem Mem System Out Time – In Time 28

29. Framework Configuration Framework Related Configuration Apache Spark 7 Executor Streaming* 140 Parallelism Aapche 7 TaskManager Flink* 140 Parallelism Apache 28 Worker Storm* 140 KafkaSpout Apache 28 Executors Gearpump* 140 KafkaSource *Other names and brands may be claimed as the property of others. 29