Apache Spark 2.3概述:有什么新鲜事?

Apache Spark 2.0设置了Str.einSpark、统一高级API、结构化流以及诸如Catalyst Optimizer和钨引擎之类的底层性能组件的体系结构基础。从那时起,Spark社区贡献者继续构建新特性,并在Spark 2.1和2.2版本中修复许多问题。
展开查看详情

1.What’s New in Apache Spark 2.3 Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 #DevSAIS16

2. About Me • Spark Committer and 2.3 Release Manager • Software Engineer at Facebook (Big Compute) • Previously at Databricks and UC Berkeley • Research on BlinkDB (Approximate Queries in Spark) 2

3.Spark 2.3 Release by the numbers • Released on 28th February 2018 • Development Span: July ‘17 – Feb ‘18 • 284 Contributors • 1406 JIRAs – SQL/Streaming (52%) – Spark Core (12%) – PySpark (9%) – ML (8%) 3

4.Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 4

5.Major Features in Spark 2.3 Continuous Data Spark on PySpark ML on History Processing Source Kubernetes Performance Streaming Server V2 API V2 Stream-stream UDF Image Native ORC Stable Various SQL Join Enhancements Reader Support Codegen Features https://spark.apache.org/releases/spark-release-2-3 -0.html 5

6.Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 6

7.Structured Streaming Users: Treat a stream as an infinite table, no need to reason about micro-batches Developers: Decoupled the high-level API with the execution engine 7

8.Structured Streaming 8

9.Micro Batch Execution 9

10.Micro Batch Execution Latency > 100ms Exactly-once Semantics 10

11.Continuous Processing (SPARK-20928) Continuous Processing An experimental execution mode 11

12.Continuous Processing (SPARK-20928) 12

13.Continuous Processing (SPARK-20928) Latency ~1ms At-least once Semantics 13

14.Continuous Processing (SPARK-20928) 14

15.Continuous Processing (SPARK-20928) Supported Operations Supported Sources • Map-like Dataset Operations • Kafka Source • Rate Source – Projections – Selections Supported Sinks • All SQL functions • Kafka Sink – Except current_timestamp(), • Memory Sink current_date() and • Console Sink aggregation functions Blog: https://tinyurl.com/spark-cp 15

16.Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 16

17.ML on Streaming • Model transformation/prediction on batch and streaming data with unified API • After fitting a model or Pipeline, you can deploy it in a streaming job val streamOutput = transformer.transform(streamDF) 17

18.Image Support in Spark (SPARK-21866) • A standard API in Spark for reading images into DataFrames • Utilities for loading images from common formats • Deep learning frameworks can rely on this val df = ImageSchema.readImages("/data/images") 18

19.Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 19

20.PySpark • Introduced in Spark 0.7 (~2013); became first class citizen in the Dataframe API in Spark 1.3 (~2015) • Much slower than Scala/Java with UDFs due to serialization and Python interpreter • Note: Most PyData tooling (e.g., Pandas, numpy etc.) are written in C/C++ 20

21.PySpark Performance Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x. 21

22.Pandas/Vectorized UDFs Scalar UDFs • Used with functions such as select and withColumn • The python function should take pandas.Series as input and return a pandas.Series of same length 22

23.Pandas/Vectorized UDFs Grouped Map UDFs • Split-apply-Combine • A python function that defines the computation for each group • Input/Outputs are both pandas.DataFrame Blog: https://tinyurl.com/pyspark-udf 23

24.Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 24

25.Spark on Kubernetes (SPARK-18278) Spark SQL + Structured MLlib GraphX DataFrames Streaming Spark Core Standalone YARN Mesos 25

26.Spark on Kubernetes (SPARK-18278) • Driver runs in a Kubernetes pod created by the submission client and creates pods that runs the executors in response to requests from Spark Scheduler • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging 26

27.Spark on Kubernetes (SPARK-18278) Apache Spark 2.3 Roadmap (Apache Spark 2.4+) • Supports K8S 1.6+ • Client Mode • Cluster Mode • Dynamic Resource Allocation + • Static Resource Allocation External Shuffle Service • Java/Scala Applications • Python/R Applications • Container-local and remote- • Client-local dependencies + Resource dependencies that are Staging Server (RSS) downloadable Blog: https://tinyurl.com/spark-k8s 27

28.Recap ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 28

29.Questions? Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 #DevSAIS16