Apache Spark 2.3概述：有什么新鲜事？

下载 0

poppy

发布于

2073

人观看

#信息技术

Apache Spark 2.0设置了Str.einSpark、统一高级API、结构化流以及诸如Catalyst Optimizer和钨引擎之类的底层性能组件的体系结构基础。从那时起，Spark社区贡献者继续构建新特性，并在Spark 2.1和2.2版本中修复许多问题。

展开查看详情

1 .What’s New in Apache Spark 2.3 Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 #DevSAIS16

2 . About Me • Spark Committer and 2.3 Release Manager • Software Engineer at Facebook (Big Compute) • Previously at Databricks and UC Berkeley • Research on BlinkDB (Approximate Queries in Spark) 2

3 .Spark 2.3 Release by the numbers • Released on 28th February 2018 • Development Span: July ‘17 – Feb ‘18 • 284 Contributors • 1406 JIRAs – SQL/Streaming (52%) – Spark Core (12%) – PySpark (9%) – ML (8%) 3

4 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 4

5 .Major Features in Spark 2.3 Continuous Data Spark on PySpark ML on History Processing Source Kubernetes Performance Streaming Server V2 API V2 Stream-stream UDF Image Native ORC Stable Various SQL Join Enhancements Reader Support Codegen Features https://spark.apache.org/releases/spark-release-2-3 -0.html 5

6 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 6

7 .Structured Streaming Users: Treat a stream as an infinite table, no need to reason about micro-batches Developers: Decoupled the high-level API with the execution engine 7

8 .Structured Streaming 8

9 .Micro Batch Execution 9

10 .Micro Batch Execution Latency > 100ms Exactly-once Semantics 10

11 .Continuous Processing (SPARK-20928) Continuous Processing An experimental execution mode 11

12 .Continuous Processing (SPARK-20928) 12

13 .Continuous Processing (SPARK-20928) Latency ~1ms At-least once Semantics 13

14 .Continuous Processing (SPARK-20928) 14

15 .Continuous Processing (SPARK-20928) Supported Operations Supported Sources • Map-like Dataset Operations • Kafka Source • Rate Source – Projections – Selections Supported Sinks • All SQL functions • Kafka Sink – Except current_timestamp(), • Memory Sink current_date() and • Console Sink aggregation functions Blog: https://tinyurl.com/spark-cp 15

16 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 16

17 .ML on Streaming • Model transformation/prediction on batch and streaming data with unified API • After fitting a model or Pipeline, you can deploy it in a streaming job val streamOutput = transformer.transform(streamDF) 17

18 .Image Support in Spark (SPARK-21866) • A standard API in Spark for reading images into DataFrames • Utilities for loading images from common formats • Deep learning frameworks can rely on this val df = ImageSchema.readImages("/data/images") 18

19 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 19

20 .PySpark • Introduced in Spark 0.7 (~2013); became first class citizen in the Dataframe API in Spark 1.3 (~2015) • Much slower than Scala/Java with UDFs due to serialization and Python interpreter • Note: Most PyData tooling (e.g., Pandas, numpy etc.) are written in C/C++ 20

21 .PySpark Performance Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x. 21

22 .Pandas/Vectorized UDFs Scalar UDFs • Used with functions such as select and withColumn • The python function should take pandas.Series as input and return a pandas.Series of same length 22

23 .Pandas/Vectorized UDFs Grouped Map UDFs • Split-apply-Combine • A python function that defines the computation for each group • Input/Outputs are both pandas.DataFrame Blog: https://tinyurl.com/pyspark-udf 23

24 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 24

25 .Spark on Kubernetes (SPARK-18278) Spark SQL + Structured MLlib GraphX DataFrames Streaming Spark Core Standalone YARN Mesos 25

26 .Spark on Kubernetes (SPARK-18278) • Driver runs in a Kubernetes pod created by the submission client and creates pods that runs the executors in response to requests from Spark Scheduler • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging 26

27 .Spark on Kubernetes (SPARK-18278) Apache Spark 2.3 Roadmap (Apache Spark 2.4+) • Supports K8S 1.6+ • Client Mode • Cluster Mode • Dynamic Resource Allocation + • Static Resource Allocation External Shuffle Service • Java/Scala Applications • Python/R Applications • Container-local and remote- • Client-local dependencies + Resource dependencies that are Staging Server (RSS) downloadable Blog: https://tinyurl.com/spark-k8s 27

28 .Recap ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 28

29 .Questions? Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 #DevSAIS16

0点赞

0收藏

0下载