- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Apache Spark 2.3概述:有什么新鲜事?
展开查看详情
1 .What’s New in Apache Spark 2.3 Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 #DevSAIS16
2 . About Me • Spark Committer and 2.3 Release Manager • Software Engineer at Facebook (Big Compute) • Previously at Databricks and UC Berkeley • Research on BlinkDB (Approximate Queries in Spark) 2
3 .Spark 2.3 Release by the numbers • Released on 28th February 2018 • Development Span: July ‘17 – Feb ‘18 • 284 Contributors • 1406 JIRAs – SQL/Streaming (52%) – Spark Core (12%) – PySpark (9%) – ML (8%) 3
4 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 4
5 .Major Features in Spark 2.3 Continuous Data Spark on PySpark ML on History Processing Source Kubernetes Performance Streaming Server V2 API V2 Stream-stream UDF Image Native ORC Stable Various SQL Join Enhancements Reader Support Codegen Features https://spark.apache.org/releases/spark-release-2-3 -0.html 5
6 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 6
7 .Structured Streaming Users: Treat a stream as an infinite table, no need to reason about micro-batches Developers: Decoupled the high-level API with the execution engine 7
8 .Structured Streaming 8
9 .Micro Batch Execution 9
10 .Micro Batch Execution Latency > 100ms Exactly-once Semantics 10
11 .Continuous Processing (SPARK-20928) Continuous Processing An experimental execution mode 11
12 .Continuous Processing (SPARK-20928) 12
13 .Continuous Processing (SPARK-20928) Latency ~1ms At-least once Semantics 13
14 .Continuous Processing (SPARK-20928) 14
15 .Continuous Processing (SPARK-20928) Supported Operations Supported Sources • Map-like Dataset Operations • Kafka Source • Rate Source – Projections – Selections Supported Sinks • All SQL functions • Kafka Sink – Except current_timestamp(), • Memory Sink current_date() and • Console Sink aggregation functions Blog: https://tinyurl.com/spark-cp 15
16 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 16
17 .ML on Streaming • Model transformation/prediction on batch and streaming data with unified API • After fitting a model or Pipeline, you can deploy it in a streaming job val streamOutput = transformer.transform(streamDF) 17
18 .Image Support in Spark (SPARK-21866) • A standard API in Spark for reading images into DataFrames • Utilities for loading images from common formats • Deep learning frameworks can rely on this val df = ImageSchema.readImages("/data/images") 18
19 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 19
20 .PySpark • Introduced in Spark 0.7 (~2013); became first class citizen in the Dataframe API in Spark 1.3 (~2015) • Much slower than Scala/Java with UDFs due to serialization and Python interpreter • Note: Most PyData tooling (e.g., Pandas, numpy etc.) are written in C/C++ 20
21 .PySpark Performance Pandas UDFs perform much better than row-at-a-time UDFs across the board, ranging from 3x to over 100x. 21
22 .Pandas/Vectorized UDFs Scalar UDFs • Used with functions such as select and withColumn • The python function should take pandas.Series as input and return a pandas.Series of same length 22
23 .Pandas/Vectorized UDFs Grouped Map UDFs • Split-apply-Combine • A python function that defines the computation for each group • Input/Outputs are both pandas.DataFrame Blog: https://tinyurl.com/pyspark-udf 23
24 .Overview ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 24
25 .Spark on Kubernetes (SPARK-18278) Spark SQL + Structured MLlib GraphX DataFrames Streaming Spark Core Standalone YARN Mesos 25
26 .Spark on Kubernetes (SPARK-18278) • Driver runs in a Kubernetes pod created by the submission client and creates pods that runs the executors in response to requests from Spark Scheduler • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging 26
27 .Spark on Kubernetes (SPARK-18278) Apache Spark 2.3 Roadmap (Apache Spark 2.4+) • Supports K8S 1.6+ • Client Mode • Cluster Mode • Dynamic Resource Allocation + • Static Resource Allocation External Shuffle Service • Java/Scala Applications • Python/R Applications • Container-local and remote- • Client-local dependencies + Resource dependencies that are Staging Server (RSS) downloadable Blog: https://tinyurl.com/spark-k8s 27
28 .Recap ML Streaming PySpark Spark on Continuous + Performance Kubernetes Processing Image Reader 28
29 .Questions? Sameer Agarwal Spark Summit | San Francisco | June 6th 2018 #DevSAIS16