深入研究Spark SQL高级性能调整

Spark SQL是一个高度可扩展和高效的关系处理引擎,易于使用API和中间查询容错。它是Apache Spark的核心模块。Spark SQL可以处理、集成和分析来自不同数据源(例如,Hive、Cassandra、Kafka和Oracle)和文件格式(例如,Parquet、ORC、CSV和JSON)的数据。本讲座将深入讨论跨越查询执行整个生命周期的SPARK SQL的技术细节。观众将深入了解Spark SQL并了解如何调整Spark SQL性能。
展开查看详情

1.Deep Dive Into SQL with Advanced Performance Tuning Xiao Li & Wenchen Fan Spark Summit | SF | Jun 2018 1

2.About US • Software Engineers at • Apache Spark Committers and PMC Members Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)

3.Databricks’ Unified Analytics Platform Unifies Data Engineers COLLABORATIVE NOTEBOOKS and Data Scientists Data Engineers Data Scientists Unifies Data and AI DATABRICKS RUNTIME Technologies Powered by Delta SQL Streaming Eliminates infrastructure complexity CLOUD NATIVE SERVICE

4.Spark SQL A highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. 4

5.Run Everywhere Processes, integrates and analyzes the data from diverse data sources (e.g., Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON) 5

6.The not-so-secret truth... SQL is not only SQL. 6

7.Spark SQL 7

8.Not Only SQL Powers and optimizes the other Spark applications and libraries: • Structured streaming for stream processing • MLlib for machine learning • GraphFrame for graph-parallel computation • Your own Spark applications that use SQL, DataFrame and Dataset APIs 8

9.Lazy Evaluation Optimization happens as late as possible, therefore Spark SQL can optimize across functions and libraries Holistic optimization when using these libraries and SQL/DataFrame/Dataset APIs in the same Spark application. 9

10.New Features of Spark SQL in Spark 2.3 • PySpark Pandas UDFs [SPARK-22216] [SPARK-21187] • Stable Codegen [SPARK-22510] [SPARK-22692] • Advanced pushdown for partition pruning predicates [SPARK-20331] • Vectorized ORC reader [SPARK-20682] [SPARK-16060] • Vectorized cache reader [SPARK-20822] • Histogram support in cost-based optimizer [SPARK-21975] • Better Hive compatibility [SPARK-20236] [SPARK-17729] [SPARK-4131] • More efficient and extensible data source API V2 10

11.Spark SQL A compiler from queries to RDDs. 11

12.Performance Tuning for Optimal Plans Run EXPLAIN Plan. Interpret Plan. Tune Plan. 12

13.Get the plans by running Explain command/APIs, or the SQL tab in either Spark UI or Spark History Server 13

14.More statistics from the Job page 14

15.Declarative APIs 15

16.Declarative APIs Declare your intentions by • SQL API: ANSI SQL:2003 and HiveQL. • Dataset/DataFrame APIs: richer, language- integrated and user-friendly interfaces 16

17.Declarative APIs When should I use SQL, DataFrames or Datasets? • The DataFrame API provides untyped relational operations • The Dataset API provides a typed version, at the cost of performance due to heavy reliance on user-defined closures/lambdas. [SPARK-14083] • http://dbricks.co/29xYnqR 17

18.Metadata Catalog 18

19.Metadata Catalog • Persistent Hive metastore [Hive 0.12 - Hive 2.3.3] • Session-local temporary view manager • Cross-session global temporary view manager • Session-local function registry 19

20.Metadata Catalog Session-local function registry • Easy-to-use lambda UDF • Vectorized PySpark Pandas UDF • Native UDAF interface • Support Hive UDF, UDAF and UDTF • Almost 300 built-in SQL functions • Next, SPARK-23899 adds 30+ high-order built-in functions. • Blog for high-order functions: https://dbricks.co/2rR8vAr 20

21.Performance Tips - Catalog Time costs of partition metadata retrieval: - Upgrade your Hive metastore - Avoid very high cardinality of partition columns - Partition pruning predicates (improved in [SPARK-20331]) 21

22.Cache Manager 22

23.Cache Manager • Automatically replace by cached data when plan matching • Cross-session • Dropping/Inserting tables/views invalidates all the caches that depend on it • Lazy evaluation 23

24.Performance Tips Cache: not always fast if spilled to disk. - Uncache it, if not needed. Next releases: - A new cache mechanism for building the snapshot in cache. Querying stale data. Resolved by names instead of by plans. [SPARK-24461] 24

25.Optimizer 25

26.Optimizer Rewrites the query plans using heuristics and cost. • Column pruning • Outer join elimination • Predicate push down• Constraint propagation • Constant folding • Join reordering and many more. 26

27.Performance Tips Roll your own Optimizer and Planner Rules • In class ExperimentalMethods • var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil • var extraStrategies: Seq[Strategy] = Nil • Examples in the Herman’s talk Deep Dive into Catalyst Optimizer • Join two intervals: http://dbricks.co/2etjIDY 27

28.Planner 28

29.Planner • Turn logical plans to physical plans. (what to how) • Pick the best physical plan according to the cost broadcast sort merge Join hash join join OR table1 table2 table1 table2 table1 table2 broadcast join has lower cost if one table can fit in memory 29