Bridging the Gap Between Datasets and DataFrames

下载 2

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/Spark/Bridging_the_Gap_Between_Datasets_and_Data_Frames?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

Spark开源社区

发布于

6年前

8233

人观看

#信息技术 Spark AI Summit 2019 NA sql spark 大数据 apple datasets

Apple leverages Apache Spark for processing large datasets to power key components of Apple’s production services. The majority of users rely on Spark SQL to benefit from state-of-the-art optimizations in Catalyst and Tungsten. As there are multiple APIs to interact with Spark SQL, users have to make a wise decision which one to pick. While DataFrames and SQL are widely used, they lack type safety so that the analysis errors will not be detected during the compile time such as invalid column names or types. Also, the ability to apply the same functional constructions as on RDDs is missing in DataFrames. Datasets expose a type-safe API and support for user-defined closures at the cost of performance. This talk will explain cases when Spark SQL cannot optimize typed Datasets as much as it can optimize DataFrames. We will also present an effort to use bytecode analysis to convert user-defined closures into native Catalyst expressions. This helps Spark to avoid the expensive conversion between the internal format and JVM objects as well as to leverage more Catalyst optimizations. A consequence, we can bridge the gap in performance between Datasets and DataFrames, so that users do not have to sacrifice the benefits of Datasets for performance reasons.

展开查看详情

1 .Bridging the Gap Between • Datasets and DataFrames • Anton Okolnychyi DB Tsai Spark+AI SF © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

2 . Agenda • Spark at Apple • Spark SQL • Datasets vs DataFrames • Optimizing Datasets with Bytecode Analysis © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

3 .Spark at Apple

4 . Spark at Apple • Scalable elastic on demand Spark • Disaggregated architecture • Over a million executors per day © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

5 .Spark SQL

7 .Datasets vs DataFrames

8 . Review of APIs of Spark DataFrame: Relational untyped APIs introduced in Spark 1.3. From Spark 2.0,   type DataFrame = Dataset[Row] Dataset: Support all the untyped APIs in DataFrame + typed functional APIs © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

9 . Review of DataFrame Real estate information can be naturally modeled by root |-- address: struct (nullable = true) | |-- houseNumber: integer (nullable = true) | |-- streetAddress: string (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) | |-- zipCode: string (nullable = true) |-- facts: struct (nullable = true) | |-- price: integer (nullable = true) | |-- size: integer (nullable = true) | |-- yearBuilt: integer (nullable = true) |-- schools: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

10 . Review of DataFrame val newAddress = struct( $"address.houseNumber" , new Column(Uuid()) as "streetAddress", $"address.city", $"address.state", $"address.zipCode" ) ds.withColumn("address", newAddress) .where($"facts.price" > 2000000).explain(true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

11 . Execution Plan - Dataframe Untyped APIs == Physical Plan == *(1) Filter (isnotnull(facts#1) && (facts#1.price > 2000000)) +- *(1) Project [named_struct(houseNumber, address#0.houseNumber, streetAddress, uuid(Some(0)), city, address#0.city, state, address#0.state, zipCode, address#0.zipCode) AS address#7, facts#1, schools#2] +- *(1) FileScan parquet [address#0,facts#1,schools#2] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)], © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

12 . Review of Dataset ds.map { home => home.copy(home.address.copy(streetAddress = UUID.randomUUID().toString)) }.filter(_.facts.price > 2000000).explain(true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

13 . Execution Plan - Dataset Typed APIs == Physical Plan == *(1) SerializeFromObject [if (isnull(assertnotnull(input[0, Home, true]).address)) null else named_struct(houseNumber, assertnotnull(assertnotnull(input[0, Home, true]).address).houseNumber, streetAddress, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, Home, true]).address).streetAddress, true, false), city,………. +- *(1) Filter <function1>.apply +- *(1) MapElements <function1>, obj#27: Home +- *(1) DeserializeToObject newInstance(class Home), obj#26: Home +- *(1) FileScan parquet [address#0,facts#1,schools#2] DataFilters: [], Format: Parquet, PushedFilters: [], © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

14 . Strongly Typed Pipeline • Typed Dataset is used to guarantee the schema consistency • Enables Java/Scala interoperability between systems • Compile time exceptions • Increases Data Scientist productivity © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

15 . Drawbacks of Strongly Typed Pipeline • Dataset is slower than Dataframe https://tinyurl.com/dataset-vs-dataframe • Serialization and deserialization cost • In Dataset, many POJOs are created for each row resulting high GC pressure • Not able to apply Catalyst optimizations © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

16 .Optimizing Datasets with Bytecode Analysis

17 . SPARK-14083 • Use bytecode analysis to convert closures/lambdas into Catalyst expressions in order to speed up Datasets • Reported by Reynold Xin in March 2016 • No progress since March 2017 © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

18 . JVM Bytecode • A platform-independent instruction set of the JVM • Java/Scala code is compiled into bytecode before it can be executed © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.

0点赞

0收藏

2下载