- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Bridging the Gap Between Datasets and DataFrames
展开查看详情
1 .Bridging the Gap Between • Datasets and DataFrames • Anton Okolnychyi DB Tsai Spark+AI SF © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
2 . Agenda • Spark at Apple • Spark SQL • Datasets vs DataFrames • Optimizing Datasets with Bytecode Analysis © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
3 .Spark at Apple
4 . Spark at Apple • Scalable elastic on demand Spark • Disaggregated architecture • Over a million executors per day © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
5 .Spark SQL
6 . Catalyst © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
7 .Datasets vs DataFrames
8 . Review of APIs of Spark DataFrame: Relational untyped APIs introduced in Spark 1.3. From Spark 2.0, type DataFrame = Dataset[Row] Dataset: Support all the untyped APIs in DataFrame + typed functional APIs © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
9 . Review of DataFrame Real estate information can be naturally modeled by root |-- address: struct (nullable = true) | |-- houseNumber: integer (nullable = true) | |-- streetAddress: string (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) | |-- zipCode: string (nullable = true) |-- facts: struct (nullable = true) | |-- price: integer (nullable = true) | |-- size: integer (nullable = true) | |-- yearBuilt: integer (nullable = true) |-- schools: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
10 . Review of DataFrame val newAddress = struct( $"address.houseNumber" , new Column(Uuid()) as "streetAddress", $"address.city", $"address.state", $"address.zipCode" ) ds.withColumn("address", newAddress) .where($"facts.price" > 2000000).explain(true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
11 . Execution Plan - Dataframe Untyped APIs == Physical Plan == *(1) Filter (isnotnull(facts#1) && (facts#1.price > 2000000)) +- *(1) Project [named_struct(houseNumber, address#0.houseNumber, streetAddress, uuid(Some(0)), city, address#0.city, state, address#0.state, zipCode, address#0.zipCode) AS address#7, facts#1, schools#2] +- *(1) FileScan parquet [address#0,facts#1,schools#2] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)], © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
12 . Review of Dataset ds.map { home => home.copy(home.address.copy(streetAddress = UUID.randomUUID().toString)) }.filter(_.facts.price > 2000000).explain(true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
13 . Execution Plan - Dataset Typed APIs == Physical Plan == *(1) SerializeFromObject [if (isnull(assertnotnull(input[0, Home, true]).address)) null else named_struct(houseNumber, assertnotnull(assertnotnull(input[0, Home, true]).address).houseNumber, streetAddress, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, Home, true]).address).streetAddress, true, false), city,………. +- *(1) Filter <function1>.apply +- *(1) MapElements <function1>, obj#27: Home +- *(1) DeserializeToObject newInstance(class Home), obj#26: Home +- *(1) FileScan parquet [address#0,facts#1,schools#2] DataFilters: [], Format: Parquet, PushedFilters: [], © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
14 . Strongly Typed Pipeline • Typed Dataset is used to guarantee the schema consistency • Enables Java/Scala interoperability between systems • Compile time exceptions • Increases Data Scientist productivity © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
15 . Drawbacks of Strongly Typed Pipeline • Dataset is slower than Dataframe https://tinyurl.com/dataset-vs-dataframe • Serialization and deserialization cost • In Dataset, many POJOs are created for each row resulting high GC pressure • Not able to apply Catalyst optimizations © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
16 .Optimizing Datasets with Bytecode Analysis
17 . SPARK-14083 • Use bytecode analysis to convert closures/lambdas into Catalyst expressions in order to speed up Datasets • Reported by Reynold Xin in March 2016 • No progress since March 2017 © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
18 . JVM Bytecode • A platform-independent instruction set of the JVM • Java/Scala code is compiled into bytecode before it can be executed © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
19 . JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
20 . JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
21 . JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
22 . JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
23 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
24 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
25 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
26 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
27 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
28 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
29 . Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.