4.王斐-eBay Spark自动化升级之道

播放视频

视频文档

4.王斐-eBay Spark自动化升级之道

下载 10

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
视频嵌入链接文档嵌入链接
<iframe src="https://www.slidestalk.com/slidestalk/4ebay_spark_upgrade_automation11362?embed&video" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

示说网官方

发布于

2年前

738

人观看

#信息技术

王斐，Apache Kyuubi PPMC Member，eBay Hadoop团队软件工程师，负责离线计算平台的开发和维护。

eBay Hadoop team负责内部Spark版本的发布和维护，Spark的大版本升级不可避免会有些计算行为的改变，让用户自行做引擎版本的切换和数据校验，可想而知是一个漫长的过程。在最近的Spark 3 版本升级过程中，我们采用一套自动化系统进行SQL任务的数据校验和引擎版本的切换，用户只需要点击确认升级按钮，大大减少了用户和平台维护人员的人力成本。本次分享将系统的介绍eBay Spark自动化升级系统。

展开查看详情

1 .

2 .Automatic Spark Version Upgrade at Scale in eBay Wang, Fei Software Engineer @ eBay Hadoop Team Apache Kyuubi PPMC Member

3 .Agenda • Spark 3 Migration Challenges • Automatic Spark Upgrade Solution • Results & Future Plans • Q&A

4 .Part 1 - Spark 3 Migration Challenges

5 .Upgrade from Spark 2.3 to Spark 3.1 ~20k ETL SQL Jobs ~110 accounts Spark 2.3 5k+ commits Spark 3.1

6 .Migration challenges • Do not impact prod jobs during • Performance downgrade is not • 20K + ETL SQL jobs migration acceptable • Hard to push job owner make • Ensure the computing behavior • Job should not fail with new migration before deadline compatibility Spark version

7 .Solution for the challenges How to automate? • 1. Modify the job • Infeasible to manually queries to prevent • 3. Must guarantee the migrate 20K+ jobs impact prod and dual- performance • Great waste of run the job pipeline • 4. Rollback in case of engineer resources • 2. Must guarantee the failure correctness

8 .Part2 – Automatic Spark Upgrade Solution

9 .JPM - Job Performance Monitoring • Cluster resource monitoring • User job monitoring ETL Jobs • Job failure diagnosis • Job performance analysis • Job resource tuning

10 .Panda – Spark binary & configuration management Likes to roll; rollback; roll forward

11 .Woody - Compatibility Test Catches the SQL bugs across versions/configs

12 .Upgrade Process JPM Job history Panda Woody Config / Version Compatibility test management

13 .LogicalPlan & StatementContext UnResolvedPlan StatementContext

14 . Handling challenge - Do not impact prod jobs during migration 1. ConversionContext - Keep table relationship 2. Parse information from LogicalPlan 3. Get the table & location nodes from StatementContext 4. Replace table & location nodes on demand 5. Prepare databases and target tables on demand Isolate(account/database/path) • Test account with read-only permission • Unique testing database for output table • ${db}.${tbl} -> ${woody_test_instance_db}.${db}__${tbl} • Isolated testing output path • ${prod_loc} -> ${woody_test_instance_root_path}/${prod_loc}

15 .Making queries safe to test Before After use demo; create database if not exists ${woody_test_db} location ${woody_test_db_loc}; drop table if exists tc; drop table if exists ${woody_test_db}.demo__tc; create table tc(id int) using parquet location ’/apps/demo/loc’; create table ${woody_test_db}.demo_tc(id int) using parquet location insert into tc select ta.key from ta join tb on ta.id=tb.id; ’${woody_test_db_loc}/apps/demo/loc’; Insert into td select tc.* from tc join td on tc.id=te.id; insert into ${woody_test_db}.demo_tc select ta.key from demo.ta join demo.tb on ta.id=tb.id; Insert into ${woody_test_db}.demo_td select demo_tc.* from ${woody_test_db}.demo_tc join demo.te on tc.id=te.id;

16 .Handling challenge – Ensure the computing behavior compatibility • Know the tables to verify the result from ConversionContext • Check the table count & checksum • Concat the column values to String with delimiter, calculate crc32 value and cast to Decimal(19, 0), and get the sum as checksum • Support to skip dynamic columns • Audit columns, likes `update_time` • Non-deterministic functions, likes `current_timestamp()`, `collect_list` • Support to skip non-deterministic query • Window function • Support to find out the random column • Dichotomy

17 .Guarantee the correctness and performance • Correctness • Count & Checksum • Sample data • Performance(from JPM) • HCU(Hadoop Compute UNIT) • 1HCU=1GB*Second

18 . End to end testing • Isolate with prod env • Preprod account, the same read-only permission with prod account • Write all of output data into preprod root path • Using prod data as input and no impaction for prod env • No code change and automation testing

19 .Part3 – Results & Future Plans

20 .Results - ETL SQL jobs migration has been finished

21 .Compatibility Bug Fixes Examples - Spark 2.3 bug, already fixed in Spark 3.1 - [SPARK-30201][SQL] HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT - Spark 3.1 bug - Decimal precision loss in Spark 3.1 - [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic - Spark 3.1 behavior change - StringType vs NumericType - Spark 2.3: cast both to DoubleType - Spark 3.1: cast string to NumericType - Solution - If ansi=false, keep compatibility with Spark 2.3 - Suggest to use explicit cast for StringType vs NumericType - Spark 3.1 feature with config - Spark 3.1 job failed due to overwriting the read path - Solution - spark.sql.hive.convertInsertingPartitionedTable=false

22 .Future Plans • Push Spark Scala jobs migration • Push pyspark jobs migration • Has provided portable python environment for Spark3 • Job tunning

23 .

0点赞

2收藏

10下载