Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing

eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) – during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing Keith Sun Data Engineer, Data Service & Solution (eBay) #UnifiedAnalytics #SparkAISummit

3.About US • DSS(Data Services & Solutions) team in eBay. • Focus on Big data development ,optimization, modeling & services on ADBMS, Spark/Hive, Hadoop platforms. • Now , more time on the migration from ADBMS to Spark.

4.Talks from our team • Experience Of Optimizing Spark SQL When Migrating from MPP Database, Yucai Yu & Yuming Wang, Spark Summit 2018, London. • Analytical DBMS to Apache Spark Auto Migration Framework, Edward Zhang, Spark Summit 2018, London. 4

5.Agenda Background Use Cases and Best Practices Auto Migration Deep Dive 5

6.Why Migrate to Spark MORE COMPLEX BIG STREAMING, GRAPH MACHINE LEARNING EXTREME DATA PROCESSING COMPUTATION, USE CASES PERFORMANCE NEEDS OPTIMIZATION NEED 6

7.Spark as DW Processing Engine DS Model (Data Science) RT Data Service Batch Service Metadata Service DW Knowledge Metadata Integrated Data Layer Graph (Data Warehouse) ODS Layer DI ZETA (Data Infrastructure) Compute/Storage 7

8.Spark Cluster Environment Spark Hadoop Hive 2.1.0/2.3.1 2.7.1 1.2.1 1900 460TB Nodes Memory 8

9.Agenda Background Use Cases and Best Practices Auto Migration Deep Dive 9

10. Table Schema Translation SQL Conversion Migration Historical Data Copy Steps SQL run on Yarn cluster Overview Post Data Quality Check Logging and Error Parsing 10

11.Table Schema Translation Single Partitioned Table Is Not Enough Column Name Is Case Sensitive Column Type Mapping Tips 11

12.Single Partitioned Table Is Not Enough Ø “Cannot overwrite a path that is also being read from.” regardless of different partitions. See SPARK-18107. Instead , create 2 tables : TableX & TableX_Merge. 12

13.Table DDL Sample CREATE TABLE Table_X_Merge( CREATE TABLE Table_X ( … ….. dt string ) ) USING parquet USING parquet OPTIONS ( OPTIONS ( path path 'hdfs://hercules/table_x/snapshot/ 'hdfs://hercules/table_x/snapshot/' dt=20190311’ ) ) ---point latest partition PARTITIONED BY (dt) 13

14.Column Name Is Case Sensitive Ø Lowercase the column name. For Hive/Spark Parquet file interoperation , otherwise you may see “NULL” fields, wrong result or errors . (SPARK-25132) 14

15.Spark 2.1.0 throw error : “Caused by: java.lang.IllegalArgumentException: Column [id] was not found in schema!” 15

16.Spark 2.3.1 returns wrong result silently. 16

17.Column Type Mapping Tips Ø Decimal typed integer map to Integer For Parquet filter push down to accelerate file scan. 17

18.Sample For Parquet filter push down to accelerate file scan.(SPARK-24549 ) 18

19.Query Improvements – Predicate Pushdown [SPARK-25419] Improvement parquet predicate pushdown • [SPARK-23727] Support Date type • [SPARK-24549] Support Decimal type • [SPARK-24718] Support Timestamp type • [SPARK-24706] Support Byte type and Short type • [SPARK-24638] Support StringStartsWith predicate • [SPARK-17091] Support IN predicate

20.SQL Conversion Update & Delete Conversion Insert Conversion Number Expression String Expression Recursive Query Conversion 20

21.SQL Conversion- Update/Delete Spark-SQL does not support update/delete yet. Transform the update/delete to insert or insert overwrite. 21

22.ADBMS Use case update tgt from database.tableX tgt, database.Delta ods set AUCT_END_DT = ods.AUCT_END_DT insert into database.tableX( where tgt.LSTG_ID = ods.LSTG_ID LSTG_ID,AUCT_END_DT) select LSTG_ID ,AUCT_END_DT Yesterda from database.Delta ods y Full Delta left outer join database.tableX tgt on tgt.LSTG_ID = ods.LSTG_ID Data where tgt.LSTG_ID is null; 22

23.Spark-SQL sample insert overwrite table TableX_merge partition(dt='20190312') select coalesce(tgt.LSTG_ID,ods.LSTG_ID) as LSTG_ID ,IF(ods.LSTG_ID is not null, ods.AUCT_END_DT,tgt.AUCT_END_DT) as AUCT_END_DT from TableX as tgt full outer join Delta ods on tgt.LSTG_ID = ods.LSTG_ID ; alter table TableX set location ‘xxxx/dt=20190312’; 23

24.SQL Conversion- Insert Ø ADBMS will implicitly dedupe data when insert into SET table(the default case for new tables). Then, for such case, a “group by” or “distinct” is necessary. 24

25.ADBMS Use case (TableY is defined a SET table ) insert into TableY( LSTG_ID,AUCT_END_DT) select LSTG_ID ,AUCT_END_DT from ods_tableY tgt 25

26.Spark-SQL sample insert overwrite table TableY_merge partition(dt='20190312') select distinct * from ( select LSTG_ID, AUCT_END_DT FROM TableY tgt UNION ALL select LSTG_ID, AUCT_END_DT FROM ODS_TableY) tmp; 26

27.SQL Conversion – Number Expression Ø Rounding behavior ADBMS round with “HALF_EVEN” rule by default, but Spark-SQL use “HAFL_UP”. 27

28.ADBMS Sample select cast(2.5 as decimal(4,0)) as result; 2. select cast(3.5 as decimal(4,0)) as result; 4. 28

29.Spark-SQL Result spark-sql> select cast(2.5 as decimal(4,0)) 3 spark-sql> select bround(2.5,0) as col1; 2 29