- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
迁移Apache Hive作业到Apache Spark
展开查看详情
1 . Migrating Apache Hive Workload to Apache Spark: Bridge the Gap Zhan Zhang, Jane Wang, Facebook
2 .Overview • Hive to Spark Migration Effort • Narrowing Down Feature Gaps – Regex Column Specification Support. – Local Writes support. – UDFs • Performance and Reliability – Dynamic Join – Bucket Join • Advanced Optimization for Extremely Large Jobs – Secondary Partitioning – Run-time Optimization.
3 . Hive to Spark Migration • Why do we migrate workload from hive to Spark – Performance – Identify and narrow down the feature gap.
4 . Regex Column Specification Support • One of the most failures in our syntax analysis. • Support regex column specification. – SELECT `(a)?+.+` FROM data table – SELECT t.`(a)?+.+` FROM data table • SPARK-12139 put your #assignedhashtag here by setting the footer in view-header/footer 4
5 . Local Filesystem Writes • Support Writing data into the filesystem from queries … – INSERT OVERWRITE LOCAL? DIRECTORY path=STRING rowFormat? createFileFormat? – INSERT OVERWRITE LOCAL? DIRECTORY (path=STRING)? tableProvider (OPTIONS options=tablePropertyList)? 5
6 . UDF Support • UDAF_JAVA_F/UDTF_JAVA_F/UDF_JAVA_F • UDF_Bind • UDF_EVAL_F • Non-deterministic Expression • … 6
7 . Narrowing Down Feature Gaps - Syntax • Regex Column Specification • Syntax parser improvement • UDF compatibility – Enum value – User defined class type – Lambda function
8 .3X Workload Growth in 6 Month Reserved CPU Days CPU Days
9 . Joins Broadcast Join ShuffleHash Join SortMerge Join
10 . Dynamic Join Start • More aggressively Build Hash table leverage HashJoin Hash No OOM Join • Provide a reliable Ye s fallback mechanim Reconstruct Iterator Sort MergeJoin End
11 .Dynamic Join – Physical Plan
12 . Bucket Join • Support different number (multiplier) of buckets on left/right side. Bucket 1 Split 1 Bucket 2 Bucket 1 Bucket 3 Bucket 2 Split 2 Split 1 Bucket 4 Bucket 3 Bucket 4 Bucket 1 Split 3 Split 2 Bucket 2 Bucket 1 Bucket 3 Bucket 2 Bucket 4 Split 4
13 . Bucket Join Validation • To verify bucket join spark generate consistent result to hive bucket join – Read Spark/Hive Table. – Zip the corresponding splits from spark/hive generated tables. – Compare the sorted column in two splits sequentially. – Sort the bucket column in each split and compare rows in two splits sequentially. 13
14 . Challenges in Large Jobs • A large job with 10,000 mapper * 10,000 reducer – IOPS: 100,000,000 – HDFS: 10,000 result files – Scheduling Overhead: 20,000 tasks – Manual Tuning – Data skewness 14
15 .Advanced - Secondary Partitioning
16 . Pros and Cons • Reduce IOPS • Number of HDFS files • Runtime Optimization • Backward Compatibility – Exactly same behavior with split number = 1 • Auto-Configuration – 503 partitions and 13 buckets to achieve good performance. BUT • Reduced Parallelism • Need to fetch all before computation.
17 .Runtime Join Optimization
18 . JIRA • SPARK-12139 – REGEX Column Specification for Hive Queries • SPARK-4131 – Support "Writing data into the filesystem from queries” • SPARK-23306 – Race condition in TaskMemoryManager • SPARK-19326 – Speculated task attempts do not get launched in few scenarios • SPARK-19839 – Fix memory leak in BytesToBytesMap
19 . ACKNOWLEDGEMENTS The presentation includes the work from the Spark team in Facebook. Thanks for their contribution, esp., Lin Wang, Tejas Patil.
20 .Question?