- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA
展开查看详情
1 .Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL Richard Whitcomb, NVIDIA Rong Ou, NVIDIA #UnifiedAnalytics #SparkAISummit
2 .About Us Richard Whitcomb: Senior Engineer working on AI Infrastructure. Previously at Spotify, Twitter. Rong Ou: Principal Engineer at Nvidia working on AI Infrastructure. Previously at Google.
3 .Why Spark on GPU?
4 .
5 .Spark GPU: A Machine Learning Story • Problem: predict loan delinquency • Dataset: Fannie Mae loan performance data • Library: XGBoost • Platform: Apache Spark • GPU: NVIDIA Tesla T4 #UnifiedAnalytics #SparkAISummit 5
6 .Dataset • Fannie Mae single-family loan performance data • 18 years: 2000 - 2017 • # loans: 38,964,685 • # performance records: 2,008,374,244 • Size (CSV): 168 GB
7 .XGBoost • Popular gradient boosting library • Distributed mode via Spark • GPU support via CUDA • Multi-GPU support via NCCL 2 • Recent addition: multi-node GPU support • Experimental: running on Spark with GPUs #UnifiedAnalytics #SparkAISummit 7
8 .Spark Cluster • Standalone cluster on GCP • 5 virtual machines, each has: – 64 vCPUs (32 physical cores) – 416 GB memory – 4 x NVIDIA Tesla T4 – 400 GB SSD persistent disk – Default networking • 4 Spark workers per VM
9 .Sample Code import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 20, "max_leaves" -> 256, "grow_policy" -> "lossguide", "num_round" -> 100, "num_workers" -> 20, "nthread" -> 16, "tree_method" -> "gpu_hist") val xgbClassifier = new XGBoostClassifier(xgbParam). setFeaturesCol("features"). setLabelCol("labels")
10 .Preliminary Results Accuracy Training Loop (AUC) (Seconds) CPU 0.832 1071.002 GPU 0.832 139.641 Max Tree Depth = 8 Speedup 766.97% CPU 0.833 1088.662 GPU 0.833 165.868 Max Tree Depth = 20 Speedup 656.34%
11 .But... • XGBoost training is pretty fast on GPUs • ETL is slow in comparison • We need to accelerate the machine learning workflow end to end
12 .Apache Arrow RDD • Store Arrow batches directly in RDDs • Already has library support • Moving between RDD->CUDA with Zero Copy • Eliminates PySpark serialization overhead – 20x speed improvement in PySpark vs Pickle
13 .Arrow RDD Problems • Users moving to Dataset/DataFrame API • Difficult to use (columns vs rows) • Most of Spark features aren’t usable, mostly works on distributed Pandas dataframes • Users would have to rewrite all of their ETL jobs to make use of GPUs
14 .Moving towards DataFrames • Can we provide similar speed improvements under the DataFrame API? • Little to no code changes for ETL jobs • Same API in which users are comfortable
15 .ETL on GPUs • Ability to process columnar across ops is key • Added interface so DataFrame ops can “opt-in” to consume and produce columnar data • Added columnar processing to a few DataFrame ops (CSV parsing, hash join, hash aggregate, etc.) • Can switch between row/columnar with config
16 . Simple Benchmark No user code changes Config settings to enable GPU acceleration 18x Uses RAPIDS library under speedup the covers: https://rapids.ai/ dfc = spark.schema(schema).csv("...") dfc.groupBy("id").agg(F.sum("x"))
17 .Spark on GPU • Encouraging early results with room to improve – 6X speedup of XGBoost training loop – 18X speedup of dataset-based ETL example • Eager to collaborate with the Spark community – Accelerator-aware scheduling (SPARK-24615) – Stage level resource scheduling (SPARK-27495) – Columnar processing (SPARK-27396) – cuDF integration into XGBoost (XGBOOST-3997) – Out-of-core XGBoost GPU (XGBOOST-4357)