Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA

Data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore more and larger datasets. This allows data scientist to drive towards their business goals, faster, and more reliably. Accelerating Apache Spark with GPU is the next step for data science. In this talk, we will share our work in accelerating Spark applications via CUDA and NCCL. We have identified several bottleneck in Spark 2.4 in the areas of data serialization and data scalability. To address this we accelerated Spark based data analytics with enhancements to allow large columnar datasets to be analyzed directly in CUDA with Python. The GPU dataframe library, cuDF (, can be used to express advanced analytics easily. Through applying Apache Arrow and cuDF, we have achieved over 20x speedup over regular RDDs. For distributed machine learning, Spark 2.4 introduced a barrier execution mode to support MPI allreduce style algorithms. We will demonstrate how the latest Nvidia NCCL library, NCCL2, could further scale out distributed learning algorithms, such as XGBoost. Finally, an enhancement of Spark kubernetes scheduler will be introduced so that GPU resources can be scheduled from a kubernetes cluster for Spark applications. We will share our experience deploying Spark on Nvidia Tesla T4 server clusters. Based on the new NVIDIA Turing architecture, the T4, an energy-efficient 70-watt small PCIe form factor GPU, is optimized for scale-out computing environments and features multi-precision Turing Tensor Cores and new RT Cores.

1.Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL Richard Whitcomb, NVIDIA Rong Ou, NVIDIA #UnifiedAnalytics #SparkAISummit

2.About Us Richard Whitcomb: Senior Engineer working on AI Infrastructure. Previously at Spotify, Twitter. Rong Ou: Principal Engineer at Nvidia working on AI Infrastructure. Previously at Google.

3.Why Spark on GPU?


5.Spark GPU: A Machine Learning Story • Problem: predict loan delinquency • Dataset: Fannie Mae loan performance data • Library: XGBoost • Platform: Apache Spark • GPU: NVIDIA Tesla T4 #UnifiedAnalytics #SparkAISummit 5

6.Dataset • Fannie Mae single-family loan performance data • 18 years: 2000 - 2017 • # loans: 38,964,685 • # performance records: 2,008,374,244 • Size (CSV): 168 GB

7.XGBoost • Popular gradient boosting library • Distributed mode via Spark • GPU support via CUDA • Multi-GPU support via NCCL 2 • Recent addition: multi-node GPU support • Experimental: running on Spark with GPUs #UnifiedAnalytics #SparkAISummit 7

8.Spark Cluster • Standalone cluster on GCP • 5 virtual machines, each has: – 64 vCPUs (32 physical cores) – 416 GB memory – 4 x NVIDIA Tesla T4 – 400 GB SSD persistent disk – Default networking • 4 Spark workers per VM

9.Sample Code import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 20, "max_leaves" -> 256, "grow_policy" -> "lossguide", "num_round" -> 100, "num_workers" -> 20, "nthread" -> 16, "tree_method" -> "gpu_hist") val xgbClassifier = new XGBoostClassifier(xgbParam). setFeaturesCol("features"). setLabelCol("labels")

10.Preliminary Results Accuracy Training Loop (AUC) (Seconds) CPU 0.832 1071.002 GPU 0.832 139.641 Max Tree Depth = 8 Speedup 766.97% CPU 0.833 1088.662 GPU 0.833 165.868 Max Tree Depth = 20 Speedup 656.34%

11.But... • XGBoost training is pretty fast on GPUs • ETL is slow in comparison • We need to accelerate the machine learning workflow end to end

12.Apache Arrow RDD • Store Arrow batches directly in RDDs • Already has library support • Moving between RDD->CUDA with Zero Copy • Eliminates PySpark serialization overhead – 20x speed improvement in PySpark vs Pickle

13.Arrow RDD Problems • Users moving to Dataset/DataFrame API • Difficult to use (columns vs rows) • Most of Spark features aren’t usable, mostly works on distributed Pandas dataframes • Users would have to rewrite all of their ETL jobs to make use of GPUs

14.Moving towards DataFrames • Can we provide similar speed improvements under the DataFrame API? • Little to no code changes for ETL jobs • Same API in which users are comfortable

15.ETL on GPUs • Ability to process columnar across ops is key • Added interface so DataFrame ops can “opt-in” to consume and produce columnar data • Added columnar processing to a few DataFrame ops (CSV parsing, hash join, hash aggregate, etc.) • Can switch between row/columnar with config

16. Simple Benchmark No user code changes Config settings to enable GPU acceleration 18x Uses RAPIDS library under speedup the covers: dfc = spark.schema(schema).csv("...") dfc.groupBy("id").agg(F.sum("x"))

17.Spark on GPU • Encouraging early results with room to improve – 6X speedup of XGBoost training loop – 18X speedup of dataset-based ETL example • Eager to collaborate with the Spark community – Accelerator-aware scheduling (SPARK-24615) – Stage level resource scheduling (SPARK-27495) – Columnar processing (SPARK-27396) – cuDF integration into XGBoost (XGBOOST-3997) – Out-of-core XGBoost GPU (XGBOOST-4357)

由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。