Infrastructure for Deep Learning in Apache Spark

In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark’s support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Infrastructure for Deep Learning in Apache Spark Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft #UnifiedAnalytics #SparkAISummit

3.Agenda • Evolution of data infrastructure • ML workflow: Data prep & DNN training • Intro to deep learning and computing needs • Distributed deep learning and challenges • Unified platform using Spark – Infra considerations, challenges • ML Pipelines #UnifiedAnalytics #SparkAISummit 3

4.Organization’s Data Database / Data Warehouse Web logs Call Logs Data …… Products Images Video Feeds Organization’s data

5.Machine Learning Typical E2E Process Prepare Experiment Deploy … Orchestrate

6. + Machine Learning and Deep Learning workloads #UnifiedAnalytics #SparkAISummit 6

7.How long does it take to train Resnet-50 on ImageNet? Before 2017 14 days NVIDIA M40 GPU #UnifiedAnalytics #SparkAISummit 7

8.Training Resnet-50 on Imagenet UC Berkeley, Sony Fujitsu Facebook Preferred Network Tencent Neural Network TACC, UC Davis TensorFlow MXNet Caffe2 ChainerMN Library (NNL) Tensorflow 1 hour 31 mins 15 mins 6.6 mins 2.0 mins 1.2 mins Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024 Tesla P40 x 2,048 Tesla V100 x 3,456 Tesla V100 x 2,048 Apr Sept Nov July Nov Apr 2017 2018 2019 #UnifiedAnalytics #SparkAISummit 8

9.Considerations for Deep Learning @ Scale • CPU vs. GPU • Single vs. multi-GPU • MPI vs. non-MPI • Infiniband vs. Ethernet Credits: Mathew Salvaris https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/ #UnifiedAnalytics #SparkAISummit 9

10.“Things” you need to deal with when training machine learning/deep learning models Dependencies and Containers Handling failures Schedule jobs Secure Access Distribute data Gather results Scale resources Provision VM clusters

11.Machine Learning Typical E2E Process Prepare Experiment Deploy … Orchestrate

12.Machine Learning and Deep Learning ML DL Top figure source; #UnifiedAnalytics #SparkAISummit Bottom figure from NVIDIA 12

13. TensorFlow PyTorch Lots of ML MXNet Chainer Frameworks …. Keras Scikit-Learn #UnifiedAnalytics #SparkAISummit 13

14. Design Choices for Big Data and Machine Learning/Deep Learning Laptop Spark + Spark Cloud Separate infrastructure for ML/DL training/inference #UnifiedAnalytics #SparkAISummit 14

15.Execution Models for Spark and Deep Learning Task 1 Spark Task 2 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 3 Distributed Learning • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Data Parallelism Model Parallelism Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark #UnifiedAnalytics #SparkAISummit 15

16.Execution Models for Spark and Deep Learning Task 1 Spark Task 2 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 3 Task Distributed Learning 1 • Non-Independent Tasks Task Task • Some parallel processing 2 3 • Optimizing communication between nodes Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark #UnifiedAnalytics #SparkAISummit 16

17.Execution Models for Spark and Deep Learning Task 1 Spark Task 2 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 3 • Re-run crashed task Task Distributed Learning 1 • Non-Independent Tasks Task Task • Some parallel processing 2 3 • Optimizing communication between nodes • Re-run all tasks Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark #UnifiedAnalytics #SparkAISummit 17

18.Spark + ML/DL www.aka.ms/spark HorovodRunner Sparkflow Project Hydrogen TensorFlowOnSpark #UnifiedAnalytics #SparkAISummit 18

19. Microsoft Machine Learning for Apache Spark v0.16 Microsoft’s Open Source Contributions to Apache Spark Cognitive Spark Model LightGBM Deep Networks HTTP on Services Serving Interpretability Gradient Boosting with CNTK Spark www.aka.ms/spark Azure/mmlspark #UnifiedAnalytics #SparkAISummit 19

20.Demo - Azure Databricks and Deep Learning #UnifiedAnalytics #SparkAISummit 20

21.Demo – Distributed Deep Learning using Tensorflow with HorovodRunner #UnifiedAnalytics #SparkAISummit 21

22.Physics of Machine Learning and Deep Learning GPU Storage CPU What do you need for training / distributed training? Network Memory Deep Learning Framework

23.GPU Device Interconnect • NVLink • GPUDirect P2P • GPUDirect RDMA • Standard network stack Interconnect topology sample Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)

24.From CUDA to NCCL1 to NCCL2 Multi-GPU Communication Library CUDA NCCL 1 NCCL 2 Multi-Core GPU Multi-GPU Multi-GPU CPU Multi-Node Credits: NCCL Tutorial (https://bit.ly/2KpPP44)

25.NCCL 2.x (multi-node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)

26. NCCL 2.x (multi- node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)

27.Spark & GPU • Using GPU with Spark options: 1. Native support (cluster manager, GPU tasks): SPARK- 24615 2. Use cores/memory as proxy for GPU resources and allow GPU-enabled code execution 3. Code implementation/generation for GPU offload • Considerations – Flexibility – Data management – Multi-GPU execution #UnifiedAnalytics #SparkAISummit 27

28.Infrastructure Considerations • Data format, storage and reuse – Co-locate Data Engineering storage infrastructure (cluster-local) – DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation) – Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm) • Job execution – Gang scheduling – Refer to SPARK-24374 – Support for GPU (and other accelerators) – Refer to SPARK-24615 – Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster) – Quota management – Support for Docker containers – MPI vs. non-MPI – Difference GPU generations • Node, GPU connectivity – Infiniband, RDMA – GPU Interconnect options – Interconnect-aware scheduling, minimize distribution, repacking

29.ML Pipelines • Using machine learning pipelines, data scientists, data engineers, and IT professionals can collaborate on different steps/phases • Enable use of best tech for different phases in ML/DL workflow #UnifiedAnalytics #SparkAISummit 29