DLoBD:深度学习大数据的新兴范例

深层学习大数据(DLoBD)正成为从海量收集的数据中挖掘价值的最重要的研究范式之一。许多新兴的深度学习框架开始运行在大数据上,比如Hadoop和Skar。随着HPC、大数据和深度学习的融合,这些DLoBD栈正在利用RDMA和基于多核/多核的CPU/GPU的优势。
展开查看详情

1. DLoBD: An Emerging Paradigm of Deep Learning Over Big Data Stacks Talk at Spark+AI Summit 2018 by Dhabaleswar K. (DK) Panda Xiaoyi Lu The Ohio State University The Ohio State University E-mail: panda@cse.ohio-state.edu E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi

2. Why Deep Learning is so hot? •  Deep Learning is a sub-set of Machine Learning –  But, it is perhaps the most radical and revolutionary subset •  Deep Learning is going through a resurgence –  Model: Excellent accuracy for deep/convolutional neural networks Courtesy: –  Data: Public availability of versatile datasets like http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility- scalability-and-advocacy/ MNIST, CIFAR, and ImageNet –  Capability: Unprecedented computing and communication capabilities: Multi-/Many-Core, GPGPUs, Xeon Phi, InfiniBand, RoCE, etc. •  Big Data has become one of the most important elements in business analytics –  Increasing demand for getting Big Value out of Big Data to drive the revenue continuously growing MNIST handwritten digits Deep Neural Network Network Based Computing Laboratory Spark+AI Summit 2018 2

3. Application Example of DL: Flickr’s Magic View Photo Filtering •  Image recognition to divide pictures into surprisingly accurate categories •  Magic of AI/DL: Generate accurate tags for billions of pictures Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g Network Based Computing Laboratory Spark+AI Summit 2018 3

4. Deep Learning over Big Data (DLoBD) •  Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms •  More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark •  Benefits of the DLoBD approach –  Easily build a powerful data analytics pipeline •  E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof (3) Non-deep (1) Prepare (2) Deep learning (4) Apply ML Datasets @Scale Learning @Scale model @Scale analytics @Scale –  Better data locality –  Efficient resource sharing and cost effective Network Based Computing Laboratory Spark+AI Summit 2018 4

5. Examples of DLoBD Stacks •  CaffeOnSpark •  SparkNet •  TensorFlowOnSpark •  TensorFrame •  DeepLearning4J •  BigDL •  mmlspark –  CNTKOnSpark •  Many others… Network Based Computing Laboratory Spark+AI Summit 2018 5

6. Overview of DLoBD Stacks •  Layers of DLoBD Stacks –  Deep learning application layer –  Deep learning library layer –  Big data analytics framework layer –  Resource scheduler layer Sub-optimal –  Distributed file system layer Performance –  Hardware resource layer ? •  How much performance benefit we can achieve for end deep learning applications? Network Based Computing Laboratory Spark+AI Summit 2018 6

7. Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) Convergence of HPC, Big Data, and Deep Learning!!! Network Based Computing Laboratory Spark+AI Summit 2018 7

8. Highly-Optimized Underlying Libraries with HPC Technologies •  BLAS Libraries – the heart of math operations –  Atlas/OpenBLAS –  NVIDIA cuBlas –  Intel Math Kernel Library (MKL) •  DNN Libraries – the heart of Convolutions! –  NVIDIA cuDNN (already reached its 7th iteration – cudnn-v7) –  Intel MKL-DNN (MKL 2017) – recent but a very promising development •  Communication Libraries – the heart of model parameter updating –  RDMA –  GPUDirect RDMA Network Based Computing Laboratory Spark+AI Summit 2018 8

9. Outline •  Accelerating Big Data Stacks •  Benchmarking and Characterizing DLoBD Stacks –  CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL •  Accelerating DLoBD Stacks –  BigDL on RDMA-Spark –  TensorFlow Network Based Computing Laboratory Spark+AI Summit 2018 9

10. The High-Performance Big Data (HiBD) Project •  RDMA for Apache Spark •  RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) –  Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions •  RDMA for Apache HBase Available for InfiniBand and RoCE •  RDMA for Memcached (RDMA-Memcached) Also run on Ethernet •  RDMA for Apache Hadoop 1.x (RDMA-Hadoop) •  OSU HiBD-Benchmarks (OHB) Available for x86 and OpenPOWER –  HDFS, Memcached, HBase, and Spark Micro-benchmarks •  http://hibd.cse.ohio-state.edu Support Singularity and Docker •  Users Base: 285 organizations from 34 countries •  More than 26,550 downloads from the project site Network Based Computing Laboratory Spark+AI Summit 2018 10

11. Number of Downloads 0 5000 10000 15000 20000 25000 30000 RDMA-Hadoop 1.x 0.9.0 RDMA-Hadoop 1.x 0.9.8 RDMA-Hadoop 1.x 0.9.9 Network Based Computing Laboratory RDMA-Memcached 0.9.1 & OHB-0.7.1 RDMA-Hadoop 2.x 0.9.1 RDMA-Hadoop 2.x 0.9.6 RDMA-Hadoop 2.x 0.9.7 RDMA-Memcached 0.9.4 Timeline Spark+AI Summit 2018 RDMA-Spark 0.9.1 HiBD Release Timeline and Downloads RDMA-HBase 0.9.1 RDMA-Hadoop 2.x 1.0.0 RDMA-Spark 0.9.4 RDMA-Hadoop 2.x 1.3.0 RDMA-Memcached 0.9.6 & OHB-0.9.3 RDMA-Spark 0.9.5 11

12. Example: RDMA-based Apache Spark RDMA-based Apache Spark (HotI’17) RDMA-based Apache Spark New Workloads -- DLoBD: (BigData’16) CaffeOnSpark, TensorFlowOnSpark, Default Design Choices: BigDL, etc. 1.  Sort-based Shuffle Default Design Choices: 2.  Netty-based data transfer RDMA-based 1.  Sort-based Shuffle New choice: Tungsten-Sort! Apache Spark Default Design Choices: 2.  Netty-based data (HotI’14) 1.  Sort-based Shuffle transfer 2.  NIO-based data Default Design Choices: transfer 1. Hash-based Shuffle 2. NIO-based data transfer 0.9.1 1.2.0 1.3.1 1.4.0~2.0.0+ 1.6.0~2.2.0+ Version Number Network Based Computing Laboratory Spark+AI Summit 2018 12

13. Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks •  Design Features Spark Core –  RDMA based shuffle plugin Shuffle Manager (Sort, Hash, Tungsten-Sort) –  SEDA-based architecture Block Transfer Service (Netty, NIO, RDMA-Plugin) –  Dynamic connection Netty NIO RDMA Netty NIO management and sharing RDMA Server Server Server Client Client Client –  Non-blocking data transfer Java Socket Interface Java Native Interface (JNI) –  Off-JVM-heap buffer management Native RDMA-based Comm. Engine –  InfiniBand/RoCE support RDMA Capable Networks 1/10/40/100 GigE, IPoIB Network (IB, iWARP, RoCE ..) •  Enables high performance RDMA communication, while supporting traditional socket interface •  JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016. Network Based Computing Laboratory Spark+AI Summit 2018 13

14. RDMA for Apache Spark Distribution •  High-Performance Design of Spark over RDMA-enabled Interconnects –  High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark –  RDMA-based data shuffle and SEDA-based shuffle architecture –  Non-blocking and chunk-based data transfer –  Off-JVM-heap buffer management –  Support for OpenPOWER –  Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) •  Current release: 0.9.5 –  Based on Apache Spark 2.1.0 –  Tested with •  Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) •  RoCE support with Mellanox adapters •  Various multi-core platforms (x86, POWER) •  RAM disks, SSDs, and HDD –  http://hibd.cse.ohio-state.edu Network Based Computing Laboratory Spark+AI Summit 2018 14

15. HiBD Packages on SDSC Comet and Chameleon Cloud •  RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. –  Examples for various modes of usage are available in: •  RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP •  RDMA for Apache Spark: /share/apps/examples/SPARK/ –  Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. •  RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance –  https://www.chameleoncloud.org/appliances/17/ M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016 Network Based Computing Laboratory Spark+AI Summit 2018 15

16. Performance Evaluation on SDSC Comet – HiBench PageRank 800 450 IPoIB 37% IPoIB 43% 700 400 350 600 RDMA RDMA Time (sec) Time (sec) 500 300 250 400 200 300 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time •  InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) •  RDMA-based design for Spark 1.5.1 •  RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. –  32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) –  64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Network Based Computing Laboratory Spark+AI Summit 2018 16

17. Outline •  Accelerating Big Data Stacks •  Benchmarking and Characterizing DLoBD Stacks –  CaffeOnSpark, TensorFlowOnSpark, MMLSpark, and BigDL •  Accelerating DLoBD Stacks –  BigDL on RDMA-Spark –  TensorFlow Network Based Computing Laboratory Spark+AI Summit 2018 17

18. Benchmarking and Characterization Methodology •  Choose proper DL workloads, models and •  Define characterization dimensions datasets –  Processor Type –  Varied sizes to cover big and small models. –  Parameter updating approach (i.e., communication) Small and large data sets –  Network Protocol (IPoIB, RDMA) –  Cover different kinds of combinations •  Generate evaluation reports •  Choose representative DLoBD stacks –  Performance (End-to-end training time; time to a certain –  CaffeOnSpark, TensorFlowOnSpark, and BigDL accuracy; epoch execution time) –  Accuracy, Scalability, Resource Utilization –  Running over Spark, Yarn, HDFS –  Breakdown Network Based Computing Laboratory Spark+AI Summit 2018 18

19. Overview of Representative DLoBD Stacks - CaffeOnSpark •  Spark Driver: Job Launching and Job Control •  Spark Executor: For data feeding and task control •  Model Synchronizer: Communicates across nodes with RDMA / TCP, and output model file on HDFS •  Scalable and Communication intensive –  Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck –  Out-of-band communication Network Based Computing Laboratory Spark+AI Summit 2018 19

20. Overview of Representative DLoBD Stacks - TensorFlowOnSpark •  Spark Executors acting as containers used to run TensorFlow code •  Two different modes to ingesting data –  Read data directly from HDFS using built- in TensorFlow modules –  Feeding data from Spark RDDs to Spark executors (TensorFlow core) •  Scalable and Communication intensive –  Parameter Server-based approach –  Embedded inside one Spark executor and talk to other workers over gRPC or gPRC with RDMA –  Out-of-band communication Network Based Computing Laboratory Spark+AI Summit 2018 20

21. Overview of Representative DLoBD Stacks – CNTKOnSpark/ MMLSpark •  Microsoft Cognitive Toolkit (CNTK) and OpenCV into Spark Machine Learning pipelines without data transfer overhead •  Feeding data for CNTK Core (e.g. images or texts) can be directly read from HDFS by Spark Executors •  Scalable and Communication intensive –  Embedded inside one Spark executor and talk to other workers over MPI (RDMA, TCP) –  Out-of-band communication Network Based Computing Laboratory Spark+AI Summit 2018 21

22. Overview of Representative DLoBD Stacks - BigDL •  Users can write deep learning applications as Spark programs •  Users can load pre-trained Caffe or Torch models into Spark programs using BigDL •  Feed data to BigDL core by Spark Executor which can directly load data from HDFS •  High performance –  Support Intel MKL –  Support both Xeon and Xeon Phi (e.g., KNL) •  Scalable and Communication intensive –  Spark block manager as parameter server –  Organically designed and integrated with Spark architecture –  In-band Communication •  RDMA communication can be achieved through our RDMA-Spark package! Network Based Computing Laboratory Spark+AI Summit 2018 22

23. Selected Various Datasets and Models MNIST CIFAR-10 ImageNet Category Digit Classification Object Classification Object Classification Resolution 28 × 28 B&W 32 × 32 Color 256 × 256 Color Classes 10 10 1000 Training Images 60 K 50 K 1.2 M Testing Images 10 K 10 K 100 K Model Layers (Conv. / Full-connected) Dataset Framework LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark SoftMax Regression NA / NA MNIST TensorFlowOnSpark CIFAR-10 Quick 3 / 1 CIFAR-10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark VGG-16 13 / 3 CIFAR-10 BigDL AlexNet 5 / 3 ImageNet CaffeOnSpark GoogLeNet 22 / 0 ImageNet CaffeOnSpark Resnet-50 53/1 Synthetic TensorFlow Network Based Computing Laboratory Spark+AI Summit 2018 23

24. Performance Characterization for CPU-/GPU-based Deep Learning with CaffeOnSpark 91.26% 98.08% 81.31% 63% 80.08% 88.91% 96.78% CIFAR-10 Quick LeNet on MNIST 78.3% 6000 1200 CPU+OpenBlas CPU+OpenBlas 5000 1000 CPU+MKL CPU+MKL End-to-End Time (secs) End-to-End Time (secs) 4000 GPU 800 GPU 41.5% 3000 GPU+cuDNN 600 GPU+cuDNN 63.92% 2000 400 15.9% 1000 200 0 0 1 2 4 8 16 1 2 4 8 16 Number of Nodes (one device/node) Number of Nodes (one device/node) •  DL workloads can benefit from the high performance of the DLoBD stacks. •  Network will become a bottleneck at some point if the sub-optimal IPoIB network protocol is used. •  GPU/GPU+cuDNN can get the best performance. GPU + cuDNN is degraded at a large scale (e.g., 16 nodes). •  For some models, solutions with CPU + MKL may outperform GPU-based solutions. Network Based Computing Laboratory Spark+AI Summit 2018 24

25. Performance Characterization for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR) CaffeOnSpark TensorFlowOnSpark 450 14.2% 450 400 2 GPUs End-to-End Time (secs) 2 4 8 16 400 4.9% End-to-End Time (secs) 350 13.3% 4 GPUs 350 300 300 250 51.2% 8 GPUs 45.6% 250 200 150 200 100 150 33.01% 50 100 0 50 IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA 0 IPoIB RDMA IPoIB RDMA GPU GPU-cuDNN GPU GPU-cuDNN CIFAR10 (2 GPUs/node) MNIST (1 GPU/node) CIFAR-10 Quick LeNet on MNIST •  CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communication overhead becomes significant. •  Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully optimized yet. For MNIST tests, RDMA is not showing obvious benefits. Network Based Computing Laboratory Spark+AI Summit 2018 25

26. Performance Characterization with MMLSpark 5000 160 CPU+OpenBlas 2 4 8 16 End-to-End Time (secs) End-to-End Time (secs) 4000 120 CPU+MKL 3000 GPU+cuDNN 80 2000 40 1000 0 0 1 2 4 8 16 IPoIB RDMA CIFAR-10 Number of Nodes (one device/node) •  The solution of GPU + cuDNN performs best, up to 55x faster than CPU + OpenBLAS, and up to 15x than CPU + MKL. •  OpenMPI-based communication over IPoIB and RDMA; Similar performance; The latency and bandwidth of IPoIB in this cluster are sufficient for small models. •  Could not find other benchmarks with bigger models for MMLSpark Network Based Computing Laboratory Spark+AI Summit 2018 26

27. Characterization on Performance and Accuracy 80 70 70% Accuracy 90 Reduce by 15% 100 Reduce by 48% 70% Accuracy 80 70% Accuracy 60 70 Accuracy (%) Accuracy (%) Accuracy (%) 50 60 40 Reduce by 22% 50 40 30 IPoIB 30 20 20 RDMA IPoIB RDMA IPoIB RDMA 10 10 0 160 511 863 1214 1565 2268 2619 2971 3322 3908 4259 4610 4962 1917 286 1001 12939 18673 22974 27273 30138 37301 43030 50326 5297 165 474 711 961 1280 1598 1920 2132 2261 2908 3645 2017 4337 Time (secs) Time (secs) Time (secs) AlexNet on ImageNet with CaffeOnSpark GoogleNet on ImageNet with CaffeOnSpark VGG on CIFAR-10 with BigDL •  Performance Evaluation of CaffeOnSpark (training time to achieve a 70% accuracy) –  RDMA reduces the overall time cost by 22% in training AlexNet on ImageNet –  RDMA reduces the overall time cost by 15% in training GoogleNet on ImageNet •  Performance Evaluation of BigDL (training time to achieve a 70% accuracy) –  RDMA reduces the overall time cost by 48% in training VGG on CIFAR-10 Network Based Computing Laboratory Spark+AI Summit 2018 27

28. Memory and Network Utilization of CaffeOnSpark 30 CPU+OpenBLAS IPoIB Host Memory Usage Network Throughput CPU+MKL 20 20 GPU RDMA GPU+cuDNN (MB/s) (GB) 10 10 0 0 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 0 1 2 3 4 5 6 7 Time (min) Time (min) •  CIFAR-10 Quick Model and CIFAR-10 Dataset •  GPU-based solutions use less memory than CPU-based ones as they mostly use GPU memory. •  CPU + MKL solution uses host memory more efficiently and has better performance than CPU + OpenBLAS. •  RDMA utilizes the network resources more efficiently than the IPoIB in CaffeOnSpark. •  CaffeOnSpark still does not fully utilize the high throughput characteristic of RDMA and memory resource. Network Based Computing Laboratory Spark+AI Summit 2018 28

29. Performance Overhead across Layers in DLoBD Stacks 70 •  SoftMax Regression model, over MNIST Other dataset 60 Spark-Overhead YARN-Overhead •  Up to 15.5% time in Apache Hadoop 50 Actual Training YARN scheduler layer Time (second) 40 •  Up to 18.1% execution time in Spark job 30 execution layer 20 •  Data size is small, so we do not count the time spent on accessing HDFS layer. 10 •  Need more effort to reduce the 0 overhead across different layers of IPoIB RDMA IPoIB RDMA DLoBD stacks TensorFlowOnSpark Native TensorFlow •  Maybe amortized in long-running deep learning jobs Network Based Computing Laboratory Spark+AI Summit 2018 29