基于Apache Spark的分布式深度学习框架BigDL介绍

Intel开源了Apache Spark上原生的分布式深度学习框架BigDL。BigDL带给用户的好处,一方面是与Spark无缝结合,可以轻松嵌入你的Spark程序。另一方面,提供了英特尔至强服务器集群上高性能深度学习模型分布式训练和推断,大大减少Spark上深度学习任务的时间。此外,BigDL还兼容常用Caffe/Tensorflow模型,方便用户迁移模型。
展开查看详情

1.BigDL: A scalable & easy deep learning solution on Apache Spark Yiheng Wang (yiheng.wang@intel.com) Big Data Technology, Software and Service Group, Intel Intel® Confidential — INTERNAL USE ONLY

2.Build an End-2-End Solution https://github.com/intel-analytics/BigDL 2

3.Build an End-2-End Solution Practical challenges:  compatible with different data source  performance and scalability  stability & fault tolerant  data management / pre-processing  resource sharing  programming tools / languages  … https://github.com/intel-analytics/BigDL 3

4.Build an End-2-End Solution on Hadoop/Spark Stre Gra MLli Big SQL ami phX b DL ng Apache Spark https://github.com/intel-analytics/BigDL 4

5. An example of end-2-end large scale machine learning • Historical data is stored on Hive • Data preprocessing with SparkSQL • Spark ML pipeline for complex feature engineering • Use multiple BigDL NN models • Use Sample+Bagging to solve unbalance problem • Grid search for hyper parameter tuning Powered by BigDL https://github.com/intel-analytics/BigDL 5

6.Build on Apache Spark MLlib GraphX BigDL SQL Streaming ML Pipelines RDD / Data Frame Spark Core https://github.com/intel-analytics/BigDL 6

7.BigDL is easy to use  A friendly API compatible with Torch and Keras  Provide Scala and Python programming API  Support Apache Spark SQL / Streaming / ML Pipeline https://github.com/intel-analytics/BigDL 7

8.High performance from your server  Powered by Intel Math Kernel Library  Extremely high performance on Xeon CPUs – Order of magnitude faster than out of box caffe / torch / tensorflow  Good scalability – Hundreds of nodes https://github.com/intel-analytics/BigDL 8

9. Competitive End-to-end Performance and Scalability Compares the throughput of K40 and Intel® Xeon® processors in the image feature extraction pipeline. You can find more info at: https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom https://github.com/intel-analytics/BigDL 9

10.Development and Deploy are Really Easy with BigDL Yarn / Mesos Spark Spark Spark BigDL BigDL BigDL https://github.com/intel-analytics/BigDL 10

11.BigDL is Normal Spark applications BigDL library files are submit with Spark job. You don’t need to install extra files on your cluster https://github.com/intel-analytics/BigDL 11

12.BigDL Feature Overview • Training, evaluation and prediction • Fine-tune / Streaming / Batch / Java Web application • More than 200 layers • Linear, Conv2D, Conv3D, Embedding, Recurrent… • Dozens of loss functions and optimization algorithms • CrossEntroypy, CTC, Adam, SGD … • Support Load model file from other framework • Torch / Caffe / Tensorflow / Keras https://github.com/intel-analytics/BigDL 12

13.Algorithms • Auto-encoders, VAE • Wide-and-deep • CNN models(AlexNet, Inception, • Deep Speech Vgg, ResNet, MobileNet, DenseNet, • Chatbot Squeezenet) • Reinforcement Learning • RNN / LSTM / GRU / Seq2Seq • SSD / Faster-RCNN • Neural Recommendation • FraudDetection https://github.com/intel-analytics/BigDL 13

14.Visualize Training Process https://github.com/intel-analytics/BigDL 15

15. Use Cases https://github.com/intel-analytics/BigDL https://software.intel.com/bigdl 16

16.Public Cloud Running BigDL, Deep Learning Use BigDL on Microsoft* for Apache Spark, on AWS* Azure* HDInsight* BigDL on Alibaba* Cloud (Amazon* Web Service) https://azure.microsoft.com/en- E-MapReduce* https://aws.amazon.com/blogs/ai/running- us/blog/use-bigdl-on-hdinsight-spark-for- https://yq.aliyun.com/articles/73347 bigdl-deep-learning-for-apache-spark-on-aws/ distributed-deep-learning/ Using Apache Spark with BigDL on CDH* and Cloudera* Intel’s BigDL on Databricks* Intel BigDL on Mesosphere* Data Science Workbench* https://databricks.com/blog/2017/02/09/in DC/OS* (by Lightbend*) http://blog.cloudera.com/blog/2017/04/bigdl- tels-bigdl-databricks.html http://developer.lightbend.com/blog/2017- on-cdh-and-cloudera-data-science-workbench/ 06-22-bigdl-on-mesos/ https://github.com/intel-analytics/BigDL 17

17. Image Feature Extraction in JD.com http://mp.weixin.qq.com/s/xUCkzbHK4K06-v5qUsaNQQ https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom https://github.com/intel-analytics/BigDL https://software.intel.com/bigdl

18. Image Similarity Search for MLS Listing https://github.com/intel-analytics/BigDL https://software.intel.com/bigdl https://homes-prod-homes-poc.azurewebsites.net/Property/ml81678150/5738-san-lorenzo-dr-san-jose-ca-95123

19. Neural Recommendation Engine in China Life https://github.com/intel-analytics/BigDL https://strata.oreilly.com.cn/strata-cn/public/schedule/detail/59722?locale=en https://software.intel.com/bigdl

20. User-Merchant Propensity Modeling in MasterCard https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63897 https://github.com/intel-analytics/BigDL https://software.intel.com/bigdl

21. Fraud detection in Union-Pay • Historical data is stored on Hive • Data preprocessing with SparkSQL • Spark ML pipeline for complex feature engineering • Use multiple BigDL NN models • Use Sample+Bagging to solve unbalance problem • Grid search for hyper parameter tuning Powered by BigDL https://github.com/intel-analytics/BigDL 22

22.Cray Urika-XC provide BigDL https://github.com/intel-analytics/BigDL 23

23.Medical Image Analysis https://www.ucsf.edu/news/2017/01/405536/ucsf-intel-join-forces- develop-deep-learning-analytics-health-care https://github.com/intel-analytics/BigDL 24

24.Deep Speech 2 on BigDL conv biRNN 1 biRNN 2 ... biRNN k 9 layers biRNN: >50 Million parameters affine softmax CTC https://github.com/intel-analytics/BigDL 25

25.Language Model - RNN Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ https://github.com/intel-analytics/BigDL 26

26.Language Model – Generate Shakespeare Poems Output of RNN: Long live the King . The King and Queen , and the Strange of the Veils of the rhapsodic . and grapple, and the entreatments of the pressure . Upon her head , and in the world ? `` Oh, the gods ! O Jove ! To whom the king : `` O friends ! Her hair, nor loose ! If , my lord , and the groundlings of the skies . jocund and Tasso in the Staggering of the Mankind . and https://github.com/intel-analytics/BigDL 27

27. Transfer Learning Melancholy Fine-tune Macro BigDL Model Load Caffe BigDL Torch Model Model Model Sunny Image source: https://www.flickr.com/photos/ • Train on different dataset based on pre-trained model • Predict image style instead of type • Save training time and improve accuracy https://github.com/intel-analytics/BigDL 28

28.Integrate with Spark Stream Integrations with Spark Streaming for runtime training and prediction Kafka Flume Train HDFS/S3 Spark BigDL Evaluator StreamWriter Streaming RDDs Model Kinesis Predict Twitter https://github.com/intel-analytics/BigDL 29

29. Intelligent Query Tight Integrations with Spark SQL, DataFrames and Structured Streaming df.select($’image’) .withColumn( “image_type”, ImgClassifier(“image”)) .filter($’image_type’ == ‘dog’) .show() Image classification on ImageNet(http://www.image-net.org) https://github.com/intel-analytics/BigDL 30