Accelerating Machine Learning on Databricks Runtime

“We all know the unprecedented potential impact for Machine Learning. But how do you take advantage of the myriad of data and ML tools now available? How do you streamline processes, speed up discovery, share knowledge, and scale up implementations for real-life scenarios? In this talk, we’ll cover some of the latest innovations brought into the Databricks Unified Analytics Platform for Machine Learning. In particular we will show you how to: – Get started quickly using the Databricks Runtime for Machine Learning, that provides pre-configured Databricks clusters including the most popular ML frameworks and libraries, Conda support, performance optimizations, and more. – Get started with most popular Deep Learning frameworks within a few minutes and go deep with state of the art model DL diagnostics tools. – Scale up Deep Learning training workloads from a single machine to large clusters for the most demanding applications using the new HorovodRunner with ease. – How all of these ML frameworks get exposed to large and distributed data using Databricks Runtime for Machine Learning.”

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Accelerating Machine Learning on Databricks Runtime Hossein Falaki & Yifan Cao, Databricks Inc. #UnifiedAnalytics #SparkAISummit

3.Outline Databricks Runtime for ML Use Case Examples Under the Hood Demo What is Next #UnifiedAnalytics #SparkAISummit 3

4. Broad Adoption of ML Disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention Digital Personalization Internet of Things and many more customers in different industries and segments #UnifiedAnalytics #SparkAISummit 4

5.Hidden Tech Debt in ML Systems “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Data Machine Resource Monitoring Verification Management Data Collection Serving Configuration Infrastructure ML Code Analysis Tools Feature Process Extraction Management Tools Small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex. #UnifiedAnalytics #SparkAISummit 5

6.#UnifiedAnalytics #SparkAISummit 6

7.ML Runtime: Job To Be Done • As an ML practitioner 1. I want to quickly start with my ML project • Today I have to spend many hours setting up environments 2. I want a single runtime for all steps of my work • I don’t want to move data and code around #UnifiedAnalytics #SparkAISummit 7

8.ML Project Stages Prepare Build Productionize Quality Data Models Databricks Runtime for ML #UnifiedAnalytics #SparkAISummit 8

9.What is Databricks Runtime for ML? A ready to use environment for machine learning and data science Built on top of and updated with every Databricks Runtime release APIs for distributed deep learning on Spark (HorovodRunner) Performance improvement for popular distributed algorithms in Spark (GraphFrames, logistic regression and tree classifiers) #UnifiedAnalytics #SparkAISummit 9

10.What is Databricks Runtime for ML? ML Environment is setup on all cluster nodes with a single click. #UnifiedAnalytics #SparkAISummit 10

11.1. Prepare Data Easily access, explore, and visualize data in collaborative notebooks Prepare data sets at scale with: o Scala / Python / R / SQL o Optimized Apache Spark o Structured Streaming o Delta Lake o Persisted data meta store Quickly automate notebooks with jobs #UnifiedAnalytics #SparkAISummit 11

12.2. Build Models Support for popular open source ML frameworks • TensorFlow and Tensorboard • PyTorch • Keras • Horovod for distributed DL • XGBoost • GraphFrames • Popular single node tools in Python and R #UnifiedAnalytics #SparkAISummit 12

13.3. Productionize ML Models Model Deployment MLflow API for inference on third-party services like Docker containers, AzureML on Azure, SageMaker on AWS Databricks Runtime for ML includes mleap for model serialization. #UnifiedAnalytics #SparkAISummit 13

14.Use Case Examples #UnifiedAnalytics #SparkAISummit 14

15. Vision Challenge • 325,000 listed hotels, massive volume of image files • Apply ML to improve match between traveler and hotels with personalized viewing experience Solution • Leverage Databricks to train DL models on 100% of image data • Increase processing power by 20X and enable real-time scoring Result • significantly improved customer engagement and conversions by improving personalization models • Customer Case Study: 15

16. NLP Challenge • >100 million gamers every month • 2% of all games infected by serious toxicity Solution • Leveraged Databricks to apply NLP & ML to proactively identify abusive language • Scaled training on much larger dataset and hyperparameter tuning Result • Riot Games increased customer satisfaction, retention, and lifetime value by detecting abusive language in real-time • Customer Case Study: 16

17. IOT Challenge • Offer insights to what consumers buy and watch • Scale from single-machine data science to large datasets to improve product offerings Solution • Leveraged Databricks to ensure collaboration across teams • Reduced annual cost by 40% and improved model performance by 1/3 Result • Nielsen improved competitive offering by applying ML to batch & live stream data from IOT devices • Customer Case Study: 17

18.Under the Hood #UnifiedAnalytics #SparkAISummit 18

19.High-level Engineering Goals • Reproducible environments – Package & dependency management • Testability – Testing & QA infrastructure and process • Cross-compatibility – Careful configuration of all packages to be compatible • Performance optimization – High-performance I/O #UnifiedAnalytics #SparkAISummit 19

20.Package Management • Package management • Environment management – Python 2.x & Python 3.x environments • Environment is selected during cluster setup • Latest stable versions from Anaconda distribution #UnifiedAnalytics #SparkAISummit 20

21.Python Environments • ML Runtime vs. Databricks Runtime – Upgraded packages – Conda vs. pip – Additional ML packages • MKL for CPU acceleration • CUDA & cuDNN for GPU acceleration #UnifiedAnalytics #SparkAISummit 21

22.Dependency Management • bazel for build system • Audit files for change detection – Python: Conda – JAR: maven – R: MRAN – Native: Ubuntu APT and Docker #UnifiedAnalytics #SparkAISummit 22

23.Docker Containers • We internally use Docker to build Databricks Runtime images – Full control over content – Reproducible and automated • Runtime for ML is a layer on top of DBR – MLR benefits from all existing DBR tests and QA – MLR gets every hotfix and patch that goes into DBR #UnifiedAnalytics #SparkAISummit 23

24.Extensive Integration Testing • Extensive tests for top-tier packages • Each commit runs unit and integration tests • Nightly tests on master and released branches • All CPU and GPU instances on Azure & AWS • Integration Tests: – Launch a docker container and run code – Launch a cluster and execute notebooks #UnifiedAnalytics #SparkAISummit 24

25.High Performance FUSE • Why Filesystem in userspace (FUSE)? • We use high-throughput FUSE clients for ML/DL – Azure Storage FUSE on Azure – Goofys on AWS • The mounts points are pre-configured on ML Runtime at dfbs:/ml #UnifiedAnalytics #SparkAISummit 25

26.Demo #UnifiedAnalytics #SparkAISummit 26

27.What is Next? #UnifiedAnalytics #SparkAISummit 27

28.GA of Runtime for ML • Release history: – 4.1 Beta: June 2018 – … – 5.3 GA: April 2019 – 5.4: May 2019 – 6.0: Second Half 2019 #UnifiedAnalytics #SparkAISummit 28

29.Roadmap for Environment • DBR with Conda (Beta) – Enable customizable environment – Databricks Runtime & Databricks Runtime for ML will continue to be supported • 6.0 – Unify all into single Runtime – Considering removing Python 2.x #UnifiedAnalytics #SparkAISummit 29