Navigating the ML Pipeline Jungle with MLflow

通道一直是现代软件工程的一个关键焦点,我们的API /服务/容器/ devop驱动的环境因此令人惊讶的是管道是AI项目往往失败的地方。 但正是因为我们的现代软件开发专注于解耦管道,我们一直在努力应对人工智能的兴起。 具体而言,当公司能够创建明确考虑数据,模型和代码之间耦合的端到端AI模型工厂时,公司能够有效地使用AI。 在本次PPT中,我将介绍一个模型工厂是什么,以及MLFlow的设计如何支持端到端模型工厂的创建以及共享我观察到的帮助客户从创业公司到财富50强创建,生产, 并扩展端到端ML通道,并观察这些通道产生严重的,改变游戏规则的业务影响。

1.Navigating the ML Pipeline Jungle with MLflow: Notes from the Field Thunder Shiviah #SAISDS11

2. Who am I ● Databricks Solutions Architect focused on machine learning and deep learning ● Previously McKinsey Data Scientist and QuantumBlack Machine Learning Engineer designing and building ML pipelines for Fortune 100 companies ● Developed and deployed models across diverse verticals such as healthcare, telecom, finance, and renewable energy 2

3.● Overview of challenges with AI in production ● How we’re solving these challenges ● Demos ● A final word on where AI in production is heading ● Q&A 3

4. AI is a Game Changing Opportunity LOTS OF NEW DATA OPPORTUNITY BUSINESS Customer Data Fraud Detection Click Streams Genome Sequencing Sensor data (IoT) Recommendation Engine Video/Speech DATA ENGINEER DATA SCIENTIST Predictive Maintenance … … Machine Learning Requires Collaborative Experimentation on Big Data

5.Hardest Part of AI isn’t AI, it’s plumbing “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Data Machine Resource Monitoring Verification Management Data Collection Serving Configuration Infrastructure ML Code Analysis Tools Feature Process Extraction Management Tools Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex.

6. ML Lifecycle is Manual, Inconsistent and Disconnected Data Prep Build Model Deploy Model ● Low level integrations for ● Ad hoc approach to track ● Multiple tightly coupled Data and ML experiments deployment options ● Difficult to track data used ● Very hard to reproduce ● Different monitoring approach for a model experiments for each framework

7.How we’re making AI in production simple

8. Simplifying the AI pipeline ● ML Runtime - Pre-configured ML libraries for CPU and GPU ● Pandas vectorized UDFs ● Distributed Transfer learning with deep learning pipelines ● MLflow

9.New: Databricks Runtime for ML Ready to use clusters with built-in ML Frameworks GPU support

10.Run your native Python code with PySpark, fast, with Vectorized Pandas UDFs ● Use Pandas UDFs to convert existing pandas code into performant spark UDFs ● Write pyspark dataframes to Pandas fast

11.Transfer learning with DL pipelines ● Use pre-trained neural networks to harness the power of neural nets on smaller data. ● Model inference using SparkSQL UDFs

12. New: Databricks MLflow standardizes ML Lifecycle Data Prep Feed data to Models Enrich data in experiments Build Model Track Experiments Databricks Delta Reproduce experiments Databricks Runtime for ML MLflow Project & Tracker Deploy Model Integrate with multiple clouds Manage and monitor models MLflow Serving

13.MLflow Components Tracking Projects Models Record and query Packaging format General model format experiments: code, for reproducible runs that supports diverse data, config, results on any platform deployment tools 13


15.A word about where AI in production is going

16.Q&A 16

17.Thank you! #SAISDS11 17