Public Introduction to Kubeflow

Google DC Ops 推的开源项目Kubeflow (Machine Learning Toolkit for Kubernetes),Kubernetes for ML。
展开查看详情

1.Introduction to Kubeflow aronchick@

2.Machine Learning is a way of solving problems without explicitly knowing how to create the solution

3.Google DC Ops

4.PUE == Power Usage Effectiveness

5.PUE == Power Usage Effectiveness

6.PUE == Power Usage Effectiveness

7.PUE == Power Usage Effectiveness

8.But...

9. Magical LOTS OF Most Folks AI PAIN Goodness

10.Why the Gap?

11.Composability Portability Scalability

12.Composability Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation Building Model Training Trainer a Validation At Scale Model Roll-out Serving Monitoring Logging

13.Portability

14.Each ML Stage is an Independent System System 2 Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation System 3 System 1 Building Model Training Trainer a Validation At Scale Model System 4 Roll-out Serving Monitoring Logging System 6 System 5

15.Portability

16. Laptop Portability Model UX Tooling Framework Storage Runtime Drivers OS Accelerator HW

17. Laptop Portability Model UX Tooling Framework Storage Runtime Drivers OS Accelerator HW

18. Laptop Training Rig Portability Model Model UX UX Tooling Tooling Framework Framework Storage Storage Runtime Runtime Drivers Drivers OS OS Accelerator Accelerator HW HW

19. Laptop Training Rig Cloud Portability Model Model Model UX UX UX Tooling Tooling Tooling Framework Framework Framework Storage Storage Storage Runtime Runtime Runtime Drivers Drivers Drivers OS OS OS Accelerator Accelerator Accelerator HW HW HW

20.Scalability ● Machine specific HW (GPU) ● Limited (or unlimited) compute ● Network & storage constraints ○ Rack, Server Locality ○ Bandwidth constraints ● Heterogeneous hardware ● HW & SW lifecycle management ● Scale isn’t JUST about adding new machines! ○ Intern vs Researcher ○ Scale to 1000s of experiments

21.You Know What’s Really Good at Composability, Portability, and Scalability?

22.Containers and Kubernetes

23.Kubernetes for ML Namespace Spark Jupyter Airflow Quota Logging NFS Cassandra Tensorflow TF-Serving Monitoring RBAC Ceph MySQL Caffe Flask+Scikit Kubernetes Operating system (Linux, Windows) CPU Memory SSD Disk GPU FPGA ASIC NIC GCP AWS Azure On-prem

24.Kubernetes for ML ● Supports accelerators in an extensible manner ○ GPUs already in progress ○ Support for FPGAs, high perf NICs under discussion ● Existing Controllers simplify devops challenges ○ K8S Jobs for Training ○ K8S Deployments for Serving ● Handles 1000s of nodes ● Container base images for ML workloads

25.But Wait, There’s More! ● Kubernetes native scaling objects ○ Autoscaling cluster based on workload metrics ○ Priority eviction for removal of low priority jobs ○ Scaled to large number of pods (experiments) ● Passes through cluster specs for specific needs ○ Scheduling jobs where the data needed to run them is ○ Node labels for Heterogeneous HW (more in the future) ○ Manage SW drivers and HW health via addons

26.But...

27.Oh, you want to use ML on K8s? Before that, can you become an expert in: ● Containers ● Packaging ● Kubernetes service endpoints ● Persistent volumes ● Scaling ● Immutable deployments ● GPUs, Drivers & the GPL ● Cloud APIs ● DevOps ● ...

28.Kubeflow

29.Make it Easy for Everyone to Learn, Deploy and Manage Portable, Distributed ML on Kubernetes (Everywhere)