1.Introduction to Kubeflow aronchick@
2.Machine Learning is a way of solving problems without explicitly knowing how to create the solution
3.Google DC Ops
4.PUE == Power Usage Effectiveness
5.PUE == Power Usage Effectiveness
6.PUE == Power Usage Effectiveness
7.PUE == Power Usage Effectiveness
9. Magical LOTS OF Most Folks AI PAIN Goodness
10.Why the Gap?
11.Composability Portability Scalability
12.Composability Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation Building Model Training Trainer a Validation At Scale Model Roll-out Serving Monitoring Logging
14.Each ML Stage is an Independent System System 2 Data Data Data Data Data Transform Ingestion Analysis Validation Splitting -ation System 3 System 1 Building Model Training Trainer a Validation At Scale Model System 4 Roll-out Serving Monitoring Logging System 6 System 5
16. Laptop Portability Model UX Tooling Framework Storage Runtime Drivers OS Accelerator HW
17. Laptop Portability Model UX Tooling Framework Storage Runtime Drivers OS Accelerator HW
18. Laptop Training Rig Portability Model Model UX UX Tooling Tooling Framework Framework Storage Storage Runtime Runtime Drivers Drivers OS OS Accelerator Accelerator HW HW
19. Laptop Training Rig Cloud Portability Model Model Model UX UX UX Tooling Tooling Tooling Framework Framework Framework Storage Storage Storage Runtime Runtime Runtime Drivers Drivers Drivers OS OS OS Accelerator Accelerator Accelerator HW HW HW
20.Scalability ● Machine specific HW (GPU) ● Limited (or unlimited) compute ● Network & storage constraints ○ Rack, Server Locality ○ Bandwidth constraints ● Heterogeneous hardware ● HW & SW lifecycle management ● Scale isn’t JUST about adding new machines! ○ Intern vs Researcher ○ Scale to 1000s of experiments
21.You Know What’s Really Good at Composability, Portability, and Scalability?
22.Containers and Kubernetes
23.Kubernetes for ML Namespace Spark Jupyter Airflow Quota Logging NFS Cassandra Tensorflow TF-Serving Monitoring RBAC Ceph MySQL Caffe Flask+Scikit Kubernetes Operating system (Linux, Windows) CPU Memory SSD Disk GPU FPGA ASIC NIC GCP AWS Azure On-prem
24.Kubernetes for ML ● Supports accelerators in an extensible manner ○ GPUs already in progress ○ Support for FPGAs, high perf NICs under discussion ● Existing Controllers simplify devops challenges ○ K8S Jobs for Training ○ K8S Deployments for Serving ● Handles 1000s of nodes ● Container base images for ML workloads
25.But Wait, There’s More! ● Kubernetes native scaling objects ○ Autoscaling cluster based on workload metrics ○ Priority eviction for removal of low priority jobs ○ Scaled to large number of pods (experiments) ● Passes through cluster specs for specific needs ○ Scheduling jobs where the data needed to run them is ○ Node labels for Heterogeneous HW (more in the future) ○ Manage SW drivers and HW health via addons
27.Oh, you want to use ML on K8s? Before that, can you become an expert in: ● Containers ● Packaging ● Kubernetes service endpoints ● Persistent volumes ● Scaling ● Immutable deployments ● GPUs, Drivers & the GPL ● Cloud APIs ● DevOps ● ...
29.Make it Easy for Everyone to Learn, Deploy and Manage Portable, Distributed ML on Kubernetes (Everywhere)