- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Kubernetes 拥抱机器学习,即使是在私有云上 - Hui Luo, Vmware
展开查看详情
1 .Kubernetes loves machine learning on on-premise Hui Luo - VMware
2 . About me Software engineer at VMware cloud native application team. Active contributor to upstream kubernetes in area like device plugin. Contributor at vSphere cloud provider, cluster api vSphere. Github: @figo
3 . Machine learning on k8s landscape Kubeflow Kubernetes CPU Memory GPU Storage Network
4 . Major aspects of GPU resource 1. Lifecycle management: setup, update, upgrade, auto-scaling 2. Sharing and Isolation 3. Monitoring 4. Heterogeneous GPU types 5. Performance consistency
5 . GPU resource in k8s apiVersion: v1 kind: Pod metadata: name: my-gpu-pod spec: kubectl create -f mypod.yml k8s cluster containers: with GPU - name: image-processor image: gcr.io/image-processor:latest resources: limits: nvidia.com/gpu: 1
6 . Lifecycle management Kubernetes Cluster GPU Device Plugin VM VM GPU driver, CRI GPU driver, CRI Hypervisor HW HW HW GPU GPU GPU
7 . Lifecycle management - Cont. DIY solution Vendor solution vs Use existing process and build Many choices exist automation solution by yourself.
8 . Sharing and isolation Tips: 1) Use namespace and GPU Quota 2) Use Pod PriorityClass and Pod QoS Note: unlike CPU, it does not support milicore
9 . GPU resource monitoring //AcceleratorStats contains stats of accelerators that attached to container type AcceleratorStats struct { Make string `json:"make"` Model string `json:"model"` ID string `json:"id"` MemoryTotal uint64 `json:"memoryTotal"` MemoryUsed uint64 `json:"memoryUsed"` DutyCycle uint64 `json:"dutyCycle"` } To make it extendable: [KEP] Compute device assignment
10 . Homogeneous to heterogeneous apiVersion: v1 nvidia tesla k80 + p100? kind: Pod metadata: Solutions: name: my-gpu-pod spec: containers: 1) [KEP] Resource api - name: image-processor 2) Use labels image: gcr.io/image-processor:latest resources: limits: gpu-gold: 1
11 . Performance consistency CPU manager, hugepage are supported To further address NUMA and device locality requirement: 1) [KEP] NUMA manager 2) Hypervisor NUMA scheduler 3) Linux AutoNUMA
12 .Join discussions at: wg-machine learning wg-resource management sig-node Contact me on github: @figo
13 .References 1. [KEP] Compute device assignment https://github.com/kubernetes/community/pull/2454 2. [KEP] Resource api, kubernetes/community/keps/sig-node/00014-resource-api.md 3. [KEP] NUMA manager kubernetes/community/contributors/design-proposals/node/numa-manager.md 4. CPU manager https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ 5. Hugepage https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/ 6. Pod PriorityClass https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/
14 .Thank you