申请试用
HOT
登录
注册
 
Kubernetes 拥抱机器学习,即使是在私有云上 - Hui Luo, Vmware
ccone
/
发布于
/
2175
人观看
将讨论在 Kubernetes 上启用 GPU 时私有云所面临的三个主要挑战,我还将展示并讨论一些有助于解决这些挑战的项目: 1) 私有云通常需要支持更广泛的 GPU 类型,在某些情况下,需要在一个集群中支持异构 GPU。 2) 支持像 RDMA、NVLINK 等复杂的硬件拓扑。 3) 当多个团队共享的 GPU 资源有限时,GPU 资源的争用通常非常高。
展开查看详情

1 .Kubernetes loves machine learning on on-premise Hui Luo - VMware

2 . About me Software engineer at VMware cloud native application team. Active contributor to upstream kubernetes in area like device plugin. Contributor at vSphere cloud provider, cluster api vSphere. Github: @figo

3 . Machine learning on k8s landscape Kubeflow Kubernetes CPU Memory GPU Storage Network

4 . Major aspects of GPU resource 1. Lifecycle management: setup, update, upgrade, auto-scaling 2. Sharing and Isolation 3. Monitoring 4. Heterogeneous GPU types 5. Performance consistency

5 . GPU resource in k8s apiVersion: v1 kind: Pod metadata: name: my-gpu-pod spec: kubectl create -f mypod.yml k8s cluster containers: with GPU - name: image-processor image: gcr.io/image-processor:latest resources: limits: nvidia.com/gpu: 1

6 . Lifecycle management Kubernetes Cluster GPU Device Plugin VM VM GPU driver, CRI GPU driver, CRI Hypervisor HW HW HW GPU GPU GPU

7 . Lifecycle management - Cont. DIY solution Vendor solution vs Use existing process and build Many choices exist automation solution by yourself.

8 . Sharing and isolation Tips: 1) Use namespace and GPU Quota 2) Use Pod PriorityClass and Pod QoS Note: unlike CPU, it does not support milicore

9 . GPU resource monitoring //AcceleratorStats contains stats of accelerators that attached to container type AcceleratorStats struct { Make string `json:"make"` Model string `json:"model"` ID string `json:"id"` MemoryTotal uint64 `json:"memoryTotal"` MemoryUsed uint64 `json:"memoryUsed"` DutyCycle uint64 `json:"dutyCycle"` } To make it extendable: [KEP] Compute device assignment

10 . Homogeneous to heterogeneous apiVersion: v1 nvidia tesla k80 + p100? kind: Pod metadata: Solutions: name: my-gpu-pod spec: containers: 1) [KEP] Resource api - name: image-processor 2) Use labels image: gcr.io/image-processor:latest resources: limits: gpu-gold: 1

11 . Performance consistency CPU manager, hugepage are supported To further address NUMA and device locality requirement: 1) [KEP] NUMA manager 2) Hypervisor NUMA scheduler 3) Linux AutoNUMA

12 .Join discussions at: wg-machine learning wg-resource management sig-node Contact me on github: @figo

13 .References 1. [KEP] Compute device assignment https://github.com/kubernetes/community/pull/2454 2. [KEP] Resource api, kubernetes/community/keps/sig-node/00014-resource-api.md 3. [KEP] NUMA manager kubernetes/community/contributors/design-proposals/node/numa-manager.md 4. CPU manager https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ 5. Hugepage https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/ 6. Pod PriorityClass https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/

14 .Thank you

3 点赞
0 收藏
0下载
相关文档