用于腾讯机器学习的深度定制 Kubernetes

Kubernetes 和 Tensorflow 在机器学习中扮演着重要的角色,然而,原始的 Kubernetes 并不能很好地服务于机器学习,其缺乏准确的 GPU 调度策略、GPU 拓扑感知以及资源限制等。在本次演讲中,我们将回顾最近发生在 Kubernetes 社区的关于机器学习的事件,概述 Kubernetes 在支持腾讯机器学习方面做了哪些改变、面临着哪些挑战,以及如何应对这些挑战。
展开查看详情

1.Deep Customized Kubernetes for Machine Learning in Tencent Shengbo Song, thomassong@tencent.com

2.Agenda 1. GaiaStack Introduction 2. Why need a custom Kubernetes 3. Highlights of GaiaStack

3.Intro to GaiaStack

4.GaiaStack Overview Repository Authorization Autoscale Docker Cloud LB Share CI/CD P2P Download Rolling Upgrade Kubernetes Third LB Private Monitor System devops image app compose LB Cloud Storage Authorization Container Kubernetes Ceph Docker Network Infrastructure

5.GaiaStack ecosystem GaiaStack Monitoring/Logging Core ...

6.Why need a custom Kubernetes

7.Why need a custom Kubernetes • Kubernetes official release is not designed for ML • GPU Topology awareness • Host devices for some apps • GPU Resource management • Deployments, Statefulset, Job, CronJob isn't suitable for ML • ... • Need some custom design and optimization for ML apps

8.Highlights on GaiaStack

9.Highlights on GaiaStack •GPU topology awareness scheduler •GPU resource management •Tapp - a CRD of Kubernetes •Galaxy - a powerful CNI plugin

10.GPU Topology scheduler • NVIDIA GPU topology Different combination of GPU for an app causes big difference performance results. For example, an app runs on GPU-0 and GPU-1 has shorter execution time than the one with GPU-0 and GPU-3

11.GPU Topology scheduler • Gaia GPU scheduler a scheduler can be aware of the topology of GPUs 2 1.8 1.742 GPU0 GPU1 GPU2 GPU3 1.6 GPU0 X PIX SYS SYS 1.4 GPU1 PIX X SYS SYS 1.2 GPU2 SYS SYS X SYS 1 0.878 GPU3 SYS SYS SYS X 0.845 0.8 0.6 0.456 0.444 0.4 The chosen combination with Gaia scheduler 0.246 0.232 (GPU0,1) has 200% data transmit speed 0.2 0.133 than the default scheduler does (GPU0,3) 0 1M 2M 4M 8M GPU 0,1 GPU 0,3

12.GPU Share and Limit NVIDIA supports sharing single GPU by: • VMs use NVIDIA GRID • Processes use MPS Service Pros: • Official support • Easy to use Cons: • NVIDIA GRID not suitable for Kubernetes which runc is default container runtime • Single MPS client can affect other MPS client and MPS server • Hard resource limits, not changed after process is running • Time-slice is not based on requests of containers

13.GPU Share and Limit Our solutions: • Use a lightweight server • Time-slice is based on request of share • Clients are not affected each other • The limits can be changed at any time • Zero-injection to user applications

14.GPU Share and Limit NVIDIA AlexNet on Tesla P4 100 90 NVIDIA Ours 80 70 60 Execution Time Utilization 50 40 30 0.3 GPU 87.3s 97.79s 57.16% 46.59% 20 10 0 0.7 GPU 87.86s 66.91s 47.98% 69.62% 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 0.3 GPU 0.7 GPU Total Time 97.86s 97.79s Ours The default strategy of NVIDIA is divided the available 100 thread equally to each clients and first run first finish. 90 80 70 60 But our strategy is divided the available thread based 50 40 on request of share, large share first finish. 30 20 10 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 0.3 GPU 0.7 GPU

15.Tapp - a CRD of Kubernetes Tapp is a CRD of Kubernetes which is similar with Statefulset but more powerful than it. Similarity: • Pod has a increased and unique id • Support Persistent Volumes • ... Difference: • Repeated operations of any Pod, including deletion, stop, reboot, rolling upgrade, downgrade • Different versions of Pod existed in the same Tapp • Change image version of a Pod, even add a new container

16.Galaxy - a powerful CNI plugin For various scenario, we design a CNI plugin called Galaxy Pros: Floating IP • Underlay + Overlay Multiple IPs • Adapt to any network scenario • Good Performance Underlay SRIOV • Zero-injection to client network Bridge MacVlan IPVlan For application: Multiple Vlan • Different apps can choose different network mode Galaxy L2 non- encapsulation • Pods on the same host can have different network mode Overlay L3 IPIP encapsulation Pod Network NAT Policy BGP Protocol

17.Galaxy - a powerful CNI plugin 10000 9306 9259 Overlay Performance 9000 25000 8000 7654 22190 21461 7000 20000 1646217442 6000 5000 4531 15000 4000 10000 7261 3000 6828 4861 5548 2000 5000 1000 0 0 TCP_RR(r/s) TCP_CRR(r/s) TCP_STREAM(Mbits/s) host vxlan ipip gateway host vxlan ipip gateway Overlay solution on GaiaStack is IPIP + Host Gateway. The performance is +14%-+40% than Vxlan(flannel), and our solution is accepted by flannel community.

18.Thank you !