深入了解:VMware 特别兴趣小组 (SIG VMware)——Steven Wong,VMware和 Hui Luo,Vmware

Kubernetes 允许使用拓扑标记来影响调度程序的 pod 布局。这用于在可用区域之间传播 pod,同时仍然尊重资源访问和可用性问题。当 Kubernetes 在 vSphere 上运行时,管理程序平台还支持高可用性的基础层以及为控制层面和工作节点进行自动配置设置。2 级调度和资源管理是被激活的。 目前没有自动调度集成发生,即 Kubernetes 不了解基础 vSphere 拓扑(站点、关联组、NUMA 等)。本次会议将介绍通过调优 vSphere、Kubernetes 配置和标签来获得更好性能、资源优化和可用性的选项。这适用于在 vSphere 堆栈上运行的任何 K8s 发行版。
展开查看详情

1. VMware SIG Deep Dive into Kubernetes Scheduling Performance and high availability options for vSphere Steve Wong, Hui Luo VMware Cloud Native Applications Business Unit November 12, 2018

2.Presenter Bios Steve Wong Hui Luo Open Source Community Relations Engineer Software Engineer VMware VMware Active in Kubernetes storage community since 2015. Chair of Kubernetes First open source project was to enable GPU on Kubernetes with VMware SIG. vSphere. Also actively contributing to kubelet, device manager, device plugin area. GitHub: @cantbewong GitHub: @figo 2

3.Abstract ​Kubernetes allows using topology labels to affect the scheduler’s placement of pods. This is used to spread pods across availability zones, while still respecting resource access and availability concerns. When Kubernetes runs on vSphere, the hypervisor platform also supports an underlying tier of high availability and automated placement options, for both control plane and worker nodes. 2 levels of scheduling and resource management are active. ​Currently no automatic scheduling integration occurs, that is, Kubernetes is not aware of the underlying vSphere topology (sites, affinity groups, NUMA, etc.). ​This session will explain the options to gain better performance, resource optimization and availability through tuning of vSphere, and Kubernetes configuration and labeling. This is applicable to any K8s distribution running on the vSphere stack. 3

4. ​Kubernetes default scheduling ​How it works Agenda ​Utilizing Zones to improve scheduling ​Using vSphere tags to define regions and zones – add cloud provider ​What is NUMA? ​How to solve potential issues with CPU and memory intensive workloads ​Kubernetes default resource management ​How it works ​Extending the functionality of Kubernetes ​Using vSphere DRS with Kubernetes ​High Availability options ​Using vSphere HA with Kubernetes 4

5.Kubenetes scheduling What does the scheduler do: As pod are created, they are place in a queue. (priority available in Beta) The scheduler continuously pull pods off the queue, evaluates the pod’s requirements, and assigns it to a worker node. 5

6.Kubenetes scheduling What does the scheduler do: As pod are created, they are place in a queue. (priority available in Beta) The scheduler continuously pull pods off the queue, evaluates the pod’s requirements, and assigns it to a worker node. Placement Decision Stages: 1. Filter out impossible worker nodes a. Filters are called predicates - extensible in code with a default list 6

7.Kubenetes scheduling What does the scheduler do: As pod are created, they are place in a queue. (priority available in Beta) The scheduler continuously pull pods off the queue, evaluates the pod’s requirements, and assigns it to a worker node. Placement Decision Stages: 1. Filter out impossible worker nodes a. Filters are called predicates - extensible in code with a default list 2. Rank remaining nodes a. ranking is driven by priorities - this is extensible and configurable with a default list (e.g. zones) 7

8.Kubenetes scheduling What does the scheduler do: As pod are created, they are place in a queue. (priority available in Beta) The scheduler continuously pull pods off the queue, evaluates the pod’s requirements, and assigns it to a worker node. Placement Decision Stages: 1. Filter out impossible worker nodes a. Filters are called predicates - extensible in code with a default list 2. Rank remaining nodes a. ranking is driven by priorities - this is extensible and configurable with a default list (e.g. zones) 8

9.Scheduling modifiers Node selector constrain which nodes your pod is eligible to be scheduled on based on <key: value> labels on the node • Some labels are automatically created, but you can add more specified as NodeSelector <key: value> in the Pod spec Affinity Pod can define rules based on node labels, or based on placement of other pods Elements that influence pod placements Zones – label nodes with failure zone/regions Taints / Tolerations – mark nodes with arbitrary labels which could correspond to resource or whatever you like Admission Controller – a wide variety are available, in validating and mutating classes 9

10.Why use Zones? Kubernetes will automatically spread the pods in replication controllers or services across zones - to reduce the impact of zone failures How it works: • Kubernetes supports running a single cluster in multiple failure zones. • When nodes are started, labels are automatically added with zone information, based on tags pre-applied by a vSphere administrator. • The scheduler uses these labels Limitations • Because this is a priority not a predicate, this is a best-effort placement. If the zones in your cluster have uneven available resources due to node variations or unevenly distributed pre-existing workloads, this might prevent perfectly even spreading of your pods across zones. • The Kubernetes Zones feature is designed to intelligently place Pods on worker nodes. It does not place the nodes themselves within vSphere failure domains. 10

11.What is NUMA? Non Uniform Memory Architecture 11

12.Why should you care about NUMA? Memory intensive workloads Nearly all database servers (e.g. Oracle, MongoDB), present a workload which will attempt to detect and consume as much of the system’s memory as possible. Where does this lead? Unpr This basically comes down to a choice of 2 CPU Nodes – NUMA host edict able whether you would rather have a fast perfo Node 0 Node 1 n g? rman cache, or a slower cache that is larger. 32GB 21GB p i ce? S wap When Linux initially allocates a threads, it is assigned a Many popular application runtimes (e.g. preferred node, by default memory allocations come from Java jre) have similar NUMA related this node the thread runs on, but can potentially come issues. from other nodes with broad performance implications. 12

13.How can NUMA issues be avoided? Application can be modified / reconfigured? • The application can be “wrapped” with a numactl command to interleave memory, or engage other options • potentially broad performance effects. (e.g interleaving get predictable albeit reduced performance) • A cgroup aware version (e.g. Java jre v10) can be deployed • This is often not available – many were developed in a pre-container era Active discussions regarding Kubernetes enhancements going on now in Resource Management Working Group – please join in • See Issue #49964 13

14.Using a NUMA aware hypervisor to solve issues now VM composition guidelines A NUMA aware hypervisor can have IO benefits too • Assuming you workload fits with the footprint of a single node, compose worker node VMs as “walled gardens” corresponding to node size • Specify multiple cores per socket, not multiple sockets • If you can’t fit in a single node because of core or memory requirements: • Minimize socket count to what is needed to meet requirements • Don’t assign an odd number of vCPUs • Never compose a VM larger than the number of physical cores For the vSphere hypervisor, there are advanced vNUMA settings, they rarely need to be changed from defaults. link 14

15.Kubernetes Resource Management How it works • Specified and “metered” on a per container basis • Requests • What a container is guaranteed to get – won’t be scheduled if not available • Limits • Restrictions are engaged when this is exceeded • Unmanaged by default • Mechanisms exist to allow a cloud provider or admin to supply a default and over-ride container specification outside an allowed range • Supplemental “Metering” at the namespace level • Resource Quotas can be applied by an administrator at a namespace level • Requests • Limits • Numeric count of allowed instances of objects 15

16.Kubernetes Resource Management What Resource are managed? Pod + Namespace Level: • CPU • Units are millicores, 2000m = 2 cores • Memory • Mibibytes, 1000Mi = 1,048,576 bytes Supplemental “Metering” at the namespace level • Memory • CPU • Object counts • configmaps • persistentvolumeclaims • replicationcontrollers • secrets • services • loadbalancers 16

17.Kubernetes Default Resource Management Goals Efficiency Fairness Quotas Prioritization Isolation 17

18.Kubernetes built-in resource management Enforcement Run time enforcement at worker node level CPU “Compressible” = violation results in throttling Memory “Uncompressible” = violation triggers “death penalty” of Pod hosting container Scheduling time enforcement ResourceQuota admission controller will refuse to schedule a Pod that would violate limit After scheduling, running Pods are not affected by quota Limitations CPU measurement is in arbitrary units, not uniform across hosts and is a share not a guarantee 18

19.Where Resource Management enforcement takes place Kubernetes -> container runtime -> Linux -> hypervisor (optional) ​Kubernetes control plane manages desired policy. ​Enforcement passes Pod -> container runtime -> Linux OS ​Cgroups are used to map Pod CPU and Memory Resources • Note: Two Cgroups Drivers exist (cgroupfs [default], systemd) 19

20.Supplement Kubernetes Resource Management with vSphere DRS What is DRS? The vSphere Distributed Resource Scheduler (DRS) is a load balancer for VMs deployed on a hypervisor cluster. It has advanced features that can provider actual guaranteed resource reservations, not just shares. It also incorporates health monitoring and IO awareness Secure multi-tenant (multi-department) Kubernetes deployments • with ability to have true guaranteed resource reservations (not just shares) • with governed sharing of unutilized capacity for improved efficiency • Allows maintenance with less service level disruption (Master) (Master) (Master) (Workers) (Worker) K8S Prod K8S Prod K8S Prod K8S Prod K8S Test K8S Test K8S Test K8S Prod K8S Prod K8S Prod K8S Prod K8S Prod K8S Prod VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM DRS Cluster 20

21.Thank You Questions?

22.remaining slides not presented to meet time constraints - included in published deck for reference 22

23.Configuring VM affinity rules Quorum dictates design Host-VM Rules (VM Anti-Affinity) (Master) (Master) (Master) (Worker) (Worker) (Worker) (Worker) K8S Prod K8S Prod K8S Prod K8S Prod K8S Test K8S Test K8S Test K8S Prod K8S Prod K8S Prod K8S Prod K8S Prod K8S Prod K8S Prod VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM V Fault Domain A Fault Domain B 23

24.Extending Kubernetes with vSphere HA What is HA? Hosts in an HA cluster are health monitored and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts. When running on hardware that supports health reporting, Pro-active failure avoidance can also be engaged. Example loss of a system cooling fan, degraded storage, or can trigger automated evacuation before host failure. Deploying HA A least 2 hypervisor hosts are required. HA can be deployed independent of DRS, but the combination of the two in a cluster is recommended. This will enable load balancing and application of affinity/anti-affinity rules 24

25.Configuring HA restart priority Ensure etcd, control plane starts first, and Prodsystems before others 25