Node Operator:Kubernetes 节点管理化繁为简

Kubernetes 节点依赖于许多主机上的软件和配置,包括容器运行时、网络插件和 kubelet。维护这些依赖关系既繁琐又容易出错。在阿里巴巴和蚂蚁金服,一个普通的集群管理员平均需要维护成千上万个 Kubernetes 节点。我们开发了 Node Operator 以简化任务并降低任务风险。 在本次演讲中,我们将分享如何使用 Node Operator 维护节点软件和配置。我们设计的声明式 API 可以让集群管理员与节点 CRD 资源进行交互,以管理任何节点的生命周期。 Node Operator 还负责对节点的状态改变做出响应并在必须要采取恢复措施。节点运算符具有可扩展设计,因此可管理不属于 Kubernetes 的其他主机上的软件。

1.Node Operator: Kubernetes Node Management Made Simple 陈俊, Ant Financial

2.Agenda • Background and Motivation • Introduction of Operators • Node-Operator • Advanced Topic: Kube-on-Kube-Operator • Achievement • Q&A

3. Background: DC/OS From Sigma 2.0(Swarm) to Sigma 3.1(Kubernetes)

4.Background: Operation Requirements • Apply to large-scale Cluster • Setup & Teardown Cluster fast and convenient • Add & delete Node at any time • Upgrade Master & Node Components reliably • Canary Rollout • Master & Node Component Versions Management

5.Motivation: Work Order Deployment • Upgrade Nodes Versions • Upgrade Node • Upgrade docker • Upgrade kubelet • Upgrade Node • Upgrade docker Worker Order • Upgrade kubelet ….

6.Motivation: Work Order Deployment Disadvantages • Inconsistency • Non-failure-aware • Complicated architecture Work order deployment system can not meet the requirements of resource management.

7.Operator • Observe: watch desired Observe resource and actual resource • Analyze: difference from desired and actual config Action Analyze • Action: manage resource to desired config

8.Operator: Advantages • Declarative system • Manage resource to final state continually • kube-apiserver oriented programming • CustomResourceDefinition (CRD) • Built on Kubernetes APIs • Kubernetes repo support • Agile, flexible and convenient

9.Node-Operator: Overview • User: SREs who can scale & offline Nodes through posting Machine CRs. • Node-Operator: difference Machine and Node state, manage Node softwares and configure files. • Machine: the instance of Machine CRD with node basic information, which represent a node desired in the Kubernetes. • NPD(Node Problem Detector): post Node state to kube- apiserver.

10.Node-Operator: Scale Nodes Node-Operator

11.Node-Operator: Upgrade Nodes Node-Operator

12.Node-Operator: Grayscale Rollout Node-Operator

13.Kube-on-Kube-Operator: Overview • Biz-Cluster: used to deploy our application. • Meta-Cluster: used to set up Biz-Cluster master components. We add Biz-Cluster master nodes to Meta-Cluster. • User: SREs who can setup & upgrade Biz-Cluster by posting Cluster CRs. • Kube-on-Kube-Operator: difference Biz-Cluster CRs and Biz-Cluster master components state, and manage Biz-Cluster master components through Kubernetes resource, such as Deployment, Pod, etc.

14.Work Together

15.Achievement • Anyone can operate and maintenance Kubernetes Cluster • Set up & tear down Kubernetes Cluster in two Minutes • Automated rollouts and rollbacks • Cluster & Node self-healing

16.Q&A THANKS --------- Q&A Section -------- /感谢聆听 陈俊 WeChat: answer1991chen


18.Background: Cluster Scale • Production environment: • Dozens of Cluster • 5k+ Nodes / Cluster • 10k+ Nodes / largest Cluster • Testing environment • Hundreds of Cluster for CI/CD • 500+ Nodes / Cluster