eBay 高可靠性生产集群监控与修复

eBay 在不同地区的全球数据中心运行着数十个 Kubernetes 集群。有成千上万个节点支持搜索和大数据等 eBay 核心服务。复杂的大型跨区域生产集群和需要极高集群稳定性的工作负载使得监控和修复成为我们面临的一项巨大挑战。基于 prometheus 联邦、组件断言、指标 exporter 和我们自身的监控工具,我们构建了一系列清晰的控制面板,然后实施了完整的跨集群修复流程以及事件管理和监控自动化。在本次演讲中,我们希望分享我们的大规模 Kubernetes 生产集群监控经验和未来构想。

1.Production Cluster Monitoring and Remediation for High Reliability at eBay 钱世俊, Cloud Software Engineer, ebay @danielqsj 刘应科, MTS1, Cloud Software Engineer, ebay @keyingliu

2.Agenda Growing Clusters Monitoring Remediation Q&A

3.Growing Clusters 30+ 8K+ 100K+ Clusters Nodes(BMs+VMs) Pods

4.Monitoring Goals ● Control Plane Management ○ Apiserver ○ ETCD ○ Scheduler ○ Controller ● Data Plane Management ○ Node Lifecycle Management ○ Pod Lifecycle Management ○ Daemonset / Deployment / Service / Ingress … ● Alert Management ● AIOps

5.Monitoring Overview Alerts Logging Metrics Automation AIOps

6.How we logging Resource Manager Egress Service Kube API Server Flink Daemonset: Filebeat Ingress Service Kafka Storage SDK Logrotate


8.How we collect metrics Federated Prometheus Cluster Prometheus A Cluster Prometheus B Cluster Prometheus C Control Plane Management Node Lifecycle Management Pod Lifecycle Management ● Apiserver Latency ● NotReady Nodes: Amount and ● Pod Creation Latency ● Scheduling Latency Timestamp ● Pod Terminating Latency ● IP Allocation Latency ● SchedulingDisabled Nodes: ● Pod Restart Times ● ETCD Latency Amount, Timestamp and Reason ● Pod Resource Usage ● ETCD disk usage ● Cpu, Memory, Disk usage ● Container Creation Latency ● Namespace Resource Usage ● Network Status ● Container Terminating Latency ... ● PID, FD status ● Container Exit Status ... ... K8S Key Node Problem Kube State Exporter Assertion ... Components Detector Metrics

9.Assertion Cluster Health api-server vip-usage simulate node kubeproxy-healthz workloads etcd pod-connectivity Assertion ipam pod-failure real time analysis ecr cross-cluster of workloads netperf more...

10.How we build dashboards Cluster Dashboards Global Dashboards ● Apiserver ● Global Health ● ETCD ● Global Cluster Capacity ● Node ● Global Alerts ● Namespace ● Components Version ● Pod ● ... ● Service ● Ingress ● Storage ● Network ● Capacity ● ... Cluster Prometheus Federated Prometheus

11.Global Health Dashboard

12.Cluster Health Dashboard

13.Node Health Dashboard

14.How we manage alerts Alert Rules Cluster Alert Dashboard Global Alert Dashboard Labels Annotations Description Component Summary Severity Runbook Cluster RCA Execution Plan Time Consumption

15.Global Alert Dashboard

16.KubeWatch How to audit across clusters? How to execute complex queries quickly?

17.KubeWatch Architecture

18.KubeWatch Query Example: 1. Get all pods for namespace kube-system SELECT name AS podName, data->'metadata'->>'namespace' AS namespace , data->'spec'->>'nodeName' AS nodeName , meta, data FROM pods WHERE data->'metadata'->>'namespace' LIKE 'kube-system' AND deleted = FALSE; 2. Get all services of type load balancer SELECT name AS serviceName, data->'spec'->>'type' AS type, data->'metadata'->>'namespace' AS nameSpace FROM svcs WHERE data->'spec'->>'type' LIKE 'LoadBalancer';

19.Monitoring Automation ● Rollout alert rules ● Rollout monitoring configurations ● Git Driven Review Create PR Sync Dev Github Repo GitLab Repo 1. Get newest rules / configuration 2. Get runtime rules / configuration 3. Compare rules / Prometheus Alert Controller configuration 4. Update rules / configuration and Reload

20.AIOps ● Real-time analysis and alerts ● Reducing MTTD and MTTR Data Preprocessing Alert Anomaly Detection Correlation Analysis

21.We have • More than 30 clusters • Thousands of nodes • Including both BMs and VMs


23.Hardware Failures Sensors (Non intrusive detection) In OS (intrusive detection) ● TEMP ● Kernel message ● CPU ● MCE message ● PSU ● Disk check ● MEMORY ● VOLT ● HDD ● FAN ● Define each pattern to correspond with each known failure ● More patterns can be defined if new failure found ● Check if the failure can be tolerant ● Mark the hardware as failures ● Get notified if hardware issues have been fixed

24.Software Failures Health check for key components: Container Kube* Configurations Runtime Key Kernel soft lockup etc. Services

25.Thank You !