- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
eBay 高可靠性生产集群监控与修复
展开查看详情
1 .Production Cluster Monitoring and Remediation for High Reliability at eBay 钱世俊, Cloud Software Engineer, ebay @danielqsj 刘应科, MTS1, Cloud Software Engineer, ebay @keyingliu
2 .Agenda Growing Clusters Monitoring Remediation Q&A
3 .Growing Clusters 30+ 8K+ 100K+ Clusters Nodes(BMs+VMs) Pods
4 .Monitoring Goals ● Control Plane Management ○ Apiserver ○ ETCD ○ Scheduler ○ Controller ● Data Plane Management ○ Node Lifecycle Management ○ Pod Lifecycle Management ○ Daemonset / Deployment / Service / Ingress … ● Alert Management ● AIOps
5 .Monitoring Overview Alerts Logging Metrics Automation AIOps
6 .How we logging Resource Manager Egress Service Kube API Server Flink Daemonset: Filebeat Ingress Service Kafka Storage SDK Logrotate
7 .
8 .How we collect metrics Federated Prometheus Cluster Prometheus A Cluster Prometheus B Cluster Prometheus C Control Plane Management Node Lifecycle Management Pod Lifecycle Management ● Apiserver Latency ● NotReady Nodes: Amount and ● Pod Creation Latency ● Scheduling Latency Timestamp ● Pod Terminating Latency ● IP Allocation Latency ● SchedulingDisabled Nodes: ● Pod Restart Times ● ETCD Latency Amount, Timestamp and Reason ● Pod Resource Usage ● ETCD disk usage ● Cpu, Memory, Disk usage ● Container Creation Latency ● Namespace Resource Usage ● Network Status ● Container Terminating Latency ... ● PID, FD status ● Container Exit Status ... ... K8S Key Node Problem Kube State Exporter Assertion ... Components Detector Metrics
9 .Assertion Cluster Health api-server vip-usage simulate node kubeproxy-healthz workloads etcd pod-connectivity Assertion ipam pod-failure real time analysis ecr cross-cluster of workloads netperf more...
10 .How we build dashboards Cluster Dashboards Global Dashboards ● Apiserver ● Global Health ● ETCD ● Global Cluster Capacity ● Node ● Global Alerts ● Namespace ● Components Version ● Pod ● ... ● Service ● Ingress ● Storage ● Network ● Capacity ● ... Cluster Prometheus Federated Prometheus
11 .Global Health Dashboard
12 .Cluster Health Dashboard
13 .Node Health Dashboard
14 .How we manage alerts Alert Rules Cluster Alert Dashboard Global Alert Dashboard Labels Annotations Description Component Summary Severity Runbook Cluster RCA Execution Plan Time Consumption
15 .Global Alert Dashboard
16 .KubeWatch How to audit across clusters? How to execute complex queries quickly?
17 .KubeWatch Architecture
18 .KubeWatch Query Example: 1. Get all pods for namespace kube-system SELECT name AS podName, data->'metadata'->>'namespace' AS namespace , data->'spec'->>'nodeName' AS nodeName , meta, data FROM pods WHERE data->'metadata'->>'namespace' LIKE 'kube-system' AND deleted = FALSE; 2. Get all services of type load balancer SELECT name AS serviceName, data->'spec'->>'type' AS type, data->'metadata'->>'namespace' AS nameSpace FROM svcs WHERE data->'spec'->>'type' LIKE 'LoadBalancer';
19 .Monitoring Automation ● Rollout alert rules ● Rollout monitoring configurations ● Git Driven Review Create PR Sync Dev Github Repo GitLab Repo 1. Get newest rules / configuration 2. Get runtime rules / configuration 3. Compare rules / Prometheus Alert Controller configuration 4. Update rules / configuration and Reload
20 .AIOps ● Real-time analysis and alerts ● Reducing MTTD and MTTR Data Preprocessing Alert Anomaly Detection Correlation Analysis
21 .We have • More than 30 clusters • Thousands of nodes • Including both BMs and VMs
22 .Overall
23 .Hardware Failures Sensors (Non intrusive detection) In OS (intrusive detection) ● TEMP ● Kernel message ● CPU ● MCE message ● PSU ● Disk check ● MEMORY ● VOLT ● HDD ● FAN ● Define each pattern to correspond with each known failure ● More patterns can be defined if new failure found ● Check if the failure can be tolerant ● Mark the hardware as failures ● Get notified if hardware issues have been fixed
24 .Software Failures Health check for key components: Container Kube* Configurations Runtime Key Kernel soft lockup etc. Services
25 .Thank You !
26 .Q&A