PingCAP-Infra-Meetup-105-Chaos practicein TiDB

本次杜川老师的分享主要分成三个部分: 1.首先通过对现有 Streaming 系统和 Batch 系统的分析,讨论了在数据处理领域 Streaming 和 Batch 的异同,明确了 Streaming 的核心本质,探讨了 Streaming 和 Batch 融合处理的可能性和必要性,并对现有类似系统进行了简单的分析。 2.简单回顾了 RDMS 中经典的 Volcano 模型的执行流程,探讨了在 RDMS 上支持 Streaming 处理的难点以及 Streaming SQL 设计的关键要素。 3.介绍了 TBSSQL 的设计思路,架构设计和若干关键技术点的方案选择,展示了 TBSSQL 的运行 Demo。并以 TBSSQL 为例,简单介绍了在 TiDB 上增加一个 Feature 的大致思路和入手点。
展开查看详情

1.Chaos practice in TiDB PingCAP 舒科 2019 年 6 月 1 日

2.Testings in TiDB ● Unit Testing ● Integration Testing ● Performance Testing ● Schroinger Testing ○ Chaos Testing

3.content ● Why Chaos ● Practice in TiDB ● Schrodinger

4.Why Chaos ● Use fault injection ● by Netflix 2010 ○ Break things ● Why 2010 ○ Netflix move to AWS ○ lots of errors ■ hardware ■ network latency ■ ...

5.Why Chaos (cont.) ● Goal: ○ Make system stronger ● Steps

6.Why Chaos (cont.) ● Samples ○ Chaos in EMC ■ Robot to remove harddisk in BMW POC ○ Chaos in Facebook ■ shutdown a data center ● lack of Chaos ○ 737 - max

7.Why Chaos (cont.) ● Micro service ○ Too complex to understand ● Error always happens ● Do Chaos to Gain confidence

8.Why Chaos (cont.)

9.Why Chaos (cont.) ● ETCD bug ● RocksDB bug ● Leader partitioned ● Transfer leader if busy ● Too many regions ● Crashed when processing batch raft ● ...

10.● Why Chaos ● Practice in TiDB ● Schrodinger

11.Chaos practise in TiDB

12.Chaos practise in TiDB (cont.) ● Region hearbeats ○ check what happened when huge number regions on a machine ○ choose metric: CPU ○ Hypothesize: ■ CPU is still low ○ Experiments ■ 40k regions on a machine ○ What happened? ■ OOM ■ 30% CPU occupied

13.Chaos practise in TiDB (cont.) ● Choose Metrics ○ often QPS ○ CPU ○ memory ● Hypothesis ○ QPS revert to previous level in X seconds ○ QPS drop 1/x

14.Chaos practise in TiDB (cont.) Error injection

15.Chaos practise in TiDB (cont.) ● Applications ○ kill, kill -9 ○ renice ○ sigstop, sigcont

16.Chaos practise in TiDB (cont.) ● Memory ○ cgroup ● Storage ○ fuse ○ rm -rf ● Network ○ tc ○ iptable

17.Chaos practise in TiDB (cont.) ● other errors ○ ETCD key deleted ○ NTP errors ○ ...

18.Chaos practise in TiDB (cont.) ● Observe results ○ Learn from history

19.Chaos practise in TiDB (cont.) ● Observe results ○ Learn from log

20.Chaos practise in TiDB (cont.) ● Automation ○ Take some machines from SRE ○ Deploy ○ Experiment ○ Debug ○ Return

21.● Why Chaos ● Practice in TiDB ● Schrodinger

22.Schrodinger

23.Schrodinger (cont.)

24.Schrodinger (cont.)

25.Schrodinger (cont.)

26.Schrodinger (cont.) cat

27.Schrodinger (cont.)

28.Schrodinger (cont.) Chaos Operator

29.Schrodinger (cont.) Run with your own Helm charts

TiDB 是一款定位于在线事务处理/在线分析处理( HTAP: Hybrid Transactional/Analytical Processing)的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性。