Percona XtraDB Cluster: Failure Scenarios and their Recovery

Percona Xtradb群集(A.K.A PXC)是多主机高可用性群集解决方案。在多主机方面,存在多个保护机制来防止集群进入不一致状态。大多数保护都是基于用户环境配置的,但是如果配置不正确,可能会导致集群停止、失败、出错。
在此会话中,我们将介绍一些故障场景,如由于网络分区而进入非主要群集、由于流控制而导致群集暂停、导致节点关闭的数据不一致、初始捕获期间的常见问题(A.K.A状态快照传输(SST))、pur中的延迟事务处理、阻止DDL导致工作人员完全集群、配置错误集群等
我们还将讨论如何解决其中一些问题,或者必须从这些故障中安全地恢复。

展开查看详情

1.Percona XtraDB Cluster: Failure Scenarios and their Recovery Krunal Bauskar (PXC Lead, Percona) Alkin Tezuysal (Sr. Technical Manager, Percona)

2.Who we are? Krunal Bauskar Alkin Tezuysal (@ask_dba) ● Database enthusiast. ● Open Source Database Evangelist ● Practicing databases (MySQL) for over a ● Global Database Operations Expert decade now. ● Cloud Infrastructure Architect AWS ● Wide interest in data handling and ● Inspiring Technical and Strategic Leader management. ● Creative Team Builder ● Worked on some real big data that powered ● Speaker, Mentor, and Coach application @ Yahoo, Oracle, Teradata. ● Outdoor Enthusiast 2

3.Agenda ● Quick sniff at PXC ● Failure Scenarios and their recovery ● PXC Genie - You wish. We implement. ● Q&A 3

4.Quick Sniff at PXC

5. What is PXC ? Enhanced Multi-master Security Network Flexible protection topology (Geo-distributed) Auto-node Performance provisioning tuned 5

6.Failure Scenarios and their recovery

7.Scenario: New node fail to connect to cluster 7

8.Scenario: New node fail to connect to cluster Joiner log 8

9.Scenario: New node fail to connect to cluster Joiner log Administrator reviews DONOR log doesn’t have any configuration settings like IP traces of JOINER trying to JOIN. address are sane and valid. 9

10.Scenario: New node fail to connect to cluster Joiner log Administrator reviews DONOR log doesn’t have any configuration settings like IP traces of JOINER trying to JOIN. address are sane and valid. Still JOINER fails to connect 10

11.Scenario: New node fail to connect to cluster Joiner log Administrator reviews DONOR log doesn’t have any configuration settings like IP traces of JOINER trying to JOIN. address are sane and valid. SELinux/AppArmor 11

12.Scenario: New node fail to connect to cluster Joiner log Don’t confuse this error with SST since node is not yet offered membership of cluster. SST comes post membership. 12

13.Scenario: New node fail to connect to cluster ● Solution-1: ○ Setting mode to PERMISSIVE or DISABLED 13

14.Scenario: New node fail to connect to cluster ● Solution-1: ○ Setting mode to PERMISSIVE or DISABLED ● Solution-2: ○ Configuring policy to allow access in ENFORCING mode. ○ Related blogs ■ “Lock Down: Enforcing SELinux with Percona XtraDB Cluster”. It probs what all permission are needed and add rules accordingly. ■ “Lock Down: Enforcing AppArmor with Percona XtraDB Cluster” ■ Using this we can continue to use SELinux in enable mode. (You can also refer to selinux configuration on Codership site too). 14

15.Scenario: New node fail to connect to cluster PXC can operate with SELinux/AppArmor. 15

16.Scenario: Catching up cluster (SST, IST) 16

17.Scenario: Catching up cluster (SST, IST) ● SST: complete copy-over of data-directory ○ SST has has multiple external components SST script, XB, network aspect, etc. Some of these are outside control of PXC process. ● IST: missing write-sets (as node is already member of cluster). ○ Intrinsic to PXC process space. 17

18.Scenario: Catching up cluster (SST, IST) Joiner log #1 18

19.Scenario: Catching up cluster (SST, IST) Joiner log SST failed on DONOR #1 19

20.Scenario: Catching up cluster (SST, IST) Joiner log SST failed on DONOR wsrep_sst_auth not set on DONOR #1 20

21.Scenario: Catching up cluster (SST, IST) Joiner log wsrep_sst_auth should be set on DONOR (often user set it on JOINER and things still fails). Post SST, JOINER will copy-over the said user from DONOR. #1 21

22.Scenario: Catching up cluster (SST, IST) Donor log #2 22

23.Scenario: Catching up cluster (SST, IST) Donor log Possible cause: ● Specified wsrep_sst_auth user doesn’t exit. ● Credentials are wrong. ● Insufficient privileges. #2 23

24.Scenario: Catching up cluster (SST, IST) Joiner log #3 24

25.Scenario: Catching up cluster (SST, IST) Joiner log #3 Trying to get old version JOINER to join from new version DONOR. (Not supported). Opposite is naturally allowed. 25

26.Scenario: Catching up cluster (SST, IST) Donor log Joiner log #4 26

27.Scenario: Catching up cluster (SST, IST) Donor log WSREP_SST: [WARNING] wsrep_node_address or Joiner log wsrep_sst_receive_address not set. Consider setting them if SST fails. #4 27

28.Scenario: Catching up cluster (SST, IST) #5 28

29.Scenario: Catching up cluster (SST, IST) Faulty SSL configuration #5 29