1.The Highs and Lows of Running a Distributed Database on Kubernetes Presented by Alex Robinson / Systems Engineer @alexwritescode
2.Databases are critical to the applications that use them
3. You need to be very careful when making big changes to your database
4.Containers are a huge change
5.To succeed, you must:
6. To succeed, you must: 1. Understand your database
7. To succeed, you must: 1. Understand your database 2. Understand your orchestration system
8. To succeed, you must: 1. Understand your database 2. Understand your orchestration system 3. Plan for the worst
9.Let’s talk about databases in Kubernetes • Why would you even want to run databases in Kubernetes? • What do databases need to run reliably? • What should you know about your orchestration system? • What’s likely to go wrong and what can you do about it?
10.My experience with databases and containers • Worked directly on Kubernetes and GKE from 2014-2016 ○ Part of the original team that launched GKE • Led all container-related efforts for CockroachDB from 2016-2019 ○ Conﬁgurations for Kubernetes, DC/OS, Docker Swarm, even Cloud Foundry ○ AWS, GCP, Azure, On-Prem ○ From single availability zone deployments to multi-region ○ Helped users deploy and troubleshoot their custom setups
12.Why even bother? We’ve been operating databases for decades
13.Traditional management of databases 1. Provision one or more beefy machines with large/fast disks 2. Copy binaries and conﬁguration onto machines 3. Run binaries with provided conﬁguration 4. Never change anything unless absolutely necessary
14.Traditional management of databases • Pros ○ Stable, predictable, understandable • Cons ○ Most management is manual, especially to scale or recover from hardware failures ■ And that manual intervention may not be very well practiced
15.So why move state into Kubernetes? • The same reasons you’d move stateless applications to Kubernetes ○ Automated deployment, scheduling, resource isolation, scalability, failure recovery, rolling upgrades ■ Less manual toil, less room for operator error • Avoid separate workﬂows for stateless vs stateful applications
16.Challenges of managing state “Understand your databases”
17.What do stateful systems need?
18.What do stateful systems need? • Process management • Persistent storage
19.What do stateful systems need? • Process management • Persistent storage • If distributed, also: ○ Network connectivity ○ Consistent name/address ○ Peer discovery
20.Managing State on Kubernetes “Understand your orchestration system”
21.Let’s skip over the basics • Unless you want to manually pin pods to nodes, you should use either: ○ StatefulSet: ■ decouples replicas from nodes ■ persistent address for each replica, DNS-based peer discovery ■ network-attached storage instance associated with each replica ○ DaemonSet: ■ pin one replica to each node ■ use node’s disk(s)
22.Where do things go wrong?
24.Don’t trust the defaults! • If you don’t speciﬁcally ask for persistent storage, you won’t get any ○ Always think about and specify where your data will live
25.Don’t trust the defaults! • If you don’t speciﬁcally ask for persistent storage, you won’t get any ○ Always think about and specify where your data will live 1. Data in container 2. Data on host filesystem 3. Data in network storage
26.Ask for a dynamically provisioned PersistentVolume
27.Don’t trust the defaults! • Now your data is persistent • But how’s performance?
28.Don’t trust the defaults! • If you don’t create and request your own StorageClass, you’re probably getting slow disks ○ Default on GCE is non-SSD (pd-standard) ○ Default on Azure is non-SSD (non-managed blob storage) ○ Default on AWS is gp2, which are backed by SSDs but with fewer IOPs than io2 • This really affects database performance
29.Use a custom StorageClass