Capacity Planning for your data stores

Percona首席传道者Colin Charles于2018年10月3日在加州帕萨迪纳举行的16倍规模会议上介绍了“数据存储容量规划”。

展开查看详情

1.Capacity planning for your data stores Colin Charles, Chief Evangelist, Percona Inc. colin.charles@percona.com / byte@bytebot.net http://bytebot.net/blog/ | @bytebot on Twitter SCALE16x, Pasadena, California, USA 10 March 2018

2.whoami • Chief Evangelist, Percona Inc • we make 100% open source tools, enhanced MySQL/MongoDB servers, XtraBackup, TokuDB, work on MyRocks/MongoRocks, Percona Toolkit and many more! • Founding team of MariaDB Server (2009-2016) • Formerly MySQL AB/Sun Microsystems • Past lives include Fedora Project (FESCO), OpenOffice.org • MySQL Community Contributor of the Year Award winner 2014

3.

4.License • Creative Commons BY-NC-SA 4.0 • https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

5.Database, data store, etc. • Database: 1. a structured set of data held in a computer, especially one that is accessible in various ways. [Google] • Data store: A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails etc. [Wikipedia]

6.Presto, the Distributed SQL Query Engine for Big Data • Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. • Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

7.Why capacity plan?

8.

9.

10.Revenue Management • Cannot sell more than you actually have • Seat map: theatre, planes • Rooms: types, quantity

11.

12.Uptime Percentile target Max downtime per year 90% 36 days 99% 3.65 days 99.5% 1.83 days 99.9% 8.76 hours 99.99% 52.56 minutes 99.999% 5.25 minutes 99.9999% 31.5 seconds

13.You can start now! • Start collecting metrics, NOW! • metric: standard of measurement • You need your baseline, your traffic patterns

14.Baseline • How well is your current infrastructure working? • what is your QPS? QPS before performance degradation? QPS before performance degradation affects user experience? • What more will you need, in the (near) future, to maintain acceptable performance? • load that causes failure - alerting? Add/remove capacity, what do you expect? When do you spin up new resources/size new orders? • How do you manage the resources? • Iterate!

15.MySQL world • Operating System • vmstat, netstat, df, ps, iostat, uptime • MySQL • SHOW [TABLE] STATUS, SHOW PROCESSLIST, INFORMATION_SCHEMA, PERFORMANCE_SCHEMA, slow query log, mytop/innotop

16.Working Set Size Estimation • http://www.brendangregg.com/wss.html • Size main memory for your database, with the intent of keeping it from swapping. Measure in bytes over an interval. • https://github.com/brendangregg/wss

17.A note on swap • On a machine with 32GB of RAM, and database stored on Intel 750 NVMe storage, a uniform sysbench workload gives about 44K QPS, 95% response time of 3.5ms (buffer pool=24GB) • Swapping when buffer pool=32GB, gives 20K QPS, response time of 9ms

18.More on swap • Try 48GB for a buffer pool (more than RAM), and you get 6K QPS, 35ms response time

19.MySQL 5.7 online buffer pool resize

20.Swappiness • https://www.percona.com/blog/2017/01/13/impact-of-swapping-on- mysql-performance/ • Don’t set vm.swappiness=0 with a modern kernel (3.5-rc1 or backports like CentOS 2.6.32-303) • https://www.percona.com/blog/2014/04/28/oom-relation-vm- swappiness0-new-kernel/ • Otherwise the OOM killer comes for you • vm.swappiness=1 is preferred nowadays

21.Sharding • Sharding • Split your data across multiple nodes • Sharding alone isn’t enough, you need ability to split reads/writes • Tools: ProxySQL, Vitess, Tumblr JetPants, Tungsten Replicator, SPIDER (MariaDB 10.3)

22.Database specific watch points • QPS (SELECTs, INSERTs, UPDATEs, DELETEs) • Open connections • Lag time between masters/slaves • Cache hit rates

23.Bottlenecks? • Bottleneck: reads or writes? • High CPU? • I/O? • Lag on replicas and the queries seem fine • Locking?

24.Context-based metrics • pt-query-digest: https://www.percona.com/doc/percona-toolkit/ 3.0/pt-query-digest.html • Analyse queries from logs, processlist, tcpdump • Box Anemometer: https://github.com/box/Anemometer • Analyse slow query logs to identify problematic queries • Commercial tools exist for this as well

25.Percona Monitoring & Management (PMM) • Query analytics + visualise it (w/sparklines, etc.) • Metrics monitor: OS & MySQL • Built on-top of open source: Prometheus, Consul, Grafana, Orchestrator • Get Docker container for “server”, get agent for “client” • http://pmmdemo.percona.com/

26.PMM

27.Understanding your workload better • Percona Lab Query Playback • https://github.com/Percona-Lab/query-playback • Query Playback is a tool for replaying the load of one database server to another • --slow-query-log --log-slow-admin-statements --log- slow-verbosity=microtime --long-query-time=0

28.Load balancing • Do you just pick a random database server? • Load balancing strategies matter • Strategy: • Pick 2 random servers • Machine has less load? • Send request

29.ProxySQL • Connection Pooling & Multiplexing • Read/Write Split and Sharding • Seamless failover (including query rerouting), load balancing • Query caching • Query rewriting • Query blocking (database aware firewall) • Query mirroring (cache warming) • Query throttling and timeouts • Runtime reconfigurable • Monitoring built-in