Capacity Planning for your data stores

在这场网络研讨会上,科林使用了一个售票网站,该网站提供“正常”活动,如M2M音乐会,但偶尔也会出售哈利波特戏剧表演的门票。
卖票要求你卖的票不要超过你的实际数量。您希望负载平衡您的查询。您希望共享数据存储。您可能需要拆分读写。您需要确定系统瓶颈在哪里,因此您需要一个基线,并知道什么是常规的流量模式。
这是一个完美的容量规划示例,因为您不希望购买在大部分时间内都不做任何事情的服务器。像这样的例子就是为什么云在今天如此流行。虽然本文的重点不是帮助规划应用服务器的加载和缓存,但是数据层绝对是一件很难处理的事情。
除此之外,我们将讨论OLTP和数据仓库使用的存储容量规划。
从度量集合中,您可以计划您的需求。将其与云的弹性特性结合起来,就永远不会有“建立数据库连接时出错”的情况。
本次讨论涉及的工具包括(但不限于):Box风速计、InnoTop、Slow Query Log、Percona工具箱(pt query digest)、vmstat、Facebook的Prophet和Percona监控和管理(PMM)。

展开查看详情

1.Capacity planning for your data stores Colin Charles, Chief Evangelist, Percona Inc. colin.charles@percona.com / byte@bytebot.net http://bytebot.net/blog/ | @bytebot on Twitter Percona Webminar 15 December 2017

2. whoami • Chief Evangelist (in the CTO office), Percona Inc • we make 100% open source tools, enhanced MySQL/MongoDB servers, XtraBackup, TokuDB, work on MyRocks/MongoRocks, Percona Toolkit and many more! • Founding team of MariaDB Server (2009-2016) • Formerly MySQL AB/Sun Microsystems • Past lives include Fedora Project (FESCO), OpenOffice.org • MySQL Community Contributor of the Year Award winner 2014

3.

4. License • Creative Commons BY-NC-SA 4.0 • https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

5. Database, data store, etc. • Database: 1. a structured set of data held in a computer, especially one that is accessible in various ways. [Google] • Data store: A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails etc. [Wikipedia]

6. Presto, the Distributed SQL Query Engine for Big Data • Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. • Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

7.Why capacity plan?

8.

9.

10.

11. Revenue Management • Cannot sell more than you actually have • Seat map: theatre, planes • Rooms: types, quantity

12.

13. Uptime Percentile target Max downtime per year 90% 36 days 99% 3.65 days 99.5% 1.83 days 99.9% 8.76 hours 99.99% 52.56 minutes 99.999% 5.25 minutes 99.9999% 31.5 seconds

14. You can start now! • Start collecting metrics, NOW! • metric: standard of measurement • You need your baseline, your traffic patterns

15. Baseline • How well is your current infrastructure working? • what is your QPS? QPS before performance degradation? QPS before performance degradation affects user experience? • What more will you need, in the (near) future, to maintain acceptable performance? • load that causes failure - alerting? Add/remove capacity, what do you expect? When do you spin up new resources/size new orders? • How do you manage the resources? • Iterate!

16. MySQL world • Operating System • vmstat, netstat, df, ps, iostat, uptime • MySQL • SHOW [TABLE] STATUS, SHOW PROCESSLIST, INFORMATION_SCHEMA, PERFORMANCE_SCHEMA, slow query log, mytop/innotop

17. Sharding • Sharding • Split your data across multiple nodes • Sharding alone isn’t enough, you need ability to split reads/writes • Tools: ProxySQL, Vitess, Tumblr JetPants, Tungsten Replicator

18. Database specific watch points • QPS (SELECTs, INSERTs, UPDATEs, DELETEs) • Open connections • Lag time between masters/slaves • Cache hit rates

19. Bottlenecks? • Bottleneck: reads or writes? • High CPU? • I/O? • Lag on replicas and the queries seem fine • Locking?

20. Context-based metrics • pt-query-digest: https://www.percona.com/doc/percona-toolkit/ 3.0/pt-query-digest.html • Analyse queries from logs, processlist, tcpdump • Box Anemometer: https://github.com/box/Anemometer • Analyse slow query logs to identify problematic queries • Commercial tools exist for this as well

21.Percona Monitoring & Management (PMM) • Query analytics + visualise it (w/sparklines, etc.) • Metrics monitor: OS & MySQL • Built on-top of open source: Prometheus, Consul, Grafana, Orchestrator • Get Docker container for “server”, get agent for “client” • http://pmmdemo.percona.com/

22.PMM

23.Understanding your workload better • Percona Lab Query Playback • https://github.com/Percona-Lab/query-playback • Query Playback is a tool for replaying the load of one database server to another • --slow-query-log --log-slow-admin-statements --log- slow-verbosity=microtime --long-query-time=0

24. Load balancing • Do you just pick a random database server? • Load balancing strategies matter • Strategy: • Pick 2 random servers • Machine has less load? • Send request

25. ProxySQL • Connection Pooling & • Query blocking (database Multiplexing aware firewall) • Read/Write Split and Sharding • Query mirroring (cache warming) • Seamless failover (including query rerouting), load balancing • Query throttling and timeouts • Query caching • Runtime reconfigurable • Query rewriting • Monitoring built-in

26. ProxySQL comparison • http://www.proxysql.com/compare

27. Storage capacity planning • Small single server deployment: 3-4x working capacity is not a bad option • size of database and data files (/var/lib/mysql) • size of largest table * 2 (for tmp/sort files) • size of each local logical backup • 5% free for OS • The above may not necessarily make sense for large scale deployments

28. Prophet • Works by fitting time-series data to get a prediction of how that metric will look in future • Generalised Additive Model • Linear or logistic regression + additive model applied to regression • Paper: https://facebookincubator.github.io/prophet/static/prophet_paper_20170113.pdf • Tip: have at least a year of data to fit the model (you may miss seasonal effects otherwise) • Tip: holidays (https://facebookincubator.github.io/prophet/docs/holiday_effects.html) • Our evaluation: https://www.percona.com/blog/2017/03/20/prophet-forecasting-our- metrics-or-predicting-the-future/

29. Auto-scaling frameworks • Scalr • Amazon • Vertical: grow the instance • Horizontal: replicas • EC2: auto scaling + groups • Amazon RDS Aurora, Google Cloud Spanner, Azure Cosmos DB