Billion Goods in Few Categories:how Histograms Save a Life

我们存储数据的目的是使用它:搜索,检索,分组,排序…为了有效地执行这些操作,MySQL存储引擎在编译查询执行计划时索引数据并与优化器通信统计信息。这种方法非常有效,除非您的数据分布不均匀。
去年,我研究了几张门票,其中的数据遵循相同的模式:数以百万计的流行产品分为两类,其余的则使用其余的。我们很难找到快速取回货物的解决方案。提供了5.7版的解决方案。然而,新的MySQL8.0特性:柱状图,-将工作得更好,更干净,更快。谈话的想法就是这样产生的。
我会讨论的
-如何物理存储索引统计信息
-与优化器交换哪些数据
-为什么做正确的索引选择是不够的
最后,我将解释哪些问题可以解决柱状图,以及为什么使用索引统计数据不足以快速检索不均匀分布的数据。

展开查看详情

1.Billion Goods in Few Categories: how Histograms Save a Life? November, 7, 2018 Sveta Smirnova

2. Table of Contents •The Case •The Cardinality: Two Levels •ANALYZE TABLE Limitations •Solutions in Percona Server 5.7 •Histograms •Conclusion 2

3. Sveta Smirnova • MySQL Support engineer • Author of • MySQL Troubleshooting • JSON UDF functions • FILTER clause for MySQL • Speaker • Percona Live, OOW, Fosdem, DevConf, HighLoad... 3

4. Everything can be Resolved! • Hardware • Wise options • Optimized queries • Brain 4

5. Not Everything • This talk is about • How I spent last two years • Resolving the same issue • For different customers 5

6. Not Everything • This talk is about • How I spent last two years • Resolving the same issue • For different customers • Task was to speed up the query 5

7. Not All the Queries can be Optimized • Specific data distribution • Access on different fields • ON clause • WHERE clause • GROUP BY • ORDER BY • Index cannot be used effectively 6

8. Disclaimer • Topic based on real Support cases • Couple of them are still in progress 7

9. Disclaimer • Topic based on real Support cases • All examples are 100% fake • They created such that • No customer can be identified • Everything generated Table names Column names Data • Use case itself is fictional 7

10. Disclaimer • Topic based on real Support cases • All examples are 100% fake • All examples are simplified • Only columns, required to show the issue • Everything extra removed • Real tables usually store much more data 7

11. Disclaimer • Topic based on real Support cases • All examples are 100% fake • All examples are simplified • All disasters happened with version 5.7 7

12.The Case

13. Two tables • categories • Less than 20 rows 9

14. Two tables • categories • Less than 20 rows • goods • More than 1M rows • 20 unique cat id values • Many other fields Price Date: added, last updated, etc. Characteristics Store ... 9

15. JOIN select * from goods join categories on (categories.id=goods.cat_id) where date_added between ’2018-07-01’ and ’2018-08-01’ and cat_id in (16,11) and price >= 1000 and <=10000 [ and ... ] [ GROUP BY ... [ORDER BY ... [ LIMIT ...]]] ; 10

16. Option 1: Select from the Small Table First • Select from the Small Table 11

17. Option 1: Select from the Small Table First • Select from the Small Table • For each cat id select from the large table 11

18. Option 1: Select from the Small Table First • Select from the Small Table • For each cat id select from the large table • Filter result on date added[ and price[...]] 11

19. Option 1: Select from the Small Table First • Select from the Small Table • For each cat id select from the large table • Filter result on date added[ and price[...]] • Slow with many items in the category 11

20. Option 2: Select from the Large Table First • Filter rows by date added[ and price[...]] 12

21. Option 2: Select from the Large Table First • Filter rows by date added[ and price[...]] • Get cat id values 12

22. Option 2: Select from the Large Table First • Filter rows by date added[ and price[...]] • Get cat id values • Retrieve rows from the small table 12

23. Option 2: Select from the Large Table First • Filter rows by date added[ and price[...]] • Get cat id values • Retrieve rows from the small table • Slow if number of rows, filtered by date added, is larger than number of goods in the selected categories 12

24. What if use Combined Indexes? • CREATE INDEX index everything (cat id, date added[, price[, ...]]) • It resolves the issue 13

25. What if use Combined Indexes? • CREATE INDEX index everything (cat id, date added[, price[, ...]]) • It resolves the issue • But not in all cases 13

26. The Problem • Maintenance cost • Slower INSERT/UPDATE/DELETE • Disk space 14

27. The Problem • Maintenance cost • Slower INSERT/UPDATE/DELETE • Disk space • Tables may have wrong cardinality 14

28.The Cardinality: Two Levels

29. MySQL is Layered Architecture • Optimizer • Engine • TokuDB • InnoDB • Any 16