- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
PingCAP-Infra-Meetup-92-xiehaibin-TiDB+Statistics
展开查看详情
1 .Introduction To TiDB Statistics TiDB Internals Presented by Haibin Xie
2 .Part I - Overview
3 .Why we need stats ● To choose the best plan among the generated plans, we need an evaluation of best ● To estimate the cost, the most important factor is number of rows processed ● To estimate the row count, we need stats PingCAP.com
4 .Basic components ● Table Level ○ Row count ○ Modify count ● Column Level ○ Histogram ○ Count-Min Sketch ○ Distinct count ○ Null count ○ Avg col size ○ Correlation PingCAP.com
5 .Part II - Selectivity Eestimation
6 .Histogram ● For range queries ○ select count(*) from t where a > 1 and a < 10 ● Equal depth, non-overlap boundaries ○ One value could only contains in one bucket ● The boundaries of indices are encoded bytes of column values PingCAP.com
7 .Query in histogram ● Fully covered bucket ○ Just add the bucket depth ● Partially covered bucket ○ Uniform distribution assumption ○ rowCount(1.7, 1.9) = rowCount(1.6, 1.9) * (1.9-1.7) / (1.9-1.6) ○ For bytes array, remove their common prefixes, and treat 8 signficant bytes as uint64 ■ rowCount(aa, ab) = rowCount(aa, ac)*(b-a)/(c-a) PingCAP.com
8 .Count-Min Sketch ● For equality queries ○ select count(*) from t where a = 1 ● Table of d*w ● Insertion ○ d indepent hash functions F ○ Each function Fi hashing values Vi of range [0, w) ○ Add 1 in (i, Vi) ● Queries ○ Min of the d places PingCAP.com
9 .Multi-column selectivity ● select count(*) from t where a = 1 and b < 1 and c > 1 ● Independent assumptions ○ sel(a = 1 and b < 1 and c >1) = sel(a = 1) * sel(b < 1) * sel(c > 1) ● Make use of index stats to reduce assumptions ○ For index(a, b), sel(a = 1 and b < 1) or sel(a = 1 and b =1) does not need independent assumptions ● Use least number of column or indices to cover all the conditions PingCAP.com
10 .Multi-column selectivity ● Choose the indices or columns greedily ○ Always prefer the primary key or index ○ The number of expression that it covers, the more the better ○ The number of columns that it contains, the less the better ● select count(*) from t where a = 1 and b < 1 and c > 1 and d > 1 ○ schema: index(a,b), index(a,b,c), primary key(c) ○ Index(a,b) -> primary key(c) -> column d PingCAP.com
11 .When it fails ● Stats is outdated, independent assumption breaks... ● Range queries becomes equality queries ○ Index (a,b), Bucket [(10000, 1), (20000,2)] ○ Query (a = 15000 and b <1) is nearly a point inside the bucket ○ Fixed in 2.1 by extending the CM Sketch of index ■ Prefix values is also inserted in the CM Sketch, like (a=15000) ■ sel(a = 15000 and b < 1) = sel(a = 15000) * sel(b < 1) PingCAP.com
12 .Part III - Building stats
13 .Analyze ● Split analyze into several tasks ○ one task per index ○ one task for all columns in a table ● Read committed, StartTS maxUInt64, Low priority ● Concurrecy control ○ tidb_build_stats_concurrency, concurrent runing tasks, default 4 ○ tidb_distsql_scan_concurrency, table scan concurrency, default 15 ○ tidb_index_serial_scan_concurrency, index scan concurrency, default 1 ● Modify count will be reset to 0 ○ Even for analyze single index PingCAP.com
14 .Column Stats ● Histogram ○ Sample 10000 rows ■ When numProcessedRows < 10000, add row to sample pool ■ Otherwise, Add it in a probibilty of 10000/numProcessedRows ○ Sorts the rows and build histograms ■ Max 256 buckets ● CM Sketch PingCAP.com
15 .Index Stats ● Histogram ○ Init each buckets’ depth to 1 ○ When all the buckets are full ■ Double the bucket depths ■ Merge each neighboring buckets ● CM Sketch ○ Need to consider each index prefix PingCAP.com
16 .Auto analyze ● When it triggers ○ Contains at least 1000 rows ○ Not been analyzed and no modifications with 1 min ○ Analyzed before and contains enough change and within analyze period ■ tidb_auto_analyze_ratio, default 0.5 ■ tidb_auto_analyze_start_time, default 00:00 ■ tidb_auto_analyze_end_time, default 23:59 PingCAP.com
17 .Auto analyze ● Concurrecy control ○ Smallest concurrency for analyze worker ● Stats owner ○ Only one in a cluster ○ Execute analyze one at a time ○ Different from DDL owner, analyze trigered by user is not handled by stats owner PingCAP.com
18 .Part IV - Maintainment
19 .Row count & Modify count ● Each session caches its own modfications ● Dump to storage per 20*statsLease(default 1 min) ○ The update time of stats_meta will also be updated ● Not right when ○ The cached modifications are not saved to storage ○ DML -> Analyze(correct row count) -> Dump modifications of DML PingCAP.com
20 .Row count & Modify count ● Stats healthy ○ (1 - ModifiyCount/RowCount) * 100 ● When ModifyCount/RowCount > 0.7, TiDB will use pseudo stats PingCAP.com
21 .Histogram ● Split the query range accroding to histogram upper bound ○ So each range can be covered by one extended bucket ● Adjust the depth according to uniform distribution assumption ○ Count(QR 1)/Count(Bkt 1) = len(QR 1)/len(Bkt 1) PingCAP.com
22 .Histogram ● Split Bucket ○ Split the bucket that covered by enough queries ■ e.g. split once if the bucket is covered by 10% queries and we can split 10 buckets in a round ○ Split the bucket according to query points ■ Choose N points if the bucket needs to split N times ■ 10 query points, split to 2 buckets, then the buckets will be [lower bound, 10th point], [10th point+1, upper bound] ● Merge buckets ○ Merge consecutive buckets that leads to smallest estimate error PingCAP.com
23 .Feedback ● Collect the query feedback by probability(default 5%) ● Dump the feedback to storeage per 10 min ● Stats owner load feedback and update stats per 15s PingCAP.com
24 .Thank You! Any Questions ? 关注 PingCAP 官方微信 了解更多技术干货