展开查看详情
1.Discovering Statistical Properties for Query Optimization Presented by Han Fei
2.Part I - Introduction to Query Optimization
3.The Process of Query Optimization ● Enumerate all possible plans by transformation rules (Cascade Optimzier) ● Get the ordinary table/columns cardinality by simple predicate ● Derive cardinality for every subplan ● Estimate the cost of plan and choose the best one
4.
5.Most Common Statistics : Histogram ● Bucket Scheme ○ Equi-Width ○ Equi-Depth ○ Max Diff ○ V-Optimial ● Estimation Scheme ○ Continuous Spread Assumption ○ Four Level Tree
6.What do We Need More for Modern Databases? ● Correlation Discovery ● ‘soft’ Functional Dependency ● Algebraic constraints ● Holes in joins ● Workload-aware Methods ● Incremental Maintance
7.Part II - Proactive Methods
8.How do we estimate multiple predicates? ● attribute value independece assumption ○ the distributions of individual attributes are independent of each other ● join uniformity assumption ○ a tuple from one relation is equally likely to join with any tuple from the second relation
9.Automated Correlation Discovery ● Consider a table with tree attributes: ○ Education ○ Income ○ Home-holder ● some of the correlations between attributes might be indirect ones, mediate by others. ○ a high-school dropout who owns a successful Internet startup is more likely to own a home than a highly educated beach bum.
10.Conditionally Independent ● P(H = h | E = e, I = i) = P (H = h | I = i) ● We just need hold some marginal distribution: ○ P(E) ○ P(I | E) ○ P(H | I) ● Then P (H , E, I) = P (E) P (I | E) P (H | I)
11.Use Graphical Model to detect correlation ● We can build a Bayesian network which consists of two component ○ a DAG whose nodes correspond to A_1, ... , A_n, edges denote a direct dependency of A_i on its parents(A_i) ○ conditional probability distribution
12.
13.Part III - Reactive Methods
14.Feedback based Histogram ● monitor queries on specified column gathering estimated and actual cardinalities ● create/refine maximum-entropy distribution at given condition PingCAP.com
15.Max Entropy Principle ● We define every bucket as {l_i, r_i, f_i}, of which f_i is relative frequency. ● H(R) = - sum(m_i / ln (m_i/ h_i)) ● Use this principle to determine the boundaries PingCAP.com
16.Meeting a Space Budget - Prunning ● PingCAP.com
17.Meeting a Space Budget - Prunning ● PingCAP.com
18.Thank You !