discovering statistical properties for query optimization

下载 1

TiDB

发布于

6420

人观看

#信息技术

Discovering and exploiting statistical features in relational datasets is key to query optimization in a relational database management system (RDBMS ), and is also needed for database design, cleaning, and integration. This paper surveys a variety of methods for automatically discovering important statistical features such as correlations, functional dependencies, keys, and algebraic constraints.

展开查看详情

1 .Discovering Statistical Properties for Query Optimization Presented by Han Fei

2 .Part I - Introduction to Query Optimization

3 .The Process of Query Optimization ● Enumerate all possible plans by transformation rules (Cascade Optimzier) ● Get the ordinary table/columns cardinality by simple predicate ● Derive cardinality for every subplan ● Estimate the cost of plan and choose the best one

4 .

5 .Most Common Statistics : Histogram ● Bucket Scheme ○ Equi-Width ○ Equi-Depth ○ Max Diff ○ V-Optimial ● Estimation Scheme ○ Continuous Spread Assumption ○ Four Level Tree

6 .What do We Need More for Modern Databases? ● Correlation Discovery ● ‘soft’ Functional Dependency ● Algebraic constraints ● Holes in joins ● Workload-aware Methods ● Incremental Maintance

7 .Part II - Proactive Methods

8 .How do we estimate multiple predicates? ● attribute value independece assumption ○ the distributions of individual attributes are independent of each other ● join uniformity assumption ○ a tuple from one relation is equally likely to join with any tuple from the second relation

9 .Automated Correlation Discovery ● Consider a table with tree attributes: ○ Education ○ Income ○ Home-holder ● some of the correlations between attributes might be indirect ones, mediate by others. ○ a high-school dropout who owns a successful Internet startup is more likely to own a home than a highly educated beach bum.

10 .Conditionally Independent ● P(H = h | E = e, I = i) = P (H = h | I = i) ● We just need hold some marginal distribution: ○ P(E) ○ P(I | E) ○ P(H | I) ● Then P (H , E, I) = P (E) P (I | E) P (H | I)

11 .Use Graphical Model to detect correlation ● We can build a Bayesian network which consists of two component ○ a DAG whose nodes correspond to A_1, ... , A_n, edges denote a direct dependency of A_i on its parents(A_i) ○ conditional probability distribution

12 .

13 .Part III - Reactive Methods

14 .Feedback based Histogram ● monitor queries on specified column gathering estimated and actual cardinalities ● create/refine maximum-entropy distribution at given condition PingCAP.com

15 .Max Entropy Principle ● We define every bucket as {l_i, r_i, f_i}, of which f_i is relative frequency. ● H(R) = - sum(m_i / ln (m_i/ h_i)) ● Use this principle to determine the boundaries PingCAP.com

16 .Meeting a Space Budget - Prunning ● PingCAP.com

17 .Meeting a Space Budget - Prunning ● PingCAP.com

18 .Thank You !

0点赞

2收藏

1下载