discovering statistical properties for query optimization

Discovering and exploiting statistical features in relational datasets is key to query optimization in a relational database management system (RDBMS ), and is also needed for database design, cleaning, and integration. This paper surveys a variety of methods for automatically discovering important statistical features such as correlations, functional dependencies, keys, and algebraic constraints.

1.Discovering Statistical Properties for Query Optimization Presented by Han Fei

2.Part I - Introduction to Query Optimization

3.The Process of Query Optimization ● Enumerate all possible plans by transformation rules (Cascade Optimzier) ● Get the ordinary table/columns cardinality by simple predicate ● Derive cardinality for every subplan ● Estimate the cost of plan and choose the best one


5.Most Common Statistics : Histogram ● Bucket Scheme ○ Equi-Width ○ Equi-Depth ○ Max Diff ○ V-Optimial ● Estimation Scheme ○ Continuous Spread Assumption ○ Four Level Tree

6.What do We Need More for Modern Databases? ● Correlation Discovery ● ‘soft’ Functional Dependency ● Algebraic constraints ● Holes in joins ● Workload-aware Methods ● Incremental Maintance

7.Part II - Proactive Methods

8.How do we estimate multiple predicates? ● attribute value independece assumption ○ the distributions of individual attributes are independent of each other ● join uniformity assumption ○ a tuple from one relation is equally likely to join with any tuple from the second relation

9.Automated Correlation Discovery ● Consider a table with tree attributes: ○ Education ○ Income ○ Home-holder ● some of the correlations between attributes might be indirect ones, mediate by others. ○ a high-school dropout who owns a successful Internet startup is more likely to own a home than a highly educated beach bum.

10.Conditionally Independent ● P(H = h | E = e, I = i) = P (H = h | I = i) ● We just need hold some marginal distribution: ○ P(E) ○ P(I | E) ○ P(H | I) ● Then P (H , E, I) = P (E) P (I | E) P (H | I)

11.Use Graphical Model to detect correlation ● We can build a Bayesian network which consists of two component ○ a DAG whose nodes correspond to A_1, ... , A_n, edges denote a direct dependency of A_i on its parents(A_i) ○ conditional probability distribution


13.Part III - Reactive Methods

14.Feedback based Histogram ● monitor queries on specified column gathering estimated and actual cardinalities ● create/refine maximum-entropy distribution at given condition

15.Max Entropy Principle ● We define every bucket as {l_i, r_i, f_i}, of which f_i is relative frequency. ● H(R) = - sum(m_i / ln (m_i/ h_i)) ● Use this principle to determine the boundaries

16.Meeting a Space Budget - Prunning ●

17.Meeting a Space Budget - Prunning ●

18.Thank You !

TiDB 是一款定位于在线事务处理/在线分析处理( HTAP: Hybrid Transactional/Analytical Processing)的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性。