本次分享主要介绍 TiDB SQL 层的三个组件:优化器,统计信息和执行引擎。 •优化器部分主要举例介绍了逻辑优化规则和物理优化框架; •统计信息部分主要介绍直方图,CMSketch 以及使用方法; •执行引擎部分以两种 join 方式为例介绍了我们在执行引擎实现中用到的一些优化方法。

1. Welcome! 加入 Infra Meetup No.95 交流群 和大家一起讨论吧~ 我们将分享本期 Meetup 资料

2.Introduction of TiDB SQL Layer Presented by Kenan Yao

3.Agenda ● TiDB Overview ● TiDB SQL Layer ○ Query Optimizer ○ Statistics ○ Execution Engine

4.Part I - TiDB Overview

5.What is TiDB? A distributed relational database that speaks MySQL protocol ● MySQL compatibility ● Distributed transaction ● High availability ● Scalability ● HTAP

6.TiDB Architecture PD PD TSO/Data location Data location PD PD Cluster Metadata Spark Driver TiDB MySQL Clients TiKV TiKV Job TiDB DistSQL API DistSQL API Worker TiDB TiKV TiKV Syncer Worker TiDB TiKV TiKV Worker TiDB ... ... ... Spark Cluster TiDB Cluster TiKV Cluster (Storage) TiSpark

7.Part II - TiDB SQL Layer

8.TiDB Layer listener SQL Core Layer Packet Privilege Manager SQL AST parser validator Logical AST Connection Schema Manager Plan Context Physical Logical Optimize Optimize Command Session DDL Worker Protocol SQL Context Physical Decode Statistics Plan Data GC Worker Feedback Protocol TiDB Distributed encode Executor Data Coprocessor BG Job Worker Data Data Data Protocol Layer TiKV

9.TiDB SQL Layer Key components: SQL Core Layer ● Query Optimizer SQL AST parser validator ● Statistics Logical AST ● Execution Engine Physical Plan Logical Optimize Optimize Physical Statistics Plan Feedback TiDB Distributed Data Executor Data Coprocessor Data TiKV

10.Part III - Query Optimizer

11.Query Optimizer Find a reasonable plan in reasonable time ● Operator Pushdown ● Access Path Selection ● Join Order/Algorithm Selection ● Aggregation Algorithm Selection ● Subquery Optimization ● ...

12.Execution Plan Example Example select t1.key, t2.val π t1.key, t2.val from t1, t2 where t1.key=t2.key ⋈ t1.key=t2.key and t2.val>22; ρ t2.val > 22 t1 t2

13.Try PointGet Plan For simple point get queries 1. Single table SELECT/UPDATE/DELETE 2. Filters should be point get, e.g, unique_key = xx Benefit 1. Reduce the query optimization overhead 2. The PointGet executor is also very efficient

14.PlanBuilder Build logical plan tree bottom-up from input AST ● Add flags for later logical optimization ● Optimization starts from here ○ Fold constant ○ Eliminate IFNULL(pk, 0 ,1) ○ Subquery handling ○ ...

15.Query Optimizer Phase 1: Logical Optimization / RBO ● Logical ● Equal ● Beneficial Phase 2: Physical Optimization / CBO

16.Logical Optimization Apply applicable rules in order ● Column Pruning ● Outer Join Simplification ● Partition Pruning ● Subquery Decorrelation ● Group By Elimination ● Predicate Push Down ● Max/Min Eliminatation ● Aggregate Push Down ● Project Elimination ● TopN/Limit Push Down ● Outer Join Elimination ● Join Reorder ● ...

17.Logical Optimization Example 1: Outer Join Simplification & Predicate Push Down select * Filter: Filter: Inner Join: t1.val > 11 t1.val > 11 from t1 t1.key = t2key t2.val > 22 t2.val > 22 left join t2 on t1.key = t2.key where t1.val > 11 Left Join: Inner Join: Filter: Filter: and t2.val > 22; t1.key = t2.key t1.key = t2.key t1.val > 11 t2.val > 22 t1 t2 t1 t2 t1 t2

18.Logical Optimization Example 2: Aggregation Push Down Aggregate: Aggregate: group by: t1.key group by: t1.key select sum(t1.val) sum(t1.val) sum(t1.val) from t1, t2 on t1.key = t2.key group by t1.key; Inner Join: Inner Join: t1.key = t2.key t1.key = t2.key t1 t2 Aggregate: group by: t1.key t2 sum(t1.val) t1

19.Logical Optimization Example 3: Outer Join Elimination ● Cond 1: parent operator only needs outer columns ● Cond 2: ○ Cond 2-1: join key on inner side is unique select t1.* from t1 left join t2 on t1.a=t2.unique_key select * from t1;

20.Logical Optimization Example 3: Outer Join Elimination ● Cond 1: parent operator only needs outer columns ● Cond 2: ○ Cond 2-1: join key on inner side is unique ○ Cond 2-2: parent conly needs distinct outer columns select sum(distinct t1.a) from t1 left join t2 on t1.b=t2.b group by t1.b; select sum(distinct t1.a) from t1;

21.Logical Optimization Example 4: Join Reorder ● extract join nodes ● pull up statistics ● apply a bottom-up DP algorithm for small join group use bitmask to represent join nodes 1. 6(110) represents a join group with node 1 and 2 2. f[6] represents the best join order for join group {1, 2} 3. f[group] = min{join(f[sub], f[group-sub])}

22.Logical Optimization Example 4: Join Reorder ● extract join nodes ● pull up statistics ● apply a bottom-up DP algorithm for small join group ● apply a greedy algorithm otherwise Join 600 Join Join Join t2 800 200 200 100 t1 t1 t2 t1 t3 t1 t3 10 10 100 10 100 10 100 t1 t2 t3 10 100 100

23.Query Optimizer Phase 1: Logical Optimization Phase 2: Physical Optimization

24.Physical Optimization start None None Sort NominalSort select t1.a from t1 None (t1.c1) (t1.c1) join t2 on t1.c1 = t2.c1 order by t1.c1 HashJoin MergeJoin IndexJoin (t2.c1) (t1.c1) None None (t1.c1) None None IndexScan TableScan IndexScan TableScan t1 t1 t2 t2

25.Physical Optimization Goals: ● Which index should be used? ● Physical operator implementation ○ Hash or Stream Aggregate? ○ Hash Join, Merge Join or Index Lookup Join? ● Split the Root Task and Coprocessor Task ● Push Agg/Limit/TopN to coprocessor?

26.Physical Optimization Physical Property: ● task type ● data order ● expected count A Dynamic Programming Progress: ● (Logical Plan, Required Physical Property) -> Physical Plan ● A top-down search approach ● Memorize the search result ● Cost model

27.Part IV - Statistics

28.Statistics Cardinality Estimation ● WHERE t.col > 0 ● ON t1.col1 = t2.col2 ● GROUP BY t.col1, t.col2 What kinds of statistics does TiDB need?

29.Statistics What kinds of statistics does TiDB need? ● Equi-depth Histogram

TiDB 是一款定位于在线事务处理/在线分析处理( HTAP: Hybrid Transactional/Analytical Processing)的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性。