PingCAP-Infra-Meetup-98-xuhuaiyu-Compiled+and+Vectorized+Query

本次分享徐怀宇老师为大家介绍了论文《Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask》,主要包括: •介绍经典 Volcano 模型的执行流程,并分析其运行时性能。 •介绍行存、列存的基本概念,并进而引出向量化执行,分析其如何克服经典 Volcano 模型的缺点。 •介绍代码生成的基本概念,结合案例分析其如何克服经典 Volcano 模型的缺点。 最后,结合论文内容,重点从 micro-architecture, data-parallel execution 两个方面,分析对比向量化执行和代码生成的特性,进而引出论文结论:向量化执行在 memory-bound 类的查询中更有优势,代码生成在 calculation-heavy 类的查询中更有优势。但是总体来看,在 OLAP 场景中,向量化执行和代码生成的执行性能相近。
展开查看详情

1. Welcome! 加入 Infra Meetup No.98 交流群 和大家一起讨论吧~ 我们将分享本期 Meetup 资料

2. Compiled and Vectorized Queries Presented by Huaiyu Xu Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

3.About Me ● Huaiyu Xu (徐怀宇) ● Kernel Engineer for TiDB (SQL execution engine, distributed computing framework) ● xuhuaiyu@pingcap.com PingCAP.com

4.Agenda ● Classic Volcano-style Model ● Vectorized Execution ● Compiled Execution ● Comparison and Analysis PingCAP.com

5.Part I - Volcano-style model

6.Volcano-style model ● A.k.a Tuple at a time ● Open-Next-Close next() tuple ● Elegant, flexible, extensible and Hash powerful[1] GroupBy next() tuple ● Efficient at that time Hash ○ disk I/O is the main overhead Join next() tuple next() ● Pull-based model Selection Selection [1] Graefe, Goetz. "Volcano/spl minus/an extensible and parallel query evaluation system." IEEE Transactions on Knowledge and Data Engineering 6.1 (1994): 120-135.

7.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

8.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

9.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

10.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

11.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price) id shop_id price 1 5 10.5

12.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

13.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

14.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price) id shop_id price 2 10 1.2

15.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price) id shop_id price 2 10 1.2

16.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

17.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

18.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price)

19.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price) id shop_id price 3 10 13.7

20.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price) id shop_id price 3 10 13.7

21.Volcano-style model SELECT SUM(price) FROM goods WHERE shop_id = 10 id shop_id price 1 5 10.5 2 10 1.2 3 10 13.7 next() Scan: goods Selection: shop_id = 10 Aggregation: sum(price) sum(price) 14.9

22.Runtime performance ● Expensive virtual function calls and type dispatch ○ Too many non-productive / control flow instructions ○ CPU pipeline bubbles ● Function call overhead for each next ○ Poor cache utilization

23.Part II - Vectorized Execution

24.Vectorized Execution ● Row Stores ● Column Stores

25.Row Stores ● Suitable for OLTP workloads ● Mostly reading / writing all of the values for entire records && typically a small number of records at a time: SELECT * FROM goods WHERE id = 10

26.Row Stores ● Not suitable for OLAP workloads ● Only one column is referred, all data has to be read SELECT SUM(UNITS_SOLD) FROM SALE

27.Column Stores SELECT SUM(UNITS_SOLD) FROM SALE ● Only one column is referenced ● Only reads small part of the table

28.Column-at-a-time Model SELECT Id, Name, Age, (Age-30)*50 AS Bonus FROM People WHERE Age > 30

29.Column-at-a-time Model ● However, column may not not be fit into the cache or memory Total Runtime Column Size Cache Size Store in files??? Memory Size

TiDB 是一款定位于在线事务处理/在线分析处理( HTAP: Hybrid Transactional/Analytical Processing)的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性。