本次杜川老师的分享主要分成三个部分: 1.首先通过对现有 Streaming 系统和 Batch 系统的分析,讨论了在数据处理领域 Streaming 和 Batch 的异同,明确了 Streaming 的核心本质,探讨了 Streaming 和 Batch 融合处理的可能性和必要性,并对现有类似系统进行了简单的分析。 2.简单回顾了 RDMS 中经典的 Volcano 模型的执行流程,探讨了在 RDMS 上支持 Streaming 处理的难点以及 Streaming SQL 设计的关键要素。 3.介绍了 TBSSQL 的设计思路,架构设计和若干关键技术点的方案选择,展示了 TBSSQL 的运行 Demo。并以 TBSSQL 为例,简单介绍了在 TiDB 上增加一个 Feature 的大致思路和入手点。

1. Streaming, SQL and TiDB Batch Streaming SQL Presented By Du Chuan

2.About me • Infrastructure engineer - working on cloud database • TiDB committer - mainly in the TiDB SQL layer • Technophile - database & distribute systems

3.Agenda • Streaming Overview • SQL and Streaming SQL • Introduction to TBSSQL

4.Part I – Streaming Overview

5.“Before it’s news, it’s data. After it‘s news, it’s data.”

6.Batch systems VS Streaming systems

7.Batch VS Streaming • Processing on data • Get data via processing • Finite data set • Infinite data streaming

8.But wait… What’s streaming, indeed?

9.Essentials of Streaming • Bind with time • Strictly ordered by time • Infinite

10.Streaming is the way the world organizing data

11. Batch …… …… Streaming

12.Batch is a snapshot of some Streaming At some point of time In some scope

13.So... One system to rule them both ?

14.Batch && Streaming systems • Using a batch runtime to simulate • Using a streaming runtime to simulate streaming batch

15.What about the RDMS?

16. An example – Ads click analysis Click- User ID Click User 1 ADS Firehose 1 DML Click- User ID Click User 2 ADS 2 Click- User ID Click User 3 ADS 3 Click- User ID Click User 1 ADS 1 Click- User ID Click User 3 ADS 3 Click- User ID User Table Ads Table Click User 1 ADS 1

17.Part II – SQL and Streaming SQL

18.The volcano model… Again

19.SQL consists various kinds of operators

20. The SQL operators SELECT t1.a, count(*) FROM t1 JOIN t2 ON t1.b = t2.b WHERE t1.c > 10 GROUP BY t1.a; Projection Join Selection Aggregation

21.The SQL operator interface • Open: Init operator SQL Operator • Next: Consume input and generate output • Close: Do cleanup job

22.The volcano model input = child.Next() return proj(input, t.a + t.b) SELECT t.a + t.b (". $ + ". &) FROM t input = child.Next() WHERE t.c > 10; return filter(input, t.c > 10) (t.c > 10) t return scan(t)

23.The SQL operator reordering >1 ( ) ) ) > 2 < 2 > ( ) ) ) 2 < >1 2

24. Aggregation operator • SELECT a, COUNT(c) AS cnt FROM t GROUP BY a HAVING cnt > 1; 1. Grouping 2. Apply aggregators 3. Filter using“Having”clause

25. Join operator • SELECT * FROM t1 (LEFT/RIGHT/FULL OUTER) JOIN)t2 ON t1.a = t2.a; 1. Get Cartesian product 2. Filter using “On” . clause

26.When SQL operator meets Streaming…

27.SELECT a, COUNT(c) FROM t GROUP BY a; 1. Grouping(Never End) 2. Apply aggregators Infinite Data Stream

28.Solution: limit the scope of streaming using “window”

29. SELECT a, COUNT(c) FROM t GROUP BY a; 1. Grouping 2. Apply Aggregators Window1 Infinite Data Stream Window2

TiDB 是一款定位于在线事务处理/在线分析处理( HTAP: Hybrid Transactional/Analytical Processing)的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性。