基于streaming构建统一的数据处理引擎的挑战与实践 部分1

基于streaming构建统一的数据处理引擎的挑战与实践
展开查看详情

1.‫ݪل‬ғᴨ᯾૬૬ $OLEDED ᘳ֖ғṛᕆದ๞ӫਹ  ṛᕆ୏‫ݎ‬ૡᑕ૵ ᄍᦖᘏғ๷‫ظ‬ᇙ  շ嬬

2. About me n ๷‫ظ‬ᇙҁṼੰ҅Kurt҂ n Apache Flink Committer n ଙᏗॊླӱ‫فے‬ᴨ᯾҅݇Өग़ӻᦇᓒଘ‫ݣ‬໐ஞᶱፓጱ ᦡᦇ޾Ꮈ‫҅ݎ‬۱ೡ൤ᔱ୚ක̵᧣ଶᔮᕹ̵ፊഴ‫ړ‬ຉᒵ̶ፓ ‫ڹ‬ᨮᨱ%OLQN 64/ጱᎸ‫޾ݎ‬ս۸

3. Agenda     Why What How Achievement Future

4.Why to Unify Batch and Stream ᴹવጱ਍ԟใᕚғӷӻ୚ක҅ӷղդᎱ Steep learning curve: two engines, two codes https://mapr.com/developercentral/lambda-architecture/

5.Why to Unify Batch and Stream य़හഝ޾ $,‫ق‬ว Ń  BIG DATA & LANDSCAPE 2018 http://mattturck.com/bigdata2018/

6. Why to Unify Batch and Stream ਍ԟӞॺ୚කٟ҅ӞղդᎱ҅‫ݶ‬෸ᬩᤈၞ॒ቘ޾ಢ॒ቘ҅ኜᛗๅग़ Learn One Engine, Write One Code, Run both Stream and Batch, Even More

7. Agenda     Why What How Achievement Future

8. What is an Unified SQL Engine 01 አಁ᥯ଶғӞղդᎱ҅Ӟ໏ጱᕮຎ User Perspective: One Query, Same Result 02 ୏‫ݎ‬᥯ଶғຝ຅ᕹӞ̵դᎱ॔አ̵ၞಢᣟ‫ݳ‬ Dev Perspective: Unified Arch, Code Reuse

9. User: One Query, Same Result CREATE TABLE USER_SCORES ( Name VARCHAR, Score DOUBLE, Time TIMESTAMP ) WITH ( ... ); SQL is exactly the same in CREATE TABLE TOTOAL_SCORES ( Name VARCHAR, Streaming and Batch TotalScore DOUBLE, MaxTime TIMESTAMP ) WITH ( ... ); INSERT INTO TOTAL_SOURCES SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

10. User: One Query, Same Result Batch Mode: 12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; ------------------------- ------------------------- | Name | Score | Time | | USER_SCORES | ------------------------- ------------------------- | Julie | 12 | 12:07 | | User | Score | Time | | Frank | 5 | 12:06 | ------------------------- ------------------------- | Julie | 7 | 12:01 | | Frank | 3 | 12:03 | Stream Mode: | Julie | 1 | 12:03 | 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; | Frank | 2 | 12:06 | ------------------------------------------------------------------------------------- | Julie | 4 | 12:07 | | [-inf, 12:01) | [12:01, 12:04) | [12:04, now) | ------------------------- | ------------------------- | ------------------------- | ------------------------- | USER_SCORES is a | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | source table/stream. | | | | | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 5 | 12:06 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------------------------------------------------------------------- Reference: https://qconsf.com/sf2017/system/files/presentation-slides/foundations_of_streaming_sql_qcon_sf_2017.pdf

11. Developer: Unified Architecture ? DataStream API DataSet API Stream Processing Batch Processing Table API & SQL Runtime Relational Operator Tree DataStream API DataSet API Transformation Stream Processing Batch Processing Batch Plan Runtime StreamGraph Distributed Streaming Dataflow Optimized Plan Local Cluster Cloud Single JVM Standalone, YARN GCE, EC2 Job Graph Stream Task & Operator Batch Task & Driver

12. Developer: What about Code Reuse SELECT * FROM t1, t2 WHERE t1.a = t2.b where t1.a < 1 Streaming Batch DataStream[Row] t1; DataSet[Row] t1; DataStream[Row] t2; DataSet[Row] t2; t1.connect(t2) t1.join(t2) .keyBy(a, b) .where(a) .transform(KeyedCoProcessOperator) .equalTo(b) .with(joinFunc) // t1.a < 1 class CoProcessFunction<IN1, IN2, OUT>

13. Developer: What about Runtime Push (stream) scanFile() process(elem) process(elem) process(elem) " # ! SELECT SUM(T.B) T execute() FROM T WHERE T.A < 10 scanFile() elem=next() elem=next() elem=next() Pull (batch) ഴ‫ګ‬ၞ (Control Flow) හഝၞ (Data Flow)

14. Agenda     Why What How Achievement Future

15. About me n շ嬬ҁԯᮐ҅-DUN҂ n Apache Flink Committer l Contributing since Flink v1.0 l Focusing on Flink Table & SQL since 3 years n Software Engineer at Alibaba in Blink SQL team

16. How to Unify Batch and Stream 01 ቘᦞचᏐғ ۖாᤒ 03 ս۸࢏ጱᕹӞ Dynamic Table Unification of the Optimizer 02 ຝ຅දᬰ 04 चᏐහഝᕮ຅ጱᕹӞ Improve Architecture Unification of Basic Data Structure 05 ᇔቘਫሿጱ‫و‬Ձ Sharing of the Runtime Implementation