Apache Kylin Updates 2018 2Q

2018年6月在Apache Kylin meetup@深圳活动上,Kyligence 架构师介绍 Apache Kylin 2.3 and 2.4 的新功能新特性。
展开查看详情

1.Apache Kylin Updates 2018 2Q Shaofeng Shi, Kylin PMC shaofengshi@apache.org

2. Kylin 2.3 Recall V2.3.0 released on 2018-3-4 V2.3.1 released on 2018-3-28 All rights reserved ©Kyligence Inc. http://kyligence.io

3. System Cube, Cube Planner, Dashboard (eBay) cube optimization visualize Kylin Query & Cube Planner Dashboard Job Engines KYLIN-2727 KYLIN-2726 Metrics System Cube Installation Documents KYLIN-2721 https://kylin.apache.org/docs23/tutorial/setup_systemcube.html https://kylin.apache.org/docs23/tutorial/use_cube_planner.html https://kylin.apache.org/docs23/tutorial/use_dashboard.html All rights reserved ©Kyligence Inc. http://kyligence.io

4.KYLIN-2721: Metrics to System Cube All rights reserved ©Kyligence Inc. http://kyligence.io

5.KYLIN-2726: Dashboard for Queries and Jobs All rights reserved ©Kyligence Inc. http://kyligence.io

6.KYLIN-2727: Cube Planner All rights reserved ©Kyligence Inc. http://kyligence.io

7.Cube Planner: Greedy Algorithm All rights reserved ©Kyligence Inc. http://kyligence.io

8.Cube Planner: Sunburst Chart All rights reserved ©Kyligence Inc. http://kyligence.io

9. Cube Planner: One-click optimization • Directly optimize the in-use Cube • Ensure atomic when switching the Cube structure. All rights reserved ©Kyligence Inc. http://kyligence.io

10. Cube Planner: Use case • Cube building accelerates 10x. • Query accelerates 5x. • Storage saved 95%。 • Overall more than 50x ROI improvement. All rights reserved ©Kyligence Inc. http://kyligence.io

11. More Faster Cube build - KYLIN-3125: Use SparkSQL to create flat table RDBMS as data source - KYLIN-3052: Support Redshift as data source - KYLIN-3044: Support SQL Server as data source All rights reserved ©Kyligence Inc. http://kyligence.io

12.Kylin 2.4 Coming soon All rights reserved ©Kyligence Inc. http://kyligence.io

13. KYLIN-3359: Support SUM(expression) The expression can be as follows: 1. a 1*col 1 + a 2*col 2 + ... + a n*col n + b if sum(col 1),sum(col 2),...sum(col n) are defined 2. case when filter 1 then expr 1 when filter 2 then expr 2 ... else expr N end if filter 1,filter 2, ... filter N-1, and expr 1,expr 2,...expr N are supported All rights reserved ©Kyligence Inc. http://kyligence.io

14. KYLIN-3359: Example select CAL_DT, sum(case when SLR_SEGMENT_CD is null then 0 else when SLR_SEGMENT_CD = 0 then PRICE * 2 else PRICE end) from TEST_KYLIN_FACT group by CAL_DT order by CAL_DT All rights reserved ©Kyligence Inc. http://kyligence.io

15. KYLIN-3359: Because it equals to select CAL_DT, sum(case when SLR_SEGMENT_CD is null then 0 else when SLR_SEGMENT_CD = 0 then PRICE * 2 else PRICE end) FROM ( select CAL_DT, SLR_SEGMENT_CD, sum(PRICE) as PRICE from TEST_KYLIN_FACT group by CAL_DT, SLR_SEGMENT_CD ) tmp group by CAL_DT order by CAL_DT All rights reserved ©Kyligence Inc. http://kyligence.io

16. KYLIN-3359: Implementation Brief SQLDigest: The Dynamic column is added to -groupby: CAL_DT, SLR_SEGMENT_CD - GTInfo -measure: sum(PRICE), sum(DYNA) - CuboidToGridTableMappingExt - GTScanRequest -DynamicFunctionDesc: -DYNA = case when SLR_SEGMENT_CD is null then 0 else when SLR_SEGMENT_CD = 0 then PRICE Pushed down into coprocessor for eval * 2 - GTFilterScanner else PRICE end - GTFunctionScanner - GTAggregateScanner

17. KYLIN-3221: Externalizing Lookup table Limitations for lookup tables in Kylin: 1. Lookup table need as small to load into Memory • Max size 300 MB by default 2. Lookup table snapshots are taken per segment • Hard to refresh snapshots for all segments

18. Workaround Before KYLIN-3221 Don’t take snapshot for big lookup table • Direct query to lookup table is not supported; • Derived dimension on big lookup is not allowed.

19. KYLIN-3221 Core ideas 1. Introduce SnapshotTableDesc • StorageType: by default “metaStore” • Storage Location • Global: shared or not. 2. Per storage type, need implement the externalize method • ILookupMaterializer.materializeLookupTablesForCube() • ILookupMaterializer.materializeLookupTables() 3. Only implementation for HBase now, but framework is extensible • MR job to convert Lookup table to HFile • Load lookup table snapshot to HBase by job engine

20. KYLIN-3221 Lookup table in HBase Rowkey Composite Lookup table PKs Encoding in HBase Column 1 column family with all non-pk columns in it.

21. Rocks DB as local cache • When there are many derived calculation, HBase is much slower than memory • Need local cache for Lookup table snapshots. • RocksDB is an embeddable persistent key-value store for fast storage.

22. KYLIN-3221 Building local cache • On Snapshot be taken, Job engine call every query node to build local cache. • Query node loads data from HBase to RocksDB as cache. • Query engine access RocksDB to get lookup table; • When cache is unavailable, use HBase. Query RocksDB Engine 2. Notify Query server 3. Dump from HBase to RocksDB Job Engine Query 1. Load lookup snapshot to HBase Engine RocksDB HBase Cluster

23. KYLIN-3221 Performance data • HDD disk, lookup table 49 columns, 54.5million rows, origin source data size 824Mb • Build local cache take: 1hour 56min • Group by a derived column, result set 534018 IDs, takes 3.7s; If from HBase, takes 237s.

24. KYLIN-3221 Summary • Kylin can support large lookup table (tens to hundreds million row) • RocksDB can be local cache for Kylin • Query server becomes stateful on this feature

25. KYLIN-3378: Kafka topic join with Hive tables • In 2.3 and before, Kafka data source only allow single table in the model. • In real world, user manages Lookup in Hive. The need for Kafka + Hive is strong. • KYLIN-3378 adds this support. Lookup A join Lookup B Kafka topic (Fact) Lookup C All rights reserved ©Kyligence Inc. http://kyligence.io

26. KYLIN-3378: Work flow • The work flow is as the following figure. • The build job may take more time if need join with Hive. Star No Save Kafka data schema Build Cube To HDFS ? Join Yes Create intermediate intermediate fact with Kafka topic Hive fact table lookup tables All rights reserved ©Kyligence Inc. http://kyligence.io

27. Other enhancements • KYLIN-3137 KYLIN-2484 Spark engine to support Kafka data • KYLIN-3369 Reduce data size from Kafka • KYLIN-3135 Allow project to set its own data source All rights reserved ©Kyligence Inc. http://kyligence.io

28. Bug Fixes • KYLIN-3388 Hive data may loss after the “Redistribution” step • KYLIN-1768 ArrayIndexOutOfBounds when using fixed_length:256 • KYLIN-3348 "missing LastBuildJobID" error when building multiple new segments concurrently • … All rights reserved ©Kyligence Inc. http://kyligence.io

29.Kylin + Superset(incubating) All rights reserved ©Kyligence Inc. http://kyligence.io

Kyligence (上海跬智信息技术有限公司)由首个来自中国的 Apache 软件基金会顶级开源项目 Apache Kylin 核心团队组建,是专注于大数据分析领域的数据科技公司,通过前沿数据技术的分析认知来加速用户关键商业决策是其使命。