Apache Kylin Updates 2018 2Q

下载 0

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/Kyligence/ApacheKylinUpdates20182Q?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

Kyligence

发布于

6年前

2415

人观看

#信息技术

2018年6月在Apache Kylin meetup@深圳活动上，Kyligence 架构师介绍 Apache Kylin 2.3 and 2.4 的新功能新特性。

展开查看详情

1 .Apache Kylin Updates 2018 2Q Shaofeng Shi, Kylin PMC shaofengshi@apache.org

3 . System Cube, Cube Planner, Dashboard (eBay) cube optimization visualize Kylin Query & Cube Planner Dashboard Job Engines KYLIN-2727 KYLIN-2726 Metrics System Cube Installation Documents KYLIN-2721 https://kylin.apache.org/docs23/tutorial/setup_systemcube.html https://kylin.apache.org/docs23/tutorial/use_cube_planner.html https://kylin.apache.org/docs23/tutorial/use_dashboard.html All rights reserved ©Kyligence Inc. http://kyligence.io

10 . Cube Planner: Use case • Cube building accelerates 10x. • Query accelerates 5x. • Storage saved 95%。 • Overall more than 50x ROI improvement. All rights reserved ©Kyligence Inc. http://kyligence.io

11 . More Faster Cube build - KYLIN-3125: Use SparkSQL to create flat table RDBMS as data source - KYLIN-3052: Support Redshift as data source - KYLIN-3044: Support SQL Server as data source All rights reserved ©Kyligence Inc. http://kyligence.io

13 . KYLIN-3359: Support SUM(expression) The expression can be as follows: 1. a 1*col 1 + a 2*col 2 + ... + a n*col n + b if sum(col 1),sum(col 2),...sum(col n) are defined 2. case when filter 1 then expr 1 when filter 2 then expr 2 ... else expr N end if filter 1,filter 2, ... filter N-1, and expr 1,expr 2,...expr N are supported All rights reserved ©Kyligence Inc. http://kyligence.io

14 . KYLIN-3359: Example select CAL_DT, sum(case when SLR_SEGMENT_CD is null then 0 else when SLR_SEGMENT_CD = 0 then PRICE * 2 else PRICE end) from TEST_KYLIN_FACT group by CAL_DT order by CAL_DT All rights reserved ©Kyligence Inc. http://kyligence.io

15 . KYLIN-3359: Because it equals to select CAL_DT, sum(case when SLR_SEGMENT_CD is null then 0 else when SLR_SEGMENT_CD = 0 then PRICE * 2 else PRICE end) FROM ( select CAL_DT, SLR_SEGMENT_CD, sum(PRICE) as PRICE from TEST_KYLIN_FACT group by CAL_DT, SLR_SEGMENT_CD ) tmp group by CAL_DT order by CAL_DT All rights reserved ©Kyligence Inc. http://kyligence.io

16 . KYLIN-3359: Implementation Brief SQLDigest: The Dynamic column is added to -groupby: CAL_DT, SLR_SEGMENT_CD - GTInfo -measure: sum(PRICE), sum(DYNA) - CuboidToGridTableMappingExt - GTScanRequest -DynamicFunctionDesc: -DYNA = case when SLR_SEGMENT_CD is null then 0 else when SLR_SEGMENT_CD = 0 then PRICE Pushed down into coprocessor for eval * 2 - GTFilterScanner else PRICE end - GTFunctionScanner - GTAggregateScanner

17 . KYLIN-3221: Externalizing Lookup table Limitations for lookup tables in Kylin: 1. Lookup table need as small to load into Memory • Max size 300 MB by default 2. Lookup table snapshots are taken per segment • Hard to refresh snapshots for all segments

18 . Workaround Before KYLIN-3221 Don’t take snapshot for big lookup table • Direct query to lookup table is not supported; • Derived dimension on big lookup is not allowed.

19 . KYLIN-3221 Core ideas 1. Introduce SnapshotTableDesc • StorageType: by default “metaStore” • Storage Location • Global: shared or not. 2. Per storage type, need implement the externalize method • ILookupMaterializer.materializeLookupTablesForCube() • ILookupMaterializer.materializeLookupTables() 3. Only implementation for HBase now, but framework is extensible • MR job to convert Lookup table to HFile • Load lookup table snapshot to HBase by job engine

20 . KYLIN-3221 Lookup table in HBase Rowkey Composite Lookup table PKs Encoding in HBase Column 1 column family with all non-pk columns in it.

21 . Rocks DB as local cache • When there are many derived calculation, HBase is much slower than memory • Need local cache for Lookup table snapshots. • RocksDB is an embeddable persistent key-value store for fast storage.

22 . KYLIN-3221 Building local cache • On Snapshot be taken, Job engine call every query node to build local cache. • Query node loads data from HBase to RocksDB as cache. • Query engine access RocksDB to get lookup table; • When cache is unavailable, use HBase. Query RocksDB Engine 2. Notify Query server 3. Dump from HBase to RocksDB Job Engine Query 1. Load lookup snapshot to HBase Engine RocksDB HBase Cluster

23 . KYLIN-3221 Performance data • HDD disk, lookup table 49 columns, 54.5million rows, origin source data size 824Mb • Build local cache take: 1hour 56min • Group by a derived column, result set 534018 IDs, takes 3.7s; If from HBase, takes 237s.

24 . KYLIN-3221 Summary • Kylin can support large lookup table (tens to hundreds million row) • RocksDB can be local cache for Kylin • Query server becomes stateful on this feature

25 . KYLIN-3378: Kafka topic join with Hive tables • In 2.3 and before, Kafka data source only allow single table in the model. • In real world, user manages Lookup in Hive. The need for Kafka + Hive is strong. • KYLIN-3378 adds this support. Lookup A join Lookup B Kafka topic (Fact) Lookup C All rights reserved ©Kyligence Inc. http://kyligence.io

26 . KYLIN-3378: Work flow • The work flow is as the following figure. • The build job may take more time if need join with Hive. Star No Save Kafka data schema Build Cube To HDFS ? Join Yes Create intermediate intermediate fact with Kafka topic Hive fact table lookup tables All rights reserved ©Kyligence Inc. http://kyligence.io

27 . Other enhancements • KYLIN-3137 KYLIN-2484 Spark engine to support Kafka data • KYLIN-3369 Reduce data size from Kafka • KYLIN-3135 Allow project to set its own data source All rights reserved ©Kyligence Inc. http://kyligence.io

28 . Bug Fixes • KYLIN-3388 Hive data may loss after the “Redistribution” step • KYLIN-1768 ArrayIndexOutOfBounds when using fixed_length:256 • KYLIN-3348 "missing LastBuildJobID" error when building multiple new segments concurrently • … All rights reserved ©Kyligence Inc. http://kyligence.io

2点赞

3收藏

0下载