Apache Kylin 2.5 Updates

2018年10月 Apache Kylin meetup@杭州,Apache Kylin committer & PMC 分享了 Kylin v2.5.0 的新功能和改进,以及社区进行中的一些任务。
展开查看详情

1.Apache Kylin 2.5.0 Update S h a o f e n g S h i , s h a o f e n g. sh i@ky li ge nc e. io A p a c h e K y l i n c o m m i t te r & P M C , 2 0 1 8 / 1 0 /2 6

2.Agenda • Kylin 2.5.0 new features • Kylin up-coming features • All-in-Spark job engine • Distributed query cache • Hadoop 3/ HBase 2 support • Data source SDK • Hybrid model • Parquet as cube storage • Enable cube planner • Merge dictionary on Yarn • MySQL as metastore • Enhanced segment pruning © Kyligence Inc. 2018, Confidential.

3.All-in-Spark job engine • Why move to Spark? • Better performance & resource utilization; • Less dependency on Hadoop; • Kylin 2.0 only moved layer cubing steps to Spark • Now all jobs moved to Spark • Convert to HFile • Merge segments • Fact distinct dimension values © Kyligence Inc. 2018, Confidential.

4.All-in-Spark job engine – cont. • Other enhancements in the Spark engine • Optimize memory usage in cubing; • Optimize default configuration for spark; • Support spark job discarding; • Performance • Cubing steps improved about 10% than previous version; • Convert to HFile step in Spark is a little slower (20% to 50%) than MR; • Spark repartition and sort is slow; • Fact distinct dimension values step is close with MR; • Will continuously tune the performance. © Kyligence Inc. 2018, Confidential.

5.Hadoop 3 / HBase 2.0 support • New features in Hadoop 3 • Erasure coding: save up to 50% disk • More than 2 name nodes: maximize fault tolerance • Shaded client jars: avoid conflict with application • New features in HBase 2 • AssignmentManager V2: faster region assignment • Offheaping of read/write: less GC • NettyRpcServer: lower RPC latency • Async RPC Client: higher throughputs • Some users have adopted Hadoop 3; © Kyligence Inc. 2018, Confidential.

6.Hadoop 3 / HBase 2.0 support – cont. • From 2.5, Kylin provides binary packages for Hadoop 3 / HBase 2 • Tomcat upgraded to 8.5 • Startup scripts refined • Tested on CDH 6.0 and HDP 3.0 • Known issues so far • Spark does not support Hadoop 3; • HBase does not support erasure encoding; © Kyligence Inc. 2018, Confidential.

7.Hybrid model • Hybrid model composites multiple cubes to answer a query; • The cubes need to have the same model; • Can have different dimension and measures; • Hybrid can support schema change; • Hybrid was introduced in Kylin 1.0, with CLI support; • Kylin 2.5 add GUI for hybrid • https://kylin.apache.org/docs/tutorial/hybrid.html © Kyligence Inc. 2018, Confidential.

8.Enable Cube Planner by default • Kylin introduces Cube Planner in v2.3, by default disabled; • kylin.cube.cubeplanner.enabled=false • kylin.cube.cubeplanner.enabled-for-existing-cube=false • Cube planner optimizes cube spanning tree in two phases: • Phase I: optimize by statistics, automated; • Phase II: optimize by query history, user triggered; • (Need system cube) • Kylin 2.5 enables phase I optimization, and refined the configurations; • kylin.cube.cubeplanner.enabled=true • kylin.cube.cubeplanner.enabled-for-existing-cube=true • kylin.cube.aggrgroup.max-combination= 32768 • Kylin 2.6 will provide a simplified script for building the system cube. © Kyligence Inc. 2018, Confidential.

9.Cube Planner: make cube smaller • An example with SSB: • Reduce cuboid number from 512 to 152 with only phase I optimization; • Storage saved 70%; • Cube build improved 25% • Tutorial: • https://kylin.apache.org/docs/tutorial/use_cube_planner.html © Kyligence Inc. 2018, Confidential.

10.Merge dictionaries on Yarn • Why • Dictionary merge is a CPU/memory intensive task; • Merging in Kylin node can slowdown, or crash Kylin; • Especially worse when many jobs run together; • Submit dictionary files to HDFS, merge them in Yarn container • Eliminate the bottleneck; • Kylin becomes more light-weighted; © Kyligence Inc. 2018, Confidential.

11.MySQL as metastore • HBase was the default metadata store for Kylin; • Metadata includes: project/model/cube definition, job information, user ACL, etc; • When HBase is down, Kylin can NOT work. • MySQL is the second option now; • HBase outage won’t crash Kylin; • Support HBase read-only mode ( e.g., AWS EMR HBase read replica); • Make ops easier: backup, restore, monitoring, etc; • Prepare for no-HBase deployment; • Design • Two tables created in MySQL: one for job info, the other for other metadata; • Big files will be routed to HDFS; © Kyligence Inc. 2018, Confidential.

12.MySQL as metastore • Configuration • MySQL server 5.1 or above, connect via JDBC driver; • https://kylin.apache.org/docs/install/advance_settings.html • This feature is beta now • Tested with AWS RDS, Google Cloud SQL, Aliyun RDS; • Can easily extend to support other JDBC service © Kyligence Inc. 2018, Confidential.

13.Enhanced segment pruning • Previously, Kylin only prune segments by time range (partition column value) and dictionaries (if available); • From 2.5, Kylin records the min./max. value of each dimension at segment level; • Prune segments with the min./max. range before read from storage, reduce the I/O to storage; © Kyligence Inc. 2018, Confidential.

14.Up-coming features • Cache enhancements (KYLIN-2895) • Memcached as query cache; • Storage level cache; • Data Source SDK (KYLIN-3552) • SDK for adding new JDBC data sources easily; • API for handling dialects; • Can do syntax conversion when push down to the data source; • Parquet as Cube storage • Use parquet + spark for Kylin storage and query; © Kyligence Inc. 2018, Confidential.

15. Our Contact Apache Kylin Address dev@kylin.apach 上海市浦东新区亮秀路112号Y1座405 e.org Telephone 021-61060928 E-mail info@kyligence.io Kyligence Inc Website https://kyligence.io info@kyligence.io

16.THANK YOU

Kyligence (上海跬智信息技术有限公司)由首个来自中国的 Apache 软件基金会顶级开源项目 Apache Kylin 核心团队组建,是专注于大数据分析领域的数据科技公司,通过前沿数据技术的分析认知来加速用户关键商业决策是其使命。