2018年10月Apache Kylin meetup@杭州,eBay 大数据团队 Kylin 架构师分享了团队基于 Kylin 开发的实时 OLAP 解决方案。

注脚

1.New Kylin Streaming in eBay 2018-10-22

2.Agenda Why new Streaming Overall Architecture Detail Design Segment and Storage HA Checkpoint Performance KYLIN REAL-TIME ANALYTICS LAUNCH 2

3.Why New Streaming • Milliseconds Data Preparation Delay • Lambda Architecture • Less MR jobs and HBase Tables KYLIN REAL-TIME ANALYTICS LAUNCH 3

4.New Streaming Architecture KYLIN REAL-TIME ANALYTICS LAUNCH 4

5.New Streaming We divide the unbounded incoming streaming data into 3 stages, the data come into different stages are all queryable. InMem Stage Unbounded Continuously InMem streaming events Aggregations On Disk Stage Flush to disk, columnar based storage and indexes Full Cubing Stage Full cubing with MR or Spark, save to HBase. KYLIN REAL-TIME ANALYTICS LAUNCH 5

6.Streaming Components Query Engine Build Engine Management Monitor And Streaming Coordinator Metadata Store Streaming Receiver KYLIN REAL-TIME ANALYTICS LAUNCH 6

7.How Streaming Cube Engine Works new streaming cube request 1 Steaming Coordinator 6 Build Engine 8 2 5 Streaming 7 Receivers Cluster Streaming Sources ReplicaSet1 ReplicaSet2 4 3 Cube Storage ReplicaSet3 ReplicaSet4 (HBase) ReplicaSet5 KYLIN REAL-TIME ANALYTICS LAUNCH 7

8.How Streaming Query Engine Works SQL Query SQL Response 1 Query Engine Steaming Coordinator 2 3 Streaming Receivers Cube Cluster Storage (HBase) 8 KYLIN REAL-TIME ANALYTICS LAUNCH 8

9.Real-time Segment States Seg_3 Seg_4 1 … L In Memory Seg_2 Store 1 … M Unbounded Fragments streaming events 1 … J Seg_1 1 … N Active Segments Immutable Segments Open to Write Close to Process KYLIN REAL-TIME ANALYTICS LAUNCH 9

10.Segment Store On Disk KYLIN REAL-TIME ANALYTICS LAUNCH 10

11.Column Based Fragment File Format KYLIN REAL-TIME ANALYTICS LAUNCH 11

12.Invert Index Format • Use Roaring Bitmap. • Two format for tri-tree encoded values and fix-len encoded values KYLIN REAL-TIME ANALYTICS LAUNCH 12

13.Compression • Support Run Length Encoding and LZ4 Compression • Use RLE compression for time-related dim and first dim • Use LZ4 for other dimensions by default • Use LZ4 Compression simple-type measure(long, double) • No compression for complex measure(count dinstinct, topn, etc.) KYLIN REAL-TIME ANALYTICS LAUNCH 13

14.Replica Set • All receivers in the Replica Set replica set share the same assignment. Receiver1 • The lead of the ReplicaSet is responsible to upload Receiver2 Assignment: “cube1”:[1,2] real-time segments to “cube2”:[2,3] HDFS • Use Zookeeper to do leader election Zookeeper KYLIN REAL-TIME ANALYTICS LAUNCH 14

15. Local Check Point Date Time Partition Offsets SeqID of Active Segments 2016/10/01 Kafka 1,x 2,y 3,z Seg_5, I Seg_4, J Seg_3, K 12:00:00 Seg_5 1 … I Topic_1 Part_1 ... x x+1 … Seg_4 Topic_1 Part_2 ... y x+1 … 1 … J Seg_3 Topic_1 Part_3 ... z z+1 … 1 … K Kafka Active Fragments KYLIN REAL-TIME ANALYTICS LAUNCH 15

16.Remote Check Point • Checkpoint is saved to Cube Segment metadata after HBase segment build ”segments”:[{…, "stream_source_checkpoint": {"0":8946898241, “1”: 8193859535, ...} }, ] • The checkpoint info is the smallest partition offsets on the streaming receiver when real-time segment is sent to full build. KYLIN REAL-TIME ANALYTICS LAUNCH 16

17.Performance • Count Query on one hour data which has 36M rows take around 800ms • Consume around 44000 events/s for one receiver(11 dimensions, 1 metrics, no aggregations) • Detail Performance Doc: • https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd- Rck9vsH/view?ths=true KYLIN REAL-TIME ANALYTICS LAUNCH 17

18.Streaming Next Step Star Schema Support Multi-Tenant Enhance Monitoring/Alerting For Streaming Receiver On Kubernetes KYLIN REAL-TIME ANALYTICS LAUNCH 18

19.Thank you! KYLIN REAL-TIME ANALYTICS LAUNCH 19

user picture
  • Kyligence
  • Kyligence (上海跬智信息技术有限公司)由首个来自中国的 Apache 软件基金会顶级开源项目 Apache Kylin 核心团队组建,是专注于大数据分析领域的数据科技公司,通过前沿数据技术的分析认知来加速用户关键商业决策是其使命。

相关Slides

  • 本PPT解释了作为支持交易型分布式数据库系统的TiDB核心产品架构及其主要组件,包括TiDB,TiKV,Placement Driver,TiSpark,TheFlash,Tool,TiDB-operator for k8s等,对其基本作用进行阐述,并对其中的核心组件TiKV重点分析,解释了基本数据组织方式,执行方式,数据管理,水平扩展和负载均衡,以及分布式一致性等基本问题。最好对其分析引擎TiSpark也进行了简要功能说明。

  • 介绍了ES的基本结构,功能和原理,重点分析了在实际生产环境中各种运维和监控的指标,以及各种调优经验和配置参数,还有运维自动化的方法论探讨,可以作为ES在实际生产环境中的最佳实践部署和运维监控案例,也可以帮助ES平台维护者理解并思考如何提供更好的ES服务及运维保障。

  • Adaptive Execution @ Spark + AI Summit Europe 2018 Video @ https://databricks.com/session/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale-2

  • Apache Spark作为分布式内存计算引擎,内存使用的优化对于性能提升至关重要,Intel的Optane(傲腾)技术,让内存和SSD之间架设了个新的数据缓存/存储层,并通过PMDK等特殊的API绕过文件系统,系统调用,内存拷贝等一系列额外操作,让性能有极大的提升。Intel开源的OAP(Optimized Analytics Package)for Apache Spark项目,也是基于这个前体,构建即席查询引擎,以及在机器学习算法诸如KMeans算法上也获得了不错的性能回报。