讲师:马刚 eBay 资深工程师
演讲概要:
- 为什么要做新的 Kylin streaming, 及其特点
- 新 Kylin Streaming 的总体架构及组件
- HA 及列存储设计
- 消费及查询性能

Kyligence发布于2019/02/26

注脚

展开查看详情

1.Apache Kylin Real-time Streaming 2019-02

2.Agenda Why Real-time Streaming Overall Architecture Detail Design Segment and Storage HA Checkpoint Performance In eBay KYLIN REAL-TIME ANALYTICS LAUNCH 2

3.Why Real-time Streaming • Milliseconds Data Preparation Delay • Lambda Architecture • Less Hadoop jobs and HBase Tables KYLIN REAL-TIME ANALYTICS LAUNCH 3

4.RT Streaming Architecture KYLIN REAL-TIME ANALYTICS LAUNCH 4

5.RT Streaming We divide the unbounded incoming streaming data into 3 stages, the data come into different stages are all queryable. InMem Stage Unbounded Continuously InMem streaming events Aggregations On Disk Stage Flush to disk, columnar based storage and indexes Full Cubing Stage Full cubing with MR or Spark, save to HBase. KYLIN REAL-TIME ANALYTICS LAUNCH 5

6.RT Streaming Components Query Engine Build Engine Management Monitor And Streaming Coordinator Metadata Store Streaming Receiver KYLIN REAL-TIME ANALYTICS LAUNCH 6

7.How Cube Engine Works new streaming cube request 6 1 Steaming Coordinator Build Engine 8 2 5 Streaming 7 Receivers Cluster Streaming Sources ReplicaSet1 ReplicaSet2 4 3 Cube Storage ReplicaSet3 ReplicaSet4 (HBase) ReplicaSet5 KYLIN REAL-TIME ANALYTICS LAUNCH 7

8.How Query Engine Works SQL Query SQL Response 1 Query Engine Steaming Coordinator 2 3 Streaming Receivers Cube Cluster Storage (HBase) 8 KYLIN REAL-TIME ANALYTICS LAUNCH 8

9.Real-time Segment States Seg_3 Seg_4 1 … L In Memory Seg_2 Store 1 … M Unbounded Fragments streaming events 1 … J Seg_1 1 … N Active Segments Immutable Segments Open to Write Close to Process KYLIN REAL-TIME ANALYTICS LAUNCH 9

10.Segment Store On Disk KYLIN REAL-TIME ANALYTICS LAUNCH 10

11.Column Based Fragment File Format KYLIN REAL-TIME ANALYTICS LAUNCH 11

12.Invert Index Format • Use Roaring Bitmap. • Two format for tri-tree encoded values and fix-len encoded values KYLIN REAL-TIME ANALYTICS LAUNCH 12

13.Compression • Support Run Length Encoding and LZ4 Compression • Use RLE compression for time-related dim and first dim • Use LZ4 for other dimensions by default • Use LZ4 Compression simple-type measure(long, double) • No compression for complex measure(count dinstinct, topn, etc.) KYLIN REAL-TIME ANALYTICS LAUNCH 13

14.Replica Set • All receivers in the Replica Set replica set share the same assignment. Receiver1 • The lead of the ReplicaSet is responsible to upload Receiver2 Assignment: “cube1”:[1,2] real-time segments to “cube2”:[2,3] HDFS • Use Zookeeper to do leader election Zookeeper KYLIN REAL-TIME ANALYTICS LAUNCH 14

15. Local Check Point Date Time Partition Offsets SeqID of Active Segments 2016/10/01 Kafka 1,x 2,y 3,z Seg_5, I Seg_4, J Seg_3, K 12:00:00 Seg_5 1 … I Topic_1 Part_1 ... x x+1 … Seg_4 Topic_1 Part_2 ... y x+1 … 1 … J Seg_3 Topic_1 Part_3 ... z z+1 … 1 … K Kafka Active Fragments KYLIN REAL-TIME ANALYTICS LAUNCH 15

16.Remote Check Point • Checkpoint is saved to Cube Segment metadata after HBase segment build ”segments”:[{…, "stream_source_checkpoint": {"0":8946898241, “1”: 8193859535, ...} }, ] • The checkpoint info is the smallest partition offsets on the streaming receiver when real-time segment is sent to full build. KYLIN REAL-TIME ANALYTICS LAUNCH 16

17.Performance • Count Query on one hour data which has 36M rows take around 800ms • Consume around 44000 events/s for one receiver (11 dimensions, 1 metrics) • Detail Performance Doc: • https://drive.google.com/file/d/1GSBMpRuVQRmr8Ev2BWvssfMd- Rck9vsH/view?ths=true KYLIN REAL-TIME ANALYTICS LAUNCH 17

18.In eBay • 20 streaming receivers; HW: 86GB RAM, 16 cores vm • Use case: site speed analytics, 16 dim, 50 metrics KYLIN REAL-TIME ANALYTICS LAUNCH 18

19.Kylin RT Streaming Next Step Star Schema Support Multi-Tenant Enhance Monitoring/Alerting For Streaming Receiver On Kubernetes KYLIN REAL-TIME ANALYTICS LAUNCH 19

20.Thank you! KYLIN REAL-TIME ANALYTICS LAUNCH 20

user picture
Kyligence (上海跬智信息技术有限公司)由首个来自中国的 Apache 软件基金会顶级开源项目 Apache Kylin 核心团队组建,是专注于大数据分析领域的数据科技公司,通过前沿数据技术的分析认知来加速用户关键商业决策是其使命。

相关文档