- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Data Product at Airbnb
展开查看详情
1 .LIYIN TANG & JINGWEI LU Data Product at Airbnb
2 .Data Infrastructure at Airbnb
3 .Batch Infrastructure Airflow Scheduling Event Kafk Gold Cluster Silver Cluster Spark Cluster Logs ReAi Hive Spark Sqoo Hive MySQL Dumps Yarn HDFS Yarn HDFS S3 AirPal Presto Cluster SuperSet Tableau Liyin Tang and Jingwei Lu 3
4 .Streaming at Airbnb Sources Airflow Scheduling Sinks Datadog Cluster Kafka Kafka Spark Streaming S3 … Dynamo … DB HBase Elastic HDFS Search Liyin Tang and Jingwei Lu 4
5 .Lambda Architecture
6 . Lambda Architecture AirStream Streaming Batch Kafka Hive Spark Streaming Spark SQL HBase 6 Liyin Tang and Jingwei Lu
7 . • Combine stream with batch Our Foundations • Shared global state store Liyin Tang and Jingwei Lu 7
8 . Unified API through AirStream • Declarative job configuration • Computation operator or sink can be shared by stream and batch job. • Stream source vs static source • Single driver execute stream/batch mode job Liyin Tang and Jingwei Lu 8
9 .Shared Global State Store AirStream Spark Spark Streaming Streaming Spark Spark Batch Batch Spark SparkStreaming Streaming Spark SparkBatch Batch HBase Tables Liyin Tang and Jingwei Lu 9
10 .Shared Global State Store Re-partition HBase Puts <Region 1, [RowKey, Value]> Region 1 DataFrame <Region 2, [RowKey, Region 2 Value]> HFile … BulkLoad … <Region N, [RowKey, Region N Value]> Liyin Tang and Jingwei Lu 10
11 .Shared Global State Store Spark Streaming/Batch Jobs Multi-Gets Prefix Scan Time Range Scan HBase Tables Liyin Tang and Jingwei Lu 11
12 .Why HBase ?
13 .Why HBase • Rich API for point-lookups and sequential scan (TimeRange, TTL, Prefix Scan …) • Merged view based on version • Unified API for streaming writes and bulk uploads • Unified API for reading from live table and snapshot table 13
14 .Streaming Computation
15 .Merged Storage Row Key Streaming Writes R1 V100 V100 Time Streaming Writes R1 V99 V99 … … … … Streaming Writes R1 V01 V01 Liyin Tang and Jingwei Lu 15
16 .Merged Storage Row Key Streaming Writes R1 V100 V100 Time Streaming Writes R1 V99 V99 Batch Bulk Upload R1 V100 100 Streaming Writes R1 V01 V01 Liyin Tang and Jingwei Lu 16
17 .Distinct Count Row Key Prefix _ R1 V102 V102 Prefix Scan with Time TimeRange Prefix _ R2 V101 V101 Prefix Scan with Prefix _ R3 V100 100 TimeRange Prefix _ R4 V01 V01 Liyin Tang and Jingwei Lu 17
18 .Moving Average Row Key R1 V102 102 Count Difference/ Time Elapsed Window 1 R1 V101 101 Time … … … R1 V100 100 Count Difference/ Time Elapsed Window 2 … … … R1 V01 V01 Liyin Tang and Jingwei Lu 18
19 .Long Window Computation Liyin Tang and Jingwei Lu 19
20 .Spark HBase Connector Spark Zeppelin HBase HBase Connector Liyin Tang and Jingwei Lu 20
21 .Presto - HBase Connector Presto HBase Connector Schema Presto Mapping HBase Split -> RS Mapping Liyin Tang and Jingwei Lu 21
22 .Hive - HBase Connector Hive HBase Connector Table InputFormat Hive HBase Snapshot InputFormat Liyin Tang and Jingwei Lu 22
23 .Use Cases
24 .Mysql DB Snapshot Using Binlog Replay
25 .Move Elephant Database Snapshot • Large amount of data: Multiple large mysql DBs • Realtime-ness: minutes delay/ hours delay • Transaction : Need to keep transaction cross different tables • Schema change: Table schema evolves 25
26 .Binlog Replay on Spark 20+ hr 4+ hr 15 5 mins 1 hr Airstream Job 26
27 .Architecture Binlog • Streaming and Batch shares Logic: Binlog file reader, DDL processor, Log Parser transaction processor, DML processor. • Idempotent: Log can be replayed HBASE multiple times. DDL • Schema changes: Full schema change history. DML xvid 27
28 .Realtime Ingestion & Interactive Query
29 .Realtime Ingestion and Interactive Query Query AirStream Engine Spark SQL Kafka HBase Data Spark Streaming Hive SQL Portal Presto SQL Liyin Tang and Jingwei Lu 29