Data Infrastructure at Airbnb.

注脚

展开查看详情

1.LIYIN TANG & JINGWEI LU Data Product at Airbnb

2.Data Infrastructure at Airbnb

3.Batch Infrastructure Airflow Scheduling Event Kafk Gold Cluster Silver Cluster Spark Cluster Logs ReAi Hive Spark Sqoo Hive MySQL Dumps Yarn HDFS Yarn HDFS S3 AirPal Presto Cluster SuperSet Tableau Liyin Tang and Jingwei Lu 3

4.Streaming at Airbnb Sources Airflow Scheduling Sinks Datadog Cluster Kafka Kafka Spark Streaming S3 … Dynamo … DB HBase Elastic HDFS Search Liyin Tang and Jingwei Lu 4

5.Lambda Architecture

6. Lambda Architecture AirStream Streaming Batch Kafka Hive Spark Streaming Spark SQL HBase 6 Liyin Tang and Jingwei Lu

7. • Combine stream with batch Our Foundations • Shared global state store Liyin Tang and Jingwei Lu 7

8. Unified API through AirStream • Declarative job configuration • Computation operator or sink can be shared by stream and batch job. • Stream source vs static source • Single driver execute stream/batch mode job Liyin Tang and Jingwei Lu 8

9.Shared Global State Store AirStream Spark Spark Streaming Streaming Spark Spark Batch Batch Spark SparkStreaming Streaming Spark SparkBatch Batch HBase Tables Liyin Tang and Jingwei Lu 9

10.Shared Global State Store Re-partition HBase Puts <Region 1, [RowKey, Value]> Region 1 DataFrame <Region 2, [RowKey, Region 2 Value]> HFile … BulkLoad … <Region N, [RowKey, Region N Value]> Liyin Tang and Jingwei Lu 10

11.Shared Global State Store Spark Streaming/Batch Jobs Multi-Gets Prefix Scan Time Range Scan HBase Tables Liyin Tang and Jingwei Lu 11

12.Why HBase ?

13.Why HBase • Rich API for point-lookups and sequential scan (TimeRange, TTL, Prefix Scan …) • Merged view based on version • Unified API for streaming writes and bulk uploads • Unified API for reading from live table and snapshot table 13

14.Streaming Computation

15.Merged Storage Row Key Streaming Writes R1 V100 V100 Time Streaming Writes R1 V99 V99 … … … … Streaming Writes R1 V01 V01 Liyin Tang and Jingwei Lu 15

16.Merged Storage Row Key Streaming Writes R1 V100 V100 Time Streaming Writes R1 V99 V99 Batch Bulk Upload R1 V100 100 Streaming Writes R1 V01 V01 Liyin Tang and Jingwei Lu 16

17.Distinct Count Row Key Prefix _ R1 V102 V102 Prefix Scan with Time TimeRange Prefix _ R2 V101 V101 Prefix Scan with Prefix _ R3 V100 100 TimeRange Prefix _ R4 V01 V01 Liyin Tang and Jingwei Lu 17

18.Moving Average Row Key R1 V102 102 Count Difference/ Time Elapsed Window 1 R1 V101 101 Time … … … R1 V100 100 Count Difference/ Time Elapsed Window 2 … … … R1 V01 V01 Liyin Tang and Jingwei Lu 18

19.Long Window Computation Liyin Tang and Jingwei Lu 19

20.Spark HBase Connector Spark Zeppelin HBase HBase Connector Liyin Tang and Jingwei Lu 20

21.Presto - HBase Connector Presto HBase Connector Schema Presto Mapping HBase Split -> RS Mapping Liyin Tang and Jingwei Lu 21

22.Hive - HBase Connector Hive HBase Connector Table InputFormat Hive HBase Snapshot InputFormat Liyin Tang and Jingwei Lu 22

23.Use Cases

24.Mysql DB Snapshot Using Binlog Replay

25.Move Elephant Database Snapshot • Large amount of data: Multiple large mysql DBs • Realtime-ness: minutes delay/ hours delay • Transaction : Need to keep transaction cross different tables • Schema change: Table schema evolves 25

26.Binlog Replay on Spark 20+ hr 4+ hr 15 5 mins 1 hr Airstream Job 26

27.Architecture Binlog • Streaming and Batch shares Logic: Binlog file reader, DDL processor, Log Parser transaction processor, DML processor. • Idempotent: Log can be replayed HBASE multiple times. DDL • Schema changes: Full schema change history. DML xvid 27

28.Realtime Ingestion & Interactive Query

29.Realtime Ingestion and Interactive Query Query AirStream Engine Spark SQL Kafka HBase Data Spark Streaming Hive SQL Portal Presto SQL Liyin Tang and Jingwei Lu 29

user picture
由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。

相关文档