- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Quer
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Near Real-Time Analytics with Apache Spark Brandon Hamric, Data Engineer at Eventbrite Beck Cronin-Dixon, Data Engineer at Eventbrite #UnifiedAnalytics #SparkAISummit
3 .Who are we Brandon Hamric Beck Cronin-Dixon Principal Data Engineer Senior Data Engineer Eventbrite Eventbrite #UnifiedAnalytics #SparkAISummit 3
4 .Overview • Components of near real-time data • Requirements • Data ingestion approaches • Benchmarks • Considerations #UnifiedAnalytics #SparkAISummit 4
5 .Components of Near Real-time Data • Data – Size – Mutability – Schema evolution • Deduplication • Infrastructure – Ad-hoc queries – Batch – Streaming queries #UnifiedAnalytics #SparkAISummit 5
6 .Components of Near Real-time Data • Storage – Folder/file partitioning and bucketing – Format • Compression – at-rest – in-transit 6
7 .Common Requirements • Latency is specific to use case • Most stakeholders ask for latency in minutes • Latency of dependencies* • Some implementations fit certain latency better than others • Mixed grain* #UnifiedAnalytics #SparkAISummit 7
8 .Eventbrite Requirements • Web interaction data: billions of records per month • Real-time DB: Multiple TB • Event Organizer reporting requirements • Business and accounting reporting • Mixed Grain - Seconds to Daily • Stack: Spark, Presto, S3, HDFS, Kafka • Parquet #UnifiedAnalytics #SparkAISummit 8
9 .Ingestion Approaches • Full overwrite • Batch incremental merge • Append only • Key/value store • Hybrid batch/stream view 9
10 .Approaches: Full Overwrite • Batch Spark process • Overwrite entire table every run • Complete copy of real-time DB • Direct ETL 10
11 .Approaches: Full Overwrite • The Good – Simple to implement – Ad Hoc Queries are fast/simple • The Bad – Significant load on real-time DB – High Latency – High write I/O requirement 11
12 .Full Overwrite Logic 12
13 .Approaches: Batch Incremental Merge • Get new/changed rows • Union new rows to base table • Deduplicate to get latest rows • Overwrite entire table • limited latency to how fast you can write to data lake 13
14 .Approaches: Batch Incremental Merge • The Good – Lower load on real-time DB – Queries are fast/simple • The Bad – Relatively high Latency – High write I/O requirement – Requires reliable incremental fields 14
15 .Batch Incremental Merge Logic 15
16 .Approaches: Append-Only Tables • Query the real-time db for new/changed rows • Coalesce and write new part files • Run compaction hourly, then daily 16
17 .Approaches: Append Only • The Good – Latency in minutes – Ingestion is easy to implement – Simplifies the data lake – Great for immutable source tables • The Bad – Requires a compaction process – Extra logic in the queries – May require views 17
18 .Append-Only Logic 18
19 .Approaches: Key/Value Store • Query the real-time db for new/changed rows • Upsert rows 19
20 .Approaches: Key/Value Store • The Good – Straightforward to implement – Built in idempotency – Good bridge between data lake and web services • The Bad – Batch writes to a key/value store are slower than using HDFS – Not optimized for large scans 20
21 .Key/Value Store Logic 21
22 .Approaches: Hybrid Batch/Stream View • Batch ingest from DB transaction logs in kafka • Batch merge new rows to base rows • Store transaction ID in the base table • Ad-hoc queries merge base table and latest transactions and deduplicate on-the-fly 22
23 .Approaches: Hybrid Batch/Stream View • The Good – Data within seconds* – Batch job is relatively easy to implement – Spark can do both tasks • The Bad – Streaming merge is complex – Processing required on read 23
24 .Hybrid Batch/Stream View 24
25 .So Delta Lake is open source now... • ACID transactions in Spark!!!!!! • Frequent Ingestion Processes • Simpler than other incremental merge approaches • Hybrid approach still bridges latency gap 25
26 .Read/Write Benchmark 26
27 .Where is Spark Streaming? • More frequent incremental writes • Less stream to stream joins • Decreasing batch intervals gives us more stream to batch joins • Less stream to stream joins, means less memory and faster joins 27
28 .Considerations • Use case • Latency needs • Data size • Deduplication • Storage 28
29 .sample mysql jdbc reader 29