Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Quer

Near real-time analytics has become a common requirement for many data teams as the technology has caught up to the demand. One of the hardest aspects of enabling near-realtime analytics is making sure the source data is ingested and deduplicated often enough to be useful to analysts while writing the data in a format that is usable by your analytics query engine. This is usually the domain of many tools since there are three different aspects of the problem: streaming ingestion of data, deduplication using an ETL process, and interactive analytics. With Spark, this can be done with one tool. This talk with walk you through how to use Spark Streaming to ingest change-log data, use Spark batch jobs to perform major and minor compaction, and query the results with Spark.SQL. At the end of this talk you will know what is required to setup near-realtime analytics at your organization, the common gotchas including file formats and distributed file systems, and how to handle data the unique data integrity issues that arise from near-realtime analytics.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Near Real-Time Analytics with Apache Spark Brandon Hamric, Data Engineer at Eventbrite Beck Cronin-Dixon, Data Engineer at Eventbrite #UnifiedAnalytics #SparkAISummit

3.Who are we Brandon Hamric Beck Cronin-Dixon Principal Data Engineer Senior Data Engineer Eventbrite Eventbrite #UnifiedAnalytics #SparkAISummit 3

4.Overview • Components of near real-time data • Requirements • Data ingestion approaches • Benchmarks • Considerations #UnifiedAnalytics #SparkAISummit 4

5.Components of Near Real-time Data • Data – Size – Mutability – Schema evolution • Deduplication • Infrastructure – Ad-hoc queries – Batch – Streaming queries #UnifiedAnalytics #SparkAISummit 5

6.Components of Near Real-time Data • Storage – Folder/file partitioning and bucketing – Format • Compression – at-rest – in-transit 6

7.Common Requirements • Latency is specific to use case • Most stakeholders ask for latency in minutes • Latency of dependencies* • Some implementations fit certain latency better than others • Mixed grain* #UnifiedAnalytics #SparkAISummit 7

8.Eventbrite Requirements • Web interaction data: billions of records per month • Real-time DB: Multiple TB • Event Organizer reporting requirements • Business and accounting reporting • Mixed Grain - Seconds to Daily • Stack: Spark, Presto, S3, HDFS, Kafka • Parquet #UnifiedAnalytics #SparkAISummit 8

9.Ingestion Approaches • Full overwrite • Batch incremental merge • Append only • Key/value store • Hybrid batch/stream view 9

10.Approaches: Full Overwrite • Batch Spark process • Overwrite entire table every run • Complete copy of real-time DB • Direct ETL 10

11.Approaches: Full Overwrite • The Good – Simple to implement – Ad Hoc Queries are fast/simple • The Bad – Significant load on real-time DB – High Latency – High write I/O requirement 11

12.Full Overwrite Logic 12

13.Approaches: Batch Incremental Merge • Get new/changed rows • Union new rows to base table • Deduplicate to get latest rows • Overwrite entire table • limited latency to how fast you can write to data lake 13

14.Approaches: Batch Incremental Merge • The Good – Lower load on real-time DB – Queries are fast/simple • The Bad – Relatively high Latency – High write I/O requirement – Requires reliable incremental fields 14

15.Batch Incremental Merge Logic 15

16.Approaches: Append-Only Tables • Query the real-time db for new/changed rows • Coalesce and write new part files • Run compaction hourly, then daily 16

17.Approaches: Append Only • The Good – Latency in minutes – Ingestion is easy to implement – Simplifies the data lake – Great for immutable source tables • The Bad – Requires a compaction process – Extra logic in the queries – May require views 17

18.Append-Only Logic 18

19.Approaches: Key/Value Store • Query the real-time db for new/changed rows • Upsert rows 19

20.Approaches: Key/Value Store • The Good – Straightforward to implement – Built in idempotency – Good bridge between data lake and web services • The Bad – Batch writes to a key/value store are slower than using HDFS – Not optimized for large scans 20

21.Key/Value Store Logic 21

22.Approaches: Hybrid Batch/Stream View • Batch ingest from DB transaction logs in kafka • Batch merge new rows to base rows • Store transaction ID in the base table • Ad-hoc queries merge base table and latest transactions and deduplicate on-the-fly 22

23.Approaches: Hybrid Batch/Stream View • The Good – Data within seconds* – Batch job is relatively easy to implement – Spark can do both tasks • The Bad – Streaming merge is complex – Processing required on read 23

24.Hybrid Batch/Stream View 24

25.So Delta Lake is open source now... • ACID transactions in Spark!!!!!! • Frequent Ingestion Processes • Simpler than other incremental merge approaches • Hybrid approach still bridges latency gap 25

26.Read/Write Benchmark 26

27.Where is Spark Streaming? • More frequent incremental writes • Less stream to stream joins • Decreasing batch intervals gives us more stream to batch joins • Less stream to stream joins, means less memory and faster joins 27

28.Considerations • Use case • Latency needs • Data size • Deduplication • Storage 28

29.sample mysql jdbc reader 29