1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2.Spark Listeners A Crash Course in Fast, Easy Monitoring Landon Robinson & Ben Storrie #UnifiedAnalytics #SparkAISummit
3.Who We Are Landon Robinson Ben Storrie Data Engineer Data Engineer Big Data Infrastructure Team @ SpotX #UnifiedAnalytics #SparkAISummit 3
4.But first… why are we here? → Building monitoring for large apps can be tough. → You want an effective and reliable solution. → Spark Listeners are an awesome solution! #UnifiedAnalytics #SparkAISummit 4
5.Our Company THE MODERN VIDEO AD SERVING PLATFORM (we show you video ads) (which means we also process a lot of data) #UnifiedAnalytics #SparkAISummit 5
6.We Process a Lot of Data Data: - 220 MM+ Total Files/Blocks - 8 PB+ HDFS Space - 20 TB+ new data daily - 100MM+ records/minute - 300+ Data Nodes Apps: - Thousands of daily Spark apps - Hundreds of daily user queries - Several 24/7 Streaming apps
7. Spark Streaming is Key for Us we want to monitor these 24/7 Spark Streaming apps are critical for us. They: • a) process billions of records every day • b) need real-time monitoring (visual dashboards + alerts) • c) without sacrificing performance So… a clever solution is necessary.
8.Our Solution Monitoring and visualization of a Spark Streaming app that is: • Fast • Easy to integrate • Has zero-to-minimal performance impact
9. Goal & Talking Points Equip you with the knowledge and code to get fast, easy monitoring implemented in your app using Spark Listeners. Talking Points ● Common Spark Streaming ● Less than Ideal Monitoring Metrics ● Better Monitoring ● Monitoring is Awesome ○ Spark Listeners! ● Monitoring can be Difficult ■ StreamingListeners ■ BatchListeners
10. Common Metrics of Spark Streaming Apps Performance: • batch duration • batch scheduling delays • # of records processed per batch Problems: • failures & exceptions Recovery: • Position (offset) within Kafka topic
11. Monitoring is Awesome It can reveal: • How your app is performing • Problems + Bugs! And provide opportunities to: • See and address issues • Observe behavior visually
12.Monitoring can be Difficult to Implement (especially when working with very big data) Usually because it requires reading your data! Example: .count().print() • Shuffling involved, and is often a blocking operation! Mapping over data to gather metrics can: • be expensive • and add processing delays – Virtually unreliable with big data
13. Less than Ideal Monitoring Make the app compute all your metrics! Example: Looping over RDDs to: • Count records • Track Kafka offsets • Processing time / delays Why is this bad? • Calculating performance significantly impacts performance… not great. • All these metrics are calculated by Spark!
14.There has to be a better way... Streaming Listeners!
15. The Better Way: Streaming Listeners Let a Spark Streaming Listener handle the basics metrics. Only make your app compute the “special” metrics it has to. Spark already crunches some numbers for you -- most evident in the Streaming tab. So let’s take advantage of it!
16. Streaming Listeners: StreamingListener Trait The StreamingListener Trait within the Developer API has everything you need. It has 8 convenience methods you can override, each with relevant, contextual information from Spark. onBatchCompleted() is the most useful to us. It allows you to execute logic after each batch.
17. Streaming Listeners: Extend Trait in New Listener Create a new listener that extends the StreamingListener Trait. Override the convenience methods. Within each is contextual monitoring information. Let’s look at onBatchCompleted().
18.Streaming Listeners: onBatchCompleted() What data is available to us from Spark? • Batch Info • Stream Info – numRecords (total) – Topic object – Processing times • Offsets – delays • numRecords (topic level)
19. Streaming Listeners: onBatchCompleted() We want to write performance stats to We want to write our Kafka Offsets to Influx using our custom method: MySQL using our custom method: writeBatchSchedulingStatsToInflux() writeBatchOffsetsAndCounts()
20. writeBatchSchedulingStatsToInflux() This custom method consumes the Batch Info stored in the object... • Batch duration • Batch delay • total records – across all streams/topics … and writes them to Influx!
21.writeBatchSchedulingStatsToInflux() Grafana dashboard reading from an Influx time-series database.
22. Streaming Listeners: onBatchCompleted() We want to write performance stats to We want to write our Kafka Offsets to Influx using our custom method: MySQL using our custom method: writeBatchSchedulingStatsToInflux() writeBatchOffsetsAndCounts()
23. writeBatchOffsetsAndCounts() This custom method consumes the • Write offsets to MySQL Stream Info stored in the object and • Write # records to Influx has two steps: – per stream/topic
24. writeTopicOffsetsToMySQL() This custom method takes a Topic object… … and finds the last offset processed for each kafka partition. It then will update each partition in a database with the latest row/offset processed.
25.writeTopicOffsetsToMySQL() MySQL table maintaining latest offsets for an app. Data is at an app / topic / partition level.
26.Reading Offsets from MySQL Your offsets are now stored in a DB after each batch completes. Whenever your app restarts, it reads those offsets from the DB... And starts processing where it last left off!
27.Getting Offsets from the Database
28.Example: Reading Offsets from MySQL
29.Example: Reading Offsets from MySQL