Near Real-Time Netflix Recommendations Using Apache Spark Streaming

As a data driven company, we use Machine learning based algos and A/B tests to drive all of the content recommendations for our members. Traditionally, these recommendations are precomputed in a batch processing fashion, but such a model cannot react quickly based on member interactions, title interests, popularity etc. With an ever-growing Netflix catalog, finding the right content for our audience in near real-time would provide the best personalized experience. We’ll take a deep dive into our realtime Spark Streaming ecosystem at Netflix. Both it’s infrastructure and business use cases. On the infrastructure front, we will delve into scale challenges, state management, data persistence, resiliency considerations, metrics, operations and auto-remediation. We will talk about a few use cases that leverage real-time data for model training, such as providing the right personalized videos in a member’s Billboard and choosing the right personalized image soon after the launch of the show. We will also reflect on the lessons learnt while building such high volume infrastructure.

Near Real-Time Recommendations - Spark Streaming Elliot Chow, Netflix Nitin Sharma, Netflix

Agenda ● Recommendations @ Netflix ● The Need for Near Real Time ● Use Cases ● Common Infrastructure ● Scaling Challenges

Recommendations at Netflix ● Personalize the Netflix experience for each member ○ Goal: Quickly help members find content they'd like to watch ○ Risk: Member may lose interest and abandon the service ○ Challenge: Recommending at scale

Scale @ Netflix ● 125M+ active members ● 190 countries ● 450B+ unique events/day ● 700+ Kafka topics

Typical Data Pipelines @ Netflix ● Data stored in Hive/S3 ● Batch ETLs using Spark/Hive ● Table partitioning by day or hour ● Job scheduling by both CRON or data availability ● Latency often is on the order of days

The Need for Near Real Time (NRT) ● Dynamic catalog ● Growing member base ● Time sensitivity ○ Content popularity changes ○ Member interests evolve

The Need for Near Real Time (NRT) ● Increasing amount of data ○ Process data as soon as possible to keep latencies low ○ Minimize amount of data to reprocess in case of failure ● Multi-Armed Bandits Adoption

Use Cases ● Video Insights ● ML Pipelines for Recommendations

NRT for Video Insights

Video Insights ● New title launches ● Early reads on title performance ● Slice by various dimensions

Video Insights Service

Video Insights Service

Video Insights Service

Video Insights Service

Video Insights Client Service

Video Insights - State ● Counts maintained in Spark ● Custom state management ○ Based on mapWithState implementation input.scan(initRDD)((currentRDD, batchRDD) => f(currentRDD, batchRDD)) ○ Easier to re-use the function f in batch mode ○ Lower-level control over state management ○ scanByKey alternative for keyed state

NRT for Recommendations

Billboard Recommendations ● Recommend a relevant title to each member ● Right time ● Respond quickly to member feedback

Artwork Personalization ● Personalized Image ● Visual Evidence ● Quickly adapting - Title launches, member tastes ● Rapid learning - Cold start

Traditional Recommendations DataStore/ Millions of ETL Training Online Play related Layers pipelines caches Signals Few days Rankings Models Precompute/Live Compute

NRT Recommendations DataStore/Online Millions of Streaming caches Play related layer Signals NRT Rankings Training pipelines Models Precompute/Live Compute

Required Data ● Impressions, Plays, etc. ● Attribution ● Explore/Exploit Metadata

Attribution Infrastructure Stream Processing Batch Processing Query API

Stream Processing - Zoomed in Impression Play Enrich Cache Video Metadata

Batch Processing - Zoomed in Impression Windowed Impression Dedupe Windowed Play Join Play Video Experimentation Metadata Data

Common Infrastructure

Netflix Spark Stack ● Jenkins ● Spinnaker ● ASG ● Runners: Marathon, Meson ● Resource Manager: Mesos ● Storage: HDFS, S3, EFS ● Multi-Region

Multi Region Challenges ● Geo routing ● Impression in one region; play in another ● Streaming - Multi Region ● Batch Reduce/Merge - One region

Can it withstand Chaos? • Chaos is a design principle • Instance Failovers => Region Failovers • Transparent to the consumers • Over provisioned • Long term - Autoscale