As a data driven company, we use Machine learning based algos and A/B tests to drive all of the content recommendations for our members. Traditionally, these recommendations are precomputed in a batch processing fashion, but such a model cannot react quickly based on member interactions, title interests, popularity etc. With an ever-growing Netflix catalog, finding the right content for our audience in near real-time would provide the best personalized experience.

We’ll take a deep dive into our realtime Spark Streaming ecosystem at Netflix. Both it’s infrastructure and business use cases. On the infrastructure front, we will delve into scale challenges, state management, data persistence, resiliency considerations, metrics, operations and auto-remediation. We will talk about a few use cases that leverage real-time data for model training, such as providing the right personalized videos in a member’s Billboard and choosing the right personalized image soon after the launch of the show. We will also reflect on the lessons learnt while building such high volume infrastructure.

注脚

展开查看详情

1.Near Real-Time Recommendations - Spark Streaming Elliot Chow, Netflix Nitin Sharma, Netflix #ML7SAIS #ML7SAIS

2.Agenda ● Recommendations @ Netflix ● The Need for Near Real Time ● Use Cases ● Common Infrastructure ● Scaling Challenges #ML7SAIS

3.Recommendations at Netflix ● Personalize the Netflix experience for each member ○ Goal: Quickly help members find content they’d like to watch ○ Risk: Member may lose interest and abandon the service ○ Challenge: Recommending at scale #ML7SAIS

4.Scale @ Netflix ● 125M+ active members ● 190 countries ● 450B+ unique events/day ● 700+ Kafka topics #ML7SAIS

5.Typical Data Pipelines @ Netflix ● Data stored in Hive/S3 ● Batch ETLs using Spark/Hive ● Table partitioning by day or hour ● Job scheduling by both CRON or data availability ● Latency often is on the order of days #ML7SAIS

6.The Need for Near Real Time (NRT) ● Dynamic catalog ● Growing member base ● Time sensitivity ○ Content popularity changes ○ Member interests evolve #ML7SAIS

7.The Need for Near Real Time (NRT) ● Increasing amount of data ○ Process data as soon as possible to keep latencies low ○ Minimize amount of data to reprocess in case of failure ● Multi-Armed Bandits Adoption #ML7SAIS

8.Use Cases ● Video Insights ● ML Pipelines for Recommendations #ML7SAIS

9.NRT for Video Insights #ML7SAIS

10.Video Insights ● New title launches ● Early reads on title performance ● Slice by various dimensions #ML7SAIS

11.Video Insights Service #ML7SAIS

12.Video Insights Service #ML7SAIS

13.Video Insights Service #ML7SAIS

14.Video Insights Service #ML7SAIS

15.Video Insights Client Service #ML7SAIS

16.Video Insights - State ● Counts maintained in Spark ● Custom state management ○ Based on mapWithState implementation input.scan(initRDD)((currentRDD, batchRDD) => f(currentRDD, batchRDD)) ○ Easier to re-use the function f in batch mode ○ Lower-level control over state management ○ scanByKey alternative for keyed state #ML7SAIS

17.NRT for Recommendations #ML7SAIS

18.Billboard Recommendations ● Recommend a relevant title to each member ● Right time ● Respond quickly to member feedback #ML7SAIS

19. Artwork Personalization ● Personalized Image ● Visual Evidence ● Quickly adapting - Title launches, member tastes ● Rapid learning - Cold start #ML7SAIS 19

20. Traditional Recommendations DataStore/ Millions of ETL Training Online Play related Layers pipelines caches Signals Few days Rankings Models Precompute/Live Compute #ML7SAIS

21. NRT Recommendations DataStore/Online Millions of Streaming caches Play related layer Signals NRT Rankings Training pipelines Models Precompute/Live Compute #ML7SAIS

22.Required Data ● Impressions, Plays, etc. ● Attribution ● Explore/Exploit Metadata #ML7SAIS

23.Attribution Infrastructure Stream Processing Batch Processing Query API #ML7SAIS

24.Stream Processing - Zoomed in Impression Play Enrich Cache Video Metadata #ML7SAIS

25.Batch Processing - Zoomed in Impression Windowed Impression Dedupe Windowed Play Join Play Video Experimentation Metadata Data #ML7SAIS

26.Common Infrastructure #ML7SAIS

27. Netflix Spark Stack ● Jenkins ● Spinnaker ● ASG ● Runners: Marathon, Meson ● Resource Manager: Mesos ● Storage: HDFS, S3, EFS ● Multi-Region #ML7SAIS

28. Multi Region Challenges ● Geo routing ● Impression in one region; play in another ● Streaming - Multi Region ● Batch Reduce/Merge - One region #ML7SAIS

29.Can it withstand Chaos? • Chaos is a design principle • Instance Failovers => Region Failovers • Transparent to the consumers • Over provisioned • Long term - Autoscale #ML7SAIS

30.When Things Go South ● What if something doesn’t look right? ○ Stream Processing is stuck ○ Driver/Executor failures ○ Intermittent issues with external dependencies ● Metrics - Spark metrics to Atlas (similar to RRDTool + Graphite) ● Getting paged at 2 am - Not fun :) ! ● Need for auto-remediation - less operational overhead #ML7SAIS

31.Auto Remediation Infrastructure Triggers Metadata De-queue SQS Auto Remediation #ML7SAIS

32. Streaming Challenges • Scalability Performance tuning – Micro batch interval – Memory Tuning – Parallelism/Shuffle tradeoff • Data quality issues – Low latency vs data quality #ML7SAIS

user picture
由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。

相关文档