按比例存储Netflix 的建议

作为一家数据驱动的公司,我们使用机器学习算法和A/B测试来为我们的成员驱动所有的内容推荐。为了提高我们个性化推荐的质量,我们尝试使用历史数据离线的想法。然后通过A/B测试来改进我们的离线度量,这些测试是通过对核心度量(如成员参与、满意度和保留率)的统计显著改进来度量的。需要机器学习模型。例如,查看成员的历史、MyLista中的视频等。
展开查看详情

1.Fact Store - Netflix Recommendations Kedar Sadekar, Netflix Nitin Sharma, Netflix #DevSAIS11

2.Agenda ● ● ● ● ● ● #DevSAIS11

3.Recommendations at Netflix ● Personalized Homepage for each member ○ Goal: Quickly help members find content they’d like to watch ○ Risk: Member may lose interest and abandon the service ○ Challenge: Recommendations at Scale #DevSAIS11

4.Scale @ Netflix ● ● ● ● #DevSAIS11

5.Experimentation Cycle @ Netflix Offline Online Experiment System Design a New Experiment to Test Out Different Ideas Design Experiment Model Testing Online A/B Testing Collect Label Data Offline Feature Model Training Model Validation Metrics Generation #DevSAIS11

6.ML Feature Engineering - Architectural View Offline Experiment Online System Microservices Features Facts Offline Feature Shared Feature Online Feature Generation Encoders Generation Model Deploy Models Training Online Scoring #DevSAIS11

7.What is a Fact? ● Fact ○ Input data for feature encoders. Used to construct a feature ○ Example: Viewing history of member, my list of a member ● Historical Version of a fact ○ Rewindable - State of the world at that time ● Temporal ■ Facts are temporal i.e. they change with time ■ Each online scoring service uses the latest value of a fact #DevSAIS11

8. Feature Logging Fact Logging Fact Microservices Fact Microservices Facts Facts Log these Online Scoring Online Scoring Features Log these Features Predictor Recommendations Predictor Recommendations #DevSAIS11

9.Fact Logging - Pull Architecture Fact Microservices ● Daily snapshots of key facts ● Storage Pull ○ S3 & Parquet ● Api to access the data Capture Snapshots Snapshots ○ RDD & DataFrames ● Cons ○ Lacks temporal accuracy ○ Load on Microservices Stratified ○ Missing Experiment specific facts Member sets #DevSAIS11

10.Fact Logging - Push Architecture Compute Services Fact Logger ● Compute engines themselves control ML Workflows what to log Model Training ● Stratification Feature Fact Transformer Generation ● Temporal accuracy Fact Fetcher Fact Store #DevSAIS11

11.Fact Logger Precompute Live Compute ● Library ● Facts ○ User Related Fact Stratification Logger ○ Video Related ○ Computation Specific ● Serialization ● Stratification Service ● Fact Stream ● Storage Base Fact Tables #DevSAIS11

12.Fact Logging - Scalability Precompute Live Compute ● 5-10x increase in data through Kafka Fact Logger ● SLA Impact; Cost Increase ● Compression - 70% decrease #DevSAIS11

13.Storage & Access Precompute Live Compute ● Pipeline load ○ Repeated facts Deduplication Conditional push Fact Logger ● Aggressive or not ○ Loss threshold ● Spark Job Fact Transformer ○ Fact pointers ○ SLA Fact Store #DevSAIS11

14. API Lookback My List Thumbs Partition 1 Partition 1 Values log_time - x Values Partition m Partition m Values Values Member ID My List View History Thumbs 122312 My List Value View History Pointer Thumbs Value View History 254637 My List Pointer View History Pointer Thumbs Pointer log_time - z Partition 1 Member n My List Pointer View History Value Thumbs Pointer Values Partition m log_time - y Values #DevSAIS11

15.Storage & Access Precompute Live Compute ● Query performance ○ Slow moving facts Deduplication Conditional push Fact Logger ● Point query ○ Connector Write Fact Transformer ● Query time reduction Read ○ Hours to minutes Read API Fact Store #DevSAIS11

16.Performance: Storage • Partitioning scheme – Noisy neighbor • Storage format – Exploratory vs production • Fast & Slow lane – Lookback limit #DevSAIS11

17.Performance: Spark reads • Bloom Filters – Reduce scan • Cache Access – EVCache, Spectator Application • MapPartitions vs UDF ML Library – Eager vs Lazy – SPARK-11438, SPARK-11469, Read API SPARK-20586 #DevSAIS11

18.Future Work • Structured with schema evolution – Best of both (POJO & Spark SQL), Iceberg • Streaming vs Batch – Multiple lanes, accountability, independent scale • Duplication – Storage vs Runtime cost #DevSAIS11

19.Questions? #DevSAIS11