- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
按比例存储Netflix 的建议
展开查看详情
1 .Fact Store - Netflix Recommendations Kedar Sadekar, Netflix Nitin Sharma, Netflix #DevSAIS11
2 .Agenda ● ● ● ● ● ● #DevSAIS11
3 .Recommendations at Netflix ● Personalized Homepage for each member ○ Goal: Quickly help members find content they’d like to watch ○ Risk: Member may lose interest and abandon the service ○ Challenge: Recommendations at Scale #DevSAIS11
4 .Scale @ Netflix ● ● ● ● #DevSAIS11
5 .Experimentation Cycle @ Netflix Offline Online Experiment System Design a New Experiment to Test Out Different Ideas Design Experiment Model Testing Online A/B Testing Collect Label Data Offline Feature Model Training Model Validation Metrics Generation #DevSAIS11
6 .ML Feature Engineering - Architectural View Offline Experiment Online System Microservices Features Facts Offline Feature Shared Feature Online Feature Generation Encoders Generation Model Deploy Models Training Online Scoring #DevSAIS11
7 .What is a Fact? ● Fact ○ Input data for feature encoders. Used to construct a feature ○ Example: Viewing history of member, my list of a member ● Historical Version of a fact ○ Rewindable - State of the world at that time ● Temporal ■ Facts are temporal i.e. they change with time ■ Each online scoring service uses the latest value of a fact #DevSAIS11
8 . Feature Logging Fact Logging Fact Microservices Fact Microservices Facts Facts Log these Online Scoring Online Scoring Features Log these Features Predictor Recommendations Predictor Recommendations #DevSAIS11
9 .Fact Logging - Pull Architecture Fact Microservices ● Daily snapshots of key facts ● Storage Pull ○ S3 & Parquet ● Api to access the data Capture Snapshots Snapshots ○ RDD & DataFrames ● Cons ○ Lacks temporal accuracy ○ Load on Microservices Stratified ○ Missing Experiment specific facts Member sets #DevSAIS11
10 .Fact Logging - Push Architecture Compute Services Fact Logger ● Compute engines themselves control ML Workflows what to log Model Training ● Stratification Feature Fact Transformer Generation ● Temporal accuracy Fact Fetcher Fact Store #DevSAIS11
11 .Fact Logger Precompute Live Compute ● Library ● Facts ○ User Related Fact Stratification Logger ○ Video Related ○ Computation Specific ● Serialization ● Stratification Service ● Fact Stream ● Storage Base Fact Tables #DevSAIS11
12 .Fact Logging - Scalability Precompute Live Compute ● 5-10x increase in data through Kafka Fact Logger ● SLA Impact; Cost Increase ● Compression - 70% decrease #DevSAIS11
13 .Storage & Access Precompute Live Compute ● Pipeline load ○ Repeated facts Deduplication Conditional push Fact Logger ● Aggressive or not ○ Loss threshold ● Spark Job Fact Transformer ○ Fact pointers ○ SLA Fact Store #DevSAIS11
14 . API Lookback My List Thumbs Partition 1 Partition 1 Values log_time - x Values Partition m Partition m Values Values Member ID My List View History Thumbs 122312 My List Value View History Pointer Thumbs Value View History 254637 My List Pointer View History Pointer Thumbs Pointer log_time - z Partition 1 Member n My List Pointer View History Value Thumbs Pointer Values Partition m log_time - y Values #DevSAIS11
15 .Storage & Access Precompute Live Compute ● Query performance ○ Slow moving facts Deduplication Conditional push Fact Logger ● Point query ○ Connector Write Fact Transformer ● Query time reduction Read ○ Hours to minutes Read API Fact Store #DevSAIS11
16 .Performance: Storage • Partitioning scheme – Noisy neighbor • Storage format – Exploratory vs production • Fast & Slow lane – Lookback limit #DevSAIS11
17 .Performance: Spark reads • Bloom Filters – Reduce scan • Cache Access – EVCache, Spectator Application • MapPartitions vs UDF ML Library – Eager vs Lazy – SPARK-11438, SPARK-11469, Read API SPARK-20586 #DevSAIS11
18 .Future Work • Structured with schema evolution – Best of both (POJO & Spark SQL), Iceberg • Streaming vs Batch – Multiple lanes, accountability, independent scale • Duplication – Storage vs Runtime cost #DevSAIS11
19 .Questions? #DevSAIS11