- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
使用Apache Spark来预测来自安装程序保留的杂乱点击流
展开查看详情
1 .User Predictions from Messy Clickstream Data Patrick Halina, Zynga #DSSAIS15
2 .Application Overview • User installs game • Capture user actions from first session • Predict if user will play again in 7 days Game User Actions Predictions First Session #DSSAIS15 2
3 .About Me (Patrick Halina) • Undergrad in Comp Eng • Grad school in Statistics • Previously worked on ML Platform at Amazon • Tech lead for ML Eng at Zynga, based out of Toronto office #DSSAIS15 3
4 .Over 80 million monthly mobile users Mission: Connect the world through games #DSSAIS15 4
5 .Batch ML Predictions at Zynga Data Databricks Data Client Warehouse Services Feature Generation Train models Score data #DSSAIS15 5
6 .ML Challenges • Building features to train ML models is hard • At Zynga, feature generation pipeline takes the most time in model dev process • Garbage in → Garbage out #DSSAIS15 6
7 .Why is Feature Generation Hard? • Need to select important features • Want to capture relationships between features • Typically, each feature is explicitly coded in SQL/SparkSQL/Pandas #DSSAIS15 7
8 .Clickstream Data • Log of user actions • Difficult to wrangle into features 08/09/2018 12:01:18 10/09/2018 15:29:01 user=123 user=124 type=game_action type=purchase subtype=level_up price=12.99 level=12 #DSSAIS15 8
9 .Problem 1: Messy • Huge logs, typically largest dataset • Thousands of event types • Different structure/interpretation for each event type • Event catalog erroneous and incomplete #DSSAIS15 9
10 .Problem 2: Wrong Data Shape User Feature Matrix #DSSAIS15 10
11 .Problem 2: Wrong Data Shape • Most ML models need input as a matrix • Every matrix input to model needs to have the same features • Challenge: transform sequence into matrix #DSSAIS15 11
12 .Problem 3: Temporal Info • How to capture timing info of events? • Eg. Increasing/decreasing trends • Eg. User plays multiplayer game before trying single player mode User A B C Num multiplayer battles 5 0 23 Num single player battles 22 12 3 #DSSAIS15 12
13 . Traditional Solution • Select events, aggregate over time period User A B C Total Num Battles 22 6 99 Num Battles Last 7 Day 10 6 0 Max Level 5 2 17 #DSSAIS15 13
14 .Traditional Solution • Human interpretation of events • Explicitly code features for each event • Add few temporal signals to code as features • Eg. Week over week change in battles #DSSAIS15 14
15 .Problems • Takes too much time • Repetitive • Hard to debug, maintain • Miss signals by hand picking features to add #DSSAIS15 15
16 .Deep Learning? • Trend with Deep Learning is to let algorithms select high level features from data • Deep Learning outperforms handmade features #DSSAIS15 16
17 .Deep Learning? • How can we apply this to event sequences? • Theoretically: Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) • Area of active research, tricky to apply in practice #DSSAIS15 17
18 .Solution: Temporal User Heatmap User A 08/09/2018 12:01:18 user=A subtype=level_up level=12 08/09/2018 12:02:11 User B user=B subtype=purchase amount=12 User C Clickstream #DSSAIS15 18
19 . Solution: Temporal User Heatmap Battle Actions Level Up Error 5m 10m 15m 20m Time Intervals #DSSAIS15 19
20 .Solution: Temporal User Heatmap User increased battles Battle over first 15 minutes, then stopped Level Up Error User had error in last 5 minutes 5m 10m 15m 20m #DSSAIS15 20
21 . Dataflow 08/09/2018 12:01:18 user=A subtype=level_up level=12 User Model 08/09/2018 12:02:11 Predictions user=B subtype=purchase amount=12 Clickstream Heatmaps #DSSAIS15 21
22 .Advantages • Heatmap is auto generated • No manual interpretation of events • Right shape for ML models • Captures temporal relations • Simpler than advanced sequence learning models like RNNs and LSTMs #DSSAIS15 22
23 .Methodology 1: Timing • Choose total time window • Choose time intervals to break up window • We applied this to new installers: 5 minute intervals during 30 minutes after install 0 5m 10m 15m 20m 25m 30m #DSSAIS15 23
24 .Methodology 2: Calculate Intensities • Aggregate actions over each time period • Choose aggregation functions • Simple aggregation: count occurrences • Other aggregation functions (max, sum) require interpretation of events but give more info • Normalize and scale values in heatmaps #DSSAIS15 24
25 .Methodology 3: Limit Events • Roll up events to limited hierarchy purchase | car | Audi > purchase | car • Select top events by significance to predictor, presence in population • ie. Decision Tree metrics (Gini impurity, Information gain) #DSSAIS15 25
26 .Implementation • Use Spark • Developed generic Python framework at Zynga to generate features from any game’s events • No need to write custom feature generation for each game #DSSAIS15 26
27 . Modeling • Rearrange heatmap so similar events are closer to each other Events • Use hierarchical clustering • Now grid cells are related to nearby cells by both time and correlation Time #DSSAIS15 27
28 .Modeling • Clickstream events are now an image • Apply models that capture spatial structure between columns and rows of data • Eg. Image classification techniques #DSSAIS15 28
29 . Retention Prediction • Capture user game actions from first 30 minutes after installation • How well can we predict retention? User Predictions First Session Clickstream Heatmap #DSSAIS15 29