使用Apache Spark来预测来自安装程序保留的杂乱点击流

点击流数据杂乱。Zynga游戏中的单个用户会话可以生成数千个事件,每个游戏、客户端版本和OS都有自己的事件模式。不幸的是,大多数ML模型要求它们的训练数据格式化为一个统一的矩阵,每个用户具有完全相同的列。开发捕获事件流的所有细微趋势和相互作用的特征集是一个耗时的挑战。
展开查看详情

1.User Predictions from Messy Clickstream Data Patrick Halina, Zynga #DSSAIS15

2.Application Overview • User installs game • Capture user actions from first session • Predict if user will play again in 7 days Game User Actions Predictions First Session #DSSAIS15 2

3.About Me (Patrick Halina) • Undergrad in Comp Eng • Grad school in Statistics • Previously worked on ML Platform at Amazon • Tech lead for ML Eng at Zynga, based out of Toronto office #DSSAIS15 3

4.Over 80 million monthly mobile users Mission: Connect the world through games #DSSAIS15 4

5.Batch ML Predictions at Zynga Data Databricks Data Client Warehouse Services Feature Generation Train models Score data #DSSAIS15 5

6.ML Challenges • Building features to train ML models is hard • At Zynga, feature generation pipeline takes the most time in model dev process • Garbage in → Garbage out #DSSAIS15 6

7.Why is Feature Generation Hard? • Need to select important features • Want to capture relationships between features • Typically, each feature is explicitly coded in SQL/SparkSQL/Pandas #DSSAIS15 7

8.Clickstream Data • Log of user actions • Difficult to wrangle into features 08/09/2018 12:01:18 10/09/2018 15:29:01 user=123 user=124 type=game_action type=purchase subtype=level_up price=12.99 level=12 #DSSAIS15 8

9.Problem 1: Messy • Huge logs, typically largest dataset • Thousands of event types • Different structure/interpretation for each event type • Event catalog erroneous and incomplete #DSSAIS15 9

10.Problem 2: Wrong Data Shape User Feature Matrix #DSSAIS15 10

11.Problem 2: Wrong Data Shape • Most ML models need input as a matrix • Every matrix input to model needs to have the same features • Challenge: transform sequence into matrix #DSSAIS15 11

12.Problem 3: Temporal Info • How to capture timing info of events? • Eg. Increasing/decreasing trends • Eg. User plays multiplayer game before trying single player mode User A B C Num multiplayer battles 5 0 23 Num single player battles 22 12 3 #DSSAIS15 12

13. Traditional Solution • Select events, aggregate over time period User A B C Total Num Battles 22 6 99 Num Battles Last 7 Day 10 6 0 Max Level 5 2 17 #DSSAIS15 13

14.Traditional Solution • Human interpretation of events • Explicitly code features for each event • Add few temporal signals to code as features • Eg. Week over week change in battles #DSSAIS15 14

15.Problems • Takes too much time • Repetitive • Hard to debug, maintain • Miss signals by hand picking features to add #DSSAIS15 15

16.Deep Learning? • Trend with Deep Learning is to let algorithms select high level features from data • Deep Learning outperforms handmade features #DSSAIS15 16

17.Deep Learning? • How can we apply this to event sequences? • Theoretically: Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) • Area of active research, tricky to apply in practice #DSSAIS15 17

18.Solution: Temporal User Heatmap User A 08/09/2018 12:01:18 user=A subtype=level_up level=12 08/09/2018 12:02:11 User B user=B subtype=purchase amount=12 User C Clickstream #DSSAIS15 18

19. Solution: Temporal User Heatmap Battle Actions Level Up Error 5m 10m 15m 20m Time Intervals #DSSAIS15 19

20.Solution: Temporal User Heatmap User increased battles Battle over first 15 minutes, then stopped Level Up Error User had error in last 5 minutes 5m 10m 15m 20m #DSSAIS15 20

21. Dataflow 08/09/2018 12:01:18 user=A subtype=level_up level=12 User Model 08/09/2018 12:02:11 Predictions user=B subtype=purchase amount=12 Clickstream Heatmaps #DSSAIS15 21

22.Advantages • Heatmap is auto generated • No manual interpretation of events • Right shape for ML models • Captures temporal relations • Simpler than advanced sequence learning models like RNNs and LSTMs #DSSAIS15 22

23.Methodology 1: Timing • Choose total time window • Choose time intervals to break up window • We applied this to new installers: 5 minute intervals during 30 minutes after install 0 5m 10m 15m 20m 25m 30m #DSSAIS15 23

24.Methodology 2: Calculate Intensities • Aggregate actions over each time period • Choose aggregation functions • Simple aggregation: count occurrences • Other aggregation functions (max, sum) require interpretation of events but give more info • Normalize and scale values in heatmaps #DSSAIS15 24

25.Methodology 3: Limit Events • Roll up events to limited hierarchy purchase | car | Audi > purchase | car • Select top events by significance to predictor, presence in population • ie. Decision Tree metrics (Gini impurity, Information gain) #DSSAIS15 25

26.Implementation • Use Spark • Developed generic Python framework at Zynga to generate features from any game’s events • No need to write custom feature generation for each game #DSSAIS15 26

27. Modeling • Rearrange heatmap so similar events are closer to each other Events • Use hierarchical clustering • Now grid cells are related to nearby cells by both time and correlation Time #DSSAIS15 27

28.Modeling • Clickstream events are now an image • Apply models that capture spatial structure between columns and rows of data • Eg. Image classification techniques #DSSAIS15 28

29. Retention Prediction • Capture user game actions from first 30 minutes after installation • How well can we predict retention? User Predictions First Session Clickstream Heatmap #DSSAIS15 29