PayPal使用SPARK ML预测商家流失

在本次会议中,PayPal将展示使用Spark ML平台使用一些机器学习模型来挽留商家。对于Paypal,商家留存直接等同于价值。因此,识别出正确的模型来训练我们的数据,并预测商家行为,给我们提供洞察力,从而帮助我们防止商家流失是非常关键的。我们还将深入探讨如何捕获正确的信号,过滤可能歪曲预测的噪声,以及我们在扩展这个解决方案时所面临的一些挑战。最后,我们将看到SparkML在我们构建的流水线中如何组织各种事件,从而使我们能够执行特征工程、对其进行训练、在跨不同数据样本的规模上验证和交叉验证。
展开查看详情

1.Merchant Churn Prediction using SparkML at PayPal

2.Who are we? • Data Engineers – ETL pipelines using Spark • Like all great projects, we started from a hack! • Data Engineering to Machine Learning 2

3. 1. Scale at PayPal Agenda 2. Understanding Merchant Churn 3. Machine Learning Workflow 4. Learnings 5. Spark ML 3

4.Scale at PayPal . 200 Countries . 25 Currencies . 19 Million Merchants . 237 Million Active Users . 8 Billon Transactions per Year . 6 Billion Events per Day 4

5.Understanding Merchant Churn Compliance Use Case for CLAC Story Triggers Impact Insights Increase in Compliance Merchant not aware of High impact for Small Biggest churn driver for Limitations for CLAC in 2017 limitation merchants CLAC in 2017 Merchant did not $M in payments Regulations mandates understand how to merchants to complete resolve limitation Compliance verification Applicable to Merchants Exceeding $$ in a 12-month period Might lead to merchant’s account being suspended . 5

6.Churn Recovery Efforts Existing pipeline Relaunched Merchants Churn Account Manager Merchants New Merchants Merchants Get Limitation • Limited Success • Reactive process • Account managers reach out to merchants already churned • Reverse limitation and relaunch merchants takes time • Large set of merchants for reach-outs 6

7.Churn Recovery Efforts Enhanced pipeline ML Model Account Manager Merchants complete regulation New Merchants Predict Revenue and Timelines Merchants Likely to Get Limitation • Proactive Process • Use machine learning pipeline to predict Time to reach $$ • Reach out to merchants before limitation is reached • Mitigate restriction and churn 7

8.ML Platform Data Models Integration Channel Metadata Salesforce Alerts (Segment, Geo, Capacity, Priority, Channel, etc) Salesforce SSO Channel Data Model 1 Model 2 Integration E-mail … Model N Performance Tracking Feedback Data (Optimization & Learnings) 8

9.So where do we start from? 9

10. Learning to do Machine learning We’re here Explore data Stop churn Let’s analyze We’re done & what kind of merchants are data we have happy! 10

11.Select Training Data Ask questions What datasets we use for training the model? Should we What merchants ? consider Inflation we should use to and currency train model ? conversion? Should we focus What data is only on initial relevant for new transactions? merchants? 11

12.Data Analyze our datasets Merchants Consumers Account Identity Demographics Currency Consumer Spending Country Low/mid/high shopper Industry Country Cross border Paypal products PAYMENTS Transaction Amount ACTIVITY Frequency Visits data New users Payment attempts Repeat users Successful transactions Transaction Type 12

13.Data Transformation Strategies Transforming data into features Raw Features Merchant Profile Binning Transaction & Revenue Data Trendlines Weekly trends in transactions Binarization Payment Methods / Cross Border Seasonality Tune weights for Transaction data 13

14. Learning to do Machine learning We’re here Explore data Data Prep Stop churn Let’s analyze Let’s prepare the We’re done & what kind of data for machine merchants are data we have learning happy! 14

15.Feature Engineering Transforming data into features Multiple Source Stitching Indicator Variables Normalization Feature Selection Outlier Removal 15

16.Data Preparation Multiple source stitching Source 2 : Coverage 30% Source 1 : Source 3: Coverage 20% Coverage 30% Stitch attribute values based on accuracy Industry & Sub-industry enrichment Enriched feature : Coverage 70% 16

17.Data Preparation Indicator variables Count Type 1 Type 1 count Count Attribute X Type 2 Type 2 count 3 features Type 3 Count Type 3 count 17

18. Data Preparation Indicator variables Type 1 Count E.g. Count Calculate Most Type 2 most active Active Gender Attribute X Type type Monthly transaction count 1 feature Count Type 3 18

19.Data Preparation Indicator variables E.g. Calculate buckets Attribute Attribute X and assign bucket Age indicator indicator Income 19

20.Data Preparation Hypothesis testing pValue All features Chi-square Selector Top 30 features 20

21.Data Preparation Outliers Restriction placed to Dormant Merchants not receive funds OUTLIERS Account locked 21

22. Learning to do Machine learning We’re here Explore data Data Prep Model selection Stop churn Let’s analyze Let’s prepare the Let’s discuss the We’re done & what kind of data for machine approach to decide merchants are data we have learning the ‘y’ and choose happy! a model 22

23.Model selection Choosing the right label Choosing the right ‘y’ Week Quarter Year No. of days Classification Regression 23

24. Model selection Choosing the right model Classification Decision tree Naïve Bayes Gradient boosting tree Random forests Logistic Regression Low Accuracy • Accuracy improved • Overfitting • Add more categorical features • Accuracy improved • Overfitting persisted • Accuracy improved • Overfitting reduced • Accuracy improved • High time to train • Overfitting reduced • Low time to train 24

25. Learning to do Machine learning We’re here Explore data Data Prep Model selection Cross validation Stop churn & hyperparameter tuning Let’s analyze Let’s prepare the Let’s discuss the Fine-tune and We’re done & what kind of data for machine approach to decide reverification of model merchants are data we have learning the ‘y’ and choose happy! a model 25

26.Hyper-parameter tuning and Cross validation Hyper-parameter values for Random Forest model Hyper-parameter Values Number of trees 5,10,15,20,25 Max Bins 5,10,20,30 Impurity Gini, Entropy Max Depth 5,10,20,30 Feature Subset Strategy auto Folds 3 26

27.Hyper-parameter tuning and Cross validation How do we measure we have the right model? Accuracy AUC ROC Precision Recall AUC PR F1 27

28.Hyper-parameter tuning and Cross validation Model comparison 1 0.9 0.8 0.7 0.6 Accuracy auROC auPR 0.5 0.4 0.3 0.2 0.1 0 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests 28

29.Hyper-parameter tuning and Cross validation Model comparison 1 auROC auPR 0.9 Best F1 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 0.2 0.1 0 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests 29