- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
PayPal使用SPARK ML预测商家流失
展开查看详情
1 .Merchant Churn Prediction using SparkML at PayPal
2 .Who are we? • Data Engineers – ETL pipelines using Spark • Like all great projects, we started from a hack! • Data Engineering to Machine Learning 2
3 . 1. Scale at PayPal Agenda 2. Understanding Merchant Churn 3. Machine Learning Workflow 4. Learnings 5. Spark ML 3
4 .Scale at PayPal . 200 Countries . 25 Currencies . 19 Million Merchants . 237 Million Active Users . 8 Billon Transactions per Year . 6 Billion Events per Day 4
5 .Understanding Merchant Churn Compliance Use Case for CLAC Story Triggers Impact Insights Increase in Compliance Merchant not aware of High impact for Small Biggest churn driver for Limitations for CLAC in 2017 limitation merchants CLAC in 2017 Merchant did not $M in payments Regulations mandates understand how to merchants to complete resolve limitation Compliance verification Applicable to Merchants Exceeding $$ in a 12-month period Might lead to merchant’s account being suspended . 5
6 .Churn Recovery Efforts Existing pipeline Relaunched Merchants Churn Account Manager Merchants New Merchants Merchants Get Limitation • Limited Success • Reactive process • Account managers reach out to merchants already churned • Reverse limitation and relaunch merchants takes time • Large set of merchants for reach-outs 6
7 .Churn Recovery Efforts Enhanced pipeline ML Model Account Manager Merchants complete regulation New Merchants Predict Revenue and Timelines Merchants Likely to Get Limitation • Proactive Process • Use machine learning pipeline to predict Time to reach $$ • Reach out to merchants before limitation is reached • Mitigate restriction and churn 7
8 .ML Platform Data Models Integration Channel Metadata Salesforce Alerts (Segment, Geo, Capacity, Priority, Channel, etc) Salesforce SSO Channel Data Model 1 Model 2 Integration E-mail … Model N Performance Tracking Feedback Data (Optimization & Learnings) 8
9 .So where do we start from? 9
10 . Learning to do Machine learning We’re here Explore data Stop churn Let’s analyze We’re done & what kind of merchants are data we have happy! 10
11 .Select Training Data Ask questions What datasets we use for training the model? Should we What merchants ? consider Inflation we should use to and currency train model ? conversion? Should we focus What data is only on initial relevant for new transactions? merchants? 11
12 .Data Analyze our datasets Merchants Consumers Account Identity Demographics Currency Consumer Spending Country Low/mid/high shopper Industry Country Cross border Paypal products PAYMENTS Transaction Amount ACTIVITY Frequency Visits data New users Payment attempts Repeat users Successful transactions Transaction Type 12
13 .Data Transformation Strategies Transforming data into features Raw Features Merchant Profile Binning Transaction & Revenue Data Trendlines Weekly trends in transactions Binarization Payment Methods / Cross Border Seasonality Tune weights for Transaction data 13
14 . Learning to do Machine learning We’re here Explore data Data Prep Stop churn Let’s analyze Let’s prepare the We’re done & what kind of data for machine merchants are data we have learning happy! 14
15 .Feature Engineering Transforming data into features Multiple Source Stitching Indicator Variables Normalization Feature Selection Outlier Removal 15
16 .Data Preparation Multiple source stitching Source 2 : Coverage 30% Source 1 : Source 3: Coverage 20% Coverage 30% Stitch attribute values based on accuracy Industry & Sub-industry enrichment Enriched feature : Coverage 70% 16
17 .Data Preparation Indicator variables Count Type 1 Type 1 count Count Attribute X Type 2 Type 2 count 3 features Type 3 Count Type 3 count 17
18 . Data Preparation Indicator variables Type 1 Count E.g. Count Calculate Most Type 2 most active Active Gender Attribute X Type type Monthly transaction count 1 feature Count Type 3 18
19 .Data Preparation Indicator variables E.g. Calculate buckets Attribute Attribute X and assign bucket Age indicator indicator Income 19
20 .Data Preparation Hypothesis testing pValue All features Chi-square Selector Top 30 features 20
21 .Data Preparation Outliers Restriction placed to Dormant Merchants not receive funds OUTLIERS Account locked 21
22 . Learning to do Machine learning We’re here Explore data Data Prep Model selection Stop churn Let’s analyze Let’s prepare the Let’s discuss the We’re done & what kind of data for machine approach to decide merchants are data we have learning the ‘y’ and choose happy! a model 22
23 .Model selection Choosing the right label Choosing the right ‘y’ Week Quarter Year No. of days Classification Regression 23
24 . Model selection Choosing the right model Classification Decision tree Naïve Bayes Gradient boosting tree Random forests Logistic Regression Low Accuracy • Accuracy improved • Overfitting • Add more categorical features • Accuracy improved • Overfitting persisted • Accuracy improved • Overfitting reduced • Accuracy improved • High time to train • Overfitting reduced • Low time to train 24
25 . Learning to do Machine learning We’re here Explore data Data Prep Model selection Cross validation Stop churn & hyperparameter tuning Let’s analyze Let’s prepare the Let’s discuss the Fine-tune and We’re done & what kind of data for machine approach to decide reverification of model merchants are data we have learning the ‘y’ and choose happy! a model 25
26 .Hyper-parameter tuning and Cross validation Hyper-parameter values for Random Forest model Hyper-parameter Values Number of trees 5,10,15,20,25 Max Bins 5,10,20,30 Impurity Gini, Entropy Max Depth 5,10,20,30 Feature Subset Strategy auto Folds 3 26
27 .Hyper-parameter tuning and Cross validation How do we measure we have the right model? Accuracy AUC ROC Precision Recall AUC PR F1 27
28 .Hyper-parameter tuning and Cross validation Model comparison 1 0.9 0.8 0.7 0.6 Accuracy auROC auPR 0.5 0.4 0.3 0.2 0.1 0 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests 28
29 .Hyper-parameter tuning and Cross validation Model comparison 1 auROC auPR 0.9 Best F1 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 0.2 0.1 0 Logistic Regression Decision Trees Naïve Bayes Gradient-boosting tree Random Forests 29