Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLf
展开查看详情
1.Moving a Fraud-Fighting Josh Johnston Random Forest from scikit- Director of AI Science learn to Spark with ML, MLflow, and Jupyter josh.johnston@kount.com
2. Overview Model lifecycle Our fraud-detecting model Initial method with database and scikit learn Improved method with HDFS and Spark Robust model governance ©Kount Inc All Rights Reserved
3. Manage the model lifecycle Modeling • Configuration management • Performance (speed) • Accuracy • Validation Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from Science is repeatable https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview ©Kount Inc All Rights Reserved
4.Our fraud-detecting model
5. Kount protects digital innovations from… Fraudulent Account Transaction/ Authentication Account Creation Takeover Fraud Payment Fraud Friction ©Kount Inc All Rights Reserved
6. Evaluate transactions for fraud • Substantial throughput • 30-100 transactions per second • Low latency • 250 ms end-to-end system latency • ~15 ms for machine learning features and model ©Kount Inc All Rights Reserved
7. Evaluate transactions for fraud ©Kount Inc All Rights Reserved
8.©Kount Inc All Rights Reserved
9. Boost Technology™ Customer View Fraud Manager Feedback: Reduced manual reviews by 20% Reduced manual reviews by 200 hours/month Reduced chargeback rate by 17% Approve an extra ~3K transactions and $1.2M USD per month Don’t hear complaints from fraud team about review queue anymore Sleep better at night ©Kount Inc All Rights Reserved
10. Boost Technology™ Technical View Feature Engineering • 200 GB of precomputed data Model • Random forest • 250 trees • ~100k nodes per tree • ~1GB serialized representation Model Training • ~150 features • ~60M observations ©Kount Inc All Rights Reserved
11.Initial training with database and scikit learn
12. First approach gets to production 16 hrs Model Training Analytics Fetch observations Service 12 hrs Database Lookup compute Time 8 hrs Fetch lookups 1 hr Observation Lookup Flat File Logging Network Train Model Storage 24 hrs (Scikit Learn) Pickled Model 2.5 days 400GB RAM 1TB into swap ©Kount Inc All Rights Reserved
13. What works • Trains a high value model ©Kount Inc All Rights Reserved
14. What doesn’t work • Time-intensive • Errors force restarts since everything is held in memory (and swap) • Burdens production analytics database • Pickled model ties execution environment to training environment • Traceability provided by log files and manual documentation • Ad hoc experiments with little configuration control Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? ©Kount Inc All Rights Reserved
15.Improved training with HDFS and Spark
16. Cluster for distributed computing • Dell hardware • 6 nodes • 484 vCores • 1.35 TB RAM • Cloudera Manager • Spark 2.4 • Mostly python HDFS • Attached to 3 nodes • 171 TB usable space ©Kount Inc All Rights Reserved
17. Improved approach through cluster 45 min Analytics sqoop data Spark Cluster HDFS Database Compute lookups Luigi 8 hrs Observation Time Lookup Perform lookups g in 2 hrs Logg Train Model (Spark ML) Zipped MLeap Model <1/2 day MLflow ©Kount Inc All Rights Reserved
18. Remote development with Jupyter • Most criticisms of notebooks are things you COULD do, not what you MUST do • Good development practices are independent of tools Research Maturity Production Version Control (git) Juptyer Notebook Python Packages Pyspark Application Automation ©Kount Inc All Rights Reserved
19. What works • Faster • Failures restart in the middle • Reduces burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment ©Kount Inc All Rights Reserved
20. What still doesn’t work • Non-deterministic Spark ML behavior and errors • Spark pipelines rely on configurations that change based on input data ©Kount Inc All Rights Reserved
21.Tools and Processes for Model Governance
22. Tools and processes for governance Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Solution components • Data traceability • Experiment, configuration, and accuracy traceability ©Kount Inc All Rights Reserved
23.©Kount Inc All Rights Reserved
24.©Kount Inc All Rights Reserved
25.©Kount Inc All Rights Reserved
26.©Kount Inc All Rights Reserved
27.©Kount Inc All Rights Reserved
28.©Kount Inc All Rights Reserved
29. • Data pipelines with error handling • Repeatable and documented data transformations • Document parameters • Trace to code and data used • Record accuracy of selected and not selected models • Store final model and configurations as artifact Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? ©Kount Inc All Rights Reserved