Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLf

This talk describes migrating a large random forest classifier from scikit-learn to Spark’s MLlib. We cut training time from 2 days to 2 hours, reduced failed runs, and track experiments better with MLflow. Kount provides certainty in digital interactions like online credit card transactions. One of our scores uses a random forest classifier with 250 trees and 100,000 nodes per tree. We used scikit-learn to train using 60 million samples that each contained over 150 features. The in-memory requirements exceeded 750 GB, took 2 days, and were not robust to disruption in our database or training execution. To migrate workflow to Spark, we built a 6-node cluster with HDFS. This provides 1.35 TB of RAM and 484 cores. Using MLlib and parallelization, the training time for our random forests are now less than 2 hours. Training data stays in our production environment, which used to require a deploy cycle to move locally-developed code onto our training server. The new implementation uses Jupyter notebooks for remote development with server-side execution. MLflow tracks all input parameters, code, and git revision number, while the performance and model itself are retained as experiment artifacts. The new workflow is robust to service disruption. Our training pipeline begins by pulling from a Vertica database. Originally, this single connection took over 8 hours to complete with any problem causing a restart. Using sqoop and multiple connections, we pull the data in 45 minutes. The old technique used volatile storage and required the data for each experiment. Now, we pull the data from Vertica one time and then reload much faster from HDFS. While a significant undertaking, moving to the Spark ecosystem converted an ad hoc and hands-on training process into a fully repeatable pipeline that meets regulatory and business goals for traceability and speed.

1.Moving a Fraud-Fighting Josh Johnston Random Forest from scikit- Director of AI Science learn to Spark with ML, MLflow, and Jupyter

2. Overview Model lifecycle Our fraud-detecting model Initial method with database and scikit learn Improved method with HDFS and Spark Robust model governance ©Kount Inc All Rights Reserved

3. Manage the model lifecycle Modeling • Configuration management • Performance (speed) • Accuracy • Validation Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Microsoft. (2017, October 19). What is the Team Data Science Process? Retrieved March 26, 2019, from Science is repeatable ©Kount Inc All Rights Reserved

4.Our fraud-detecting model

5. Kount protects digital innovations from… Fraudulent Account Transaction/ Authentication Account Creation Takeover Fraud Payment Fraud Friction ©Kount Inc All Rights Reserved

6. Evaluate transactions for fraud • Substantial throughput • 30-100 transactions per second • Low latency • 250 ms end-to-end system latency • ~15 ms for machine learning features and model ©Kount Inc All Rights Reserved

7. Evaluate transactions for fraud ©Kount Inc All Rights Reserved

8.©Kount Inc All Rights Reserved

9. Boost Technology™ Customer View Fraud Manager Feedback: Reduced manual reviews by 20% Reduced manual reviews by 200 hours/month Reduced chargeback rate by 17% Approve an extra ~3K transactions and $1.2M USD per month Don’t hear complaints from fraud team about review queue anymore Sleep better at night ©Kount Inc All Rights Reserved

10. Boost Technology™ Technical View Feature Engineering • 200 GB of precomputed data Model • Random forest • 250 trees • ~100k nodes per tree • ~1GB serialized representation Model Training • ~150 features • ~60M observations ©Kount Inc All Rights Reserved

11.Initial training with database and scikit learn

12. First approach gets to production 16 hrs Model Training Analytics Fetch observations Service 12 hrs Database Lookup compute Time 8 hrs Fetch lookups 1 hr Observation Lookup Flat File Logging Network Train Model Storage 24 hrs (Scikit Learn) Pickled Model 2.5 days 400GB RAM 1TB into swap ©Kount Inc All Rights Reserved

13. What works • Trains a high value model ©Kount Inc All Rights Reserved

14. What doesn’t work • Time-intensive • Errors force restarts since everything is held in memory (and swap) • Burdens production analytics database • Pickled model ties execution environment to training environment • Traceability provided by log files and manual documentation • Ad hoc experiments with little configuration control Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? ©Kount Inc All Rights Reserved

15.Improved training with HDFS and Spark

16. Cluster for distributed computing • Dell hardware • 6 nodes • 484 vCores • 1.35 TB RAM • Cloudera Manager • Spark 2.4 • Mostly python HDFS • Attached to 3 nodes • 171 TB usable space ©Kount Inc All Rights Reserved

17. Improved approach through cluster 45 min Analytics sqoop data Spark Cluster HDFS Database Compute lookups Luigi 8 hrs Observation Time Lookup Perform lookups g in 2 hrs Logg Train Model (Spark ML) Zipped MLeap Model <1/2 day MLflow ©Kount Inc All Rights Reserved

18. Remote development with Jupyter • Most criticisms of notebooks are things you COULD do, not what you MUST do • Good development practices are independent of tools Research Maturity Production Version Control (git) Juptyer Notebook Python Packages Pyspark Application Automation ©Kount Inc All Rights Reserved

19. What works • Faster • Failures restart in the middle • Reduces burden on production analytics database • Redesign experiments without penalty • MLeap decouples evaluation environment from training environment ©Kount Inc All Rights Reserved

20. What still doesn’t work • Non-deterministic Spark ML behavior and errors • Spark pipelines rely on configurations that change based on input data ©Kount Inc All Rights Reserved

21.Tools and Processes for Model Governance

22. Tools and processes for governance Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? Solution components • Data traceability • Experiment, configuration, and accuracy traceability ©Kount Inc All Rights Reserved

23.©Kount Inc All Rights Reserved

24.©Kount Inc All Rights Reserved

25.©Kount Inc All Rights Reserved

26.©Kount Inc All Rights Reserved

27.©Kount Inc All Rights Reserved

28.©Kount Inc All Rights Reserved

29. • Data pipelines with error handling • Repeatable and documented data transformations • Document parameters • Trace to code and data used • Record accuracy of selected and not selected models • Store final model and configurations as artifact Governance Questions • Which model are you using? • How did you train it? • How well does it work? After each answer: Why? ©Kount Inc All Rights Reserved