申请试用
HOT
登录
注册
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLf

Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, MLf

Spark开源社区
/
发布于
/
8293
人观看
This talk describes migrating a large random forest classifier from scikit-learn to Spark’s MLlib. We cut training time from 2 days to 2 hours, reduced failed runs, and track experiments better with MLflow. Kount provides certainty in digital interactions like online credit card transactions. One of our scores uses a random forest classifier with 250 trees and 100,000 nodes per tree. We used scikit-learn to train using 60 million samples that each contained over 150 features. The in-memory requirements exceeded 750 GB, took 2 days, and were not robust to disruption in our database or training execution. To migrate workflow to Spark, we built a 6-node cluster with HDFS. This provides 1.35 TB of RAM and 484 cores. Using MLlib and parallelization, the training time for our random forests are now less than 2 hours. Training data stays in our production environment, which used to require a deploy cycle to move locally-developed code onto our training server. The new implementation uses Jupyter notebooks for remote development with server-side execution. MLflow tracks all input parameters, code, and git revision number, while the performance and model itself are retained as experiment artifacts. The new workflow is robust to service disruption. Our training pipeline begins by pulling from a Vertica database. Originally, this single connection took over 8 hours to complete with any problem causing a restart. Using sqoop and multiple connections, we pull the data in 45 minutes. The old technique used volatile storage and required the data for each experiment. Now, we pull the data from Vertica one time and then reload much faster from HDFS. While a significant undertaking, moving to the Spark ecosystem converted an ad hoc and hands-on training process into a fully repeatable pipeline that meets regulatory and business goals for traceability and speed.
0点赞
1收藏
1下载
确认
3秒后跳转登录页面
去登陆