Detecting Financial Fraud at Scale with Machine Learning

Detecting fraudulent patterns at scale is a challenge given the massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior. In finance, added security concerns and the importance of explaining how fraudulent behavior was identified further increases the difficulty of the task. Legacy systems rely on rule-based detection that is difficult to implement and run at scale. The resulting code is very complex and brittle, making it difficult to update to keep up with new threats.

In this talk, we will go over how to convert a rule based financial fraud detection program to use machine learning on Spark as part of a scalable, modular solution. We will examine how to identify appropriate features and labels and how to create a feedback loop that will allow the model to evolve and improve overtime. We will also look at how MLflow may be leveraged throughout this effort for experiment tracking and model deployment.

Specifically, we will discuss:
-How to create a fraud-detection data pipeline
-How to leverage a framework for building features from large datasets
-How to create modular code to re-use and maintain new machine learning models
-How to choose appropriate models and algorithms for a given fraud-detection problem


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Detecting Financial Fraud at Scale with Machine Learning Elena Boiarskaia, #UnifiedDataAnalytics #SparkAISummit

3.Overview • Legacy fraud detection system • Convert rule based program to ML • Improve model with Active Learning • Unified solution on Spark • Leverage Spark Pipelines and MLflow #UnifiedDataAnalytics #SparkAISummit 3

4.Legacy System • Siloed data sources/different platforms • Deploy on legacy system • Models built have to be translated 4

5.Rule Based Model • Established by domain experts • Deployed in SQL or C/C++ 5

6.Spark: Unified System • Work on the same system • Direct access to data • Modeling at scale • Ease of deployment 6

7.Machine Learning Conversion • Use rules to train first ML model – Labels – Features • Need easy to explain ML model • Model will find new cases (false positives) • Need feedback loop to update labels 7

8.Model Feedback • Many false positives • Model learns new fraud behavior – Review uncertain cases – Review extreme misses • Update labels for training • Model will improve overtime 8

9.Selecting Appropriate Model • How accurate should the model be? • What metric is most important? – FP vs FN • How interpretable should our model be? – GDPR • Is there class imbalance? 9

10.Model Pipeline • Prepare data using Spark • Spark ML feature transformers • Sparkling Water model inside Spark ML pipeline • Save and deploy Spark pipeline • Track with MLflow 10


12.Takeaways • Collaborate on same platform with Spark • ML has high accuracy for rule-based labels • Use Sparkling Water for transparent models • Feedback loop to update labels • Active learning improves accuracy • Keep up with new fraudulent behavior 12