Lessons Learned Replatforming A Large Machine Learning Application to Spark

Morningstar’s Risk Model project is created by stitching together statistical and machine learning models to produce risk and performance metrics for millions of financial securities. Previously, we were running a single version of this application, but needed to expand it to allow for customizations based on client demand. With the goal of running hundreds of custom Risk Model runs at once at an output size of around 1TB of data each, we had a challenging technical problem on our hands! In this presentation, we’ll talk about the challenges we faced replatforming this application to Spark, how we solved them, and the benefits we saw.

Some things we’ll touch on include how we created customized models, the architecture of our machine learning application, how we maintain an audit trail of data transformations (for rigorous third party audits), and how we validate the input data our model takes in and output data our model produces. We want the attendees to walk away with some key ideas of what worked for us when productizing a large scale machine learning platform.


1.WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2.Lessons Learned Replatforming A Large ML Application Patrick Caldon – Director of Quant Research Taylor Hess – Lead Quant Analyst Morningstar Inc. #UnifiedDataAnalytics #SparkAISummit

3.Roadmap • Our Model • Lessons Learned 1. Just Get A Bigger Cluster 2. End to End Models 3. Make It Easy To Iterate 4. Focus On Local Runs 3

4.Our Model 4

5.Why Finance Models Are Different 1. Hard to Validate 2. Probabilistic Outputs 3. More collaborative 4. Heavy compliance issues – models and data need versioning 5

6.Large ML models even more difficult… • Software installations can be difficult • Data can’t fit on computer • Desktop/Laptop not powerful enough 6

7.What Are We Building? • Financial Terms: Risk Factor Model • ML Terms: Cross sectional regression + more Apple Inc. | 2019-01-01 • Time series of each coefficient Momentum = 0.3 • Forecasted return distributions All Stock Size = 2.1 • Covariance estimates Data Health = 1.5 Daily Factors 7

8.What Are We Building? • Essentially, we take features of financial securities and estimate distributions of future returns • We make millions of these estimates • Try to understand how stock returns move together • The feature engineering work has been studied extensively in academic financial research (it is utilized by Quant hedge funds to invest as well) – Some features are simple, some are complex 8

9.What Are We Building? • Inputs (~500GB) • Return data • Security information (region, sector, etc.) • Financial information • Portfolio information • Outputs (~500GB each run) Security Date Size Momentum … Apple 10/1/2018 2.1 0.4 … • Security and portfolio exposures (daily for Google 10/1/2018 2.0 0.3 … each security/portfolio) Barclays 10/1/2018 1.6 -0.1 … BP 10/1/2018 1.8 0.7 … • Security and portfolio forecasted distributions (daily for each security/portfolio) 9

10.Risk Model 1.0 Years of research and development to come up with our proprietary model • Equity only model (~40,000 securities) • Single server relying on database for many calculations • 10 hours to run each day • Producing ~10M datapoints daily On Prem Data Server Warehouse 10

11.Rethinking Our Approach Long time to regenerate Hard to expand code Validation is arduous New model creation painful 11

12.New Architecture Airflow Amazon Amazon Amazon S3 Amazon EMR Athena S3 (Spark) Amazon Amazon Amazon Route 53 Fargate RDS Morningstar API 12

13.Risk Model 2.0 Old Model New Model g 10 million datapoints each model run g 25 billion datapoints each model run g 40,000 securities (equity only) g 1,000,000 securities (equity + fixed g 1 model at a time income) g Months to refresh all data g 10+ models at a time g Hard to get validation data g Hours to refresh all data g Validation data automated 4000x faster 5,000x output data 50x parallel models (full rebuild) (each model run) 13

14.Four Lessons 14

15.1. Just Get A Bigger Cluster 15

16.What is it? • Get larger servers and more of them – then trim down later 16

17.Why? It’s easy to do And we should do the easy things first 1. 2x larger >> 2x faster in many cases (so its cost effective) 2. Good joins can’t make up for poorly sized cluster (sometimes) 17

18.Some reasons to scale I/O Caching Parallelization 18

19.I/O • Too small of cluster will cause a spill to disk • Writing and reading from disk are slooooooow • Monitor Spark GUI for spills to disk and add more RAM 19

20.Caching • If you use large datasets 2+ times, cache • Caching requires lots of RAM • Partial caching not good enough 20

21.Parallelization • If data skew is not a problem: 2x larger = 2x faster • Make sure cluster fully utilized – Executor Count / Size 21

22.2. Build Models End To End 22

23.What is it? • Ability to rerun models historically • Always keep source data intact 23

24.Why? • Distribution of data can shift over time – was your model stable? • Necessary in projects with a time component • Bugfixes quicker • Makes it easy to tweak preprocessing steps 24

25.Model Deployment • Liebig's law of the minimum (~1850) – plant growth is governed by the least available nutrient. (cf. Amdahl's law). • In any software-based environment, end-to-end system test is a necessity. The test will likely find bugs. So the slowest process in our development governs release and bugfix speed. • Liebig's law of the minimum (rephrased) – model development speed is governed by the slowest part of the development/deployment environment. • Conclusion – if you have a multi-day process to rebuild a model, you’re at risk of this process governing the release and bugfix cadence. 25

26.Latest Model Raw Storage Raw Storage Raw Storage Raw Storage Prepared Data Prepared Data Prepared Data Prepared Data Model Model Model Model Output Output Output Output Time 26

27.Full Latest Raw Storage Raw Storage Raw Storage Raw Storage Model Prepared Data Prepared Data Prepared Data Prepared Data Model Model Model Model Output Output Output Output Time 27

28.End To Raw Storage Raw Storage Raw Storage Raw Storage End Model Prepared Data Prepared Data Prepared Data Prepared Data Model Model Model Model Output Output Output Output Time 28

29.3. Make It Easy To Iterate 29