SparkML: Easy ML Productization for Real-Time Bidding

dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We’ll also share how we: – continually ship models to production – train models in an unattended fashion with auto-tuning capabilities – tune and overbooked cluster resources for maximum performance – ported our previous ML solution into Spark – evaluate the performance of high-rate bidding models

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.SparkML: Easy ML Productization for Real- Time Bidding Maximo Gurmendez Javier Buquet #UnifiedAnalytics #SparkAISummit

3.Boston Company Builds software for marketers to run effective programmatic marketing campaigns Automated decisioning at the core

4.Real Time Ad Bidding ad auction bidder X bidder Y $3 $2 $1

5.dataxu: make marketing smarter through data science! Event data: Bids $3 Wins Losses ML System Attributions Bidding models

6.Scale? Ø 2 Petabytes Processed Daily Ø 3 Million Bid Decisions Per Second Ø Runs 24 X 7 on 5 Continents Ø Thousands of ML Models Trained per Day

7.Goals of dataxu’s ML System Highly Fast to Bid Optimal use of training Predictive (< 1 millisecond) resources No Always fresh Unattended downtime models operation Easy to deploy Self tuning Transparent new algorithms

8.9 years ago f(x) Custom Hadoop f(x) Jobs (single pass) f(x) f(x) Campaign Models used at events bid time for each training data campaign

9.4 years ago: Can we use Spark? Does it use Is it fast Thread too much enough? memory? safe? Can we use Spark Is it its out-the- models work expensive box ML well with our to train? algorithms? data?

10.Problem #1: Data Partitioning beware of the fat reducers! 1 sample pass + 1 write pass

11.Problem #2: Spark models not ready for a low latency bidding setting Feature Feature Feature Feature Prediction 1 2 1 2 1 0 Spark Model 1 0 0.3 1 1 1 1 0.7 0 1 0 1 0.4 At bid time things are different… Feature Feature Feature Feature Prediction 1 2 Model Needed 1 2 1 0 1 0 0.3 Solution: Extended Spark with RowModels

12.Problem #2: Spark models not ready for a low latency bidding setting Solution: Extended Spark with RowModels

13. Problem #3: Categorical Features Encoding Slow Spark Typical: F1 F2 F1 F2 IX 1 F1 F2 IX 1 IX 2 A X A Y StringIndexer A X 0 StringIndexer A X 0 1 B Y A Y 0 A Y 0 0 B Y 1 B Y 1 0 Instead: F1 F2 F1 F2 IX 1 IX 2 Metwally, Agrawal, and Amr A X A X 0 1 Abbadi (Efficient computation of A Y MultiTopK frequent and top-k elements in A Y 0 0 B Y data streams) B Y 1 0

14.Problem #4: Expensive to train We were running one campaign at a time… Observations: • Some campaigns took hours, some a few minutes • Some parts of training were IO bound, some CPU bound • We observed cluster idleness between jobs Solutions: • Launch in parallel smart batches of jobs • Carefully overbook the cluster resources, and not use “maxResourceAllocation” Result: 60% cheaper than legacy 1-pass Hadoop method!

15.Problem #5: How to switch systems? Stage 1: Decorated Model Spark model pulsed on that day Active Bidding Model Decorated Spark Bidding Model A/B tests

16.Problem #5: How to switch systems? Stage 2: Selected Bidding Machine Stage 3: Full Switch

17.Problem #5: How to switch systems? Everything went smoothly? Not exactly! • Reached S3 request limits upon deploy! • Rolled back • Implemented retries • Random waits • Back-offs & jitter • Latencies not exposed in simulations • Rolled back • Deeper profiling with YourKit

18.What about self-tuning, unattended operations? event data Bidding machines model trainer insights manifest selector & calibrator builder builder calibrations bidding models insights manifests Blackboard (S3)

19. What about transparency? { "model": { "partition": "Xm9ZgQEjav", "pipeline": "prospecting_random_forest", "uri": "s3://ml-bucket/../20180923.204250/" }, "bid_modifiers": [ { "name": "prospecting_random_forest", "parameters": { "profile": "quality_calibration" }, "type": "calibration", "uri": "s3://.../calibration.cjson" }, { "name": "insights-aware-bidding", "type": "insights-aware-bidding", "uri": ”s3://insights/../261716353" } ] }

20.Easy to add new algorithms? Took 2 days to port a standard Spark ML pipeline for a customer into production, thanks to the blackboard design.

21.DEMO #UnifiedAnalytics #SparkAISummit 21

22.Outcomes Benefits Lessons Greater flexibility to adapt to new use cases Spark can be used for serious production systems Better overall performance Some tweaks are needed but still have the Better reliability and upgrade path benefits of the 3rd Party ML libraries 50% less code There’s no test like a full live test! 60% savings Gradual switchover, pulsing and vigilance protected our business from harm.