Advanced Hyperparameter Optimization for Deep Learning with MLflow

Building on the “Best Practices for Hyperparameter Tuning with MLflow” talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Advanced HPO for Deep Learning Maneesh Bhide, Databricks #UnifiedAnalytics #SparkAISummit

3.Review: HPO Approaches Grid search: • PRO: can be run in one time-step • CON: naive, computationally expensive, suffers from curse of dimensionality, probably alias over global optima. Random search: • PRO: suffers less from curse of dimensionality, can be run in one time-step • CON: naive, no certainty about results, still computationally expensive Population based: • PRO: implicit predictions, can be run in several time-steps, good at resolving many optima • CON: computationally expensive, may converge to local optima Bayesian: • PRO: explicit predictions, computationally efficient • CON: requires sequential observations #UnifiedAnalytics #SparkAISummit 3

4.Review: Best Practices • Tune entire pipeline, not individual models • How you phrase parameters matters! – Categoricals really categorical? • [2,4,8,16,32] à {1,5,1} and 2(param) – Use transformations to your advantage • For learning_rate, instead of {0,1} à {-10,0} and 10(param) • Don’t restrict to traditional hyperparameters – SGD Flavor – Architecture #UnifiedAnalytics #SparkAISummit 4

5.HPO for Neural Networks • Can benefit from compute efficiency of Bayesian Optimization as parameter space can explode. – Challenge of sequential training and long training time • Optimize more than just hyperparameters – Challenge of parameters depending on other parameters • Production models often have multiple criteria – Challenge of trading off between objectives #UnifiedAnalytics #SparkAISummit 5

6.Agenda • Challenge of sequential training and long training time – Early Termination • Challenge of parameters depending on other parameters – Awkward/Conditional Spaces • Challenge of trading off between objectives – Multimetric Optimization #UnifiedAnalytics #SparkAISummit 6

7.How Early Termination works From the HyperBand Paper… 1. Select initial candidate configuration set 2. Train configurations for Xn epochs 3. Evaluate performance (preferably, objective metric) 4. Use SuccessiveHalving (eliminate half), run remaining configurations an additional Xn epochs 5. Xn+1 = 2Xn 6. Goto step 2 #UnifiedAnalytics #SparkAISummit 7

8. Credit: #UnifiedAnalytics #SparkAISummit 8

9.Assumptions • Well behaved learning curves • Model Performance: don’t need the best model, need a good model faster #UnifiedAnalytics #SparkAISummit 9

10. Credit: #UnifiedAnalytics #SparkAISummit 10

11.Scenario Walkthrough ResNet-50 on ImageNet 9 Hyperparameters for HPO 128 configurations 1 p2.xlarge ($.90/hour) 12 hours training time #UnifiedAnalytics #SparkAISummit 11

12.Standard Training 12 hours * 128 config Total Compute Time: 1536 hours Total Cost: $1382.4 #UnifiedAnalytics #SparkAISummit 12

13.With HyperBand % train .78% 1.56% 3.12% 6.25% 12.5% 25% 50% 100% hours .09 .19 .37 .75 1.5 3 6 12 Configs 128 64 32 16 8 4 2 1 ET 64 32 16 8 4 2 1 Total 5.76 6.08 5.92 6 6 6 6 12 Total Compute Time: 53.76 hours Total Cost: $48.38 #UnifiedAnalytics #SparkAISummit 13

14.Scenario Summary w/o Early Termination: 1536 hours w/ Early Termination: 53.76 hours 96.5% Reduction in Compute (and Cost!) #UnifiedAnalytics #SparkAISummit 14

15.Bayesian + HyperBand 1. Articulate checkpoints 2. Optimizer selects an initial sample (bootstraping) 3. Train for “checkpoint N” epochs 4. Evaluate performance (preferably, objective metric) 5. Use Bayesian method to select new candidates 6. Increment N 7. Goto step 3 #UnifiedAnalytics #SparkAISummit 15

16. Credit: #UnifiedAnalytics #SparkAISummit 16

17.Assumptions None • Black box optimization • Allows user to account for potential stagnation in checkpoint selection • Regret intrinsically accounted for #UnifiedAnalytics #SparkAISummit 17

18.Random vs Bayesian 1. Number of initial candidates – Random: scales exponentially with number of parameters – Bayesian: scales linearly with number of parameters 2. Candidate selection – Random: naïve, static – Bayesian: adaptive 3. Regret Implementation – Random: User must explicitly define – Bayesian: Surrogate + acquisition function #UnifiedAnalytics #SparkAISummit 18

19.Which is Better? #UnifiedAnalytics #SparkAISummit 19

20.Does this Actually Work? #UnifiedAnalytics #SparkAISummit 20

21.Summary • Attempts to optimize for resource allocation • Dramatically reduce compute and wall clock to convergence • Better implementations include a “regret” mechanism to recover configurations • Bayesian outperforms Random – But in principle, compatible with any underlying hyperparameter optimization technique #UnifiedAnalytics #SparkAISummit 21

22.What about Keras/TF EarlyStopping NOT THE SAME THING Evaluates against a pre-determined rate of loss improvement for a single model 1. Terminate stagnating configurations 2. Prevent over training #UnifiedAnalytics #SparkAISummit 22

23.Libraries Open Source: HyperBand • HpBandSter (with Random search) Open Source: Conceptually Similar • HpBandSter (with HyperOpt search) • Fabolas* (RoBo) Commercial: Conceptually Similar • SigOpt #UnifiedAnalytics #SparkAISummit 23

24.Code • HpBandSter: uto_examples/index.html • Fabolas: mples/ • SigOpt: #UnifiedAnalytics #SparkAISummit 24

25.Awkward/Conditional Spaces The range, or existence, of one hyperparameter is dependent on the value of another hyperparameter #UnifiedAnalytics #SparkAISummit 25

26.Examples • Optimize Gradient Descent algorithm selection • Neural network topology refinement • Neural Architecture Search • Ensemble models as featurizers #UnifiedAnalytics #SparkAISummit 26

27.Credit: #UnifiedAnalytics #SparkAISummit 27

28.Why does this matter? • Bayesian/adaptive algorithms learn from the prior • For every hyperparameter, it will require some number of samples to ”learn” dependencies #UnifiedAnalytics #SparkAISummit 28

29.Libraries Open Source • HyperOpt • HpBandSter Commercial • SigOpt #UnifiedAnalytics #SparkAISummit 29