Contenu connexe Similaire à Sagemaker Automatic model tuning (20) Sagemaker Automatic model tuning1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Soji Adeshina, Machine Learning Engineer, Amazon AI
SageMaker Automatic Model
Tuning
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roadmap
• Hyperparameters
• Search Based HPO
• Bayesian HPO
• Amazon SageMaker AMT
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hyperparameters
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Hyperparameter
• Hyperparameter = algorithm parameter
• Training algorithm accepts hyperparameter(s) and returns model
parameters
• It affects how an algorithm behaves during model training process
• “Any decision an algorithm author can’t make for you”
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Examples of Hyperparameters
Model:
Number of layers: 1, 2, 3, …
Activation functions: Sigmoid, tanh, RELU, …
Optimization:
Method: SGD, Adam, AdaGrad, …
Learning Rate: 0.01 to 2
Data:
Batch Size: 8, 16, 32 …
Augmentation: Resize, Normalize, Color Jitter, …
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Model vs Hyperparameter Optimization
𝑙∗
= min
𝜃
ℎ(𝜃)
ℎ(𝜃) = min
𝑤
𝑓(𝑤|𝑋, 𝑦, 𝜃)
Optimize Model params (𝑤)
Optimize Hyperparams (𝜃)
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Blackbox Optimization
• We aim to minimize the objective function .
• We have no knowledge of what the objective function is.
• We don’t have access to the gradients of the objective function.
• All we know is what goes into the function and what comes out.
ℎ( 𝜃)
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Search Based HPO
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Grid Search
Learning Rate
Activation
Sigmoid
RELU
tanh
0 20.5 1 1.5
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Grid Search
Learning Rate
Activation
Sigmoid
RELU
tanh
0 20.5 1 1.5
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Grid Search - Shortcomings
• In grid search the user specifies a finite set of values for each hyperparameter.
• Each hyperparam increases degree of freedom and results in combinatorial explosion.
• Assume each hyper-param has 5 options
e.g. Learning Rate: 0, 0.5, 1, 1.5, 2
1 HP = 5 combinations
2 HPs = 5*5 = 25 combinations
3 HPs = 5*5*5 = 125 combinations
…
10 HPs = 5^10 = 9,765,625 combinations
N HPs = 5^N combinations
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Grid Search - Shortcomings
Learning Rate
Activation
Sigmoid
RELU
tanh
0 20.5 1 1.5
Some hyper-params more important than others.
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Grid Search
Learning Rate
Activation
Sigmoid
RELU
tanh
0 20.5 1 1.5
Wasted Compute
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Random Grid Search
Learning Rate
Activation
Sigmoid
RELU
tanh
0 20.5 1 1.5
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bayesian HPO
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Model based Bayesian HPO
Learning Rate
Activation
RELU
0 20.5 1 1.5
ℎ 𝜃 : 𝑡𝑟𝑢𝑒 (ℎ𝑖𝑑𝑑𝑒𝑛)
𝐷: 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
ℎ′ 𝜃 : 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒
𝑐: 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
• ℎ 𝜃 is expensive so use an approximation or surrogate model ℎ′(𝜃) instead
• Use an acquisition function 𝔼[𝐼 𝜆 ] to selects next points
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Keeps track of previous evaluations and infers expected behaviour.
• It is Bayesian in a sense that the surrogate model model uses prior probability
distribution to make predictions about the posterior.
𝑃 𝑌 𝑋 ∝ 𝑃 𝑌 𝑋 𝑃(𝑌)
• Improves our beliefs about the objective function by applying iterative learning.
Model based Bayesian HPO
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Surrogate Model - Gaussian Process
• Gaussian Process is a distribution over functions each of which returns mean and variance of a
Gaussian distribution.
𝑓: 𝒳 → ℝ
𝑓(𝑋𝑡1
), 𝑓(𝑋𝑡2
), … , 𝑓(𝑋𝑡 𝑛
)~𝒩(𝝁, 𝜮)
• Gaussian distribution is a distribution of random numbers that is described by mean 𝜇 and variance
𝜎2
.
• Each distribution corresponds to a set of hyperparameters Λ;
𝜆𝑖 𝜖Λ = 𝑖=1
𝑛
Λ 𝑖
• A Gaussian process is fully specified by a mean 𝜇 𝜆 and a covariance function 𝑘(𝜆, 𝜆′).
𝒢(𝜇 𝜆 , 𝑘(𝜆, 𝜆′))
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gaussian Process for model of model loss
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Covariance Matrix
Similarity between 2 points: controls ‘smoothness’.
SageMaker uses Matérn kernel with 𝜐 = 5/2
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Acquisition Function
• Given posterior distribution of functions…
𝔼 𝕀 𝜆 = 𝔼[max(𝑓_ min −𝑌, 0)]
• Used as criteria for selecting next candidate hyperparams for evaluation.
• Often depends on the best hyperparams seen so far in search.
• Controls exploration vs exploitation in search.
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Acquisition Function: Expected Improvement
0.3 0.2
𝐸𝐼 𝑥1 > 𝐸𝐼(𝑥2)
𝑥1
𝑥2
1
-1
70%
Current best
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Acquisition Function
• Expected improvement
[maximining the dashed line] has
two components:
• One is dependent on −𝜇 [solid line]
• The other dependent on uncertainty or
variance 𝑘(𝜆, 𝜆′) [blue line]
• There fore we maximize the
acquisition function wherever:
• Mean, 𝜇, is low, or
• Uncertainty,𝑘(𝜆, 𝜆′), is high.
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Part 2: Hands On with Amazon SageMaker AMT
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Notes de l'éditeur Various data types: Continuous, Integer, Categorical
Various ranges 𝑓 and ℎ return the loss: cross entropy loss.
Can find gradient of 𝜆: 1st order.
Can’t find gradient of 𝜃: 0th order. Often no closed form. Underlying true relationship is hidden.
Cost time and money to evaluate.
Must sample. Discretize 1000 years for model that takes 1h to train Often some hyper-params more important than others. Wasted compute. Can limit number of samples
Use quick model to choose next point to evaluate.
Use acquisition function to choose next point. Assumes similar points give similar results: Co-variance function.
Gives probabilistic estimates.
Closed form expressions for mean and variance.
Most common is Squared Exponential Kernel (Gaussian radial basis function). Matérn generalizes this.
V=Inf gives Squared Exponential Kernel, Infinitely differentiable.
V=5/2 Can differentiate twice but not 3 times) – good default, works on wide range of problems, robust
Simplifications for these cases.