Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic Model Tuning
Using Amazon SageMaker
Cyrus M Vahid
Principal Evangelist – AWS AI Labs
CyrusMV@amazon.com
A I M 4 1 2

Agenda
Hyperparameters
Search-based HPO
Bayesian HPO
Amazon SageMaker Automatic
Model Tuning (AMT)

What is a hyperparameter
• Hyperparameter = algorithm parameter
• It affects how an algorithm behaves during model learning
process
• “Any decision an algorithm author can’t make for you”

Examples of hyperparameters
Model
Number of layers: 1, 2, 3, . . .
Activation functions: Sigmoid, tanh, RELU, . . .
Optimization
Method: SGD, Adam, AdaGrad, . . .
Learning rate: 0.01 to 2
Data
Batch size: 8, 16, 32, . . .
Augmentation: Resize, Normalize, Color Jitter, . . .

Optimizing loss function for a model
gradient

Optimizing loss function for a model

Blackbox optimization
• We aim to minimize the objective function 𝐽
• We have no knowledge of what the objective function is
• We do have access to the gradients of the objective function
• All we know is what goes into the function and what comes out
𝐽(𝑊, 𝜃|𝜆)

Grid search
Learning rate
Activation
Sigmoid
RELU
tanh
0 20.5 1 1.5

Curse of dimensionality
• In grid search the user specifies a finite set of values for each hyperparameter and then
• Each hyperparameter increases degree of freedom and results in combinatorial explosion
• Example
• One HP and five options  5 combinations
• Two HP and five options each  5*5 combinations
• Ten HO and five options each  5*5*5*5*5*5*5*5*5*5 = 510 ≈ 106 combinations

Random grid search
• Random search samples configurations at random until a certain budget 𝐵 for the search is
exhausted
• The number of evaluations for 𝑁 hyperparameters is only 𝐵1/𝑁 vs. 𝐵 for grid search
Bergstra, Bengio. 2012. “Random Search for Hyper-Parameter Optimization”

Model optimization
• The aim of model optimization is to find a set of model parameters,
𝜃∗
, that returns the most accurate results for a given dataset
Optimize model-params: 𝜃
𝜃∗ = 𝑎𝑟𝑔 min
𝜃𝜖Θ
𝑓(𝜆, 𝜃, 𝒟𝑡 𝒟 𝑣)
𝒟𝑡, 𝒟 𝑣 ∼ 𝒟

Hyperparameter optimization
• The aim of HPO is to find a set of hyperparameters, 𝜆∗
, that given
algorithm 𝒜 and dataset 𝒟, returns the most accurate results
Optimize model-params: 𝜃
𝜃∗ = 𝑎𝑟𝑔 min
𝜃𝜖Θ
𝑓(𝜆, 𝜃, 𝒟𝑡 𝒟 𝑣)
𝒟𝑡, 𝒟 𝑣 ∼ 𝒟
Optimize model-params: 𝜆∗
𝜆∗ = 𝑎𝑟𝑔 min
𝜆𝜖Λ
𝔼 𝒟 𝑡,𝒟 𝑣 ∼𝒟 𝑉(ℒ, 𝒜 𝜆, 𝒟𝑡, 𝒟 𝑣)
𝑊ℎ𝑒𝑟𝑒 𝑉 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠 𝑡ℎ𝑒 𝑙𝑜𝑠𝑠 𝑜𝑓 𝑎 𝑚𝑜𝑑𝑒𝑙 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑏𝑦
𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 𝒜 𝑤𝑖𝑡ℎ ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝜆 𝑜𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝒟

Bayesian optimization
• Keeps track of previous evaluations and infers expected behavior
• It is Bayesian in a sense that the surrogate model model uses prior
probability distribution to make predictions about the posterior
𝑃 𝑌 𝑋 ∝ 𝑃 𝑌 𝑋 𝑃(𝑌)
• An acquisition function selects next points to be evaluated. Commonly
expected improvement
𝔼 𝕀 𝜆 = 𝔼[max(𝑓_ min −𝑌, 0)]
• As per iteration improves our beliefs about the objective function by
applying iterative learning

Bayesian optimization
1: 𝑓𝑜𝑟 𝑛 = 1, 2, 3, … 𝑑𝑜
2: 𝑠𝑒𝑙𝑒𝑐𝑡 𝑛𝑒𝑤 𝑋 𝑛+1 𝑏𝑦 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑖𝑛𝑔 𝑎𝑐𝑞𝑢𝑖𝑠𝑖𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝛼
𝑋 𝑛+1 = arg max
𝑥
𝛼 𝑥; 𝒟
3: 𝑞𝑢𝑒𝑟𝑦 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑡𝑜 𝑜𝑏𝑡𝑎𝑖𝑛 𝑦 𝑛+1
4: augment data 𝒟 𝑛+1 = 𝒟 𝑛, 𝑥 𝑛+1, 𝑦 𝑛+1
5: 𝑢𝑝𝑑𝑎𝑡𝑒 𝑡ℎ𝑒 𝑠𝑢𝑟𝑟𝑜𝑔𝑎𝑡𝑒 𝑚𝑜𝑑𝑒𝑙
https://ieeexplore.ieee.org/document/7352306

Gaussian process
• Gaussian distribution is a series of random numbers that is described
by mean 𝜇 and variance 𝜎2
• Gaussian process is a distribution over functions, each of which returns
mean and variance of a Gaussian distribution
𝑓: 𝒳 → ℝ
𝑓(𝑋𝑡1
), 𝑓(𝑋𝑡2
), … , 𝑓(𝑋𝑡 𝑛
)~𝒩(𝝁, 𝜮)

Gaussian process – continued . . .
• Each distribution corresponds to a set of hyperparameters Λ;
𝜆𝑖 𝜖Λ = ς𝑖=1
𝑛
Λ 𝑖
• A Gaussian process is fully specified by a mean 𝜇 𝜆 and a covariance
function 𝑘(𝜆, 𝜆′)
𝒢(𝜇 𝜆 , 𝑘(𝜆, 𝜆′))

Gaussian process for model of model loss
• A Gaussian process 𝒢(𝜇 𝜆 , 𝑘(𝜆, 𝜆′)) is fully specified by a mean 𝜇 𝜆 and
a covariance function 𝑘(𝜆, 𝜆′)
𝑓(𝑋𝑡1
), 𝑓(𝑋𝑡2
), … , 𝑓(𝑋𝑡 𝑛
)~𝒩(𝝁, 𝜮)

Covariant function
• The mean function is usually
constant in Bayesian optimization
• The quality of the Gaussian
process is solely dependent on
the quality of the covariant
function
• A common choice is the Matern
5/2 kernel, with its
hyperparameters integrated out
by Markov chain Monte Carlo

Acquisition function: Probability of improvement
70%
Current best
50% 70%
𝑃𝐼 𝑥1 < 𝑃𝐼(𝑥2)
𝑥1
𝑥2

Acquisition function: Expected improvement
0.3 0.2
𝐸𝐼 𝑥1 > 𝐸𝐼(𝑥2)
𝑥1
𝑥2
1
-1
70%
Current best

Using acquisition function
• Expected improvement
[maximining the dashed line] has
two components
• One is dependent on −𝜇 [solid line]
• The other dependent on uncertainty or
variance 𝑘(𝜆, 𝜆′) [blue line]
• Therefore we maximize the
acquisition function wherever
• Mean, 𝜇, is low, or
• Uncertainty,𝑘(𝜆, 𝜆′), is high

Parallelism through Thompson sampling (TS)
• TS is a decades-old heuristic for
choosing actions that maximize
the expected reward
𝑎 𝑛+1 = arg max
𝑎
𝑓𝑤 𝑎 , where ෥𝑤 ~ 𝑝(𝑤|𝒟 𝑛)
• Sequentiality makes sampling
computationally interactable
• We can use multiple workers for
parallelizing TS
~~
https://www.cs.cmu.edu/~kkandasa/misc/automl-slides.pdf

Automatic Model Tuning — Workflow
Training jobHyperparameter
tuning job
Tuning strategy
Objective
metrics
Training job
Training job
Training job
Clients
(console, notebook, IDEs, CLI)
Model name
model1
model2
…
Objective
metric
0.8
0.75
…
eta
0.07
0.09
…
max_depth
6
5
…
…

Automatic Model Tuning — Architecture
Training code
• Factorization machine
• Regression/classification
• Principal component analysis
• K-means clustering
• XGBoost
• DeepAR
• And more
Amazon SageMaker
built-in algorithms
Bring Your Own Script (prebuilt containers) Bring Your Own Algorithm
Fetch training data Save model artifacts
Fully
managed –
Secured–
Automatic Model Tuning

Input mapping
• Continuous – e.g., learning rate
• Integral – e.g., number of epochs
• Categorical – e.g., type of activation
0,2 → 0,1
1,2, … , 50 → [0,1]
𝑆𝑖𝑔𝑚𝑜𝑖𝑑, 𝑅𝐸𝐿𝑈, 𝑡𝑎𝑛ℎ → 0,1 3
Create a unit hypercube from all input hyperparameters

Thank you!
Cyrus Vahid
cyrusmv@amazon.com

Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018

Similaire à Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018