SQL Database Design For Developers at php[tek] 2024
Competition winning learning rates
1. Competition Winning Learning Rates
Leslie N. Smith
Naval Center for Applied Research in Artificial Intelligence
US Naval Research Laboratory, Washington, DC 20375
leslie.smith@nrl.navy.mil; Phone: (202) 767-9532
MLConf 2018
November 14, 2018
UNCLASSIFIED
2. Introduction
• My story of enlightenment about learning rates (LR)
• My first steps: Cyclical learning rates (CLR)
– What are cyclical learning rates? Why do they matter?
• A new LR schedule and Super-Convergence
– Fast training of networks with large learning rates
• Competition winning learning rates
– Stanford’s BENCHDawn competition
– Kaggle’s iMaterialist Challenge
• Enlightenment is a never-ending story
– Is weight decay more important than LR?
2
Outline
3. • Outline
Deep Learning Basics Background
– Uses a neural network composed of many “hidden” layers, 𝒍; each layer
contains trainable weights, 𝑾𝒍, and biases, 𝒃𝒍, and a non-linear function, σ
– Image x is input and output y is compared to the label, which defines the loss
}Loss
Label
Back-propagationVanishing Gradient
Input x
Classes
Weights
Speed limit 20
Yield
Pedestrian crossing
𝒚𝒍 = 𝑭 𝒚𝒍−𝟏 = 𝝈 𝑾𝒍 𝒚𝒍−𝟏 + 𝒃𝒍
𝒚 𝑳 = 𝝈 𝑾 𝑳 𝝈(𝑾 𝑳−𝟏 𝝈(… 𝑾 𝟏 𝒙 + 𝒃 𝟏 … ))
3
|| 𝒚 𝑳 - y ||
4. 4
What are learning rates?
• Learning rates are the step size in Stochastic Gradient
Descent’s (SGD) back-propagation
• Long known that the learning rate (LR) is the most
important hyper-parameter to tune
– Too large: the training diverges
– Too small: trains slowly to a sub-optimal solution
• How to find an optimal LR?
– Grid or random search
– Time consuming and inaccurate
Cyclical learning rates (CLR)
𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
5. 5
What is CLR? And who cares?
• Cyclical learning rates (CLR)
– Learning rate schedule that varies between
min and max values
• LR range test: One stepsize of
increasing LR
– Quick and easy way to find an optimal
learning rate
– The peak defines the max_lr
– The optimal LR is a bit less than max_lr
– min_lr ≈ max_lr / 3
Cyclical learning rates (CLR)
max_lr
min_lr
6. 6
Super-convergence
• What if there’s no peak?
– Implies the ability to train at very large
learning rates
• Super-convergence
– Start with a small LR
– Grow to a large learning rate maximum
• 1cycle learning rate schedule
– One CLR cycle
Super-convergence
7. 7
Super-convergence
• What if there’s no peak?
– Implies the ability to train at very large
learning rates
• Super-convergence
– Start with a small LR
– Grow to a large learning rate maximum
• 1cycle learning rate schedule
– One CLR cycle
Super-convergence
8. 8
Super-convergence
• What if there’s no peak?
– Implies the ability to train at very large
learning rates
• Super-convergence
– Start with a small LR
– Grow to a large learning rate maximum
• 1cycle learning rate schedule
– One CLR cycle, ending with a smaller LR
than the min
Super-convergence
10. Introduction
• My story of enlightenment about learning rates (LR)
• My first steps: Cyclical learning rates (CLR)
– What are cyclical learning rates? Why do they matter?
• A new LR schedule and Super-Convergence
– Fast training of networks with large learning rates
• Competition winning learning rates
– Stanford’s BENCHDawn competition
– Kaggle’s iMaterialist Challenge
• Enlightenment is a never-ending story
– Is weight decay more important than LR?
10
Outline
11. 11
Competition winning LRCompetition Winning Learning Rates
• DAWNBench challenge
– “Howard explains that in order to create an algorithm for solving CIFAR,
Fast.AI’s group turned to a relatively unknown technique known as “super
convergence.” This wasn’t developed by a well-funded tech company or
published in a big journal, but was created and self-published by a single
engineer named Leslie Smith working at the Naval Research Laboratory.”
The Verge, 5/7/2018 article about fast.ai team 1st place winning
12. 12
Competition winning LRCompetition Winning Learning Rates
• Kaggle iMaterialist
Challenge (Fashion)
– “For training I used Adam
initially but I switched to the
1cycle policy with SGD very
early on. You can read more
about this training regime in
a paper by Leslie Smith and
you can find details on how to
use it by Sylvain Gugger, the
author of the implementation
in the fastai library here.”
1st place winner, Radek
Osmulski
13. 13
Relevant publications
• Smith, Leslie N. "Cyclical learning rates for training neural
networks." In Applications of Computer Vision (WACV),
2017 IEEE Winter Conference on, pp. 464-472. IEEE, 2017.
• Smith, Leslie N., and Nicholay Topin. "Super-Convergence:
Very Fast Training of Residual Networks Using Large
Learning Rates." arXiv preprint arXiv:1708.07120 (2017).
• Smith, Leslie N. "A disciplined approach to neural network
hyper-parameters: Part 1--learning rate, batch size,
momentum, and weight decay." arXiv preprint
arXiv:1803.09820 (2018).
• There’s a large batch of literature on Large Batch Training
Competition Winning Learning Rates
14. Introduction
• My story of enlightenment about learning rates (LR)
• My first steps: Cyclical learning rates (CLR)
– What are cyclical learning rates? Why do they matter?
• A new LR schedule and Super-Convergence
– Fast training of networks with large learning rates
• Competition winning learning rates
– Stanford’s BENCHDawn competition
– Kaggle’s iMaterialist Challenge
• Enlightenment is a never-ending story
– Is weight decay more important than LR?
14
Outline
15. 15
What is Weight Decay?
• L2 normalization
• The “effective weight
decay” is a combination of
the WD coefficient, λ, and
the learning rate, ε
• Which term does the
learning rate schedule
impact more?
Weight decay (WD)
𝑳(𝜽, 𝒙) = |𝒇 𝜽, 𝒙 − 𝒚| 2 + ½ λ||w||2
Effective WD
𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙 − 𝜺 𝝀 𝒘 𝒕
𝒘 𝒕+𝟏 = (𝟏 − 𝜺 𝝀) 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
16. 16
What is the optimal WD?
• Decreasing weight decay has a much greater effect than
decreasing the learning rate
• The maximum weight decay, maxWD, can be found in a
similar way as maxLR (i.e., WD range test)
• Use a large WD early in training and let it decay
Weight decay (WD)
MaxWD
17. 17
Dynamic weight decay
• Hyper-parameter’s relationship
where LR = learning rate, WD = weight decay coefficient, TBS = total batch size,
and α = momentum
– Large TBS and smaller values of LR and α permit larger maxWD
• Large batch training can be improved by a large WD if WD decays
during training
Weight decay (WD)
(𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
83.7%
82.2%
84%
83.5%
18. 18
Conclusions (for now)
• Takeaways
– Enlightenment is a never-ending story
– A new diet plan for your network
• Decaying weight decay in large batch training; set WD large in the
early phase of training and zero near the end
– Hyper-parameters are tightly coupled and must be tuned
together
Competition Winning Learning Rates
(𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡