The marginal value of adaptive gradient methods in machine learning

IDS Lab
The Marginal Value of Adaptive Gradient
Methods in Machine Learning
Does deep learning really doing some generalization? 2

presentedby Jamie Seol

IDS Lab
Jamie Seol
Preface
• Toy problem: smooth quadratic strong convex optimization

• Let object f be as following, and WLOG suppose A to be a
symmetric and nonsingular

• why WLOG? symmetric because it’s a quadratic form, and
singular curvature (curvature of quadratic function is A) is
reducible in quadratic function

• moreover, strong convex = positive definite curvature

• meaning that all eigenvalues are positive

IDS Lab
Jamie Seol
Preface
• Note that A was a real symmetric matrix, so by the spectral
theorem, A has eigendecomposition with unitary basis

• In this simple objective function, we can explicitly compute the
optima

IDS Lab
Jamie Seol
Preface
• We’ll apply a gradient descent! let superscript be an iteration:

• Will it converge to the optima? let’s check it out!

• We use some tricky trick using change of basis

• This new sequence x(k) should converge to 0

• But when?

IDS Lab
Jamie Seol
Preface
• This holds

• [homework: prove it]

• With rewriting by element-wise notation:

IDS Lab
Jamie Seol
Preface
• So, the gradient descent converges only if

• for all i

• In summary, it converges when

• And the optimal is

• where 𝜎(A) denote a spectral radius of A, meaning the maximal
absolute value among eigenvalues [homework: n = 1 case]

IDS Lab
Jamie Seol
Preface (appendix)
• Actually, this result is rather obvious

• Note that A was a curvature of the objective, and the spectral
radius or the largest eigenvalue means "stretching" above A’s
principal axis

• curvature ← see differential geometry

• principal axis ← see linear algebra

• So, it is vacuous that the learning rate should be in a safe area
regarding the "stretching", which can be done with simple
normalization

IDS Lab
Jamie Seol
Preface
• Similarly, the optimal momentum decay can also be induced,
using condition number 𝜅

• condition number of a matrix is ratio between maximal and
minimal (absolute) eigenvalues

• Therefore, if we can control the boundary of the spectral radius of
the objective, then we can approximate the optimal parameters
for gradient descent

• this is the main idea of the YellowFin optimizer

IDS Lab
Jamie Seol
Preface
• So what?

• We pretty much do know well about behaviors of gradient
descent

• if the objective is smooth quadratic strong convex..

• but the objectives of deep learning is not nice enough!

• We just don’t really know about characteristics of deep learning
objective functions yet

• requires more research

IDS Lab
Jamie Seol
Preface 2
• Here’s a typical linear regression problem

• If the number of features d is bigger than the number of samples
m, than it is underdetermined system

• So it has (possibly infinitely) many solutions

• Let’s use stochastic gradient descent (SGD)

• which solution will SGD find?

IDS Lab
Jamie Seol
Preface 2
• Actually, we’ve already discussed about this in the previous
seminar

• Anyway, even if the system is underdetermined, SGD always
converges to some unique solution which belongs to span of X

IDS Lab
Jamie Seol
Preface 2
• Moreover, experiments show that SGD’s solution has small norm

• We know that the l2-regularization helps generalization

• l2-regularization: keeping parameter’s norm small

• So, we can say that the SGD has implicit regularization

• but there’s evidence that l2-regularization does not help at all…

• see previous seminar presented by me

• 잘 되지만 사실 잘 안되고, 그래도 좋은 편이지만 그닥 좋지만은 않다…

IDS Lab
Jamie Seol
Introduction
• In summary,

• adaptive gradient descent methods

• might be poor

• at generalization

IDS Lab
Jamie Seol
Preliminaries
• Famous non-adaptive gradient descent methods:

• Stochastic Gradient Descent [SGD]

• Heavy-Ball [HB] (Polyak, 1964)

• Nesterov’s Accelerated Gradient [NAG] (Nesterov, 1983)

IDS Lab
Jamie Seol
Preliminaries
• Adaptive methods can be summarized as:

• AdaGrad (Duchi, 2011)

• RMSProp (Tieleman and Hinton, 2012, in coursera!)

• Adam (Kingma and Ba, 2015)

• In short, these methods adaptively changes learning rate and
momentum decay

IDS Lab
Jamie Seol
Preliminaries
• All together

IDS Lab
Jamie Seol
Synopsis
• For a system with multiple solution, what solution does an
algorithm find and how well does it generalize to unseen data?

• Claim: there exists a constructive problem(dataset) in which

• non-adaptive methods work well and

• finds a solution with good generalization power

• adaptive methods work poor

• finds a solution with poor generalization power

• we even can make this arbitrarily poor, while the non-
adaptive solution still working

IDS Lab
Jamie Seol
Problem settings
• Think of a simple binary least-squares classification problem

• When d > n, if there is a optima with loss 0 then there are infinite
number of optima

• But as shown in preface 2, SGD converges to the unique solution

• with known to be the minimum norm solution

• which generalizes well

• why? becase in here, it’s also the largest margin solution

• All other non-adaptive methods also converges to the same

IDS Lab
Jamie Seol
Lemma
• Let sign(x) denote a function that maps each component of x to its
sign

• ex) sign([2, -3]) = [1, -1]

• If there exists a solution proportional to sign(XTy), this is precisely
the unique solution where all adaptive methods converge

• quite interesting lemma!

• pf) use induction

• Note that this solution is just:

• mean of positive labeled vectors - mean of negative labeled
vectors

IDS Lab
Jamie Seol
Funny dataset
• Let’s fool adaptive methods

• first, assign yi to 1 with probability p > 1/2

• when y = [-1, -1, -1, -1]

• when y = [1, 1, 1, 1]

IDS Lab
Jamie Seol
Funny dataset
• Note that for such a dataset, the only discriminative feature is the
first one!

• if y = [1, -1, -1, 1, -1] then X becomes:

IDS Lab
Jamie Seol
Funny dataset
• Let and assume b > 0 (p > 1/2)

• Suppose , then

IDS Lab
Jamie Seol
Funny dataset
• So, holds!

• Take a closer look

• all first three are 1, and rest is 0 for new data

• this solution is bad!

• it will classify every new data to positive class!!!

• what a horrible generalization!

IDS Lab
Jamie Seol
Funny dataset
• How about non-adaptive method?

• So, when , the solution makes no errors

• wow

IDS Lab
Jamie Seol
Funny dataset
• Think this is too extreme?

• Well, even in the real dataset, the following are rather common:

• a few frequent feature (j = 2, 3)

• some are good indicators, but hard to identify (j = 1)

• many other sparse feature (other)

IDS Lab
Jamie Seol
Experiments
• (authors said that they downloaded models from internet…)

• Results in summary:

• adaptive makes poor generalization

• even if it had lower loss than the non-adaptive ones!!!

• adaptive looks fast, but that’s it

• adaptive says "no more tuning" but tuning initial values were
still significant

• and it requires as much time as non-adaptive tuning…

IDS Lab
Jamie Seol
Experiments
• CIFAR-10

• use non-adaptive

IDS Lab
Jamie Seol
Experiments
• low training loss, more test error (Adam vs HB)

IDS Lab
Jamie Seol
Experiments
• Character-level language model

• AdaGrad looks very fast, but indeed, not good

• surprisingly, RMSProp closely trails SGD on test

IDS Lab
Jamie Seol
Experiments
• Parsing

• well, it is true that non-adaptive methods are slow

IDS Lab
Jamie Seol
Conclusion
• Adaptive methods are not advantageous for optimization

• It might be fast, but poor generalization

• then why is Adam so popular?

• because it’s popular…?

• specially, known to be popular in GAN and Q-learning

• these are not exactly optimization problems

• we don’t know any nature of objectives in those two yet

IDS Lab
Jamie Seol
References
• Wilson,Ashia C., et al. "The Marginal Value ofAdaptive Gradient
Methods in Machine Learning." arXiv preprint arXiv:1705.08292 (2017).
• Zhang, Jian, Ioannis Mitliagkas, and Christopher Ré. "YellowFin and the
Art of Momentum Tuning." arXiv preprint arXiv:1706.03471 (2017).
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016).
• Polyak, Boris T. "Some methods of speeding up the convergence of
iteration methods." USSR Computational Mathematics and Mathematical
Physics 4.5 (1964): 1-17.
• Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/
10.23915/distill.00006

The marginal value of adaptive gradient methods in machine learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (14)

Similaire à The marginal value of adaptive gradient methods in machine learning

Similaire à The marginal value of adaptive gradient methods in machine learning (20)

Dernier

Dernier (20)

The marginal value of adaptive gradient methods in machine learning