SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
IDS Lab
The Marginal Value of Adaptive Gradient
Methods in Machine Learning
Does deep learning really doing some generalization? 2

presentedby Jamie Seol
IDS Lab
Jamie Seol
Preface
• Toy problem: smooth quadratic strong convex optimization

• Let object f be as following, and WLOG suppose A to be a
symmetric and nonsingular

• why WLOG? symmetric because it’s a quadratic form, and
singular curvature (curvature of quadratic function is A) is
reducible in quadratic function

• moreover, strong convex = positive definite curvature

• meaning that all eigenvalues are positive
IDS Lab
Jamie Seol
Preface
• Note that A was a real symmetric matrix, so by the spectral
theorem, A has eigendecomposition with unitary basis

• In this simple objective function, we can explicitly compute the
optima
IDS Lab
Jamie Seol
Preface
• We’ll apply a gradient descent! let superscript be an iteration:

• Will it converge to the optima? let’s check it out!

• We use some tricky trick using change of basis

• This new sequence x(k) should converge to 0

• But when?
IDS Lab
Jamie Seol
Preface
• This holds

• [homework: prove it]

• With rewriting by element-wise notation:
IDS Lab
Jamie Seol
Preface
• So, the gradient descent converges only if

• for all i

• In summary, it converges when

• And the optimal is

• where 𝜎(A) denote a spectral radius of A, meaning the maximal
absolute value among eigenvalues [homework: n = 1 case]
IDS Lab
Jamie Seol
Preface (appendix)
• Actually, this result is rather obvious

• Note that A was a curvature of the objective, and the spectral
radius or the largest eigenvalue means "stretching" above A’s
principal axis

• curvature ← see differential geometry

• principal axis ← see linear algebra

• So, it is vacuous that the learning rate should be in a safe area
regarding the "stretching", which can be done with simple
normalization
IDS Lab
Jamie Seol
Preface
• Similarly, the optimal momentum decay can also be induced,
using condition number 𝜅

• condition number of a matrix is ratio between maximal and
minimal (absolute) eigenvalues

• Therefore, if we can control the boundary of the spectral radius of
the objective, then we can approximate the optimal parameters
for gradient descent

• this is the main idea of the YellowFin optimizer
IDS Lab
Jamie Seol
Preface
• So what?

• We pretty much do know well about behaviors of gradient
descent

• if the objective is smooth quadratic strong convex..

• but the objectives of deep learning is not nice enough!

• We just don’t really know about characteristics of deep learning
objective functions yet

• requires more research
IDS Lab
Jamie Seol
Preface 2
• Here’s a typical linear regression problem

• If the number of features d is bigger than the number of samples
m, than it is underdetermined system

• So it has (possibly infinitely) many solutions

• Let’s use stochastic gradient descent (SGD)

• which solution will SGD find?
IDS Lab
Jamie Seol
Preface 2
• Actually, we’ve already discussed about this in the previous
seminar

• Anyway, even if the system is underdetermined, SGD always
converges to some unique solution which belongs to span of X
IDS Lab
Jamie Seol
Preface 2
• Moreover, experiments show that SGD’s solution has small norm

• We know that the l2-regularization helps generalization

• l2-regularization: keeping parameter’s norm small

• So, we can say that the SGD has implicit regularization

• but there’s evidence that l2-regularization does not help at all…

• see previous seminar presented by me

• 잘 되지만 사실 잘 안되고, 그래도 좋은 편이지만 그닥 좋지만은 않다…
IDS Lab
Jamie Seol
Introduction
• In summary,

• adaptive gradient descent methods

• might be poor

• at generalization
IDS Lab
Jamie Seol
Preliminaries
• Famous non-adaptive gradient descent methods:

• Stochastic Gradient Descent [SGD]

• Heavy-Ball [HB] (Polyak, 1964)

• Nesterov’s Accelerated Gradient [NAG] (Nesterov, 1983)
IDS Lab
Jamie Seol
Preliminaries
• Adaptive methods can be summarized as:

• AdaGrad (Duchi, 2011)

• RMSProp (Tieleman and Hinton, 2012, in coursera!)

• Adam (Kingma and Ba, 2015)

• In short, these methods adaptively changes learning rate and
momentum decay
IDS Lab
Jamie Seol
Preliminaries
• All together
IDS Lab
Jamie Seol
Synopsis
• For a system with multiple solution, what solution does an
algorithm find and how well does it generalize to unseen data?

• Claim: there exists a constructive problem(dataset) in which

• non-adaptive methods work well and

• finds a solution with good generalization power

• adaptive methods work poor

• finds a solution with poor generalization power

• we even can make this arbitrarily poor, while the non-
adaptive solution still working
IDS Lab
Jamie Seol
Problem settings
• Think of a simple binary least-squares classification problem

• When d > n, if there is a optima with loss 0 then there are infinite
number of optima

• But as shown in preface 2, SGD converges to the unique solution

• with known to be the minimum norm solution

• which generalizes well

• why? becase in here, it’s also the largest margin solution

• All other non-adaptive methods also converges to the same
IDS Lab
Jamie Seol
Lemma
• Let sign(x) denote a function that maps each component of x to its
sign

• ex) sign([2, -3]) = [1, -1]

• If there exists a solution proportional to sign(XTy), this is precisely
the unique solution where all adaptive methods converge

• quite interesting lemma!

• pf) use induction

• Note that this solution is just:

• mean of positive labeled vectors - mean of negative labeled
vectors
IDS Lab
Jamie Seol
IDS Lab
Jamie Seol
Funny dataset
• Let’s fool adaptive methods

• first, assign yi to 1 with probability p > 1/2

• when y = [-1, -1, -1, -1]

• when y = [1, 1, 1, 1]
IDS Lab
Jamie Seol
Funny dataset
• Note that for such a dataset, the only discriminative feature is the
first one!

• if y = [1, -1, -1, 1, -1] then X becomes:
IDS Lab
Jamie Seol
Funny dataset
• Let and assume b > 0 (p > 1/2)

• Suppose , then
IDS Lab
Jamie Seol
Funny dataset
• So, holds!

• Take a closer look

• all first three are 1, and rest is 0 for new data

• this solution is bad!

• it will classify every new data to positive class!!!

• what a horrible generalization!
IDS Lab
Jamie Seol
Funny dataset
• How about non-adaptive method?

• So, when , the solution makes no errors

• wow
IDS Lab
Jamie Seol
Funny dataset
• Think this is too extreme?

• Well, even in the real dataset, the following are rather common:

• a few frequent feature (j = 2, 3)

• some are good indicators, but hard to identify (j = 1)

• many other sparse feature (other)
IDS Lab
Jamie Seol
Experiments
• (authors said that they downloaded models from internet…)

• Results in summary:

• adaptive makes poor generalization

• even if it had lower loss than the non-adaptive ones!!!

• adaptive looks fast, but that’s it

• adaptive says "no more tuning" but tuning initial values were
still significant

• and it requires as much time as non-adaptive tuning…
IDS Lab
Jamie Seol
Experiments
• CIFAR-10

• use non-adaptive
IDS Lab
Jamie Seol
Experiments
• low training loss, more test error (Adam vs HB)
IDS Lab
Jamie Seol
Experiments
• Character-level language model

• AdaGrad looks very fast, but indeed, not good

• surprisingly, RMSProp closely trails SGD on test
IDS Lab
Jamie Seol
Experiments
• Parsing

• well, it is true that non-adaptive methods are slow
IDS Lab
Jamie Seol
Conclusion
• Adaptive methods are not advantageous for optimization

• It might be fast, but poor generalization

• then why is Adam so popular?

• because it’s popular…?

• specially, known to be popular in GAN and Q-learning

• these are not exactly optimization problems

• we don’t know any nature of objectives in those two yet
IDS Lab
Jamie Seol
References
• Wilson,Ashia C., et al. "The Marginal Value ofAdaptive Gradient
Methods in Machine Learning." arXiv preprint arXiv:1705.08292 (2017).
• Zhang, Jian, Ioannis Mitliagkas, and Christopher Ré. "YellowFin and the
Art of Momentum Tuning." arXiv preprint arXiv:1706.03471 (2017).
• Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking
generalization." arXiv preprint arXiv:1611.03530 (2016).
• Polyak, Boris T. "Some methods of speeding up the convergence of
iteration methods." USSR Computational Mathematics and Mathematical
Physics 4.5 (1964): 1-17.
• Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/
10.23915/distill.00006

Contenu connexe

Tendances

Deep Learning Interview Questions and Answers | Edureka
Deep Learning Interview Questions and Answers | EdurekaDeep Learning Interview Questions and Answers | Edureka
Deep Learning Interview Questions and Answers | EdurekaEdureka!
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringAkram El-Korashy
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Sangwoo Mo
 
Mixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and TransformationsMixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and TransformationsScott Fraundorf
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)YerevaNN research lab
 
Mixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical LogitMixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical LogitScott Fraundorf
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement LearningMax Pagels
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingSangwoo Mo
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?NAVER Engineering
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement LearningMax Pagels
 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsMax Pagels
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question AnsweringSujit Pal
 
Who's that girl? Handheld augmented reality for printed photo books
Who's that girl? Handheld augmented reality for printed photo booksWho's that girl? Handheld augmented reality for printed photo books
Who's that girl? Handheld augmented reality for printed photo booksNiels Henze
 

Tendances (14)

Deep Learning Interview Questions and Answers | Edureka
Deep Learning Interview Questions and Answers | EdurekaDeep Learning Interview Questions and Answers | Edureka
Deep Learning Interview Questions and Answers | Edureka
 
IROS 2017 Slides
IROS 2017 SlidesIROS 2017 Slides
IROS 2017 Slides
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question Answering
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Mixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and TransformationsMixed Effects Models - Centering and Transformations
Mixed Effects Models - Centering and Transformations
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
 
Mixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical LogitMixed Effects Models - Empirical Logit
Mixed Effects Models - Empirical Logit
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual Bandits
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 
Who's that girl? Handheld augmented reality for printed photo books
Who's that girl? Handheld augmented reality for printed photo booksWho's that girl? Handheld augmented reality for printed photo books
Who's that girl? Handheld augmented reality for printed photo books
 

Similaire à The marginal value of adaptive gradient methods in machine learning

Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsSangwoo Mo
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSangwoo Mo
 
Introduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksIntroduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksMiles Cranmer
 
The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰Seonghoon Jung
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
The Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for ResamplingThe Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for ResamplingJMP software from SAS
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaAndre Pemmelaar
 
Preliminary Exam Slides
Preliminary Exam SlidesPreliminary Exam Slides
Preliminary Exam SlidesDebasmit Das
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick ProsserPierre Schaus
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning台灣資料科學年會
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sVidyasagar Bhargava
 
Agile Estimation @ Lean Agile Manchester: Make Estimates Small!
Agile Estimation @ Lean Agile Manchester: Make Estimates Small!Agile Estimation @ Lean Agile Manchester: Make Estimates Small!
Agile Estimation @ Lean Agile Manchester: Make Estimates Small!Axelisys Limited
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Week 2 - ML models and Linear Regression.pptx
Week 2 - ML models and Linear Regression.pptxWeek 2 - ML models and Linear Regression.pptx
Week 2 - ML models and Linear Regression.pptxHafizAliHummad
 

Similaire à The marginal value of adaptive gradient methods in machine learning (20)

Emergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep RepresentationsEmergence of Invariance and Disentangling in Deep Representations
Emergence of Invariance and Disentangling in Deep Representations
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Introduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksIntroduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural Networks
 
The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰
 
Find-S Algorithm
Find-S AlgorithmFind-S Algorithm
Find-S Algorithm
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
The Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for ResamplingThe Bootstrap and Beyond: Using JSL for Resampling
The Bootstrap and Beyond: Using JSL for Resampling
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in julia
 
Preliminary Exam Slides
Preliminary Exam SlidesPreliminary Exam Slides
Preliminary Exam Slides
 
Mlcc #4
Mlcc #4Mlcc #4
Mlcc #4
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick Prosser
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
On Impact in Software Engineering Research (HU Berlin 2021)
On Impact in Software Engineering Research (HU Berlin 2021)On Impact in Software Engineering Research (HU Berlin 2021)
On Impact in Software Engineering Research (HU Berlin 2021)
 
Agile Estimation @ Lean Agile Manchester: Make Estimates Small!
Agile Estimation @ Lean Agile Manchester: Make Estimates Small!Agile Estimation @ Lean Agile Manchester: Make Estimates Small!
Agile Estimation @ Lean Agile Manchester: Make Estimates Small!
 
Algorithms 1
Algorithms 1Algorithms 1
Algorithms 1
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Week 2 - ML models and Linear Regression.pptx
Week 2 - ML models and Linear Regression.pptxWeek 2 - ML models and Linear Regression.pptx
Week 2 - ML models and Linear Regression.pptx
 

Dernier

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

The marginal value of adaptive gradient methods in machine learning

  • 1. IDS Lab The Marginal Value of Adaptive Gradient Methods in Machine Learning Does deep learning really doing some generalization? 2 presentedby Jamie Seol
  • 2. IDS Lab Jamie Seol Preface • Toy problem: smooth quadratic strong convex optimization • Let object f be as following, and WLOG suppose A to be a symmetric and nonsingular • why WLOG? symmetric because it’s a quadratic form, and singular curvature (curvature of quadratic function is A) is reducible in quadratic function • moreover, strong convex = positive definite curvature • meaning that all eigenvalues are positive
  • 3. IDS Lab Jamie Seol Preface • Note that A was a real symmetric matrix, so by the spectral theorem, A has eigendecomposition with unitary basis • In this simple objective function, we can explicitly compute the optima
  • 4. IDS Lab Jamie Seol Preface • We’ll apply a gradient descent! let superscript be an iteration: • Will it converge to the optima? let’s check it out! • We use some tricky trick using change of basis • This new sequence x(k) should converge to 0 • But when?
  • 5. IDS Lab Jamie Seol Preface • This holds • [homework: prove it] • With rewriting by element-wise notation:
  • 6. IDS Lab Jamie Seol Preface • So, the gradient descent converges only if • for all i • In summary, it converges when • And the optimal is • where 𝜎(A) denote a spectral radius of A, meaning the maximal absolute value among eigenvalues [homework: n = 1 case]
  • 7. IDS Lab Jamie Seol Preface (appendix) • Actually, this result is rather obvious • Note that A was a curvature of the objective, and the spectral radius or the largest eigenvalue means "stretching" above A’s principal axis • curvature ← see differential geometry • principal axis ← see linear algebra • So, it is vacuous that the learning rate should be in a safe area regarding the "stretching", which can be done with simple normalization
  • 8. IDS Lab Jamie Seol Preface • Similarly, the optimal momentum decay can also be induced, using condition number 𝜅 • condition number of a matrix is ratio between maximal and minimal (absolute) eigenvalues • Therefore, if we can control the boundary of the spectral radius of the objective, then we can approximate the optimal parameters for gradient descent • this is the main idea of the YellowFin optimizer
  • 9. IDS Lab Jamie Seol Preface • So what? • We pretty much do know well about behaviors of gradient descent • if the objective is smooth quadratic strong convex.. • but the objectives of deep learning is not nice enough! • We just don’t really know about characteristics of deep learning objective functions yet • requires more research
  • 10. IDS Lab Jamie Seol Preface 2 • Here’s a typical linear regression problem • If the number of features d is bigger than the number of samples m, than it is underdetermined system • So it has (possibly infinitely) many solutions • Let’s use stochastic gradient descent (SGD) • which solution will SGD find?
  • 11. IDS Lab Jamie Seol Preface 2 • Actually, we’ve already discussed about this in the previous seminar • Anyway, even if the system is underdetermined, SGD always converges to some unique solution which belongs to span of X
  • 12. IDS Lab Jamie Seol Preface 2 • Moreover, experiments show that SGD’s solution has small norm • We know that the l2-regularization helps generalization • l2-regularization: keeping parameter’s norm small • So, we can say that the SGD has implicit regularization • but there’s evidence that l2-regularization does not help at all… • see previous seminar presented by me • 잘 되지만 사실 잘 안되고, 그래도 좋은 편이지만 그닥 좋지만은 않다…
  • 13. IDS Lab Jamie Seol Introduction • In summary, • adaptive gradient descent methods • might be poor • at generalization
  • 14. IDS Lab Jamie Seol Preliminaries • Famous non-adaptive gradient descent methods: • Stochastic Gradient Descent [SGD] • Heavy-Ball [HB] (Polyak, 1964) • Nesterov’s Accelerated Gradient [NAG] (Nesterov, 1983)
  • 15. IDS Lab Jamie Seol Preliminaries • Adaptive methods can be summarized as: • AdaGrad (Duchi, 2011) • RMSProp (Tieleman and Hinton, 2012, in coursera!) • Adam (Kingma and Ba, 2015) • In short, these methods adaptively changes learning rate and momentum decay
  • 17. IDS Lab Jamie Seol Synopsis • For a system with multiple solution, what solution does an algorithm find and how well does it generalize to unseen data? • Claim: there exists a constructive problem(dataset) in which • non-adaptive methods work well and • finds a solution with good generalization power • adaptive methods work poor • finds a solution with poor generalization power • we even can make this arbitrarily poor, while the non- adaptive solution still working
  • 18. IDS Lab Jamie Seol Problem settings • Think of a simple binary least-squares classification problem • When d > n, if there is a optima with loss 0 then there are infinite number of optima • But as shown in preface 2, SGD converges to the unique solution • with known to be the minimum norm solution • which generalizes well • why? becase in here, it’s also the largest margin solution • All other non-adaptive methods also converges to the same
  • 19. IDS Lab Jamie Seol Lemma • Let sign(x) denote a function that maps each component of x to its sign • ex) sign([2, -3]) = [1, -1] • If there exists a solution proportional to sign(XTy), this is precisely the unique solution where all adaptive methods converge • quite interesting lemma! • pf) use induction • Note that this solution is just: • mean of positive labeled vectors - mean of negative labeled vectors
  • 21. IDS Lab Jamie Seol Funny dataset • Let’s fool adaptive methods • first, assign yi to 1 with probability p > 1/2 • when y = [-1, -1, -1, -1] • when y = [1, 1, 1, 1]
  • 22. IDS Lab Jamie Seol Funny dataset • Note that for such a dataset, the only discriminative feature is the first one! • if y = [1, -1, -1, 1, -1] then X becomes:
  • 23. IDS Lab Jamie Seol Funny dataset • Let and assume b > 0 (p > 1/2) • Suppose , then
  • 24. IDS Lab Jamie Seol Funny dataset • So, holds! • Take a closer look • all first three are 1, and rest is 0 for new data • this solution is bad! • it will classify every new data to positive class!!! • what a horrible generalization!
  • 25. IDS Lab Jamie Seol Funny dataset • How about non-adaptive method? • So, when , the solution makes no errors • wow
  • 26. IDS Lab Jamie Seol Funny dataset • Think this is too extreme? • Well, even in the real dataset, the following are rather common: • a few frequent feature (j = 2, 3) • some are good indicators, but hard to identify (j = 1) • many other sparse feature (other)
  • 27. IDS Lab Jamie Seol Experiments • (authors said that they downloaded models from internet…) • Results in summary: • adaptive makes poor generalization • even if it had lower loss than the non-adaptive ones!!! • adaptive looks fast, but that’s it • adaptive says "no more tuning" but tuning initial values were still significant • and it requires as much time as non-adaptive tuning…
  • 28. IDS Lab Jamie Seol Experiments • CIFAR-10 • use non-adaptive
  • 29. IDS Lab Jamie Seol Experiments • low training loss, more test error (Adam vs HB)
  • 30. IDS Lab Jamie Seol Experiments • Character-level language model • AdaGrad looks very fast, but indeed, not good • surprisingly, RMSProp closely trails SGD on test
  • 31. IDS Lab Jamie Seol Experiments • Parsing • well, it is true that non-adaptive methods are slow
  • 32. IDS Lab Jamie Seol Conclusion • Adaptive methods are not advantageous for optimization • It might be fast, but poor generalization • then why is Adam so popular? • because it’s popular…? • specially, known to be popular in GAN and Q-learning • these are not exactly optimization problems • we don’t know any nature of objectives in those two yet
  • 33. IDS Lab Jamie Seol References • Wilson,Ashia C., et al. "The Marginal Value ofAdaptive Gradient Methods in Machine Learning." arXiv preprint arXiv:1705.08292 (2017). • Zhang, Jian, Ioannis Mitliagkas, and Christopher Ré. "YellowFin and the Art of Momentum Tuning." arXiv preprint arXiv:1706.03471 (2017). • Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016). • Polyak, Boris T. "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics and Mathematical Physics 4.5 (1964): 1-17. • Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/ 10.23915/distill.00006