SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Regularization
Yow-Bang (Darren) Wang
8/1/2013
Outline
● VC dimension & VC bound – Frequentist viewpoint
● L1 regularization – An intuitive interpretation
● Model parameter prior – Bayesian viewpoint
● Early stopping – Also a regularization
● Conclusion
VC dimension & VC bound
– Frequentist viewpoint
Regularization
● (My) definition: Techniques to prevent overfitting
● Frequentists’ viewpoint:
○ Regularization = suppress model complexity
○ “Usually” done by inserting a term representing model complexity into the objective function:
Training
error
Model
complexity
Trade-off weight
VC dimension & VC bound
● Why suppressing model complexity?
○ A theoretical bound of testing error, called Vapnik–Chervonenkis (VC) bound, state the
follows:
● To reduce the testing error, we prefer:
○ Low training error ( Etrain
↓)
○ Big data ( N ↑)
○ Low model complexity ( dVC
↓)
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
Label=1
Label=0
Label=1
Label=0
Label=1
Label=0
……
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
VC dimension & VC bound
● : VC dimension
○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain
set of instances that can be binary-classified into any combination of class labels by H.
● Example: H = {straight lines in 2D space}
○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
○ N=4: fails in the case:
Regularization – Frequentist viewpoint
● In general, more model parameters
↔ higher VC dimension
↔ higher model complexity
↔
Regularization – Frequentist viewpoint
● ……Therefore, reduce model complexity
↔ reduce VC dimension
↔ reduce number of free parameters
↔ reduce
↔ sparsity of parameter!
L-0 norm
Regularization – Frequentist viewpoint
● The L-p norm of a K-dimensional vector x:
1. L-2 norm:
2. L-1 norm:
3. L-0 norm: defined as
4. L-∞ norm:
Regularization – Frequentist viewpoint
● However, since L-0 norm is hard to incorporate into the objective function (∵
not continuous), we turn to the other more approachable L-p norms
● E.g. Linear SVM:
● Linear SVM = Hinge loss + L-2 regularization!
L-2 regularization (a.k.a. Large Margin)Trade-off weight
Hinge Loss:
L1 regularization
– An intuitive interpretation
L1 Regularization – An Intuitive Interpretation
● Now we know we prefer sparse parameters
○ ↔ small L-0 norm
● ……but why people say minimizing L1 norm would introduce sparsity?
● “For most large underdetermined systems of linear equations, the minimal L1‐
norm solution is also the sparsest solution”
○ Donoho, David L, Communications on pure and applied mathematics, 2006.
L1 Regularization – An Intuitive Interpretation
● An intuitive interpretation: L-p norm ≣ control our preference to parameters
○ L-2 norm:
○ L-1 norm:
Equal-preferable lines
<Parameter Space>
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal training
error occurs at the tip points of parameter preference lines
○ Assume the equal training error lines are concentric circles ……
Equal training error lines
Optimal solution
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal training
error occurs at the tip points of parameter preference lines
○ Assume the equal training error lines are concentric circles ……
……
L1 Regularization – An Intuitive Interpretation
● Intuition: using L1 regularization, it’s more possible that the minimal training
error occurs at the tip points of parameter preference lines
○ Assume the equal training error lines are concentric circles, then the minimal training error
occurs at the tip points iff the centric of equal training error lines lies in the shaded areas as
the figure shows, which is relatively highly probable!
Model parameter prior
– Bayesian viewpoint
Regularization – Bayesian viewpoint
● Bayesian: model parameters are probabilistic.
● Frequentist: model parameters are deterministic.
Given observation
Fixed yet unknown universe
Sampling
Estimate
parameters
Unknown universe
Random observation
Sampling
Estimate parameters
assuming the universe is
a certain type of model
Regularization – Bayesian viewpoint
● To conclude:
Data Model parameter
Bayesian Fixed Variable
Frequentist Variable Fixed yet unknown
Regularization – Bayesian viewpoint
● E.g. L-2 regularization
● Assume the parameters w are from a Gaussian distribution with zero-mean,
identity covariance:
<Parameter Probability Space>
Equal probability lines
Regularization – Bayesian viewpoint
● E.g. L-2 regularization
● Assume the parameters w are from a Gaussian distribution with zero-mean,
identity covariance:
Early stopping
– Also a regularization
Early Stopping
● Early stopping – stop training before optimal
● Often used in MLP training
● An intuitive interpretation:
○ Training iteration ↑
○ → number of updates of weights ↑
○ → number of active (far from 0) weights ↑
○ → complexity ↑
Early Stopping
● Theoretical proof:
○ Consider a perceptron with hinge loss:
○ Assume the optimal separating hyperplane is , with maximal margin
○ Denote the weight at t-th iteration as , with margin
Early Stopping
●
1.
∵
Early Stopping
●
1.
2.
R: radius of
data distribution
R
Early Stopping
●
1.
2.
→
R: radius of
data distribution
R
Early Stopping
● Small learning rate → Large margin
● Small number of updates → Large margin
→ Early Stopping!!!
Early Stopping
Early Stopping
Training iteration ↑
Conclusion
Conclusion
● Regularization: Techniques to prevent overfitting
○ L1-norm: Sparsity of parameter
○ L2-norm: Large Margin
○ Early stopping
○ ……etc.
● The philosophy of regularization
○ Occam’s razor: “Entities must not be multiplied beyond necessity.”
Reference
● Learning From Data - A Short Course
○ Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin
● Ronan Collobert, Samy Bengio, “Links Between Perceptrons, MLPs and
SVMs”, in ACM 2004.

Contenu connexe

Tendances

Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Simplilearn
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applicationsSangeeta Tiwari
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 

Tendances (20)

Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Back propagation method
Back propagation methodBack propagation method
Back propagation method
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Back propagation
Back propagationBack propagation
Back propagation
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applications
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 

Similaire à Regularization

Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmrAdam Lee Perelman
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptxAbdusSadik
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiersKrish_ver2
 
Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdfbvhrs2
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the thesanjaibalajeessn
 
Chapter4
Chapter4Chapter4
Chapter4Vu Vo
 
Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural NetRatul Alahy
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsRyan B Harvey, CSDP, CSM
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression modelRADO7900
 
Interval programming
Interval programming Interval programming
Interval programming Zahra Sadeghi
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 

Similaire à Regularization (20)

Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdf
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Regression
RegressionRegression
Regression
 
Chapter4
Chapter4Chapter4
Chapter4
 
Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural Net
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
 
Interval programming
Interval programming Interval programming
Interval programming
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 

Dernier

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Dernier (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Regularization

  • 2. Outline ● VC dimension & VC bound – Frequentist viewpoint ● L1 regularization – An intuitive interpretation ● Model parameter prior – Bayesian viewpoint ● Early stopping – Also a regularization ● Conclusion
  • 3. VC dimension & VC bound – Frequentist viewpoint
  • 4. Regularization ● (My) definition: Techniques to prevent overfitting ● Frequentists’ viewpoint: ○ Regularization = suppress model complexity ○ “Usually” done by inserting a term representing model complexity into the objective function: Training error Model complexity Trade-off weight
  • 5. VC dimension & VC bound ● Why suppressing model complexity? ○ A theoretical bound of testing error, called Vapnik–Chervonenkis (VC) bound, state the follows: ● To reduce the testing error, we prefer: ○ Low training error ( Etrain ↓) ○ Big data ( N ↑) ○ Low model complexity ( dVC ↓)
  • 6. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} Label=1 Label=0 Label=1 Label=0 Label=1 Label=0 ……
  • 7. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1}
  • 8. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1} ○ N=3: {0,0,0}, {0,0,1},……, {1,1,1}
  • 9. VC dimension & VC bound ● : VC dimension ○ We say a hypothesis set H has iff given # of instances ≤ N, there exists a certain set of instances that can be binary-classified into any combination of class labels by H. ● Example: H = {straight lines in 2D space} ○ N=2: {0,0}, {0,1}, {1,0}, {1,1} ○ N=3: {0,0,0}, {0,0,1},……, {1,1,1} ○ N=4: fails in the case:
  • 10. Regularization – Frequentist viewpoint ● In general, more model parameters ↔ higher VC dimension ↔ higher model complexity ↔
  • 11. Regularization – Frequentist viewpoint ● ……Therefore, reduce model complexity ↔ reduce VC dimension ↔ reduce number of free parameters ↔ reduce ↔ sparsity of parameter! L-0 norm
  • 12. Regularization – Frequentist viewpoint ● The L-p norm of a K-dimensional vector x: 1. L-2 norm: 2. L-1 norm: 3. L-0 norm: defined as 4. L-∞ norm:
  • 13. Regularization – Frequentist viewpoint ● However, since L-0 norm is hard to incorporate into the objective function (∵ not continuous), we turn to the other more approachable L-p norms ● E.g. Linear SVM: ● Linear SVM = Hinge loss + L-2 regularization! L-2 regularization (a.k.a. Large Margin)Trade-off weight Hinge Loss:
  • 14. L1 regularization – An intuitive interpretation
  • 15. L1 Regularization – An Intuitive Interpretation ● Now we know we prefer sparse parameters ○ ↔ small L-0 norm ● ……but why people say minimizing L1 norm would introduce sparsity? ● “For most large underdetermined systems of linear equations, the minimal L1‐ norm solution is also the sparsest solution” ○ Donoho, David L, Communications on pure and applied mathematics, 2006.
  • 16. L1 Regularization – An Intuitive Interpretation ● An intuitive interpretation: L-p norm ≣ control our preference to parameters ○ L-2 norm: ○ L-1 norm: Equal-preferable lines <Parameter Space>
  • 17. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles …… Equal training error lines Optimal solution
  • 18. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles …… ……
  • 19. L1 Regularization – An Intuitive Interpretation ● Intuition: using L1 regularization, it’s more possible that the minimal training error occurs at the tip points of parameter preference lines ○ Assume the equal training error lines are concentric circles, then the minimal training error occurs at the tip points iff the centric of equal training error lines lies in the shaded areas as the figure shows, which is relatively highly probable!
  • 20. Model parameter prior – Bayesian viewpoint
  • 21. Regularization – Bayesian viewpoint ● Bayesian: model parameters are probabilistic. ● Frequentist: model parameters are deterministic. Given observation Fixed yet unknown universe Sampling Estimate parameters Unknown universe Random observation Sampling Estimate parameters assuming the universe is a certain type of model
  • 22. Regularization – Bayesian viewpoint ● To conclude: Data Model parameter Bayesian Fixed Variable Frequentist Variable Fixed yet unknown
  • 23. Regularization – Bayesian viewpoint ● E.g. L-2 regularization ● Assume the parameters w are from a Gaussian distribution with zero-mean, identity covariance: <Parameter Probability Space> Equal probability lines
  • 24. Regularization – Bayesian viewpoint ● E.g. L-2 regularization ● Assume the parameters w are from a Gaussian distribution with zero-mean, identity covariance:
  • 25. Early stopping – Also a regularization
  • 26. Early Stopping ● Early stopping – stop training before optimal ● Often used in MLP training ● An intuitive interpretation: ○ Training iteration ↑ ○ → number of updates of weights ↑ ○ → number of active (far from 0) weights ↑ ○ → complexity ↑
  • 27. Early Stopping ● Theoretical proof: ○ Consider a perceptron with hinge loss: ○ Assume the optimal separating hyperplane is , with maximal margin ○ Denote the weight at t-th iteration as , with margin
  • 29. Early Stopping ● 1. 2. R: radius of data distribution R
  • 30. Early Stopping ● 1. 2. → R: radius of data distribution R
  • 31. Early Stopping ● Small learning rate → Large margin ● Small number of updates → Large margin → Early Stopping!!!
  • 35. Conclusion ● Regularization: Techniques to prevent overfitting ○ L1-norm: Sparsity of parameter ○ L2-norm: Large Margin ○ Early stopping ○ ……etc. ● The philosophy of regularization ○ Occam’s razor: “Entities must not be multiplied beyond necessity.”
  • 36. Reference ● Learning From Data - A Short Course ○ Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin ● Ronan Collobert, Samy Bengio, “Links Between Perceptrons, MLPs and SVMs”, in ACM 2004.