SlideShare une entreprise Scribd logo
1  sur  13
Learning Sparse Neural Networks using
L0 Regularization
- Varun Reddy G
Neural Networks
 Very good function approximators and flexible
 Scales well
Some problems
1. Highly overparameterized
2. Can easily overfit.
One of the Solutions:
Model Compression and Sparsification
 A typical Lp regularization loss would look like
Where ||θ||p is the L p norm and
L(.) is the loss function
 L0 norm essentially means counting the number of non-zero parameters in the model.
 It penalizes all non-zero values equally, unlike other Lp norms which penalize on the value of θj causing
more shrinkage on higher values
So, now the error function looks like this
But, now this function is computationally intractable given non-differentiability and combinatorial nature of
the 2 |θ| possible states for the parameter vector θ
So, we reformulate to try and make it continuous.
 Consider the following re-parametarization,
Where, Zj corresponds to the binary gates 0, 1 representing the parameter is present or not.
Now, if we consider q(zj |πj ) = Bern(πj) distribution where πj is the probability of 1, then we can
reformulate the loss on average as
Now, the second term is easy to minimize, but the first term, due to the discrete nature of z, is difficult to
optimize.
Let s be a continuous random variable with a distribution q(s) and let the z’s be given by a hard-sigmoid
rectification of s
Hard-sigmoid
f(.) = min(1, max(0, .))
So, now z is given by
z = min(1, max(0, s))
This is equivalent to
z =
0 𝑖𝑓 𝑠 ≤ 0
1 𝑖𝑓 𝑠 ≥ 1
𝑠 𝑖𝑓 0 < 𝑠 < 1
So, if we look at the loss function, we have to penalize all the non-zero θ, so, the second term is
essentially the probability of s < 0, which is given out by the CDF Q(s)
Substituting these
 Our loss function becomes
where g(s) is our hard-sigmoid function.
Re-parameterization Trick
We can choose q(s), with parameters ɸ such that they allow the re-parameterization trick and express the
loss function as an expectation over a parameter free noise distribution p(ϵ) and a deterministic and
differentiable transformation f(.) of the parameters ɸ and ϵ
P.S variables in the above definition do not correspond to those in the picture
Therefore, the objective now becomes,
Choosing the q(s)
We are free to choose the q(s) and something that worked well in practice is a binary concrete random
variable distributed in (0, 1) with probability density qs (s| ɸ) and cumulative density Qs (s | ɸ).
The parameters of this distribution are ɸ = (log ⍺, β) where, log ⍺ is location and β is temperature.
We stretch this distribution to an interval (ɣ, 𝛿) such that ɣ < 0 and 𝛿 > 0 and apply hard-sigmoid on its
random samples
 So, with the above changes, the objective function is
Eq. 9
Results
Summary
1. Force the network weights to become absolute 0’s
2. To remove non-differentiability, re-parameterize
3. Now, to make the objective function continuous and to keep the sampling step out of the main network,
use the re-parameterization trick.
4. Learn the parameters for the q(s) and use them at inference time, like so
Resources
 Numenta Journal Club https://www.youtube.com/watch?v=HD2uvsAEZFM
 Original Paper https://arxiv.org/abs/1712.01312

Contenu connexe

Tendances

Dft and its applications
Dft and its applicationsDft and its applications
Dft and its applications
Agam Goel
 

Tendances (20)

1.5 all notes
1.5 all notes1.5 all notes
1.5 all notes
 
DSP_FOEHU - MATLAB 03 - The z-Transform
DSP_FOEHU - MATLAB 03 - The z-TransformDSP_FOEHU - MATLAB 03 - The z-Transform
DSP_FOEHU - MATLAB 03 - The z-Transform
 
big_oh
big_ohbig_oh
big_oh
 
Dft and its applications
Dft and its applicationsDft and its applications
Dft and its applications
 
Website designing company in Noida
Website designing company in NoidaWebsite designing company in Noida
Website designing company in Noida
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman Operators
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa
 
Dft,fft,windowing
Dft,fft,windowingDft,fft,windowing
Dft,fft,windowing
 
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
 
Operations on fourier series
Operations on fourier seriesOperations on fourier series
Operations on fourier series
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Vector operators
Vector operatorsVector operators
Vector operators
 
Seismic data processing lecture 4
Seismic data processing lecture 4Seismic data processing lecture 4
Seismic data processing lecture 4
 
poster
posterposter
poster
 
Theta notation
Theta notationTheta notation
Theta notation
 
Design of sampled data control systems 5th lecture
Design of sampled data control systems  5th  lectureDesign of sampled data control systems  5th  lecture
Design of sampled data control systems 5th lecture
 
DSP_FOEHU - Lec 08 - The Discrete Fourier Transform
DSP_FOEHU - Lec 08 - The Discrete Fourier TransformDSP_FOEHU - Lec 08 - The Discrete Fourier Transform
DSP_FOEHU - Lec 08 - The Discrete Fourier Transform
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
Bode plot
Bode plotBode plot
Bode plot
 

Similaire à Learning sparse Neural Networks using L0 Regularization

Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
ESCOM
 
Inverse laplacetransform
Inverse laplacetransformInverse laplacetransform
Inverse laplacetransform
Tarun Gehlot
 
Machine learning (13)
Machine learning (13)Machine learning (13)
Machine learning (13)
NYversity
 

Similaire à Learning sparse Neural Networks using L0 Regularization (20)

Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Method of weighted residuals
Method of weighted residualsMethod of weighted residuals
Method of weighted residuals
 
Linear Regression
Linear Regression Linear Regression
Linear Regression
 
7 regularization
7 regularization7 regularization
7 regularization
 
Calculus ii power series and functions
Calculus ii   power series and functionsCalculus ii   power series and functions
Calculus ii power series and functions
 
Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline Mappings
 
Modeling biased tracers at the field level
Modeling biased tracers at the field levelModeling biased tracers at the field level
Modeling biased tracers at the field level
 
Moudling of sensitivityof transfer function
Moudling of sensitivityof transfer functionMoudling of sensitivityof transfer function
Moudling of sensitivityof transfer function
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 
Klt
KltKlt
Klt
 
Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment Help
 
Capitulo9
Capitulo9Capitulo9
Capitulo9
 
Inverse laplacetransform
Inverse laplacetransformInverse laplacetransform
Inverse laplacetransform
 
10.1.1.630.8055
10.1.1.630.805510.1.1.630.8055
10.1.1.630.8055
 
Mba Ebooks ! Edhole
Mba Ebooks ! EdholeMba Ebooks ! Edhole
Mba Ebooks ! Edhole
 
Stochastic Processes Homework Help
Stochastic Processes Homework HelpStochastic Processes Homework Help
Stochastic Processes Homework Help
 
Machine learning (13)
Machine learning (13)Machine learning (13)
Machine learning (13)
 
3. Weighted residual methods (1).pptx
3. Weighted residual methods (1).pptx3. Weighted residual methods (1).pptx
3. Weighted residual methods (1).pptx
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equations
 

Dernier

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Dernier (20)

tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 

Learning sparse Neural Networks using L0 Regularization

  • 1. Learning Sparse Neural Networks using L0 Regularization - Varun Reddy G
  • 2. Neural Networks  Very good function approximators and flexible  Scales well Some problems 1. Highly overparameterized 2. Can easily overfit. One of the Solutions: Model Compression and Sparsification
  • 3.  A typical Lp regularization loss would look like Where ||θ||p is the L p norm and L(.) is the loss function
  • 4.  L0 norm essentially means counting the number of non-zero parameters in the model.  It penalizes all non-zero values equally, unlike other Lp norms which penalize on the value of θj causing more shrinkage on higher values So, now the error function looks like this But, now this function is computationally intractable given non-differentiability and combinatorial nature of the 2 |θ| possible states for the parameter vector θ So, we reformulate to try and make it continuous.
  • 5.  Consider the following re-parametarization, Where, Zj corresponds to the binary gates 0, 1 representing the parameter is present or not. Now, if we consider q(zj |πj ) = Bern(πj) distribution where πj is the probability of 1, then we can reformulate the loss on average as Now, the second term is easy to minimize, but the first term, due to the discrete nature of z, is difficult to optimize.
  • 6. Let s be a continuous random variable with a distribution q(s) and let the z’s be given by a hard-sigmoid rectification of s Hard-sigmoid f(.) = min(1, max(0, .)) So, now z is given by z = min(1, max(0, s)) This is equivalent to z = 0 𝑖𝑓 𝑠 ≤ 0 1 𝑖𝑓 𝑠 ≥ 1 𝑠 𝑖𝑓 0 < 𝑠 < 1 So, if we look at the loss function, we have to penalize all the non-zero θ, so, the second term is essentially the probability of s < 0, which is given out by the CDF Q(s) Substituting these
  • 7.  Our loss function becomes where g(s) is our hard-sigmoid function.
  • 8. Re-parameterization Trick We can choose q(s), with parameters ɸ such that they allow the re-parameterization trick and express the loss function as an expectation over a parameter free noise distribution p(ϵ) and a deterministic and differentiable transformation f(.) of the parameters ɸ and ϵ P.S variables in the above definition do not correspond to those in the picture Therefore, the objective now becomes,
  • 9. Choosing the q(s) We are free to choose the q(s) and something that worked well in practice is a binary concrete random variable distributed in (0, 1) with probability density qs (s| ɸ) and cumulative density Qs (s | ɸ). The parameters of this distribution are ɸ = (log ⍺, β) where, log ⍺ is location and β is temperature. We stretch this distribution to an interval (ɣ, 𝛿) such that ɣ < 0 and 𝛿 > 0 and apply hard-sigmoid on its random samples
  • 10.  So, with the above changes, the objective function is Eq. 9
  • 12. Summary 1. Force the network weights to become absolute 0’s 2. To remove non-differentiability, re-parameterize 3. Now, to make the objective function continuous and to keep the sampling step out of the main network, use the re-parameterization trick. 4. Learn the parameters for the q(s) and use them at inference time, like so
  • 13. Resources  Numenta Journal Club https://www.youtube.com/watch?v=HD2uvsAEZFM  Original Paper https://arxiv.org/abs/1712.01312