SlideShare a Scribd company logo
1 of 6
Vanishing Gradients – What?
1. “Vanishing” means disappearing. Vanishing gradients means that error gradients becoming so small that we can barely see any update
on weights (refer grad descent equation). Hence, the convergence is not achieved.
2. Before going further, lets see below 3 equations to see when we multiply numbers that are between 0 to 1, the output is lesser than
values of both the input numbers.
3. Let’s assume a network shown on next page with sigmoid activation used across the network layers. Activations like tanh and sigmoid
limit the value of z between 0 and 1. The derivative value of these activations lies between 0 to 0.25. This makes any number multiplied
with these derivatives to reduce in absolute terms as seen in step 2.
Vanishing Gradients
Vanishing Gradients – How to Avoid?
1. Reason  Let’s see the equation for gradient of error w.r.t w17 and gradient of error w.r.t w23. The number of items required to be
multiplied to calculate gradient of error w.r.t w17 (a weight in initial layer) is way more than number of items required to be multiplied to
calculate gradient of error w.r.t w23 (a weight in later layers). Now, the terms in these gradients that do partial derivative of activation will
be valued between 0 to 0.25 (refer point 3). Since number of terms less than 1 is more for error gradients in initial layers, hence,
vanishing gradient effect is seen more prominently in the initial layers of network. The number of terms required to compute gradient
w.r.t w1, w2 etc. will be quite high.
Resolution  The way to avoid the chances of a vanishing gradient problem is to use activations whose derivative is not limited to values less
than 1. We can use Relu activation. Relu’s derivative for positive values is 1. The issue with Relu is it’s derivative for negative values is 0 which
makes contribution of some nodes 0. This can be managed by using Leaky Relu instead.
Vanishing Gradients – How to Avoid?
Vanishing Gradients – How to Avoid?
2. Reason  The first problem that we discussed was the usage of activations whose derivatives are low. The second problem deals with
low value of initialized weights. We can understand this from simple example as shown in network on previous page. The equations for
error grad w.r.t w1 includes value of w5 as well. Hence, if value of w5 is initialized very low, it will also plays a role in making the gradient
w.r.t w1 smaller i.e vanishing gradient.
We can also say Vanishing gradient problems will be more prominent in deep networks. This is because the number of multiplicative terms to
compute the gradient of initial layers in a deep network is very high.
Resolution  As we can see from below equations, the derivative of activation function along with weights play a role in causing vanishing
gradients because both are there in equation for computation of error gradient. We need to initialize the weights properly to avoid vanishing
gradient problem. We will discuss about it further in weight initialization strategy section.
Exploding Gradients – What?
1. “Exploding” means increasing to a large extent. Exploding gradients means that error gradients becoming so big that the update on
weights is too high in every iteration. This causes the weights to swindle a lot and causes error to keep missing the global minima. Hence,
the convergence becomes tough to be achieved.
2. Exploding gradients are caused due to usage of bigger weights used in the network.
3. Probable resolutions
1. Keep low learning rate to accommodate for higher weights
2. Gradient clipping
3. Gradient scaling
4. Gradient scaling
1. For every batch, get all the gradient vectors for all samples.
2. Find L2 norm of the concatenated error gradient vector.
1. If L2 norm > 1 (1 is used as an example here)
2. Scale/normalize the gradient terms such that L2 norm becomes 1
3. Code example  opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)
5. Gradient clipping
1. For every sample in a batch, if the gradient value w.r.t any weight is outside a range (let’s say -0.5 <= gradient_value <= 0.5), we clip
the gradient value to the border values. If gradient value is 0.6, we clip it to make it 0.5.
2. Code example  opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5)
6. Generic practice is to use same values of clipping / scaling throughout the network.

More Related Content

What's hot

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
Molly Chugh
 

What's hot (20)

Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Neural network
Neural networkNeural network
Neural network
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics Course
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Mc culloch pitts neuron
Mc culloch pitts neuronMc culloch pitts neuron
Mc culloch pitts neuron
 
Back propagation
Back propagationBack propagation
Back propagation
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applications
 

Similar to Vanishing & Exploding Gradients

Theory of linear programming
Theory of linear programmingTheory of linear programming
Theory of linear programming
Tarun Gehlot
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
butest
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
butest
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
Taiji Suzuki
 

Similar to Vanishing & Exploding Gradients (20)

3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_B
 
Dimd_m_004 DL.pdf
Dimd_m_004 DL.pdfDimd_m_004 DL.pdf
Dimd_m_004 DL.pdf
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdfMachine-Learning-with-Ridge-and-Lasso-Regression.pdf
Machine-Learning-with-Ridge-and-Lasso-Regression.pdf
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
Shrinkage Methods in Linear Regression
Shrinkage Methods in Linear RegressionShrinkage Methods in Linear Regression
Shrinkage Methods in Linear Regression
 
Daa unit 1
Daa unit 1Daa unit 1
Daa unit 1
 
Levenberg - Marquardt (LM) algorithm_ aghazade
Levenberg - Marquardt (LM) algorithm_ aghazadeLevenberg - Marquardt (LM) algorithm_ aghazade
Levenberg - Marquardt (LM) algorithm_ aghazade
 
Theory of linear programming
Theory of linear programmingTheory of linear programming
Theory of linear programming
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7
 
Batch Normalization
Batch NormalizationBatch Normalization
Batch Normalization
 
Regresión
RegresiónRegresión
Regresión
 
Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
Dynamic programmng2
Dynamic programmng2Dynamic programmng2
Dynamic programmng2
 

Recently uploaded

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Vanishing & Exploding Gradients

  • 1. Vanishing Gradients – What? 1. “Vanishing” means disappearing. Vanishing gradients means that error gradients becoming so small that we can barely see any update on weights (refer grad descent equation). Hence, the convergence is not achieved. 2. Before going further, lets see below 3 equations to see when we multiply numbers that are between 0 to 1, the output is lesser than values of both the input numbers. 3. Let’s assume a network shown on next page with sigmoid activation used across the network layers. Activations like tanh and sigmoid limit the value of z between 0 and 1. The derivative value of these activations lies between 0 to 0.25. This makes any number multiplied with these derivatives to reduce in absolute terms as seen in step 2.
  • 3. Vanishing Gradients – How to Avoid? 1. Reason  Let’s see the equation for gradient of error w.r.t w17 and gradient of error w.r.t w23. The number of items required to be multiplied to calculate gradient of error w.r.t w17 (a weight in initial layer) is way more than number of items required to be multiplied to calculate gradient of error w.r.t w23 (a weight in later layers). Now, the terms in these gradients that do partial derivative of activation will be valued between 0 to 0.25 (refer point 3). Since number of terms less than 1 is more for error gradients in initial layers, hence, vanishing gradient effect is seen more prominently in the initial layers of network. The number of terms required to compute gradient w.r.t w1, w2 etc. will be quite high. Resolution  The way to avoid the chances of a vanishing gradient problem is to use activations whose derivative is not limited to values less than 1. We can use Relu activation. Relu’s derivative for positive values is 1. The issue with Relu is it’s derivative for negative values is 0 which makes contribution of some nodes 0. This can be managed by using Leaky Relu instead.
  • 4. Vanishing Gradients – How to Avoid?
  • 5. Vanishing Gradients – How to Avoid? 2. Reason  The first problem that we discussed was the usage of activations whose derivatives are low. The second problem deals with low value of initialized weights. We can understand this from simple example as shown in network on previous page. The equations for error grad w.r.t w1 includes value of w5 as well. Hence, if value of w5 is initialized very low, it will also plays a role in making the gradient w.r.t w1 smaller i.e vanishing gradient. We can also say Vanishing gradient problems will be more prominent in deep networks. This is because the number of multiplicative terms to compute the gradient of initial layers in a deep network is very high. Resolution  As we can see from below equations, the derivative of activation function along with weights play a role in causing vanishing gradients because both are there in equation for computation of error gradient. We need to initialize the weights properly to avoid vanishing gradient problem. We will discuss about it further in weight initialization strategy section.
  • 6. Exploding Gradients – What? 1. “Exploding” means increasing to a large extent. Exploding gradients means that error gradients becoming so big that the update on weights is too high in every iteration. This causes the weights to swindle a lot and causes error to keep missing the global minima. Hence, the convergence becomes tough to be achieved. 2. Exploding gradients are caused due to usage of bigger weights used in the network. 3. Probable resolutions 1. Keep low learning rate to accommodate for higher weights 2. Gradient clipping 3. Gradient scaling 4. Gradient scaling 1. For every batch, get all the gradient vectors for all samples. 2. Find L2 norm of the concatenated error gradient vector. 1. If L2 norm > 1 (1 is used as an example here) 2. Scale/normalize the gradient terms such that L2 norm becomes 1 3. Code example  opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0) 5. Gradient clipping 1. For every sample in a batch, if the gradient value w.r.t any weight is outside a range (let’s say -0.5 <= gradient_value <= 0.5), we clip the gradient value to the border values. If gradient value is 0.6, we clip it to make it 0.5. 2. Code example  opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5) 6. Generic practice is to use same values of clipping / scaling throughout the network.

Editor's Notes

  1. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  2. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  3. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  4. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  5. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video
  6. Why BN is not applied in batch or stochastic mode? Whe using RELU, you can encounter dying RELU problem, then use leaky RELU with He initialization strategy – do in activation function video