Activation functions like sigmoid and tanh can cause the gradient to vanish when neurons saturate, hindering training. Relu avoids saturation but can cause dead neurons. Variants like leaky relu, parametric relu, and ELU address dead neurons while maintaining computational efficiency. Maxout activation considers multiple linear transformations to further prevent saturation and dead neurons, though it increases parameters. In summary, relu and its variants are commonly used activations as they allow efficient training of deep networks while avoiding gradient issues.
4. Sigmoid
𝜎 𝑧 =
1
1 + 𝑒−𝑧
• Squash the real values in the range [0,1]
• ⅆ𝜎 𝑧 = 𝜎 𝑧 (1 − 𝜎 𝑧 )
5. Sigmoid
• Squash the real values in the range [0,1]
• ⅆ𝜎 𝑧 = 𝜎 𝑧 (1 − 𝜎 𝑧 )
•
𝜕𝑦
𝜕𝑧3
=
𝜕𝑎3
𝜕𝑧3
=
𝜕𝜎(𝑧3)
𝜕𝑧3
= 𝜎(𝑧3)(1 − 𝜎(𝑧3))
x
𝜎
𝜎
𝜎
𝑎1
𝑧1
𝑎2
𝑧2
𝑎3
𝑧3
If a sigmoid neuron is saturated.
𝜎(𝑧3) = 0 and 𝜎(𝑧3) = 1
Then
𝜕𝜎(𝑧3)/𝜕𝑧3 = 0
Remember the update Rule
𝑤 = 𝑤 − 𝜂
𝜕ℒ
𝜕𝑧
6. Why does a sigmoid neuron saturate?
𝑦𝑖 = 𝜎 𝑧
𝑧 = 𝑥𝑖𝑤𝑖
7. Problem with sigmoid
• Gradient Vanishes in case of neuron saturate
• Expensive to compute.
• Not zero centered.
𝑦 = 𝑎1𝑤1 + 𝑎2𝑤2
∇𝑤1 =
𝜕ℒ
𝜕𝑦
𝜕𝑦
𝜕𝑧3
𝜕𝑧3
𝜕𝑤1
∇𝑤2 =
𝜕ℒ
𝜕𝑦
𝜕𝑦
𝜕𝑧3
𝜕𝑧3
𝜕𝑤2
𝑤1
𝑤2
𝑥1
𝑥2
𝜎(𝑧1)
𝜎(𝑧2)
𝜎(𝑧3)
𝑦 = 𝜎(𝑎1𝑤1 + 𝑎2𝑤2)
8. Problem with sigmoid
𝑦 = 𝑎1𝑤1 + 𝑎2𝑤2
∇𝑤1 =
𝜕ℒ
𝜕𝑦
𝜕𝑦
𝜕𝑧3
𝑎1
∇𝑤2 =
𝜕ℒ
𝜕𝑦
𝜕𝑦
𝜕𝑧3
𝑎2
𝜕ℒ
𝜕𝑦
𝜕𝑦
𝜕𝑧3
, 𝑎1 ≥ 0
-VE or +VE
So Gradient of loss w. r.t all weights at a particular hidden layer are either +ve or –ve.
This restrict the possible update direction .
9. TanH
• Squash the real value into [-1,1]
• Zero Centered
•
𝜕 tanh 𝑥
𝜕𝑥
= 1 − 𝑡𝑎𝑛ℎ2 𝑥
• Gradient Vanishing problem
• Expensive to compute
10. Relu
𝑟𝑒𝑙𝑢 𝑥 = max(0, 𝑥)
• Gradient does not saturate
• Computationally Efficient
• Converge faster then 𝑆𝑖𝑔𝑚𝑜𝑖ⅆ 𝑎𝑛ⅆ 𝑡𝑎𝑛ℎ
• Not zero centered
• Dead Neuron
11. Dead Neuron
𝜕𝑟𝑒𝑙𝑢(𝑥)
𝜕𝑥
=
1 𝑖𝑓 𝑥 > 0
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑖𝑓 𝑥1𝑤1 + 𝑥2𝑤2 + 𝑏 < 0 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 𝑏 < < 0
A Neuron is said to be dead if its weight are not
updated during the training
∇𝑤1 =
𝜕ℒ
𝜕𝑦
𝜕𝑦
𝜕𝑧1
𝜕𝑧1
𝜕𝑤1
= 0 𝑏𝑒𝑐𝑎𝑢𝑠𝑒
𝜕𝑦
𝜕𝑧1
=0
Remember: Large number of relu neuron dies during the
training if learning rate to too high.
1
𝑥1
𝑥2
𝑧1
𝑦 = 𝜎 𝑧1 . 𝑤ℎ𝑒𝑟𝑒 𝑧1 = 𝑥1𝑤1 + 𝑥2𝑤2+b
𝑤1
𝑤2
b
12. Leaky Relu
𝑙𝑒𝑎𝑘𝑦 𝑟𝑒𝑙𝑢 𝑥 = max(0.1 ∗ 𝑥, 𝑥)
• No saturation
• No neuron dies problem
• Computationally efficient
14. Exponential Relu
𝐸𝐿𝑈 𝑥 = 𝑓 𝑥 =
𝑥, 𝑥 > 0
𝑎𝑒𝑥, −1 𝑥 ≤ 0
All benefits of Leaky relu
but expensive to compute
15. Maxout activation
𝑚𝑎𝑥𝑜𝑢𝑡 𝑥 = max(𝑥𝑇
𝑤1 + 𝑏1, 𝑥𝑇
𝑤2 + 𝑏2) • No Saturation
• No Dead neuron problem
• Increase the number of parameters
16. Conclusion
• Sigmoid and tanh suffer from the gradient vanishing problem(don’t
use in case of deep network)
• Relu is the most popular activation but it can lead to Dead neuron
problem.
• Variants of relu requires careful tuning of the hyperparameters.