The ever-increasing number of parameters in deep neural networks poses challengesfor memory-limited applications. Regularize-and-prune methods aim at meetingthese challenges by sparsifying the network weights. In this context we quantifythe outputsensitivityto the parameters (i.e. their relevance to the network output)and introduce a regularization term that gradually lowers the absolute value ofparameters with low sensitivity. Thus, a very large fraction of the parametersapproach zero and are eventually set to zero by simple thresholding. Our methodsurpasses most of the recent techniques both in terms of sparsity and error rates. Insome cases, the method reaches twice the sparsity obtained by other techniques atequal error rates.
1. Introducing sparsity in artificial neural
networks:
a sensitivity-based approach
ENZO TARTAGLIONE
POSTDOC AT UNIVERSITY OF TORINO
Universita’ degli Studi
Di Torino
Computer Science Dept.
EIDOS group
2. Deep networks
• High number of hidden layers
• More complex classification tasks
(ImageNet)
• Use of convolutional layers, pooling layers,
very large fully-connected layers
• Very high number of parameters (hundreds
of millions and even more…)
• Is it possible to boost the performance
making the ANN robust to noise?
2
STATE OF THE ART
3. Size of ANN models vs generalization
3
STATE OF THE ART
4. Approaches to reduce the size of an ANN
4
Quantization [Zhou et al., 2016] [Han et al., 2015-1]
Modify the architecture [Howard et al., 2017]
Regularize and prune to achieve sparsity
STATE OF THE ART
5. Why sparse networks?
5
Less memory required.
Less comp. resources.
Deployability on embedded
devices.
STATE OF THE ART
Typical architectures are
overparametrized!
6. Some existing pruning strategies…
6
Design a proxy L0 regularizer [Louizos et al., 2018]
Greedy thresholding after L2+dropout strategy [Han et al., 2015-2]
Grouping for convolutional features[Lebedev and Lempitsky, 2016] [Hadifar et al., 2020]
Dropout-based approaches [Srivastava et al., 2014]
Lasso-based regularizers [Scardapane et al., 2017]
…
STATE OF THE ART
7. When is a parameter necessary?
7
.
.
.
.
I
N
P
U
T
O
U
T
P
U
T
𝑦
𝑤
Changing w we change
the output of the
network… we don’t want
to modify it!
Changing w we do not
change the output of the
network… we are free to
change it!
Forward Propagation
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
PRE-TRAINED MODEL
8. Definition of sensitivity
Small perturbation of w
where C is the size of the output and 𝛼 𝑘 a weight scalar factor.
8
Δ𝑦 𝑘 ≈ Δ𝑤𝑖
𝜕𝑦 𝑘
𝜕𝑤𝑖
𝑆 𝒚, 𝑤𝑖 =
𝑘=1
𝐶
𝛼 𝑘
𝜕𝑦 𝑘
𝜕𝑤𝑖
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
9. Towards the definition of the update term
We need an insensitivity parameter:
To guarantee this quantity always being positive… trivial choice:
9
Value Importance
𝑆
𝑆
𝑤 ?
?
𝑆 𝒚, 𝑤𝑖 = 1 − 𝑆(𝒚, 𝑤𝑖)
𝑆 𝑏 𝒚, 𝑤𝑖 = max 0, 𝑆 𝒚, 𝑤𝑖
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
10. Weight update proposed:
where
• 𝑆 𝑏 is a function which states whether the parameter is relevant or not to the computation of
the output y of the network.
• 𝒚 is the output of the network.
• 𝑤𝑖 is the parameter.
• 𝐿 is a generic loss function.
Sensitivity-based regularization
10
𝑤𝑖
𝑡
≔ 𝑤𝑖
𝑡−1
− 𝜂
𝜕𝐿
𝜕𝑤𝑖
𝑡−1 − 𝜆𝑤𝑖
𝑡−1
𝑆 𝑏(𝒚, 𝑤𝑖
𝑡−1
)
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
11. Which function are we minimizing?
We need to solve the integral
Math, math, math….
valid for any architecture and any loss function, but for ReLU-activated networks…
11
PUBLISHED
𝑅 = 𝑤 ⋅ 𝑆 𝑏 𝒚, 𝑤 𝑑𝑤
𝑅 = Θ 𝑆 𝒚, 𝑤
𝑤2
2
1 −
𝑘=1
𝐶
𝛼 𝑘 𝑠𝑖𝑔𝑛
𝜕𝑦 𝑘
𝜕𝑤
𝑛=1
∞
−1 𝑛+1
𝜕 𝑛
𝑦 𝑘
𝜕𝑤 𝑛
𝑤 𝑛−1
𝑛 + 1 !
𝑅 =
𝑤2
2
𝑆 𝑏 𝒚, 𝑤
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
12. Thresholding
During training, due to numerical
errors and asymptotic behaviors it
might happen that its value never
reaches zero.
For this we introduce a simple
thresholding mechanism
12
PUBLISHED
𝑤𝑖 < 𝑇
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
13. Overview on the technique
13
𝑤𝑖
𝑡
≔ 𝑤𝑖
𝑡−1
− 𝜂
𝜕𝐿
𝜕𝑤𝑖
𝑡−1 − 𝜆𝑤𝑖
𝑡−1
𝑆 𝑏(𝒚, 𝑤𝑖
𝑡−1
)
Forward Propagation
Back-Propagation
Update
Pruning
𝜕𝐿
𝜕𝑤
, S(w)
At the end of the epoch
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
14. Sensitivity-based regularization:
results on LeNet300-MNIST
14
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
15. 15
Sensitivity-based regularization:
results on VGG16-ImageNet
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
17. References
[Zhou et al., 2016] Zhou, Shuchang, et al. "Dorefa-net: Training low bitwidth convolutional neural networks with
low bitwidth gradients." arXiv preprint arXiv:1606.06160 (2016).
[Han et al., 2015-1] Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:1510.00149 (2015).
[Howard et al., 2017] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile
vision applications." arXiv preprint arXiv:1704.04861 (2017).
[Louizos et al., 2018] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neuralnetworks
throughl0regularization,”6th International Conference onLearning Representations, ICLR 2018 - Conference Track
Proceedings,2018.
[Han et al, 2015-2] . Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-nections for efficient
neural network,” inAdvances in neural informationprocessing systems, 2015, pp. 1135–1143.
[Srivastava et al., 2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a
simple way to prevent neural networks from over-fitting,”The Journal of Machine Learning Research, vol. 15, no.
1, pp.1929–1958, 2014
17
18. References (II)
[Lebedev and Lempitsky, 2016] . Lebedev and V. Lempitsky, “Fast convnets using group-wise
braindamage,” inProceedings of the IEEE Conference on Computer Visionand Pattern
Recognition, 2016, pp. 2554–2564.
[Scardapane et al., 2017] Scardapane, Simone, et al. "Group sparse regularization for deep
neural networks." Neurocomputing 241 (2017): 81-89.
[Hadifar et al., 2020] Hadifar, Amir, et al. "Block-wise Dynamic Sparseness." arXiv preprint
arXiv:2001.04686 (2020).
18
Notes de l'éditeur
- noise -> are some parameters less relevant than others?
Two questions: how to estimate the change of the output? Where we drive non-necessary w?
T is magnitude-based for all the literature out there