Learning Sparse Neural Networksvia Sensitivity-Driven Regularization

Introducing sparsity in artificial neural
networks:
a sensitivity-based approach
ENZO TARTAGLIONE
POSTDOC AT UNIVERSITY OF TORINO
Universita’ degli Studi
Di Torino
Computer Science Dept.
EIDOS group

Deep networks
• High number of hidden layers
• More complex classification tasks
(ImageNet)
• Use of convolutional layers, pooling layers,
very large fully-connected layers
• Very high number of parameters (hundreds
of millions and even more…)
• Is it possible to boost the performance
making the ANN robust to noise?
2
STATE OF THE ART

Size of ANN models vs generalization
3
STATE OF THE ART

Approaches to reduce the size of an ANN
4
Quantization [Zhou et al., 2016] [Han et al., 2015-1]
Modify the architecture [Howard et al., 2017]
Regularize and prune to achieve sparsity
STATE OF THE ART

Why sparse networks?
5
Less memory required.
Less comp. resources.
Deployability on embedded
devices.
STATE OF THE ART
Typical architectures are
overparametrized!

Some existing pruning strategies…
6
Design a proxy L0 regularizer [Louizos et al., 2018]
Greedy thresholding after L2+dropout strategy [Han et al., 2015-2]
Grouping for convolutional features[Lebedev and Lempitsky, 2016] [Hadifar et al., 2020]
Dropout-based approaches [Srivastava et al., 2014]
Lasso-based regularizers [Scardapane et al., 2017]
…
STATE OF THE ART

When is a parameter necessary?
7
.
.
.
.
I
N
P
U
T
O
U
T
P
U
T
𝑦
𝑤
Changing w we change
the output of the
network… we don’t want
to modify it!
Changing w we do not
change the output of the
network… we are free to
change it!
Forward Propagation
PUBLISHED
Tartaglione, E., Lepsoy, S., Fiandrotti, A., Francini, G. (2018). Learning sparse neural networks via sensitivity-driven
regularization. In Advances in Neural Information Processing Systems (NeurIPS 2018)
PRE-TRAINED MODEL

Definition of sensitivity
Small perturbation of w
where C is the size of the output and 𝛼 𝑘 a weight scalar factor.
8
Δ𝑦 𝑘 ≈ Δ𝑤𝑖
𝜕𝑦 𝑘
𝜕𝑤𝑖
𝑆 𝒚, 𝑤𝑖 =
𝑘=1
𝐶
𝛼 𝑘
𝜕𝑦 𝑘
𝜕𝑤𝑖
PUBLISHED

Towards the definition of the update term
We need an insensitivity parameter:
To guarantee this quantity always being positive… trivial choice:
9
Value Importance
𝑆
𝑆
𝑤 ?
?
𝑆 𝒚, 𝑤𝑖 = 1 − 𝑆(𝒚, 𝑤𝑖)
𝑆 𝑏 𝒚, 𝑤𝑖 = max 0, 𝑆 𝒚, 𝑤𝑖
PUBLISHED

Weight update proposed:
where
• 𝑆 𝑏 is a function which states whether the parameter is relevant or not to the computation of
the output y of the network.
• 𝒚 is the output of the network.
• 𝑤𝑖 is the parameter.
• 𝐿 is a generic loss function.
Sensitivity-based regularization
10
𝑤𝑖
𝑡
≔ 𝑤𝑖
𝑡−1
− 𝜂
𝜕𝐿
𝜕𝑤𝑖
𝑡−1 − 𝜆𝑤𝑖
𝑡−1
𝑆 𝑏(𝒚, 𝑤𝑖
𝑡−1
)
PUBLISHED

Which function are we minimizing?
We need to solve the integral
Math, math, math….
valid for any architecture and any loss function, but for ReLU-activated networks…
11
PUBLISHED
𝑅 = 𝑤 ⋅ 𝑆 𝑏 𝒚, 𝑤 𝑑𝑤
𝑅 = Θ 𝑆 𝒚, 𝑤
𝑤2
2
1 −
𝑘=1
𝐶
𝛼 𝑘 𝑠𝑖𝑔𝑛
𝜕𝑦 𝑘
𝜕𝑤
𝑛=1
∞
−1 𝑛+1
𝜕 𝑛
𝑦 𝑘
𝜕𝑤 𝑛
𝑤 𝑛−1
𝑛 + 1 !
𝑅 =
𝑤2
2
𝑆 𝑏 𝒚, 𝑤

Thresholding
During training, due to numerical
errors and asymptotic behaviors it
might happen that its value never
reaches zero.
For this we introduce a simple
thresholding mechanism
12
PUBLISHED
𝑤𝑖 < 𝑇

Overview on the technique
13
𝑤𝑖
𝑡
≔ 𝑤𝑖
𝑡−1
− 𝜂
𝜕𝐿
𝜕𝑤𝑖
𝑡−1 − 𝜆𝑤𝑖
𝑡−1
𝑆 𝑏(𝒚, 𝑤𝑖
𝑡−1
)
Forward Propagation
Back-Propagation
Update
Pruning
𝜕𝐿
𝜕𝑤
, S(w)
At the end of the epoch
PUBLISHED

Sensitivity-based regularization:
results on LeNet300-MNIST
14
PUBLISHED

15
Sensitivity-based regularization:
results on VGG16-ImageNet
PUBLISHED

References
[Zhou et al., 2016] Zhou, Shuchang, et al. "Dorefa-net: Training low bitwidth convolutional neural networks with
low bitwidth gradients." arXiv preprint arXiv:1606.06160 (2016).
[Han et al., 2015-1] Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding." arXiv preprint arXiv:1510.00149 (2015).
[Howard et al., 2017] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile
vision applications." arXiv preprint arXiv:1704.04861 (2017).
[Louizos et al., 2018] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neuralnetworks
throughl0regularization,”6th International Conference onLearning Representations, ICLR 2018 - Conference Track
Proceedings,2018.
[Han et al, 2015-2] . Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-nections for efficient
neural network,” inAdvances in neural informationprocessing systems, 2015, pp. 1135–1143.
[Srivastava et al., 2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a
simple way to prevent neural networks from over-fitting,”The Journal of Machine Learning Research, vol. 15, no.
1, pp.1929–1958, 2014
17

References (II)
[Lebedev and Lempitsky, 2016] . Lebedev and V. Lempitsky, “Fast convnets using group-wise
braindamage,” inProceedings of the IEEE Conference on Computer Visionand Pattern
Recognition, 2016, pp. 2554–2564.
[Scardapane et al., 2017] Scardapane, Simone, et al. "Group sparse regularization for deep
neural networks." Neurocomputing 241 (2017): 81-89.
[Hadifar et al., 2020] Hadifar, Amir, et al. "Block-wise Dynamic Sparseness." arXiv preprint
arXiv:2001.04686 (2020).
18

Learning Sparse Neural Networksvia Sensitivity-Driven Regularization

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Learning Sparse Neural Networksvia Sensitivity-Driven Regularization

Similaire à Learning Sparse Neural Networksvia Sensitivity-Driven Regularization (20)

Dernier

Dernier (20)

Learning Sparse Neural Networksvia Sensitivity-Driven Regularization

Notes de l'éditeur