Dense sparse-dense training for dnn and Other Models

Perspective of Squeezing Model
Dense-Sparse-Dense Training for
DNN paper review &
squeeze model methods
davinnovation@gmail.com
산업공학 조동헌

Why Dense-Sparse-Dense Training?
• Deep Neural Network는 Computer Vision, Natural Language Processing, Speech Recognition 등의
다양한 분야에서 향상된 성능을 보이고 있고, 더 좋은 성능과 하드웨어가 더 복잡한 모델의 훈련을
가능하게 하고 있음
• 복잡한 모델은 Feature와 Output간의 관계를 더 잘 학습할 수 있지만 데이터의 Noise에 대한 편향성도
높아지게 되어 ( Over Fitting ) 문제 발생
• 모델의 크기를 줄이게 되면 under-fitting 문제에 봉착하게 됨으로, 좋은 해결법은 아님
• Dense-Sparse-Dense training flow (DSD)는 기존의 모델을 유지하며 학습을 시키는 방법에 대해 제안함
Overfitting을 피하며 모델을 유지하는 Dense-Sparse-Dense Training을 제안함

• Dense-Sparse-Dense Training은 총 3단계로 모델을 학습하는 과정을 가짐
• 1단계는 Dense 과정으로, 기존의 딥러닝 학습에 사용되는 Gradient Decent를 이용한 Backpropagation
과정과 동일함
• 2단계 Sparse 과정은 Sparsity 제약을 통하여 모델을 정규화
• 3단계에서는 2단계에서 제외되었던 weight들을 재 학습
What is Dense-Sparse-Dense Training?
Dense Sparse Dense

Dense Sparse Dense
1단계 : Dense
1단계 Dense는 weight의 value & connection 강도를 학습하는 단계
* 학습 iteration은 heuristic 으로 결정함

Dense Sparse Dense
2단계 : Sparse
2단계 Sparse는 중요한 weight를 재 학습 & 강화 시키는 단계
* 학습 iteration, Top-k은 heuristic 으로 결정함

Dense Sparse Dense
3단계 : Dense
3단계 Dense는 전체 weight를 재 학습 시키는 단계
* 학습 iteration은 heuristic 으로 결정함

Dense Sparse Dense
1단계 : Dense
1단계 Dense는 weight의 value & connection 강도를 학습하는 단계

Dense Sparse Dense
2단계 : Sparse
2단계 Sparse는 중요한 weight를 재 학습 & 강화 시키는 단계

Dense Sparse Dense
3단계 : Dense
3단계 Dense는 전체 weight를 재 학습 시키는 단계
* 1단계 Dense의 결과보다 Value가 작은 값의
weight 개수가 늘어남 == 강화효과

Dense-Sparse-Dense Training Result
31.1
31.5
30.4
24
14.5
30
27.2
29.2
22.9
13.4
10 15 20 25 30 35
GoogLe
Net
VGG-16
ResNet-
18
ResNet-
50
DeepSpe
ech2 DSD BaseLine
Baseline 모델들에 DSD 적용 시 , error rate가 감소하는 것을 확인 가능

DSD Training
DSD results on Models
* Sparsity 30%
31.14
30.58
30.02
10.96
10.58
10.34
0 20 40
Baseline
Sparse
DSD
VGGNet
31.5
28.19
27.19
11.32
9.23
8.67
0 20 40
Baseline
Sparse
DSD
ResNet-50
24.01
23.55
22.89
7.02
6.88
6.47
0 10 20 30
Baseline
Sparse
DSD
Top-5
Top-1
GoogLeNet

Effect of DSD Training
1. Sparse하고, Re-Dense의 과정에서 Saddle Point를 탈출하는 것을 관찰 가능
2. 1에 따른 결과에 의해, 더 좋은 Minima 값을 가짐
3. Sparse 단계의 Training으로 인해 Noise에 대해 더 Robust한 경향을 띄게 됨
‒ Sparse 단계에서 low dimension으로 학습하기 때문에 더 Robust한 모델이 들어지고, Re-Dense 과정에서 re-
initialization을 통한 모델의 Robust성을 만들 수 있음
DSD Train을 통하여 더 강건하고, 성능이 뛰어난 모델을 만들 수 있음

Other Learning Techniques
Dropout
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from
overfitting." Journal of machine learning research15.1 (2014): 1929-1958.
• Model Combination ( Ensemble )을 하면 더 나은 성능의 모델을 만들 수 있으나, 딥러닝 모델은 매우
깊어 복수의 모델을 학습시키기 어려움
=>
Training 단계에서 Random하게 Node를
계산에 포함시키지 않음으로서,
1. 다양한 형태의 모델을 학습시키는
효과를 얻고
2. 모델들을 Ensemble하는 유사한 결과를
얻을 수 있음
Randomly
Dropout
On training Phase

Dropout
Dropout의 Layer Visualization 비교를
확인해보면 Sparsity가 더 높은 것을 확인할
수 있다
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from
overfitting." Journal of machine learning research15.1 (2014): 1929-1958.

Batch Normalization
Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network
training by reducing internal covariate shift." International Conference on Machine
Learning. 2015.
• Deep Learning 모델 학습 시 non-linear function을 activation으로 사용하여 깊은 모델에 대한 back-
propagation시 vanishing/exploding gradient 문제가 발생
=>
Activation의 Output을 정규분포로 형태로
만들어서 vanishing 문제를 해결함
1. 각 Layer의 Output은 독립 (차원마다 정규화)
2. Batch 단위로 정규화
- On Training Phase
- Batch의 Mean과 Variation을 적용함
- On Test Phase
- Training Phase의 Mean과 Variation의 이동평균선
값을 적용함

Batch Normalization
Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network
training by reducing internal covariate shift." International Conference on Machine
Learning. 2015.
Batch Normalization은 더 빠른 학습과 최종적으로 더 낮은 loss를 기대 가능

Residual Network
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016.
• 모델이 깊어질 수록 Vanishing 문제로 인해 학습시키기 힘들어 짐
=>
각 Layer가 전 layer와만 연결되는 것이 아닌
2개 전의 layer와도 weight가 연결이 되어
있어 그 전의 정보를 반영할 수 있도록 함

Residual Network
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016.
ResNet을 통해 더 깊은 Layer의 모델을 학습시킬 수 있음

Dense Network
Iandola, Forrest, et al. "Densenet: Implementing efficient convnet descriptor pyramids."
arXiv preprint arXiv:1404.1869 (2014).
• 모델이 깊어질 수록 Vanishing 문제로 인해 학습시키기 힘들어 짐
=>
각 Layer가 전 layer와만 연결되는 것이 아닌
N개 전의 layer와도 weight가 연결이 되어
있어 그 전의 정보를 반영할 수 있도록 함

Dense Network
Iandola, Forrest, et al. "Densenet: Implementing efficient convnet descriptor pyramids."
arXiv preprint arXiv:1404.1869 (2014).
더 적은 Parameter로도 더 좋은 Accuracy를 보임

Other Learning Techniques & Compression
Distilling
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural
network." arXiv preprint arXiv:1503.02531 (2015).
• Model Combination ( Ensemble )을 하면 더 나은 성능의 모델을 만들 수 있으나, 딥러닝 모델은 매우
깊어 복수의 모델을 학습시키기 어려움
=>
Model을 Ensemble 시킨 후, Ensemble의
Output 정보를 가지고 새로운 Single
Model에 학습 시킴으로 더 작은 모델로 더
좋은 성능을 가지는 Model을 얻을 수 있었음

Distilling
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural
network." arXiv preprint arXiv:1503.02531 (2015).
58.9
61.1
60.8
56 58 60 62
Baseline
10x Ensemble
Distilling
위의 결과를 보면 Distilling 한 결과가 10x
Ensemble한 결과보다는 성능이 떨어지지만
Single Model보다 성능이 높아진 것을 볼 수
있음

Singular Value Decomposition
Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for
efficient evaluation." Advances in Neural Information Processing Systems. 2014.
Fully Connected Layer의 weight matrix를
singular value decomposition을 통해
압축시킴
• 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐 ( FC Layer가 대부분의 weight를 차지)
=>

Singular Value Decomposition
Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for
efficient evaluation." Advances in Neural Information Processing Systems. 2014.
Accuracy는 떨어지나 모델의 weight수는 줄게 됨

Pruning
Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS.
2015.
DSD와 유사하게 Dense 과정 후, Pruning을
거치게 됨
• 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐
=>

Pruning
Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS.
2015.
최종 Weight 수를 줄이면서 모델의 성능도 비슷하게 유지함

Pruning and splicing
Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS. 2016.
DSD와 유사하게 Dense 과정 후, Pruning을
거치고, Re-Dense 대신 Splicing과정을
진행함
=>
* Splicing 과정 : Weight가 Training 과정 시 포함 여부에
대해서 결정함 ( Weight Update는 같이 진행 됨 )
Train
Network
Update
T_k
Update
W_k
DSD와 다른 점은 최종 Step의 Fully Connected 여부

Pruning and splicing
Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS. 2016.
최종 Weight 수를 더 많이 줄이면서 모델의 성능도 비슷하게 유지함

Other Compression
SqueezeNet
Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters
and< 0.5 MB model size."
모델 자체를 작은 Weight를 가질 수 있도록
설계함
=>
* FireModule : SqueezeNet

Other Compression
SqueezeNet
Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters
and< 0.5 MB model size."
최종 Weight 수를 줄이면서 모델의 성능도 더 우수하거나 동일함

References
• Han, Song, et al. "Dsd: Regularizing deep neural networks with dense-sparse-dense training flow." arXiv preprint
arXiv:1607.04381(2016).
• Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine
learning research15.1 (2014): 1929-1958.
• Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European
conference on computer vision. Springer, Cham, 2014.
• Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531(2015).
• Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing
internal covariate shift." International Conference on Machine Learning. 2015.
• He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016.
• Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient
evaluation." Advances in Neural Information Processing Systems. 2014.
• Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model
size."

Dense sparse-dense training for dnn and Other Models

Recommended

Recommended

More Related Content

Similar to Dense sparse-dense training for dnn and Other Models

Similar to Dense sparse-dense training for dnn and Other Models (20)

More from Dong Heon Cho

More from Dong Heon Cho (20)

Dense sparse-dense training for dnn and Other Models