SlideShare a Scribd company logo
1 of 32
Perspective of Squeezing Model
Dense-Sparse-Dense Training for
DNN paper review &
squeeze model methods
davinnovation@gmail.com
산업공학 조동헌
Why Dense-Sparse-Dense Training?
• Deep Neural Network는 Computer Vision, Natural Language Processing, Speech Recognition 등의
다양한 분야에서 향상된 성능을 보이고 있고, 더 좋은 성능과 하드웨어가 더 복잡한 모델의 훈련을
가능하게 하고 있음
• 복잡한 모델은 Feature와 Output간의 관계를 더 잘 학습할 수 있지만 데이터의 Noise에 대한 편향성도
높아지게 되어 ( Over Fitting ) 문제 발생
• 모델의 크기를 줄이게 되면 under-fitting 문제에 봉착하게 됨으로, 좋은 해결법은 아님
• Dense-Sparse-Dense training flow (DSD)는 기존의 모델을 유지하며 학습을 시키는 방법에 대해 제안함
Overfitting을 피하며 모델을 유지하는 Dense-Sparse-Dense Training을 제안함
• Dense-Sparse-Dense Training은 총 3단계로 모델을 학습하는 과정을 가짐
• 1단계는 Dense 과정으로, 기존의 딥러닝 학습에 사용되는 Gradient Decent를 이용한 Backpropagation
과정과 동일함
• 2단계 Sparse 과정은 Sparsity 제약을 통하여 모델을 정규화
• 3단계에서는 2단계에서 제외되었던 weight들을 재 학습
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
1단계 : Dense
1단계 Dense는 weight의 value & connection 강도를 학습하는 단계
* 학습 iteration은 heuristic 으로 결정함
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
2단계 : Sparse
2단계 Sparse는 중요한 weight를 재 학습 & 강화 시키는 단계
* 학습 iteration, Top-k은 heuristic 으로 결정함
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
3단계 : Dense
3단계 Dense는 전체 weight를 재 학습 시키는 단계
* 학습 iteration은 heuristic 으로 결정함
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
1단계 : Dense
1단계 Dense는 weight의 value & connection 강도를 학습하는 단계
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
2단계 : Sparse
2단계 Sparse는 중요한 weight를 재 학습 & 강화 시키는 단계
What is Dense-Sparse-Dense Training?
Dense Sparse Dense
3단계 : Dense
3단계 Dense는 전체 weight를 재 학습 시키는 단계
* 1단계 Dense의 결과보다 Value가 작은 값의
weight 개수가 늘어남 == 강화효과
Dense-Sparse-Dense Training Result
31.1
31.5
30.4
24
14.5
30
27.2
29.2
22.9
13.4
10 15 20 25 30 35
GoogLe
Net
VGG-16
ResNet-
18
ResNet-
50
DeepSpe
ech2 DSD BaseLine
Baseline 모델들에 DSD 적용 시 , error rate가 감소하는 것을 확인 가능
DSD Training
DSD results on Models
* Sparsity 30%
31.14
30.58
30.02
10.96
10.58
10.34
0 20 40
Baseline
Sparse
DSD
VGGNet
31.5
28.19
27.19
11.32
9.23
8.67
0 20 40
Baseline
Sparse
DSD
ResNet-50
24.01
23.55
22.89
7.02
6.88
6.47
0 10 20 30
Baseline
Sparse
DSD
Top-5
Top-1
GoogLeNet
Effect of DSD Training
1. Sparse하고, Re-Dense의 과정에서 Saddle Point를 탈출하는 것을 관찰 가능
2. 1에 따른 결과에 의해, 더 좋은 Minima 값을 가짐
3. Sparse 단계의 Training으로 인해 Noise에 대해 더 Robust한 경향을 띄게 됨
‒ Sparse 단계에서 low dimension으로 학습하기 때문에 더 Robust한 모델이 들어지고, Re-Dense 과정에서 re-
initialization을 통한 모델의 Robust성을 만들 수 있음
DSD Train을 통하여 더 강건하고, 성능이 뛰어난 모델을 만들 수 있음
Other Learning Techniques
Dropout
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from
overfitting." Journal of machine learning research15.1 (2014): 1929-1958.
• Model Combination ( Ensemble )을 하면 더 나은 성능의 모델을 만들 수 있으나, 딥러닝 모델은 매우
깊어 복수의 모델을 학습시키기 어려움
=>
Training 단계에서 Random하게 Node를
계산에 포함시키지 않음으로서,
1. 다양한 형태의 모델을 학습시키는
효과를 얻고
2. 모델들을 Ensemble하는 유사한 결과를
얻을 수 있음
Randomly
Dropout
On training Phase
Other Learning Techniques
Dropout
Dropout의 Layer Visualization 비교를
확인해보면 Sparsity가 더 높은 것을 확인할
수 있다
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from
overfitting." Journal of machine learning research15.1 (2014): 1929-1958.
Other Learning Techniques
Batch Normalization
Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network
training by reducing internal covariate shift." International Conference on Machine
Learning. 2015.
• Deep Learning 모델 학습 시 non-linear function을 activation으로 사용하여 깊은 모델에 대한 back-
propagation시 vanishing/exploding gradient 문제가 발생
=>
Activation의 Output을 정규분포로 형태로
만들어서 vanishing 문제를 해결함
1. 각 Layer의 Output은 독립 (차원마다 정규화)
2. Batch 단위로 정규화
- On Training Phase
- Batch의 Mean과 Variation을 적용함
- On Test Phase
- Training Phase의 Mean과 Variation의 이동평균선
값을 적용함
Other Learning Techniques
Batch Normalization
Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network
training by reducing internal covariate shift." International Conference on Machine
Learning. 2015.
Batch Normalization은 더 빠른 학습과 최종적으로 더 낮은 loss를 기대 가능
Other Learning Techniques
Residual Network
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016.
• 모델이 깊어질 수록 Vanishing 문제로 인해 학습시키기 힘들어 짐
=>
각 Layer가 전 layer와만 연결되는 것이 아닌
2개 전의 layer와도 weight가 연결이 되어
있어 그 전의 정보를 반영할 수 있도록 함
Other Learning Techniques
Residual Network
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016.
ResNet을 통해 더 깊은 Layer의 모델을 학습시킬 수 있음
Other Learning Techniques
Dense Network
Iandola, Forrest, et al. "Densenet: Implementing efficient convnet descriptor pyramids."
arXiv preprint arXiv:1404.1869 (2014).
• 모델이 깊어질 수록 Vanishing 문제로 인해 학습시키기 힘들어 짐
=>
각 Layer가 전 layer와만 연결되는 것이 아닌
N개 전의 layer와도 weight가 연결이 되어
있어 그 전의 정보를 반영할 수 있도록 함
Other Learning Techniques
Dense Network
Iandola, Forrest, et al. "Densenet: Implementing efficient convnet descriptor pyramids."
arXiv preprint arXiv:1404.1869 (2014).
더 적은 Parameter로도 더 좋은 Accuracy를 보임
Other Learning Techniques & Compression
Distilling
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural
network." arXiv preprint arXiv:1503.02531 (2015).
• Model Combination ( Ensemble )을 하면 더 나은 성능의 모델을 만들 수 있으나, 딥러닝 모델은 매우
깊어 복수의 모델을 학습시키기 어려움
=>
Model을 Ensemble 시킨 후, Ensemble의
Output 정보를 가지고 새로운 Single
Model에 학습 시킴으로 더 작은 모델로 더
좋은 성능을 가지는 Model을 얻을 수 있었음
Other Learning Techniques & Compression
Distilling
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural
network." arXiv preprint arXiv:1503.02531 (2015).
58.9
61.1
60.8
56 58 60 62
Baseline
10x Ensemble
Distilling
위의 결과를 보면 Distilling 한 결과가 10x
Ensemble한 결과보다는 성능이 떨어지지만
Single Model보다 성능이 높아진 것을 볼 수
있음
Other Learning Techniques & Compression
Singular Value Decomposition
Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for
efficient evaluation." Advances in Neural Information Processing Systems. 2014.
Fully Connected Layer의 weight matrix를
singular value decomposition을 통해
압축시킴
• 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐 ( FC Layer가 대부분의 weight를 차지)
=>
Other Learning Techniques & Compression
Singular Value Decomposition
Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for
efficient evaluation." Advances in Neural Information Processing Systems. 2014.
Accuracy는 떨어지나 모델의 weight수는 줄게 됨
Other Learning Techniques & Compression
Pruning
Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS.
2015.
DSD와 유사하게 Dense 과정 후, Pruning을
거치게 됨
• 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐
=>
Other Learning Techniques & Compression
Pruning
Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS.
2015.
최종 Weight 수를 줄이면서 모델의 성능도 비슷하게 유지함
Other Learning Techniques & Compression
Pruning and splicing
Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS. 2016.
DSD와 유사하게 Dense 과정 후, Pruning을
거치고, Re-Dense 대신 Splicing과정을
진행함
• 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐
=>
* Splicing 과정 : Weight가 Training 과정 시 포함 여부에
대해서 결정함 ( Weight Update는 같이 진행 됨 )
Train
Network
Update
T_k
Update
W_k
DSD와 다른 점은 최종 Step의 Fully Connected 여부
Other Learning Techniques & Compression
Pruning and splicing
Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS. 2016.
최종 Weight 수를 더 많이 줄이면서 모델의 성능도 비슷하게 유지함
Other Compression
SqueezeNet
Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters
and< 0.5 MB model size."
모델 자체를 작은 Weight를 가질 수 있도록
설계함
• 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐
=>
* FireModule : SqueezeNet
Other Compression
SqueezeNet
Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters
and< 0.5 MB model size."
최종 Weight 수를 줄이면서 모델의 성능도 더 우수하거나 동일함
References
• Han, Song, et al. "Dsd: Regularizing deep neural networks with dense-sparse-dense training flow." arXiv preprint
arXiv:1607.04381(2016).
• Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine
learning research15.1 (2014): 1929-1958.
• Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European
conference on computer vision. Springer, Cham, 2014.
• Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531(2015).
• Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing
internal covariate shift." International Conference on Machine Learning. 2015.
• He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016.
• Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient
evaluation." Advances in Neural Information Processing Systems. 2014.
• Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model
size."
Q & A
and Discussion

More Related Content

Similar to Dense sparse-dense training for dnn and Other Models

딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰taeseon ryu
 
Denoising auto encoders(d a)
Denoising auto encoders(d a)Denoising auto encoders(d a)
Denoising auto encoders(d a)Tae Young Lee
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AITae Young Lee
 
밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장Sunggon Song
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationLEE HOSEONG
 
Distilling the knowledge in a neural network
Distilling the knowledge in a neural networkDistilling the knowledge in a neural network
Distilling the knowledge in a neural networkKyeongUkJang
 
Transfer learning usage
Transfer learning usageTransfer learning usage
Transfer learning usageTae Young Lee
 
Deep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendationsDeep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendationsseungwoo kim
 
PR12 Season3 Weight Agnostic Neural Networks
PR12 Season3 Weight Agnostic Neural NetworksPR12 Season3 Weight Agnostic Neural Networks
PR12 Season3 Weight Agnostic Neural NetworksKyunghoon Jung
 
Deep Learning & Convolutional Neural Network
Deep Learning & Convolutional Neural NetworkDeep Learning & Convolutional Neural Network
Deep Learning & Convolutional Neural Networkagdatalab
 
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...gohyunwoong
 
Learning how to explain neural networks: PatternNet and PatternAttribution
Learning how to explain neural networks: PatternNet and PatternAttributionLearning how to explain neural networks: PatternNet and PatternAttribution
Learning how to explain neural networks: PatternNet and PatternAttributionGyubin Son
 
Infra as a model service
Infra as a model serviceInfra as a model service
Infra as a model serviceTae Young Lee
 
Anomaly Detection based on Diffusion
Anomaly Detection based on DiffusionAnomaly Detection based on Diffusion
Anomaly Detection based on Diffusionssuserbaebf8
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...Susang Kim
 
딥러닝 기본 원리의 이해
딥러닝 기본 원리의 이해딥러닝 기본 원리의 이해
딥러닝 기본 원리의 이해Hee Won Park
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper ReviewLEE HOSEONG
 

Similar to Dense sparse-dense training for dnn and Other Models (20)

딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
Denoising auto encoders(d a)
Denoising auto encoders(d a)Denoising auto encoders(d a)
Denoising auto encoders(d a)
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AI
 
밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
 
Distilling the knowledge in a neural network
Distilling the knowledge in a neural networkDistilling the knowledge in a neural network
Distilling the knowledge in a neural network
 
Transfer learning usage
Transfer learning usageTransfer learning usage
Transfer learning usage
 
Deep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendationsDeep neural networks for You-Tube recommendations
Deep neural networks for You-Tube recommendations
 
PR12 Season3 Weight Agnostic Neural Networks
PR12 Season3 Weight Agnostic Neural NetworksPR12 Season3 Weight Agnostic Neural Networks
PR12 Season3 Weight Agnostic Neural Networks
 
Deep Learning & Convolutional Neural Network
Deep Learning & Convolutional Neural NetworkDeep Learning & Convolutional Neural Network
Deep Learning & Convolutional Neural Network
 
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
 
Learning how to explain neural networks: PatternNet and PatternAttribution
Learning how to explain neural networks: PatternNet and PatternAttributionLearning how to explain neural networks: PatternNet and PatternAttribution
Learning how to explain neural networks: PatternNet and PatternAttribution
 
Infra as a model service
Infra as a model serviceInfra as a model service
Infra as a model service
 
Anomaly Detection based on Diffusion
Anomaly Detection based on DiffusionAnomaly Detection based on Diffusion
Anomaly Detection based on Diffusion
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
 
딥러닝 기본 원리의 이해
딥러닝 기본 원리의 이해딥러닝 기본 원리의 이해
딥러닝 기본 원리의 이해
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
 
HistoryOfCNN
HistoryOfCNNHistoryOfCNN
HistoryOfCNN
 

More from Dong Heon Cho

Forward-Forward Algorithm
Forward-Forward AlgorithmForward-Forward Algorithm
Forward-Forward AlgorithmDong Heon Cho
 
Neural Radiance Field
Neural Radiance FieldNeural Radiance Field
Neural Radiance FieldDong Heon Cho
 
2020 > Self supervised learning
2020 > Self supervised learning2020 > Self supervised learning
2020 > Self supervised learningDong Heon Cho
 
All about that pooling
All about that poolingAll about that pooling
All about that poolingDong Heon Cho
 
Background elimination review
Background elimination reviewBackground elimination review
Background elimination reviewDong Heon Cho
 
Transparent Latent GAN
Transparent Latent GANTransparent Latent GAN
Transparent Latent GANDong Heon Cho
 
Multi object Deep reinforcement learning
Multi object Deep reinforcement learningMulti object Deep reinforcement learning
Multi object Deep reinforcement learningDong Heon Cho
 
Multi agent reinforcement learning for sequential social dilemmas
Multi agent reinforcement learning for sequential social dilemmasMulti agent reinforcement learning for sequential social dilemmas
Multi agent reinforcement learning for sequential social dilemmasDong Heon Cho
 
Hybrid reward architecture
Hybrid reward architectureHybrid reward architecture
Hybrid reward architectureDong Heon Cho
 
Use Jupyter notebook guide in 5 minutes
Use Jupyter notebook guide in 5 minutesUse Jupyter notebook guide in 5 minutes
Use Jupyter notebook guide in 5 minutesDong Heon Cho
 
AlexNet and so on...
AlexNet and so on...AlexNet and so on...
AlexNet and so on...Dong Heon Cho
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDong Heon Cho
 
How can we train with few data
How can we train with few dataHow can we train with few data
How can we train with few dataDong Heon Cho
 
Domain adaptation gan
Domain adaptation ganDomain adaptation gan
Domain adaptation ganDong Heon Cho
 

More from Dong Heon Cho (20)

Forward-Forward Algorithm
Forward-Forward AlgorithmForward-Forward Algorithm
Forward-Forward Algorithm
 
What is Texture.pdf
What is Texture.pdfWhat is Texture.pdf
What is Texture.pdf
 
BADGE
BADGEBADGE
BADGE
 
Neural Radiance Field
Neural Radiance FieldNeural Radiance Field
Neural Radiance Field
 
2020 > Self supervised learning
2020 > Self supervised learning2020 > Self supervised learning
2020 > Self supervised learning
 
All about that pooling
All about that poolingAll about that pooling
All about that pooling
 
Background elimination review
Background elimination reviewBackground elimination review
Background elimination review
 
Transparent Latent GAN
Transparent Latent GANTransparent Latent GAN
Transparent Latent GAN
 
Image matting atoc
Image matting atocImage matting atoc
Image matting atoc
 
Multi object Deep reinforcement learning
Multi object Deep reinforcement learningMulti object Deep reinforcement learning
Multi object Deep reinforcement learning
 
Multi agent reinforcement learning for sequential social dilemmas
Multi agent reinforcement learning for sequential social dilemmasMulti agent reinforcement learning for sequential social dilemmas
Multi agent reinforcement learning for sequential social dilemmas
 
Multi agent System
Multi agent SystemMulti agent System
Multi agent System
 
Hybrid reward architecture
Hybrid reward architectureHybrid reward architecture
Hybrid reward architecture
 
Use Jupyter notebook guide in 5 minutes
Use Jupyter notebook guide in 5 minutesUse Jupyter notebook guide in 5 minutes
Use Jupyter notebook guide in 5 minutes
 
AlexNet and so on...
AlexNet and so on...AlexNet and so on...
AlexNet and so on...
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
 
LOL win prediction
LOL win predictionLOL win prediction
LOL win prediction
 
How can we train with few data
How can we train with few dataHow can we train with few data
How can we train with few data
 
Domain adaptation gan
Domain adaptation ganDomain adaptation gan
Domain adaptation gan
 
Squeeeze models
Squeeeze modelsSqueeeze models
Squeeeze models
 

Dense sparse-dense training for dnn and Other Models

  • 1. Perspective of Squeezing Model Dense-Sparse-Dense Training for DNN paper review & squeeze model methods davinnovation@gmail.com 산업공학 조동헌
  • 2. Why Dense-Sparse-Dense Training? • Deep Neural Network는 Computer Vision, Natural Language Processing, Speech Recognition 등의 다양한 분야에서 향상된 성능을 보이고 있고, 더 좋은 성능과 하드웨어가 더 복잡한 모델의 훈련을 가능하게 하고 있음 • 복잡한 모델은 Feature와 Output간의 관계를 더 잘 학습할 수 있지만 데이터의 Noise에 대한 편향성도 높아지게 되어 ( Over Fitting ) 문제 발생 • 모델의 크기를 줄이게 되면 under-fitting 문제에 봉착하게 됨으로, 좋은 해결법은 아님 • Dense-Sparse-Dense training flow (DSD)는 기존의 모델을 유지하며 학습을 시키는 방법에 대해 제안함 Overfitting을 피하며 모델을 유지하는 Dense-Sparse-Dense Training을 제안함
  • 3. • Dense-Sparse-Dense Training은 총 3단계로 모델을 학습하는 과정을 가짐 • 1단계는 Dense 과정으로, 기존의 딥러닝 학습에 사용되는 Gradient Decent를 이용한 Backpropagation 과정과 동일함 • 2단계 Sparse 과정은 Sparsity 제약을 통하여 모델을 정규화 • 3단계에서는 2단계에서 제외되었던 weight들을 재 학습 What is Dense-Sparse-Dense Training? Dense Sparse Dense
  • 4. What is Dense-Sparse-Dense Training? Dense Sparse Dense 1단계 : Dense 1단계 Dense는 weight의 value & connection 강도를 학습하는 단계 * 학습 iteration은 heuristic 으로 결정함
  • 5. What is Dense-Sparse-Dense Training? Dense Sparse Dense 2단계 : Sparse 2단계 Sparse는 중요한 weight를 재 학습 & 강화 시키는 단계 * 학습 iteration, Top-k은 heuristic 으로 결정함
  • 6. What is Dense-Sparse-Dense Training? Dense Sparse Dense 3단계 : Dense 3단계 Dense는 전체 weight를 재 학습 시키는 단계 * 학습 iteration은 heuristic 으로 결정함
  • 7. What is Dense-Sparse-Dense Training? Dense Sparse Dense 1단계 : Dense 1단계 Dense는 weight의 value & connection 강도를 학습하는 단계
  • 8. What is Dense-Sparse-Dense Training? Dense Sparse Dense 2단계 : Sparse 2단계 Sparse는 중요한 weight를 재 학습 & 강화 시키는 단계
  • 9. What is Dense-Sparse-Dense Training? Dense Sparse Dense 3단계 : Dense 3단계 Dense는 전체 weight를 재 학습 시키는 단계 * 1단계 Dense의 결과보다 Value가 작은 값의 weight 개수가 늘어남 == 강화효과
  • 10. Dense-Sparse-Dense Training Result 31.1 31.5 30.4 24 14.5 30 27.2 29.2 22.9 13.4 10 15 20 25 30 35 GoogLe Net VGG-16 ResNet- 18 ResNet- 50 DeepSpe ech2 DSD BaseLine Baseline 모델들에 DSD 적용 시 , error rate가 감소하는 것을 확인 가능
  • 11. DSD Training DSD results on Models * Sparsity 30% 31.14 30.58 30.02 10.96 10.58 10.34 0 20 40 Baseline Sparse DSD VGGNet 31.5 28.19 27.19 11.32 9.23 8.67 0 20 40 Baseline Sparse DSD ResNet-50 24.01 23.55 22.89 7.02 6.88 6.47 0 10 20 30 Baseline Sparse DSD Top-5 Top-1 GoogLeNet
  • 12. Effect of DSD Training 1. Sparse하고, Re-Dense의 과정에서 Saddle Point를 탈출하는 것을 관찰 가능 2. 1에 따른 결과에 의해, 더 좋은 Minima 값을 가짐 3. Sparse 단계의 Training으로 인해 Noise에 대해 더 Robust한 경향을 띄게 됨 ‒ Sparse 단계에서 low dimension으로 학습하기 때문에 더 Robust한 모델이 들어지고, Re-Dense 과정에서 re- initialization을 통한 모델의 Robust성을 만들 수 있음 DSD Train을 통하여 더 강건하고, 성능이 뛰어난 모델을 만들 수 있음
  • 13. Other Learning Techniques Dropout Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research15.1 (2014): 1929-1958. • Model Combination ( Ensemble )을 하면 더 나은 성능의 모델을 만들 수 있으나, 딥러닝 모델은 매우 깊어 복수의 모델을 학습시키기 어려움 => Training 단계에서 Random하게 Node를 계산에 포함시키지 않음으로서, 1. 다양한 형태의 모델을 학습시키는 효과를 얻고 2. 모델들을 Ensemble하는 유사한 결과를 얻을 수 있음 Randomly Dropout On training Phase
  • 14. Other Learning Techniques Dropout Dropout의 Layer Visualization 비교를 확인해보면 Sparsity가 더 높은 것을 확인할 수 있다 Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research15.1 (2014): 1929-1958.
  • 15. Other Learning Techniques Batch Normalization Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International Conference on Machine Learning. 2015. • Deep Learning 모델 학습 시 non-linear function을 activation으로 사용하여 깊은 모델에 대한 back- propagation시 vanishing/exploding gradient 문제가 발생 => Activation의 Output을 정규분포로 형태로 만들어서 vanishing 문제를 해결함 1. 각 Layer의 Output은 독립 (차원마다 정규화) 2. Batch 단위로 정규화 - On Training Phase - Batch의 Mean과 Variation을 적용함 - On Test Phase - Training Phase의 Mean과 Variation의 이동평균선 값을 적용함
  • 16. Other Learning Techniques Batch Normalization Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International Conference on Machine Learning. 2015. Batch Normalization은 더 빠른 학습과 최종적으로 더 낮은 loss를 기대 가능
  • 17. Other Learning Techniques Residual Network He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. • 모델이 깊어질 수록 Vanishing 문제로 인해 학습시키기 힘들어 짐 => 각 Layer가 전 layer와만 연결되는 것이 아닌 2개 전의 layer와도 weight가 연결이 되어 있어 그 전의 정보를 반영할 수 있도록 함
  • 18. Other Learning Techniques Residual Network He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. ResNet을 통해 더 깊은 Layer의 모델을 학습시킬 수 있음
  • 19. Other Learning Techniques Dense Network Iandola, Forrest, et al. "Densenet: Implementing efficient convnet descriptor pyramids." arXiv preprint arXiv:1404.1869 (2014). • 모델이 깊어질 수록 Vanishing 문제로 인해 학습시키기 힘들어 짐 => 각 Layer가 전 layer와만 연결되는 것이 아닌 N개 전의 layer와도 weight가 연결이 되어 있어 그 전의 정보를 반영할 수 있도록 함
  • 20. Other Learning Techniques Dense Network Iandola, Forrest, et al. "Densenet: Implementing efficient convnet descriptor pyramids." arXiv preprint arXiv:1404.1869 (2014). 더 적은 Parameter로도 더 좋은 Accuracy를 보임
  • 21. Other Learning Techniques & Compression Distilling Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). • Model Combination ( Ensemble )을 하면 더 나은 성능의 모델을 만들 수 있으나, 딥러닝 모델은 매우 깊어 복수의 모델을 학습시키기 어려움 => Model을 Ensemble 시킨 후, Ensemble의 Output 정보를 가지고 새로운 Single Model에 학습 시킴으로 더 작은 모델로 더 좋은 성능을 가지는 Model을 얻을 수 있었음
  • 22. Other Learning Techniques & Compression Distilling Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). 58.9 61.1 60.8 56 58 60 62 Baseline 10x Ensemble Distilling 위의 결과를 보면 Distilling 한 결과가 10x Ensemble한 결과보다는 성능이 떨어지지만 Single Model보다 성능이 높아진 것을 볼 수 있음
  • 23. Other Learning Techniques & Compression Singular Value Decomposition Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems. 2014. Fully Connected Layer의 weight matrix를 singular value decomposition을 통해 압축시킴 • 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐 ( FC Layer가 대부분의 weight를 차지) =>
  • 24. Other Learning Techniques & Compression Singular Value Decomposition Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems. 2014. Accuracy는 떨어지나 모델의 weight수는 줄게 됨
  • 25. Other Learning Techniques & Compression Pruning Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. DSD와 유사하게 Dense 과정 후, Pruning을 거치게 됨 • 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐 =>
  • 26. Other Learning Techniques & Compression Pruning Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 최종 Weight 수를 줄이면서 모델의 성능도 비슷하게 유지함
  • 27. Other Learning Techniques & Compression Pruning and splicing Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS. 2016. DSD와 유사하게 Dense 과정 후, Pruning을 거치고, Re-Dense 대신 Splicing과정을 진행함 • 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐 => * Splicing 과정 : Weight가 Training 과정 시 포함 여부에 대해서 결정함 ( Weight Update는 같이 진행 됨 ) Train Network Update T_k Update W_k DSD와 다른 점은 최종 Step의 Fully Connected 여부
  • 28. Other Learning Techniques & Compression Pruning and splicing Guo, Yiwen, et al. "Dynamic Network Surgery for Efficient DNNs." NIPS. 2016. 최종 Weight 수를 더 많이 줄이면서 모델의 성능도 비슷하게 유지함
  • 29. Other Compression SqueezeNet Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." 모델 자체를 작은 Weight를 가질 수 있도록 설계함 • 모델이 커져서 많은 weight를 가지게 되었고, 용량이 커짐 => * FireModule : SqueezeNet
  • 30. Other Compression SqueezeNet Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." 최종 Weight 수를 줄이면서 모델의 성능도 더 우수하거나 동일함
  • 31. References • Han, Song, et al. "Dsd: Regularizing deep neural networks with dense-sparse-dense training flow." arXiv preprint arXiv:1607.04381(2016). • Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research15.1 (2014): 1929-1958. • Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer, Cham, 2014. • Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531(2015). • Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International Conference on Machine Learning. 2015. • He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. • Denton, Emily L., et al. "Exploiting linear structure within convolutional networks for efficient evaluation." Advances in Neural Information Processing Systems. 2014. • Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size."
  • 32. Q & A and Discussion