More Related Content More from Edge AI and Vision Alliance (20) "Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys1. © 2019 Synopsys
5+ Techniques for Efficient
Implementation of Neural
Networks
Bert Moons
Synopsys
May 2019
3. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
4. © 2019 Synopsys
3 Major challenges
Introduction – Challenges of embedded deep learning
1. Many operations per pixel
2. Process a lot of pixels in real-time
3. A large variation of different algorithms
5. © 2019 Synopsys
Classification accuracy comes at a cost
Introduction – Challenges of embedded deep learning
Conventional Machine
Learning
Deep
Learning
Human
Bestreportedtop-5accuracy
onIMAGENET-1000[%]
Neural network accuracy comes at a cost of a high workload per input pixel
and huge model sizes and bandwidth requirements
6. © 2019 Synopsys
Computing on large input data
Introduction – Challenges of embedded deep learning
4KFHDIMAGENET
1X 40X 160X
Embedded applications require
real-time operation on large input frames
7. © 2019 Synopsys
Massive workload in real-time applications
Introduction – Challenges of embedded deep learning
1GOP 1TOP
Top-1IMAGENETaccuracy[%]
70
75
65
Single
ImageNet
Image
1GOP to 10GOP per IMAGENET image
# operations / ImageNet image1GOP/s 1TOP/s
Top-1IMAGENETaccuracy[%]
70
75
65
6 Cameras
30fps
Full HD
Image
5-to-180 TOPS @ 30 fps, FHD, ADAS
# operations / second
MobileNet V2
ResNet-50
GoogleNet
VGG-16
8. © 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
4. Network pruning and compression
5. Network decomposition: low-rank network
approximations
6. Sparsity and correlation based feature map
compression
10. © 2019 Synopsys
The benefits of quantized number representations
5 Techniques – A. Neural Networks Are Error-Tolerant
8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point
Energy consumption
per unit
Processing
units per chip
Classification
time per chip
* [Choi,2019]
16b float 8b fixed
O(1)
relative fps
O(16)
relative fps
4b fixed
O(256)
relative fps
~ 16 ~ 6-8 ~ 2-4
Relative accuracy 100%
no loss
99% 50-95%
11. © 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Convert floating point pretrained models to Dynamic Fixed Point
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
Fixed Point Dynamic Fixed Point
0 1 1 0 1System Exponent Group 1 Exponent
Group 2 Exponent
* [Courbariaux,2019]
12. © 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Dynamic Fixed-Point Quantization allows running neural networks with 8
bit weights and activations across the board
32 bit float baseline 8 bit fixed point
* [Nvidia,2017]
13. © 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
How to optimally choose: dynamic exponent groups, saturation
thresholds, weight and activation exponents?
Min-max scaling throws away small values A saturation threshold better represents
small values, but clips large values
* [Nvidia,2017]
14. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
*
15. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Good accuracy down to 2b
Graceful performance degradation
0.85
0.9
0.95
1
1.05
CIFAR10 SVHN AlexNet ResNet18 ResNet50
full precision 5b 4b 3b 2b
* [Choi,2018]
Relativebenchmark
accuracyvsfloatbaseline*
16. © 2019 Synopsys
Non-linear trained quantization – codebook clustering
5 Techniques – A. Neural Networks Are Error-Tolerant
Clustered, codebook quantization can be optimally trained.
This only reduces bandwidth, computations are still in floating point.
* [Han,2015]
18. © 2019 Synopsys
Pruning Neural Networks
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Pruning removes unnecessary connections in the neural network. Accuracy
is recovered through retraining the pruned network
* [Han,2015]
19. © 2019 Synopsys
Low Rank Singular Value Decomposition (SVD) in DNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Many singular values are small and can be discarded
* [Xue,2013]
𝑨 = 𝑼 𝜮 𝑽 𝑻
𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
20. © 2019 Synopsys
Low Rank Canonical Polyadic (CP) decomp. in CNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Convert a large convolutional filter in a triplet of smaller filters
* [Astrid,2017]
21. © 2019 Synopsys
Basic example: Combining SVD, pruning and clustering
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
11x model compression in a phone-recognition LSTM
0
2
4
6
8
10
12
Base P SVD SVD+P P+C SVD+P+C
LSTMCompressionRate
(P)
(C)
* [Goetschalckx,2018]
P = Pruning
SVD = Singular Value Decomposition
C = Clustering / Codebook Compression
23. © 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Feature map bandwidth dominates in modern CNNs
0
2
4
6
8
10
12
Coefficient
BW
Feature Map
BW
BWinMObileNet-V1[MB]
1x1, 32
3x3DW, 32
1x1, 64
32
24. © 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
ReLU activation introduces 50-90% zero-valued numbers in intermediate
feature maps
-5 4 12
-10 0 17
-1 3 2
0 4 12
0 0 17
0 3 2
ReLU activation
8b Features 8b Features
25. © 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Hardware support for multi-bit Huffman encoding allows up to 2x
bandwidth reduction in typical networks.
Zero-runlength encoding as in [Chen, 2016],
Huffman-encoding as in [Moons, 2017]
0 4 12
0 0 17
0 3 2
8b Features
72b
Huffman Features
zero 2’b00
<16 2’b01, 4’b WORD
nonzero 1’b1, 8’b WORD
41b < 72b
26. © 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Intermediate features in the same channel-plane are highly correlated
Intermediate featuremaps in ReLU-less YOLOV2
An example featuremap
scale1
An example featuremap
scale9
27. © 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Super-linear correlation based extended bit-plane compression allows
feature-map compression even on non-sparse data
* [Cavigelli,2018]
0,4,0,0,0,8,0,0,71
Zero Values
Non-Zero
Values
Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99
Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
28. © 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Correlated compression outperforms sparsity-based compression
0
0.5
1
1.5
2
2.5
3
Mobilenet ResNet-50 Yolo V2 VOC VGG-16
CompressionRate
Sparsity-based Correlation-based
30. © 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
A first-order energy model for Neural Network Inference
Assume:
• Quadratic energy scaling / MAC
when going from 32 to 8 bit.
• Linear energy saving / read-write in DDR/SRAM
when going from 32 to 8 bit
• 50% of coefficients zero when pruning
• 50% compute reduction under decomposition
• 50% of activations can be compressed
*[Han,2015]
*
31. © 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
When all model data is stored in DRAM optimized ResNet-50 is 10x
more efficient than its plain 32b counterpart
O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
10x
100%
22%
16%
11%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
32. © 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x
more efficient than its plain 32b counterpart
O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
100%
15%
9% 8%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
12.5x
33. © 2019 Synopsys
For More Information
Visit the Synopsys booth for
demos on Automotive ADAS,
Virtual Reality & More
33
EV6x Embedded Vision
Processor IP with Safety
Enhancement Package
• Thursday, May 23
• Santa Clara Convention Center
• Doors open 8 AM
• Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX…
• Register via the EV Alliance website or at Synopsys Booth
Join Synopsys’ EV Seminar on Thursday
Navigating Embedded Vision at the Edge
B E S T P R O C E S S O R
34. © 2019 Synopsys
References
34
[Han,2015,2016]
https://arxiv.org/abs/1510.00149
https://arxiv.org/abs/1602.01528
[Xue, 2013]
https://www.microsoft.com/en-us/research/wp-
content/uploads/2013/01/svd_v2.pdf
[Nvidia, 2017]
http://on-demand.gputechconf.com/gtc
/2017/presentation/s7310-8-bit-inference-with-
tensorrt.pdf
[Choi, 2018, 2019]
https://arxiv.org/abs/1805.06085
https://www.ibm.com/blogs/research/2019/04/2-bit-
precision/
[Goetschalckx, 2018]
https://www.sigmobile.org/mobisys/2018/workshops/
deepmobile18/papers/Efficiently_Combining_SVD_Pru
ning_Clustering_Retraining.pdf
[Astrid,2017]
https://arxiv.org/abs/1701.07148
[Moons,2017]
https://ieeexplore.ieee.org/abstract/document/78703
53
[Chen,2016]
http://eyeriss.mit.edu/
[Cavigelli, 2018]
https://arxiv.org/abs/1810.03979
[Courbariaux, 2014]
https://arxiv.org/pdf/1412.7024.pdf
Embedded Vision Summit
Bert Moons --
5+ Techniques for Efficient Implementations
of Neural Networks
May 2019
36. © 2019 Synopsys
5+ Techniques for Efficient
Implementation of Neural
Networks
Bert Moons
Synopsys
May 2019
38. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
39. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
40. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
41. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
1. Many operations per pixel
2. Process a lot of pixels in real-time
42. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
1. Many operations per pixel
2. Process a lot of pixels in real-time
3. A large variation of different algorithms
43. © 2019 Synopsys
Neural Network accuracy comes at a high cost in terms of model
storage and operations per input feature
3 Major challenges
Introduction – Challenges of embedded deep learning
Many embedded applications require real-time operation on high-
dimensional, large input data from various input sources
Many embedded applications require support for a variety of
networks: CNN’s in feature extraction, RNN’s in sequence modeling
1. Many operations per pixel
2. Process a lot of pixels in real-time
3. A large variation of different algorithms
44. © 2019 Synopsys
Classification accuracy comes at a cost
Introduction – Challenges of embedded deep learning
Conventional Machine
Learning
Deep
Learning
Human
Bestreportedtop-5accuracy
onIMAGENET-1000[%]
Neural network accuracy comes at a cost of a high workload per input pixel
and huge model sizes and bandwidth requirements
45. © 2019 Synopsys
Computing on large input data
Introduction – Challenges of embedded deep learning
4KFHDIMAGENET
1X 40X 160X
Embedded applications require
real-time operation on large input frames
46. © 2019 Synopsys
Massive workload in real-time applications
Introduction – Challenges of embedded deep learning
1GOP 1TOP
Top-1IMAGENETaccuracy[%]
70
75
65
Single
ImageNet
Image
1GOP to 10GOP per IMAGENET image
# operations / ImageNet image
MobileNet V2
ResNet-50
GoogleNet
VGG-16
47. © 2019 Synopsys
Massive workload in real-time applications
Introduction – Challenges of embedded deep learning
1GOP 1TOP
Top-1IMAGENETaccuracy[%]
70
75
65
Single
ImageNet
Image
1GOP to 10GOP per IMAGENET image
# operations / ImageNet image1GOP/s 1TOP/s
Top-1IMAGENETaccuracy[%]
70
75
65
6 Cameras
30fps
Full HD
Image
5-to-180 TOPS @ 30 fps, FHD, ADAS
# operations / second
MobileNet V2
ResNet-50
GoogleNet
VGG-16
48. © 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
49. © 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
50. © 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
4. Network pruning and compression
5. Network decomposition: low-rank network
approximations
51. © 2019 Synopsys
5+ Techniques to reduce the DNN workload
A. Neural Networks are
error-tolerant
Introduction – Challenges of embedded deep learning
1. Linear post-training 8/12/16b quantization
2. Linear trained 2/4/8 bit quantization
3. Non-linear trained 2/4/8 bit quantization
through clustering
C. Neural Networks have
sparse and correlated
intermediate results
B. Neural Networks have
redundancies and
are over-dimensioned
4. Network pruning and compression
5. Network decomposition: low-rank network
approximations
6. Sparsity and correlation based feature map
compression
53. © 2019 Synopsys
The benefits of quantized number representations
5 Techniques – A. Neural Networks Are Error-Tolerant
8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point
Energy consumption
per unit
Processing
units per chip
Classification
time per chip
* [Choi,2019]
16b float 8b fixed
O(1)
relative fps
O(16)
relative fps
4b fixed
O(256)
relative fps
~ 16 ~ 6-8 ~ 2-4
Relative accuracy 100%
no loss
99% 50-95%
54. © 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Convert floating point pretrained models to Dynamic Fixed Point
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
0 1 1 0 1
1 0 0 1 1 1 0 1 0 1
0 1 1 1 0 0 0 0 1 0
0 0 1 0 1 1 0 0 1 1
1 1 1 0 0 1 0 1 0 0
Fixed Point Dynamic Fixed Point
0 1 1 0 1System Exponent Group 1 Exponent
Group 2 Exponent
* [Courbariaux,2019]
55. © 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Dynamic Fixed-Point Quantization allows running neural networks with 8
bit weights and activations across the board
32 bit float baseline 8 bit fixed point
* [Nvidia,2017]
56. © 2019 Synopsys
Linear post-training quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
How to optimally choose: dynamic exponent groups, saturation
thresholds, weight and activation exponents?
Min-max scaling throws away small values A saturation threshold better represents
small values, but clips large values
* [Nvidia,2017]
57. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
58. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
59. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
60. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
61. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Floating point models are a bad initializer for low-precision fixed-point.
Trained quantization from scratch automates heuristic-based optimization.
Quantizing weights and activations with straight-
through estimators, allowing back-prop +
Train saturation range
for activations
Forward Backward
* PACT, Parametrized Clipping
Activation [Choi,2019]
*
62. © 2019 Synopsys
Linear trained quantization
5 Techniques – A. Neural Networks Are Error-Tolerant
Good accuracy down to 2b
Graceful performance degradation
0.85
0.9
0.95
1
1.05
CIFAR10 SVHN AlexNet ResNet18 ResNet50
full precision 5b 4b 3b 2b
* [Choi,2018]
Relativebenchmark
accuracyvsfloatbaseline*
63. © 2019 Synopsys
Non-linear trained quantization – codebook clustering
5 Techniques – A. Neural Networks Are Error-Tolerant
Clustered, codebook quantization can be optimally trained.
This only reduces bandwidth, computations are still in floating point.
* [Han,2015]
65. © 2019 Synopsys
Pruning Neural Networks
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Pruning removes unnecessary connections in the neural network. Accuracy
is recovered through retraining the pruned network
* [Han,2015]
66. © 2019 Synopsys
Low Rank Singular Value Decomposition (SVD) in DNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Many singular values are small and can be discarded
* [Xue,2013]
𝑨 = 𝑼 𝜮 𝑽 𝑻
𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼
67. © 2019 Synopsys
Low Rank Canonical Polyadic (CP) decomp. in CNNs
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
Convert a large convolutional filter in a triplet of smaller filters
* [Astrid,2017]
68. © 2019 Synopsys
Basic example: Combining SVD, pruning and clustering
5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant
11x model compression in a phone-recognition LSTM
0
2
4
6
8
10
12
Base P SVD SVD+P P+C SVD+P+C
LSTMCompressionRate
(P)
(C)
* [Goetschalckx,2018]
P = Pruning
SVD = Singular Value Decomposition
C = Clustering / Codebook Compression
70. © 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Feature map bandwidth dominates in modern CNNs
0
2
4
6
8
10
12
Coefficient
BW
Feature Map
BW
BWinMObileNet-V1[MB]
1x1, 32
3x3DW, 32
1x1, 64
32
71. © 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
ReLU activation introduces 50-90% zero-valued numbers in intermediate
feature maps
-5 4 12
-10 0 17
-1 3 2
0 4 12
0 0 17
0 3 2
ReLU activation
8b Features 8b Features
72. © 2019 Synopsys
Sparse feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Hardware support for multi-bit Huffman encoding allows up to 2x
bandwidth reduction in typical networks.
Zero-runlength encoding as in [Chen, 2016],
Huffman-encoding as in [Moons, 2017]
0 4 12
0 0 17
0 3 2
8b Features
72b
Huffman Features
zero 2’b00
<16 2’b01, 4’b WORD
nonzero 1’b1, 8’b WORD
41b < 72b
73. © 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Intermediate features in the same channel-plane are highly correlated
Intermediate featuremaps in ReLU-less YOLOV2
An example featuremap
scale1
An example featuremap
scale9
74. © 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Super-linear correlation based extended bit-plane compression allows
feature-map compression even on non-sparse data
* [Cavigelli,2018]
0,4,0,0,0,8,0,0,71
Zero Values
Non-Zero
Values
Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99
Delta Values: 0, 4, 0, 0, 8, 0, 0, 71
75. © 2019 Synopsys
Correlated feature-map compression
5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps
Correlated compression outperforms sparsity-based compression
0
0.5
1
1.5
2
2.5
3
Mobilenet ResNet-50 Yolo V2 VOC VGG-16
CompressionRate
Sparsity-based Correlation-based
77. © 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
A first-order energy model for Neural Network Inference
Assume:
• Quadratic energy scaling / MAC
when going from 32 to 8 bit.
• Linear energy saving / read-write in DDR/SRAM
when going from 32 to 8 bit
• 50% of coefficients zero when pruning
• 50% compute reduction under decomposition
• 50% of activations can be compressed
*[Han,2015]
*
78. © 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
When all model data is stored in DRAM optimized ResNet-50 is 10x
more efficient than its plain 32b counterpart
O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
10x
100%
22%
16%
11%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
79. © 2019 Synopsys
A first-order analysis on ResNet-50
5 Techniques – Conclusion: bringing it all together
In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x
more efficient than its plain 32b counterpart
O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame
100%
15%
9% 8%
0%
20%
40%
60%
80%
100%
120%
32b float A. 8b fixed B. Decomposition
+ Pruning
C. Featuremap
Compression
RelativeEnergy
Consumption
12.5x
80. © 2019 Synopsys
For More Information
Visit the Synopsys booth for
demos on Automotive ADAS,
Virtual Reality & More
80
EV6x Embedded Vision
Processor IP with Safety
Enhancement Package
• Thursday, May 23
• Santa Clara Convention Center
• Doors open 8 AM
• Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX…
• Register via the EV Alliance website or at Synopsys Booth
Join Synopsys’ EV Seminar on Thursday
Navigating Embedded Vision at the Edge
B E S T P R O C E S S O R
81. © 2019 Synopsys
References
81
[Han,2015,2016]
https://arxiv.org/abs/1510.00149
https://arxiv.org/abs/1602.01528
[Xue, 2013]
https://www.microsoft.com/en-us/research/wp-
content/uploads/2013/01/svd_v2.pdf
[Nvidia, 2017]
http://on-demand.gputechconf.com/gtc
/2017/presentation/s7310-8-bit-inference-with-
tensorrt.pdf
[Choi, 2018, 2019]
https://arxiv.org/abs/1805.06085
https://www.ibm.com/blogs/research/2019/04/2-bit-
precision/
[Goetschalckx, 2018]
https://www.sigmobile.org/mobisys/2018/workshops/
deepmobile18/papers/Efficiently_Combining_SVD_Pru
ning_Clustering_Retraining.pdf
[Astrid,2017]
https://arxiv.org/abs/1701.07148
[Moons,2017]
https://ieeexplore.ieee.org/abstract/document/78703
53
[Chen,2016]
http://eyeriss.mit.edu/
[Cavigelli, 2018]
https://arxiv.org/abs/1810.03979
[Courbariaux, 2014]
https://arxiv.org/pdf/1412.7024.pdf
Embedded Vision Summit
Bert Moons --
5+ Techniques for Efficient Implementations
of Neural Networks
May 2019