"Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys

3. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling

4. © 2019 Synopsys 3 Major challenges Introduction – Challenges of embedded deep learning 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms

5. © 2019 Synopsys Classification accuracy comes at a cost Introduction – Challenges of embedded deep learning Conventional Machine Learning Deep Learning Human Bestreportedtop-5accuracy onIMAGENET-1000[%] Neural network accuracy comes at a cost of a high workload per input pixel and huge model sizes and bandwidth requirements

7. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image1GOP/s 1TOP/s Top-1IMAGENETaccuracy[%] 70 75 65 6 Cameras 30fps Full HD Image 5-to-180 TOPS @ 30 fps, FHD, ADAS # operations / second MobileNet V2 ResNet-50 GoogleNet VGG-16

8. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations 6. Sparsity and correlation based feature map compression

10. © 2019 Synopsys The benefits of quantized number representations 5 Techniques – A. Neural Networks Are Error-Tolerant 8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point Energy consumption per unit Processing units per chip Classification time per chip * [Choi,2019] 16b float 8b fixed O(1) relative fps O(16) relative fps 4b fixed O(256) relative fps ~ 16 ~ 6-8 ~ 2-4 Relative accuracy 100% no loss 99% 50-95%

11. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Convert floating point pretrained models to Dynamic Fixed Point 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 Fixed Point Dynamic Fixed Point 0 1 1 0 1System Exponent Group 1 Exponent Group 2 Exponent * [Courbariaux,2019]

12. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Dynamic Fixed-Point Quantization allows running neural networks with 8 bit weights and activations across the board 32 bit float baseline 8 bit fixed point * [Nvidia,2017]

13. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant How to optimally choose: dynamic exponent groups, saturation thresholds, weight and activation exponents? Min-max scaling throws away small values A saturation threshold better represents small values, but clips large values * [Nvidia,2017]

14. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019] *

15. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Good accuracy down to 2b Graceful performance degradation 0.85 0.9 0.95 1 1.05 CIFAR10 SVHN AlexNet ResNet18 ResNet50 full precision 5b 4b 3b 2b * [Choi,2018] Relativebenchmark accuracyvsfloatbaseline*

16. © 2019 Synopsys Non-linear trained quantization – codebook clustering 5 Techniques – A. Neural Networks Are Error-Tolerant Clustered, codebook quantization can be optimally trained. This only reduces bandwidth, computations are still in floating point. * [Han,2015]

18. © 2019 Synopsys Pruning Neural Networks 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Pruning removes unnecessary connections in the neural network. Accuracy is recovered through retraining the pruned network * [Han,2015]

19. © 2019 Synopsys Low Rank Singular Value Decomposition (SVD) in DNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Many singular values are small and can be discarded * [Xue,2013] 𝑨 = 𝑼 𝜮 𝑽 𝑻 𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼

20. © 2019 Synopsys Low Rank Canonical Polyadic (CP) decomp. in CNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Convert a large convolutional filter in a triplet of smaller filters * [Astrid,2017]

21. © 2019 Synopsys Basic example: Combining SVD, pruning and clustering 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant 11x model compression in a phone-recognition LSTM 0 2 4 6 8 10 12 Base P SVD SVD+P P+C SVD+P+C LSTMCompressionRate (P) (C) * [Goetschalckx,2018] P = Pruning SVD = Singular Value Decomposition C = Clustering / Codebook Compression

23. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Feature map bandwidth dominates in modern CNNs 0 2 4 6 8 10 12 Coefficient BW Feature Map BW BWinMObileNet-V1[MB] 1x1, 32 3x3DW, 32 1x1, 64 32

24. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps ReLU activation introduces 50-90% zero-valued numbers in intermediate feature maps -5 4 12 -10 0 17 -1 3 2 0 4 12 0 0 17 0 3 2 ReLU activation 8b Features 8b Features

25. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Hardware support for multi-bit Huffman encoding allows up to 2x bandwidth reduction in typical networks. Zero-runlength encoding as in [Chen, 2016], Huffman-encoding as in [Moons, 2017] 0 4 12 0 0 17 0 3 2 8b Features 72b Huffman Features zero 2’b00 <16 2’b01, 4’b WORD nonzero 1’b1, 8’b WORD 41b < 72b

26. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Intermediate features in the same channel-plane are highly correlated Intermediate featuremaps in ReLU-less YOLOV2 An example featuremap scale1 An example featuremap scale9

27. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Super-linear correlation based extended bit-plane compression allows feature-map compression even on non-sparse data * [Cavigelli,2018] 0,4,0,0,0,8,0,0,71 Zero Values Non-Zero Values Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99 Delta Values: 0, 4, 0, 0, 8, 0, 0, 71

28. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Correlated compression outperforms sparsity-based compression 0 0.5 1 1.5 2 2.5 3 Mobilenet ResNet-50 Yolo V2 VOC VGG-16 CompressionRate Sparsity-based Correlation-based

30. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together A first-order energy model for Neural Network Inference Assume: • Quadratic energy scaling / MAC when going from 32 to 8 bit. • Linear energy saving / read-write in DDR/SRAM when going from 32 to 8 bit • 50% of coefficients zero when pruning • 50% compute reduction under decomposition • 50% of activations can be compressed *[Han,2015] *

31. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together When all model data is stored in DRAM optimized ResNet-50 is 10x more efficient than its plain 32b counterpart O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 10x 100% 22% 16% 11% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption

32. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x more efficient than its plain 32b counterpart O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 100% 15% 9% 8% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption 12.5x

33. © 2019 Synopsys For More Information Visit the Synopsys booth for demos on Automotive ADAS, Virtual Reality & More 33 EV6x Embedded Vision Processor IP with Safety Enhancement Package • Thursday, May 23 • Santa Clara Convention Center • Doors open 8 AM • Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX… • Register via the EV Alliance website or at Synopsys Booth Join Synopsys’ EV Seminar on Thursday Navigating Embedded Vision at the Edge B E S T P R O C E S S O R

34. © 2019 Synopsys References 34 [Han,2015,2016] https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1602.01528 [Xue, 2013] https://www.microsoft.com/en-us/research/wp- content/uploads/2013/01/svd_v2.pdf [Nvidia, 2017] http://on-demand.gputechconf.com/gtc /2017/presentation/s7310-8-bit-inference-with- tensorrt.pdf [Choi, 2018, 2019] https://arxiv.org/abs/1805.06085 https://www.ibm.com/blogs/research/2019/04/2-bit- precision/ [Goetschalckx, 2018] https://www.sigmobile.org/mobisys/2018/workshops/ deepmobile18/papers/Efficiently_Combining_SVD_Pru ning_Clustering_Retraining.pdf [Astrid,2017] https://arxiv.org/abs/1701.07148 [Moons,2017] https://ieeexplore.ieee.org/abstract/document/78703 53 [Chen,2016] http://eyeriss.mit.edu/ [Cavigelli, 2018] https://arxiv.org/abs/1810.03979 [Courbariaux, 2014] https://arxiv.org/pdf/1412.7024.pdf Embedded Vision Summit Bert Moons -- 5+ Techniques for Efficient Implementations of Neural Networks May 2019

39. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources

40. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling

41. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time

42. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms

43. © 2019 Synopsys Neural Network accuracy comes at a high cost in terms of model storage and operations per input feature 3 Major challenges Introduction – Challenges of embedded deep learning Many embedded applications require real-time operation on high- dimensional, large input data from various input sources Many embedded applications require support for a variety of networks: CNN’s in feature extraction, RNN’s in sequence modeling 1. Many operations per pixel 2. Process a lot of pixels in real-time 3. A large variation of different algorithms

44. © 2019 Synopsys Classification accuracy comes at a cost Introduction – Challenges of embedded deep learning Conventional Machine Learning Deep Learning Human Bestreportedtop-5accuracy onIMAGENET-1000[%] Neural network accuracy comes at a cost of a high workload per input pixel and huge model sizes and bandwidth requirements

46. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image MobileNet V2 ResNet-50 GoogleNet VGG-16

47. © 2019 Synopsys Massive workload in real-time applications Introduction – Challenges of embedded deep learning 1GOP 1TOP Top-1IMAGENETaccuracy[%] 70 75 65 Single ImageNet Image 1GOP to 10GOP per IMAGENET image # operations / ImageNet image1GOP/s 1TOP/s Top-1IMAGENETaccuracy[%] 70 75 65 6 Cameras 30fps Full HD Image 5-to-180 TOPS @ 30 fps, FHD, ADAS # operations / second MobileNet V2 ResNet-50 GoogleNet VGG-16

48. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned

49. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned

50. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations

51. © 2019 Synopsys 5+ Techniques to reduce the DNN workload A. Neural Networks are error-tolerant Introduction – Challenges of embedded deep learning 1. Linear post-training 8/12/16b quantization 2. Linear trained 2/4/8 bit quantization 3. Non-linear trained 2/4/8 bit quantization through clustering C. Neural Networks have sparse and correlated intermediate results B. Neural Networks have redundancies and are over-dimensioned 4. Network pruning and compression 5. Network decomposition: low-rank network approximations 6. Sparsity and correlation based feature map compression

53. © 2019 Synopsys The benefits of quantized number representations 5 Techniques – A. Neural Networks Are Error-Tolerant 8 bit fixed is 3-4x faster, 2-4x more efficient than 16b floating point Energy consumption per unit Processing units per chip Classification time per chip * [Choi,2019] 16b float 8b fixed O(1) relative fps O(16) relative fps 4b fixed O(256) relative fps ~ 16 ~ 6-8 ~ 2-4 Relative accuracy 100% no loss 99% 50-95%

54. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Convert floating point pretrained models to Dynamic Fixed Point 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 0 Fixed Point Dynamic Fixed Point 0 1 1 0 1System Exponent Group 1 Exponent Group 2 Exponent * [Courbariaux,2019]

55. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Dynamic Fixed-Point Quantization allows running neural networks with 8 bit weights and activations across the board 32 bit float baseline 8 bit fixed point * [Nvidia,2017]

56. © 2019 Synopsys Linear post-training quantization 5 Techniques – A. Neural Networks Are Error-Tolerant How to optimally choose: dynamic exponent groups, saturation thresholds, weight and activation exponents? Min-max scaling throws away small values A saturation threshold better represents small values, but clips large values * [Nvidia,2017]

57. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]

58. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019]

61. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Floating point models are a bad initializer for low-precision fixed-point. Trained quantization from scratch automates heuristic-based optimization. Quantizing weights and activations with straight- through estimators, allowing back-prop + Train saturation range for activations Forward Backward * PACT, Parametrized Clipping Activation [Choi,2019] *

62. © 2019 Synopsys Linear trained quantization 5 Techniques – A. Neural Networks Are Error-Tolerant Good accuracy down to 2b Graceful performance degradation 0.85 0.9 0.95 1 1.05 CIFAR10 SVHN AlexNet ResNet18 ResNet50 full precision 5b 4b 3b 2b * [Choi,2018] Relativebenchmark accuracyvsfloatbaseline*

63. © 2019 Synopsys Non-linear trained quantization – codebook clustering 5 Techniques – A. Neural Networks Are Error-Tolerant Clustered, codebook quantization can be optimally trained. This only reduces bandwidth, computations are still in floating point. * [Han,2015]

65. © 2019 Synopsys Pruning Neural Networks 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Pruning removes unnecessary connections in the neural network. Accuracy is recovered through retraining the pruned network * [Han,2015]

66. © 2019 Synopsys Low Rank Singular Value Decomposition (SVD) in DNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Many singular values are small and can be discarded * [Xue,2013] 𝑨 = 𝑼 𝜮 𝑽 𝑻 𝑨 ≅ 𝑼′𝜮′𝑽′ 𝑻 = 𝑵𝑼

67. © 2019 Synopsys Low Rank Canonical Polyadic (CP) decomp. in CNNs 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant Convert a large convolutional filter in a triplet of smaller filters * [Astrid,2017]

68. © 2019 Synopsys Basic example: Combining SVD, pruning and clustering 5 Techniques – B. Neural Networks are Over-Dimensioned & Redundant 11x model compression in a phone-recognition LSTM 0 2 4 6 8 10 12 Base P SVD SVD+P P+C SVD+P+C LSTMCompressionRate (P) (C) * [Goetschalckx,2018] P = Pruning SVD = Singular Value Decomposition C = Clustering / Codebook Compression

70. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Feature map bandwidth dominates in modern CNNs 0 2 4 6 8 10 12 Coefficient BW Feature Map BW BWinMObileNet-V1[MB] 1x1, 32 3x3DW, 32 1x1, 64 32

71. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps ReLU activation introduces 50-90% zero-valued numbers in intermediate feature maps -5 4 12 -10 0 17 -1 3 2 0 4 12 0 0 17 0 3 2 ReLU activation 8b Features 8b Features

72. © 2019 Synopsys Sparse feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Hardware support for multi-bit Huffman encoding allows up to 2x bandwidth reduction in typical networks. Zero-runlength encoding as in [Chen, 2016], Huffman-encoding as in [Moons, 2017] 0 4 12 0 0 17 0 3 2 8b Features 72b Huffman Features zero 2’b00 <16 2’b01, 4’b WORD nonzero 1’b1, 8’b WORD 41b < 72b

73. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Intermediate features in the same channel-plane are highly correlated Intermediate featuremaps in ReLU-less YOLOV2 An example featuremap scale1 An example featuremap scale9

74. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Super-linear correlation based extended bit-plane compression allows feature-map compression even on non-sparse data * [Cavigelli,2018] 0,4,0,0,0,8,0,0,71 Zero Values Non-Zero Values Correlated Values: 16, 20, 20, 20, 28, 28, 28, 99 Delta Values: 0, 4, 0, 0, 8, 0, 0, 71

75. © 2019 Synopsys Correlated feature-map compression 5 Techniques – C. Neural Networks have Sparse, Correlated Intermediate Feature Maps Correlated compression outperforms sparsity-based compression 0 0.5 1 1.5 2 2.5 3 Mobilenet ResNet-50 Yolo V2 VOC VGG-16 CompressionRate Sparsity-based Correlation-based

77. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together A first-order energy model for Neural Network Inference Assume: • Quadratic energy scaling / MAC when going from 32 to 8 bit. • Linear energy saving / read-write in DDR/SRAM when going from 32 to 8 bit • 50% of coefficients zero when pruning • 50% compute reduction under decomposition • 50% of activations can be compressed *[Han,2015] *

78. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together When all model data is stored in DRAM optimized ResNet-50 is 10x more efficient than its plain 32b counterpart O(65MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 10x 100% 22% 16% 11% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption

79. © 2019 Synopsys A first-order analysis on ResNet-50 5 Techniques – Conclusion: bringing it all together In a system with sufficient on-chip SRAM, optimized ResNet-50 is 12.5x more efficient than its plain 32b counterpart O(0MB) DDR / frame O(1GB) SRAM / frame O(3.6G) MACS / frame 100% 15% 9% 8% 0% 20% 40% 60% 80% 100% 120% 32b float A. 8b fixed B. Decomposition + Pruning C. Featuremap Compression RelativeEnergy Consumption 12.5x

80. © 2019 Synopsys For More Information Visit the Synopsys booth for demos on Automotive ADAS, Virtual Reality & More 80 EV6x Embedded Vision Processor IP with Safety Enhancement Package • Thursday, May 23 • Santa Clara Convention Center • Doors open 8 AM • Sessions on EV6x Vision Processor IP, Functional Safety, Security, OpenVX… • Register via the EV Alliance website or at Synopsys Booth Join Synopsys’ EV Seminar on Thursday Navigating Embedded Vision at the Edge B E S T P R O C E S S O R

81. © 2019 Synopsys References 81 [Han,2015,2016] https://arxiv.org/abs/1510.00149 https://arxiv.org/abs/1602.01528 [Xue, 2013] https://www.microsoft.com/en-us/research/wp- content/uploads/2013/01/svd_v2.pdf [Nvidia, 2017] http://on-demand.gputechconf.com/gtc /2017/presentation/s7310-8-bit-inference-with- tensorrt.pdf [Choi, 2018, 2019] https://arxiv.org/abs/1805.06085 https://www.ibm.com/blogs/research/2019/04/2-bit- precision/ [Goetschalckx, 2018] https://www.sigmobile.org/mobisys/2018/workshops/ deepmobile18/papers/Efficiently_Combining_SVD_Pru ning_Clustering_Retraining.pdf [Astrid,2017] https://arxiv.org/abs/1701.07148 [Moons,2017] https://ieeexplore.ieee.org/abstract/document/78703 53 [Chen,2016] http://eyeriss.mit.edu/ [Cavigelli, 2018] https://arxiv.org/abs/1810.03979 [Courbariaux, 2014] https://arxiv.org/pdf/1412.7024.pdf Embedded Vision Summit Bert Moons -- 5+ Techniques for Efficient Implementations of Neural Networks May 2019

"Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys

Recommended

Recommended

More Related Content

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Five+ Techniques for Efficient Implementation of Neural Networks," a Presentation from Synopsys