SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Practical DNN Quantization
Techniques and Tools
Raghuraman Krishnamoorthi
Software Engineer, Facebook
Sept 2020
Outline
• Motivation
• Quantization: Overview
• Quantizing deep networks
• Post Training quantization
• Quantization aware training
• Best Practices
• Conclusions
Motivation(1)
• Number of edge devices is growing
rapidly, lots of these devices are
resource constrained.
• Gartner predicts that 80% of mobile
devices shipped in 2022 will have
on device AI.
Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
Motivation(2)
• While models are becoming
more efficient, high accuracy
still implies high complexity
From: Benchmark Analysis of Representative
Deep Neural Network Architectures, Simone
Bianco et al,
Motivation(3)
• Energy consumption is
dominated by memory
accesses
• Reduced precision is
critical to save power
Quantization
• Many approaches to solve the problems outlined here:
➢Better hardware accelerators: DSPs, NPUs
➢Requires new custom hardware
➢Efficient deep network architectures: Mnasnet, Mobilenet, FBNet
➢ Requires new model architectures
• A simpler approach that does not require re-design of models/new hardware is quantization.
• Quantization refers to techniques to perform computation and storage at reduced precision
• Works in combination with above approaches
• Table stakes for low power custom hardware
Quantization: Benefits
Benefits Quantization
Applicability Broad applicability across models and use cases
Support Supported by x86, Nvidia Volta, ARM, Mali, Hexagon
Software Support Kernel libraries widely available
Memory Size 4x reduction
Memory Bandwidth/Cache 4x reduction
Compute 2x to 4x speedup, depending on ISA
Power ~4x*
Note: Comparing float32 implementations with 8 bit inference on CPU
* Higher power savings are possible due to cache effects
Background: Quantization(1)
• Quantization refers to mapping values from fp32 to a lower precision format.
• Specified by
• Format
• Mapping type
• Granularity
fp32
fp16
bfloat16
int8
int4
binary
Background: Quantization(2)
• We also consider different granularities of quantization:
• Per tensor quantization
• Same mapping for all elements in a tensor.
• Per-axis/per-channel quantization:
• Choose quantizer parameters independently for each row (fc layers) or for each conv kernel (conv
layers) for weights
Quantizing Deep Neural Networks
Model Quantization: Overview
Train
Convert for
inference
Graph
Optimization
Kernels
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Prepare for
Quantization
Quantization
Model Model Model
Compile Compile Compile
Post Training Quantization: Weight Compression
• Simplest quantization scheme is to compress the weights to lower
precision
• Requires no input data and can be done statically as part of preparing
a model for inference
• Hardware accelerators can benefit if de-compression is done after
memory access
• Trivial for case of fp16/int8 quantization of weights.
• K-means compression is also supported in select platforms and is
amenable to simple de-compression
• Scatter-Gather operation in select processors
• Supported in CoreML
Dynamic Quantization
• Dynamic quantization refers to schemes where the activations are read/written in fp32 and are dynamically
quantized to lower precisions for compute.
• Requires no calibration data
• Data exchanged between operations is in floating point, so no need to worry about format conversion.
• Provides performance improvements close to static quantization as long as the latency is compute bound.
• Suitable for inference for transformer/LSTM models
• Supported by:
• Pytorch
• Tensorflow Lite
Quantizing Weights and Activations
• Post training quantization refers to quantizing both weights and activations to reduced
precision, typically int8.
• Requires estimation of statistics of activations for determining quantizer parameters.
• Quantizer parameters are determined by minimizing an error metric:
• KL Divergence: TensorRT
• Saturation error: Tensorflow Lite
• Expected quantization error: Pytorch
Modeling Quantization During Training
• Emulate quantization by quantizing and de-quantizing
in succession
• Values are still in floating point, but with reduced
precision
• 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥
= 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑
𝑥
𝑠
− 𝑧 + 𝑧
= 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 )
• Can also model quantization as a stochastic rounding
operation
Fake Quantizer (top), showing the
quantization of output values.
Approximation for purposes of derivative
calculation (bottom).
Comparison of Quantization Techniques
Multiple ways to quantize a network with different impact:
Quantization
scheme
Memory bandwidth
reduction (Weights)
Memory bandwidth
reduction (Activations)
Compute
Speedup
Notes
Weight only
quantization to int8
4x 1x 1x Suitable for
embedding lookups
Dynamic
quantization
4x 1x 2x Suitable for compute
bound layers
Static quantization
(int32 accumulators)
4x 4x 2x Suited for all layers,
important for
convolutions
Static quantization
(int16 accumulators)
4x 4x 4x Requires lower
precision
weights/activations
Results
Post Training Quantization Results
fp32 accuracy int8 accuracy change Technique
CPU inference speed
up
ResNet50 76.1
Imagenet
-0.2
75.9
Post Training
2x
214ms ➙102ms,
Intel Skylake-DE
BERT 90.2
F1 (GLUE MRPC)
0.0
90.2
Dynamic Quantization
with per-channel
quantization
1.6x
581ms ➙313ms,
Intel Skylake-DE, Batch size=1
Quantization Aware Training
• Quantization aware training provides the best accuracy and allows for simpler
quantization schemes.
fp32 accuracy
int8 accuracy
change
Technique
CPU Inference
speedup
MobileNetV2 71.9
Imagenet
-6.3
65.6
Post Training: Per
Tensor
quantization
4x
75ms ➙18ms
OnePlus 5,
Snapdragon 835
MobileNetV2 71.9
Imagenet
-4.8
67.1
Post-Training: Per-
channel
quantization
MobileNetV2 71.9
Imagenet
-0.4
71.5
Quantization-
Aware Training:
Per Tensor
Quantization
Recipe for Quantizing a Model
Recipe(1)
• Profile your model first:
• Latency: Dynamic quantization (CPU)
• Model size: Dynamic/Static quantization
• Code size: Choose appropriate framework
• Power: Static quantization, custom hardware (NPU)
• Accuracy: Quantization aware training (if needed)
• Iterate on architecture: Exponential tradeoff between complexity and accuracy
Recipe(2)
• Fuse: Optimize your model and kernels
• Fuse operations as much as possible:
• Conv-Relu, Conv-Batch norm and more
• Quantize and check the accuracy
• If accuracy is not sufficient
• Try Quantization aware training
• Selectively quantize parts of model
22
Learn More
Resources
https://pytorch.org/docs/stable/quantization.html
https://www.tensorflow.org/model_optimization
Tutorials
https://pytorch.org/tutorials/advanced/static_quan
tization_tutorial.html
Quantized models for download
https://github.com/pytorch/vision/tree/master/tor
chvision/models/quantization
2020 Embedded Vision Summit
“Practical DNN Quantization Techniques
and Tools”
9/15/2020
23

Contenu connexe

Tendances

Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 

Tendances (20)

Presentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AIPresentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AI
 
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Neural network
Neural networkNeural network
Neural network
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Activation functions
Activation functionsActivation functions
Activation functions
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Model compression
Model compressionModel compression
Model compression
 
Neural networks introduction
Neural networks introductionNeural networks introduction
Neural networks introduction
 
Enabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through QuantizationEnabling Power-Efficient AI Through Quantization
Enabling Power-Efficient AI Through Quantization
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 

Similaire à “Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 

Similaire à “Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook (20)

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
 
Faster computation with matlab
Faster computation with matlabFaster computation with matlab
Faster computation with matlab
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Icsm11a.ppt
Icsm11a.pptIcsm11a.ppt
Icsm11a.ppt
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdf
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
JPM1402 An Efficient Parallel Approach for Sclera Vein Recognition
JPM1402  An Efficient Parallel Approach for Sclera Vein RecognitionJPM1402  An Efficient Parallel Approach for Sclera Vein Recognition
JPM1402 An Efficient Parallel Approach for Sclera Vein Recognition
 
Presentation
PresentationPresentation
Presentation
 
CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION CREDIT CARD FRAUD DETECTION
CREDIT CARD FRAUD DETECTION
 
Resource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsResource Scheduling using Apache Mesos in Cloud Native Environments
Resource Scheduling using Apache Mesos in Cloud Native Environments
 

Plus de Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

Plus de Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook

  • 1. Practical DNN Quantization Techniques and Tools Raghuraman Krishnamoorthi Software Engineer, Facebook Sept 2020
  • 2. Outline • Motivation • Quantization: Overview • Quantizing deep networks • Post Training quantization • Quantization aware training • Best Practices • Conclusions
  • 3. Motivation(1) • Number of edge devices is growing rapidly, lots of these devices are resource constrained. • Gartner predicts that 80% of mobile devices shipped in 2022 will have on device AI. Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
  • 4. Motivation(2) • While models are becoming more efficient, high accuracy still implies high complexity From: Benchmark Analysis of Representative Deep Neural Network Architectures, Simone Bianco et al,
  • 5. Motivation(3) • Energy consumption is dominated by memory accesses • Reduced precision is critical to save power
  • 6. Quantization • Many approaches to solve the problems outlined here: ➢Better hardware accelerators: DSPs, NPUs ➢Requires new custom hardware ➢Efficient deep network architectures: Mnasnet, Mobilenet, FBNet ➢ Requires new model architectures • A simpler approach that does not require re-design of models/new hardware is quantization. • Quantization refers to techniques to perform computation and storage at reduced precision • Works in combination with above approaches • Table stakes for low power custom hardware
  • 7. Quantization: Benefits Benefits Quantization Applicability Broad applicability across models and use cases Support Supported by x86, Nvidia Volta, ARM, Mali, Hexagon Software Support Kernel libraries widely available Memory Size 4x reduction Memory Bandwidth/Cache 4x reduction Compute 2x to 4x speedup, depending on ISA Power ~4x* Note: Comparing float32 implementations with 8 bit inference on CPU * Higher power savings are possible due to cache effects
  • 8. Background: Quantization(1) • Quantization refers to mapping values from fp32 to a lower precision format. • Specified by • Format • Mapping type • Granularity fp32 fp16 bfloat16 int8 int4 binary
  • 9. Background: Quantization(2) • We also consider different granularities of quantization: • Per tensor quantization • Same mapping for all elements in a tensor. • Per-axis/per-channel quantization: • Choose quantizer parameters independently for each row (fc layers) or for each conv kernel (conv layers) for weights
  • 11. Model Quantization: Overview Train Convert for inference Graph Optimization Kernels Train Convert for inference Graph Optimization Kernel Implementation Quantization Train Convert for inference Graph Optimization Kernel Implementation Quantization Prepare for Quantization Quantization Model Model Model Compile Compile Compile
  • 12. Post Training Quantization: Weight Compression • Simplest quantization scheme is to compress the weights to lower precision • Requires no input data and can be done statically as part of preparing a model for inference • Hardware accelerators can benefit if de-compression is done after memory access • Trivial for case of fp16/int8 quantization of weights. • K-means compression is also supported in select platforms and is amenable to simple de-compression • Scatter-Gather operation in select processors • Supported in CoreML
  • 13. Dynamic Quantization • Dynamic quantization refers to schemes where the activations are read/written in fp32 and are dynamically quantized to lower precisions for compute. • Requires no calibration data • Data exchanged between operations is in floating point, so no need to worry about format conversion. • Provides performance improvements close to static quantization as long as the latency is compute bound. • Suitable for inference for transformer/LSTM models • Supported by: • Pytorch • Tensorflow Lite
  • 14. Quantizing Weights and Activations • Post training quantization refers to quantizing both weights and activations to reduced precision, typically int8. • Requires estimation of statistics of activations for determining quantizer parameters. • Quantizer parameters are determined by minimizing an error metric: • KL Divergence: TensorRT • Saturation error: Tensorflow Lite • Expected quantization error: Pytorch
  • 15. Modeling Quantization During Training • Emulate quantization by quantizing and de-quantizing in succession • Values are still in floating point, but with reduced precision • 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥 = 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑 𝑥 𝑠 − 𝑧 + 𝑧 = 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 ) • Can also model quantization as a stochastic rounding operation Fake Quantizer (top), showing the quantization of output values. Approximation for purposes of derivative calculation (bottom).
  • 16. Comparison of Quantization Techniques Multiple ways to quantize a network with different impact: Quantization scheme Memory bandwidth reduction (Weights) Memory bandwidth reduction (Activations) Compute Speedup Notes Weight only quantization to int8 4x 1x 1x Suitable for embedding lookups Dynamic quantization 4x 1x 2x Suitable for compute bound layers Static quantization (int32 accumulators) 4x 4x 2x Suited for all layers, important for convolutions Static quantization (int16 accumulators) 4x 4x 4x Requires lower precision weights/activations
  • 18. Post Training Quantization Results fp32 accuracy int8 accuracy change Technique CPU inference speed up ResNet50 76.1 Imagenet -0.2 75.9 Post Training 2x 214ms ➙102ms, Intel Skylake-DE BERT 90.2 F1 (GLUE MRPC) 0.0 90.2 Dynamic Quantization with per-channel quantization 1.6x 581ms ➙313ms, Intel Skylake-DE, Batch size=1
  • 19. Quantization Aware Training • Quantization aware training provides the best accuracy and allows for simpler quantization schemes. fp32 accuracy int8 accuracy change Technique CPU Inference speedup MobileNetV2 71.9 Imagenet -6.3 65.6 Post Training: Per Tensor quantization 4x 75ms ➙18ms OnePlus 5, Snapdragon 835 MobileNetV2 71.9 Imagenet -4.8 67.1 Post-Training: Per- channel quantization MobileNetV2 71.9 Imagenet -0.4 71.5 Quantization- Aware Training: Per Tensor Quantization
  • 21. Recipe(1) • Profile your model first: • Latency: Dynamic quantization (CPU) • Model size: Dynamic/Static quantization • Code size: Choose appropriate framework • Power: Static quantization, custom hardware (NPU) • Accuracy: Quantization aware training (if needed) • Iterate on architecture: Exponential tradeoff between complexity and accuracy
  • 22. Recipe(2) • Fuse: Optimize your model and kernels • Fuse operations as much as possible: • Conv-Relu, Conv-Batch norm and more • Quantize and check the accuracy • If accuracy is not sufficient • Try Quantization aware training • Selectively quantize parts of model 22
  • 23. Learn More Resources https://pytorch.org/docs/stable/quantization.html https://www.tensorflow.org/model_optimization Tutorials https://pytorch.org/tutorials/advanced/static_quan tization_tutorial.html Quantized models for download https://github.com/pytorch/vision/tree/master/tor chvision/models/quantization 2020 Embedded Vision Summit “Practical DNN Quantization Techniques and Tools” 9/15/2020 23