SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Standardising the
compressed representation
of neural networks
Werner Bailer
TEWI-Kolloquium, 2021-06-25
Outline
Context
State of the art in NN compression
Standard: need and design considerations
Coding methods
High-level syntax
Performance
Ongoing work and conclusion
Context
Artificial neural networks are widely used, e.g. in multimedia
visual and audio content recognition and classification
speech and natural language processing
Deep learning makes use of very large networks
many layers with nodes and connections
parameters/weights attached to each of them (e.g. convolution operations)
4
Context
Training
learn parameters from data
typically once, on powerful infrastructure
updates or adaptations in target environment may be necessary
Inference
use the trained network for prediction
needs network with all its parameters, i.e., large amount of data to be transmitted and processed
→ focus: small size to be transmitted
often used on resource constrained devices (mobile phones, smart cameras, edge nodes, …)
→ focus: low memory and computational complexity during inference
5
SotA in NN Compression
typically three steps
reduction of parameters, e.g.
eliminating neurons (pruning)
reducing the entropy of a tensor
decomposing/transforming a tensor
reducing the precision of parameters (i.e. quantization)
performing entropy coding
6
Han et al., ICLR 2016
SotA in NN Compression
Pruning and sparsification
large number of methods
simple to implement
strongly depends on initial model
at low compression rates, performance
improvements are possible
performance loss can often be regained by
training again
computational effort is a concern
7
Blalock et al., MLSys 2020
SotA in NN Compression
Lottery ticket hypothesis [Frankle and Carbin, ICLR 2018]
for a dense randomly initialized feed forward NN there are subnetworks that
achieve the same performance at similar training cost
such a subnetwork can be obtained by pruning, without further training
retrain from early training iteration
Inference complexity gain
straight forward for pruning layers, channels, etc.
sparse tensors need specific support (e.g. recently added to CUDA)
usually certain sparseness ratio needs to be met to get gain
8
SotA in NN Compression - Quantisation
possible to go <10bit parameters for many common NNs without performance loss
many works show losses can be regained with fine-tuning (or already training with quantized
network)
mixed precision (e.g., per layer) is beneficial for many networks
going to very low precision (1bit) is feasible (needs approaches to ensure backprop still works, e.g.,
asymptotic-quantized estimator [Chen et al., IEEE JSTSP 2020])
Inference complexity gain
needs support on target platform, and in DL framework (+ any custom operators)
increasingly supported on GPUs and specific NN processors (at least 4, 8, 16)
9
Relation to Network Architecture Search (NAS)
finding an alternative architecture
and train
architecture search is
computationally expensive
training is then done using e.g. knowledge distillation,
teacher-student learning
Faster NAS methods have been proposed, e.g. Single-Path Mobile AutoML
[Stamoulis et al., IEEE JSTSP 2020]
needs also access to full training data, while fine-tuning could be done on partial
data or application specific data
10
[Strubell, et al. ACL 2019]
Relation to target hardware
Optimising for target hardware
supported operations (e.g., sparse matrix multiplication, weight precisions)
relative costs of memory access and computing operations
training a network, derive network for particular platform
first approaches [Cai, ICLR 2020]
no reliable prediction of inference
costs on particular target architecture
in particular, prediction of speed and
energy consumption
cf. autotune in inference of
DL frameworks
11
Need for a standardised interface
12
Accelerator Libraries
supporting multiple
backends
Accelerator Libraries
supporting multiple
backends
Standardization in MPEG (ISO/IEC JTC1 SC29)
Develop interoperable compressed representation of neural networks
Leverage the know-how in the MPEG on compression of various types of
(multimedia) data
Enable multimedia applications to benefit from the progress in machine
learning using deep neural networks
Cover a broad set of relevant use cases
selected image classification, visual content matching, content coding and audio
classification as applications in which technology is validated
13
Design Considerations
Interoperability with exchange formats (NNEF, ONNX) and formats of
common DL frameworks
Reuse existing approaches for representing topology
Agnostic to inference platform and its specificities
Different types of networks, applications, … may be best served by
different compression tools
14
Evaluating compression technologies
Compression ratio
Reconstruction of original parameters (cf. PSNR for multimedia data)
not a useful metric
performance in target application (e.g., image classification) needs to be measured
(cf. perceptual quality metrics for multimedia)
requires models and data sets for each target application
runtime/memory consumption
of the encoding/decoding process
inference using the resulting model
15
Evaluating compression technologies
16
compression ratio of
stored/transmitted model
(format must be considered)
compression ratio of
model in memory
(depends on platform)
performance for task
(with/without finetuning)
Standard as a Toolbox
17
Reconstructed
Neural Net
R
Original
Neural Net
O
NNR Encoder
NNR Decoder
Pi Q
RP RQ
Bitstream S
0 1 0 1 1 1
Sparsification
Pruning
LR-Decomp.
Optional Preprocessing /
Parameter Reduction
Quantization Entropy Coding
Uniform Nearest
Neighbor Q.
DeepCABAC
Codebook Q.
Dependent Q.
Unification
Batchnorm
Folding
Local Scaling
Optional Postprocessing /
FineTuning
Value Reconstruction Entropy Decoding
DeepCABAC
Value
Reconstruction /
Rescaling
Network Reconstruction /
Re-Composition
can be represented in
any NN format
(maybe not efficiently)
representation for
inference
representation for
storage/transmission
Parameter Reduction - Sparsification
General sparsification
perform fine-tuning and determine sparsity loss
overall loss is weighted sum of task-specific
and sparsity loss
set weights below threshold to zero
Micro-structured sparsification
based on General Matrix Multiplication (GEMM)
reshape tensor to 1-D, 2-D or 3-D block structure
set blocks of weights to zero
keep mask of blocks that have been set to zero
18
Parameter Reduction - Pruning
Pruning is used in combination with one of the sparsification methods
Perform weight importance estimation
graph diffusion on output channels -> get an estimate weight redundancy
Iterative algorithm
determine weight importance
prune neurons according to target pruning ratio
apply sparsification
repeat until target ratios are reached
19
Parameter Reduction – LR Decomposition, Unification
Low-rank decomposition
decompose weight matrix (using SVD)
choose rank r to trade off reconstruction error vs. number of parameters
Unification
generalisation of micro-structured sparsification
instead of setting block to zero, set to approximated value (keep sign)
keep mask of unified blocks
optimised multiplication can be applied if mask is used
20
Parameter Reduction – Batchnorm folding, Local scaling
Batch normalisation is commonly used in NNs
store batchnorm parameters, and apply to weights
better compressability of transformed weights
Local scaling
scaling factor per row
factors can be adjusted using fine-tuning
if used together with batchnorm folding,
no addition parameters are added
21
Quantisation
Uniform Nearest Neighbor Quantization
Codebook Quantization
Dependent Scalar Quantization
Trellis-coded quantisation
two scalar quantisers, and procedure for
switching between them (state-machine
with 8 states)
22
Entropy Coding
DeepCABAC
Adaptation of Context-adaptive Binary
Arithmetic Coding (CABAC)
Binarization
Context-modelling
separate models for each of the flags
select from a fixed set of models
Arithmetic coding to regular and
bypass bins
23
Decoding
Output of encoded tensor
Integer or floating point
Block: set of combined tensors (e.g., components of decomposed tensor)
Parallel decoding of parts of a tensor is supported
option to specify entry points for the parts during encoding
24
High-level Syntax
Sequence of units
size, header, payload
Unit types
Start unit
Model parameter set
Layer parameter set
Topology
Quantisation data
Compressed data
Aggregated unit: grouping units for joint decoding
25
Interoperability with exchange formats
Include network topology in encoded bitstream
ONNX, NNEF, Tensorflow, PyTorch
Supports encoding just some of the tensors
Compatibility with quantisation formats supported in those formats
Include encoded tensors in exchange format
Recommended approach for NNEF and ONNX
26
Performance
27
Performance
28
Performance
5
10
15
20
25
30
35
40
25
35
45
55
65
75
0 0.05 0.1 0.15 0.2 0.25
Compression Ratio
Pretrained
VGG16 NCTM
BZIP2
ResNet50 NCTM
BZIP2
MobileNetV2 NCTM
BZIP2
DCase NCTM
BZIP2
UC12B NCTM
BZIP2
Top-1 Accuracy in % PSNR in dB
} image classification
audio classification
image encoding
Performance
} image classification
audio classification
image encoding
5
10
15
20
25
30
35
40
25
30
35
40
45
50
55
60
65
70
75
0 0.05 0.1 0.15
Compression Ratio
Low Rank Decomposition
VGG16 NCTM
BZIP2
ResNet50 NCTM
BZIP2
MobileNetV2 NCTM
BZIP2
DCase NCTM
BZIP2
UC12B NCTM
BZIP2
Top-1 Accuracy in % PSNR in dB
Ongoing work: incremental compression
Use cases that need to send updated models
e.g. deploy to mobile devices, federated learning
Encode model w.r.t. base model
support tensor updates and structural changes,
e.g. transfer learning with different number
of output classes
Initial results
updates in distributed training can be represented at <1% of the base model size
31
VGG-16
Central Server
Central Server Institution A
Institution B
Train data C
Train data C
Train data B
Train data B
Aggregation
(Central Server)
Aggregation
(Central Server)
VGG-16
NCTM available
NCTM available
Conclusion
standard for compressing NN parameters
compresses to less than 10% without performance loss
interoperability with exchange formats
status
compression standard is going to final ballot
reference software under development (first public release in autumn)
work on incremental compression ongoing
32
www.joanneum.at/digital

Contenu connexe

Tendances

[論文読み]Effective Approaches to Attention-based Neural Machine Translation
[論文読み]Effective Approaches to Attention-based Neural Machine Translation[論文読み]Effective Approaches to Attention-based Neural Machine Translation
[論文読み]Effective Approaches to Attention-based Neural Machine Translation
hirono kawashima
 
相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心
takehikoihayashi
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 

Tendances (20)

PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
 
SappoRoR#11 LT: Exploring the Potential of Survey Tools Using Generative AI
SappoRoR#11 LT: Exploring the Potential of Survey Tools Using Generative AISappoRoR#11 LT: Exploring the Potential of Survey Tools Using Generative AI
SappoRoR#11 LT: Exploring the Potential of Survey Tools Using Generative AI
 
言語モデル入門
言語モデル入門言語モデル入門
言語モデル入門
 
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
 
개념 이해가 쉬운 Variational Autoencoder (VAE)
개념 이해가 쉬운 Variational Autoencoder (VAE)개념 이해가 쉬운 Variational Autoencoder (VAE)
개념 이해가 쉬운 Variational Autoencoder (VAE)
 
[論文読み]Effective Approaches to Attention-based Neural Machine Translation
[論文読み]Effective Approaches to Attention-based Neural Machine Translation[論文読み]Effective Approaches to Attention-based Neural Machine Translation
[論文読み]Effective Approaches to Attention-based Neural Machine Translation
 
[DL輪読会]DisCo RL: Distribution-Conditioned Reinforcement Learning for General...
[DL輪読会]DisCo RL:  Distribution-Conditioned Reinforcement Learning for General...[DL輪読会]DisCo RL:  Distribution-Conditioned Reinforcement Learning for General...
[DL輪読会]DisCo RL: Distribution-Conditioned Reinforcement Learning for General...
 
Unityを使ったゲームデザイン超入門
Unityを使ったゲームデザイン超入門Unityを使ったゲームデザイン超入門
Unityを使ったゲームデザイン超入門
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデル
 
【Zansa】第17回 ブートストラップ法入門
【Zansa】第17回 ブートストラップ法入門【Zansa】第17回 ブートストラップ法入門
【Zansa】第17回 ブートストラップ法入門
 
相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
自分以外の心は存在するのか, 高橋英之
自分以外の心は存在するのか, 高橋英之自分以外の心は存在するのか, 高橋英之
自分以外の心は存在するのか, 高橋英之
 
実験計画法入門 Part 4
実験計画法入門 Part 4実験計画法入門 Part 4
実験計画法入門 Part 4
 
順推論と逆推論
順推論と逆推論順推論と逆推論
順推論と逆推論
 
1235.pptx
1235.pptx1235.pptx
1235.pptx
 
【論文紹介】U-GAT-IT
【論文紹介】U-GAT-IT【論文紹介】U-GAT-IT
【論文紹介】U-GAT-IT
 
【Unity道場スペシャル 2017幕張】続 あそびのデザイン講座
【Unity道場スペシャル 2017幕張】続 あそびのデザイン講座【Unity道場スペシャル 2017幕張】続 あそびのデザイン講座
【Unity道場スペシャル 2017幕張】続 あそびのデザイン講座
 
NVIDIA ディープラーニング用語集
NVIDIA ディープラーニング用語集NVIDIA ディープラーニング用語集
NVIDIA ディープラーニング用語集
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 

Similaire à Standardising the compressed representation of neural networks

Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
ideas2ignite
 
Design and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated MultiplierDesign and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated Multiplier
ijsrd.com
 

Similaire à Standardising the compressed representation of neural networks (20)

Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Machine Vision on Embedded Hardware
Machine Vision on Embedded HardwareMachine Vision on Embedded Hardware
Machine Vision on Embedded Hardware
 
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
Thesis Giani UIC Slides EN
Thesis Giani UIC Slides ENThesis Giani UIC Slides EN
Thesis Giani UIC Slides EN
 
HPPS - Final - 06/14/2007
HPPS - Final - 06/14/2007HPPS - Final - 06/14/2007
HPPS - Final - 06/14/2007
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
3rd 3DDRESD: ReCPU 4 NIDS
3rd 3DDRESD: ReCPU 4 NIDS3rd 3DDRESD: ReCPU 4 NIDS
3rd 3DDRESD: ReCPU 4 NIDS
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Design and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated MultiplierDesign and Implementation of a Programmable Truncated Multiplier
Design and Implementation of a Programmable Truncated Multiplier
 
A Flexible Software/Hardware Adaptive Network for Embedded Distributed Archit...
A Flexible Software/Hardware Adaptive Network for Embedded Distributed Archit...A Flexible Software/Hardware Adaptive Network for Embedded Distributed Archit...
A Flexible Software/Hardware Adaptive Network for Embedded Distributed Archit...
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 

Plus de Förderverein Technische Fakultät

The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
Förderverein Technische Fakultät
 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
Förderverein Technische Fakultät
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
Förderverein Technische Fakultät
 

Plus de Förderverein Technische Fakultät (20)

Supervisory control of business processes
Supervisory control of business processesSupervisory control of business processes
Supervisory control of business processes
 
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
The Digital Transformation of Education: A Hyper-Disruptive Era through Block...
 
A Game of Chess is Like a Swordfight.pdf
A Game of Chess is Like a Swordfight.pdfA Game of Chess is Like a Swordfight.pdf
A Game of Chess is Like a Swordfight.pdf
 
From Mind to Meta.pdf
From Mind to Meta.pdfFrom Mind to Meta.pdf
From Mind to Meta.pdf
 
Miniatures Design for Tabletop Games.pdf
Miniatures Design for Tabletop Games.pdfMiniatures Design for Tabletop Games.pdf
Miniatures Design for Tabletop Games.pdf
 
Distributed Systems in the Post-Moore Era.pptx
Distributed Systems in the Post-Moore Era.pptxDistributed Systems in the Post-Moore Era.pptx
Distributed Systems in the Post-Moore Era.pptx
 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
 
Engineering Serverless Workflow Applications in Federated FaaS.pdf
Engineering Serverless Workflow Applications in Federated FaaS.pdfEngineering Serverless Workflow Applications in Federated FaaS.pdf
Engineering Serverless Workflow Applications in Federated FaaS.pdf
 
The Role of Machine Learning in Fluid Network Control and Data Planes.pdf
The Role of Machine Learning in Fluid Network Control and Data Planes.pdfThe Role of Machine Learning in Fluid Network Control and Data Planes.pdf
The Role of Machine Learning in Fluid Network Control and Data Planes.pdf
 
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
Nonequilibrium Network Dynamics_Inference, Fluctuation-Respones & Tipping Poi...
 
Towards a data driven identification of teaching patterns.pdf
Towards a data driven identification of teaching patterns.pdfTowards a data driven identification of teaching patterns.pdf
Towards a data driven identification of teaching patterns.pdf
 
Förderverein Technische Fakultät.pptx
Förderverein Technische Fakultät.pptxFörderverein Technische Fakultät.pptx
Förderverein Technische Fakultät.pptx
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
 
East-west oriented photovoltaic power systems: model, benefits and technical ...
East-west oriented photovoltaic power systems: model, benefits and technical ...East-west oriented photovoltaic power systems: model, benefits and technical ...
East-west oriented photovoltaic power systems: model, benefits and technical ...
 
Machine Learning in Finance via Randomization
Machine Learning in Finance via RandomizationMachine Learning in Finance via Randomization
Machine Learning in Finance via Randomization
 
IT does not stop
IT does not stopIT does not stop
IT does not stop
 
Advances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksAdvances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial Networks
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Industriepraktikum_ Unterstützung bei Projekten in der Automatisierung.pdf
Industriepraktikum_ Unterstützung bei Projekten in der Automatisierung.pdfIndustriepraktikum_ Unterstützung bei Projekten in der Automatisierung.pdf
Industriepraktikum_ Unterstützung bei Projekten in der Automatisierung.pdf
 
Introduction to 5G from radio perspective
Introduction to 5G from radio perspectiveIntroduction to 5G from radio perspective
Introduction to 5G from radio perspective
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Standardising the compressed representation of neural networks

  • 1. Standardising the compressed representation of neural networks Werner Bailer TEWI-Kolloquium, 2021-06-25
  • 2. Outline Context State of the art in NN compression Standard: need and design considerations Coding methods High-level syntax Performance Ongoing work and conclusion
  • 3. Context Artificial neural networks are widely used, e.g. in multimedia visual and audio content recognition and classification speech and natural language processing Deep learning makes use of very large networks many layers with nodes and connections parameters/weights attached to each of them (e.g. convolution operations) 4
  • 4. Context Training learn parameters from data typically once, on powerful infrastructure updates or adaptations in target environment may be necessary Inference use the trained network for prediction needs network with all its parameters, i.e., large amount of data to be transmitted and processed → focus: small size to be transmitted often used on resource constrained devices (mobile phones, smart cameras, edge nodes, …) → focus: low memory and computational complexity during inference 5
  • 5. SotA in NN Compression typically three steps reduction of parameters, e.g. eliminating neurons (pruning) reducing the entropy of a tensor decomposing/transforming a tensor reducing the precision of parameters (i.e. quantization) performing entropy coding 6 Han et al., ICLR 2016
  • 6. SotA in NN Compression Pruning and sparsification large number of methods simple to implement strongly depends on initial model at low compression rates, performance improvements are possible performance loss can often be regained by training again computational effort is a concern 7 Blalock et al., MLSys 2020
  • 7. SotA in NN Compression Lottery ticket hypothesis [Frankle and Carbin, ICLR 2018] for a dense randomly initialized feed forward NN there are subnetworks that achieve the same performance at similar training cost such a subnetwork can be obtained by pruning, without further training retrain from early training iteration Inference complexity gain straight forward for pruning layers, channels, etc. sparse tensors need specific support (e.g. recently added to CUDA) usually certain sparseness ratio needs to be met to get gain 8
  • 8. SotA in NN Compression - Quantisation possible to go <10bit parameters for many common NNs without performance loss many works show losses can be regained with fine-tuning (or already training with quantized network) mixed precision (e.g., per layer) is beneficial for many networks going to very low precision (1bit) is feasible (needs approaches to ensure backprop still works, e.g., asymptotic-quantized estimator [Chen et al., IEEE JSTSP 2020]) Inference complexity gain needs support on target platform, and in DL framework (+ any custom operators) increasingly supported on GPUs and specific NN processors (at least 4, 8, 16) 9
  • 9. Relation to Network Architecture Search (NAS) finding an alternative architecture and train architecture search is computationally expensive training is then done using e.g. knowledge distillation, teacher-student learning Faster NAS methods have been proposed, e.g. Single-Path Mobile AutoML [Stamoulis et al., IEEE JSTSP 2020] needs also access to full training data, while fine-tuning could be done on partial data or application specific data 10 [Strubell, et al. ACL 2019]
  • 10. Relation to target hardware Optimising for target hardware supported operations (e.g., sparse matrix multiplication, weight precisions) relative costs of memory access and computing operations training a network, derive network for particular platform first approaches [Cai, ICLR 2020] no reliable prediction of inference costs on particular target architecture in particular, prediction of speed and energy consumption cf. autotune in inference of DL frameworks 11
  • 11. Need for a standardised interface 12 Accelerator Libraries supporting multiple backends Accelerator Libraries supporting multiple backends
  • 12. Standardization in MPEG (ISO/IEC JTC1 SC29) Develop interoperable compressed representation of neural networks Leverage the know-how in the MPEG on compression of various types of (multimedia) data Enable multimedia applications to benefit from the progress in machine learning using deep neural networks Cover a broad set of relevant use cases selected image classification, visual content matching, content coding and audio classification as applications in which technology is validated 13
  • 13. Design Considerations Interoperability with exchange formats (NNEF, ONNX) and formats of common DL frameworks Reuse existing approaches for representing topology Agnostic to inference platform and its specificities Different types of networks, applications, … may be best served by different compression tools 14
  • 14. Evaluating compression technologies Compression ratio Reconstruction of original parameters (cf. PSNR for multimedia data) not a useful metric performance in target application (e.g., image classification) needs to be measured (cf. perceptual quality metrics for multimedia) requires models and data sets for each target application runtime/memory consumption of the encoding/decoding process inference using the resulting model 15
  • 15. Evaluating compression technologies 16 compression ratio of stored/transmitted model (format must be considered) compression ratio of model in memory (depends on platform) performance for task (with/without finetuning)
  • 16. Standard as a Toolbox 17 Reconstructed Neural Net R Original Neural Net O NNR Encoder NNR Decoder Pi Q RP RQ Bitstream S 0 1 0 1 1 1 Sparsification Pruning LR-Decomp. Optional Preprocessing / Parameter Reduction Quantization Entropy Coding Uniform Nearest Neighbor Q. DeepCABAC Codebook Q. Dependent Q. Unification Batchnorm Folding Local Scaling Optional Postprocessing / FineTuning Value Reconstruction Entropy Decoding DeepCABAC Value Reconstruction / Rescaling Network Reconstruction / Re-Composition can be represented in any NN format (maybe not efficiently) representation for inference representation for storage/transmission
  • 17. Parameter Reduction - Sparsification General sparsification perform fine-tuning and determine sparsity loss overall loss is weighted sum of task-specific and sparsity loss set weights below threshold to zero Micro-structured sparsification based on General Matrix Multiplication (GEMM) reshape tensor to 1-D, 2-D or 3-D block structure set blocks of weights to zero keep mask of blocks that have been set to zero 18
  • 18. Parameter Reduction - Pruning Pruning is used in combination with one of the sparsification methods Perform weight importance estimation graph diffusion on output channels -> get an estimate weight redundancy Iterative algorithm determine weight importance prune neurons according to target pruning ratio apply sparsification repeat until target ratios are reached 19
  • 19. Parameter Reduction – LR Decomposition, Unification Low-rank decomposition decompose weight matrix (using SVD) choose rank r to trade off reconstruction error vs. number of parameters Unification generalisation of micro-structured sparsification instead of setting block to zero, set to approximated value (keep sign) keep mask of unified blocks optimised multiplication can be applied if mask is used 20
  • 20. Parameter Reduction – Batchnorm folding, Local scaling Batch normalisation is commonly used in NNs store batchnorm parameters, and apply to weights better compressability of transformed weights Local scaling scaling factor per row factors can be adjusted using fine-tuning if used together with batchnorm folding, no addition parameters are added 21
  • 21. Quantisation Uniform Nearest Neighbor Quantization Codebook Quantization Dependent Scalar Quantization Trellis-coded quantisation two scalar quantisers, and procedure for switching between them (state-machine with 8 states) 22
  • 22. Entropy Coding DeepCABAC Adaptation of Context-adaptive Binary Arithmetic Coding (CABAC) Binarization Context-modelling separate models for each of the flags select from a fixed set of models Arithmetic coding to regular and bypass bins 23
  • 23. Decoding Output of encoded tensor Integer or floating point Block: set of combined tensors (e.g., components of decomposed tensor) Parallel decoding of parts of a tensor is supported option to specify entry points for the parts during encoding 24
  • 24. High-level Syntax Sequence of units size, header, payload Unit types Start unit Model parameter set Layer parameter set Topology Quantisation data Compressed data Aggregated unit: grouping units for joint decoding 25
  • 25. Interoperability with exchange formats Include network topology in encoded bitstream ONNX, NNEF, Tensorflow, PyTorch Supports encoding just some of the tensors Compatibility with quantisation formats supported in those formats Include encoded tensors in exchange format Recommended approach for NNEF and ONNX 26
  • 28. Performance 5 10 15 20 25 30 35 40 25 35 45 55 65 75 0 0.05 0.1 0.15 0.2 0.25 Compression Ratio Pretrained VGG16 NCTM BZIP2 ResNet50 NCTM BZIP2 MobileNetV2 NCTM BZIP2 DCase NCTM BZIP2 UC12B NCTM BZIP2 Top-1 Accuracy in % PSNR in dB } image classification audio classification image encoding
  • 29. Performance } image classification audio classification image encoding 5 10 15 20 25 30 35 40 25 30 35 40 45 50 55 60 65 70 75 0 0.05 0.1 0.15 Compression Ratio Low Rank Decomposition VGG16 NCTM BZIP2 ResNet50 NCTM BZIP2 MobileNetV2 NCTM BZIP2 DCase NCTM BZIP2 UC12B NCTM BZIP2 Top-1 Accuracy in % PSNR in dB
  • 30. Ongoing work: incremental compression Use cases that need to send updated models e.g. deploy to mobile devices, federated learning Encode model w.r.t. base model support tensor updates and structural changes, e.g. transfer learning with different number of output classes Initial results updates in distributed training can be represented at <1% of the base model size 31 VGG-16 Central Server Central Server Institution A Institution B Train data C Train data C Train data B Train data B Aggregation (Central Server) Aggregation (Central Server) VGG-16 NCTM available NCTM available
  • 31. Conclusion standard for compressing NN parameters compresses to less than 10% without performance loss interoperability with exchange formats status compression standard is going to final ballot reference software under development (first public release in autumn) work on incremental compression ongoing 32