Artificial neural networks have been adopted for a broad range of tasks in multimedia analysis and processing, such as visual and acoustic classification, extraction of multimedia descriptors or image and video coding. The trained neural networks for these applications contain a large number of parameters (weights), resulting in a considerable size. Thus, transferring them to a number of clients using them in applications (e.g., mobile phones, smart cameras) benefits from a compressed representation of neural networks.
MPEG Neural Network Coding and Representation is the first international standard for efficient compression of neural networks (NNs). The standard is designed as a toolbox of compression methods, which can be used to create coding pipelines. It can be either used as an independent coding framework (with its own bitstream format) or together with external neural network formats and frameworks. For providing the highest degree of flexibility, the network compression methods operate per parameter tensor in order to always ensure proper decoding, even if no structure information is provided. The standard contains compression-efficient quantization and an arithmetic coding scheme (DeepCABAC) as core encoding and decoding technologies, as well as neural network parameter pre-processing methods like sparsification, pruning, low-rank decomposition, unification, local scaling, and batch norm folding. NNR achieves a compression efficiency of more than 97% for transparent coding cases, i.e. without degrading classification quality, such as top-1 or top-5 accuracies.
This talk presents an overview of the context, technical features, and characteristics of the NN coding standard, and discusses ongoing topics such as incremental neural network representation.
2. Outline
Context
State of the art in NN compression
Standard: need and design considerations
Coding methods
High-level syntax
Performance
Ongoing work and conclusion
3. Context
Artificial neural networks are widely used, e.g. in multimedia
visual and audio content recognition and classification
speech and natural language processing
Deep learning makes use of very large networks
many layers with nodes and connections
parameters/weights attached to each of them (e.g. convolution operations)
4
4. Context
Training
learn parameters from data
typically once, on powerful infrastructure
updates or adaptations in target environment may be necessary
Inference
use the trained network for prediction
needs network with all its parameters, i.e., large amount of data to be transmitted and processed
→ focus: small size to be transmitted
often used on resource constrained devices (mobile phones, smart cameras, edge nodes, …)
→ focus: low memory and computational complexity during inference
5
5. SotA in NN Compression
typically three steps
reduction of parameters, e.g.
eliminating neurons (pruning)
reducing the entropy of a tensor
decomposing/transforming a tensor
reducing the precision of parameters (i.e. quantization)
performing entropy coding
6
Han et al., ICLR 2016
6. SotA in NN Compression
Pruning and sparsification
large number of methods
simple to implement
strongly depends on initial model
at low compression rates, performance
improvements are possible
performance loss can often be regained by
training again
computational effort is a concern
7
Blalock et al., MLSys 2020
7. SotA in NN Compression
Lottery ticket hypothesis [Frankle and Carbin, ICLR 2018]
for a dense randomly initialized feed forward NN there are subnetworks that
achieve the same performance at similar training cost
such a subnetwork can be obtained by pruning, without further training
retrain from early training iteration
Inference complexity gain
straight forward for pruning layers, channels, etc.
sparse tensors need specific support (e.g. recently added to CUDA)
usually certain sparseness ratio needs to be met to get gain
8
8. SotA in NN Compression - Quantisation
possible to go <10bit parameters for many common NNs without performance loss
many works show losses can be regained with fine-tuning (or already training with quantized
network)
mixed precision (e.g., per layer) is beneficial for many networks
going to very low precision (1bit) is feasible (needs approaches to ensure backprop still works, e.g.,
asymptotic-quantized estimator [Chen et al., IEEE JSTSP 2020])
Inference complexity gain
needs support on target platform, and in DL framework (+ any custom operators)
increasingly supported on GPUs and specific NN processors (at least 4, 8, 16)
9
9. Relation to Network Architecture Search (NAS)
finding an alternative architecture
and train
architecture search is
computationally expensive
training is then done using e.g. knowledge distillation,
teacher-student learning
Faster NAS methods have been proposed, e.g. Single-Path Mobile AutoML
[Stamoulis et al., IEEE JSTSP 2020]
needs also access to full training data, while fine-tuning could be done on partial
data or application specific data
10
[Strubell, et al. ACL 2019]
10. Relation to target hardware
Optimising for target hardware
supported operations (e.g., sparse matrix multiplication, weight precisions)
relative costs of memory access and computing operations
training a network, derive network for particular platform
first approaches [Cai, ICLR 2020]
no reliable prediction of inference
costs on particular target architecture
in particular, prediction of speed and
energy consumption
cf. autotune in inference of
DL frameworks
11
11. Need for a standardised interface
12
Accelerator Libraries
supporting multiple
backends
Accelerator Libraries
supporting multiple
backends
12. Standardization in MPEG (ISO/IEC JTC1 SC29)
Develop interoperable compressed representation of neural networks
Leverage the know-how in the MPEG on compression of various types of
(multimedia) data
Enable multimedia applications to benefit from the progress in machine
learning using deep neural networks
Cover a broad set of relevant use cases
selected image classification, visual content matching, content coding and audio
classification as applications in which technology is validated
13
13. Design Considerations
Interoperability with exchange formats (NNEF, ONNX) and formats of
common DL frameworks
Reuse existing approaches for representing topology
Agnostic to inference platform and its specificities
Different types of networks, applications, … may be best served by
different compression tools
14
14. Evaluating compression technologies
Compression ratio
Reconstruction of original parameters (cf. PSNR for multimedia data)
not a useful metric
performance in target application (e.g., image classification) needs to be measured
(cf. perceptual quality metrics for multimedia)
requires models and data sets for each target application
runtime/memory consumption
of the encoding/decoding process
inference using the resulting model
15
15. Evaluating compression technologies
16
compression ratio of
stored/transmitted model
(format must be considered)
compression ratio of
model in memory
(depends on platform)
performance for task
(with/without finetuning)
16. Standard as a Toolbox
17
Reconstructed
Neural Net
R
Original
Neural Net
O
NNR Encoder
NNR Decoder
Pi Q
RP RQ
Bitstream S
0 1 0 1 1 1
Sparsification
Pruning
LR-Decomp.
Optional Preprocessing /
Parameter Reduction
Quantization Entropy Coding
Uniform Nearest
Neighbor Q.
DeepCABAC
Codebook Q.
Dependent Q.
Unification
Batchnorm
Folding
Local Scaling
Optional Postprocessing /
FineTuning
Value Reconstruction Entropy Decoding
DeepCABAC
Value
Reconstruction /
Rescaling
Network Reconstruction /
Re-Composition
can be represented in
any NN format
(maybe not efficiently)
representation for
inference
representation for
storage/transmission
17. Parameter Reduction - Sparsification
General sparsification
perform fine-tuning and determine sparsity loss
overall loss is weighted sum of task-specific
and sparsity loss
set weights below threshold to zero
Micro-structured sparsification
based on General Matrix Multiplication (GEMM)
reshape tensor to 1-D, 2-D or 3-D block structure
set blocks of weights to zero
keep mask of blocks that have been set to zero
18
18. Parameter Reduction - Pruning
Pruning is used in combination with one of the sparsification methods
Perform weight importance estimation
graph diffusion on output channels -> get an estimate weight redundancy
Iterative algorithm
determine weight importance
prune neurons according to target pruning ratio
apply sparsification
repeat until target ratios are reached
19
19. Parameter Reduction – LR Decomposition, Unification
Low-rank decomposition
decompose weight matrix (using SVD)
choose rank r to trade off reconstruction error vs. number of parameters
Unification
generalisation of micro-structured sparsification
instead of setting block to zero, set to approximated value (keep sign)
keep mask of unified blocks
optimised multiplication can be applied if mask is used
20
20. Parameter Reduction – Batchnorm folding, Local scaling
Batch normalisation is commonly used in NNs
store batchnorm parameters, and apply to weights
better compressability of transformed weights
Local scaling
scaling factor per row
factors can be adjusted using fine-tuning
if used together with batchnorm folding,
no addition parameters are added
21
21. Quantisation
Uniform Nearest Neighbor Quantization
Codebook Quantization
Dependent Scalar Quantization
Trellis-coded quantisation
two scalar quantisers, and procedure for
switching between them (state-machine
with 8 states)
22
22. Entropy Coding
DeepCABAC
Adaptation of Context-adaptive Binary
Arithmetic Coding (CABAC)
Binarization
Context-modelling
separate models for each of the flags
select from a fixed set of models
Arithmetic coding to regular and
bypass bins
23
23. Decoding
Output of encoded tensor
Integer or floating point
Block: set of combined tensors (e.g., components of decomposed tensor)
Parallel decoding of parts of a tensor is supported
option to specify entry points for the parts during encoding
24
24. High-level Syntax
Sequence of units
size, header, payload
Unit types
Start unit
Model parameter set
Layer parameter set
Topology
Quantisation data
Compressed data
Aggregated unit: grouping units for joint decoding
25
25. Interoperability with exchange formats
Include network topology in encoded bitstream
ONNX, NNEF, Tensorflow, PyTorch
Supports encoding just some of the tensors
Compatibility with quantisation formats supported in those formats
Include encoded tensors in exchange format
Recommended approach for NNEF and ONNX
26
30. Ongoing work: incremental compression
Use cases that need to send updated models
e.g. deploy to mobile devices, federated learning
Encode model w.r.t. base model
support tensor updates and structural changes,
e.g. transfer learning with different number
of output classes
Initial results
updates in distributed training can be represented at <1% of the base model size
31
VGG-16
Central Server
Central Server Institution A
Institution B
Train data C
Train data C
Train data B
Train data B
Aggregation
(Central Server)
Aggregation
(Central Server)
VGG-16
NCTM available
NCTM available
31. Conclusion
standard for compressing NN parameters
compresses to less than 10% without performance loss
interoperability with exchange formats
status
compression standard is going to final ballot
reference software under development (first public release in autumn)
work on incremental compression ongoing
32