A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smartphones. Highlights some frameworks and best practices.
7. Response Time Limits – Powers of 10
0.1 second : Reacting instantly
1.0 seconds : User’s flow of thought
10 seconds : Keeping the user’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
8. Mobile Deep Learning Recipe
Mobile Inference Engine + Pretrained Model = DL App
(Efficient) (Efficient)
11. Use Cloud APIs for General Recognition Needs
• Microsoft Cognitive Services
• Clarifai
• Google Cloud Vision
• IBM Watson Services
• Amazon Rekognition
12. How to Choose a Computer Vision Based API?
Benchmark & Compare them
COCO-Text v2.0 for Text reading in the wild
• ~2k random images
• Candidate text has at least 2 characters together
• Direct word match
COCO-Val 2017 for Image Tagging in the wild
• ~4k random images
• Tag similarity match instead of word match
14. Recognize Text Benchmarks
Text API Accuracy
Amazon Rekognition 45.4%
Google Cloud Vision 33.4%
Microsoft Cognitive Services 55.4%
Evaluation criteria:
• Photos have candidate words with at length>=2
• Direct word match with ground truth
15. Image Tagging Benchmarks
Evaluation criteria:
• Concept similarity match instead of word match
• E.g. ‘military-officer’ tag matched with ground truth tag ‘person’
Text API Accuracy
Amazon Rekognition 65%
Google Cloud Vision 47.6%
Microsoft Cognitive Services 50.0%
16. Image Tagging Benchmarks
Evaluation criteria:
• Concept similarity match instead of word match
• E.g. ‘military-officer’ tag matched with ground truth tag ‘person’
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8
17. Image Tagging Benchmarks
Hard to do Precision-Recall since COCO ground truth tags are not exhaustive
Lower # of tags for a given accuracy indicates higher F-measure
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8
18. Tips for reducing network latency
• For Text Recognition
• Compressing setting of upto 90% has little effect on accuracy, but drastic
savings in size
• Resizing is dangerous, text recognition needs a minimum size for
recognition
• For image recognition
• Resize to 224 as the minimum(height,width) at 50% compression with
bilinear interpolation
24. Apple’s Ecosystem
Metal
- low-level, low-overhead hardware-accelerated 3D graphic and
compute shader application programming interface (API)
- Available since iOS 8
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
25. Apple’s Ecosystem
Fast low-level primitives:
• BNNS – Basic Neural Network Subroutine
• Ideal case: Fully connected NN
• MPS – Metal Performance Shaders
• Ideal case: Convolutions
Inconvenient for large networks:
• Inception-v3 inference consisted of 1.5K hard coded model definition
• Libraries Like Forge by Matthijs Hollemans provide abstraction
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
26. Apple’s Ecosystem
Convert Caffe/Tensorflow model to CoreML model in 3 lines:
import coremltools
coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)
coreml_model.save('my_model.mlmodel’)
Add model to iOS project and call for prediction.
Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM
Automatically minimizes memory footprint and power consumption
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
27. Apple’s Ecosystem
• Model quantization support upto 1 bit
• Batch API for improved performance
• Conversion support for MXNet, ONNX
• ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer
• ML Create for quick training
• tf-coreml for direct conversion from tensorflow
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
28. CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accurac
y
Size of
Model (MB)
iPhone 5S
Execution
Time (ms)
iPhone 6
Execution
Time (ms)
iPhone
6S/SE
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
iPhone 8/X
Execution
Time (ms)
VGG 16 71 553 7408 4556 235 181 146
Inception v3 78 95 727 637 114 90 78
Resnet 50 75 103 538 557 77 74 71
MobileNet 71 17 129 109 44 35 33
SqueezeNet 57 5 75 78 36 30 29
2014 2015 2016
Huge
improvement in
GPU hardware
in 2015
2013 2017
32. TensorFlow Ecosystem
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
Easy pipeline to bring Tensorflow models to mobile
Excellent documentation
Optimizations to bring model to mobile
33. TensorFlow Ecosystem
• Smaller
• Faster
• Minimal dependencies
• Easier to package & deploy
• Allows running custom operators
1 line conversion from Keras to TensorFlow lite
• tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
34. TensorFlow Lite is small
• ~75KB for core interpreter
• ~400KB for core interpreter + supported operations
• Compared to 1.5MB for Tensorflow Mobile
35. TensorFlow Lite is fast
• Takes advantage of on-device hardware acceleration
• Uses FlatBuffers
• Reduces code footprint, memory usage
• Reduces CPU cycles on serialization and deserialization
• Improves startup time
• Pre-fused activations
• Combining batch normalization layer with previous Convolution
• Interpreter uses static memory and static execution plan
• Decreases load time
38. TensorFlow Lite Benchmarks - http://ai-benchmark.com/
• Crowdsourcing benchmarking with AI Benchmark android app
• By Andrey Ignatov from ETH
• 9 Tests
• E.g Semantic Segmentation, Image Super Resolution, Face Recognition
39. Caffe2
From Facebook
Under 1 MB of binary size
Built for Speed :
For ARM CPU : Uses NEON Kernels, NNPack
For iPhone GPU : Uses Metal Performance Shaders and Metal
For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)
ONNX format support to import models from CNTK/PyTorch
41. Recommendation for development
1. Train a model using Keras
2. For iOS:
• Convert to CoreML using coremltools
3. For Android:
• Convert to Tensorflow Lite using tflite_convert
Keras
.mlmodel file .tflite file
coremltools tflite_convert
42. Common Questions
“My app has become too big to download. What do I do?”
• iOS doesn’t allow apps over 150 MB to be downloaded
• Solution : Download on demand, and compile on device
• 0 MB change to app size on first install
43. Common Questions
“Do I need to ship a new app update with every model improvement?”
• Making App updates is a decent amount of overheard, plus ~2 days
wait time
• Solution : Check for model updates, download and compile on device
• Easier solution – Use a framework for Model Management, e.g.
• Google ML Kit
• Fritz
• Numerrical
44. Common Questions
“Why does my app not recognize objects at top/bottom of screen?”
• Solution : Check the cropping used, by default, its center crop
47. Learn Playing an Accordion
3 months
Knows Piano
Fine Tune Skills
1 week
48. I got a dataset, Now What?
Step 1 : Find a pre-trained model
Step 2 : Fine tune a pre-trained model
Step 3 : Run using existing frameworks
“Don’t Be A Hero”
- Andrej Karpathy
49. How to find pretrained models for my task?
Search “Model Zoo”
https://modelzoo.co
- 300+ models
50. AlexNet, 2012 (simplified)
[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimension
Feature
representation
51. Deciding how to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
52. Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
53. Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
54. Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
55. Could you training your own classifier ... without coding?
• Microsoft CustomVision.ai
• Unique: Under a minute training, Custom object detection
• Google AutoML
• Unique: Full CNN training, crowdsourced workers
• IBM Watson Visual recognition
• Baidu EZDL
• Unique: Custom Sound recognition
56. Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Upload 30 photos per class for make prototype model
Upload 200 photos per class for more robust production model
More distinct the shape/type of object, lesser images required.
57. Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Use Fatkun Browser Extension to download images from Search Engine,
or use Bing Image Search API to programmatically download photos with
proper rights
58. CoreML exporter from customvision.ai
– Drag and drop training
5 minute shortcut to training, finetuning and
getting model ready in CoreML format
Drag and drop interface
60. Barcode recognition from Seeing AI
Live Guide user in finding a barcode with audio cues
With
Server
Decode barcode to identify product
Tech MPSCNN running on mobile GPU + barcode library
Metrics 40 FPS (~25 ms) on iPhone 7
Aim : Help blind users identify products using barcode
Issue : Blind users don’t know where the barcode is
61. Currency recognition from Seeing AI
Aim : Identify currency
Live Identify denomination of paper currency instantly
With
Server
-
Tech Task specific CNN running on mobile GPU
Metrics 40 FPS (~25 ms) on iPhone 7
62. Training Data Collection App
Request volunteers to take photos of objects
in non-obvious settings
Sends photos to cloud, trains model nightly
Newsletter shows the best photos from volunteers
Let them compete for fame
72. Accuracy vs Operations Per Image Inference
Size is proportional
to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want
73. Your Budget - Smartphone Floating Point Operations Per Second (2015)
http://pages.experts-exchange.com/processing-power-compared/
74. iPhone X is more powerful than a Macbook Pro
https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
75. Strategies to get maximum efficiency from your CNN
Before training
• Pick an efficient architecture for your task
• Designing efficient layers
After training
• Pruning
• Quantization
• Network binarization
76. CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accura
cy
Size of
Model
(MB)
Million
Multi
Adds
iPhone 5S
Execution
Time (ms)
iPhone 6
Execution
Time (ms)
iPhone
6S/SE
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
iPhone
8/X
Execution
Time (ms)
VGG 16 71 553 15300 7408 4556 235 181 146
Inception
v3
78 95 5000 727 637 114 90 78
Resnet 50 75 103 3900 538 557 77 74 71
MobileNet 71 17 569 129 109 44 35 33
SqueezeN
et
57 5 800 75 78 36 30 29
2014 2015 2016
Huge
improvement in
GPU hardware
in 2015
2013 2017
77. MobileNet family
Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise
conv
Tune with two parameters – Width Multiplier and resolution multiplier
Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
82. Tricks while designing your own network
• Dilated Convolutions
• Great for Segmentation / when target object has high area in image
• Replace NxN convolutions with Nx1 followed by 1xN
• Depth wise Separable Convolutions (e.g. MobileNet)
• Inverted residual block (e.g. MobileNetV2)
• Replacing large filters with multiple small filters
• 5x5 is slower than 3x3 followed by 3x3
83. Design consideration for custom architectures – Small Filters
Three layers of 3x3 convolutions
>>
One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions
Replace NxN convolutions with stack of 1xN and Nx1
Fewer parameters
Less compute
More non-linearity
Better
Faster
Stronger
Andrej Karpathy, CS-231n Notes, Lecture 11
84. Selective training to keep networks shallow
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images
beyond +-45 degrees. Your phone will anyway rotate.
Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames
85. Pruning
Aim : Remove all connections
with absolute weights below a
threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
86. Observation : Most parameters in Fully Connected Layers
AlexNet 240 MB VGG-16 552 MB
96% of all
parameters
90% of all
parameters
87. Pruning gets quickest model compression without accuracy loss
AlexNet 240 MB VGG-16 552 MB
First layer which directly interacts with image is sensitive
and cannot be pruned too much without hurting
accuracy
88. Weight Sharing
Idea : Cluster weights with similar values together, and store in a
dictionary.
Codebook
Huffman coding
HashedNets
Cons: Need a special inference engine, doesn’t work for most
applications
89. Filter Pruning - ThiNet
Idea : Discard whole filter if not important to predictions
Advantage:
• No change in architecture, other than thinning of filters per layer
• Can be further compressed with other methods
Just like feature selection, select filter to discard. Possible greedy
methods:
• Absolute weight sum of entire filter closest to 0
• Average percentage of ‘Zeros’ as outputs
• ThiNet – Collect statistics on the output of the next layer
90. SqueezeNet - AlexNet-level accuracy in 0.5 MB
SqueezeNet base 4.8 MB
SqueezeNet compressed 0.5 MB
80.3% top-5 Accuracy on ImageNet
0.72 GFLOPS/image
Fire Block
Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
91. Quantization
Reduce precision from 32 bits to <=16 bits or lesser
Use stochastic rounding for best results
In Practice:
• Ristretto + Caffe
• Automatic Network quantization
• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support
• Gemmlowp – Low precision matrix multiplication library
92. Quantizing CNNs in Practice
Reducing CoreML models to half size
# Load a model, lower its precision, and then save the smaller model.
model_spec = coremltools.utils.load_spec(‘model.mlmodel’)
model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec)
coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')
93. Quantizing CNNs in Practice
Reducing CoreML models to even smaller size
Choose bits and quantization mode
Bits from [1,2,4,8]
Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]
• Lut = look up table
from coremltools.models.neural_network.quantization_utils import *
quantized_model= quantize_weights(model, 8, 'linear')
quantized_model.save('quantizedModel.mlmodel’)
compare_model(model, quantized_model, './sample_data/')
94. Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
95. Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
96. Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
97. XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
98. XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
99. XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
101. Challenges
Off the shelf CNNs not robust for video
Solutions:
• Collective confidence over several frames
• CortexNet
102. Building a DL App and get
$10 million in funding
(or a PhD)
103. Competitions to follow
Winners = High accuracy + Low energy consumption
* LPIRC - Low-Power Image Recognition Challenge
* EDLDC - Embedded deep learning design contest
* System Design Contest at Design Automation Conference (DAC)
104. AutoML – Let AI design an efficient AI architecture
MnasNet: Platform-Aware Neural Architecture Search for Mobile
• An automated neural architecture search approach for designing mobile
models using reinforcement learning
• Incorporates latency information into the reward objective function
• Measure real-world inference latency by executing on a particular
platform Sample models
from search space Trainer
Mobile
phones
Multi-objective
reward
latency
reward
Controller
accuracy
105. AutoML – Let AI design an efficient AI architecture
For same accuracy:
• 1.5x faster than MobileNetV2
• ResNet-50 accuracy with 19x less
parameters
• SSD300 mAP with 35x less FLOPs
[Miller 1968; Card et al. 1991]:
0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
No, don’t do it right now. Do it in the next session.
If you need a hand warmer on a cold day, I suggest you try training on a phone
Core ML supports a variety of machine learning models, including neural networks, tree ensembles, support vector machines, and generalized linear models. Core ML requires the Core ML model format (models with a .mlmodel file extension).
Apple provides several popular, open source models that are already in the Core ML model format. You can download these models and start using them in your app. Additionally, various research groups and universities publish their models and training data, which may not be in the Core ML model format. To use these models, you need to convert them, as described in Converting Trained Models to Core ML.
Internally does context switching between GPU and CPU.
Uses Accelerate for CPU (For eg Sentiment Analysis) and GPU (for Image Classification)
Speedups : No need to decode JPEGs, directly deal with camera image buffers
“surprisingly, ARM CPUs outperform the on-board GPUs (our NNPACK ARM CPU implementation outperforms Apple’s MPSCNNConvolution for all devices except the iPhone 7). There are other advantages to offloading compute onto the GPU/DSP, and it’s an active work in progress to expose these in Caffe2 “
Built for first class support on phones from 2015 onwards. Does support runs on 2013 beyond models
Use NEON Kernels for certan operations like transpose
Heavily use NNPAck, extemely fast for convoltuions on ARM CPUs. Even phones from 2014 without GPUs, arm cpu will outperform. NNPack implements Winograd for convoluation math. Convert convolution to element wise multiplication. Reduces number of flops by 2.5 times
also works on any arm cpu, which doesn't limit you to cell phone
Under 1 MB of compiled binary size
You don't need microsoft 's ocean boiling gpu cluster
Learned hierarchical features from a deep learning algorithm. Each feature can be thought of as a filter, which filters the input image for that feature (a nose). If the feature is found, the responsible units generate large activations, which can be picked up by the later classifier stages as a good indicator that the class is present.
In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest.
Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios:
New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features.
New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network.
New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network.
New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest.
Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios:
New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features.
New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network.
New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network.
New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest.
Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios:
New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features.
New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network.
New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network.
New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest.
Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios:
New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features.
New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network.
New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network.
New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
As we all have painfully experienced, in real life, what you really want, is not what you can always afford.
And that's the same in machine learning,
We all know deep learning works if you have large GPU servers, what about when you want to run it on a tiny little device.
What's the number 1 limitation, it turns out to be memory.
If you look at image net in the last couple of years, it started with 240 megabytes. VGG was over half a gig.
So the question we will solve now is how to get these neural networks do these amazing things yet have a very small memory footprint
Just more layers, nothing special
Here is another view of this model.
This is the equivalent of Big data for Powerpoint
Errors are reducing by 40% year on year
Previously, they used to reduce by 5% year by year
Compromise between accuracy, number of parameters,
Apple is increasing the core count from two to six with a new A11 chip. Two of the cores are meant to do the bulk of intensive processing, while the other four are high efficiency cores dedicated to low-power tasks.
DNNs often suffer from over-parameterization and large amount of redundancy in their models. This typically results in inefficient computation and memory usage
iPhone 7 has a considerable mobile gpu....10 years ago when CUDA came out, desktop GPUS had similar performance
DNNs often suffer from over-parameterization and large amount of redundancy in their models. This typically results in inefficient computation and memory usage
1x1 bottleneck convolutions are very efficient
Word Lens app uses this
Pruning redundant, non-informative weights in a previously trained network reduces the size of the network at inference time.
Take a network, prune, and then retrain the remaining connections
VGG-16 contains 90% of the weights
AlexNet contains 96% of the weights
Most computation happen in convolutional layers
VGG-16 contains 90% of the weights
AlexNet contains 96% of the weights
Resnet, GoogleNet, Inception have majority convolutional layers, so they compress less
Caffe2 does this, but does dense multiplication
Facebook app uses this
Facebook app uses this
SqueezeNet has been recently released. It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms.
Strategy 1. Replace 3x3 filters with 1x1 filters
Strategy 2. Decrease the number of input channels to 3x3 filters
Strategy 3. Downsample late in the network so that convolution layers have large activation maps.
Ristretto is an automated CNN-approximation tool which condenses 32-bit floating point networks. Ristretto is an extension of Caffe and allows to test, train and fine-tune networks with limited numerical precision.
Ristretto In a Minute
Ristretto Tool: The Ristretto tool performs automatic network quantization and scoring, using different bit-widths for number representation, to find a good balance between compression rate and network accuracy.
Ristretto Layers: Ristretto re-implements Caffe-layers and simulates reduced word width arithmetic.
Testing and Training: Thanks to Ristretto’s smooth integration into Caffe, network description files can be changed to quantize different layers. The bit-width used for different layers as well as other parameters can be set in the network’s prototxt file. This allows to directly test and train condensed networks, without any need of recompilation.
Facebook app uses this
Reduce weights it to binary values, then scale them during training
Reduce weights it to binary values, then scale them during training
Reduce weights it to binary values, then scale them during training
Now that I see this slide, this should probably have been the title for this session. We would have gotten a lot more people in this room.
Facebook app uses this
Three components:
(1) a RNN-based controller for learning and sampling model architectures
(2) a trainer that trains models
(3) an inference engine for measuring the model speed on real mobile phones
SE = Squeeze-and-Excitation optimization
Minerva consists of five stages, as shown in Figure 2. Stages 1–2 establish a fair baseline accelerator implementation. Stage 1 generates the baseline DNN: fixing a network topology and a set of trained weights. Stage 2 selects an optimal baseline accelerator implementation. Stages 3– 5 employ novel co-design optimizations to minimize power consumption over the baseline in the following ways: Stage 3 analyzes the dynamic range of all DNN signals and reduces slack in data type precision. Stage 4 exploits observed network sparsity to minimize data accesses and MAC operations. Stage 5 introduces a novel fault mitigation technique, which allows for aggressive SRAM supply voltage reduction. For each of the three optimization stages, the ML level measures the impact on prediction accuracy, the architecture level evaluates hardware resource savings, and the circuit level characterizes the hardware models and validates simulation results.