SlideShare une entreprise Scribd logo
1  sur  109
Deep Learning on mobile phones
- A Practitioners guide
Anirudh Koul
Deep Learning on mobile phones
- A Practitioners guide
Anirudh Koul
Anirudh Koul , @anirudhkoul , http://koul.ai
Head of AI & Research,
Aira
[lastname]@aira.io
Founder, Seeing AI
Previously at Microsoft
Why Deep Learning On Mobile?
Latency Privacy
Response Time Limits – Powers of 10
0.1 second : Reacting instantly
1.0 seconds : User’s flow of thought
10 seconds : Keeping the user’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
Mobile Deep Learning Recipe
Mobile Inference Engine + Pretrained Model = DL App
(Efficient) (Efficient)
Building a DL App in _ time
Building a DL App in 1 hour
Use Cloud APIs for General Recognition Needs
• Microsoft Cognitive Services
• Clarifai
• Google Cloud Vision
• IBM Watson Services
• Amazon Rekognition
How to Choose a Computer Vision Based API?
Benchmark & Compare them
COCO-Text v2.0 for Text reading in the wild
• ~2k random images
• Candidate text has at least 2 characters together
• Direct word match
COCO-Val 2017 for Image Tagging in the wild
• ~4k random images
• Tag similarity match instead of word match
Pricing
Recognize Text Benchmarks
Text API Accuracy
Amazon Rekognition 45.4%
Google Cloud Vision 33.4%
Microsoft Cognitive Services 55.4%
Evaluation criteria:
• Photos have candidate words with at length>=2
• Direct word match with ground truth
Image Tagging Benchmarks
Evaluation criteria:
• Concept similarity match instead of word match
• E.g. ‘military-officer’ tag matched with ground truth tag ‘person’
Text API Accuracy
Amazon Rekognition 65%
Google Cloud Vision 47.6%
Microsoft Cognitive Services 50.0%
Image Tagging Benchmarks
Evaluation criteria:
• Concept similarity match instead of word match
• E.g. ‘military-officer’ tag matched with ground truth tag ‘person’
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8
Image Tagging Benchmarks
Hard to do Precision-Recall since COCO ground truth tags are not exhaustive
Lower # of tags for a given accuracy indicates higher F-measure
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8
Tips for reducing network latency
• For Text Recognition
• Compressing setting of upto 90% has little effect on accuracy, but drastic
savings in size
• Resizing is dangerous, text recognition needs a minimum size for
recognition
• For image recognition
• Resize to 224 as the minimum(height,width) at 50% compression with
bilinear interpolation
Building a DL App in 1 day
http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/
Energy to train
Convolutional
Neural Network
Energy to use
Convolutional
Neural Network
Base Pretrained Model
ImageNet – 1000 Object Categorizer
VGG16
Inception-v3
Resnet-50
MobileNet
SqueezeNet
Running pre-trained models on mobile
Core ML
TensorFlow Lite
Caffe2
Apple’s Ecosystem
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
Apple’s Ecosystem
Metal
- low-level, low-overhead hardware-accelerated 3D graphic and
compute shader application programming interface (API)
- Available since iOS 8
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
Apple’s Ecosystem
Fast low-level primitives:
• BNNS – Basic Neural Network Subroutine
• Ideal case: Fully connected NN
• MPS – Metal Performance Shaders
• Ideal case: Convolutions
Inconvenient for large networks:
• Inception-v3 inference consisted of 1.5K hard coded model definition
• Libraries Like Forge by Matthijs Hollemans provide abstraction
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
Apple’s Ecosystem
Convert Caffe/Tensorflow model to CoreML model in 3 lines:
import coremltools
coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)
coreml_model.save('my_model.mlmodel’)
Add model to iOS project and call for prediction.
Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM
Automatically minimizes memory footprint and power consumption
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
Apple’s Ecosystem
• Model quantization support upto 1 bit
• Batch API for improved performance
• Conversion support for MXNet, ONNX
• ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer
• ML Create for quick training
• tf-coreml for direct conversion from tensorflow
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017
2018
CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accurac
y
Size of
Model (MB)
iPhone 5S
Execution
Time (ms)
iPhone 6
Execution
Time (ms)
iPhone
6S/SE
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
iPhone 8/X
Execution
Time (ms)
VGG 16 71 553 7408 4556 235 181 146
Inception v3 78 95 727 637 114 90 78
Resnet 50 75 103 538 557 77 74 71
MobileNet 71 17 129 109 44 35 33
SqueezeNet 57 5 75 78 36 30 29
2014 2015 2016
Huge
improvement in
GPU hardware
in 2015
2013 2017
Putting out more frames than an art gallery
TensorFlow Ecosystem
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
TensorFlow Ecosystem
The full, bulky deal
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
TensorFlow Ecosystem
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
Easy pipeline to bring Tensorflow models to mobile
Excellent documentation
Optimizations to bring model to mobile
TensorFlow Ecosystem
• Smaller
• Faster
• Minimal dependencies
• Easier to package & deploy
• Allows running custom operators
1 line conversion from Keras to TensorFlow lite
• tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
TensorFlow Lite is small
• ~75KB for core interpreter
• ~400KB for core interpreter + supported operations
• Compared to 1.5MB for Tensorflow Mobile
TensorFlow Lite is fast
• Takes advantage of on-device hardware acceleration
• Uses FlatBuffers
• Reduces code footprint, memory usage
• Reduces CPU cycles on serialization and deserialization
• Improves startup time
• Pre-fused activations
• Combining batch normalization layer with previous Convolution
• Interpreter uses static memory and static execution plan
• Decreases load time
TensorFlow Lite Architecture
TensorFlow Lite Benchmarks - http://alpha.lab.numericcal.com/
TensorFlow Lite Benchmarks - http://ai-benchmark.com/
• Crowdsourcing benchmarking with AI Benchmark android app
• By Andrey Ignatov from ETH
• 9 Tests
• E.g Semantic Segmentation, Image Super Resolution, Face Recognition
Caffe2
From Facebook
Under 1 MB of binary size
Built for Speed :
For ARM CPU : Uses NEON Kernels, NNPack
For iPhone GPU : Uses Metal Performance Shaders and Metal
For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)
ONNX format support to import models from CNTK/PyTorch
Caffe2
Recommendation for development
1. Train a model using Keras
2. For iOS:
• Convert to CoreML using coremltools
3. For Android:
• Convert to Tensorflow Lite using tflite_convert
Keras
.mlmodel file .tflite file
coremltools tflite_convert
Common Questions
“My app has become too big to download. What do I do?”
• iOS doesn’t allow apps over 150 MB to be downloaded
• Solution : Download on demand, and compile on device
• 0 MB change to app size on first install
Common Questions
“Do I need to ship a new app update with every model improvement?”
• Making App updates is a decent amount of overheard, plus ~2 days
wait time
• Solution : Check for model updates, download and compile on device
• Easier solution – Use a framework for Model Management, e.g.
• Google ML Kit
• Fritz
• Numerrical
Common Questions
“Why does my app not recognize objects at top/bottom of screen?”
• Solution : Check the cropping used, by default, its center crop 
Building a DL App in 1 week
Learn Playing an Accordion
3 months
Learn Playing an Accordion
3 months
Knows Piano
Fine Tune Skills
1 week
I got a dataset, Now What?
Step 1 : Find a pre-trained model
Step 2 : Fine tune a pre-trained model
Step 3 : Run using existing frameworks
“Don’t Be A Hero”
- Andrej Karpathy
How to find pretrained models for my task?
Search “Model Zoo”
https://modelzoo.co
- 300+ models
AlexNet, 2012 (simplified)
[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimension
Feature
representation
Deciding how to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Could you training your own classifier ... without coding?
• Microsoft CustomVision.ai
• Unique: Under a minute training, Custom object detection
• Google AutoML
• Unique: Full CNN training, crowdsourced workers
• IBM Watson Visual recognition
• Baidu EZDL
• Unique: Custom Sound recognition
Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Upload 30 photos per class for make prototype model
Upload 200 photos per class for more robust production model
More distinct the shape/type of object, lesser images required.
Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Use Fatkun Browser Extension to download images from Search Engine,
or use Bing Image Search API to programmatically download photos with
proper rights
CoreML exporter from customvision.ai
– Drag and drop training
5 minute shortcut to training, finetuning and
getting model ready in CoreML format
Drag and drop interface
Building a Crowdsourced Data Collector
in 1 months
Barcode recognition from Seeing AI
Live Guide user in finding a barcode with audio cues
With
Server
Decode barcode to identify product
Tech MPSCNN running on mobile GPU + barcode library
Metrics 40 FPS (~25 ms) on iPhone 7
Aim : Help blind users identify products using barcode
Issue : Blind users don’t know where the barcode is
Currency recognition from Seeing AI
Aim : Identify currency
Live Identify denomination of paper currency instantly
With
Server
-
Tech Task specific CNN running on mobile GPU
Metrics 40 FPS (~25 ms) on iPhone 7
Training Data Collection App
Request volunteers to take photos of objects
in non-obvious settings
Sends photos to cloud, trains model nightly
Newsletter shows the best photos from volunteers
Let them compete for fame
Daily challenge - Collected by volunteers
Daily challenge - Collected by volunteers
Building a production DL App
in 3 months
What you want
https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016
$2000$200,000
What you can afford
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
AlexNet, 8 layers
(ILSVRC 2012)
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
Ultra
deep
ResNet, 152 layers 1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
7x7 conv, 64, /2, pool/2
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
28.2
25.8
16.4
11.7
7.3 6.7
3.6 2.9
ILSVRC'10 ILSVRC'11 ILSVRC'12
AlexNet
ILSVRC'13 ILSVRC'14
VGG
ILSVRC'14
GoogleNet
ILSVRC'15
ResNet
ILSVRC'16
Ensemble
ImageNet Classification top-5 error (%)
shallow 8 layers
19 layers 22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
Revolution of Depth vs Classification Accuracy
Ensemble of
Resnet, Inception
Resnet, Inception
and Wide Residual
Network
Accuracy vs Operations Per Image Inference
Size is proportional
to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want
Your Budget - Smartphone Floating Point Operations Per Second (2015)
http://pages.experts-exchange.com/processing-power-compared/
iPhone X is more powerful than a Macbook Pro
https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
Strategies to get maximum efficiency from your CNN
Before training
• Pick an efficient architecture for your task
• Designing efficient layers
After training
• Pruning
• Quantization
• Network binarization
CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1
Accura
cy
Size of
Model
(MB)
Million
Multi
Adds
iPhone 5S
Execution
Time (ms)
iPhone 6
Execution
Time (ms)
iPhone
6S/SE
Execution
Time (ms)
iPhone 7
Execution
Time (ms)
iPhone
8/X
Execution
Time (ms)
VGG 16 71 553 15300 7408 4556 235 181 146
Inception
v3
78 95 5000 727 637 114 90 78
Resnet 50 75 103 3900 538 557 77 74 71
MobileNet 71 17 569 129 109 44 35 33
SqueezeN
et
57 5 800 75 78 36 30 29
2014 2015 2016
Huge
improvement in
GPU hardware
in 2015
2013 2017
MobileNet family
Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise
conv
Tune with two parameters – Width Multiplier and resolution multiplier
Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
Efficient Classification Architectures
https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html
MobileNetV2 is the current favourite
Efficient Detection Architectures
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
Efficient Detection Architectures
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
Efficient Segmentation Architectures
ICNet - Image cascade
network
Tricks while designing your own network
• Dilated Convolutions
• Great for Segmentation / when target object has high area in image
• Replace NxN convolutions with Nx1 followed by 1xN
• Depth wise Separable Convolutions (e.g. MobileNet)
• Inverted residual block (e.g. MobileNetV2)
• Replacing large filters with multiple small filters
• 5x5 is slower than 3x3 followed by 3x3
Design consideration for custom architectures – Small Filters
Three layers of 3x3 convolutions
>>
One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions
Replace NxN convolutions with stack of 1xN and Nx1
Fewer parameters 
Less compute 
More non-linearity 
Better
Faster
Stronger
Andrej Karpathy, CS-231n Notes, Lecture 11
Selective training to keep networks shallow
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images
beyond +-45 degrees. Your phone will anyway rotate.
Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames
Pruning
Aim : Remove all connections
with absolute weights below a
threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
Observation : Most parameters in Fully Connected Layers
AlexNet 240 MB VGG-16 552 MB
96% of all
parameters
90% of all
parameters
Pruning gets quickest model compression without accuracy loss
AlexNet 240 MB VGG-16 552 MB
First layer which directly interacts with image is sensitive
and cannot be pruned too much without hurting
accuracy
Weight Sharing
Idea : Cluster weights with similar values together, and store in a
dictionary.
Codebook
Huffman coding
HashedNets
Cons: Need a special inference engine, doesn’t work for most
applications
Filter Pruning - ThiNet
Idea : Discard whole filter if not important to predictions
Advantage:
• No change in architecture, other than thinning of filters per layer
• Can be further compressed with other methods
Just like feature selection, select filter to discard. Possible greedy
methods:
• Absolute weight sum of entire filter closest to 0
• Average percentage of ‘Zeros’ as outputs
• ThiNet – Collect statistics on the output of the next layer
SqueezeNet - AlexNet-level accuracy in 0.5 MB
SqueezeNet base 4.8 MB
SqueezeNet compressed 0.5 MB
80.3% top-5 Accuracy on ImageNet
0.72 GFLOPS/image
Fire Block
Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
Quantization
Reduce precision from 32 bits to <=16 bits or lesser
Use stochastic rounding for best results
In Practice:
• Ristretto + Caffe
• Automatic Network quantization
• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support
• Gemmlowp – Low precision matrix multiplication library
Quantizing CNNs in Practice
Reducing CoreML models to half size
# Load a model, lower its precision, and then save the smaller model.
model_spec = coremltools.utils.load_spec(‘model.mlmodel’)
model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec)
coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')
Quantizing CNNs in Practice
Reducing CoreML models to even smaller size
Choose bits and quantization mode
Bits from [1,2,4,8]
Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]
• Lut = look up table
from coremltools.models.neural_network.quantization_utils import *
quantized_model= quantize_weights(model, 8, 'linear')
quantized_model.save('quantizedModel.mlmodel’)
compare_model(model, quantized_model, './sample_data/')
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation
and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and
Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net on Mobile
Challenges
Off the shelf CNNs not robust for video
Solutions:
• Collective confidence over several frames
• CortexNet
Building a DL App and get
$10 million in funding
(or a PhD)
Competitions to follow
Winners = High accuracy + Low energy consumption
* LPIRC - Low-Power Image Recognition Challenge
* EDLDC - Embedded deep learning design contest
* System Design Contest at Design Automation Conference (DAC)
AutoML – Let AI design an efficient AI architecture
MnasNet: Platform-Aware Neural Architecture Search for Mobile
• An automated neural architecture search approach for designing mobile
models using reinforcement learning
• Incorporates latency information into the reward objective function
• Measure real-world inference latency by executing on a particular
platform Sample models
from search space Trainer
Mobile
phones
Multi-objective
reward
latency
reward
Controller
accuracy
AutoML – Let AI design an efficient AI architecture
For same accuracy:
• 1.5x faster than MobileNetV2
• ResNet-50 accuracy with 19x less
parameters
• SSD300 mAP with 35x less FLOPs
Mr. Data Scientist PhD
One Last Question
How to access the slides in 1 second
Link posted here -> @anirudhkoul
Deep learning on mobile

Contenu connexe

Tendances

Tendances (20)

Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Onnx and onnx runtime
Onnx and onnx runtimeOnnx and onnx runtime
Onnx and onnx runtime
 
GPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelGPT : Generative Pre-Training Model
GPT : Generative Pre-Training Model
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression
 
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 
Prompt Engineering
Prompt EngineeringPrompt Engineering
Prompt Engineering
 
Chain-of-thought Prompting.pptx
Chain-of-thought Prompting.pptxChain-of-thought Prompting.pptx
Chain-of-thought Prompting.pptx
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
NLP 101 + Chatbots
NLP 101 + ChatbotsNLP 101 + Chatbots
NLP 101 + Chatbots
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
 
Research Methods in Natural Language Processing
Research Methods in Natural Language ProcessingResearch Methods in Natural Language Processing
Research Methods in Natural Language Processing
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
 
⼤語⾔模型 LLM 應⽤開發入⾨
⼤語⾔模型 LLM 應⽤開發入⾨⼤語⾔模型 LLM 應⽤開發入⾨
⼤語⾔模型 LLM 應⽤開發入⾨
 
딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)
딥 러닝 자연어 처리 학습을 위한 PPT! (Deep Learning for Natural Language Processing)
 

Similaire à Deep learning on mobile

Siddha Ganju, NVIDIA. Deep Learning for Mobile
Siddha Ganju, NVIDIA. Deep Learning for MobileSiddha Ganju, NVIDIA. Deep Learning for Mobile
Siddha Ganju, NVIDIA. Deep Learning for Mobile
IT Arena
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
 
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr..."Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
Edge AI and Vision Alliance
 

Similaire à Deep learning on mobile (20)

Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
 
Siddha Ganju. Deep learning on mobile
Siddha Ganju. Deep learning on mobileSiddha Ganju. Deep learning on mobile
Siddha Ganju. Deep learning on mobile
 
Siddha Ganju, NVIDIA. Deep Learning for Mobile
Siddha Ganju, NVIDIA. Deep Learning for MobileSiddha Ganju, NVIDIA. Deep Learning for Mobile
Siddha Ganju, NVIDIA. Deep Learning for Mobile
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
AI at Google (30 min)
AI at Google (30 min)AI at Google (30 min)
AI at Google (30 min)
 
Track2 02. machine intelligence at google scale google, kaz sato, staff devel...
Track2 02. machine intelligence at google scale google, kaz sato, staff devel...Track2 02. machine intelligence at google scale google, kaz sato, staff devel...
Track2 02. machine intelligence at google scale google, kaz sato, staff devel...
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr..."Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
"Enabling Ubiquitous Visual Intelligence Through Deep Learning," a Keynote Pr...
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Deep Learning on Qubole Data Platform
Deep Learning on Qubole Data PlatformDeep Learning on Qubole Data Platform
Deep Learning on Qubole Data Platform
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
 
Using Algorithmia to leverage AI and Machine Learning APIs
Using Algorithmia to leverage AI and Machine Learning APIsUsing Algorithmia to leverage AI and Machine Learning APIs
Using Algorithmia to leverage AI and Machine Learning APIs
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Deep learning on mobile

  • 1. Deep Learning on mobile phones - A Practitioners guide Anirudh Koul
  • 2.
  • 3. Deep Learning on mobile phones - A Practitioners guide Anirudh Koul
  • 4. Anirudh Koul , @anirudhkoul , http://koul.ai Head of AI & Research, Aira [lastname]@aira.io Founder, Seeing AI Previously at Microsoft
  • 5.
  • 6. Why Deep Learning On Mobile? Latency Privacy
  • 7. Response Time Limits – Powers of 10 0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention [Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
  • 8. Mobile Deep Learning Recipe Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)
  • 9. Building a DL App in _ time
  • 10. Building a DL App in 1 hour
  • 11. Use Cloud APIs for General Recognition Needs • Microsoft Cognitive Services • Clarifai • Google Cloud Vision • IBM Watson Services • Amazon Rekognition
  • 12. How to Choose a Computer Vision Based API? Benchmark & Compare them COCO-Text v2.0 for Text reading in the wild • ~2k random images • Candidate text has at least 2 characters together • Direct word match COCO-Val 2017 for Image Tagging in the wild • ~4k random images • Tag similarity match instead of word match
  • 14. Recognize Text Benchmarks Text API Accuracy Amazon Rekognition 45.4% Google Cloud Vision 33.4% Microsoft Cognitive Services 55.4% Evaluation criteria: • Photos have candidate words with at length>=2 • Direct word match with ground truth
  • 15. Image Tagging Benchmarks Evaluation criteria: • Concept similarity match instead of word match • E.g. ‘military-officer’ tag matched with ground truth tag ‘person’ Text API Accuracy Amazon Rekognition 65% Google Cloud Vision 47.6% Microsoft Cognitive Services 50.0%
  • 16. Image Tagging Benchmarks Evaluation criteria: • Concept similarity match instead of word match • E.g. ‘military-officer’ tag matched with ground truth tag ‘person’ Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8
  • 17. Image Tagging Benchmarks Hard to do Precision-Recall since COCO ground truth tags are not exhaustive Lower # of tags for a given accuracy indicates higher F-measure Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8
  • 18. Tips for reducing network latency • For Text Recognition • Compressing setting of upto 90% has little effect on accuracy, but drastic savings in size • Resizing is dangerous, text recognition needs a minimum size for recognition • For image recognition • Resize to 224 as the minimum(height,width) at 50% compression with bilinear interpolation
  • 19. Building a DL App in 1 day
  • 21. Base Pretrained Model ImageNet – 1000 Object Categorizer VGG16 Inception-v3 Resnet-50 MobileNet SqueezeNet
  • 22. Running pre-trained models on mobile Core ML TensorFlow Lite Caffe2
  • 23. Apple’s Ecosystem Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  • 24. Apple’s Ecosystem Metal - low-level, low-overhead hardware-accelerated 3D graphic and compute shader application programming interface (API) - Available since iOS 8 Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  • 25. Apple’s Ecosystem Fast low-level primitives: • BNNS – Basic Neural Network Subroutine • Ideal case: Fully connected NN • MPS – Metal Performance Shaders • Ideal case: Convolutions Inconvenient for large networks: • Inception-v3 inference consisted of 1.5K hard coded model definition • Libraries Like Forge by Matthijs Hollemans provide abstraction Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  • 26. Apple’s Ecosystem Convert Caffe/Tensorflow model to CoreML model in 3 lines: import coremltools coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’) coreml_model.save('my_model.mlmodel’) Add model to iOS project and call for prediction. Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM Automatically minimizes memory footprint and power consumption Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  • 27. Apple’s Ecosystem • Model quantization support upto 1 bit • Batch API for improved performance • Conversion support for MXNet, ONNX • ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer • ML Create for quick training • tf-coreml for direct conversion from tensorflow Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  • 28. CoreML Benchmark - Pick a DNN for your mobile architecture Model Top-1 Accurac y Size of Model (MB) iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 7408 4556 235 181 146 Inception v3 78 95 727 637 114 90 78 Resnet 50 75 103 538 557 77 74 71 MobileNet 71 17 129 109 44 35 33 SqueezeNet 57 5 75 78 36 30 29 2014 2015 2016 Huge improvement in GPU hardware in 2015 2013 2017
  • 29. Putting out more frames than an art gallery
  • 30. TensorFlow Ecosystem TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018
  • 31. TensorFlow Ecosystem The full, bulky deal TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018
  • 32. TensorFlow Ecosystem TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018 Easy pipeline to bring Tensorflow models to mobile Excellent documentation Optimizations to bring model to mobile
  • 33. TensorFlow Ecosystem • Smaller • Faster • Minimal dependencies • Easier to package & deploy • Allows running custom operators 1 line conversion from Keras to TensorFlow lite • tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018
  • 34. TensorFlow Lite is small • ~75KB for core interpreter • ~400KB for core interpreter + supported operations • Compared to 1.5MB for Tensorflow Mobile
  • 35. TensorFlow Lite is fast • Takes advantage of on-device hardware acceleration • Uses FlatBuffers • Reduces code footprint, memory usage • Reduces CPU cycles on serialization and deserialization • Improves startup time • Pre-fused activations • Combining batch normalization layer with previous Convolution • Interpreter uses static memory and static execution plan • Decreases load time
  • 37. TensorFlow Lite Benchmarks - http://alpha.lab.numericcal.com/
  • 38. TensorFlow Lite Benchmarks - http://ai-benchmark.com/ • Crowdsourcing benchmarking with AI Benchmark android app • By Andrey Ignatov from ETH • 9 Tests • E.g Semantic Segmentation, Image Super Resolution, Face Recognition
  • 39. Caffe2 From Facebook Under 1 MB of binary size Built for Speed : For ARM CPU : Uses NEON Kernels, NNPack For iPhone GPU : Uses Metal Performance Shaders and Metal For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup) ONNX format support to import models from CNTK/PyTorch
  • 41. Recommendation for development 1. Train a model using Keras 2. For iOS: • Convert to CoreML using coremltools 3. For Android: • Convert to Tensorflow Lite using tflite_convert Keras .mlmodel file .tflite file coremltools tflite_convert
  • 42. Common Questions “My app has become too big to download. What do I do?” • iOS doesn’t allow apps over 150 MB to be downloaded • Solution : Download on demand, and compile on device • 0 MB change to app size on first install
  • 43. Common Questions “Do I need to ship a new app update with every model improvement?” • Making App updates is a decent amount of overheard, plus ~2 days wait time • Solution : Check for model updates, download and compile on device • Easier solution – Use a framework for Model Management, e.g. • Google ML Kit • Fritz • Numerrical
  • 44. Common Questions “Why does my app not recognize objects at top/bottom of screen?” • Solution : Check the cropping used, by default, its center crop 
  • 45. Building a DL App in 1 week
  • 46. Learn Playing an Accordion 3 months
  • 47. Learn Playing an Accordion 3 months Knows Piano Fine Tune Skills 1 week
  • 48. I got a dataset, Now What? Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks “Don’t Be A Hero” - Andrej Karpathy
  • 49. How to find pretrained models for my task? Search “Model Zoo” https://modelzoo.co - 300+ models
  • 50. AlexNet, 2012 (simplified) [Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation
  • 51. Deciding how to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 52. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 53. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 54. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  • 55. Could you training your own classifier ... without coding? • Microsoft CustomVision.ai • Unique: Under a minute training, Custom object detection • Google AutoML • Unique: Full CNN training, crowdsourced workers • IBM Watson Visual recognition • Baidu EZDL • Unique: Custom Sound recognition
  • 56. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Upload 30 photos per class for make prototype model Upload 200 photos per class for more robust production model More distinct the shape/type of object, lesser images required.
  • 57. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Use Fatkun Browser Extension to download images from Search Engine, or use Bing Image Search API to programmatically download photos with proper rights
  • 58. CoreML exporter from customvision.ai – Drag and drop training 5 minute shortcut to training, finetuning and getting model ready in CoreML format Drag and drop interface
  • 59. Building a Crowdsourced Data Collector in 1 months
  • 60. Barcode recognition from Seeing AI Live Guide user in finding a barcode with audio cues With Server Decode barcode to identify product Tech MPSCNN running on mobile GPU + barcode library Metrics 40 FPS (~25 ms) on iPhone 7 Aim : Help blind users identify products using barcode Issue : Blind users don’t know where the barcode is
  • 61. Currency recognition from Seeing AI Aim : Identify currency Live Identify denomination of paper currency instantly With Server - Tech Task specific CNN running on mobile GPU Metrics 40 FPS (~25 ms) on iPhone 7
  • 62. Training Data Collection App Request volunteers to take photos of objects in non-obvious settings Sends photos to cloud, trains model nightly Newsletter shows the best photos from volunteers Let them compete for fame
  • 63. Daily challenge - Collected by volunteers
  • 64. Daily challenge - Collected by volunteers
  • 65. Building a production DL App in 3 months
  • 66. What you want https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016 $2000$200,000 What you can afford
  • 67. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  • 68. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  • 69. AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2 VGG, 19 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Ultra deep
  • 70. ResNet, 152 layers 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 7x7 conv, 64, /2, pool/2 Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  • 71. 28.2 25.8 16.4 11.7 7.3 6.7 3.6 2.9 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'13 ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'16 Ensemble ImageNet Classification top-5 error (%) shallow 8 layers 19 layers 22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Revolution of Depth vs Classification Accuracy Ensemble of Resnet, Inception Resnet, Inception and Wide Residual Network
  • 72. Accuracy vs Operations Per Image Inference Size is proportional to num parameters Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016 552 MB 240 MB What we want
  • 73. Your Budget - Smartphone Floating Point Operations Per Second (2015) http://pages.experts-exchange.com/processing-power-compared/
  • 74. iPhone X is more powerful than a Macbook Pro https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
  • 75. Strategies to get maximum efficiency from your CNN Before training • Pick an efficient architecture for your task • Designing efficient layers After training • Pruning • Quantization • Network binarization
  • 76. CoreML Benchmark - Pick a DNN for your mobile architecture Model Top-1 Accura cy Size of Model (MB) Million Multi Adds iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 15300 7408 4556 235 181 146 Inception v3 78 95 5000 727 637 114 90 78 Resnet 50 75 103 3900 538 557 77 74 71 MobileNet 71 17 569 129 109 44 35 33 SqueezeN et 57 5 800 75 78 36 30 29 2014 2015 2016 Huge improvement in GPU hardware in 2015 2013 2017
  • 77. MobileNet family Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv Tune with two parameters – Width Multiplier and resolution multiplier Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
  • 79. Efficient Detection Architectures Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
  • 80. Efficient Detection Architectures Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
  • 82. Tricks while designing your own network • Dilated Convolutions • Great for Segmentation / when target object has high area in image • Replace NxN convolutions with Nx1 followed by 1xN • Depth wise Separable Convolutions (e.g. MobileNet) • Inverted residual block (e.g. MobileNetV2) • Replacing large filters with multiple small filters • 5x5 is slower than 3x3 followed by 3x3
  • 83. Design consideration for custom architectures – Small Filters Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters  Less compute  More non-linearity  Better Faster Stronger Andrej Karpathy, CS-231n Notes, Lecture 11
  • 84. Selective training to keep networks shallow Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames
  • 85. Pruning Aim : Remove all connections with absolute weights below a threshold Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
  • 86. Observation : Most parameters in Fully Connected Layers AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters
  • 87. Pruning gets quickest model compression without accuracy loss AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
  • 88. Weight Sharing Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Cons: Need a special inference engine, doesn’t work for most applications
  • 89. Filter Pruning - ThiNet Idea : Discard whole filter if not important to predictions Advantage: • No change in architecture, other than thinning of filters per layer • Can be further compressed with other methods Just like feature selection, select filter to discard. Possible greedy methods: • Absolute weight sum of entire filter closest to 0 • Average percentage of ‘Zeros’ as outputs • ThiNet – Collect statistics on the output of the next layer
  • 90. SqueezeNet - AlexNet-level accuracy in 0.5 MB SqueezeNet base 4.8 MB SqueezeNet compressed 0.5 MB 80.3% top-5 Accuracy on ImageNet 0.72 GFLOPS/image Fire Block Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
  • 91. Quantization Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice: • Ristretto + Caffe • Automatic Network quantization • Finds balance between compression rate and accuracy • Apple Metal Performance Shaders automatically quantize to 16 bits • Tensorflow has 8 bit quantization support • Gemmlowp – Low precision matrix multiplication library
  • 92. Quantizing CNNs in Practice Reducing CoreML models to half size # Load a model, lower its precision, and then save the smaller model. model_spec = coremltools.utils.load_spec(‘model.mlmodel’) model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec) coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')
  • 93. Quantizing CNNs in Practice Reducing CoreML models to even smaller size Choose bits and quantization mode Bits from [1,2,4,8] Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”] • Lut = look up table from coremltools.models.neural_network.quantization_utils import * quantized_model= quantize_weights(model, 8, 'linear') quantized_model.save('quantizedModel.mlmodel’) compare_model(model, quantized_model, './sample_data/')
  • 94. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 95. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 96. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 97. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 98. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 99. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  • 101. Challenges Off the shelf CNNs not robust for video Solutions: • Collective confidence over several frames • CortexNet
  • 102. Building a DL App and get $10 million in funding (or a PhD)
  • 103. Competitions to follow Winners = High accuracy + Low energy consumption * LPIRC - Low-Power Image Recognition Challenge * EDLDC - Embedded deep learning design contest * System Design Contest at Design Automation Conference (DAC)
  • 104. AutoML – Let AI design an efficient AI architecture MnasNet: Platform-Aware Neural Architecture Search for Mobile • An automated neural architecture search approach for designing mobile models using reinforcement learning • Incorporates latency information into the reward objective function • Measure real-world inference latency by executing on a particular platform Sample models from search space Trainer Mobile phones Multi-objective reward latency reward Controller accuracy
  • 105. AutoML – Let AI design an efficient AI architecture For same accuracy: • 1.5x faster than MobileNetV2 • ResNet-50 accuracy with 19x less parameters • SSD300 mAP with 35x less FLOPs
  • 108. How to access the slides in 1 second Link posted here -> @anirudhkoul

Notes de l'éditeur

  1. https://hongkongphooey.wordpress.com/2009/02/18/first-look-huawei-android-phone/ https://medium.com/@startuphackers/building-a-deep-learning-neural-network-startup-7032932e09c
  2. https://hongkongphooey.wordpress.com/2009/02/18/first-look-huawei-android-phone/ https://medium.com/@startuphackers/building-a-deep-learning-neural-network-startup-7032932e09c
  3. [Miller 1968; Card et al. 1991]: 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result. 1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data. 10 seconds is about the limit for keeping the user's attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.
  4. No, don’t do it right now. Do it in the next session.
  5. If you need a hand warmer on a cold day, I suggest you try training on a phone
  6. Core ML supports a variety of machine learning models, including neural networks, tree ensembles, support vector machines, and generalized linear models. Core ML requires the Core ML model format (models with a .mlmodel file extension). Apple provides several popular, open source models that are already in the Core ML model format. You can download these models and start using them in your app. Additionally, various research groups and universities publish their models and training data, which may not be in the Core ML model format. To use these models, you need to convert them, as described in Converting Trained Models to Core ML. Internally does context switching between GPU and CPU. Uses Accelerate for CPU (For eg Sentiment Analysis) and GPU (for Image Classification)
  7. Speedups : No need to decode JPEGs, directly deal with camera image buffers
  8. “surprisingly, ARM CPUs outperform the on-board GPUs (our NNPACK ARM CPU implementation outperforms Apple’s MPSCNNConvolution for all devices except the iPhone 7). There are other advantages to offloading compute onto the GPU/DSP, and it’s an active work in progress to expose these in Caffe2 “ Built for first class support on phones from 2015 onwards. Does support runs on 2013 beyond models Use NEON Kernels for certan operations like transpose Heavily use NNPAck, extemely fast for convoltuions on ARM CPUs. Even phones from 2014 without GPUs, arm cpu will outperform. NNPack implements Winograd  for convoluation math. Convert convolution to element wise multiplication. Reduces number of flops by 2.5 times also works on any arm cpu, which doesn't limit you to cell phone Under 1 MB of compiled binary size
  9. You don't need microsoft 's ocean boiling gpu cluster
  10. Learned hierarchical features from a deep learning algorithm. Each feature can be thought of as a filter, which filters the input image for that feature (a nose). If the feature is found, the responsible units generate large activations, which can be picked up by the later classifier stages as a good indicator that the class is present.
  11. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  12. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  13. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  14. In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest. Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios: New dataset is smaller in size and similar in content compared to original dataset: If the data is small, it is not a good idea to fine-tune the DCNN due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the DCNN to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN-features. New dataset is relatively large in size and similar in content compared to the original dataset: Since we have more data, we can have more confidence that we would not over fit if we were to try to fine-tune through the full network. New dataset is smaller in size but very different in content compared to the original dataset: Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train a classifier from activations somewhere earlier in the network. New dataset is relatively large in size and very different in content compared to the original dataset: Since the dataset is very large, we may expect that we can afford to train a DCNN from scratch. However, in practice it is very often still beneficial to initialize with weights from a pre-trained model. In this case, we would have enough data and confidence to fine-tune through the entire network.
  15. As we all have painfully experienced, in real life, what you really want, is not what you can always afford. And that's the same in machine learning, We all know deep learning works if you have large GPU servers, what about when you want to run it on a tiny little device. What's the number 1 limitation, it turns out to be memory. If you look at image net in the last couple of years, it started with 240 megabytes. VGG was over half a gig. So the question we will solve now is how to get these neural networks do these amazing things yet have a very small memory footprint
  16. Just more layers, nothing special
  17. Here is another view of this model. This is the equivalent of Big data for Powerpoint
  18. Errors are reducing by 40% year on year Previously, they used to reduce by 5% year by year
  19. Compromise between accuracy, number of parameters,
  20. Apple is increasing the core count from two to six with a new A11 chip. Two of the cores are meant to do the bulk of intensive processing, while the other four are high efficiency cores dedicated to low-power tasks.
  21. DNNs often suffer from over-parameterization and large amount of redundancy in their models. This typically results in inefficient computation and memory usage
  22. iPhone 7 has a considerable mobile gpu....10 years ago when CUDA came out, desktop GPUS had similar performance
  23. DNNs often suffer from over-parameterization and large amount of redundancy in their models. This typically results in inefficient computation and memory usage
  24. 1x1 bottleneck convolutions are very efficient
  25. Word Lens app uses this
  26. Pruning redundant, non-informative weights in a previously trained network reduces the size of the network at inference time. Take a network, prune, and then retrain the remaining connections
  27. VGG-16 contains 90% of the weights AlexNet contains 96% of the weights Most computation happen in convolutional layers
  28. VGG-16 contains 90% of the weights AlexNet contains 96% of the weights Resnet, GoogleNet, Inception have majority convolutional layers, so they compress less Caffe2 does this, but does dense multiplication
  29. Facebook app uses this
  30. Facebook app uses this
  31. SqueezeNet has been recently released. It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms. Strategy 1. Replace 3x3 filters with 1x1 filters Strategy 2. Decrease the number of input channels to 3x3 filters Strategy 3. Downsample late in the network so that convolution layers have large activation maps.
  32. Ristretto is an automated CNN-approximation tool which condenses 32-bit floating point networks. Ristretto is an extension of Caffe and allows to test, train and fine-tune networks with limited numerical precision. Ristretto In a Minute Ristretto Tool: The Ristretto tool performs automatic network quantization and scoring, using different bit-widths for number representation, to find a good balance between compression rate and network accuracy. Ristretto Layers: Ristretto re-implements Caffe-layers and simulates reduced word width arithmetic. Testing and Training: Thanks to Ristretto’s smooth integration into Caffe, network description files can be changed to quantize different layers. The bit-width used for different layers as well as other parameters can be set in the network’s prototxt file. This allows to directly test and train condensed networks, without any need of recompilation.
  33. Facebook app uses this
  34. Reduce weights it to binary values, then scale them during training
  35. Reduce weights it to binary values, then scale them during training
  36. Reduce weights it to binary values, then scale them during training
  37. Now that I see this slide, this should probably have been the title for this session. We would have gotten a lot more people in this room.
  38. Facebook app uses this
  39. Three components: (1) a RNN-based controller for learning and sampling model architectures (2) a trainer that trains models (3) an inference engine for measuring the model speed on real mobile phones
  40. SE = Squeeze-and-Excitation optimization
  41. Minerva consists of five stages, as shown in Figure 2. Stages 1–2 establish a fair baseline accelerator implementation. Stage 1 generates the baseline DNN: fixing a network topology and a set of trained weights. Stage 2 selects an optimal baseline accelerator implementation. Stages 3– 5 employ novel co-design optimizations to minimize power consumption over the baseline in the following ways: Stage 3 analyzes the dynamic range of all DNN signals and reduces slack in data type precision. Stage 4 exploits observed network sparsity to minimize data accesses and MAC operations. Stage 5 introduces a novel fault mitigation technique, which allows for aggressive SRAM supply voltage reduction. For each of the three optimization stages, the ML level measures the impact on prediction accuracy, the architecture level evaluates hardware resource savings, and the circuit level characterizes the hardware models and validates simulation results.