SlideShare a Scribd company logo
1 of 23
Download to read offline
DEEP CNN VS CONVENTIONAL ML Algorithms and Case Study
My learning path
Deep learning for coders: http://course.fast.ai/
Michael Nielsen : http://neuralnetworksanddeeplearning.com/
Stanford: http://cs231n.github.io/classification/
UCF computer vision class:
https://www.youtube.com/watch?v=715uLCHt4jE&list=PLd3hlSJsX_ImKP68wfKZJVIPTd8Ie5u-9
The deep learning book: http://www.deeplearningbook.org/
30 different blogs for detailed topics.
Deep CNN (Convolutional NN) vs Conventional Neural Network
CNN :
NN :
Architecture Differences:
In addition to fully connected
layers and last softmax layer:
1. Conv layers.
2. Max-pooling layers.
3. Number of hidden layers:
from a few to a dozen.
Algorithm Differences:
In addition to SGD and
Backpropagation to train
weights and biases:
1. Nonlinear activation: ReLU
activation instead of
Sigmoid/Tanh.
2. Regularization: Dropout.
3. Batch-normalization.
Convolutional Layers: How it works?
Fully connected layer: every neuron in the network is connected to every neuron in adjacent layers.
Conv layer: each neuron in the hidden layer will be connected to a small region of the input neurons. The
transformation is defined by a filter. We then slide the filter across the entire input image.
Figure 3: each hidden neuron has a bias and 5×5 weights (define a filter) connected to its local receptive field. we slide the
filter over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron.
Weight Sharing:
use the same weights and bias (filter/kernel) for each of the 24×24 hidden neurons, i.e., all the neurons detect
exactly the same feature but at different locations in the input image, but have many such filters (e.g., LeNet used
66 different filters)
5X5
24X24
Convolution in CNN vs Convolution in traditional computer vision
Image filtering: compute function of local neighborhood at each position. Can be used to enhance images:
denoise, resize, increase contrast, etc. Or extract information from images Texture, edges, distinctive points, etc.
Convolution in computer vision:
Convolution in CNN vs Convolution in traditional computer vision
Image	filtering	examples:
shifting averaging
Detect vertical edge Detect horizontal edge
Convolution in CNN vs Convolution in traditional computer vision (why it’s better?)
Feature engineer in traditional computer vision: Haar filters, Gabor filters.
Convolution in CNN:
Shared weights and biases are computed within the network through training, no feature engineering needed.
Advantage 1: Feature generation and classification are tied within one system.
E.g, Haar filters in face detection: a series
of predefined simple filters to classify face:
Number of Layers in CNN vs in conventional NN (why it’s better?)
Figure 6: vgg16 architecture
(13 conv layers, 3 dense layers)
Advantage 2: The ability to learn hierarchies of concepts, building up
multiple layers of abstraction.
Number of Layers in CNN vs in conventional NN (why it’s better?)
Advantage 3: Easier to compute complex functions.
Example: Design computer from scratch.
Hard to do much with shallow layers of circuits. Even small tasks need multiple layers of assembly.
There are mathematical proofs showing that for some functions, very shallow circuits require exponentially more circuit
elements to compute than do deep circuits.
Problems arose when making networks deep
Huge number of parameters leads to overfitting and long computation time.
Solutions:
1. Conv layers: local connectivity and shared weights.
E.g., a filter need 5x5=25 shared weights, plus a bias term to define it. 20 such filters need 20x26=520
parameters to define a conv layer. If use a fully connected layer with 28x28=784 input neurons, and 30
hidden neurons, then there are 784x30 weights plus 30 biases, for a total of 23,550 parameters.
2. max-pooling to reduce number of parameters.
3. Regularizations to prevent overfitting: L2/L1, dropout, early stopping, data augmentation, noise injection…
4. GPU.
Vanishing gradient problem:
Solutions:
ReLU activation.
Blowing up problem:
Solutions:
Batch normalization.
Max pooling in traditional computer vision
Spatial	Pyramid	Matching	(SPM)	for	image	compression:
• Each level in the pyramid is 1/4 of the size of previous level.
• Resolution (dimension) reduces in each level from bottom to top.
• higher order representation introduces some invariances.
• Pooling Methods: sum, max, random, histogram,
Gaussian, Laplacian, L2-norm.
Max	pooling	in	Conv	NN:
• Choose Max pooling to achieve high speed.
• Pooling layers are usually used immediately after conv layers
to produce a condensed feature map.
Intuition:
• once a feature has been found, its exact location isn't as
important as its rough location relative to other features.
• max pooling is claimed a part of our visual system, so called
receptive fields is working as max pooling the sensory data
obtained by eyes.
Regularization methods in DL vs in conventional ML
Popular	regularization	methods	in	DL	based	on	previously	used	ML/Statistics	methods:
1. L2/L1 regularization: as in Ridge regression, Lasso, Elastic Net.
2. Data augmentation: translating, rotating, scaling the original image. Used frequently in traditional computer vision.
The idea is similar to bootstrapping in Statistics: sampling with replacement from the original samples.
3. Noise injection to input dataset: For some models, the addition of noise with innitesimal variance at the input of the
model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b).
4. Noise injection to weights (mainly for RNN): can be interpreted as stochastic implementation of Bayesian inference
over the weights. (Bayesian assume parameters follow certain probability distribution.)
5. Noise injection to output labels (label smoothing): if assume there are mistake in y with prob α, replacing the hard 0, 1
targets with !
(#$%)⁄ and 1- α. Based on max entropy principle. This strategy has been used since the 1980s.
6. Early stopping: recording validation error during training, algorithm terminates when no parameters have improved
over the best recorded validation error for some pre-specied number of iterations (treat the number of training steps
as another hyperparameter.) In the case of a simple linear model with a quadratic error function and simple gradient
descent—early stopping is equivalent to L2 regularization.
Similar to overfit monitoring in ML/convergence monitoring in Bayesian.
Regularization methods in DL vs in conventional ML
Popular	regularization	methods	in	DL	(relatively	new):
Dropout	(Srivastava	et	al.,	2014):	an inexpensive approximation to training and evaluating a bagged ensemble of
exponentially many neural networks. Usually only applied to FC layers.
What it does?: bagging by randomly destruct features.
Specifically, randomly removing non-output units (by multiplying
its output value by zero) from an underlying base network. Each
time we load an example into a minibatch, we randomly sample
a different binary mask to apply to all of the input and hidden
units in the network, then it’s equivalent to bagging with
bootstrapping training data. Computationally, approximate
ensembled result by multiply the weights going out of unit i with
the probability of including unit i.
Purpose of destruction:
This makes the model more robust to the loss of individual pieces
of evidence, and thus less likely to rely on particular
idiosyncracies of the training data.
Regularization methods in DL vs in conventional ML
Another ML method that bagging by randomly destruct features and bootstrap inputs: Random Forest.
Differences:
1. In RF, each tree is trained to convergence on its respective training set. As for dropout, typically most models are not
explicitly trained at all. Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the
parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters.
2. Dropout destroys extracted features rather than original values, which allows the destruction process to make use of
all of the knowledge about the input distribution that the model has acquired so far.
What it does?: bootstrap input datasets, and select
random subset of the features, to reduce correlation of
the trees and provide weak features opportunities to
contribute.
Similarity to dropout: both do bootstrap on data
points and bagging on features, by following the same
principle: a group of “weak learners” can come
together to form a “strong learner”.
Activation function change from Sigmoid/Tanh to ReLU
Vanishing	gradient	problem	for	Deep	NN	using	Sigmoid/Tanh:
1st layer bias gradient will usually be a factor of 16 smaller than 3rd layer bias gradient.
Sigmoid function: 𝜎 𝑥 = %
%+,-.⁄
Example: Expression of 1st layer bias gradient spread to an expression for the gradient with respect to 3rd layer bias:
(C is cost func, z is weighted input to neuron: )
Activation function change from Sigmoid/Tanh to ReLU
ReLU: max(0, x)
Benefits:
1. No gradient vanishing problem (Relu’s has constant gradient of 1).
(Krizhevsky et al. indicating the 6x improvement in convergence with
the ReLU unit compared to the tanh unit.)
2. Simpler computation to reduce training and evaluation time.
3. Introduce sparsity (when x<0), have similar effect of dropout.
4. can be used in Restricted Boltzmann machine to model real/integer
valued outputs.
Drawbacks:
1. Blowing up: ReLu may amplify the signal inside the network more than softmax and sigmoid since no squashing.
Solution: dropout, batch-norm.
2. Dead units: if learning rate is set too high, a large gradient flowing through a ReLU neuron could cause the
weights to update in such a way that the neuron will never activate on any data point again. If this happens,
then the gradient flowing through the unit will forever be zero from that point on.
Solution: careful learning rate setting.
ReLU’s brother used in ML
ReLU: max(0, x)
Hinge function: Direct hinge: max(0, x-c) Mirror hinge: max(0, c-x)
MARS (Multivariate adaptive regression splines) algorithm uses hinge
function as basis function to fit regression and find non-linear relationship:
MARS with one var: MARS with multiple var and var interactions:
= 25+6.1*max(0,x-13)-3.1*max(0,13-x)
Batch normalization in Deep NN and in ML
Batch normalization: doing preprocessing (i.e. normalization to shifting inputs to zero-mean and unit variance) at
every layer of the network for every mini-batch.
Why Normalization?
Input variable normalization in ML:
Example 1: in clustering, regression and SVM, normalization make sure var with a larger value (often associated with different
measurement units) does not overshadow the effects of the var with a smaller value.
Example 2: in NN, a good learning rate depends on the input scaling : small valued inputs will typically require larger weights
and learning rate, while large valued inputs need smaller learning rate, due to usage of a single learning rate, rescaling is
helpful.
Normalization is especially needed for deep NN due to easy ill-conditioning: i.e., a small perturbation in the initial layers, leads
to a large change in the later layers.
Why Batch?
Covariate shift problem in ML:
Distribution of the var are different between training and testing (e.g., market condition change between training time and
testing time). Rescaling to make training and testing data comparable.
Benefits: easier weight initialization and learning rate setup to provide faster optimization.
Note: Since BN has a regularizing effect it also means you can often remove dropout (which is helpful as dropout usually slows
down training).
Case study 1: Healthcare image classification
Dataset: Kaggle cervical cancer screening images.
https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening
Method A:
Traditional computer vision: visual bag of words with SIFT features + SVM
1. Extract SIFT features from each image.
2. Compute Kmeans over the entire set of SIFT features, extracted from the training set (i.e. construct vocabulary).
3. Compute the histogram of features for each image by assign each SIFT features in the image to its cluster.
4. Feed each image histogram into SVM.
Validation data accuracy: 51%
Method B:
CNN using pre-trained VGG16 conv layers + 0.5 dropout
Validation data accuracy: 72%
Case study 2: Movie recommendation
Dataset: MovieLens movie rating dataset.
https://movielens.org/
Method A:
State of Art collaborative filtering algorithm (alternating least squares)
Validation data MSE: 0.831
Method B:
3 layer fully connected NN with dropout
Validation data MSE: 0.802
Case study 3: document clustering
Word2Vec: The fake DL model.
Train a single hidden layer fully connected NN to learn a task: Given a specific word, the network is going to tell
us the probability for every word in our vocabulary of being the “nearby word” that we chose. Then take the
hidden layer weight matrix as a way to reduce dimension.
Case study 3: document clustering
NMI Featurization Clustering
0.412 TFIDF Hierarchical
0.366 w2v: window 8, dim 150 Kmeans
0.342 w2v: window 5, dim 100 Kmeans
0.324 w2v: window 8, dim 150 Hierarchical
0.321 w2v: window 5, dim 150 Hierarchical
0.258 TFIDF Kmeans
0.188 SVD: rank 100 Hierarchical
0.116 d2v: window 2, dim 20 Kmeans
0.095 SVD: rank 50 Hierarchical
0.083 d2v: window 2, dim 50 Kmeans
Dataset: Cluster amazon book review into it’s corresponding books.
Automated DL in production (AI)?
Issue 1: Huge number of possible settings:
A. Large number of possible architectures: e.g., number of layers, number of filters, filter size, max pooling dim,
number of FC layer neurons, whether and where to apply dropout and Batch-norm…
B. Large number of hyperparameters: learning rate and epochs (try and error for where and when to change),
random seed, mini-batch size, dropout percentage, early stop iteration, L1/L2 regularization, data
augmentation types, cost function, optimization method…
Issue 2: Many hyperparameters and methods are related:
E.g., Methods have regularization effect: convolution, L2/L1, dropout, ReLU, Batch-norm, early stopping, data
augmentation, noise injection…
Issue 3: Generalization problem:
A. need to re-tune for different dataset.
B. Borrow pre-trained Conv layer can save a lot of training and tuning time, but hard to control and understand,
easy to overfit.
Issue 4: Computation time and resources:
A. Long training time if from scratch. E.g., Google’s autoML claims to auto-tune DL models, but need 800G GPU to
run a week.
B. Long debugging time.

More Related Content

What's hot

Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithmswapnac12
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief netszukun
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief netsbutest
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Indraneel Pole
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044Jinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis taeseon ryu
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksHannes Hapke
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentationBaptiste Wicht
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbfkylin
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksParrotAI
 

What's hot (20)

Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
2020 12-03-vit
2020 12-03-vit2020 12-03-vit
2020 12-03-vit
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbf
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Zoooooohaib
ZoooooohaibZoooooohaib
Zoooooohaib
 
Deep learning-practical
Deep learning-practicalDeep learning-practical
Deep learning-practical
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
 

Similar to deep CNN vs conventional ML

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Artificial Neural Networks Deep Learning Report
Artificial Neural Networks   Deep Learning ReportArtificial Neural Networks   Deep Learning Report
Artificial Neural Networks Deep Learning ReportLisa Muthukumar
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_BSrimatre K
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
deeplearning
deeplearningdeeplearning
deeplearninghuda2018
 
UNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxNoorUlHaq47
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionYoussefKitane
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Armando Vieira
 
A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...IRJET Journal
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 

Similar to deep CNN vs conventional ML (20)

Deep learning
Deep learningDeep learning
Deep learning
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Artificial Neural Networks Deep Learning Report
Artificial Neural Networks   Deep Learning ReportArtificial Neural Networks   Deep Learning Report
Artificial Neural Networks Deep Learning Report
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_B
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
N ns 1
N ns 1N ns 1
N ns 1
 
deeplearning
deeplearningdeeplearning
deeplearning
 
UNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptx
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
 
A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...
 
ML_in_QM_JC_02-10-18
ML_in_QM_JC_02-10-18ML_in_QM_JC_02-10-18
ML_in_QM_JC_02-10-18
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

deep CNN vs conventional ML

  • 1. DEEP CNN VS CONVENTIONAL ML Algorithms and Case Study
  • 2. My learning path Deep learning for coders: http://course.fast.ai/ Michael Nielsen : http://neuralnetworksanddeeplearning.com/ Stanford: http://cs231n.github.io/classification/ UCF computer vision class: https://www.youtube.com/watch?v=715uLCHt4jE&list=PLd3hlSJsX_ImKP68wfKZJVIPTd8Ie5u-9 The deep learning book: http://www.deeplearningbook.org/ 30 different blogs for detailed topics.
  • 3. Deep CNN (Convolutional NN) vs Conventional Neural Network CNN : NN : Architecture Differences: In addition to fully connected layers and last softmax layer: 1. Conv layers. 2. Max-pooling layers. 3. Number of hidden layers: from a few to a dozen. Algorithm Differences: In addition to SGD and Backpropagation to train weights and biases: 1. Nonlinear activation: ReLU activation instead of Sigmoid/Tanh. 2. Regularization: Dropout. 3. Batch-normalization.
  • 4. Convolutional Layers: How it works? Fully connected layer: every neuron in the network is connected to every neuron in adjacent layers. Conv layer: each neuron in the hidden layer will be connected to a small region of the input neurons. The transformation is defined by a filter. We then slide the filter across the entire input image. Figure 3: each hidden neuron has a bias and 5×5 weights (define a filter) connected to its local receptive field. we slide the filter over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron. Weight Sharing: use the same weights and bias (filter/kernel) for each of the 24×24 hidden neurons, i.e., all the neurons detect exactly the same feature but at different locations in the input image, but have many such filters (e.g., LeNet used 66 different filters) 5X5 24X24
  • 5. Convolution in CNN vs Convolution in traditional computer vision Image filtering: compute function of local neighborhood at each position. Can be used to enhance images: denoise, resize, increase contrast, etc. Or extract information from images Texture, edges, distinctive points, etc. Convolution in computer vision:
  • 6. Convolution in CNN vs Convolution in traditional computer vision Image filtering examples: shifting averaging Detect vertical edge Detect horizontal edge
  • 7. Convolution in CNN vs Convolution in traditional computer vision (why it’s better?) Feature engineer in traditional computer vision: Haar filters, Gabor filters. Convolution in CNN: Shared weights and biases are computed within the network through training, no feature engineering needed. Advantage 1: Feature generation and classification are tied within one system. E.g, Haar filters in face detection: a series of predefined simple filters to classify face:
  • 8. Number of Layers in CNN vs in conventional NN (why it’s better?) Figure 6: vgg16 architecture (13 conv layers, 3 dense layers) Advantage 2: The ability to learn hierarchies of concepts, building up multiple layers of abstraction.
  • 9. Number of Layers in CNN vs in conventional NN (why it’s better?) Advantage 3: Easier to compute complex functions. Example: Design computer from scratch. Hard to do much with shallow layers of circuits. Even small tasks need multiple layers of assembly. There are mathematical proofs showing that for some functions, very shallow circuits require exponentially more circuit elements to compute than do deep circuits.
  • 10. Problems arose when making networks deep Huge number of parameters leads to overfitting and long computation time. Solutions: 1. Conv layers: local connectivity and shared weights. E.g., a filter need 5x5=25 shared weights, plus a bias term to define it. 20 such filters need 20x26=520 parameters to define a conv layer. If use a fully connected layer with 28x28=784 input neurons, and 30 hidden neurons, then there are 784x30 weights plus 30 biases, for a total of 23,550 parameters. 2. max-pooling to reduce number of parameters. 3. Regularizations to prevent overfitting: L2/L1, dropout, early stopping, data augmentation, noise injection… 4. GPU. Vanishing gradient problem: Solutions: ReLU activation. Blowing up problem: Solutions: Batch normalization.
  • 11. Max pooling in traditional computer vision Spatial Pyramid Matching (SPM) for image compression: • Each level in the pyramid is 1/4 of the size of previous level. • Resolution (dimension) reduces in each level from bottom to top. • higher order representation introduces some invariances. • Pooling Methods: sum, max, random, histogram, Gaussian, Laplacian, L2-norm. Max pooling in Conv NN: • Choose Max pooling to achieve high speed. • Pooling layers are usually used immediately after conv layers to produce a condensed feature map. Intuition: • once a feature has been found, its exact location isn't as important as its rough location relative to other features. • max pooling is claimed a part of our visual system, so called receptive fields is working as max pooling the sensory data obtained by eyes.
  • 12. Regularization methods in DL vs in conventional ML Popular regularization methods in DL based on previously used ML/Statistics methods: 1. L2/L1 regularization: as in Ridge regression, Lasso, Elastic Net. 2. Data augmentation: translating, rotating, scaling the original image. Used frequently in traditional computer vision. The idea is similar to bootstrapping in Statistics: sampling with replacement from the original samples. 3. Noise injection to input dataset: For some models, the addition of noise with innitesimal variance at the input of the model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b). 4. Noise injection to weights (mainly for RNN): can be interpreted as stochastic implementation of Bayesian inference over the weights. (Bayesian assume parameters follow certain probability distribution.) 5. Noise injection to output labels (label smoothing): if assume there are mistake in y with prob α, replacing the hard 0, 1 targets with ! (#$%)⁄ and 1- α. Based on max entropy principle. This strategy has been used since the 1980s. 6. Early stopping: recording validation error during training, algorithm terminates when no parameters have improved over the best recorded validation error for some pre-specied number of iterations (treat the number of training steps as another hyperparameter.) In the case of a simple linear model with a quadratic error function and simple gradient descent—early stopping is equivalent to L2 regularization. Similar to overfit monitoring in ML/convergence monitoring in Bayesian.
  • 13. Regularization methods in DL vs in conventional ML Popular regularization methods in DL (relatively new): Dropout (Srivastava et al., 2014): an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks. Usually only applied to FC layers. What it does?: bagging by randomly destruct features. Specifically, randomly removing non-output units (by multiplying its output value by zero) from an underlying base network. Each time we load an example into a minibatch, we randomly sample a different binary mask to apply to all of the input and hidden units in the network, then it’s equivalent to bagging with bootstrapping training data. Computationally, approximate ensembled result by multiply the weights going out of unit i with the probability of including unit i. Purpose of destruction: This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular idiosyncracies of the training data.
  • 14. Regularization methods in DL vs in conventional ML Another ML method that bagging by randomly destruct features and bootstrap inputs: Random Forest. Differences: 1. In RF, each tree is trained to convergence on its respective training set. As for dropout, typically most models are not explicitly trained at all. Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters. 2. Dropout destroys extracted features rather than original values, which allows the destruction process to make use of all of the knowledge about the input distribution that the model has acquired so far. What it does?: bootstrap input datasets, and select random subset of the features, to reduce correlation of the trees and provide weak features opportunities to contribute. Similarity to dropout: both do bootstrap on data points and bagging on features, by following the same principle: a group of “weak learners” can come together to form a “strong learner”.
  • 15. Activation function change from Sigmoid/Tanh to ReLU Vanishing gradient problem for Deep NN using Sigmoid/Tanh: 1st layer bias gradient will usually be a factor of 16 smaller than 3rd layer bias gradient. Sigmoid function: 𝜎 𝑥 = % %+,-.⁄ Example: Expression of 1st layer bias gradient spread to an expression for the gradient with respect to 3rd layer bias: (C is cost func, z is weighted input to neuron: )
  • 16. Activation function change from Sigmoid/Tanh to ReLU ReLU: max(0, x) Benefits: 1. No gradient vanishing problem (Relu’s has constant gradient of 1). (Krizhevsky et al. indicating the 6x improvement in convergence with the ReLU unit compared to the tanh unit.) 2. Simpler computation to reduce training and evaluation time. 3. Introduce sparsity (when x<0), have similar effect of dropout. 4. can be used in Restricted Boltzmann machine to model real/integer valued outputs. Drawbacks: 1. Blowing up: ReLu may amplify the signal inside the network more than softmax and sigmoid since no squashing. Solution: dropout, batch-norm. 2. Dead units: if learning rate is set too high, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any data point again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. Solution: careful learning rate setting.
  • 17. ReLU’s brother used in ML ReLU: max(0, x) Hinge function: Direct hinge: max(0, x-c) Mirror hinge: max(0, c-x) MARS (Multivariate adaptive regression splines) algorithm uses hinge function as basis function to fit regression and find non-linear relationship: MARS with one var: MARS with multiple var and var interactions: = 25+6.1*max(0,x-13)-3.1*max(0,13-x)
  • 18. Batch normalization in Deep NN and in ML Batch normalization: doing preprocessing (i.e. normalization to shifting inputs to zero-mean and unit variance) at every layer of the network for every mini-batch. Why Normalization? Input variable normalization in ML: Example 1: in clustering, regression and SVM, normalization make sure var with a larger value (often associated with different measurement units) does not overshadow the effects of the var with a smaller value. Example 2: in NN, a good learning rate depends on the input scaling : small valued inputs will typically require larger weights and learning rate, while large valued inputs need smaller learning rate, due to usage of a single learning rate, rescaling is helpful. Normalization is especially needed for deep NN due to easy ill-conditioning: i.e., a small perturbation in the initial layers, leads to a large change in the later layers. Why Batch? Covariate shift problem in ML: Distribution of the var are different between training and testing (e.g., market condition change between training time and testing time). Rescaling to make training and testing data comparable. Benefits: easier weight initialization and learning rate setup to provide faster optimization. Note: Since BN has a regularizing effect it also means you can often remove dropout (which is helpful as dropout usually slows down training).
  • 19. Case study 1: Healthcare image classification Dataset: Kaggle cervical cancer screening images. https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening Method A: Traditional computer vision: visual bag of words with SIFT features + SVM 1. Extract SIFT features from each image. 2. Compute Kmeans over the entire set of SIFT features, extracted from the training set (i.e. construct vocabulary). 3. Compute the histogram of features for each image by assign each SIFT features in the image to its cluster. 4. Feed each image histogram into SVM. Validation data accuracy: 51% Method B: CNN using pre-trained VGG16 conv layers + 0.5 dropout Validation data accuracy: 72%
  • 20. Case study 2: Movie recommendation Dataset: MovieLens movie rating dataset. https://movielens.org/ Method A: State of Art collaborative filtering algorithm (alternating least squares) Validation data MSE: 0.831 Method B: 3 layer fully connected NN with dropout Validation data MSE: 0.802
  • 21. Case study 3: document clustering Word2Vec: The fake DL model. Train a single hidden layer fully connected NN to learn a task: Given a specific word, the network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose. Then take the hidden layer weight matrix as a way to reduce dimension.
  • 22. Case study 3: document clustering NMI Featurization Clustering 0.412 TFIDF Hierarchical 0.366 w2v: window 8, dim 150 Kmeans 0.342 w2v: window 5, dim 100 Kmeans 0.324 w2v: window 8, dim 150 Hierarchical 0.321 w2v: window 5, dim 150 Hierarchical 0.258 TFIDF Kmeans 0.188 SVD: rank 100 Hierarchical 0.116 d2v: window 2, dim 20 Kmeans 0.095 SVD: rank 50 Hierarchical 0.083 d2v: window 2, dim 50 Kmeans Dataset: Cluster amazon book review into it’s corresponding books.
  • 23. Automated DL in production (AI)? Issue 1: Huge number of possible settings: A. Large number of possible architectures: e.g., number of layers, number of filters, filter size, max pooling dim, number of FC layer neurons, whether and where to apply dropout and Batch-norm… B. Large number of hyperparameters: learning rate and epochs (try and error for where and when to change), random seed, mini-batch size, dropout percentage, early stop iteration, L1/L2 regularization, data augmentation types, cost function, optimization method… Issue 2: Many hyperparameters and methods are related: E.g., Methods have regularization effect: convolution, L2/L1, dropout, ReLU, Batch-norm, early stopping, data augmentation, noise injection… Issue 3: Generalization problem: A. need to re-tune for different dataset. B. Borrow pre-trained Conv layer can save a lot of training and tuning time, but hard to control and understand, easy to overfit. Issue 4: Computation time and resources: A. Long training time if from scratch. E.g., Google’s autoML claims to auto-tune DL models, but need 800G GPU to run a week. B. Long debugging time.