1. DEEP CNN VS CONVENTIONAL ML Algorithms and Case Study
2. My learning path
Deep learning for coders: http://course.fast.ai/
Michael Nielsen : http://neuralnetworksanddeeplearning.com/
Stanford: http://cs231n.github.io/classification/
UCF computer vision class:
https://www.youtube.com/watch?v=715uLCHt4jE&list=PLd3hlSJsX_ImKP68wfKZJVIPTd8Ie5u-9
The deep learning book: http://www.deeplearningbook.org/
30 different blogs for detailed topics.
3. Deep CNN (Convolutional NN) vs Conventional Neural Network
CNN :
NN :
Architecture Differences:
In addition to fully connected
layers and last softmax layer:
1. Conv layers.
2. Max-pooling layers.
3. Number of hidden layers:
from a few to a dozen.
Algorithm Differences:
In addition to SGD and
Backpropagation to train
weights and biases:
1. Nonlinear activation: ReLU
activation instead of
Sigmoid/Tanh.
2. Regularization: Dropout.
3. Batch-normalization.
4. Convolutional Layers: How it works?
Fully connected layer: every neuron in the network is connected to every neuron in adjacent layers.
Conv layer: each neuron in the hidden layer will be connected to a small region of the input neurons. The
transformation is defined by a filter. We then slide the filter across the entire input image.
Figure 3: each hidden neuron has a bias and 5×5 weights (define a filter) connected to its local receptive field. we slide the
filter over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron.
Weight Sharing:
use the same weights and bias (filter/kernel) for each of the 24×24 hidden neurons, i.e., all the neurons detect
exactly the same feature but at different locations in the input image, but have many such filters (e.g., LeNet used
66 different filters)
5X5
24X24
5. Convolution in CNN vs Convolution in traditional computer vision
Image filtering: compute function of local neighborhood at each position. Can be used to enhance images:
denoise, resize, increase contrast, etc. Or extract information from images Texture, edges, distinctive points, etc.
Convolution in computer vision:
6. Convolution in CNN vs Convolution in traditional computer vision
Image filtering examples:
shifting averaging
Detect vertical edge Detect horizontal edge
7. Convolution in CNN vs Convolution in traditional computer vision (why it’s better?)
Feature engineer in traditional computer vision: Haar filters, Gabor filters.
Convolution in CNN:
Shared weights and biases are computed within the network through training, no feature engineering needed.
Advantage 1: Feature generation and classification are tied within one system.
E.g, Haar filters in face detection: a series
of predefined simple filters to classify face:
8. Number of Layers in CNN vs in conventional NN (why it’s better?)
Figure 6: vgg16 architecture
(13 conv layers, 3 dense layers)
Advantage 2: The ability to learn hierarchies of concepts, building up
multiple layers of abstraction.
9. Number of Layers in CNN vs in conventional NN (why it’s better?)
Advantage 3: Easier to compute complex functions.
Example: Design computer from scratch.
Hard to do much with shallow layers of circuits. Even small tasks need multiple layers of assembly.
There are mathematical proofs showing that for some functions, very shallow circuits require exponentially more circuit
elements to compute than do deep circuits.
10. Problems arose when making networks deep
Huge number of parameters leads to overfitting and long computation time.
Solutions:
1. Conv layers: local connectivity and shared weights.
E.g., a filter need 5x5=25 shared weights, plus a bias term to define it. 20 such filters need 20x26=520
parameters to define a conv layer. If use a fully connected layer with 28x28=784 input neurons, and 30
hidden neurons, then there are 784x30 weights plus 30 biases, for a total of 23,550 parameters.
2. max-pooling to reduce number of parameters.
3. Regularizations to prevent overfitting: L2/L1, dropout, early stopping, data augmentation, noise injection…
4. GPU.
Vanishing gradient problem:
Solutions:
ReLU activation.
Blowing up problem:
Solutions:
Batch normalization.
11. Max pooling in traditional computer vision
Spatial Pyramid Matching (SPM) for image compression:
• Each level in the pyramid is 1/4 of the size of previous level.
• Resolution (dimension) reduces in each level from bottom to top.
• higher order representation introduces some invariances.
• Pooling Methods: sum, max, random, histogram,
Gaussian, Laplacian, L2-norm.
Max pooling in Conv NN:
• Choose Max pooling to achieve high speed.
• Pooling layers are usually used immediately after conv layers
to produce a condensed feature map.
Intuition:
• once a feature has been found, its exact location isn't as
important as its rough location relative to other features.
• max pooling is claimed a part of our visual system, so called
receptive fields is working as max pooling the sensory data
obtained by eyes.
12. Regularization methods in DL vs in conventional ML
Popular regularization methods in DL based on previously used ML/Statistics methods:
1. L2/L1 regularization: as in Ridge regression, Lasso, Elastic Net.
2. Data augmentation: translating, rotating, scaling the original image. Used frequently in traditional computer vision.
The idea is similar to bootstrapping in Statistics: sampling with replacement from the original samples.
3. Noise injection to input dataset: For some models, the addition of noise with innitesimal variance at the input of the
model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b).
4. Noise injection to weights (mainly for RNN): can be interpreted as stochastic implementation of Bayesian inference
over the weights. (Bayesian assume parameters follow certain probability distribution.)
5. Noise injection to output labels (label smoothing): if assume there are mistake in y with prob α, replacing the hard 0, 1
targets with !
(#$%)⁄ and 1- α. Based on max entropy principle. This strategy has been used since the 1980s.
6. Early stopping: recording validation error during training, algorithm terminates when no parameters have improved
over the best recorded validation error for some pre-specied number of iterations (treat the number of training steps
as another hyperparameter.) In the case of a simple linear model with a quadratic error function and simple gradient
descent—early stopping is equivalent to L2 regularization.
Similar to overfit monitoring in ML/convergence monitoring in Bayesian.
13. Regularization methods in DL vs in conventional ML
Popular regularization methods in DL (relatively new):
Dropout (Srivastava et al., 2014): an inexpensive approximation to training and evaluating a bagged ensemble of
exponentially many neural networks. Usually only applied to FC layers.
What it does?: bagging by randomly destruct features.
Specifically, randomly removing non-output units (by multiplying
its output value by zero) from an underlying base network. Each
time we load an example into a minibatch, we randomly sample
a different binary mask to apply to all of the input and hidden
units in the network, then it’s equivalent to bagging with
bootstrapping training data. Computationally, approximate
ensembled result by multiply the weights going out of unit i with
the probability of including unit i.
Purpose of destruction:
This makes the model more robust to the loss of individual pieces
of evidence, and thus less likely to rely on particular
idiosyncracies of the training data.
14. Regularization methods in DL vs in conventional ML
Another ML method that bagging by randomly destruct features and bootstrap inputs: Random Forest.
Differences:
1. In RF, each tree is trained to convergence on its respective training set. As for dropout, typically most models are not
explicitly trained at all. Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the
parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters.
2. Dropout destroys extracted features rather than original values, which allows the destruction process to make use of
all of the knowledge about the input distribution that the model has acquired so far.
What it does?: bootstrap input datasets, and select
random subset of the features, to reduce correlation of
the trees and provide weak features opportunities to
contribute.
Similarity to dropout: both do bootstrap on data
points and bagging on features, by following the same
principle: a group of “weak learners” can come
together to form a “strong learner”.
15. Activation function change from Sigmoid/Tanh to ReLU
Vanishing gradient problem for Deep NN using Sigmoid/Tanh:
1st layer bias gradient will usually be a factor of 16 smaller than 3rd layer bias gradient.
Sigmoid function: 𝜎 𝑥 = %
%+,-.⁄
Example: Expression of 1st layer bias gradient spread to an expression for the gradient with respect to 3rd layer bias:
(C is cost func, z is weighted input to neuron: )
16. Activation function change from Sigmoid/Tanh to ReLU
ReLU: max(0, x)
Benefits:
1. No gradient vanishing problem (Relu’s has constant gradient of 1).
(Krizhevsky et al. indicating the 6x improvement in convergence with
the ReLU unit compared to the tanh unit.)
2. Simpler computation to reduce training and evaluation time.
3. Introduce sparsity (when x<0), have similar effect of dropout.
4. can be used in Restricted Boltzmann machine to model real/integer
valued outputs.
Drawbacks:
1. Blowing up: ReLu may amplify the signal inside the network more than softmax and sigmoid since no squashing.
Solution: dropout, batch-norm.
2. Dead units: if learning rate is set too high, a large gradient flowing through a ReLU neuron could cause the
weights to update in such a way that the neuron will never activate on any data point again. If this happens,
then the gradient flowing through the unit will forever be zero from that point on.
Solution: careful learning rate setting.
17. ReLU’s brother used in ML
ReLU: max(0, x)
Hinge function: Direct hinge: max(0, x-c) Mirror hinge: max(0, c-x)
MARS (Multivariate adaptive regression splines) algorithm uses hinge
function as basis function to fit regression and find non-linear relationship:
MARS with one var: MARS with multiple var and var interactions:
= 25+6.1*max(0,x-13)-3.1*max(0,13-x)
18. Batch normalization in Deep NN and in ML
Batch normalization: doing preprocessing (i.e. normalization to shifting inputs to zero-mean and unit variance) at
every layer of the network for every mini-batch.
Why Normalization?
Input variable normalization in ML:
Example 1: in clustering, regression and SVM, normalization make sure var with a larger value (often associated with different
measurement units) does not overshadow the effects of the var with a smaller value.
Example 2: in NN, a good learning rate depends on the input scaling : small valued inputs will typically require larger weights
and learning rate, while large valued inputs need smaller learning rate, due to usage of a single learning rate, rescaling is
helpful.
Normalization is especially needed for deep NN due to easy ill-conditioning: i.e., a small perturbation in the initial layers, leads
to a large change in the later layers.
Why Batch?
Covariate shift problem in ML:
Distribution of the var are different between training and testing (e.g., market condition change between training time and
testing time). Rescaling to make training and testing data comparable.
Benefits: easier weight initialization and learning rate setup to provide faster optimization.
Note: Since BN has a regularizing effect it also means you can often remove dropout (which is helpful as dropout usually slows
down training).
19. Case study 1: Healthcare image classification
Dataset: Kaggle cervical cancer screening images.
https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening
Method A:
Traditional computer vision: visual bag of words with SIFT features + SVM
1. Extract SIFT features from each image.
2. Compute Kmeans over the entire set of SIFT features, extracted from the training set (i.e. construct vocabulary).
3. Compute the histogram of features for each image by assign each SIFT features in the image to its cluster.
4. Feed each image histogram into SVM.
Validation data accuracy: 51%
Method B:
CNN using pre-trained VGG16 conv layers + 0.5 dropout
Validation data accuracy: 72%
20. Case study 2: Movie recommendation
Dataset: MovieLens movie rating dataset.
https://movielens.org/
Method A:
State of Art collaborative filtering algorithm (alternating least squares)
Validation data MSE: 0.831
Method B:
3 layer fully connected NN with dropout
Validation data MSE: 0.802
21. Case study 3: document clustering
Word2Vec: The fake DL model.
Train a single hidden layer fully connected NN to learn a task: Given a specific word, the network is going to tell
us the probability for every word in our vocabulary of being the “nearby word” that we chose. Then take the
hidden layer weight matrix as a way to reduce dimension.
22. Case study 3: document clustering
NMI Featurization Clustering
0.412 TFIDF Hierarchical
0.366 w2v: window 8, dim 150 Kmeans
0.342 w2v: window 5, dim 100 Kmeans
0.324 w2v: window 8, dim 150 Hierarchical
0.321 w2v: window 5, dim 150 Hierarchical
0.258 TFIDF Kmeans
0.188 SVD: rank 100 Hierarchical
0.116 d2v: window 2, dim 20 Kmeans
0.095 SVD: rank 50 Hierarchical
0.083 d2v: window 2, dim 50 Kmeans
Dataset: Cluster amazon book review into it’s corresponding books.
23. Automated DL in production (AI)?
Issue 1: Huge number of possible settings:
A. Large number of possible architectures: e.g., number of layers, number of filters, filter size, max pooling dim,
number of FC layer neurons, whether and where to apply dropout and Batch-norm…
B. Large number of hyperparameters: learning rate and epochs (try and error for where and when to change),
random seed, mini-batch size, dropout percentage, early stop iteration, L1/L2 regularization, data
augmentation types, cost function, optimization method…
Issue 2: Many hyperparameters and methods are related:
E.g., Methods have regularization effect: convolution, L2/L1, dropout, ReLU, Batch-norm, early stopping, data
augmentation, noise injection…
Issue 3: Generalization problem:
A. need to re-tune for different dataset.
B. Borrow pre-trained Conv layer can save a lot of training and tuning time, but hard to control and understand,
easy to overfit.
Issue 4: Computation time and resources:
A. Long training time if from scratch. E.g., Google’s autoML claims to auto-tune DL models, but need 800G GPU to
run a week.
B. Long debugging time.