2. What is Deep Learning?
Deep learning is a set of algorithms in machine learning that attempt to
model high-level abstractions in data by using architectures composed of
multiple non-linear transformations.
• Multiple Layer Deep Neural Networks
• Work for Media and Unstructured Data
• Automatic Feature Engineering
• Complex Architectures and Computationally Intensive
7. Historical Background: First Generation ANN
• Perceptron (~1960) used a
layer of hand-coded
features and tried to
recognize objects by
learning how to weight
these features.
– There was a neat learning
algorithm for adjusting the
weights.
– But perceptron nodes are
fundamentally limited in
what they can learn to do.
Non-Adaptive
Hand Coded
Features
Output Class Labels
Input Feature
Sketch of a typical perceptron
from the 1960’s
Bomb Toy
8. Multiple Layer Perceptron ANN (1960~1985)
input vector
hidden
layers
outputs
Back-propagate
error signal to get
derivatives for
learning
Compare outputs with
correct answer to get
error signal
10. • It requires labeled training data.
Almost all data is unlabeled
• The learning time does not scale well
It is very slow in networks with
multiple hidden layers.
It can get stuck in poor local
optima.
Disadvantages
Back Propagation Algorithm
• Multi layer Perceptron network can be
trained by the back propagation
algorithm to perform any mapping
between the input and the output.
Advantages
11. Support Vector Machines
• Vapnik and his co-workers developed a very clever type of
perceptron called a Support Vector Machine.
o Instead of hand-coding the layer of non-adaptive features, each
training example is used to create a new feature using a fixed
recipe.
• The feature computes how similar a test example is to that
training example.
o Then a clever optimization technique is used to select the best
subset of the features and to decide how to weight each feature
when classifying a test case.
• But its just a perceptron and has all the same limitations.
• In the 1990’s, many researchers abandoned neural networks with
multiple adaptive hidden layers because Support Vector Machines
worked better.
26. Greedy Layer-Wise Training
• Train first layer using your data without the labels (unsupervised)
Since there are no targets at this level, labels don't help.
Could also use the more abundant unlabeled data which is
not part of the training set (i.e. self-taught learning).
• Then freeze the first layer parameters and start training the
second layer using the output of the first layer as the
unsupervised input to the second layer
• Repeat this for as many layers as desired
This builds our set of robust features
• Use the outputs of the final layer as inputs to a supervised
layer/model and train the last supervised layers(leave early
weights frozen)
• Unfreeze all weights and fine tune the full network by training
with a supervised approach, given the pre-processed weight
settings
28. Benefit of Greedy Layer-Wise Training
• Greedy layer-wise training avoids many of the problems of
trying to train a deep net in a supervised fashion
o Each layer gets full learning focus in its turn since it is the
only current "top" layer
o Can take advantage of unlabeled data
o When you finally tune the entire network with supervised
training the network weights have already been adjusted so
that you are in a good error basin and just need fine tuning.
This helps with problems of
• Ineffective early layer learning
• Deep network local minima
• Two most common approaches
o Stacked Auto-Encoders
o Deep Belief Networks
28
30. What Auto-Encoder Can Do?
• A type of unsupervised learning which tries to discover generic features of
the data
o Learn identity function by learning important sub-features not by just
passing through data
o Can use just new features in the new training set or concatenate both
34. Stacked Auto-Encoders Approach
• Stack many sparse auto-encoders in succession and train them using greedy
layer-wise training
• Drop the decode output layer each time
• Do supervised training on the last layer using final features
• Finally do supervised training on the entire network to fine- tune all weights
35. What is Sparse Encoders?
• Auto encoders will often do a dimensionality reduction
o PCA-like or non-linear dimensionality reduction
• This leads to a "dense" representation which is nice in terms of
parsimony
o All features typically have non-zero values for any input and the
combination of values contains the compressed information
• However, this distributed and entangled representation can often
make it more difficult for successive layers to pick out the salient
features
• A sparse representation uses more features where at any given time
a significant number of the features will have a 0 value
o This leads to more localist variable length encodings where a
particular node (or small group of nodes) with value 1 signifies the
presence of a feature (small set of bases)
o A type of simplicity bottleneck (regularizer)
o This is easier for subsequent layers to use for learning
36. Implementation of Sparse Auto-Encoder
• Use more hidden nodes in the encoder
• Use regularization techniques which encourage
sparseness e.g. a significant portion of nodes have 0
output for any given input
o Penalty in the learning function for non-zero nodes
with weight decay
• De-noising Auto-Encoder
o Stochastically corrupt training instance each time, but
still train auto-encoder to decode the uncorrupted
instance, forcing it to learn conditional dependencies
within the instance
o Better empirical results, handles missing values well
37. General Belief Nets
• A belief net is a directed
acyclic graph composed of
stochastic variables.
• Solve two problems:
The inference problem:
Infer the states of the
unobserved variables.
The learning problem:
Adjust the interactions
between variables to make
the network more likely to
generate the observed data.
stochastic
hidden
cause
visible
effect
Use nets composed of layers of
stochastic binary variables with
weighted connections. Other types of
variable can be generalized as well.
38. Stochastic Binary Units
(Bernoulli Variables)
• Variables with state of 1
or 0;
• The probability of turning
on is determined by the
weighted input from
other units (plus a bias)
0
0
1
j
jiji
i
wsb
sp
)exp(1
)(
1
1
j
jiji wsb
)( 1isp
39. Learning Rule for Sigmoid Belief Nets
• Learning is easy if we can get
an unbiased sample from the
posterior distribution over
hidden states given the
observed data.
• For each unit, maximize the
log probability that its binary
state in the sample from the
posterior would be
generated by the sampled
binary states of its parents.
j
jij
ii
ws
spp
)exp(1
)(
1
1
j
i
jiw
)( iijji pssw
is
js
learning
rate
40. Problems with Deep Belief Nets
Since DBNs are directed graph model, given input data, the posterior of
hidden units is intractable due to the “explaining away” effect. Even two
hidden causes are independent, they can become dependent when we
observe an effect that they can both influence.
Solution: Complementary Priors to ensure the posterior of hidden units
are under the independent constraints.
truck hits house earthquake
house
jumps
20 20
-20
-10 -10
General Deep Belief Nets
Explaining Away Effect
p(1,1)=.0001
p(1,0)=.4999
p(0,1)=.4999
p(0,0)=.0001
posterior
41. Complementary Priors
Definition of Complementary Priors:
Consider observations x and hidden variables y, for a given likelihood function P(x|y), the
priors of y, P(y) is called the complementary priors of P(x|y), provided that P(x,y)=P(x|y)
P(y) leads to the posteriors P(y|x) .
Infinite directed model with tied weights and Complementary Priors and
Gibbs sampling:
Recall that the RBMs have the property
The definition of energy function of RBM makes it proper model that has
two sets of conditional independencies(complementary priors for both v
and h)
Since we need to estimate the distribution of data, P(v), we can perform
Gibbs sampling alternatively from P(v,h) for infinite times. This procedure
is analogous to unroll the single RBM into infinite directed stacks of
RBMs with tied weights(due to “complementary priors”) where each
RBM takes input from the hidden layer of the lower level RBM.
n
j
j
m
i
i
vhPP
hvPP
1
1
)|()v|h(
)|(h)|(v
42. Restricted Boltzmann Machines
• Restrict the connectivity to make
learning easier.
Only one layer of hidden units
No connections between hidden units.
• The hidden units are conditionally
independent given the visible states.
Quickly get an unbiased sample from
the posterior distribution when given a
data-vector, which is a big advantage
over directed belief nets
hidden
i
j
visible
43. Energy of A Joint Configuration
ji
ijji whvv,hE
,
)(
weight
between units i
and j
Energy with configuration v
on the visible units and h
on the hidden units
binary state of
visible unit i
binary state of
hidden unit j
ji
ij
hv
w
hvE
),(
44. Weights, Energies and Probabilities
• Each possible joint configuration of the visible and hidden
units has an energy
The energy is determined by the weights and biases as in
a Hopfield net.
• The energy of a joint configuration of the visible and hidden
units determines its probability:
• The probability of a configuration over the visible units is
found by summing the probabilities of all the joint
configurations that contain it.
),(
),(
hvE
hvp e
45. Using Energies to Define Probabilities
• The probability of a joint
configuration over both visible
and hidden units depends on
the energy of that joint
configuration compared with
the energy of all other joint
configurations.
• The probability of a
configuration of the visible
units is the sum of the
probabilities of all the joint
configurations that contain it.
gu
guE
hvE
e
e
hvp
,
),(
),(
),(
gu
guE
h
hvE
e
e
vp
,
),(
),(
)(
partition
function
46. Maximum Likelihood RBM Learning Algorithm
0
jihv
jihv
i
j
i
j
i
j
i
j
t = 0 t = 1 t = 2 t = infinity
jiji
ij
hvhv
w
vp 0)(log
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in
parallel and updating all the visible units in parallel.
a fantasy
47. A Quick Way to Learn an RBM
0
jihv 1
jihv
i
j
i
j
t = 0 t = 1
)( 10
jijiij hvhvw
• Start with a training vector on
the visible units.
• Update all the hidden units in
parallel
• Update the all the visible units
in parallel to get a
“reconstruction”.
• Update the hidden units again.
Contrastive divergence: This is not following the gradient of the
log likelihood. But it works well. It is approximately following the
gradient of another objective function.
reconstructiondata