convolutional_rbm.ppt

Convolutional Restricted
Boltzmann Machines for
Feature Learning
Mohammad Norouzi
Advisor: Dr. Greg Mori
CS @ Simon Fraser University
27 Nov 2009 1

CRBMs for
Feature Learning
Mohammad Norouzi
Advisor: Dr. Greg Mori
CS @ Simon Fraser University
27 Nov 2009 2

Problems
Human detection
Handwritten digit
classification
3

Sliding Window Approach (Cont’d)
5
[INRIA Person Dataset]

Success or Failure of an object recognition
algorithm hinges on the features used
Input
Feature
representation
Label
Our Focus Classifier
? Human
Background
0 / 1 / 2 / 3 / …
6
Learning

Local Feature Detector Hierarchies
7
Larger More complicated Less frequent

Generative & Layerwise Learning
8
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Generative
CRBM
?
?
? ?
?
?
?
?
? ?
?
?

Visual Features: Filtering
9
1 0 -1
2 0 -2
1 0 -1
Filter Kernel (Feature)
-1 0 1
-2 0 2
-1 0 1
0 -1 -2
1 0 -1
2 1 0
Filter Response
1
W
V
2
W 2
W
)
,
( 1
W
V
Filter )
,
( 2
W
V
Filter )
,
( 3
W
V
Filter

Our approach to feature learning
is generative
?
?
?
1
H
2
H
3
H
V
Binary Hidden
Variables
10
1
W
2
W
3
W
(CRBM model)

Related Work
• Convolutional Neural Network (CNN)
– Filtering layers are bundled with a classifier, and all
the layers are learned together using error
backpropagation.
– Does not perform well on natural images
• Biologically plausible models
– Hand-crafted first layer vs. Randomly selected
prototypes for second layer.
[Lecun et al. 98]
[Ranzato et al. CVPR'07]
[Serre et al., PAMI'07] [Mutch and Lowe, CVPR'06]
12

Related Work (cont’d)
• Deep Belief Net
– A two layer partially observed MRF, called RBM, is
the building block
– Learning is performed unsupervised and layer-by-
layer from bottom layer upwards
• Our contributions: We incorporate spatial
locality into RBMs and adapt the learning
algorithm accordingly
• We add more complicated components such
as pooling and sparsity into deep belief nets
[Hinton et al., NC'2006]
13

Why Generative &Unsupervised
• Discriminative learning of deep and large
neural networks has not been successful
– Requires large training sets
– Easily gets over-fitted for large models
– First layer gradients are relatively small
• Alternative hybrid approach
– Learn a large set of first layer features generatively
– Switch to a discriminative model to select the
discriminative features from those that are learned
– Discriminative fine-tuning is helpful

CRBM
• Image is the visible layer
and hidden layer is
related to filter responses
• An energy based
probabilistic model
16
Dot product of vectorized matrices
   
 
)
,
(
)
;
,
(
)
;
,
(
)
;
,
(
,
exp
1
k
k
k
k
k k
k
H
W
V
Filter
H
W
H
V
E
W
H
V
E
W
H
V
E
H;W
V
E
Z
=
V;W
P








Training CRBMs
• Maximum likelihood learning of CRBMs is difficult
• Contrastive Divergence (CD) learning is applicable
• For CD learning we need to compute the
conditionals and .
data
17
sample
 
H
V
P |  
V
H
P |

CRBM (Backward)
• Nearby hidden variables
cooperate in reconstruction
• Conditional Probabilities
take the form
18
 
 
)
exp
1
(
1
*
)
(
)
,
(
)
|
(
)
,
(
)
|
(
x
k k
k
k
k
x
W
H
Filter
H
V
P
W
V
Filter
V
H
P










Learning the Hierarchy
• The structure is trained bottom up and layerwise
• The CRBM model for training filtering layers
• Filtering layers are followed by down-sampling
CRBM CRBM
Classifier
Pooling Pooling
19
Filtering
Non-linearity
Reduce the
dimensionality
layers

Input
1st
Filters 2nd
Filters
Responses
Responses
1 3
2 4

Evaluation
MNIST digit dataset
• Training set: 60,000 image
of digits of size 28x28
• Test set: 10,000 images
INRIA person dataset
• Training set: 2416 person
windows of size 128 x 64
pixels and 4.5x106 negative
windows
• Test set: 1132 positive and
2x106 negative windows
22

First layer filters
• Gray-scale images of
INRIA positive set
• 15 filters of 7x7
23
• MNIST unlabeled digits
• 15 filters of 5x5

Second Layer Features (MNIST)
• Hard to visualize the filters
• We show patches highly responded to filters:
24
24

Second Layer Features (INRIA)
25

MNIST Results
• MNIST error rate when model is trained on
the full training set
26

INRIA Results
• Adding our large-scale features significantly
improves performance of the baseline (HOG)
33

Conclusion
• We extended the RBM model to Convolutional
RBM, useful for domains with spatial locality
• We exploited CRBMs to train local hierarchical
feature detectors one layer at a time and
generatively
• This method obtained results comparable to
state-of-the-art in digit classification and
human detection
34

Hierarchical Feature Detector
36
? ? ?
? ? ?
? ? ?

Contrastive Divergence Learning
37
 
data
1
k
data
0
k
k
k H
,
V
Filter
H
,
V
Filter
η
+
W
=
W )
(
)
( 1
0

   
k
k
H
V,
Filter
=
W
θ
H;
V,
E




Training CRBMs (Cont'd)
• The problem of reconstructing border region
becomes severe when number of Gibbs
sampling steps > 1.
– Partition visible units into middle and border
regions
• Instead of maximizing the
likelihood, we (approximately)
maximize  
 b
m
v
|
v
p

Enforcing Feature Sparsity
• The CRBM's representation is K (number of
filters) times overcomplete
• After a few CD learning iterations, V is
perfectly reconstructed
• Enforce sparsity to tackle this problem
– Hidden bias terms were frozen at large negative
values
• Having a single non-sparse hidden unit
improves the learned features
– Might be related to the ergodicity condition

Probabilistic Meaning of Max
1 2 3 4 5 6
1 2 3 4
Max
1 2 3 4 5 6
1 1 2 2
h
h'
v
 
6
4
5
3
4
2
3
1
:
T
4
:
T
3
:
T
2
:
T
1
v
w
h
+
v
w
h
+
v
w
h
+
v
w
h
=
h
v,
E

h'
v
   
 
6
4
5
3
4
2
3
1
:
T
2
:
T
2
:
T
1
:
T
1
v
w
h'
+
v
w
h'
max
+
v
w
h'
,
v
w
h'
max
=
h
v,
E


The Classifier Layer
• We used SVM as our final classifier
– RBF kernel for MNIST
– Linear kernel for INRIA
– For INRIA we combined our 4th layer outputs and
HOG features
• We experimentally observed that relaxing the
sparsity of CRBM's hidden units yields better
results
– This lets the discriminative model to set the
thresholds itself

Why HOG features are added?
• Because part-like features
are very sparse
• Having a template of the
human figure helps a lot
f

RBM
• Two layer pairwise MRF with a full set
of hidden-visible connections
• RBM Is an energy based model
• Hidden random variables are binary, Visible
variables can be binary or continuous
• Inference is straightforward: and
• Contrastive Divergence learning for training
h
v
w
 
 
 
 
θ
h;
v,
E
θ
Z
=
θ
h;
v,
p 
exp
1
  


 

 2
2
1
i
j
j
i
i
j
ij
i v
+
h
c
v
b
h
w
v
=
θ
h;
v,
E
 
v
|
h
p  
h
|
v
p

Why Unsupervised Bottom-Up
• Discriminative learning of deep structure has
not been successful
– Requires large training sets
– Easily is over-fitted for large models
– First layer gradients are relatively small
• Alternative hybrid approach
– Learn a large set of first layer features generatively
– Later, switch to a discriminative model to select
the discriminative features from those learned
– Fine-tune the features using

INRIA Results (Cont'd)
• Missrate at different FPPW rates
• FPPI is a better indicator of performance
• More experiments on size of features and
number of layers are desired

convolutional_rbm.ppt

Recommandé

Recommandé

Contenu connexe

Similaire à convolutional_rbm.ppt

Similaire à convolutional_rbm.ppt (20)

Dernier

Dernier (20)

convolutional_rbm.ppt