6. Success or Failure of an object recognition
algorithm hinges on the features used
Input
Feature
representation
Label
Our Focus Classifier
? Human
Background
0 / 1 / 2 / 3 / …
6
Learning
12. Related Work
• Convolutional Neural Network (CNN)
– Filtering layers are bundled with a classifier, and all
the layers are learned together using error
backpropagation.
– Does not perform well on natural images
• Biologically plausible models
– Hand-crafted first layer vs. Randomly selected
prototypes for second layer.
[Lecun et al. 98]
[Ranzato et al. CVPR'07]
[Serre et al., PAMI'07] [Mutch and Lowe, CVPR'06]
12
13. Related Work (cont’d)
• Deep Belief Net
– A two layer partially observed MRF, called RBM, is
the building block
– Learning is performed unsupervised and layer-by-
layer from bottom layer upwards
• Our contributions: We incorporate spatial
locality into RBMs and adapt the learning
algorithm accordingly
• We add more complicated components such
as pooling and sparsity into deep belief nets
[Hinton et al., NC'2006]
13
14. Why Generative &Unsupervised
• Discriminative learning of deep and large
neural networks has not been successful
– Requires large training sets
– Easily gets over-fitted for large models
– First layer gradients are relatively small
• Alternative hybrid approach
– Learn a large set of first layer features generatively
– Switch to a discriminative model to select the
discriminative features from those that are learned
– Discriminative fine-tuning is helpful
16. CRBM
• Image is the visible layer
and hidden layer is
related to filter responses
• An energy based
probabilistic model
16
Dot product of vectorized matrices
)
,
(
)
;
,
(
)
;
,
(
)
;
,
(
,
exp
1
k
k
k
k
k k
k
H
W
V
Filter
H
W
H
V
E
W
H
V
E
W
H
V
E
H;W
V
E
Z
=
V;W
P
17. Training CRBMs
• Maximum likelihood learning of CRBMs is difficult
• Contrastive Divergence (CD) learning is applicable
• For CD learning we need to compute the
conditionals and .
data
17
sample
H
V
P |
V
H
P |
18. CRBM (Backward)
• Nearby hidden variables
cooperate in reconstruction
• Conditional Probabilities
take the form
18
)
exp
1
(
1
*
)
(
)
,
(
)
|
(
)
,
(
)
|
(
x
k k
k
k
k
x
W
H
Filter
H
V
P
W
V
Filter
V
H
P
19. Learning the Hierarchy
• The structure is trained bottom up and layerwise
• The CRBM model for training filtering layers
• Filtering layers are followed by down-sampling
CRBM CRBM
Classifier
Pooling Pooling
19
Filtering
Non-linearity
Reduce the
dimensionality
layers
22. Evaluation
MNIST digit dataset
• Training set: 60,000 image
of digits of size 28x28
• Test set: 10,000 images
INRIA person dataset
• Training set: 2416 person
windows of size 128 x 64
pixels and 4.5x106 negative
windows
• Test set: 1132 positive and
2x106 negative windows
22
23. First layer filters
• Gray-scale images of
INRIA positive set
• 15 filters of 7x7
23
• MNIST unlabeled digits
• 15 filters of 5x5
24. Second Layer Features (MNIST)
• Hard to visualize the filters
• We show patches highly responded to filters:
24
24
33. INRIA Results
• Adding our large-scale features significantly
improves performance of the baseline (HOG)
33
34. Conclusion
• We extended the RBM model to Convolutional
RBM, useful for domains with spatial locality
• We exploited CRBMs to train local hierarchical
feature detectors one layer at a time and
generatively
• This method obtained results comparable to
state-of-the-art in digit classification and
human detection
34
37. Contrastive Divergence Learning
37
data
1
k
data
0
k
k
k H
,
V
Filter
H
,
V
Filter
η
+
W
=
W )
(
)
( 1
0
k
k
H
V,
Filter
=
W
θ
H;
V,
E
38. Training CRBMs (Cont'd)
• The problem of reconstructing border region
becomes severe when number of Gibbs
sampling steps > 1.
– Partition visible units into middle and border
regions
• Instead of maximizing the
likelihood, we (approximately)
maximize
b
m
v
|
v
p
39. Enforcing Feature Sparsity
• The CRBM's representation is K (number of
filters) times overcomplete
• After a few CD learning iterations, V is
perfectly reconstructed
• Enforce sparsity to tackle this problem
– Hidden bias terms were frozen at large negative
values
• Having a single non-sparse hidden unit
improves the learned features
– Might be related to the ergodicity condition
40. Probabilistic Meaning of Max
1 2 3 4 5 6
1 2 3 4
Max
1 2 3 4 5 6
1 1 2 2
h
h'
v
6
4
5
3
4
2
3
1
:
T
4
:
T
3
:
T
2
:
T
1
v
w
h
+
v
w
h
+
v
w
h
+
v
w
h
=
h
v,
E
h'
v
6
4
5
3
4
2
3
1
:
T
2
:
T
2
:
T
1
:
T
1
v
w
h'
+
v
w
h'
max
+
v
w
h'
,
v
w
h'
max
=
h
v,
E
41. The Classifier Layer
• We used SVM as our final classifier
– RBF kernel for MNIST
– Linear kernel for INRIA
– For INRIA we combined our 4th layer outputs and
HOG features
• We experimentally observed that relaxing the
sparsity of CRBM's hidden units yields better
results
– This lets the discriminative model to set the
thresholds itself
42. Why HOG features are added?
• Because part-like features
are very sparse
• Having a template of the
human figure helps a lot
f
43. RBM
• Two layer pairwise MRF with a full set
of hidden-visible connections
• RBM Is an energy based model
• Hidden random variables are binary, Visible
variables can be binary or continuous
• Inference is straightforward: and
• Contrastive Divergence learning for training
h
v
w
θ
h;
v,
E
θ
Z
=
θ
h;
v,
p
exp
1
2
2
1
i
j
j
i
i
j
ij
i v
+
h
c
v
b
h
w
v
=
θ
h;
v,
E
v
|
h
p
h
|
v
p
44. Why Unsupervised Bottom-Up
• Discriminative learning of deep structure has
not been successful
– Requires large training sets
– Easily is over-fitted for large models
– First layer gradients are relatively small
• Alternative hybrid approach
– Learn a large set of first layer features generatively
– Later, switch to a discriminative model to select
the discriminative features from those learned
– Fine-tune the features using
45. INRIA Results (Cont'd)
• Missrate at different FPPW rates
• FPPI is a better indicator of performance
• More experiments on size of features and
number of layers are desired