1. A Deep Belief Network
Approach to Learning Depth
from Optical Flow
Reuben Feinman
1
Applied Mathematics Honors Thesis
by
2. Background
2
•Visual system of insects are exquisitely
sensitive to motion
•Srinivasan et al 1989 showed that bees
decipher the range of their targets by
absolute motion and motion relative to the
background
•Key idea: optical flow is important to
navigation
3. Motion Parallax in the Dorsal Stream
Humans perceive depth rather precisely via motion parallax
• Motion is a powerful monocular cue to depth understanding
• Assists with interpretation of spatial relationships
• “Optical flow”: the motion information encoded in the visual system
3
source: opticflow.bu.edu
4. Deep Learning
4
•The mapping from motion to depth is highly nonlinear (Braunstein, 1976)
•Great progress in deep learning; multiple layers of nonlinear processing,
more complex input to output function
source: www.deeplearning.stanford.edu
Motion
Information
Depth
prediction
->
->
->
->
-->
5. Computer Graphics
•Need labeled training data; videos do not have ground truth
depth
•Graphical scenes generated by a gaming engine provide large
number of training samples for supervised learning
5
A scene excerpt from our CryEngine forest database
RGB frame
ground truth depth map
6. 6
MT Motion Model
• Hierarchical model of motion processing; alternate between template
matching and max pooling
• Convolutional learning of spatio-temporal features
• Extension of HMAX (Serre et al 2007)
Jhuang et al 2007
7. Population Responses
7
Dorsal velocity model outputs a motion energy
feature map
•(# Speeds) x (# Directions) x Height x Width
•In other words: Each pixel contains a feature
vector X with (# Speeds) x (# Directions)
dimensions
8. 8
Deep Belief Networks
•MLP: fail
•Lots of unlabeled data available;
maybe we can exploit this data and
extract deep hierarchical
representations of our motion model
outputs
•Initialize network with feature
detectors
source: http://deeplearning.net
9. The RBM Model
9
Maximum likelihood learning: update model parameters to maximize the
likelihood of our training data
Standard RBM:
Gaussian-Bernoulli RBM:
P(v,h) = (1/Z)*exp(-E(v,h))
We then create a new “free energy” version
which sums over all possible hidden states
P(v) = (1/Z)*exp(-F(v))
source: http://deeplearning.net
10. Justifying Greedy Layer-Wise Pre-Training
10
•We use a Markov chain with
alternating Gibbs Sampling
h’ ~ P(h | v = v)
v’ ~ P(v | h = h’)
•Gibbs Sampling is guaranteed to
reduce the KL divergence
between the posterior
distribution in a given layer and
the model’s equilibrium
distribution
Hinton et al 2006
11. The DBN
11
• The data: feature vectors have 72 elements, tuned to 9
different speeds and 8 directions (9*8 = 72)
• DBN takes in 3x3 pixel window
• 3 Hidden layers of 800 units; sigmoidal activation
• Linear output layer
Technicalities:
•Mini-batch training with batch size of 5000
•Sparse initialization scheme
•RMSprop learning rule (regularized mean squares)
•Backpropagation fine-tuning with dropout, dropping 20% of units at each
layer except for the input layer
•Geometrically decaying learning rate (LR = 0.998*LR at each epoch)
14. Markov Random Field Smoothing
Receptive field can be a powerful tool for decoding
14
MRF defined by two potential functions:
1) Φ = ∑_i [ (w • x_i − d_i) ^ 2 ]
2) Ψ = ∑_<i,j> [ (d_i − d_j)^2 /( (d_i − d_j)^2 + 1) ) ]
(note: <i,j> = all neighboring pairs i,j)
P(d | x ; alpha, w) = (1/Z) * exp(− (alpha*Ψ + Φ)).
Peter Orchard, University of Edinburgh
ground truth original prediction: 0.595 MRF prediction: 0.630
19. Normalizing the Data
• Training a GB-RBM is hard; the distributions of spike firing rates have many
variations depending on the dataset
• We propose a normalized GB-RBM where the training data is normalized
to zero mean and unit variance; all datasets thereafter (validation & test)
are normalized with the same parameters
19
Dataset histograms before and after normalization