Stereo Matching by Deep Learning

STEREO MATCHING BY DEEP
LEARNING
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
◦ Self-Supervised Learning for Stereo Matching with Self-Improving Ability
◦ Unsupervised Learning of Stereo Matching
◦ Pyramid Stereo Matching Network
◦ Learning for Disparity Estimation through Feature Constancy
◦ Deep Material-aware Cross-spectral Stereo Matching
◦ SegStereo: Exploiting Semantic Information for Disparity Estimation
◦ DispSegNet: Leveraging Semantics for End-to-End Learning of Disparity
Estimation from Stereo Imagery
◦ Group-wise Correlation Stereo Network

Self-Supervised Learning for Stereo
Matching with Self-Improving Ability
◦ A simple CNN architecture that is able to learn to compute dense disparity
maps directly from the stereo inputs.
◦ Training is performed in an e2e fashion without the need of ground-truth
disparity maps.
◦ The idea is to use image warping error (instead of disparity-map residuals) as
the loss function to drive the learning process, aiming to find a depth-map that
minimizes the warping error.
◦ The network is self-adaptive to different unseen imageries as well as to different
camera settings.

The self-supervised deep stereo matching network architecture. The network consists of five modules,
feature extraction, cross feature volume, 3D feature matching, soft-argmin, and warping loss evaluation.

Feature Volume Construction. The cross feature volume is
constructed by concatenating the learned features extracted
from the left and right images correspondingly. The blue
rectangle represents a feature map from the left image, the
stacked orange rectangle set represents traversed right
feature maps from 0 toward a preset disparity range D.
Different intensities correspond to different level of disparity.
Note that the left feature map is copied D + 1 times to match
the traversed right feature maps.

Diagram of our res-TDM module for 3D feature matching with learned regularization. It takes
cross feature volume as an input, and is followed by a series of 3D convolution and deconvolution.
The output of this module is a 3D disparity volume of dimension H × W × (D + 1).

KITTI-2012

KITTI-2015

Unsupervised Learning of Stereo
Matching
◦ A framework for learning stereo matching costs
without human supervision.
◦ This method updates network parameters in an
iterative manner.
◦ It starts with a randomly initialized network.
◦ Left-right check is adopted to guide the training.
◦ Suitable matching is picked and used as training
data in following iterations.
◦ The system finally converges to a stable state.

Matching
The learning network takes stereo images as input, and generates a disparity map. The architecture is with
two branches where the first is for computing the cost-volume and the other is for jointly filtering the volume.

Matching
Configuration of each component, cost-volume branch
(CVB), image feature branch (IFB) and joint filtering
branch (JF), of our network. Torch notations (channels,
kernel, stride) are used to define the convolutional layers.

Matching
The iterative unsupervised training framework consists of four parts: disparity
prediction, confidence map estimation, training data selection and network training.

Matching
KITTI 2015

Pyramid Stereo Matching Network
◦ Current architectures rely on patch-based Siamese networks, lacking the
means to exploit context info. for finding correspondence in ill- posed regions.
◦ To tackle this problem, PSM-Net, a pyramid stereo matching network,
consisting of two main modules: spatial pyramid pooling and 3D CNN.
◦ The spatial pyramid pooling module takes advantage of the capacity of
global context information by aggregating context in different scales and
locations to form a cost volume.
◦ The 3D CNN learns to regularize cost volume using stacked multiple hourglass
networks in conjunction with intermediate supervision.
◦ Codes of PSMNet: https://github.com/JiaRenChang/PSMNet.

Architecture overview of
proposed PSMNet. The left and
right input stereo images are
fed to two weight-sharing
pipelines consisting of a CNN
for feature maps calculation, an
SPP module for feature
harvesting by concatenating
representations from sub-
regions with different sizes, and
a convolution layer for feature
fusion. The left and right image
features are then used to form
a 4D cost volume, which is fed
into a 3D CNN for cost volume
regularization and disparity
regression.

Table 1. Parameters of the proposed PSMNet architecture. Construction of residual blocks are designated in brackets with the
number of stacked blocks. Downsampling is performed by conv0 1 and conv2 1 with stride of 2. The usage of batch
normalization and ReLU follows ResNet, with exception that PSMNet does not apply ReLU after summation.

KITTI 2015

KITTI 2012

Learning for Disparity Estimation
through Feature Constancy
◦ A network architecture to incorporate all steps: matching cost calculation,
matching cost aggregation, disparity calculation, and disparity refinement.
◦ The network consists of three parts.
◦ 1) calculates the multi-scale shared features.
◦ 2) performs matching cost calculation, matching cost aggregation and disparity
calculation to estimate the initial disparity using shared features.
◦ Note: The initial disparity and the shared features are used to calculate the feature
constancy that measures correctness of the correspondence between two input images.
◦ 3) The initial disparity and the feature constancy are then fed into a sub-network to refine
the initial disparity.
◦ Source code: http://github.com/leonzfa/iResNet.

The architecture. It incorporates all of the four steps for stereo matching into a single network. Note that, the
skip connections between encoder and decoder at different scales are omitted here for better visualization.

Comparison with other
state-of-the-art
methods on the KITTI
2015 dataset.

SegStereo: Exploiting Semantic
Information for Disparity
◦ Appropriate incorporation of semantic cues can greatly rectify prediction in
commonly-used disparity estimation frameworks.
◦ This method conducts semantic feature embedding and regularizes semantic
cues as the loss term to improve learning disparity.
◦ The unified model SegStereo employs semantic features from segmentation
and introduces semantic softmax loss, which helps improve the prediction
accuracy of disparity maps.
◦ The semantic cues work well in both unsupervised and supervised manners.

Extract intermediate features from
stereo input. Calculate the cost
volume via the correlation operator.
The left segmentation feature map is
aggregated into disparity branch as
semantic feature embedding. The
right segmentation feature mapis
warped to left view for per-pixel
semantic prediction with softmax
loss regularization. Both steps
incorporate semantic info. to
improve disparity estimation. The
SegStereo framework enables both
unsupervised and supervised
learning, using photometric loss or
disparity regression loss.

unsupervised
SegStereo
models

Supervised-learning

Deep Material-aware Cross-spectral
Stereo Matching
◦ Cross-spectral imaging provides benefits for recognition and detection tasks.
◦ Stereo matching also provides an opportunity to obtain depth without an
active projector source.
◦ Matching images from different spectral bands is challenging because of
large appearance variations.
◦ A deep learning framework to simultaneously transform images across spectral
bands and estimate disparity.
◦ A material-aware loss function is incorporated within the disparity prediction
network to handle regions with unreliable matching such as light sources, glass
windshields and glossy surfaces.
◦ No depth supervision is required.

Stereo Matching
The disparity prediction network (DPN) predicts left-right disparity for a RGB-NIR stereo input. The spectral
translation network (STN) converts the left RGB image into a pseudo-NIR image. The two networks are
trained simultaneously with reprojection error. The symmetric CNN in (b) prevents the STN learning disparity.

Stereo Matching
Intermediate results. (a) Left image. (b) material recognition from DeepLab. (c) RGB-to-NIR filters
corrected by exposure and white balancing. The R,G,B values represent the weights of R,G,B channels.

Stereo Matching

DispSegNet: Leveraging Semantics for End-to-End
Learning of Disparity Estimation from Stereo Imagery
◦ A CNN architecture improves the quality and accuracy of disparity estimation
with the help of semantic segmentation.
◦ A network structure in which these two tasks are highly coupled.
◦ The two-stage refinement process.
◦ Initial disparity estimates are refined with an embedding learned from the
semantic segmentation branch of the network.
◦ The model is trained using an unsupervised approach, in which images from one
of the stereo pair are warped and compared against images from the other.
◦ A single network is capable of outputting disparity estimates and semantic labels.
◦ Leveraging embedding learned from semantic segmentation improves the
performance of disparity estimation.

Architecture. The pipeline consists of: (a) rectified input stereo images. (b) useful features are extracted from input stereo
images. (c) cost volume is formed by concatenating corresponding features from both sides. (d) initial disparity is estimated
from cost volume using 3D convolution. (e) initial disparity is further improved by fusing segment embedding. The PSP
(Pyramid scene parsing) incorporates more context info. for the semantic segmentation task. (f) estimated disparity and
semantic segmentation from both left and right views are generated from the model.

disparity prediction

3D semantic results

Group-wise Correlation Stereo Network
◦ This method tries to construct the cost volume by group-wise correlation.
◦ The left features and the right features are divided into groups along the
channel dimension, and correlation maps are computed among each group to
obtain multiple matching cost proposals, then packed into a cost volume.
◦ Group-wise correlation provides efficient representations for measuring feature
similarities and will not lose too much information like full correlation.
◦ It also preserves better performance when reducing parameters.
◦ The code is available at https://github.com/xy-guo/GwcNet.

The pipeline of the proposed group-wise correlation network. The whole network consists of four parts, unary
feature extraction, cost volume construction, 3D convolution aggregation, and disparity prediction. The cost
volume is divided into two parts, concatenation volume (Cat) and group-wise correlation volume (Gwc).
Concatenation volume is built by concatenating the compressed left and right features.

The structure of 3D aggregation network. The network consists of a pre-hourglass module (four
convolutions at the beginning) and three stacked 3D hourglass networks. Compared with PSMNet,
remove the shortcut connections between different hourglass modules and output modules, thus output
modules 0,1,2 can be removed during inference to save time. 1×1×1 3D convolutions are added to the
shortcut connections within hourglass modules.

Table: Structure details of the modules. H,
W represents the height and the width of
the input image. S1/2 denotes the
convolution stride. If not specified, each
3D convolution is with a batch
normalization and ReLU.
* denotes the ReLU is not included.
** denotes convolution only.

Stereo Matching by Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stereo Matching by Deep Learning

Similar to Stereo Matching by Deep Learning (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Stereo Matching by Deep Learning