Deep vo and slam ii

Deep VO and SLAM II
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Single View Stereo Matching, 3, 2018
• LEGO: Learning Edge with Geometry all at
Once by Watching Videos, 3, 2018
• Unsupervised Learning of Monocular
Depth Estimation and Visual Odometry
with Deep Feature Reconstruction, 4, 2018
• Unsupervised Learning of Depth and Ego-
Motion from Monocular Video Using 3D
Geometric Constraints, 6, 2018
• Look Deeper into Depth: Monocular Depth
Estimation with Semantic Booster and
Attention-Driven Loss, ECCV, 2018
• GeoNet: Geometric Neural Network, CVPR
2018
• GeoNet: Unsupervised Learning of Dense
Depth, Optical Flow and Camera Pose
• Learning Depth from Monocular Videos
using Direct Methods, CVPR, 2018
• CNN-SVO: Improving the Mapping in Semi-
Direct Visual Odometry Using Single-Image
Depth Prediction, 10, 2018
• Depth Prediction Without the Sensors:
Leveraging Structure for Unsupervised
Learning from Monocular Videos, 11, 2018
• Self-Supervised Learning of Depth and
Camera Motion from 360◦ Videos, 11, 2018
• Unsupervised Learning-based Depth
Estimation aided Visual SLAM Approach, 1,
2019

Single View Stereo Matching
• Previous monocular depth estimation methods take a single view and directly regress the
expected results.
• Though recent advances are made by applying geometrically inspired loss functions during
training, the inference procedure does not explicitly impose any geometrical constraint.
• Therefore these models purely rely on the quality of data and the effectiveness of learning to
generalize.
• This either leads to suboptimal results or the demand of huge amount of expensive ground truth
labelled data to generate reasonable results.
• This paper shows that the monocular depth estimation problem can be reformulated as two sub-
problems, a view synthesis procedure followed by stereo matching, with two intriguing
properties, namely i) geometrical constraints can be explicitly imposed during inference; ii)
demand on labelled depth data can be greatly alleviated.
• The whole pipeline can still be trained in an end-to-end fashion and this new formulation plays a
critical role in advancing the performance.
• The model also generalizes well to other monocular depth estimation benchmarks.
• It also discusses the implications and the advantages of solving monocular depth estimation using
stereo methods.

Single View Stereo Matching
“Unsupervised monocular depth estimation with left-right consistency”

LEGO: Learning Edge with Geometry all at
Once by Watching Videos
• Learning to estimate 3D geometry in a single image by watching unlabeled videos via deep
convolutional network is attracting significant attention.
• This paper introduces a "3D as-smooth-as-possible (3D-ASAP)" prior inside the pipeline, which enables
joint estimation of edges and 3D scene, yielding results with significant improvement in accuracy for
fine detailed structures.
• Specifically, it defines the 3D-ASAP prior by requiring that any two points recovered in 3D from an
image should lie on an existing planar surface if no other cues provided.
• They design an unsupervised framework that Learns Edges and Geometry (depth, normal) all at Once
(LEGO).
• The predicted edges are embedded into depth and surface normal smoothness terms, where pixels
without edges in-between are constrained to satisfy the prior.
• In the framework, the predicted depths, normals and edges are forced to be consistent all the time.

LEGO: Learning Edge with Geometry all at
Once by Watching Videos

Unsupervised Learning of Depth and Ego-Motion from
Monocular Video Using 3D Geometric Constraints
• It is an approach for unsupervised learning of depth and ego-motion from monocular video.
• Unsupervised learning removes the need for separate supervisory signals (depth or ego-motion
ground truth, or multi-view video).
• Prior work in unsupervised depth learning uses pixel-wise or gradient-based losses, which only
consider pixels in small local neighborhoods.
• This idea is to explicitly consider the inferred 3D geometry of the whole scene, and enforce
consistency of the estimated 3D point clouds and ego-motion across consecutive frames.
• This is a challenging task and is solved by a novel (approximate) backpropagation algorithm for
aligning 3D structures.
• They combine this 3D-based loss with 2D losses based on photometric quality of frame
reconstructions using estimated depth and ego-motion from adjacent frames.
• It also incorporates validity masks to avoid penalizing areas in which no useful information exists.
• Because they only require a simple video, learning depth and ego-motion on large and varied
datasets becomes possible.
• Codes are available at https://sites.google.com/view/vid2depth

“Unsupervised cnn for single view depth
estimation: Geometry to the rescue”
“Unsupervised learning of depth
and ego-motion from video”

Look Deeper into Depth: Monocular Depth Estimation
with Semantic Booster and Attention-Driven Loss
• Monocular depth estimation benefits greatly from learning based techniques.
• By studying the training data, they observe that the per-pixel depth values in existing datasets
typically exhibit a long-tailed distribution.
• However, most previous approaches treat all the regions in the training data equally regardless
of the imbalanced depth distribution, which restricts the model performance particularly on
distant depth regions.
• This paper investigates the long tail property and delve deeper into the distant depth regions
(i.e. the tail part) to propose an attention driven loss for the network supervision.
• In addition, to better leverage the semantic information for monocular depth estimation, it
proposes a synergy network to automatically learn the information sharing strategies between
the two tasks.
• With the proposed attention-driven loss and synergy network, the depth estimation and
semantic labeling tasks can be mutually improved.

“Deep convolutional
neural fields for
depth estimation
from a single image”
“Depth map prediction from a single image
using a multi-scale deep network”
“Deeper depth prediction with fully
convolutional residual networks”
“Depth map prediction from a single
image using a multi-scale deep network”

GeoNet: Geometric Neural Network
18
• This paper proposes Geometric Neural Network (GeoNet) to jointly predict depth
and surface normal maps from a single image.
• Building on top of two-stream CNNs, the GeoNet incorporates geometric relation
between depth and surface normal via the new depth-to-normal and normal-to-
depth networks.
• Depth-to-normal network exploits the least square solution of surface normal
from depth and improves its quality with a residual module.
• Normal-to-depth network, contrarily, refines the depth map based on the
constraints from the surface normal through a kernel regression module, which
has no parameter to learn.
• These two networks enforce the underlying model to efficiently predict depth
and surface normal for high consistency and corresponding accuracy.

19

20

21

22

GeoNet--Unsupervised Learning of Dense
• GeoNet, a jointly unsupervised learning framework for monocular depth, optical
flow and ego-motion estimation from videos.
• The three components are coupled by the nature of 3D scene geometry, jointly
learned by our framework in an end-to-end manner.
• Specifically, geometric relationships are extracted over the predictions of
individual modules and then combined as an image reconstruction loss,
reasoning about static and dynamic scene parts separately.
• Furthermore, they propose an adaptive geometric consistency loss to increase
robustness towards outliers and non-Lambertian regions, which resolves
occlusions and texture ambiguities effectively.
23

24

25

26

Unsupervised Learning of Monocular Depth Estimation
and Visual Odometry with Deep Feature Reconstruction
• Despite learning based methods showing promising results in single view depth estimation and
visual odometry, most existing approaches treat the tasks in a supervised manner.
• Recent approaches to single view depth estimation explore the possibility of learning without full
supervision via minimizing photometric error.
• This paper explores the use of stereo sequences for learning depth and visual odometry.
• The use of stereo sequences enables the use of both spatial (between left-right pairs) and
temporal (forward backward) photometric warp error, and constrains the scene depth and
camera motion to be in a common, real world scale.
• At test time the framework is able to estimate single view depth and two-view odometry from a
monocular sequence.
• They improve on a standard photometric warp loss by considering a warp of deep features.
• The source code is available at https://github.com/Huangying-Zhan/Depth-VO-Feat.

Unsupervised Learning of Monocular Depth Estimation
and Visual Odometry with Deep Feature Reconstruction

Learning Depth from Monocular Videos using Direct
Methods
• The ability to predict depth from a single image - using recent advances in CNNs - is of
increasing interest to the vision community.
• Unsupervised strategies to learning are particularly appealing as they can utilize much
larger and varied monocular video datasets during learning without the need for ground
truth depth or stereo.
• In previous works, separate pose and depth CNN predictors had to be determined such
that their joint outputs minimized the photometric error.
• Inspired by recent advances in direct visual odometry (DVO), it argues that the depth
CNN predictor can be learned without a pose CNN predictor.
• Further, they demonstrate empirically that incorporation of a differentiable
implementation of DVO, along with a novel depth normalization strategy - substantially
improves performance over state of the art that use monocular videos for training.
• https://github.com/MightyChaos/LKVOLearner

Methods
Unsupervised learning pipeline. The learning
algorithm takes 3 sequential images at a time. The
Depth-CNN produces 3 inverse depth maps for
the inputs, and the pose predictor estimates two
relative camera pose between the second image
and the other two. The appearance dissimilarity
loss is measured between the second image I2
and the inversely warped images of I1, I3; In
addition, the loss is evaluated in a reverse
direction- it is also measured between I1, I3 and
two warped images of I2. Lower part illustrates 3
architectures we evaluated for pose prediction: 1)
Pose-CNN, 2) use the proposed differentiable
Direct Visual Odometry (DDVO), the initialization
of pose is set as zero (identity transformation),
and 3) a hybrid of the above two - use pretrained
Pose-CNN to give a better initial pose for DDVO.

Methods “Unsupervised learning of depth and ego-motion from video”

CNN-SVO: Improving the Mapping in Semi-Direct Visual
Odometry Using Single-Image Depth Prediction
• Reliable feature correspondence between frames is a critical step in visual odometry (VO) and
visual simultaneous localization and mapping (V-SLAM) algorithms.
• In comparison with existing VO and V-SLAM algorithms, semi-direct visual odometry (SVO) has
two main advantages that lead to state-of-the-art frame rate camera motion estimation: direct
pixel correspondence and efficient implementation of probabilistic mapping method.
• This paper improves the SVO mapping by initializing the mean and the variance of the depth at a
feature location according to the depth prediction from a single image depth prediction network.
• By significantly reducing the depth uncertainty of the initialized map point (i.e., small variance
centered about the depth prediction), the benefits are twofold: reliable feature correspondence
between views and fast convergence to the true depth in order to create new map points.

map point initialization

Camera motion estimation in the HDR environment Camera trajectory and map points

Depth Prediction Without the Sensors: Leveraging Structure for
Unsupervised Learning from Monocular Videos
• Learning to predict depth from RGB is challenging both for indoor and outdoor robot navigation.
• This work addresses unsupervised learning of scene depth and robot ego-motion where
supervision is provided by monocular videos, as cameras are the cheapest, least restrictive and
most ubiquitous sensor for robotics.
• Previous unsupervised image-to-depth learning has established strong baselines in the domain.
• This approach is able to model moving objects and is shown to transfer across data domains, e.g.
from outdoors to indoor scenes.
• The main idea is to introduce geometric structure in the learning process, by modeling the scene
and the individual objects; camera ego-motion and object motions are learned from monocular
videos as input.
• An online refinement method is introduced to adapt learning on the fly to unknown domains.
• The code can be found at https://sites.google.com/view/struct2depth.

Depth Prediction Without the Sensors: Leveraging Structure for
Unsupervised Learning from Monocular Videos

Self-Supervised Learning of Depth and Camera
Motion from 360◦ Videos
• As 360◦ cameras become prevalent in many autonomous systems (e.g., self-driving cars and
drones), efficient 360◦ perception becomes more and more important.
• This is a self-supervised learning approach for predicting the omnidirectional depth and camera
motion from a 360◦ video.
• In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-
view, they introduce three key features to process 360 ◦ images efficiently.
• convert each image from equirectangular projection to cubic projection in order to avoid image
distortion. In each network layer, use Cube Padding (CP), which pads intermediate features from
adjacent faces, to avoid image boundaries.
• apply a “spherical” photometric consistency constraint on the whole viewing sphere. In this way, no
pixel will be projected outside the image boundary with normal field-of-view.
• rather than estimating 6 independent camera motions (i.e., SfM-Learner to each face on a cube), apply
camera pose consistency loss to ensure the estimated camera motions reaching consensus.
• They collect a PanoSUNCG dataset containing a large amount of 360◦ videos with ground truth
depth and camera motion.

Spherical photometric constraints.

Unsupervised Learning-based Depth Estimation
aided Visual SLAM Approach
• Recently, deep learning technologies have achieved great success in the visual SLAM area, which
can directly learn high-level features from the visual inputs and improve the estimation accuracy
of the depth information.
• Therefore, deep learning technologies maintain the potential to extend the source of the depth
information and improve the performance of the SLAM system.
• However, the existing deep learning-based methods are mainly supervised and require a large
amount of ground-truth depth data, which is hard to acquire because of the realistic constraints.
• This paper presents an unsupervised learning framework, which not only uses image
reconstruction for supervising but also exploits the pose estimation method to enhance the
supervised signal and add training constraints for the task of monocular depth and camera
motion estimation.
• Furthermore, it exploits the unsupervised learning framework to assist the traditional ORB-SLAM
system when the initialization module of ORB-SLAM method could not match enough features.
• Unsupervised learning framework could significantly accelerate the initialization process of ORB-
SLAM system and effectively improve the accuracy on environmental mapping in strong lighting
and weak texture scenes.

Unsupervised Learning-based Depth Estimation
aided Visual SLAM Approach

Deep vo and slam ii

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deep vo and slam ii

Similaire à Deep vo and slam ii (20)

Plus de Yu Huang

Plus de Yu Huang (20)

Dernier

Dernier (20)

Deep vo and slam ii