1. Deep VO and SLAM IV
Yu Huang
Yu.haung07@gmail.com
Sunnyvale, California
2. Outline
• Flowdometry: An Optical Flow and Deep Learning Based Approach to Visual Odometry
• Supervising the new with the old: learning SFM from SFM
• Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency
• Digging Into Self-Supervised Monocular Depth Estimation
• Learning monocular visual odometry with dense 3D mapping from dense 3D flow
• Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion,
Optical Flow and Motion Segmentation
• Estimating Metric Scale Visual Odometry from Videos using 3D Convolutional Networks
• GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with
Generative Adversarial Networks
• DeepPCO: End-to-End Point Cloud Odometry through Deep Parallel Neural Network
• DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency
• Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic
Understanding
3. Flowdometry: An Optical Flow and Deep Learning
Based Approach to Visual Odometry
• https://github.com/petermuller/flowdometry
• Visual odometry is a challenging task related to simultaneous localization and mapping that aims
to generate a map traveled from a visual data stream.
• Based on one or two cameras, motion is estimated from features and pixel differences between
frames.
• Because of the frame rate of the cameras, there are generally small, incremental changes
between subsequent frames where optical flow can be assumed to be proportional to the
physical distance moved by an egocentric reference, such as a camera on a vehicle.
• This paper porposed a visual odometry system called Flowdometry based on optical flow and
deep learning.
• Optical flow images are used as input to a convolutional neural network, which calculates a
rotation and displacement for each image pixel.
• The displacements and rotations are applied incrementally to construct a map of where the
camera has traveled.
• The proposed system is trained and tested on the KITTI visual odometry dataset, and accuracy is
measured by the difference in distances between ground truth and predicted driving trajectories.
4. Flowdometry: An Optical Flow and Deep Learning
Based Approach to Visual Odometry
The Flowdometry convolutional neural network architecture
based on the contractive part of FlowNetS
FlowNetS architecture with the contractive side of the network
6. Supervising the new with the old: learning SFM
from SFM
• Recent work has demonstrated that it is possible to learn deep neural networks for monocular
depth and ego-motion estimation from unlabelled video sequences, an interesting theoretical
development with numerous advantages in applications.
• This paper propose a number of improvements to these approaches.
• First, since such self supervised approaches are based on the brightness constancy assumption,
which is valid only for a subset of pixels, apply a probabilistic learning formulation where the
network predicts distributions over variables rather than specific values.
• As these distributions are conditioned on the observed image, the network can learn which scene
and object types are likely to violate the model assumptions, resulting in more robust learning; so
build on dozens of years of experience in developing handcrafted structure-from-motion (SFM)
algorithms by using an off-the-shelf SFM system to generate a supervisory signal for the deep
neural network.
• While this signal is also noisy, this probabilistic formulation can learn and account for the defects
of SFM, helping to integrate different sources of information and boosting the overall
performance of the network.
10. Self-Supervised Learning of Depth and Motion
Under Photometric Inconsistency
• The self-supervised learning of depth and pose from monocular sequences provides an attractive
solution by using the photometric consistency of nearby frames as it depends much less on the
ground-truth data.
• This paper addresses the issue when previous assumptions of the self-supervised approaches are
violated due to the dynamic nature of real-world scenes.
• Different from handling the noise as uncertainty, this key idea is to incorporate more robust
geometric quantities and enforce internal consistency in the temporal image sequence.
• Enforcing the depth consistency across adjacent frames significantly improves the depth
estimation with much fewer noisy pixels.
• The geometric information is implicitly embedded into neural networks and does not bring
overhead for inference.
14. Digging Into Self-Supervised Monocular Depth
Estimation
• Per-pixel ground-truth depth data is challenging to acquire at scale.
• To overcome this limitation, self-supervised learning has emerged as a promising alternative for
training models to perform monocular depth estimation.
• This paper proposes a set of improvements, which together result in both quantitatively and
qualitatively improved depth maps compared to competing self-supervised methods.
• Research on self-supervised monocular training usually explores increasingly complex
architectures, loss functions, and image formation models, all of which have recently helped to
close the gap with fully-supervised methods.
• It shows that a surprisingly simple model, and associated design choices, lead to superior
predictions.
• (i) a minimum reprojection loss, designed to robustly handle occlusions;
• (ii) a full-resolution multi-scale sampling method that reduces visual artifacts;
• (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions.
• https://github.com/nianticlabs/monodepth2
18. Learning monocular visual odometry with dense 3D
mapping from dense 3D flow
• This paper introduces a fully deep learning approach to monocular SLAM, which can perform
simultaneous localization using a NN for learning visual odometry (L-VO) and dense 3D mapping.
• Dense 2D flow and a depth image are generated from monocular images by sub-networks, which
are then used by a 3D flow associated layer in the L-VO network to generate dense 3D flow.
• Given this 3D flow, the dual stream L-VO network can then predict the 6DOF relative pose and
furthermore reconstruct the vehicle trajectory.
• In order to learn the correlation between motion directions, the Bivariate Gaussian modeling is
employed in the loss function.
• Moreover, the learned depth is leveraged to generate a dense 3D map.
• As a result, an entire visual SLAM system, that is, learning monocular odometry combined with
dense 3D mapping, is achieved.
22. Competitive Collaboration: Joint Unsupervised Learning of
Depth, Camera Motion, Optical Flow and Motion Segmentation
• It addresses the unsupervised learning of several interconnected problems in low-level vision:
single view depth prediction, camera motion estimation, optical flow, and segmentation of a
video into the static scene and moving regions.
• The key insight is four fundamental vision problems are coupled through geometric constraints.
• Consequently, learning to solve them together simplifies the problem because the solutions can
reinforce each other.
• They go beyond previous work by exploiting geometry more explicitly and segmenting the scene
into static and moving regions.
• To that end, it introduces Competitive Collaboration, a framework that facilitates the coordinated
training of multiple specialized NNs to solve complex problems.
• Competitive Collaboration works much like expectation-maximization, but with NNs that act as
both competitors to explain pixels that correspond to static or moving regions, and as
collaborators through a moderator that assigns pixels to be either static or independently moving.
• This method integrates all these problems in a common framework and simultaneously reasons
about the segmentation of the scene into moving objects and the static background, the camera
motion, depth of the static scene structure, and the optical flow of moving objects.
• All our models and code are available at https://github.com/anuragranj/cc.
27. Estimating Metric Scale Visual Odometry from
Videos using 3D Convolutional Networks
• Monocular visual odometry (VO) is a heavily studied topic in robotics as it enables robust 3D
localization with a ubiquitous, lightweight sensor: a single camera.
• Scale accuracy can only be achieved with geometric methods in one of two ways: 1) by fusing info
from a sensor that measures physical units, such as an IMU or GPS receiver, or 2) by exploiting
prior knowledge about objects in a scene, such as the typical size.
• This is an E2E deep learning approach for performing metric scale-sensitive regression such visual
odometry with a single camera and no additional sensors.
• They propose a 3D convolutional architecture, 3DC-VO, that can leverage temporal relationships
over a short moving window of images to estimate linear and angular velocities.
• The network makes local predictions on stacks of images that can be integrated to form a full
trajectory.
• https://www.github.com/alexanderkoumis/3dc_vo.
28. Estimating Metric Scale Visual Odometry from
Videos using 3D Convolutional Networks
A 3D convolution
Generic subnetwork structure
30. GANVO: Unsupervised Deep Monocular Visual Odometry and
Depth Estimation with Generative Adversarial Networks
• In the last decade, supervised deep learning approaches have been extensively employed in
visual odometry (VO) applications, which is not feasible in environments where labelled data is
not abundant.
• On the other hand, unsupervised deep learning approaches for localization and mapping in
unknown environments from unlabelled data have received comparatively less attention in VO
research.
• This study proposes a generative unsupervised learning framework that predicts 6-DoF pose
camera motion and monocular depth map of the scene from unlabelled RGB image sequences,
using deep convolutional Generative Adversarial Networks (GANs).
• They create a supervisory signal by warping view sequences and assigning the re-projection
minimization to the objective loss function that is adopted in multi-view pose estimation and
single-view depth generation network.
31. GANVO: Unsupervised Deep Monocular Visual Odometry and
Depth Estimation with Generative Adversarial Networks
32. GANVO: Unsupervised Deep Monocular Visual Odometry and
Depth Estimation with Generative Adversarial Networks
33. GANVO: Unsupervised Deep Monocular Visual Odometry and
Depth Estimation with Generative Adversarial Networks
34. GANVO: Unsupervised Deep Monocular Visual Odometry and
Depth Estimation with Generative Adversarial Networks
35. DeepPCO: End-to-End Point Cloud Odometry
through Deep Parallel Neural Network
• Odometry is of key importance for localization in the absence of a map.
• There is considerable work in the area of visual odometry (VO), and recent advances in deep
learning have brought novel approaches to VO, which directly learn salient features from raw
images.
• These learning-based approaches have led to more accurate and robust VO systems.
• However, they have not been well applied to point cloud data yet.
• This work investigates how to exploit deep learning to estimate point cloud odometry (PCO),
which may serve as a critical component in point cloud-based downstream tasks or learning-
based systems.
• Specifically, they propose a end-to-end deep parallel neural network called DeepPCO, which can
estimate the 6-DOF poses using consecutive point clouds.
• It consists of two parallel sub-networks to estimate 3- D translation and orientation respectively
rather than a single neural network.
39. DeepPCO: End-to-End Point Cloud Odometry
through Deep Parallel Neural Network
Ablation experiment of single branch fully connected layers. All the parameter configurations
of convolutional layers and fully connected layers are the same as DeepPCO. Different from
DeepPCO in which transformation vector is trained using two branches, 3-D translation (x, y, z)
and orientation (i, j, k) are jointly trained and inferred by just one branch here.
40. DeepPCO: End-to-End Point Cloud Odometry
through Deep Parallel Neural Network
“Deep Learning for Laser Based Odometry Estimation”
41. DF-Net: Unsupervised Joint Learning of Depth and
Flow using Cross-Task Consistency
• https://github.com/vt-vl-lab/DF-Net
• It presents an unsupervised learning framework for simultaneously training single-view depth
prediction and optical flow estimation models using unlabeled video sequences.
• Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors
to train depth or flow models.
• This paper proposes to leverage geometric consistency as additional supervisory signals.
• The core idea is that for rigid regions it can use the predicted scene depth and camera motion to
synthesize 2D optical flow by back-projecting the induced 3D scene flow.
• The discrepancy between the rigid flow (from depth prediction and camera motion) and the
estimated flow (from optical flow model) allows us to impose a cross-task consistency loss.
• While all the networks are jointly optimized during training, they can be applied independently at
test time.
45. Every Pixel Counts ++: Joint Learning of Geometry
and Motion with 3D Holistic Understanding
• https://github.com/chenxuluo/EPC
• Learning to estimate 3D geometry in a single frame and optical flow from consecutive frames by
watching unlabeled videos via Deep CNN has made significant progress recently.
• Current SoTA methods treat the two tasks independently. One typical assumption of the existing
depth estimation methods is that the scenes contain no independent moving objects. while
object moving could be easily modeled using optical flow.
• This paper proposes to address the two tasks as a whole, i.e. to jointly understand per-pixel 3D
geometry and motion.
• This eliminates the need of static scene assumption and enforces the inherent geometrical
consistency during the learning process, yielding significantly improved results for both tasks.
• This method is called as “Every Pixel Counts++” or “EPC++”.
• Specifically, during training, given two consecutive frames from a video, they adopt three parallel
networks to predict the camera motion (MotionNet), dense depth map (DepthNet), and per-pixel
optical flow between two frames (OptFlowNet) respectively.
• The three types of information, are fed into a holistic 3D motion parser (HMP), and per-pixel 3D
motion of both rigid background and moving objects are disentangled and recovered.
• Various loss terms are formulated to jointly supervise the three networks.
46. Every Pixel Counts ++: Joint Learning of Geometry
and Motion with 3D Holistic Understanding
47. Every Pixel Counts ++: Joint Learning of Geometry
and Motion with 3D Holistic Understanding
DepthNet
48. Every Pixel Counts ++: Joint Learning of Geometry
and Motion with 3D Holistic Understanding
49. Every Pixel Counts ++: Joint Learning of Geometry
and Motion with 3D Holistic Understanding
50. Every Pixel Counts ++: Joint Learning of Geometry
and Motion with 3D Holistic Understanding