SlideShare a Scribd company logo
1 of 9
Download to read offline
1
Robust Large Scale Reconstruction of Indoor Scenes
Vincent Kee*, Matthew Graham†, Jeffrey Shapiro*
*Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA
† Machine Intelligence Group, Charles Stark Draper Laboratory, USA
Email: vkee@mit.edu, mgraham@draper.com, jshaps@mit.edu
Abstract—Creating high quality three-dimensional models of
indoor scenes promises many applications in augmented reality,
robotics, and object scanning. One popular approach for 3D
model reconstruction is dense visual Simultaneous Localization
and Mapping (SLAM), in which a camera is used to build a
map of the environment while simultaneously localizing itself
within the map. One major drawback of most dense visual SLAM
methods is that they can either build accurate maps in small-scale
environments (using model-based techniques) or estimate their
pose accurately in large-scale environments (using pose graphs),
but not both. Deformation graphs can bridge this gap. They can
be used in large-scale environments while directly optimizing map
reconstruction, theoretically resulting in higher map accuracy.
We investigate how much deformation graphs embedded in the
model’s surface will improve large-scale scene reconstruction
quality. Specifically, this project has added deformation graphs
into Draper’s existing dense visual SLAM system and evaluated
the system on published benchmark datasets and real world tests.
In addition, we compare the performance of our approach to
existing state-of-the-art dense visual SLAM algorithms.
I. INTRODUCTION
For an autonomous mobile robot, building an accurate map
of its surroundings and localizing itself in the map are critical
for intelligent operation in an unknown environment. This
problem, Simultaneous Localization and Mapping (SLAM),
has been the focus of robotics research for the past few decades
[1], [2], [3]. In the past ten years, practical SLAM techniques
have been developed and successfully implemented. However,
most of these methods constructed sparse feature-based maps
that lacked sufficient information for executing tasks beyond
basic navigation [4], [5]. Approaches using LiDAR systems
(3D laser scanner) to build high resolution point clouds of
the surroundings existed [6], but the high cost of the system
limited adoption and further development. A recent alternative,
RGB-D (red-green-blue-depth) sensors such as the Microsoft
Kinect provide accurate depth sensing at a much lower cost.
Coupled with the increased availability of high performance
computing through graphical processing units (GPUs), these
new sensors fueled the development of dense visual SLAM
algorithms such as KinectFusion [7], which produced signif-
icantly higher resolution maps than those of earlier methods.
However, existing dense methods have their limitations due
to their underlying implementation: they can provide accurate
model reconstruction of small-scale scenes [7] or accurate
camera trajectory estimation in large-scale environments [8],
but not both.
A computer graphics data structure presented in [9], the
deformation graph, allows for natural, non-rigid space defor-
Fig. 1. Reconstruction of an office desk dataset [11] produced by our system
mations of any geometric model. The deformation graph can
bridge the gap in existing dense techniques as it can be applied
to model large-scale environments while directly optimizing
surface reconstruction, theoretically resulting in higher model
accuracy.
In this paper, we present our work integrating deformation
graphs into our dense visual SLAM system and results demon-
strating that deformation graphs embedded in the model’s
surface improve large-scale scene reconstruction quality. Our
system is based off of KinectFusion [7] and Kintinuous [10],
with two significant differences. To achieve map reconstruction
quality, we do not reconstruct the environment in real time or
downsample point cloud data. Section II discusses existing
related work in the field of dense visual SLAM. Section
III summarizes our approach, outlining the components of
the real-time data collection module and the offline post
processing module. Section IV details our experimental setup
and the benchmark datasets used to evaluate our system against
existing state-of-the-art dense visual SLAM algorithms. Sec-
tion V discusses the qualitative and quantitative results of the
experiments. Section VI presents conclusions and areas for
future research.
II. RELATED WORK
Many dense visual SLAM methods have been developed
over the years, particularly following the release of the
Microsoft Kinect [12], [13]. KinectFusion [7] was the first
algorithm to generate real-time high quality reconstructions
of indoor scenes using a Microsoft Kinect and commodity
2
TABLE I
FEATURE COMPARISON OF DENSE VISUAL SLAM ALGORITHMS
Dense SLAM
Algorithms
Dense
Map
Large Scale
Environments
Surface-Model
Loop Closure
Dense Point
Cloud
Real
Time
Deterministic
Behavior
Deformation
Graph
Our System 
Kinfu    
Kintinuous  
ElasticFusion  
RGB-D Frame
Surface
Measurement
Camera Pose
Estimation
Place Recognition
Cloud Slice
Extraction
Surface
Reconstruction
Update and
Surface Prediction
TSDF
Pose Graph
DBoW
Database
3D Depth
Point Cloud
m
ab

m
p
R
G
B-D
Loop Closure
Constraint
Pose
Odometry
Pose

ms
Pose
RGB-D
Pose
Data Collection Module
Legend
Data Move
Conditional
Data Move
Data Store
Module
Component
Point Cloud Data
(Cloud Slices,
TSDF) + Pose
Graph
Optimization
Deformation
Optimized
Poses
Optimized
Map
Deformation
Graph
Post Processing Module
3D Point
Cloud
Fig. 2. System Architecture Diagram. mab refers to the movement metric defined in Equation 2, mp is the place recognition movement threshold, and ms
is the cloud slice movement threshold.
GPU, serving as a model for many future techniques. However,
KinectFusion was limited to operation in small-scale environ-
ments with high geometric diversity as it used only depth
data from the Kinect for tracking and localization. Kintinuous
[10], based on KinectFusion, extended the operating range of
KinectFusion to handle larger scale environments and added
RGB data processing into the pipeline. However, Kintinuous
suffered with handling frequently occurring loop closures
(determining when part of the environment is rediscovered)
and properly merging the rediscovered data with the model.
Perhaps the most promising approach was presented in 2015,
named ElasticFusion [14], which integrated the deformation
graph structure presented in [9] to optimize map reconstruction
in both small-scale and large-scale environments. However,
both of these methods exhibit nondeterministic behavior due
to their heavy reliance on multi-threading to achieve real time
performance.
Currently, the Perception and Localization Group at the
Charles Stark Draper Laboratory has developed a dense visual
SLAM system based on KinectFusion, using both depth and
color data for surface reconstruction. It works fairly well in
small-scale environments, but struggles with handling loop
closures, particularly in large-scale rooms, leading to incon-
sistent models with misaligned features.
We have integrated deformation graphs into Draper’s system
and applied techniques from [10] to improve performance in
larger environments. Unlike Kintinuous, we do not attempt
to generate the map in real time. Instead, we are post pro-
cessing the data in an attempt to generate higher quality
reconstructions and avoid the nondeterministic behavior issues.
Additionally, we do not downsample point cloud vertices
with a voxel grid filter during surface extraction as done
in [10]. We investigate whether working with denser point
clouds improves map reconstruction quality. Table I outlines
the feature comparison between our system and other dense
visual SLAM algorithms.
III. APPROACH
We adapt the key structure of [7], [10] in performing large-
scale dense visual SLAM but divide the system into two
separate modules to improve the performance. Our system is
organized into two primary components: the data collection
module and the post processing module. Much of the code has
been adopted from Draper’s modified repository of the Point
Cloud Library (PCL) [15], an open source library for point
cloud processing, and the open source version of Kintinuous
[10]. Functions that could benefit significantly from paral-
lelization have been implemented with NVIDIA CUDA for
performance as described in [7], [10]. Table II outlines where
each of the major system components is from and how it is
modified in our system.
In this section, each part of the data collection and post
processing modules is briefly described, with the main func-
tions summarized and the deformation components focused
on. The interested reader can see the original papers [7], [10]
for further details. The system architecture and data pipeline
are shown in Figure (2).
A. Data Collection Module
The data collection module is based off of KinectFusion [7]
and Kintinuous [10]. Data is captured from an RGB-D sensor
and processed and fused into a volumetric data structure that
shifts as the sensor moves throughout the environment. We
create a dense pose graph of the camera pose at each processed
3
TABLE II
SYSTEM COMPONENT ORIGINS
System Component Adapted From Modifications
Volumetric Fusion KinectFusion/KinectFusion [7], [15]
FOVIS [16] for improved robustness in camera pose estimation,
no downsampled point cloud data
Cloud Slice Extraction Kintinuous/KinectFusion [10], [15] Run sequentially rather than concurrently with other data collection methods
Place Recognition Kintinuous [10] Run sequentially rather than concurrently with other data collection methods
Pose Graph Construction
and Optimization
Kintinuous [10]
Separated into data collection and post processing modules and optimized
using g2o instead of iSAM
Deformation Graph Kintinuous [10] Separated into an offline post processing module
RGB-D frame, with loop closure constraints detected by a bag-
of-words-based DBoW place recognition system [17].
The primary components of the data collection module are
volumetric fusion of RGB-D data, cloud slice extraction, and
place recognition.
1) Volumetric Fusion of RGB-D Data: The volumetric
fusion component fuses the depth 3D point clouds into one
cohesive 3D model of the environment. It is adapted from
KinectFusion [7], with modifications to improve robustness
in camera pose estimation. It consists of four sequential sub-
components: surface measurement, camera pose estimation,
surface reconstruction update, and surface prediction.
The surface measurement subcomponent performs pre-
processing of the depth data in every frame captured by the
RGB-D sensor. Raw depth data is passed through a bilateral
filter to remove noise and converted into a dense 3D point
cloud.
The camera pose estimation subcomponent estimates the
camera pose for each frame at time k using the iterative
closest point (ICP) algorithm [18], parallelized on a GPU for
performance. ICP finds the transformation, Tk, that minimizes
the distance between the plane parallel to a point in the first
point cloud with its corresponding point in the second point
cloud. This transformation is in the form of a rigid body
matrix:
Tk =
Rk tk
0 1
∈ SE3 (1)
where SE3 := {R ∈ SO3, t ∈ R3
}. The FOVIS visual
odometry system from [19], [16] is run in parallel to serve
as an auxiliary camera pose estimation system in case ICP
does not converge. This often occurs in planar environments
with few geometric features.
The surface reconstruction update subcomponent integrates
each frame’s 3D point cloud into one global model stored
in the GPU using the volumetric truncated signed distance
function (TSDF). The TSDF representation stores at each
voxel the signed distance to the closest object surface, with
negative values for voxels inside an object and zero for voxels
on an object’s surface. The new TSDF is fused with the
existing model’s TSDF with a weighted running average [7].
The surface prediction subcomponent uses raycasting to
refine the model surface and remove noise. A line is projected
from the camera focal point towards the model surface until
the line intersects the surface. The intersection point and its
corresponding pixel are stored in the refined surface model.
2) Cloud Slice Extraction: By implementing the TSDF
data structure using modular arithmetic for GPU memory
addressing, environments that were previously too large to fit
in the original TSDF representation can now be mapped. This
approach [10] allows the TSDF to act as a cyclical buffer as
the camera moves through the environment. When the camera
translation exceeds the movement threshold, ms (ms = 1.5m
in our experiments), the TSDF volume shifts as necessary to
accommodate the new data and the surface exciting the TSDF
is extracted as a 3D slice of vertices called a cloud slice. This
cloud slice is saved in the world model stored in the CPU and
written to disk for post processing. It is associated with the
camera pose at the time of extraction.
3) Camera Pose Graph: A dense pose graph, containing
a camera pose for each processed frame, is constructed for
the camera trajectory. Additionally, the graph has relative
transformations between poses involved in loop closures. This
pose graph is generated for the post processing module, which
will use it to refine the camera pose estimates associated with
each cloud slice, thereby improving the map reconstruction.
Each camera pose Pi is composed of a rotation Ri ∈ SO3,
translation ti ∈ R3
, and timestamp.
4) Place Recognition: The Place Recognition component
[10] detects loop closures to eliminate odometry drift caused
by error accumulation as the system goes through an environ-
ment. It uses Speeded Up Robust Feature (SURF) descriptors
with a bag-of-words-based loop closure detector (DBoW) [17].
New frames are only added to the system when a movement
metric based on rotations and translations exceeds a movement
threshold mp (mp = 0.3 in our experiments) for performance
reasons. The movement metric, mab is computed from the
current pose, a, and the pose of the last frame added, b:
mab = r(R−1
a Rb) 2 + ta − tb 2 (2)
where r(R) : SO3 → R3
is defined as the rotation vector
form of a rotation matrix. For each RGB-D frame i [rgbi, di],
SURF keypoints and descriptors Ui ∈ Ω × R64
are computed
and queried in the bag-of-words descriptor database to detect
matches, which indicate possible loop closures. If a match m
is found, additional validation tests are run on the keypoints,
descriptors Um, and depth data dm of the match frame to avoid
false positives.
a) SURF Correspondence Threshold: A set of valid
SURF correspondences, G ∈ Ω × Ω, are found between
the frames using a k-nearest neighbors search. An l2-norm
between SURF descriptors, Ui and Um, is used to threshold
4
the matches in R64
. The loop closure candidate is discarded
if |G|  σ (σ = 40 in our experiments).
b) RANSAC Transformation Estimation: A RANSAC-
based 3-point algorithm [20] is used to estimate a 6-DOF
relative transformation, Te, between camera poses i and m
using the matched image depth data dm and the valid SURF
correspondences G. The loop closure candidate is discarded
if the percentage of inliers is less than µ (µ = 0.35 in our
experiments). Otherwise, the Levenberg-Marquardt algorithm
(LMA) [21] is used to refine Te by minimizing inlier feature
reprojection errors.
c) Point Cloud ICP: Non-linear ICP is performed be-
tween the matched depth frames di and dm, using Te as
an initial guess and LMA to minimize nearest neighbor
correspondence distances between the clouds. If the mean L2
2-
norm of correspondence errors is below φ (φ = 0.01 in our
experiments), the relative transformation constraint between
the matched poses is added to the pose graph.
B. Post Processing Module
The post processing module is based off of Kintinuous [10]
and run offline after data collection. The dense camera pose
graph is optimized using g2
o [22]. The optimized pose graph
along with all the vertices from the cloud slices and TSDF are
fed into a deformation graph structure to optimize the model.
The primary components of the post processing module are
pose graph optimization and deformation.
1) Pose Graph Optimization: The camera pose graph,
including the loop closure constraints, is optimized using
g2
o, a framework for optimizing graph-based, non-linear error
functions [22].
A deformation graph allows for natural, non-rigid space
deformations of any geometric model composed of vertices
in R3
[9], accounting for all measured constraints.
2) Deformation: The deformation component is adapted
from [9], [10].
a) Deformation Graph: A deformation graph enables
defined constraints to apply natural, non-rigid space defor-
mations of any geometric model composed of vertices in R3
[9]. This graph structure is composed of nodes representing
affine transformations in R3
and undirected edges connecting
neighboring nodes sharing influence on nearby space. Each
node j has a position gj ∈ R3
, a set of neighboring nodes
N(j), and an affine transformation represented by a rotation
matrix Rj ∈ SO3 (initialized to the identity matrix) and
translation vector tj ∈ R3 (initialized the zero vector).
The deformation graph in the post processing module is
initialized by sampling from the g2
o optimized camera pose
graph using the Incremental Deformation Node Sampling
algorithm from [10] and connecting nodes based on their
temporal location in the pose graph to prevent unrelated
regions of the map from overlapping and becoming connected.
Each node causes a space deformation on the nearby space,
mapping a point p ∈ R3
to ˆp:
ˆp = Rj(p − gj) + gj + tj (3)
The set of k nodes deforming p is denoted as N(v), with k =
4 in our experiments. It is composed of the k-nearest neighbors
to p in the pose graph that are spatially and temporally optimal
generated by the Back-Traversal Vertex Association algorithm
from [10].
The deformed position of a map point cloud vertex v is
given by [9]:
ˆv =
k∈N(v)
wk(v) [Rk(v − gk) + gk + tk] (4)
with wk(v) defined as (normalized over all k):
wk(v) = 1 −
v − gk
dmax 2
2
(5)
where dmax is the Euclidean distance to the k+1-nearest node
to the vertex.
b) Optimization: We use three cost functions from [9],
[10], [23] to optimize the deformation graph. The first cost
function maximizes rigidity in the deformation by biasing
towards isometry and minimizes rotation error over all trans-
formations:
Erot =
l
Rl Rl − I
2
F
(6)
where · F denotes the Frobenius norm. The second cost
function is a regularization term ensuring a smooth deforma-
tion by minimizing distance between neighboring nodes:
Ereg =
l k∈N(v)
Rl(gk − gl) + gl + tl − (gk + tk)
2
2 (7)
The third cost function links the optimized camera pose
graph to the map deformation:
EconP
=
i
φ(ti) − ti
2
2 (8)
where ti is the translation component of node i in the pose
graph before optimization, ti is the translation component of
node i in the pose graph after optimization, and φ(ti) is the
result of applying Equation 4 to ti.
The total cost function is:
wrotErot + wregEreg + wconP
EconP
(9)
with experimental weights wrot = 1, wreg = 10, and wconP
=
100 from [10], [9]. We optimize this cost function with itera-
tive Gauss-Newton. The optimized deformation graph is then
applied in parallel to all cloud slice vertices using Equation 4.
IV. EXPERIMENTAL SETUP
We describe our experimental setup to evaluate the qualita-
tive and quantitative performance of our system and compare it
against existing state-of-the-art dense visual SLAM algorithms
including KinectFusion [7], Kintinuous [10], and ElasticFu-
sion [14].
A. Hardware
For all tests, we ran all algorithms on a standard laptop
PC running 64-bit Ubuntu 14.04 Linux with an Intel Core i7-
4710MQ 2.5GHz CPU, 8GB DDR3 1600MHz RAM, and an
NVIDIA GeForce GTX 980M 8GB GPU.
5
TABLE III
DATASET STATISTICS
Dataset
Duration
(s)
Frames
Length
(m)
Avg Translational Velocity
(m/s)
Avg Angular Velocity
(deg/s)
Trajectory Dimensions
(m × m × m)
lr kt0 51 1510 6.54 0.126 - -
lr kt1 33 967 2.05 0.063 - -
lr kt2 30 882 8.43 0.282 - -
lr kt3 42 1242 11.32 0.263 - -
fr1/desk 23.40 1352 9.263 0.413 23.327 2.42 x 1.34 x 0.66
fr1/desk2 24.86 620 10.161 0.426 29.308 2.44 x 1.47 x 0.52
fr1/room 48.90 1352 15.989 0.334 29.882 2.54 x 2.21 x 0.51
fr1/xyz 30.09 792 7.112 0.244 8.920 0.46 x 0.70 x 0.44
fr1/rpy 27.687 694 1.664 0.062 50.147 0.15 x 0.21 x 0.21
fr1/plant 41.53 1126 14.795 0.365 27.891 1.71 x 1.70 x 1.07
fr2/desk 99.36 620 18.880 0.193 6.338 3.90 x 4.13 x 0.57
fr2/xyz 122.74 3615 7.029 0.058 1.716 1.30 x 0.96 x 0.72
fr3/long 87.09 2488 21.455 0.249 10.188 5.12 x 4.89 x 0.54
B. Datasets
Our algorithm must be able to accurately reconstruct models
of a wide variety of indoor environments. Accordingly, we ran
our system on published benchmark data sets.
The Imperial College London and National University of
Ireland Maynooth (ICL-NUIM) have released an extensive
synthetic dataset for benchmarking SLAM algorithms con-
sisting of a living room and office room environment, each
with 4 different sensor trajectories [11]. The synthetic living
room environment dataset has color and depth maps images
(640×480) recorded at 30Hz, time synchronized ground truth
camera poses, and a 3D ground truth surface model to evaluate
surface reconstruction quality. All trajectories are in the same
environment, and lr kt3 has a small loop closure.
The Computer Vision Group at Technische Universit¨at
M¨unchen (TUM) has also released a large RGB-D SLAM
dataset benchmark containing color and depth images (640 ×
480) captured at 30Hz with a Microsoft Kinect and time
synchronized ground truth sensor poses recorded with a high
speed motion-capture system (100 Hz) [24]. Datasets with
labels containing ‘fr1’ were recorded in a typical office en-
vironment (6 × 6m2
) while those containing ‘fr2’ and ‘fr3’
were recorded in a large industrial hall (10 × 12m2
). The
fr1/desk dataset sweeps over four cluttered desks with mul-
tiple computer monitors and numerous objects. The fr1/desk2
dataset sweeps over the same environment, but a bit faster and
with more trajectory overlap, allowing for more loop closure
testing. The fr1/room dataset begins at the same desk view,
but pans through the entire room, ending back at the desk
shot. This dataset allows for testing larger scale environment
evaluation as well as evaluating loop closure performance. The
fr1/xyz dataset begins at the same desk view but only translates
in the x, y, and z axes with minimal rotation. The fr1/rpy
dataset also begins at the same desk view but only rotates about
the x, y, and z axes with minimal translation. Both of these
datasets allow for simple pose estimation testing. The fr1/plant
dataset has a thorough 360 degree sweep around a potted plant
from all views, which is good for testing reconstruction quality
and loop closure performance. The fr2/desk dataset has a 360
degree sweep around a desk with many objects on it, which
is good for testing loop closure performance. The fr2/xyz
dataset begins at the same desk as fr2/desk, but only translates
very slowly in the x, y, and z axes with minimal rotation,
resulting in very clean data free of motion blur and rolling
shutter effects. This dataset is good for debugging tracking
translation movements. The fr3/long dataset has a 360 degree
sweep around desks in a home and office environment with
many objects with one large loop closure. The statistics on all
of the datasets are outlined in Table III.
V. EVALUATION
Our system was evaluated and compared against existing
dense visual SLAM algorithms qualitatively and quantitatively
in terms of trajectory estimation, surface reconstruction, and
computational performance with existing algorithms. However,
the existing state-of-the-art SLAM algorithms did not achieve
results comparable to published results. This discrepancy could
be attributed to rewriting system components to opensource the
system, which the authors concede may introduce performance
regressions 1
.
A. Trajectory Estimation
To evaluate sensor trajectory estimation, we calculated the
absolute trajectory (ATE) root mean square error (RMSE) of
the trajectory using evaluation tools from [24]. This metric
is the root mean square error of each estimated pose with its
corresponding ground truth pose, linked by timestamp [24]. As
the results show in Table IV and Figure 3, the system tracks
the position fairly well. We found that pose graph optimization
marginally improves the RMSE accuracy, helping eliminate
drift, although not as much as we originally thought.
Our system performs similarly to Kinfu in some datasets
where Kinfu does not get lost such as lr kt0, lr kt1, and
1https://github.com/mp3guy/Kintinuous, https://github.com/mp3guy/ElasticFusion
6
lr_kt2fr1/xyz
Fig. 3. Two dimensional trajectory plots of estimated trajectories versus ground truth for datasets fr1/xyz and lr kt2
fr1/room. This result is expected as our system’s pose es-
timation component is based off of Kinfu’s. From previous
experiments using Kinfu, we found that the RGB-D sensor
must move slowly through environments for Kinfu’s ICP-
based pose estimator to track the sensor’s trajectory. As shown
in Table III and IV, Kinfu gets lost in the datasets where the
sensor translates and rotates faster than Kinfu can track such
as fr1/plant and fr1/desk2. Accordingly, Kinfu performs best
in datasets where the sensor’s movements are slow such as
fr1/xyz and fr1/rpy.
Our system is significantly more robust to more drastic
frame-to-frame movements as the FOVIS visual odometry sys-
tem provides a decent transformation estimate when ICP does
not converge. However, Kinfu estimates its actual trajectory
better than our system in some datasets where Kinfu does not
get lost such as fr1/rpy and fr2/xyz. One possible explanation
of this behavior is that pairwise ICP performs better than
FOVIS in these cases. When the initial point-to-plane ICP
does not converge on a valid transformation estimate, Kinfu
tries to get a valid transformation by running pairwise ICP
on the frames. Our system, on the other hand, uses FOVIS
before trying to use pairwise ICP as a last resort only if the
FOVIS solution is invalid. We have not investigated FOVIS
performance versus pairwise ICP, which is an area for future
work.
Compared to Kintinuous in Table IV, our system performs
similarly, as each system tracks more accurately than the other
on different datasets. Although both systems’ pose estimation
modules are based off of Kinfu, our system’s is augmented
with FOVIS while Kintinuous’s combines geometric infor-
mation with photometric constraints to improve robustness in
environments with low geometric features [10].
As seen in Table IV, ElasticFusion outperforms all the other
systems on all datasets except for lr kt3 where it is slightly
outperformed by Kintinuous. This superior performance can be
attributed to its pose tracking module utilizing both geometric
TABLE IV
COMPARISON OF ATE RMSE ON DATASETS
Dataset Ours(m) Kintinuous(m) ElasticFusion(m) Kinfu(m)
lr kt0 0.237349 0.309554 0.197866 0.202962
lr kt1 0.374268 0.029417 0.177573 0.381164
lr kt2 0.661996 0.26592 0.207747 Lost
lr kt3 0.693045 0.169388 0.259297 Lost
fr1/desk 0.711651 0.955585 0.224547 Lost
fr1/desk2 0.685753 0.937414 0.156361 Lost
fr1/room 0.784516 0.937414 0.095809 0.744844
fr1/xyz 0.132682 0.024784 0.017364 0.132682
fr1/rpy 0.058544 0.041976 0.035581 0.03617
fr1/plant 0.532928 0.58716 0.077455 Lost
fr2/desk 0.744481 0.208078 0.15544 Lost
fr2/xyz 0.202528 0.058109 0.0146268 0.153812
fr3/long 1.31371 1.782465 0.043245 1.324468
and photometric constraints in combination with utilizing
frequent local and global loop closures to minimize drift [14].
B. Surface Reconstruction
Qualitatively, our system generates decent quality surface
reconstructions, which share a combination of issues to be
addressed in future work. The main problems are model
feature misalignments and noisy mesh models. As noted by
[10], accurate trajectory estimation does not necessary lead to
accurate surface reconstructions. However, the misalignment
flaws, such as those depicted in the mesh model from lr kt1 in
Figure 4, are primarily due to poor trajectory estimation. This
error manifests as our system integrates the new data from each
frame into the global model using the estimated pose at which
the frame is processed. By improving the pose estimation, loop
closure detection, and pose and deformation graph optimiza-
tion, this aliasing could be mitigated. Volumetric reintegration
7
TABLE V
RECONSTRUCTION STATISTICS
Datasets
Ours Kinfu Kintinuous ElasticFusion
Vertices Faces
Cloud
Slices
Loop
Closures
Vertices Faces Vertices Faces Vertices
lr kt0 11761467 3920489 10 1 4372128 1457376 243309 460133 757378
lr kt1 10808433 3602811 6 0 7690662 2563554 354427 695516 672893
lr kt2 11539362 3846454 7 0 1332318 444106 451289 871573 360518
lr kt3 15338664 5112888 4 0 4357767 1452589 432376 839220 717285
fr1/desk 34462794 11487598 21 0 8366436 2788812 472511 873861 604127
fr1/desk2 8241693 2747261 7 1 2930019 976673 293664 542559 177618
fr1/room 24707529 8235843 15 0 8366772 2788924 473425 874585 5310688
fr1/xyz 2197521 732507 0 3 2197473 732491 67522 126493 163975
fr1/rpy 9098793 3032931 11 6 3251754 1083918 141038 261853 120761
fr1/plant 23590230 7863410 11 0 6346272 2115424 368252 704246 307691
fr2/desk 7284849 2428283 7 4 2929998 976666 151875 26243 178237
fr2/xyz 2370717 790239 0 17 2371512 790504 76974 142763 358988
fr3/long 9881976 293992 4 3 8749332 2916444 282927 529690 951408
Fig. 4. Misalignment error in a living room environment in lr kt1. Note how
the two walls should intersect at a right angle, but due to odometry drift, this
is visibly not the case.
and fusion of revisited areas in the environment would also
help improve the model quality and remove misalignments, as
mentioned by [10]. This is a challenging area for future work
[10].
The majority of noise in the model is also due to poor
trajectory estimation. As shown in Figure 5, Kinfu generates
significantly cleaner models for some datasets like fr1/rpy as
the pairwise ICP pose estimates are superior to those generated
by FOVIS in these specific cases. Additional noise could
be reduced by downsampling the point cloud data with a
voxel grid filter as done in [10], which would also improve
performance by reducing the data processed. As shown in
Table V and Figure 6, having more vertices and faces in
the mesh model does not necessarily result in higher quality
models.
Optimizing pose graphs by detecting loop closures and
modifying the model with deformation graphs do improve map
reconstruction quality, as seen when running Kintinuous and
ElasticFusion. Even when trajectory estimation is poor and
suffering from odometry drift, once a loop closure is detected
and the pose and deformation graphs are optimized and applied
(a)
(b)
Fig. 5. Comparison of models generated by our system (a) and Kinfu (b)
from the fr1/rpy dataset. Note how Kinfu captures the entire desk object while
our system captures only part of the desk object and picks up significant noise
in the background.
to the model, the misaligned components of the map snap into
place, drastically improving the model as can be seen in Figure
6. This model of dataset fr1/room was generated with poor
pose tracking (RMSE of 0.937414) as seen in Table IV, but
has no major aliasing.
When processing datasets with frame-to-frame movements
that our system can track well, our system performs com-
parably with the state-of-the-art as seen in Figure 6. Clearly,
ElasticFusion generates the highest quality map reconstruction
of dataset fr1/xyz. as the geometric features are best preserved
8
(a) (b)
(c) (d)
Fig. 6. Comparison of models from dataset fr1/xyz generated by our system (a), Kinfu (b), Kintinuous (c), and ElasticFusion (d). Note how the models
generated by our system and Kinfu are virtually identical and extremely similar to the one generated by Kintinuous. Also note the accurate and highly detailed
colored geometric features in ElasticFusion.
Fig. 7. Model generated by Kintinuous of the fr1/room dataset. Note how
there do not appear to be any major misalignments of features such as the
walls at the corners of the wall or objects on the table.
and colored.
TABLE VI
COMPARISON OF COMPUTATIONAL PERFORMANCE
Dataset Ours(ms) Kintinuous(ms) ElasticFusion(ms) Kinfu(ms)
lr kt0 75.6045 25.95 32.289 46.7025
lr kt1 29.8339 27.186 34.699 31.2649
lr kt2 27.9252 22.827 34.794 28.1382
lr kt3 30.1917 22.558 37.366 38.6743
fr1/desk 26.5032 22.773 36.198 46.085
fr1/desk2 25.6016 20.064 20.064 29.9576
fr1/room 105.338 19.569 76.903 52.6674
fr1/xyz 25.6625 17.187 34.328 25.5975
fr1/rpy 19.7236 20.578 33.8709 29.2447
fr1/plant 28.9778 29.921 37.349 27.8769
fr2/desk 28.9778 19.355 19.355 30.4725
fr2/xyz 26.3884 19.4279 39.563 24.5707
fr3/long 27.251 23.08 41.6801 27.1341
C. Computational Performance
To measure computational performance, we measured the
average time it took each system to process an RGB-D frame
(preprocessing the data, estimating the pose, and integrating
9
the data into the world model). As the results in Table VI
show, our system performs similarly to the other systems for
most datasets. The two outliers are due to FOVIS estimating
the transformation between frames, as it is the only different
component between our system and Kinfu. However, this
slower performance on the two datasets is a minor drawback
as FOVIS drastically improves robustness as seen in the
Trajectory Estimation section and Table IV.
VI. CONCLUSION
We have presented our approach to robust large map recon-
struction of indoor scenes. Our main contribution is to divide
the approach of [10] into data collection and post processing
modules to eliminate nondeterministic behavior and use denser
point clouds for map reconstruction. In our experiments, we
show that our system performs comparably to the state-of-the-
art dense visual SLAM algorithms on some datasets. However,
our system struggles to track the frame-to-frame movement
in faster moving datasets as well as reintegrating data from
revisited areas, leading to aliasing. This challenge is an area
for future research.
ACKNOWLEDGMENTS
I would like to thank my SuperUROP supervisor Matthew
Graham for providing extensive support throughout the
project. I would also like to thank my SuperUROP advisor
Professor Jeffrey Shapiro for offering guidance. This work
was supported by SuperUROP funding from the Charles Stark
Draper Laboratory.
REFERENCES
[1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-
ping: part i,” Robotics  Automation Magazine, IEEE, vol. 13, no. 2,
pp. 99–110, 2006.
[2] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and map-
ping (slam): Part ii,” IEEE Robotics  Automation Magazine, vol. 13,
no. 3, pp. 108–117, 2006.
[3] S. Thrun et al., “Robotic mapping: A survey,” Exploring artificial
intelligence in the new millennium, vol. 1, pp. 1–35, 2002.
[4] D. Hahnel, W. Burgard, D. Fox, and S. Thrun, “An efficient fastslam
algorithm for generating maps of large-scale cyclic environments from
raw laser range measurements,” in Intelligent Robots and Systems,
2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Confer-
ence on, vol. 1. IEEE, 2003, pp. 206–211.
[5] S. Thrun, W. Burgard, and D. Fox, “A real-time algorithm for mobile
robot mapping with applications to multi-robot and 3d mapping,” in
Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE Interna-
tional Conference on, vol. 1. IEEE, 2000, pp. 321–328.
[6] J. Levinson, M. Montemerlo, and S. Thrun, “Map-based precision
vehicle localization in urban environments.” in Robotics: Science and
Systems, vol. 4. Citeseer, 2007, p. 1.
[7] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,
A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon,
“Kinectfusion: Real-time dense surface mapping and tracking,” in
Proceedings of the 2011 10th IEEE International Symposium on
Mixed and Augmented Reality, ser. ISMAR ’11. Washington, DC,
USA: IEEE Computer Society, 2011, pp. 127–136. [Online]. Available:
http://dx.doi.org/10.1109/ISMAR.2011.6092378
[8] J. McDonald, M. Kaess, C. Cadena, J. Neira, and J. J. Leonard, “Real-
time 6-dof multi-session visual slam over large-scale environments,”
Robotics and Autonomous Systems, vol. 61, no. 10, pp. 1144–1158, 2013.
[9] R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation
for shape manipulation,” in ACM SIGGRAPH 2007 Papers, ser.
SIGGRAPH ’07. New York, NY, USA: ACM, 2007. [Online].
Available: http://doi.acm.org/10.1145/1275808.1276478
[10] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and
J. McDonald, “Real-time large-scale dense rgb-d slam with volumetric
fusion,” The International Journal of Robotics Research, vol. 34, no.
4-5, pp. 598–626, 2015.
[11] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark
for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE
Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May
2014.
[12] P. Henry, D. Fox, A. Bhowmik, and R. Mongia, “Patch volumes:
Segmentation-based consistent mapping with rgb-d cameras,” in 3D
Vision-3DV 2013, 2013 International Conference on. IEEE, 2013, pp.
398–405.
[13] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard,
“An evaluation of the rgb-d slam system,” in Robotics and Automation
(ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp.
1691–1696.
[14] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison,
“Elasticfusion: Dense slam without a pose graph,” in Proceedings of
Robotics: Science and Systems, Rome, Italy, July 2015.
[15] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” in
Robotics and Automation (ICRA), 2011 IEEE International Conference
on. IEEE, 2011, pp. 1–4.
[16] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald,
“Robust real-time visual odometry for dense rgb-d mapping,” in Robotics
and Automation (ICRA), 2013 IEEE International Conference on. IEEE,
2013, pp. 5724–5731.
[17] D. Galvez-Lopez and J. D. Tardos, “Real-time loop detection with
bags of binary words,” in Intelligent Robots and Systems (IROS), 2011
IEEE/RSJ International Conference on. IEEE, 2011, pp. 51–58.
[18] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in
Robotics-DL tentative. International Society for Optics and Photonics,
1992, pp. 586–606.
[19] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox,
and N. Roy, “Visual odometry and mapping for autonomous flight using
an rgb-d camera,” in International Symposium on Robotics Research
(ISRR), vol. 2, 2011.
[20] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated
cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395,
1981.
[21] K. Levenberg, “A method for the solution of certain non–linear problems
in least squares,” 1944.
[22] R. K¨ummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
“g 2 o: A general framework for graph optimization,” in Robotics and
Automation (ICRA), 2011 IEEE International Conference on. IEEE,
2011, pp. 3607–3613.
[23] J. Chen, S. Izadi, and A. Fitzgibbon, “Kinˆetre: animating the world with
the human body,” in Proceedings of the 25th annual ACM symposium
on User interface software and technology. ACM, 2012, pp. 435–444.
[24] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A
benchmark for the evaluation of rgb-d slam systems,” in Proc. of the
International Conference on Intelligent Robot Systems (IROS), Oct.
2012.

More Related Content

What's hot

“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...
“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...
“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...Edge AI and Vision Alliance
 
Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...
Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...
Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...Tomohiro Fukuda
 
OpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsOpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsvirtualcitySYSTEMS GmbH
 
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Analytics India Magazine
 
MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...
MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...
MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...cscpconf
 
Multiple region of interest tracking of non rigid objects using demon's algor...
Multiple region of interest tracking of non rigid objects using demon's algor...Multiple region of interest tracking of non rigid objects using demon's algor...
Multiple region of interest tracking of non rigid objects using demon's algor...csandit
 
Two mark qn answer
Two mark qn answerTwo mark qn answer
Two mark qn answerJessy Mol
 
Analysis of KinectFusion
Analysis of KinectFusionAnalysis of KinectFusion
Analysis of KinectFusionDong-Won Shin
 
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisNaeem Shehzad
 
Integrating UAV Development Technology with Augmented Reality Toward Landscap...
Integrating UAV Development Technology with Augmented Reality Toward Landscap...Integrating UAV Development Technology with Augmented Reality Toward Landscap...
Integrating UAV Development Technology with Augmented Reality Toward Landscap...Tomohiro Fukuda
 
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...Tomohiro Fukuda
 
3D_Exploitation
3D_Exploitation3D_Exploitation
3D_ExploitationPeter Cho
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1wang yaohui
 
Adaptive lifting based image compression scheme using interactive artificial ...
Adaptive lifting based image compression scheme using interactive artificial ...Adaptive lifting based image compression scheme using interactive artificial ...
Adaptive lifting based image compression scheme using interactive artificial ...csandit
 

What's hot (20)

“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...
“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...
“Efficient Deep Learning for 3D Point Cloud Understanding,” a Presentation fr...
 
Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...
Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...
Point Cloud Stream on Spatial Mixed Reality: Toward Telepresence in Architect...
 
reviewpaper
reviewpaperreviewpaper
reviewpaper
 
OpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsOpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developments
 
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
 
MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...
MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...
MULTIPLE REGION OF INTEREST TRACKING OF NON-RIGID OBJECTS USING DEMON'S ALGOR...
 
Multiple region of interest tracking of non rigid objects using demon's algor...
Multiple region of interest tracking of non rigid objects using demon's algor...Multiple region of interest tracking of non rigid objects using demon's algor...
Multiple region of interest tracking of non rigid objects using demon's algor...
 
Two mark qn answer
Two mark qn answerTwo mark qn answer
Two mark qn answer
 
Analysis of KinectFusion
Analysis of KinectFusionAnalysis of KinectFusion
Analysis of KinectFusion
 
FinalReport
FinalReportFinalReport
FinalReport
 
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesis
 
Masters Thesis
Masters ThesisMasters Thesis
Masters Thesis
 
Kintinuous review
Kintinuous reviewKintinuous review
Kintinuous review
 
Integrating UAV Development Technology with Augmented Reality Toward Landscap...
Integrating UAV Development Technology with Augmented Reality Toward Landscap...Integrating UAV Development Technology with Augmented Reality Toward Landscap...
Integrating UAV Development Technology with Augmented Reality Toward Landscap...
 
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
Tracking Robustness and Green View Index Estimation of Augmented and Diminish...
 
3D_Exploitation
3D_Exploitation3D_Exploitation
3D_Exploitation
 
H010315356
H010315356H010315356
H010315356
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1
 
ei2106-submit-opt-415
ei2106-submit-opt-415ei2106-submit-opt-415
ei2106-submit-opt-415
 
Adaptive lifting based image compression scheme using interactive artificial ...
Adaptive lifting based image compression scheme using interactive artificial ...Adaptive lifting based image compression scheme using interactive artificial ...
Adaptive lifting based image compression scheme using interactive artificial ...
 

Similar to paper

10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdfmokamojah
 
Depth-DensePose: an efficient densely connected deep learning model for came...
Depth-DensePose: an efficient densely connected deep learning  model for came...Depth-DensePose: an efficient densely connected deep learning  model for came...
Depth-DensePose: an efficient densely connected deep learning model for came...IJECEIAES
 
Обучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхОбучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхAnatol Alizar
 
Satellite Image Classification with Deep Learning Survey
Satellite Image Classification with Deep Learning SurveySatellite Image Classification with Deep Learning Survey
Satellite Image Classification with Deep Learning Surveyijtsrd
 
IRJET- 3D Object Recognition of Car Image Detection
IRJET-  	  3D Object Recognition of Car Image DetectionIRJET-  	  3D Object Recognition of Car Image Detection
IRJET- 3D Object Recognition of Car Image DetectionIRJET Journal
 
Point cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihangPoint cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihangLihang Li
 
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4IRJET Journal
 
Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...IJEECSIAES
 
Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...nooriasukmaningtyas
 
Conception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdfConception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdfSofianeHassine2
 
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...Seiya Ito
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryCSCJournals
 
A ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHM
A ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHMA ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHM
A ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHMcsandit
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsPetteriTeikariPhD
 
Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...
Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...
Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...IRJET Journal
 
A Wireless Network Infrastructure Architecture for Rural Communities
A Wireless Network Infrastructure Architecture for Rural CommunitiesA Wireless Network Infrastructure Architecture for Rural Communities
A Wireless Network Infrastructure Architecture for Rural CommunitiesAIRCC Publishing Corporation
 

Similar to paper (20)

10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf10.1109@ICCMC48092.2020.ICCMC-000167.pdf
10.1109@ICCMC48092.2020.ICCMC-000167.pdf
 
sibgrapi2015
sibgrapi2015sibgrapi2015
sibgrapi2015
 
Depth-DensePose: an efficient densely connected deep learning model for came...
Depth-DensePose: an efficient densely connected deep learning  model for came...Depth-DensePose: an efficient densely connected deep learning  model for came...
Depth-DensePose: an efficient densely connected deep learning model for came...
 
Обучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхОбучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграх
 
Satellite Image Classification with Deep Learning Survey
Satellite Image Classification with Deep Learning SurveySatellite Image Classification with Deep Learning Survey
Satellite Image Classification with Deep Learning Survey
 
IRJET- 3D Object Recognition of Car Image Detection
IRJET-  	  3D Object Recognition of Car Image DetectionIRJET-  	  3D Object Recognition of Car Image Detection
IRJET- 3D Object Recognition of Car Image Detection
 
Point cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihangPoint cloud mesh-investigation_report-lihang
Point cloud mesh-investigation_report-lihang
 
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
Real-time 3D Object Detection on LIDAR Point Cloud using Complex- YOLO V4
 
AR/SLAM for end-users
AR/SLAM for end-usersAR/SLAM for end-users
AR/SLAM for end-users
 
kanimozhi2019.pdf
kanimozhi2019.pdfkanimozhi2019.pdf
kanimozhi2019.pdf
 
Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...
 
Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...Efficient resampling features and convolution neural network model for image ...
Efficient resampling features and convolution neural network model for image ...
 
PointNet
PointNetPointNet
PointNet
 
Conception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdfConception_et_realisation_dun_site_Web_d.pdf
Conception_et_realisation_dun_site_Web_d.pdf
 
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
[3D勉強会@関東] Deep Reinforcement Learning of Volume-guided Progressive View Inpa...
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud Library
 
A ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHM
A ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHMA ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHM
A ROS IMPLEMENTATION OF THE MONO-SLAM ALGORITHM
 
Dataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problemsDataset creation for Deep Learning-based Geometric Computer Vision problems
Dataset creation for Deep Learning-based Geometric Computer Vision problems
 
Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...
Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...
Performance Analysis of Iterative Closest Point (ICP) Algorithm using Modifie...
 
A Wireless Network Infrastructure Architecture for Rural Communities
A Wireless Network Infrastructure Architecture for Rural CommunitiesA Wireless Network Infrastructure Architecture for Rural Communities
A Wireless Network Infrastructure Architecture for Rural Communities
 

paper

  • 1. 1 Robust Large Scale Reconstruction of Indoor Scenes Vincent Kee*, Matthew Graham†, Jeffrey Shapiro* *Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA † Machine Intelligence Group, Charles Stark Draper Laboratory, USA Email: vkee@mit.edu, mgraham@draper.com, jshaps@mit.edu Abstract—Creating high quality three-dimensional models of indoor scenes promises many applications in augmented reality, robotics, and object scanning. One popular approach for 3D model reconstruction is dense visual Simultaneous Localization and Mapping (SLAM), in which a camera is used to build a map of the environment while simultaneously localizing itself within the map. One major drawback of most dense visual SLAM methods is that they can either build accurate maps in small-scale environments (using model-based techniques) or estimate their pose accurately in large-scale environments (using pose graphs), but not both. Deformation graphs can bridge this gap. They can be used in large-scale environments while directly optimizing map reconstruction, theoretically resulting in higher map accuracy. We investigate how much deformation graphs embedded in the model’s surface will improve large-scale scene reconstruction quality. Specifically, this project has added deformation graphs into Draper’s existing dense visual SLAM system and evaluated the system on published benchmark datasets and real world tests. In addition, we compare the performance of our approach to existing state-of-the-art dense visual SLAM algorithms. I. INTRODUCTION For an autonomous mobile robot, building an accurate map of its surroundings and localizing itself in the map are critical for intelligent operation in an unknown environment. This problem, Simultaneous Localization and Mapping (SLAM), has been the focus of robotics research for the past few decades [1], [2], [3]. In the past ten years, practical SLAM techniques have been developed and successfully implemented. However, most of these methods constructed sparse feature-based maps that lacked sufficient information for executing tasks beyond basic navigation [4], [5]. Approaches using LiDAR systems (3D laser scanner) to build high resolution point clouds of the surroundings existed [6], but the high cost of the system limited adoption and further development. A recent alternative, RGB-D (red-green-blue-depth) sensors such as the Microsoft Kinect provide accurate depth sensing at a much lower cost. Coupled with the increased availability of high performance computing through graphical processing units (GPUs), these new sensors fueled the development of dense visual SLAM algorithms such as KinectFusion [7], which produced signif- icantly higher resolution maps than those of earlier methods. However, existing dense methods have their limitations due to their underlying implementation: they can provide accurate model reconstruction of small-scale scenes [7] or accurate camera trajectory estimation in large-scale environments [8], but not both. A computer graphics data structure presented in [9], the deformation graph, allows for natural, non-rigid space defor- Fig. 1. Reconstruction of an office desk dataset [11] produced by our system mations of any geometric model. The deformation graph can bridge the gap in existing dense techniques as it can be applied to model large-scale environments while directly optimizing surface reconstruction, theoretically resulting in higher model accuracy. In this paper, we present our work integrating deformation graphs into our dense visual SLAM system and results demon- strating that deformation graphs embedded in the model’s surface improve large-scale scene reconstruction quality. Our system is based off of KinectFusion [7] and Kintinuous [10], with two significant differences. To achieve map reconstruction quality, we do not reconstruct the environment in real time or downsample point cloud data. Section II discusses existing related work in the field of dense visual SLAM. Section III summarizes our approach, outlining the components of the real-time data collection module and the offline post processing module. Section IV details our experimental setup and the benchmark datasets used to evaluate our system against existing state-of-the-art dense visual SLAM algorithms. Sec- tion V discusses the qualitative and quantitative results of the experiments. Section VI presents conclusions and areas for future research. II. RELATED WORK Many dense visual SLAM methods have been developed over the years, particularly following the release of the Microsoft Kinect [12], [13]. KinectFusion [7] was the first algorithm to generate real-time high quality reconstructions of indoor scenes using a Microsoft Kinect and commodity
  • 2. 2 TABLE I FEATURE COMPARISON OF DENSE VISUAL SLAM ALGORITHMS Dense SLAM Algorithms Dense Map Large Scale Environments Surface-Model Loop Closure Dense Point Cloud Real Time Deterministic Behavior Deformation Graph Our System Kinfu Kintinuous ElasticFusion RGB-D Frame Surface Measurement Camera Pose Estimation Place Recognition Cloud Slice Extraction Surface Reconstruction Update and Surface Prediction TSDF Pose Graph DBoW Database 3D Depth Point Cloud m ab m p R G B-D Loop Closure Constraint Pose Odometry Pose ms Pose RGB-D Pose Data Collection Module Legend Data Move Conditional Data Move Data Store Module Component Point Cloud Data (Cloud Slices, TSDF) + Pose Graph Optimization Deformation Optimized Poses Optimized Map Deformation Graph Post Processing Module 3D Point Cloud Fig. 2. System Architecture Diagram. mab refers to the movement metric defined in Equation 2, mp is the place recognition movement threshold, and ms is the cloud slice movement threshold. GPU, serving as a model for many future techniques. However, KinectFusion was limited to operation in small-scale environ- ments with high geometric diversity as it used only depth data from the Kinect for tracking and localization. Kintinuous [10], based on KinectFusion, extended the operating range of KinectFusion to handle larger scale environments and added RGB data processing into the pipeline. However, Kintinuous suffered with handling frequently occurring loop closures (determining when part of the environment is rediscovered) and properly merging the rediscovered data with the model. Perhaps the most promising approach was presented in 2015, named ElasticFusion [14], which integrated the deformation graph structure presented in [9] to optimize map reconstruction in both small-scale and large-scale environments. However, both of these methods exhibit nondeterministic behavior due to their heavy reliance on multi-threading to achieve real time performance. Currently, the Perception and Localization Group at the Charles Stark Draper Laboratory has developed a dense visual SLAM system based on KinectFusion, using both depth and color data for surface reconstruction. It works fairly well in small-scale environments, but struggles with handling loop closures, particularly in large-scale rooms, leading to incon- sistent models with misaligned features. We have integrated deformation graphs into Draper’s system and applied techniques from [10] to improve performance in larger environments. Unlike Kintinuous, we do not attempt to generate the map in real time. Instead, we are post pro- cessing the data in an attempt to generate higher quality reconstructions and avoid the nondeterministic behavior issues. Additionally, we do not downsample point cloud vertices with a voxel grid filter during surface extraction as done in [10]. We investigate whether working with denser point clouds improves map reconstruction quality. Table I outlines the feature comparison between our system and other dense visual SLAM algorithms. III. APPROACH We adapt the key structure of [7], [10] in performing large- scale dense visual SLAM but divide the system into two separate modules to improve the performance. Our system is organized into two primary components: the data collection module and the post processing module. Much of the code has been adopted from Draper’s modified repository of the Point Cloud Library (PCL) [15], an open source library for point cloud processing, and the open source version of Kintinuous [10]. Functions that could benefit significantly from paral- lelization have been implemented with NVIDIA CUDA for performance as described in [7], [10]. Table II outlines where each of the major system components is from and how it is modified in our system. In this section, each part of the data collection and post processing modules is briefly described, with the main func- tions summarized and the deformation components focused on. The interested reader can see the original papers [7], [10] for further details. The system architecture and data pipeline are shown in Figure (2). A. Data Collection Module The data collection module is based off of KinectFusion [7] and Kintinuous [10]. Data is captured from an RGB-D sensor and processed and fused into a volumetric data structure that shifts as the sensor moves throughout the environment. We create a dense pose graph of the camera pose at each processed
  • 3. 3 TABLE II SYSTEM COMPONENT ORIGINS System Component Adapted From Modifications Volumetric Fusion KinectFusion/KinectFusion [7], [15] FOVIS [16] for improved robustness in camera pose estimation, no downsampled point cloud data Cloud Slice Extraction Kintinuous/KinectFusion [10], [15] Run sequentially rather than concurrently with other data collection methods Place Recognition Kintinuous [10] Run sequentially rather than concurrently with other data collection methods Pose Graph Construction and Optimization Kintinuous [10] Separated into data collection and post processing modules and optimized using g2o instead of iSAM Deformation Graph Kintinuous [10] Separated into an offline post processing module RGB-D frame, with loop closure constraints detected by a bag- of-words-based DBoW place recognition system [17]. The primary components of the data collection module are volumetric fusion of RGB-D data, cloud slice extraction, and place recognition. 1) Volumetric Fusion of RGB-D Data: The volumetric fusion component fuses the depth 3D point clouds into one cohesive 3D model of the environment. It is adapted from KinectFusion [7], with modifications to improve robustness in camera pose estimation. It consists of four sequential sub- components: surface measurement, camera pose estimation, surface reconstruction update, and surface prediction. The surface measurement subcomponent performs pre- processing of the depth data in every frame captured by the RGB-D sensor. Raw depth data is passed through a bilateral filter to remove noise and converted into a dense 3D point cloud. The camera pose estimation subcomponent estimates the camera pose for each frame at time k using the iterative closest point (ICP) algorithm [18], parallelized on a GPU for performance. ICP finds the transformation, Tk, that minimizes the distance between the plane parallel to a point in the first point cloud with its corresponding point in the second point cloud. This transformation is in the form of a rigid body matrix: Tk = Rk tk 0 1 ∈ SE3 (1) where SE3 := {R ∈ SO3, t ∈ R3 }. The FOVIS visual odometry system from [19], [16] is run in parallel to serve as an auxiliary camera pose estimation system in case ICP does not converge. This often occurs in planar environments with few geometric features. The surface reconstruction update subcomponent integrates each frame’s 3D point cloud into one global model stored in the GPU using the volumetric truncated signed distance function (TSDF). The TSDF representation stores at each voxel the signed distance to the closest object surface, with negative values for voxels inside an object and zero for voxels on an object’s surface. The new TSDF is fused with the existing model’s TSDF with a weighted running average [7]. The surface prediction subcomponent uses raycasting to refine the model surface and remove noise. A line is projected from the camera focal point towards the model surface until the line intersects the surface. The intersection point and its corresponding pixel are stored in the refined surface model. 2) Cloud Slice Extraction: By implementing the TSDF data structure using modular arithmetic for GPU memory addressing, environments that were previously too large to fit in the original TSDF representation can now be mapped. This approach [10] allows the TSDF to act as a cyclical buffer as the camera moves through the environment. When the camera translation exceeds the movement threshold, ms (ms = 1.5m in our experiments), the TSDF volume shifts as necessary to accommodate the new data and the surface exciting the TSDF is extracted as a 3D slice of vertices called a cloud slice. This cloud slice is saved in the world model stored in the CPU and written to disk for post processing. It is associated with the camera pose at the time of extraction. 3) Camera Pose Graph: A dense pose graph, containing a camera pose for each processed frame, is constructed for the camera trajectory. Additionally, the graph has relative transformations between poses involved in loop closures. This pose graph is generated for the post processing module, which will use it to refine the camera pose estimates associated with each cloud slice, thereby improving the map reconstruction. Each camera pose Pi is composed of a rotation Ri ∈ SO3, translation ti ∈ R3 , and timestamp. 4) Place Recognition: The Place Recognition component [10] detects loop closures to eliminate odometry drift caused by error accumulation as the system goes through an environ- ment. It uses Speeded Up Robust Feature (SURF) descriptors with a bag-of-words-based loop closure detector (DBoW) [17]. New frames are only added to the system when a movement metric based on rotations and translations exceeds a movement threshold mp (mp = 0.3 in our experiments) for performance reasons. The movement metric, mab is computed from the current pose, a, and the pose of the last frame added, b: mab = r(R−1 a Rb) 2 + ta − tb 2 (2) where r(R) : SO3 → R3 is defined as the rotation vector form of a rotation matrix. For each RGB-D frame i [rgbi, di], SURF keypoints and descriptors Ui ∈ Ω × R64 are computed and queried in the bag-of-words descriptor database to detect matches, which indicate possible loop closures. If a match m is found, additional validation tests are run on the keypoints, descriptors Um, and depth data dm of the match frame to avoid false positives. a) SURF Correspondence Threshold: A set of valid SURF correspondences, G ∈ Ω × Ω, are found between the frames using a k-nearest neighbors search. An l2-norm between SURF descriptors, Ui and Um, is used to threshold
  • 4. 4 the matches in R64 . The loop closure candidate is discarded if |G| σ (σ = 40 in our experiments). b) RANSAC Transformation Estimation: A RANSAC- based 3-point algorithm [20] is used to estimate a 6-DOF relative transformation, Te, between camera poses i and m using the matched image depth data dm and the valid SURF correspondences G. The loop closure candidate is discarded if the percentage of inliers is less than µ (µ = 0.35 in our experiments). Otherwise, the Levenberg-Marquardt algorithm (LMA) [21] is used to refine Te by minimizing inlier feature reprojection errors. c) Point Cloud ICP: Non-linear ICP is performed be- tween the matched depth frames di and dm, using Te as an initial guess and LMA to minimize nearest neighbor correspondence distances between the clouds. If the mean L2 2- norm of correspondence errors is below φ (φ = 0.01 in our experiments), the relative transformation constraint between the matched poses is added to the pose graph. B. Post Processing Module The post processing module is based off of Kintinuous [10] and run offline after data collection. The dense camera pose graph is optimized using g2 o [22]. The optimized pose graph along with all the vertices from the cloud slices and TSDF are fed into a deformation graph structure to optimize the model. The primary components of the post processing module are pose graph optimization and deformation. 1) Pose Graph Optimization: The camera pose graph, including the loop closure constraints, is optimized using g2 o, a framework for optimizing graph-based, non-linear error functions [22]. A deformation graph allows for natural, non-rigid space deformations of any geometric model composed of vertices in R3 [9], accounting for all measured constraints. 2) Deformation: The deformation component is adapted from [9], [10]. a) Deformation Graph: A deformation graph enables defined constraints to apply natural, non-rigid space defor- mations of any geometric model composed of vertices in R3 [9]. This graph structure is composed of nodes representing affine transformations in R3 and undirected edges connecting neighboring nodes sharing influence on nearby space. Each node j has a position gj ∈ R3 , a set of neighboring nodes N(j), and an affine transformation represented by a rotation matrix Rj ∈ SO3 (initialized to the identity matrix) and translation vector tj ∈ R3 (initialized the zero vector). The deformation graph in the post processing module is initialized by sampling from the g2 o optimized camera pose graph using the Incremental Deformation Node Sampling algorithm from [10] and connecting nodes based on their temporal location in the pose graph to prevent unrelated regions of the map from overlapping and becoming connected. Each node causes a space deformation on the nearby space, mapping a point p ∈ R3 to ˆp: ˆp = Rj(p − gj) + gj + tj (3) The set of k nodes deforming p is denoted as N(v), with k = 4 in our experiments. It is composed of the k-nearest neighbors to p in the pose graph that are spatially and temporally optimal generated by the Back-Traversal Vertex Association algorithm from [10]. The deformed position of a map point cloud vertex v is given by [9]: ˆv = k∈N(v) wk(v) [Rk(v − gk) + gk + tk] (4) with wk(v) defined as (normalized over all k): wk(v) = 1 − v − gk dmax 2 2 (5) where dmax is the Euclidean distance to the k+1-nearest node to the vertex. b) Optimization: We use three cost functions from [9], [10], [23] to optimize the deformation graph. The first cost function maximizes rigidity in the deformation by biasing towards isometry and minimizes rotation error over all trans- formations: Erot = l Rl Rl − I 2 F (6) where · F denotes the Frobenius norm. The second cost function is a regularization term ensuring a smooth deforma- tion by minimizing distance between neighboring nodes: Ereg = l k∈N(v) Rl(gk − gl) + gl + tl − (gk + tk) 2 2 (7) The third cost function links the optimized camera pose graph to the map deformation: EconP = i φ(ti) − ti 2 2 (8) where ti is the translation component of node i in the pose graph before optimization, ti is the translation component of node i in the pose graph after optimization, and φ(ti) is the result of applying Equation 4 to ti. The total cost function is: wrotErot + wregEreg + wconP EconP (9) with experimental weights wrot = 1, wreg = 10, and wconP = 100 from [10], [9]. We optimize this cost function with itera- tive Gauss-Newton. The optimized deformation graph is then applied in parallel to all cloud slice vertices using Equation 4. IV. EXPERIMENTAL SETUP We describe our experimental setup to evaluate the qualita- tive and quantitative performance of our system and compare it against existing state-of-the-art dense visual SLAM algorithms including KinectFusion [7], Kintinuous [10], and ElasticFu- sion [14]. A. Hardware For all tests, we ran all algorithms on a standard laptop PC running 64-bit Ubuntu 14.04 Linux with an Intel Core i7- 4710MQ 2.5GHz CPU, 8GB DDR3 1600MHz RAM, and an NVIDIA GeForce GTX 980M 8GB GPU.
  • 5. 5 TABLE III DATASET STATISTICS Dataset Duration (s) Frames Length (m) Avg Translational Velocity (m/s) Avg Angular Velocity (deg/s) Trajectory Dimensions (m × m × m) lr kt0 51 1510 6.54 0.126 - - lr kt1 33 967 2.05 0.063 - - lr kt2 30 882 8.43 0.282 - - lr kt3 42 1242 11.32 0.263 - - fr1/desk 23.40 1352 9.263 0.413 23.327 2.42 x 1.34 x 0.66 fr1/desk2 24.86 620 10.161 0.426 29.308 2.44 x 1.47 x 0.52 fr1/room 48.90 1352 15.989 0.334 29.882 2.54 x 2.21 x 0.51 fr1/xyz 30.09 792 7.112 0.244 8.920 0.46 x 0.70 x 0.44 fr1/rpy 27.687 694 1.664 0.062 50.147 0.15 x 0.21 x 0.21 fr1/plant 41.53 1126 14.795 0.365 27.891 1.71 x 1.70 x 1.07 fr2/desk 99.36 620 18.880 0.193 6.338 3.90 x 4.13 x 0.57 fr2/xyz 122.74 3615 7.029 0.058 1.716 1.30 x 0.96 x 0.72 fr3/long 87.09 2488 21.455 0.249 10.188 5.12 x 4.89 x 0.54 B. Datasets Our algorithm must be able to accurately reconstruct models of a wide variety of indoor environments. Accordingly, we ran our system on published benchmark data sets. The Imperial College London and National University of Ireland Maynooth (ICL-NUIM) have released an extensive synthetic dataset for benchmarking SLAM algorithms con- sisting of a living room and office room environment, each with 4 different sensor trajectories [11]. The synthetic living room environment dataset has color and depth maps images (640×480) recorded at 30Hz, time synchronized ground truth camera poses, and a 3D ground truth surface model to evaluate surface reconstruction quality. All trajectories are in the same environment, and lr kt3 has a small loop closure. The Computer Vision Group at Technische Universit¨at M¨unchen (TUM) has also released a large RGB-D SLAM dataset benchmark containing color and depth images (640 × 480) captured at 30Hz with a Microsoft Kinect and time synchronized ground truth sensor poses recorded with a high speed motion-capture system (100 Hz) [24]. Datasets with labels containing ‘fr1’ were recorded in a typical office en- vironment (6 × 6m2 ) while those containing ‘fr2’ and ‘fr3’ were recorded in a large industrial hall (10 × 12m2 ). The fr1/desk dataset sweeps over four cluttered desks with mul- tiple computer monitors and numerous objects. The fr1/desk2 dataset sweeps over the same environment, but a bit faster and with more trajectory overlap, allowing for more loop closure testing. The fr1/room dataset begins at the same desk view, but pans through the entire room, ending back at the desk shot. This dataset allows for testing larger scale environment evaluation as well as evaluating loop closure performance. The fr1/xyz dataset begins at the same desk view but only translates in the x, y, and z axes with minimal rotation. The fr1/rpy dataset also begins at the same desk view but only rotates about the x, y, and z axes with minimal translation. Both of these datasets allow for simple pose estimation testing. The fr1/plant dataset has a thorough 360 degree sweep around a potted plant from all views, which is good for testing reconstruction quality and loop closure performance. The fr2/desk dataset has a 360 degree sweep around a desk with many objects on it, which is good for testing loop closure performance. The fr2/xyz dataset begins at the same desk as fr2/desk, but only translates very slowly in the x, y, and z axes with minimal rotation, resulting in very clean data free of motion blur and rolling shutter effects. This dataset is good for debugging tracking translation movements. The fr3/long dataset has a 360 degree sweep around desks in a home and office environment with many objects with one large loop closure. The statistics on all of the datasets are outlined in Table III. V. EVALUATION Our system was evaluated and compared against existing dense visual SLAM algorithms qualitatively and quantitatively in terms of trajectory estimation, surface reconstruction, and computational performance with existing algorithms. However, the existing state-of-the-art SLAM algorithms did not achieve results comparable to published results. This discrepancy could be attributed to rewriting system components to opensource the system, which the authors concede may introduce performance regressions 1 . A. Trajectory Estimation To evaluate sensor trajectory estimation, we calculated the absolute trajectory (ATE) root mean square error (RMSE) of the trajectory using evaluation tools from [24]. This metric is the root mean square error of each estimated pose with its corresponding ground truth pose, linked by timestamp [24]. As the results show in Table IV and Figure 3, the system tracks the position fairly well. We found that pose graph optimization marginally improves the RMSE accuracy, helping eliminate drift, although not as much as we originally thought. Our system performs similarly to Kinfu in some datasets where Kinfu does not get lost such as lr kt0, lr kt1, and 1https://github.com/mp3guy/Kintinuous, https://github.com/mp3guy/ElasticFusion
  • 6. 6 lr_kt2fr1/xyz Fig. 3. Two dimensional trajectory plots of estimated trajectories versus ground truth for datasets fr1/xyz and lr kt2 fr1/room. This result is expected as our system’s pose es- timation component is based off of Kinfu’s. From previous experiments using Kinfu, we found that the RGB-D sensor must move slowly through environments for Kinfu’s ICP- based pose estimator to track the sensor’s trajectory. As shown in Table III and IV, Kinfu gets lost in the datasets where the sensor translates and rotates faster than Kinfu can track such as fr1/plant and fr1/desk2. Accordingly, Kinfu performs best in datasets where the sensor’s movements are slow such as fr1/xyz and fr1/rpy. Our system is significantly more robust to more drastic frame-to-frame movements as the FOVIS visual odometry sys- tem provides a decent transformation estimate when ICP does not converge. However, Kinfu estimates its actual trajectory better than our system in some datasets where Kinfu does not get lost such as fr1/rpy and fr2/xyz. One possible explanation of this behavior is that pairwise ICP performs better than FOVIS in these cases. When the initial point-to-plane ICP does not converge on a valid transformation estimate, Kinfu tries to get a valid transformation by running pairwise ICP on the frames. Our system, on the other hand, uses FOVIS before trying to use pairwise ICP as a last resort only if the FOVIS solution is invalid. We have not investigated FOVIS performance versus pairwise ICP, which is an area for future work. Compared to Kintinuous in Table IV, our system performs similarly, as each system tracks more accurately than the other on different datasets. Although both systems’ pose estimation modules are based off of Kinfu, our system’s is augmented with FOVIS while Kintinuous’s combines geometric infor- mation with photometric constraints to improve robustness in environments with low geometric features [10]. As seen in Table IV, ElasticFusion outperforms all the other systems on all datasets except for lr kt3 where it is slightly outperformed by Kintinuous. This superior performance can be attributed to its pose tracking module utilizing both geometric TABLE IV COMPARISON OF ATE RMSE ON DATASETS Dataset Ours(m) Kintinuous(m) ElasticFusion(m) Kinfu(m) lr kt0 0.237349 0.309554 0.197866 0.202962 lr kt1 0.374268 0.029417 0.177573 0.381164 lr kt2 0.661996 0.26592 0.207747 Lost lr kt3 0.693045 0.169388 0.259297 Lost fr1/desk 0.711651 0.955585 0.224547 Lost fr1/desk2 0.685753 0.937414 0.156361 Lost fr1/room 0.784516 0.937414 0.095809 0.744844 fr1/xyz 0.132682 0.024784 0.017364 0.132682 fr1/rpy 0.058544 0.041976 0.035581 0.03617 fr1/plant 0.532928 0.58716 0.077455 Lost fr2/desk 0.744481 0.208078 0.15544 Lost fr2/xyz 0.202528 0.058109 0.0146268 0.153812 fr3/long 1.31371 1.782465 0.043245 1.324468 and photometric constraints in combination with utilizing frequent local and global loop closures to minimize drift [14]. B. Surface Reconstruction Qualitatively, our system generates decent quality surface reconstructions, which share a combination of issues to be addressed in future work. The main problems are model feature misalignments and noisy mesh models. As noted by [10], accurate trajectory estimation does not necessary lead to accurate surface reconstructions. However, the misalignment flaws, such as those depicted in the mesh model from lr kt1 in Figure 4, are primarily due to poor trajectory estimation. This error manifests as our system integrates the new data from each frame into the global model using the estimated pose at which the frame is processed. By improving the pose estimation, loop closure detection, and pose and deformation graph optimiza- tion, this aliasing could be mitigated. Volumetric reintegration
  • 7. 7 TABLE V RECONSTRUCTION STATISTICS Datasets Ours Kinfu Kintinuous ElasticFusion Vertices Faces Cloud Slices Loop Closures Vertices Faces Vertices Faces Vertices lr kt0 11761467 3920489 10 1 4372128 1457376 243309 460133 757378 lr kt1 10808433 3602811 6 0 7690662 2563554 354427 695516 672893 lr kt2 11539362 3846454 7 0 1332318 444106 451289 871573 360518 lr kt3 15338664 5112888 4 0 4357767 1452589 432376 839220 717285 fr1/desk 34462794 11487598 21 0 8366436 2788812 472511 873861 604127 fr1/desk2 8241693 2747261 7 1 2930019 976673 293664 542559 177618 fr1/room 24707529 8235843 15 0 8366772 2788924 473425 874585 5310688 fr1/xyz 2197521 732507 0 3 2197473 732491 67522 126493 163975 fr1/rpy 9098793 3032931 11 6 3251754 1083918 141038 261853 120761 fr1/plant 23590230 7863410 11 0 6346272 2115424 368252 704246 307691 fr2/desk 7284849 2428283 7 4 2929998 976666 151875 26243 178237 fr2/xyz 2370717 790239 0 17 2371512 790504 76974 142763 358988 fr3/long 9881976 293992 4 3 8749332 2916444 282927 529690 951408 Fig. 4. Misalignment error in a living room environment in lr kt1. Note how the two walls should intersect at a right angle, but due to odometry drift, this is visibly not the case. and fusion of revisited areas in the environment would also help improve the model quality and remove misalignments, as mentioned by [10]. This is a challenging area for future work [10]. The majority of noise in the model is also due to poor trajectory estimation. As shown in Figure 5, Kinfu generates significantly cleaner models for some datasets like fr1/rpy as the pairwise ICP pose estimates are superior to those generated by FOVIS in these specific cases. Additional noise could be reduced by downsampling the point cloud data with a voxel grid filter as done in [10], which would also improve performance by reducing the data processed. As shown in Table V and Figure 6, having more vertices and faces in the mesh model does not necessarily result in higher quality models. Optimizing pose graphs by detecting loop closures and modifying the model with deformation graphs do improve map reconstruction quality, as seen when running Kintinuous and ElasticFusion. Even when trajectory estimation is poor and suffering from odometry drift, once a loop closure is detected and the pose and deformation graphs are optimized and applied (a) (b) Fig. 5. Comparison of models generated by our system (a) and Kinfu (b) from the fr1/rpy dataset. Note how Kinfu captures the entire desk object while our system captures only part of the desk object and picks up significant noise in the background. to the model, the misaligned components of the map snap into place, drastically improving the model as can be seen in Figure 6. This model of dataset fr1/room was generated with poor pose tracking (RMSE of 0.937414) as seen in Table IV, but has no major aliasing. When processing datasets with frame-to-frame movements that our system can track well, our system performs com- parably with the state-of-the-art as seen in Figure 6. Clearly, ElasticFusion generates the highest quality map reconstruction of dataset fr1/xyz. as the geometric features are best preserved
  • 8. 8 (a) (b) (c) (d) Fig. 6. Comparison of models from dataset fr1/xyz generated by our system (a), Kinfu (b), Kintinuous (c), and ElasticFusion (d). Note how the models generated by our system and Kinfu are virtually identical and extremely similar to the one generated by Kintinuous. Also note the accurate and highly detailed colored geometric features in ElasticFusion. Fig. 7. Model generated by Kintinuous of the fr1/room dataset. Note how there do not appear to be any major misalignments of features such as the walls at the corners of the wall or objects on the table. and colored. TABLE VI COMPARISON OF COMPUTATIONAL PERFORMANCE Dataset Ours(ms) Kintinuous(ms) ElasticFusion(ms) Kinfu(ms) lr kt0 75.6045 25.95 32.289 46.7025 lr kt1 29.8339 27.186 34.699 31.2649 lr kt2 27.9252 22.827 34.794 28.1382 lr kt3 30.1917 22.558 37.366 38.6743 fr1/desk 26.5032 22.773 36.198 46.085 fr1/desk2 25.6016 20.064 20.064 29.9576 fr1/room 105.338 19.569 76.903 52.6674 fr1/xyz 25.6625 17.187 34.328 25.5975 fr1/rpy 19.7236 20.578 33.8709 29.2447 fr1/plant 28.9778 29.921 37.349 27.8769 fr2/desk 28.9778 19.355 19.355 30.4725 fr2/xyz 26.3884 19.4279 39.563 24.5707 fr3/long 27.251 23.08 41.6801 27.1341 C. Computational Performance To measure computational performance, we measured the average time it took each system to process an RGB-D frame (preprocessing the data, estimating the pose, and integrating
  • 9. 9 the data into the world model). As the results in Table VI show, our system performs similarly to the other systems for most datasets. The two outliers are due to FOVIS estimating the transformation between frames, as it is the only different component between our system and Kinfu. However, this slower performance on the two datasets is a minor drawback as FOVIS drastically improves robustness as seen in the Trajectory Estimation section and Table IV. VI. CONCLUSION We have presented our approach to robust large map recon- struction of indoor scenes. Our main contribution is to divide the approach of [10] into data collection and post processing modules to eliminate nondeterministic behavior and use denser point clouds for map reconstruction. In our experiments, we show that our system performs comparably to the state-of-the- art dense visual SLAM algorithms on some datasets. However, our system struggles to track the frame-to-frame movement in faster moving datasets as well as reintegrating data from revisited areas, leading to aliasing. This challenge is an area for future research. ACKNOWLEDGMENTS I would like to thank my SuperUROP supervisor Matthew Graham for providing extensive support throughout the project. I would also like to thank my SuperUROP advisor Professor Jeffrey Shapiro for offering guidance. This work was supported by SuperUROP funding from the Charles Stark Draper Laboratory. REFERENCES [1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map- ping: part i,” Robotics Automation Magazine, IEEE, vol. 13, no. 2, pp. 99–110, 2006. [2] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and map- ping (slam): Part ii,” IEEE Robotics Automation Magazine, vol. 13, no. 3, pp. 108–117, 2006. [3] S. Thrun et al., “Robotic mapping: A survey,” Exploring artificial intelligence in the new millennium, vol. 1, pp. 1–35, 2002. [4] D. Hahnel, W. Burgard, D. Fox, and S. Thrun, “An efficient fastslam algorithm for generating maps of large-scale cyclic environments from raw laser range measurements,” in Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Confer- ence on, vol. 1. IEEE, 2003, pp. 206–211. [5] S. Thrun, W. Burgard, and D. Fox, “A real-time algorithm for mobile robot mapping with applications to multi-robot and 3d mapping,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE Interna- tional Conference on, vol. 1. IEEE, 2000, pp. 321–328. [6] J. Levinson, M. Montemerlo, and S. Thrun, “Map-based precision vehicle localization in urban environments.” in Robotics: Science and Systems, vol. 4. Citeseer, 2007, p. 1. [7] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, ser. ISMAR ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 127–136. [Online]. Available: http://dx.doi.org/10.1109/ISMAR.2011.6092378 [8] J. McDonald, M. Kaess, C. Cadena, J. Neira, and J. J. Leonard, “Real- time 6-dof multi-session visual slam over large-scale environments,” Robotics and Autonomous Systems, vol. 61, no. 10, pp. 1144–1158, 2013. [9] R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation for shape manipulation,” in ACM SIGGRAPH 2007 Papers, ser. SIGGRAPH ’07. New York, NY, USA: ACM, 2007. [Online]. Available: http://doi.acm.org/10.1145/1275808.1276478 [10] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald, “Real-time large-scale dense rgb-d slam with volumetric fusion,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 598–626, 2015. [11] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014. [12] P. Henry, D. Fox, A. Bhowmik, and R. Mongia, “Patch volumes: Segmentation-based consistent mapping with rgb-d cameras,” in 3D Vision-3DV 2013, 2013 International Conference on. IEEE, 2013, pp. 398–405. [13] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard, “An evaluation of the rgb-d slam system,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1691–1696. [14] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense slam without a pose graph,” in Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015. [15] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 1–4. [16] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald, “Robust real-time visual odometry for dense rgb-d mapping,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 5724–5731. [17] D. Galvez-Lopez and J. D. Tardos, “Real-time loop detection with bags of binary words,” in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 51–58. [18] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Robotics-DL tentative. International Society for Optics and Photonics, 1992, pp. 586–606. [19] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and N. Roy, “Visual odometry and mapping for autonomous flight using an rgb-d camera,” in International Symposium on Robotics Research (ISRR), vol. 2, 2011. [20] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981. [21] K. Levenberg, “A method for the solution of certain non–linear problems in least squares,” 1944. [22] R. K¨ummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g 2 o: A general framework for graph optimization,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 3607–3613. [23] J. Chen, S. Izadi, and A. Fitzgibbon, “Kinˆetre: animating the world with the human body,” in Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 2012, pp. 435–444. [24] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.