paper

1
Robust Large Scale Reconstruction of Indoor Scenes
Vincent Kee*, Matthew Graham†, Jeffrey Shapiro*
*Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, USA
† Machine Intelligence Group, Charles Stark Draper Laboratory, USA
Email: vkee@mit.edu, mgraham@draper.com, jshaps@mit.edu
Abstract—Creating high quality three-dimensional models of
indoor scenes promises many applications in augmented reality,
robotics, and object scanning. One popular approach for 3D
model reconstruction is dense visual Simultaneous Localization
and Mapping (SLAM), in which a camera is used to build a
map of the environment while simultaneously localizing itself
within the map. One major drawback of most dense visual SLAM
methods is that they can either build accurate maps in small-scale
environments (using model-based techniques) or estimate their
pose accurately in large-scale environments (using pose graphs),
but not both. Deformation graphs can bridge this gap. They can
be used in large-scale environments while directly optimizing map
reconstruction, theoretically resulting in higher map accuracy.
We investigate how much deformation graphs embedded in the
model’s surface will improve large-scale scene reconstruction
quality. Specifically, this project has added deformation graphs
into Draper’s existing dense visual SLAM system and evaluated
the system on published benchmark datasets and real world tests.
In addition, we compare the performance of our approach to
existing state-of-the-art dense visual SLAM algorithms.
I. INTRODUCTION
For an autonomous mobile robot, building an accurate map
of its surroundings and localizing itself in the map are critical
for intelligent operation in an unknown environment. This
problem, Simultaneous Localization and Mapping (SLAM),
has been the focus of robotics research for the past few decades
[1], [2], [3]. In the past ten years, practical SLAM techniques
have been developed and successfully implemented. However,
most of these methods constructed sparse feature-based maps
that lacked sufficient information for executing tasks beyond
basic navigation [4], [5]. Approaches using LiDAR systems
(3D laser scanner) to build high resolution point clouds of
the surroundings existed [6], but the high cost of the system
limited adoption and further development. A recent alternative,
RGB-D (red-green-blue-depth) sensors such as the Microsoft
Kinect provide accurate depth sensing at a much lower cost.
Coupled with the increased availability of high performance
computing through graphical processing units (GPUs), these
new sensors fueled the development of dense visual SLAM
algorithms such as KinectFusion [7], which produced signif-
icantly higher resolution maps than those of earlier methods.
However, existing dense methods have their limitations due
to their underlying implementation: they can provide accurate
model reconstruction of small-scale scenes [7] or accurate
camera trajectory estimation in large-scale environments [8],
but not both.
A computer graphics data structure presented in [9], the
deformation graph, allows for natural, non-rigid space defor-
Fig. 1. Reconstruction of an office desk dataset [11] produced by our system
mations of any geometric model. The deformation graph can
bridge the gap in existing dense techniques as it can be applied
to model large-scale environments while directly optimizing
surface reconstruction, theoretically resulting in higher model
accuracy.
In this paper, we present our work integrating deformation
graphs into our dense visual SLAM system and results demon-
strating that deformation graphs embedded in the model’s
surface improve large-scale scene reconstruction quality. Our
system is based off of KinectFusion [7] and Kintinuous [10],
with two significant differences. To achieve map reconstruction
quality, we do not reconstruct the environment in real time or
downsample point cloud data. Section II discusses existing
related work in the field of dense visual SLAM. Section
III summarizes our approach, outlining the components of
the real-time data collection module and the offline post
processing module. Section IV details our experimental setup
and the benchmark datasets used to evaluate our system against
existing state-of-the-art dense visual SLAM algorithms. Sec-
tion V discusses the qualitative and quantitative results of the
experiments. Section VI presents conclusions and areas for
future research.
II. RELATED WORK
Many dense visual SLAM methods have been developed
over the years, particularly following the release of the
Microsoft Kinect [12], [13]. KinectFusion [7] was the first
algorithm to generate real-time high quality reconstructions
of indoor scenes using a Microsoft Kinect and commodity

2
TABLE I
FEATURE COMPARISON OF DENSE VISUAL SLAM ALGORITHMS
Dense SLAM
Algorithms
Dense
Map
Large Scale
Environments
Surface-Model
Loop Closure
Dense Point
Cloud
Real
Time
Deterministic
Behavior
Deformation
Graph
Our System
Kinfu
Kintinuous
ElasticFusion
RGB-D Frame
Surface
Measurement
Camera Pose
Estimation
Place Recognition
Cloud Slice
Extraction
Surface
Reconstruction
Update and
Surface Prediction
TSDF
Pose Graph
DBoW
Database
3D Depth
Point Cloud
m
ab

m
p
R
G
B-D
Loop Closure
Constraint
Pose
Odometry
Pose

ms
Pose
RGB-D
Pose
Data Collection Module
Legend
Data Move
Conditional
Data Move
Data Store
Module
Component
Point Cloud Data
(Cloud Slices,
TSDF) + Pose
Graph
Optimization
Deformation
Optimized
Poses
Optimized
Map
Deformation
Graph
Post Processing Module
3D Point
Cloud
Fig. 2. System Architecture Diagram. mab refers to the movement metric defined in Equation 2, mp is the place recognition movement threshold, and ms
is the cloud slice movement threshold.
GPU, serving as a model for many future techniques. However,
KinectFusion was limited to operation in small-scale environ-
ments with high geometric diversity as it used only depth
data from the Kinect for tracking and localization. Kintinuous
[10], based on KinectFusion, extended the operating range of
KinectFusion to handle larger scale environments and added
RGB data processing into the pipeline. However, Kintinuous
suffered with handling frequently occurring loop closures
(determining when part of the environment is rediscovered)
and properly merging the rediscovered data with the model.
Perhaps the most promising approach was presented in 2015,
named ElasticFusion [14], which integrated the deformation
graph structure presented in [9] to optimize map reconstruction
in both small-scale and large-scale environments. However,
both of these methods exhibit nondeterministic behavior due
to their heavy reliance on multi-threading to achieve real time
performance.
Currently, the Perception and Localization Group at the
Charles Stark Draper Laboratory has developed a dense visual
SLAM system based on KinectFusion, using both depth and
color data for surface reconstruction. It works fairly well in
small-scale environments, but struggles with handling loop
closures, particularly in large-scale rooms, leading to incon-
sistent models with misaligned features.
We have integrated deformation graphs into Draper’s system
and applied techniques from [10] to improve performance in
larger environments. Unlike Kintinuous, we do not attempt
to generate the map in real time. Instead, we are post pro-
cessing the data in an attempt to generate higher quality
reconstructions and avoid the nondeterministic behavior issues.
Additionally, we do not downsample point cloud vertices
with a voxel grid filter during surface extraction as done
in [10]. We investigate whether working with denser point
clouds improves map reconstruction quality. Table I outlines
the feature comparison between our system and other dense
visual SLAM algorithms.
III. APPROACH
We adapt the key structure of [7], [10] in performing large-
scale dense visual SLAM but divide the system into two
separate modules to improve the performance. Our system is
organized into two primary components: the data collection
module and the post processing module. Much of the code has
been adopted from Draper’s modified repository of the Point
Cloud Library (PCL) [15], an open source library for point
cloud processing, and the open source version of Kintinuous
[10]. Functions that could benefit significantly from paral-
lelization have been implemented with NVIDIA CUDA for
performance as described in [7], [10]. Table II outlines where
each of the major system components is from and how it is
modified in our system.
In this section, each part of the data collection and post
processing modules is briefly described, with the main func-
tions summarized and the deformation components focused
on. The interested reader can see the original papers [7], [10]
for further details. The system architecture and data pipeline
are shown in Figure (2).
A. Data Collection Module
The data collection module is based off of KinectFusion [7]
and Kintinuous [10]. Data is captured from an RGB-D sensor
and processed and fused into a volumetric data structure that
shifts as the sensor moves throughout the environment. We
create a dense pose graph of the camera pose at each processed

3
TABLE II
SYSTEM COMPONENT ORIGINS
System Component Adapted From Modifications
Volumetric Fusion KinectFusion/KinectFusion [7], [15]
FOVIS [16] for improved robustness in camera pose estimation,
no downsampled point cloud data
Cloud Slice Extraction Kintinuous/KinectFusion [10], [15] Run sequentially rather than concurrently with other data collection methods
Place Recognition Kintinuous [10] Run sequentially rather than concurrently with other data collection methods
Pose Graph Construction
and Optimization
Kintinuous [10]
Separated into data collection and post processing modules and optimized
using g2o instead of iSAM
Deformation Graph Kintinuous [10] Separated into an offline post processing module
RGB-D frame, with loop closure constraints detected by a bag-
of-words-based DBoW place recognition system [17].
The primary components of the data collection module are
volumetric fusion of RGB-D data, cloud slice extraction, and
place recognition.
1) Volumetric Fusion of RGB-D Data: The volumetric
fusion component fuses the depth 3D point clouds into one
cohesive 3D model of the environment. It is adapted from
KinectFusion [7], with modifications to improve robustness
in camera pose estimation. It consists of four sequential sub-
components: surface measurement, camera pose estimation,
surface reconstruction update, and surface prediction.
The surface measurement subcomponent performs pre-
processing of the depth data in every frame captured by the
RGB-D sensor. Raw depth data is passed through a bilateral
filter to remove noise and converted into a dense 3D point
cloud.
The camera pose estimation subcomponent estimates the
camera pose for each frame at time k using the iterative
closest point (ICP) algorithm [18], parallelized on a GPU for
performance. ICP finds the transformation, Tk, that minimizes
the distance between the plane parallel to a point in the first
point cloud with its corresponding point in the second point
cloud. This transformation is in the form of a rigid body
matrix:
Tk =
Rk tk
0 1
∈ SE3 (1)
where SE3 := {R ∈ SO3, t ∈ R3
}. The FOVIS visual
odometry system from [19], [16] is run in parallel to serve
as an auxiliary camera pose estimation system in case ICP
does not converge. This often occurs in planar environments
with few geometric features.
The surface reconstruction update subcomponent integrates
each frame’s 3D point cloud into one global model stored
in the GPU using the volumetric truncated signed distance
function (TSDF). The TSDF representation stores at each
voxel the signed distance to the closest object surface, with
negative values for voxels inside an object and zero for voxels
on an object’s surface. The new TSDF is fused with the
existing model’s TSDF with a weighted running average [7].
The surface prediction subcomponent uses raycasting to
refine the model surface and remove noise. A line is projected
from the camera focal point towards the model surface until
the line intersects the surface. The intersection point and its
corresponding pixel are stored in the refined surface model.
2) Cloud Slice Extraction: By implementing the TSDF
data structure using modular arithmetic for GPU memory
addressing, environments that were previously too large to fit
in the original TSDF representation can now be mapped. This
approach [10] allows the TSDF to act as a cyclical buffer as
the camera moves through the environment. When the camera
translation exceeds the movement threshold, ms (ms = 1.5m
in our experiments), the TSDF volume shifts as necessary to
accommodate the new data and the surface exciting the TSDF
is extracted as a 3D slice of vertices called a cloud slice. This
cloud slice is saved in the world model stored in the CPU and
written to disk for post processing. It is associated with the
camera pose at the time of extraction.
3) Camera Pose Graph: A dense pose graph, containing
a camera pose for each processed frame, is constructed for
the camera trajectory. Additionally, the graph has relative
transformations between poses involved in loop closures. This
pose graph is generated for the post processing module, which
will use it to refine the camera pose estimates associated with
each cloud slice, thereby improving the map reconstruction.
Each camera pose Pi is composed of a rotation Ri ∈ SO3,
translation ti ∈ R3
, and timestamp.
4) Place Recognition: The Place Recognition component
[10] detects loop closures to eliminate odometry drift caused
by error accumulation as the system goes through an environ-
ment. It uses Speeded Up Robust Feature (SURF) descriptors
with a bag-of-words-based loop closure detector (DBoW) [17].
New frames are only added to the system when a movement
metric based on rotations and translations exceeds a movement
threshold mp (mp = 0.3 in our experiments) for performance
reasons. The movement metric, mab is computed from the
current pose, a, and the pose of the last frame added, b:
mab = r(R−1
a Rb) 2 + ta − tb 2 (2)
where r(R) : SO3 → R3
is defined as the rotation vector
form of a rotation matrix. For each RGB-D frame i [rgbi, di],
SURF keypoints and descriptors Ui ∈ Ω × R64
are computed
and queried in the bag-of-words descriptor database to detect
matches, which indicate possible loop closures. If a match m
is found, additional validation tests are run on the keypoints,
descriptors Um, and depth data dm of the match frame to avoid
false positives.
a) SURF Correspondence Threshold: A set of valid
SURF correspondences, G ∈ Ω × Ω, are found between
the frames using a k-nearest neighbors search. An l2-norm
between SURF descriptors, Ui and Um, is used to threshold

4
the matches in R64
. The loop closure candidate is discarded
if |G| σ (σ = 40 in our experiments).
b) RANSAC Transformation Estimation: A RANSAC-
based 3-point algorithm [20] is used to estimate a 6-DOF
relative transformation, Te, between camera poses i and m
using the matched image depth data dm and the valid SURF
correspondences G. The loop closure candidate is discarded
if the percentage of inliers is less than µ (µ = 0.35 in our
experiments). Otherwise, the Levenberg-Marquardt algorithm
(LMA) [21] is used to refine Te by minimizing inlier feature
reprojection errors.
c) Point Cloud ICP: Non-linear ICP is performed be-
tween the matched depth frames di and dm, using Te as
an initial guess and LMA to minimize nearest neighbor
correspondence distances between the clouds. If the mean L2
2-
norm of correspondence errors is below φ (φ = 0.01 in our
experiments), the relative transformation constraint between
the matched poses is added to the pose graph.
B. Post Processing Module
The post processing module is based off of Kintinuous [10]
and run offline after data collection. The dense camera pose
graph is optimized using g2
o [22]. The optimized pose graph
along with all the vertices from the cloud slices and TSDF are
fed into a deformation graph structure to optimize the model.
The primary components of the post processing module are
pose graph optimization and deformation.
1) Pose Graph Optimization: The camera pose graph,
including the loop closure constraints, is optimized using
g2
o, a framework for optimizing graph-based, non-linear error
functions [22].
A deformation graph allows for natural, non-rigid space
deformations of any geometric model composed of vertices
in R3
[9], accounting for all measured constraints.
2) Deformation: The deformation component is adapted
from [9], [10].
a) Deformation Graph: A deformation graph enables
defined constraints to apply natural, non-rigid space defor-
mations of any geometric model composed of vertices in R3
[9]. This graph structure is composed of nodes representing
affine transformations in R3
and undirected edges connecting
neighboring nodes sharing influence on nearby space. Each
node j has a position gj ∈ R3
, a set of neighboring nodes
N(j), and an affine transformation represented by a rotation
matrix Rj ∈ SO3 (initialized to the identity matrix) and
translation vector tj ∈ R3 (initialized the zero vector).
The deformation graph in the post processing module is
initialized by sampling from the g2
o optimized camera pose
graph using the Incremental Deformation Node Sampling
algorithm from [10] and connecting nodes based on their
temporal location in the pose graph to prevent unrelated
regions of the map from overlapping and becoming connected.
Each node causes a space deformation on the nearby space,
mapping a point p ∈ R3
to ˆp:
ˆp = Rj(p − gj) + gj + tj (3)
The set of k nodes deforming p is denoted as N(v), with k =
4 in our experiments. It is composed of the k-nearest neighbors
to p in the pose graph that are spatially and temporally optimal
generated by the Back-Traversal Vertex Association algorithm
from [10].
The deformed position of a map point cloud vertex v is
given by [9]:
ˆv =
k∈N(v)
wk(v) [Rk(v − gk) + gk + tk] (4)
with wk(v) defined as (normalized over all k):
wk(v) = 1 −
v − gk
dmax 2
2
(5)
where dmax is the Euclidean distance to the k+1-nearest node
to the vertex.
b) Optimization: We use three cost functions from [9],
[10], [23] to optimize the deformation graph. The first cost
function maximizes rigidity in the deformation by biasing
towards isometry and minimizes rotation error over all trans-
formations:
Erot =
l
Rl Rl − I
2
F
(6)
where · F denotes the Frobenius norm. The second cost
function is a regularization term ensuring a smooth deforma-
tion by minimizing distance between neighboring nodes:
Ereg =
l k∈N(v)
Rl(gk − gl) + gl + tl − (gk + tk)
2
2 (7)
The third cost function links the optimized camera pose
graph to the map deformation:
EconP
=
i
φ(ti) − ti
2
2 (8)
where ti is the translation component of node i in the pose
graph before optimization, ti is the translation component of
node i in the pose graph after optimization, and φ(ti) is the
result of applying Equation 4 to ti.
The total cost function is:
wrotErot + wregEreg + wconP
EconP
(9)
with experimental weights wrot = 1, wreg = 10, and wconP
=
100 from [10], [9]. We optimize this cost function with itera-
tive Gauss-Newton. The optimized deformation graph is then
applied in parallel to all cloud slice vertices using Equation 4.
IV. EXPERIMENTAL SETUP
We describe our experimental setup to evaluate the qualita-
tive and quantitative performance of our system and compare it
against existing state-of-the-art dense visual SLAM algorithms
including KinectFusion [7], Kintinuous [10], and ElasticFu-
sion [14].
A. Hardware
For all tests, we ran all algorithms on a standard laptop
PC running 64-bit Ubuntu 14.04 Linux with an Intel Core i7-
4710MQ 2.5GHz CPU, 8GB DDR3 1600MHz RAM, and an
NVIDIA GeForce GTX 980M 8GB GPU.

5
TABLE III
DATASET STATISTICS
Dataset
Duration
(s)
Frames
Length
(m)
Avg Translational Velocity
(m/s)
Avg Angular Velocity
(deg/s)
Trajectory Dimensions
(m × m × m)
lr kt0 51 1510 6.54 0.126 - -
lr kt1 33 967 2.05 0.063 - -
lr kt2 30 882 8.43 0.282 - -
lr kt3 42 1242 11.32 0.263 - -
fr1/desk 23.40 1352 9.263 0.413 23.327 2.42 x 1.34 x 0.66
fr1/desk2 24.86 620 10.161 0.426 29.308 2.44 x 1.47 x 0.52
fr1/room 48.90 1352 15.989 0.334 29.882 2.54 x 2.21 x 0.51
fr1/xyz 30.09 792 7.112 0.244 8.920 0.46 x 0.70 x 0.44
fr1/rpy 27.687 694 1.664 0.062 50.147 0.15 x 0.21 x 0.21
fr1/plant 41.53 1126 14.795 0.365 27.891 1.71 x 1.70 x 1.07
fr2/desk 99.36 620 18.880 0.193 6.338 3.90 x 4.13 x 0.57
fr2/xyz 122.74 3615 7.029 0.058 1.716 1.30 x 0.96 x 0.72
fr3/long 87.09 2488 21.455 0.249 10.188 5.12 x 4.89 x 0.54
B. Datasets
Our algorithm must be able to accurately reconstruct models
of a wide variety of indoor environments. Accordingly, we ran
our system on published benchmark data sets.
The Imperial College London and National University of
Ireland Maynooth (ICL-NUIM) have released an extensive
synthetic dataset for benchmarking SLAM algorithms con-
sisting of a living room and office room environment, each
with 4 different sensor trajectories [11]. The synthetic living
room environment dataset has color and depth maps images
(640×480) recorded at 30Hz, time synchronized ground truth
camera poses, and a 3D ground truth surface model to evaluate
surface reconstruction quality. All trajectories are in the same
environment, and lr kt3 has a small loop closure.
The Computer Vision Group at Technische Universität
München (TUM) has also released a large RGB-D SLAM
dataset benchmark containing color and depth images (640 ×
480) captured at 30Hz with a Microsoft Kinect and time
synchronized ground truth sensor poses recorded with a high
speed motion-capture system (100 Hz) [24]. Datasets with
labels containing ‘fr1’ were recorded in a typical office en-
vironment (6 × 6m2
) while those containing ‘fr2’ and ‘fr3’
were recorded in a large industrial hall (10 × 12m2
). The
fr1/desk dataset sweeps over four cluttered desks with mul-
tiple computer monitors and numerous objects. The fr1/desk2
dataset sweeps over the same environment, but a bit faster and
with more trajectory overlap, allowing for more loop closure
testing. The fr1/room dataset begins at the same desk view,
but pans through the entire room, ending back at the desk
shot. This dataset allows for testing larger scale environment
evaluation as well as evaluating loop closure performance. The
fr1/xyz dataset begins at the same desk view but only translates
in the x, y, and z axes with minimal rotation. The fr1/rpy
dataset also begins at the same desk view but only rotates about
the x, y, and z axes with minimal translation. Both of these
datasets allow for simple pose estimation testing. The fr1/plant
dataset has a thorough 360 degree sweep around a potted plant
from all views, which is good for testing reconstruction quality
and loop closure performance. The fr2/desk dataset has a 360
degree sweep around a desk with many objects on it, which
is good for testing loop closure performance. The fr2/xyz
dataset begins at the same desk as fr2/desk, but only translates
very slowly in the x, y, and z axes with minimal rotation,
resulting in very clean data free of motion blur and rolling
shutter effects. This dataset is good for debugging tracking
translation movements. The fr3/long dataset has a 360 degree
sweep around desks in a home and office environment with
many objects with one large loop closure. The statistics on all
of the datasets are outlined in Table III.
V. EVALUATION
Our system was evaluated and compared against existing
dense visual SLAM algorithms qualitatively and quantitatively
in terms of trajectory estimation, surface reconstruction, and
computational performance with existing algorithms. However,
the existing state-of-the-art SLAM algorithms did not achieve
results comparable to published results. This discrepancy could
be attributed to rewriting system components to opensource the
system, which the authors concede may introduce performance
regressions 1
.
A. Trajectory Estimation
To evaluate sensor trajectory estimation, we calculated the
absolute trajectory (ATE) root mean square error (RMSE) of
the trajectory using evaluation tools from [24]. This metric
is the root mean square error of each estimated pose with its
corresponding ground truth pose, linked by timestamp [24]. As
the results show in Table IV and Figure 3, the system tracks
the position fairly well. We found that pose graph optimization
marginally improves the RMSE accuracy, helping eliminate
drift, although not as much as we originally thought.
Our system performs similarly to Kinfu in some datasets
where Kinfu does not get lost such as lr kt0, lr kt1, and
1https://github.com/mp3guy/Kintinuous, https://github.com/mp3guy/ElasticFusion

6
lr_kt2fr1/xyz
Fig. 3. Two dimensional trajectory plots of estimated trajectories versus ground truth for datasets fr1/xyz and lr kt2
fr1/room. This result is expected as our system’s pose es-
timation component is based off of Kinfu’s. From previous
experiments using Kinfu, we found that the RGB-D sensor
must move slowly through environments for Kinfu’s ICP-
based pose estimator to track the sensor’s trajectory. As shown
in Table III and IV, Kinfu gets lost in the datasets where the
sensor translates and rotates faster than Kinfu can track such
as fr1/plant and fr1/desk2. Accordingly, Kinfu performs best
in datasets where the sensor’s movements are slow such as
fr1/xyz and fr1/rpy.
Our system is signiﬁcantly more robust to more drastic
frame-to-frame movements as the FOVIS visual odometry sys-
tem provides a decent transformation estimate when ICP does
not converge. However, Kinfu estimates its actual trajectory
better than our system in some datasets where Kinfu does not
get lost such as fr1/rpy and fr2/xyz. One possible explanation
of this behavior is that pairwise ICP performs better than
FOVIS in these cases. When the initial point-to-plane ICP
does not converge on a valid transformation estimate, Kinfu
tries to get a valid transformation by running pairwise ICP
on the frames. Our system, on the other hand, uses FOVIS
before trying to use pairwise ICP as a last resort only if the
FOVIS solution is invalid. We have not investigated FOVIS
performance versus pairwise ICP, which is an area for future
work.
Compared to Kintinuous in Table IV, our system performs
similarly, as each system tracks more accurately than the other
on different datasets. Although both systems’ pose estimation
modules are based off of Kinfu, our system’s is augmented
with FOVIS while Kintinuous’s combines geometric infor-
mation with photometric constraints to improve robustness in
environments with low geometric features [10].
As seen in Table IV, ElasticFusion outperforms all the other
systems on all datasets except for lr kt3 where it is slightly
outperformed by Kintinuous. This superior performance can be
attributed to its pose tracking module utilizing both geometric
TABLE IV
COMPARISON OF ATE RMSE ON DATASETS
Dataset Ours(m) Kintinuous(m) ElasticFusion(m) Kinfu(m)
lr kt0 0.237349 0.309554 0.197866 0.202962
lr kt1 0.374268 0.029417 0.177573 0.381164
lr kt2 0.661996 0.26592 0.207747 Lost
lr kt3 0.693045 0.169388 0.259297 Lost
fr1/desk 0.711651 0.955585 0.224547 Lost
fr1/desk2 0.685753 0.937414 0.156361 Lost
fr1/room 0.784516 0.937414 0.095809 0.744844
fr1/xyz 0.132682 0.024784 0.017364 0.132682
fr1/rpy 0.058544 0.041976 0.035581 0.03617
fr1/plant 0.532928 0.58716 0.077455 Lost
fr2/desk 0.744481 0.208078 0.15544 Lost
fr2/xyz 0.202528 0.058109 0.0146268 0.153812
fr3/long 1.31371 1.782465 0.043245 1.324468
and photometric constraints in combination with utilizing
frequent local and global loop closures to minimize drift [14].
B. Surface Reconstruction
Qualitatively, our system generates decent quality surface
reconstructions, which share a combination of issues to be
addressed in future work. The main problems are model
feature misalignments and noisy mesh models. As noted by
[10], accurate trajectory estimation does not necessary lead to
accurate surface reconstructions. However, the misalignment
ﬂaws, such as those depicted in the mesh model from lr kt1 in
Figure 4, are primarily due to poor trajectory estimation. This
error manifests as our system integrates the new data from each
frame into the global model using the estimated pose at which
the frame is processed. By improving the pose estimation, loop
closure detection, and pose and deformation graph optimiza-
tion, this aliasing could be mitigated. Volumetric reintegration

7
TABLE V
RECONSTRUCTION STATISTICS
Datasets
Ours Kinfu Kintinuous ElasticFusion
Vertices Faces
Cloud
Slices
Loop
Closures
Vertices Faces Vertices Faces Vertices
lr kt0 11761467 3920489 10 1 4372128 1457376 243309 460133 757378
lr kt1 10808433 3602811 6 0 7690662 2563554 354427 695516 672893
lr kt2 11539362 3846454 7 0 1332318 444106 451289 871573 360518
lr kt3 15338664 5112888 4 0 4357767 1452589 432376 839220 717285
fr1/desk 34462794 11487598 21 0 8366436 2788812 472511 873861 604127
fr1/desk2 8241693 2747261 7 1 2930019 976673 293664 542559 177618
fr1/room 24707529 8235843 15 0 8366772 2788924 473425 874585 5310688
fr1/xyz 2197521 732507 0 3 2197473 732491 67522 126493 163975
fr1/rpy 9098793 3032931 11 6 3251754 1083918 141038 261853 120761
fr1/plant 23590230 7863410 11 0 6346272 2115424 368252 704246 307691
fr2/desk 7284849 2428283 7 4 2929998 976666 151875 26243 178237
fr2/xyz 2370717 790239 0 17 2371512 790504 76974 142763 358988
fr3/long 9881976 293992 4 3 8749332 2916444 282927 529690 951408
Fig. 4. Misalignment error in a living room environment in lr kt1. Note how
the two walls should intersect at a right angle, but due to odometry drift, this
is visibly not the case.
and fusion of revisited areas in the environment would also
help improve the model quality and remove misalignments, as
mentioned by [10]. This is a challenging area for future work
[10].
The majority of noise in the model is also due to poor
trajectory estimation. As shown in Figure 5, Kinfu generates
significantly cleaner models for some datasets like fr1/rpy as
the pairwise ICP pose estimates are superior to those generated
by FOVIS in these specific cases. Additional noise could
be reduced by downsampling the point cloud data with a
voxel grid filter as done in [10], which would also improve
performance by reducing the data processed. As shown in
Table V and Figure 6, having more vertices and faces in
the mesh model does not necessarily result in higher quality
models.
Optimizing pose graphs by detecting loop closures and
modifying the model with deformation graphs do improve map
reconstruction quality, as seen when running Kintinuous and
ElasticFusion. Even when trajectory estimation is poor and
suffering from odometry drift, once a loop closure is detected
and the pose and deformation graphs are optimized and applied
(a)
(b)
Fig. 5. Comparison of models generated by our system (a) and Kinfu (b)
from the fr1/rpy dataset. Note how Kinfu captures the entire desk object while
our system captures only part of the desk object and picks up significant noise
in the background.
to the model, the misaligned components of the map snap into
place, drastically improving the model as can be seen in Figure
6. This model of dataset fr1/room was generated with poor
pose tracking (RMSE of 0.937414) as seen in Table IV, but
has no major aliasing.
When processing datasets with frame-to-frame movements
that our system can track well, our system performs com-
parably with the state-of-the-art as seen in Figure 6. Clearly,
ElasticFusion generates the highest quality map reconstruction
of dataset fr1/xyz. as the geometric features are best preserved

8
(a) (b)
(c) (d)
Fig. 6. Comparison of models from dataset fr1/xyz generated by our system (a), Kinfu (b), Kintinuous (c), and ElasticFusion (d). Note how the models
generated by our system and Kinfu are virtually identical and extremely similar to the one generated by Kintinuous. Also note the accurate and highly detailed
colored geometric features in ElasticFusion.
Fig. 7. Model generated by Kintinuous of the fr1/room dataset. Note how
there do not appear to be any major misalignments of features such as the
walls at the corners of the wall or objects on the table.
and colored.
TABLE VI
COMPARISON OF COMPUTATIONAL PERFORMANCE
Dataset Ours(ms) Kintinuous(ms) ElasticFusion(ms) Kinfu(ms)
lr kt0 75.6045 25.95 32.289 46.7025
lr kt1 29.8339 27.186 34.699 31.2649
lr kt2 27.9252 22.827 34.794 28.1382
lr kt3 30.1917 22.558 37.366 38.6743
fr1/desk 26.5032 22.773 36.198 46.085
fr1/desk2 25.6016 20.064 20.064 29.9576
fr1/room 105.338 19.569 76.903 52.6674
fr1/xyz 25.6625 17.187 34.328 25.5975
fr1/rpy 19.7236 20.578 33.8709 29.2447
fr1/plant 28.9778 29.921 37.349 27.8769
fr2/desk 28.9778 19.355 19.355 30.4725
fr2/xyz 26.3884 19.4279 39.563 24.5707
fr3/long 27.251 23.08 41.6801 27.1341
C. Computational Performance
To measure computational performance, we measured the
average time it took each system to process an RGB-D frame
(preprocessing the data, estimating the pose, and integrating

9
the data into the world model). As the results in Table VI
show, our system performs similarly to the other systems for
most datasets. The two outliers are due to FOVIS estimating
the transformation between frames, as it is the only different
component between our system and Kinfu. However, this
slower performance on the two datasets is a minor drawback
as FOVIS drastically improves robustness as seen in the
Trajectory Estimation section and Table IV.
VI. CONCLUSION
We have presented our approach to robust large map recon-
struction of indoor scenes. Our main contribution is to divide
the approach of [10] into data collection and post processing
modules to eliminate nondeterministic behavior and use denser
point clouds for map reconstruction. In our experiments, we
show that our system performs comparably to the state-of-the-
art dense visual SLAM algorithms on some datasets. However,
our system struggles to track the frame-to-frame movement
in faster moving datasets as well as reintegrating data from
revisited areas, leading to aliasing. This challenge is an area
for future research.
ACKNOWLEDGMENTS
I would like to thank my SuperUROP supervisor Matthew
Graham for providing extensive support throughout the
project. I would also like to thank my SuperUROP advisor
Professor Jeffrey Shapiro for offering guidance. This work
was supported by SuperUROP funding from the Charles Stark
Draper Laboratory.
REFERENCES
[1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-
ping: part i,” Robotics Automation Magazine, IEEE, vol. 13, no. 2,
pp. 99–110, 2006.
[2] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and map-
ping (slam): Part ii,” IEEE Robotics Automation Magazine, vol. 13,
no. 3, pp. 108–117, 2006.
[3] S. Thrun et al., “Robotic mapping: A survey,” Exploring artificial
intelligence in the new millennium, vol. 1, pp. 1–35, 2002.
[4] D. Hahnel, W. Burgard, D. Fox, and S. Thrun, “An efficient fastslam
algorithm for generating maps of large-scale cyclic environments from
raw laser range measurements,” in Intelligent Robots and Systems,
2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Confer-
ence on, vol. 1. IEEE, 2003, pp. 206–211.
[5] S. Thrun, W. Burgard, and D. Fox, “A real-time algorithm for mobile
robot mapping with applications to multi-robot and 3d mapping,” in
Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE Interna-
tional Conference on, vol. 1. IEEE, 2000, pp. 321–328.
[6] J. Levinson, M. Montemerlo, and S. Thrun, “Map-based precision
vehicle localization in urban environments.” in Robotics: Science and
Systems, vol. 4. Citeseer, 2007, p. 1.
[7] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,
A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon,
“Kinectfusion: Real-time dense surface mapping and tracking,” in
Proceedings of the 2011 10th IEEE International Symposium on
Mixed and Augmented Reality, ser. ISMAR ’11. Washington, DC,
USA: IEEE Computer Society, 2011, pp. 127–136. [Online]. Available:
http://dx.doi.org/10.1109/ISMAR.2011.6092378
[8] J. McDonald, M. Kaess, C. Cadena, J. Neira, and J. J. Leonard, “Real-
time 6-dof multi-session visual slam over large-scale environments,”
Robotics and Autonomous Systems, vol. 61, no. 10, pp. 1144–1158, 2013.
[9] R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation
for shape manipulation,” in ACM SIGGRAPH 2007 Papers, ser.
SIGGRAPH ’07. New York, NY, USA: ACM, 2007. [Online].
Available: http://doi.acm.org/10.1145/1275808.1276478
[10] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and
J. McDonald, “Real-time large-scale dense rgb-d slam with volumetric
fusion,” The International Journal of Robotics Research, vol. 34, no.
4-5, pp. 598–626, 2015.
[11] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark
for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE
Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May
2014.
[12] P. Henry, D. Fox, A. Bhowmik, and R. Mongia, “Patch volumes:
Segmentation-based consistent mapping with rgb-d cameras,” in 3D
Vision-3DV 2013, 2013 International Conference on. IEEE, 2013, pp.
398–405.
[13] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard,
“An evaluation of the rgb-d slam system,” in Robotics and Automation
(ICRA), 2012 IEEE International Conference on. IEEE, 2012, pp.
1691–1696.
[14] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison,
“Elasticfusion: Dense slam without a pose graph,” in Proceedings of
Robotics: Science and Systems, Rome, Italy, July 2015.
[15] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” in
Robotics and Automation (ICRA), 2011 IEEE International Conference
on. IEEE, 2011, pp. 1–4.
[16] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald,
“Robust real-time visual odometry for dense rgb-d mapping,” in Robotics
and Automation (ICRA), 2013 IEEE International Conference on. IEEE,
2013, pp. 5724–5731.
[17] D. Galvez-Lopez and J. D. Tardos, “Real-time loop detection with
bags of binary words,” in Intelligent Robots and Systems (IROS), 2011
IEEE/RSJ International Conference on. IEEE, 2011, pp. 51–58.
[18] P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in
Robotics-DL tentative. International Society for Optics and Photonics,
1992, pp. 586–606.
[19] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox,
and N. Roy, “Visual odometry and mapping for autonomous flight using
an rgb-d camera,” in International Symposium on Robotics Research
(ISRR), vol. 2, 2011.
[20] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated
cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395,
1981.
[21] K. Levenberg, “A method for the solution of certain non–linear problems
in least squares,” 1944.
[22] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
“g 2 o: A general framework for graph optimization,” in Robotics and
Automation (ICRA), 2011 IEEE International Conference on. IEEE,
2011, pp. 3607–3613.
[23] J. Chen, S. Izadi, and A. Fitzgibbon, “Kinêtre: animating the world with
the human body,” in Proceedings of the 25th annual ACM symposium
on User interface software and technology. ACM, 2012, pp. 435–444.
[24] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A
benchmark for the evaluation of rgb-d slam systems,” in Proc. of the
International Conference on Intelligent Robot Systems (IROS), Oct.
2012.

paper

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to paper

Similar to paper (20)

paper