Final_draft_Practice_School_II_report

A Report
On
3D OBJECT RECOGNITION
USING POINT CLOUD LIBRARY
(PCL)
prepared by:
Rishikesh Bagwe (2012A8PS401G)
Mentor:
Imran Syed (Sc ‘C’)
Centre for Artificial Intelligence and Robotics, Bangalore
A Practice School – II station of
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
(May, 2016)

i
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
(RAJASTHAN)
Practice School Division
Station: Centre for Artificial Intelligence and Robotics, Bangalore
Duration: 5 months and 7 days Date of Start: 12th
Jan 2016
Date of Submission: 31st
May 2016
Title of the Project: 3D Object Recognition using point cloud library (pcl)
Students:
Name ID numbers Discipline
Rishikesh Bagwe 2012A8PS401G Electronics and Instrumentation
Station Experts:
Name Designation
Imran Syed Scientist ‘C’
PS Faculty: K. Pradheep Kumar
Siju C. R.
Keywords:
Project Area: Artificial Intelligence, Object Recognition
Abstract:
This report gives detailed steps required for 3D object recognition. It states the literature and
concepts used in the process. The report also has the global and local pipeline execution results.
A comparative study of different algorithms used in the pipelines is listed here and the
combination of algorithms to be used is concluded.

ii
Preface
The following project report is based on 3D object recognition. It gives information on why
3D object recognition is better than 2D one, what are the steps involved in the process of object
recognition. It also gives an insight into the mathematics used for various algorithms in
keypoint description. Some of the commonly used algorithms for surface description are
explained in the report. The training and testing parts are the body of object recognition. The
training is done by using 400 points clouds (3D images) for each class (object). The
combination of various algorithms is tried and the fastest and accurate combination is selected.
Microsoft Kinect Xbox360 is used to gather 3D data for testing purpose.

iii
Acknowledgement
I would like to express my special thanks of gratitude to my station mentor Mr. Imran Syed for
giving me an opportunity to work on 3D object recognition project and continuously guiding
me through the obstacles I faced. I would also like to thank the organisation Centre for Artificial
Intelligence and Robotics (CAIR) for allowing me to use their equipment and expertise for my
project. Secondly I like thank my BITS PS faculty Dr. K. Pradheep Kumar and Dr. Siju C.R.
for easing the process of entering into the CAIR and for guiding me through the rules and
regulation of the Practice School Division. Lastly I want to thank the Birla Institute of
Technology and Science (BITS), Pilani for providing me the opportunity to work in a reputed
research organisation, CAIR.

iv
Table of Contents
Abstract Sheet .............................................................................................................................i
Preface........................................................................................................................................ii
Acknowledgement ................................................................................................................... iii
Table of Contents......................................................................................................................iv
1 Introduction ........................................................................................................................1
2 Terminology and Concepts.................................................................................................4
3 The process classification and flow....................................................................................6
3.1 The Global Pipeline.....................................................................................................7
3.2 The Local Pipeline ......................................................................................................9
4 Testing and Training.........................................................................................................11
5 Experiment Results...........................................................................................................12
5.1 Experiment 1 – RGB-D object dataset from the internet..........................................12
5.2 Experiment 2 – Creating and testing on own dataset................................................16
6 Conclusion........................................................................................................................17
References................................................................................................................................18

1
1 Introduction
The objective of the 3D object recognition is to identify objects correctly in the point cloud and
determines their poses (i.e., location and orientation). For many years, the most common
sensors for computer vision were 2D cameras that retrieved a RGB image of the scene (like all
the digital cameras that are so common nowadays, in our laptops or smartphones). Algorithms
exist that are able to find an object in a picture, even if it is rotated or scaled. Then came 3D
sensor which gave us the depth of each point in the scene and because of this new measurement
we can now detect the object and its pose even from a different camera view than it was trained
with. But the addition of a new dimension makes calculations expensive. Working with the
data they retrieve is a lot different that working with a 2D image, and texture information is
rarely used.
There are different 3D sensors categorized into 3 classes viz.
- Stereo Cameras: They are the only passive measurement device of the list. They are
essentially two identical cameras assembled together (some centimeters apart), that
capture slightly different scenes. By computing the differences between both
scenes, it is possible to infer depth information about the points in each image. A
stereo pair is cheap, but perhaps the least accurate sensor. Ideally, it would require
perfect calibration of both cameras, which is unfeasible in practice. Bad light
conditions will render it useless.
- Time-of-flight (Tof): These sensors work by measuring the time it has taken a ray
or pulse of light to travel a certain distance. Because the speed of light is a known
constant, a simple formula can be used to obtain the range to the object. These
sensors are not affected by light conditions and have the potential to be very precise.
A LIDAR (light+radar) is just a common laser range finder mounted on a platform
that is able to rotate very fast, scanning the scene point by point.
- Structured Light: sensors (like Kinect and Xtion) work by projecting a pattern of
infrared light (for example, a grid of lines, or a "constellation" of points) on top of
the scene's objects. This pattern is seen distorted when looked from a different from
the projector's perspective. By analysing this distortion, information about the depth
can be retrieved, and the surface(s) reconstructed.

2
In this project I will be using Microsoft Kinect for taking data. So it is important to go into the
details of 3D image formation from Kinect sensor. The basis of Microsoft Kinect is the
PrimeSense Technology. Kinect has PS1080 system-on-chip which handles its 3D image
formation. The Kinect has a RGB Camera and an IR projector and sensor. It can be seen in the
picture given below.
It uses the IR sensor & projector
pair to measure the depth of a point
in the scene. The theory of
operation is simple, but its
execution can be complex which is
done by PrimeSense’s PS1080
SoC. The IR projector projects a pattern of IR dots and detects them using a conventional
CMOS image sensor with an IR filter. The pattern will change based upon objects that reflect
the light. The dots will change size and position based on how far the objects are from the
source. For example:
The PS1080 SoC has both the projected pattern and the sensed pattern. First it maps the points
in the projected pattern to the ones in sensed pattern. Then it measures the distance by which
the point has moved (disparity). This disparity is then used to calculate the depth of that point
as follows.

3
In the diagram beside,
By similarity of ΔO & ΔXOP
𝑥
𝑂𝑃
=
𝑓
𝑍
……..1
And by similarity of ΔO’ & ΔXO’P
𝑥′
𝑂′𝑃
=
𝑓
𝑍
…….2
From 1 and 2,
𝑥+𝑥′
𝑂𝑂′
=
𝑓
𝑍
.
We know OO’ distance and f (focal length). x and x’ are found by the Primesense SoC PS1080
from IR projector and IR sensor respectively. Therefore we can find Z, the depth of the point
X.
The Kinect return the 3D data in the form of a point cloud. A point cloud is a set of points in
three-dimensional space, each with its own XYZ coordinates. Every point corresponds to
exactly one pixel of the captured image. Optionally, the point cloud data can also store RGB
data if the sensor has a RGB camera. The data format in which a point cloud is stored is called
point cloud data (.pcd).
In order to do the processing like object detection, object recognition, 3D modelling on these
point clouds and to handle the complex calculation involving depth measurement a point cloud
library (pcl) was started in early 2010 by Willow Garage and OpenCV. The first version was
fully introduced in 2011, and it has been actively maintained ever since. PCL aims to be an
one-for-all solution for point cloud and 3D processing. It is an open source, multiplatform
library divided in many submodules for different tasks, like visualization, filtering,
segmentation, registration, searching, and feature estimation.
left up – IR projection
left down – RGB image2D
right down – depth map
right up – point cloud
(RGBD)

4
2 Terminology and Concepts
Keypoints:
According to the original publication a keypoint is a point on the object which
1. takes information about borders and the surface structure into account
2. can be reliably detected even if the object is observed from another perspective
3. provides stable areas for normal estimation or the descriptor calculation in general
As you can see in the beside figure the good keypoints
are not exactly on the edge but just around it so that it
is easy to calculate the normals in the neighbourhood.
Also the red bad keypoints does not have
characteristic surface change beneath them. The main
reason to find the keypoints is to reduce the stress on
further process. A point cloud of an object can have as much as 1 lakh points which makes
processing lengthy. But if we calculate the keypoints, it reduces the point number to some
hundreds.
There are different algorithms for detecting keypoints. A small set of detectors specifically
proposed for 3D point clouds and range maps viz. Intrinsic Shape Signatures (ISS), NARF,
etc. Several keypoint detectors are derived from 2D interest point detectors, they are Harris3D,
SIFT3D, SUSAN3D.
The following image shows the ISS3D keypoints calculated on a keyboard point cloud data:
The colored points are the
keypoints. As you can see
not all the keypoints are
good ones according to the
above description of
keypoints. This is mainly
due to the aberrations in
the point cloud data
collected.

5
Descriptors:
It is a ‘n’ dimensional vector calculated for each points local neighbourhood or sometimes it
is computed for the whole cloud. The dimension of the vector depends on the algorithm used
for calculating it. These descriptors are divided into 2 categories global and local descriptors.
Each local descriptors describe the surface beneath the neighbourhood of each keypoint
whereas one global descriptor describes the whole viewed object surface. In order to calculate
the descriptors we first have to calculate normals at each point in the specified neighbourhood.
Then the difference in the angels between the normals is binned into a histogram. For example
Fast Point Feature Histogram (FPFH) has 33 bins. These 33 bins are subdivided into 11 bins
based on the value intervals for each parameter (e.g. parameter will the angle difference
between the normal at desired point and one of its neighbouring point in x plane). So each
interval will have 3 bins. The number of instances for each interval and for each parameter is
calculated and then added to the histogram at the appropriate bin. Apart from FPFH there are
various algorithm for descriptor calculation like in Local category - Signature of Histograms
of Orientations (SHOT), Point Feature Histogram (PFH) and in Global category – Viewpoint
Feature Histogram (VFH), Global Fast Point Feature Histogram (GFPFH). In this report we
have used the VFH descriptor for global pipeline and SHOT descriptor for local pipeline.
Here are the FPFH and VFH descriptors calculated for the keyboard point cloud data shown
above:
VFH descriptor
FPFH descriptors

6
3 The process classification and flow
The basis of 3D object recognition is to find a set of correspondences between two different
point clouds, one of them containing the object we are looking for. The process is classified
into 2 separate pipelines viz. Local pipeline and Global pipeline. The global pipeline is usually
used for object detection in a scene and the local pipeline is to find the object position and
orientation with respect to the camera. Each pipeline has different stages as shown in the figure
below.
The main difference between the 2 pipeline is at the stage of description. Both the pipelines
use different algorithms for finding descriptors. The local pipeline describes the surface
curvatures of an object while the global pipeline considers the object as a whole.
So for the complete object recognition we need to use the global pipeline first and then the
local one. So accordingly there will be different combinations of algorithm used for description
in both the pipeline. The accuracy and speed of the process highly depends on the combination
used. A comparative study needs to be done in order to get the optimum throughput.
Apart from the algorithms used for descriptors, both the pipelines have different stages. As
seen from the figure above, keypoint extraction is only in local pipeline because in global
pipeline, descriptor is found for whole object while in local pipeline in order to reduce the
processing time keyoints are found first and then their descriptors are calculated. Similarly
segmentation step is there in only global pipeline because we want the object from the
separately for finding the objects global descriptor as a whole whereas in local pipeline we
have the object itself.
Usually when these 2 pipelines are integrated, first we carry out the global pipeline i.e. first we
find the object in the given scene and then run the local pipeline to calculate the orientation of
the found object.

7
3.1 The Global Pipeline
As stated earlier the global pipeline is used for finding the class (what object it is?) of the given
object. The number of classes that the system can recognise depends on the training of the
system.
Training Part:
A segmented point cloud database is used for training, so we do not need to perform
segmentation while training. The training part is done by taking 130 images of each object
from different views and then grouping them into few groups based on a threshold distance of
the calculated global descriptors. The global descriptor used in this project is VFH (View-Point
Feature Histogram) which is view dependent that is why it needs images from different view-
points. The descriptors are stored in the database in a KD –tree format. Along with the KD-tree
format (hdf5) file there are 2 more files generated namely the descriptor name file and the
training data path file. The data path file generated is shown below:

8
Testing Part:
First the data captured from the Kinect is segmented to keep only the object in the point
cloud. This segmentation is done on the basis of surface planes like cylindrical, planar,
spherical etc. For there are different objects kept on the table, so we perform planar
segmentation to detect the table surface and cut it out. The remaining disjoint point clouds will
the objects kept on the table.
(photo of segmentation)
Sometimes the segmentation will give back 2-3 points cloud even if only one object is kept on
the table in front of it. In such situations the first point cloud produced by the algorithm is the
intended object the others are maybe due to the faulty registration of the point cloud in the
Kinect.
After the segmentation, VFH descriptors are calculated for the acquired point cloud. The
descriptor are then compared to all the VFH descriptors stored in the KD-tree while training
based on a distance threshold in 308 dimensions (VFH being 308 dimensions). And the closest
5 matching points clouds are output by the program as shown in the figure below
The highlighted part in grey is the object model provided for matching and the one highlighted
in pink is the matching result. As you can see the keyboard is matched correctly to keyboard.

9
3.2 The Local Pipeline
The local pipeline is used to measure the orientation and position of the object in the scene.
The orientation which we get is relative to a point cloud which we give for testing (whose
orientation and position we know).
Training Part:
After matching and determining the class of the object, we find the local descriptors
file from our database for the one with which it is matched. This database is created for all the
files which are used for global pipeline training. For local pipeline training we first find the
keypoints of a point cloud. In this project we use ISS3D keypoint detector for finding the
keypoints. Having found the keypoints we then find the descriptors. We have used SHOT
(Signature of Histogram of Orientations) descriptors. The database of local descriptors for all
point clouds is stored in individual files.
Testing Part:
In this testing part, having identified the class of the given point cloud. We then find
ISS3D keypoints and then the SHOT descriptors. These descriptors are then compared with the
globally matched point cloud’s local descriptors. The comparison is based on the n dimensional
euclidean distance between the descriptors of the two point clouds. The output of this pipeline
is the rotational matrix and a translation vector which gives the orientation of the given point
cloud with respect to the trained point cloud. The figure below shows the result of the local
pipeline.

10
RANSAC (Random Sample Consensus) part:
The correspondences from the local matching, as seen in the figure above, are not all
correct. There are always considerable number of wrong matches which lead to incorrect
rotation matrix and translation vector. In order to remove the incorrect matches, RANSAC is
used. An iterative process which finds out the parameters of a mathematical model given a set
of pre-recorded information having outliers. Its aim is to remove these outliers. So here it finds
the orientation of the correspondences lines and removes the unparalleled lines. The following
figures shows the results when RANSAC was used on matching two keyboards.

11
4 Testing and Training
Training and testing are 2 separate processes. As mentioned above each of the global and local
pipeline has training part and testing part. But while execution the training is done together for
both the pipelines and testing for both is done together. The training is done to build a database
for object recognition. It has to be done for different views of the object. In testing part, there
are additional steps for matching. Training is an offline process i.e. it is done once and then the
system is ready recognition while testing is an online process. The efforts are always made to
reduce the time required for testing. Training always takes more time since it has to process
more number of point clouds.

12
5 Experiment Results
Two experiments are conducted. The first is using the object dataset from the internet to train
the system and in the second the chair dataset is constructed using an office. The matching
accuracy results are stated in each case
5.1 Experiment 1 – RGB-D object dataset from the internet
The standard RGB-D object dataset is used in this project for experimentation. The dataset has
different folders for each class (for example inside Keyboard folder there are keyboard_1,
keyboard_2, etc. folders). Each folder has the object view from a particular elevation. We used
only 10 classes (object) from this dataset for final testing. The following are the classes used:

13
We tried global recognition with 2 different algorithms.
VFH:
1. Training done with folder 1 (i.e. for e.g. keyboard_1) and testing done on different
images from folder 1.
Object Matched/Tested Mostly confused with
Apple 213/219 Orange, Cap
Banana 219/257 Soda_can, Shampoo
Cap 222/227 Apple, Soda_can
Coffee_mug 197/200 Apple
Keyboard 252/252
Kleenex 232/271 Apple, Cap
Orange 251/252 Apple
Plate 249/253 Coffee_mug
Shampoo 208/273 Soda_can, Kleenex
Soda_can 207/227 Apple, Shampoo
images from folder 2 (i.e. for e.g. keyboard_2).
Object Matches Mostly confused with
Apple 208/225 Orange, Soda_can
Banana 167/253 Soda_can, Shampoo
Cap 191/238 Coffee_mug, Soda_can
Coffee_mug 171/201 Shampoo, Soda_can
Keyboard 160/211 Shampoo, Kleenex
Kleenex 213/273 Cap, Soda_can
Plate 182/211
Soda_can 182/211 Apple, Cap

14
OUR_CVFH:
images from folder 1.
Banana 288/300 Kleenex, Shampoo
Cap 396/411 Kleenex, Coffee_mug
Coffee_mug 638/640 Cap
Keyboard 418/424 Banana, Plate
Kleenex 560/603 Cap, Coffee_mug
Plate 547/553 Cap, Kleenex
Soda_can 309/324 Cap, Apple
images from folder 2 (i.e. for e.g. keyboard_2).
Banana 251/348 Shampoo
Cap 366/446 Coffee_mug, Kleenex
Coffee_mug 669/700 Kleenex, Cap
Keyboard 286/511 Shampoo, Cap
Kleenex 413/483 Cap, Soda_can
Plate 633/636 Cap
Shampoo 242/411 Soda_can, Banana
Soda_can 270/312 Cap, Coffee_mug

15
Based on the accuracy of the 2 methods, VFH global descriptors are used for matching the
object in the dataset. The results are demonstrated by capturing a 3D image of a coffee mug
and a keyboard and matching it with the available dataset.
For Keyboard:
So the system identified the given object correctly as keyboard from among the 9 objects used
for training. The white keyboard is from the dataset set while the black one is the one captured
from the Kinect. There are problems with the sizes of the keyboards. Both are of different sizes
so they can’t be aligned exactly. The following is the rotational matrix and the translation vector
for the matching:
For Coffee Mug:
Again for the coffee mug the system identified it correctly. The light purple is the coffee mug
from the dataset while the red one is captured using the Kinect. As one can see, the alignment
is not proper, so for improving the results one can increase the number of views used for
training or increase the number of points used for finding a single local descriptor. The
alignment issue arises due to inaccurate performance of local descriptors.

16
5.2 Experiment 2 – Creating and testing on own dataset
In this experiment a Chair dataset is created by taking 3D images of an office chair from a
particular elevation. The chair was rotated at the same place and different 3D views were
collected. All these images are first passed through a Pass Through filter to remove the
background objects and the ground and then only the chair remains in the image. VFH global
descriptors are found for these images and then the descriptors are stored in a KDtree along
with the other 10 object dataset.
An image of a different chair is taken, and similarly it is passed through the filter and descriptor
calculated to be used for matching. The following are the results of chair matching:
In the above figures, the brown chair is the one used in the dataset and the pink chair is used
for testing. All the 3 images different view angles of pink chair are given for testing. All the
times, the system has correctly identified the given object as chair. Here again, the accuracy of
alignment is different for different views of the test chair. The following is the rotational matrix
and translation vector for the 1st
alignment figure:

17
6 Conclusion
A comparative study of algorithms for 3D object recognition is performed. The algorithms
considered in this project are Global – VFH, OUR_CVFH, Local – SHOT, FPFH, PFH. After
the execution of the algorithms individually the results showed that VFH and SHOT descriptors
are better along with ISS3D keypoints. Also when both VFH and SHOT were executed together
in a pipeline, it showed good results which evident in this report.

18
References
- 3D Object Recognition Based on Local and Global Features
Using Point Cloud Library. Author - Khaled Alhamzi, Mohammed Elmogy, Sherif
Barakat.
- Tutorial Documentation from the original documentation from their website:
http://pointclouds.org/documentation/tutorials/
- A Large-Scale Hierarchical Multi-View RGB-D Object Dataset
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox
In IEEE International Conference on Robotics and Automation (ICRA), May
2011.
- Wikipedia
https://en.wikipedia.org/wiki/Structured-light_3D_scanner
https://en.wikipedia.org/wiki/Time-of-flight_camera
- PCL/OpenNI tutorials
http://robotica.unileon.es/mediawiki/index.php/PhD-3D-Object-Tracking

Final_draft_Practice_School_II_report

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (8)

Similaire à Final_draft_Practice_School_II_report

Similaire à Final_draft_Practice_School_II_report (20)

Plus de Rishikesh Bagwe

Plus de Rishikesh Bagwe (7)

Final_draft_Practice_School_II_report