luanvan abstract english phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (563.33 KB, 28 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECNOLOGY

LE VAN HUNG

3D OBJECT DETECTIONS AND RECOGNITIONS:
ASSISTING VISUALLY IMPAIRED PEOPLE IN
DAILY ACTIVITIES

Major: Computer Science
Code: 9480101

ABSTRACT OF DOCTORAL DISSERTATION
COMPUTER SCIENCE

Hanoi −2018

The dissertation is completed at:
Hanoi University of Science and Technology

Supervisors:
1. Dr. Vu Hai
2. Assoc. Prof. Nguyen Thi Thuy

Reviewer 1: Assoc. Prof. Luong Chi Mai
Reviewer 2: Assoc. Prof. Le Thanh Ha
Reviewer 3: Assoc. Prof. Nguyen Quang Hoan

The dissertation will be defended before approval committee
at Hanoi University of Science and Technology:

Time..........., date.......month.......year.......

The dissertation can be found at:
1. Ta Quang Buu Library
2. Vietnam National Library

INTRODUCTION
Motivation
Visually Impaired People (VIPs) face many difficulties in their daily living. Nowadays, many aided systems for the VIPs have been deployed such as navigation services,
obstacle detection (iNavBelt, GuideCane products in Andreas et al. IROS, 2014; Rimon et al.,2016), object recognition in supermarket (EyeRing at MIT’s Media Lab).
The most common situation is that the VIPs need to locate home facilities. However,
even for a simple activity such as querying common objects (e.g., a bottle, a coffee-cup,
jars, so on) in a conventional environment (e.g., in kitchen, cafeteria room), it may be
a challenging task. In term of deploying an aided system for the VIPs, not only the
object’s position must be provided but also more information about the queried object
(e.g., its size, grabbing objects on a flat surface such as bowls, coffee cups in a kitchen
table) is required.
Let us consider a real scenario, as shown in Fig. 1, to look for a tea or coffee
cup, he (she) goes into the kitchen, touches any surrounded object and picks up the
right one. In term of an aided system, that person just makes a query ”Where is a
coffee cup?”, ”What is the size of the cup?”, ”The cup is lying or standing on the
table?”. The aided system should provide the information for the VIPs so that they
can grasp the objects and avoid accidents such as being burned. Even when doing
3-D objects detection, recognition on 2-D image data and more information on depth
images as presented in (Bo et al. NIPS 2010, Bo et al. CVPR 2011, Bo et al. IROS
2011), only information about the objects label is provided. At the same time the
information that the system captured from the environment is the image frames of the
environment. Therefore the data of the objects on the table gives only a visible part of

the object like the front of cup, box or fruit. While the information that the VIPs need
are the information about the position, size and direction for safely grasping. From
this, we use the ”3-D objects estimation method” to estimate the information of the
objects.
By knowing the queried object is a coffee cup which is usually a cylindrical shape
and lying on a flat surface (table plane), the aided system could resolve the query by
fitting a primitive shape to the collected point cloud from the object. The objects in
the kitchen or tea room are usually placed on the tables such as cups, bowls, jars, fruit,
funnels, etc. Therefore, these objects can be simplified by the primitive shapes. The
problem of detecting and recognizing the complex objects in the scene is not considered
in the dissertation. The prior knowledge observed from the current scene such as a
1

Figure 1 Illustration of a real scenario: a VIP comes to the Kitchen and gives a
query: ”Where is a coffee cup? ” on the table. Left panel shows a Kinect mounted on
the human’s chest. Right panel: the developed system is build on a Laptop PC.
cup normally stands on the table, contextual constraints such as walls in the scene are
perpendicular to the table plane; the size/height of the queried object is limited, would
be valuable cues to improve the system performances.
Generally, we realize that the queried objects could be identified through simplifying geometric shapes: planar segments (boxes), cylinders (coffee mugs, soda cans),
sphere (balls), cones, without utilizing conventional 3-D features. Approaching these
ideas, a pipeline of the work ”3-D Object Detection and Recognition for Assisting Visually Impaired People” is proposed. It consists of several tasks, including: (1) separating
the queried objects from table plane detection result by using the transformation original coordinate system technique; (2) detecting candidates for the interested objects
using appearance features; and (3) estimating a model of the queried object from a
3-D point cloud. Wherein the last one plays an important role. Instead of matching
the queried objects into 3-D models as conventional learning-based approaches do, this
research work focuses on constructing a simplified geometrical model of the queried
objects from an unstructured set of point clouds collected by a RGB and range sensor.

Objective
In this dissertation, we aim to propose a robust 3-D object detection and recognition system. As a feasible solution to deploy a real application, the proposed framework
should be simple, robust and friendly to the VIPs. However, it is necessary to notice
that there are critical issues that might affect the performance of the proposed system. Particularly, some of them are: (1) objects are queried in a complex scene where
cluster and occlusion issue may appear; (2) noises from collected data; and (3) high
computational cost due to huge number of points in the cloud data. Although in the
literature, a number of relevant works of 3-D object detection and recognition has been
attempted for a long time, in this study, we will not attempt to solve these issues separately. Instead of that, we aim to build an unified solution. To this end, the concrete
objectives are:
2

Figure 2 Illustration of the process of 3-D query-based object in the indoor environment. The full object model is the estimated green cylinder from the point cloud of
coffee-cup (red points).
- To propose a completed 3-D query-based object detection system in supporting
the VIPs with high accuracy. Figure 2 illustrates the processes of 3-D query-based
object detection in an indoor environment.
- To deploy a real application to locate and describe objects’ information in supporting the VIPs grasping objects. The application is evaluated in practical
scenarios such as finding objects in a sharing-room, a kitchen room.
An available extension from this research is to give the VIPs a feeling or a way
of interaction in a simple form. The fact that the VIPs want to make optimal use of
all their senses (i.e., audition, touch, and kinesthetic feedback). By doing this study,
informative information extracted from cameras (i.e. position, size, safely directions
for object grasping) is available. As a result, the proposed method can offer an effective
way so that the a large amount of the collected data is valuable as feasible resource.

Context, constraints and challenges
Figure 1 shows the context when a VIP comes to a cafeteria and using an aided
system for locating an object on the table. The input of system is a query and output
is object position in a 3-D coordination and object’s information (size, height). The

proposed system operates with a MS Kinect sensor version 1. The Kinect sensor is
mounted on the chest of the VIPs and the laptop is warped in the backpack as shown
in Fig. 1-bottom. For deploying a real application, we have some constraints for the
scenario as the following:
❼ The MS Kinect sensor:

– A MS Kinect sensor is mounted on VIP’s chest and he/she moves slowly
around the table. This is to collect the data of the environment.
– A MS Kinect sensor captures RGB and Depth images at a normal frame rate
(from 10 to 30 fps) with image resolution of 640×480 pixels for both of those
image types. With each frame obtained from Kinect an acceleration vector
3

is also obtained. Because MS Kinect collects the images in a range from
10 to 30 fps, , it fits well with the slow movements of the VIPs (∼ 1 m/s).
Although collecting image data via a wearable sensor can be affected by
subject’s movement such as image blur, vibrations in the practical situations,
there are no specifically requirements for collecting the image data. For
instance, VIPs are not required to be stranded before collecting the image
data.
– Every queried object needs to be placed in the visible area of a MS Kinect
sensor, which is in a distance of 0.8 to 4 meter and an angle of 300 around
the center axis of the MS Kinect sensor. Therefore, the distance constraint
from the VIPs to the table is also about 0.8 to 4m.
❼ Interested (or queried) objects are assumed to have simple geometrical structures.

For instance, coffee mugs, bowls, jars, bottles, etc have cylindrical shape, whereas
ball(s) have spherical shape; a cube shape could be boxes, etc. They are idealized
and labeled. The modular interaction between a VIP and the system has not been

developed in the dissertation.
❼ Once a VIP wants to query an object on the table, he(she) should stand in front

of the table. This ensures that the current scene is in the visible area of a MS
Kinect sensor and can move around the table. The proposed system computes and
returns the object’s information such as position, size and orientation. Sending
such information to senses (e.g., audible information, on a Braille screen, or by a
vibrating type) is out of the scope of this dissertation.
❼ Some heuristic parameters are pre-determined. For instance, a VIP’s height, and

other parameters of contextual constraints (e.g., size of table plane in a scene,
object’s height limitations) are pre-selected.
The above-mentioned scenarios and challenges are to cope with the following issues:
❼ Occlusion and cluster of the interested objects: In the practical, when a VIP

comes to a cafeteria to find an object on the table, the queried objects could be
occluded by others. At a certain view point, a MS Kinect sensor captured only
a part of of an object. Therefore, data of the queried objects is missed. Other
situation is that the data consists of many noises because the depth image of a
MS Kinect version 1 often is affected by illumination conditions. These issues are
challenges for fitting, detecting and classifying the objects from a point cloud.
❼ Various appearances of same object type: The system is to support for the VIPs

querying common objects. In fact that a ”blue” tea/coffee cup and a ”yellow”
4

bottle have same type of a primitive shape (as cylindrical model). These objects
have the same geometric structure but are different colors. We exploit learningbased techniques to utilize appearance features (on RGB images) for recognizing
the queried objects.

❼ Computational time: A point cloud of a scene that is generated from an image

with size of 640 × 480 pixels consists of hundreds of thousands of points. Therefore, computations in the 3-D environment often require higher computational
costs than a task in the 2-D environment.

Contributions
Throughout the dissertation, the main objectives are addressed by an unified
solution. We achieve following contributions:
❼ Contribution 1: Proposed a new robust estimator that called (GCSAC - Geometrical

Constraints SAmple Consensus) for estimation of primitive shapes from the
point cloud of the objects. Different from conventional RANSAC algorithms
(RANdom SAmple Consensus), GCSAC selects the uncontaminated (so-called
the qualified or good) samples from a set data of points using the geometrical
constraints. Moreover, GCSAC is extended by utilizing the contextual constraints
to validate results of the model estimation.
❼ Contribution 2: Proposed a comparative study on three different approaches

for recognizing the 3-D objects in a complex scene. Consequently, the best one
is a combination of deep-learning based technique and the proposed robust estimator(GCSAC). This method takes recent advantages of object detection using
a neural network on RGB image and utilizes the proposed GCSAC to estimate
the full 3-D models of the queried objects.
❼ Contribution 3: Deployed a successfully system using the proposed methods for

detecting 3-D primitive shape objects in a lab-based environment. The system
combined the table plane detection technique and the proposed method of 3-D
objects detection and estimation. It achieved fast computation for both tasks of
locating and describing the objects. As a result, it fully supports the VIPs in
grasping the queried objects.

General framework and dissertation outline
In this dissertation, we propose an unified framework of detecting the queried 3-D
objects on the table for supporting the VIPs in an indoor environment. The proposed
framework consists of three main phases as illustrated in Fig. 3. The first phase is
considered as a pre-processing step. It consists of point cloud representation from the

5

Acceleration
vector
Microsoft
Kinect

RGB-D
image

Pre-processing step
Point cloud
representation

Objects
detection on
RGB image

Table plane
detection

3-D objects
location on the

table plane

3-D objects
model
estimation

3-D objects
information

Fitting 3-D objects

Candidates

Figure 3 A general framework of detecting the 3-D queried objects on the table of the
VIPs.
RGB and Depth images and table plane detection in order to separate the interested
objects from a current scene. The second phase aims to label the object candidates
on the RGB images. The third phase is to estimate a full model from the point cloud
specified from the first and the second phases. In the last phase, the 3-D objects are
estimated by utilizing a new robust estimator GCSAC for the full geometrical models.
Utilizing this framework, we deploy a real application. The application is evaluated
in different scenarios including data sets collected in lab environments and the public
datasets. Particularly, these research works in the dissertation are composed of six
chapters as following:
❼ Introduction: This chapter describes the main motivations and objectives of the

study. We also present critical points the research’s context, constraints and
challenges, that we meet and address in the dissertation. Additionally, the general
framework and main contributions of the dissertation are also presented.
❼ Chapter 1: A Literature Review: This chapter mainly surveys existing aided

systems for the VIPs. Particularly, the related techniques for developing an
aided system are discussed. We also presented the relevant works on estimation
algorithms and a series of the techniques for 3-D object detection and recognition.
❼ Chapter 2: In this chapter, we describe a point cloud representation from data

collected by a MS Kinect Sensor. A real-time table plane detection technique for
separating the interested objects from a certain scene is described. The proposed
table plane detection technique is adapted with the contextual constraints. The
experimental results confirm the effectiveness of the proposed method on both
self-collected and public datasets.
❼ Chapter 3: This chapter describes a new robust estimator for the primitive shapes

estimation from a point cloud data. The proposed robust estimator, named GC6

SAC (Geometrical Constraint SAmple Consensus), utilizes the geometrical constraints to choose good samples for estimating models. Furthermore, we utilize
the contextual information to validate the estimation’s results. In the experiments, the proposed GCSAC is compared with various RANSAC-based variations
in both synthesized and the real datasets.
❼ Chapter 4: This chapter describes the completed framework for locating and

providing the full information of the queried objects. In this chapter, we exploit
advantages of recent deep learning techniques for object detection. Moreover, to
estimate full 3-D model of the queried-object, we utilize GCSAC on point cloud
data of the labeled object. Consequently, we can directly extract the object’s
information (e.g., size, normal surface, grasping direction). This scheme outperforms existing approaches such as solely using 3-D object fitting or 3-D feature
learning.
❼ Chapter 5: We conclude the works and discuss the limitations of the proposed

method. Research directions are also described for future works.

CHAPTER 1

LITERATURE REVIEW
In this chapter, we would like to present surveys on the related works of aid systems
for the VIPs and detecting objects methods in indoor environment. Firstly, relevant
aiding applications for VIPs are presented in Sec. 1.1. Then, the robust estimators and
their applications in the robotics, computer vision are presented in Sec. 1.3. Finally,
we will introduce and analyses the state-of-the-art works with 3-D object detection,
recognition in Sec. 1.2.

7

1.1

Aided systems supporting for visually impaired people

1.1.1

Aided systems for navigation service

1.1.2

Aided systems for obstacle detection

1.1.3

Aided systems for locating the interested objects in scenes

1.1.4

Aided systems for detecting objects in daily activities

1.1.5

Discussions

1.2

3-D object detection, recognition from a point cloud data

1.2.1

Appearance-based methods

1.2.2

Geometry-based methods

1.2.3

Discussions

1.3

Fitting primitive shapes: A brief survey

1.3.1

Linear fitting algorithms

1.3.2

Robust estimation algorithms

1.3.3

RANdom SAmple Consensus (RANSAC) and its variations

1.3.4

Discussions

CHAPTER 2

POINT CLOUD REPRESENTATION AND THE
PROPOSED METHOD FOR TABLE PLANE
DETECTION
A common situation in activities of daily living of visually impaired people (VIPs)
is to query an object (a coffee cup, water bottle, so on) on a flat surface. We assume
that such flat surface could be a table plane in a sharing room, or in a kitchen. To
build the completed aided-system supporting for VIPs, obviously, the queried objects
should be separated from a table plane in current scene. In a general frame-work that
consists other steps such as detection, and estimation full model of the queried objects,
the table plane detection could be considered as a pre-processing step. Therefore, this
chapter is organized as follows: Firstly, we introduce a representation of the point
clouds which are combined the data collected by Kinect sensor in Section 2.1. We then
present the proposed method for the table plane detection in Section 2.2.

8

2.1
2.1.1

Point cloud representation
Capturing data by a Microsoft Kinect sensor

In order to collect the data from the environment for building an aid system for
the VIPs to detect, grasp objects that have simple geometrical structure on the table
in the indoor environment. The color image and depth image are captured from MS
Kinect sensor version 1.
2.1.2

Point cloud representation

The result of calibration images is a the camera’s intrinsic matrix Hm for projecting
pixels in 2-D space to 3-D space as follows:


f x 0 cx


Hm =  0 fy cy 
0 0 1
where (cx , cy ) is the principle point (usually the image center), fx and fy are the focal
lengths.

2.2

2.2.1

The proposed method for table plane detection
Introduction

Plane detection in 3-D point clouds is a critical task for many robotics and computer vision applications. In order to help visually impaired/blind people find and
grasp interesting objects (e.g., coffee cup, bottle, bowl) on the table, one has to find
the table planes in the captured scenes. This work is motivated by such adaptation
in which acceleration data provided by the MS Kinect sensor to prune the extraction results. The proposed algorithms achieve real-time performance as well as a high
detection rate of the table planes.
2.2.2

Related Work

2.2.3

The proposed method

2.2.3.1

The proposed framework

Our research context aims to develop object finding and grasping-aided services
for VIP. The proposed framework, as shown in Fig. 2.6, consists of four steps: downsampling, organized point cloud representation, plane segmentation and table plane
classification. Because of our work utilizing only depth feature, a simple and effective
method for down-sampling and smoothing the depth data is described below.

9

Acceleration
vector
Microsoft
Kinect
Depth

Down
sampling

Organized
point cloud
representation

Plane
segmentation

Plane
classification

Table
plane

Figure 2.6: The proposed frame-work for table plane detection.
Given a sliding window (of size n × n pixels), the depth value of a center pixel
D(xc , yc ) is computed from the Eq. 2.2:
D(xc , yc ) =

N
i=1

D(xi , yi )
N

(2.2)

where D(xi , yi ) is depth value of ith neighboring pixel of the center pixel (xc , yc ); N is
the number of pixels in the neighborhood n × n (N=(n × n) -1).
2.2.3.2

Plane segmentation

The detailed process of the plane segmentation is given in (Holz et al. RoboCup,
2011).
2.2.3.3

Table plane detection/extraction

The results of the first step are planes that are perpendicular to the acceleration
vector. After rotating the y axis such that it is parallel with the acceleration vector.
Therefore, the table plane is highest plane in the scene, that means the table plane is
the one with minimum y-value.
2.2.4
2.2.4.1

Experimental results
Experimental setup and dataset collection

The first dataset called ’MICA3D’ : A Microsoft Kinect version 1 is mounted on
the person’s chest, the person then moves around one table in the room. The distance
between the Kinect and the center of the table is about 1.5 m. The height of the

Kinect compared with table plane is about 0.6 meter. The height of table plane is
about 60 → 80 cm. We capture data of 10 different scenes which include a cafeteria,
showroom, and kitchen and, so on. These scenes cover common contexts in daily
activities of visually impaired people. The second dataset is introduced of (Richtsfeld
et al. IROS, 2012) . This dataset contains calibrated RGB-D data of 111 scenes. Each
scene has a table plane. The size of the image is 640x480 pixels.
2.2.4.2

Table plane detection evaluation method

Therefore, three evaluation measures are needed and they are defined as below.
Evaluation measure 1 (EM1): This measure evaluates the difference between the
10

Table 2.2: The average result of detected table plane on our own dataset(%).
Evaluation Measurement
Missing Frame per
Approach
EM1 EM2 EM3 Average
rate
second
First Method
87.43 87.26 71.77
82.15
1.2
0.2
Second Method
98.29 98.25 96.02
97.52

0.63
0.83
Proposed Method 96.65 96.78 97.73
97.0
0.81
5
Table 2.3: The average result of detected table plane on the dataset [3] (%).
Evaluation Measurement
Missing Frame per
Approach
EM1 EM2 EM3 Average
rate
second
First Method
87.39 68.47 98.19
84.68
0.0
1.19
Second Method
87.39 68.47 95.49
83.78
0.0
0.98
Proposed Method 87.39 68.47 99.09
84.99
0.0
5.43
normal vector extracted from the detected table plane and the normal vector extracted
from ground-truth data.
Evaluation measure 2 (EM2): By using EM1, only one point was used (center

point of the ground-truth) to estimate the angle. To reduce the noise influence, more
points for determining the normal vector of the ground truth are used. For the EM2,
3 points (p1 , p2 , p3 ) are randomly selected from the ground-truth point cloud.
Evaluation measure 3 (EM3): The two evaluation measures presented above
do not take into account the area of the detected table plane. Therefore, it is to propose
EM3 that is inspired by the Jaccard index for object detection.
r=

2.2.4.3

Rd ∩ Rg
Rd ∪ Rg

(2.6)

Results

The comparative results of three different evaluation measures on two datasets are
shown in Tab. 2.2 and Tab. 2.3, respectively.
2.2.5

Discussions

In this work, a method for table plane detection using down sampling, accelerometer data and organized point cloud structure obtained from color and depth images
of the MS Kinect sensor is proposed.

11

2.3

Separating the interested objects on the table plane

2.3.1

Coordinate system transformation

2.3.2

Separating table plane and interested objects

2.3.3

Discussions

CHAPTER 3

PRIMITIVE SHAPES ESTIMATION BY A NEW
ROBUST ESTIMATOR USING GEOMETRICAL
CONSTRAINTS
3.1
3.1.1

Fitting primitive shapes: By GCSAC
Introduction

The geometrical model of an interested object can be estimated using from two to
seven geometrical parameters as in (Schnabel et al. 2007). A Random Sample Consensus (RANSAC) and its paradigm attempt to extract as good as possible shape parameters which are objected either heavy noise in the data or processing time constraints. In
particular, at each hypothesis in a framework of a RANSAC-based algorithm, a searching process aims at finding good samples based on the constraints of an estimated model
is implemented. To perform search for good samples, we define two criteria: (1) The

selected samples must ensure being consistent with the estimated model via a roughly
inlier ratio evaluation; (2) The samples must satisfy explicit geometrical constraints of
the interested objects (e.g., cylindrical constraints).
3.1.2

Related work

3.1.3

The proposed new robust estimator

3.1.3.1

Overview of the proposed robust estimator (GCSAC)

To estimate parameters of a 3-D primitive shape, an original RANSAC paradigm,
as shown in the top panel of Figure 3.2, selects randomly an (Minimum Sample SubsetMSS) from a point cloud and then model parameters are estimated and validated. The
algorithm is often computationally infeasible and it is unnecessary to try every possible
sample. Our proposed method (GCSAC - in the bottom panel of Figure 3.2) is based
on an original version of RANSAC, however it is different in three major aspects: (1)
At each iteration, the minimal sample set is conducted when the random sampling
procedure is performed, so that probing the consensus data is easily achievable. In
other words, a low pre-defined inlier threshold can be deployed as a weak condition
of the consistency. Then after only (few) random sampling iterations, the candidates
12

No

A point

cloud

Randomly
sampling a
minimal subset

Geometrical
parameters
estimation M

Randomly
sampling
a minimal
subset

Geometrical
parameters
Estimation M

Model evaluation
M; Update the best
model

Update the number
of iterations K
adaptively (Eq. 3.2)

Terminate
?
yes

RANSAC/
MLESAC
paradigm

Proposed
Method
(GCSAC)

Randomly
sampling a
minimal
subset

Searching good
samples using
geometrical
constraints
Geometrical
parameters
estimation M

Model evaluation M via
(inlier ratio or Negative
log-likelihood);
Update the best model

Update the number
of iterations K
adaptively (Eq. 3.2)

Model evaluation M via
Negative
Log-likehood;
Update the best model

Update the number
of iterations K
adaptively (Eq. 3.2)

Estimated
Model

RANSAC Iteration
A point cloud
Search good sampling
based on Geometrical
constraint based on (GS)

Random sampling

Estimation model;
Compute the inlier ratio w
Yes

k=0: MLESAC
k=1:w≥ wt: Yes
k=1:w≥ wt: No
As MLESAC

Good samples
(GS)

w≥wt

Compute Negative loglihood L, update the best
model

No

k≤K

No
Estimated mode

Figure 3.2: Top panel: Over view of RANSAC-based algorithm.
Bottom panel: A diagram of the GCSAC’s implementations.
of good samples could be achieved. (2) The minimal sample sets consist of qualified
samples which ensure geometrical constraints of the interested object. (3) Termination
condition of the adaptive RANSAC algorithm of (Hartley et al. 2003) is adopted so
that the algorithm terminates as soon as the minimal sample set is found for which
the number of iterations of current estimation is less than that which has already been
achieved.
To determine the termination criterion for the estimation algorithm, a well-known
calculation for determining a number of sample selection K is as Eq. 3.2.
K=

log(1 − p)
log(1 − ws )

(3.2)

where p is the probability to find a model describing the data, s is the minimal number
of samples needed to estimate a model, w is percentage of inliers in the point cloud.

13

PlaneY

γc

p2

L1

p1

(a)

γ

n2

p2

γ1
(d)

L2

n1

γ2

n3
n1

(c)

(b)
p1

p3

Ic

n2

n1

n2
p1

Estimated
cylinder

p2
(e)

(f)

Figure 3.3: Geometrical parameters of a cylindrical object. (a)-(c) Explanation of the
geometrical analysis to estimate a cylindrical object. (d)-(e) Illustration of the
geometrical constraints applied in GCSAC. (f) Result of the estimated cylinder from
a point cloud. Blue points are outliers, red points are inliers.
3.1.3.2

Geometrical analyses and constraints for qualifying good samples

In the following sections, the principles of 3-D the primitive shapes are explained.
Based on the geometrical analysis, related constraints are given to select good samples.
The normal vector of any point is computed following the approach in (Holz et al.
2011) At each point pi , k-nearest neighbors kn of pi are determined within a radius r.
The normal vector of pi is therefore reduced to analysis of eigenvectors and eigenvalues
of the covariance matrix C, that is presented as in Sec. 2.2.3.2.
a. Geometrical analysis for cylindrical objects
The geometrical relationships of above parameters are shown in Fig. 3.3 (a). A cylinder
can be estimated from two points (p1 , p2 ) (two blue-squared points) and their corresponding normal vectors (n1 , n2 ) (marked by green and yellow line). Let γc be the
main axis of the cylinder (red line) which is estimated by:
γc = n1 × n2

(3.3)

To specify a centroid point I, we project the two parametric lines L1 = p1 + tn1 and
L2 = p2 + tn2 onto a plane specified by P laneY (see Figure 3.3(b)). The normal
vector of this plane is estimated by a cross product of γc and n1 vectors (γc × n1 ). The
centroid point I is the intersection of L1 and L2 (see Figure 3.3 (c)). The radius Ra
is set by the distance between I and p1 in P laneY . A result of the estimated cylinder
from a point cloud is illustrated in Figure 3.3 (f). The height of the estimated cylinder

is normalized to 1.

14

Figure 3.4: (a) Setting geometrical parameters for estimating a cylindrical object
from a point cloud as described above. (b) The estimated cylinder (green one) from
an inlier p1 and an outlier p2 . As shown, it is an incorrect estimation.
(c) Normal vectors n1 and n∗2 on the plane π are specified.
We first built a plane π that is perpendicular to the plane P laneY and consists of
n1 . Therefore its normal vector is nπ = (nP laneY ×n1 ) where nP laneY is the normal vector
of P laneY , as shown in Figure 3.4 (a). In the other words, n1 is nearly perpendicular
with n∗2 where n∗2 is the projection of n2 onto the plane π. This observation leads to
the criterion below:
cp = arg

3.1.4
3.1.4.1

min

{n1 · n∗2 }

p2 ∈{Un \p1 }

(3.4)

Experimental results of robust estimator
Datasets for evaluation of the robust estimator

The first one is synthesized datasets. These datasets consists of cylinders, spheres
and cones. In addition, we evaluate the proposed method on real datasets. For the
cylindrical objects, the dataset is collected from a public dataset [1] which contains
300 objects belonging to 51 categories. It named ’second cylinder’. For the spherical
object, the dataset consists of two balls collected from four real scenes. Finally, point
cloud data of the cone objects, named ’second cone’, is collected from dataset given in
[4].
3.1.4.2

Evaluation measurements of robust estimator

To evaluate the performance of the proposed method, we use following measurements:
- Let denote the relative error Ew of the estimated inlier ratio. The smaller Ew is,
the better the algorithm is. Where wgt is the defined inlier ratio of ground-truth;
w is the inlier ratio of the estimated model.
- The total distance errors Sd is calculated by summation of distances from any
point pj to the estimated model Me .

15

Table 3.2: The average evaluation results of synthesized datasets.
The synthesized datasets were repeated 50 times for statistically representative results.
Dataset/
Measure RANSAC PROSAC
Method
Ew
23.59
28.62
(%)

1528.71 1562.42
’first Sd
89.54
52.71
cylinder’ tp (ms)
Ed (cm)
0.05
0.06
EA (deg.)
3.12
4.02
Er (%)
1.54
2.33
Ew (%)
23.01
31.53
Sd
3801.95 3803.62
’first
t (ms)
10.68
23.45
sphere’ p
Ed (cm)
0.05
0.07
Er (%)
2.92
4.12

Ew (%)
24.89
37.86
Sd
2361.79 2523.68
tp (ms)
495.26
242.26
’first
cone’
EA (deg.)
6.48
15.64
E r(%)
20.47
17.65

MLESAC MSAC

LOSAC

NAPSAC

GCSAC

43.13

10.92

9.95

61.27

8.49

1568.81
70.94
0.17
5.87
7.54
85.65
3774.77
1728.21
1.71
203.60
68.32
2383.01
52525
11.67
429.44

1527.93
90.84
0.04
2.81
1.02
33.43
3804.27
9.46
0.08

5.15
40.74
2388.64
227.57
15.64
17.31

1536.47
536.84
0.05
2.84
2.40
23.63
3558.06
31.57
0.21
17.52
30.11
2298.03
1258.07
6.79
20.22

3168.17
52.03
0.93
7.02
112.06
57.76
3904.22

2.96
0.97
63.60
86.15
13730.53
206.17
14.54
54.44

1495.33
41.35
0.03
2.24
0.69
19.44
3452.88
6.48
0.05
2.61
24.40
2223.14
188.4
4.77
17.21

Table 3.3: Experimental results on the ’second cylinder’ dataset.
The experiments were repeated 20 times, then errors are averaged.
Dataset/
Measure
’second cylinder’

(coffee mug)
’second cylinder’
(food can)
’second cylinder’
(food cup)
’second cylinder’
(soda can)

Method
MLESAC
GCSAC
MLESAC
GCSAC
MLESAC
GCSAC
MLESAC
GCSAC

w
(%)
9.94
13.83
19.05
21.41
15.04
18.8
13.54
20.6

Sd

3269.77
2807.40
1231.16
1015.38
1211.91
1035.19
1238.96
1004.27

tp
(ms)
110.28
33.44
479.74
119.46
101.61
14.43
620.62
16.25

Er
(%)
9.93
7.00
19.58
13.48
21.89
17.87
29.63
27.7

- The processing time tp is measured in milliseconds (ms). The smaller tp is the
faster the algorithm is.
- The relative error of the estimated center (only for synthesized datasets) Ed is
Euclidean distance of the estimated center Ee and the truth one Et .
3.1.4.3

Evaluation results of new robust estimator

The performances of each method on the synthesized datasets are reported in
Tab. 3.2. For evaluating the real datasets, the experimental results are reported in
Tab. 3.3 for the cylindrical objects. Table 3.4 reports fitting results for spherical and
cone datasets.

16

Table 3.4: The average evaluation results on the ’second sphere’, ’second cone’ datasets.
The real datasets were repeated 20 times for statistically representative results.
Dataset/
Method

’second
sphere’

’second
cone’

3.1.5

Measure

RANSACPROSAC MLESAC MSAC

LOSAC NAPSAC GCSAC

w(%)
Sd
tp (ms)
Er (%)
w(%)
Sd
tp (ms)
EA (deg.)
Er (%)

99.77
29.60
3.44
30.56
79.52
126.56
10.94
38.11
77.52

99.78
28.77
7.82
31.05

80.21
96.37
96.37
29.42
71.66

99.98
26.62
3.43
26.55
71.89
156.40
7.42
40.35
77.09

99.83
29.38
4.17
30.36
75.45
147.00
13.05
35.62
74.84

99.80
29.37
2.97
30.38

71.89
143.00
9.65
25.39
75.10

98.20
35.55
4.11
33.72
38.79
1043.34
25.39
52.64
76.06

100.00
11.31
2.93
14.08
82.27
116.09
7.14
23.74
68.84

Discussions

In this work, we have proposed GCSAC that is a new RANSAC-based robust estimator for fitting the primitive shapes from point clouds. The key idea of the proposed
GCSAC is the combination of ensuring consistency with the estimated model via a

roughly inlier ratio evaluation and geometrical constraints of the interested shapes.
This strategy aimed to select good samples for the model estimation. The proposed
method was examined with primitive shapes such as a cylinder, sphere and cone. The
experimental datasets consisted of synthesized, real datasets. The results of the GCSAC algorithm were compared to various RANSAC-based algorithms and they confirm
that GCSAC worked well even the point-clouds with low inlier ratio. In the future,
we will continue to validate GCSAC on other geometrical structures and evaluate the
proposed method with the real scenario for detecting multiple objects.

3.2
3.2.1

Fitting objects using the context and geometrical constraints
Finding objects using the context and geometrical constraints

Let’s consider a real scenario in common daily activities of the visually impaired
people. They come to a cafeteria then give a query ”where is a coffee cup?”, as shown
in Fig. 1.
3.2.2

The proposed method of finding objects using the context and geometrical constraints

In the context of developing object-finding-aided systems for the VIPs (as shown
in Fig. 1).

17

3.2.2.1
3.2.3
3.2.3.1

Model verification using contextual constraints
Experimental results of finding objects using the context and geometrical constraints
Descriptions of the datasets for evaluation

The first dataset is constructed from a public one used in [3].
3.2.3.2

Evaluation measurements

3.2.3.3

Results of finding objects using the context and geometrical constraints

Table 3.5 compares the performances of the proposed method GCSAC and MLESAC.
Table 3.5: Average results of the evaluation measurements using GCSAC and MLESAC
on three datasets. The fitting procedures were repeated 50 times for statistical evaluations.
without the context’s
constraint
Dataset/ Method
Ea (deg.)
Er (%)
tp (ms)
First
MLESAC
46.47
92.85
18.10
GCSAC
36.17

81.01
13.51
dataset
Second
MLESAC
47.56
50.78
25.89
GCSAC
40.68
38.29
18.38
dataset
Third
MLESAC
45.32
48.48
22.75
GCSAC
43.06
46.9
17.14
dataset

3.2.4

Discussions

CHAPTER 4

DETECTING AND ESTIMATING THE FULL
MODEL OF 3-D OBJECTS AND DEPLOYING THE
APPLICATION
4.1
4.1.1

3-D object detection
Introduction

The interested objects are placed on the table plane and the objects have simple
geometry structure (e.g. coffee mugs, jars, bottles, soda cans are cylindrical, soccerballs are spherical). Our method exploited the performance of YOLO [2] as a state-ofthe-art method for objects detection in the RGB images because it is a method that
has the highest performance for objects detection. After that, the detected objects are
projects into the point cloud data (3-D data) to generate the full objects model for
grasping, describing objects.
18

Table 4.1: The average result detecting
Measure/
First
dataset
stage
Recall Precision
Method
(%)
(%)
First
PSM
62.23
48.36

Dataset
CVFGS 56.24
50.38
DLGS
88.24
78.52

spherical objects on two stages
Second stage
Recall
(%)
60.56
48.27
76.52

Average
Processing
time
Precision
tp (s)/scene
(%)
46.68
1.05
42.34
1.2
72.29
0.5

4.1.2

Related Work

4.1.3

Three different approaches for 3-D objects detection in a complex
scene

4.1.3.1

Geometry-based method for Primitive Shape detection Method (PSM)

This method used the detecting Primitive Shape Method (PSM) of (Schnabel et
al) in point cloud of the objects.
4.1.3.2

Combination of Clustering objects, Viewpoint Features Histogram, GCSAC
for estimating 3-D full object models - CVFGS

4.1.3.3

Combination of Deep Learning based, GCSAC for estimating 3-D full object
models- DLGS

This network divided the input image into an gird that has the size c × c and used
the features from the entire image to predict each bounding box for each cell of this
gird.
4.1.4

Experimental results

4.1.4.1

Data collection

4.1.4.2

Object detection evaluation

4.1.4.3

Evaluation parameters

4.1.4.4

Results

The average result of detecting spherical objects at the first stage of evaluation is
presented in Tab. 4.1.
4.1.5

4.2

Discussions

Deploying an aided system for visually impaired people

From the evaluations above, they can see that the DLGS method has the best
results for detecting 3-D primitive objects that based on the queries of the VIPs.
Therefore, the complete system is developed according to the frame-work shown in
Fig. 4.20. To detect objects that based on the query-based of a VIP on the table in

the 3-D environment, steps are performed as follows:
19

Acceleration
vector

Point cloud
representation

Microsoft
Kinect

Table plane
detection

RGB-D
image

3-D objects
located on the
table plane

Objects
detection on
RGB image

RGB image

3-D objects

model
estimation

3-D objects
information

Detected table plane
Point cloud representation

3-D Objects located on the
table plane

(m)
(m)
3-D objects location,
description for grasping

Depth image
Detected Objects

Figure 4.20: The frame-work for deploying the complete system to detect 3-D primitive
objects according to the queries of the VIPs.
1. Generating RGB point cloud from RGB image and depth image (presented in
Sec. 2.1) that used the calibration matrix and the down-sampling.
2. Using acceleration vector and constraints to detect the table plane (presented in
Sec. 2.2)
3. Separating the table plane and objects (presented in Sec. 2.3)
4. Objects detection on RGB image (YOLO)
5. 3-D Object location on the table plane
6. Fitting models by GCSAC (presented in Sec. 3.1) for grasping, describing objects.

4.2.1

Environment and material setup

To build an aiding system to detect and locate 3-D queried primitive objects on
the table for the VIPs, we use two types of devices as below. The first device is a MS
Kinect version 1. Second device is a Laptop.
4.2.2

Pre-built script

We experiment on the three blind people at three types of table according to the
scenarios:
❼ a VIP moves around the table and wants to find the spherical objects or cylindri-

cal objects on the table and there are coffee cup, jar, balls. Between them there
is a large enough distance.
❼ a VIP moves around the table and wants to find the spherical objects or cylin-

drical objects on the table and there are coffee cup, jar, balls. These objects are
occluded.

20

Table 4.6: The average results of 3-D queried objects detection.
First stage
Second stage
Processing
Measurement

Recall Precision Recall Precision
time
(%)
(%)
(%)
(%)
(frame/s)
Average
100
99.27
97.80
90.45
0.86
Results
4.2.3

Experimental results

From the experimental setup of system is described in the Sec. 4.2.1 and Sec.4.2.2.
It includes 8 scenes with different types of table, each scene has about 400 frames, the
frame rate of the MS Kinect is about 10 frames per second.
4.2.3.1

Evaluation of finding 3-D objects

To evaluate the 3-D queried objects detection of the VIPs, we have prepared
the ground truth data according to the two phases. The first phase is to evaluate
the table plane detection, we prepared as Sec. 2.2.4.2 and using ’EM1’ measurement
for evaluating the table plane detection. To evaluate the objects detection, we also
prepared the ground truth data and compute T1 for evaluating 3-D cylindrical objects

detection and T2 for evaluating 3-D spherical objects detection. They are presented in
the Sec. 4.1.4.2. To detect objects in the RGB images, we utilize the YOLO network
for training the object classifier. The number of classes, iterations are used as Sec.
4.1.4.3. All source code of program is published in the link: 1 .
We performed the training on 20% data and testing on 80% data. All of data
is published in the link:2 . A true object detection is true table plane detection and
satisfy the rate of T1 for 3-D cylindrical objects detection and T2 for evaluating 3-D
spherical objects detection. The average results of 3-D queried objects detection when
using DLGS method is shown in Tab. 4.6. The videos demo of the real system are
published in the link: 3 .
1

/> />3
/>2

21

4.2.4

Evaluation of usability

CHAPTER 5

CONCLUSION AND FUTURE WORKS
5.1

Conclusion

In this dissertation, we have proposed a new robust estimator called GCSAC

(Geometrical Constraint SAmple Consensus) for estimating primitive shapes (e.g.,
cylinder, sphere, cone) from a point cloud data that may contain contaminated data.
This algorithm is a RANSAC variation with improvements of the sampling step. Unlike RANSAC and MLESAC, where the samples are drawn randomly, GCSAC selects
intentionally good samples based on the proposed geometrical constraints. GCSAC
was evaluated and compared to RANSAC variations for the estimation of the primitive shapes on the synthesized and real datasets. The experimental results confirmed
that GCSAC is better than the RANSAC variations for both the quality of the estimated models and the computational time requirements. We also proposed to use the
contextual constraints which are delivered from the specific context of the environment
to significantly improve the estimation results.
In this dissertation, we also described a completed aided-system for detecting 3-D
primitive objects based on VIP’s query. This system was demonstrated and evaluated in the real environments. The application was developed utilizing the following
proposed techniques:
❼ The real-time table plane detection that achieved both high accuracy and low

computational time. It is a combination of down-sampling, region growth algorithm and the contextual constraints. A real dataset of table plane collected in
various real scenes is made publicity available.
❼ A combination of Deep Learning (YOLO network) for object detection on RGB

image and using the proposed robust estimator (GCSAC) for generating the full
object models to provide object’s descriptions for the VIPs. The evaluations
confirmed that YOLO achieved an acceptable accuracy and its computational
time is the fastest, while GCSAC could estimate full models with contaminated
or occluded data. These results ensure the feasibility of the developed application.
During the experimentations, we also find limitations of the proposed methods,
that are listed below:
❼ In table plane detection step, some context constraints are assumed. For instance,

table plane is flat, lying on the floor and its height is lower than the MS Kinect’s
22

position.
❼ In order to detect objects, depth information is utilized to combine with the

color information to project to 3-D space. However, the resolution of depth
images captured by the MS Kinect sensor is not good enough. Particularly, at
a far distance (more than 4m) or too near (lower than 0.8m), the depth data is
unavailable. Therefore, the performance of proposed method could be reduced
when an user stands too far or too near from the objects.
❼ Each primitive shape used a different type of constraint, so the number of ob-

jects that can be found are limited. In this study, only three types of object
(cylindrical, spherical, conical objects) were investigated.
❼ The context constraints are only applied for some specific objects whose main

axis direction is specified.
❼ The proposed system requires a training time when many objects appear in the

scene. In particular, we have not solved the problem of detecting, recognizing
3-D objects with complex geometry. The constraints applying for objects that
are composed by many primitive shapes have not been studied.
❼ At the moment, the object’s descriptions (e.g, size, position information) of the

queried objects are estimated on each separated frame. Temporal information
has been not attempted in the parameter estimation procedures. Therefore, the
estimated parameters could consists of noises or uncorrected results. To resolve
these issues, relevant techniques of time series analysis could be adopted. For instance, a Kalman filter can be applied on consecutive frames to eliminate and to
correct the estimation results. In addition, observations of the estimated parameters in continuous frames could be utilized to calculate statistical measurements
which would be more stable and reliable.
Not only the above limitations, but also the existing challenges of the study suggest
us research directions in the future.

❼ Short term:

– For an improvement of GCSAC to estimate primitive shapes: We need to
propose geometrical constraints for estimating many other geometrical structures. The combination of the proposed algorithm and the constraints for
the the complex shapes can be adopted by work of (Schnabel et al. 2007)
or composing graph of the primitive shapes as proposed by (Nieuwenhuisen
et al. 2012).
– Evaluating the developed system needs to be deployed on many VIPs with

23

luanvan abstract english phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về