Phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people in daily activities

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.26 MB, 159 trang )

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

LE VAN HUNG

3-D OBJECT DETECTIONS AND
RECOGNITIONS: ASSISTING VISUALLY
IMPAIRED PEOPLE

Major: Computer Science
Code: 9480101

DOCTORAL DISSERTATION OF
COMPUTER SCIENCE

SUPERVISORS:
1. Dr. Vu Hai
2. Assoc. Prof. Dr. Nguyen Thi Thuy

Hanoi − 2018

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

LE VAN HUNG

3-D OBJECT DETECTIONS AND
RECOGNITIONS: ASSISTING VISUALLY
IMPAIRED PEOPLE

Major: Computer Science
Code: 9480101

DOCTORAL DISSERTATION OF
COMPUTER SCIENCE

SUPERVISORS:
1. Dr. Vu Hai
2. Assoc. Prof. Dr. Nguyen Thi Thuy

Hanoi − 2018

DECLARATION OF AUTHORSHIP
I, Le Van Hung, declare that this dissertation titled, ”3-D Object Detections and
Recognitions: Assisting Visually Impaired People in Daily Activities ”, and the works
presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a Ph.D. research
degree at Hanoi University of Science and Technology.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at Hanoi University of Science and Technology or any other
institution, this has been clearly stated.
Where I have consulted the published work of others, this is always clearly attributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this dissertation is entirely my own work.
I have acknowledged all main sources of help.
Where the dissertation is based on work done by myself jointly with others, I
have made exactly what was done by others and what I have contributed myself.

Hanoi, November 2018
PhD Student

Le Van Hung

SUPERVISORS

Dr. Vu Hai

Assoc. Prof. Dr. Nguyen Thi Thuy

i

ACKNOWLEDGEMENT
This dissertation was written during my doctoral course at International Research
Institute Multimedia, Information, Communication and Applications (MICA), Hanoi
University of Science and Technology (HUST). It is my great pleasure to thank all the
people who supported me for completing this work.
First, I would like to express my sincere gratitude to my advisors Dr. Hai Vu
and Assoc. Prof. Dr. Thi Thuy Nguyen for their continuous support, their patience,
motivation, and immense knowledge. Their guidance helped me all the time of research
and writing this dissertation. I could not imagine a better advisor and mentor for my
Ph.D. study.
Besides my advisors, I would like to thank to Assoc. Prof. Dr. Thi-Lan Le,
Assoc. Prof. Dr. Thanh-Hai Tran and members of Computer Vision Department at
MICA Institute. The colleagues have assisted me a lot in my research process as well
as they are co-authored in the published papers. Moreover, the attention at scientific
conferences has always been a great experience for me to receive many the useful
comments.
During my PhD course, I have received many supports from the Management
Board of MICA Institute. My sincere thank to Prof. Yen Ngoc Pham, Prof. Eric
Castelli and Dr. Son Viet Nguyen, who gave me the opportunity to join research

works, and gave me permission to joint to the laboratory in MICA Institute. Without
their precious support, it has been being impossible to conduct this research.
As a Ph.D. student of 911 program, I would like to thank this programme for
financial support. I also gratefully acknowledge the financial support for attending
the conferences from Nafosted-FWO project (FWO.102.2013.08) and VLIR project
(ZEIN2012RIP19). I would like to thank the College of Statistics over the years both
at my career work and outside of the work.
Special thanks to my family, particularly, to my mother and father for all of their
sacrifices that they have made on my behalf. I also would like to thank my beloved
wife for everything she supported me.
Hanoi, November 2018
Ph.D. Student
Le Van Hung

ii

CONTENTS
DECLARATION OF AUTHORSHIP

i

ACKNOWLEDGEMENT

ii

CONTENTS

v

SYMBOLS

vi

LIST OF TABLES

viii

LIST OF FIGURES

xvii

1 LITERATURE REVIEW
1.1 Aided-systems for supporting visually impaired people . . . . . .
1.1.1 Aided-systems for navigation services . . . . . . . . . . . .
1.1.2 Aided-systems for obstacle detection . . . . . . . . . . . .
1.1.3 Aided-systems for locating the interested objects in scenes
1.1.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 3-D object detection, recognition from a point cloud data . . . . .
1.2.1 Appearance-based methods . . . . . . . . . . . . . . . . .
1.2.1.1 Discussion . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Geometry-based methods . . . . . . . . . . . . . . . . . . .
1.2.3 Datasets for 3-D object recognition . . . . . . . . . . . . .
1.2.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Fitting primitive shapes . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Linear fitting algorithms . . . . . . . . . . . . . . . . . . .
1.3.2 Robust estimation algorithms . . . . . . . . . . . . . . . .
1.3.3 RANdom SAmple Consensus (RANSAC) and its variations
1.3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

8
8
8
9
11
12
13
13
16

16
17
17
18
18
19
20
23

2 POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD
FOR TABLE PLANE DETECTION
24
2.1 Point cloud representations . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Capturing data by a Microsoft Kinect sensor . . . . . . . . . . . 24
2.1.2 Point cloud representation . . . . . . . . . . . . . . . . . . . . . 25
2.2 The proposed method for table plane detection . . . . . . . . . . . . . 28
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii

2.2.2
2.2.3

2.3

Related Work . . . . . . . . . . . . . . . . . . . . .
The proposed method . . . . . . . . . . . . . . . .
2.2.3.1 The proposed framework . . . . . . . . . .
2.2.3.2 Plane segmentation . . . . . . . . . . . . .

2.2.3.3 Table plane detection and extraction . . .
2.2.4 Experimental results . . . . . . . . . . . . . . . . .
2.2.4.1 Experimental setup and dataset collection
2.2.4.2 Table plane detection evaluation method .
2.2.4.3 Results . . . . . . . . . . . . . . . . . . .
Separating the interested objects on the table plane . . . .
2.3.1 Coordinate system transformation . . . . . . . . . .
2.3.2 Separating table plane and the interested objects .
2.3.3 Discussions . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

3 PRIMITIVE SHAPES ESTIMATION BY A NEW ROBUST ESTIMATOR USING GEOMETRICAL CONSTRAINTS
3.1 Fitting primitive shapes by GCSAC . . . . . . . . . . . . . . . . . . . .
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 The proposed a new robust estimator . . . . . . . . . . . . . . .
3.1.3.1 Overview of the proposed robust estimator (GCSAC) .
3.1.3.2 Geometrical analyses and constraints for qualifying good
samples . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.4 Experimental results of robust estimator . . . . . . . . . . . . .
3.1.4.1 Evaluation datasets of robust estimator . . . . . . . .
3.1.4.2 Evaluation measurements of robust estimator . . . . .
3.1.4.3 Evaluation results of a new robust estimator . . . . . .

3.1.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Fitting objects using the context and geometrical constraints . . . . . .
3.2.1 The proposed method of finding objects using the context and
geometrical constraints . . . . . . . . . . . . . . . . . . . . . . .
3.2.1.1 Model verification using contextual constraints . . . .
3.2.2 Experimental results of finding objects using the context and
geometrical constraints . . . . . . . . . . . . . . . . . . . . . . .
3.2.2.1 Descriptions of the datasets for evaluation . . . . . . .
3.2.2.2 Evaluation measurements . . . . . . . . . . . . . . . .
3.2.2.3 Results of finding objects using the context and geometrical constraints . . . . . . . . . . . . . . . . . . .
3.2.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

29
30
30
32
34
36
36
37
40
46
46
48
48

51
52

52
53
55
55
58
64
64
67
68
74
76
77
77
78
78
81
82
85

4 DETECTION AND ESTIMATION OF A 3-D OBJECT MODEL
FOR A REAL APPLICATION
86
4.1 A Comparative study on 3-D object detection . . . . . . . . . . . . . . 86
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1.3 Three different approaches for 3-D objects detection in a complex
scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1.3.1 Geometry-based method for Primitive Shape detection
Method (PSM) . . . . . . . . . . . . . . . . . . . . . 90

4.1.3.2 Combination of Clustering objects and Viewpoint Features
Histogram, GCSAC for estimating 3-D full object models (CVFGS) . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.3.3 Combination of Deep Learning based and GCSAC for
estimating 3-D full object models (DLGS) . . . . . . . 93
4.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.4.1 Data collection . . . . . . . . . . . . . . . . . . . . . . 95
4.1.4.2 Evaluation method . . . . . . . . . . . . . . . . . . . . 98
4.1.4.3 Setup parameters in the evaluations . . . . . . . . . . 101
4.1.4.4 Evaluation results . . . . . . . . . . . . . . . . . . . . 102
4.1.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2 Deploying an aided-system for visually impaired people . . . . . . . . . 109
4.2.1 Environment and material setup for the evaluation . . . . . . . 111
4.2.2 Pre-built script . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2.3 Performances of the real system . . . . . . . . . . . . . . . . . . 114
4.2.3.1 Evaluation of finding 3-D objects . . . . . . . . . . . . 115
4.2.4 Evaluation of usability and discussion . . . . . . . . . . . . . . . 118
5 CONCLUSION AND FUTURE WORKS
121
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliography

125

PUBLICATIONS

139

v

ABBREVIATIONS
No. Abbreviation Meaning
1

API

Application Programming Interface

2

CNN

Convolution Neural Network

2

CPU

Central Processing Unit

3

CVFH

Clustered Viewpoint Feature Histogram

4

FN

False Negative

5

FP

False Positive

6

FPFH

Fast Point Feature Histogram

7

fps

f rame per second

8

GCSAC

Geometrical Constraint SAmple Consensus

9

GPS

Global Positioning System

10

GT

Ground Truth

11

HT

Hough Transform

12

ICP

Iterative Closest Point

13

ISS

Intrinsic Shape Signatures

14

JI

Jaccard Index

15

KDES

Kernel DEScriptors

16

KNN

K Nearest Neighbors

17

LBP

Local Binary Patterns

18

LMNN

Large Margin Nearest Neighbor

19

LMS

Least Mean of Squares

20

LO-RANSAC

Locally Optimized RANSAC

21

LRF

Local Receptive Fields

22

LSM

Least Squares Method

23

MAPSAC

Maximum A Posteriori SAmple Consensus

24

MLESAC

Maximum Likelihood Estimation SAmple Consensus

25

MS

MicroSoft

26

MSAC

M-estimator SAmple Consensus

27

MSI

Modified Plessey

28

MSS

Minimal Sample Set

29

NAPSAC

N-Adjacent Points SAmple Consensus

vi

30

NARF

Normal Aligned Radial Features

31

NN

Nearest Neighbor

32

NNDR

Nearest Neighbor Distance Ratio

33

OCR

Optical Character Recognition

34

OPENCV

OPEN source Computer Vision Library

35

PC

Persional Computer

36

PCA

Principal Component Analysis

37

PCL

Point Cloud Library

38

PROSAC

PROgressive SAmple Consensus

39

QR code

Quick Response Code

40

RAM

Random Acess Memory

41

RANSAC

RANdom SAmple Consensus

42

RFID

Radio-Frequency IDentification

43

R-RANSAC

Recursive RANdom SAmple Consensus

44

SDK

Software Development Kit

45

SHOT

Signature of Histograms of OrienTations

46

SIFT

Scale-Invariant Feature Transform

47

SQ

SuperQuadric

48

SURF

Speeded Up Robust Features

49

SVM

Support Vector Machine

50

TN

True Negative

51

TP

True Positive

52

TTS

Text To Speech

53

UPC

Universal Product Code

54

URL

Uniform Resource Locator

55

USAC

A Universal Framework for Random SAmple Consensus

56

VFH

Viewpoint Feature Histogram

57

VIP

Visually Impaired Person

57

VIPs

Visually Impaired People

vii

LIST OF TABLES

Table 2.1

The number of frames of each scene. . . . . . . . . . . . . . . . .

36

Table 2.2

The average result of detected table plane on our own dataset(%). 41

Table 2.3

The average result of detected table plane on the dataset [117] (%). 43

Table 2.4 The average result of detected table plane of our method with
different down sampling factors on our dataset. . . . . . . . . . . . . .

44

Table 3.1 The characteristics of the generated cylinder, sphere, cone dataset
(synthesized dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Table 3.2 The average evaluation results of synthesized datasets. The synthesized datasets were repeated 50 times for statistically representative

results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Table 3.3 Experimental results on the ’second cylinder’ dataset. The experiments were repeated 20 times, then errors are averaged. . . . . . . . .

75

Table 3.4 The average evaluation results on the ’second sphere’, ’second
cone’ datasets. The real datasets were repeated 20 times for statistically
representative results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Table 3.5 Average results of the evaluation measurements using GCSAC and
MLESAC on three datasets. The fitting procedures were repeated 50
times for statistical evaluations. . . . . . . . . . . . . . . . . . . . . . .

83

Table 4.1

The average result detecting spherical objects on two stages. . . . 102

Table 4.2 The average results of detecting the cylindrical objects at the first
stage in both the first and second datasets. . . . . . . . . . . . . . . . . 103
Table 4.3 The average results of detecting the cylindrical objects at the
second stage in both the first and second datasets. . . . . . . . . . . . . 106
Table 4.4 The average processing time of detecting cylindrical objects in
both the first and second datasets. . . . . . . . . . . . . . . . . . . . . 106

Table 4.5

The average results of 3-D queried objects detection. . . . . . . . 116

viii

LIST OF FIGURES
Figure 1
Illustration of a real scenario: a VIP comes to the Kitchen and
gives a query: ”Where is a coffee cup? ” on the table. Left panel shows
a Kinect mounted on the human’s chest. Right panel: the developed
system is build on a Laptop PC. . . . . . . . . . . . . . . . . . . . . . .

2

Figure 2
Illustration of the process of 3-D query-based object in the indoor
environment. The full object model is the estimated green cylinder from
the point cloud of coffee-cup (red points). . . . . . . . . . . . . . . . . .

3

Figure 3
A general framework of detecting the 3-D queried objects on the
table of the VIPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Figure 1.1 Illustration of the 3-D object recognition process towards local

feature based method [53]. . . . . . . . . . . . . . . . . . . . . . . . . .

13

Figure 1.2 Illustration of primitive shapes extraction from the point cloud
[144] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Figure 1.3

Illustration of the Least squares process. . . . . . . . . . . . . .

19

Figure 1.4

Line presentation in image space and in Hough space [126]. . . .

20

Figure 1.5

Illustration of line estimation by RANSAC algorithm. . . . . . .

21

Figure 1.6

Diagram of RANSAC-based algorithms. . . . . . . . . . . . . . .

22

Figure 2.1

Microsoft Kinect Sensor version 1. . . . . . . . . . . . . . . . . .

25

Figure 2.2

Illustration of the organized point cloud representation process.

26

Figure 2.3

Description of the organized and unorganized point cloud. . . .

27

Figure 2.4

(a) is a RGB image, (b) is a point cloud of a scene. . . . . . . .

28

Figure 2.5

The proposed framework for table plane detection. . . . . . . . .

30

Figure 2.6 (a) Computing the depth value of the center pixel based on its
neighborhoods (within a (3 × 3) pixels window); (b) down sampling of
the depth image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

ix

Figure 2.7 Illustration of estimating the normal vector of a set point in the
3-D space. (a) a set of points; (b) estimation of the normal vector of a
black point; (c) selection of two points for estimating a plane; (d) the
normal vector of a black point. . . . . . . . . . . . . . . . . . . . . . . .

33

Figure 2.8

Illustration of point cloud segmentation process. . . . . . . . . .

33

Figure 2.9 Example of plane segmentation (a) color image of the scene; (b)
plane segmentation result with PROSAC in a our publication; (c) plane
segmentation result with the organized point cloud. . . . . . . . . . . .

35

Figure 2.10 Illustrating acceleration vector provided by a Microsoft Kinect
sensor. (xk , yk , zk ) are the three coordinate axes of the Kinect coordinate
system that mounted on the chest of VIP. . . . . . . . . . . . . . . . .

35

Figure 2.11 Illustration of extracting the table plane in the complex scene. .

36

Figure 2.12 Examples of 10 scenes captured in our dataset. . . . . . . . . . .

37

Figure 2.13 Scenes in the dataset [?]. . . . . . . . . . . . . . . . . . . . . . .

37

Figure 2.14 (a) Color and depth image of the scene; (b) mask data of table
plane; (c) cropped region; (d) point cloud corresponding of the cropped
region, green point is 3-D centroid of the region. . . . . . . . . . . . . .

38

Figure 2.15 (a), Illustration of the angle between normal vector of the detected table plane and T. (b), Illustration of the angle between normal
vector of the detected table plane ne and ng ; (c) Illustration of overlapping and union between detected and ground-truth regions. . . . . . . .

39

Figure 2.16 Detailed results for each scene of the three plane detection methods on our dataset: (a) using the first evaluation measure; (b) the second
evaluation measure and (c) the third evaluation measure. . . . . . . . .

42

Figure 2.17 Illustration of floor plane is segmented to the multiple planes. .

43

Figure 2.18 Results of table detection with our dataset (two first rows) and
the dataset in [117] (two bottom rows). Table plane is limited by the
red color boundary in image and by green color points in point cloud.
Arrow with red color is normal vector of detected table. . . . . . . . .

44

x

Figure 2.19 Top line is an example detection that is defined as true detection
if using the two first evaluation measures and as false detection if using
the third evaluation measure: (a) color image; (b) point cloud of the
scene; (c) the overlap area between the 2-D contour of detected table
plane and the table plane ground-truth. Bottom line is an example of
missing case with our method (a) color image, (b) point cloud of the
scene. After down sampling, the number of points belonging to table is
276 that is lower than our threshold. . . . . . . . . . . . . . . . . . . .

45

Figure 2.20 The transformation of original point cloud: from Kinect’s original
coordination Ok (xk , yk , zk ) to a new coordination Ot (xt , yt , zt ), that the
normal vector nt of a table plane is parallel to the y-axis. . . . . . . .

47

Figure 2.21 Setup of (a) the experiment 1 and (b) the experiment 2. . . . .

48

Figure 2.22 The distribution of error (in distance measure) (ε) of object center
estimation in two cases (Case 1: cylinder and Case 2: circle estimation)
obtained from experiment 1 (a), and the experiment 2 (b). . . . . . . .

49

Figure 2.23 Illustrating the result of the detected table plane and separating
interested objects. Left is the result of the detected table plane (green
points) in the point cloud data of a scene; Right is the result of the point
cloud data of objects on the table. . . . . . . . . . . . . . . . . . . . . .

50

Figure 3.1 Illustration of using the primitive shapes estimation to estimate
the full object models. . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Figure 3.2 Top panel: Over view of RANSAC-based algorithms. Bottom
panel: A diagram of the GCSAC’s implementations. . . . . . . . . . . .

56

Figure 3.3 Geometrical parameters of a cylindrical object. (a)-(c) Explanation of the geometrical analysis to estimate a cylindrical object. (d)-(e)
Illustration of the geometrical constraints applied in GCSAC. (f) Result
of the estimated cylinder from a point cloud. Blue points are outliers,
red points are inliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Figure 3.4 (a) Setting geometrical parameters for estimating a cylindrical
object from a point cloud as described above. (b) The estimated cylinder
(green one) from an inlier p1 and an outlier p2 . As shown, it is an
incorrect estimation. (c) Normal vectors n1 and n∗2 on the plane π are
specified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

xi

Figure 3.5 Estimating parameters of a sphere from 3-D points. Red points
are inlier points. In this figure, p1 , p2 are the two selected samples for
estimating a sphere (two gray points), they are outlier points. Therefore,
the estimated sphere is with wrong centroid and radius (see green sphere
(left bottom panel)). . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

Figure 3.6 Estimating parameters of a cone from 3-D points using the geometrical analysis proposed in [131]; (a) Point cloud with three samples

(p1 , p2 , p3 ) and its normal vector. (b) Three estimated planes pli (i =
1, 2, 3). (c) Ei is calculated according Eq. (3.8). (d) Ap and main axis
γco of the estimated cone. (e) Illustration of the proposed constraint to
estimate a conical object. . . . . . . . . . . . . . . . . . . . . . . . . . .

63

Figure 3.7 Point clouds of (a) dC1 , dC2 , dC3 , (b) dSP1 , dSP2 , dSP3 and (b)
dCO1 , dCO2 , dCO3 of the three datasets (the synthesized data) in case
of 50% inlier ratio. The red points are inliers, whereas blue points are
outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

Figure 3.8 Examples of four cylindrical-like objects collected from the ’second cylinder’ dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Figure 3.9 (a) Illustrating the separating the point cloud data of a ball in
the scene. (b) Illustrating the point cloud data of a cone and preparing
the ground-truth of evaluating the fitting a cone. . . . . . . . . . . . .

66

Figure 3.10 Comparisons of GCSAC and MLESAC algorithms. (a) Total
residual errors of the estimated model and ideal case. (b) Relative errors
of the estimated model and idea case. . . . . . . . . . . . . . . . . . . .

69

Figure 3.11 The average number of iterations of GCSAC and MLESAC on
the synthesized dataset when were repeated 50 times for statistically
representative results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

Figure 3.12 Decomposition of residual density distribution: inlier (blue) and
outlier (red) density distributions of a synthesized point cloud with 50
inliers. (a) Noises are added by a uniform distribution. (b) Noises are
added by a Gaussian distribution µ = 0, σ = 1.5. In each subfigure, leftpanel shows the distribution of an axis (e.g., x-axis), right-panel shows
the corresponding point cloud . . . . . . . . . . . . . . . . . . . . . . .

70

xii

Figure 3.13 An illustration of GCSAC’s at a k th iteration to estimate a coffee
mug in the second dataset. Left: the fitting result with a random MSS.
Middle: the fitting result where the random samples are updated due to
the application of the geometrical constrains. Right: the current best
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Figure 3.14 The best estimated model using GCSAC (a) and MLESAC (b)
with 50% inlier synthesized point cloud. In each sub-figure, two different
view-points are given. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Figure 3.15 Results of coffee mug fitting. Ground-truth objects are marked
as red points; estimated ones are marked as green points. . . . . . . . .

72

Figure 3.16 The cup is detected by fitting a cylinder. Top row: The scenes
in Dataset 3 captured at different view-points; Bottom row: The results
of locating a coffee cup using GCSAC in the corresponding point clouds. 72
Figure 3.17 An illustration of GCSAC’s at a k th iteration to estimate a coffee
mug in the second dataset. Left: the fitting result with a random MSS.
Middle: the fitting result where the random samples are updated due to
applying the geometrical constraints. Right: the current best model. . .

73

Figure 3.18 Illustrating the fitting cylinder of GCSAC and some RANSAC
variations on the synthesized datasets, which have 15% inlier ratio. Red
points are inlier points, blue points are outlier points. The estimated
cylinder is a green cylinder. In this figure, GCSAC estimated a cylinder
from a point cloud data, that Ed , Er , Ea are smallest. . . . . . . . . . .

73

Figure 3.19 Illustrating the fitting sphere of GCSAC and some RANSAC
variations on the synthesized datasets, which have 15% inlier ratio. Red
points are inlier points, blue points are outlier points. The estimated
spheres is the green spheres. . . . . . . . . . . . . . . . . . . . . . . . .

74

Figure 3.20 Illustrating the fitting cone of GCSAC and some RANSAC variations on the synthesized datasets, which have 15% inlier ratio. Red
points are inlier points, blue points are outlier points. The estimated
cones is the green cones. . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Figure 3.21 Result fitting of some instances collected from the real datasets.
(a) A coffee- mug; (b) A toy ball; (c) A cone object. In each subfigure: left-panel is RGB image for a reference, right-panel is fitting
result. Ground-truths are marked as red points; the estimated objects
are marked as green points. . . . . . . . . . . . . . . . . . . . . . . . .

76

xiii

Figure 3.22 Illustrations of correct a correct (a) and incorrect estimation without using the verification scheme. On each sub-figure: Left panel: point
cloud data; Middle panel: the normal vector of each point; Right panel:
the estimated model. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Figure 3.23 The histogram of deviation angle with the x-axis (1, 0, 0) of a real
dataset in the bottom panel of Fig. 3.22; (b) the histogram of deviation
angle with the x-axis (1, 0, 0) of a generated cylinder dataset in the top
panel of Fig. 3.22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Figure 3.24 Illustrating of the deviation angle between the estimated cylinder’s axis and the normal vector of the plane. . . . . . . . . . . . . . .

80

Figure 3.25 Some examples of scenes with cylindrical objects [117] collected
in the first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

Figure 3.26 Illustration of six types of cylindrical objects in the third dataset. 81
Figure 3.27 Result of the table plane detection in a pre-processing step using
the methods in our previous publication and [33].(a) RGB image of
current scene; (b) The detected table plane is marked in green points.
(c) The point clouds above the table plane are located and marked in red. 82
Figure 3.28 (a) is the results of estimating the cylindrical objects of ’MICA3D’
dataset. (b) is the results of estimating the cylindrical objects of [68]
dataset. In these scenes, there are more than one cylinder objects. They
are marked in red, green, blue and yellow, so on. The estimated cylinders
include radius, position (a center of the cylinder), main axis direction.
The height can be computed using a normalization in y-value of the
estimated object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

Figure 3.29 (a) The green estimated cylindrical object has he relative error of
the estimated radius Er = 111.08%; (b) the blue estimated cylindrical
object has he relative error of the estimated radius Er = 165.92%. . . .

84

Figure 3.30 Angle errors Ea of the fitting results using GCSAC with and
without using the context’s constraint. . . . . . . . . . . . . . . . . . .

84

Figure 3.31 Extracting the fitting results of the video on the scene 1th of the
first dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

Figure 4.1 Top-panel: the procedures of PSM method. Bottom-panel illustrated the result of each step. . . . . . . . . . . . . . . . . . . . . . . .

91

xiv

Figure 4.2 A result of object clustering when using the method of Qingming
et al. [110]. (a) RGB image; (b) the result of objects clustering projected
to the image space. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

Figure 4.3 (a): Presentation of neighbors of Pq [123]. (b): Estimating parameters of PFH descriptor [124] are as Eq. 4.2. . . . . . . . . . . . . .

93

Figure 4.4

Illustrating of training phase of CVFGS method. . . . . . . . . .

93

Figure 4.5

Illustrating of testing phase of CVFGS method. . . . . . . . . .

94

Figure 4.6 The architecture of Faster R-CNN network [116] for object detection on RGB image. . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

Figure 4.7

The architecture of YOLO version 2 network [114]. . . . . . . .

95

Figure 4.8 The size of the feature map when using YOLO version 2 for
training model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

Figure 4.9 YOLO divides the image into regions and predicts bounding
boxes and probabilities for each region [115]. . . . . . . . . . . . . . . .

96

Figure 4.10 Illustration mounted of a MS Kinect on the chest’s VIP. . . . .

97

Figure 4.11 Illustration mounted a Laptop on the VIPs. . . . . . . . . . . .

98

Figure 4.12 Illustration of type objects and scenes in our dataset and published dataset [68]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

Figure 4.13 Illustration of the size of table. . . . . . . . . . . . . . . . . . . .

99

Figure 4.14 Illustrating of detecting spherical objects evaluation. The left
column is the result of detecting spherical objects on the RGB image by
YOLO CNN. The middle column is the result of the estimated spherical
object from the point cloud of the detected object in the left column.
The right column is the result of spherical object detection when project
the estimated sphere to the RGB image. . . . . . . . . . . . . . . . . .

99

Figure 4.15 Illustration of computing the deviation angle of the normal vector
of table plane yt and the estimated cylinder axis γc . . . . . . . . . . . . 101

xv

Figure 4.16 A final result of detecting a spherical object in the scene, left: is
the result in the RGB image, the location of spherical object is shown;
right: is the finding result in 3D environment (green points are data
points of the generated points of the estimated sphere and inlier points
that estimated blue sphere). The located and described of a estimated
spherical object in the scene are x=-0.4m, y=-0.45m, z=1.77m, radius=0.098m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Figure 4.17 (a), (b) Illustrating the false cases of the detecting spherical objects in the first stage. (c) Illustrating the false case of finding spherical
objects following the query of the VIPs. A green set point is the point
cloud that estimates the blue sphere. In this case, the projected rectangle from generated point cloud data of blue sphere is large. . . . . . . . 104
Figure 4.18 Illustration result of detecting, locating, describing a spherical
object in the 3-D environment. . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 4.19 (a) Illustration of a RGB image; (b) the results of object detection
on the RGB image when using the object classifier of YOLO [115]; (c)
Illustrating the results of finding cylindrical objects when not using the
angle constraint of the context; (d) Illustrating the results of finding
cylindrical objects when using the angle constraint of the context. The
deviation angle of estimated cylinder with the normal vector of the table
plane is 88 degrees of bottle, is 38 degrees of coffee cup (c), is 18.6 degrees
of bottle, is 4.2 degrees of coffee cup. . . . . . . . . . . . . . . . . . . . 107
Figure 4.20 (a) Illustration of the estimated cylinder from point cloud data
of green jar in Fig. 4.19. . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 4.21 Illustrating of the result of detecting two cylindrical objects on
the table. (a) is the result on the RGB image, (b) is the result in the
3-D environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 4.22 Illustration results of detecting, locating, describing a 3-D cylindrical object (green cylinder) in the 3-D environment. The red points
belong a detected object. . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Figure 4.23 The framework for deploying the complete system to detect 3-D
primitive objects according to the queries of the VIPs. . . . . . . . . . 110
Figure 4.24 Illustration of 3-D queried primitive objects in 2-D and 3-D environments. Left panel is the results of detecting cylindrical and right
panel is the results of spherical objects in RGB image and point cloud.

xvi

111

Figure 4.25 Illustration of the set up system on the VIPs. . . . . . . . . . . 112
Figure 4.26 Illustrating of the VIPs come to the sharing-room or Kitchen to
find the spherical objects or cylindrical objects on the table. . . . . . . 113
Figure 4.27 Illustration of trajectory of the visually impaired. . . . . . . . . 113
Figure 4.28 Illustration of object detection in the RGB images. . . . . . . . 114
Figure 4.29 Full trajectory of a volunteer in the experiments. First row:
scene collected by a surveillance camera Frame ID is noted above each
panel. Second row: Results of YOLO detection on the RGB images.
Last row: Results of the model fitting and object’s descriptions(e.g.,
position, radius) are given. There are five spherical-like objects in this
scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Figure 4.30 Illustration of scenes and objects when VIPs move in the real
environment of sharing-room, kitchen. . . . . . . . . . . . . . . . . . . 115
Figure 4.31 Computation the occluded data. Rg is the area of objects; Rs is
the area of visible area. . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 4.32 Illustration of results on the occluded data. . . . . . . . . . . . . 117
Figure 4.33 The table plane is a wrong detection. . . . . . . . . . . . . . . . 117
Figure 4.34 The estimated spherical object is a wrong. . . . . . . . . . . . . 118
Figure 4.35 Illustration of using GCSAC estimated the spherical object on
the missing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Figure 4.36 Illustration of the inclined cylinder. . . . . . . . . . . . . . . . . 119
Figure 5.1

Illustration the problem solved in the our dissertation [93], [131]. 123

xvii

INTRODUCTION
Motivation
Visually Impaired People (VIPs) face many difficulties in their daily living. Nowadays, many aided systems for the VIPs have been deployed such as navigation services,
obstacle detection (iNavBelt, GuideCane products [10, 119]), object recognition in supermarket (EyeRing at MIT’s Media Lab [74]). In their activities of daily living, the
most common situation is that VIPs need to locate home facilities. However, even for
a simple activity such as querying common objects (e.g., a bottle, a coffee-cup, jars, so
on) in a conventional environment (e.g., in kitchen, cafeteria room), it may be a challenging task. In term of deploying an aided system for the VIPs, not only the object’s
position must be provided but also more information about the queried-objects such
as their size, grabbing directions is required.
Let us consider a real scenario, as shown in Fig. 1, to look for a tea or coffee cup,
he(she) goes into the kitchen, touches any surrounded object and picks up the right
one. In term of an aided system, that person just makes queries ”Where is a coffee
cup?”, ”What is the size of the cup?”, ”Is the cup lying/standing on the table?”. The
aided system should provide the information for the VIPs so that they can grasp the
objects and avoid accidents. Approaching these issues, recognition on 2-D image data
and adding information on depth images are presented in [81], [82], [83]. However,
in these works only information about the object’s label is provided. Moreover, 2-D
information is able to observe only a visible part at each certain view-point. While the
VIPs need information about position, size and direction for safely grasping. For this
reason, we use the 3-D approaches to address these critical issues.
To this end, by knowing shape of the queried-object, for instance, a coffee cup is
usually a cylindrical shape and lying on a flat surface (table plane). The aided system
can resolve the query by fitting a primitive shape to the collected point cloud from
the object. More generally, the objects in the kitchen or tea room are usually placed
on the tables such as cups, bowls, jars, fruit, funnels, etc. Therefore, these objects
could be simplified by the primitive shapes. The problem of detecting and recognizing

complex objects are out-of-scope-of the dissertation. In addition, we observe that the
prior knowledge observed from the current scene such as a cup normally stands on
the table, contextual information such as walls in the scene to be perpendicular to the
table plane; the size/height of the queried objects is limited, would be valuable cues to
improve the system performances.

1

Figure 1 Illustration of a real scenario: a VIP comes to the Kitchen and gives a
query: ”Where is a coffee cup? ” on the table. Left panel shows a Kinect mounted on
the human’s chest. Right panel: the developed system is build on a Laptop PC.
More generic, we realize that the queried-objects could be located through simplifying geometric shapes: planar segments (boxes), cylinders (coffee mugs, soda cans),
sphere (balls), cones, but not utilizing conventional 3-D features. Approaching these
ideas, a pipeline of the work ”3-D Object Detection and Recognition for Assisting Visually Impaired People” is proposed in this dissertation. The proposed framework consists
of several tasks including: (1) separating the queried objects from a table plane; (2)
detecting candidates of the interested objects using appearance features; and (3) estimating a model of the queried-object from a 3-D point cloud. Instead of matching
the queried-objects into 3-D models as conventional learning-based approaches do, this
research work focuses on constructing a simplified geometrical model of the queriedobject from an unstructured set of point clouds collected by a RGB and depth sensor,
wherein the last step plays the most important role.

Objective
In this dissertation, we aim to propose a robust 3-D object detection and recognition system. As a feasible solution to deploy a real application, the proposed framework
should be simple, robust and friendly to the VIPs. However, it is necessary to notice
that there are critical issues that might affect the performance of the proposed system. Particularly, some of them are: (1) objects are queried in a complex scene where
cluster and occlusion issue may appear; (2) noises from collected data; and (3) high
computational cost due to huge number of points in a cloud data. Although in the
literature, a number of relevant works of 3-D object detection and recognition has been
attempted for a long time. In this study, we will not attempt to solve these issues
separately. Instead of that, we aim to generate an unified solution. To this end, the

concrete objectives are:
- To propose a completed 3-D query-based object detection system in supporting

2

Figure 2 Illustration of the process of 3-D query-based object in the indoor environment. The full object model is the estimated green cylinder from the point cloud of
coffee-cup (red points).
the VIPs with high accuracy. Figure 2 illustrates the processes of 3-D query-based
object detection in an indoor environment.
- To deploy a real application to locate and describe objects’ information supporting the VIPs grasping objects. The application is evaluated in practical scenarios
such as finding objects in a sharing-room, a kitchen room.
An available extension from this research is to give the VIPs a feeling or a way
of interaction in a simple form. The fact that VIPs want to make optimal use of
all their senses (i.e., audition, touch, and kinesthetic feedback). By doing this study,
informative information extracted from cameras (i.e. position, size, safely directions for
object grasping) is available. As a result, the proposed method could offer an effective
way so that the a large amount of the collected data is valuable and feasible resource.

Context, constraints and challenges
Figure 1 shows the context when a VIP comes to a cafeteria and using an aided
system for locating an object on the table. The input of system is a user’s query and
output is the object’s position in a 3-D coordination and object’s information (size,
height, normal surface). The proposed system operates using a MS Kinect sensor [72].
The Kinect sensor is mounted on the chest of the VIP and the laptop is warped in
the backpack, as shown in Fig. 1. For deploying a real application, we require some
constraints in the scenario as the following:
❼ The MS Kinect sensor:

– A MS Kinect sensor is mounted on VIP’s chest and he/she moves slowly

around the table. This is to collect the data of the environment.
– A MS Kinect sensor captures RGB and Depth images at a normal frame
rate (from 10 to 30 fps) [95] with image resolution of 640×480 pixels for

3

both of those image types. With each frame obtained from Kinect an acceleration vector is also obtained. Because MS Kinect collects the images in a
range from 10 to 30 fps, , it fits well with the slow movements of the VIPs
(∼ 1 m/s). Although collecting image data via a wearable sensor can be
affected by subject’s movement such as image blur, vibrations in the practical situations, there are no specifically requirements for collecting the image
data. For instance, VIPs are not required to be stranded before collecting
the image data.
– Every queried object needs to be placed in the visible area of a MS Kinect
sensor, which is in a distance of 0.8 to 4 meter and an angle of 300 around
the center axis of the MS Kinect sensor. Therefore, the distance constraint
from the VIPs to the table is also about 0.8 to 4m.
❼ Interested (or queried) objects are assumed to have simple geometrical structures.

For instance, coffee mugs, bowls, jars, bottles, etc have cylindrical shape, whereas
ball(s) have spherical shape; a cube shape could be boxes, etc. They are idealized
and labeled. The modular interaction between a VIP and the system has not been
developed in the dissertation.
❼ Once a VIP wants to query an object on the table, he/she should stand in front

of the table. This ensures that the current scene is in the visible area of a MS
Kinect sensor and can move around the table. The proposed system computes and
returns the object’s information such as position, size and orientation. Sending
such information to senses (e.g., audible information, on a Braille screen, or by a
vibrating type) is out of the scope of this dissertation.

❼ Some heuristic parameters are pre-determined. For instance, a VIP’s height, and

other parameters of contextual constraints (e.g., size of table plane in a scene,
object’s height limitations) are pre-selected.
The above-mentioned scenarios and constraints are to cope with the following
issues:
❼ Occlusion and clustered issues of the interested objects: In the practical situation,

when a VIP comes to a cafeteria to find an object on the table, the queried-objects
could be occluded by others. At a certain view point, the MS Kinect sensor
captures only a part of an object. Therefore, full 3-D data of the queried-objects
is missed. Other situation is that the collected data consists of many noises
because the depth image collected by MS Kinect often is affected by illumination
conditions. These issues raise challenges for fitting, detecting and classifying the
queried-objects from a point cloud.

4

❼ Various appearances of same object type: The system is to support for the VIPs

querying common objects. The fact that a ”blue” tea/coffee cup and a ”yellow”
bottle have same type of a primitive shape (e.g., a cylindrical model). These
objects have the same geometrical structure but are different appearances of color
and texture. We exploit recent deep learning techniques to utilize appearance
features (on RGB images) for addressing these issues.
❼ Computational time: A point cloud of a scene generated from an image with

size of 640 × 480 pixels consists of hundreds of thousand of points. Therefore,
computations in the 3-D scene often require higher computational costs than the

relevant task in 2-D image.

Contributions
Throughout the dissertation, the main objectives are addressed by an unified
solution. We achieve following contributions:
❼ Contribution 1: Proposed a new robust estimator that called (GCSAC - Geometrical

Constraints SAmple Consensus) for estimation of primitive shapes from the
point cloud of the objects. Different from conventional RANSAC algorithms
(RANdom SAmple Consensus), GCSAC selects the uncontaminated (so-called
the qualified or good) samples from a set data of points using the geometrical
constraints. Moreover, GCSAC is extended by utilizing the contextual constraints
to validate results of the model estimation.
❼ Contribution 2: Proposed a comparative study on three different approaches

for recognizing the 3-D objects in a complex scene. Consequently, the best one
is a combination of deep-learning based technique and the proposed robust estimator(GCSAC). This method takes recent advantages of object detection using
a neural network on RGB image and utilizes the proposed GCSAC to estimate
the full 3-D models of the queried objects.
❼ Contribution 3: Deployed a successfully system using the proposed methods for

detecting 3-D primitive shape objects in a lab-based environment. The system
combined the table plane detection technique and the proposed method of 3-D
objects detection and estimation. It achieved fast computation for both tasks of
locating and describing the objects. As a result, it fully supports the VIPs in
grasping the queried objects.

5

Acceleration
vector
Microsoft
Kinect

RGB-D
image

Pre-processing step
Point cloud
representation

Objects
detection on
RGB image

Table plane
detection

3-D objects
location on the
table plane

3-D objects
model
estimation

3-D objects
information

Fitting 3-D objects

Candidates

Figure 3 A general framework of detecting the 3-D queried objects on the table of the
VIPs.

General framework and dissertation outline
In this dissertation, we propose an unified framework of detecting the queried 3-D
objects on the table for supporting the VIPs in an indoor environment. The proposed
framework consists of three main phases as illustrated in Fig. 3. The first phase is
considered as a pre-processing step. It consists of point cloud representation from the
RGB and Depth images and table plane detection in order to separate the interested
objects from a current scene. The second phase aims to label the object candidates
on the RGB images. The third phase is to estimate a full model from the point cloud
specified from the first and the second phases. In the last phase, the 3-D objects are
estimated by utilizing a new robust estimator GCSAC for the full geometrical models.
Utilizing this framework, we deploy a real application. The application is evaluated
in different scenarios including data sets collected in lab environments and the public
datasets. Particularly, these research works in the dissertation are composed of six
chapters as following:
❼ Introduction: This chapter describes the main motivations and objectives of the

study. We also present critical points the research’s context, constraints and
challenges, that we meet and address in the dissertation. Additionally, the general
framework and main contributions of the dissertation are also presented.
❼ Chapter 1: A Literature Review: This chapter mainly surveys existing aided

systems for the VIPs. Particularly, the related techniques for developing an
aided system are discussed. We also presented the relevant works on estimation

algorithms and a series of the techniques for 3-D object detection and recognition.
❼ Chapter 2: In this chapter, we describe a point cloud representation from data

collected by a MS Kinect Sensor. A real-time table plane detection technique for

6

Phát hiện và nhận dạng đối tượng 3 d hỗ trợ sinh hoạt của người khiếm thị 3 d object detection and recognition assisting visually impaired people in daily activities

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về