Tải bản đầy đủ (.pdf) (157 trang)

Nhận dạng cử chỉ động của bàn tay người sử dụng kết hợp thông tin hình ảnh và độ sâu ứng dụng trong tương tác người thiết bị

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.17 MB, 157 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECNOLOGY

THI HUONG GIANG DOAN

DYNAMIC HAND GESTURE RECOGNITION USING RGB-D
IMAGES FOR HUMAN-MACHINE INTERACTION

DOCTORAL THESIS OF
CONTROL ENGINEERING AND AUTOMATION

Hanoi − 2017


MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECNOLOGY

THI HUONG GIANG DOAN

DYNAMIC HAND GESTURE RECOGNITION USING
RGB-D IMAGES FOR HUMAN-MACHINE
INTERACTION

Specialty: Control Engineering and Automation
Specialty Code: 62520216

DOCTORAL THESIS OF
CONTROL ENGINEERING AND AUTOMATION

SUPERVISORS:
1. Dr. Hai Vu


2. Dr. Thanh Hai Tran

Hanoi − 2017


DECLARATION OF AUTHORSHIP
I, Thi Huong Giang Doan, declare that the thesis titled, “Dynamic Hand Gesture
Recognition Using RGB-D Images for Human-Machine Interaction”, and the works
presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a Ph.D. research
degree at Hanoi University of Science and Technology.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at Hanoi University of Science and Technology or any other
institution, this has been clearly stated.
Where I have consulted the published work of others, this is always clearly attributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have
made exactly what was done by others and what I have contributed myself.

Hanoi, December 2017
PhD STUDENT

Thi Huong Giang DOAN

SUPERVISORS

Dr. Hai VU


Dr. Thi Thanh Hai TRAN

i


ACKNOWLEDGEMENT
This thesis was written during my doctoral study at International Research Institute Multimedia, Information, Communication and Applications (MICA), Hanoi
University of Science and Technology (HUST). It is my great pleasure to thank all the
people who supported me for completing this work.
First, I would like to express my sincere gratitude to my advisors Dr. Hai Vu and
Dr. Thi Thanh Hai Tran for the continuous support of my Ph.D. study and related research, for their patience, motivation, and immense knowledge. Their guidance helped
me in all the time of research and writing of this thesis. I could not have imagined
having a better advisor and mentor for my Ph.D. study.
Besides my advisors, I would like to thank the scientists and the authors of the
published works which are cited in this thesis, and I am provided with valuable information resources from their works for my thesis. The attention at scientific conferences
have always been a great experience for me to receive many the useful comments.
In the process of implementation and completion of my research, I have received
many supports from the board of MICA directors. My sincere thanks go to Prof. Yen
Ngoc Pham, Prof. Eric Castelli and Dr. Son Viet Nguyen, who provided me with an
opportunity to join researching works in MICA institute, and who gave access to the
laboratory and research facilities. Without their precious support would it have been
being impossible to conduct this research.
As a Ph.D. student of 911 programme, I would like to thanks 911 programme for
their financial support during my Ph.D course. I also gratefully acknowledge the financial support for publishing papers and conference fees from research projects T2014-100,
T2016-PC-189, and T2016-LN-27. I would like to thank my colleagues at Computer
Vision Department and Multi-Lab of MICA institute over the years both at work and
outside of work.
Special thanks to my family. Words can not express how grateful I am to my
mother and father for all of the sacrifices that they have made on my behalf. I would
also like to thank my beloved husband. Thank you for supporting me for everything.

Hanoi, December 2017
Ph.D. Student
Thi Huong Giang DOAN

ii


CONTENTS
DECLARATION OF AUTHORSHIP

i

ACKNOWLEDGEMENT

ii

CONTENTS

vi

SYMBOLS

vii

LIST OF TABLES

xi

LIST OF FIGURES


xvi

1 LITERATURE REVIEW
1.1 Completed hand gesture recognition systems for controlling home appliances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 GUI device dependent systems . . . . . . . . . . . . . . . . . . .
1.1.2 GUI device independent systems . . . . . . . . . . . . . . . . .
1.2 Hand detection and segmentation . . . . . . . . . . . . . . . . . . . . .
1.2.1 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.4 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Hand gesture spotting system . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Model-based approaches . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Feature-based approaches . . . . . . . . . . . . . . . . . . . . .
1.3.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Dynamic hand gesture recognition . . . . . . . . . . . . . . . . . . . . .
1.4.1 HMM-based approach . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 DTW-based approach . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 SVM-based approach . . . . . . . . . . . . . . . . . . . . . . . .
1.4.4 Deep learning-based approach . . . . . . . . . . . . . . . . . . .
1.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

8
8
14
18
19
20

21
21
23
24
25
27
29
29
30
31
33
34
35
35

2 A NEW DYNAMIC HAND GESTURE SET OF CYCLIC MOVEMENT

37

iii

8


2.1
2.2

2.3

2.4


2.5

Defining dynamic hand gestures . . . . . . . . . . . . . . . . . . . . . .
The existing dynamic hand gesture datasets . . . . . . . . . . . . . . .
2.2.1 The published dynamic hand gesture datasets . . . . . . . . . .
2.2.1.1 The RGB hand gesture datasets . . . . . . . . . . . . .
2.2.1.2 The Depth hand gesture datasets . . . . . . . . . . . .
2.2.1.3 The RGB and Depth hand gesture datasets . . . . . .
2.2.2 The non-published hand gesture datasets . . . . . . . . . . . . .
2.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition of the closed-form pattern of gestures and phasing issues . .
2.3.1 A conducting commands of a dynamic hand gestures set . . . .
2.3.2 Definition of the closed-form pattern of gestures and phasing issues
2.3.3 Characteristics of dynamic hand gesture set . . . . . . . . . . .
Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 MICA1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 MICA2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 MICA3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.4 MICA4 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

3 HAND DETECTION AND GESTURE SPOTTING WITH USERGUIDE SCHEME
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Heuristic user-guide scheme . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Proposed framework . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Estimating heuristic parameters . . . . . . . . . . . . . . . . . .
3.2.3.1 Estimating parameters of background model for body
detection . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.3.2 Estimating the distance from hand to the Kinect sensor
for extracting hand candidates . . . . . . . . . . . . .
3.2.3.3 Estimating skin color parameters for pruning hand regions
3.2.4 Hand detection phase using heuristic parameters . . . . . . . . .
3.2.4.1 Hand detection . . . . . . . . . . . . . . . . . . . . . .
3.2.4.2 Hand posture recognition . . . . . . . . . . . . . . . .
3.3 Dynamic hand gesture spotting . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Catching buffer . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Spotting dynamic hand gesture . . . . . . . . . . . . . . . . . .
3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 The required learning time for end-users . . . . . . . . . . . . .

iv

37
38
38
38
40
41
44
46
47
47
48
50
51
51
52
53

54
55

56
56
58
58
58
60
60
62
63
65
65
66
66
66
67
71
71


3.4.2
3.4.3

3.5

The computational time for hand segmentation and recognition
Performance of the hand region segmentations . . . . . . . . . .
3.4.3.1 Evaluate the hand segmentation . . . . . . . . . . . .

3.4.3.2 Compare the hand posture recognition results . . . . .
3.4.4 Performance of the gesture spotting algorithm . . . . . . . . . .
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73
75
75
75
76
78
78
78

4 DYNAMIC HAND GESTURE REPRESENTATION AND RECOGNITION USING SPATIAL-TEMPORAL FEATURES
79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Proposed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Hand representation from spatial and temporal features . . . . . 81
4.2.1.1 Temporal features extraction . . . . . . . . . . . . . . 81
4.2.1.2 Spatial features extraction using linear reduction space 83
4.2.1.3 Spatial features extraction using non-linear reduction
space . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 DTW-based phase synchronization and KNN-based classification 86
4.2.2.1 Dynamic Time Warping for phase synchronization . . 86
4.2.2.2 Dynamic hand gesture recognition using K-NN method 88
4.2.3 Interpolation-based synchronization and SVM Classification . . 89
4.2.3.1 Dynamic hand gesture representation . . . . . . . . . . 89
4.2.3.2 Quasi-periodic dynamic hand gesture pattern . . . . . 91

4.2.3.3 Phase synchronization using hand posture interpolation 94
4.2.3.4 Dynamic hand gesture recognition using difference classifications . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Influence of temporal resolution on recognition accuracy . . . . 97
4.3.2 Tunning kernel scale parameters RBF-SVM classifier . . . . . . 98
4.3.3 Performance evaluation of the proposed method . . . . . . . . . 99
4.3.4 Impacts of the phase normalization . . . . . . . . . . . . . . . . 100
4.3.5 Further evaluations on public datasets . . . . . . . . . . . . . . 101
4.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 CONTROLLING HOME APPLIANCES USING DYNAMIC HAND
GESTURES
105
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
v


5.2

5.3

5.4

Deployment of control systems using hand gestures . . . . . . . . . . .
5.2.1 Assignment of hand gestures to commands . . . . . . . . . . . .
5.2.2 Different modes of operations carried out by hand gestures . . .
5.2.2.1 Different states of lamp and their transitions . . . . . .
5.2.2.2 Different states of fan and their transition . . . . . . .
5.2.3 Implementation of the control system . . . . . . . . . . . . . . .

5.2.3.1 Main components of the control system using hand gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3.2 Integration of hand gesture recognition modules . . . .
Experiments of control systems using hand gestures . . . . . . . . . . .
5.3.1 Environment and material setup . . . . . . . . . . . . . . . . . .
5.3.2 Pre-built script . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3.1 Evaluation of hand gesture recognition . . . . . . . . .
5.3.3.2 Evaluation of time costs . . . . . . . . . . . . . . . . .
5.3.4 Evaluation of usability . . . . . . . . . . . . . . . . . . . . . . .
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bibliography

105
105
107
107
108
108
108
109
115
115
116
117
118
119
120

121
121
122
126

vi


ABBREVIATIONS
TT Abbreviation Meaning
1

ANN

Artifical Neural Network

2

ASL

American Sign Language

3

BB

Bounding Box

4


BGS

Background Subtraction

5

BW

Baum Welch

6

BOW

Bag Of Words

7

C3D

Convolutional 3D

8

CD

Compact Disc

9


CIF

Common Intermediate Format

10

CNN

Convolution Neural Network

11

CPU

Central Processing Unit

12

CRFs

Conditional Random Fields

13

CSI

Channel State Information

14


DBN

Deep Belief Network

15

DDNN

Deep Dynamic Neural Networks

16

DoF

Degree of Freedom

17

DT

Decision Tree

18

DTM

Dense Trajectories Motion

19


DTW

Dynamic Time Warping

20

FAR

False Acceptance Rate

21

FD

Fourier Descriptor

22

FP

False Positive

23

FN

False Negative

24


FSM

Finite State Machine

25

fps

f rame per second

26

GA

Genetic Algorithm

27

GMM

Gaussian Mixture Model

28

GT

Ground True

29


GUI

Graphic User Interface

30

HCI

Human Computer Interaction

vii


31

HCRFs

Hidden Conditional Random Fields

32

HNN

Hopfield Neural Network

33

HMM

Hidden Markov Model


34

HOG

Histogram of Oriented Gradient

35

HSV

Hue Saturation Value

36

ID

IDentification

37

IP

Internet Protocol

38

IR

InfRared


39

ISOMAP

ISOmetric MAPing

40

JI

Jaccard Index

41

KLT

Kanade Lucas Tomasi

42

KNN

K Nearest Neighbors

43

LAN

Local Area Network


44

LE

Laplacian Eigenmaps

45

LLE

Locally Linear Embedding

46

LRB

Left Right Banded

47

MOG

Mixture of Gaussian

48

MFC

Microsoft Founding Classes


49

MSC

Mean Shift Clustering

50

MR

Magic Ring

51

NB

Naive Bayesian

52

PC

Persional Computer

53

PCA

Principal Component Analysis


54

PDF

Probability Distribution Function

55

PNG

Portable Network Graphics

56

QCIF

Quarter Common Intermediate Format

57

RAM

Random Acess Memory

58

RANSAC

RANdom SAmple Consensus


59

RBF

Radial Basic Function

60

RF

Random Forest

61

RGB

Red Green Blue

62

RGB-D

Red Green Blue Depth

63

RMSE

Root Mean Square Error


64

ROI

Region of Interest

65

RNN

Recurrent Neural Network
viii


66

SIFT

Scale Ivariant Feature Transform

67

SVM

Support Vector Machine

68

STE


Short Time Energy

69

STF

Spatial Temporal Feature

70

ToF

Time of Flight

71

TN

True Negative

72

TP

True Positive

73

TV


TeleVion

74

XML

Xextensible Markup Language

ix


LIST OF TABLES

Table 1.1

Soft remote control system and commands assignment . . . . . .

12

Table 1.2

Omron TV command assignment . . . . . . . . . . . . . . . . . .

15

Table 1.3

Hand gestures utilized for different devices using Wisee technique


16

Table 1.4

Hand gestures utilized for different devices using MR technique. .

17

Table 1.5

The existing in-air gesture-based systems . . . . . . . . . . . . .

18

Table 1.6

The existing vision-based dynamic hand gesture methods . . . .

36

Table 2.1

The existing Hand gesture datasets . . . . . . . . . . . . . . . . .

46

Table 2.2

The main commands of some smart home electrical appliances .


48

Table 2.3

Notations used in this research . . . . . . . . . . . . . . . . . . .

50

Table 2.4

Characteristic of the defined databases . . . . . . . . . . . . . . .

55

Table 3.1

The required time to learning parameters of the background model 72

Table 3.2 The required time to learn parameters of the hand-skin color
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Table 3.3

The required time to learn the hand to Kinect distance

. . . . .

73


Table 3.4

The required time to hand segmentation

. . . . . . . . . . . . .

74

Table 3.5

The required time to hand posture recognition . . . . . . . . . .

74

Table 3.6

Results of the JI indexes without/with learning scheme

75

. . . . .

Table 4.1 Recall rate the proposed method (%) on myself datasets with the
difference classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Table 4.2

Performance of the proposed method on three different datasets . 103

Table 5.1 Assignment of hand gestures to commands for controlling lamp

and fan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Table 5.2

Confusion matrix of dynamic hand gesture recognition . . . . . . 118

x


Table 5.3

Accuracy rate (%) of dynamic hand gesture commands . . . . . . 118

Table 5.4

Assessment of end-users on the defined dataset . . . . . . . . . . 120

xi


LIST OF FIGURES
Figure 1

Home appliances in a smart homes . . . . . . . . . . . . . . . .

3

Figure 2
Controlling home appliances using dynamic hand gestures in
smart house. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


3

Figure 3
The proposed frame-work of the dynamic hand gesture recognition for controlling home appliances. . . . . . . . . . . . . . . . . . . .

6

Figure 1.1

Mitsubishi hand gesture-based TV [46]. . . . . . . . . . . . . . .

9

Figure 1.2

Samsung-Smart-TV using hand gestures. . . . . . . . . . . . . .

10

Figure 1.3

Dynamic hand gestures used for Samsung-Smart-TV. . . . . . .

10

Figure 1.4

Hand gesture commands in Soft Remote Control System [39]. .

11


Figure 1.5

General framework of the Soft remote control system [39]. . . .

11

Figure 1.6

Hand gesture-based home appliances system [143]. . . . . . . . .

12

Figure 1.7

TV Controlling with GUI of Dynamic gesture recognition [151].

13

Figure 1.8

Commands of GUI of Dynamic gesture recognition [103] . . . .

13

Figure 1.9

Features of the Omron dataset [3]. . . . . . . . . . . . . . . . . .

14


Figure 1.10 Wi-Fi signals to control home appliances using hand gesture [119]. 15
Figure 1.11 Seven hand gestures for wireles-based interaction[9] (Wisee dataset). 16
Figure 1.12 Simulation of using MR to control some home appliances [62]. .

17

Figure 1.13 AirTouch-based control uses depth cue [33]. . . . . . . . . . . .

18

Figure 1.14 Depth threshold cues and face skin [97].

. . . . . . . . . . . . .

22

Figure 1.15 Depth threshold and skeleton [60]. . . . . . . . . . . . . . . . . .

23

Figure 1.16 The process of detecting hand region [69]. . . . . . . . . . . . .

23

Figure 1.17 Spotting dynamic hand gestures system using HMM model [71].

25

Figure 1.18 Threshold using HMM model for different gestures [71]. . . . . .


26

Figure 1.19 CRFs-based spotting method use threshold [142]. . . . . . . . .

26

xii


Figure 1.20 Designed gesture in proposed method [13] . . . . . . . . . . . . .

28

Figure 1.21 Two gesture boundaries is spotted [65]. . . . . . . . . . . . . . .

29

Figure 1.22 Gesture recognition using HMM [42]. . . . . . . . . . . . . . . .

31

Figure 1.23 Gesture features extraction [8]. . . . . . . . . . . . . . . . . . .

33

Figure 2.1

Periodic image sequences appear in many common actions. . . .


38

Figure 2.2

Four hand gestures of [83]. . . . . . . . . . . . . . . . . . . . . .

39

Figure 2.3

Cambridge hand gesture dataset of [67].

. . . . . . . . . . . . .

39

Figure 2.4

Five hand gestures of [82]. . . . . . . . . . . . . . . . . . . . . .

40

Figure 2.5

Twelve dynamic hand gestures of the MSRGesture3D dataset [1]. 41

Figure 2.6

Dynamic hand gestures of [88].


. . . . . . . . . . . . . . . . . .

42

Figure 2.7

Gestures of NATOPS dataset [140]. . . . . . . . . . . . . . . . .

43

Figure 2.8

Dynamic hand gestures of SKIG Dataset [76]. . . . . . . . . . .

44

Figure 2.9

Gestures in Charlean dataset. . . . . . . . . . . . . . . . . . . .

44

Figure 2.10 Dynamic hand gestures of [93].

. . . . . . . . . . . . . . . . . .

44

Figure 2.11 Dynamic hand gestures of the NVIDIA dataset [87]. . . . . . . .


45

Figure 2.12 Dynamic hand gestures of PowerGesture dataset [71]. . . . . . .

45

Figure 2.13 Hand shape variations and hand trajectories (low panel) of the
proposed gesture set (5 gestures). . . . . . . . . . . . . . . . . . . . . .

48

Figure 2.14 In each row, changes of the hand shape during a gesture performing. From left-to-right, hand-shapes of the completed gesture chance in
a cyclical pattern (closed-opened-closed). . . . . . . . . . . . . . . . . .

49

Figure 2.15 Comparing the similarity between the closed-form gestures and
a simple sinusoidal signal. . . . . . . . . . . . . . . . . . . . . . . . . .

51

Figure 2.16 Close cyclical hand gesture pattern and cycle signal. . . . . . . .

51

Figure 2.17 The environment setup the MICA1 dataset. . . . . . . . . . . .

52

Figure 2.18 The environment setup for the MICA2 dataset . . . . . . . . . .


52

Figure 2.19 The environment setup for the MICA3 dataset. . . . . . . . . .

53

xiii


Figure 2.20 The environment setup for the MICA4 dataset. . . . . . . . . .

54

Figure 3.1

Diagram of the proposed hand gesture spotting system. . . . . .

57

Figure 3.2

Diagram of the proposed hand detection and segmentation system. 59

Figure 3.3 The Venn diagram representing the relationship between the pixel
sets I, D, Bd , Hd , S and H ∗ . . . . . . . . . . . . . . . . . . . . . . . . .

60

Figure 3.4


Results of hand region detection . . . . . . . . . . . . . . . . . .

61

Figure 3.5 Result of the learning distance parameter. (a-c) Three consecutive frames; (d) Results of subtracting two first frames; (e) Results of
the subtracting two next frames; (f) Binary thresholding operator; (g)
A range of hand (left) and of body (right) on the depth histogram . . .

63

Figure 3.6

The training skin color model . . . . . . . . . . . . . . . . . . .

63

Figure 3.7

Result of the training skin color model . . . . . . . . . . . . . .

64

Figure 3.8 Results of the hand segmentation. (a) a Candidate of hand; (b)
Mahalanobis distance; (c) Refining the segmentation results using RGB
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Figure 3.9


Catching buffer to store continous hand frames. . . . . . . . . .

67

Figure 3.10 The area cues of the hand regions. . . . . . . . . . . . . . . . . .

68

Figure 3.11 The velocity cues of the hand regions. . . . . . . . . . . . . . . .

68

Figure 3.12 The comnination of area and velocity signal of the hands. . . . .

69

Figure 3.13 The finding of local peak from the original area signal of the hands. 70
Figure 3.14 Log activities of an evaluator who follows stages of the user-guide
scheme and represents seven hand postures for preparing the posture
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Figure 3.15 Seven type of the postures recognized in the proposed system.
(a) The first row: original images with results of the hand detections (in
red boxes). (b) The second row: zoom-in version of the hand regions
without segmentation. (c) The third row: the corresponding segmented
hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


73

Figure 3.16 Results of the kernel-based descriptors for hand posture recognition without/with segmentation . . . . . . . . . . . . . . . . . . . . . .

76

xiv


Figure 3.17 Performances of the dynamic gesture spotting on two datasets
MICA1 and MICA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Figure 3.18 An illustration of the gesture spotting errors. . . . . . . . . . . .

77

Figure 4.1

The comparison framework of hand gesture recognition . . . . .

81

Figure 4.2

Optical flow and Trajectory of the go-right hand gesture. . . . .

83


Figure 4.3 An illustration of the Go left hand gesture before and after projecting in the constructed PCA space. . . . . . . . . . . . . . . . . . . .

84

Figure 4.4

3D manifold of hand postures belonging to five gesture classes. .

86

Figure 4.5 An illustration of the DTW results of two hand gestures (T,P).
(a)-(b) Alignments between postures in T and P in the image space
and the spatial-temporal space. (c)-(d) The refined alignments after
removing repetitive ones. . . . . . . . . . . . . . . . . . . . . . . . . . .

87

Figure 4.6

Distribution of dynamic hand gestures in the low-dimension. . .

89

Figure 4.7

Five dynamic hand gestures in the 3D dimension. . . . . . . . .

90

Figure 4.8


Define quasi-periodic image sequence . . . . . . . . . . . . . . .

91

Figure 4.9

Illustrations of the phase variations. . . . . . . . . . . . . . . . .

92

Figure 4.10 Define quasi-periodic image sequence in phase domain. . . . . .

92

Figure 4.11 Manifold representation of the cyclical Next hand gesture . . . .

93

Figure 4.12 Phase synchronization. . . . . . . . . . . . . . . . . . . . . . . .

94

Figure 4.13 Whole length sequence is synchronized with the best difference
phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

Figure 4.14 Whole length sequence is synchronized with the the best similar
phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


95

Figure 4.15 a, c) Original hand gestures. b,d) corresponding interpolated
hand gestures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

Figure 4.16 ROC curves of hand gesture recognition results with SVM classifier. 97
Figure 4.17 The dynamic hand gesture recognition results with the difference
kernel scale SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

98


Figure 4.18 The comparison combination characteristics (KLT and ISOMAP)
of dynamic hand gesture . . . . . . . . . . . . . . . . . . . . . . . . . .

99

Figure 4.19 Performance comparisons with different techniques . . . . . . . 101
Figure 4.20 Comparison results between the proposed method vs. others at
thirteen positions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 4.21 Dynamic hand gestures in the sub-NVIDIA dataset. . . . . . . . 102
Figure 4.22 Confusion Matrixs with MSRGesture3D and Sub-NVIDIA Datasets103
Figure 5.1 Illustration of light controlling using dynamic hand gestures with
different levels of intensity of the lamp. . . . . . . . . . . . . . . . . . . 106
Figure 5.2 Illustration of ten modes of fan controlled by dynamic hand gestures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Figure 5.3

The state diagram of the proposed lighting control system. . . . 107

Figure 5.4

The state diagram of the proposed fan control system. . . . . . 108

Figure 5.5 A schematic representation of basic components in hand gesturebased control system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 5.6

Integration of hand gesture recognition modules. . . . . . . . . . 109

Figure 5.7

The proposed frame-work for training phase. . . . . . . . . . . . 110

Figure 5.8 The proposed flow chart for the online dynamic hand gesture
recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Figure 5.9

The proposed flow chart for controlling lamp. . . . . . . . . . . 113

Figure 5.10 The proposed flow chart for controlling fan. . . . . . . . . . . . 114
Figure 5.11 Setup for evaluating the control systems . . . . . . . . . . . . . 115
Figure 5.12 Illustration of environment and material setup. . . . . . . . . . . 117
Figure 5.13 The time-line of the proposed evaluation system. . . . . . . . . . 119
Figure 5.14 The time cost for the proposed dynamic hand gesture recognition
system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Figure 5.15 Usability evaluation of the proposed system. . . . . . . . . . . . 120


xvi


INTRODUCTION
Motivation
Home-automation products have been widely used in smart homes (or smart
spaces) thanks to recent advances in intelligent computing, smart devices, and new
communication protocols. In term of automating ability, most of advanced technologies are focusing on either saving energy or facilitating the control via an user-interface
(e.g., remote controllers [92], mobile phones [7], tablets [52], voice recognition [11]).
To maximize user ability, a human-computer interaction method must allow end-users
easily using and naturally performing the conventional operations. Motivated by such
advantages, this thesis pursues an unified solution to deploy a complete hand gesturebased control system for home appliances. A natural and friendly way will be deployed
in order to replace the conventional remote controller.
A complete gesture-based controlling application requires both robustness as well
as low computational time. However, these requirements face to many technical challenges such as a huge computational cost and complexity of hand movements. The
previous solution only focus on one of problems in this field. To solve these issues, two
trends in the literature are investigated. One common trend bases on aided-devices
and another focuses on improving the relevant algorithms/paradigms. The first group
addresses the critical issues by using supportive devices such as a data-glove [85, 75],
hand markers [111], or contact sensors mounted on hand, or palm of end-users when
they control home appliances. Obviously, these solutions are expensive or inconvenient for the end-users. For the second one, hand gesture recognition has been widely
attempted by researchers in the communities of computer visions, robotics, and automation control. However, how to achieve the robustness and low computational time
still remaining an open question. In this thesis, the main motivation pursues a set
of “suggestive” hand gestures. There is an argument that the characteristics of hand
gestures are important cues in contexts of deploying a complete hand gesture-based
system.
On the other hand, recent new and low-cost depth sensors have been widely applied
in the fields of robotic and automation control. These devices open new opportunities
for addressing the critical issues of gesture recognition schemes. This work attempts

to benefit from Kinect sensor [2] which provides both RGB and depth features. Utilizing such valuable features offer an efficient and robust solution for addressing the
challenges.

1


Objectives
The thesis aims to achieve a robust, real-time hand gesture recognition system. As
a feasible solution, the proposed method should be natural and friendly for end-users.
A real application is deployed for automatically controlling a fan and/or bulb/lamp
using hand gestures. They are the common electrical home appliances. Without any
limitation, the proposed technique tends to extend a specific case to general home
automation control systems. To this end, the concrete objectives are:
- Defining an unique set of dynamic hand gestures. This gesture set conveys commands that are available in common home electronic appliances such as television, fan, lamp, door, air-conditioner, and so on. Moreover, the proposed gesture
set is designed with unique characteristics. These characteristics are important
cues and offer promising solutions to address the challenges of a dynamic hand
gestures recognition system.
- A real-time spotting dynamic hand gestures from input video stream. The proposed spotting gesture technique consists of relevant solutions of hand detection
and hand segmentation from consecutive RGB-D images. In the view of a complete system, the spotting technique considers a preprocessing procedure.
- Performances of a dynamic hand gesture recognition method depends on gesture’s
representation and matching phases. This work aims to extract and represent
both spatial and temporal features of the gestures. Moreover, thesis intends to
match phases of the gallery and probe sequences using a phase synchronization
scheme. The proposed phase synchronization aims to solve variants of gesture
speeds, acquisition frame rates. In the experiments, the proposed method with
various positions, directions, and distances from the human to the Kinect sensor
are evaluated.
- A proposed framework to control home appliances (such as lamp/fan) is deployed.
A full hand gesture-based system is built in an indoor scenario (a smart-room).
The prototypes of the proposed system for controlling fans and lamps are shown

in Fig. 5.1 and in Fig. 5.2, respectively. Evaluations of usability with the
proposed datasets and experimental evaluations are reported. Datasets are also
shared to the community for further evaluations.

Context, constraints, and challenges
Figure 2 shows the context when end-user controls home electronic appliances
in a living room environment. Nowadays there are many methods to control home
2


appliances (as illustrated in Fig. 1 (a)). The main differences from existing ones
are that the proposed hand gesture recognition system aims to convey naturally and
conveniently the commands of home appliance equipment without any requirements of
a remote control, as illustrated in Fig. 1 (b).

Fan
Push buttons

Lamp

Television
Remote control

Camera

Smart phone
Air-condition
(a) Smart home appliance equipments

(b) Traditional controlling method


Figure 1 Home appliances in a smart homes

Figure 2 Controlling home appliances using dynamic hand gestures in smart house.
The proposed system operates with a Kinect sensor. This device is mounted at
the fixed position to obtain good system performance as well as to make end-users feel
comfortable. To deploy a real application of home appliance controlling using dynamic
hand gestures, the thesis has some constraints for studying on dynamic hand gesture
recognition as the following:
3


❼ The Kinect sensor:

– The Kinect sensor is immobile when end-users implement interactions.
– The Kinect sensor captures RGB and Depth images at a normal frame rate
(from 10 to 30 fps) with an image resolution of 640×480 pixels for both of
those image types.
– The visible area is an area in front of the Kinect sensor so that every object
can be viewed by the Kinect sensor (not only limited by distance from the
objects to the camera (from 0.8m to 4m) but also coved by an angle of 300
around the center axe of the Kinect sensor).
❼ Furnitures and other objects are distributed uniformly in a square room.
❼ For an instance time, it is assumed that that only one end-user controls a home

appliance by using dynamic hand gestures of his/her right hand. If there is more
than one subject in the room, the nearest person from the Kinect sensor will be
considered.
❼ When an end-user wants to control an electronic appliance, he/she should stand in


front of the Kinect sensor in the visible area of the Kinect sensor, raises one hand
forward to the Kinect sensor and implements gestures that have been designed
previously.
The above-mentioned scenarios and constraints are to cope with the following
issues:
❼ Changing of illumination: While natural lighting source changes within a day

(in the morning, noon, afternoon), artificial lighting sources in a smart-room
condition could be affected by:
❼ Complex background: In a practical environment, the background of the scene

is complex with many types of furniture. The background could contain other
objects whose color appearances are similar to human-skin color. Moreover, some
objects, which are same distances to the Kinect sensor, may appear. Therefore,
a task to clearly detect and segment hand from the background meets challenges.
❼ Computational time: Consists of costs for training the end-users and for pro-

cessing relevant procedures of a complete system. The proposed gesture-based
application requires real-time performance. It is worth to study and propose
reasonable solutions to address the real-time performance issue.
❼ Representing dynamic hand gestures: The gestures consist of non-rigid hand

shapes in a continuous image sequences. Therefore, in order to obtain good per-

4


formances of recognition, the gesture’s representation should adapt to variation
of hand-shape along temporal dimension.
❼ Variations of gestures: The end-users (subjects) implement dynamic hand ges-


tures with artifacts such as different speeds/velocities, captured frame rate changes,
various length of hand’s trajectories. Therefore, the proposed dynamic hand
gesture system must be designed to adapt to such variations. Thesis mainly
addresses such issues by a new phase synchronization technique.

Contributions
Throughout the thesis, the main objectives are addressed by a unified solution.
Thesis achieves following contributions:
❼ Contribution 1: Designing a dynamic hand gesture dataset to conduct the

commands of electronic home equipments. The proposed gestures are suitable to
deploy gesture-based systems for smart room environments. The dataset consists
of specific characteristics that are useful and supportive for deploying a robust
hand gesture recognition system. A number of datasets are captured with a large
number of end-users. The datasets consist both RGB-Depth images and publish
for the research community about dynamic hand gestures. In addition, these
datasets are to evaluate the performances of proposed algorithms.
❼ Contribution 2: An efficient user-guide scheme is proposed to learn the heuris-

tic parameters-based with a trade-off between a real-time system and user independent system. This scheme helps to obtain both a real-time hand detection
and good performance of hand segmentation. Then, an efficient gesture spotting method is proposed that utilizes the features extracted from continuous
segmented hand regions.
❼ Contribution 3: Proposing an efficient representation for dynamic hand ges-

tures which combines spatial-temporal features. By using some most significant
dimensions from the nonlinear reduced space (ISOMAP technique), the spatial
features are extracted for dynamic hand gesture representations. The trajectories
of hand movements are extracted using KLT technique. This proposed representation is especially helpful for discriminating the different types of the gestures.
In addition, to resolve the gestures’ variation issues, a new phase synchronization

is proposed. By using a proposed interpolation method in in the spatial-temporal
space, a new sequence with a pre-determined length is created.
❼ Contribution 4: A complete system is deployed to control light and fan in

5


a smart-room environment. The system utilizes the proposed algorithms and
achieves both high accuracy and real-time performance. In addition, it is suffered
evaluations from a large number of end-users in different contexts such as in
Techmark Exhibitions (Sept., 2015, 2016), or in Technical demonstration sessions.
(Celebration of HUST’s 60th Anniversary, Oct., 2016).

General framework and thesis outline

Setup
Hand gestures DB

5 types of
controls

Defining dataset
Hand detection &
segmentation

Spotting dynamic
hand gesture

Spotted dynamic
hand gesture


Real-time dynamic hand gesture spotting
Dynamic hand gesture
representation

Phase synchronization

Hand gesture
classifer

Robust dynamic hand gesture recognition
Control home appliances
(natural way & real environment)
Application system

Figure 3 The proposed frame-work of the dynamic hand gesture recognition for controlling home appliances.
This thesis proposes an unified solution of dynamic hand gesture recognition. The
proposed framework consists of three main phases as illustrated in Fig. 3. They are
(1) hand detection and segmentation from a video stream; (2) spotting dynamic hand
gestures; and (3) the recognition schemes. Utilizing this framework, a real application
is also deployed. The application is evaluated in different contexts such as in lab-based
environments, demonstrations in the exhibitions and Tech-mart events. Particularly,
these research works in the thesis are divided into five chapters as follows:
❼ Introduction: This chapter describes the main motivations and objectives of the

study. The thesis also presents the research’s context, constraints, and challenges.
These factors could be raised when addressing the relevant problems in the thesis.
Additionally, the general proposed framework, and the main contributions are
also presented in this Chapter.


6


❼ Chapter 1: This chapter mainly surveys existing complete hand gesture-based

control systems. Particularly, the related techniques of home appliance electronic
equipment are discussed. A series of the relative techniques consisting of hand
detection, segmentation, and recognition techniques when systems are surveyed
in this Chapter.
❼ Chapter 2: In this Chapter, existing datasets of the dynamic hand gestures are

firstly described. Then, the common commands of the home appliances are examined. Based on these studies, a new set of hand gestures is proposed. This
new set consists the gestures with cyclical hand patterns. A number of the proposed gesture sets are collected in different experiments such as in exhibitions,
lab-based environments for further works.
❼ Chapter 3: This chapter proposes a learning scheme to learn a heuristic pa-

rameters for the hand detection and segmentation. Utilizing the results of the
learning scheme, hand detection and segmentation results obtain not only a robust and real-time but also good performance. Given the segmented hand of the
continuous sequence, a proposed method for spotting gestures is also presented.
❼ Chapter 4: This chapter describes the proposed algorithms and experimental

evaluations for dynamic hand gesture recognition system. An efficient representation of the hand gestures based on spatial-temporal features are proposed. To
solve critical issues of gesture variations, a phase synchronization enhancing the
system’s performance is presented. The proposed algorithms are evaluated on
several datasets (both collected and public datasets).
❼ Chapter 5: By utilizing the proposed framework, a complete system is deployed to

control lamps/bulbs and fans in indoor environments. A number of volunteers/enduser are invited to interact with the proposed system. The computational costs
and end-users’ feedbacks are reported. The application shows the feasibility of
the proposed method to deploy a real application.

❼ Conclusion and Future Works: Conclusions of the works and relevant discussions

on the limitations of the proposed method are given in this Chapter. Further
research directions are proposed for future works.

7


×