Nhận dạng cử chỉ tay cho thuyết trình thông minh

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.27 MB, 48 trang )

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY

TRAN NGUYEN LE

HAND GESTURE RECOGNITION FOR
INTELLIGENT PRESENTATION

MASTER THESIS OF COMPUTER SCIENCE

Hanoi - 2015

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY

TRAN NGUYEN LE

HAND GESTURE RECOGNITION FOR
INTELLIGENT PRESENTATION

Major
Code

: Computer Science
: 60480101

MASTER THESIS OF COMPUTER SCIENCE
SUPERVISOR: Dr. Le Thanh Ha

Hanoi - 2015

AUTHORSHIP
“I hereby declare that the work contained in this thesis is of my own and has not been
previously submitted for a degree or diploma at this or any other higher education
institution. To the best of my knowledge and belief, the thesis contains no materials
previously published or written by another person except where due reference or
acknowledgement is made.”
Signature:………………………………………………

SUPERVISOR’S APPROVAL
“I hereby approve that the thesis in its current form is ready for committee
examination as a requirement for the Master of Computer Science degree at the
University of Engineering and Technology.”
Signature:………………………………………………

ACKNOWLEDGEMENT
I would like to express my sincere gratitude to my advisor Dr. Le Thanh Ha,
University of Engineering and Technology, Vietnam National University, Hanoi for
his enthusiastic guidance, warm encouragement and useful research experiences. I am
grateful to thank all the teachers of University of Engineering and Technology, VNU
for their extremely valuable knowledge, which they gave to me during my master
course. I would like to thank all my friends and other lab mates in Human Machine
Interaction Laboratory for their helpful discussions about my research topic. I
sincerely acknowledge the basic research projects in natural science in 2012 of the
National Foundation for Science & Technology Development (Nafosted), Vietnam
(102.01-2012.36, Coding and communication of multiview video plus depth for 3D
Television Systems) for supporting finance to my master study. Last, but not least, my

family is really the biggest motivation behind me. I would like to thank my parents and
my brother for supporting me spiritually throughout writing this thesis. I would like to
send them my gratefulness and love.
Hanoi, October 20th, 2015
Tran Nguyen Le

v

HAND GESTURE RECOGNITION FOR
INTELLIGENT PRESENTATION
Tran Nguyen Le
Computer Science
Abstract:
This research presents a contour based hand gesture recognition solution for
presentation control using depth image data. In this work, a motion-based algorithm is
used to detect and track human hand. Then, the hand contours are extracted and
described by illumination-, rotation- and scale-invariant feature vectors. After that,
logistic regression and multilayer perceptron classifiers are employed for hand posture
and dynamic hand gesture recognition respectively. Finally, in the application of
presentation control, two recognized gestures are used to move forward to the next
slide or backward to the preceding slide. The experimental results exhibit the high
recognition accuracy and efficiency of our approach, and our prototype application can
control PowerPoint slides in real-time.
Keywords: hand gesture, recognition, intelligent presentation, depth image.

vi

Table of Contents

Table of Contents ............................................................................................. vii
Abbreviations .....................................................................................................ix
List of Figures .....................................................................................................x
List of Tables ......................................................................................................xi
Chapter 1 INTRODUCTION ............................................................................1
1.1. Motivation ..................................................................................................1
1.2. Objectives ..................................................................................................2
1.3. Methodology ..............................................................................................2
1.4. Thesis‟s outline ..........................................................................................2
Chapter 2 RELATED WORK ..........................................................................3
2.1. Infrared laser tracking devices for presentation ........................................3
2.2. Distance transform based hand gesture recognition ..................................4
2.3. Body tracking-based hand gesture recognition using Microsoft Kinect ...5
Chapter 3 HAND GESTURE RECOGNITION FOR INTELLIGENT
PRESENTATION ...............................................................................................8
3.1. Image sequence preprocess........................................................................9
3.1.1. Motion extraction from depth images .................................................9
3.1.2. Noise reduction .................................................................................10
3.1.1. Initial hand detection .........................................................................11
3.2. Hand localization .....................................................................................15
3.2.1. Hand tracking ....................................................................................15
3.2.2. Hand region segmentation ................................................................ 16
3.2.2.1. Updating the hand point position ...............................................16
3.2.2.2. Using depth threshold from the depth value of the hand point ..17
3.2.2.3. Using blob detection to detect hand region from others ............17
3.2.2.4. Reducing noise from hand area using hand point position ........18
3.2.3. Hand contour extraction ....................................................................18
3.3. Hand gesture recognition .........................................................................20
3.3.1. Sample gesture definition .................................................................20
3.3.2. Feature vector selection ....................................................................22

3.3.2.1. Hand posture ..............................................................................22
3.3.2.2. Dynamic hand gesture ................................................................ 24
3.3.3. Training and classifying ....................................................................25
3.3.3.1. Hand posture ..............................................................................25
3.3.3.2. Dynamic hand gesture ................................................................ 26
3.4. Presentation controller .............................................................................29
vii

3.4.1. System requirements .........................................................................29
3.4.2. Workflow of controlling presentation ...............................................29
3.4.3. Presentation controller interface .......................................................30
Chapter 4 EXPERIMENTAL RESULTS ......................................................32
4.1. Data collection .........................................................................................32
4.1.1. Hand posture database ......................................................................32
4.1.2. Dynamic hand gesture database ........................................................32
4.2. Test-bed system and results .....................................................................32
4.2.1. Accuracy of hand posture recognition ..............................................32
4.2.2. Accuracy of dynamic hand gesture recognition ...............................33
4.2.3. Presentation controller performance .................................................34
Chapter 5 CONCLUSION ...............................................................................35
References .........................................................................................................36

viii

TV

Abbreviations
Television

RGB

Red Green Blue 

SDK
PC

Software Development Kit
Personal Computer

ix

List of Figures
Figure 2.1: Tracked skeleton joints of the user‟s body [9] ..............................................5
Figure 3.1: Abstract layered view of proposed system ...................................................8
Figure 3.2: The process of generating the motion image ..............................................10
Figure 3.3: (a) The opening operation, (b) The erosion operation, (c) The dilation
operation ........................................................................................................................11
Figure 3.4: (a) The original motion image, (b) The reduced noise motion image ........11
Figure 3.5: Motion clustering with hand size: (a) Before applying the threshold of hand
size (b) After applying the threshold of hand size .........................................................12
Figure 3.6: Motion history image and motion template procedure. Motion history at
time (a) t, (b) t+1, (c) t+2, (d) Depth motion history image ..........................................13
Figure 3.7: The direction of cluster ...............................................................................14
Figure 3.8: Result of the initial hand detection .............................................................15
Figure 3.9: Hand tracking using Kalman filter ..............................................................16
Figure 3.10: The result of hand region extraction using blob detection: (a) Detected
blobs (b) Extracted blob including hand point ..............................................................17

Figure 3.11: The result of hand segmentation ...............................................................18
Figure 3.12: Binary image including hand area ............................................................18
Figure 3.13: Contour tracing using Moore-Neighbor tracing algorithm .......................20
Figure 3.14: Hand contour extraction using Moore-Neighbor tracing algorithm .........20
Figure 3.15: Hand postures definition ...........................................................................21
Figure 3.16: Dynamic hand gesture definition ..............................................................22
Figure 3.17: Computation of angle relation ..................................................................25
Figure 3.18: Workflow of gesture recognition process .................................................28
Figure 3.19: Workflow of controlling presentation.......................................................30
Figure 3.20: Presentation controller interface ...............................................................30
Figure 4.1: Result with Logistic Regression classifier ..................................................33

x

List of Tables
Table 4.1: The result of classifying next/previous and grasp/release gesture ...............33
Table 4.2: The result of classifying next and previous gesture .....................................34
Table 4.3: The result of classifying grasp and release gesture ......................................34

xi

Chapter 1
INTRODUCTION
1.1. Motivation
With present-day technology, slideshow presentation applications such as PowerPoint
are becoming more popular and playing an important role in many areas especially
business or education. However, among various presentation controls, the most
common and widely used tools are still the standard mouse and keyboard which can

make presenters feel inconvenient during their speech. For example, when the
projection plane is far away from the computer, presenters have to walk back and forth
a long distance between the computer and screen if they want to point something on
the slide, that causes many interruptions for their presentation. On the other side,
staying close to the computer leads to reduced body language and eye contact with the
listeners. Another favorite tool for presentation that is emerging nowadays is laser
pointer. Nevertheless, the laser point makes the audiences hard to follow because of its
fast moving and unpredicted trajectory. As the technology progresses forward, more
and more natural and easy-to-use presentation techniques are developed in order to
overcome above disadvantages and deliver good experiences to presenters as well as
listeners.
To address this demand, one of the most studied research at the present time is hand
gesture recognition. In the recent few years, hand gesture recognition systems have
gained great attention because of the ability to interact with computer effectively. Use
of gesture makes the interaction between human and computer easy, convenient and
interesting. The evidence is that today hand gestures are used to control various
applications like robot control, smart TV control, gaming, etc. Along with the strong
development of such systems, more and more new devices in the area of gesture
recognition are getting popular and successful. One of them is an input device for
motion sensing, developed by Microsoft, namely Kinect sensor [1]. This sensor allows
the users to control and interact with an application using real gestures. Also the low
price and availability to work with traditional computer hardware and existence of
developers‟ tools for Kinect application development made Kinect so popular,
compared to the other existing sensors for motion tracking. Therefore, in this thesis,
the idea to control slides during a presentation by hand gesture recognition system
using Kinect is put forth.

1

1.2. Objectives
The main objective of our thesis is to propose an architectural design of the intelligent
presentation system using a contour based hand gesture recognition method [2]. Our
system contains four major components: the image sequence preprocess, hand
localization, hand gesture recognition and presentation controller. Different from other
hand gesture recognition systems based on visual color method [3,4] that are highly
affected by the illumination condition, the proposed system should be able to work
under low light environment which is the common condition of a presentation by using
depth image data captured from Kinect sensor. In addition, it must ensure the accuracy
and real-time performance of hand gesture recognition method.
1.3. Methodology
In this research, the first component of the proposed system detects the initial hand by
a motion-based algorithm. Then, the hand localization unit extracts and describes hand
contours by illumination-, rotation- and scale-invariant feature vector after detecting
and tracking hand region. In the third major component, logistic regression and
multilayer perceptron classifiers are employed for hand posture and dynamic hand
gesture recognition respectively. Finally, in the presentation controller module, the
recognized hand gestures will be transformed as a visual command in order to move
forward or backward a slide.
1.4. Thesis’s outline
The remainder of this thesis is organized as follows. Chapter 2 described the related
hand gesture recognition methods and existing systems for intelligent presentation.
Then, Chapter 3 presents proposed hand localization and hand gesture recognition
method as well as the way to control PowerPoint presentation. Chapter 4 shows the
experimental results of our prototype application. Finally, Chapter 5 concludes our
proposed method in this thesis.

2

Chapter 2
RELATED WORK
There are many existing hand gesture recognition solutions for presentation control.
Gesture- controlled solutions for presentation control are usually based on motionsensing devices like cameras, data gloves, infrared sensors and other similar devices.
Some of these solutions are described in the sequel.
2.1. Infrared laser tracking devices for presentation
In [5], a system for large display interaction using infrared laser tracking device is
presented. The authors address the challenge of natural interaction system by hiding
cursor and laser pointer, not requiring clicking and using hotspots and gestures.
Hotspots are areas around objects which are highlighted with a colored background
when the pointer enters them. This provides a mechanism for objects to be selected
without clicking. Technically, to select the object, the user moves their laser pointer
towards the object. When the pointer moves inside the object, the system detects the
crossing of the boundary by the laser beam. Then, the system reacts to this crossing
and highlights the object, while the laser pointer stops at the center of the object. This
is ideal because people tend to point towards the center of an object, rather than the
edges. When the user points away from the object, it reverts to the original appearance.
Gestures are natural movements of the hand (as indicated by the path traced by the
pointer) which the system recognizes, allowing an action to be performed. Those
gestures can be found and used successfully in modern web browsers such as Mozilla
and Opera. The idea here is to use gestures to select objects by circling around the
object or navigate a piece of information such as to move forward by using a left to
right sweeping gesture or move backward by a right to left gesture.
On the other hand, there are two noteworthy limitations have been recognized. These
ideas are shown by an include module for Microsoft PowerPoint using the
NaturalPoint™ Smart-Nav™ tracking device. Smart-Nav is designed for use by
individuals at a distance of less than approximately 2 meters; it has a low resolution of
256 x 256 pixels. This is possibly sufficient for the application on a large display at a
distance of around 3 meters. The more significant issue is the fact that the camera has
difficulties when tracking small objects smoothly. In some cases, the camera may lose

tracking altogether, often caused by bursts of frames between periods of inactivity.

3

2.2. Distance transform based hand gesture recognition
In [6], Ram Rajesh J et al. suggest two techniques to control the slides of PowerPoint
presentation in a device free manner without any markers or gloves. Utilizing exposed
hand, the gesture is given as information to the webcam associated with the PC. Then,
using an algorithm which calculates the quantity of active fingers, the gesture is
perceived and the slideshow is controlled. The number of active fingers are discovered
using two methods namely Circular profiling and Distance transform.
 Circular profiling
The finger count is determined as in [7]. The centroid of the segmented binary image
of the hand is figured. After that, the length of the biggest active finger is found by
drawing the boundary box of the hand. The centroid ascertained is made as the center
and the estimation of radius is the length of the biggest finger multiplied by 0.7 [8].
With the centroid as center and length of the biggest finger multiplied with 0.7 as
radius a circle is drawn to intersect with the active fingers of the hand. In the event that
a finger is active, then it crosses with the circle. A chart is utilized to compute the
quantity of transitions from white to dark area. This number gives the quantity of
active fingers. From the number of active fingers, the gesture made can be resolved. If
a value less than 0.7 is used the circle drawn encases just palm locale. If a value
greater than 0.7 is used the circle doesn‟t intersect the thumb.
However, the disadvantage in this strategy is that the hand ought to be appropriately
put regarding the webcam so that the whole hand region is caught to draw the circle. If
the hand is not set legitimately, the gesture is not perceived properly. Gesture made in
this technique includes only one hand and this decreases the quantity of gestures that
can be made utilizing both hands. Additionally, the reaction time is very high.
 Distance transform

The distance transform method gives the Euclidean distance of each pixel from the
nearest boundary pixel. The distance from the boundary to a pixel in the hand area
increments as the pixel is far from the boundary. Using this distance value, the
centroid of the palm area can be computed. The quantity of fingers used to describe the
gesture is found by drawing a line along the major axis of the segmented finger areas.
The number of lines drawn is equal to number of active fingers. This value is used to
control the slides of PowerPoint.

4

This distance transform method‟s proficiency diminishes when the human hand far
from the focus of the camera. Improper gestures and gestures made promptly without a
pause is additionally a reason for reduction in the level of accuracy. The effectiveness
decreases if the background has components like wall hanging, furniture and so forth
containing color like skin color. Issue happens if the fingers are not stretched
appropriately while making a gesture.
2.3. Body tracking-based hand gesture recognition using Microsoft Kinect
Many researchers use the Microsoft Kinect device to capture both RGB and depth
data. In [9], they have developed algorithms that identify humans in a scene and
perform full body tracking, as well as they can predict a person‟s skeletal structure in
real-time.

Figure 2.1: Tracked skeleton joints of the user’s body [9]

Using Microsoft Kinect SDK, three streams of information can be gained: RGB, depth
and skeleton data streams. The RGB data stream gives the color information for each
pixel, while the depth data gives the distance information between the pixels and the
sensor. The skeletal data stream gives the positions of various skeletal data joints of
5

the users that are in the range of the sensor. The tracked skeleton joints of the user‟s
body are appeared in Figure 2.1. By handling the depth stream data, the skeleton data
stream is created. For gesture detection, the authors have utilized the skeleton data
stream. As the pixels color isn‟t required, the RGB data stream wasn‟t used.
The characteristics of the swipe left gesture that the authors observed are:
 The x-axis coordinate values are decreasing as the gesture is executed;  
 The y-axis coordinate values have nearly equal values as the gesture is
executed;  
 The length of the line formed as a sum of the lengths between the points of the
gesture has to exceed some previously-defined value.  
 The spent time between the first and the last tracked point of the gesture has to
be in the previously defined allowed range.  
 The characteristics for the swipe right gestures are the same, except for the first
one where the x-axis coordinate values are increasing (not decreasing) as the
gesture is executed.
The parameters that the authors introduced are:  
 Xmax maximal threshold value of the x-axis between two consecutive hand
joint data expressed in meters for a recognized gesture  
 Ymax maximal threshold value of the y-axis between hand joint data expressed
in meters for a recognized gesture  
 Lmin minimal length of the recognized swipe gesture expressed in meters  
 Tmin minimal duration of the recognized swipe gesture expressed in
milliseconds  
 Tmax maximal duration of the recognized swipe gesture expressed in
milliseconds
To detect a swipe gesture, the successive skeleton hand joint data must be checked and
when the data satisfies all of the parameters, gesture is detected.
For keeping the skeleton hand data two queues were used, one for the right skeleton

hand joint data (left swipe gesture), and the other for the left joint data (right swipe
gesture). The maximal number of components in the queues is 38 (the maximal
number of progressive joint data in a gesture). When new skeleton data arrives, the
6

hand joint data is added to the queues.
The last two skeleton joint data entries are checked for the parameters X max and
Ymax for both gestures. In the case that these parameters aren‟t fulfilled, then the data
from the suitable queue is erased. If they are satisfied, then the other three Lmin Tmin
and Tmax are additionally checked. If they are satisfied, then a swipe gesture is
detected.
When a gesture is identified, then a proper pressing of a keyboard button is simulated.
The left swipe gesture represents pressing of the left arrow, and the right swipe gesture
is the right arrow on the keyboard.
This method is only effective and robust to detect the location of the hand in the
condition that the prediction of a person‟s skeletal structure is good. The accuracy of
this approach depends heavily on human body posture. Hence, our project uses a hand
detection method without using human skeletal information to improve the
performance of the recognition system.  

7

Chapter 3
HAND GESTURE RECOGNITION FOR INTELLIGENT PRESENTATION
This chapter will describe our proposed hand gesture recognition system for
presentation control. We define an abstract layered view of intelligent presentation
system as illustrated in Figure 3.1.

Figure 3.1: Abstract layered view of proposed system

Each layer of the proposed system represents an integral element. The bottom layer
shows the hardware device Kinect that captures the visual of the scene with depth data.
The Kinect depth sensor consists of infrared emitter and depth sensor. The infrared
8

emitter is like a camera from the outside, but it‟s an infrared projector that emits
infrared light in a “pseudo-random dot” pattern over everything in front of it. These
dots are normally invisible to us, but it is possible to capture their depth information
using an infrared depth sensor. The dotted light reflects off different objects and the
depth sensor reads them from the objects then converts them into depth information by
measuring the distance between the sensor and the object from where the infrared dot
was read. The middle one includes three major modules that process the depth image
sequences from Kinect sensor and automatically recognize the hand gestures. The top
layer represents the module that implements natural interaction for presentation
control.
3.1. Image sequence preprocess
3.1.1. Motion extraction from depth images
After receiving depth images data from Kinect sensor, the image sequence preprocess
module will extract motion image which is used to detect the hand point position later.
The Kinect sensor captures approximately 30 depth frames per second. However, in
our method, we only use 5 continuous frame each time to create a motion image. The
process of generating the motion image is shown in Figure 3.2. First, the difference
image is obtained by subtracting the previous frame (it-1) from the current frame (it)
as below:
Diff_imaget = it – it-1

(3.1)

Then we apply a threshold for each difference image to generate the binary difference
image. Finally, the accumulation of these binary difference images is the motion
image. In the accumulated image, all movement of human body, hand, object and
noise are represented.

9

Figure 3.2: The process of generating the motion image

3.1.2. Noise reduction
Before detecting the hand point position from motion image, we need to remove the
noise from it first in order to increase the accuracy of the method. A spatial filtering
and a morphological processing are used for noise reduction. We used a 5x5 aperture
median filter for spatial filtering. The median filter replaces the pixel value with the
median value of the sub-image with aperture [10]. The advantage of median filter is
removing salt and pepper noise in a given image therefore it is very effective in this
case because the noise pattern of the motion image is very similar to salt and pepper
noise.
The morphological processing consists of threes operation: the opening, the erosion
and the dilation [10]. The opening operation is employed to reduce the outer shape and
expand the outers of an object by erosion. Generally, this operation makes the outers
smooth, splits the narrow region and removes the thin surrounding area. Thus, it
removes the noise and makes the original image smooth. The erosion and the dilation
operation are opposite, one reduces irrelevant pixels and eliminates small noise
components from the image, another one can return the eroded objects to their original
size and make the size of the image bigger than before. These operations are highly
effective for the depth image noise reduction. Figure 3.3 shows the operators of
opening, erosion and dilation in a simple way. Figure 3.3.a presents the opening of the

dark-blue square by a disk, resulting in the light-blue square with round corners.
Figure 3.3.b presents the erosion of the dark-blue square by a disk, resulting in the
light-blue square. Figure 3.3.c presents the dilation of the dark-blue square by a disk,
10

resulting in the light-blue square with rounded corners.

Figure 3.3: (a) The opening operation, (b) The erosion operation, (c) The dilation
operation

The original motion image and the result of the noise removal methods of spatial
filtering and the morphological processing on the motion image are shown in Figure
3.4.a and Figure 3.4.b respectively.

Figure 3.4: (a) The original motion image, (b) The reduced noise motion image

3.1.1. Initial hand detection
In this section, motion regions are clustered to detect the hand region position. First,
the connected components are selected from the motion image and then, they are
clustered. These clusters can be either real motion or noise and one of them is the
hand. The noise clusters are usually small, so if the size is smaller than a threshold, a
noise cluster is identified and removed.
11

To decide the threshold of the size, the polynomial regression method is applied [11].
First, a variety of hand size data are obtained from a largely range of distances. Then,
the polynomial method is employed to fit a curve to the dataset. The fitted curve is
estimated to choose the threshold of the hand size. Figure 3.5.a shows the result of

motion clustering before applying the threshold of the hand size. Figure 3.5.b shows
the result of motion clustering with the threshold by the hand size. The hand cluster is
found among those clusters for the hand detection process.

Figure 3.5: Motion clustering with hand size: (a) Before applying the threshold of
hand size (b) After applying the threshold of hand size

To find the hand, the condition of hand wave motion is set, which consists of a side-toside motion sequence. First, the directions of cluster movements are detected by using
a motion template [12, 13]. The motion template is an effective method for tracking
general movement; especially it is useful for gesture recognition. To use the motion
template, a segmented cluster is needed, which is the white rectangle shown in Figure
3.6.a. This image is referred to the motion history image. When the rectangle moves, a
new cluster is calculated from the new current motion image and stored into the
motion history image. The white rectangle represents the new cluster and the previous
cluster of old motions have become darker are shown in Figure 3.6.b and 3.6.c. The
darkest rectangle is the oldest motion. These continuous changed rectangles represent
the movement of clusters. Figure 3.6.d shows the motion history image in depth space.

12

Figure 3.6: Motion history image and motion template procedure. Motion history
at time (a) t, (b) t+1, (c) t+2, (d) Depth motion history image

From the motion history image, the gradient is taken to represent the direction. The
gradient can be calculated by the Sobel gradient function [13]. In some situations,
gradients from the motion history image are invalid because non-movement regions
have zero gradients and outer edges of the cluster have large gradients. The range of
gradients can be calculated and the invalid gradients are removed when the time
between frames are defined. Finally, the global gradient is assigned as the direction.

Figure 3.7 shows the direction of clusters. The line in the circle shows the direction
that the clusters are moving toward.

13

Figure 3.7: The direction of cluster

Next, the hand cluster is found by using wave motion detection. From the movement
clusters, their directions can be calculated. The method is used for wave motion
detection is to count the number of direction changes of the cluster. The condition of
wave number is set to three times and the number of times that clusters move left to
right is counted. After changing the direction three times, the selected cluster is
assigned as the initial hand. Figure 3.8 shows the result of the initial hand detection.
This method is robust to illumination conditions. Sometimes, the noise clusters from
the image may be suitable with the wave motion and the size condition, this may
falsely detect as the hand. In the next section, a tracking method is used to eliminate as
much as possible the false detection situation.

14

Nhận dạng cử chỉ tay cho thuyết trình thông minh

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về