Tải bản đầy đủ (.pdf) (47 trang)

Kết hợp thông tin không gian thời gian và áp dụng kĩ thuật chuyển hướng góc nhìn cho bài toán nhận dạng hành động con người sử dụng đa góc nhìn

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.39 MB, 47 trang )

..

MINISTRY OF EDUCATION AND TRAINING

LỂ TUẤN DŨNG

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------

Tuan Dung LE

HỆ THỐNG THÔNG TIN

IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION
WITH SPATIAL-TEMPORAL POOLING AND
VIEW SHIFTING TECHNIQUES

MASTER OF SCIENCE THESIS IN
INFORMATION SYSTEM

2017-2018

Hanoi – 2018


MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------

Tuan Dung LE


IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH
SPATIAL-TEMPORAL POOLING AND VIEW SHIFTING TECHNIQUES

Speciality:

Information System

MASTER OF SCIENCE THESIS IN
INFORMATION SYSTEM

SUPERVISOR :
1. Dr. Thi Oanh NGUYEN

Hanoi – 2018

Master student: Tuan Dung LE – CBC17016

Page 2


ACKNOWLEDGEMENT
First of all, I sincerely thank the teachers in the School of Information and
Communication Technology as well as all the teachers at the Hanoi University of
Technology has taught me the knowledge and valuable experience during the past 5
years.
I would like to thank the two supervisors, Dr. Nguyen Thi Oanh - lecturer in
Information Systems and Communication, Institute of Information and
Communication Technology, Hanoi University of Technology and Dr. Tran Thi
Thanh Hai, MICA Research Institute has guided me to complete this master thesis. I
have learned a lot from them, not only the knowledge of the field of computer vision

but also working and studying skills such as writing papers, preparing slides and
presenting to the crowd.
Finally, I would like to send my thanks to my family, friends and people who
have always supported me in the process of studying and researching this thesis.
Hanoi, March 2018
Master student

Tuan Dung LE

Master student: Tuan Dung LE – CBC17016

Page 3


TABLE OF CONTENT
ACKNOWLEDGEMENT ..........................................................................................3
TABLE OF CONTENT ............................................................................................4
LIST OF FIGURES.....................................................................................................6
LIST OF TABLES ......................................................................................................8
LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS .............................9
INTRODUCTION .....................................................................................................10
CHAPTER 1. HUMAN ACTION RECOGNITION APPROACHES .....................12
1.1 Overview ..........................................................................................................12
1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words
model .....................................................................................................................20
CHAPTER 2. PROPOSED FRAMEWORK ............................................................24
2.1 General framework ..........................................................................................24
2.2 Combination of spatial/temporal information and Bag-of-Words model .......25
2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW) .25
2.2.2 Combination of temporal information and Bag-of-Words model (T-BoW)

............................................................................................................................26
2.3 View shifting technique ...................................................................................27
CHAPTER 3. EXPERIMENTS ................................................................................30
3.1 Setup environment ...........................................................................................30
3.2 Setup ................................................................................................................30
3.3 Datasets ............................................................................................................30
3.3.1 Western Virginia University Multi-view Action Recognition Dataset
(WVU) ................................................................................................................30
3.3.2 Northwestern-UCLA Multiview Action 3D (N-UCLA) ...........................32
3.4 Performance measurement...............................................................................33
3.5 Experiment results ...........................................................................................35
3.5.1 WVU dataset .............................................................................................35
3.5.2 N-UCLA dataset ........................................................................................40
CONCLUSION & FUTURE WORK .......................................................................43
REFERENCES ..........................................................................................................44

Master student: Tuan Dung LE – CBC17016

Page 4


APPENDIX 1 ............................................................................................................47

Master student: Tuan Dung LE – CBC17016

Page 5


LIST OF FIGURES
Figure 1. 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual

hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body
model, g) cylindrical/ellipsoid human body model [1]. ...........................................14
Figure 1. 2 Construct HOG-HOF descriptive vector based on SSM matrix[6]. .......16
Figure 1. 3 a) Original video of walking action with viewpoints 0𝑜 and 45𝑜 , their
volumes and silhouettes, b) epipolar geometry in case of extracted actor body
silhouettes, c) epipolar geometry in case of dynamic scene with dynamic actor and
static background without extracting silhouettes[9]. ................................................16
Figure 1. 4 MHI (middle row) and MEI (last row) template [15]. ...........................18
Figure 1. 5 Illustration of spatio-temporal interest point detected in a people
clapping’s video [16].................................................................................................21
Figure 1. 6 Three ways to combine multiple 2D views information in the BoW model
[11]. ...........................................................................................................................21

Figure 2. 1 Proposed framework. ..............................................................................24
Figure 2. 2 Dividing space domain based on bounding box and centroid. ...............26
Figure 2. 3 Illustration of T-BoW model. .................................................................27
Figure 2. 4 Illustration of view shifting in testing phase. .........................................28

Figure 3. 1 Ilustration of 12 action classes in the WVU Multi-view actions dataset.
...................................................................................................................................31
Figure 3. 2 Cameras setup for capturing WVU dataset. ...........................................31
Figure 3. 3 Ilustration of 10 action classes in the N-UCLA Multi-view Actions 3D
dataset. .......................................................................................................................32
Figure 3. 4 Cameras setup for capturing N-UCLA dataset. ......................................33
Figure 3. 5 Illustration of confusion matrix. .............................................................35
Figure 3. 6 Confusion matrix: a) Basic BoW model with codebook D3, accuracy
70,83%; b) S-BoW model with 4 spatial parts codebook D3, accuracy 82,41%. ....37
Figure 3. 7 Confusion matrices: a) S-BoW model with 6 spatial parts, codebook D3,
accuracy 78,24%; b) S-BoW model with 6 spatial parts and view shifting, codebook
D3, accuracy 96,67%. ...............................................................................................38

Figure 3. 8 Confusion matrices: a) Basic BoW model, codebook D3, accuracy
59,57%; b) S-BoW mofel with 6 spatial parts, codebook D3, accuracy 63,40%. ....41

Master student: Tuan Dung LE – CBC17016

Page 6


Figure 3. 9 Illustration of view shifting on N-UCLA dataset. ..................................42

Master student: Tuan Dung LE – CBC17016

Page 7


LIST OF TABLES
Table 3. 1 Accuracy (%) of basic BoW model on WVU dataset.............................36
Table 3. 2 Accuracy (%) of T-BoW model on WVU dataset ..................................36
Table 3. 3 Accuracy (%) of S-BoW model on WVU dataset ...................................38
Table 3. 4 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting
technique on WVU dataset........................................................................................39
Table 3. 5 Comparison with others methods on WVU Dataset ................................39
Table 3. 6 Accuracy (%) of basic model on N-UCLA dataset .................................40
Table 3. 7 Accuracy (%) of T-BoW model on N-UCLA dataset .............................40
Table 3. 8 Accuracy (%) of the combination of S-BoW model and view shifting on
N-UCLA dataset ........................................................................................................41
Table 3. 9 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting
technique on N-UCLA dataset ..................................................................................42

Master student: Tuan Dung LE – CBC17016


Page 8


LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS

Index

Abbreviation

Full name

1

MHI

Motion History Image

2

MEI

Motion Energy Image

3

LMEI

Localized Motion Energy Image


4

STIP

Spatio-Temporal Interest Point

5

SSM

Self-Similarities Matrix

6

HOG

Histogram of Oriented Gradient

7

HOF

Histogram of Optical Flow

8

IXMAS

INRIA Xmas Acquisition Sequences


9

BoW

Bag-of-Words

10

ROIs

Region of Interest

Master student: Tuan Dung LE – CBC17016

Page 9


INTRODUCTION
In the growing social scene from the 3.0 era (automation of information
technology and electronic production) to the new 4.0 (a new convergence of
technologies such as the Internet Things - Internet, collaboration robots, 3D printing
and cloud computing, and the emergence of new business models), automatically
collecting and processing information by the computer is very necessary. This leads
to higher demands on the interaction between humans and machines both in precision
and speed. Thus, the problems of object recognition, motion recognition, speech
recognition ... are now attracting a lot of interest of scientists and companies around
the world. Nowadays, video data is easily generated by devices such as digital
cameras, laptops, mobile phones, and video-sharing websites. Human action
recognition in the video, contributing to the automated exploitation of the resources
of this rich data source.

Applications related to human action recognition problems such as: Security
and traditional monitoring systems include networks of cameras and are monitored
by humans. With the increase in the number of cameras as well as these systems being
deployed in multiple locations, the supervisor's efficiency and accuracy issues are
required to cover the entire system. The task of computer vision is to find a solution
that can replace or assist the supervisor. Automatic recognition of abnormalities from
surveillance systems is a matter that attracts a lot of research. The problem of
enhancing interaction between humans and machines is still challenging, the visual
cues are the most important method of non-verbal communication. Effectively
exploiting gesture-based communication will create a more accurate and natural
human-computer interaction. A typical application in the field is the "smart home",
intelligent response to the gesture, the action of the user. However, these applications
are still incomplete and still attract more research. In addition, human action
recognition problem is also applied in a number of other applications, such as robots,
content-based video analysis, content-based and recovery-based video compression,
video indexing, and virtual reality games. ...

Master student: Tuan Dung LE – CBC17016

Page 10


With the aim of studying and approaching the problem of human action
recognition using a combination of multiple views, we explored some of the recent
approaches and chose to experiment with the method of using combination of local
feature and Bag-of-Words model. After analyzing the weaknesses of the method, we
proposed a plan for improvement and evaluate it by doing experiments. The thesis
will be presented in the following format:
 Chapter 1: This chapter focuses on the approaches to provide readers with an
overview of the problem of human action recognition in general and using

multiple views in particular. The last part of this chapter introduces a method that
using combination of local feature and the Bag-of-Words model, evaluates the
advantages and disadvantages of the method, and then introduces the proposed
improvement methods.
 Chapter 2: This chapter focuses on presenting an improvement framework using
a combination of spatial/temporal information and view shifting techniques.
 Chapter 3: Experiment the proposed method and give the results with some
evaluation.
 Conclusion and Future works: This section will look at what has been and is not
done in the master's thesis and highlight pros and cons and future development.

 References

Master student: Tuan Dung LE – CBC17016

Page 11


CHAPTER 1. HUMAN ACTION RECOGNITION APPROACHES
1.1 Overview
Recognition and analysis of human actions has been a subject that has attracted
much interest over the past three decades and is currently being actively researched
in the field of computer vision. This is a good solution to solve the problems of a
large number of potential applications in the scope of intelligent monitoring, video
recovery, video analysis and human-machine interaction. Recent research has
highlighted the difficulty of this problem with the large fluctuations in human actions
data such as the variability in the way individuals perform actions; movement and
clothing; camera angles and motion effects; light fluctuations; occlusion due to
objects in the environment or parts of the human body; or disturbances in the
surroundings. Because there are so many factors that can affect the outcome of the

problem, current methods are often limited or placed in simple scenarios with simple
backgrounds, simple action classes, and stationary cameras or limit the variation in
viewing angles.
Many different approaches have been proposed over the years for human action
recognition. These approaches may be categorized depending on the visual
information used to describe the action. Single-view methods use a camera to record
the human body during the execution of the action. However, the appearance of the
action is quite different when viewed at arbitrary angle of view. Thus, single-view
methods are often accompanied by a basic assumption that action is observed from
the same angle in both the training data and the testing data. The efficiency of singleview methods is significantly reduced if this assumption is not true. The obvious way
to improve the accuracy of human action recognition is to increase the number of
views per action by increasing the number of cameras, which enables us to exploit a
larger amount of visual information to describe an action. The multi-views approach
has been studied for only a decade now because the limited capabilities of devices
and tools in previous decades did not adequately meet the calculated volume of the

Master student: Tuan Dung LE – CBC17016

Page 12


method need. Recent technological advances have brought powerful tools that allow
the multi-view approach to become available in a variety of application contexts.
Action recognition methods can be divided into two approaches: the traditional
approach of using manual features, the approach of neural network. An approach
using neural networks typically requires large sets of training data, otherwise it would
be ineffective. In practical applications, datasets are usually medium and small in
size. Therefore, in the context of this study, we are interested in a traditional approach
that utilizes manually selected features. In this approach, the performance
representation can be constructed from 2D data (2D approach) or from 3D data (3D

approach) [1].
 3D approaches
The general trend in 3D methods is to integrate visual information captured by
various angles of view, then represent actions by a 3D model. This is, usually,
achieved by combining 2D human body poses in terms of binary silhouettes denoting
the video frame pixels belonging to the human body on each camera (Fig 1.1b). After
obtaining the corresponding 3D human body representation, actions are described as
sequences of successive 3D human body poses. Human body representations adopted
by 3D methods include visual hulls (Fig 1.1c), motion history volumes (Fig 1.1d) [2],
optical flow corresponding to the human body (Fig 1.1e) [3], Gaussian blobs (Fig
1.1f) [4], cylindrical/ellipsoid body models (Fig 1.1g) [5] …

Master student: Tuan Dung LE – CBC17016

Page 13


Figure 1. 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose
(visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human
body model, g) cylindrical/ellipsoid human body model [1].
 2D approaches
Although the approaches use human body shape and 3D motion have been
successfully implemented, most of them assume that the human body is required to
be appear in all cameras of fixed camera systems for both training and testing. This
will lead to some restrictions on the application, which in reality may not be appear
in all cameras because they are outside of the camera's recording area or are occluded
by other objects. Obviously, when there is not enough information from all the
cameras, it is impossible to obtain accurate 3D descriptions of the human body and

Master student: Tuan Dung LE – CBC17016


Page 14


thus produce false prediction. On the other hand, multidimensional 2D viewing
methods can completely overcome the drawback mentioned above. 2D methods tend
to look for invariant features on different angles and then combine the predicted
results in the action class. Thus, the lack of information of a view does not affect the
results. Multidimensional 2D viewing methods are often divided into two smaller
approaches:
o View-invariant features
First approach is to try to represent the action by describing it with features
which are invariant to the view [6, 7, 8, 9, 10]. Action recognition is performed on
each video from each independent camera. First, the methods will show represent the
action by view-invariant features, then the action class is based on this invariant
feature.
A view-invariant approach proposed by N.Junejo et al.[6] is to calculate the
similarity of a series of images over time and to observe the stability of the model
over multiple viewing angles. It then builds a descriptive vector that records the
structural characteristics of similarity and time difference in a sequence of actions.
First, from each video, the authors calculate the difference between consecutive
frames. From there the authors build a self-similarities matrix called SSM-pos for 3D
data and SSM-HOG-HOF for 2D data. Next, from the existing SSM model, the
authors extracted the local SSM vector (Fig 1.2) and introduced it into the K-mean
clustering algorithm. Clusters will correspond to a word in the dictionary (BoW
approach). Finally, the SVM classifier uses the squared kernel with one-vs-all
strategy. The advantage of the method is that it achieves high stability under varying
angles and stability in the event of a difference in the way the actions are performed.

Master student: Tuan Dung LE – CBC17016


Page 15


Figure 1. 2 Construct HOG-HOF descriptive vector based on SSM matrix [6].
Another view-invariant approach proposed by Anwaar-ul-Haq et al [9] is based
on dense optical flow and epipolar geometry. The authors propose a novel similarity
score for action matching based on the property of segmentation matrix or two-body
fundamental matrix. It helps establishing view invariant action matching framework
without any preprocessing on original video sequences.

Figure 1. 3 (a) Original video of walking action with viewpoints 0𝑜 and 45𝑜 ,
their volumes and silhouettes, (b) epipolar geometry in case of extracted actor body

Master student: Tuan Dung LE – CBC17016

Page 16


silhouettes, (c) epipolar geometry in case of dynamic scene with dynamic actor and
static background without extracting silhouettes[9].
o Combination of information from multi-view
Second approach will be done by combining information from different views
[11, 12, 13]. Unlike view-invariant approach, we find that different views will hold
different amounts of information and may complement one another. For example,
parts of the human body may be occluded in one view, but are captured different from
one another. This combination is quite similar to combining views to give a 3D
representation of action as 3D approach approaches, but 2D approaches can combine
information at different stages of the classification problem: combining features at
different views, then classifying or combining the results after performing the

classification at different views. It may give the final classification results with higher
accuracy. The two issues that need to be addressed in this approach are (1) what
features to use to characterize the action and (2) how to combine information between
views in order to achieve the best prediction results.
For problem (1), according to [14] we can divide the representation of action
into: global representation and representation by local feature. Global representation
often record the structure, shape, and movement of the human body. Two examples
of the global representation are MHI and MEI [15]. The idea of these two methods is
to encode information about human motion and shape in a sequence of images into a
single "template" image (Figure 1.3), then we can exploit information needed from
this template. The global performances were studied extensively in the problem of
action identification in the period 1997-2007, which often preserved the spatial and
temporal structure of action. However, nowadays local performances are more
interested in research because global performances are more disadvantageous in cases
of different angles or occlusion.

Master student: Tuan Dung LE – CBC17016

Page 17


Figure 1. 4 MHI (middle row) and MEI (last row) template [15].

Local representations of action using a pipeline include point-of-interest
detection, local feature extraction, and aggregation of local features into action
representation vectors. In order to detect feature points in the video, many point
detection detectors have been proposed: Harris3D [16], Cuboid [17], Hessian3D [18],
dense sampling. After detecting the feature, we will compute a vector describing this
point based on the image intensity of the points around this point in three dimensions.
The most commonly used descriptive vector types are Cuboid, HOG/HOF, HOG3D,

ESURF. Next, locally-described vectors are used to train BoW model and provide a
descriptive vector representing each video. Finally, the vector representations of this
video pass through a classifier to label an action. According to [19], the author made
a combined experiment of detectors and descriptive vector types on two dataset KTH
and UCF sports. The KTH achieved the highest accuracy when combining the
Harris3D detector and the HoF vector, while the UCF achieved the highest accuracy
when combining dense sampling and the HOG3D vector. In summary, it is difficult
to judge which combination is best for every situation.

Master student: Tuan Dung LE – CBC17016

Page 18


For problem (2), two most commonly ways of combining information from
different views is early and late fusion. In the case of early fusion, feature descriptors
from different views are concatenated together to form a final vector of the action
before being pushed into a common classifier. For the late fusion, the classifiers from
each view will be trained and their results will be used to produce the final result.
G.Burghouts et al. [11] use the STIP feature and BoW model. The results show that
late fusion achieved the highest accuracy on the IXMAS dataset. R.Kavi et al. [20]
extract features from LMEI (similar to the MEI template but with more spatially
distributed information), then included in the LDA classifier of each individual view.
The final results are obtained by combining the outputs of the LDAs. Another method
of R.Kavi et al. [21] using the LSTM ConvNets structure, yielding results using early
and late fusion strategies. Results from the articles showed that late fusion achieved
higher recognition accuracy than early fusion. There are two possible explanations
for this result. First, the occlusion at some views causes the wrong extraction or lack
of feature from that view, so that the vector representing the final action is less
effective in classifying. Secondly, when using a multiple camera system and a person

choosing the direction and position to perform different actions, early fusion will
produce vectors representing for the same action but not correlated because the
appearance of people on the camera is different. Therefore, the accuracy when using
early fusion in this case decreases. Although, the use of late fusion is also influenced
by this cause, however, when training a classifier for each view, if there exists a view
that gives high probability of prediction the right class of action then we can make
the final prediction exactly. To improve efficiency in case of differences in position
and direction of exercise in the two sets of assessment and training data, R. Kavi et
al. [20] proposed to make a circular view shifting when performing the evaluation
phase with the aim of bringing the evaluation data in the same direction as the training
set. This technique will also be used and evaluated in the proposed framework of this
thesis.

Master student: Tuan Dung LE – CBC17016

Page 19


Section 1.2 will introduce the baseline method that we apply and propose an
improving framework.
1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words
model
G.Burghouts et al.[11] proposed a BoW pipeline consisting of STIP features
[10] (using Harris 3D detector and HOG/HOF descriptor) extracted from video (Fig
1.4), a random forest model [22] to transform the features into histograms that serves
as video descriptor, and a SVM classifier to predict the action class. The authors
experiment several strategies to combine the information from multiple views:
combining of features as early fusion; combining of video descriptors as intermediate
fusion, and posterior probability as late fusion (Fig 1.5). Experiment results shown
that averaging prediction probability from all views gained highest accuracy on

IXMAS dataset.

Master student: Tuan Dung LE – CBC17016

Page 20


Figure 1. 5 Illustration of spatio-temporal interest point detected in a people
clapping’s video [16]

Figure 1. 6 Three ways to combine multiple 2D views information in the
BoW model [11].
Assume that we have observations from 𝑀 views to capture human actions and
a set of 𝑁 different actions to be recognized. For a specific view, we perform the
following steps. Firstly, STIP features are extracted from video. With each local
keypoint detected, a histogram of oriented gradients (HoG) and a histogram of optic
flows (HOF) in a 3 × 3 × 2 spatio-temporal blocks are computed to capture shape
and motion information in the local neighborhood of this point. By concatenating
HOG and HOF histograms, we acquire a descriptor with 162 values.
Secondly, for each action, a random forest model is trained to learn the
codebook. Random Forest is used instead of K-mean clustering because of its
capability to create a more discriminate codebook and its speed [23]. The training
input consists of a set of positive features from an action class and a set of negative
features from other classes. There are two options to collect negatives features:
random and selective [22]. (In this study, we follow the random option. The selective
option could be an axis of our future works). Each forest contains 10 trees, each tree
has 32 leafs and each leaf corresponds to a codeword in the codebook. We have a set
of codebooks for each view 𝑚𝑡ℎ :

Master student: Tuan Dung LE – CBC17016


Page 21


1
2
𝑁
𝐷𝑚 = { 𝐷𝑚
, 𝐷𝑚
, . . . 𝐷𝑚
}

(1.1)

𝑖
where 𝐷𝑚
is the codebook for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ .

The next step is to quantize the STIP features of a video into histogram. We
pass STIP features through the learned forests then we obtain a 320-bin normalized
𝑖
histogram to describe a video. Finally, this descriptors is used to train a binary 𝑆𝑉𝑀𝑚

classifier for class 𝑖𝑡ℎ . Corresponding to the codebooks, we have a set of binary
𝑆𝑉𝑀𝑠:
1
2
𝑁
𝑆𝑉𝑀𝑚 = { 𝑆𝑉𝑀𝑚
, 𝑆𝑉𝑀𝑚

, . . . 𝑆𝑉𝑀𝑚
}

(1.2)

𝑖
where 𝑆𝑉𝑀𝑚
is the classifier for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ .

The SVM is trained using a chi kernel (C = 1) and the SVM outputs are the
posterior probabilities. In the binary case, the probabilities are calibrated using Platt
scaling: logistic regression on the SVM's scores, fit by an additional cross-validation
on the training data. With each test sample in the view 𝑚𝑡ℎ , we will have a set of
probabilities:
𝑃𝑚 = { 𝑃𝑚1 , 𝑃𝑚2 , . . . 𝑃𝑚𝑁 }

(1.3)

where 𝑃𝑚𝑖 is the probabilities of action 𝑖𝑡ℎ in view 𝑚𝑡ℎ .
Then, the posterior probabilities from all views are combined by taking their
average:
𝑖

𝑃 =

𝑖
∑𝑀
𝑗=1 𝑃𝑗

𝑀


(1.4)

is the probability of test sample belonging to action 𝑖𝑡ℎ .
The label is assigned to the class having the highest probability:
𝑘 = arg max {𝑃𝑖 }
𝑖 ∈(1,𝑁)

(1.5)

 Limitations & towards improvement

Master student: Tuan Dung LE – CBC17016

Page 22


This method shown a good performance on multi-views human action
recognition, gained 96,4% accuracy on IXMAS dataset (selective negative samples
for random forest). We also test this method by random negative samples for random
forest and gained 88% accuracy on same dataset. However, local descriptor STIP
provide shape and motion information of the keypoint but lack of location
information (both spatial and temporal coordinates). Moreover, BoW model provides
the distribution information of visual word in a video but also lack of the information
on the appearance order as well as the spatial correlation between visual words. These
factors may lead to confusing between action which have similar local information
but actually difference in relative positions such as arms and legs or order of
appearance. Based on this deduction, several methods proposed adding
spatial/temporal information of local features into BoW model and shown an
improving performance on single-view human action recognition [24, 25]. Parul

Shukla et al.[24] divide a video into several parts base on time domain. M.Ullah et
al. [25] divide detected spatial domain into several smaller parts by using action
detector, motion and object detector. Common purpose is obtained a final descriptor
that provide more information by combining information from smaller
spatial/temporal parts.
Based on these ideas that we have mentioned above, we proposed a framework
for human action recognition using multiple views. This will be described in detail in
next chapter.

Master student: Tuan Dung LE – CBC17016

Page 23


CHAPTER 2. PROPOSED FRAMEWORK
2.1 General framework
We proposed a framework for human action recognition from multiview camera
as illustrated in Fig. 2.1. The dotted block shows processing steps for each camera
view. Given a video sequence of an action, we start by extracting STIP features. Next,
BoW model will be used to represent actions. However, for the purpose of
distinguishing small motion activities, we propose to use a combination of spatial
temporal information and BoW model (spatial-temporal pooling block) to describe
better action classes. We proposed using a simple background subtraction algorithm.
The output of this block will help us to adding the spatial information of local
features. In addition, because the person's performance in the training data and the
assessment data may be different, we also apply view shifting in testing phase. The
main contributions are described in more detail in the following sections.

Figure 2.1 Proposed framework.


Master student: Tuan Dung LE – CBC17016

Page 24


2.2 Combination of spatial/temporal information and Bag-of-Words model
2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW)
STIP features are detected from a sequence of original images and the descriptor
is obtained by computing histograms of spatial gradient and optical flow accumulated
in space-time neighborhoods of detected interest points [14]. This will take small
movement of background into account. To avoid bad effect of background, we will
detect STIP features only in the moving ROIs. These ROIs are obtained by
conventional background subtraction technique. Then as there is a large interclass
similarity among actions. Some actions only differ from others by a minor movement
of a body part (hand, foot). To improve BoW model with spatial information, for each
frame, we divide ROIs (bounding box) into 𝑠 parts according to the human structure.
We will try many ways to divide (Fig 2.2) and compare their effectiveness in the
experimental sections. We divide the bounding box into 3,4 spatial parts based only
on the height of the bounding box (Fig. 2.2 e) and 6 spatial parts by using centroid’s
coordinate (Figure 2.2 f). For each spatial part, we compute the histogram
corresponding to the features located in that part. These histograms are then
concatenated into the final vector.

Master student: Tuan Dung LE – CBC17016

Page 25


×