..
MINISTRY OF EDUCATION AND TRAINING
LỂ TUẤN DŨNG
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------
Tuan Dung LE
HỆ THỐNG THÔNG TIN
IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION
WITH SPATIAL-TEMPORAL POOLING AND
VIEW SHIFTING TECHNIQUES
MASTER OF SCIENCE THESIS IN
INFORMATION SYSTEM
2017-2018
Hanoi – 2018
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------
Tuan Dung LE
IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH
SPATIAL-TEMPORAL POOLING AND VIEW SHIFTING TECHNIQUES
Speciality:
Information System
MASTER OF SCIENCE THESIS IN
INFORMATION SYSTEM
SUPERVISOR :
1. Dr. Thi Oanh NGUYEN
Hanoi – 2018
Master student: Tuan Dung LE – CBC17016
Page 2
ACKNOWLEDGEMENT
First of all, I sincerely thank the teachers in the School of Information and
Communication Technology as well as all the teachers at the Hanoi University of
Technology has taught me the knowledge and valuable experience during the past 5
years.
I would like to thank the two supervisors, Dr. Nguyen Thi Oanh - lecturer in
Information Systems and Communication, Institute of Information and
Communication Technology, Hanoi University of Technology and Dr. Tran Thi
Thanh Hai, MICA Research Institute has guided me to complete this master thesis. I
have learned a lot from them, not only the knowledge of the field of computer vision
but also working and studying skills such as writing papers, preparing slides and
presenting to the crowd.
Finally, I would like to send my thanks to my family, friends and people who
have always supported me in the process of studying and researching this thesis.
Hanoi, March 2018
Master student
Tuan Dung LE
Master student: Tuan Dung LE – CBC17016
Page 3
TABLE OF CONTENT
ACKNOWLEDGEMENT ..........................................................................................3
TABLE OF CONTENT ............................................................................................4
LIST OF FIGURES.....................................................................................................6
LIST OF TABLES ......................................................................................................8
LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS .............................9
INTRODUCTION .....................................................................................................10
CHAPTER 1. HUMAN ACTION RECOGNITION APPROACHES .....................12
1.1 Overview ..........................................................................................................12
1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words
model .....................................................................................................................20
CHAPTER 2. PROPOSED FRAMEWORK ............................................................24
2.1 General framework ..........................................................................................24
2.2 Combination of spatial/temporal information and Bag-of-Words model .......25
2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW) .25
2.2.2 Combination of temporal information and Bag-of-Words model (T-BoW)
............................................................................................................................26
2.3 View shifting technique ...................................................................................27
CHAPTER 3. EXPERIMENTS ................................................................................30
3.1 Setup environment ...........................................................................................30
3.2 Setup ................................................................................................................30
3.3 Datasets ............................................................................................................30
3.3.1 Western Virginia University Multi-view Action Recognition Dataset
(WVU) ................................................................................................................30
3.3.2 Northwestern-UCLA Multiview Action 3D (N-UCLA) ...........................32
3.4 Performance measurement...............................................................................33
3.5 Experiment results ...........................................................................................35
3.5.1 WVU dataset .............................................................................................35
3.5.2 N-UCLA dataset ........................................................................................40
CONCLUSION & FUTURE WORK .......................................................................43
REFERENCES ..........................................................................................................44
Master student: Tuan Dung LE – CBC17016
Page 4
APPENDIX 1 ............................................................................................................47
Master student: Tuan Dung LE – CBC17016
Page 5
LIST OF FIGURES
Figure 1. 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual
hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body
model, g) cylindrical/ellipsoid human body model [1]. ...........................................14
Figure 1. 2 Construct HOG-HOF descriptive vector based on SSM matrix[6]. .......16
Figure 1. 3 a) Original video of walking action with viewpoints 0𝑜 and 45𝑜 , their
volumes and silhouettes, b) epipolar geometry in case of extracted actor body
silhouettes, c) epipolar geometry in case of dynamic scene with dynamic actor and
static background without extracting silhouettes[9]. ................................................16
Figure 1. 4 MHI (middle row) and MEI (last row) template [15]. ...........................18
Figure 1. 5 Illustration of spatio-temporal interest point detected in a people
clapping’s video [16].................................................................................................21
Figure 1. 6 Three ways to combine multiple 2D views information in the BoW model
[11]. ...........................................................................................................................21
Figure 2. 1 Proposed framework. ..............................................................................24
Figure 2. 2 Dividing space domain based on bounding box and centroid. ...............26
Figure 2. 3 Illustration of T-BoW model. .................................................................27
Figure 2. 4 Illustration of view shifting in testing phase. .........................................28
Figure 3. 1 Ilustration of 12 action classes in the WVU Multi-view actions dataset.
...................................................................................................................................31
Figure 3. 2 Cameras setup for capturing WVU dataset. ...........................................31
Figure 3. 3 Ilustration of 10 action classes in the N-UCLA Multi-view Actions 3D
dataset. .......................................................................................................................32
Figure 3. 4 Cameras setup for capturing N-UCLA dataset. ......................................33
Figure 3. 5 Illustration of confusion matrix. .............................................................35
Figure 3. 6 Confusion matrix: a) Basic BoW model with codebook D3, accuracy
70,83%; b) S-BoW model with 4 spatial parts codebook D3, accuracy 82,41%. ....37
Figure 3. 7 Confusion matrices: a) S-BoW model with 6 spatial parts, codebook D3,
accuracy 78,24%; b) S-BoW model with 6 spatial parts and view shifting, codebook
D3, accuracy 96,67%. ...............................................................................................38
Figure 3. 8 Confusion matrices: a) Basic BoW model, codebook D3, accuracy
59,57%; b) S-BoW mofel with 6 spatial parts, codebook D3, accuracy 63,40%. ....41
Master student: Tuan Dung LE – CBC17016
Page 6
Figure 3. 9 Illustration of view shifting on N-UCLA dataset. ..................................42
Master student: Tuan Dung LE – CBC17016
Page 7
LIST OF TABLES
Table 3. 1 Accuracy (%) of basic BoW model on WVU dataset.............................36
Table 3. 2 Accuracy (%) of T-BoW model on WVU dataset ..................................36
Table 3. 3 Accuracy (%) of S-BoW model on WVU dataset ...................................38
Table 3. 4 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting
technique on WVU dataset........................................................................................39
Table 3. 5 Comparison with others methods on WVU Dataset ................................39
Table 3. 6 Accuracy (%) of basic model on N-UCLA dataset .................................40
Table 3. 7 Accuracy (%) of T-BoW model on N-UCLA dataset .............................40
Table 3. 8 Accuracy (%) of the combination of S-BoW model and view shifting on
N-UCLA dataset ........................................................................................................41
Table 3. 9 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting
technique on N-UCLA dataset ..................................................................................42
Master student: Tuan Dung LE – CBC17016
Page 8
LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS
Index
Abbreviation
Full name
1
MHI
Motion History Image
2
MEI
Motion Energy Image
3
LMEI
Localized Motion Energy Image
4
STIP
Spatio-Temporal Interest Point
5
SSM
Self-Similarities Matrix
6
HOG
Histogram of Oriented Gradient
7
HOF
Histogram of Optical Flow
8
IXMAS
INRIA Xmas Acquisition Sequences
9
BoW
Bag-of-Words
10
ROIs
Region of Interest
Master student: Tuan Dung LE – CBC17016
Page 9
INTRODUCTION
In the growing social scene from the 3.0 era (automation of information
technology and electronic production) to the new 4.0 (a new convergence of
technologies such as the Internet Things - Internet, collaboration robots, 3D printing
and cloud computing, and the emergence of new business models), automatically
collecting and processing information by the computer is very necessary. This leads
to higher demands on the interaction between humans and machines both in precision
and speed. Thus, the problems of object recognition, motion recognition, speech
recognition ... are now attracting a lot of interest of scientists and companies around
the world. Nowadays, video data is easily generated by devices such as digital
cameras, laptops, mobile phones, and video-sharing websites. Human action
recognition in the video, contributing to the automated exploitation of the resources
of this rich data source.
Applications related to human action recognition problems such as: Security
and traditional monitoring systems include networks of cameras and are monitored
by humans. With the increase in the number of cameras as well as these systems being
deployed in multiple locations, the supervisor's efficiency and accuracy issues are
required to cover the entire system. The task of computer vision is to find a solution
that can replace or assist the supervisor. Automatic recognition of abnormalities from
surveillance systems is a matter that attracts a lot of research. The problem of
enhancing interaction between humans and machines is still challenging, the visual
cues are the most important method of non-verbal communication. Effectively
exploiting gesture-based communication will create a more accurate and natural
human-computer interaction. A typical application in the field is the "smart home",
intelligent response to the gesture, the action of the user. However, these applications
are still incomplete and still attract more research. In addition, human action
recognition problem is also applied in a number of other applications, such as robots,
content-based video analysis, content-based and recovery-based video compression,
video indexing, and virtual reality games. ...
Master student: Tuan Dung LE – CBC17016
Page 10
With the aim of studying and approaching the problem of human action
recognition using a combination of multiple views, we explored some of the recent
approaches and chose to experiment with the method of using combination of local
feature and Bag-of-Words model. After analyzing the weaknesses of the method, we
proposed a plan for improvement and evaluate it by doing experiments. The thesis
will be presented in the following format:
Chapter 1: This chapter focuses on the approaches to provide readers with an
overview of the problem of human action recognition in general and using
multiple views in particular. The last part of this chapter introduces a method that
using combination of local feature and the Bag-of-Words model, evaluates the
advantages and disadvantages of the method, and then introduces the proposed
improvement methods.
Chapter 2: This chapter focuses on presenting an improvement framework using
a combination of spatial/temporal information and view shifting techniques.
Chapter 3: Experiment the proposed method and give the results with some
evaluation.
Conclusion and Future works: This section will look at what has been and is not
done in the master's thesis and highlight pros and cons and future development.
References
Master student: Tuan Dung LE – CBC17016
Page 11
CHAPTER 1. HUMAN ACTION RECOGNITION APPROACHES
1.1 Overview
Recognition and analysis of human actions has been a subject that has attracted
much interest over the past three decades and is currently being actively researched
in the field of computer vision. This is a good solution to solve the problems of a
large number of potential applications in the scope of intelligent monitoring, video
recovery, video analysis and human-machine interaction. Recent research has
highlighted the difficulty of this problem with the large fluctuations in human actions
data such as the variability in the way individuals perform actions; movement and
clothing; camera angles and motion effects; light fluctuations; occlusion due to
objects in the environment or parts of the human body; or disturbances in the
surroundings. Because there are so many factors that can affect the outcome of the
problem, current methods are often limited or placed in simple scenarios with simple
backgrounds, simple action classes, and stationary cameras or limit the variation in
viewing angles.
Many different approaches have been proposed over the years for human action
recognition. These approaches may be categorized depending on the visual
information used to describe the action. Single-view methods use a camera to record
the human body during the execution of the action. However, the appearance of the
action is quite different when viewed at arbitrary angle of view. Thus, single-view
methods are often accompanied by a basic assumption that action is observed from
the same angle in both the training data and the testing data. The efficiency of singleview methods is significantly reduced if this assumption is not true. The obvious way
to improve the accuracy of human action recognition is to increase the number of
views per action by increasing the number of cameras, which enables us to exploit a
larger amount of visual information to describe an action. The multi-views approach
has been studied for only a decade now because the limited capabilities of devices
and tools in previous decades did not adequately meet the calculated volume of the
Master student: Tuan Dung LE – CBC17016
Page 12
method need. Recent technological advances have brought powerful tools that allow
the multi-view approach to become available in a variety of application contexts.
Action recognition methods can be divided into two approaches: the traditional
approach of using manual features, the approach of neural network. An approach
using neural networks typically requires large sets of training data, otherwise it would
be ineffective. In practical applications, datasets are usually medium and small in
size. Therefore, in the context of this study, we are interested in a traditional approach
that utilizes manually selected features. In this approach, the performance
representation can be constructed from 2D data (2D approach) or from 3D data (3D
approach) [1].
3D approaches
The general trend in 3D methods is to integrate visual information captured by
various angles of view, then represent actions by a 3D model. This is, usually,
achieved by combining 2D human body poses in terms of binary silhouettes denoting
the video frame pixels belonging to the human body on each camera (Fig 1.1b). After
obtaining the corresponding 3D human body representation, actions are described as
sequences of successive 3D human body poses. Human body representations adopted
by 3D methods include visual hulls (Fig 1.1c), motion history volumes (Fig 1.1d) [2],
optical flow corresponding to the human body (Fig 1.1e) [3], Gaussian blobs (Fig
1.1f) [4], cylindrical/ellipsoid body models (Fig 1.1g) [5] …
Master student: Tuan Dung LE – CBC17016
Page 13
Figure 1. 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose
(visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human
body model, g) cylindrical/ellipsoid human body model [1].
2D approaches
Although the approaches use human body shape and 3D motion have been
successfully implemented, most of them assume that the human body is required to
be appear in all cameras of fixed camera systems for both training and testing. This
will lead to some restrictions on the application, which in reality may not be appear
in all cameras because they are outside of the camera's recording area or are occluded
by other objects. Obviously, when there is not enough information from all the
cameras, it is impossible to obtain accurate 3D descriptions of the human body and
Master student: Tuan Dung LE – CBC17016
Page 14
thus produce false prediction. On the other hand, multidimensional 2D viewing
methods can completely overcome the drawback mentioned above. 2D methods tend
to look for invariant features on different angles and then combine the predicted
results in the action class. Thus, the lack of information of a view does not affect the
results. Multidimensional 2D viewing methods are often divided into two smaller
approaches:
o View-invariant features
First approach is to try to represent the action by describing it with features
which are invariant to the view [6, 7, 8, 9, 10]. Action recognition is performed on
each video from each independent camera. First, the methods will show represent the
action by view-invariant features, then the action class is based on this invariant
feature.
A view-invariant approach proposed by N.Junejo et al.[6] is to calculate the
similarity of a series of images over time and to observe the stability of the model
over multiple viewing angles. It then builds a descriptive vector that records the
structural characteristics of similarity and time difference in a sequence of actions.
First, from each video, the authors calculate the difference between consecutive
frames. From there the authors build a self-similarities matrix called SSM-pos for 3D
data and SSM-HOG-HOF for 2D data. Next, from the existing SSM model, the
authors extracted the local SSM vector (Fig 1.2) and introduced it into the K-mean
clustering algorithm. Clusters will correspond to a word in the dictionary (BoW
approach). Finally, the SVM classifier uses the squared kernel with one-vs-all
strategy. The advantage of the method is that it achieves high stability under varying
angles and stability in the event of a difference in the way the actions are performed.
Master student: Tuan Dung LE – CBC17016
Page 15
Figure 1. 2 Construct HOG-HOF descriptive vector based on SSM matrix [6].
Another view-invariant approach proposed by Anwaar-ul-Haq et al [9] is based
on dense optical flow and epipolar geometry. The authors propose a novel similarity
score for action matching based on the property of segmentation matrix or two-body
fundamental matrix. It helps establishing view invariant action matching framework
without any preprocessing on original video sequences.
Figure 1. 3 (a) Original video of walking action with viewpoints 0𝑜 and 45𝑜 ,
their volumes and silhouettes, (b) epipolar geometry in case of extracted actor body
Master student: Tuan Dung LE – CBC17016
Page 16
silhouettes, (c) epipolar geometry in case of dynamic scene with dynamic actor and
static background without extracting silhouettes[9].
o Combination of information from multi-view
Second approach will be done by combining information from different views
[11, 12, 13]. Unlike view-invariant approach, we find that different views will hold
different amounts of information and may complement one another. For example,
parts of the human body may be occluded in one view, but are captured different from
one another. This combination is quite similar to combining views to give a 3D
representation of action as 3D approach approaches, but 2D approaches can combine
information at different stages of the classification problem: combining features at
different views, then classifying or combining the results after performing the
classification at different views. It may give the final classification results with higher
accuracy. The two issues that need to be addressed in this approach are (1) what
features to use to characterize the action and (2) how to combine information between
views in order to achieve the best prediction results.
For problem (1), according to [14] we can divide the representation of action
into: global representation and representation by local feature. Global representation
often record the structure, shape, and movement of the human body. Two examples
of the global representation are MHI and MEI [15]. The idea of these two methods is
to encode information about human motion and shape in a sequence of images into a
single "template" image (Figure 1.3), then we can exploit information needed from
this template. The global performances were studied extensively in the problem of
action identification in the period 1997-2007, which often preserved the spatial and
temporal structure of action. However, nowadays local performances are more
interested in research because global performances are more disadvantageous in cases
of different angles or occlusion.
Master student: Tuan Dung LE – CBC17016
Page 17
Figure 1. 4 MHI (middle row) and MEI (last row) template [15].
Local representations of action using a pipeline include point-of-interest
detection, local feature extraction, and aggregation of local features into action
representation vectors. In order to detect feature points in the video, many point
detection detectors have been proposed: Harris3D [16], Cuboid [17], Hessian3D [18],
dense sampling. After detecting the feature, we will compute a vector describing this
point based on the image intensity of the points around this point in three dimensions.
The most commonly used descriptive vector types are Cuboid, HOG/HOF, HOG3D,
ESURF. Next, locally-described vectors are used to train BoW model and provide a
descriptive vector representing each video. Finally, the vector representations of this
video pass through a classifier to label an action. According to [19], the author made
a combined experiment of detectors and descriptive vector types on two dataset KTH
and UCF sports. The KTH achieved the highest accuracy when combining the
Harris3D detector and the HoF vector, while the UCF achieved the highest accuracy
when combining dense sampling and the HOG3D vector. In summary, it is difficult
to judge which combination is best for every situation.
Master student: Tuan Dung LE – CBC17016
Page 18
For problem (2), two most commonly ways of combining information from
different views is early and late fusion. In the case of early fusion, feature descriptors
from different views are concatenated together to form a final vector of the action
before being pushed into a common classifier. For the late fusion, the classifiers from
each view will be trained and their results will be used to produce the final result.
G.Burghouts et al. [11] use the STIP feature and BoW model. The results show that
late fusion achieved the highest accuracy on the IXMAS dataset. R.Kavi et al. [20]
extract features from LMEI (similar to the MEI template but with more spatially
distributed information), then included in the LDA classifier of each individual view.
The final results are obtained by combining the outputs of the LDAs. Another method
of R.Kavi et al. [21] using the LSTM ConvNets structure, yielding results using early
and late fusion strategies. Results from the articles showed that late fusion achieved
higher recognition accuracy than early fusion. There are two possible explanations
for this result. First, the occlusion at some views causes the wrong extraction or lack
of feature from that view, so that the vector representing the final action is less
effective in classifying. Secondly, when using a multiple camera system and a person
choosing the direction and position to perform different actions, early fusion will
produce vectors representing for the same action but not correlated because the
appearance of people on the camera is different. Therefore, the accuracy when using
early fusion in this case decreases. Although, the use of late fusion is also influenced
by this cause, however, when training a classifier for each view, if there exists a view
that gives high probability of prediction the right class of action then we can make
the final prediction exactly. To improve efficiency in case of differences in position
and direction of exercise in the two sets of assessment and training data, R. Kavi et
al. [20] proposed to make a circular view shifting when performing the evaluation
phase with the aim of bringing the evaluation data in the same direction as the training
set. This technique will also be used and evaluated in the proposed framework of this
thesis.
Master student: Tuan Dung LE – CBC17016
Page 19
Section 1.2 will introduce the baseline method that we apply and propose an
improving framework.
1.2 Baseline method: combination of multiple 2D views in the Bag-of-Words
model
G.Burghouts et al.[11] proposed a BoW pipeline consisting of STIP features
[10] (using Harris 3D detector and HOG/HOF descriptor) extracted from video (Fig
1.4), a random forest model [22] to transform the features into histograms that serves
as video descriptor, and a SVM classifier to predict the action class. The authors
experiment several strategies to combine the information from multiple views:
combining of features as early fusion; combining of video descriptors as intermediate
fusion, and posterior probability as late fusion (Fig 1.5). Experiment results shown
that averaging prediction probability from all views gained highest accuracy on
IXMAS dataset.
Master student: Tuan Dung LE – CBC17016
Page 20
Figure 1. 5 Illustration of spatio-temporal interest point detected in a people
clapping’s video [16]
Figure 1. 6 Three ways to combine multiple 2D views information in the
BoW model [11].
Assume that we have observations from 𝑀 views to capture human actions and
a set of 𝑁 different actions to be recognized. For a specific view, we perform the
following steps. Firstly, STIP features are extracted from video. With each local
keypoint detected, a histogram of oriented gradients (HoG) and a histogram of optic
flows (HOF) in a 3 × 3 × 2 spatio-temporal blocks are computed to capture shape
and motion information in the local neighborhood of this point. By concatenating
HOG and HOF histograms, we acquire a descriptor with 162 values.
Secondly, for each action, a random forest model is trained to learn the
codebook. Random Forest is used instead of K-mean clustering because of its
capability to create a more discriminate codebook and its speed [23]. The training
input consists of a set of positive features from an action class and a set of negative
features from other classes. There are two options to collect negatives features:
random and selective [22]. (In this study, we follow the random option. The selective
option could be an axis of our future works). Each forest contains 10 trees, each tree
has 32 leafs and each leaf corresponds to a codeword in the codebook. We have a set
of codebooks for each view 𝑚𝑡ℎ :
Master student: Tuan Dung LE – CBC17016
Page 21
1
2
𝑁
𝐷𝑚 = { 𝐷𝑚
, 𝐷𝑚
, . . . 𝐷𝑚
}
(1.1)
𝑖
where 𝐷𝑚
is the codebook for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ .
The next step is to quantize the STIP features of a video into histogram. We
pass STIP features through the learned forests then we obtain a 320-bin normalized
𝑖
histogram to describe a video. Finally, this descriptors is used to train a binary 𝑆𝑉𝑀𝑚
classifier for class 𝑖𝑡ℎ . Corresponding to the codebooks, we have a set of binary
𝑆𝑉𝑀𝑠:
1
2
𝑁
𝑆𝑉𝑀𝑚 = { 𝑆𝑉𝑀𝑚
, 𝑆𝑉𝑀𝑚
, . . . 𝑆𝑉𝑀𝑚
}
(1.2)
𝑖
where 𝑆𝑉𝑀𝑚
is the classifier for action 𝑖𝑡ℎ in view 𝑚𝑡ℎ .
The SVM is trained using a chi kernel (C = 1) and the SVM outputs are the
posterior probabilities. In the binary case, the probabilities are calibrated using Platt
scaling: logistic regression on the SVM's scores, fit by an additional cross-validation
on the training data. With each test sample in the view 𝑚𝑡ℎ , we will have a set of
probabilities:
𝑃𝑚 = { 𝑃𝑚1 , 𝑃𝑚2 , . . . 𝑃𝑚𝑁 }
(1.3)
where 𝑃𝑚𝑖 is the probabilities of action 𝑖𝑡ℎ in view 𝑚𝑡ℎ .
Then, the posterior probabilities from all views are combined by taking their
average:
𝑖
𝑃 =
𝑖
∑𝑀
𝑗=1 𝑃𝑗
𝑀
(1.4)
is the probability of test sample belonging to action 𝑖𝑡ℎ .
The label is assigned to the class having the highest probability:
𝑘 = arg max {𝑃𝑖 }
𝑖 ∈(1,𝑁)
(1.5)
Limitations & towards improvement
Master student: Tuan Dung LE – CBC17016
Page 22
This method shown a good performance on multi-views human action
recognition, gained 96,4% accuracy on IXMAS dataset (selective negative samples
for random forest). We also test this method by random negative samples for random
forest and gained 88% accuracy on same dataset. However, local descriptor STIP
provide shape and motion information of the keypoint but lack of location
information (both spatial and temporal coordinates). Moreover, BoW model provides
the distribution information of visual word in a video but also lack of the information
on the appearance order as well as the spatial correlation between visual words. These
factors may lead to confusing between action which have similar local information
but actually difference in relative positions such as arms and legs or order of
appearance. Based on this deduction, several methods proposed adding
spatial/temporal information of local features into BoW model and shown an
improving performance on single-view human action recognition [24, 25]. Parul
Shukla et al.[24] divide a video into several parts base on time domain. M.Ullah et
al. [25] divide detected spatial domain into several smaller parts by using action
detector, motion and object detector. Common purpose is obtained a final descriptor
that provide more information by combining information from smaller
spatial/temporal parts.
Based on these ideas that we have mentioned above, we proposed a framework
for human action recognition using multiple views. This will be described in detail in
next chapter.
Master student: Tuan Dung LE – CBC17016
Page 23
CHAPTER 2. PROPOSED FRAMEWORK
2.1 General framework
We proposed a framework for human action recognition from multiview camera
as illustrated in Fig. 2.1. The dotted block shows processing steps for each camera
view. Given a video sequence of an action, we start by extracting STIP features. Next,
BoW model will be used to represent actions. However, for the purpose of
distinguishing small motion activities, we propose to use a combination of spatial
temporal information and BoW model (spatial-temporal pooling block) to describe
better action classes. We proposed using a simple background subtraction algorithm.
The output of this block will help us to adding the spatial information of local
features. In addition, because the person's performance in the training data and the
assessment data may be different, we also apply view shifting in testing phase. The
main contributions are described in more detail in the following sections.
Figure 2.1 Proposed framework.
Master student: Tuan Dung LE – CBC17016
Page 24
2.2 Combination of spatial/temporal information and Bag-of-Words model
2.2.1 Combination of spatial information and Bag-of-Words model (S-BoW)
STIP features are detected from a sequence of original images and the descriptor
is obtained by computing histograms of spatial gradient and optical flow accumulated
in space-time neighborhoods of detected interest points [14]. This will take small
movement of background into account. To avoid bad effect of background, we will
detect STIP features only in the moving ROIs. These ROIs are obtained by
conventional background subtraction technique. Then as there is a large interclass
similarity among actions. Some actions only differ from others by a minor movement
of a body part (hand, foot). To improve BoW model with spatial information, for each
frame, we divide ROIs (bounding box) into 𝑠 parts according to the human structure.
We will try many ways to divide (Fig 2.2) and compare their effectiveness in the
experimental sections. We divide the bounding box into 3,4 spatial parts based only
on the height of the bounding box (Fig. 2.2 e) and 6 spatial parts by using centroid’s
coordinate (Figure 2.2 f). For each spatial part, we compute the histogram
corresponding to the features located in that part. These histograms are then
concatenated into the final vector.
Master student: Tuan Dung LE – CBC17016
Page 25