BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI
--------------------------------------Khổng Văn Minh
KẾT HỢP ĐẶC TRƯNG DIỆN MẠO VÀ CHUYỂN ĐỘNG
TRONG BIỂU DIỄN HOẠT ĐỘNG CỦA NGƯỜI SỬ DỤNG
MẠNG NƠ RON TÍCH CHẬP
Chun ngành :
Hệ thống thơng tin
LUẬN VĂN THẠC SĨ KHOA HỌC
HỆ THỐNG THÔNG TIN
NGƯỜI HƯỚNG DẪN KHOA HỌC :
TS. Trần Thị Thanh Hải
Hà Nội – Năm 2018
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
--------------------------------------KHONG VAN MINH
COMBINATION OF APPEARANCE AND MOTION
INFORMATION IN HUMAN ACTION REPRESENTATION
USING CONVOLUTIONAL NEURAL NETWORK
FIELD OF STUDY : INFORMATION SYSTEM
MASTER’S THESIS
IN INFORMATION SYSTEM
SUPERVISOR:
PhD: Tran Thi Thanh Hai
HANOI – 2018
SĐH.QT9.BM11
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn : Khổng Văn Minh
Đề tài luận văn: Kết hợp đặc trưng diện mạo và chuyển động trong biểu
diễn hoạt động của người sử dụng mạng nơ ron tích chập
Chuyên ngành: Hệ thống thông tin
Mã số SV:
CBC17021
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác
nhận tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng
ngày….........................………… với các nội dung sau:
……………………………………………………………………………………………………..
……………………………………………………………………………………………………..
……………………………………………………………………………………………………..
……………………………………………………………………………………………………..
……………………………………………………………………………………………………..
……………………………………………………………………………………………………..
……………………………………………………………………………………..
Ngày
Giáo viên hướng dẫn
CHỦ TỊCH HỘI ĐỒNG
tháng
năm
Tác giả luận văn
Abstract
In this thesis, I focus on solving the action recognition problem in video or a stack of consecutive frames. This problem plays an important role in surveillance systems that are very
popular nowadays. There are two main solutions to solve this problem: using hand-crafted
features or using learned features using deep learning. Both of the solutions have pros and
cons and the solution that I study belongs to the secondategory. Recently, advanced techniques relying on convolutional neural networks produced impressive improvement compared to traditional handcrafted features based techniques. Besides, literature researches
also showed that the use of different streams of data will help to increase recognition performance. This paper proposes a method that exploits both RGB and optical flow for human
action recognition. Specifically, we deploy a two stream convolutional neural network that
takes RGB and optical flow computed from RGB stream as inputs. Each stream has architecture of an existing 3D convolutional neural network (C3D) which has been shown to
be compact but efficient for the task of action recognition from video. Each stream works
independently then is combined by early fusion or late fusion to output the recognition
results. We show that the proposed two-stream 3D convolutional neural network (2stream
C3D) outperforms one stream C3D on two benchmark datasets UCF101 (from 82.79% to
89.11%), HMDB51 (from 45.71 % to 60.87%) and CMDFALL (from 65.35% to 71.77%).
1
Acknowledgments
Firstly, I would like to express my deep gratitude to my supervisor PhD. Tran Thi Thanh
Hai for supporting my research direction, which allowed me to explore new ideas in the
field of computer vision and machine learning. I would like to thank for her supervision,
encouragement, motivation, and support and her guidance helped me throughout the research work and in writing of the thesis.
I would like to acknowledge the International Research Institute MICA, HUST for providing me the great research environment.
I wish to express my gratitude to the teachers in Computer vision department, MICA
for giving me the opportunity to work and acquire great research experience.
I would like to acknowledge the School of Information and Communication Technology
for providing me the knowledge and the opportunity to study.
I would like to thank my friends for supporting me in my study.
Last but not least, I would like to convey my deepest gratitude to my family for their
supports, and sacrifices during my studies.
2
Contents
1
2
3
4
Introduction to Human Action Recognition
9
1.1
Human Action Recognition problem . . . . . . . . . . . . . . . . . . . . .
9
1.2
Overview of human action recognition approach . . . . . . . . . . . . . . . 12
1.2.1
Hand crafted feature based methods . . . . . . . . . . . . . . . . . 12
1.2.2
Deep learning based methods . . . . . . . . . . . . . . . . . . . . 13
1.2.3
Purpose of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
State-of-the-art on HAR using CNN
15
2.1
Introduction to Convolutional Neural Networks . . . . . . . . . . . . . . . 15
2.2
2D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 17
2.3
3D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 18
2.4
Multistream Convolutional Neural Networks . . . . . . . . . . . . . . . . . 20
Proposed method for HAR using multistream C3D
23
3.1
General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2
RGB stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3
Optical Flow Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4
Fusion of multistream 3D CNN . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1
Early fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2
Late fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Experimental Results
4.1
28
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3
5
4.1.1
UCF101 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2
HMDB51 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3
CMDFALL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2
Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3
Single stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4
Multiple stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Conclusion
43
5.1
Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4
List of Figures
1-1 Human Action Recognition Problem . . . . . . . . . . . . . . . . . . . . . 10
1-2 Human Action Recognition phases . . . . . . . . . . . . . . . . . . . . . . 11
1-3 Hand-crafted feature based method for Human Action Recognition . . . . . 12
1-4 Deep learning method for Human Action Recognition problem . . . . . . . 13
2-1 Main layers in Convolutional Neural Networks . . . . . . . . . . . . . . . 15
2-2 Fusion techniques used in [1] . . . . . . . . . . . . . . . . . . . . . . . . . 17
2-3 3D convolution operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2-4 Two stream architecture for Human Action Recognition in [2] . . . . . . . 21
3-1 General framework for human action recognition. . . . . . . . . . . . . . . 24
3-2 Early fusion method by concatenate two L2-normalization feature vectors . 26
3-3 Late fusion by averaging class score . . . . . . . . . . . . . . . . . . . . . 27
4-1 The class labels in UCF101 dataset . . . . . . . . . . . . . . . . . . . . . . 29
4-2 The class labels in HMDB51 dataset . . . . . . . . . . . . . . . . . . . . . 30
4-3 Experiment steps for each dataset . . . . . . . . . . . . . . . . . . . . . . . 30
4-4 The step using C3D for experiment . . . . . . . . . . . . . . . . . . . . . . 32
4-5 C3D clip and video prediction . . . . . . . . . . . . . . . . . . . . . . . . 35
4-6 Confusion matrix of two stream on UCF101 . . . . . . . . . . . . . . . . . 36
4-7 Confusion matrix of two stream on HMBD51 . . . . . . . . . . . . . . . . 36
4-8 Confusion matrix of two stream on CMDFALL . . . . . . . . . . . . . . . 37
4-9 In HMDB51, the most confused action in the RGB stream is swing baseball. 60% of its videos are confused with throw. . . . . . . . . . . . . . . . 39
5
4-10 Most benefit classes in UCF101 when combining compared to RGB stream
39
4-11 Most benefit classes in HMDB51 when combining compared to RGB stream 40
4-12 Most benefit classes in HMDB51 when combining compared to RGB stream 40
4-13 Classes of UCF101 in which RGB stream perform better . . . . . . . . . . 40
4-14 Classes of UCF101 in which Flow stream perform better . . . . . . . . . . 41
4-15 Classes of HMDB51 in which RGB stream perform better . . . . . . . . . 41
4-16 Classes of HMDB51 in which Flow stream perform better . . . . . . . . . 41
4-17 Classes of CMDFALL in which RGB stream perform better . . . . . . . . 41
4-18 Classes of CMDFALL in which Flow stream perform better . . . . . . . . 42
6
Acronyms
3DCNN 3D Convolutional Neural Networks. 1, 19
CNN Convolutional Neural Networks. 1, 15, 17, 20
HAR Human Action Recognition. 1, 9, 23
HOG Histogram of Gradients. 12
MBH Motion boundary histograms. 12
SIFT Scale-invariant feature transform. 12
7
List of Tables
2.1
Result of fusion techniques on the 200,000 videos of the Sport1M test set.
Hit@k indicate the fraction of test samples that contained a least one of the
ground truth labels in the top k predictions [1]. . . . . . . . . . . . . . . . . 18
2.2
C3D results on different tasks . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3
Two-stream architecture mean accuracy (%) on UCF101 and HMDB51
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1
Class tree of CMDFALL dataset . . . . . . . . . . . . . . . . . . . . . . . 31
4.2
Accuracy of action recognition on single and multiple streams C3D (%) . . 35
4.3
Comparision result on two popular benchmark datasets (%) . . . . . . . . . 37
8
Chapter 1
Introduction to Human Action
Recognition
1.1
Human Action Recognition problem
Human action recognition is an important topic in computer vision domain. It has many
applications such as: surveillance system in hospital, abnormal activity detection in building (bank, aeroport, hotel) or in human machine interaction. There are various types of
human activities. Depending on their complexity, we can categorize human activities into
four different levels: gestures, actions, interactions, and group activities.
∙ Gestures are elementary movements of a person"s body part, and are atomic components describing the meaningful motion of a person. Example: "stretching an arm",
"raising a leg", ...
∙ Actions are single person activities that may be composed of multiple gestures organized temporally, such as: "walking", "waving", and "punching".
∙ Interactions are human activities that involve two or more persons and/or objects.
For example, "two person fighting" is an interaction between two humans, "drinking
water" is an interaction between human and object.
∙ Group of activities are the activities performed by conceptual groups composed of
9
multiple persons and/or objects. Example: "A group of persons marching", ...
Figure 1-1: Human Action Recognition Problem
In this thesis, we focus on the human action recognition. The problem of human action
recognition can be defined as below.
∙ Input: A video or a sequence of consecutive frames that contain a human action.
∙ Output: Label of the action that that belongs to one of the predefined classes.
Human action recognition is a challenge for researchers in computer vision domain because
of noisy background, viewpoint changes, and variety in performing action of each person.
Figure 1-1 illustrates the human action recognition problem.
Key components of a visual recognition system
Figure 1-2 illustrate the two phases of a recognition system.
∙ Training: Learning from the training dataset to obtain the parameters of the recognition model.
∙ Recognition: Use the learned model from training phase to recognize new data.
Each phase in the system has the main components as below:
∙ Preprocessing data: Convert data to the form that are compatible for the model
10
Figure 1-2: Human Action Recognition phases
∙ Feature extraction: From the preprocessed data, extract the suitable features for representing the human action. The features can be obtained by hand crafted or deep
learning techniques.
∙ Classification: Use the features extracted from previous step to create the input for
the training or predicting.
∙ Recognition: The new data is input through the step of preprocessing, feature extraction, then using the trained classifier for predicting the label.
11
Figure 1-3: Hand-crafted feature based method for Human Action Recognition
1.2
1.2.1
Overview of human action recognition approach
Hand crafted feature based methods
In this approach, human actions are represented by features that are manually designed
by high experience researchers. Once features are extracted, they are inputs to a generic
trainable classifier for action recognition. The building blocks for hand-crafted featurebased approach is illustrated in the figure 1-3:
∙ Feature extraction: Takes input as image or video pixel and output the features for
that image or video.
∙ Classification: A classifier that takes the feature as input and provides the output as
class label.
There are many types of handcrafted features designed by experts to solve the human
action recognition problem. Many classical image features have been generalized to videos,
e.g. 3D-SIFT, HOG3D. Among local space-time features, dense trajectories have been
shown to perform best on variety of datasets. The main idea is to densely sample feature
points in each frame, and track them in the video based on optical flow. Multiple descriptors
are computed along the trajectories of feature points to capture shape, appearance and
motion information. Motion boundary histograms (MBH) give the best results among these
descriptors. The idea of dense trajectories has extended by the work of Wang and Schmid
[3] to improve of performance by considering the camera motion and achieved state-of-theart in hand-crafted feature. Despite its good performance, this method is computationally
intensive.
12
Figure 1-4: Deep learning method for Human Action Recognition problem
1.2.2
Deep learning based methods
On the other hand, a learning-based representation approach, specifically, deep learning
uses computational models with multiple processing layers to learn multiple levels of abstraction from data. This learning encompasses a set of methods that enable the machine to
process the data in raw form and automatically transform it into a suitable representation
needed for classification. This is what we call trainable feature extractors. This transformation process is handled at different layers. These layers are learned from raw data using
general purpose learning procedure which does not need to be designed manually by experts. The performance of the human action recognition methods mainly depends on the
appropriate and efficient representation of data.
Recently, deep learning achieved very good result on image-based task [4]. This result
inspires researchers to extend it into video classification specially to solve the human action
recognition problem. To deal with video input, the authors in [1] use 2DCNN on individual
frame and explore the temporal information by fusing information over temporal dimension
through the network. In [5], [6], the authors uses 3D convolution operator to learn the
temporal information. In [2], the authors decompose video into spatial and temporal part.
Deep learning methods require large number of training data to achieve good result. In
[1], the authors construct a large scale dataset named Sport1M which consists of 1 million
videos downloaded from YouTube annotated with 487 classes. Features learned from this
dataset can be very generic to other dataset such as UCF101 [7]
1.2.3
Purpose of thesis
In this thesis, we propose to improve an existing 3D convolutional network, specifically the
C3D network [6]. Although C3D itself was designed with 3D kernels where one dimension
is temporal, the size of filter is only 3 × 3 × 3 which seems to be unable to represent long
13
variation. Then instead of using only one RGB stream, we deploy both streams (RGB and
optical flow). Each stream goes through an independent C3D network then is combined at
fully-connected or score level. We experiment the proposed method on the popular challenging benchmark datasets (UCF101 and HMDB51) and dataset built by MICA (CMDFALL) and show how the two streams C3D outperforms the original one stream C3D.
The thesis is organized as follows. In chapter 2, we present state of the art on Human
Action Recognition using CNN. In chapter 3, we describe our proposed methods using
3D convolutional neural network for action recognition with two-stream architecture. In
chapter 4, we report the result on UCF101, HMDB51, CMDFALL and analyse the result.
Chapter 5 concludes and gives ideas for future works.
14
Chapter 2
State-of-the-art on HAR using CNN
2.1
Introduction to Convolutional Neural Networks
Convolutional Neural Networks (CNN) are biologically-inspirire variants of Multilayer
Perceptrons. They have been very effective in areas such as image recognition and classification. There are four main types of layers to build ConvNet architectures: Convolutional
Layer, Non-Linearity layer, Pooling Layer, and Fully-Connected Layer. We will stack these
layers to form a full ConvNet architecture.
Figure 2-1: Main layers in Convolutional Neural Networks
15
Convolutional layer
The Conv layer is the core building block of a Convolutional Network. The CONV
layer’s parameters consist of a set fo learnable filters. Every filter is small spatially (along
width and height), but extends through the full depth of the input volume. For example,
a typical filter on a first layer of a ConvNet might have size 5x5x3 (5 pixels width and
height, and 3 is the number of channels of an image (RGB)). During the forward pass, we
slide (more precisely, convolve) each filter across the width and height of the input volume
and compute dot products between the entries of the filter and the input at any position.
As we slide the filter over the width and height of the input volume we will produce a 2dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively, the network will learn filters that activate when they see some type of visual
feature such as an edge of some orientation or a blotch of some color on the first layer. Now
we will have an entire set of filters in each CONV layer, and each of them will produce a
separate 2-dimensional activation map. We will stack these activation maps along the depth
dimension and produce the output volume.
Non-Linearity layer (ReLU)
An additional operation called ReLU has been used after every Convolution operation.
ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given
by: Output = Max(0, Input). ReLU is an element wise operation (applied per pixel) and
replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to
introduce non-linearity in our ConvNet, since most of the real-world data we would want
out ConvNet to learn would be non-linear. Other non linear functions such as tanh or
sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in
most situations.
Pooling layer
The Pooling Layer operates independently on every depth slice of the input and resizes
it spatially, using the MAX operation. The most common form is a pooling layer with
filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by
2 along both width and height, discarding 75% of the activations. Every MAX operation
would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice).
16
The depth dimension remains unchanged.
Fully-Connected layer
Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed
with a matrix multiplication followed by a bias offset.
In this thesis, we focus on presenting some related works for action recognition using
CNN techniques. We categorize them into three groups: methods based on 2D convolutional neural network; methods based on 3D convolutional neural network; methods used
multiple streams.
2.2
2D Convolutional Neural Networks
Figure 2-2: Fusion techniques used in [1]
Recently, 2D convnets have successfully obtained very good results on image based task
[4]. Encouraged by these results, the authors in [1] study multiple approaches for extending
CNN on video input. For baseline, they use a 2D CNN model operating on single frame to
evaluate the contribution information of static appearance to the classification accuracy. To
learn the information lies in temporal domain and study how it influence the performance,
they use the fusion techniques as in Figure 2-2:
∙ Early fusion: from sequence of frames, they get T consecutive frames to construct the
input of size 11 ×11 ×3 ×T to a CNN. In this paper, they use T=10, which is approx17
imately a third of a second. This technique combines the information immediately
on the pixel-level and allows to learn the local motion information of video.
∙ Late fusion: they use 2 separate CNN towers, each take single frame as input. The
two frames are chosen with the distance of 15 frames in the video. The temporal
information is then combined at the first fully-connected layer, which is high-level
abstraction. This model can learn the global motion information in the video.
∙ Slow fusion: This model is a balanced mixed between the two approaches by slowly
combining the temporal information across the networks. The lower layer process
the local temporal information while higher layer can access to global temporal information. In this paper, the first convolutional layer apply to every T=4 consecutive
frames on an input clip of 10 frames with stride 2. The second and third layers above
process with temporal extent T = 2 and stride 2. Thus, the third convolutional layer
has access to information across 10 frames.
They have conducted experiment on large scale Sport1M dataset. This dataset consists
of 1 million video downloaded from YouTube annotated with 487 classes. The result in
Table 2.1 shows that the slow fusion model performs best.
Table 2.1: Result of fusion techniques on the 200,000 videos of the Sport1M test set. Hit@k
indicate the fraction of test samples that contained a least one of the ground truth labels in
the top k predictions [1].
Model
Clip Hit@1 Video Hit@1 Video Hit@5
Single frame
41.1
59.3
77.7
Early fusion
38.9
57.7
76.8
Late fusion
40.7
59.3
78.7
Slow fusion
41.9
60.9
80.2
2.3
3D Convolutional Neural Networks
In [5], [6], the authors extend the 2D convolution operator in temporal dimension for video
analysis task. They propose to perform 3D convolutions in the convolution steps of CNNs
to compute features from both spatial and temporal dimension. The 3D convolution is
18
achieved by convolving 3D kernel to the fixed size cube formed by stacking multiple contiguous frames together as shown in Figure 2-3. By this construction, the feature maps
in the convolution layer is connected to multiple contiguous frames in the previous layer,
thereby capturing motion information.
In [5], the experiment is performed on the TRECVID 2008 data and the KTH data.
The TREVID 2008 data set consists of 49-hour real world videos data capture at London
Gatwick Airport. The KTH dataset consist of 6 action classes performed by 25 subjects.
The input in experiment with TREVID 2008 is 7-frame cube while with KTH dataset, this
is 9-frame cube. The result shows that the 3D convolutional networks outperform the 2D
CNN with noticeable margin.
Figure 2-3: 3D convolution operator
In [6], the authors proposed 3D convolution networks called to learn spatio-temporal
feature in the large scale dataset Sport1M. They show that C3D has great learning capacity,
capture well the information and can process large number of video. They have trained C3D
on large scale datasets: I380K and Sport1M. The trained model can be used as a feature
extractor on another dataset. They prove that the 3D CNN architecture effectively learn
the features from video by conducting experiment on different tasks: Activity Recognition,
Action Similarity Labeling, Scene and Object Recognition. Table 2.2 shows the result
using C3D in different tasks. C3D outperforms most of the methods before by noticeable
margin. Thus, C3D is very generic on capturing appearance and motion information in
videos.
19
Table 2.2: C3D results on different tasks
Dataset Sport1M UCF101 ASLAN YUPENN UMD Object
Method
[8]
[9]
[10]
[11]
[11]
[12]
Result
90.8
75.8
68.7
96.2
77.7
12.0
C3D
85.2
85.2
78.3
98.1
87.7
22.3
2.4
Multistream Convolutional Neural Networks
In [2], the authors decomposes videos into spatial and temporal components by using RGB
and optical flows. These components are then fed into separate ConvNets to learn spatial
as well as temporal information about the appearance and movement of the objects in a
scene. Each stream is performing video recognition on its own and for final classification,
softmax scores are combined by late fusion.
∙ Spatial stream operates on individual video frame, perform action recognition from
still image.
∙ Temporal stream operates on motion information of the videos in form of stacking
optical flow displacement between several consecutive frames.
For spatial stream, the input for the networks is a randomly selected frame from video.
A 224 × 224 sub-image is randomly cropped from the selected frame; it then undergoes
random horizontal flipping and RGB jittering. While in temporal stream, they study several
techniques to form the input:
∙ Optical flow stacking: The optical flow is computed by Brox’s method. By stacking
the horizontal and vertical of L consecutive frame they create the input volume of
size 224 × 224 × 2L for the network.
∙ Trajectory stacking: An alternative motion representation, inspired by the trajectorybased descriptors, replaces the optical flow.
∙ Bi-directional optical flow: The optical flow in the above techniques is forward flow.
In bi-directional method, they stack L/2 forward flow computed from L/2 frames
follow current frame and L/2 backward flow from L/2 frames before current frame.
20
∙ Mean flow subtraction: For zero-centering the input for the networks, from each
displacement field, they subtract its mean vector.
Figure 2-4: Two stream architecture for Human Action Recognition in [2]
They report that the mean flow subtraction is helpful, as it reduces the effect of global
motion between the frames. The bi-direction optical flow input performs best for the temporal stream. However, for convnet fusion, the uni-directional optical flow with multi-task
learning is the most benificial. The result show that when combining multiple stream of
information, the performance has a significant improvement (6% over temporal and 14%
over spatial nets). It means that the information in RGB and Optical flow image are complementary to each other.
Method
UCF101 HMDB51
Spatial stream ConvNet
73.0
40.5
Temporal stream ConvNet
83.7
54.6
Two-stream model (fusion by averaging)
86.9
58.0
Two-stream model (fusion by SVM)
88.0
59.4
Table 2.3: Two-stream architecture mean accuracy (%) on UCF101 and HMDB51 dataset
In the work of [13], they build upon architecture of [2] and study the fusion methods of
the two networks both in spatial and temporal dimension.
They study the spatial fusion techniques below:
∙ Sum fusion: Compute the sum of two feature maps at the same spatial location.
∙ Max fusion: Take the maximum over the two feature maps
21
∙ Concatenate fusion: Stack the two feature maps at the same spatial location.
∙ Conv fusion: First stack the two feature maps at the same spatial location as above
and subsequently convolves the stacked data with a bank of filters.
∙ Bilinear fusion: Computes the matrix outer product of the two features at each pixel
location, followed by a summation over the locations.
For temporal fusion, they use 3D pooling layer or 3D convolution operator followed
by 3D pooling operator. The result shows that the two-stream architecture can be fused
at a convolution layer without loss of performance but subtiantially reduce the amount of
parameters for learning and it is better to fuse such networks spatially at last convolutional
layer than earlier.
22