Tải bản đầy đủ (.pdf) (131 trang)

Face tracking and indexing in video sequences

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (47.91 MB, 131 trang )

Face Tracking and Indexing In Video
Sequences
Reporters:
M. Serge MIGUET
M. Farid MELGANI
Examiners:
Mme. Bernadette DORIZZI

Student:

M. Malik MALLEM

Ngoc-Trung Tran

M. Sami ROMDHANI
M. Kevin BAILLY
Supervisors:
M. Fakhreddine ABABSA
M. Maurice CHARBIT

A thesis submitted in fulfilment of the requirements
for the degree of Doctor of Philosophy
in the
LTCI Department
Telecom ParisTECH

July 2015


Abstract


Face tracking in video sequences is an important problem in computer vision because of
multiple applications in many domains, such as: video surveillance, human computer interface, biometrics. Although there are many recent advances in this field, it still remains
a challenging topic, in particular if 6 Degrees of Freedom (DOF) - three dimensional (3D)
translation and rotation - or fiducial points (to capture the facial animation) needs to
be estimated in the same time. Its challenge comes mainly from following factors: illumination variations, wide head rotation, expression, occlusion, cluttered background,
etc. In this thesis, contributions are made to address two of mentioned major difficulties: the expression and wide head rotation. We aim to build a 3D tracking framework
on monocular cameras, which is able to estimate accurate 6 DOF and facial animation
while being robust with wide head rotation, even profile. Our method adopt the 3D face
model as a set of 3D vertices needed to compute continuous values of 6 DOF in video
sequences.
To track wide 3D head poses, there are significant difficulties regarding to the data
collection, where the pose and/or a large number of fiducial points is generally required
in order to build the statistical shape and appearance models. The pose ground-truth
is expensive to be collected because of requirement of some specific devices or sensors,
while the manual annotation of correspondences on large databases can be tedious and
error prone. Moreover, the problem of correspondence annotation is the difficulty of
locating the hidden points of self-occlusion faces at the profile. The result is a general
approach that wants to tackle the out-of-plane rotations have to usually use a view-based
or adaptive models. The problem matters since the different numbers of fiducial points
on views, for example, profile and frontal, that causes a gap in term of tracking between
them using a single model. The first main contribution of our thesis is to use of a large
synthetic data to overcome this problem. Leveraging on the such data covering the full
range of head pose and the on-line information correlating between frames, we propose
the combination between them to robustify the tracking. The local features are adopted
in our thesis to reduce the high variation of facial appearance when learning on a large
dataset.


ii
To be rigorous with simultaneously tracking of the facial animation, out-of-plane rotations and towards to in-the-wild, the combination of synthetic and real datasets is

proposed to train using cascaded-regression tracking model. In the perspective of cascaded regression, the local descriptor is very significant for high performance. Hence, we
utilize the method of feature learning in order to allow the representation of local patches
more discriminative than hand-crafted descriptors. Furthermore, the proposed method
introduces some modifications of the traditional cascaded approach at later stages in
order to boost the quality of the found fiducial points or correspondences.
Lastly, Unconstrained 3D Pose Tracking (U3PT) dataset, our own recordings, would
be nominated for the evaluation of in-the-wild face tracking in the community. The
dataset captured on ten subjects (five videos/subject) in the office environment with
the cluttered background. People, who are captured, moving comfortably in front of the
camera, rotating their heads in three direction (Yaw, Pitch and Roll), even 90 degree
of Yaw, doing some occlusion or expression. In addition, these videos are recorded in
different light conditions. With the available ground-truth of 3D pose computed using
the infrared camera system, the robustness and accuracy of tracking framework, which
aims to work in-the-wild, could be accurately evaluated.
Keywords: head pose tracking, pose estimation, 3D face tracking, in-the-wild face
tracking, Bayesian tracking, face alignment, cascaded regression, synthetic data.


iii

Publications

During this study, the following papers were published or under submission:

U3PT: A New Dataset for Unconstrained 3D Pose Tracking Evaluation
Ngoc Trung Tran, Fakhreddine Ababsa and Maurice Charbit,
International Conference on Computer Analysis of Images and Patterns (CAIP), 2015.
Cascaded Regression of Learning Feature for Face alignment
Ngoc Trung Tran, Fakhreddine Ababsa, Sarra Ben Fredj and Maurice Charbit,
Advanced Concepts for Intelligent Vision Systems (ACIVS), 2015.

Towards Pose-Free Tracking of Non-Rigid Face using Synthetic Data
Ngoc Trung Tran, Fakhreddine Ababsa and Maurice Charbit,
International Conference on Pattern Recognition Applications and Methods (ICPRAM), 2015.
A Robust Framework for Tracking Simultaneously Face Pose and Animation using
Synthesized Faces
Ngoc Trung Tran, Fakhreddine Ababsa and Maurice Charbit,
Pattern Recognition Letter, 2014.
3D Face Pose and Animation Tracking via Eigen-Decomposition based Bayesian
Approach
Ngoc-Trung Tran, Fakhreddine Ababsa, Jacques Feldmar, Maurice Charbit, Dijana PetrovskaDelacretaz and Gerard Chollet
International Symposium on Visual Computing (ISVC), 2013.
3D Face Pose Tracking from Monocular Camera via Sparse Representation of Synthesized Faces
Ngoc-Trung Tran, Jacques Feldmar, Maurice Charbit, Dijana Petrovska-Delacretaz and Gerard
Chollet
International Conference on Computer Vision Theory and Applications (VISAPP), 2013.
Towards In-the-wild 3D Head Tracking
Ngoc Trung Tran, Fakhreddine Ababsa and Maurice Charbit,
Journal of Machine Vision and Applications (MVA), 2015. (Submitted)


Abbreviations

2D

Two Dimensional

3D

Three Dimensional


AAM

Active Appearance Model

ASM

Active Shape Model

CLM

Constrained Local Model

DOF

Degrees-of-Freedom

GMM

Gaussian Mixture Model

IC

Inverse Compositional

IRLS

Iteratively Re-weighted Least Squares

LDM


Linear Deformable Model

KF

Kalman Filter

KLT

Kanade–Lucas–Tomasi

ML

Maximum Likelihood

MAE

Mean Average Error

PCA

Principal Component Analysis

RANSAC RANdom SAmple Consensus
RGB

Red Green and Blue

RMS

Root Mean Squared


POSIT

Pose from Orthography and Scaling with ITerations

SDM

Supervised Descent Method

SfM

Structure from Motion

SSD

Sum of Squared Differences

SVD

Singular Value Decomposition

SVM

Support Vector Machine

3DMM

3D Morphable Model



Contents
Contents

v

List of Figures

viii

List of Tables
1 Introduction
1.1 Motivation
1.2 Challenge .
1.3 Objectives .
1.4 Overview .
1.5 Notation . .

xi

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

2 State-of-the-art
2.1 Face Models . . . . . . . . . . . . . . . . . . . .
2.1.1 Linear Deformable Models (LDMs) . . .
2.1.2 Rigid Models . . . . . . . . . . . . . . .
2.2 Tracking Approaches . . . . . . . . . . . . . . .
2.2.1 Optical Flows Regularization . . . . . .
2.2.2 Manifold Embedding . . . . . . . . . . .
2.2.3 Template Model . . . . . . . . . . . . .
2.2.4 Local Matching . . . . . . . . . . . . . .
2.2.5 Discriminative . . . . . . . . . . . . . .
2.2.6 Regression . . . . . . . . . . . . . . . . .

2.2.7 Hybrid . . . . . . . . . . . . . . . . . . .
2.3 Multi-view Tracking . . . . . . . . . . . . . . .
2.3.1 On-line adaptive models . . . . . . . . .
2.3.2 View-based models . . . . . . . . . . . .
2.3.3 3D Synthetic models . . . . . . . . . . .
2.4 Discussion . . . . . . . . . . . . . . . . . . . . .
2.4.1 Off-line learning vs. On-line adaptation
2.4.2 Real data vs. Synthetic . . . . . . . . .
2.4.3 Global vs. Local features . . . . . . . .
2.5 Databases . . . . . . . . . . . . . . . . . . . . .

v

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.

1
1
4
5
7
8

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

10
10
11
12
13
13
14
15
16
17
19
20
20
20
21
21
22
22
23
23
23


Contents


vi

3 A Baseline Framework for 3D Face Tracking
3.1 The General Bayesian Tracking Formulation . . . . . . . . . . . . . . . . .
3.2 Face Tracking using Covariance Matrices of Synthesized Faces . . . . . . .
3.2.1 Face Representation . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Baseline Framework . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . .
3.2.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Face Tracking using Feature Reconstruction through Sparse Representation
3.3.1 Reconstructed-Error Function . . . . . . . . . . . . . . . . . . . . .
3.3.2 Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Face Tracking using Adaptive Local Models . . . . . . . . . . . . . . . . .
3.4.1 Adaptive Local Models . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2.1 Pose accuracy . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2.2 Landmark precision . . . . . . . . . . . . . . . . . . . . .
3.4.2.3 Out-of-plane tracking . . . . . . . . . . . . . . . . . . . .
3.5 Conlcusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
25
26
26
29
29
31

31
34
36
36
38
39
40
40
42
42
44
47
47

4 Pose-Free Tracking of Non-rigid Face in Controlled Environment
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Wide Baseline Matching as Initialization . . . . . . . . . . . . . . . . . .
4.3 Large-rotation tracking via Pose-wise Appearance Models . . . . . . . .
4.3.1 View-based Appearance Model . . . . . . . . . . . . . . . . . . .
4.3.2 Matching Strategy by Keyframes . . . . . . . . . . . . . . . . . .
4.3.3 Rigid and Non-rigid Estimation using View-based Appearance
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Flexible tracking via Pose-wise Classifiers . . . . . . . . . . . . . . . . .
4.4.1 Templates for Matching using Support Vector Machine . . . . . .
4.4.2 Large off-line Synthetic Dataset . . . . . . . . . . . . . . . . . . .
4.4.3 Local Appearance Models . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Fitting via Pose-Wise Classifiers . . . . . . . . . . . . . . . . . .
4.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5.1 Pose Accuracy . . . . . . . . . . . . . . . . . . . . . . .

4.4.5.2 Landmark precision . . . . . . . . . . . . . . . . . . . .
4.4.5.3 Out-of-plane tracking . . . . . . . . . . . . . . . . . . .
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

49
49
49
52
52
54

.
.
.
.
.
.
.
.
.
.
.
.


55
57
59
60
62
63
64
66
66
67
68
69

5 Towards In-the-wild 3D Face Tracking
70
5.1 Cascaded Regression of Learning Features for Landmark Detection . . . . 70
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


Contents

vii

5.1.2
5.1.3
5.1.4
5.1.5

5.2


5.3

Cascaded Regression . . . . . . . . . . . . . . . . . . . . . .
Local Learning Feature . . . . . . . . . . . . . . . . . . . .
Coarse-to-Fine Correlative Cascaded Regression . . . . . .
Experimental Results . . . . . . . . . . . . . . . . . . . . .
5.1.5.1 Learing Feature on Inner vs Boundary points . . .
5.1.5.2 Coarse-to-Fine Correlative Regression (CFCR) . .
5.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
In-the-wild 3D Face Tracking . . . . . . . . . . . . . . . . . . . . .
5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Real and synthetic data for in-the-wild tracking . . . . . . .
5.2.3 Unconstrained 3D Pose Tracking Dataset . . . . . . . . . .
5.2.3.1 Recording System . . . . . . . . . . . . . . . . . .
5.2.3.2 Calibration . . . . . . . . . . . . . . . . . . . . . .
5.2.3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4.1 Public datasets . . . . . . . . . . . . . . . . . . . .
5.2.4.2 Datasets for Visualization . . . . . . . . . . . . . .
5.2.4.3 Our dataset: Unconstrained Pose Tracking dataset
5.2.5 Conlcusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conlcusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

6 Conclusion and Future Works
A Algorithms in Use
A.1 Nelder-Mead algorithm .
A.2 POSIT . . . . . . . . . .

A.3 RANSAC . . . . . . . .
A.4 Local Binary Patterns .
A.5 3D Morphable Model . .
Bibliography

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

72
74
75
77
79
81
81
81
81
83
88
88
89
92
93
93
94
95
98
99
100


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


103
. 103
. 103
. 106
. 107
. 107
109


List of Figures
1.1
1.2

Three head orientations: Yaw, Pitch and Roll. . . . . . . . . . . . . . . . .
Some challenging conditions of 3D face tracking: illumination, head rotation, clustered background in one sample video sequence. . . . . . . . . .

2
4

2.1

Some 3D face models from left to right: AAM, Cynlinder, Mesh and 3DMM. 11

3.1
3.2

The frontal and profile views of Candide-3 model. . . . . . . . . . . . . . .
One example of the perspective projection in our method to obtain 2D
projected landmarks from the current model given the parameters. . . . .
The diagram of baseline framework for 3D face tracking: data generation,

training and tracking stages. . . . . . . . . . . . . . . . . . . . . . . . . . .
Landmark initialization, 3D model fitting using POSIT and the rendering
of synthesized images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The landmarks used in the framework, examples of mean and covariance
matrices at the eye and mouth corners. . . . . . . . . . . . . . . . . . . . .
The visualization of two and three first components of descriptor projection using PCA learned through synthetic images of some feature points. .
Some video examples of BUFT videos at different views. . . . . . . . . . .
The framework using reconstructed features via sparse representation
with two modifications from the baseline. . . . . . . . . . . . . . . . . . .
The sparse representation to estimate the coefficients of the positive (green)
and negative (red) local patches of one corner of eyebrow. . . . . . . . . .
The construction of posistive and negative patches of eye corners. . . . . .
The framework using adaptive local models based on the baseline. . . . .
An sample video of BUFT dataset using local adaptive model. . . . . . .
Some tracking examples on BUFT dataset using local adaptive model
(green) and FaceTracker (yellow). . . . . . . . . . . . . . . . . . . . . . . .
The visualization of our (Yaw, Pitch, Roll) estimation (blue) and groundtruth (red) on the video jam7.avi using local adaptive model. . . . . . . .
The visualization of our (Yaw, Pitch, Roll) estimatation (blue) and groundtruth (red) on the video vam2.avi using local adaptive model. . . . . . . .
The RMS error of 12 selected points for tracking in our framework (red)
compared to [97] (blue). The vertical axis is RMS error (in pixel) and the
horizontal axis is the frame number. . . . . . . . . . . . . . . . . . . . . .

3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10

3.11
3.12
3.13
3.14
3.15
3.16

4.1

27
29
30
31
32
33
34
37
38
39
40
44
45
45
46

47

The basic idea of wide baseline matching. . . . . . . . . . . . . . . . . . . 50
viii



List of Figures
4.2
4.3
4.4
4.5
4.6

4.7
4.8
4.9
4.10

4.11

4.12
4.13

4.14
5.1
5.2

Computing 3D points Lk as the 3D intersection points. . . . . . . . . . . .
From left to right: The 2D SIFT keypoints of the keyframe, SIFT matching, outliers removed by RANSAC. . . . . . . . . . . . . . . . . . . . . . .
The pipeline of Pose-Wise Appearance Model based Framework. . . . . .
The structure of mapping table with the key as three orientation and the
content as the local descriptors. . . . . . . . . . . . . . . . . . . . . . . . .
The cross-validation to select two parameters: the number of nearest
neighbors Nq and the threshold kHub . The validation of a) first row: the
number of nearest poses, b) second row: kHub for |Y aw| ≤ 30◦ , and c)

third row: kHub for |Y aw| > 30◦ . The vertical axis is RMS error. . . . . .
Tracking examples on VidTimid and Honda/UCSD. . . . . . . . . . . . .
The way how to pick up positive (blue) and negative samples (red), and
the response map computed at the mouth corner after training. . . . . . .
Some weight matrices or templates of local patches T (x) in our implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
From left to right of training process: 143 frontal images, landmark annotation and 3D model alignment, synthesized images rendering, and posewise SVMs training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a) The Candide-3 model with facial points in our method. (b) The way
to compute the response map at the mouth corner using three descriptors
via SVM templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The pipeline of tracking process from the frame t to t + 1. . . . . . . . . .
The RMS of our framework (red curve) and FaceTracker [97] (blue curve).
The vertical axis is RMS error (in pixel) and the horizontal axis is the
frame number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Our tracking method on some sample videos of VidTimid and Honda/UCSD.

Some results of face alignment on some challenging images of 300-W dataset.
The visualization of weight matrices when learning local descriptors of
eye and mouth corners using RBM. . . . . . . . . . . . . . . . . . . . . . .
5.3 The overview of our approach. In training, we learn sequentially RBM
models, coarse-to-find regression (correlative and global regression). . . . .
5.4 CED evaluation of local correlative regression on specific landmarks. i →
j means using i-th landmark to detect j-th landmark. . . . . . . . . . . .
5.5 The cross-validation of feature size, the number of hiddens, the number
of random samples and the number of regressors. . . . . . . . . . . . . . .
5.6 The evaluation of the effect of inner and boundary points on cross-validation
set of 300-W dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Some results on 300-W dataset. . . . . . . . . . . . . . . . . . . . . . . . .
5.8 The annotation of 51 landmarks in synthetic images and its automatic
rendering in different views. . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9 Examples of landmark annotation of Multi-PIE at different views. . . . .

5.10 Examples of automatic rendering of synthetic images. . . . . . . . . . . .
5.11 Landmark detection on some large rotations using synthetic training data.

ix
51
52
53
54

58
59
61
61

62

64
64

67
68
71
75
76
77
79
80
82
83
84

85
86


List of Figures
5.12 Examples of landmark detection on real images using the model trained
from the synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13 The mean shape, the transformed shape as close as possible to target
shape using anisotropic scaling and the local descriptors of transformed
shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14 Tracking between conscutive frames using the simple strategy. . . . . . .
5.15 The installation of recording system. . . . . . . . . . . . . . . . . . . . .
5.16 The tree flystick and the detection of using stereo infrared camera and
Dtrack2 software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.17 The checkboard for calibration of RGB camera. . . . . . . . . . . . . . .
5.18 Annotation of the zero marker for stereo calibration. The red point is the
annotation of the orignal coordinate position in 2D images. . . . . . . .
5.19 The diagram to estimate the rotation and translation from the infrared
camera to RGB camera. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.20 The some example of stereo-calibration results. . . . . . . . . . . . . . .
5.21 Some sample video sequences in our datasets. . . . . . . . . . . . . . . .
5.22 The RMS error on Talking Face video: 5.3 (pixels) compared to 6.8 (pixels) of FaceTracker on this video. . . . . . . . . . . . . . . . . . . . . . .
5.23 Some example of face tracking on YouTube Celebrities and Honda/UCSD
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.24 The accuracy evaluation of U3PT on each subset of videos. . . . . . . .
5.25 The robustness evaluation of U3PT on each subset of videos. . . . . . .

x

. 86


. 87
. 88
. 89
. 89
. 90
. 90
. 91
. 91
. 94
. 95
. 96
. 97
. 98

A.1 The pinhole camera model of POSIT (courtesy of [33]). . . . . . . . . . . 106
A.2 How to compute LBP of a pixel (courtesy of [90]). . . . . . . . . . . . . . 107


List of Tables
3.1
3.2

3.3

3.4

3.5
3.6
3.7


4.1
4.2

5.1
5.2
5.3
5.4

The list of shape and animation units in our study. . . . . . . . . . . . .
The comparison of robustness (Ps ) and accuracy (Eyaw , Epitch , Eroll and
Em ) between two descriptors intensity and SIFT on the uniform-light set
of BUFT dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The comparing of robustness (Ps ) and accuracy (Eyaw , Epitch , Eroll and
Em ) between different ranges of Y aw on the uniform-light set of BUFT
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The comparing of robustness (Ps ) and accuracy (Eyaw , Epitch , Eroll and
Em ) to the baseline method on the uniform-light set of BUFT dataset.
The results are computed at two different thresholds (95% and 100%) of
robustness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The cross-validation using the accuracy for α and β terms of the adaptive
method on the jam1.avi video. . . . . . . . . . . . . . . . . . . . . . . .
The comparison of robustness (Ps ) and accuracy (Eyaw , Epitch , Eroll and
Em ) of adaptive local models to our previous results on BUFT dataset.
The comparing of robustness (Ps ) and accuracy (Eyaw , Epitch , Eroll and
Em ) of adaptive local models to state-of-the-art methods on the uniformlight set of BUFT dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .

. 28

. 35


. 35

. 40
. 43
. 44

. 44

The evaluation of view-based appearance model based method on BUFT
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
The pose precision of our method and state-of-the-art methods on uniformlight set of BUFT dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Compare to state-of-the-art methods on 300-W dataset using 68 landmarks.
Compare to state-of-the-art methods on 300-W dataset using 51 landmarks.
The tracking performance on the uniform-light set of BUFT dataset. . . .
The performance of methods on our own U3PT dataset. . . . . . . . . . .

xi

80
81
93
97


Chapter 1

Introduction
1.1


Motivation

Three-dimensional (3D) human face tracking is a generic problem that has been receiving
considerable attention in computer vision community. The main goal of 3D face tracking
is to estimate some parameters of human faces from video frames: i) 6 Degrees of
Freedom (DOF) - consists of the 3D translation and three axial rotation as in {Fig. 1.1}
- of a person’s head relative to the camera view. As commonly used in the literature [82],
we adopt three terms Yaw (or Pan), Pitch (or Tilt) and Roll for the three axial rotations.
The Yaw orientation is computed when, for example, rotating the head from right to left.
The Pitch orientation is related to the movement of the head from forward to backward.
The Roll orientation is when bending the head from left to right. The 6 DOF are
considered as rigid parameters. ii) The non-rigid parameters describe the facial muscle
movements or facial animation, which are usually the early step to recognize the facial
expression, such as: happy, sad, angry, disgusted, surprise, and fearful, etc. Indeed,
the consideration for non-rigid parameters is often represented in form of detecting
and tracking facial points. These points are acknowledged as fiducial points, feature
points or landmarks in face processing community. The word ”indexing” in our report
means that rigid and non-rigid parameters are estimated from video frames. Our aim
is to read a video (or from a webcam) capturing the single face (In case of multiple
faces, the biggest face is selected) and the output is parameters (rigid and non-rigid) of
video frames. Indeed, the non-rigid parameters are difficult to be represented because
it depends on the application. In our study, they are represented indirectly as localizing
or detecting feature points on the face.
1


Chapter 1. Introduction

2


Figure 1.1: Three head orientations: Yaw, Pitch and Roll.

There are several potential applications in many domains which use face tracking. The
most popular one is to recognize facial behaviors in order to support for an automatic
system of human communication understanding [76]. In this context, the visual focus of
attention of a person is a very important key to recognize. It is a nonverbal communication way or an indicative signal in a conversation. For this problem, we have to analyze
first the head pose to determine the direction where people are likely looking at in video
sequences. It makes sense that people may be focusing on someone or something while
talking. Furthermore, there are important meanings in head movements as a form of
gesturing in a conversation. For example, the head nodding or shaking indicates that
people understand and misunderstand or agree and disagree respectively to what is being said. Emphasized head movements are a conventional way of directing someone to
observe a particular object or location. In addition, the head pose is intrinsically linked
with the gaze: Head pose indicates a coarse estimation of gaze in situations of invisible
eyes such as low-resolution imagery, very low-bit rate video recorders, or eye-occlusion
due to sunglasses-wearing. Even when the eyes are visible, the head pose supports
to predict more accurately the gaze direction. There are other gestures that are able
to indicate dissent, confusion and consideration, etc. Facial animation analysis is also
necessary to be able to read what kind of expression people are exposing. The facial expression is a natural part in human communication, it is one of the most cogent means
for human beings to infer the attitude and emotions of other persons in the vicinity.
The expression analysis, which requires facial animation detection, is a crucial topic not
only in machine vision but also psychology [1, 86]. The head gesture and expression
are complement each other to clarify more comprehensively the interactions, intentions,
emotions, etc... between persons.


Chapter 1. Introduction

3

The 3D face tracking plays an important role in many other applications. For example,

integrating the estimation of facial animation allows a natural human user to control a
distant avatar in virtual environments [21]. It is useful in controlling facial animations
of virtual characters in video games, e.g. Second Life or Poker 3D, creating characters
as real as possible in making animated movies, or enhancing remote communications
in a collaborative virtual environment. The vision-driven user interfaces for Driver
Fatigue Detection Systems detect the early signs of fatigue/drowsiness to alarm during
driving a car [35, 114]. The human-machine interaction to control computers or devices
using face gestures [107, 109]. Face tracking and pose estimation is useful to build the
teleconferencing and Virtual Reality-interfaces systems [51, 113]. Face tracking is the
compulsory stage of the face recognition system for video security systems to track and
identify people, because the faces have to be tracked (at least the rigid parameters)
before doing some pre-processing stages, such as: normalizing or estimating the frontal
view of the face. Many monitoring systems involve informations associated to the face
position. In addition to security issues, they can be used for marketing issues. One
example is the face tracking and recognition of human activity, which can be used in
supermarkets to track customer behavior. This kind of system analyses human actions to
understand their behaviors and give an efficient recommendation of product arrangement
in supermarket for instance.
Nowadays, there are many types of camera devices which could be used for 3D face
tracking, such as: Kinect, 3D infrared cameras or stereo cameras. Even though these
cameras can provide more useful information (like depth) allowing to make tracking
more robust and accurate, the monocular 2D cameras are still our choice in the thesis.
The first reason is that the 3D cameras are usually installed for only indoor applications.
Second, they are more expensive devices and not portable to be integrated in any small
devices such as mobile phone or a chat webcam. Third, the complicated calibration and
some configuration are often required to use such devices. Therefore, there are always
still rooms for applications of monocular cameras. If building a face tracking system
successfully using a monocular camera, people can extend it to the stereo system or
combine with depth cameras to improve performances.



Chapter 1. Introduction

4

Figure 1.2: Some challenging conditions of 3D face tracking: illumination, head
rotation, clustered background in one sample video sequence.

1.2

Challenge

In the literature, the estimation of rigid and non-rigid parameters with monocular
camera were usually considered as two separate tasks: pose estimation [82] and face
alignment or facial feature point detection [116]. When attempting to estimate the rigid
parameters or head pose in wide rotation, the previous works did not focus on non-rigid
parameters. Otherwise, when trying to represent the facial animation efficiently, it was
uneasy to address the wide range of rotations. Indeed, the simultaneous estimation of
rigid and non-rigid is a very difficult task because of dozens of parameters have to be
simultaneously estimated under the various challenging conditions. In order to build a
robust face tracking system for both rigid and non-rigid parameters, researchers have
to face several difficulties, given below:

• 3D-2D projection: The projection is an ill-posed problem because the observation
from monocular cameras is only 2D; whereas, the recovery of 3D pose or the
estimation of rigid parameters is 3D.
• Illumination variations: Face appearance depends on the light from different sources
locations; moreover, the light reflection is not the same on parts of faces and depends on light sources, camera, or objects. Such conditions change head appearance in different ways as can be seen in Fig. 1.2.
• Head orientation: Obviously, a head does not look the same from different sides,
for instance, between the frontal and profiles looking. When changing the head orientation relative to the camera, the face appearance changes depending on visible

parts of faces.
• Biological appearance variation: It is the problem that all of systems related to
face processing have to tackle. In other words, the tracking system has to be able
to track any faces of unknown people.


Chapter 1. Introduction

5

– Intra variability: For the same person, the different conditions such as short or
long hairs, beard and mustaches or make-ups can change the face appearance.
– Extra variability: Different persons have different face appearance because of
different sizes and shapes of eyes, noses, and mouths. Different facial hairs
and beards are other things which make the tracking less accurate.
• Simultaneous non-rigid expressions: The estimation of non-rigid parameters usually depends on some facial points around eyes, nose and mouth corners. It is
difficult to collect the amount of training data to build statistical models of shape
and appearance because a large number of annotated correspondences are generally required. Moreover, the estimation of dozens of rigid and non-rigid 3D
parameters from noisy observations seems to be challenging.
• Occlusions and self-occlusions: When wearing some accessories, such as glasses,
hats, a part of face is occluded, or a half of face is invisible in the case of profile
views.
• Low resolution: Although it is not really a big problem thanks to the highresolution cameras nowadays, in some conditions like surveillance systems, the
low resolution usually happens when people stay at distance from the camera.
• Cluttered or ambiguous background probably causes the fragment to tracking system.
• Moreover, the face tracking has to be real-time for many applications.

In addition, many stages are rigorously integrated to build an automatic and robust 3D
face tracking system in video sequences. First, the face detection to localize the face
at first frame. Second, the alignment stage to align the face model into the 2D facial

image. Third, the tracking stage to infer the face model at the next time given the head
model at a given time. Fourth, the recovery stage to recover the failure tracking quickly,
etc. Each stage is itself one challenging task in the reality.

1.3

Objectives

Because the face is a deformable object as discussed above, one of the major issues that
is really challenging in real world applications, is the need of a large corpus of training


Chapter 1. Introduction

6

data with ground-truth of wide range of rotations, expression, shapes and so on. In
addition, a lot of human resources have to be spent to annotate that seems very costly.
However, there are still some ways to circumvent this issue in practice. The main aim
of this thesis is to propose an algorithm, which is able to track both facial pose and
animation. We focus not only on divergence but also on the accuracy, as well as on
developing some methods to overcome current issues and make the application realistic.
For that, we want to tackle the following challenging problems:

• A lot of studies have been proposed to estimate separately rigid and non-rigid
parameters because of challenges discussed above. In this study, we want to analyse how to track them simultaneously, the rigid can help to estimate non-rigid
parameters and vice versa.
• Handling the wide range of poses is still a big challenge. Work with this problem
is focused in our study. We propose the combined approach using both off-line
and on-line information for this problem.

• Our goal is to build a robust 3D face tracking framework, so a large amount of
training data is required to build statistical models. The data collection is too
expensive; moreover, how to annotate the hidden landmarks in profile views is
very difficult. So, we propose the use of synthetic dataset to train tracking model.
It is not expensive to collect data and useful to address hidden landmarks of profile
views.
• We investigate the utility of synthetic data, and then combine with the available
real datasets that aim to work reasonably on in-the-wild videos. Our study is the
first work approaching this way for efficient 3D tracking, where the combination of
cascaded regression based method and matching is adopted. In our study, we aim
to develop a method that is able to work with reasonable results for 3D in-the-wild
tracking.
• We propose to create a new database with ground-truth of 3D pose. This dataset
is recorded under challenging conditions: wide rotation, expression, occlusion and
cluttered background, etc. We setup the recording system (RGB camera + 3D
infrared cameras) that allows us to capture accurately 3D ground-truth, while
people can move comfortably in front of the system. Furthermore, the dataset is
recorded at different levels of difficulties that enable to evaluate the tracking more
precisely, such as: robustness, accuracy, profile-tracking capability... than datasets


Chapter 1. Introduction

7

ever reported. This database could be a contribution to the vision community as
a benchmark for in-the-wild tracking.

1.4


Overview

This dissertation is comprised of six chapters, the first of which is this introduction.
The chapters are organized in such a way that the reader will benefit by the growth
of conventions, terminology and method set out in earlier chapters are developed more
efficiently in the chapters that follow. The brief outline of following chapters is given
below:
Chapter 2: To report a review of state-of-the-art methods. This includes a detailed
discussion in different perspectives of methods such as head model, tracking approaches.
The chapter serves as a basis analysis of one existing method, where the strengths and
weaknesses of them are discussed, to figure out which approach is good for our problems.
Chapter 3: To present a basic investigation into the utility of synthetic dataset for rigid
tracking. We develop the baseline framework based on the idea of one published work
to derive a new efficient way of tracking rigid faces. The applicability of the approach
is empirically evaluated through some experiments on a face tracking dataset. Results
that we achieved are encouraging to improve in next chapters. An evaluation of two
different descriptors used in our framework is also reported. This chapter is mandatory
for developments of our methods in next ones.
Chapter 4: To describe the proposed two-step method that shows the applicability of
using the wide baseline matching or tracking-by-detection to improve the robustness of
out-of-plane tracking. a) The first step benefits matching between the current frame
and some adaptive preceding-stored keyframes to estimate only rigid parameters. By
this way, our method is sustainable to fast movement and recoverable in terms of lost
tracking. b) The second step obtains the whole set of parameters (rigid and non-rigid)
by a heuristic method using pose-wise SVMs. This way can efficiently align a 3D model
into the profile face in similar manner of the frontal face fitting. The combination of
three descriptors is also considered to have better local representation.
Chapter 5: To report two improvements from ealier chapters. For first stage, we
propose a new method for face alignment using deep learning, the approach based on
the Supervised Descent Method (SDM) and learning features of Restricted Boltzmann



Chapter 1. Introduction

8

Machine (RBM). Details regarding its derivation as well as the efficiency of learning
features compared to designed features in face alignment points of views. We also
propose the combination of local correlative regression and global regression to improve
the accuracy performances. The performance reported in this chapter is comparable
to recent state-of-the-art methods. This study is useful for the new framework to do
alignment in first video frame in next chapter.
For second stage, we extend the method of face alignment from early stage to enable the
framework to work in-the-wild conditions. In addition, we propose to use 3D Morphable
Model (3DMM) to generate better synthetic dataset for training and combine with real
databases. Empirical evaluations are performed on our own face tracking dataset that
is captured by a new recording system. Analyses of the results are presented along with
ideas for further performance gains.
Chapter 6: To sum up this study with an overview of what is mentioned in the study
and give directions for future work.

1.5

Notation

• Scalar: are written in italics, either in lower or upper-case, for example: a and B.
• Vectors: are written in lower-case non-italic boldface, with components separated
by spaces. v = [a b c]T .
• Matrices: are written in upper-case non-italic boldface, for example: M; however,
we sometimes uses the Greek symbols instead as Φ.

• Function: are typeset in the upper-case Ralph Smith’s Formal Script (RSFS), for
example: F , G .
• Function composition: is denoted by the ◦ symbol, for example:
G (F (v)) = G ◦ F (v)
• N{id} : is denoted as the number of the specific thing {id}, for example: Np is the
number of vertices of 3D models.
• The model parameters (rigid and non-rigid): is usually denoted as Θ in this dissertation.


Chapter 1. Introduction

9

• Perspective projection: is denoted as P, for example: P(v) is the projection of a
3D point v to 2D.


Chapter 2

State-of-the-art
In this section, the state-of-the-art methods in the literature are categorized into groups
depending on the tracking technique points of views. More precisely, the technique of
aligning the face model into 2D frames in video sequences. Fortunately, each group of
tracking methods uses one specific 3D model, feature descriptor or training data. For
example, discriminative methods for face tracking often use Linear Deformable Models
(LDMs) and local descriptors and the face model is trained from off-line dataset. So,
the review of tracking methods itself includes the review of the face model, features or
data. However, we will firstly mention about 3D face models because of their importance
to decide which tracking approaches are being used. In addition, the robustness with
profile views - one of our main objectives - will be researched as one separate part to find

out how previous works track efficiently the profile sides. Some datasets for evaluation
are also discussed at the end of this section.

2.1

Face Models

To estimate facial pose or head orientation (rigid parameters) in video sequences, three
basic ways are commonly used: First, using 3D models to track and estimate directly the
head pose. Second, tracking 2D feature points and the pose estimation are performed
when fitting a given 3D face model into 2D feature points (sometimes using geometrical
characteristics instead of 3D model). Third, tracking 2D face region, extracting features
and estimating the pose directly from its appearance using trained models without the
need of 3D face model. To estimate the facial shape and animation, 2D or 3D models
of feature points are considered. The feature points (named also facial points, fiducial
10


Chapter 2. State-of-the-art

11

Figure 2.1: Some 3D face models from left to right: AAM, Cynlinder, Mesh and
3DMM.

points or landmarks) are usually the points localized at corners of eyes, eyebrows, nose
and mouth, or more precisely, they are points rich in terms of local information contents,
e.g. significant 2D texture. In fact, if we have accurate landmark detectors at any
views running in real-time, it is unnecessary to need a 3D face tracking. However, such
expected detectors do not exist up to now, so it is still important to build an efficient

tracking framework using 3D models. The choice of head model allows people to decide
what characteristics of face to be interpreted in the tracking.

2.1.1

Linear Deformable Models (LDMs)

The most popular face model in the literature is Linear Deformable Models (LDMs).
Some instances of this model have the distinction representing deformability in only
shape or both in shape and appearance. The main point of this model is that from a
set of annotated training images, separate linear models of shape and appearance are
learned through Principle Component Analysis (PCA) [89]. The pioneer work of LDMs
is Active Shape Models (ASMs) proposed by [24, 25] learning PCA of shapes. ASMs was
extended to Active Appearance Models (AAMs) [26] by the combination of two separate
linear models of shape and appearance. At first, ASMs and AAMs are 2D, then extended
to 3D by adding more parameters [122] or using structure-from-motion to learn from 2D
correspondents [106]. ASMs and AAMs are often used for face alignment (or landmark
detection) rather than pose estimation (rigid ). A well-known problem when using AAMs
is the weak generalization capability because the facial texture space is too large to be
sufficiently captured. Recently, the LDMs combined with local feature is more popular
in use because of the development of invariant local descriptors [7, 97, 111, 118].


Chapter 2. State-of-the-art

12

Candide model [3] is another 3D face model that has been designed to manage both
rigid and non-rigid parameters in the same viewpoint. In contrast to previous LDMs,
this model separates the shape and animation parameters and it is available model. The

latest version Candide-3 is applied in many works [4, 36, 100]. It is an efficient model
for frontal face, but with the best of our knowledge, there is no work using it to track
profile faces.
3D Morphable Models (3DMMs) [14], which is closely related to AAMs, is a technique
for modeling textured 3D faces. The difference from AAMs is that 3DMMs is directly
trained from 3D scan images. Ideally, 3DMM is likely the most appropriate model for 3D
face tracking because it is really 3D and has large regions on two sides that are significant
to keep robust tracking with profile faces. However, this is, let’s say, a ”heavy” mode,
because of a lot of vertices need to be controlled. In addition, the texture appearance
on two sides are not well reconstructed in current works. There is a few works, such as
[80, 130] using this model for 3D face tracking.
Many other deformable models exist in the literature but not commonly-used in other
works, such as: [17, 32] constructed hand-fully a 3D polygon mesh which is realized by
a set of parameterized deformations. However, this model is limited in expressing some
complicated motion, for example, lip deformations. [67] constructed the textured 3D
face models from videos with some initial interactions at the begining by users.

2.1.2

Rigid Models

Using the pre-defined 3D geometric models such as cylindrical, ellipsoid, cube, and so
on, to represent the 3D head. They address only the estimation of rigid parameters.
Some of them could be considered as follows:

Planar Model: The rectangle that is usually used to capture the face region in video
sequence. For example, [135] used this model in the combination of face detection and
3D pose tracking for pose estimation; however, the head model has 6 degrees of freedom
to initialize, and the experiments indicate that this kind of model is too simple and not
efficient enough to track large rotations.

Cylindrical Model: It is the most popular rigid model used for pose estimation. It is
the surface of the points from a given line at a fixed distance and is parameterized by
3D positions and orientations. For tracking, the model texture is usually obtained by


Chapter 2. State-of-the-art

13

warping frame texture from video. This model is appropriate to track a wide range
of rotations, especially for Yaw orientation [2, 20, 50, 79, 103, 123] because it has few
degrees of freedom to approximate the human pose reasonably.
Ellipsoid Model: [5, 9, 23] is supposed to be better than cylindrical model because it
shows relatively more stable performance on the Pitch movement. In addition, it can
approximate the forehead part well since the geometric shape of the forehead is curved.
However, controlling this model is more difficult than Cylindrical Model.
Mesh Model: It is a collection of vertices and polygons defining the fixed shape structure
of 3D face [112]. It is a kind of model designed by hand, not learned from training
set. This model is appropriate to model and track 3D face; however, it is not flexible
because every time, it needs to be adjusted to specific face before tracking. In addition,
the initialization of this model at first frame is challenging. There are some other mesh
models designed by hand, such as [131].
Most of rigid head models are designed for only rigid estimation, easy to control and
useful to track in large rotations, but it is worth noting that it is not appropriate for
non-rigid estimation. However, it is possible to combine them with non-rigid models to
have better estimation.

2.2

Tracking Approaches


In our study, we classify the state-of-the-art methods in some categories depending
mainly on the tracking approach. However, we will also discuss about some important
properties, such as the features (e.g, local or global), adaptive or off-line learning, the
type of training data (real, synthetic or hybrid). Other properties are sometimes mentioned; for example, which parameters are estimated (rigid and non-rigid ), the recovery
of failure tracking, the robustness to occlusion, illumination, clustered background and
time consuming. The problem of out-of-plane rotation will be considered in a separate
section afterward, because it is one of the most important problems that needs to be
solved to build a robust tracking.

2.2.1

Optical Flows Regularization

This approach uses set of vertices (given or random points) on the face model regularized
by optical flow for tracking. These points are not used to extract clearly the appearance


×