Tải bản đầy đủ (.pdf) (177 trang)

Nghiên cứu và phát triển các kỹ thuật định vị và định danh kết hợp thông tin hình ảnh và WiFi (LA tiến sĩ)

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.91 MB, 177 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

THUY PHAM THI THANH

NGHIÊN CỨU VÀ PHÁT TRIỂN CÁC KỸ THUẬT
ĐỊNH VỊ VÀ ĐỊNH DANH KẾT HỢP THÔNG TIN
HÌNH ẢNH VÀ WIFI
PERSON LOCALIZATION AND IDENTIFICATION
BY FUSION OF VISION AND WIFI

DOCTORAL THESIS OF COMPUTER SCIENCE

Hanoi − 2017


MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

THUY PHAM THI THANH

NGHIÊN CỨU VÀ PHÁT TRIỂN CÁC KỸ THUẬT
ĐỊNH VỊ VÀ ĐỊNH DANH KẾT HỢP THÔNG TIN
HÌNH ẢNH VÀ WIFI
PERSON LOCALIZATION AND IDENTIFICATION
BY FUSION OF VISION AND WIFI

Specialization: Computer Science
Code No: 62480101

DOCTORAL THESIS OF COMPUTER SCIENCE



SUPERVISORS:
1. Assoc.Prof. Thi Lan Le
2. Dr. Trung Kien Dao

Hanoi − 2017


DECLARATION OF AUTHORSHIP
I, Thuy Pham Thi Thanh, declare that this thesis titled, "Person Localization and
Identification by Fusion of Vision and WiFi", and the work presented in it are my own.
I confirm that:
This work was done wholly or mainly while in candidature for a PhD research
degree at Hanoi University of Science and Technology.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at Hanoi University of Science and Technology or any other
institution, this has been clearly stated.
Where I have consulted the published work of others, this is always clearly attributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have
made exactly what was done by others and what I have contributed myself.

Hanoi, January 2017
PhD STUDENT

Thuy Pham Thi Thanh

SUPERVISORS


Assoc.Prof. Thi Lan Le

Dr. Trung Kien Dao

i


ACKNOWLEDGEMENT
This thesis is done at International Research Institute MICA, Hanoi University
of Science and Technology. First, I would like to express my sincere gratitude to my
advisors Assoc.Prof. Thi Lan Le and Dr. Trung Kien Dao for the continuous support
of my Ph.D study and related research, for their patience, motivation, and immense
knowledge. Their guidance helped me in all the time of research and writing of this
thesis. I could not have imagined having a better advisor and mentor for my Ph.D
study.
Besides my advisors, I would like to thank the scientists, the authors of the published works which are cited in this thesis. I am provided with valuable information
resources from their works for my thesis. In the process of implementation and completion of my research, I has received many supports from the board of MICA directors.
My sincere thanks go to Prof. Yen Ngoc Pham , Prof. Eric Castelli and Dr. Son Viet
Nguyen, who provided me with an opportunity to join researching works in MICA,
and who gave access to the laboratory and research facilities. Without their precious
support would it have been impossible to conduct this research.
I would also like to thank board of directors of the University of Technology and
Logistics where I worked. I received financial support and time from my office and
leaders for completing my doctoral thesis.
I gratefully acknowledge the financial support for publishing papers and conference
fees from research projects B2013.01.48, and NAFOSTED 102.04-2013.32. I would like
to thank my colleagues at Computer Vision Department and Pervasive Spaces and
Interaction Department for their accompaniment during my research.
A special thanks to my family. Words can not express how grateful I am to my

mother and father for all of the sacrifices that they have made on my behalf. I would
also like to thank to my beloved husband and my sisters. Thank you for supporting
me for everything.
Hanoi, January 2017
PhD Student
Thuy Pham Thi Thanh

ii


CONTENTS
DECLARATION OF AUTHORSHIP

i

ACKNOWLEDGEMENT

ii

CONTENTS

v

SYMBOLS

vi

LIST OF TABLES

x


LIST OF FIGURES

xvii

1 LITERATURE REVIEW
1.1 WiFi-based localization . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Vision-based person localization . . . . . . . . . . . . . . . . . . .
1.2.1 Human detection . . . . . . . . . . . . . . . . . . . . . . .
1.2.1.1 Motion-based detection . . . . . . . . . . . . . .
1.2.1.2 Classifier-based detection . . . . . . . . . . . . .
1.2.2 Human tracking . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Human localization . . . . . . . . . . . . . . . . . . . . . .
1.3 Person localization based on fusion of WiFi and visual properties
1.4 Vision-based person re-identification . . . . . . . . . . . . . . . .
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 WIFI-BASED PERSON LOCALIZATION
2.1 Framework . . . . . . . . . . . . . . . . . . . . .
2.2 Probabilistic propagation model . . . . . . . . .
2.2.1 Parameter estimation . . . . . . . . . . .
2.2.2 Reduction of Algorithm Complexity . . .
2.3 Fingerprinting database and KNN matching . .
2.4 Experimental results . . . . . . . . . . . . . . .
2.4.1 Testing environment and data collection
2.4.2 Experiments for propagation model . . .
2.4.3 Localization experiments . . . . . . . . .
2.4.3.1 Evaluation metrics . . . . . . .
2.4.3.2 Experimental results . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . .


iii

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.

13
14
18
19
19
21
22
23
24
26
29

.
.
.
.
.
.
.

.
.
.
.
.

30
30
32
33
34
35
38
38
40
43
43
43
49


3 VISION-BASED PERSON LOCALIZATION
50
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Experimental datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Shadow Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Chromaticity-based feature extraction and shadow-matching score
calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Shadow-matching score utilizing physical properties . . . . . . . 60
3.3.3 Density-based score fusion scheme . . . . . . . . . . . . . . . . . 62

3.3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Human detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Fusion of background subtraction and HOG-SVM . . . . . . . . 67
3.4.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2.1 Dataset and evaluation metrics . . . . . . . . . . . . . 69
3.4.2.2 Experimental results . . . . . . . . . . . . . . . . . . . 70
3.5 Person tracking and localization . . . . . . . . . . . . . . . . . . . . . . 72
3.5.1 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.2 Person tracking and data association . . . . . . . . . . . . . . . 73
3.5.3 Person localization and linking trajectories in camera network . 80
3.5.3.1 Person localization . . . . . . . . . . . . . . . . . . . . 80
3.5.3.2 Linking person’s trajectories in camera network . . . . 82
3.5.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 84
3.5.4.1 Initial values . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.4.2 Evaluation metrics for person tracking in one camera
FOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5.4.3 Experimental results . . . . . . . . . . . . . . . . . . . 87
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 PERSON IDENTIFICATION AND RE-IDENTIFICATION
CAMERA NETWORK
4.1 Face recognition system . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Experimental evaluation . . . . . . . . . . . . . . . . . .
4.1.2.1 Testing scenarios . . . . . . . . . . . . . . . . .
4.1.2.2 Measurements . . . . . . . . . . . . . . . . . . .
4.1.2.3 Testing data and results . . . . . . . . . . . . .
4.2 Appearance-based person re-identification . . . . . . . . . . . .
4.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Improved kernel descriptor for human appearance . . . .
4.2.3 Experimental results . . . . . . . . . . . . . . . . . . . .


iv

IN A
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

92
93
93
96
96
96
96
97
97
98
102



4.3

4.2.3.1 Testing datasets . . . . . . . . . . . . . . . . . . . . . 102
4.2.3.2 Results and discussion . . . . . . . . . . . . . . . . . . 104
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5 FUSION OF WIFI AND CAMERA FOR PERSON
TION AND IDENTIFICATION
5.1 Fusion framework and algorithm . . . . . . . . . . . . .
5.1.1 Framework . . . . . . . . . . . . . . . . . . . .
5.1.2 Fusion method . . . . . . . . . . . . . . . . . .
5.1.2.1 Kalman filter . . . . . . . . . . . . . .
5.1.2.2 Optimal Assignment . . . . . . . . . .
5.2 Dataset and Evaluation . . . . . . . . . . . . . . . . . .
5.2.1 Testing dataset . . . . . . . . . . . . . . . . . .
5.2.2 Experimental results . . . . . . . . . . . . . . .
5.2.2.1 Experimental results on script 1 data .
5.2.2.2 Experimental results on script 2 data .
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

LOCALIZA.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

117
118
118
120
121
123
124
124
128
128
129
132

PUBLICATIONS

137

BIBLIOGRAPHY

139


A

154

v


ABBREVIATIONS
TT Abbreviation Meaning
1

AHPE

Asymmetry based Histogram Plus Epitome

2

AmI

Ambient Intelligent

3

ANN

Artificial Neural Network

4

AoA


Angle of Arrival

5

AP

Access Point

6

BB

Bounding Box

7

BGS

Background Subtraction

8

CCD

Charge-Coupled Device

9

DSM


Direct Stein Method

10

EM

Expectation Maximization

11

FAR

False Acceptance Rate

12

FN

False Negative

13

FOV

Field of View

14

FP


False Positive

15

fps

f rame per second

16

JPDAF

Joint Probability Data Association Filtering

17

GA

Genetic Algorithm

18

GLOH

Gradient Location and Orientation Histogram

19

GLONASS


Global Navigation Satellite System

20

GMM

Gaussian Mixture Model

21

GMOTA

Global Multiple Object Tracking Accuracy

22

GNSS

Global Navigation Satellite Systems

23

GPS

Global Positioning System

24

HOG


Histogram of Oriented Gradient

25

HSV

Hue Saturation Value

26

ID

Identity

27

IP

Internet Protocol

28

KLT

Kanade Lucas Tomasi

29

KNN


K-Nearest Neighbors

30

LAP

Linear Assignment Problem

vi


31

LBP

Local Binary Pattern

32

LBPH

Local Binary Pattern Histogram

33

LDA

Linear Discriminant Analysis


34

LMNR

Large Margin Nearest Neighbor

35

LoB

Line of Bearing

36

LOS

Line of Sight

37

LR

Large Region

38

MAC

Media Access Control


39

MHT

Multiple Hypothesis Tracking

40

MOTA

Multiple Object Tracking Accuracy

41

MOG

Mixture of Gaussian

42

MOTP

Multiple Object Tracking Precision

43

MSCR

Maximally Stable Colour Regions


44

NLoS

None-Line-of-Sight

45

PCA

Principal Component Analysis

46

PDF

Probability Distribution Function

47

PLS

Partial Least Squares

48

PNG

Portable Network Graphics


49

PPM

Probabilistic Propagation Model

50

RBF

Radial Basis Function

51

RDC

Relative Distance Comparison

52

Re-ID

Re-Identification

53

RFID

Radio Frequency Identification


54

RGB

Red Green Blue

55

ROI

Region of Interest

56

RSS

Received Signal Strength

57

RSSI

Received Signal Strength Indication

58

SD

Shadow


59

SDALF

Symmetry Driven Accumulation of Local Features

60

SIFT

Scale Invariant Feature Transform

61

SKMGM

Spatial Kinetic Mixture of Gaussian Model

62

SLAM

Simultaneous Localization and Mapping

63

SMP

Stable Marriage Problem


64

SPOT

Structure Preserving Object Tracker

65

SR

Small Region
vii


66

STGMM

Spatio Temporal Gaussian Mixture Model

67

STL

Standard Template Library

68

SURF


Speeded Up Robust Features

69

SVM

Support Vector Machine

70

SVR

Support Vector Regression

71

TAPPMOG

Time Adaptive Per Pixel Mixtures of Gaussians

72

TDoA

Time Difference of Arrival

73

TLGMM


Two-Layer Gaussian Mixture Model

74

TN

True Negative

75

ToA

Time of Arrival

76

TP

True Positive

77

UWB

Ultra-Wideband

78

VOR


VHF Omnidirectional Range

79

WLAN

Wireless Local Area Network

80

WPS

WiFi Positioning System

viii


LIST OF TABLES

Table 2.1

. . . . . . . . . . . . . . . . . .

40

Table 2.2 Optimized system parameters for the first and the second scenarios
of testing environments . . . . . . . . . . . . . . . . . . . . . . . . . . .

41


Table 2.3

Evaluations for the first scenario with distance and RSSI features.

45

Table 2.4 Localization results for the first scenario using different features
of distance and RSSI, without using coefficient λ. . . . . . . . . . . . .

45

Table 2.5

Genetic algorithm configuration

Evaluations for the second scenario with distance and RSSI features. 47

Table 3.1 Performance of human detectors with HOG-SVM method, combination of HOG-SVM and Adaptive GMM with and without shadow
removal (SR) on MICA2 dataset. . . . . . . . . . . . . . . . . . . . . .

71

Table 3.2

Reference points on image and floor plan coordinate systems. . .

86

Table 3.3


Evaluations for homography transformation. . . . . . . . . . . . .

88

Table 3.4

Testing results of person tracking and localization in Cam1’s FOV. 88

Table 3.5

Testing results of person tracking and localization in Cam2’s FOV. 89

Table 3.6

Testing results of person tracking and localization in Cam4’s FOV. 90

Table 4.1 Comparative face recognition results and time consuming for gallery
and probe sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Table 4.2 Datasets for person re-identification testing. In the last column,

the number of sign ( ) shows the ranking for intra-class variation of the
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Table 4.3

The comparative evaluations of Person Re-ID on HDA dataset. . 111

Table 4.4 The testing results on iLIDS VID dataset for the proposed method,

original KDES and the method in [157]. . . . . . . . . . . . . . . . . . 112

ix


Table 4.5 The comparative evaluations on Rank 1 (%) for person Re-ID with
different methods and datasets. (The sign ”×” indicates no information
available. For iLIDS dataset, there are two data settings as described in
[19] and in [10]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Table 5.1 The comparative results of the proposed fusion algorithm against
the evaluations in chapter 4 with testing data of script 1. . . . . . . . . 129
Table 5.2 The experimental results for person tracking by identification and
person Re-ID with the second dataset. . . . . . . . . . . . . . . . . . . 131
Table A.1 Technical information of WiFi-based localization system. . . . . . 154
Table A.2 Technical information of vision-based localization system. . . . . 156
Table A.3 Technical information of fusion-based localization system. . . . . 157

x


LIST OF FIGURES
Figure 1

Person surveillance context in indoor environment.

. . . . . . .

4

Figure 2


Multimodal localization system fusing WiFi signals and images.

4

Figure 3

Surveillance region with WiFi range covers disjoint camera FOVs.

6

Figure 4
Framework for person localization and identification by fusion of
WiFi and camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Figure 1.1

Flowchart of WiFi-based person localization. . . . . . . . . . . .

14

Figure 1.2

The angle-based positioning technique using AoA. . . . . . . . .

15

Figure 1.3 The position of a mobile client is determined by (a) the intersection of three circle (circular trilateration) with the radius for each is

the distance di and (b) the intersection of two hyperbolas (hyperbolic
lateration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Figure 1.4

Framework of person localization system using fixed cameras.

.

19

Figure 1.5

Human detection in an image [149]. . . . . . . . . . . . . . . . .

19

Figure 1.6 Human detection results with shadows (the images on the left
column) and without shadows (the images on the right column) [134]. .

20

Figure 1.7

Human tracking and Re-ID results in two different cameras [22].

22


Figure 1.8 Camera-based localization system in [154]: (a) Original frame,
(b) Foreground segmentation by MOG, (c) Extraction of foot region
from (b), (d) Gaussian kernel of (c) is mapped on floor plan. . . . . . .

24

Figure 2.1

Diagram of the proposed WiFi-based object localization system.

31

Figure 2.2 An example of radio map with a set of pi RPs and the distance
values di (L) from each RP to L APs. . . . . . . . . . . . . . . . . . . .

31

Figure 2.3

WiFi signal attenuation through walls/floors. . . . . . . . . . . .

33

Figure 2.4

Optimization of system parameters using GAs. . . . . . . . . . .

34

Figure 2.5


Weights of different values of θ based on dissimilarity. . . . . . .

37

Figure 2.6

Weights of different values of λ based on dissimilarity. . . . . . .

38

xi


Figure 2.7

Distribution of APs in the first scenario of testing environment.

39

Figure 2.8

Ground plan of the second floor in the second testing scenario. .

39

Figure 2.9 Radio map (a) with (b) 2000 fingerprint locations collected on
the 8th floor in the first testing scenario. . . . . . . . . . . . . . . . . .

39


Figure 2.10 Radio map (a) with (b) 1200 fingerprint locations collected on
the 2nd floor in the second testing scenario. . . . . . . . . . . . . . . . .

40

Figure 2.11 Deterministic propagation model compared to measurements in
the first scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Figure 2.12 Deterministic propagation model compared to measurements in
the second scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Figure 2.13 Probabilistic propagation model for the first scenario. . . . . . .

42

Figure 2.14 Probabilistic propagation model for the second scenario. . . . .

43

Figure 2.15 Localization results for the first scenario, with distance and RSSI
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Figure 2.16 Distribution of localization error for distance and RSSI features

in the first scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Figure 2.17 Localization reliability for distance and RSSI features in the first
scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Figure 2.18 Localization results for distance and RSSI features, without using
coefficient λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

Figure 2.19 Distribution of localization error for distance and RSSI features,
without using coefficient λ . . . . . . . . . . . . . . . . . . . . . . . . .

46

Figure 2.20 Localization reliability for distance and RSSI features, without
using coefficient λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Figure 2.21 Localization results for the second scenario, with distance and
RSSI features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Figure 2.22 Distribution of localization error for distance and RSSI features

in the second scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Figure 2.23 Localization reliability for distance and RSSI features in the second scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

xii


Figure 3.1

Framework of person localization in camera networks. . . . . . .

50

Figure 3.2 Examples of tracking lines which are formed by linking trajectories of corresponding FootPoint positions. . . . . . . . . . . . . . . . . .

51

Figure 3.3

Testing environment. . . . . . . . . . . . . . . . . . . . . . . . .

53

Figure 3.4 Examples in the MICA1 dataset. The images on the top are
captured from the camera at check-in region and used for training phase.
The images at the bottom are the testing images which acquired from 4

other cameras (Cam1, Cam2, Cam3, Cam4) in surveillance region. . . .

54

Figure 3.5

Examples of manually-extracted human ROIs from Cam2. . . .

55

Figure 3.6

Examples of manually-cropped human ROIs from Cam1 and Cam4. 55

Figure 3.7

Framework of the proposed shadow removal method. . . . . . .

56

Figure 3.8 Extracting shadow pixels: (a) original frame, (b) foreground
mask obtained with adaptive GMM, (c) frame superimposed on foreground mask, (d) and (e) are object and shadow pixels labeled manually
from (c), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

Figure 3.9 Example using chromatic-based features for shadow removal:
background image at (a) H, (b) S and (c) V channels. (d) ∆(H) and
(e) ∆(S) of shadow pixels. This example uses the same frame as in Fig.
3.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


58

Figure 3.10 The results of GMM fitting algorithm with K = 3 to distribution
of the chromaticity-based features. (a) original distribution of feature
vectors x; (b) isocurves of each mixture component k. (c) density mixture components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Figure 3.11 The results of shadow score calculated on an example image. (a)
original image; (b) background subtraction result; (c) shadow matching
score s1 calculated using GMM fitting results in Fig. 3.10 with chromaticity features; (d) shadow matching score s2 calculated with physical
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Figure 3.12 Physical shadow model [76], with shadow pixels fall into the gray
area. The physics-based feature vector for a shadow pixel p contains two
components: α(p) (length of vector SD) and θ(p). . . . . . . . . . . . .

61

Figure 3.13 (a) Physical shadow model [76], and examples of (b) original
image, (c) log-scale of θ(p) property, (d) log-scale of α(p) property. . . .

61

xiii



Figure 3.14 Illustration of log-likelihood calculated from learning (a) chromaticitybased score, (b) physics-based score, and (c) score fusion of shadow and
non-shadow pixels. (d) Visualization of s = (s1 , s2 ) for shadow pixels
(in blue dots) and non-shadow pixels (in red dots). . . . . . . . . . . . 63
Figure 3.15 Illustration of shadow removal results using score fusion scheme.
The first row contains original frames. The second row shows the foreground masks with background subtraction (BGS). The foreground masks
with shadow removal are presented in the third row. The fourth row indicates the shadow pixels detected by the proposed method, and the
final row is ground truth data. . . . . . . . . . . . . . . . . . . . . . . .

64

Figure 3.16 The evaluations for shadow removal from the proposed method
and other methods in [134]. . . . . . . . . . . . . . . . . . . . . . . . .

65

Figure 3.17 Examples of shadow removal results: The original frames are on
the first row; the second row: chromaticity-based method; the third row:
physical method; the fourth row: geometry-based method; the fifth and
the sixth row: texture-based method; The last one is our method. . . .

66

Figure 3.18 Examples of training data for HOG descriptor. Positive and
negative training images captured by (a) Camera 1 (Cam1), (b) Camera
2 (Cam2) and (c) Camera 3 (Cam3). . . . . . . . . . . . . . . . . . . .

70

Figure 3.19 Examples of (a) false positive (F P ) and (b) false negative (F N )
in HOG-SVM detector. . . . . . . . . . . . . . . . . . . . . . . . . . . .


71

Figure 3.20 The Kalman recursive model. . . . . . . . . . . . . . . . . . . .

72

Figure 3.21 Example of (a) grid map and (b) a threshold region bounded by
a contour line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Figure 3.22 Examples of noises in detection: (a) equal number of real targets
and detections, but not all of the detections are true ones. (b) the
number of real targets is larger than the detections. (c) the number of
real targets is smaller than the detections. . . . . . . . . . . . . . . . .

77

Figure 3.23 Example of Hungarian algorithm. . . . . . . . . . . . . . . . . .

79

Figure 3.24 Camera pinhole model. . . . . . . . . . . . . . . . . . . . . . . .

80

Figure 3.25 Examples of original frames (on the top row) and frames with
corrected distortion (on the bottom row). . . . . . . . . . . . . . . . . .


82

Figure 3.26 The flowchart for wrapping camera FOVs. . . . . . . . . . . . .

83

xiv


Figure 3.27 Four marked points and the detected points on the floor plan
resulted from inversed transformation of matrix H are shown on the
top. The bottom images are bird-eye view projection of Cam1 and Cam3. 84
Figure 3.28 The matching points on the floor plan between two images captured from Cam1 and Cam3. . . . . . . . . . . . . . . . . . . . . . . . .

84

Figure 3.29 Flowchart of linking user trajectories. . . . . . . . . . . . . . . .

85

Figure 3.30 Image points marked on the frames captured from (a) camera 1
(Cam1), (b) camera 2 (Cam2) and (c) camera 4 (Cam4). These points
are used for calculating the camera extrinsic parameters. . . . . . . . .

86

Figure 3.31 Floor map with a 2D coordinate system. . . . . . . . . . . . . .

87


Figure 3.32 Examples of frame sequences in MICA2 dataset with (a) hallway
scenario captured from Cam1, (b) lobby scenario captured from Cam2
and (c) showroom scenario captured from Cam4. They are used for
evaluations of person tracking and localization. . . . . . . . . . . . . . .

89

Figure 3.33 Examples of person tracking results in Cam1 FOV (Hallway scene). 89
Figure 3.34 Examples of person tracing results in Cam2 FOV (Lobby scene).

90

Figure 3.35 Examples of person tracking results in Cam4 FOV (Showroom
scene). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

Figure 4.1

Framework of human face recognition. . . . . . . . . . . . . . .

93

Figure 4.2

Face detection result represented by a rectangle region. . . . . .

93

Figure 4.3


Example of LBP computation [75]. . . . . . . . . . . . . . . . .

94

Figure 4.4

LBP images with different gray-scale transformations. . . . . . .

95

Figure 4.5

Face description with LBPH. . . . . . . . . . . . . . . . . . . . .

95

Figure 4.6 Examples in training database with (a) face images of 20 subjects,
(b) images of one subject. . . . . . . . . . . . . . . . . . . . . . . . . .

97

Figure 4.7

Diagram of vision-based person Re-ID system. . . . . . . . . . .

98

Figure 4.8


The basic idea of representation based on kernel methods. . . .

98

Figure 4.9 Illustration of size-adaptive patches (a, c) and size-fixed patches
(a, b) which is mentioned in [25]. . . . . . . . . . . . . . . . . . . . . . 100

xv


Figure 4.10 Image-level feature vector concatenated by feature vectors of
blocks in the pyramid layers. . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 4.11 Examples of testing images which are detected automatically in
the MICA2 dataset. The first and the second rows contain the human
ROIs with and without shadow removal, respectively. . . . . . . . . . . 103
Figure 4.12 Results of proposed method against AHPE [19] and KDES [25]
on (a) CAVIAR4REID dataset. . . . . . . . . . . . . . . . . . . . . . . 105
Figure 4.13 Results of proposed method against AHPE [19], SDALF [54] and
KDES [25] on iLIDS dataset. . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 4.14 Comparative results with reported methods in [10] on iLIDs dataset.106
Figure 4.15 Testing results on our MICA1 dataset. . . . . . . . . . . . . . . 107
Figure 4.16 The person Re-ID evaluation of proposed KDES descriptor against
original KDES and PLS [138] on ETHZ dataset with (a) Sequence 1, (b)
Sequence 2 and (c) Sequence 3. . . . . . . . . . . . . . . . . . . . . . . 108
Figure 4.17 The person Re-ID evaluations of proposed KDES descriptor compared with the original KDES method on WARD dataset. . . . . . . . 109
Figure 4.18 The person Re-ID evaluations of proposed KDES descriptor compared with the original KDES method on RAiD dataset. . . . . . . . . 110
Figure 4.19 The recognition rates of the proposed KDES on HDA dataset. . 110
Figure 4.20 Recognition rates of the proposed KDES on MICA2 dataset with
manually-cropped human ROIs, automatically-detected human ROIs
with and without shadow removal. . . . . . . . . . . . . . . . . . . . . 113

Figure 4.21 Examples of person Re-ID results on MICA2 dataset. The first
column is the frames captured from Cam1 and Cam4. The second column contains human ROIs extracted manually from these frames. The
human ROIs detected automatically with and without shadow removal
are shown in the third and the fourth columns, respectively. The ID
labels are put on the top of these human ROIs, and at the bottom, the
filled circles and squares indicate the correct and incorrect results of
person Re-ID, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 114
Figure 5.1 The different types of sensor combination for person localization.
(a) late fusion, (b) early fusion, (c) trigger fusion. . . . . . . . . . . . . 119

xvi


Figure 5.2 Framework for person localization and Re-ID using the combined
system of WiFi and camera. . . . . . . . . . . . . . . . . . . . . . . . . 119
Figure 5.3

Flowchart of fusion algorithm. . . . . . . . . . . . . . . . . . . . 120

Figure 5.4 A 2D floor map of the testing environment in Figure 5.5, with
the routing path of moving people in testing scenarios. . . . . . . . . . 125
Figure 5.5

Testing environment. . . . . . . . . . . . . . . . . . . . . . . . . 126

Figure 5.6 The visual examples in script 2. The first row contains frames
for the scenario of one moving person. The scenarios for two, three and
five moving people are shown in the second, third and fourth rows. . . . 127
Figure 5.7 Training examples of manually-extracted human ROIs from Cam
2 for person 1 (images on the left) and person 2 (images on the right). . 129

Figure 5.8 Testing examples of manually-extracted human ROIs from Cam
1 (images on the left column) and Cam 4 (images on the right column)
for (a) person 1 and (b) person 2. . . . . . . . . . . . . . . . . . . . . . 130
Figure 5.9

Person Re-ID evaluations on testing data of two moving people.

xvii

131


INTRODUCTION
Motivation
Modern technology is changing human life in many ways, notably in the way people
interact with technological products. Human-computer interaction becomes more and
more natural and convenient and this makes our life more enjoyable and comfortable.
A new concept was formed for this revolutionary change that is Ambient Intelligent
(AmI).
Ambient Intelligence (AmI) has become an active area with many related projects,
research programs and potential applications in different domains, such as home, office,
education, entertainment, health care, emergency services, logistic, transportation, security and surveillance [135], [116], [164], [67], [62], [43], [123]. A common viewpoint
shared by many authors [3], [61], [14], [40] is that AmI refers to digital environments
in which the environmental information (temperature, humidity..) and the human
presence are perceived automatically by sensors and devices interconnected through
networks. Three main features of Perceiving, Modeling and Acting are required for an
AmI system [40]. Perceiving is also considered as the problem of context awareness, in
which humanity with their attributes are the center of perception. Modeling relates to
feature extraction and building discriminate descriptor for each object. Finally, Acting
specifies the response of the environment to the people inhabiting in it by providing

adaptive and user-transparent services.
Although the vision of AmI was introduced more than ten years ago and its research has strengthened and expanded, the development and implementation of the
real-life applications are still in infancy. There are many practical challenges that need
to be addressed in each of the contributing technological areas or particular applications
[2].
In this research, the information of person position and identity are considered in
indoor environments. They are two of the most crucial attributes for ambient environments. In order to determine position (where a person is) and identity (who a person
is) in indoor environments, two problems of person localization and identification need
to be solved. A wide range of sensors can be used to handle these problems, such as
Ultra-Wideband (UWB), ultrasound, Radio-Frequency Identification (RFID), camera,
WiFi, etc [101]. UWB is especially useful for indoor environments where multipath is
severe, but it is not widely used because of the requirement for dedicated transmitters and receiver infrastructure. Ultrasound-based systems are able to locate objects
1


within a few centimeters, but remain prohibitively expensive and rely on large amount
of fixed infrastructure. Such infrastructure is not only labor intensive to install, but
also expensive to maintain. RFID allows to identify and wirelessly transmit the identity of a person or an object via radio waves by a unique RFID tag. The performance
of RFID-based localization outperforms other technologies but the deployment is expensive and the positioning range is limited. Camera has been becoming a dominant
technology for person localization and identification thank to the improvement and
miniaturization of actuators (e.g., lasers) and particularly advancement in the technology of detectors (e.g., CCD sensors). However, deployment is limited because of the
exorbitant cost of the solution (both in terms of licensing and processing requirements)
and the effectiveness of the image processing algorithms themselves in solving realworld dynamic situations. Wi-Fi positioning system (WPS) is a suitable alternation
for GPS and GLONASS in indoor environments as this technology is inadequate due
to various causes, including multi-path and signal blockage indoors. Moreover, WiFi
positioning takes advantage of the rapid growth of wireless access points in building
areas and wireless-enabled smart mobile devices. However, the positioning accuracy
for WiFi signal based is lower than vision-based positioning systems. Indoor environments are particularly challenging for WiFi-based positioning for several reasons:
multi-path propagation, Non-Line-of-Sight (NLoS) conditions, high attenuation and
signal scattering due to high density of obstacles, etc.

It has by now become apparent that no overall solution based on a single technology is perfect for all applications. Therefore, besides developing optimal algorithms for
each technology, fusion of their data is a new trend in solving the problem of person
localization and identification in indoor environments [121], [12], [98], [104]. The main
propose of the fusion is to retain the benefits of each individual sensor technology,
whilst at the same time mitigating their weaknesses. Being motivated by this, our research focuses on person localization and identification by combining WiFi-based and
vision-based technologies. This combination offers the following benefits in comparison
with each single method:
❼ A coarse-to-fine localization system can be set up. The coarse level of positioning

is established for WiFi system, and based on this the fine positioning processes
are done at the cameras which are in the range of WiFi-based localization. The
coarse-to-fine localization system allows to continuously localize people with a
sparse camera network, and offers lower cost for system deployment and computation, with cameras are deployed in the regions which require high positioning
accuracy.
❼ Easy scalability of coverage area by simply deploying more APs (Access Point)

in the environment.
2


❼ Richer information for person identification and re-identification (Re-ID). One

object can be identified by both WiFi and camera systems.

Objective
The thesis focuses on researching and developing solutions for person localization
and identification which are considered under the context of automatic person surveillance in indoor environments by using WiFi and camera systems. For this, the concrete
objectives are:
❼ Constructing an improved method for WiFi-based localization. The method al-


lows to to use popular WiFi-enable devices, such as smart phones or tablets for
localization. These kinds of devices are not originally produced for localization,
but RSSI values scanned by them from nearby APs are popularly used for localization. The proposed method can overcome some of unstable characteristics
of RSSI values which are used for localization in indoor environments. It also
grants coarse-level localization in the combined system of WiFi and camera. The
performance criterion set up in this thesis for WiFi-based localization system is
under 4 m of error at reliability of 90 %.
❼ Building efficient methods for vision-based person localization, including the so-

lutions for human detection, tracking and linking person’s trajectories in camera
networks. The performance of human localization is tied with the positioning error for all matched pairs of person and tracker hypothesis on all frames is under
50 cm.
❼ Constructing an efficient solution for person Re-ID in camera networks.
❼ Developing a method for person localization and Re-ID by combination of WiFi

and camera systems. The method can leverage the advantages of each single technology, such as high localization accuracy of vision-based systems and low computational cost and more reliable identity (ID) of target in WiFi-based localization
systems. In the combined localization system, the positioning performance based
on cameras are reserved, while the performance of person identification and Re-ID
in camera network is improved.
❼ Setting up a combined system of WiFi and camera under indoor scenarios of an

automatic person surveillance system. The proposed methods for people localization and identification are evaluated in this system.
❼ Building datasets for experimental evaluation of the proposed solutions. To the

best of our knowledge, public multi-modal datasets for evaluating combined lo3


calization and identification systems of WiFi and camera do not exist.

Context, constraints, and challenges

Context
The combined system of WiFi and camera for person localization and identification
is deployed in real world scenarios for an automated person surveillance system in
building environments. In almost buildings, entrance and exit gates are set up in order
to control who comes in or comes out of a building. This context is also considered in
our system. Figure 1 shows the context in which the testing environment is divided into
two areas: Check-in/check-out and Surveillance. The proposed system is implemented
in these areas with two main functions. The first function is learning ID cues, which is
executed for each individual in check-in/check-out area. The second function is person
localization and Re-ID, and it is processed in surveillance area (see Figure 2).
Capture Face for Check in

Capture Appearance

Capture Face for Check out

Surveillance Area

Check in/Check-out Area

Figure 1 Person surveillance context in indoor environment.

WiFi signals
& images

Check-in/Check-out
Area

Surveillance
Area


Learning IDs

Localization and
Re-Identification

Positions
& IDs

Multi-modal Localization

Figure 2 Multimodal localization system fusing WiFi signals and images.
Each person is required to hold a WiFi-integrated device, and one by one comes in
at the entrance of the check-in/check-out area. At the entrance gate of the this region,
the person’s ID will be learned individually by the images captured from cameras
4


and the MAC address of WiFi-enable equipment held by each person. One camera,
which is attached in front door of the check-in gate, will capture the human face for
face recognition. In this case, face recognition is processed in closed set manner, which
means the faces in the probe set are included in the gallery set. Another camera acquires
human images at different poses and learning phase of appearance-based ID is done
for each person. In short, in the first region, we get three types of signatures for each
person Ni : face-based identity IDiF , WiFi-based identity IDiW and appearance-based
identity IDiA . Based on IDiF , we know which corresponding IDiW has already been
inside the surveillance region. The corresponding IDiA is also assigned for this person.
Depending on different circumstances, these ID cues can be used for the purpose of
person localization and Re-ID in the surveillance region.
Each person will end up his/her route at the exit gate and he/she will be checked

out by another camera which captures human faces for face recognition. The checkedout person will be removed from the processing system.
In summary, with the above-mentioned scenarios, in check-in/check-out area, we
can:
❼ Monitor the changes of each individual in appearance (changes in cloth) at each

time he/she comes into the surveillance regions. This makes appearance-based
person descriptors more feasible for person Re-ID.
❼ Decrease the computing cost of the system and narrow the ID-matching space

by eliminating checked-out people from the processing system.
❼ Map between different ID cues for the same person.

In the surveillance area, two problems of person localization and Re-ID will be
simultaneously solved by combining visual and WiFi information. The surveillance
region is set so that the WiFi range, which is formed by deploying wireless Access
Points (APs), will cover all visual ranges (the cameras’ FOVs). Figure 3 demonstrates
this setting, with two camera ranges are covered by WiFi range.
Constraints
With the above-mentioned context, some constraints are taken into account for
person localization and identification system by fusion of WiFi and camera:
❼ Environment:

– Indoor environment with space constraints:
✯ A single floor including the scenarios in hallway, lobby and in a room,

5


AP


AP

AP
AP

WiFi range

53

AP

Camera range

Figure 3 Surveillance region with WiFi range covers disjoint camera FOVs.
with the area of hallway is 9×5.1 m, lobby is 7.5×1.8 m and showroom
is 5.4×5.1 m.
✯ The space in which people are continuously located by WiFi range and

at least two non-overlapping cameras.
– Furnitures and other objects are distributed statically in an office building.
❼ Illumination conditions: Both natural and artificial lighting sources are consid-

ered.
– Natural lighting source changing within a day (in the morning, noon, afternoon).
– Artificial lighting sources are stable in a room.
❼ Sensors:

– Vision:
✯ Stationary RGB cameras which capture frames at normal frame rate


(from 15 to 30 fps) and the image resolution at 640×480 pixels.
✯ Cameras are deployed in the environment with non-overlapping FOVs.
✯ Cameras are time synchronized with Internet time.

– WiFi:
✯ Wireless APs are deployed so that their wireless ranges can cover through-

out the surveillance region.
✯ WiFi enabled devices, such as smart phones or tablets have their own

ID (the MAC address of WiFi adapter). Each person holding this device
6


×