Tải bản đầy đủ (.pdf) (174 trang)

Nghiên cứu và phát triển các kỹ thuật định vị và định danh kết hợp thông tin hình ảnh và WiFi

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.91 MB, 174 trang )

MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

THUY PHAM THI THANH

NGHIÊN CỨU VÀ PHÁT TRIỂN CÁC KỸ THUẬT
ĐỊNH VỊ VÀ ĐỊNH DANH KẾT HỢP THÔNG TIN
HÌNH ẢNH VÀ WIFI
PERSON LOCALIZATION AND IDENTIFICATION
BY FUSION OF VISION AND WIFI

DOCTORAL THESIS OF COMPUTER SCIENCE

Hanoi − 2017


CONTENTS
DECLARATION OF AUTHORSHIP

i

ACKNOWLEDGEMENT

ii

CONTENTS

v

SYMBOLS


vi

LIST OF TABLES

x

LIST OF FIGURES

xvii

1 LITERATURE REVIEW
1.1 WiFi-based localization . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Vision-based person localization . . . . . . . . . . . . . . . . . . .
1.2.1 Human detection . . . . . . . . . . . . . . . . . . . . . . .
1.2.1.1 Motion-based detection . . . . . . . . . . . . . .
1.2.1.2 Classifier-based detection . . . . . . . . . . . . .
1.2.2 Human tracking . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Human localization . . . . . . . . . . . . . . . . . . . . . .
1.3 Person localization based on fusion of WiFi and visual properties
1.4 Vision-based person re-identification . . . . . . . . . . . . . . . .
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 WIFI-BASED PERSON LOCALIZATION
2.1 Framework . . . . . . . . . . . . . . . . . . . . .
2.2 Probabilistic propagation model . . . . . . . . .
2.2.1 Parameter estimation . . . . . . . . . . .
2.2.2 Reduction of Algorithm Complexity . . .
2.3 Fingerprinting database and KNN matching . .
2.4 Experimental results . . . . . . . . . . . . . . .
2.4.1 Testing environment and data collection
2.4.2 Experiments for propagation model . . .

2.4.3 Localization experiments . . . . . . . . .
2.4.3.1 Evaluation metrics . . . . . . .
2.4.3.2 Experimental results . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . .

iii

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

13
14
18
19
19
21
22
23
24
26
29

.
.

.
.
.
.
.
.
.
.
.
.

30
30
32
33
34
35
38
38
40
43
43
43
49


3 VISION-BASED PERSON LOCALIZATION
50
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Experimental datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Shadow Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Chromaticity-based feature extraction and shadow-matching score
calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Shadow-matching score utilizing physical properties . . . . . . . 60
3.3.3 Density-based score fusion scheme . . . . . . . . . . . . . . . . . 62
3.3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Human detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Fusion of background subtraction and HOG-SVM . . . . . . . . 67
3.4.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2.1 Dataset and evaluation metrics . . . . . . . . . . . . . 69
3.4.2.2 Experimental results . . . . . . . . . . . . . . . . . . . 70
3.5 Person tracking and localization . . . . . . . . . . . . . . . . . . . . . . 72
3.5.1 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.2 Person tracking and data association . . . . . . . . . . . . . . . 73
3.5.3 Person localization and linking trajectories in camera network . 80
3.5.3.1 Person localization . . . . . . . . . . . . . . . . . . . . 80
3.5.3.2 Linking person’s trajectories in camera network . . . . 82
3.5.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 84
3.5.4.1 Initial values . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.4.2 Evaluation metrics for person tracking in one camera
FOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5.4.3 Experimental results . . . . . . . . . . . . . . . . . . . 87
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 PERSON IDENTIFICATION AND RE-IDENTIFICATION
CAMERA NETWORK
4.1 Face recognition system . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Experimental evaluation . . . . . . . . . . . . . . . . . .
4.1.2.1 Testing scenarios . . . . . . . . . . . . . . . . .
4.1.2.2 Measurements . . . . . . . . . . . . . . . . . . .

4.1.2.3 Testing data and results . . . . . . . . . . . . .
4.2 Appearance-based person re-identification . . . . . . . . . . . .
4.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Improved kernel descriptor for human appearance . . . .
4.2.3 Experimental results . . . . . . . . . . . . . . . . . . . .

iv

IN A
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

92
93
93
96
96
96
96

97
97
98
102


4.3

4.2.3.1 Testing datasets . . . . . . . . . . . . . . . . . . . . . 102
4.2.3.2 Results and discussion . . . . . . . . . . . . . . . . . . 104
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5 FUSION OF WIFI AND CAMERA FOR PERSON
TION AND IDENTIFICATION
5.1 Fusion framework and algorithm . . . . . . . . . . . . .
5.1.1 Framework . . . . . . . . . . . . . . . . . . . .
5.1.2 Fusion method . . . . . . . . . . . . . . . . . .
5.1.2.1 Kalman filter . . . . . . . . . . . . . .
5.1.2.2 Optimal Assignment . . . . . . . . . .
5.2 Dataset and Evaluation . . . . . . . . . . . . . . . . . .
5.2.1 Testing dataset . . . . . . . . . . . . . . . . . .
5.2.2 Experimental results . . . . . . . . . . . . . . .
5.2.2.1 Experimental results on script 1 data .
5.2.2.2 Experimental results on script 2 data .
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

LOCALIZA.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

117
118
118
120
121
123
124
124
128
128
129
132

PUBLICATIONS

137


BIBLIOGRAPHY

139

A

154

v


ABBREVIATIONS
TT Abbreviation Meaning
1

AHPE

Asymmetry based Histogram Plus Epitome

2

AmI

Ambient Intelligent

3

ANN


Artificial Neural Network

4

AoA

Angle of Arrival

5

AP

Access Point

6

BB

Bounding Box

7

BGS

Background Subtraction

8

CCD


Charge-Coupled Device

9

DSM

Direct Stein Method

10

EM

Expectation Maximization

11

FAR

False Acceptance Rate

12

FN

False Negative

13

FOV


Field of View

14

FP

False Positive

15

fps

f rame per second

16

JPDAF

Joint Probability Data Association Filtering

17

GA

Genetic Algorithm

18

GLOH


Gradient Location and Orientation Histogram

19

GLONASS

Global Navigation Satellite System

20

GMM

Gaussian Mixture Model

21

GMOTA

Global Multiple Object Tracking Accuracy

22

GNSS

Global Navigation Satellite Systems

23

GPS


Global Positioning System

24

HOG

Histogram of Oriented Gradient

25

HSV

Hue Saturation Value

26

ID

Identity

27

IP

Internet Protocol

28

KLT


Kanade Lucas Tomasi

29

KNN

K-Nearest Neighbors

30

LAP

Linear Assignment Problem

vi


31

LBP

Local Binary Pattern

32

LBPH

Local Binary Pattern Histogram

33


LDA

Linear Discriminant Analysis

34

LMNR

Large Margin Nearest Neighbor

35

LoB

Line of Bearing

36

LOS

Line of Sight

37

LR

Large Region

38


MAC

Media Access Control

39

MHT

Multiple Hypothesis Tracking

40

MOTA

Multiple Object Tracking Accuracy

41

MOG

Mixture of Gaussian

42

MOTP

Multiple Object Tracking Precision

43


MSCR

Maximally Stable Colour Regions

44

NLoS

None-Line-of-Sight

45

PCA

Principal Component Analysis

46

PDF

Probability Distribution Function

47

PLS

Partial Least Squares

48


PNG

Portable Network Graphics

49

PPM

Probabilistic Propagation Model

50

RBF

Radial Basis Function

51

RDC

Relative Distance Comparison

52

Re-ID

Re-Identification

53


RFID

Radio Frequency Identification

54

RGB

Red Green Blue

55

ROI

Region of Interest

56

RSS

Received Signal Strength

57

RSSI

Received Signal Strength Indication

58


SD

Shadow

59

SDALF

Symmetry Driven Accumulation of Local Features

60

SIFT

Scale Invariant Feature Transform

61

SKMGM

Spatial Kinetic Mixture of Gaussian Model

62

SLAM

Simultaneous Localization and Mapping

63


SMP

Stable Marriage Problem

64

SPOT

Structure Preserving Object Tracker

65

SR

Small Region
vii


66

STGMM

Spatio Temporal Gaussian Mixture Model

67

STL

Standard Template Library


68

SURF

Speeded Up Robust Features

69

SVM

Support Vector Machine

70

SVR

Support Vector Regression

71

TAPPMOG

Time Adaptive Per Pixel Mixtures of Gaussians

72

TDoA

Time Difference of Arrival


73

TLGMM

Two-Layer Gaussian Mixture Model

74

TN

True Negative

75

ToA

Time of Arrival

76

TP

True Positive

77

UWB

Ultra-Wideband


78

VOR

VHF Omnidirectional Range

79

WLAN

Wireless Local Area Network

80

WPS

WiFi Positioning System

viii


LIST OF TABLES

Table 2.1

. . . . . . . . . . . . . . . . . .

40


Table 2.2 Optimized system parameters for the first and the second scenarios
of testing environments . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Table 2.3

Evaluations for the first scenario with distance and RSSI features.

45

Table 2.4 Localization results for the first scenario using different features
of distance and RSSI, without using coefficient λ. . . . . . . . . . . . .

45

Table 2.5

Genetic algorithm configuration

Evaluations for the second scenario with distance and RSSI features. 47

Table 3.1 Performance of human detectors with HOG-SVM method, combination of HOG-SVM and Adaptive GMM with and without shadow
removal (SR) on MICA2 dataset. . . . . . . . . . . . . . . . . . . . . .

71

Table 3.2

Reference points on image and floor plan coordinate systems. . .


86

Table 3.3

Evaluations for homography transformation. . . . . . . . . . . . .

88

Table 3.4

Testing results of person tracking and localization in Cam1’s FOV. 88

Table 3.5

Testing results of person tracking and localization in Cam2’s FOV. 89

Table 3.6

Testing results of person tracking and localization in Cam4’s FOV. 90

Table 4.1 Comparative face recognition results and time consuming for gallery
and probe sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Table 4.2 Datasets for person re-identification testing. In the last column,

the number of sign ( ) shows the ranking for intra-class variation of the
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Table 4.3

The comparative evaluations of Person Re-ID on HDA dataset. . 111

Table 4.4 The testing results on iLIDS VID dataset for the proposed method,
original KDES and the method in [157]. . . . . . . . . . . . . . . . . . 112

ix


Table 4.5 The comparative evaluations on Rank 1 (%) for person Re-ID with
different methods and datasets. (The sign ”×” indicates no information
available. For iLIDS dataset, there are two data settings as described in
[19] and in [10]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Table 5.1 The comparative results of the proposed fusion algorithm against
the evaluations in chapter 4 with testing data of script 1. . . . . . . . . 129
Table 5.2 The experimental results for person tracking by identification and
person Re-ID with the second dataset. . . . . . . . . . . . . . . . . . . 131
Table A.1 Technical information of WiFi-based localization system. . . . . . 154
Table A.2 Technical information of vision-based localization system. . . . . 156
Table A.3 Technical information of fusion-based localization system. . . . . 157

x


LIST OF FIGURES
Figure 1

Person surveillance context in indoor environment.


. . . . . . .

4

Figure 2

Multimodal localization system fusing WiFi signals and images.

4

Figure 3

Surveillance region with WiFi range covers disjoint camera FOVs.

6

Figure 4
Framework for person localization and identification by fusion of
WiFi and camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Figure 1.1

Flowchart of WiFi-based person localization. . . . . . . . . . . .

14

Figure 1.2


The angle-based positioning technique using AoA. . . . . . . . .

15

Figure 1.3 The position of a mobile client is determined by (a) the intersection of three circle (circular trilateration) with the radius for each is
the distance di and (b) the intersection of two hyperbolas (hyperbolic
lateration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Figure 1.4

Framework of person localization system using fixed cameras.

.

19

Figure 1.5

Human detection in an image [149]. . . . . . . . . . . . . . . . .

19

Figure 1.6 Human detection results with shadows (the images on the left
column) and without shadows (the images on the right column) [134]. .

20

Figure 1.7


Human tracking and Re-ID results in two different cameras [22].

22

Figure 1.8 Camera-based localization system in [154]: (a) Original frame,
(b) Foreground segmentation by MOG, (c) Extraction of foot region
from (b), (d) Gaussian kernel of (c) is mapped on floor plan. . . . . . .

24

Figure 2.1

Diagram of the proposed WiFi-based object localization system.

31

Figure 2.2 An example of radio map with a set of pi RPs and the distance
values di (L) from each RP to L APs. . . . . . . . . . . . . . . . . . . .

31

Figure 2.3

WiFi signal attenuation through walls/floors. . . . . . . . . . . .

33

Figure 2.4


Optimization of system parameters using GAs. . . . . . . . . . .

34

Figure 2.5

Weights of different values of θ based on dissimilarity. . . . . . .

37

Figure 2.6

Weights of different values of λ based on dissimilarity. . . . . . .

38

xi


Figure 2.7

Distribution of APs in the first scenario of testing environment.

39

Figure 2.8

Ground plan of the second floor in the second testing scenario. .

39


Figure 2.9 Radio map (a) with (b) 2000 fingerprint locations collected on
the 8th floor in the first testing scenario. . . . . . . . . . . . . . . . . .

39

Figure 2.10 Radio map (a) with (b) 1200 fingerprint locations collected on
the 2nd floor in the second testing scenario. . . . . . . . . . . . . . . . .

40

Figure 2.11 Deterministic propagation model compared to measurements in
the first scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Figure 2.12 Deterministic propagation model compared to measurements in
the second scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Figure 2.13 Probabilistic propagation model for the first scenario. . . . . . .

42

Figure 2.14 Probabilistic propagation model for the second scenario. . . . .

43

Figure 2.15 Localization results for the first scenario, with distance and RSSI

features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Figure 2.16 Distribution of localization error for distance and RSSI features
in the first scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Figure 2.17 Localization reliability for distance and RSSI features in the first
scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Figure 2.18 Localization results for distance and RSSI features, without using
coefficient λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

Figure 2.19 Distribution of localization error for distance and RSSI features,
without using coefficient λ . . . . . . . . . . . . . . . . . . . . . . . . .

46

Figure 2.20 Localization reliability for distance and RSSI features, without
using coefficient λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Figure 2.21 Localization results for the second scenario, with distance and

RSSI features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Figure 2.22 Distribution of localization error for distance and RSSI features
in the second scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Figure 2.23 Localization reliability for distance and RSSI features in the second scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

xii


Figure 3.1

Framework of person localization in camera networks. . . . . . .

50

Figure 3.2 Examples of tracking lines which are formed by linking trajectories of corresponding FootPoint positions. . . . . . . . . . . . . . . . . .

51

Figure 3.3

Testing environment. . . . . . . . . . . . . . . . . . . . . . . . .


53

Figure 3.4 Examples in the MICA1 dataset. The images on the top are
captured from the camera at check-in region and used for training phase.
The images at the bottom are the testing images which acquired from 4
other cameras (Cam1, Cam2, Cam3, Cam4) in surveillance region. . . .

54

Figure 3.5

Examples of manually-extracted human ROIs from Cam2. . . .

55

Figure 3.6

Examples of manually-cropped human ROIs from Cam1 and Cam4. 55

Figure 3.7

Framework of the proposed shadow removal method. . . . . . .

56

Figure 3.8 Extracting shadow pixels: (a) original frame, (b) foreground
mask obtained with adaptive GMM, (c) frame superimposed on foreground mask, (d) and (e) are object and shadow pixels labeled manually
from (c), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57


Figure 3.9 Example using chromatic-based features for shadow removal:
background image at (a) H, (b) S and (c) V channels. (d) ∆(H) and
(e) ∆(S) of shadow pixels. This example uses the same frame as in Fig.
3.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

Figure 3.10 The results of GMM fitting algorithm with K = 3 to distribution
of the chromaticity-based features. (a) original distribution of feature
vectors x; (b) isocurves of each mixture component k. (c) density mixture components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Figure 3.11 The results of shadow score calculated on an example image. (a)
original image; (b) background subtraction result; (c) shadow matching
score s1 calculated using GMM fitting results in Fig. 3.10 with chromaticity features; (d) shadow matching score s2 calculated with physical
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Figure 3.12 Physical shadow model [76], with shadow pixels fall into the gray
area. The physics-based feature vector for a shadow pixel p contains two
components: α(p) (length of vector SD) and θ(p). . . . . . . . . . . . .

61

Figure 3.13 (a) Physical shadow model [76], and examples of (b) original
image, (c) log-scale of θ(p) property, (d) log-scale of α(p) property. . . .


61

xiii


Figure 3.14 Illustration of log-likelihood calculated from learning (a) chromaticitybased score, (b) physics-based score, and (c) score fusion of shadow and
non-shadow pixels. (d) Visualization of s = (s1 , s2 ) for shadow pixels
(in blue dots) and non-shadow pixels (in red dots). . . . . . . . . . . . 63
Figure 3.15 Illustration of shadow removal results using score fusion scheme.
The first row contains original frames. The second row shows the foreground masks with background subtraction (BGS). The foreground masks
with shadow removal are presented in the third row. The fourth row indicates the shadow pixels detected by the proposed method, and the
final row is ground truth data. . . . . . . . . . . . . . . . . . . . . . . .

64

Figure 3.16 The evaluations for shadow removal from the proposed method
and other methods in [134]. . . . . . . . . . . . . . . . . . . . . . . . .

65

Figure 3.17 Examples of shadow removal results: The original frames are on
the first row; the second row: chromaticity-based method; the third row:
physical method; the fourth row: geometry-based method; the fifth and
the sixth row: texture-based method; The last one is our method. . . .

66

Figure 3.18 Examples of training data for HOG descriptor. Positive and
negative training images captured by (a) Camera 1 (Cam1), (b) Camera
2 (Cam2) and (c) Camera 3 (Cam3). . . . . . . . . . . . . . . . . . . .


70

Figure 3.19 Examples of (a) false positive (F P ) and (b) false negative (F N )
in HOG-SVM detector. . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Figure 3.20 The Kalman recursive model. . . . . . . . . . . . . . . . . . . .

72

Figure 3.21 Example of (a) grid map and (b) a threshold region bounded by
a contour line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Figure 3.22 Examples of noises in detection: (a) equal number of real targets
and detections, but not all of the detections are true ones. (b) the
number of real targets is larger than the detections. (c) the number of
real targets is smaller than the detections. . . . . . . . . . . . . . . . .

77

Figure 3.23 Example of Hungarian algorithm. . . . . . . . . . . . . . . . . .

79

Figure 3.24 Camera pinhole model. . . . . . . . . . . . . . . . . . . . . . . .


80

Figure 3.25 Examples of original frames (on the top row) and frames with
corrected distortion (on the bottom row). . . . . . . . . . . . . . . . . .

82

Figure 3.26 The flowchart for wrapping camera FOVs. . . . . . . . . . . . .

83

xiv


Figure 3.27 Four marked points and the detected points on the floor plan
resulted from inversed transformation of matrix H are shown on the
top. The bottom images are bird-eye view projection of Cam1 and Cam3. 84
Figure 3.28 The matching points on the floor plan between two images captured from Cam1 and Cam3. . . . . . . . . . . . . . . . . . . . . . . . .

84

Figure 3.29 Flowchart of linking user trajectories. . . . . . . . . . . . . . . .

85

Figure 3.30 Image points marked on the frames captured from (a) camera 1
(Cam1), (b) camera 2 (Cam2) and (c) camera 4 (Cam4). These points
are used for calculating the camera extrinsic parameters. . . . . . . . .

86


Figure 3.31 Floor map with a 2D coordinate system. . . . . . . . . . . . . .

87

Figure 3.32 Examples of frame sequences in MICA2 dataset with (a) hallway
scenario captured from Cam1, (b) lobby scenario captured from Cam2
and (c) showroom scenario captured from Cam4. They are used for
evaluations of person tracking and localization. . . . . . . . . . . . . . .

89

Figure 3.33 Examples of person tracking results in Cam1 FOV (Hallway scene). 89
Figure 3.34 Examples of person tracing results in Cam2 FOV (Lobby scene).

90

Figure 3.35 Examples of person tracking results in Cam4 FOV (Showroom
scene). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

Figure 4.1

Framework of human face recognition. . . . . . . . . . . . . . .

93

Figure 4.2


Face detection result represented by a rectangle region. . . . . .

93

Figure 4.3

Example of LBP computation [75]. . . . . . . . . . . . . . . . .

94

Figure 4.4

LBP images with different gray-scale transformations. . . . . . .

95

Figure 4.5

Face description with LBPH. . . . . . . . . . . . . . . . . . . . .

95

Figure 4.6 Examples in training database with (a) face images of 20 subjects,
(b) images of one subject. . . . . . . . . . . . . . . . . . . . . . . . . .

97

Figure 4.7

Diagram of vision-based person Re-ID system. . . . . . . . . . .


98

Figure 4.8

The basic idea of representation based on kernel methods. . . .

98

Figure 4.9 Illustration of size-adaptive patches (a, c) and size-fixed patches
(a, b) which is mentioned in [25]. . . . . . . . . . . . . . . . . . . . . . 100

xv


Figure 4.10 Image-level feature vector concatenated by feature vectors of
blocks in the pyramid layers. . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 4.11 Examples of testing images which are detected automatically in
the MICA2 dataset. The first and the second rows contain the human
ROIs with and without shadow removal, respectively. . . . . . . . . . . 103
Figure 4.12 Results of proposed method against AHPE [19] and KDES [25]
on (a) CAVIAR4REID dataset. . . . . . . . . . . . . . . . . . . . . . . 105
Figure 4.13 Results of proposed method against AHPE [19], SDALF [54] and
KDES [25] on iLIDS dataset. . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 4.14 Comparative results with reported methods in [10] on iLIDs dataset.106
Figure 4.15 Testing results on our MICA1 dataset. . . . . . . . . . . . . . . 107
Figure 4.16 The person Re-ID evaluation of proposed KDES descriptor against
original KDES and PLS [138] on ETHZ dataset with (a) Sequence 1, (b)
Sequence 2 and (c) Sequence 3. . . . . . . . . . . . . . . . . . . . . . . 108
Figure 4.17 The person Re-ID evaluations of proposed KDES descriptor compared with the original KDES method on WARD dataset. . . . . . . . 109

Figure 4.18 The person Re-ID evaluations of proposed KDES descriptor compared with the original KDES method on RAiD dataset. . . . . . . . . 110
Figure 4.19 The recognition rates of the proposed KDES on HDA dataset. . 110
Figure 4.20 Recognition rates of the proposed KDES on MICA2 dataset with
manually-cropped human ROIs, automatically-detected human ROIs
with and without shadow removal. . . . . . . . . . . . . . . . . . . . . 113
Figure 4.21 Examples of person Re-ID results on MICA2 dataset. The first
column is the frames captured from Cam1 and Cam4. The second column contains human ROIs extracted manually from these frames. The
human ROIs detected automatically with and without shadow removal
are shown in the third and the fourth columns, respectively. The ID
labels are put on the top of these human ROIs, and at the bottom, the
filled circles and squares indicate the correct and incorrect results of
person Re-ID, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 114
Figure 5.1 The different types of sensor combination for person localization.
(a) late fusion, (b) early fusion, (c) trigger fusion. . . . . . . . . . . . . 119

xvi


Figure 5.2 Framework for person localization and Re-ID using the combined
system of WiFi and camera. . . . . . . . . . . . . . . . . . . . . . . . . 119
Figure 5.3

Flowchart of fusion algorithm. . . . . . . . . . . . . . . . . . . . 120

Figure 5.4 A 2D floor map of the testing environment in Figure 5.5, with
the routing path of moving people in testing scenarios. . . . . . . . . . 125
Figure 5.5

Testing environment. . . . . . . . . . . . . . . . . . . . . . . . . 126


Figure 5.6 The visual examples in script 2. The first row contains frames
for the scenario of one moving person. The scenarios for two, three and
five moving people are shown in the second, third and fourth rows. . . . 127
Figure 5.7 Training examples of manually-extracted human ROIs from Cam
2 for person 1 (images on the left) and person 2 (images on the right). . 129
Figure 5.8 Testing examples of manually-extracted human ROIs from Cam
1 (images on the left column) and Cam 4 (images on the right column)
for (a) person 1 and (b) person 2. . . . . . . . . . . . . . . . . . . . . . 130
Figure 5.9

Person Re-ID evaluations on testing data of two moving people.

xvii

131


INTRODUCTION
Motivation
Modern technology is changing human life in many ways, notably in the way people
interact with technological products. Human-computer interaction becomes more and
more natural and convenient and this makes our life more enjoyable and comfortable.
A new concept was formed for this revolutionary change that is Ambient Intelligent
(AmI).
Ambient Intelligence (AmI) has become an active area with many related projects,
research programs and potential applications in different domains, such as home, office,
education, entertainment, health care, emergency services, logistic, transportation, security and surveillance [135], [116], [164], [67], [62], [43], [123]. A common viewpoint
shared by many authors [3], [61], [14], [40] is that AmI refers to digital environments
in which the environmental information (temperature, humidity..) and the human
presence are perceived automatically by sensors and devices interconnected through

networks. Three main features of Perceiving, Modeling and Acting are required for an
AmI system [40]. Perceiving is also considered as the problem of context awareness, in
which humanity with their attributes are the center of perception. Modeling relates to
feature extraction and building discriminate descriptor for each object. Finally, Acting
specifies the response of the environment to the people inhabiting in it by providing
adaptive and user-transparent services.
Although the vision of AmI was introduced more than ten years ago and its research has strengthened and expanded, the development and implementation of the
real-life applications are still in infancy. There are many practical challenges that need
to be addressed in each of the contributing technological areas or particular applications
[2].
In this research, the information of person position and identity are considered in
indoor environments. They are two of the most crucial attributes for ambient environments. In order to determine position (where a person is) and identity (who a person
is) in indoor environments, two problems of person localization and identification need
to be solved. A wide range of sensors can be used to handle these problems, such as
Ultra-Wideband (UWB), ultrasound, Radio-Frequency Identification (RFID), camera,
WiFi, etc [101]. UWB is especially useful for indoor environments where multipath is
severe, but it is not widely used because of the requirement for dedicated transmitters and receiver infrastructure. Ultrasound-based systems are able to locate objects
1


within a few centimeters, but remain prohibitively expensive and rely on large amount
of fixed infrastructure. Such infrastructure is not only labor intensive to install, but
also expensive to maintain. RFID allows to identify and wirelessly transmit the identity of a person or an object via radio waves by a unique RFID tag. The performance
of RFID-based localization outperforms other technologies but the deployment is expensive and the positioning range is limited. Camera has been becoming a dominant
technology for person localization and identification thank to the improvement and
miniaturization of actuators (e.g., lasers) and particularly advancement in the technology of detectors (e.g., CCD sensors). However, deployment is limited because of the
exorbitant cost of the solution (both in terms of licensing and processing requirements)
and the effectiveness of the image processing algorithms themselves in solving realworld dynamic situations. Wi-Fi positioning system (WPS) is a suitable alternation
for GPS and GLONASS in indoor environments as this technology is inadequate due
to various causes, including multi-path and signal blockage indoors. Moreover, WiFi

positioning takes advantage of the rapid growth of wireless access points in building
areas and wireless-enabled smart mobile devices. However, the positioning accuracy
for WiFi signal based is lower than vision-based positioning systems. Indoor environments are particularly challenging for WiFi-based positioning for several reasons:
multi-path propagation, Non-Line-of-Sight (NLoS) conditions, high attenuation and
signal scattering due to high density of obstacles, etc.
It has by now become apparent that no overall solution based on a single technology is perfect for all applications. Therefore, besides developing optimal algorithms for
each technology, fusion of their data is a new trend in solving the problem of person
localization and identification in indoor environments [121], [12], [98], [104]. The main
propose of the fusion is to retain the benefits of each individual sensor technology,
whilst at the same time mitigating their weaknesses. Being motivated by this, our research focuses on person localization and identification by combining WiFi-based and
vision-based technologies. This combination offers the following benefits in comparison
with each single method:
❼ A coarse-to-fine localization system can be set up. The coarse level of positioning

is established for WiFi system, and based on this the fine positioning processes
are done at the cameras which are in the range of WiFi-based localization. The
coarse-to-fine localization system allows to continuously localize people with a
sparse camera network, and offers lower cost for system deployment and computation, with cameras are deployed in the regions which require high positioning
accuracy.
❼ Easy scalability of coverage area by simply deploying more APs (Access Point)

in the environment.
2


❼ Richer information for person identification and re-identification (Re-ID). One

object can be identified by both WiFi and camera systems.

Objective

The thesis focuses on researching and developing solutions for person localization
and identification which are considered under the context of automatic person surveillance in indoor environments by using WiFi and camera systems. For this, the concrete
objectives are:
❼ Constructing an improved method for WiFi-based localization. The method al-

lows to to use popular WiFi-enable devices, such as smart phones or tablets for
localization. These kinds of devices are not originally produced for localization,
but RSSI values scanned by them from nearby APs are popularly used for localization. The proposed method can overcome some of unstable characteristics
of RSSI values which are used for localization in indoor environments. It also
grants coarse-level localization in the combined system of WiFi and camera. The
performance criterion set up in this thesis for WiFi-based localization system is
under 4 m of error at reliability of 90 %.
❼ Building efficient methods for vision-based person localization, including the so-

lutions for human detection, tracking and linking person’s trajectories in camera
networks. The performance of human localization is tied with the positioning error for all matched pairs of person and tracker hypothesis on all frames is under
50 cm.
❼ Constructing an efficient solution for person Re-ID in camera networks.
❼ Developing a method for person localization and Re-ID by combination of WiFi

and camera systems. The method can leverage the advantages of each single technology, such as high localization accuracy of vision-based systems and low computational cost and more reliable identity (ID) of target in WiFi-based localization
systems. In the combined localization system, the positioning performance based
on cameras are reserved, while the performance of person identification and Re-ID
in camera network is improved.
❼ Setting up a combined system of WiFi and camera under indoor scenarios of an

automatic person surveillance system. The proposed methods for people localization and identification are evaluated in this system.
❼ Building datasets for experimental evaluation of the proposed solutions. To the

best of our knowledge, public multi-modal datasets for evaluating combined lo3



calization and identification systems of WiFi and camera do not exist.

Context, constraints, and challenges
Context
The combined system of WiFi and camera for person localization and identification
is deployed in real world scenarios for an automated person surveillance system in
building environments. In almost buildings, entrance and exit gates are set up in order
to control who comes in or comes out of a building. This context is also considered in
our system. Figure 1 shows the context in which the testing environment is divided into
two areas: Check-in/check-out and Surveillance. The proposed system is implemented
in these areas with two main functions. The first function is learning ID cues, which is
executed for each individual in check-in/check-out area. The second function is person
localization and Re-ID, and it is processed in surveillance area (see Figure 2).
Capture Face for Check in

Capture Appearance

Capture Face for Check out

Surveillance Area

Check in/Check-out Area

Figure 1 Person surveillance context in indoor environment.

WiFi signals
& images


Check-in/Check-out
Area

Surveillance
Area

Learning IDs

Localization and
Re-Identification

Positions
& IDs

Multi-modal Localization

Figure 2 Multimodal localization system fusing WiFi signals and images.
Each person is required to hold a WiFi-integrated device, and one by one comes in
at the entrance of the check-in/check-out area. At the entrance gate of the this region,
the person’s ID will be learned individually by the images captured from cameras
4


and the MAC address of WiFi-enable equipment held by each person. One camera,
which is attached in front door of the check-in gate, will capture the human face for
face recognition. In this case, face recognition is processed in closed set manner, which
means the faces in the probe set are included in the gallery set. Another camera acquires
human images at different poses and learning phase of appearance-based ID is done
for each person. In short, in the first region, we get three types of signatures for each
person Ni : face-based identity IDiF , WiFi-based identity IDiW and appearance-based

identity IDiA . Based on IDiF , we know which corresponding IDiW has already been
inside the surveillance region. The corresponding IDiA is also assigned for this person.
Depending on different circumstances, these ID cues can be used for the purpose of
person localization and Re-ID in the surveillance region.
Each person will end up his/her route at the exit gate and he/she will be checked
out by another camera which captures human faces for face recognition. The checkedout person will be removed from the processing system.
In summary, with the above-mentioned scenarios, in check-in/check-out area, we
can:
❼ Monitor the changes of each individual in appearance (changes in cloth) at each

time he/she comes into the surveillance regions. This makes appearance-based
person descriptors more feasible for person Re-ID.
❼ Decrease the computing cost of the system and narrow the ID-matching space

by eliminating checked-out people from the processing system.
❼ Map between different ID cues for the same person.

In the surveillance area, two problems of person localization and Re-ID will be
simultaneously solved by combining visual and WiFi information. The surveillance
region is set so that the WiFi range, which is formed by deploying wireless Access
Points (APs), will cover all visual ranges (the cameras’ FOVs). Figure 3 demonstrates
this setting, with two camera ranges are covered by WiFi range.
Constraints
With the above-mentioned context, some constraints are taken into account for
person localization and identification system by fusion of WiFi and camera:
❼ Environment:

– Indoor environment with space constraints:
✯ A single floor including the scenarios in hallway, lobby and in a room,


5


AP

AP

AP
AP

WiFi range

53

AP

Camera range

Figure 3 Surveillance region with WiFi range covers disjoint camera FOVs.
with the area of hallway is 9×5.1 m, lobby is 7.5×1.8 m and showroom
is 5.4×5.1 m.
✯ The space in which people are continuously located by WiFi range and

at least two non-overlapping cameras.
– Furnitures and other objects are distributed statically in an office building.
❼ Illumination conditions: Both natural and artificial lighting sources are consid-

ered.
– Natural lighting source changing within a day (in the morning, noon, afternoon).
– Artificial lighting sources are stable in a room.

❼ Sensors:

– Vision:
✯ Stationary RGB cameras which capture frames at normal frame rate

(from 15 to 30 fps) and the image resolution at 640×480 pixels.
✯ Cameras are deployed in the environment with non-overlapping FOVs.
✯ Cameras are time synchronized with Internet time.

– WiFi:
✯ Wireless APs are deployed so that their wireless ranges can cover through-

out the surveillance region.
✯ WiFi enabled devices, such as smart phones or tablets have their own

ID (the MAC address of WiFi adapter). Each person holding this device
6


will be uniquely assigned an ID of the device.
✯ WiFi enabled devices are time synchronized with Internet time.
❼ Pedestrian:

– At the same time, there may be more than one pedestrian involved.
– Each person is required to hold a WiFi-enable device and moves with normal
speed (1-1.3 m/s) in monitoring areas.
Challenges
Person localization and identification in indoor environments by fusion of WiFi
and camera systems are very challenging. First, the challenges come from the visionbased system, including:
❼ Illumination conditions: Light variations which suddenly occur can strongly affect


the performance of human detector. For person Re-ID, this issue is critical,
especially in case of non-overlapping cameras. The same person observed by two
different cameras under distinctive illumination conditions may result to different
appearances. This decreases the performance of person Re-ID, because most of
the proposed methods for person Re-ID rely on the human appearance.
❼ Shadows and reflections: Depending on illumination conditions, lighting angle,

floor/wall smoothness, shadows and reflections can appear variously,, and they
are troubles for human detection, tracking and localization. Shadows and reflections are difficult to handle. Depending on the features (motion, shape or
background) used for human detection or tracking, a shadow on the ground or
reflected by walls or windows may behave and appear like the person that casts it.
Localization errors may become larger if we have bad results of people detection
and tracking caused by shadow phenomenon.
❼ Occlusions: Occlusions appear when people move in proximity or hidden by

obstacles in the environment. This phenomenon can cause track loss, errors in
position-ID assignment. For multi-target tracking, inter-person occlusion is still
a challenging problem.
❼ Person appearance variation: Appearance of one person can be highly influenced

by the clothing color he/she wears, the distinctive view angles of one camera, or
different cameras. The variation in human appearance is challenging for human
detection, tracking and Re-ID.
❼ Crowded scene: The number of persons in the scene is a critical parameter. For

7


human tracking, the high number of persons in the scene has two negative effects:

first, the probability of occlusion increases with the increase of person number;
second, the high number of persons increases the risk of ID permutations and
tracking errors, due to the high probability of having close models of tracked
persons. For person Re-ID, this issue is also important. The high number of
persons in the scene increases the number of matching candidates for each Re-ID
query, and thereby, increases the probability of Re-ID error. The high number of
people increases the probability of having similar visual signatures too.
❼ Multiple cameras: Person localization and identification in multi-camera scenar-

ios are much more challenging than using only one camera. In camera networks,
the problem of person Re-ID and linking trajectories when people move from one
camera FOV to others is still an open issue.
❼ Computational cost and real time performance: Computational costs include the

time and memory used when building and testing a system. Many image-based
localization applications require real time performance, so it is worth to study
how these systems address the time performance issue.
Second, in comparison with vision-based person localization systems, the deployment of WiFi-based systems is easier and the wireless chips are much cheaper than cameras. Their power and computing resource consumption is also significantly lower than
vision-based localization systems. However, the WiFi-based localization techniques
are always associated with a set of challenges, mainly originated from the influence of
obstacles on the propagation of radio waves in indoor environments:
❼ Unpredictability of WiFi signal propagation through indoor environments. The

data distribution may be varied because of changes in temperature and humidity,
as well as the position of moving obstacles, such as people walking throughout
the building. This uncertainty makes it difficult to generate accurate estimates
of signal strength measurements by which the positions can be calculated.
❼ None Line of Sight (NLOS) refers to the path of propagation of a radio frequency

(RF) that is obscured (partially or completely) by obstacles, thus it makes difficult

for the radio signal to pass through.
Additionally, quality of WiFi data for localization are highly depends on type, position,
orientation, quantity and the distribution of wireless transceivers (APs, mobile phones,
tablets,..).
Third, some challenges arise from combination of WiFi and vision-based systems
for person localization and identification in indoor environments:
❼ Different nature of WiFi and visual signals: It is challenging for combining data

8


collected from the distinctive sensors like WiFi and visual sensors.
❼ Signal synchronization between different sensors of WiFi and camera. This is a

necessary step before testing and evaluating any fusing solutions.

Contributions
In order to achieve the above-mentioned objectives of the research, several contributions are made in this thesis:
❼ Contribution 1: Proposing an improvement for WiFi-based person localization.

In this proposal, an efficient pass-loss model is defined with the consideration of
obstacle constraints in indoor environments. Based on this, we can effectively
model the relationship between RSSI and the distance from a mobile user and
APs. A well-known fingerprinting method with a new radio map is defined to
make stable and reliable fingerprint data for localization. In order to do matching
between a query sample and fingerprint data, KNN method is utilized with an
additional coefficient reflecting the chronological changes of fingerprinting data in
the environment. The WiFi-based localization results allow to activate the visionbased localization processes at the cameras which are in the range of returned
positioning result from WiFi system.
❼ Contribution 2: Improving vision-based person localization by proposing ef-


ficient shadow removal and human detection methods. For shadow removal,
a combination of chromaticity-based and physical features is proposed, and a
density-based score fusion scheme is built to integrate each shadow-matching
score archived by each independent feature. It is a preprocessing step for better human detection, which is based on the fusion of HOG-SVM and adaptive
GMM background subtraction techniques. This fusion allows to take advantages
of high speed computation of adaptive GMM and accuracy of HOG-SVM for
human detection. Additionally, for HOG-SVM detector, we build HOG descriptors and train SVM on our database and standard INRIA dataset. This helps
to improve the performance of human detection by HOG-SVM in the considered
environments.
❼ Contribution 3: An efficient appearance-based human descriptor is proposed for

person Re-ID in camera networks. The descriptor is built on each detected human
ROI from human detector. Three different features of gradient, color and shape
are extracted from a human ROI at three levels of pixel, patch and whole human
ROI, then three match kernel functions are built from these. Fusion of these
match kernel functions results to an invariant descriptor to scale and rotation of
9


×