Tải bản đầy đủ (.pdf) (249 trang)

Machine learning in computer vision

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.51 MB, 249 trang )

Machine Learning in Computer
Vision
b
y
N. SEBE
Universit
y
o
f
Amsterdam,
The
N
etherlan
d
s
IRA COHEN
ASHUTOSH GARG
an
d
THOMAS S. HUANG
Universit
y
o
f
Illinois at Urbana-Champai
g
n,
H
P Research Labs, U.S.A.
Goog


l
e Inc., U.S.A
.
Urbana, IL, U.S.A.
A
C.I.P. Cata
l
ogue recor
d
for t
hi
s
b
oo
k

i
s ava
il
a
bl
e from t
h
e L
ib
rary of Congress.
P
u
bli
s

h
e
d

b
y Spr
i
nger
,
P
.O. Box 17, 3300 AA Dor
d
rec
h
t, T
h
e Net
h
er
l
an
d
s
.
P
rinted on acid-
f
ree pape
r
All

R
i
g
h
ts Reserve
d
©
2005 Spr
i
nger
N
o part of t
hi
s wor
k
may
b
e repro
d
uce
d
, store
d

i
n a retr
i
eva
l
system, or transm

i
tte
d
i
n any form or
b
y any means, e
l
ectron
i
c, mec
h
an
i
ca
l
, p
h
otocopy
i
ng, m
i
crof
il
m
i
ng
,
recor
di

ng
o
r ot
h
erw
i
se, w
i
t
h
out wr
i
tten
p
erm
i
ss
i
on from t
h
e Pu
bli
s
h
er, w
i
t
h
t
h

e
exce
p
tio
n
o
f an
y
material supplied specificall
y
for the purpose of bein
g
entere
d
a
nd executed on a computer s
y
stem, for exclusive use b
y
the purchaser of the work.
P
rint
ed
in th
e
N
e
th
e
rlan

ds.
I
SBN-10 1-4020-3274-9 (HB) Springer Dordrecht, Berlin, Heidelberg, New York
I
SBN-10 1-4020-3275-7 (e-book) Springer Dordrecht, Berlin, Heidelberg, New York
I
SBN-13 978-1-4020-3274-5 (HB) Sprin
g
er Dordrecht, Berlin, Heidelber
g
, New York
I
SBN-13 978-1-4020-3275-2 (e-book) Springer Dordrecht, Berlin, Heidelberg, New York
To m
y
parent
s
N
icu
To Mera
v
and Yonatan
I
ra
T
om
y
parent
s
A

sutosh
To my students
:
P
ast, present, an
df
uture
To
m
Contents
Foreword xi
Pre
f
ace x
iii
1
. INTR
O
D
UC
TI
O
N
1
1
Researc
h
Issues on Learn
i
ng

i
n Computer V
i
s
i
on 2
2 Overview of the Book
6
3C
ontributions 12
2. THEORY
:
PROBABILISTIC CLASSIFIERS 1
5
1
Introduction 15
2 Pre
li
m
i
nar
i
es an
d
Notat
i
ons 1
8
2
.1 Max

i
mum L
ik
e
lih
oo
dCl
ass
ifi
cat
i
on 1
8
2
.2 In
f
ormat
i
on T
h
eory 1
9
2
.3 Inequa
li
t
i
es 20
3 Bayes Optimal Error and Entropy 2
0

4 Anal
y
sis of Classification Error of Estimated (
M
i
s
matc
h
e
d
)
Di
str
ib
ut
i
on 2
7
4
.1 H
y
pothesis Testin
g
Framework 2
8
4
.2 Classification Framework 30
5
Densit
y

of Distributions 3
1
5
.1 Distributional Density 3
3
5
.2 Relating to Classification Error 3
7
6
Complex Probabilistic Models and Small Sample Effects 4
0
7
S
ummar
y41
vi
MAC
HINE LE
A
RNIN
G
IN
CO
MP
U
TER
V
I
S
I

ON
3
. THEORY
:
G
ENERALIZATION BOUNDS 4
5
1
Introduction 4
5
2Pr
e
limin
a
ri
es
4
7
3 A Mar
g
in Distribution Based Bound 49
3
.1 Prov
i
ng t
h
e Marg
i
nD
i

str
ib
ut
i
on Boun
d
4
9
4
Analysis
57
4.1 Comparison with Existing Bounds
59
5
Summar
y
6
4
4. THEORY
:
SEMI-SUPERVISED LEARNING 6
5
1
Introduction 6
5
2 Pro
p
erties of Classification 6
7
3 Existin

g
Literature 68
4
Sem
i
-superv
i
se
d
Learn
i
ng Us
i
ng Max
i
mum L
ik
e
lih
oo
d
Est
i
mat
i
on 7
0
5
As
y

mptotic Properties of Maximum Likelihood Estimatio
n
with Labeled and Unlabeled Data 73
5.1 Model Is Correct 7
6
5
.2 Model Is Incorrect 7
7
5.3 Examples: Unlabeled Data De
g
radin
g
Performanc
e
with Discrete and Continuous Variables 80
5.4 Generatin
g
Examples: Performance De
g
radation with
U
nivariate Distributions 8
3
5.5 Distribution of As
y
mptotic Classification Error Bias 8
6
5.6 Short Summar
y
8

8
6 Learning with Finite Data 9
0
6
.1 Ex
p
eriments with Artificial Data 9
1
6
.2 Can Unlabeled Data Hel
p
with Incorrect Models
?
B
i
as vs. Var
i
ance E
ff
ects an
d
t
h
eLa
b
e
l
e
d
-un

l
a
b
e
l
e
d
G
ra
ph
s92
6
.3 Detecting When Unlabeled Data Do Not Change th
e
Est
i
mates
97
6
.4 Using Unlabeled Data to Detect Incorrect Modelin
g
Assum
p
t
i
ons 9
9
7 Conc
l
u

di
ng Remar
k
s10
0
C
ontent
s
v
ii
5
. ALGORITHM:
MAXIMUM LIKELIHOOD MINIMUM ENTROPY HMM 10
3
1
Prev
i
ous Wor
k
10
3
2 Mutua
l
In
f
ormat
i
on, Bayes Opt
i
ma

l
Error, Entropy, an
d
Conditional Probability 10
5
3 Max
i
mum Mutua
l
In
f
ormat
i
on HMMs 10
7
3
.1 D
i
screte Max
i
mum Mutua
l
In
f
ormat
i
on HMMs 1
08
3
.2

C
ontinuous Maximum Mutual Information HMMs 11
0
3
.3 Unsupervised Case 11
1
4Di
scuss
i
o
n11
1
4
.1 Convex
i
ty 11
1
4
.2 Convergence 112
4
.3 Maximum A-
p
osteriori View of Maximum Mutual
Inf
o
rm
at
i
o
n HMM

s
11
2
5
Ex
p
erimental Results 115
5.1 S
y
nthetic Discrete Supervised Data 11
5
5
.2 Speaker Detection 11
5
5
.3 Protein Data 117
5
.4 Real-time Emotion Data 117
6
Summary 11
7
6
.AL
GO
RITHM:
MARGIN DISTRIBUTION OPTIMIZATION 11
9
1
Intro
d

uct
i
on 11
9
2 A Mar
gi
nD
i
str
ib
ut
i
on Base
d
Boun
d
12
0
3
Ex
i
st
i
n
g
Learn
i
n
g
A

lg
or
i
t
h
ms 12
1
4 The Mar
g
in Distribution Optimization (MDO) Al
g
orithm 125
4
.1 Comparison with SVM and Boostin
g
12
6
4
.2 Com
p
utational Issues 126
5
Ex
p
erimental Evaluation 12
7
6C
onclusions 12
8
7

.AL
GO
RITHM:
LEARNIN
G
THE
S
TRU
C
TURE
O
FBAYE
S
IAN
NETW
O
RK
C
LA
SS
IFIER
S 129
1
Introduction 12
9
2Ba
y
esian Network Classifiers 13
0
2

.1 Na
i
ve Bayes C
l
ass
ifi
ers 132
2
.2 Tree-Augmente
d
Na
i
ve Bayes C
l
ass
ifi
ers 13
3
viii
MA
CHINE LE
A
RNING IN COMPUTER VISIO
N
3Sw
i
tc
hi
ng
b

etween Mo
d
e
l
s: Na
i
ve Bayes an
d
TAN C
l
ass
ifi
ers 138
4
Learnin
g
the Structure of Ba
y
esian Network Classifiers
:
Existin
g
Approaches 14
0
4
.1 Inde
p
endence-based Methods 140
4
.2 Likelihood and Ba

y
esian Score-based Methods 142
5
Classification Driven Stochastic Structure Search 143
5
.1 Stochastic Structure Search Algorithm 14
3
5
.2 Addin
g
VC Bound Factor to the Empirical Error
Measure 14
5
6Ex
p
eriments 14
6
6
.1 Results
w
ith Labeled Data 14
6
6
.2 Results
w
ith Labeled and
U
nlabeled Data 147
7 Should Unlabeled Data Be Weighed Differently? 1
50

8
Active Learnin
g
15
1
9 Concludin
g
Remarks 15
3
8
. APPLI
C
ATI
O
N:
OFFICE ACTIVITY RECOGNITION 15
7
1
Context-Sensitive S
y
stems 15
7
2 Towards Tractable and Robust Context Sensing 1
5
9
3 Layered Hidden Markov Models (LHMMs) 1
60
3.1 Approaches 1
6
1

3.2 Decomposition per Temporal Granularit
y
16
2
4Im
p
lementation of SEER 16
4
4
.1 Feature Extraction and Selection in SEER 1
6
4
4
.2 Architecture of SEER 16
5
4
.3 Learning in SEER 1
66
4
.4
C
lassification in
S
EER 1
66
5
Ex
p
eriments 16
6

5.1 Discussion 16
9
6
Related Representations 17
0
7
S
ummar
y1
7
2
9
. APPLICATION:
MULTIMODAL EVENT DETECTION 17
5
1
Fusion Models: A Review 17
6
2AHi
e
r
a
r
c
hi
ca
lF
us
i
o

nM
ode
l17
7
2
.1 Wor
ki
ng o
f
t
h
eMo
d
e
l
178
2
.2 T
h
e Durat
i
on De
p
en
d
ent In
p
ut Out
p
ut Mar

k
ov Mo
d
e
l
17
9
C
ontents i
x
3
Experimental Setup, Features, and Results 18
2
4S
ummar
y 183
10
. APPLI
C
ATI
O
N
:
F
A
C
IAL EXPRE
SS
I
O

NRE
COG
NITI
O
N
18
7
1 Introduction 1
8
7
2
Human Emot
i
on Researc
h
18
9
2.1 A
ff
ect
i
ve Human-com
p
uter Interact
i
on 189
2.2 T
h
eor
i

es o
f
Emot
i
on 1
90
2.3 Fac
i
a
l
Express
i
on Recogn
i
t
i
on Stu
di
es 19
2
3
Fac
i
a
l
Express
i
on Recogn
i
t

i
on System 197
3.1 Face Trackin
g
and Feature Extraction 19
7
3.2 Bayesian Network Classifiers: Learning the
“Structure” of the Facial Features 20
0
4
Experimental Anal
y
sis 201
4.1 Ex
p
erimental Results with Labeled Data 20
4
4.1.1 Person-dependent Tests 205
4.1.2 Person-inde
p
endent Tests 20
6
4.2 Ex
p
er
i
ments w
i
t
h

La
b
e
l
e
d
an
d
Un
l
a
b
e
l
e
d
Data 20
7
5
Discussion 208
11
. APPLI
C
ATI
O
N
:
B
AYE
S

IAN NETW
O
RK
C
LA
SS
IFIER
S
F
O
RFA
C
E DETE
C
TI
O
N
211
1In
t
r
oductio
n 211
2
Re
l
ate
d
Wor
k

213
3 Appl
y
in
g
Ba
y
esian Network Classifiers to Face Detection 217
4
Ex
p
eriments 218
5
Discussion 22
2
R
eferences 22
5
I
n
d
ex 23
7
Foreword
It starte
d
w
i
t
h

i
ma
g
e process
i
n
g
i
nt
h
es
i
xt
i
es. Bac
k
t
h
en,
i
t too
k
ages to
di
g
itize a Landsat ima
g
e and then process it with a mainframe computer. Pro
-
c

ess
i
ng was
i
nsp
i
re
d
on t
h
eac
hi
evements o
f
s
i
gna
l
process
i
ng an
d
was st
ill
ver
y
much oriented towards pro
g
rammin
g.

In the seventies
,
image analysi
s
spun off combinin
g
ima
g
e measurement
wi
t
h
stat
i
st
i
ca
l
pattern recogn
i
t
i
on. S
l
ow
l
y, computat
i
ona
l

met
h
o
d
s
d
etac
h
e
d
themselves from the sensor and the
g
oal to become more
g
enerall
y
applicable.
In t
h
ee
i
g
h
t
i
es, mo
d
e
l
-

d
r
i
ve
n
c
omputer v
i
s
i
o
n
or
i
g
i
nate
d
w
h
en art
ifi
c
i
a
li
n-
telli
g
ence and

g
eometric modellin
g
came to
g
ether with ima
g
e anal
y
sis compo
-
n
ents. T
h
e emp
h
as
i
s was on prec
i
se ana
l
ys
i
sw
i
t
hli
tt
l

eorno
i
nteract
i
on, st
ill
ver
y
much an art evaluated b
y
visual appeal. The main bottleneck was in th
e
amount of data using an average of
5
to
5
0 pictures to illustrate the point
.
A
t the be
g
innin
g
of the nineties, vision became available to man
y
with th
e
a
d
vent o

f
su
ffi
c
i
ent
l
y
f
ast PCs. T
h
e Internet revea
l
e
d
t
h
e
i
nterest o
f
t
h
e gen
-
e
ra
l
pu
bli

c
i
m
i
mages, eventua
ll
y
i
ntro
d
uc
i
n
g
c
ontent-
b
ase
d
ima
g
e retrieva
l
.
Combinin
g
independent (informal) archives, as the web is, ur
g
es for interac
-

t
i
ve eva
l
uat
i
on o
f
approx
i
mate resu
l
ts an
dh
ence wea
k
a
l
gor
i
t
h
ms an
d
t
h
e
ir
c
ombination in

w
eak classifiers
.
In t
h
e new century, t
h
e
l
ast ana
l
og
b
ast
i
on was ta
k
en. In a
f
ew years, sen
-
sors have become all di
g
ital. Archives will soon follow. As a consequenc
e
of
t
hi
sc
h

ange
i
nt
h
e
b
as
i
c con
di
t
i
ons
d
atasets w
ill
over

ow. Computer v
i
s
i
o
n
will spin off a new branch to be called somethin
g
lik
e
archive-based
o

r se-
mant
i
cv
i
s
i
o
n
i
nc
l
u
di
ng a ro
l
e
f
or
f
orma
lk
now
l
e
d
ge
d
escr
i

pt
i
on
i
n an onto
l
ogy
eq
ui
pp
ed with detectors. An alternative view is
e
xperience-based
o
r cognitiv
e
vision.T
hi
s
i
s most
l
ya
d
ata-
d
r
i
ven v
i

ew on v
i
s
i
on an
di
nc
l
u
d
es t
h
ee
l
ementar
y
l
awsofima
g
e formation.
T
hi
s
b
oo
k
comes r
i
g
h

tont
i
me. T
h
e genera
l
tren
di
s easy to see. T
h
e met
h
-
o
ds of computation went from dedicated to one specific task to more
g
enerall
y
app
li
ca
bl
e
b
u
ildi
ng
bl
oc
k

s,
f
rom
d
eta
il
e
d
attent
i
on to one aspect
lik
e
fil
ter
i
n
g
xii
F
O
REW
O
R
D
to a broad variet
y
of topics, from a detailed model desi
g
n evaluated a

g
ainst
a
f
ew
d
ata to a
b
stract ru
l
es tune
d
toaro
b
ust app
li
cat
i
on.
From the source to consumption, ima
g
es are now all di
g
ital. Ver
y
soon
,
arc
hi
ves w

ill b
e over

ow
i
ng. T
hi
s
i
ss
li
g
h
t
l
y worry
i
ng as
i
tw
ill
ra
i
se t
h
e
l
eve
l
o

f expectations about the accessibilit
y
of the pictorial content to a level com
-
pat
ibl
ew
i
t
h
w
h
at
h
umans can ac
hi
eve.
There is onl
y
one realistic chance to respond. From the trend displa
y
e
d
a
b
ove,
i
t
i
s

b
est to
id
ent
if
y
b
as
i
c
l
aws an
d
t
h
en to
l
earn t
h
e spec
ifi
cs o
f
t
he
m
o
d
e
lf

rom a
l
arger
d
ataset. Rat
h
er t
h
an exc
l
u
di
ng
i
nteract
i
on
i
nt
h
e eva
l
uat
i
o
n
o
f the result, it is better to perceive interaction as a valuable source of instant
l
earn

i
ng
f
or t
h
ea
l
gor
i
t
h
m
.
This book builds on that insi
g
ht: that the ke
y
element in the current rev
-
ol
ut
i
on
i
st
h
e use o
f
mac
hi

ne
l
earn
i
ng to capture t
h
evar
i
at
i
ons
i
nv
i
sua
l
ap
-
pearance, rather than havin
g
the desi
g
ner of the model accomplish this. As
a
b
onus, mo
d
e
l
s

l
earne
df
rom
l
arge
d
atasets are
lik
e
l
yto
b
e more ro
b
ust an
d
m
ore realistic than the brittle all-desi
g
n models.
This book reco
g
nizes that machine learnin
g
for computer vision is distinc
-
t
i
ve

l
y
diff
erent
f
rom p
l
a
i
n mac
hi
ne
l
earn
i
ng. Loa
d
so
fd
ata, spat
i
a
l
co
h
erence,
and the lar
g
e variet
y

of appearances, make computer vision a special challen
ge
f
or t
h
e mac
hi
ne
l
earn
i
ng a
l
gor
i
t
h
ms. Hence, t
h
e
b
oo
kd
oes not waste
i
tse
lf
o
n
the complete spectrum of machine learnin

g
al
g
orithms. Rather, this book is
f
ocusse
d
on mac
hi
ne
l
earn
i
ng
f
or p
i
ctures
.
It is amazin
g
so earl
y
in a new field that a book appears which connects
t
h
eory to a
l
gor
i

t
h
ms an
d
t
h
roug
h
t
h
em to conv
i
nc
i
ng app
li
cat
i
ons
.
The authors met one another at Urbana-Champai
g
n and then dispersed over
t
h
ewor
ld
, apart
f
rom T

h
omas Huang w
h
o
h
as
b
een t
h
ere
f
orever. T
hi
s
b
oo
k
will
sure
l
y
b
ew
i
t
h
us
f
or qu
i

te some t
i
me to come
.
Arnold
S
meulders
Un
i
vers
i
ty o
f
Amster
d
a
m
T
he Netherlands
Octo
b
er, 200
4
Preface
T
h
e goa
l
o
f

computer v
i
s
i
on researc
hi
s to prov
id
e computers w
i
t
hh
uman
-
l
ike perception capabilities so that the
y
can sense the environment, understand
t
h
e sense
dd
ata, ta
k
ea
pp
ro
p
r
i

ate act
i
ons, an
dl
earn
f
rom t
hi
sex
p
er
i
ence
in
o
rder to enhance future performance. The field has evolved from the applica
-
t
i
on o
f
c
l
ass
i
ca
l
pattern recogn
i
t

i
on an
di
mage process
i
ng met
h
o
d
stoa
d
vance
d
techniques in ima
g
e understandin
g
like model-based and knowled
g
e-based vi
-
s
i
on
.
In recent
y
ears, there has been an increased demand for computer vision s
y
s

-
tems to address “real-world” problems. However, much of our current models
and methodolo
g
ies do not seem to scale out of limited “to
y
” domains. There
-
f
ore, t
h
e current state-o
f
-t
h
e-art
i
n computer v
i
s
i
on nee
d
ss
i
gn
ifi
cant a
d
vance

-
m
ents to
d
ea
l
w
i
t
h
rea
l
-wor
ld
app
li
cat
i
ons, suc
h
as nav
i
gat
i
on, target recogn
i-
tion, manufacturin
g
, photo interpretation, remote sensin
g

, etc. It is widel
y
un
-
d
erstoo
d
t
h
at many o
f
t
h
ese app
li
cat
i
ons requ
i
re v
i
s
i
on a
l
gor
i
t
h
ms an

d
systems
to work under partial occlusion, possibl
y
under hi
g
h clutter, low contrast, and
ch
ang
i
ng env
i
ronmenta
l
con
di
t
i
ons. T
hi
s requ
i
res t
h
at t
h
ev
i
s
i

on tec
h
n
i
que
s
should be robust and flexible to optimize performance in a
g
iven scenario
.
T
h
e

e
ld
o
f
mac
hi
ne
l
earn
i
ng
i
s
d
r
i

ven
b
yt
h
e
id
ea t
h
at computer a
l
gor
i
t
h
ms
and s
y
stems can improve their own performance with time. Machine learnin
g
has evolved from the relativel
y
“knowled
g
e-free”
g
eneral purpose learnin
g
s
y
s

-
tem, the “
p
erce
p
tron” [Rosenblatt, 19
5
8], and decision-theoretic a
pp
roaches
f
or learnin
g
[Blockeel and De Raedt, 1998], to s
y
mbolic learnin
g
of hi
g
h-leve
l
knowledge [Michalski et al., 1986], artificial neural networks [Rowley et al.
,
1998a], and
g
enetic al
g
orithms [DeJon
g
, 1988]. With the recent advances i

n
h
ar
d
ware an
d
so
f
tware, a var
i
ety o
f
pract
i
ca
l
app
li
cat
i
ons o
f
t
h
e mac
hi
ne
l
earn
-

i
n
g
research is emer
g
in
g
[Se
g
re, 1992].
V
ision provides interestin
g
and challen
g
in
g
problems and a rich environ
-
m
ent to a
d
vance t
h
e state-o
f
-t
h
e art
i

n mac
hi
ne
l
earn
i
ng. Mac
hi
ne
l
earn
i
n
g
technolo
gy
has a stron
g
potential to contribute to the development of flexibl
e
xi
v
PREF
ACE
and robust vision al
g
orithms, thus improvin
g
the performance of practical vi
-

s
i
on systems. Learn
i
ng-
b
ase
d
v
i
s
i
on systems are expecte
d
to prov
id
ea
hi
g
h
er
level of competence and
g
reater
g
eneralit
y
. Learnin
g
ma

y
allow us to use th
e
e
xper
i
ence ga
i
ne
di
n creat
i
ng a v
i
s
i
on system
f
or one app
li
cat
i
on
d
oma
i
nt
o
a vision s
y

stem for another domain b
y
developin
g
s
y
stems that acquire and
m
a
i
nta
i
n
k
now
l
e
d
ge. We c
l
a
i
mt
h
at
l
earn
i
ng represents t
h

enextc
h
a
ll
eng
i
n
g
f
rontier for com
p
uter vision research.
More specificall
y
, machine learnin
g
offers effective methods for computer
v
i
s
i
on
f
or automat
i
ng t
h
emo
d
e

l
/concept acqu
i
s
i
t
i
on an
d
up
d
at
i
ng processes,
adaptin
g
task parameters and representations, and usin
g
experience for
g
ener
-
at
i
ng, ver
if
y
i
ng, an
d

mo
dif
y
i
ng
h
ypot
h
eses. Expan
di
ng t
hi
s
li
st o
f
compute
r
vision problems, we find that some of the applications of machine learnin
g
i
n computer v
i
s
i
on are: segmentat
i
on an
df
eature extract

i
on;
l
earn
i
ng ru
l
es,
relations, features, discriminant functions, and evaluation strate
g
ies; learnin
g
and refinin
g
visual models; indexin
g
and reco
g
nition strate
g
ies; inte
g
ration o
f
v
i
s
i
on mo
d

u
l
es an
d
tas
k
-
l
eve
ll
earn
i
n
g
;
l
earn
i
n
g
s
h
ape representat
i
on an
d
sur
-
f
ace reconstruction strategies; self-organizing algorithms for pattern learning

;
bi
o
l
og
i
ca
ll
y mot
i
vate
d
mo
d
e
li
ng o
f
v
i
s
i
on systems t
h
at
l
earn; an
d
parameter
adaptation, and self-calibration of vision s

y
stems. As an eventual
g
oal, ma
-
chi
ne
l
earn
i
ng may prov
id
et
h
e necessary too
l
s
f
or synt
h
es
i
z
i
ng v
i
s
i
on a
l

go
-
rithms startin
g
from adaptation of control parameters of vision al
g
orithms an
d
sys
t
ems
.
The
g
oal of this book is to address the use of several important machin
e
l
earn
i
ng tec
h
n
i
ques
i
nto computer v
i
s
i
on app

li
cat
i
ons. An
i
nnovat
i
ve com
bi-
n
ation of computer vision and machine learnin
g
techniques has the promis
e
of
a
d
vanc
i
ng t
h
e

e
ld
o
f
computer v
i
s

i
on, w
hi
c
h
w
ill
contr
ib
ute to
b
etter un
-
d
erstan
di
ng o
f
comp
l
ex rea
l
-wor
ld
app
li
cat
i
ons. T
h

ere
i
s anot
h
er
b
ene

to
f
i
ncorporatin
g
a learnin
g
paradi
g
m in the computational vision framework. T
o
m
ature t
h
e
l
a
b
oratory-grown v
i
s
i

on systems
i
nto rea
l
-wor
ld
wor
ki
ng systems,
i
t is necessar
y
to evaluate the performance characteristics of these s
y
stems us
-
i
ng a var
i
ety o
f
rea
l
,ca
lib
rate
dd
ata. Learn
i
ng o

ff
ers t
hi
s eva
l
uat
i
on too
l
,s
i
nc
e
n
o learnin
g
can take place without appropriate evaluation of the results
.
Genera
ll
y,
l
earn
i
ng requ
i
res
l
arge amounts o
fd

ata an
df
ast computat
i
ona
l
resources for its practical use. However, all learnin
g
does not have to be on
-
li
ne. Some o
f
t
h
e
l
earn
i
ng can
b
e
d
one o
ff
-
li
ne, e.g., opt
i
m

i
z
i
ng parameters,
f
eatures, and sensors durin
g
trainin
g
to improve performance. Dependin
g
upo
n
t
h
e
d
oma
i
no
f
app
li
cat
i
on, t
h
e
l
arge num

b
er o
f
tra
i
n
i
ng samp
l
es nee
d
e
df
or
i
nductive learnin
g
techniques ma
y
not be available. Thus, learnin
g
techniques
s
h
ou
ld b
ea
bl
etowor
k

w
i
t
h
vary
i
ng amounts o
f
apr
i
or
ik
now
l
e
d
ge an
dd
ata
.
T
h
ee
ff
ect
i
ve usa
g
eo
f

mac
hi
ne
l
earn
i
n
g
tec
h
no
l
o
gy i
n rea
l
-wor
ld
computer
vision problems requires understanding the domain of application, abstractio
n
of
a
l
earn
i
ng pro
bl
em
f

romag
i
ven computer v
i
s
i
on tas
k
,an
d
t
h
ese
l
ect
i
o
n
PREF
AC
E
xv
o
f
a
pp
ro
p
r
i

ate re
p
resentat
i
ons
f
or t
h
e
l
earna
bl
e(
i
n
p
ut) an
dl
earne
d
(
i
nterna
l)
entities of the s
y
stem. To succeed in selectin
g
the most appropriate machine
l

earn
i
n
g
tec
h
n
i
que(s)
f
or t
h
e
gi
ven computer v
i
s
i
on tas
k
,ana
d
equate un
d
er
-
s
tanding of the different machine learning paradigms is necessary
.
A

l
earn
i
ng system
h
as to c
l
ear
l
y
d
emonstrate an
d
answer t
h
e quest
i
ons
lik
e
w
hat is bein
g
learned, how it is learned, what data is used to learn, how to rep
-
resent w
h
at
h
as

b
een
l
earne
d
,
h
ow we
ll
an
dh
ow e
ffi
c
i
ent
i
st
h
e
l
earn
i
ng ta
ki
ng
place and what are the evaluation criteria for the task at hand. Experimen
-
t
a

ld
eta
il
s are essent
i
a
lf
or
d
emonstrat
i
ng t
h
e
l
earn
i
ng
b
e
h
av
i
or o
f
a
l
gor
i
t

h
m
s
a
nd s
y
stems. These experiments need to include scientific experimental desi
gn
m
et
h
o
d
o
l
ogy
f
or tra
i
n
i
ng/test
i
ng, parametr
i
c stu
di
es, an
d
measures o

f
per
f
or
-
m
ance improvement with experience. Experiments that exihibit scalabilit
y
of
l
earn
i
ng-
b
ase
d
v
i
s
i
on systems are a
l
so very
i
mportant
.
I
n this book, we address all these important aspects. In each of the chapters
,
w

es
h
ow
h
ow t
h
e
li
terature
h
as
i
ntro
d
uce
d
t
h
e tec
h
n
iq
ues
i
nto t
h
e
p
art
i

cu
l
ar
t
opic area, we present the back
g
round theor
y
, discuss comparative experiment
s
m
a
d
e
b
y us, an
d
conc
l
u
d
ew
i
t
h
comments an
d
recommen
d
at

i
ons
.
Acknowledgments
T
his book would not have existed without the assistance of Marcelo Cirelo
,
L
arry C
h
en, Fa
bi
o Cozman, M
i
c
h
ae
l
Lew, an
d
Dan Rot
h
w
h
ose tec
h
n
i
ca
l

con
-
t
ributions are directl
y
reflected within the chapters. We would like to than
k
Th
eo Gevers, Nur
i
aO
li
ver, Arno
ld
Smeu
ld
ers, an
d
our co
ll
eagues
f
rom t
h
e
I
ntelli
g
ent Sensor
y

Information S
y
stems
g
roup at Universit
y
of Amsterda
m
a
n
d
t
h
e IFP group at Un
i
vers
i
ty o
f
I
lli
no
i
satUr
b
ana-C
h
ampa
i
gn w

h
ogaveu
s
valuable su
gg
estions and critical comments. Be
y
ond technical contributions,
w
ewou
ld lik
etot
h
an
k
our
f
am
ili
es
f
or years o
f
pat
i
ence, support, an
d
encour-
ag
ement. Furthermore, we are

g
rateful to our departments for providin
g
a
n
exce
ll
ent sc
i
ent
ifi
cenv
i
ronment
.
Chapter 1
INTRODUCTION
C
omputer v
i
s
i
on
h
as grown rap
idl
yw
i
t
hi

nt
h
e past
d
eca
d
e, pro
d
uc
i
ng too
ls
t
hat enable the understandin
g
of visual information, especiall
y
for scenes wit
h
no accompany
i
ng structura
l
,a
d
m
i
n
i
strat

i
ve, or
d
escr
i
pt
i
ve text
i
n
f
ormat
i
on
.
The Internet, more specificall
y
the Web, has become a common channel fo
r
th
e transm
i
ss
i
on o
f
grap
hi
ca
li

n
f
ormat
i
on, t
h
us mov
i
ng v
i
sua
li
n
f
ormat
i
on re-
t
rieval rapidl
y
from stand-alone workstations and databases into a networked
env
i
ronment
.
Pract
i
ca
li
ty

h
as
b
egun to
di
ctate t
h
at t
h
e
i
n
d
ex
i
ng o
fh
uge co
ll
ect
i
ons o
fi
m-
ag
es b
y
hand is a task that is both labor intensive and expensive - in man
y
cases more t

h
an can
b
ea
ff
or
d
e
d
to
p
rov
id
e some met
h
o
d
o
fi
nte
ll
ectua
l
ac-
cess to di
g
ital ima
g
e collections. In the world of text retrieval, text “speak
s

f
or
i
tse
lf
”w
h
ereas
i
mage ana
l
ys
i
s requ
i
res a com
bi
nat
i
on o
fhi
g
h
-
l
eve
l
con-
cept creation as well as the processin
g

and interpretation of inherent visua
l
f
eatures. In t
h
e area o
fi
nte
ll
ectua
l
access to v
i
sua
li
n
f
ormat
i
on, t
h
e
i
nterp
l
ay
between human and machine ima
g
e indexin
g

methods has be
g
un to influence
th
e
d
eve
l
opment o
f
computer v
i
s
i
on systems. Researc
h
an
d
app
li
cat
i
on
b
y
t
he ima
g
e understandin
g

(IU) communit
y
su
gg
ests that the most fruitful ap-
p
roac
h
es to IU
i
nvo
l
ve ana
l
ys
i
san
dl
earn
i
ng o
f
t
h
e type o
fi
n
f
ormat
i

on
b
e
i
n
g
s
ou
g
ht, the domain in which it will be used, and s
y
stematic testin
g
to identif
y
o
ptimal methods.
T
h
e goa
l
o
f
computer v
i
s
i
on researc
hi
s to prov

id
e computers w
i
t
hh
uman-
l
ike perception capabilities so that the
y
can sense the environment, understand
th
e sense
dd
ata, ta
k
e appropr
i
ate act
i
ons, an
dl
earn
f
rom t
hi
s exper
i
ence
i
nor-

d
er to enhance future
p
erformance. The vision field has evolved from the a
pp
li-
cat
i
on o
f
c
l
ass
i
ca
l
pattern recogn
i
t
i
on an
di
mage process
i
ng tec
h
n
i
ques to a
d

-
2
Intro
d
uctio
n
vanced applications of ima
g
e understandin
g
, model-based vision, knowled
g
e
-
b
ase
d
v
i
s
i
on, an
d
systems t
h
at ex
hibi
t
l
earn

i
ng capa
bili
ty. T
h
ea
bili
ty to reaso
n
and the abilit
y
to learn are the two ma
j
or capabilities associated with these s
y
s-
tems. In recent years, t
h
eoret
i
ca
l
an
d
pract
i
ca
l
a
d

vances are
b
e
i
ng ma
d
e
i
nt
he

eld of computer vision and pattern reco
g
nition b
y
new techniques and pro
-
c
esses o
fl
earn
i
ng, representat
i
on, an
d
a
d
aptat
i

on. It
i
s pro
b
a
bl
y
f
a
i
rtoc
l
a
i
m,
however, that learnin
g
represents the next challen
g
in
g
frontier for computer
v
i
s
i
on
.
1. Research Issues on Learning in Computer Vision
In recent years, t

h
ere
h
as
b
een a surge o
fi
nterest
i
n
d
eve
l
op
i
ng mac
hi
n
e
learnin
g
techniques for computer vision based applications. The interest de
-
r
i
ves
f
rom
b
ot

h
commerc
i
a
l
pro
j
ects to create wor
ki
ng pro
d
ucts
f
rom com
-
puter vision techniques and from a
g
eneral trend in the computer vision fiel
d
to
i
ncorporate mac
hi
ne
l
earn
i
ng tec
h
n

i
ques
.
L
earn
i
ng
i
s one o
f
t
h
e current
f
ront
i
ers
f
or computer v
i
s
i
on researc
h
an
dh
as
been receivin
g
increased attention in recent

y
ears. Machine learnin
g
technol
-
ogy
h
as strong potent
i
a
l
to contr
ib
ute to
:
t
he development of flexible and robust vision al
g
orithms that will improv
e
th
e per
f
ormance o
f
pract
i
ca
l
v

i
s
i
on systems w
i
t
h
a
hi
g
h
er
l
eve
l
o
f
compe
-
t
ence and
g
reater
g
eneralit
y
,an
d
t
he development of architectures that will speed up s

y
stem developmen
t
ti
me an
dp
rov
id
e
b
etter
p
er
f
ormance.
T
h
e goa
l
o
fi
mprov
i
ng t
h
e per
f
ormance o
f
computer v

i
s
i
on systems
h
as
b
rou
gh
tnewc
h
a
ll
en
g
es to t
h
e

e
ld
o
f
mac
hi
ne
l
earn
i
n

g
,
f
or examp
l
e,
l
earn
i
n
g
f
rom structured descriptions, partial information, incremental learnin
g
, focus
-
i
ng attent
i
on or
l
earn
i
ng reg
i
ons o
fi
nterests (ROI),
l
earn

i
ng w
i
t
h
many c
l
asses,
e
tc. Solvin
g
problems in visual domains will result in the development of new,
m
ore robust machine learning algorithms that will be able to work in mor
e
rea
li
st
i
c sett
i
ngs
.
F
rom t
h
e stan
d
po
i

nt o
f
computer v
i
s
i
on systems, mac
hi
ne
l
earn
i
ng can o
ff
er
e
ffective methods for automatin
g
the acquisition of visual models, adaptin
g
tas
k
parameters an
d
representat
i
on, trans
f
orm
i

ng s
i
gna
l
s to sym
b
o
l
s,
b
u
ildi
n
g
trainable ima
g
e processin
g
s
y
stems, focusin
g
attention on tar
g
et ob
j
ect, and
l
earn
i

ng w
h
en to app
l
yw
h
at a
l
gor
i
t
h
m
i
nav
i
s
i
on system.
F
rom t
h
e stan
d
po
i
nt o
f
mac
hi

ne
l
earn
i
ng systems, computer v
i
s
i
on can pro
-
vide interestin
g
and challen
g
in
g
problems. As examples consider the follow
-
i
n
g
: learnin
g
models rather than handcraftin
g
them, learnin
g
to transfer experi
-
e

nce ga
i
ne
di
n one app
li
cat
i
on
d
oma
i
n to anot
h
er
d
oma
i
n,
l
earn
i
ng
f
rom
l
arg
e
sets of ima
g

es with no annotation, desi
g
nin
g
evaluation criteria for the qualit
y
Researc
h
Issues on Learnin
g
in Computer Visio
n
3
o
f learnin
g
processes in computer vision s
y
stems. Man
y
studies in machin
e
l
earn
i
ng assume t
h
at a care
f
u

l
tra
i
ner prov
id
es
i
nterna
l
representat
i
ons o
f
t
he
o
bserved environment, thus pa
y
in
g
little attention to the problems of percep
-
t
i
on. Un
f
ortunate
l
y, t
hi

s assumpt
i
on
l
ea
d
stot
h
e
d
eve
l
opment o
fb
r
i
tt
l
e systems
w
ith nois
y
, excessivel
y
detailed, or quite coarse descriptions of the perceived
e
nv
i
ronment
.

Espos
i
to an
d
Ma
l
er
b
a [Espos
i
to an
d
Ma
l
er
b
a, 2001]
li
ste
d
some o
f
t
h
e
i
m
-
p
ortant research issues that have to be dealt with in order to develo

p
successfu
l
app
li
cat
i
ons:
Can we
l
earn t
h
emo
d
e
l
s use
dby
a computer vision s
y
stem rat
h
er t
h
an
handcra
f
ting them?
In many computer v
i

s
i
on app
li
cat
i
ons,
h
an
d
cra
f
t
i
ng t
h
ev
i
sua
l
mo
d
e
l
o
f
a
n
o
bj

ect
i
sne
i
t
h
er easy nor pract
i
ca
l
.For
i
nstance,
h
umans can
d
etect an
d
i
dentif
y
faces in a scene with little or no effort. This skill is quite robust,
d
esp
i
te
l
arge c
h
anges

i
nt
h
ev
i
sua
l
st
i
mu
l
us. Nevert
h
e
l
ess, prov
idi
ng com
-
puter vision s
y
stems with models of facial landmarks or facial expressions
i
sver
y
difficult [Cohen et al., 2003b]. Even when models have been hand
-
cra
f
te

d
,as
i
nt
h
e case o
f
page
l
ayout
d
escr
i
pt
i
ons use
db
y some
d
ocument
i
ma
g
e processin
g
s
y
stems [Na
gy
et al., 1992], it has been observed that the

y
li
m
i
tt
h
e use o
f
t
h
e system to a spec
ifi
cc
l
ass o
fi
mages, w
hi
c
hi
ssu
bj
ect t
o
c
h
ange
i
nare
l

at
i
ve
l
ys
h
ort t
i
me
.
H
ow is machine learning used in computer vision systems?
M
ac
hi
ne
l
earn
i
ng a
l
gor
i
t
h
ms can
b
e app
li
e

di
nat
l
east two
diff
erent ways
i
n computer vision s
y
stems
:

to
i
mprove percept
i
on o
f
t
h
e surroun
di
ng env
i
ronment, t
h
at
i
s, to
i

m
-
p
rove the transformation of sensed si
g
nals into internal representations,
a
n
d

to brid
g
e the
g
ap between the internal representations of the environ
-
ment an
d
t
h
e representat
i
on o
f
t
h
e
k
now
l

e
d
ge nee
d
e
db
yt
h
e system to
p
erform its task
.
A poss
ibl
eexp
l
anat
i
on o
f
t
h
e marg
i
na
l
attent
i
on g
i

ven to
l
earn
i
ng
i
nterna
l
representations of the perceived environment is that feature extraction has
rece
i
ve
d
very
li
tt
l
e attent
i
on
i
nt
h
e mac
hi
ne
l
earn
i
ng commun

i
ty,
b
ecause
i
t
has been considered a
pp
lication-de
p
endent and research on this issue is not
o
f
genera
li
nterest. T
h
e
id
ent
ifi
cat
i
on o
f
requ
i
re
dd
ata an

dd
oma
i
n
k
now
l-
e
d
ge requ
i
res t
h
eco
ll
a
b
orat
i
on w
i
t
h
a
d
oma
i
n expert an
di
san

i
mportan
t
s
tep of the process of appl
y
in
g
machine learnin
g
to real-world problems.
4
Intro
d
uctio
n
Onl
y
recentl
y
, the related issues of feature selection and, more
g
enerall
y
,
d
ata preprocess
i
ng
h

ave
b
een more systemat
i
ca
ll
y
i
nvest
i
gate
di
n mac
hi
n
e
learnin
g
. Data preprocessin
g
is still considered a step of the knowled
ge
di
scovery process an
di
s con

ne
d
to

d
ata c
l
ean
i
ng, s
i
mp
l
e
d
ata trans
f
orma
-
t
ions (e.
g
., summarization), and validation. On the contrar
y
, man
y
studies
i
n computer v
i
s
i
on an
d

pattern recogn
i
t
i
on
f
ocuse
d
on t
h
e pro
bl
ems o
ff
ea
-
t
ure extraction and selection. Hou
g
h transform, FFT, and textural features,
j
ust to ment
i
on some, are a
ll
examp
l
es o
ff
eatures w

id
e
l
y app
li
e
di
n
i
mag
e
classification and scene understandin
g
tasks. Their properties have bee
n
w
e
ll i
nvest
i
gate
d
an
d
ava
il
a
bl
e too
l

sma
k
et
h
e
i
r use s
i
mp
l
ean
d
e
ffi
c
i
ent
.
How
d
o we represent visua
l
information?
I
n many computer v
i
s
i
on app
li

cat
i
ons,
f
eature vectors are use
d
to represen
t
t
he perceived environment. However, relational descriptions are deeme
d
t
o
b
eo
f
cruc
i
a
li
mportance
i
n
hi
g
h
-
l
eve
l

v
i
s
i
on. S
i
nce re
l
at
i
ons cannot
be
represented b
y
feature vectors, pattern reco
g
nition researchers use
g
raphs
t
o capture t
h
e structure o
fb
ot
h
o
bj
ects an
d

scenes, w
hil
e peop
l
ewor
ki
ng
in the field of machine learnin
g
prefer to use first-order lo
g
ic formalisms.
B
y mapp
i
ng one
f
orma
li
sm
i
nto anot
h
er,
i
t
i
s poss
ibl
eto


n
d
some s
i
m
i-
larities between research done in pattern reco
g
nition and machine learnin
g
.
An examp
l
e
i
st
h
e spat
i
o-tempora
ld
ec
i
s
i
on tree propose
db
yB
i

sc
h
o
f
an
d
Caelli [Bischof and Caelli, 2001], which can be related to lo
g
ical decisio
n
t
rees
i
n
d
uce
db
y some genera
l
-purpose
i
n
d
uct
i
ve
l
earn
i
ng systems [B

l
oc
k-
eel and De Raedt, 1998].
What machine learning paradigms and strategies are appropriate to the
com
p
uter vision
d
omain?
I
n
d
uct
i
ve
l
earn
i
ng,
b
ot
h
superv
i
se
d
an
d
unsuperv

i
se
d
, emerges as t
h
e mos
t
important learnin
g
strate
gy
. There are several important paradi
g
ms that ar
e
b
e
i
n
g
use
d
: conceptua
l
(
d
ec
i
s
i

on trees,
g
rap
h
-
i
n
d
uct
i
on), stat
i
st
i
ca
l
(sup
-
port vector machines), and neural networks (Kohonen maps and similar
a
uto-or
g
an
i
z
i
n
g
s
y

stems). Anot
h
er emer
gi
n
g
para
dig
m, w
hi
c
hi
s
d
escr
ib
e
d
in detail in this book, is the use of probabilistic models in
g
eneral and prob-
abili
st
i
c grap
hi
ca
l
mo
d

e
l
s
i
n part
i
cu
l
ar.
What are the criteria for evaluating the
q
uality of the learning processes in
computer vision s
y
stems
?
I
n
b
enc
h
mar
ki
ng computer v
i
s
i
on systems, est
i
mates o

f
t
h
e pre
di
ct
i
ve ac
-
curac
y
, recall, and precision [Hui
j
sman and Sebe, 2004] are considered th
e
ma
i
n parameters to eva
l
uate t
h
e success o
f
a
l
earn
i
ng a
l
gor

i
t
h
m. How
-
Researc
h
Issues on Learnin
g
in Computer Visio
n
5
ever, the comprehensibilit
y
of learned models is also deemed an important
cr
i
ter
i
on, espec
i
a
ll
yw
h
en
d
oma
i
n experts

h
ave strong expectat
i
ons on t
he
properties of visual models or when understandin
g
of s
y
stem failures is im
-
portant. Compre
h
ens
ibili
ty
i
s nee
d
e
db
yt
h
e expert to eas
il
yan
d
re
li
a

bly
verif
y
the inductive assertions and relate them to their own domain knowl
-
e
d
ge. W
h
en compre
h
ens
ibili
ty
i
san
i
mportant
i
ssue, t
h
e conceptua
ll
earn
-
i
n
g
paradi
g

m is usuall
y
preferred, since it is based on the comprehensibilit
y
postu
l
ate state
db
yM
i
c
h
a
l
s
ki
[M
i
c
h
a
l
s
ki
, 1983]:
The results of computer induction should be s
y
mbolic descrip
-
tions of

g
iven entities, semanticall
y
and structurall
y
similar to those
a
h
uman expert m
i
g
h
t pro
d
uce o
b
serv
i
ng t
h
e same ent
i
t
i
es. Com-
p
onents o
f
t
h

ese
d
escr
i
pt
i
ons s
h
ou
ld b
e compre
h
ens
ibl
eass
i
ng
l
e

chunks” of information, directl
y
interpretable in natural lan
g
ua
g
e
,
and should relate
q

uantitative and
q
ualitative conce
p
ts in an inte-
g
rate
df
as
hi
on
.
W
h
en is it usefu
l
to a
d
opt severa
l
representations of t
h
e perceive
d
environ-
m
ent wit
hd
i
ff

erent
l
eve
l
so
f
a
b
straction?
In complex real-world applications, multi-representations of the perceive
d
env
i
ronment prove very use
f
u
l
.For
i
nstance, a
l
ow reso
l
ut
i
on
d
ocument
i
ma

g
e is suitable for the efficient separation of text from
g
raphics, while a

ner resolution is required for the subsequent step of interpretin
g
the s
y
m
-
b
o
l
s
i
n a text
bl
oc
k
(OCR). Ana
l
ogous
l
y, t
h
e representat
i
on o
f

an aer
i
a
l
view of a cultivated area b
y
means of a vector of textural features can b
e
appropr
i
ate to recogn
i
ze t
h
e type o
f
vegetat
i
on,
b
ut
i
t
i
s too coarse
f
or t
he
recogn
i

t
i
on o
f
a part
i
cu
l
ar geomorp
h
o
l
ogy. By app
l
y
i
ng a
b
stract
i
on pr
i
n
-
ciples in computer pro
g
rammin
g
, software en
g

ineers have mana
g
ed to de
-
ve
l
op comp
l
ex so
f
tware systems. S
i
m
il
ar
l
y, t
h
e systemat
i
c app
li
cat
i
on o
f
abstraction principles in knowled
g
e representation is the ke
y

stone for a lon
g
t
erm solution to man
y
problems encountered in computer vision tasks.
H
ow can mutua
ld
epen
d
ency o
f
visua
l
concepts
b
e
d
ea
l
t wit
h?
In scene labellin
g
problems, ima
g
ese
g
ments have to be associated with a

class name or a label, the number of distinct labels dependin
g
on the dif
-
f
erent t
y
pes o
f
o
bj
ects a
ll
owe
di
nt
h
e perce
i
ve
d
wor
ld
.T
y
p
i
ca
lly
,

i
ma
ge
s
egments cannot be labelled independently of each other, since the inter
-
pretat
i
on o
f
a part o
f
a scene
d
epen
d
sont
h
eun
d
erstan
di
n
g
o
f
t
h
ew
h

o
le
s
cene (holistic view). Context-dependent labelling rules will take such con
-
cept
d
epen
d
enc
i
es
i
nto account, so as to guarantee t
h
at t
h
e

na
l
resu
l
t
i
s
g
loball
y
(and not onl

y
locall
y
) consistent [Haralick and Shapiro, 1979].
L
earn
i
ng context-
d
epen
d
ent
l
a
b
e
lli
ng ru
l
es
i
s anot
h
er researc
hi
ssue, s
i
nc
e
6

Intro
d
uctio
n
most learnin
g
al
g
orithms rel
y
on the independence assumption, accordin
g
t
ow
hi
c
h
t
h
eso
l
ut
i
on to a mu
l
t
i
c
l
ass or mu

l
t
i
p
l
e concept
l
earn
i
ng pro
bl
em
is simpl
y
the sum of independent solutions to sin
g
le class or sin
g
le concept
l
earn
i
ng pro
bl
ems.
O
bviousl
y
, the above list cannot be considered complete. Other equall
y

re
l
evant researc
hi
ssues m
i
g
h
t
b
e propose
d
, suc
h
as t
h
e
d
eve
l
opment o
f
no
i
se
-
tolerant learnin
g
techniques, the effective use of lar
g

e sets of unlabeled ima
g
es
an
d
t
h
e
id
ent
ifi
cat
i
on o
f
su
i
ta
bl
ecr
i
ter
i
a
f
or start
i
ng/stopp
i
ng t

h
e
l
earn
i
ng pro
-
c
ess and/or revisin
g
acquired visual models.
2. Overview of the Book
In
g
eneral, the stud
y
of machine learnin
g
and computer vision can be di
-
v
id
e
di
nto t
h
ree
b
roa
d

categor
i
es
:
Th
eor
y
l
ea
di
ng t
o
Alg
orit
h
ms
a
n
d
A
pp
l
ica-
tion
s
b
uilt on top of theor
y
and al
g

orithms. In this framework, the application
s
s
h
ou
ld f
orm t
h
e
b
as
i
so
f
t
h
et
h
eoret
i
ca
l
researc
hl
ea
di
ng to
i
nterest
i

ng a
l
go
-
rithms. As a conse
q
uence, the book was divided into three
p
arts. The first
p
art
d
eve
l
ops t
h
et
h
eoret
i
ca
l
un
d
erstan
di
ng o
f
t
h

e concepts t
h
at are
b
e
i
ng use
din
developin
g
al
g
orithms in the second part. The third part focuses on the anal
-
ys
i
so
f
computer v
i
s
i
on an
dh
uman-computer
i
nteract
i
on app
li

cat
i
ons t
h
at us
e
the al
g
orithms and the theor
y
presented in the first parts.
The theoretical results in this book ori
g
inate from different practical prob
-
lems encountered when usin
g
machine learnin
g
in
g
eneral, and probabilistic
m
o
d
e
l
s
i
n

p
art
i
cu
l
ar, to com
p
uter v
i
s
i
on an
d
mu
l
t
i
me
di
a
p
ro
bl
ems. T
h
e

rst
set of questions arise from the hi
g

h dimensionalit
y
of models in computer vi
-
s
i
on an
d
mu
l
t
i
me
di
a. For examp
l
e,
i
ntegrat
i
on o
f
au
di
oan
d
v
i
sua
li

n
f
orma
-
t
i
on p
l
ays a cr
i
t
i
ca
l
ro
l
e
i
nmu
l
t
i
me
di
a ana
l
ys
i
s. D
iff

erent me
di
a streams (e.g.,
audio, video, and text, etc.) ma
y
carr
y
information about the task bein
g
per
-
f
orme
d
an
d
recent resu
l
ts [Bran
d
et a
l
., 1997; C
h
en an
d
Rao, 1998; Garg et a
l
.
,

2
000b] have shown that improved performance can be obtained b
y
combinin
g
i
nformation from different sources compared with the situation when a sin
g
l
e
m
o
d
a
li
ty
i
s cons
id
ere
d
.Att
i
mes,
diff
erent streams may carry s
i
m
il
ar

i
n
f
orma
-
tion and in that case, one attempts to use the redundanc
y
to improve the perfor
-
m
ance o
f
t
h
e
d
es
i
re
d
tas
kb
y cance
lli
ng t
h
eno
i
se. At ot
h

er t
i
mes, two streams
m
ay carry comp
li
mentary
i
n
f
ormat
i
on an
di
nt
h
at case t
h
e system must ma
ke
use of the information carried in both channels to carr
y
out the task. However,
t
h
e mer
i
ts o
f
us

i
ng mu
l
t
i
p
l
e streams
i
s overs
h
a
d
owe
db
yt
h
e
f
orm
id
a
bl
e tas
k
o
f
learnin
g
in hi

g
h dimensional which is invariabl
y
the case in multi-modal infor
-
m
ation processin
g
. Althou
g
h, the existin
g
theor
y
supports the task of learnin
g
i
n
hi
g
hdi
mens
i
ona
l
spaces, t
h
e
d
ata an

d
mo
d
e
l
comp
l
ex
i
ty requ
i
rements pose
d
are t
y
picall
y
not met b
y
the real life s
y
stems. Under such scenario, the existin
g
O
verview o
f
t
h
e Boo
k

7
results in learnin
g
theor
y
falls short of
g
ivin
g
an
y
meanin
g
ful
g
uarantees for
t
h
e
l
earne
d
c
l
ass
ifi
ers. T
hi
sra
i

ses a num
b
er o
fi
nterest
i
ng quest
i
ons
:
C
an we ana
l
yze t
h
e
l
earn
i
ng t
h
eory
f
or more pract
i
ca
l
scenar
i
os?

C
an the results of such anal
y
sis be used to develop better al
g
orithms?
Another set of questions arise from the practical problem of data availabil
-
i
ty
i
n computer v
i
s
i
on, ma
i
n
l
y
l
a
b
e
l
e
dd
ata. In t
hi
s respect, t

h
ere are t
h
re
e
m
ain paradi
g
ms for learnin
g
from trainin
g
data. The first is known a
s
super-
v
ise
dl
earnin
g
,i
nw
hi
c
h
a
ll
t
h
e tra

i
n
i
ng
d
ata are
l
a
b
e
l
e
d
,
i
.e., a
d
atum conta
i
ns
b
oth the values of the attributes and the labelin
g
of the attributes to one of
t
h
ec
l
asses. T
h

e
l
a
b
e
li
ng o
f
t
h
e tra
i
n
i
ng
d
ata
i
s usua
ll
y
d
one
b
y an externa
l
m
echanism (usuall
y
humans) and thus the name

s
upervised
.
The second i
s
k
nown a
s
unsupervise
dl
earnin
g
i
nw
hi
c
h
eac
hd
atum conta
i
ns t
h
eva
l
ues o
f
th
e attr
ib

utes
b
ut
d
oes not conta
i
nt
h
e
l
a
b
e
l
. Unsuperv
i
se
dl
earn
i
ng tr
i
es to

n
d
re
g
ularities in the unlabeled trainin
g

data (such as different clusters under som
e
m
etr
i
cs
p
ace),
i
n
f
er t
h
ec
l
ass
l
a
b
e
l
san
d
somet
i
mes even t
h
e num
b
er o

f
c
l
asses.
T
h
et
hir
d
kin
d
i
s
s
emi-supervised learning
i
n
w
hich some of the data is labele
d
an
d
some un
l
a
b
e
l
e
d

.Int
hi
s
b
oo
k,
we are more
i
ntereste
di
nt
h
e
l
atter.
Semi-supervised learnin
g
is motivated from the fact that in man
y
compute
r
v
i
s
i
on (an
d
ot
h
er rea

l
wor
ld
) pro
bl
ems, o
b
ta
i
n
i
ng un
l
a
b
e
l
e
dd
ata
i
sre
l
at
i
ve
l
y
eas
y

(e.
g
., collectin
g
ima
g
es of faces and non-faces), while labelin
g
is difficult,
expensive, and/or labor intensive. Thus, in many problems, it is very desirabl
e
t
o have learnin
g
al
g
orithms that are able to incorporate a lar
g
e number of un
-
labeled data with a small number of labeled data when learnin
g
classifiers.
Some o
f
t
h
e quest
i
ons ra

i
se
di
n sem
i
-superv
i
se
dl
earn
i
ng o
f
c
l
ass
ifi
ers are
:
I
s
i
t
f
eas
ibl
e to use un
l
a
b

e
l
e
dd
ata
i
nt
h
e
l
earn
i
ng process
?
I
st
h
ec
l
ass
ifi
cat
i
on per
f
ormance o
f
t
h
e

l
earne
d
c
l
ass
ifi
er guarantee
d
to
i
m
-
prove when addin
g
the unlabeled data to the labeled data
?
What is the
v
alue of unlabeled data?
Th
e goa
l
o
f
t
h
e
b
oo

ki
stoa
dd
ress a
ll
t
h
ec
h
a
ll
eng
i
ng quest
i
ons pose
d
so
f
ar. We believe that a detailed anal
y
sis of the wa
y
machine learnin
g
theor
y
ca
n
b

e app
li
e
d
t
h
roug
h
a
l
gor
i
t
h
ms to rea
l
-wor
ld
app
li
cat
i
ons
i
s very
i
mportant an
d
e
xtreme

l
yre
l
evant to t
h
esc
i
ent
ifi
c commun
i
ty
.
Chapters 2, 3, and 4 provide the theoretical answers to the questions pose
d
a
b
ove. C
h
apter 2
i
ntro
d
uces t
h
e
b
as
i
cs o

f
pro
b
a
bili
st
i
cc
l
ass
ifi
ers. We argu
e
that there are two main factors contributin
g
to the error of a classifier. Becaus
e
o
f
t
h
e
i
n
h
erent nature o
f
t
h
e

d
ata, t
h
ere
i
s an upper
li
m
i
tont
h
e per
f
ormanc
e
o
f
any c
l
ass
ifi
er an
d
t
hi
s
i
s typ
i
ca

ll
yre
f
erre
d
to as Bayes opt
i
ma
l
error. W
e
start b
y
anal
y
zin
g
the relationship between the Ba
y
es optimal performance of
8
Intro
d
uctio
n
a classifier and the conditional entrop
y
of the data. The mismatch betwee
n
t

h
e true un
d
er
l
y
i
ng mo
d
e
l
(one t
h
at generate
d
t
h
e
d
ata) an
d
t
h
emo
d
e
l
use
d
f

or classification contributes to the second factor of error. In this cha
p
ter, w
e
d
eve
l
op
b
oun
d
sont
h
ec
l
ass
ifi
cat
i
on error un
d
er t
h
e
h
ypot
h
es
i
s test

i
ng
f
rame
-
w
ork when there is a mismatch in the distribution used with res
p
ect to the tru
e
di
str
ib
ut
i
on. Our
b
oun
d
ss
h
ow t
h
at t
h
ec
l
ass
ifi
cat

i
on error
i
sc
l
ose
l
yre
l
ate
d
t
o
the conditional entrop
y
of the distribution. The additional penalt
y
, because of
t
h
em
i
smatc
h
e
ddi
str
ib
ut
i

on,
i
sa
f
unct
i
on o
f
t
h
eKu
llb
ac
k
-Le
ibl
er
di
stance
b
e
-
t
w
een the true and the mismatched distribution.
O
nce these bounds are de
v
el
-

o
pe
d
,t
h
enext
l
og
i
ca
l
step
i
stosee
h
ow o
f
ten t
h
e error cause
db
yt
h
em
i
smatc
h
between distributions is lar
g
e. Our avera

g
e case anal
y
sis for the independenc
e
assumptions leads to results that justify the success of the conditional inde-
pen
d
ence assumpt
i
on (e.
g
.,
i
nna
i
ve Ba
y
es arc
hi
tecture). We s
h
ow t
h
at
i
n most
c
ases, almost all distributions are very close to the distribution assuming condi
-

t
i
ona
li
n
d
epen
d
ence. More
f
orma
lly
,wes
h
ow t
h
at t
h
e num
b
er o
fdi
str
ib
ut
i
ons
f
or w
hi

c
h
t
h
ea
ddi
t
i
ona
l
pena
l
ty term
i
s
l
arge goes
d
own exponent
i
a
ll
y
f
ast.
Rot
h
[Rot
h
, 1998]

h
as s
h
own t
h
at t
h
e pro
b
a
bili
st
i
cc
l
ass
ifi
ers can
b
ea
l
ways
m
apped to linear classifiers and as such, one can anal
y
ze the performance of
these under the probably approximately correct (PAC) or Vapnik-Chervonenkis
(
VC)-
di

mens
i
on
f
ramewor
k
.T
hi
sv
i
ew
p
o
i
nt
i
s
i
m
p
ortant as
i
ta
ll
ows one t
o
directl
y
stud
y

the classification performance b
y
developin
g
the relations be
-
tween t
h
e per
f
ormance on t
h
e tra
i
n
i
ng
d
ata an
d
t
h
e expecte
d
per
f
ormance o
n
t
h

e
f
uture unseen
d
ata. In C
h
a
p
ter 3, we
b
u
ild
on t
h
ese resu
l
ts o
f
Rot
h
[Rot
h
,
1
998]. It turns out that althou
g
h the existin
g
theor
y

ar
g
ues that one needs lar
ge
amounts o
fd
ata to
d
ot
h
e
l
earn
i
ng, we o
b
serve t
h
at
i
n pract
i
ce a goo
d
gen
-
e
ralization is achieved with a much small number of examples. The existin
g
V

C-
di
mens
i
on
b
ase
db
oun
d
s(
b
e
i
ng t
h
e worst case
b
oun
d
s) are too
l
oose an
d
w
e nee
d
to ma
k
e use o

f
propert
i
es o
f
t
h
eo
b
serve
dd
ata
l
ea
di
ng to
d
ata
d
epen
-
dent bounds. Our observation, that in practice, classification is achieved with
goo
d
marg
i
n, mot
i
vates us to
d

eve
l
op
b
oun
d
s
b
ase
d
on marg
i
n
di
str
ib
ut
i
on.
We develop a classification version of the Random pro
j
ection theorem [John
-
son and Lindenstrauss, 1984] and use it to develop data dependent bounds. Our
resu
l
ts s
h
ow t
h

at
i
n most pro
bl
ems o
f
pract
i
ca
li
nterest,
d
ata actua
ll
y res
id
e
in
a low dimensional space. Comparison with existin
g
bounds on real datasets
s
h
ows t
h
at our
b
oun
d
s are t

i
g
h
ter t
h
an ex
i
st
i
ng
b
oun
d
san
di
n most cases
l
es
s
than 0.
5
.
The next cha
p
ter (Cha
p
ter 4)
p
rovides a unified framework of
p

robabilistic
cl
ass
ifi
ers
l
earne
d
us
i
ng max
i
mum
lik
e
lih
oo
d
est
i
mat
i
on. In a nuts
h
e
ll
,we
di
s
-

c
uss what t
y
pe of probabilistic classifiers are suited for usin
g
unlabeled dat
a
i
nas
y
stematic wa
y
with the maximum likelihood learnin
g
, namel
y
classifiers
k
nown as
g
enerat
i
ve
.
We
di
scuss t
h
e con
di

t
i
ons un
d
er w
hi
c
h
t
h
e assert
i
o
n
that unlabeled data are alwa
y
s profitable when learnin
g
classifiers, made i
n
O
verview o
f
t
h
e Boo
k
9
the existin
g

literature, is valid, namel
y
when the assumed probabilistic mode
l
m
atc
h
es rea
li
ty. We a
l
so s
h
ow,
b
ot
h
ana
l
yt
i
ca
ll
yan
d
exper
i
menta
ll
y, t

h
at un
l
a
-
beled data can be detrimental to the classification
p
erformance when the condi
-
t
i
ons are v
i
o
l
ate
d
. Here we use t
h
e term ‘rea
li
ty’ to mean t
h
at t
h
ere ex
i
sts som
e
true probabilit

y
distribution that
g
enerates data, the same one for both labeled
an
d
un
l
a
b
e
l
e
dd
ata. T
h
e terms are more r
i
gourous
l
y
d
e

ne
di
nC
h
apter 4
.

T
h
et
h
eoret
i
ca
l
ana
l
ys
i
sa
l
t
h
oug
hi
nterest
i
ng
i
n
i
tse
lf
gets rea
ll
y attract
i

ve
if
i
t can be
p
ut to use in
p
ractical
p
roblems. Cha
p
ters
5
and 6 build on the result
s
developed in Chapters 2 and 3, respectively. In Chapter
5
, we use the results
of
C
h
apter 2 to
d
eve
l
op a new a
lg
or
i
t

h
m
f
or
l
earn
i
n
g
HMMs. In C
h
apter 2, w
e
show that conditional entrop
y
is inversel
y
related to classification performance
.
Bu
ildi
ng on t
hi
s
id
ea, we argue t
h
at w
h
en HMMs are use

df
or c
l
ass
ifi
cat
i
on,
i
nstead of learnin
g
parameters b
y
onl
y
maximizin
g
the likelihood, one should
a
l
so attempt to m
i
n
i
m
i
ze t
h
e con
di

t
i
ona
l
entropy
b
etween t
h
e query (
hidd
en
)
and the observed variables. This leads to a new al
g
orithm for learnin
g
HMMs
-
MMIHMM. Our resu
l
ts on
b
ot
h
synt
h
et
i
can
d

rea
ld
ata
d
emonstrate t
h
esu
-
p
eriorit
y
of this new al
g
orithm over the standard ML learnin
g
of HMMs.
In Chapter 3, a new, data-dependent, complexit
y
measure for learnin
g
– pro
-
j
ect
i
on pro
fil
e–
i
s

i
ntro
d
uce
d
an
di
s use
d
to
d
eve
l
op
i
mprove
d
genera
li
zat
i
o
n
bounds. In Chapter 6, we extend this result b
y
developin
g
a new learnin
g
al

g
o
-
r
ithm for linear classifiers. The complexit
y
measure – projection profil
e

i
sa
f
unct
i
on o
f
t
h
e
m
argin
d
istri
b
utio
n
(
t
h
e

di
str
ib
ut
i
on o
f
t
h
e
di
stance o
fi
nstances
f
rom a separatin
g
h
y
perplane). We ar
g
ue that instead of maximizin
g
the mar
-
g
i
n, one s
h
ou

ld
attempt to
di
rect
l
ym
i
n
i
m
i
ze t
hi
s term w
hi
c
h
actua
ll
y
d
epen
d
s
o
n the mar
g
in distribution. Experimental results on some real world problems
(f
ace

d
etect
i
on an
d
context sens
i
t
i
ve spe
lli
ng correct
i
on) an
d
on severa
l
UCI
data sets show that this new al
g
orithm is superior (in terms of classificatio
n
p
er
f
ormance) over Boost
i
ng an
d
SVM

.
C
h
apter 7 prov
id
es a
di
scuss
i
on o
f
t
h
e
i
mp
li
cat
i
on o
f
t
h
e ana
l
ys
i
so
f
sem

i-
supervised learnin
g
(Chapter 4) when learnin
g
Ba
y
esian network classifiers,
suggest
i
ng an
d
compar
i
ng
diff
erent approac
h
es t
h
at can
b
eta
k
en to ut
ili
ze pos-
i
tivel
y

unlabeled data. Ba
y
esian networks are directed ac
y
clic
g
raph models
t
h
at represent
j
o
i
nt pro
b
a
bili
ty
di
str
ib
ut
i
ons o
f
a set o
f
var
i
a

bl
es. T
h
e grap
h
s
c
onsist of nodes (vertices in the
g
raph) which represent the random variables
and directed ed
g
es between the nodes which represent probabilistic dependen
-
ci
es
b
etween t
h
evar
i
a
bl
es an
d
t
h
e casua
l
re

l
at
i
ons
hip b
etween t
h
e two con
-
n
ected nodes. With each node there is an associated probabilit
y
mass functio
n
wh
en t
h
evar
i
a
bl
e
i
s
di
screte, or pro
b
a
bili
ty

di
str
ib
ut
i
on
f
unct
i
on, w
h
en t
he
v
ariable is continuous. In classification, one of the nodes in the
g
raph is th
e
cl
ass var
i
a
bl
ew
hil
et
h
e rest are t
h
e attr

ib
utes. One o
f
t
h
ema
i
na
d
vantages o
f
Ba
y
esian networks is the abilit
y
to handle missin
g
data, thus it is possible t
o
systemat
i
ca
ll
y
h
an
dl
eun
l
a

b
e
l
e
dd
ata w
h
en
l
earn
i
ng t
h
e Bayes
i
an networ
k
.T
he
10
Intro
d
uctio
n
structure of a Ba
y
esian network is the
g
raph structure of the network. We sho
w

t
h
at
l
earn
i
ng t
h
e grap
h
structure o
f
t
h
e Bayes
i
an networ
ki
s
k
ey w
h
en
l
earn
-
i
n
g
with unlabeled data. Motivated b

y
this observation, we review the existin
g
structure
l
earn
i
ng approac
h
es an
d
po
i
nt out to t
h
e
i
r potent
i
a
ldi
sa
d
vantages
w
hen learnin
g
classifiers. We describe a structure learnin
g
al

g
orithm, drive
n
b
yc
l
ass
ifi
cat
i
on accuracy an
d
prov
id
e emp
i
r
i
ca
l
ev
id
ence o
f
t
h
ea
l
gor
i

t
h
m’s
success.
Chapter 8 deals with automatic reco
g
nition of hi
g
h level human behavior.
In part
i
cu
l
ar, we
f
ocus on t
h
eo
ffi
ce scenar
i
oan
d
attempt to
b
u
ild
a system
that can decode the human activities
(

phone conversation, face-to-face conver-
(
(
s
ation, presentation mo
d
e, ot
h
er activit
y
,no
b
o
dy
aroun
d,
a
n
d
d
istant conver-
s
at
i
o
n
). Althou
g
h there has been some work in the area of behavioral anal
-

y
sis, this is probabl
y
the first s
y
stem that does the automatic reco
g
nition of
h
uman act
i
v
i
t
i
es
i
n rea
l
t
i
me
f
rom
l
ow-
l
eve
l
sensory

i
nputs. We ma
k
e use o
f
p
robabilistic models for this task. Hidden Markov models (HMMs) have bee
n
success
f
u
ll
y app
li
e
df
or t
h
e tas
k
o
f
ana
l
yz
i
ng tempora
ld
ata (e.g. speec
h

). A
l-
t
h
oug
h
very power
f
u
l
, HMMs are not very success
f
u
li
n captur
i
ng t
h
e
l
on
g
term relationships and modelin
g
concepts lastin
g
over lon
g
periods of time.
O

ne can a
l
ways
i
ncrease t
h
e num
b
er o
f hidd
en states
b
ut t
h
en t
h
e comp
l
ex
i
t
y
of decodin
g
and the amount of data required to learn increases man
y
fold. I
n
our work, to solve this problem, we propose the use of la
y

ered (a t
y
pe of hier
-
arc
hi
ca
l
) HMMs (LHMM), w
hi
c
h
can
b
ev
i
ewe
d
as a s
p
ec
i
a
l
case o
f
Stac
k
e
d

Generalization [Wolpert, 1992]. At each level of the hierarch
y
, HMMs ar
e
use
d
as c
l
ass
ifi
ers to
d
ot
h
e
i
n
f
erence. T
h
e
i
n
f
erent
i
a
l
output o
f

t
h
ese HMMs
f
orms t
h
e
i
nput to t
h
enext
l
eve
l
o
f
t
h
e
hi
erarc
hy
. As our resu
l
ts s
h
ow, t
hi
sne
w

architecture has a number of advanta
g
es over the standard HMMs. It allows
one to capture events at
diff
erent
l
eve
l
o
f
a
b
stract
i
on an
d
at t
h
e same t
i
me
i
s
c
apturin
g
lon
g
term dependencies which are critical in the modelin

g
of hi
g
her
level concepts (human activities). Furthermore, this architecture provides ro
-
b
ustness to no
i
se an
d
genera
li
zes we
ll
to
diff
erent sett
i
ngs. Compar
i
son w
i
t
h
standard HMM shows that this model has superior performance in modelin
g
t
h
e

b
e
h
av
i
ora
l
concepts
.
T
h
eot
h
er c
h
a
ll
eng
i
ng pro
bl
em re
l
ate
d
to mu
l
t
i
me

di
a
d
ea
l
sw
i
t
h
automat
i
c
anal
y
sis/annotation of videos. This problem forms the topic of Chapter 9. Al
-
t
h
oug
h
s
i
m
il
ar
i
nsp
i
r
i

ttot
h
e pro
bl
em o
fh
uman act
i
v
i
ty recogn
i
t
i
on, t
hi
s pro
b-
lem
g
ets challen
g
in
g
because of the limited number of modalities (audio an
d
v
i
s
i

on) an
d
t
h
e corre
l
at
i
on
b
etween t
h
em
b
e
i
ng t
h
e
k
ey
i
n event
id
ent
ifi
cat
i
on.
In t

hi
sc
h
apter, we present a new a
lg
or
i
t
h
m
f
or
d
etect
i
n
g
events
i
nv
id
eos,
w
hich combines the features with temporal support from multiple modalities.
Thi
sa
l
gor
i
t

h
m
i
s
b
ase
d
onanew
f
ramewor
k
“Durat
i
on
d
epen
d
ent
i
nput/output
M
arkov models (DDIOMM)”. Essentiall
y
DDIOMM is a time var
y
in
g
Markov
m
o

d
e
l
(state trans
i
t
i
on matr
i
x
i
sa
f
unct
i
on o
f
t
h
e
i
nputs at any g
i
ven t
i
me) an
d
O
verview o
f

t
h
e Boo
k
11
the state transition probabilities are modified to explicitl
y
take into account th
e
n
on-exponent
i
a
l
nature o
f
t
h
e
d
urat
i
ons o
f
var
i
ous events
b
e
i

ng mo
d
e
l
e
d
.Tw
o
m
ain features of this model are (a) the abilit
y
to account for non-exponentia
l
d
urat
i
on an
d
(
b
)t
h
ea
bili
ty to map
di
screte state
i
nput sequences to
d

ec
i
s
i
o
n
sequences. The standard al
g
orithms modelin
g
the video-events use HMMs
whi
c
h
mo
d
e
l
t
h
e
d
urat
i
on o
f
events as an exponent
i
a
ll

y
d
ecay
i
ng
di
str
ib
ut
i
on
.
H
owever, we ar
g
ue that the duration is an important characteristic of each event
an
d
we
d
emonstrate
i
t
b
yt
h
e
i
mprove
d

per
f
ormance over stan
d
ar
d
HMMs
in
solvin
g
real world problems. The model is tested on the audio-visual event ex
-
p
losion. Usin
g
a set of hand-labeled video data, we compare the performanc
e
of
our mo
d
e
l
w
i
t
h
an
d
w
i

t
h
out t
h
eex
pli
c
i
tmo
d
e
lf
or
d
urat
i
on. We a
l
so com
-
p
are the performance of the proposed model with the traditional HMM and
ob
serve an
i
m
p
rovement
i
n

d
etect
i
on
p
er
f
ormance
.
The al
g
orithms LHMM and DDIOMM presented in Chapters 8 and 9, re
-
spect
i
ve
l
y,
h
ave t
h
e
i
ror
i
g
i
ns
i
n HMM an

d
are mot
i
vate
db
yt
h
e vast
li
teratur
e
o
n probabilistic models and some ps
y
cholo
g
ical studies ar
g
uin
g
that huma
n
b
e
h
av
i
or
d
oes

h
ave a
hi
erarc
hi
ca
l
structure [Zac
k
san
d
Tvers
k
y, 2001]. How
-
e
ver, the problem lies in the fact that we are usin
g
these probabilistic models
f
or classification and not purely for inferencing (the performance is measured
wi
t
h
res
p
ect to t
he
0


1
l
oss
f
unct
i
on). A
l
t
h
ou
gh
one can use ar
g
uments re
l
ate
d
to Ba
y
es optimalit
y
, these ar
g
uments fall apart in the case of mismatched dis-
tr
ib
ut
i
ons (

i
.e. w
h
en t
h
e true
di
str
ib
ut
i
on
i
s
diff
erent
f
rom t
h
e use
d
one). T
hi
s
m
ismatch ma
y
arise because of the small number of trainin
g
samples used for

l
earn
i
ng, assumpt
i
ons ma
d
etos
i
mp
lif
yt
h
e
i
n
f
erence proce
d
ure (e.g. a num
-
ber of conditional independence assumptions are made in Ba
y
esian networks
)
o
r may
b
e
j

ust
b
ecause o
f
t
h
e
l
ac
k
o
fi
n
f
ormat
i
on a
b
out t
h
e true mo
d
e
l
.Fo
l-
l
owin
g
the ar

g
uments of Roth [Roth, 1999], one can anal
y
ze these al
g
orithms
b
ot
hf
rom t
h
e
p
ers
p
ect
i
ve o
fp
ro
b
a
bili
st
i
cc
l
ass
ifi
ers an

df
rom t
h
e
p
ers
p
ect
i
ve
o
f statistical learnin
g
theor
y
. We appl
y
these al
g
orithms to two distinct but re
-
l
ate
d
app
li
cat
i
ons w
hi

c
h
requ
i
re mac
hi
ne
l
earn
i
ng tec
h
n
i
ques
f
or mu
l
t
i
mo
d
a
l
i
nformation fusion: office activit
y
reco
g
nition and multimodal event detection

.
C
h
apters 10 an
d
11
d
emonstrate t
h
et
h
eory an
d
a
l
gor
i
t
h
ms o
f
sem
i-
supervised learnin
g
(Chapters 4 and 7) to two classification tasks related to hu
-
m
an computer
i

nte
lli
gent
i
nteract
i
on. T
h
e

rst
i
s
f
ac
i
a
l
express
i
on recogn
i
t
i
o
n
f
rom video sequences usin
g
non-ri

g
id face trackin
g
results as the attributes
.
W
e show that Ba
y
esian networks can be used as classifiers to reco
g
nize facia
l
e
xpress
i
ons w
i
t
h
goo
d
accuracy w
h
en t
h
e structure o
f
t
h
e networ

ki
s est
i
mate
d
f
rom data. We also describe a real-time facial expression reco
g
nition s
y
stem
whi
c
hi
s
b
ase
d
on t
hi
s ana
l
ys
i
s. T
h
e secon
d
app
li

cat
i
on
i
s
f
ronta
lf
ace
d
e
-
tection from ima
g
es under various illuminations. We describe the task and
s
h
ow t
h
at
l
earn
i
ng Bayes
i
an networ
k
c
l
ass

ifi
ers
f
or
d
etect
i
ng
f
aces us
i
ng our

×