Machine Learning in Computer
Vision
b
y
N. SEBE
Universit
y
o
f
Amsterdam,
The
N
etherlan
d
s
IRA COHEN
ASHUTOSH GARG
an
d
THOMAS S. HUANG
Universit
y
o
f
Illinois at Urbana-Champai
g
n,
H
P Research Labs, U.S.A.
Goog
l
e Inc., U.S.A
.
Urbana, IL, U.S.A.
A
C.I.P. Cata
l
ogue recor
d
for t
hi
s
b
oo
k
i
s ava
il
a
bl
e from t
h
e L
ib
rary of Congress.
P
u
bli
s
h
e
d
b
y Spr
i
nger
,
P
.O. Box 17, 3300 AA Dor
d
rec
h
t, T
h
e Net
h
er
l
an
d
s
.
P
rinted on acid-
f
ree pape
r
All
R
i
g
h
ts Reserve
d
©
2005 Spr
i
nger
N
o part of t
hi
s wor
k
may
b
e repro
d
uce
d
, store
d
i
n a retr
i
eva
l
system, or transm
i
tte
d
i
n any form or
b
y any means, e
l
ectron
i
c, mec
h
an
i
ca
l
, p
h
otocopy
i
ng, m
i
crof
il
m
i
ng
,
recor
di
ng
o
r ot
h
erw
i
se, w
i
t
h
out wr
i
tten
p
erm
i
ss
i
on from t
h
e Pu
bli
s
h
er, w
i
t
h
t
h
e
exce
p
tio
n
o
f an
y
material supplied specificall
y
for the purpose of bein
g
entere
d
a
nd executed on a computer s
y
stem, for exclusive use b
y
the purchaser of the work.
P
rint
ed
in th
e
N
e
th
e
rlan
ds.
I
SBN-10 1-4020-3274-9 (HB) Springer Dordrecht, Berlin, Heidelberg, New York
I
SBN-10 1-4020-3275-7 (e-book) Springer Dordrecht, Berlin, Heidelberg, New York
I
SBN-13 978-1-4020-3274-5 (HB) Sprin
g
er Dordrecht, Berlin, Heidelber
g
, New York
I
SBN-13 978-1-4020-3275-2 (e-book) Springer Dordrecht, Berlin, Heidelberg, New York
To m
y
parent
s
N
icu
To Mera
v
and Yonatan
I
ra
T
om
y
parent
s
A
sutosh
To my students
:
P
ast, present, an
df
uture
To
m
Contents
Foreword xi
Pre
f
ace x
iii
1
. INTR
O
D
UC
TI
O
N
1
1
Researc
h
Issues on Learn
i
ng
i
n Computer V
i
s
i
on 2
2 Overview of the Book
6
3C
ontributions 12
2. THEORY
:
PROBABILISTIC CLASSIFIERS 1
5
1
Introduction 15
2 Pre
li
m
i
nar
i
es an
d
Notat
i
ons 1
8
2
.1 Max
i
mum L
ik
e
lih
oo
dCl
ass
ifi
cat
i
on 1
8
2
.2 In
f
ormat
i
on T
h
eory 1
9
2
.3 Inequa
li
t
i
es 20
3 Bayes Optimal Error and Entropy 2
0
4 Anal
y
sis of Classification Error of Estimated (
M
i
s
matc
h
e
d
)
Di
str
ib
ut
i
on 2
7
4
.1 H
y
pothesis Testin
g
Framework 2
8
4
.2 Classification Framework 30
5
Densit
y
of Distributions 3
1
5
.1 Distributional Density 3
3
5
.2 Relating to Classification Error 3
7
6
Complex Probabilistic Models and Small Sample Effects 4
0
7
S
ummar
y41
vi
MAC
HINE LE
A
RNIN
G
IN
CO
MP
U
TER
V
I
S
I
ON
3
. THEORY
:
G
ENERALIZATION BOUNDS 4
5
1
Introduction 4
5
2Pr
e
limin
a
ri
es
4
7
3 A Mar
g
in Distribution Based Bound 49
3
.1 Prov
i
ng t
h
e Marg
i
nD
i
str
ib
ut
i
on Boun
d
4
9
4
Analysis
57
4.1 Comparison with Existing Bounds
59
5
Summar
y
6
4
4. THEORY
:
SEMI-SUPERVISED LEARNING 6
5
1
Introduction 6
5
2 Pro
p
erties of Classification 6
7
3 Existin
g
Literature 68
4
Sem
i
-superv
i
se
d
Learn
i
ng Us
i
ng Max
i
mum L
ik
e
lih
oo
d
Est
i
mat
i
on 7
0
5
As
y
mptotic Properties of Maximum Likelihood Estimatio
n
with Labeled and Unlabeled Data 73
5.1 Model Is Correct 7
6
5
.2 Model Is Incorrect 7
7
5.3 Examples: Unlabeled Data De
g
radin
g
Performanc
e
with Discrete and Continuous Variables 80
5.4 Generatin
g
Examples: Performance De
g
radation with
U
nivariate Distributions 8
3
5.5 Distribution of As
y
mptotic Classification Error Bias 8
6
5.6 Short Summar
y
8
8
6 Learning with Finite Data 9
0
6
.1 Ex
p
eriments with Artificial Data 9
1
6
.2 Can Unlabeled Data Hel
p
with Incorrect Models
?
B
i
as vs. Var
i
ance E
ff
ects an
d
t
h
eLa
b
e
l
e
d
-un
l
a
b
e
l
e
d
G
ra
ph
s92
6
.3 Detecting When Unlabeled Data Do Not Change th
e
Est
i
mates
97
6
.4 Using Unlabeled Data to Detect Incorrect Modelin
g
Assum
p
t
i
ons 9
9
7 Conc
l
u
di
ng Remar
k
s10
0
C
ontent
s
v
ii
5
. ALGORITHM:
MAXIMUM LIKELIHOOD MINIMUM ENTROPY HMM 10
3
1
Prev
i
ous Wor
k
10
3
2 Mutua
l
In
f
ormat
i
on, Bayes Opt
i
ma
l
Error, Entropy, an
d
Conditional Probability 10
5
3 Max
i
mum Mutua
l
In
f
ormat
i
on HMMs 10
7
3
.1 D
i
screte Max
i
mum Mutua
l
In
f
ormat
i
on HMMs 1
08
3
.2
C
ontinuous Maximum Mutual Information HMMs 11
0
3
.3 Unsupervised Case 11
1
4Di
scuss
i
o
n11
1
4
.1 Convex
i
ty 11
1
4
.2 Convergence 112
4
.3 Maximum A-
p
osteriori View of Maximum Mutual
Inf
o
rm
at
i
o
n HMM
s
11
2
5
Ex
p
erimental Results 115
5.1 S
y
nthetic Discrete Supervised Data 11
5
5
.2 Speaker Detection 11
5
5
.3 Protein Data 117
5
.4 Real-time Emotion Data 117
6
Summary 11
7
6
.AL
GO
RITHM:
MARGIN DISTRIBUTION OPTIMIZATION 11
9
1
Intro
d
uct
i
on 11
9
2 A Mar
gi
nD
i
str
ib
ut
i
on Base
d
Boun
d
12
0
3
Ex
i
st
i
n
g
Learn
i
n
g
A
lg
or
i
t
h
ms 12
1
4 The Mar
g
in Distribution Optimization (MDO) Al
g
orithm 125
4
.1 Comparison with SVM and Boostin
g
12
6
4
.2 Com
p
utational Issues 126
5
Ex
p
erimental Evaluation 12
7
6C
onclusions 12
8
7
.AL
GO
RITHM:
LEARNIN
G
THE
S
TRU
C
TURE
O
FBAYE
S
IAN
NETW
O
RK
C
LA
SS
IFIER
S 129
1
Introduction 12
9
2Ba
y
esian Network Classifiers 13
0
2
.1 Na
i
ve Bayes C
l
ass
ifi
ers 132
2
.2 Tree-Augmente
d
Na
i
ve Bayes C
l
ass
ifi
ers 13
3
viii
MA
CHINE LE
A
RNING IN COMPUTER VISIO
N
3Sw
i
tc
hi
ng
b
etween Mo
d
e
l
s: Na
i
ve Bayes an
d
TAN C
l
ass
ifi
ers 138
4
Learnin
g
the Structure of Ba
y
esian Network Classifiers
:
Existin
g
Approaches 14
0
4
.1 Inde
p
endence-based Methods 140
4
.2 Likelihood and Ba
y
esian Score-based Methods 142
5
Classification Driven Stochastic Structure Search 143
5
.1 Stochastic Structure Search Algorithm 14
3
5
.2 Addin
g
VC Bound Factor to the Empirical Error
Measure 14
5
6Ex
p
eriments 14
6
6
.1 Results
w
ith Labeled Data 14
6
6
.2 Results
w
ith Labeled and
U
nlabeled Data 147
7 Should Unlabeled Data Be Weighed Differently? 1
50
8
Active Learnin
g
15
1
9 Concludin
g
Remarks 15
3
8
. APPLI
C
ATI
O
N:
OFFICE ACTIVITY RECOGNITION 15
7
1
Context-Sensitive S
y
stems 15
7
2 Towards Tractable and Robust Context Sensing 1
5
9
3 Layered Hidden Markov Models (LHMMs) 1
60
3.1 Approaches 1
6
1
3.2 Decomposition per Temporal Granularit
y
16
2
4Im
p
lementation of SEER 16
4
4
.1 Feature Extraction and Selection in SEER 1
6
4
4
.2 Architecture of SEER 16
5
4
.3 Learning in SEER 1
66
4
.4
C
lassification in
S
EER 1
66
5
Ex
p
eriments 16
6
5.1 Discussion 16
9
6
Related Representations 17
0
7
S
ummar
y1
7
2
9
. APPLICATION:
MULTIMODAL EVENT DETECTION 17
5
1
Fusion Models: A Review 17
6
2AHi
e
r
a
r
c
hi
ca
lF
us
i
o
nM
ode
l17
7
2
.1 Wor
ki
ng o
f
t
h
eMo
d
e
l
178
2
.2 T
h
e Durat
i
on De
p
en
d
ent In
p
ut Out
p
ut Mar
k
ov Mo
d
e
l
17
9
C
ontents i
x
3
Experimental Setup, Features, and Results 18
2
4S
ummar
y 183
10
. APPLI
C
ATI
O
N
:
F
A
C
IAL EXPRE
SS
I
O
NRE
COG
NITI
O
N
18
7
1 Introduction 1
8
7
2
Human Emot
i
on Researc
h
18
9
2.1 A
ff
ect
i
ve Human-com
p
uter Interact
i
on 189
2.2 T
h
eor
i
es o
f
Emot
i
on 1
90
2.3 Fac
i
a
l
Express
i
on Recogn
i
t
i
on Stu
di
es 19
2
3
Fac
i
a
l
Express
i
on Recogn
i
t
i
on System 197
3.1 Face Trackin
g
and Feature Extraction 19
7
3.2 Bayesian Network Classifiers: Learning the
“Structure” of the Facial Features 20
0
4
Experimental Anal
y
sis 201
4.1 Ex
p
erimental Results with Labeled Data 20
4
4.1.1 Person-dependent Tests 205
4.1.2 Person-inde
p
endent Tests 20
6
4.2 Ex
p
er
i
ments w
i
t
h
La
b
e
l
e
d
an
d
Un
l
a
b
e
l
e
d
Data 20
7
5
Discussion 208
11
. APPLI
C
ATI
O
N
:
B
AYE
S
IAN NETW
O
RK
C
LA
SS
IFIER
S
F
O
RFA
C
E DETE
C
TI
O
N
211
1In
t
r
oductio
n 211
2
Re
l
ate
d
Wor
k
213
3 Appl
y
in
g
Ba
y
esian Network Classifiers to Face Detection 217
4
Ex
p
eriments 218
5
Discussion 22
2
R
eferences 22
5
I
n
d
ex 23
7
Foreword
It starte
d
w
i
t
h
i
ma
g
e process
i
n
g
i
nt
h
es
i
xt
i
es. Bac
k
t
h
en,
i
t too
k
ages to
di
g
itize a Landsat ima
g
e and then process it with a mainframe computer. Pro
-
c
ess
i
ng was
i
nsp
i
re
d
on t
h
eac
hi
evements o
f
s
i
gna
l
process
i
ng an
d
was st
ill
ver
y
much oriented towards pro
g
rammin
g.
In the seventies
,
image analysi
s
spun off combinin
g
ima
g
e measurement
wi
t
h
stat
i
st
i
ca
l
pattern recogn
i
t
i
on. S
l
ow
l
y, computat
i
ona
l
met
h
o
d
s
d
etac
h
e
d
themselves from the sensor and the
g
oal to become more
g
enerall
y
applicable.
In t
h
ee
i
g
h
t
i
es, mo
d
e
l
-
d
r
i
ve
n
c
omputer v
i
s
i
o
n
or
i
g
i
nate
d
w
h
en art
ifi
c
i
a
li
n-
telli
g
ence and
g
eometric modellin
g
came to
g
ether with ima
g
e anal
y
sis compo
-
n
ents. T
h
e emp
h
as
i
s was on prec
i
se ana
l
ys
i
sw
i
t
hli
tt
l
eorno
i
nteract
i
on, st
ill
ver
y
much an art evaluated b
y
visual appeal. The main bottleneck was in th
e
amount of data using an average of
5
to
5
0 pictures to illustrate the point
.
A
t the be
g
innin
g
of the nineties, vision became available to man
y
with th
e
a
d
vent o
f
su
ffi
c
i
ent
l
y
f
ast PCs. T
h
e Internet revea
l
e
d
t
h
e
i
nterest o
f
t
h
e gen
-
e
ra
l
pu
bli
c
i
m
i
mages, eventua
ll
y
i
ntro
d
uc
i
n
g
c
ontent-
b
ase
d
ima
g
e retrieva
l
.
Combinin
g
independent (informal) archives, as the web is, ur
g
es for interac
-
t
i
ve eva
l
uat
i
on o
f
approx
i
mate resu
l
ts an
dh
ence wea
k
a
l
gor
i
t
h
ms an
d
t
h
e
ir
c
ombination in
w
eak classifiers
.
In t
h
e new century, t
h
e
l
ast ana
l
og
b
ast
i
on was ta
k
en. In a
f
ew years, sen
-
sors have become all di
g
ital. Archives will soon follow. As a consequenc
e
of
t
hi
sc
h
ange
i
nt
h
e
b
as
i
c con
di
t
i
ons
d
atasets w
ill
over
fl
ow. Computer v
i
s
i
o
n
will spin off a new branch to be called somethin
g
lik
e
archive-based
o
r se-
mant
i
cv
i
s
i
o
n
i
nc
l
u
di
ng a ro
l
e
f
or
f
orma
lk
now
l
e
d
ge
d
escr
i
pt
i
on
i
n an onto
l
ogy
eq
ui
pp
ed with detectors. An alternative view is
e
xperience-based
o
r cognitiv
e
vision.T
hi
s
i
s most
l
ya
d
ata-
d
r
i
ven v
i
ew on v
i
s
i
on an
di
nc
l
u
d
es t
h
ee
l
ementar
y
l
awsofima
g
e formation.
T
hi
s
b
oo
k
comes r
i
g
h
tont
i
me. T
h
e genera
l
tren
di
s easy to see. T
h
e met
h
-
o
ds of computation went from dedicated to one specific task to more
g
enerall
y
app
li
ca
bl
e
b
u
ildi
ng
bl
oc
k
s,
f
rom
d
eta
il
e
d
attent
i
on to one aspect
lik
e
fil
ter
i
n
g
xii
F
O
REW
O
R
D
to a broad variet
y
of topics, from a detailed model desi
g
n evaluated a
g
ainst
a
f
ew
d
ata to a
b
stract ru
l
es tune
d
toaro
b
ust app
li
cat
i
on.
From the source to consumption, ima
g
es are now all di
g
ital. Ver
y
soon
,
arc
hi
ves w
ill b
e over
fl
ow
i
ng. T
hi
s
i
ss
li
g
h
t
l
y worry
i
ng as
i
tw
ill
ra
i
se t
h
e
l
eve
l
o
f expectations about the accessibilit
y
of the pictorial content to a level com
-
pat
ibl
ew
i
t
h
w
h
at
h
umans can ac
hi
eve.
There is onl
y
one realistic chance to respond. From the trend displa
y
e
d
a
b
ove,
i
t
i
s
b
est to
id
ent
if
y
b
as
i
c
l
aws an
d
t
h
en to
l
earn t
h
e spec
ifi
cs o
f
t
he
m
o
d
e
lf
rom a
l
arger
d
ataset. Rat
h
er t
h
an exc
l
u
di
ng
i
nteract
i
on
i
nt
h
e eva
l
uat
i
o
n
o
f the result, it is better to perceive interaction as a valuable source of instant
l
earn
i
ng
f
or t
h
ea
l
gor
i
t
h
m
.
This book builds on that insi
g
ht: that the ke
y
element in the current rev
-
ol
ut
i
on
i
st
h
e use o
f
mac
hi
ne
l
earn
i
ng to capture t
h
evar
i
at
i
ons
i
nv
i
sua
l
ap
-
pearance, rather than havin
g
the desi
g
ner of the model accomplish this. As
a
b
onus, mo
d
e
l
s
l
earne
df
rom
l
arge
d
atasets are
lik
e
l
yto
b
e more ro
b
ust an
d
m
ore realistic than the brittle all-desi
g
n models.
This book reco
g
nizes that machine learnin
g
for computer vision is distinc
-
t
i
ve
l
y
diff
erent
f
rom p
l
a
i
n mac
hi
ne
l
earn
i
ng. Loa
d
so
fd
ata, spat
i
a
l
co
h
erence,
and the lar
g
e variet
y
of appearances, make computer vision a special challen
ge
f
or t
h
e mac
hi
ne
l
earn
i
ng a
l
gor
i
t
h
ms. Hence, t
h
e
b
oo
kd
oes not waste
i
tse
lf
o
n
the complete spectrum of machine learnin
g
al
g
orithms. Rather, this book is
f
ocusse
d
on mac
hi
ne
l
earn
i
ng
f
or p
i
ctures
.
It is amazin
g
so earl
y
in a new field that a book appears which connects
t
h
eory to a
l
gor
i
t
h
ms an
d
t
h
roug
h
t
h
em to conv
i
nc
i
ng app
li
cat
i
ons
.
The authors met one another at Urbana-Champai
g
n and then dispersed over
t
h
ewor
ld
, apart
f
rom T
h
omas Huang w
h
o
h
as
b
een t
h
ere
f
orever. T
hi
s
b
oo
k
will
sure
l
y
b
ew
i
t
h
us
f
or qu
i
te some t
i
me to come
.
Arnold
S
meulders
Un
i
vers
i
ty o
f
Amster
d
a
m
T
he Netherlands
Octo
b
er, 200
4
Preface
T
h
e goa
l
o
f
computer v
i
s
i
on researc
hi
s to prov
id
e computers w
i
t
hh
uman
-
l
ike perception capabilities so that the
y
can sense the environment, understand
t
h
e sense
dd
ata, ta
k
ea
pp
ro
p
r
i
ate act
i
ons, an
dl
earn
f
rom t
hi
sex
p
er
i
ence
in
o
rder to enhance future performance. The field has evolved from the applica
-
t
i
on o
f
c
l
ass
i
ca
l
pattern recogn
i
t
i
on an
di
mage process
i
ng met
h
o
d
stoa
d
vance
d
techniques in ima
g
e understandin
g
like model-based and knowled
g
e-based vi
-
s
i
on
.
In recent
y
ears, there has been an increased demand for computer vision s
y
s
-
tems to address “real-world” problems. However, much of our current models
and methodolo
g
ies do not seem to scale out of limited “to
y
” domains. There
-
f
ore, t
h
e current state-o
f
-t
h
e-art
i
n computer v
i
s
i
on nee
d
ss
i
gn
ifi
cant a
d
vance
-
m
ents to
d
ea
l
w
i
t
h
rea
l
-wor
ld
app
li
cat
i
ons, suc
h
as nav
i
gat
i
on, target recogn
i-
tion, manufacturin
g
, photo interpretation, remote sensin
g
, etc. It is widel
y
un
-
d
erstoo
d
t
h
at many o
f
t
h
ese app
li
cat
i
ons requ
i
re v
i
s
i
on a
l
gor
i
t
h
ms an
d
systems
to work under partial occlusion, possibl
y
under hi
g
h clutter, low contrast, and
ch
ang
i
ng env
i
ronmenta
l
con
di
t
i
ons. T
hi
s requ
i
res t
h
at t
h
ev
i
s
i
on tec
h
n
i
que
s
should be robust and flexible to optimize performance in a
g
iven scenario
.
T
h
e
fi
e
ld
o
f
mac
hi
ne
l
earn
i
ng
i
s
d
r
i
ven
b
yt
h
e
id
ea t
h
at computer a
l
gor
i
t
h
ms
and s
y
stems can improve their own performance with time. Machine learnin
g
has evolved from the relativel
y
“knowled
g
e-free”
g
eneral purpose learnin
g
s
y
s
-
tem, the “
p
erce
p
tron” [Rosenblatt, 19
5
8], and decision-theoretic a
pp
roaches
f
or learnin
g
[Blockeel and De Raedt, 1998], to s
y
mbolic learnin
g
of hi
g
h-leve
l
knowledge [Michalski et al., 1986], artificial neural networks [Rowley et al.
,
1998a], and
g
enetic al
g
orithms [DeJon
g
, 1988]. With the recent advances i
n
h
ar
d
ware an
d
so
f
tware, a var
i
ety o
f
pract
i
ca
l
app
li
cat
i
ons o
f
t
h
e mac
hi
ne
l
earn
-
i
n
g
research is emer
g
in
g
[Se
g
re, 1992].
V
ision provides interestin
g
and challen
g
in
g
problems and a rich environ
-
m
ent to a
d
vance t
h
e state-o
f
-t
h
e art
i
n mac
hi
ne
l
earn
i
ng. Mac
hi
ne
l
earn
i
n
g
technolo
gy
has a stron
g
potential to contribute to the development of flexibl
e
xi
v
PREF
ACE
and robust vision al
g
orithms, thus improvin
g
the performance of practical vi
-
s
i
on systems. Learn
i
ng-
b
ase
d
v
i
s
i
on systems are expecte
d
to prov
id
ea
hi
g
h
er
level of competence and
g
reater
g
eneralit
y
. Learnin
g
ma
y
allow us to use th
e
e
xper
i
ence ga
i
ne
di
n creat
i
ng a v
i
s
i
on system
f
or one app
li
cat
i
on
d
oma
i
nt
o
a vision s
y
stem for another domain b
y
developin
g
s
y
stems that acquire and
m
a
i
nta
i
n
k
now
l
e
d
ge. We c
l
a
i
mt
h
at
l
earn
i
ng represents t
h
enextc
h
a
ll
eng
i
n
g
f
rontier for com
p
uter vision research.
More specificall
y
, machine learnin
g
offers effective methods for computer
v
i
s
i
on
f
or automat
i
ng t
h
emo
d
e
l
/concept acqu
i
s
i
t
i
on an
d
up
d
at
i
ng processes,
adaptin
g
task parameters and representations, and usin
g
experience for
g
ener
-
at
i
ng, ver
if
y
i
ng, an
d
mo
dif
y
i
ng
h
ypot
h
eses. Expan
di
ng t
hi
s
li
st o
f
compute
r
vision problems, we find that some of the applications of machine learnin
g
i
n computer v
i
s
i
on are: segmentat
i
on an
df
eature extract
i
on;
l
earn
i
ng ru
l
es,
relations, features, discriminant functions, and evaluation strate
g
ies; learnin
g
and refinin
g
visual models; indexin
g
and reco
g
nition strate
g
ies; inte
g
ration o
f
v
i
s
i
on mo
d
u
l
es an
d
tas
k
-
l
eve
ll
earn
i
n
g
;
l
earn
i
n
g
s
h
ape representat
i
on an
d
sur
-
f
ace reconstruction strategies; self-organizing algorithms for pattern learning
;
bi
o
l
og
i
ca
ll
y mot
i
vate
d
mo
d
e
li
ng o
f
v
i
s
i
on systems t
h
at
l
earn; an
d
parameter
adaptation, and self-calibration of vision s
y
stems. As an eventual
g
oal, ma
-
chi
ne
l
earn
i
ng may prov
id
et
h
e necessary too
l
s
f
or synt
h
es
i
z
i
ng v
i
s
i
on a
l
go
-
rithms startin
g
from adaptation of control parameters of vision al
g
orithms an
d
sys
t
ems
.
The
g
oal of this book is to address the use of several important machin
e
l
earn
i
ng tec
h
n
i
ques
i
nto computer v
i
s
i
on app
li
cat
i
ons. An
i
nnovat
i
ve com
bi-
n
ation of computer vision and machine learnin
g
techniques has the promis
e
of
a
d
vanc
i
ng t
h
e
fi
e
ld
o
f
computer v
i
s
i
on, w
hi
c
h
w
ill
contr
ib
ute to
b
etter un
-
d
erstan
di
ng o
f
comp
l
ex rea
l
-wor
ld
app
li
cat
i
ons. T
h
ere
i
s anot
h
er
b
ene
fi
to
f
i
ncorporatin
g
a learnin
g
paradi
g
m in the computational vision framework. T
o
m
ature t
h
e
l
a
b
oratory-grown v
i
s
i
on systems
i
nto rea
l
-wor
ld
wor
ki
ng systems,
i
t is necessar
y
to evaluate the performance characteristics of these s
y
stems us
-
i
ng a var
i
ety o
f
rea
l
,ca
lib
rate
dd
ata. Learn
i
ng o
ff
ers t
hi
s eva
l
uat
i
on too
l
,s
i
nc
e
n
o learnin
g
can take place without appropriate evaluation of the results
.
Genera
ll
y,
l
earn
i
ng requ
i
res
l
arge amounts o
fd
ata an
df
ast computat
i
ona
l
resources for its practical use. However, all learnin
g
does not have to be on
-
li
ne. Some o
f
t
h
e
l
earn
i
ng can
b
e
d
one o
ff
-
li
ne, e.g., opt
i
m
i
z
i
ng parameters,
f
eatures, and sensors durin
g
trainin
g
to improve performance. Dependin
g
upo
n
t
h
e
d
oma
i
no
f
app
li
cat
i
on, t
h
e
l
arge num
b
er o
f
tra
i
n
i
ng samp
l
es nee
d
e
df
or
i
nductive learnin
g
techniques ma
y
not be available. Thus, learnin
g
techniques
s
h
ou
ld b
ea
bl
etowor
k
w
i
t
h
vary
i
ng amounts o
f
apr
i
or
ik
now
l
e
d
ge an
dd
ata
.
T
h
ee
ff
ect
i
ve usa
g
eo
f
mac
hi
ne
l
earn
i
n
g
tec
h
no
l
o
gy i
n rea
l
-wor
ld
computer
vision problems requires understanding the domain of application, abstractio
n
of
a
l
earn
i
ng pro
bl
em
f
romag
i
ven computer v
i
s
i
on tas
k
,an
d
t
h
ese
l
ect
i
o
n
PREF
AC
E
xv
o
f
a
pp
ro
p
r
i
ate re
p
resentat
i
ons
f
or t
h
e
l
earna
bl
e(
i
n
p
ut) an
dl
earne
d
(
i
nterna
l)
entities of the s
y
stem. To succeed in selectin
g
the most appropriate machine
l
earn
i
n
g
tec
h
n
i
que(s)
f
or t
h
e
gi
ven computer v
i
s
i
on tas
k
,ana
d
equate un
d
er
-
s
tanding of the different machine learning paradigms is necessary
.
A
l
earn
i
ng system
h
as to c
l
ear
l
y
d
emonstrate an
d
answer t
h
e quest
i
ons
lik
e
w
hat is bein
g
learned, how it is learned, what data is used to learn, how to rep
-
resent w
h
at
h
as
b
een
l
earne
d
,
h
ow we
ll
an
dh
ow e
ffi
c
i
ent
i
st
h
e
l
earn
i
ng ta
ki
ng
place and what are the evaluation criteria for the task at hand. Experimen
-
t
a
ld
eta
il
s are essent
i
a
lf
or
d
emonstrat
i
ng t
h
e
l
earn
i
ng
b
e
h
av
i
or o
f
a
l
gor
i
t
h
m
s
a
nd s
y
stems. These experiments need to include scientific experimental desi
gn
m
et
h
o
d
o
l
ogy
f
or tra
i
n
i
ng/test
i
ng, parametr
i
c stu
di
es, an
d
measures o
f
per
f
or
-
m
ance improvement with experience. Experiments that exihibit scalabilit
y
of
l
earn
i
ng-
b
ase
d
v
i
s
i
on systems are a
l
so very
i
mportant
.
I
n this book, we address all these important aspects. In each of the chapters
,
w
es
h
ow
h
ow t
h
e
li
terature
h
as
i
ntro
d
uce
d
t
h
e tec
h
n
iq
ues
i
nto t
h
e
p
art
i
cu
l
ar
t
opic area, we present the back
g
round theor
y
, discuss comparative experiment
s
m
a
d
e
b
y us, an
d
conc
l
u
d
ew
i
t
h
comments an
d
recommen
d
at
i
ons
.
Acknowledgments
T
his book would not have existed without the assistance of Marcelo Cirelo
,
L
arry C
h
en, Fa
bi
o Cozman, M
i
c
h
ae
l
Lew, an
d
Dan Rot
h
w
h
ose tec
h
n
i
ca
l
con
-
t
ributions are directl
y
reflected within the chapters. We would like to than
k
Th
eo Gevers, Nur
i
aO
li
ver, Arno
ld
Smeu
ld
ers, an
d
our co
ll
eagues
f
rom t
h
e
I
ntelli
g
ent Sensor
y
Information S
y
stems
g
roup at Universit
y
of Amsterda
m
a
n
d
t
h
e IFP group at Un
i
vers
i
ty o
f
I
lli
no
i
satUr
b
ana-C
h
ampa
i
gn w
h
ogaveu
s
valuable su
gg
estions and critical comments. Be
y
ond technical contributions,
w
ewou
ld lik
etot
h
an
k
our
f
am
ili
es
f
or years o
f
pat
i
ence, support, an
d
encour-
ag
ement. Furthermore, we are
g
rateful to our departments for providin
g
a
n
exce
ll
ent sc
i
ent
ifi
cenv
i
ronment
.
Chapter 1
INTRODUCTION
C
omputer v
i
s
i
on
h
as grown rap
idl
yw
i
t
hi
nt
h
e past
d
eca
d
e, pro
d
uc
i
ng too
ls
t
hat enable the understandin
g
of visual information, especiall
y
for scenes wit
h
no accompany
i
ng structura
l
,a
d
m
i
n
i
strat
i
ve, or
d
escr
i
pt
i
ve text
i
n
f
ormat
i
on
.
The Internet, more specificall
y
the Web, has become a common channel fo
r
th
e transm
i
ss
i
on o
f
grap
hi
ca
li
n
f
ormat
i
on, t
h
us mov
i
ng v
i
sua
li
n
f
ormat
i
on re-
t
rieval rapidl
y
from stand-alone workstations and databases into a networked
env
i
ronment
.
Pract
i
ca
li
ty
h
as
b
egun to
di
ctate t
h
at t
h
e
i
n
d
ex
i
ng o
fh
uge co
ll
ect
i
ons o
fi
m-
ag
es b
y
hand is a task that is both labor intensive and expensive - in man
y
cases more t
h
an can
b
ea
ff
or
d
e
d
to
p
rov
id
e some met
h
o
d
o
fi
nte
ll
ectua
l
ac-
cess to di
g
ital ima
g
e collections. In the world of text retrieval, text “speak
s
f
or
i
tse
lf
”w
h
ereas
i
mage ana
l
ys
i
s requ
i
res a com
bi
nat
i
on o
fhi
g
h
-
l
eve
l
con-
cept creation as well as the processin
g
and interpretation of inherent visua
l
f
eatures. In t
h
e area o
fi
nte
ll
ectua
l
access to v
i
sua
li
n
f
ormat
i
on, t
h
e
i
nterp
l
ay
between human and machine ima
g
e indexin
g
methods has be
g
un to influence
th
e
d
eve
l
opment o
f
computer v
i
s
i
on systems. Researc
h
an
d
app
li
cat
i
on
b
y
t
he ima
g
e understandin
g
(IU) communit
y
su
gg
ests that the most fruitful ap-
p
roac
h
es to IU
i
nvo
l
ve ana
l
ys
i
san
dl
earn
i
ng o
f
t
h
e type o
fi
n
f
ormat
i
on
b
e
i
n
g
s
ou
g
ht, the domain in which it will be used, and s
y
stematic testin
g
to identif
y
o
ptimal methods.
T
h
e goa
l
o
f
computer v
i
s
i
on researc
hi
s to prov
id
e computers w
i
t
hh
uman-
l
ike perception capabilities so that the
y
can sense the environment, understand
th
e sense
dd
ata, ta
k
e appropr
i
ate act
i
ons, an
dl
earn
f
rom t
hi
s exper
i
ence
i
nor-
d
er to enhance future
p
erformance. The vision field has evolved from the a
pp
li-
cat
i
on o
f
c
l
ass
i
ca
l
pattern recogn
i
t
i
on an
di
mage process
i
ng tec
h
n
i
ques to a
d
-
2
Intro
d
uctio
n
vanced applications of ima
g
e understandin
g
, model-based vision, knowled
g
e
-
b
ase
d
v
i
s
i
on, an
d
systems t
h
at ex
hibi
t
l
earn
i
ng capa
bili
ty. T
h
ea
bili
ty to reaso
n
and the abilit
y
to learn are the two ma
j
or capabilities associated with these s
y
s-
tems. In recent years, t
h
eoret
i
ca
l
an
d
pract
i
ca
l
a
d
vances are
b
e
i
ng ma
d
e
i
nt
he
fi
eld of computer vision and pattern reco
g
nition b
y
new techniques and pro
-
c
esses o
fl
earn
i
ng, representat
i
on, an
d
a
d
aptat
i
on. It
i
s pro
b
a
bl
y
f
a
i
rtoc
l
a
i
m,
however, that learnin
g
represents the next challen
g
in
g
frontier for computer
v
i
s
i
on
.
1. Research Issues on Learning in Computer Vision
In recent years, t
h
ere
h
as
b
een a surge o
fi
nterest
i
n
d
eve
l
op
i
ng mac
hi
n
e
learnin
g
techniques for computer vision based applications. The interest de
-
r
i
ves
f
rom
b
ot
h
commerc
i
a
l
pro
j
ects to create wor
ki
ng pro
d
ucts
f
rom com
-
puter vision techniques and from a
g
eneral trend in the computer vision fiel
d
to
i
ncorporate mac
hi
ne
l
earn
i
ng tec
h
n
i
ques
.
L
earn
i
ng
i
s one o
f
t
h
e current
f
ront
i
ers
f
or computer v
i
s
i
on researc
h
an
dh
as
been receivin
g
increased attention in recent
y
ears. Machine learnin
g
technol
-
ogy
h
as strong potent
i
a
l
to contr
ib
ute to
:
t
he development of flexible and robust vision al
g
orithms that will improv
e
th
e per
f
ormance o
f
pract
i
ca
l
v
i
s
i
on systems w
i
t
h
a
hi
g
h
er
l
eve
l
o
f
compe
-
t
ence and
g
reater
g
eneralit
y
,an
d
t
he development of architectures that will speed up s
y
stem developmen
t
ti
me an
dp
rov
id
e
b
etter
p
er
f
ormance.
T
h
e goa
l
o
fi
mprov
i
ng t
h
e per
f
ormance o
f
computer v
i
s
i
on systems
h
as
b
rou
gh
tnewc
h
a
ll
en
g
es to t
h
e
fi
e
ld
o
f
mac
hi
ne
l
earn
i
n
g
,
f
or examp
l
e,
l
earn
i
n
g
f
rom structured descriptions, partial information, incremental learnin
g
, focus
-
i
ng attent
i
on or
l
earn
i
ng reg
i
ons o
fi
nterests (ROI),
l
earn
i
ng w
i
t
h
many c
l
asses,
e
tc. Solvin
g
problems in visual domains will result in the development of new,
m
ore robust machine learning algorithms that will be able to work in mor
e
rea
li
st
i
c sett
i
ngs
.
F
rom t
h
e stan
d
po
i
nt o
f
computer v
i
s
i
on systems, mac
hi
ne
l
earn
i
ng can o
ff
er
e
ffective methods for automatin
g
the acquisition of visual models, adaptin
g
tas
k
parameters an
d
representat
i
on, trans
f
orm
i
ng s
i
gna
l
s to sym
b
o
l
s,
b
u
ildi
n
g
trainable ima
g
e processin
g
s
y
stems, focusin
g
attention on tar
g
et ob
j
ect, and
l
earn
i
ng w
h
en to app
l
yw
h
at a
l
gor
i
t
h
m
i
nav
i
s
i
on system.
F
rom t
h
e stan
d
po
i
nt o
f
mac
hi
ne
l
earn
i
ng systems, computer v
i
s
i
on can pro
-
vide interestin
g
and challen
g
in
g
problems. As examples consider the follow
-
i
n
g
: learnin
g
models rather than handcraftin
g
them, learnin
g
to transfer experi
-
e
nce ga
i
ne
di
n one app
li
cat
i
on
d
oma
i
n to anot
h
er
d
oma
i
n,
l
earn
i
ng
f
rom
l
arg
e
sets of ima
g
es with no annotation, desi
g
nin
g
evaluation criteria for the qualit
y
Researc
h
Issues on Learnin
g
in Computer Visio
n
3
o
f learnin
g
processes in computer vision s
y
stems. Man
y
studies in machin
e
l
earn
i
ng assume t
h
at a care
f
u
l
tra
i
ner prov
id
es
i
nterna
l
representat
i
ons o
f
t
he
o
bserved environment, thus pa
y
in
g
little attention to the problems of percep
-
t
i
on. Un
f
ortunate
l
y, t
hi
s assumpt
i
on
l
ea
d
stot
h
e
d
eve
l
opment o
fb
r
i
tt
l
e systems
w
ith nois
y
, excessivel
y
detailed, or quite coarse descriptions of the perceived
e
nv
i
ronment
.
Espos
i
to an
d
Ma
l
er
b
a [Espos
i
to an
d
Ma
l
er
b
a, 2001]
li
ste
d
some o
f
t
h
e
i
m
-
p
ortant research issues that have to be dealt with in order to develo
p
successfu
l
app
li
cat
i
ons:
Can we
l
earn t
h
emo
d
e
l
s use
dby
a computer vision s
y
stem rat
h
er t
h
an
handcra
f
ting them?
In many computer v
i
s
i
on app
li
cat
i
ons,
h
an
d
cra
f
t
i
ng t
h
ev
i
sua
l
mo
d
e
l
o
f
a
n
o
bj
ect
i
sne
i
t
h
er easy nor pract
i
ca
l
.For
i
nstance,
h
umans can
d
etect an
d
i
dentif
y
faces in a scene with little or no effort. This skill is quite robust,
d
esp
i
te
l
arge c
h
anges
i
nt
h
ev
i
sua
l
st
i
mu
l
us. Nevert
h
e
l
ess, prov
idi
ng com
-
puter vision s
y
stems with models of facial landmarks or facial expressions
i
sver
y
difficult [Cohen et al., 2003b]. Even when models have been hand
-
cra
f
te
d
,as
i
nt
h
e case o
f
page
l
ayout
d
escr
i
pt
i
ons use
db
y some
d
ocument
i
ma
g
e processin
g
s
y
stems [Na
gy
et al., 1992], it has been observed that the
y
li
m
i
tt
h
e use o
f
t
h
e system to a spec
ifi
cc
l
ass o
fi
mages, w
hi
c
hi
ssu
bj
ect t
o
c
h
ange
i
nare
l
at
i
ve
l
ys
h
ort t
i
me
.
H
ow is machine learning used in computer vision systems?
M
ac
hi
ne
l
earn
i
ng a
l
gor
i
t
h
ms can
b
e app
li
e
di
nat
l
east two
diff
erent ways
i
n computer vision s
y
stems
:
–
to
i
mprove percept
i
on o
f
t
h
e surroun
di
ng env
i
ronment, t
h
at
i
s, to
i
m
-
p
rove the transformation of sensed si
g
nals into internal representations,
a
n
d
–
to brid
g
e the
g
ap between the internal representations of the environ
-
ment an
d
t
h
e representat
i
on o
f
t
h
e
k
now
l
e
d
ge nee
d
e
db
yt
h
e system to
p
erform its task
.
A poss
ibl
eexp
l
anat
i
on o
f
t
h
e marg
i
na
l
attent
i
on g
i
ven to
l
earn
i
ng
i
nterna
l
representations of the perceived environment is that feature extraction has
rece
i
ve
d
very
li
tt
l
e attent
i
on
i
nt
h
e mac
hi
ne
l
earn
i
ng commun
i
ty,
b
ecause
i
t
has been considered a
pp
lication-de
p
endent and research on this issue is not
o
f
genera
li
nterest. T
h
e
id
ent
ifi
cat
i
on o
f
requ
i
re
dd
ata an
dd
oma
i
n
k
now
l-
e
d
ge requ
i
res t
h
eco
ll
a
b
orat
i
on w
i
t
h
a
d
oma
i
n expert an
di
san
i
mportan
t
s
tep of the process of appl
y
in
g
machine learnin
g
to real-world problems.
4
Intro
d
uctio
n
Onl
y
recentl
y
, the related issues of feature selection and, more
g
enerall
y
,
d
ata preprocess
i
ng
h
ave
b
een more systemat
i
ca
ll
y
i
nvest
i
gate
di
n mac
hi
n
e
learnin
g
. Data preprocessin
g
is still considered a step of the knowled
ge
di
scovery process an
di
s con
fi
ne
d
to
d
ata c
l
ean
i
ng, s
i
mp
l
e
d
ata trans
f
orma
-
t
ions (e.
g
., summarization), and validation. On the contrar
y
, man
y
studies
i
n computer v
i
s
i
on an
d
pattern recogn
i
t
i
on
f
ocuse
d
on t
h
e pro
bl
ems o
ff
ea
-
t
ure extraction and selection. Hou
g
h transform, FFT, and textural features,
j
ust to ment
i
on some, are a
ll
examp
l
es o
ff
eatures w
id
e
l
y app
li
e
di
n
i
mag
e
classification and scene understandin
g
tasks. Their properties have bee
n
w
e
ll i
nvest
i
gate
d
an
d
ava
il
a
bl
e too
l
sma
k
et
h
e
i
r use s
i
mp
l
ean
d
e
ffi
c
i
ent
.
How
d
o we represent visua
l
information?
I
n many computer v
i
s
i
on app
li
cat
i
ons,
f
eature vectors are use
d
to represen
t
t
he perceived environment. However, relational descriptions are deeme
d
t
o
b
eo
f
cruc
i
a
li
mportance
i
n
hi
g
h
-
l
eve
l
v
i
s
i
on. S
i
nce re
l
at
i
ons cannot
be
represented b
y
feature vectors, pattern reco
g
nition researchers use
g
raphs
t
o capture t
h
e structure o
fb
ot
h
o
bj
ects an
d
scenes, w
hil
e peop
l
ewor
ki
ng
in the field of machine learnin
g
prefer to use first-order lo
g
ic formalisms.
B
y mapp
i
ng one
f
orma
li
sm
i
nto anot
h
er,
i
t
i
s poss
ibl
eto
fi
n
d
some s
i
m
i-
larities between research done in pattern reco
g
nition and machine learnin
g
.
An examp
l
e
i
st
h
e spat
i
o-tempora
ld
ec
i
s
i
on tree propose
db
yB
i
sc
h
o
f
an
d
Caelli [Bischof and Caelli, 2001], which can be related to lo
g
ical decisio
n
t
rees
i
n
d
uce
db
y some genera
l
-purpose
i
n
d
uct
i
ve
l
earn
i
ng systems [B
l
oc
k-
eel and De Raedt, 1998].
What machine learning paradigms and strategies are appropriate to the
com
p
uter vision
d
omain?
I
n
d
uct
i
ve
l
earn
i
ng,
b
ot
h
superv
i
se
d
an
d
unsuperv
i
se
d
, emerges as t
h
e mos
t
important learnin
g
strate
gy
. There are several important paradi
g
ms that ar
e
b
e
i
n
g
use
d
: conceptua
l
(
d
ec
i
s
i
on trees,
g
rap
h
-
i
n
d
uct
i
on), stat
i
st
i
ca
l
(sup
-
port vector machines), and neural networks (Kohonen maps and similar
a
uto-or
g
an
i
z
i
n
g
s
y
stems). Anot
h
er emer
gi
n
g
para
dig
m, w
hi
c
hi
s
d
escr
ib
e
d
in detail in this book, is the use of probabilistic models in
g
eneral and prob-
abili
st
i
c grap
hi
ca
l
mo
d
e
l
s
i
n part
i
cu
l
ar.
What are the criteria for evaluating the
q
uality of the learning processes in
computer vision s
y
stems
?
I
n
b
enc
h
mar
ki
ng computer v
i
s
i
on systems, est
i
mates o
f
t
h
e pre
di
ct
i
ve ac
-
curac
y
, recall, and precision [Hui
j
sman and Sebe, 2004] are considered th
e
ma
i
n parameters to eva
l
uate t
h
e success o
f
a
l
earn
i
ng a
l
gor
i
t
h
m. How
-
Researc
h
Issues on Learnin
g
in Computer Visio
n
5
ever, the comprehensibilit
y
of learned models is also deemed an important
cr
i
ter
i
on, espec
i
a
ll
yw
h
en
d
oma
i
n experts
h
ave strong expectat
i
ons on t
he
properties of visual models or when understandin
g
of s
y
stem failures is im
-
portant. Compre
h
ens
ibili
ty
i
s nee
d
e
db
yt
h
e expert to eas
il
yan
d
re
li
a
bly
verif
y
the inductive assertions and relate them to their own domain knowl
-
e
d
ge. W
h
en compre
h
ens
ibili
ty
i
san
i
mportant
i
ssue, t
h
e conceptua
ll
earn
-
i
n
g
paradi
g
m is usuall
y
preferred, since it is based on the comprehensibilit
y
postu
l
ate state
db
yM
i
c
h
a
l
s
ki
[M
i
c
h
a
l
s
ki
, 1983]:
The results of computer induction should be s
y
mbolic descrip
-
tions of
g
iven entities, semanticall
y
and structurall
y
similar to those
a
h
uman expert m
i
g
h
t pro
d
uce o
b
serv
i
ng t
h
e same ent
i
t
i
es. Com-
p
onents o
f
t
h
ese
d
escr
i
pt
i
ons s
h
ou
ld b
e compre
h
ens
ibl
eass
i
ng
l
e
“
chunks” of information, directl
y
interpretable in natural lan
g
ua
g
e
,
and should relate
q
uantitative and
q
ualitative conce
p
ts in an inte-
g
rate
df
as
hi
on
.
W
h
en is it usefu
l
to a
d
opt severa
l
representations of t
h
e perceive
d
environ-
m
ent wit
hd
i
ff
erent
l
eve
l
so
f
a
b
straction?
In complex real-world applications, multi-representations of the perceive
d
env
i
ronment prove very use
f
u
l
.For
i
nstance, a
l
ow reso
l
ut
i
on
d
ocument
i
ma
g
e is suitable for the efficient separation of text from
g
raphics, while a
fi
ner resolution is required for the subsequent step of interpretin
g
the s
y
m
-
b
o
l
s
i
n a text
bl
oc
k
(OCR). Ana
l
ogous
l
y, t
h
e representat
i
on o
f
an aer
i
a
l
view of a cultivated area b
y
means of a vector of textural features can b
e
appropr
i
ate to recogn
i
ze t
h
e type o
f
vegetat
i
on,
b
ut
i
t
i
s too coarse
f
or t
he
recogn
i
t
i
on o
f
a part
i
cu
l
ar geomorp
h
o
l
ogy. By app
l
y
i
ng a
b
stract
i
on pr
i
n
-
ciples in computer pro
g
rammin
g
, software en
g
ineers have mana
g
ed to de
-
ve
l
op comp
l
ex so
f
tware systems. S
i
m
il
ar
l
y, t
h
e systemat
i
c app
li
cat
i
on o
f
abstraction principles in knowled
g
e representation is the ke
y
stone for a lon
g
t
erm solution to man
y
problems encountered in computer vision tasks.
H
ow can mutua
ld
epen
d
ency o
f
visua
l
concepts
b
e
d
ea
l
t wit
h?
In scene labellin
g
problems, ima
g
ese
g
ments have to be associated with a
class name or a label, the number of distinct labels dependin
g
on the dif
-
f
erent t
y
pes o
f
o
bj
ects a
ll
owe
di
nt
h
e perce
i
ve
d
wor
ld
.T
y
p
i
ca
lly
,
i
ma
ge
s
egments cannot be labelled independently of each other, since the inter
-
pretat
i
on o
f
a part o
f
a scene
d
epen
d
sont
h
eun
d
erstan
di
n
g
o
f
t
h
ew
h
o
le
s
cene (holistic view). Context-dependent labelling rules will take such con
-
cept
d
epen
d
enc
i
es
i
nto account, so as to guarantee t
h
at t
h
e
fi
na
l
resu
l
t
i
s
g
loball
y
(and not onl
y
locall
y
) consistent [Haralick and Shapiro, 1979].
L
earn
i
ng context-
d
epen
d
ent
l
a
b
e
lli
ng ru
l
es
i
s anot
h
er researc
hi
ssue, s
i
nc
e
6
Intro
d
uctio
n
most learnin
g
al
g
orithms rel
y
on the independence assumption, accordin
g
t
ow
hi
c
h
t
h
eso
l
ut
i
on to a mu
l
t
i
c
l
ass or mu
l
t
i
p
l
e concept
l
earn
i
ng pro
bl
em
is simpl
y
the sum of independent solutions to sin
g
le class or sin
g
le concept
l
earn
i
ng pro
bl
ems.
O
bviousl
y
, the above list cannot be considered complete. Other equall
y
re
l
evant researc
hi
ssues m
i
g
h
t
b
e propose
d
, suc
h
as t
h
e
d
eve
l
opment o
f
no
i
se
-
tolerant learnin
g
techniques, the effective use of lar
g
e sets of unlabeled ima
g
es
an
d
t
h
e
id
ent
ifi
cat
i
on o
f
su
i
ta
bl
ecr
i
ter
i
a
f
or start
i
ng/stopp
i
ng t
h
e
l
earn
i
ng pro
-
c
ess and/or revisin
g
acquired visual models.
2. Overview of the Book
In
g
eneral, the stud
y
of machine learnin
g
and computer vision can be di
-
v
id
e
di
nto t
h
ree
b
roa
d
categor
i
es
:
Th
eor
y
l
ea
di
ng t
o
Alg
orit
h
ms
a
n
d
A
pp
l
ica-
tion
s
b
uilt on top of theor
y
and al
g
orithms. In this framework, the application
s
s
h
ou
ld f
orm t
h
e
b
as
i
so
f
t
h
et
h
eoret
i
ca
l
researc
hl
ea
di
ng to
i
nterest
i
ng a
l
go
-
rithms. As a conse
q
uence, the book was divided into three
p
arts. The first
p
art
d
eve
l
ops t
h
et
h
eoret
i
ca
l
un
d
erstan
di
ng o
f
t
h
e concepts t
h
at are
b
e
i
ng use
din
developin
g
al
g
orithms in the second part. The third part focuses on the anal
-
ys
i
so
f
computer v
i
s
i
on an
dh
uman-computer
i
nteract
i
on app
li
cat
i
ons t
h
at us
e
the al
g
orithms and the theor
y
presented in the first parts.
The theoretical results in this book ori
g
inate from different practical prob
-
lems encountered when usin
g
machine learnin
g
in
g
eneral, and probabilistic
m
o
d
e
l
s
i
n
p
art
i
cu
l
ar, to com
p
uter v
i
s
i
on an
d
mu
l
t
i
me
di
a
p
ro
bl
ems. T
h
e
fi
rst
set of questions arise from the hi
g
h dimensionalit
y
of models in computer vi
-
s
i
on an
d
mu
l
t
i
me
di
a. For examp
l
e,
i
ntegrat
i
on o
f
au
di
oan
d
v
i
sua
li
n
f
orma
-
t
i
on p
l
ays a cr
i
t
i
ca
l
ro
l
e
i
nmu
l
t
i
me
di
a ana
l
ys
i
s. D
iff
erent me
di
a streams (e.g.,
audio, video, and text, etc.) ma
y
carr
y
information about the task bein
g
per
-
f
orme
d
an
d
recent resu
l
ts [Bran
d
et a
l
., 1997; C
h
en an
d
Rao, 1998; Garg et a
l
.
,
2
000b] have shown that improved performance can be obtained b
y
combinin
g
i
nformation from different sources compared with the situation when a sin
g
l
e
m
o
d
a
li
ty
i
s cons
id
ere
d
.Att
i
mes,
diff
erent streams may carry s
i
m
il
ar
i
n
f
orma
-
tion and in that case, one attempts to use the redundanc
y
to improve the perfor
-
m
ance o
f
t
h
e
d
es
i
re
d
tas
kb
y cance
lli
ng t
h
eno
i
se. At ot
h
er t
i
mes, two streams
m
ay carry comp
li
mentary
i
n
f
ormat
i
on an
di
nt
h
at case t
h
e system must ma
ke
use of the information carried in both channels to carr
y
out the task. However,
t
h
e mer
i
ts o
f
us
i
ng mu
l
t
i
p
l
e streams
i
s overs
h
a
d
owe
db
yt
h
e
f
orm
id
a
bl
e tas
k
o
f
learnin
g
in hi
g
h dimensional which is invariabl
y
the case in multi-modal infor
-
m
ation processin
g
. Althou
g
h, the existin
g
theor
y
supports the task of learnin
g
i
n
hi
g
hdi
mens
i
ona
l
spaces, t
h
e
d
ata an
d
mo
d
e
l
comp
l
ex
i
ty requ
i
rements pose
d
are t
y
picall
y
not met b
y
the real life s
y
stems. Under such scenario, the existin
g
O
verview o
f
t
h
e Boo
k
7
results in learnin
g
theor
y
falls short of
g
ivin
g
an
y
meanin
g
ful
g
uarantees for
t
h
e
l
earne
d
c
l
ass
ifi
ers. T
hi
sra
i
ses a num
b
er o
fi
nterest
i
ng quest
i
ons
:
C
an we ana
l
yze t
h
e
l
earn
i
ng t
h
eory
f
or more pract
i
ca
l
scenar
i
os?
C
an the results of such anal
y
sis be used to develop better al
g
orithms?
Another set of questions arise from the practical problem of data availabil
-
i
ty
i
n computer v
i
s
i
on, ma
i
n
l
y
l
a
b
e
l
e
dd
ata. In t
hi
s respect, t
h
ere are t
h
re
e
m
ain paradi
g
ms for learnin
g
from trainin
g
data. The first is known a
s
super-
v
ise
dl
earnin
g
,i
nw
hi
c
h
a
ll
t
h
e tra
i
n
i
ng
d
ata are
l
a
b
e
l
e
d
,
i
.e., a
d
atum conta
i
ns
b
oth the values of the attributes and the labelin
g
of the attributes to one of
t
h
ec
l
asses. T
h
e
l
a
b
e
li
ng o
f
t
h
e tra
i
n
i
ng
d
ata
i
s usua
ll
y
d
one
b
y an externa
l
m
echanism (usuall
y
humans) and thus the name
s
upervised
.
The second i
s
k
nown a
s
unsupervise
dl
earnin
g
i
nw
hi
c
h
eac
hd
atum conta
i
ns t
h
eva
l
ues o
f
th
e attr
ib
utes
b
ut
d
oes not conta
i
nt
h
e
l
a
b
e
l
. Unsuperv
i
se
dl
earn
i
ng tr
i
es to
fi
n
d
re
g
ularities in the unlabeled trainin
g
data (such as different clusters under som
e
m
etr
i
cs
p
ace),
i
n
f
er t
h
ec
l
ass
l
a
b
e
l
san
d
somet
i
mes even t
h
e num
b
er o
f
c
l
asses.
T
h
et
hir
d
kin
d
i
s
s
emi-supervised learning
i
n
w
hich some of the data is labele
d
an
d
some un
l
a
b
e
l
e
d
.Int
hi
s
b
oo
k,
we are more
i
ntereste
di
nt
h
e
l
atter.
Semi-supervised learnin
g
is motivated from the fact that in man
y
compute
r
v
i
s
i
on (an
d
ot
h
er rea
l
wor
ld
) pro
bl
ems, o
b
ta
i
n
i
ng un
l
a
b
e
l
e
dd
ata
i
sre
l
at
i
ve
l
y
eas
y
(e.
g
., collectin
g
ima
g
es of faces and non-faces), while labelin
g
is difficult,
expensive, and/or labor intensive. Thus, in many problems, it is very desirabl
e
t
o have learnin
g
al
g
orithms that are able to incorporate a lar
g
e number of un
-
labeled data with a small number of labeled data when learnin
g
classifiers.
Some o
f
t
h
e quest
i
ons ra
i
se
di
n sem
i
-superv
i
se
dl
earn
i
ng o
f
c
l
ass
ifi
ers are
:
I
s
i
t
f
eas
ibl
e to use un
l
a
b
e
l
e
dd
ata
i
nt
h
e
l
earn
i
ng process
?
I
st
h
ec
l
ass
ifi
cat
i
on per
f
ormance o
f
t
h
e
l
earne
d
c
l
ass
ifi
er guarantee
d
to
i
m
-
prove when addin
g
the unlabeled data to the labeled data
?
What is the
v
alue of unlabeled data?
Th
e goa
l
o
f
t
h
e
b
oo
ki
stoa
dd
ress a
ll
t
h
ec
h
a
ll
eng
i
ng quest
i
ons pose
d
so
f
ar. We believe that a detailed anal
y
sis of the wa
y
machine learnin
g
theor
y
ca
n
b
e app
li
e
d
t
h
roug
h
a
l
gor
i
t
h
ms to rea
l
-wor
ld
app
li
cat
i
ons
i
s very
i
mportant an
d
e
xtreme
l
yre
l
evant to t
h
esc
i
ent
ifi
c commun
i
ty
.
Chapters 2, 3, and 4 provide the theoretical answers to the questions pose
d
a
b
ove. C
h
apter 2
i
ntro
d
uces t
h
e
b
as
i
cs o
f
pro
b
a
bili
st
i
cc
l
ass
ifi
ers. We argu
e
that there are two main factors contributin
g
to the error of a classifier. Becaus
e
o
f
t
h
e
i
n
h
erent nature o
f
t
h
e
d
ata, t
h
ere
i
s an upper
li
m
i
tont
h
e per
f
ormanc
e
o
f
any c
l
ass
ifi
er an
d
t
hi
s
i
s typ
i
ca
ll
yre
f
erre
d
to as Bayes opt
i
ma
l
error. W
e
start b
y
anal
y
zin
g
the relationship between the Ba
y
es optimal performance of
8
Intro
d
uctio
n
a classifier and the conditional entrop
y
of the data. The mismatch betwee
n
t
h
e true un
d
er
l
y
i
ng mo
d
e
l
(one t
h
at generate
d
t
h
e
d
ata) an
d
t
h
emo
d
e
l
use
d
f
or classification contributes to the second factor of error. In this cha
p
ter, w
e
d
eve
l
op
b
oun
d
sont
h
ec
l
ass
ifi
cat
i
on error un
d
er t
h
e
h
ypot
h
es
i
s test
i
ng
f
rame
-
w
ork when there is a mismatch in the distribution used with res
p
ect to the tru
e
di
str
ib
ut
i
on. Our
b
oun
d
ss
h
ow t
h
at t
h
ec
l
ass
ifi
cat
i
on error
i
sc
l
ose
l
yre
l
ate
d
t
o
the conditional entrop
y
of the distribution. The additional penalt
y
, because of
t
h
em
i
smatc
h
e
ddi
str
ib
ut
i
on,
i
sa
f
unct
i
on o
f
t
h
eKu
llb
ac
k
-Le
ibl
er
di
stance
b
e
-
t
w
een the true and the mismatched distribution.
O
nce these bounds are de
v
el
-
o
pe
d
,t
h
enext
l
og
i
ca
l
step
i
stosee
h
ow o
f
ten t
h
e error cause
db
yt
h
em
i
smatc
h
between distributions is lar
g
e. Our avera
g
e case anal
y
sis for the independenc
e
assumptions leads to results that justify the success of the conditional inde-
pen
d
ence assumpt
i
on (e.
g
.,
i
nna
i
ve Ba
y
es arc
hi
tecture). We s
h
ow t
h
at
i
n most
c
ases, almost all distributions are very close to the distribution assuming condi
-
t
i
ona
li
n
d
epen
d
ence. More
f
orma
lly
,wes
h
ow t
h
at t
h
e num
b
er o
fdi
str
ib
ut
i
ons
f
or w
hi
c
h
t
h
ea
ddi
t
i
ona
l
pena
l
ty term
i
s
l
arge goes
d
own exponent
i
a
ll
y
f
ast.
Rot
h
[Rot
h
, 1998]
h
as s
h
own t
h
at t
h
e pro
b
a
bili
st
i
cc
l
ass
ifi
ers can
b
ea
l
ways
m
apped to linear classifiers and as such, one can anal
y
ze the performance of
these under the probably approximately correct (PAC) or Vapnik-Chervonenkis
(
VC)-
di
mens
i
on
f
ramewor
k
.T
hi
sv
i
ew
p
o
i
nt
i
s
i
m
p
ortant as
i
ta
ll
ows one t
o
directl
y
stud
y
the classification performance b
y
developin
g
the relations be
-
tween t
h
e per
f
ormance on t
h
e tra
i
n
i
ng
d
ata an
d
t
h
e expecte
d
per
f
ormance o
n
t
h
e
f
uture unseen
d
ata. In C
h
a
p
ter 3, we
b
u
ild
on t
h
ese resu
l
ts o
f
Rot
h
[Rot
h
,
1
998]. It turns out that althou
g
h the existin
g
theor
y
ar
g
ues that one needs lar
ge
amounts o
fd
ata to
d
ot
h
e
l
earn
i
ng, we o
b
serve t
h
at
i
n pract
i
ce a goo
d
gen
-
e
ralization is achieved with a much small number of examples. The existin
g
V
C-
di
mens
i
on
b
ase
db
oun
d
s(
b
e
i
ng t
h
e worst case
b
oun
d
s) are too
l
oose an
d
w
e nee
d
to ma
k
e use o
f
propert
i
es o
f
t
h
eo
b
serve
dd
ata
l
ea
di
ng to
d
ata
d
epen
-
dent bounds. Our observation, that in practice, classification is achieved with
goo
d
marg
i
n, mot
i
vates us to
d
eve
l
op
b
oun
d
s
b
ase
d
on marg
i
n
di
str
ib
ut
i
on.
We develop a classification version of the Random pro
j
ection theorem [John
-
son and Lindenstrauss, 1984] and use it to develop data dependent bounds. Our
resu
l
ts s
h
ow t
h
at
i
n most pro
bl
ems o
f
pract
i
ca
li
nterest,
d
ata actua
ll
y res
id
e
in
a low dimensional space. Comparison with existin
g
bounds on real datasets
s
h
ows t
h
at our
b
oun
d
s are t
i
g
h
ter t
h
an ex
i
st
i
ng
b
oun
d
san
di
n most cases
l
es
s
than 0.
5
.
The next cha
p
ter (Cha
p
ter 4)
p
rovides a unified framework of
p
robabilistic
cl
ass
ifi
ers
l
earne
d
us
i
ng max
i
mum
lik
e
lih
oo
d
est
i
mat
i
on. In a nuts
h
e
ll
,we
di
s
-
c
uss what t
y
pe of probabilistic classifiers are suited for usin
g
unlabeled dat
a
i
nas
y
stematic wa
y
with the maximum likelihood learnin
g
, namel
y
classifiers
k
nown as
g
enerat
i
ve
.
We
di
scuss t
h
e con
di
t
i
ons un
d
er w
hi
c
h
t
h
e assert
i
o
n
that unlabeled data are alwa
y
s profitable when learnin
g
classifiers, made i
n
O
verview o
f
t
h
e Boo
k
9
the existin
g
literature, is valid, namel
y
when the assumed probabilistic mode
l
m
atc
h
es rea
li
ty. We a
l
so s
h
ow,
b
ot
h
ana
l
yt
i
ca
ll
yan
d
exper
i
menta
ll
y, t
h
at un
l
a
-
beled data can be detrimental to the classification
p
erformance when the condi
-
t
i
ons are v
i
o
l
ate
d
. Here we use t
h
e term ‘rea
li
ty’ to mean t
h
at t
h
ere ex
i
sts som
e
true probabilit
y
distribution that
g
enerates data, the same one for both labeled
an
d
un
l
a
b
e
l
e
dd
ata. T
h
e terms are more r
i
gourous
l
y
d
e
fi
ne
di
nC
h
apter 4
.
T
h
et
h
eoret
i
ca
l
ana
l
ys
i
sa
l
t
h
oug
hi
nterest
i
ng
i
n
i
tse
lf
gets rea
ll
y attract
i
ve
if
i
t can be
p
ut to use in
p
ractical
p
roblems. Cha
p
ters
5
and 6 build on the result
s
developed in Chapters 2 and 3, respectively. In Chapter
5
, we use the results
of
C
h
apter 2 to
d
eve
l
op a new a
lg
or
i
t
h
m
f
or
l
earn
i
n
g
HMMs. In C
h
apter 2, w
e
show that conditional entrop
y
is inversel
y
related to classification performance
.
Bu
ildi
ng on t
hi
s
id
ea, we argue t
h
at w
h
en HMMs are use
df
or c
l
ass
ifi
cat
i
on,
i
nstead of learnin
g
parameters b
y
onl
y
maximizin
g
the likelihood, one should
a
l
so attempt to m
i
n
i
m
i
ze t
h
e con
di
t
i
ona
l
entropy
b
etween t
h
e query (
hidd
en
)
and the observed variables. This leads to a new al
g
orithm for learnin
g
HMMs
-
MMIHMM. Our resu
l
ts on
b
ot
h
synt
h
et
i
can
d
rea
ld
ata
d
emonstrate t
h
esu
-
p
eriorit
y
of this new al
g
orithm over the standard ML learnin
g
of HMMs.
In Chapter 3, a new, data-dependent, complexit
y
measure for learnin
g
– pro
-
j
ect
i
on pro
fil
e–
i
s
i
ntro
d
uce
d
an
di
s use
d
to
d
eve
l
op
i
mprove
d
genera
li
zat
i
o
n
bounds. In Chapter 6, we extend this result b
y
developin
g
a new learnin
g
al
g
o
-
r
ithm for linear classifiers. The complexit
y
measure – projection profil
e
–
i
sa
f
unct
i
on o
f
t
h
e
m
argin
d
istri
b
utio
n
(
t
h
e
di
str
ib
ut
i
on o
f
t
h
e
di
stance o
fi
nstances
f
rom a separatin
g
h
y
perplane). We ar
g
ue that instead of maximizin
g
the mar
-
g
i
n, one s
h
ou
ld
attempt to
di
rect
l
ym
i
n
i
m
i
ze t
hi
s term w
hi
c
h
actua
ll
y
d
epen
d
s
o
n the mar
g
in distribution. Experimental results on some real world problems
(f
ace
d
etect
i
on an
d
context sens
i
t
i
ve spe
lli
ng correct
i
on) an
d
on severa
l
UCI
data sets show that this new al
g
orithm is superior (in terms of classificatio
n
p
er
f
ormance) over Boost
i
ng an
d
SVM
.
C
h
apter 7 prov
id
es a
di
scuss
i
on o
f
t
h
e
i
mp
li
cat
i
on o
f
t
h
e ana
l
ys
i
so
f
sem
i-
supervised learnin
g
(Chapter 4) when learnin
g
Ba
y
esian network classifiers,
suggest
i
ng an
d
compar
i
ng
diff
erent approac
h
es t
h
at can
b
eta
k
en to ut
ili
ze pos-
i
tivel
y
unlabeled data. Ba
y
esian networks are directed ac
y
clic
g
raph models
t
h
at represent
j
o
i
nt pro
b
a
bili
ty
di
str
ib
ut
i
ons o
f
a set o
f
var
i
a
bl
es. T
h
e grap
h
s
c
onsist of nodes (vertices in the
g
raph) which represent the random variables
and directed ed
g
es between the nodes which represent probabilistic dependen
-
ci
es
b
etween t
h
evar
i
a
bl
es an
d
t
h
e casua
l
re
l
at
i
ons
hip b
etween t
h
e two con
-
n
ected nodes. With each node there is an associated probabilit
y
mass functio
n
wh
en t
h
evar
i
a
bl
e
i
s
di
screte, or pro
b
a
bili
ty
di
str
ib
ut
i
on
f
unct
i
on, w
h
en t
he
v
ariable is continuous. In classification, one of the nodes in the
g
raph is th
e
cl
ass var
i
a
bl
ew
hil
et
h
e rest are t
h
e attr
ib
utes. One o
f
t
h
ema
i
na
d
vantages o
f
Ba
y
esian networks is the abilit
y
to handle missin
g
data, thus it is possible t
o
systemat
i
ca
ll
y
h
an
dl
eun
l
a
b
e
l
e
dd
ata w
h
en
l
earn
i
ng t
h
e Bayes
i
an networ
k
.T
he
10
Intro
d
uctio
n
structure of a Ba
y
esian network is the
g
raph structure of the network. We sho
w
t
h
at
l
earn
i
ng t
h
e grap
h
structure o
f
t
h
e Bayes
i
an networ
ki
s
k
ey w
h
en
l
earn
-
i
n
g
with unlabeled data. Motivated b
y
this observation, we review the existin
g
structure
l
earn
i
ng approac
h
es an
d
po
i
nt out to t
h
e
i
r potent
i
a
ldi
sa
d
vantages
w
hen learnin
g
classifiers. We describe a structure learnin
g
al
g
orithm, drive
n
b
yc
l
ass
ifi
cat
i
on accuracy an
d
prov
id
e emp
i
r
i
ca
l
ev
id
ence o
f
t
h
ea
l
gor
i
t
h
m’s
success.
Chapter 8 deals with automatic reco
g
nition of hi
g
h level human behavior.
In part
i
cu
l
ar, we
f
ocus on t
h
eo
ffi
ce scenar
i
oan
d
attempt to
b
u
ild
a system
that can decode the human activities
(
phone conversation, face-to-face conver-
(
(
s
ation, presentation mo
d
e, ot
h
er activit
y
,no
b
o
dy
aroun
d,
a
n
d
d
istant conver-
s
at
i
o
n
). Althou
g
h there has been some work in the area of behavioral anal
-
y
sis, this is probabl
y
the first s
y
stem that does the automatic reco
g
nition of
h
uman act
i
v
i
t
i
es
i
n rea
l
t
i
me
f
rom
l
ow-
l
eve
l
sensory
i
nputs. We ma
k
e use o
f
p
robabilistic models for this task. Hidden Markov models (HMMs) have bee
n
success
f
u
ll
y app
li
e
df
or t
h
e tas
k
o
f
ana
l
yz
i
ng tempora
ld
ata (e.g. speec
h
). A
l-
t
h
oug
h
very power
f
u
l
, HMMs are not very success
f
u
li
n captur
i
ng t
h
e
l
on
g
term relationships and modelin
g
concepts lastin
g
over lon
g
periods of time.
O
ne can a
l
ways
i
ncrease t
h
e num
b
er o
f hidd
en states
b
ut t
h
en t
h
e comp
l
ex
i
t
y
of decodin
g
and the amount of data required to learn increases man
y
fold. I
n
our work, to solve this problem, we propose the use of la
y
ered (a t
y
pe of hier
-
arc
hi
ca
l
) HMMs (LHMM), w
hi
c
h
can
b
ev
i
ewe
d
as a s
p
ec
i
a
l
case o
f
Stac
k
e
d
Generalization [Wolpert, 1992]. At each level of the hierarch
y
, HMMs ar
e
use
d
as c
l
ass
ifi
ers to
d
ot
h
e
i
n
f
erence. T
h
e
i
n
f
erent
i
a
l
output o
f
t
h
ese HMMs
f
orms t
h
e
i
nput to t
h
enext
l
eve
l
o
f
t
h
e
hi
erarc
hy
. As our resu
l
ts s
h
ow, t
hi
sne
w
architecture has a number of advanta
g
es over the standard HMMs. It allows
one to capture events at
diff
erent
l
eve
l
o
f
a
b
stract
i
on an
d
at t
h
e same t
i
me
i
s
c
apturin
g
lon
g
term dependencies which are critical in the modelin
g
of hi
g
her
level concepts (human activities). Furthermore, this architecture provides ro
-
b
ustness to no
i
se an
d
genera
li
zes we
ll
to
diff
erent sett
i
ngs. Compar
i
son w
i
t
h
standard HMM shows that this model has superior performance in modelin
g
t
h
e
b
e
h
av
i
ora
l
concepts
.
T
h
eot
h
er c
h
a
ll
eng
i
ng pro
bl
em re
l
ate
d
to mu
l
t
i
me
di
a
d
ea
l
sw
i
t
h
automat
i
c
anal
y
sis/annotation of videos. This problem forms the topic of Chapter 9. Al
-
t
h
oug
h
s
i
m
il
ar
i
nsp
i
r
i
ttot
h
e pro
bl
em o
fh
uman act
i
v
i
ty recogn
i
t
i
on, t
hi
s pro
b-
lem
g
ets challen
g
in
g
because of the limited number of modalities (audio an
d
v
i
s
i
on) an
d
t
h
e corre
l
at
i
on
b
etween t
h
em
b
e
i
ng t
h
e
k
ey
i
n event
id
ent
ifi
cat
i
on.
In t
hi
sc
h
apter, we present a new a
lg
or
i
t
h
m
f
or
d
etect
i
n
g
events
i
nv
id
eos,
w
hich combines the features with temporal support from multiple modalities.
Thi
sa
l
gor
i
t
h
m
i
s
b
ase
d
onanew
f
ramewor
k
“Durat
i
on
d
epen
d
ent
i
nput/output
M
arkov models (DDIOMM)”. Essentiall
y
DDIOMM is a time var
y
in
g
Markov
m
o
d
e
l
(state trans
i
t
i
on matr
i
x
i
sa
f
unct
i
on o
f
t
h
e
i
nputs at any g
i
ven t
i
me) an
d
O
verview o
f
t
h
e Boo
k
11
the state transition probabilities are modified to explicitl
y
take into account th
e
n
on-exponent
i
a
l
nature o
f
t
h
e
d
urat
i
ons o
f
var
i
ous events
b
e
i
ng mo
d
e
l
e
d
.Tw
o
m
ain features of this model are (a) the abilit
y
to account for non-exponentia
l
d
urat
i
on an
d
(
b
)t
h
ea
bili
ty to map
di
screte state
i
nput sequences to
d
ec
i
s
i
o
n
sequences. The standard al
g
orithms modelin
g
the video-events use HMMs
whi
c
h
mo
d
e
l
t
h
e
d
urat
i
on o
f
events as an exponent
i
a
ll
y
d
ecay
i
ng
di
str
ib
ut
i
on
.
H
owever, we ar
g
ue that the duration is an important characteristic of each event
an
d
we
d
emonstrate
i
t
b
yt
h
e
i
mprove
d
per
f
ormance over stan
d
ar
d
HMMs
in
solvin
g
real world problems. The model is tested on the audio-visual event ex
-
p
losion. Usin
g
a set of hand-labeled video data, we compare the performanc
e
of
our mo
d
e
l
w
i
t
h
an
d
w
i
t
h
out t
h
eex
pli
c
i
tmo
d
e
lf
or
d
urat
i
on. We a
l
so com
-
p
are the performance of the proposed model with the traditional HMM and
ob
serve an
i
m
p
rovement
i
n
d
etect
i
on
p
er
f
ormance
.
The al
g
orithms LHMM and DDIOMM presented in Chapters 8 and 9, re
-
spect
i
ve
l
y,
h
ave t
h
e
i
ror
i
g
i
ns
i
n HMM an
d
are mot
i
vate
db
yt
h
e vast
li
teratur
e
o
n probabilistic models and some ps
y
cholo
g
ical studies ar
g
uin
g
that huma
n
b
e
h
av
i
or
d
oes
h
ave a
hi
erarc
hi
ca
l
structure [Zac
k
san
d
Tvers
k
y, 2001]. How
-
e
ver, the problem lies in the fact that we are usin
g
these probabilistic models
f
or classification and not purely for inferencing (the performance is measured
wi
t
h
res
p
ect to t
he
0
−
1
l
oss
f
unct
i
on). A
l
t
h
ou
gh
one can use ar
g
uments re
l
ate
d
to Ba
y
es optimalit
y
, these ar
g
uments fall apart in the case of mismatched dis-
tr
ib
ut
i
ons (
i
.e. w
h
en t
h
e true
di
str
ib
ut
i
on
i
s
diff
erent
f
rom t
h
e use
d
one). T
hi
s
m
ismatch ma
y
arise because of the small number of trainin
g
samples used for
l
earn
i
ng, assumpt
i
ons ma
d
etos
i
mp
lif
yt
h
e
i
n
f
erence proce
d
ure (e.g. a num
-
ber of conditional independence assumptions are made in Ba
y
esian networks
)
o
r may
b
e
j
ust
b
ecause o
f
t
h
e
l
ac
k
o
fi
n
f
ormat
i
on a
b
out t
h
e true mo
d
e
l
.Fo
l-
l
owin
g
the ar
g
uments of Roth [Roth, 1999], one can anal
y
ze these al
g
orithms
b
ot
hf
rom t
h
e
p
ers
p
ect
i
ve o
fp
ro
b
a
bili
st
i
cc
l
ass
ifi
ers an
df
rom t
h
e
p
ers
p
ect
i
ve
o
f statistical learnin
g
theor
y
. We appl
y
these al
g
orithms to two distinct but re
-
l
ate
d
app
li
cat
i
ons w
hi
c
h
requ
i
re mac
hi
ne
l
earn
i
ng tec
h
n
i
ques
f
or mu
l
t
i
mo
d
a
l
i
nformation fusion: office activit
y
reco
g
nition and multimodal event detection
.
C
h
apters 10 an
d
11
d
emonstrate t
h
et
h
eory an
d
a
l
gor
i
t
h
ms o
f
sem
i-
supervised learnin
g
(Chapters 4 and 7) to two classification tasks related to hu
-
m
an computer
i
nte
lli
gent
i
nteract
i
on. T
h
e
fi
rst
i
s
f
ac
i
a
l
express
i
on recogn
i
t
i
o
n
f
rom video sequences usin
g
non-ri
g
id face trackin
g
results as the attributes
.
W
e show that Ba
y
esian networks can be used as classifiers to reco
g
nize facia
l
e
xpress
i
ons w
i
t
h
goo
d
accuracy w
h
en t
h
e structure o
f
t
h
e networ
ki
s est
i
mate
d
f
rom data. We also describe a real-time facial expression reco
g
nition s
y
stem
whi
c
hi
s
b
ase
d
on t
hi
s ana
l
ys
i
s. T
h
e secon
d
app
li
cat
i
on
i
s
f
ronta
lf
ace
d
e
-
tection from ima
g
es under various illuminations. We describe the task and
s
h
ow t
h
at
l
earn
i
ng Bayes
i
an networ
k
c
l
ass
ifi
ers
f
or
d
etect
i
ng
f
aces us
i
ng our