A
y
d D o o
3rd Edition
r
<. 0 f
Pattern Recognition
and
Computer Vision
This page is intentionally left blank
A
J
i b o o
3rd Edition
r
< of
Pattern Recognition
and
Computer Vision
editors
C H Chen
University
of
Massachusetts
Dartmouth,
USA
PS P Wang
Northeastern
University,
USA
\[p
World
Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Tori Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
First published 2005
Reprinted 2006
HANDBOOK OF PATTERN RECOGNITION & COMPUTER VISION (3rd Edition)
Copyright © 2005 by World Scientific Publishing Co. Pte. Ltd.
All rights
reserved.
This
book,
or parts
thereof,
may not be reproduced
in
any form or by any
means,
electronic or
mechanical, including photocopying, recording or any information storage and retrieval system now known or to
be
invented,
without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center,
Inc.,
222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from
the publisher.
ISBN 981-256-105-6
Printed in Singapore by Mainland Press
Preface
to the
Third Edition
Dedicated
to the
memory
of the
late Professor King
Sun Fu
(1930-1985),
the
handbook
series,
with first edition (1993), second edition (1999)
and
third edition (2005), provides
a
comprehensive, concise
and
balanced coverage
of the
progress
and
achievements
in the
field
of
pattern recognition
and
computer vision
in the
last twenty years. This
is a
highly
dynamic field which
has
been expanding greatly over
the
last thirty years.
No
handbook
can cover
the
essence
of
all aspects
of
the field
and we
have
not
attempted
to do
that.
The
carefully selected
33
chapters
in the
current edition were written
by
leaders
in the
field
and
we
believe that
the
book
and its
sister volumes,
the
first
and
second editions, will
provide
the
growing pattern recognition
and
computer vision community
a set of
valuable
resource books that
can
last
for a
long time. Each chapter will speak
for
itself
the
importance
of
the subject area covered.
The book continues
to
contain five parts. Part
1 is on the
basic methods
of
pattern recognition. Though there
are
only five chapters,
the
readers
may
find other
coverage
of
basic methods
in the
first
and
second editions. Part
2 is on
basic methods
in
computer vision. Again readers
may
find that Part
2
complements well what were offered
in
the
first
and
second editions. Part
3 on
recognition applications continues
to
emphasize
on character recognition
and
document processing.
It
also presents
new
applications
in
digital mammograms, remote sensing images
and
functional magnetic resonance imaging
data. Currently
one
intensively explored area
of
pattern recognition applications
is the
personal identification problem, also called biometrics, though
the
problem
has
been
around
for a
number
of
years. Part
4 is
especially devoted
to
this topic area. Indeed
chapters
in
both Part
3 and
Part
4
represent
the
growing importance
of
applications
in
pattern recognition.
In
fact
Prof. Fu had
envisioned
the
growth
of
pattern recognition
applications
in the
early
60's He and his
group
at
Purdue
had
worked
on the
character
recognition, speech recognition, fingerprint recognition, seismic pattern recognition,
biomedical
and
remote sensing recognition problems,
etc.
Part
5 on
system
and
technology presents other important aspects
of
pattern recognition
and
computer vision.
Our sincere thanks
go to all
contributors
of
this volume
for
their outstanding
technical contributions.
We
would like
to
mention specially
Dr.
Quang-Tuan Luong,
Dr.
Giovanni Garibotto
and Prof.
Ching
Y.
Suen
for
their original contributions
to all
three
volumes. Other authors
who
have contributed
to all
three volumes
are: Prof.
Thomas
S.
Huang,
Prof. J.K.
Aggarwal,
Prof. Yun Y.
Tang,
Prof. C.C. Li, Prof. R.
Chellappa
and
Prof.
P.S.P. Wang.
We are
pleased
to
mention that
Prof.
Thomas Huang
and Prof.
Jake
Aggarwal
are the
recipients respectively
in
2002
and
2004,
of the
prestigious
K.S. Fu
Prize sponsored
by the
International Association
of
Pattern Recognition (IAPR). Among
Prof. Fu's Ph.D.
graduates
at
Purdue
who
have contributed
to the
handbook series
are:
C.H. Chen (1965),
M.H.
Loew (1972),
S.M. Hsu
(1975),
S.Y. Lu
(1977),
K.Y.
Huang
(1983)
and H.D.
Cheng (1985). Finally
we
would like
to pay
tribute
to the
late
Prof.
Azirel Rosenfeld (1931-2004)
who, as one
IAPR member
put it, is a
true scientist
and a
great giant
in the
field.
He was
awarded
the
K.S.
Fu
Prize
by
IAPR
in
1988. Readers
are
reminded
to
read
Prof.
Rosenfeld's inspirational article
on
"Vision
-
Some Speculations"
that appeared
as
Foreword
of
the second edition
of
the handbook series.
Prof.
Rosenfeld's
profound influence
in the
field will
be
felt
in the
many years
to
come.
v
VI
The camera ready manuscript production requires certain amount of additional
efforts, as compared to typeset printing, on the part of editors and authors. We like to
thank all contributors for their patience in making the necessary revisions to comply with
the format requirements during this long process of manuscript preparation. Our special
thanks go to Steven Patt, in-house editor of World Scientific Publishing, for his efficient
effort to make a timely publication of the book possible.
September, 2004
The Editors
Contents
Preface to the Third Edition v
Contents vii
Part 1. Basic Methods in Pattern Recognition 1
Chapter 1.1 Statistical Pattern Recognition 3
R.P.W.
Duin and
D.M.J.
Tax
Chapter 1.2 Hidden Markov Models for Spatio-Temporal Pattern
Recognition 25
Brian
C.
Lovell and Terry Caelli
Chapter 1.3 A New Kernel-Based Formalization of Minimum Error Pattern
Recognition 41
Erik McDermott andShigeru Katagiri
Chapter 1.4 Parallel Contextual Array Grammars with Trajectories 55
P.
Helen Chandra, C. Martin-Vide, K.G. Subramanian,
D.L.
Van
and
P.
S.
P.
Wang
Chapter 1.5 Pattern Recognition with Local Invariant Features 71
C.
Schmid,
G. Dorko, S. Lazebnik, K. Mikolajczyk
and
J.
Ponce
Part 2. Basic Methods in Computer Vision 93
Chapter 2.1 Case-Based Reasoning for Image Analysis and Interpretation 95
Petra Perner
Chapter 2.2 Multiple Image Geometry - A Projective Viewpoint 115
Quang-Tuan Luong
Chapter 2.3 Skeletonization in 3D Discrete Binary Images 137
Gabrielle Sanniti di Baja and Injela Nystrom
Chapter 2.4 Digital Distance Transforms in 2D, 3D, and 4D 157
Gunilla Borgefors
Chapter 2.5 Computing Global Shape Measures 177
Paul
L.
Rosin
VII
VIII
Chapter 2.6 Texture Analysis with Local Binary Patterns 197
Topi Mdenpdd and Matti Pietikdinen
Part 3. Recognition Applications 217
Chapter 3.1 Document Analysis and Understanding 219
Yuan Yan Tang
Chapter 3.2 Chinese Character Recognition 241
Xiaoqing Ding
Chapter 3.3 Extraction of
Words
from Handwritten Legal Amounts on
Bank Cheques 259
In Cheol Kim and Ching
Y.
Suen
Chapter 3.4 OCR Assessment of Printed-Fonts for Enhancing Human
Vision 273
Ching
Y.
Suen,
Qizhi
Xu and Cedric Devoghelaere
Chapter 3.5 Clustering and Classification of Web Documents Using a
Graph Model 287
Adam Schenker, Horst
Bunke,
Mark Last
and Abraham Kandel
Chapter 3.6 Automated Detection of Masses in Mammograms 303
H.D.
Cheng, X.J. Shi, R. Min, X.P. Cai and
H.N.
Du
Chapter 3.7 Wavelet-Based Kalman Filtering in Scale Space for Image
Fusion 325
Hsi-Chin Hsin and Ching-Chung Li
Chapter 3.8 Multisensor Fusion with Hyperspectral Imaging Data:
Detection and Classification 347
Su May Hsu and Hsiao-hua Burke
Chapter 3.9 Independent Component Analysis of Functional Magnetic
Resonance Imaging Data 365
V.D.
Calhoun andB. Hong
Part 4. Human Identification 385
Chapter
4.1
Multimodal Emotion Recognition 387
Nicu Sebe, Ira Cohen and Thomas
S.
Huang
Chapter 4.2 Gait-Based Human Identification from a Monocular Video
Sequence 411
Amit
Kale,
AravindSundaresan,
AmitK.
RoyChowdhury
and Rama Chelappa
IX
Chapter 4.3 Palmprint Authentication System 431
David Zhang
Chapter 4.4 Reconstruction of High-Resolution Facial Images for Visual
Surveillance 445
Jeong-Seon Park and Seong
Whan
Lee
Chapter 4.5 Object Recognition with Deformable Feature Graphs: Faces,
Hands, and Cluttered Scenes 461
Jochen Triesch and Christian Eckes
Chapter 4.6 Hierarchical Classification and Feature Reduction for Fast Face
Detection 481
Bernd
Heisele,
Thomas
Serre,
Sam Prentice and
Tomaso Poggio
Part 5. System and Technology 497
Chapter 5.1 Tracking and Classifying Moving Objects Using Single or
Multiple Cameras 499
Quming Zhou
andJ.K.
Aggarwal
Chapter 5.2 Performance Evaluation of Image Segmentation Algorithms 525
Xiaoyi Jiang
Chapter 5.3 Contents-Based Video Analysis for Knowledge Discovery 543
Chia-Hung
Yeh,
Shih-Hung Lee andC C. Jay Kuo
Chapter 5.4 Object-Process Methodology and Its Applications to Image
Processing and Pattern Recognition 559
Dov Dori
Chapter 5.5 Musical Style Recognition — AQuantative Approach 583
Peter van Kranenburg and Eric Backer
Chapter 5.6 Auto-Detector: Mobile Automatic Number Plate Recognition 601
Giovanni
B.
Garribotto
Chapter 5.7 Omnidirectional Vision 619
Hiroshi Ishiguro
Index 629
CHAPTER 1.1
STATISTICAL PATTERN RECOGNITION
R.P.W. Duin, D.M.J. Tax
Information and Communication Theory Group
Faculty of Electrical Engineering, Mathematics and Computer Science
Delft University of Technology
P.O.Box 5031, 2600 GA, Delft, The Netherlands
E-mail: {R.P.
W.Duin,D.
M.J. Tax} @ ewi.tudelft.nl
A review is given of the area of statistical pattern recognition: the representation
of objects and the design and evaluation of trainable systems for generalization.
Traditional as well as more recently studied procedures are reviewed like the
classical Bayes classifiers, neural networks, support vector machines, one-class
classifiers and combining classifiers. Further we introduce methods for feature re-
duction and error evaluation. New developments in statistical pattern recognition
are briefly discussed.
1.
Introduction
Statistical pattern recognition is the research area that studies statistical tools for
the generalization of sets of real world objects or phenomena. It thereby aims to
find procedures that answer questions like: does this new object fit into the pattern
of a given set of objects, or: to which of the patterns defined in a given set does it
fit best? The first question is related to cluster analysis, but is also discussed from
some perspective in this chapter. The second question is on pattern classification
and that is what will be the main concern here.
The overall structure of a pattern recognition system may be summarized as in
Figure 1. Objects have first to be appropriately represented before a generalization
can be derived. Depending on the demands of the procedures used for this the
representation has to be adapted, e.g. transformed, scaled or simplified.
The procedures discussed in this chapter are partially also studied in areas like
statistical learning theory
32
, machine learning
25
and neural networks
14
. As the
emphasis in pattern recognition is close to application areas, questions related to the
representation of the objects are important here: how are objects described (e.g. fea-
tures,
distances to prototypes), how extensive may this description be, what are the
ways to incorporate knowledge from the application domain? Representations have
to be adapted to fit the tools that are used later. Simplifications of representations
3
4
like feature reduction and prototype selection should thereby be considered.
In order to derive, from a training set, a classifier that is valid for new objects
(i.e.
that it is able to generalize) the representation should fulfill an important
condition: representations of similar real world objects have to be similar as well.
The representations should be close. This is the so-called compactness hypothesis
2
on which the generalization from examples to new, unseen objects is built. It enables
the estimation of their class labels on the basis of distances to examples or on class
densities derived from examples.
Objects are traditionally represented by vectors in a feature space. An important
recent development to incorporate domain knowledge is the representation of objects
by their relation to other objects. This may be done by a so called kernel method
29
,
derived from features, or directly on dissimilarities computed from the raw data
26
.
We will assume that, after processing the raw measurements, objects are given
in a p-dimensional vector space
Cl.
Traditionally this space is spanned by p features,
but also the dissimilarities with p prototype objects may be used. To simplify the
discussion we will use the term feature space for both. If K is the number of classes
to be distinguished, a pattern classification system, or shortly classifier C{x) is a
function or a procedure that assigns to each object x in fi a class
u>
c
,
with c =
1, ,K.
Such a classifier has to be derived from a set of examples X
tv
= {xi,i =
1 N}
of known classes y,. X
tr
will be called the training set and yi £ w
c
, c = 1 K
a label. Unless otherwise stated it is assumed that yt is unique (objects belong to
just a single class) and is known for all objects in X
tl
.
In section 2 training procedures will be discussed to derive classifiers C(x) from
training sets. The performance of these classifiers is usually not just related the
quality of the features (their ability to show class differences) but also to their
number, i.e. the dimensionality of the feature space. A growing number of features
evaluation
•s
a
charactt
4
update
update
update
larger tra ning
rization
generalization
adaptation
representation
sets
^ obj ects
class labels, confidences
classifiers, class models
feature extraction
prototype selection
features, dissimilarities
class models, object models
better sensors or
measurement conditions
Fig. 1. The pattern recognition system
5
may increase the class separability, but, may also decrease the statistical accuracy
of the training procedure. It is thereby important to have a small number of good
features. In section 3 a review is given of ways to reduce the number of features
by selection or by combination (so called feature extraction). The evaluation of
classifiers, discussed in section 4, is an important topic. As the characteristics of
new applications are often unknown before, the best algorithms for feature reduction
and classification have to be found iteratively on the basis of unbiased and accurate
testing procedures.
This chapter builds further on earlier reviews of the area of statistical pattern
recognition by Fukunaga
12
and by Jain et al
16
. It is inevitable to repeat and
summarize them partly. We will, however, also discuss some new directions like one-
class classifiers, combining classifiers, dissimilarity representations and techniques
for building good classifiers and reducing the feature space simultaneously. In the
last section of this chapter, the discussion, we will return to these new developments.
2.
Classifiers
For the development of classifiers, we have to consider two main aspects: the basic
assumptions that the classifier makes about the data (which results in a functional
form of the classifier), and the optimization procedure to fit the model to the training
data. It is possible to consider very complex classifiers, but without efficient methods
to fit these classifiers to the data, they are not useful. Therefore, in many cases the
functional form of the classifier is restricted by the available optimization routines.
We will start discussing the two-class classification problem. In the first three
sections, 2.1, 2.2 and 2.3, the three basic approaches with their assumptions are
given: first, modeling the class posteriors, second, modeling class conditional prob-
abilities and finally modeling the classification boundary. In section 2.4 we discuss
how these approaches can be extended to work for more than two classes. In the
next section, the special case is considered where just one of the classes is reli-
ably sampled. The last section, 2.6, discusses the possibilities to combine several
(non-optimal) classifiers.
2.1.
Bayes classifiers and approximations
A classifier should assign a new object x to the most likely class. In a probabilistic
setting this means that the label of the class with the highest posterior probability
should be chosen. This class can be found when p(wi|x) and p(w2|x) (for a two class
classification problem) are known. The classifier becomes:
if p(wi|x) > p(w2|x) assign object x to
u>\,
otherwise to
W2-
(1)
When we assume that p(ui\x) and p(w2|x) are known, and further assume that
misclassifying an object originating from
o>i
to
W2
is as costly as vise versa, classifier
(1) is the theoretical optimal classifier and will make the minimum error. This
classifier is called the Bayes optimal classifier.
6
P(wi|x) =
1
,
(
,
T
^, P(w
2
|x) = 1 -p(w
2
|x), (2)
In practice p(wi|x) and p(w2|x) are not known, only samples x, are available,
and the misclassification costs might be only known in approximation. Therefore
approximations to the Bayes optimal classifier have to be made. This classifier can be
approximated in several different ways, depending on knowledge of the classification
problem.
The first way is to approximate the class posterior probabilities p(w
c
|x). The
logistic classifier assumes a particular model for the class posterior probabilities:
1 + exp(—w
T
x)
where w is a p-dimensional weight vector. This basically implements a linear clas-
sifier in the feature space.
An approach to fit this logistic classifier (2) to training data X
u
, is to maximize
the data likelihood L:
N
L = np(wi|xO
ni(x)
p(w2|xi)
na(x)
> (3)
i
where n
c
(x) is 1 if object x belongs to class
o>
c
,
and 0 otherwise. This can be done by,
for instance, an iterative gradient ascent method. Weights are iteratively updated
using:
Wnew = W
0
l
d
+ ?7-—, (4)
where
rj
is a suitably chosen learning rate parameter. In Ref. 1 the first (and second)
derivative of L with respect to w are derived for this and can be plugged into (4).
2.2.
Class densities and Bayes rule
Assumptions on p(w|x) are often difficult to make. Sometimes it is more convenient
to make assumptions on the class conditional probability densities p(x|w): they
indicate the distribution of the objects which are drawn from one of the classes.
When assumptions on these distributions can be made, classifier (1) can be derived
using Bayes' decision rule:
PH
*) = ?»!. (5)
This rule basically rewrites the class posterior probabilities in terms of the class
conditional probabilities and the class priors p(w). This result can be substituted
into (1), resulting in the following form:
if
P(X\LOI)P(U>I)
> p(x|w2)p(w2) assign x to wi, otherwise to o;2- (6)
The term p(x) is ignored because this is constant for a given x. Any monotonically
increasing function can be applied to both sides without changing the final decision.
In some cases, a suitable choice will simplify the notation significantly. In particular,
7
using a logarithmic transformation can simplify the classifier when functions from
the exponential family are used.
For the special case of a two-class problem the classifiers can be rewritten in
terms of a single discriminant function /(x) which is the difference between the left
hand side and the right hand side. A few possibilities are:
/(x) =p(wi|x) -p(w
2
|x), (7)
/(x) = p(x|on)p(wi) - p(x|w
2
)p(w
2
), (8)
/(x)=ln^
+
ln^ll. (9)
p(x|w
2
) P{LU
2
)
The classifier becomes:
if /(x) > 0 assign
xtowj,
otherwise to o>
2
. (10)
In many cases fitting p(x|a>) on training data is relatively straightforward. It is
the standard density estimation problem: fit a density on a data sample. To estimate
each p(x|w) the objects from just one of the classes
UJ
is used.
Depending on the functional form of the class densities, different classifiers are
constructed. One of the most common approaches is to assume a Gaussian density
for each of the classes:
p(x|w) =
JV(X;/I,E)
=
(27r)P
/2|
S
|i/2
ex
P (-^(x-MfS-^x-M)) - (
n
)
where /x is the (p-dimensional) mean of the class
u>,
and £ is the covariance matrix.
Further, |£| indicates the determinant of £ and £
-1
its inverse. For the explicit
values of the parameters /x and E usually the maximum likelihood estimates are
plugged in, therefore this classifier is called the plug-in Bayes classifier. Extra com-
plications occur when the sample size TV is insufficient to (in particular) compute
£
_1
. In these cases a standard solution is to regularize the covariance matrix such
that the inverse can be computed:
E
A
= £ + AJ, (12)
where X is the p x p identity matrix, and A is the regularization parameter to set
the trade off between the estimated covariance matrix and the regularizer I.
Substituting (11) for each of the classes
u>\
and w
2
(with their estimated /Xj, /x
2
and Ei, £
2
) into (9) results in:
/(x) = ixW - Er> + ^(MiSr
1
- M
2
S
2
-
1
)
T
x
-^ErVi + ^SjVs-ilnlEtl + ilnlE.I+ln^ij. (13)
This classifier rule is quadratic in terms of x, and it is therefore called the normal-
based quadratic classifier.
For the quadratic classifier a full covariance matrix has to be estimated for each
of the classes. In high dimensional feature spaces it can happen that insufficient
8
data is available to estimate these covariance matrices reliably. By restricting the
covariance matrices to have less free variables, estimations can become more reliable.
One approach the reduce the number of parameters, is to assume that both classes
have an identical covariance structure: Ei = £2 = E. The classifier simplifies to:
/(x) = \{^ - ./S-'x - i/ifE-Vi + \l%*- V
2
+ In ^ (14)
Because this classifier is linear in terms of x, this classifier is called the normal-based
linear classifier.
For the linear and the quadratic classifier, strong class distributional assumptions
are made: each class has a Gaussian distribution. In many applications this cannot
be assumed, and more flexible class models have to be used. One possibility is to
use a 'non-parametric' model. An example is the Parzen density model. Here the
density is estimated by summing local kernels with a fixed size h which are centered
on each of the training objects:
1
N
p(x|w) = - ^JVfcxi.ftl), (15)
i—l
where X is the identity matrix and h is the width parameter which has to be op-
timized. By substituting (15) into (6), the Parzen classifier is defined. The only
free parameter in this classifier is the size (or width) h of the kernel. Optimizing
this parameter by maximizing the likelihood on the training data, will result in the
solution h = 0. To avoid this, a leave-one-out procedure can be used
9
.
2.3.
Boundary methods
Density estimation in high dimensional spaces is difficult. In order to have a reliable
estimate, large amounts of training data should be available. Unfortunately, in many
cases the number of training objects is limited. Therefore it is not always wise to
estimate the class distributions completely. Looking at (1), (6) and (10), it is only
of interest which class is to be preferred over the other. This problem is simpler
than estimating
p(x\u>).
For a two-class problem, we just a function /(x) is needed
which is positive for objects of
UJI
and negative otherwise. In this section we will list
some classifiers which avoid estimating p(x|w) but try to obtain a suitable /(x).
The Fisher classifier searches to find a direction w in the feature space, such
that the two classes are separated as well as possible. The degree in which the two
classes are separated, is measured by the so-called Fisher ratio, or Fisher criterion:
j
=
\™\~
m
f.
(16
)
sl
+
s
2
2
y
'
Here mi and
mo,
are the means of the two classes, projected onto the direction w:
mi = ~w
T
Hi and
rri2
= w
T
/x
2
. The si and
s%
are the variances of the two classes
projected onto w. The criterion therefore favors directions in which the means are
far apart and the variances are small.
9
This Fisher ratio can
be
explicitly rewritten
in
terms
of
w. First we rewrite
s
l
=
Ex
6
a,
c
(
wTx
-
w7
>c)
2
=
Exea,
c
wT
(
x
-
M
c
)(x
-
^
c
)
Tw
=
w
r
S
c
w. Second
we write (mi
-
m
2
)
2
=
(w
T
/x
1
—
w
T
/z
2
)
2
=
w
T
(/x
1
—
/x
2
)(A
t
i
_
i
u
2)
Tw
=
w
T
SflW.
The term 5s is also called the between scatter matrix.
J
becomes:
_
\rn-j
-m
2
|
2
_
w
r
5
B
w
_
w
T
S
B
w
sf
+
s?, w
T
5iw
+
w
T
S
,
2W w
T
5iyw'
where
Sw
=
S\
+ S2 is also called the within scatter matrix.
In order to optimize (17), we set the derivative of (17) to zero and obtain:
(w
T
5.Bw)5iyw
=
(w
r
S
,
^vw)S'BW.
(18)
We are interested in the direction of w and not in the length, so we drop the scalar
terms between brackets. Further, from the definition of SB
it
follows that SBW
is
always in the direction fj,
x
—
fj,
2
- Multiplying both sides of (18) by
S
w
gives:
w~S^(Mi-M2)-
(19)
This classifier
is
known
as
the Fisher classifier. Note that the threshold
b is
not
defined for this classifier.
It
is also linear and requires the inversion of the within-
scatter Sw- This formulation yields an identical shape of
w
as the expression
in
(14),
although the classifiers use very different starting assumptions!
Most classifiers which have been discussed so far, have
a
very restricted form
of their decision boundary. In many cases these boundaries are not flexible enough
to follow the true decision boundaries. A flexible method is the k-nearest neighbor
rule. This classifier looks locally which labels are most dominant in the training set.
First
it
finds the
k
nearest objects in the training set AW(x), and then counts the
number of these neighbors are from class u>i or
u>2-
if
ni >
712 assign xtoui, otherwise to
cj2-
(20)
Although the training of the fc-nearest neighbor classifier is trivial (it only has
to
store all training objects,
k
can simply be optimized by
a
leave-one-out estimation),
it may become expensive
to
classify
a
new object
x.
For this the distances
to
all
training objects have to be computed, which may be prohibitive for large training
sets and high dimensional feature spaces.
Another classifier which is flexible but does not require the storage of the full
training set is the multi-layered feed-forward neural network
4
. A neural network is
a
collection of small processing units, called the neurons, which are interconnected by
weights
w
and
v
to form
a
network. A schematic picture is shown in Figure 2. An
input object
x
is processed through different layers of neurons, through the hidden
layer to the output layer. The output of the j-th output neuron becomes:
°j(*)
= h
j
nT.vMvrTy)] (21)
(see Figure 2 for the meaning of the variables). The object
x
is now assigned to the
class
j
for which the corresponding output neuron has the highest output Oj.
10
X\
•
X2%
•
Xp
#
•—-
<*n
Fig.
2.
Schematic picture
of a
neural network.
To optimize this neural network, the squared error between the network output
and the desired class label
is
defined:
N
K
£
=
EE(
n
i^)-°iW)
2
- (22)
where
rij
(x)
is 1 if
object
x
belongs
to
class w,-, and
0
otherwise.
To
simplify
the
notation, we will combine all the weights
w,
and
v
into one weight vector
w.
This error
E is a
continuous function
of
the weights
w, and the
derivative
of
E with respect
to
these weights can easily
be
calculated. The weights
of
the neu-
ral network can therefore
be
optimized
to
minimize the error
by
gradient descent,
analogous
to
(4):
dE
W
new
=
W
old
-Ty—,
(23)
where 77
is the
learning parameter. After expanding this learning rule (23),
it
appears that
the
weight updates
for
each layer
of
neurons
can be
computed
by back-propagating
the
error which
is
computed
at the
output
of the
network
(nj(xj)
—
Oj(xi)). This
is
therefore called the back-propagation update rule.
The advantage of this type of neural networks
is
that they are flexible, and that
they can
be
trained using these update rules. The disadvantages are, that there are
many important parameters
to be
chosen beforehand
(the
number
of
layers,
the
number
of
neurons
per
layer,
the
learning rate,
the
number
of
training updates,
etc.),
and
that
the
optimization
can be
extremely slow.
To
increase
the
training
speed, several additions and extensions are proposed,
for
instance the inclusion
of
momentum terms
in
(23),
or
the use
of
second order moments.
Neural networks can be easily overtrained. Many heuristic techniques have been
developed
to
decrease
the
chance
of
overtraining. One
of the
methods
is to use
weight decay,
in
which an extra regularization term is added
to
equation (22). This
regularization term, often something
of
the form
E
= E
+ \\\w\\
2
,
(24)
tries
to
reduce the size
of
the individual weights
in
the network. By restricting the
size
of the
weights,
the
network will adjust less
to the
noise
in the
data sample
l+exp(-v
T
h)
11
and become less complex. The regularization parameter A regulates the trade-off
between the classification error E and the classifier complexity. When the size of
the network (in terms of the number of neurons) is also chosen carefully, good
performances can be achieved by the neural network.
A similar approach is chosen for the support vector classifier
32
. The most basic
version is just a linear classifier as in Eq. (10) with
/(x) = w
T
x + 6. (25)
The minimum distance from the training objects to the classifier is thereby maxi-
mized. This gives the classifier some robustness against noise in the data, such that
it will generalize well for new data. It, appears that this maximum margin p is in-
versely related to ||w||
2
, such that maximizing this margin means minimizing ||w||
2
(taking the constraints into account that all the objects are correctly classified).
o
o
o
o
1
1
I
o /*-
;
o 4
1
i
1
• 1
' /
t
1
w
r
x +
6
= 0
1
1
f
| 1
1 1
/ , P
f i
f
/ •
1 i
i i
•
•
o
cti = 0
o
Fig. 3. Schematic picture of a support vector classifier
Given linearly separable data, the linear classifier is found which has the largest
margin p to each of the classes. To allow for some errors in the classification, some
slack variables are introduced to weaken the hard constraints. The error to minimize
for the support vector classifier therefore consists of two parts: the complexity of the
classifiers in terms of w
T
w, and the number of classification errors, measured by
X^jCi- The optimization can be stated by the following mathematical formulation:
min w
T
w + Cy"&, (26)
w '-—'
t
such that
(w
(
Xi + b > 1 - & if xewi,
1 w'x, + b < -l + ?i otherwise.
12
Parameter C determines the trade-off between the complexity of the classifier, as
measured by w
T
w, and the number of classification errors.
Although the basic version of the support vector classifier is a linear classifier,
it can be made much more powerful by the introduction of kernels. When the
constraints (27) are incorporated into (26) by the use of Lagrange multipliers a,
this error can be rewritten in the so-called dual form. For this, we define the labels
y, where y» = 1 when x$ £ ui\ and j/j =
— 1
otherwise. The optimization becomes:
maxa
T
xx
T
a - l
T
a, s.t. y
T
a = 0, 0 < a
t
< C, Vi (28)
with w = J2i
a
iVi
x
i- Due to the constraints in (28) the optimization is not trivial,
but standard software packages exist which can solve this quadratic programming
problem. It appears that in the optimal solution of (28) many of the Qj become 0.
Therefore only a few Qj ^ 0 determine the w. The corresponding objects Xi are
called the support vectors. All other objects in the training set can be ignored.
The special feature of this formulation is that both the classifier /(x) and the
error (28) are completely stated in terms of inner products between objects xfXj.
This means that the classifier does not explicitly depend on the features of the
objects. It depends on the similarity between the object x and the support vectors
Xj,
measured by the inner product
x
T
Xj.
By replacing the inner product by another
similarity, defined by the kernel function isT(x, x,), other non-linear classifiers are
obtained. One of the most popular kernel functions is the Gaussian kernel:
ir(x,x,) = exp(-
l|x
^
112
), (29)
where a is still a free parameter.
The drawback of the support vector classifier is that it requires the solution
of a large quadratic programming problem (28), and that suitable settings for the
parameters C and a have to be found. On the other hand, when C and a are
optimized, the performance of this classifier is often very competitive. Another ad-
vantage of this classifier is, that it offers the possibility to encode problem specific
knowledge in the kernel function K. In particular for problems where a good feature
representation is hard to derive (for instance in the classification of shapes or text
documents) this can be important.
2.4. Multi-class classifiers
In the previous section we focused on the two-class classification problems. This sim-
plifies the formulation and notation of the classifiers. Many classifiers can trivially
be extended to multi-class problems. For instance the Bayes classifier (1) becomes:
assign x to w
c
., c* = argmaxp(w
c
|x) (30)
c
Most of the classifiers directly follow from this. Only the boundary methods which
were constructed to explicitly distinguish between two classes, for instance the
13
Fisher classifier or the support vector classifier, cannot be trivially extended. For
these classifiers several combining techniques are available. The two main ap-
proaches to decompose a multi-class problem into a set of two-class problems are:
(1) one-against-all: train K classifiers between one of the classes and all others,
(2) one-against-one: train K(K
—
l)/2 classifiers to distinguish all pairwise classes.
Afterward classifiers have to be combined using classification confidences (posterior
probabilities) or by majority voting. A more advanced approach is to use Error-
Correcting Output Codes (ECOC), where classifiers are trained to distinguish spe-
cific combinations of classes, but are allowed to ignore others
7
. The classes are
chosen such that a redundant output labeling appears, and possible classification
errors can be fixed.
2.5.
One-class classifiers
A fundamental assumption in all previous discussions, is that a representative train-
ing set X
tT
is available. That means that examples from both classes are present,
sampled according to their class priors. In some applications one of the classes
might contain diverse objects, or its objects are difficult or expensive to measure.
This happens for instance in machine diagnostics or in medical applications. A
suf-
ficient number of representative examples from the class of ill patients or the class
of faulty machines are sometimes hard to collect. In these cases one cannot rely on
a representative dataset to train a classifier, and a so-called one-class classifier
30
may be used.
Fig. 4. One-class classifier example.
In one-class classifiers, it is assumed that we have examples from just one of the
classes, called the target class. From all other possible objects, per definition the
outlier objects, no examples are available during training. When it is assumed that
the outliers are uniformly distributed around the target class, the classifier should
14
circumscribe the target object as tight as possible in order to minimize the chance
of accepting outlier objects.
In general, the problem of one-class classification is harder than the problem
of conventional two-class classification. In conventional classification problems the
decision boundary is supported from both sides by examples of both classes. Because
in the case of one-class classification only one set of data is available, only one side
of the boundary is supported. It is therefore hard to decide, on the basis of just one
class,
how strictly the boundary should fit around the data in each of the feature
directions. In order to have a good distinction between the target objects and the
outliers, good representation of the data is essential.
Approaches similar to standard two-class classification can be used here. Us-
ing the uniform outlier distribution assumption, the class posteriors can be esti-
mated and the class conditional distributions or direct boundary methods can be
constructed. For high dimensional spaces the density estimators suffer and often
boundary methods are to be preferred.
2.6.
Combining classifiers
In practice it is hard to find (and train) a classifier which fits the data distribution
sufficiently well. The model can be difficult to construct (by the user), too hard to
optimize, or insufficient training data is available to train. In these cases it can be
very beneficial to combine several "weak" classifiers in order to boost the classifica-
tion performance
21
. It is hoped that each individual classifier will focus on different
aspects of the data and err on different objects. Combining the set of so-called base
classifiers will then complement their weak areas.
y,
/
feature/
space \
base classifier 1
base classifier 2
base classifier 3
combining classifier
classification
/ base classifier outputs (e.g. confidences)
Fig. 5. Combining classifier
The most basic combining approach is to train several different types of classifiers
on the same dataset and combine their outputs. One has to realize that classifiers
can only correct each other when their outputs vary, i.e. the set of classifiers is
diverse
22
. It appears therefore to be more advantageous to combine classifiers
which were trained on objects represented by different features. Another approach
to force classifiers to become diverse is to artificially change the training set by
resampling (resulting in a bagging
6
or a boosting
8
approach).
15
The outputs of the classifier can be combined using several combining rules
18
,
depending on the type of classifier outputs. If the classifiers provide crisp output
labels,
a voting combining rule has to be used. When the real valued outputs are
available, they can be averaged, weighted averaged or multiplied, the maximum or
minimum output can be taken or even an output classifier can be trained. If fixed
(i.e.
not trained) rules are used, it is important that the output of a classifier is
properly scaled. Using a trainable combining rule, this constraint can be elevated
but clearly training data is required to optimize this combining rule
10
.
3.
Feature reduction
In many classification problems it is unclear what features have to be taken into
account. Often a large set of k potentially useful features is collected, and by feature
reduction the k most suitable features are chosen. Often the distinction between
feature selection and feature extraction is made. In selection, only a subset of the
original features is chosen. The advantage is that in the final application just a
few features have to be measured. The disadvantage is that the selection of the
appropriate subset is an expensive search. In extraction new features are derived
from the original features. Often all original features are used, and no reduction is
obtained in the number of measurement. Butt in many cases the optimization is
easier. In Section 3.1 we will discuss several evaluation criteria, then in Section 3.2
feature selection and finally in 3.3 feature extraction.
3.1.
Feature set evaluation criteria
In order to evaluate a feature set, a criterion J has to be defined. Because feature
reduction is often applied in classification, the most obvious criterion is thus the
performance of the classifier. Unfortunately, the optimization of a classifier is often
hard. Other evaluation criteria might be a cheaper approximation to this classifica-
tion performance. Therefore approximate criteria are used, measuring the distance
or dissimilarity between distributions, or even ignoring the class labels, but just
focusing on unsupervised characteristics.
Some typical evaluation criteria are listed in Table 1. The most simple ones use
the scatter matrices characterizing the scatter within classes (showing how samples
scatter around their class mean vector, called Sw, the within scatter) and the the
scatter between the clusters (showing how the means of the clusters scatter, SB, the
between scatter matrix, see also the discussion of the Fisher ratio in section 2.3).
These scatter matrices can be combined using several functions, listed in the first
part of Table 1. Often Si = SB is used, and 52 = Sw or 5
2
= Sw + SB-
The measures between distributions involve the class distributions
p(x\u>i),
and
in practice often single Gaussian distributions for each of the classes are chosen.
The reconstruction errors still contain free parameters in the form of a matrix of
basis vectors W or a set of prototypes fj,
k
. These are optimized in their respective
16
procedures, like the Principal Component Analysis or Self-Organizing Maps. These
scatter criteria and the supervised measures between distributions are mainly used
in the feature selection, Section 3.2. The unsupervised reconstruction errors are used
in feature extraction 3.3.
Table 1. Feature selection criteria for measuring the difference between
two distributions or for measuring a reconstruction error.
Measures using scatter matrices
For explanation J = tr(5^"
1
5i)
of Si and S
2
J = ln|5^
1
5i|
see text. J = gf
2
-
Measures between distributions
Kolmogorov J = / |p(wi|x)
—
p(cj
2
|x)|p(x)cbc
Average separation J = ^xjewi Sx^ewi
Divergence J = / (p(x|wi)p(x|w
2
)) (p(*]^j)
dx
Chernoff J = - log/p
s
(x|wi)p
1_s
(x|u;2)dx
Fisher J = y/(p(x|wi)p(wi) -p(x|w
2
)p(w
2
))
2
dx
Reconstruction errors
PCA E=\\x- (W(W'
i
'W)
_1
W'
r
)xf
SOM B = min
fc
||x-Ai
fc
||
2
3.2. Feature selection
In feature selection a subset of the original features is chosen. A feature reduction
procedure consist of two ingredients: the first is the evaluation criterion to evaluate
a given set of features, the second is a search strategy to search over all possible
feature subsets
16
. Exhaustive search is in many applications not feasible. When we
start with k = 250 features, and we want to select k = 10, we have to consider in
principle (
250
) — 2 • 10
17
different subsets, which is clearly too much.
Instead of exhaustive search, a forward selection can be applied. It starts with
the single best feature (according to the evaluation criterion) and adds the feature
which gives the biggest improvement in performance. This is repeated till the re-
quested number of features k is reached. Instead of forward selection, the opposite
approach can be used: backward selection. This starts with the complete set of fea-
tures and removes that feature such that the performance increase is the largest.
These approaches have the significant drawback that they might miss the optimal
subset. These are the subsets for which the individual features have poor discrim-
inability but combined give a very good performance. In order to find these subsets,
a more advanced search strategy is required. It can be a floating search where adding
17
and removing features is alternated. Another approach is the branch-and-bound al-
gorithm
12
, where all the subsets of features is arranged in a search tree. This tree
is traversed in such order that as soon as possible large sub branches can be dis-
regarded, and the search process is shortened significantly. This strategy will yield
the optimal subset when the evaluation criterion J is monotone, that means that
when for a certain feature set a value of Jk is obtained, a subset of the features
cannot have higher value for Jj,. Criteria like the Bayes error, the Chernoff distance
or the functions on the scatter matrices fulfill this.
Currently, other approaches appear which combine the traditional feature selec-
tion and subsequent training of a classifier. One example is a linear classifier (with
the functional form of (25)) called LASSO, Least Absolute Shrinkage and Selection
Operator
31
. The classification problem is approached as a regression problem with
an additional regularization. A linear function is fitted to the data by minimizing
the following error:
n
min^(y
i
-w
T
x
i
-6)
2
+ C||w||. (31)
i
The first part defines the deviation of the linear function w
T
Xj +
6
from the expected
label yi. The second part shrinks the weights w, such that many of them become
zero.
By choosing a suitable value for C, the number of retained features can be
changed. This kind of regularization appears to be very effective when the number
of feature is huge (in the thousands) and the training size is small (in the tens). A
similar solution can be obtained when the term w
T
w in (26) is replace by |w|
3
.
3.3.
Feature extraction
Instead of using a subset of the given features, a smaller set of new features may
be derived from the old ones. This can be done by linear or nonlinear feature
extraction. For the computation of new features usually all original features are used.
Feature extraction will therefore almost never reduce the amount of measurements.
The optimization criteria are often based on reconstruction errors as in Table 1.
The most well-known linear extraction method is the Principal Component Anal-
ysis (PCA)
17
. Each new feature i is a linear combination of the original features:
x\
= WjX. The new features are optimized to minimize the PCA mean squared
error reconstruction error, Table 1. It basically extracts the directions W, in which
the data set shows the highest variance. These directions appear to be equivalent to
the eigenvectors of the (estimated) covariance matrix E with the largest eigenvalues.
For the i-th principal component Wj therefore holds:
EWi = XiWi, Xi > Xj, iff <
j.
(32)
An extension of the (linear) PCA is the kernelized version, kernel-PCA
24
. Here
the standard covariance matrix E is replaced by a covariance matrix in a feature
space. After rewriting, the eigenvalue problem in the feature space reduces to the