Tải bản đầy đủ (.pdf) (616 trang)

introduction to statistical pattern recognition 2nd ed. - k. fukunaga

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.62 MB, 616 trang )

%
Introduction
to
Statistical
Pattern
Recogni-tion
%
Second
Edition
0
0
0
0
0
n
Keinosuke
Fukunaga
Introduction
to
Stas-tical
Pattern
Recognit
ion
Second
Edition
This completely revised second edition
presents an introduction to statistical pat-
tern recognition. Pattern recognition in
general covers a wide range
of
problems: it


is
applied to engineering problems, such as
character readers and wave
form
analysis,
as
well
as to brain modeling in biology and
psychology. Statistical decision and estima-
tion, which
are
the main subjects
of
this
book,
are
regarded as fimdamental to the
study
of
pattern recognition.
This
book
is
appropriate as
a
text
for
introductory
courses in pattern recognition and as a ref-
erence book

for
people who work in the
field. Each chapter also contains computer
projects as
well
as exercises.


Introduction to Statistical
Pattern Recognition
Second Edition
This
is
a
volume
in
COMPUTER
SCIENCE
AND
SCIENTIFIC
COMPUTING
Editor:
WERNER
RHEINBOLDT
Introduction to Statistical
Pattern Recognition
Second Edition
Keinosuke
F'ukunaga
School

of Electrical Engineering
Purdue University
West Lafa yet te, Indiana
M
K4
Morgan
K.ufmu\n
is
an
imprint
of
Academic
Rars
AHamrt
SaenccandTechlndogyCompony
San
Diego
San Francisco New
York
Boston
London Sydney Tokyo
This book
is
printed on acid-free paper.
8
Copyright
0
1990 by Academic Press
All rights reserved.
No

part of this publication may
be
reproduced or
transmitted in any form or by any means, electronic
or mechanical, including photocopy, recording, or
any information storage and retrieval system. without
permission in writing from the publisher.
ACADEMIC PRESS
A
Harcourt Science and Technology Company
525
B
Street, Suite
1900,
San
Diego, CA
92101-4495
USA
h
ttp://www.academicpress.com
Academic Press
24-28
Oval
Road, London
NW1
7DX United Kingdom
h
tt
p:/lwww. hbuWap/
Morgan Kaufmann

340
Pine
Street,
Sixth
Floor,
San
Francisco,
CA
94104-3205

Library
of
Congress Cataloging-in-Publication
Data
Fukunaga. Keinosuke.
Fukunaga.
-
2nd ed.
Introduction
to
statistical pattern recognition
I
Keinosuke
p, cm.
Includes bibliographical references.
1.
Pattern perception
-
Statistical methods. 2. Decision-making
-

ISBN 0-12-269851-7
-
Mathematical models.
3.
Mathematical statistics. I. Title.
0327.F85 1990
006.4
-
dc20 89-18195
CIP
PRINTF.D
M
THE
UNITED
STATES
OF
AMEIUCA
03
9
To
Reiko,
Gen,
and
Nina

Contents
Preface

xi
Acknowledgments


xm
Chapter
1
Introduction
1
Formulation
of
Pattern Recognition Problems

1
Process of Classifier Design

7

1.1
1.2
Notation

9
References

10
Chapter2
Random Vectors
and
Their Properties
Random Vectors and Their Distributions

11

Estimation of Parameters

17
2.3 Linear Transformation

24
Computer Projects

47
Problems

48
11
2.1
2.2
2.4
Various Properties of Eigenvalues and
Eigenvectors

35

References
50
Vii
Viii
Contents
Chapter
3
Hypothesis Testing
51

3.1
3.2
3.3
3.4
3.5
Sequential Hypothesis Testing

110
Problems

120
References

122
Hypothesis Tests for Two Classes

51
Other Hypothesis Tests

65
Error Probability in Hypothesis Testing

85
Upper Bounds on the Bayes Error

97
Computer Projects

119
Chapter

4
Parametric Classifiers
124
4.1
The Bayes Linear Classifier

125
4.2
Linear Classifier Design

131
4.3
Quadratic Classifier Design

153
4.4
Other Classifiers

169
Computer Projects

176
Problems

177
References

180
Chapter
5

Parameter Estimation
181
5.1
Effect of Sample Size in Estimation

182
5.2
Estimation of Classification Errors

196
5.3
Holdout. LeaveOneOut. and Resubstitution
Methods

219
5.4
Bootstrap Methods

238
Computer Projects

250
Problems

250
References

252
Chapter
6

Nonparametric Density Estimation
254
6.1
6.2
6.3
Parzen Density Estimate

255
kNearest Neighbor Density Estimate

268
Expansion by Basis Functions

287
Computer Projects

295
Problems

296
References

297
Contents
ix
Chapter
7
Nonparametric Classification and
Error Estimation
300

7.1
General Discussion

301
7.2
Voting
kNN
Procedure
-
Asymptotic Analysis

305
7.3
Voting
kNN
Procedure
-
Finite Sample Analysis

313
7.4
Error Estimation

322
7.5
Miscellaneous Topics
in
the
kNN
Approach


351
Computer Projects

362
Problems

363
References

364
Chapter
8
Successive Parameter Estimation
367
8.1
Successive Adjustment of a Linear Classifier

367
8.2
Stochastic Approximation

375
8.3
Successive Bayes Estimation

389
Computer Projects

395

Problems

396
References

397
Chapter
9
Feature Extraction
and
Linear Mapping
9.1
The Discrete Karhunen-Lokve Expansion

400
9.2
The Karhunen-LoBve Expansion for Random
Processes

417
9.3
for Signal Representation
399
Estimation of Eigenvalues and Eigenvectors

425
Computer Projects

435
Problems


438
References

440
Chapter
10
Feature Extraction
and
Linear Mapping
for
Classification
441
10.1
General Problem Formulation

442
10.2
Discriminant Analysis

445
10.3
Generalized Criteria

460
10.4
Nonparametric Discriminant Analysis

466
10.5

Sequential Selection of Quadratic Features

480
10.6
Feature Subset Selection

489
X
Contents
Computer Projects

503
Problems

504
References

506
Chapter
11
Clustering
508
11.1 Parametric Clustering

509
11.2 Nonparametric Clustering

533
11.3
Selection

of
Representatives

549
Computer Projects

559
Problems

560
References

562
Appendix
A
DERIVATIVES
OF
MATRICES

564
Appendix
B
MATHEMATICAL FORMULAS

572
Appendix
C
NORMAL ERROR TABLE

576

Appendix
D
GAMMA FUNCTION TABLE

578
Index

579
Preface
This book presents an introduction to statistical pattern recogni-
tion. Pattern recognition in general covers a wide range of problems,
and
it
is
hard to find
a
unified view or approach. It
is
applied to
engineering problems, such
as
character readers and waveform analy-
sis,
as
well
as
to brain modeling
in
biology and psychology. However,
statistical decision and estimation, which

are
the subjects of this book,
are
regarded
as
fundamental to the study of pattern recognition. Statis-
tical decision and estimation are covered
in
various texts on mathemati-
cal statistics, statistical communication, control theory, and
so
on. But
obviously each field has a different need and view.
So
that workers in
pattern recognition need not look from one book to another,
this
book
is
organized to provide the basics of these statistical concepts from the
viewpoint of pattern recognition.
The material of
this
book
has been taught
in
a
graduate
course
at

Purdue University and
also
in short courses offered in
a
number of
locations. Therefore,
it
is
the author’s hope that
this
book
will
serve as
a
text for introductory courses of pattern recognition
as
well as a refer-
ence book for the workers in the field.
xi

Acknowledgments
The author would like to express
his
gratitude for the support
of
the National Science Foundation for research
in
pattern recognition.
Much of the material in this book was contributed
by

the author’s past
co-workers,
T.
E
Krile,
D.
R.
Olsen,
W.
L.
G.
Koontz,
D.
L.
Kessell,
L.
D.
Hostetler,
I?
M.
Narendra,
R.
D.
Short,
J.
M.
Mantock,
T.
E. Flick,
D.

M.
Hummels, and
R. R.
Hayes. Working with these outstanding
indi-
viduals has been the author’s honor, pleasure, and delight.
Also,
the
continuous discussion with
W.
H.
Schoendorf,
B.
J.
Burdick,
A.
C.
Williams,
and
L.
M.
Novak has been stimulating. In addition, the
author wishes
to
thank
his
wife Reiko for continuous support and
encouragement.
The author acknowledges those at the Institute of Electrical and
Electronics Engineers, Inc., for their authorization to use material from

its
journals.
xiii

Chapter
I
INTRODUCTION
This book presents and discusses the fundamental mathematical tools for
statistical decision-making processes
in
pattern recognition.
It
is felt that the
decision-making processes
of
a human being are somewhat related to the
recognition of patterns; for example, the next move
in
a chess game is based
upon the present pattern
on
the board, and buying or selling stocks is decided
by a complex pattern of information. The goal of pattern recognition is to clar-
ify
these complicated mechanisms of decision-making processes and to
automate these functions using computers. However, because
of
the complex
nature of the problem, most pattern recognition research has been concentrated
on more realistic problems, such as the recognition of Latin characters and the

classification of waveforms. The purpose of this book is to cover the
mathematical models
of
these practical problems and to provide the fundamen-
tal mathematical tools necessary for solving them. Although many approaches
have been proposed to formulate more complex decision-making processes,
these are outside the scope of this book.
1.1
Formulation
of
Pattern Recognition Problems
Many important applications of pattern recognition can be characterized
as either waveform classification or classification of geometric figures. For
example, consider the problem of testing a machine for normal or abnormal
I
2
Introduction to Statistical Pattern Recognition
operation by observing the output voltage of a microphone over a period
of
time. This problem reduces to discrimination of waveforms from good and
bad machines.
On
the other hand, recognition
of
printed English Characters
corresponds to classification
of
geometric figures. In order to perform this type
of classification, we must first measure the observable characteristics of the
sample. The most primitive but assured way to extract all information con-

tained
in
the sample is to measure the time-sampled values for a waveform,
x(t,),
.
.
.
,
x(t,,),
and the grey levels of pixels for a figure,
x(1)
,
.
.
.
,
A-(n),
as
shown
in
Fig.
1-1.
These
n
measurements form a vector
X.
Even under the
normal machine condition, the observed waveforms are different each time the
observation is made. Therefore,
x(ri)

is a random variable and will be
expressed, using boldface, as
x(fi).
Likewise,
X
is called a random vector if its
components are random variables and
is
expressed as
X.
Similar arguments
hold for characters: the observation,
x(i),
varies from one
A
to another and
therefore
x(i)
is
a random variable, and
X
is a random vector.
Thus, each waveform or character is expressed by a vector
(or
a sample)
in an n-dimensional space, and many waveforms or characters form a distribu-
tion
of
X
in the n-dimensional

space. Figure
1-2 shows a simple two-
dimensional example of two distributions corresponding to normal and
abnormal machine conditions, where points depict the locations of samples and
solid lines are the contour lines of the probability density functions. If we
know these two distributions of
X
from past experience, we can set up a boun-
dary between these
two
distributions,
g
(I-
,,
x2)
=
0,
which divides the two-
dimensional space into two regions. Once the boundary is selected, we can
classify a sample without a class label to a normal
or
abnormal machine,
depending
on
g
(x
I,
xz)<
0
or

g
(x,
,
x2)
>O.
We call
g
(x
,
x2)
a discriminant
function,
and a network which detects the sign of
g
(x
1,
x2)
is
called a
pattern
I-ecognition network,
a
categorizer,
or
a classfier. Figure
1-3
shows a block
diagram of
a
classifier in a general n-dimensional space.

Thus,
in
order to
design a classifier, we must study the characteristics of the distribution of
X
for
each category and find a proper discriminant function.
This process is called
learning
or
training, and samples used to design a classifier are called learning
or training samples. The discussion can be easily extended to multi-category
cases.
Thus, pattern recognition,
or
decision-making
in
a broader sense, may be
considered as a problem of estimating density functions in a high-dimensional
space and dividing the space into the regions of categories
or
classes. Because
1
Introduction
3
,Pixel
#1
(b)
Fig.
1-1

Two measurements of patterns: (a) waveform; (b) character.
of
this view,
mathematical statistics
forms the foundation of the subject.
Also,
since vectors and matrices are used
to
represent samples and linear
operators,
respectively, a basic knowledge
of
linear algebra
is required
to
read this book.
Chapter
2
presents a brief review of these two subjects.
The first question we ask is what
is
the theoretically best classifier,
assuming that the distributions of the random vectors are given. This problem
is
statistical hypothesis testing,
and the
Bayes classifier
is the best classifier
which minimizes the probability
of

classification error. Various hypothesis
tests are discussed in Chapter
3.
The probability of error is the key parameter
in
pattern recognition. The
error due to the Bayes classifier (the
Bayes error)
gives the smallest error we
can achieve from given distributions.
In
Chapter
3,
we discuss how to calcu-
late the Bayes error. We also consider a simpler problem
of
finding an upper
bound
of
the Bayes error.
4
Introduction to Statistical Pattern Recognition
Fig.
1-2
Distributions of samples from normal and abnormal machines.
Although the Bayes classifier is optimal, its implementation is often
difficult in practice because
of
its complexity, particularly when the dimen-
sionality is high. Therefore, we

are
often led to consider a simpler,
parametric
classifier.
Parametric classifiers are based
on
assumed mathematical forms for
either the density functions or the discriminant functions.
Linear, quadratic,
or
piecewise classifiers
are the simplest and most common choices. Various
design procedures for these classifiers are discussed in Chapter
4.
Even when the mathematical forms can be assumed, the values of the
parameters are not given in practice and must be estimated from available sam-
ples. With
a finite number
of
samples,
the estimates of the parameters and
subsequently of the classifiers based on these estimates become random vari-
ables. The resulting classification error also becomes a random variable and is
biased with a variance. Therefore, it is important to understand how the
number of samples affects classifier design and its performance. Chapter
5
discusses this subject.
1
Introduction
5

When no parametric structure can be assumed for the density functions,
we
must use
nonparametric techniques
such as the
Parzen
and
k-nearest neigh-
bor
approaches for estimating density functions. In Chapter
6,
we develop the
basic statistical properties of these estimates.
Then, in Chapter
7,
the nonparametric density estimates are applied to
classification problems. The main topic in Chapter
7
is the estimation of the
Bayes error without assuming any mathematical form for the density functions.
In
general, nonparametric techniques are very sensitive to the number of con-
trol parameters, and tend to give heavily biased results unless the values
of
these parameters are carefully chosen. Chapter
7
presents an extensive discus-
sion of how to select these parameter values.
In Fig.
1-2,

we presented decision-making as dividing a high-
dimensional space.
An alternative view is to consider decision-making as a
dictionary search.
That is, all past experiences (learning samples) are stored
in
a memory (a dictionary), and a test sample is classified to the class of the
closest sample in the dictionary. This process is called the
nearest neighbor
classification
rule.
This process
is
widely considered as a decision-making
process close to the one of a human being. Figure
1-4
shows an example of
the decision boundary due to this classifier.
Again, the classifier divides the
space into two regions, but in a somewhat more complex and sample-
dependent way than the boundary of Fig.
1-2.
This is a nonparametric
classifier discussed in Chapter
7.
From the very beginning of the computer age, researchers have been
interested
in
how a human being learns,
for

example, to read English charac-
ters. The study of
neurons
suggested that a single neuron operates like a linear
classifier, and that
a
combination of many neurons may produce a complex,
piecewise linear boundary.
So,
researchers came up with the idea of a
learning
machine
as shown
in
Fig. 1-5. The structure of the classifier is given along
with a number of unknown parameters
wo,
.
.
.
,wT.
The input vector, for
example an English character, is fed, one sample at a time, in sequence.
A
teacher stands beside the machine, observing both the input and output. When
a discrepancy is observed between the input and output, the teacher notifies the
machine, and the machine changes the parameters according to a predesigned
algorithm. Chapter
8
discusses how to change these parameters and how the

parameters converge to the desired values. However, changing a large number
of
parameters by observing one sample at a time turns out to be a very
inefficient way of designing a classifier.
6
X
>
Introduction to Statistical Pattern Recognition
classifier
output
*
wo,
w
1,""
'.,
wy
+
+
+
+
+
class
1
+
+
00
I
0
0
0

class
2
0
0
I
+
XI
Fig.
1-4
Nearest neighbor decision boundary.
f
We started our discussion by choosing time-sampled values of
waveforms or pixel values
of
geometric figures. Usually, the number of meas-
urements
n
becomes high in order to ensure that the measurements carry all of
the information contained in the original data. This high-dimensionality makes
many pattern recognition problems difficult.
On
the other hand, classification
by a human being is usually based
on
a small number of features such as the
peak value, fundamental frequency, etc. Each
of
these measurements carries
significant information for classification and is selected according to the physi-
cal meaning of the problem. Obviously, as the number of inputs to a classifier

becomes smaller, the design of the classifier becomes simpler.
In
order to
enjoy this advantage, we have to find some way to select or extract important
1
Introduction
7
features from the observed samples. This problem is calledfeature selection
or
extraction and is another important subject
of
pattern recognition. However,
it
should be noted that, as long as features are computed from the measurements,
the set of features cannot carry more classification information than the meas-
urements.
As
a
result, the Bayes error
in
the feature space is always larger
than that
in
the measurement space.
Feature selection can be considered
as
a
mapping from the n-dimensional
space to
a

lower-dimensional feature space. The mapping should be carried
out without severely reducing the class separability. Although most features
that a human being selects are nonlinear functions of the measurements, finding
the optimum nonlinear mapping functions is beyond our capability.
So,
the
discussion in this
book
is limited to linear mappings.
In Chapter
9,
feature extraction
for-
signal representation is discussed in
which
the mapping is limited to orthonormal transformations and the
mean-
square error is minimized. On the other hand,
in
feature
extruetion
for-
classif-
cation, mapping is not limited to any specific form and the class separability is
used as the criterion to be optimized. Feature extraction for classification
is
discussed
in
Chapter
10.

It is sometimes important to decompose
a
given distribution into several
clusters. This operation is called clustering
or
unsupervised classification
(or
learning). The subject is discussed in Chapter
1
1.
1.2
Process
of
Classifier
Design
Figure
1-6
shows a
flow
chart of how a classifier is designed. After data
is gathered, samples are normalized and registered. Normalization and regis-
tration are very important processes for
a
successful classifier design. How-
ever, different data requires different normalization and registration, and
it
is
difficult to discuss these subjects in a generalized way. Therefore, these sub-
jects are not included in this
book.

After normalization and registration,
the
class separability
of
the data
is
measured. This is done by estimating the Bayes error in the measurement
space. Since
it
is not appropriate at this stage to assume a mathematical form
for
the data structure, the estimation procedure must be nonparametric.
If
the
Bayes error is larger than the final classifier error we wish to achieve (denoted
by
E~),
the data does not carry enough classification information to meet the
specification. Selecting features and designing a classifier
in
the later stages

×