Tải bản đầy đủ (.pdf) (514 trang)

Statistical patern recognition 2nd

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.63 MB, 514 trang )



Statistical Pattern Recognition, Second Edition. Andrew R. Webb
Copyright  2002 John Wiley & Sons, Ltd.
ISBNs: 0-470-84513-9 (HB); 0-470-84514-7 (PB)

Statistical Pattern Recognition


Statistical Pattern Recognition
Second Edition

Andrew R. Webb
QinetiQ Ltd., Malvern, UK


First edition published by Butterworth Heinemann.
Copyright c 2002

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,
West Sussex PO19 8SQ, England
Telephone (+44) 1243 779777

Email (for orders and customer service enquiries):
Visit our Home Page on www.wileyeurope.com or www.wiley.com
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or
otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of
a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP,
UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed
to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West


Sussex PO19 8SQ, England, or emailed to , or faxed to (+44) 1243 770571.
This publication is designed to provide accurate and authoritative information in regard to the subject
matter covered. It is sold on the understanding that the Publisher is not engaged in rendering
professional services. If professional advice or other expert assistance is required, the services of a
competent professional should be sought.

Other Wiley Editorial Offices
John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA
Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA
Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809
John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1

British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
ISBN 0-470-84513-9(Cloth)
ISBN 0-470-84514-7(Paper)
Typeset from LaTeX files produced by the author by Laserwords Private Limited, Chennai, India
Printed and bound in Great Britain by Biddles Ltd, Guildford, Surrey
This book is printed on acid-free paper responsibly manufactured from sustainable forestry in which at
least two trees are planted for each one used for paper production.


To Rosemary,
Samuel, Miriam, Jacob and Ethan


Contents


Preface

xv

Notation

xvii

1 Introduction to statistical pattern recognition
1.1
Statistical pattern recognition
1.1.1 Introduction
1.1.2 The basic model
1.2
Stages in a pattern recognition problem
1.3
Issues
1.4
Supervised versus unsupervised
1.5
Approaches to statistical pattern recognition
1.5.1 Elementary decision theory
1.5.2 Discriminant functions
1.6
Multiple regression
1.7
Outline of book
1.8
Notes and references
Exercises


1
1
1
2
3
4
5
6
6
19
25
27
28
30

2 Density estimation – parametric
2.1
Introduction
2.2
Normal-based models
2.2.1 Linear and quadratic discriminant functions
2.2.2 Regularised discriminant analysis
2.2.3 Example application study
2.2.4 Further developments
2.2.5 Summary
2.3
Normal mixture models
2.3.1 Maximum likelihood estimation via EM
2.3.2 Mixture models for discrimination

2.3.3 How many components?
2.3.4 Example application study
2.3.5 Further developments
2.3.6 Summary

33
33
34
34
37
38
40
40
41
41
45
46
47
49
49


viii

CONTENTS

2.4

2.5
2.6

2.7
2.8

Bayesian estimates
2.4.1 Bayesian learning methods
2.4.2 Markov chain Monte Carlo
2.4.3 Bayesian approaches to discrimination
2.4.4 Example application study
2.4.5 Further developments
2.4.6 Summary
Application studies
Summary and discussion
Recommendations
Notes and references
Exercises

3 Density estimation – nonparametric
3.1
Introduction
3.2
Histogram method
3.2.1 Data-adaptive histograms
3.2.2 Independence assumption
3.2.3 Lancaster models
3.2.4 Maximum weight dependence trees
3.2.5 Bayesian networks
3.2.6 Example application study
3.2.7 Further developments
3.2.8 Summary
3.3

k -nearest-neighbour method
3.3.1
k -nearest-neighbour decision rule
3.3.2 Properties of the nearest-neighbour rule
3.3.3 Algorithms
3.3.4 Editing techniques
3.3.5 Choice of distance metric
3.3.6 Example application study
3.3.7 Further developments
3.3.8 Summary
3.4
Expansion by basis functions
3.5
Kernel methods
3.5.1 Choice of smoothing parameter
3.5.2 Choice of kernel
3.5.3 Example application study
3.5.4 Further developments
3.5.5 Summary
3.6
Application studies
3.7
Summary and discussion
3.8
Recommendations
3.9
Notes and references
Exercises

50

50
55
70
72
75
75
75
77
77
77
78
81
81
82
83
84
85
85
88
91
91
92
93
93
95
95
98
101
102
103

104
105
106
111
113
114
115
115
116
119
120
120
121


CONTENTS

4 Linear discriminant analysis
4.1
Introduction
4.2
Two-class algorithms
4.2.1 General ideas
4.2.2 Perceptron criterion
4.2.3 Fisher’s criterion
4.2.4 Least mean squared error procedures
4.2.5 Support vector machines
4.2.6 Example application study
4.2.7 Further developments
4.2.8 Summary

4.3
Multiclass algorithms
4.3.1 General ideas
4.3.2 Error-correction procedure
4.3.3 Fisher’s criterion – linear discriminant analysis
4.3.4 Least mean squared error procedures
4.3.5 Optimal scaling
4.3.6 Regularisation
4.3.7 Multiclass support vector machines
4.3.8 Example application study
4.3.9 Further developments
4.3.10 Summary
4.4
Logistic discrimination
4.4.1 Two-group case
4.4.2 Maximum likelihood estimation
4.4.3 Multiclass logistic discrimination
4.4.4 Example application study
4.4.5 Further developments
4.4.6 Summary
4.5
Application studies
4.6
Summary and discussion
4.7
Recommendations
4.8
Notes and references
Exercises


123
123
124
124
124
128
130
134
141
142
142
144
144
145
145
148
152
155
155
156
156
158
158
158
159
161
162
163
163
163

164
165
165
165

5 Nonlinear discriminant analysis – kernel methods
5.1
Introduction
5.2
Optimisation criteria
5.2.1 Least squares error measure
5.2.2 Maximum likelihood
5.2.3 Entropy
5.3
Radial basis functions
5.3.1 Introduction
5.3.2 Motivation
5.3.3 Specifying the model

169
169
171
171
175
176
177
177
178
181


ix


x

CONTENTS

5.4

5.5
5.6
5.7
5.8

5.3.4 Radial basis function properties
5.3.5 Simple radial basis function
5.3.6 Example application study
5.3.7 Further developments
5.3.8 Summary
Nonlinear support vector machines
5.4.1 Types of kernel
5.4.2 Model selection
5.4.3 Support vector machines for regression
5.4.4 Example application study
5.4.5 Further developments
5.4.6 Summary
Application studies
Summary and discussion
Recommendations
Notes and references

Exercises

187
187
187
189
189
190
191
192
192
195
196
197
197
199
199
200
200

6 Nonlinear discriminant analysis – projection methods
6.1
Introduction
6.2
The multilayer perceptron
6.2.1 Introduction
6.2.2 Specifying the multilayer perceptron structure
6.2.3 Determining the multilayer perceptron weights
6.2.4 Properties
6.2.5 Example application study

6.2.6 Further developments
6.2.7 Summary
6.3
Projection pursuit
6.3.1 Introduction
6.3.2 Projection pursuit for discrimination
6.3.3 Example application study
6.3.4 Further developments
6.3.5 Summary
6.4
Application studies
6.5
Summary and discussion
6.6
Recommendations
6.7
Notes and references
Exercises

203
203
204
204
205
205
212
213
214
216
216

216
218
219
220
220
221
221
222
223
223

7 Tree-based methods
7.1
Introduction
7.2
Classification trees
7.2.1 Introduction
7.2.2 Classifier tree construction
7.2.3 Other issues
7.2.4 Example application study

225
225
225
225
228
237
239



CONTENTS

7.3

7.4
7.5
7.6
7.7

7.2.5 Further developments
7.2.6 Summary
Multivariate adaptive regression splines
7.3.1 Introduction
7.3.2 Recursive partitioning model
7.3.3 Example application study
7.3.4 Further developments
7.3.5 Summary
Application studies
Summary and discussion
Recommendations
Notes and references
Exercises

239
240
241
241
241
244
245

245
245
247
247
248
248

8 Performance
8.1
Introduction
8.2
Performance assessment
8.2.1 Discriminability
8.2.2 Reliability
8.2.3 ROC curves for two-class rules
8.2.4 Example application study
8.2.5 Further developments
8.2.6 Summary
8.3
Comparing classifier performance
8.3.1 Which technique is best?
8.3.2 Statistical tests
8.3.3 Comparing rules when misclassification costs are uncertain
8.3.4 Example application study
8.3.5 Further developments
8.3.6 Summary
8.4
Combining classifiers
8.4.1 Introduction
8.4.2 Motivation

8.4.3 Characteristics of a combination scheme
8.4.4 Data fusion
8.4.5 Classifier combination methods
8.4.6 Example application study
8.4.7 Further developments
8.4.8 Summary
8.5
Application studies
8.6
Summary and discussion
8.7
Recommendations
8.8
Notes and references
Exercises

251
251
252
252
258
260
263
264
265
266
266
267
267
269

270
271
271
271
272
275
278
284
297
298
298
299
299
300
300
301

9 Feature selection and extraction
9.1
Introduction

305
305

xi


xii

CONTENTS


9.2

9.3

9.4

9.5
9.6
9.7
9.8

Feature selection
9.2.1 Feature selection criteria
9.2.2 Search algorithms for feature selection
9.2.3 Suboptimal search algorithms
9.2.4 Example application study
9.2.5 Further developments
9.2.6 Summary
Linear feature extraction
9.3.1 Principal components analysis
9.3.2 Karhunen–Lo`eve transformation
9.3.3 Factor analysis
9.3.4 Example application study
9.3.5 Further developments
9.3.6 Summary
Multidimensional scaling
9.4.1 Classical scaling
9.4.2 Metric multidimensional scaling
9.4.3 Ordinal scaling

9.4.4 Algorithms
9.4.5 Multidimensional scaling for feature extraction
9.4.6 Example application study
9.4.7 Further developments
9.4.8 Summary
Application studies
Summary and discussion
Recommendations
Notes and references
Exercises

10 Clustering
10.1 Introduction
10.2 Hierarchical methods
10.2.1 Single-link method
10.2.2 Complete-link method
10.2.3 Sum-of-squares method
10.2.4 General agglomerative algorithm
10.2.5 Properties of a hierarchical classification
10.2.6 Example application study
10.2.7 Summary
10.3 Quick partitions
10.4 Mixture models
10.4.1 Model description
10.4.2 Example application study
10.5 Sum-of-squares methods
10.5.1 Clustering criteria
10.5.2 Clustering algorithms
10.5.3 Vector quantisation


307
308
311
314
317
317
318
318
319
329
335
342
343
344
344
345
346
347
350
351
352
353
353
354
355
355
356
357
361
361

362
364
367
368
368
369
370
370
371
372
372
374
374
375
376
382


CONTENTS xiii

10.5.4 Example application study
10.5.5 Further developments
10.5.6 Summary
10.6 Cluster validity
10.6.1 Introduction
10.6.2 Distortion measures
10.6.3 Choosing the number of clusters
10.6.4 Identifying genuine clusters
10.7 Application studies
10.8 Summary and discussion

10.9 Recommendations
10.10 Notes and references
Exercises

394
395
395
396
396
397
397
399
400
402
404
405
406

11 Additional topics
11.1 Model selection
11.1.1 Separate training and test sets
11.1.2 Cross-validation
11.1.3 The Bayesian viewpoint
11.1.4 Akaike’s information criterion
11.2 Learning with unreliable classification
11.3 Missing data
11.4 Outlier detection and robust procedures
11.5 Mixed continuous and discrete variables
11.6 Structural risk minimisation and the Vapnik–Chervonenkis
dimension

11.6.1 Bounds on the expected risk
11.6.2 The Vapnik–Chervonenkis dimension

409
409
410
410
411
411
412
413
414
415

A Measures of dissimilarity
A.1
Measures of dissimilarity
A.1.1 Numeric variables
A.1.2 Nominal and ordinal variables
A.1.3 Binary variables
A.1.4 Summary
A.2
Distances between distributions
A.2.1 Methods based on prototype vectors
A.2.2 Methods based on probabilistic distance
A.2.3 Probabilistic dependence
A.3
Discussion

419

419
419
423
423
424
425
425
425
428
429

B Parameter estimation
B.1
Parameter estimation
B.1.1 Properties of estimators
B.1.2 Maximum likelihood
B.1.3 Problems with maximum likelihood
B.1.4 Bayesian estimates

431
431
431
433
434
434

416
416
417



xiv

CONTENTS

C Linear algebra
C.1
Basic properties and definitions
C.2
Notes and references

437
437
441

D Data
D.1
D.2
D.3
D.4
D.5
D.6

443
443
443
444
446
448
448


Introduction
Formulating the problem
Data collection
Initial examination of data
Data sets
Notes and references

E Probability theory
E.1
Definitions and terminology
E.2
Normal distribution
E.3
Probability distributions

449
449
454
455

References

459

Index

491



Preface

This book provides an introduction to statistical pattern recognition theory and techniques.
Most of the material presented is concerned with discrimination and classification and
has been drawn from a wide range of literature including that of engineering, statistics,
computer science and the social sciences. The book is an attempt to provide a concise
volume containing descriptions of many of the most useful of today’s pattern processing techniques, including many of the recent advances in nonparametric approaches to
discrimination developed in the statistics literature and elsewhere. The techniques are
illustrated with examples of real-world applications studies. Pointers are also provided
to the diverse literature base where further details on applications, comparative studies
and theoretical developments may be obtained.
Statistical pattern recognition is a very active area of research. Many advances over
recent years have been due to the increased computational power available, enabling
some techniques to have much wider applicability. Most of the chapters in this book have
concluding sections that describe, albeit briefly, the wide range of practical applications
that have been addressed and further developments of theoretical techniques.
Thus, the book is aimed at practitioners in the ‘field’ of pattern recognition (if such
a multidisciplinary collection of techniques can be termed a field) as well as researchers
in the area. Also, some of this material has been presented as part of a graduate course
on information technology. A prerequisite is a knowledge of basic probability theory
and linear algebra, together with basic knowledge of mathematical methods (the use
of Lagrange multipliers to solve problems with equality and inequality constraints, for
example). Some basic material is presented as appendices. The exercises at the ends of
the chapters vary from ‘open book’ questions to more lengthy computer projects.
Chapter 1 provides an introduction to statistical pattern recognition, defining some terminology, introducing supervised and unsupervised classification. Two related approaches
to supervised classification are presented: one based on the estimation of probability
density functions and a second based on the construction of discriminant functions. The
chapter concludes with an outline of the pattern recognition cycle, putting the remaining
chapters of the book into context. Chapters 2 and 3 pursue the density function approach
to discrimination, with Chapter 2 addressing parametric approaches to density estimation

and Chapter 3 developing classifiers based on nonparametric schemes.
Chapters 4–7 develop discriminant function approaches to supervised classification.
Chapter 4 focuses on linear discriminant functions; much of the methodology of this
chapter (including optimisation, regularisation and support vector machines) is used in
some of the nonlinear methods. Chapter 5 explores kernel-based methods, in particular,


xvi

PREFACE

the radial basis function network and the support vector machine, techniques for discrimination and regression that have received widespread study in recent years. Related nonlinear models (projection-based methods) are described in Chapter 6. Chapter 7 considers a
decision-tree approach to discrimination, describing the classification and regression tree
(CART) methodology and multivariate adaptive regression splines (MARS).
Chapter 8 considers performance: measuring the performance of a classifier and improving the performance by classifier combination.
The techniques of Chapters 9 and 10 may be described as methods of exploratory
data analysis or preprocessing (and as such would usually be carried out prior to the
supervised classification techniques of Chapters 2–7, although they could, on occasion,
be post-processors of supervised techniques). Chapter 9 addresses feature selection and
feature extraction – the procedures for obtaining a reduced set of variables characterising
the original data. Such procedures are often an integral part of classifier design and it is
somewhat artificial to partition the pattern recognition problem into separate processes
of feature extraction and classification. However, feature extraction may provide insights
into the data structure and the type of classifier to employ; thus, it is of interest in its
own right. Chapter 10 considers unsupervised classification or clustering – the process of
grouping individuals in a population to discover the presence of structure; its engineering
application is to vector quantisation for image and speech coding.
Finally, Chapter 11 addresses some important diverse topics including model selection. Appendices largely cover background material and material appropriate if this book
is used as a text for a ‘conversion course’: measures of dissimilarity, estimation, linear
algebra, data analysis and basic probability.

The website www.statistical-pattern-recognition.net contains references and links to further information on techniques and applications.
In preparing the second edition of this book I have been helped by many people.
I am grateful to colleagues and friends who have made comments on various parts of
the manuscript. In particular, I would like to thank Mark Briers, Keith Copsey, Stephen
Luttrell, John O’Loghlen and Kevin Weekes (with particular thanks to Keith for examples
in Chapter 2); Wiley for help in the final production of the manuscript; and especially
Rosemary for her support and patience.


Notation

Some of the more commonly used notation is given below. I have used some notational
conveniences. For example, I have tended to use the same symbol for a variable as well
as a measurement on that variable. The meaning should be obvious from the context.
Also, I denote the density function of x as p.x/ and y as p.y/, even though the functions
differ. A vector is denoted by a lower-case quantity in bold face, and a matrix by upper
case.

p
C
n
ni
!i
X 1; : : : ; X p
x1 ; : : : ; x p
x D .x1 ; : : : ; x p /T
T
X D [x
2 1; : : : ; xn ]
x11 : : :

6 ::
::
X D 4 :
:

number of variables
number of classes
number of measurements
number of measurements in class i
label for class i
p random variables
measurements on variables X 1 ; : : : ; X p
measurement vector
n ð p data matrix

3
x1 p
:: 7
: 5
xn1 : : : x np
P.x/ D prob.X 1 Ä x1 ; : : : ; X p Ä x p /
p.x/ D @ P=@x
prior probability of class i
p.!i / R
µ D Rx p.x/dx
population mean
mean
of class i; i D 1; : : : ; C
µi D x p.x/dx
Pn

x
sample
mean
m D .1=n/ P
r D1 r
sample mean of class i ; i D 1; : : : ; C
mi D .1=n i / rnD1 z ir x r
z ir D 1 if x r 2 !i ; 0 otherwise P
n i D number of patterns in !i D rnD1 z ir
P
O D 1 rnD1 .x r m/.x r m/T
sample covariance matrix
n
(maximum likelihood estimate)
O
n=.n 1/
sample covariance matrix
(unbiased estimate)


xviii

NOTATION

O i D .1=n i /
ni
ni 1

Si D
SW D

S D

Pn

jD1 z i j .x j

mi /.x j

Oi

PC

ni
i D1 n

Oi

n
n C SW

PC n i
SB D
i D1 n .mi
S B C S WPD O
jjAjj2 D i j Ai2j
N .m; /
E[Y jX ]
I. /

m/.mi


m/T

mi /T

sample covariance matrix of class i
(maximum likelihood estimate)
sample covariance matrix of class i
(unbiased estimate)
pooled within-class sample
covariance matrix
pooled within-class sample
covariance matrix (unbiased estimate)
sample between-class matrix

normal distribution, mean; m
covariance matrix
expectation of Y given X
=1 if  = true else 0

Notation for specific probability density functions is given in Appendix E.


Statistical Pattern Recognition, Second Edition. Andrew R. Webb
Copyright  2002 John Wiley & Sons, Ltd.
ISBNs: 0-470-84513-9 (HB); 0-470-84514-7 (PB)

1
Introduction to statistical pattern
recognition


Overview
Statistical pattern recognition is a term used to cover all stages of an investigation
from problem formulation and data collection through to discrimination and classification, assessment of results and interpretation. Some of the basic terminology
is introduced and two complementary approaches to discrimination described.

1.1 Statistical pattern recognition
1.1.1 Introduction
This book describes basic pattern recognition procedures, together with practical applications of the techniques on real-world problems. A strong emphasis is placed on the
statistical theory of discrimination, but clustering also receives some attention. Thus,
the subject matter of this book can be summed up in a single word: ‘classification’,
both supervised (using class information to design a classifier – i.e. discrimination) and
unsupervised (allocating to groups without class information – i.e. clustering).
Pattern recognition as a field of study developed significantly in the 1960s. It was
very much an interdisciplinary subject, covering developments in the areas of statistics, engineering, artificial intelligence, computer science, psychology and physiology,
among others. Some people entered the field with a real problem to solve. The large
numbers of applications, ranging from the classical ones such as automatic character
recognition and medical diagnosis to the more recent ones in data mining (such as credit
scoring, consumer sales analysis and credit card transaction analysis), have attracted considerable research effort, with many methods developed and advances made. Other researchers were motivated by the development of machines with ‘brain-like’ performance,
that in some way could emulate human performance. There were many over-optimistic
and unrealistic claims made, and to some extent there exist strong parallels with the


2

Introduction to statistical pattern recognition

growth of research on knowledge-based systems in the 1970s and neural networks in
the 1980s.
Nevertheless, within these areas significant progress has been made, particularly where

the domain overlaps with probability and statistics, and within recent years there have
been many exciting new developments, both in methodology and applications. These
build on the solid foundations of earlier research and take advantage of increased computational resources readily available nowadays. These developments include, for example,
kernel-based methods and Bayesian computational methods.
The topics in this book could easily have been described under the term machine
learning that describes the study of machines that can adapt to their environment and learn
from example. The emphasis in machine learning is perhaps more on computationally
intensive methods and less on a statistical approach, but there is strong overlap between
the research areas of statistical pattern recognition and machine learning.

1.1.2 The basic model
Since many of the techniques we shall describe have been developed over a range of
diverse disciplines, there is naturally a variety of sometimes contradictory terminology.
We shall use the term ‘pattern’ to denote the p-dimensional data vector x D .x1 ; : : : ; x p /T
of measurements (T denotes vector transpose), whose components xi are measurements of
the features of an object. Thus the features are the variables specified by the investigator
and thought to be important for classification. In discrimination, we assume that there
exist C groups or classes, denoted !1 ; : : : ; !C , and associated with each pattern x is a
categorical variable z that denotes the class or group membership; that is, if z D i, then
the pattern belongs to !i , i 2 f1; : : : ; Cg.
Examples of patterns are measurements of an acoustic waveform in a speech recognition problem; measurements on a patient made in order to identify a disease (diagnosis);
measurements on patients in order to predict the likely outcome (prognosis); measurements on weather variables (for forecasting or prediction); and a digitised image for
character recognition. Therefore, we see that the term ‘pattern’, in its technical meaning,
does not necessarily refer to structure within images.
The main topic in this book may be described by a number of terms such as pattern
classifier design or discrimination or allocation rule design. By this we mean specifying
the parameters of a pattern classifier, represented schematically in Figure 1.1, so that it
yields the optimal (in some sense) response for a given pattern. This response is usually
an estimate of the class to which the pattern belongs. We assume that we have a set of
patterns of known class f.x i ; z i /; i D 1; : : : ; ng (the training or design set) that we use

to design the classifier (to set up its internal parameters). Once this has been done, we
may estimate class membership for an unknown pattern x.
The form derived for the pattern classifier depends on a number of different factors. It
depends on the distribution of the training data, and the assumptions made concerning its
distribution. Another important factor is the misclassification cost – the cost of making
an incorrect decision. In many applications misclassification costs are hard to quantify,
being combinations of several contributions such as monetary costs, time and other more
subjective costs. For example, in a medical diagnosis problem, each treatment has different costs associated with it. These relate to the expense of different types of drugs,


Stages in a pattern recognition problem 3

sensor
representation
pattern

feature selector
/extractor

classifier
feature
pattern

decision

Figure 1.1 Pattern classifier

the suffering the patient is subjected to by each course of action and the risk of further
complications.
Figure 1.1 grossly oversimplifies the pattern classification procedure. Data may undergo several separate transformation stages before a final outcome is reached. These

transformations (sometimes termed preprocessing, feature selection or feature extraction)
operate on the data in a way that usually reduces its dimension (reduces the number
of features), removing redundant or irrelevant information, and transforms it to a form
more appropriate for subsequent classification. The term intrinsic dimensionality refers
to the minimum number of variables required to capture the structure within the data.
In the speech recognition example mentioned above, a preprocessing stage may be to
transform the waveform to a frequency representation. This may be processed further
to find formants (peaks in the spectrum). This is a feature extraction process (taking a
possible nonlinear combination of the original variables to form new variables). Feature
selection is the process of selecting a subset of a given set of variables.
Terminology varies between authors. Sometimes the term ‘representation pattern’ is
used for the vector of measurements made on a sensor (for example, optical imager, radar)
with the term ‘feature pattern’ being reserved for the small set of variables obtained by
transformation (by a feature selection or feature extraction process) of the original vector
of measurements. In some problems, measurements may be made directly on the feature
vector itself. In these situations there is no automatic feature selection stage, with the
feature selection being performed by the investigator who ‘knows’ (through experience,
knowledge of previous studies and the problem domain) those variables that are important
for classification. In many cases, however, it will be necessary to perform one or more
transformations of the measured data.
In some pattern classifiers, each of the above stages may be present and identifiable
as separate operations, while in others they may not be. Also, in some classifiers, the
preliminary stages will tend to be problem-specific, as in the speech example. In this book,
we consider feature selection and extraction transformations that are not applicationspecific. That is not to say all will be suitable for any given application, however, but
application-specific preprocessing must be left to the investigator.

1.2 Stages in a pattern recognition problem
A pattern recognition investigation may consist of several stages, enumerated below.
Further details are given in Appendix D. Not all stages may be present; some may be
merged together so that the distinction between two operations may not be clear, even if

both are carried out; also, there may be some application-specific data processing that may
not be regarded as one of the stages listed. However, the points below are fairly typical.


4

Introduction to statistical pattern recognition

1. Formulation of the problem: gaining a clear understanding of the aims of the investigation and planning the remaining stages.
2. Data collection: making measurements on appropriate variables and recording details
of the data collection procedure (ground truth).
3. Initial examination of the data: checking the data, calculating summary statistics and
producing plots in order to get a feel for the structure.
4. Feature selection or feature extraction: selecting variables from the measured set that
are appropriate for the task. These new variables may be obtained by a linear or
nonlinear transformation of the original set (feature extraction). To some extent, the
division of feature extraction and classification is artificial.
5. Unsupervised pattern classification or clustering. This may be viewed as exploratory
data analysis and it may provide a successful conclusion to a study. On the other hand,
it may be a means of preprocessing the data for a supervised classification procedure.
6. Apply discrimination or regression procedures as appropriate. The classifier is designed using a training set of exemplar patterns.
7. Assessment of results. This may involve applying the trained classifier to an independent test set of labelled patterns.
8. Interpretation.
The above is necessarily an iterative process: the analysis of the results may pose
further hypotheses that require further data collection. Also, the cycle may be terminated
at different stages: the questions posed may be answered by an initial examination of
the data or it may be discovered that the data cannot answer the initial question and the
problem must be reformulated.
The emphasis of this book is on techniques for performing steps 4, 5 and 6.


1.3 Issues
The main topic that we address in this book concerns classifier design: given a training
set of patterns of known class, we seek to design a classifier that is optimal for the
expected operating conditions (the test conditions).
There are a number of very important points to make about the sentence above,
straightforward as it seems. The first is that we are given a finite design set. If the
classifier is too complex (there are too many free parameters) it may model noise in the
design set. This is an example of over-fitting. If the classifier is not complex enough,
then it may fail to capture structure in the data. An example of this is the fitting of a set
of data points by a polynomial curve. If the degree of the polynomial is too high, then,
although the curve may pass through or close to the data points, thus achieving a low
fitting error, the fitting curve is very variable and models every fluctuation in the data


Supervised versus unsupervised 5

(due to noise). If the degree of the polynomial is too low, the fitting error is large and
the underlying variability of the curve is not modelled.
Thus, achieving optimal performance on the design set (in terms of minimising some
error criterion perhaps) is not required: it may be possible, in a classification problem,
to achieve 100% classification accuracy on the design set but the generalisation performance – the expected performance on data representative of the true operating conditions
(equivalently, the performance on an infinite test set of which the design set is a sample) – is poorer than could be achieved by careful design. Choosing the ‘right’ model is
an exercise in model selection.
In practice we usually do not know what is structure and what is noise in the data.
Also, training a classifier (the procedure of determining its parameters) should not be
considered as a separate issue from model selection, but it often is.
A second point about the design of optimal classifiers concerns the word ‘optimal’.
There are several ways of measuring classifier performance, the most common being
error rate, although this has severe limitations. Other measures, based on the closeness
of the estimates of the probabilities of class membership to the true probabilities, may

be more appropriate in many cases. However, many classifier design methods usually
optimise alternative criteria since the desired ones are difficult to optimise directly. For
example, a classifier may be trained by optimising a squared error measure and assessed
using error rate.
Finally, we assume that the training data are representative of the test conditions. If
this is not so, perhaps because the test conditions may be subject to noise not present
in the training data, or there are changes in the population from which the data are
drawn (population drift), then these differences must be taken into account in classifier
design.

1.4 Supervised versus unsupervised
There are two main divisions of classification: supervised classification (or discrimination) and unsupervised classification (sometimes in the statistics literature simply referred
to as classification or clustering).
In supervised classification we have a set of data samples (each consisting of measurements on a set of variables) with associated labels, the class types. These are used
as exemplars in the classifier design.
Why do we wish to design an automatic means of classifying future data? Cannot
the same method that was used to label the design set be used on the test data? In
some cases this may be possible. However, even if it were possible, in practice we
may wish to develop an automatic method to reduce labour-intensive procedures. In
other cases, it may not be possible for a human to be part of the classification process.
An example of the former is in industrial inspection. A classifier can be trained using
images of components on a production line, each image labelled carefully by an operator.
However, in the practical application we would wish to save a human operator from the
tedious job, and hopefully make it more reliable. An example of the latter reason for
performing a classification automatically is in radar target recognition of objects. For


6

Introduction to statistical pattern recognition


vehicle recognition, the data may be gathered by positioning vehicles on a turntable and
making measurements from all aspect angles. In the practical application, a human may
not be able to recognise an object reliably from its radar image, or the process may be
carried out remotely.
In unsupervised classification, the data are not labelled and we seek to find groups in
the data and the features that distinguish one group from another. Clustering techniques,
described further in Chapter 10, can also be used as part of a supervised classification
scheme by defining prototypes. A clustering scheme may be applied to the data for each
class separately and representative samples for each group within the class (the group
means, for example) used as the prototypes for that class.

1.5 Approaches to statistical pattern recognition
The problem we are addressing in this book is primarily one of pattern classification. Given a set of measurements obtained through observation and represented as
a pattern vector x, we wish to assign the pattern to one of C possible classes !i ,
i D 1; : : : ; C. A decision rule partitions the measurement space into C regions i ,
i D 1; : : : ; C. If an observation vector is in i then it is assumed to belong to class
!i . Each region may be multiply connected – that is, it may be made up of several
disjoint regions. The boundaries between the regions i are the decision boundaries or
decision surfaces. Generally, it is in regions close to these boundaries that the highest proportion of misclassifications occurs. In such situations, we may reject the pattern or withhold a decision until further information is available so that a classification
may be made later. This option is known as the reject option and therefore we have
C C 1 outcomes of a decision rule (the reject option being denoted by !0 ) in a C-class
problem.
In this section we introduce two approaches to discrimination that will be explored
further in later chapters. The first assumes a knowledge of the underlying class-conditional
probability density functions (the probability density function of the feature vectors for
a given class). Of course, in many applications these will usually be unknown and must
be estimated from a set of correctly classified samples termed the design or training
set. Chapters 2 and 3 describe techniques for estimating the probability density functions
explicitly.

The second approach introduced in this section develops decision rules that use the
data to estimate the decision boundaries directly, without explicit calculation of the
probability density functions. This approach is developed in Chapters 4, 5 and 6 where
specific techniques are described.

1.5.1 Elementary decision theory
Here we introduce an approach to discrimination based on knowledge of the probability
density functions of each class. Familiarity with basic probability theory is assumed.
Some basic definitions are given in Appendix E.


Approaches to statistical pattern recognition 7

Bayes decision rule for minimum error
Consider C classes, !1 ; : : : ; !C , with a priori probabilities (the probabilities of each class
occurring) p.!1 /; : : : ; p.!C /, assumed known. If we wish to minimise the probability
of making an error and we have no information regarding an object other than the class
probability distribution then we would assign an object to class ! j if
p.! j / > p.!k /

k D 1; : : : ; C; k 6D j

This classifies all objects as belonging to one class. For classes with equal probabilities,
patterns are assigned arbitrarily between those classes.
However, we do have an observation vector or measurement vector x and we wish
to assign x to one of the C classes. A decision rule based on probabilities is to assign
x to class ! j if the probability of class ! j given the observation x, p.! j jx/, is greatest
over all classes !1 ; : : : ; !C . That is, assign x to class ! j if
p.! j jx/ > p.!k jx/


k D 1; : : : ; C; k 6D j

(1.1)

This decision rule partitions the measurement space into C regions 1 ; : : : ; C such that
if x 2 j then x belongs to class ! j .
The a posteriori probabilities p.! j jx/ may be expressed in terms of the a priori
probabilities and the class-conditional density functions p.xj!i / using Bayes’ theorem
(see Appendix E) as
p.!i jx/ D

p.xj!i / p.!i /
p.x/

and so the decision rule (1.1) may be written: assign x to ! j if
p.xj! j / p.! j / > p.xj!k / p.!k /

k D 1; : : : ; C; k 6D j

(1.2)

This is known as Bayes’ rule for minimum error.
For two classes, the decision rule (1.2) may be written
lr .x/ D

p.!2 /
p.xj!1 /
>
implies x 2 class !1
p.xj!2 /

p.!1 /

The function lr .x/ is the likelihood ratio. Figures 1.2 and 1.3 give a simple illustration for
a two-class discrimination problem. Class !1 is normally distributed with zero mean and
unit variance, p.xj!1 / D N .xj0; 1/ (see Appendix E). Class !2 is a normal mixture (a
weighted sum of normal densities) p.xj!2 / D 0:6N .xj1; 1/C0:4N .xj 1; 2/. Figure 1.2
plots p.xj!i / p.!i /; i D 1; 2, where the priors are taken to be p.!1 / D 0:5, p.!2 / D 0:5.
Figure 1.3 plots the likelihood ratio lr .x/ and the threshold p.!2 /= p.!1 /. We see from
this figure that the decision rule (1.2) leads to a disjoint region for class !2 .
The fact that the decision rule (1.2) minimises the error may be seen as follows. The
probability of making an error, p.error/, may be expressed as
p.error/ D

C
X
i D1

p.errorj!i / p.!i /

(1.3)


×