Introduction
to
Machine
Learning
Second
Edition
Adaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael
Kearns, Associate Editors
A complete list of books published in The Adaptive Computation and
Machine Learning series appears at the back of this book.
Introduction
to
Machine
Learning
Second
Edition
Ethem Alpaydın
The MIT Press
Cambridge, Massachusetts
London, England
© 2010 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
For information about special quantity discounts, please email
Typeset in 10/13 Lucida Bright by the author using LATEX 2ε .
Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Information
Alpaydin, Ethem.
Introduction to machine learning / Ethem Alpaydin. — 2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-262-01243-0 (hardcover : alk. paper)
1. Machine learning. I. Title
Q325.5.A46 2010
006.3’1—dc22
2009013169
CIP
10 9 8 7 6 5 4 3 2 1
Brief Contents
1
Introduction
2
Supervised Learning
1
3
Bayesian Decision Theory
4
Parametric Methods
5
Multivariate Methods
6
Dimensionality Reduction
7
Clustering
8
Nonparametric Methods
9
Decision Trees
21
47
61
87
109
143
163
185
10 Linear Discrimination
209
11 Multilayer Perceptrons
12 Local Models
233
279
13 Kernel Machines
309
14 Bayesian Estimation
341
15 Hidden Markov Models
16 Graphical Models
363
387
17 Combining Multiple Learners
18 Reinforcement Learning
419
447
19 Design and Analysis of Machine Learning Experiments
A Probability
517
475
Contents
Series Foreword
Figures
xix
Tables
xxix
Preface
xvii
xxxi
Acknowledgments
xxxiii
Notes for the Second Edition
Notations
xxxix
1 Introduction
1.1
1.2
1.3
1.4
1.5
1.6
1
What Is Machine Learning?
1
Examples of Machine Learning Applications
1.2.1 Learning Associations
4
1.2.2 Classification
5
1.2.3 Regression
9
1.2.4 Unsupervised Learning
11
1.2.5 Reinforcement Learning
13
Notes
14
Relevant Resources
16
Exercises
18
References
19
2 Supervised Learning
2.1
xxxv
21
Learning a Class from Examples
21
4
viii
Contents
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
Vapnik-Chervonenkis (VC) Dimension
27
Probably Approximately Correct (PAC) Learning
29
Noise
30
Learning Multiple Classes
32
Regression
34
Model Selection and Generalization
37
Dimensions of a Supervised Machine Learning Algorithm
Notes
42
Exercises
43
References
44
3 Bayesian Decision Theory
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Introduction
47
Classification
49
Losses and Risks
51
Discriminant Functions
Utility Theory
54
Association Rules
55
Notes
58
Exercises
58
References
59
4 Parametric Methods
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
53
61
Introduction
61
Maximum Likelihood Estimation
62
4.2.1 Bernoulli Density
63
4.2.2 Multinomial Density
64
4.2.3 Gaussian (Normal) Density
64
Evaluating an Estimator: Bias and Variance
65
The Bayes’ Estimator
66
Parametric Classification
69
Regression
73
Tuning Model Complexity: Bias/Variance Dilemma
Model Selection Procedures
80
Notes
84
Exercises
84
References
85
5 Multivariate Methods
5.1
47
Multivariate Data
87
87
76
41
ix
Contents
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
Parameter Estimation
88
Estimation of Missing Values
89
Multivariate Normal Distribution
90
Multivariate Classification
94
Tuning Complexity
99
Discrete Features
102
Multivariate Regression
103
Notes
105
Exercises
106
References
107
6 Dimensionality Reduction
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
Introduction
109
Subset Selection
110
Principal Components Analysis
113
Factor Analysis
120
Multidimensional Scaling
125
Linear Discriminant Analysis
128
Isomap
133
Locally Linear Embedding
135
Notes
138
Exercises
139
References
140
7 Clustering
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
143
Introduction
143
Mixture Densities
144
k-Means Clustering
145
Expectation-Maximization Algorithm
149
Mixtures of Latent Variable Models
154
Supervised Learning after Clustering
155
Hierarchical Clustering
157
Choosing the Number of Clusters
158
Notes
160
Exercises
160
References
161
8 Nonparametric Methods
8.1
8.2
109
163
Introduction
163
Nonparametric Density Estimation
165
x
Contents
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.2.1 Histogram Estimator
165
8.2.2 Kernel Estimator
167
8.2.3 k-Nearest Neighbor Estimator
168
Generalization to Multivariate Data
170
Nonparametric Classification
171
Condensed Nearest Neighbor
172
Nonparametric Regression: Smoothing Models
174
8.6.1 Running Mean Smoother
175
8.6.2 Kernel Smoother
176
8.6.3 Running Line Smoother
177
How to Choose the Smoothing Parameter
178
Notes
180
Exercises
181
References
182
9 Decision Trees
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
185
Introduction
185
Univariate Trees
187
9.2.1 Classification Trees
188
9.2.2 Regression Trees
192
Pruning
194
Rule Extraction from Trees
197
Learning Rules from Data
198
Multivariate Trees
202
Notes
204
Exercises
207
References
207
10 Linear Discrimination
10.1
10.2
10.3
10.4
10.5
10.6
10.7
209
Introduction
209
Generalizing the Linear Model
211
Geometry of the Linear Discriminant
10.3.1 Two Classes
212
10.3.2 Multiple Classes
214
Pairwise Separation
216
Parametric Discrimination Revisited
Gradient Descent
218
Logistic Discrimination
220
10.7.1 Two Classes
220
212
217
xi
Contents
10.8
10.9
10.10
10.11
10.7.2 Multiple Classes
224
Discrimination by Regression
228
Notes
230
Exercises
230
References
231
11 Multilayer Perceptrons
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
11.11
11.12
11.13
11.14
11.15
Introduction
233
11.1.1 Understanding the Brain
234
11.1.2 Neural Networks as a Paradigm for Parallel
Processing
235
The Perceptron
237
Training a Perceptron
240
Learning Boolean Functions
243
Multilayer Perceptrons
245
MLP as a Universal Approximator
248
Backpropagation Algorithm
249
11.7.1 Nonlinear Regression
250
11.7.2 Two-Class Discrimination
252
11.7.3 Multiclass Discrimination
254
11.7.4 Multiple Hidden Layers
256
Training Procedures
256
11.8.1 Improving Convergence
256
11.8.2 Overtraining
257
11.8.3 Structuring the Network
258
11.8.4 Hints
261
Tuning the Network Size
263
Bayesian View of Learning
266
Dimensionality Reduction
267
Learning Time
270
11.12.1 Time Delay Neural Networks
270
11.12.2 Recurrent Networks
271
Notes
272
Exercises
274
References
275
12 Local Models
12.1
12.2
233
279
Introduction
279
Competitive Learning
280
xii
Contents
12.2.1 Online k-Means
280
12.2.2 Adaptive Resonance Theory
285
12.2.3 Self-Organizing Maps
286
12.3 Radial Basis Functions
288
12.4 Incorporating Rule-Based Knowledge
294
12.5 Normalized Basis Functions
295
12.6 Competitive Basis Functions
297
12.7 Learning Vector Quantization
300
12.8 Mixture of Experts
300
12.8.1 Cooperative Experts
303
12.8.2 Competitive Experts
304
12.9 Hierarchical Mixture of Experts
304
12.10 Notes
305
12.11 Exercises
306
12.12 References
307
13 Kernel Machines
13.1
13.2
13.3
13.4
13.5
13.6
13.7
13.8
13.9
13.10
13.11
13.12
13.13
13.14
13.15
Introduction
309
Optimal Separating Hyperplane
311
The Nonseparable Case: Soft Margin Hyperplane
ν-SVM
318
Kernel Trick
319
Vectorial Kernels
321
Defining Kernels
324
Multiple Kernel Learning
325
Multiclass Kernel Machines
327
Kernel Machines for Regression
328
One-Class Kernel Machines
333
Kernel Dimensionality Reduction
335
Notes
337
Exercises
338
References
339
14 Bayesian Estimation
14.1
14.2
14.3
309
315
341
Introduction
341
Estimating the Parameter of a Distribution
343
14.2.1 Discrete Variables
343
14.2.2 Continuous Variables
345
Bayesian Estimation of the Parameters of a Function
348
xiii
Contents
14.4
14.5
14.6
14.7
14.3.1 Regression
348
14.3.2 The Use of Basis/Kernel Functions
14.3.3 Bayesian Classification
353
Gaussian Processes
356
Notes
359
Exercises
360
References
361
15 Hidden Markov Models
15.1
15.2
15.3
15.4
15.5
15.6
15.7
15.8
15.9
15.10
15.11
15.12
15.13
16.4
16.5
16.6
16.7
16.8
363
Introduction
363
Discrete Markov Processes
364
Hidden Markov Models
367
Three Basic Problems of HMMs
369
Evaluation Problem
369
Finding the State Sequence
373
Learning Model Parameters
375
Continuous Observations
378
The HMM with Input
379
Model Selection in HMM
380
Notes
382
Exercises
383
References
384
16 Graphical Models
16.1
16.2
16.3
352
387
Introduction
387
Canonical Cases for Conditional Independence
389
Example Graphical Models
396
16.3.1 Naive Bayes’ Classifier
396
16.3.2 Hidden Markov Model
398
16.3.3 Linear Regression
401
d-Separation
402
Belief Propagation
402
16.5.1 Chains
403
16.5.2 Trees
405
16.5.3 Polytrees
407
16.5.4 Junction Trees
409
Undirected Graphs: Markov Random Fields
410
Learning the Structure of a Graphical Model
413
Influence Diagrams
414
xiv
Contents
16.9 Notes
414
16.10 Exercises
417
16.11 References
417
17 Combining Multiple Learners
17.1
17.2
17.3
17.4
17.5
17.6
17.7
17.8
17.9
17.10
17.11
17.12
17.13
17.14
Rationale
419
Generating Diverse Learners
420
Model Combination Schemes
423
Voting
424
Error-Correcting Output Codes
427
Bagging
430
Boosting
431
Mixture of Experts Revisited
434
Stacked Generalization
435
Fine-Tuning an Ensemble
437
Cascading
438
Notes
440
Exercises
442
References
443
18 Reinforcement Learning
18.1
18.2
18.3
18.4
419
447
Introduction
447
Single State Case: K-Armed Bandit
449
Elements of Reinforcement Learning
450
Model-Based Learning
453
18.4.1 Value Iteration
453
18.4.2 Policy Iteration
454
18.5 Temporal Difference Learning
454
18.5.1 Exploration Strategies
455
18.5.2 Deterministic Rewards and Actions
456
18.5.3 Nondeterministic Rewards and Actions
457
18.5.4 Eligibility Traces
459
18.6 Generalization
461
18.7 Partially Observable States
464
18.7.1 The Setting
464
18.7.2 Example: The Tiger Problem
465
18.8 Notes
470
18.9 Exercises
472
18.10 References
473
xv
Contents
19 Design and Analysis of Machine Learning Experiments
19.1
19.2
19.3
19.4
19.5
19.6
19.7
19.8
19.9
19.10
19.11
19.12
19.13
19.14
19.15
19.16
Introduction
475
Factors, Response, and Strategy of Experimentation
478
Response Surface Design
481
Randomization, Replication, and Blocking
482
Guidelines for Machine Learning Experiments
483
Cross-Validation and Resampling Methods
486
19.6.1 K-Fold Cross-Validation
487
19.6.2 5×2 Cross-Validation
488
19.6.3 Bootstrapping
489
Measuring Classifier Performance
489
Interval Estimation
493
Hypothesis Testing
496
Assessing a Classification Algorithm’s Performance
498
19.10.1 Binomial Test
499
19.10.2 Approximate Normal Test
500
19.10.3 t Test
500
Comparing Two Classification Algorithms
501
19.11.1 McNemar’s Test
501
19.11.2 K-Fold Cross-Validated Paired t Test
501
19.11.3 5 × 2 cv Paired t Test
502
19.11.4 5 × 2 cv Paired F Test
503
Comparing Multiple Algorithms: Analysis of Variance
504
Comparison over Multiple Datasets
508
19.13.1 Comparing Two Algorithms
509
19.13.2 Multiple Algorithms
511
Notes
512
Exercises
513
References
514
A Probability
A.1
A.2
475
517
Elements of Probability
517
A.1.1 Axioms of Probability
518
A.1.2 Conditional Probability
518
Random Variables
519
A.2.1 Probability Distribution and Density Functions
A.2.2 Joint Distribution and Density Functions
520
A.2.3 Conditional Distributions
520
A.2.4 Bayes’ Rule
521
519
xvi
Contents
A.3
A.4
Index
A.2.5 Expectation
521
A.2.6 Variance
522
A.2.7 Weak Law of Large Numbers
523
Special Random Variables
523
A.3.1 Bernoulli Distribution
523
A.3.2 Binomial Distribution
524
A.3.3 Multinomial Distribution
524
A.3.4 Uniform Distribution
524
A.3.5 Normal (Gaussian) Distribution
525
A.3.6 Chi-Square Distribution
526
A.3.7 t Distribution
527
A.3.8 F Distribution
527
References
527
529
Series Foreword
The goal of building systems that can adapt to their environments and
learn from their experience has attracted researchers from many fields,
including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide
variety of learning techniques that are transforming many industrial and
scientific fields. Recently, several research communities have converged
on a common set of issues surrounding supervised, semi-supervised, unsupervised, and reinforcement learning problems. The MIT Press Series
on Adaptive Computation and Machine Learning seeks to unify the many
diverse strands of machine learning research and to foster high-quality
research and innovative applications.
The MIT Press is extremely pleased to publish this second edition of
Ethem Alpaydın’s introductory textbook. This book presents a readable
and concise introduction to machine learning that reflects these diverse
research strands while providing a unified treatment of the field. The
book covers all of the main problem formulations and introduces the
most important algorithms and techniques encompassing methods from
computer science, neural computation, information theory, and statistics. The second edition expands and updates coverage of several areas,
particularly kernel machines and graphical models, that have advanced
rapidly over the past five years. This updated work continues to be a
compelling textbook for introductory courses in machine learning at the
undergraduate and beginning graduate level.
Figures
1.1
Example of a training dataset where each circle corresponds
to one data instance with input values in the corresponding
axes and its sign indicates the class.
6
1.2
A training dataset of used cars and the function fitted.
10
2.1
Training set for the class of a “family car.”
22
2.2
Example of a hypothesis class.
23
2.3
C is the actual class and h is our induced hypothesis.
25
2.4
S is the most specific and G is the most general hypothesis.
26
2.5
We choose the hypothesis with the largest margin, for best
separation.
27
2.6
An axis-aligned rectangle can shatter four points.
28
2.7
The difference between h and C is the sum of four
rectangular strips, one of which is shaded.
30
2.8
When there is noise, there is not a simple boundary
between the positive and negative instances, and zero
misclassification error may not be possible with a simple
hypothesis.
31
2.9
There are three classes: family car, sports car, and luxury
sedan.
33
Linear, second-order, and sixth-order polynomials are fitted
to the same set of points.
36
2.11
A line separating positive and negative instances.
44
3.1
Example of decision regions and decision boundaries.
54
2.10
xx
Figures
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
5.1
5.2
5.3
5.4
5.5
5.6
5.7
6.1
6.2
6.3
6.4
θ is the parameter to be estimated.
(a) Likelihood functions and (b) posteriors with equal priors
for two classes when the input is one-dimensional.
(a) Likelihood functions and (b) posteriors with equal priors
for two classes when the input is one-dimensional.
Regression assumes 0 mean Gaussian noise added to the
model; here, the model is linear.
(a) Function, f (x) = 2 sin(1.5x), and one noisy (N (0, 1))
dataset sampled from the function.
In the same setting as that of figure 4.5, using one hundred
models instead of five, bias, variance, and error for
polynomials of order 1 to 5.
In the same setting as that of figure 4.5, training and
validation sets (each containing 50 instances) are generated.
In the same setting as that of figure 4.5, polynomials of
order 1 to 4 are fitted.
67
Bivariate normal distribution.
Isoprobability contour plot of the bivariate normal
distribution.
Classes have different covariance matrices.
Covariances may be arbitary but shared by both classes.
All classes have equal, diagonal covariance matrices, but
variances are not equal.
All classes have equal, diagonal covariance matrices of
equal variances on both dimensions.
Different cases of the covariance matrices fitted to the same
data lead to different boundaries.
91
Principal components analysis centers the sample and then
rotates the axes to line up with the directions of highest
variance.
(a) Scree graph. (b) Proportion of variance explained is given
for the Optdigits dataset from the UCI Repository.
Optdigits data plotted in the space of two principal
components.
Principal components analysis generates new variables that
are linear combinations of the original input variables.
71
72
74
78
79
81
83
92
96
97
98
99
101
115
117
118
121
xxi
Figures
6.5
6.6
6.7
6.8
6.9
6.10
7.1
7.2
7.3
7.4
7.5
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
Factors are independent unit normals that are stretched,
rotated, and translated to make up the inputs.
Map of Europe drawn by MDS.
Two-dimensional, two-class data projected on w.
Optdigits data plotted in the space of the first two
dimensions found by LDA.
Geodesic distance is calculated along the manifold as
opposed to the Euclidean distance that does not use this
information.
Local linear embedding first learns the constraints in the
original space and next places the points in the new space
respecting those constraints.
Given x, the encoder sends the index of the closest code
word and the decoder generates the code word with the
received index as x .
Evolution of k-means.
k-means algorithm.
Data points and the fitted Gaussians by EM, initialized by
one k-means iteration of figure 7.2.
A two-dimensional dataset and the dendrogram showing
the result of single-link clustering is shown.
Histograms for various bin lengths.
Naive estimate for various bin lengths.
Kernel estimate for various bin lengths.
k-nearest neighbor estimate for various k values.
Dotted lines are the Voronoi tesselation and the straight
line is the class discriminant.
Condensed nearest neighbor algorithm.
Regressograms for various bin lengths. ‘×’ denote data
points.
Running mean smooth for various bin lengths.
Kernel smooth for various bin lengths.
Running line smooth for various bin lengths.
Kernel estimate for various bin lengths for a two-class
problem.
Regressograms with linear fits in bins for various bin lengths.
122
126
129
132
134
136
147
148
149
153
159
166
167
168
169
173
174
175
176
177
178
179
182
xxii
Figures
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
Example of a dataset and the corresponding decision tree.
Entropy function for a two-class problem.
Classification tree construction.
Regression tree smooths for various values of θr .
Regression trees implementing the smooths of figure 9.4
for various values of θr .
Example of a (hypothetical) decision tree.
Ripper algorithm for learning rules.
Example of a linear multivariate decision tree.
In the two-dimensional case, the linear discriminant is a
line that separates the examples from two classes.
10.2 The geometric interpretation of the linear discriminant.
10.3 In linear classification, each hyperplane Hi separates the
examples of Ci from the examples of all other classes.
10.4 In pairwise linear separation, there is a separate hyperplane
for each pair of classes.
10.5 The logistic, or sigmoid, function.
10.6 Logistic discrimination algorithm implementing gradient
descent for the single output case with two classes.
10.7 For a univariate two-class problem (shown with ‘◦’ and ‘×’ ),
the evolution of the line w x + w0 and the sigmoid output
after 10, 100, and 1,000 iterations over the sample.
10.8 Logistic discrimination algorithm implementing gradient
descent for the case with K > 2 classes.
10.9 For a two-dimensional problem with three classes, the
solution found by logistic discrimination.
10.10 For the same example in figure 10.9, the linear
discriminants (top), and the posterior probabilities after the
softmax (bottom).
186
189
191
195
196
197
200
203
10.1
11.1
11.2
11.3
11.4
11.5
11.6
Simple perceptron.
K parallel perceptrons.
Perceptron training algorithm implementing stochastic
online gradient descent for the case with K > 2 classes.
The perceptron that implements AND and its geometric
interpretation.
XOR problem is not linearly separable.
The structure of a multilayer perceptron.
213
214
215
216
219
222
223
226
226
227
237
239
243
244
245
247
xxiii
Figures
11.7
11.8
11.9
11.10
11.11
11.12
11.13
11.14
11.15
11.16
11.17
11.18
11.19
11.20
11.21
11.22
12.1
12.2
12.3
12.4
The multilayer perceptron that solves the XOR problem.
Sample training data shown as ‘+’, where xt ∼ U(−0.5, 0.5),
and y t = f (xt ) + N (0, 0.1).
The mean square error on training and validation sets as a
function of training epochs.
(a) The hyperplanes of the hidden unit weights on the first
layer, (b) hidden unit outputs, and (c) hidden unit outputs
multiplied by the weights on the second layer.
Backpropagation algorithm for training a multilayer
perceptron for regression with K outputs.
As complexity increases, training error is fixed but the
validation error starts to increase and the network starts to
overfit.
As training continues, the validation error starts to increase
and the network starts to overfit.
A structured MLP.
In weight sharing, different units have connections to
different inputs but share the same weight value (denoted
by line type).
The identity of the object does not change when it is
translated, rotated, or scaled.
Two examples of constructive algorithms.
Optdigits data plotted in the space of the two hidden units
of an MLP trained for classification.
In the autoassociator, there are as many outputs as there
are inputs and the desired outputs are the inputs.
A time delay neural network.
Examples of MLP with partial recurrency.
Backpropagation through time.
Shaded circles are the centers and the empty circle is the
input instance.
Online k-means algorithm.
The winner-take-all competitive neural network, which is a
network of k perceptrons with recurrent connections at the
output.
The distance from x a to the closest center is less than the
vigilance value ρ and the center is updated as in online
k-means.
249
252
253
254
255
259
259
260
261
262
265
268
269
271
272
273
282
283
284
285
xxiv
Figures
12.5
In the SOM, not only the closest unit but also its neighbors,
in terms of indices, are moved toward the input.
12.6 The one-dimensional form of the bell-shaped function used
in the radial basis function network.
12.7 The difference between local and distributed representations.
12.8 The RBF network where ph are the hidden units using the
bell-shaped activation function.
12.9 (-) Before and (- -) after normalization for three Gaussians
whose centers are denoted by ‘*’.
12.10 The mixture of experts can be seen as an RBF network
where the second-layer weights are outputs of linear models.
12.11 The mixture of experts can be seen as a model for
combining multiple models.
For a two-class problem where the instances of the classes
are shown by plus signs and dots, the thick line is the
boundary and the dashed lines define the margins on either
side.
13.2 In classifying an instance, there are four possible cases.
13.3 Comparison of different loss functions for r t = 1.
13.4 The discriminant and margins found by a polynomial
kernel of degree 2.
13.5 The boundary and margins found by the Gaussian kernel
with different spread values, s 2 .
13.6 Quadratic and -sensitive error functions.
13.7 The fitted regression line to data points shown as crosses
and the -tube are shown (C = 10, = 0.25).
13.8 The fitted regression line and the -tube using a quadratic
kernel are shown (C = 10, = 0.25).
13.9 The fitted regression line and the -tube using a Gaussian
kernel with two different spreads are shown
(C = 10, = 0.25).
13.10 One-class support vector machine places the smoothest
boundary (here using a linear kernel, the circle with the
smallest radius) that encloses as much of the instances as
possible.
13.11 One-class support vector machine using a Gaussian kernel
with different spreads.
287
289
290
292
296
301
302
13.1
314
316
318
322
323
329
331
332
332
334
336
Figures
xxv
13.12 Instead of using a quadratic kernel in the original space (a),
we can use kernel PCA on the quadratic kernel values to
map to a two-dimensional new space where we use a linear
discriminant (b); these two dimensions (out of five) explain
80 percent of the variance.
337
14.1
14.2
14.3
14.4
14.5
14.6
14.7
15.1
15.2
15.3
15.4
15.5
16.1
16.2
16.3
16.4
16.5
16.6
16.7
The generative graphical model.
Plots of beta distributions for different sets of (α, β).
20 data points are drawn from p(x) ∼ N (6, 1.52 ), prior is
p(μ) ∼ N (4, 0.82 ), and posterior is then
p(μ|X) ∼ N (5.7, 0.32 ).
Bayesian linear regression for different values of α and β.
Bayesian regression using kernels with one standard
deviation error bars.
Gaussian process regression with one standard deviation
error bars.
Gaussian process regression using a Gaussian kernel with
s 2 = 0.5 and varying number of training data.
Example of a Markov model with three states.
An HMM unfolded in time as a lattice (or trellis) showing all
the possible trajectories.
Forward-backward procedure.
Computation of arc probabilities, ξt (i, j).
Example of a left-to-right HMM.
Bayesian network modeling that rain is the cause of wet
grass.
Head-to-tail connection.
Tail-to-tail connection.
Head-to-head connection.
Larger graphs are formed by combining simpler subgraphs
over which information is propagated using the implied
conditional independencies.
(a) Graphical model for classification. (b) Naive Bayes’
classifier assumes independent inputs.
Hidden Markov model can be drawn as a graphical model
where q t are the hidden states and shaded O t are observed.
342
346
347
351
354
357
359
365
368
371
375
381
388
390
391
392
394
397
398