Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 76 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (400.8 KB, 10 trang )

730 Ricardo Vilalta, Christophe Giraud-Carrier, and Pavel Brazdil
Metal. A Meta-Learning Assistant for Providing User Support in Machine Learning and Data
Mining, 1998.
Michie, D., Spiegelhalter, D. J., Taylor, C.C. Machine Learning, Neural and Statistical Clas-
sification. England: Ellis Horwood, 1994.
Nakhaeizadeh, G., Schnabel, A. Development of Multi-criteria Metrics for Evaluation of
Data-mining Algorithms. In Proceedings of the Third International Conference on
Knowledge Discovery and Data-Mining, 1997.
Paterson, I. New Models for Data Envelopment Analysis, Measuring Efficiency with the
VRS Frontier. Economics Series No. 84, Institute for Advanced Studies, Vienna, 2000.
Peng, Y., Flach, P., Brazdil, P., Soares, C. Decision Tree-Based Characterization for Meta-
Learning. In: ECML/PKDD’02 Workshop on Integration and Collaboration Aspects of
Data Mining, Decision Support and Meta-Learning, 111-122. University of Helsinki,
2002.
Pfahringer, B., Bensusan, H., Giraud-Carrier, C. Meta-learning by Landmarking Various
Learning Algorithms. In Proceedings of the Seventeenth International Conference on
Machine Learning, 2000.
Pratt, L., Thrun, S. Second Special Issue on Inductive Transfer. Machine Learning, 28, 1997.
Pratt S., Jennings B. A Survey of Connectionist Network Reuse Through Transfer. In Learn-
ing to Learn, Chapter 2, 19-43, Kluwer Academic Publishers, MA, 1998.
Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra-
tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag
(2004).
Schmidhuber J. Discovering Solutions with Low Kolmogorov Complexity and High Gen-
eralization Capability. Proceedings of the Twelve International Conference on Machine
Learning, 488-49, Morgan Kaufman, 1995.
Skalak, D. Prototype Selection for Composite Nearest Neighbor Classifiers. PhD thesis, Uni-
versity of Massachusetts, Amherst, 1997.
Soares, C., Brazdil, P. Zoomed Ranking: Selection of Classification Algorithms Based on
Relevant Performance Information. In Proceedings of the Fourth European Conference
on Principles and Practice of Knowledge Discovery in Databases, 2000.


Soares, C., Petrak, J., Brazdil, P. Sampling-Based Relative Landmarks: Systematically Test-
Driving Algorithms Before Choosing. Proceedings of the 10th Portuguese Conference
on Artificial Intelligence, Springer, 2001.
Sohn, S.Y. Meta Analysis of Classification Algorithms for Pattern Recognition. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 21(11): 1137-1144, 1999.
Thrun, S. Lifelong Learning Algorithms. In Learning to Learn, Chapter 8, 181-209, MA:
Kluwer Academic Publishers, 1998.
Ting, K. M., Witten I. H. Stacked generalization: When does it work?. In Proceedings of
the 15
th
International Joint Conference on Artificial Intelligence, pp 866-873, Nagoya,
Japan, Morgan Kaufmann, 1997.
Todorovski, L., Dzeroski, S. Experiments in Meta-level Learning with ILP. In Proceedings
of the Third European Conference on Principles and Practice of Knowledge Discovery
in Databases, 1999.
Todorovski, L., Dzeroski, S. Combining Multiple Models with Meta Decision Trees. In Pro-
ceedings of the Fourth European Conference on Principles and Practice of Knowledge
Discovery in Databases, 2000.
Todorovski, L., Dzeroski, S. Combining Classifiers with Meta Decision Trees. Machine
Learning 50 (3), 223-250, 2003.
36 Meta-Learning 731
Utgoff P. Shift of Bias for Inductive Concept Learning. In Michalski, R.S. et al (Ed), Ma-
chine Learning: An Artificial Intelligence Approach Vol. II, 107-148, Morgan Kaufman,
California, 1986.
Vilalta, R. Research Directions in Meta-Learning: Building Self-Adaptive Learners. Interna-
tional Conference on Artificial Intelligence, Las Vegas, Nevada, 2001.
Vilalta, R., Drissi, Y. A Perspective View and Survey of Meta-Learning. Journal of Artificial
Intelligence Review, 18 (2): 77-95, 2002.
Widmer, G. On-line Metalearning in Changing Contexts. MetaL(B) and MetaL(IB). In Pro-
ceedings of the Third International Workshop on Multistrategy Learning (MSL-96),

1996A.
Widmer, G. Recognition and Exploitation of Contextual Clues via Incremental Meta-
Learning. In Proceedings of the Thirteenth International Conference on Machine Learn-
ing (ICML-96), 1996B.
Widmer, G. Tracking Context Changes through Meta-Learning. Machine Learning, 27(3):
259-286, 1997.
Wolpert D. Stacked Generalization. Neural Networks, 5: 241-259, 1992.

37
Bias vs Variance Decomposition For Regression and
Classification
Pierre Geurts
Department of Electrical Engineering and Computer Science, University of Li
`
ege, Belgium.
Postdoctoral Researcher, F.N.R.S., Belgium
Summary. In this chapter, the important concepts of bias and variance are introduced. After
an intuitive introduction to the bias/variance tradeoff, we discuss the bias/variance decom-
positions of the mean square error (in the context of regression problems) and of the mean
misclassification error (in the context of classification problems). Then, we carry out a small
empirical study providing some insight about how the parameters of a learning algorithm in-
fluence bias and variance.
Key words: bias, variance, supervised learning, overfitting
37.1 Introduction
The general problem of supervised learning is often formulated as an optimization problem.
An error measure is defined that evaluates the quality of a model and the goal of learning
is to find, in a family of models (the hypothesis space), a model that minimizes this error
estimated on the learning sample (or dataset) S. So, at first sight, if no good enough model is
found in this family, it should be sufficient to extend the family or to exchange it for a more
powerful one in terms of model flexibility. However, we are often interested in a model that

generalizes well on unseen data rather than on a model that perfectly predicts the output for
the learning sample cases. And, unfortunately, in practice, good results on the learning set do
not necessarily imply good generalization performance on unseen data, especially if the “size”
of the hypothesis space is large in comparison to the sample size.
Let us use a simple one-dimensional regression problem to explain intuitively why larger
hypothesis spaces do not necessarily lead to better models. In this synthetic problem, learning
outputs are generated according to y = f
b
(x)+
ε
, where f
b
is represented by the dashed curves
in Figure 39.1 and
ε
is distributed according to a Gaussian N(0,
σ
) distribution. With squared
error loss, we will see below that the best possible model for this problem is f
b
and its average
squared error is
σ
2
. Let us consider two extreme situations of a bad model structure choice.
• A too simple model: using a linear model y = w.x + b and minimizing squared error on
the learning set, we obtain the estimations given in the left part of Figure 39.1 for two
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_37, © Springer Science+Business Media, LLC 2010
734 Pierre Geurts

Too simple model Too complex model
S
1
S
1
2
2
x
y
S
S
y
y
xx
x
y
Fig. 37.1. Left, a linear model fitted to two learning samples. Right, a neural network fitted to
the same samples
different learning set choices. These models are not very good, neither on their learning
sets, nor in generalization. Whatever the learning set, there will always remain an error
due to the fact that the model is too simple with respect to the complexity of f
b
.
• A too complex model: by using a very complex model like a neural network with two
hidden layers of ten neurons each, we get the functions at the right part of Figure 39.1 for
the same learning sets. This time, models receive an almost perfect score on the learning
set. However, their generalization errors are still not very good because of two phenomena.
First, the learning algorithm is able to match perfectly the learning set and hence also the
noise term. We say in this case that the learning algorithm “overfits” the data. Second,
even if there is no noise, there will still remain some errors due to the high complexity of

the model. Indeed, the learning algorithm has many different models at its disposal and if
the learning set size is relatively small, several of them will realize a perfect match of the
learning set. As at most one of them is a perfect image of the best model, any other choice
by the learning algorithm will result in suboptimality.
The main source of error is very different in both cases. In the first case, the error is
essentially independent of the particular learning set and must be attributed to the lack of
complexity of the model. This source of error is called bias. In the second case, on the other
hand, the error may be attributed to the variability of the model from one learning set to
another (which is due on one hand to overfitting and on the other hand to the sparse nature
of the learning set with respect to the complexity of the model). This source of error is called
variance. Note that in the first case there is also a dependence of the model on the learning set
and thus some variability of the predictions. However the resulting variance is negligible with
respect to bias. In general, bias and variance both depend on the complexity of the model but
in opposite direction and thus there must exist an optimal tradeoff between these two sources
of error. As a matter of fact, this optimal tradeoff depends also on the smoothness of the best
model and on the sample size. An important consequence of this is that, because of variance,
we should always take care of not increasing too much the complexity of the model structure
with respect to the complexity of the problem and the size of the learning sample.
In the next section, we give a formal additive decomposition of the mean (over all learn-
ing set choices) squared error into two terms which represent the bias and the variance effect.
37 Bias vs Variance 735
Some propositions of similar decompositions in the context of 0-1 loss-functions are also dis-
cussed. They show some fundamental differences between the two types of problems although
bias and variance concepts are always useful. Section 3 discusses procedures to estimate bias
and variance terms for practical problems. In Section 4, we give some experiments and appli-
cations of bias/variance decompositions.
37.2 Bias/Variance Decompositions
Let us introduce some notations. A learning sample S is a collection of m input/output pairs
(< x
1

,y
1
>, ,< x
m
,y
m
>), each one randomly and independently drawn from a probability
distribution P
D
(x,y). A learning algorithm I produces a model I(S) from S, i.e. a function of
inputs x to the domain of y. The error of this model is computed as the expectation:
Error(I(S)) = E
x,y
[L(y,I(S)(x))],
where L is some loss function that measures the discrepancy between its two arguments. Since
the learning sample S is randomly drawn from some distribution D, the model I(S) and its
prediction I(S)(x) at x are also random. Hence, Error(I(S)) is again a random variable and we
are interested in studying the expected value of this error (over the set of all learning sets of
size m) E
S
[Error(I(S))]. This error can be decomposed into:
E
S
[Error(I(S))] = E
x
[E
S
[E
y|x
[L(y,I(S)(x))]]]

= E
x
[E
S
[Error(I(S)(x))]],
where Error(I(S)(x)) denotes the local error at point x.
Bias/variance decompositions usually try to decompose this error into three terms: the
residual or minimal attainable error, the systematic error, and the effect of the variance. The
exact decomposition depends on the loss function L. The next two subsections are devoted to
the most common loss functions, i.e. the squared loss for regression problems and the 0-1 loss
for classification problems. Notice however that these loss functions are not the only plausible
loss functions and several authors have studied bias/variance decompositions for other loss
functions (Wolpert, 1997, Hansen, 2000). Actually, several of the decompositions for 0-1 loss
presented below are derived as special cases of more general bias/variance decompositions
(Tibshirani, 1996,Wolpert, 1997,Heskes, 1998,Domingos, 1996,James, 2003). The interested
reader may refer to these references for more details.
37.2.1 Bias/Variance Decomposition of the Squared Loss
When the output y is numerical, the usual loss function is the squared loss L
2
(y
1
,y
2
)=(y
1

y
2
)
2

. With this loss function, it is easy to show that the best possible model is f
b
(x)=E
y|x
[y],
which takes the expectation of the target y at each point x. The best model according to a given
loss function is often called the Bayes model in statistical pattern recognition. Introducing this
model in the mean local error, we get with some elementary calculations:
E
S
[Error(I(S)(x))] = E
y|x
[(y − f
b
(x))
2
]+E
S
[( f
b
(x) −I(S)(x))
2
]. (37.1)
Symmetrically to the Bayes model, let us define the average model, f
avg
(x)=E
S
[I(S)(x)]
which outputs the average prediction among all learning sets. Introducing this model in the
second term of Equation (63.1), we obtain:

736 Pierre Geurts
E
S
[( f
b
(x) −I(S)(x))
2
]=(f
b
(x) − f
avg
(x))
2
+ E
S
[(I(S)(x) − f
avg
(x))
2
].
In summary, we have the following well-known decomposition of the mean square error at a
point x:
E
S
[Error(I(S)(x))] =
σ
2
R
(x)+bias
2

R
(x)+var
R
(x)
by defining:
σ
2
R
(x)=E
y
|
x
[(y − f
b
(x))
2
], (37.2)
bias
2
R
(x)=(f
b
(x) − f
avg
(x))
2
, (37.3)
var
2
R

(x)=E
S
[(I(S)(x) − f
avg
(x))
2
]. (37.4)
This error decomposition is well known in estimation theory and has been introduced in the
automatic learning community by (Geman et al., 1995).
The residual squared error,
σ
2
(x), is the error obtained by the best possible model. It
provides a theoretical lower bound that is independent of the learning algorithm. Thus, the
suboptimality of a particular learning algorithm is composed of two terms: the (squared) bias
measures the discrepancy between the best and the average model. It measures how well is
the estimate in average. The variance measures the variability of the predictions with respect
to the learning set randomness.
R
bias
2
x ( )
R
2
σ x( )
R
2
σ x( )
R
2

xvar ( )
R
2
xvar ( )
R
bias
2
x ( )
x( )f
avg
x( )f
b
x( )f
avg
x( )f
b
Err
xx
y
xx
y
E
rr
Too complex model
Too simple model
Fig. 37.2. Top: the average models; bottom: residual error, bias, and variance
To explain why these two terms are indeed the consequence of the two phenomena dis-
cussed in the introduction of this chapter, let us come back to our simple regression problem.
The average model is depicted in the top of Figure 39.2 for the two cases of bad model choice.
Residual error, bias and variance for each position x are drawn in the bottom of the same

figure. The residual error is entirely specified by the problem and loss criterion and hence in-
dependent of the algorithm and learning set used. When the model is too simple, the average
model is far from the Bayes model almost everywhere and thus the bias is large. On the other
hand, the variance is small as the model does not match very strongly the learning set and thus
the prediction at each point does not vary too much from one learning set to another. Bias is
37 Bias vs Variance 737
thus the dominant term of error. When the model is too complex, the distribution of predic-
tions matches very strongly the distribution of outputs at each point. The average prediction
is thus close to the Bayes model and the bias is small. However, because of the noise and the
small learning set size, predictions are highly variable at each point. In this case, variance is
the dominant term of error.
37.2.2 Bias/variance decompositions of the 0-1 loss
The usual loss function for classification problems (i.e. a discrete target variable) is the 0-1
loss function, L
c
(y
1
,y
2
)=1 if y
1
= y
2
, 0 otherwise, which yields the mean misclassification
error at x:
E
S
[Error(I(S)(x))] = E
S
[E

y|x
[L
c
(y,I(S)(x))]]]
= P
D,S
(y = I(S)(x)|x).
The Bayes model in this case is the model that outputs the most probable class at x, i.e.
f
b
(x)=argmax
c
P
D
(y = c
|
x) . The corresponding residual error is:
σ
C
(x)=1 −P
D
(y = f
b
(x)
|
x) . (37.5)
By analogy with the decomposition of the square error, it is possible to define what we call
“natural” bias and variance terms for the 0-1 loss function. First, by symmetry with the Bayes
model and by analogy with the square loss decomposition, the equivalent in classification of
the average model is the majority vote classifier defined by:

f
avg
(x)=argmax
c
P
S
(I(S)(x)=c),
which outputs at each point the class receiving the majority of votes among the distribution
of classifiers induced from the distribution of learning sets. The square bias is the error of the
average model with respect to the best possible model. This definition yields here:
bias
C
(x)=L
c
( f
b
(x), f
ma j
(x)).
So, biased points are those for which the majority vote classifier disagrees with the Bayes
classifier. On the other hand, variance can be naturally defined as:
var
C
(x)=E
S

L
c
(I(S)(x), f
ma j

(x))

= P
S
(I(S)(x) = f
ma j
(x)),
which is the average error of the models induced from random learning samples S with respect
to the majority vote classifier. This definition is indeed a measure of the variability of the
predictions at x: when var
C
(x)=0, every model outputs the same class whatever the learning
set from which it is induced and var
C
(x) is maximal when the probability of the class given by
the majority vote classifier is equal to 1/z (with z the number of classes), which corresponds
to the most uncertain distribution of predictions.
Unfortunately, these natural bias and variance terms do not sum up with the residual error
to give the local misclassification error. In other words:
E
S
[Error(I(S)(x))] =
σ
C
(x)+bias
C
(x)+var
C
(x).
Let us illustrate on a simple example how increased variance may decrease the average classi-

fication error in some situations. Let us suppose that we have a 3 classes problem such that the
738 Pierre Geurts
true class probability distribution is given by (P
D
(y = c
1
|x), P
D
(y = c
2
|x), P
D
(y = c
3
|x))=(0.7,
0.2, 0.1). The best possible prediction at x is thus the class c
1
and the corresponding minimal
error is 0.3. Let us suppose that we have two learning algorithms I
1
and I
2
and that the distri-
bution of predictions of the models built by these algorithms are given by:
(P
S
(I
1
(S)(x)=c
1

),P
S
(I
1
(S)(x)=c
2
),P
S
(I
1
(S)(x)=c
3
)) = (.1,.8,.1)
(P
S
(I
2
(S)(x)=c
1
),P
S
(I
2
(S)(x)=c
2
),P
S
(I
2
(S)(x)=c

3
)) = (.4,.5,.1)
So, we observe that both algorithms produce models that most probably will decide
class c
2
(respectively with probability 0.8 and 0.5). Thus, the two methods are biased
(bias
C
(x)=1). On the other hand, the variances of the two methods are obtained in
the following way:
var
1
C
(x)=1 −0.8 = 0.2 and var
2
C
(x)=1 −0.5 = 0.5,
and their mean misclassification errors are found to be
E
S
[Error(I
1
(S)(x))] = 0.76 and E
S
[Error(I
2
(S)(x))] = 0.61.
Thus between these two methods with identical bias, it is the one having the largest
variance that has the smallest average error rate.
It is easy to see that this happens here because of the existence of a bias. Indeed,

with 0-1 loss, an algorithm that has small variance and high bias is an algorithm
that systematically (i.e. whatever the learning sample) produces a wrong answer,
whereas an algorithm that has a high bias but also a high variance is only wrong
for a majority of learning samples, but not necessarily systematically. So, this latter
may be better than the former. In other words, with 0-1 loss, much variance can be
beneficial because it can lead the system closer to the Bayes classification.
As a result of this counter-intuitive interaction between bias and variance terms
with 0-1 loss, several authors have proposed their own decompositions. We briefly
describe below the most representative of them. For a more detailed discussion of
these decompositions, see for example (Geurts, 2002) or (James, 2003). In the fol-
lowing sections, we present a very different approach to study bias and variance of
0-1 loss due to Friedman (1997), which relates the mean error to the squared bias
and variance terms of the class probability estimates.
Some decompositions
Tibshirani (1996) defines the bias as the difference between the probability of the
Bayes class and the probability of the majority vote class:
bias
T
(x)=P
D
(y = f
b
(x)
|
x) −P
D
(y = f
ma j
(x)
|

x ). (37.6)
Thus, the sum of this bias and the residual error is actually the misclassification error
of the majority vote classifier:
σ
C
(x)+bias
T
(x)=1 −P
D
(y = f
ma j
(x)
|
x )=Error( f
ma j
(x)).
37 Bias vs Variance 739
This is exactly the part of the error that would remain if we could completely can-
cel the variability of the predictions. The variance is then defined as the difference
between the mean misclassification error and the error of the majority vote classifier:
var
T
(x)=E
S
[Error(I(S)(x))] −Error( f
ma j
(x)). (37.7)
Tibshirani (1996) denotes this variance term the aggregation effect. Indeed, this is the
variation of error that results from the aggregation of the predictions over all learning
sets. Note that this variance term is not necessarily positive. From different consid-

erations, James (2003) has proposed exactly the same decomposition. To distinguish
(63.3) and (63.5) from the natural bias and variance terms, he calls them system-
atic and variance effect respectively. Dietterich and Kong (1995) have proposed a
decomposition that applies only to the noise-free case but that exactly reduces to
Tibshirani’s decomposition in this latter case.
Domingos (2000) agrees with the natural definition of bias, variance given in
the introduction of this section and he combines them into a non-additive expression
like:
E
S
[Error(I(S)(x))] = b
1
(x).
σ
C
(x)+bias
C
(x)+b
2
(x).var
C
(x),
where b
1
and b
2
are two factors that are in fact functions of the true class distribution
and of the distribution of predictions.
Kohavi and Wolpert (1996) have proposed a very different decomposition which
is closer in spirit to the decomposition of the squared loss. Their decomposition

makes use of quadratic functions of the probabilities P
S
(I(S)(x)
|
x) and P(y
|
x).
Heskes (1998) adopts the natural variance term var
C
and, ignoring the residual er-
ror, defines bias as the difference between the mean misclassification error and his
variance. As a consequence, it can happen that his bias is smaller than the residual
error. Breiman (1996a, 2000) has successively proposed two decompositions. In the
first one, bias and variance are defined globally instead of locally. Bias is the part of
the error due to biased points (i.e. such that bias
C
(x)=1) and variance is defined as
the part of the error due to unbiased points.
This multitude of decompositions translates well the complexity of the interac-
tion between bias and variance in classification. Each decomposition has its pros and
cons. Notably, we may observe in some case counterintuitive behavior with respect
to what would be observed with the classical decomposition of the squared error (e.g.
a negative variance). This makes the choice, both in theoretical and empirical studies,
of a particular decomposition difficult. Nevertheless, all decompositions have proven
to be useful to analyze classification algorithms, each one at least in the context of
its introduction.
Bias and variance of class probability estimates
Many classification algorithms work by first computing an estimate I
c
(S)(x) of the

conditional probability of each class c at x and then deriving their classification
model by:
I(S)(x)=argmax
c
I
c
(S)(x).

×