Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 67 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (404.31 KB, 10 trang )


32
Data Mining Model Comparison
Paolo Giudici
University of Pavia
Summary. The aim of this contribution is to illustrate the role of statistical models and, more
generally, of statistics, in choosing a Data Mining model. After a preliminary introduction on
the distinction between Data Mining and statistics, we will focus on the issue of how to choose
a Data Mining methodology. This well illustrates how statistical thinking can bring real added
value to a Data Mining analysis, as otherwise it becomes rather difficult to make a reasoned
choice. In the third part of the paper we will present, by means of a case study in credit risk
management, how Data Mining and statistics can profitably interact.
Key words: Model choice, statistical hypotheses testing, cross-validation, loss functions,
credit risk management, logistic regression models.
32.1 Data Mining and Statistics
Statistics has always been involved with creating methods to analyse data. The main differ-
ence compared to the methods developed in Data Mining is that statistical methods are usually
developed in relation to the data being analyzed but also according to a conceptual reference
paradigm. Although this has made the various statistical methods available coherent and rig-
orous at the same time, it has also limited their ability to adapt quickly to the methodological
requests put forward by the developments in the field of information technology.
There are at least four aspects that distinguish the statistical analysis of data from Data
Mining.
First, while statistical analysis traditionally concerns itself with analyzing primary data
that has been collected to check specific research hypotheses, Data Mining can also concern
itself with secondary data collected for other reasons. This is the norm, for example, when an-
alyzing company data that comes from a data warehouse. Furthermore, while in the statistical
field the data can be of an experimental nature (the data could be the result of an experiment
which randomly allocates all the statistical units to different kinds of treatment) in Data Min-
ing the data is typically of an observational nature.
Second, Data Mining is concerned with analyzing great masses of data. This implies new


considerations for statistical analysis. For example, for many applications it is impossible
to analyst or even access the whole database for reasons of computer efficiency. Therefore
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_32, © Springer Science+Business Media, LLC 2010
642 Paolo Giudici
it becomes necessary to have a sample of the data from the database being examined. This
sampling must be carried out bearing in mind the Data Mining aims and, therefore, it cannot
be analyzed with the traditional statistical sampling theory tools.
Third, many databases do not lead to the classic forms of statistical data organization. This
is true, for example, of data that comes from the Internet. This creates the need for appropriate
analytical methods to be developed, which are not available in the statistics field.
One last but very important difference that we have already mentioned is that Data Mining
results must be of some consequence. This means that constant attention must be given to
business results achieved with the data analysis models.
32.2 Data Mining Model Comparison
Several classes of computational and statistical methods for data mining are available. Once a
class of models has been established the problem is to choose the ”best” model from it. In this
chapter, summarized from chapter 6 in (Giudici, 2003) we present a systematic comparison of
them.
Comparison criteria for Data Mining models can be classified schematically into: criteria
based on statistical tests, based on scoring functions, Bayesian criteria, computational criteria,
and business criteria.
The first are based on the theory of statistical hypothesis testing and, therefore, there is a
lot of detailed literature related to this topic. See for example a text about statistical inference,
such as (Mood et al., 1991) or (Bickel and Doksum, 1977). A statistical model can be specified
by a discrete probability function or by a probability density function, f(x) Such model is
usually left unspecified, up to unknown quantities that have to be estimated on the basis of
the data at hand. Typically, the observed sample it is not sufficient to reconstruct each detail
of f (x), but can indeed be used to approximate f (x) with a certain accuracy. Often a density
function is parametric so that it is defined by a vector of parameters

Θ
=(
θ
1
, ,
θ
I
), such that
each value
θ
of
Θ
corresponds to a particular density function, p
θ
(x). In order to measure the
accuracy of a parametric model, one can resort to the notion of distance between a model f ,
which underlies the data, and an approximating model g (see, for instance, (Zucchini, 2000)).
Notable examples of distance functions are, for categorical variables: the entropic dis-
tance, which describes the proportional reduction of the heterogeneity of the dependent vari-
able; the chi-squared distance, based on the distance from the case of independence; the 0-1
distance, which leads to misclassification rates.
The entropic distance of a distribution g from a target distribution f , is:
E
d =

i
f
i
log
f

i
g
i
(32.1)
The chi-squared distance of a distribution g from a target distribution f is instead:
χ
2
d =

i
( f
i
−g
i
)
2
g
i
(32.2)
The 0-1 distance between a vector of predicted values, X
gr
, and a vector of observed
values, X
fr
, is:
0−1
d =
n

r=1

1

X
fr
−X
gr

(32.3)
32 643
where 1(w,z)=1ifw = z and 0 otherwise.
For quantitative variables, the typical choice is the Euclidean distance, representing the
distance between two vectors in the Cartesian plane. Another possible choice is the uniform
distance, applied when nonparametric models are being used.
The Euclidean distance between a distribution g and a target f is expressed by the equa-
tion:
2
d

X
f
,X
g

=

n

r=1

X

fr
−X
gr

2
(32.4)
Given two distribution functions F and G with values in [0, 1] it is defined uniform dis-
tance the quantity:
sup
0≤t≤1
|
F (t)−G(t)
|
(32.5)
Any of the previous distances can be employed to define the notion of discrepancy of a
statistical model. The discrepancy of a model, g, can be obtained as the discrepancy between
the unknown probabilistic model, f , and the best (closest) parametric statistical model. Since
f is unknown, closeness can be measured with respect to a sample estimate of the unknown
density f .
Assume that f represents the unknown density of the population, and let g= p
θ
be a family
of density functions (indexed by a vector of I parameters,
θ
) that approximates it. Using, to
exemplify, the Euclidean distance, the discrepancy of a model g, with respect to a target model
f is:
Δ
( f , p
ϑ

)=
n

i=1
( f (x
i
) −p
ϑ
(x
i
))
2
(32.6)
A common choice of discrepancy function is the Kullback-Leibler divergence, that derives
from the entropic distance, and can be applied to any type of observations. In such context,
the best model can be interpreted as that with a minimal loss of information from the true
unknown distribution.
The Kullback-Leibler divergence of a parametric model p
θ
with respect to an unknown
density f is defined by:
Δ
K−L
( f , p
ϑ
)=

i
f (x
i

)log
f (x
i
)
p
ˆ
θ
(x
i
)
(32.7)
where the parametric density in the denominator has been evaluated in terms of the values of
the parameters which minimizes the distance with respect to f .
It can be shown that the statistical tests used for model comparison are generally based
on estimators of the total Kullback-Leibler discrepancy. The most used of such estimators is
the log-likelihood score. Statistical hypothesis testing is based on subsequent pairwise com-
parisons between pairs of alternative models. The idea is to compare the log-likelihood score
of two alternative models.
The log-likelihood score is then defined by:
−2
n

i=1
log

p
ˆ
θ
(x
i

)

(32.8)
Hypothesis testing theory allows to derive a threshold below which the difference between
two models is not significant and, therefore, the simpler models can be chosen. To summarize,
Data Mining Model Comparison
644 Paolo Giudici
using statistical tests it is possible to make an accurate choice among the models, based on the
observed data. The defect of this procedure is that it allows only a partial ordering of models,
requiring a comparison between model pairs and, therefore, with a large number of alternatives
it is necessary to make heuristic choices regarding the comparison strategy (such as choosing
among forward, backward and stepwise criteria, whose results may diverge). Furthermore, a
probabilistic model must be assumed to hold, and this may not always be a valid assumption.
A less structured approach has been developed in the field of information theory, giving
rise to criteria based on score functions. These criteria give each model a score, which puts
them into some kind of complete order. We have seen how the Kullback-Leibler discrepancy
can be used to derive statistical tests to compare models. In many cases, however, a formal
test cannot be derived. For this reason, it is important to develop scoring functions, that attach
a score to each model. The Kullback-Leibler discrepancy estimator is an example of such a
scoring function that, for complex models, can be often be approximated asymptotically. A
problem with the Kullback-Leibler score is that it depends on the complexity of a model as
described, for instance, by the number of parameters. It is thus necessary to employ score
functions that penalise model complexity.
The most important of such functions is the AIC (Akaike Information Criterion, see
(Akaike, 1974)). The AIC criterion is defined by the following equation:
AIC = −2 log L(
ˆ
ϑ
;x
1

, ,x
n
)+2q (32.9)
where the first term is minus twice the the logarithm of the likelihood function calculated in
the maximum likelihood parameter estimate and q is the number of parameters of the model.
From its definition notice that the AIC score essentially penalises the log-likelihood score
with a term that increases linearly with model complexity. The AIC criterion is based on the
implicit assumption that q remains constant when the size of the sample increases. However
this assumption is not always valid and therefore the AIC criterion does not lead to a consis-
tent estimate of the dimension of the unknown model. An alternative, and consistent, scoring
function is the BIC criterion (Bayesian Information Criterion), also called SBC, formulated
in (Schwarz, 1978). The BIC criterion is defined by the following expression:
BIC = −2 log L

ˆ
ϑ
;x
1
, ,x
n

+ qlog(n) (32.10)
As can be seen from its definition the BIC differs from the AIC only in the second part
which now also depends on the sample size n. Compared to the AIC, when n increases the BIC
favours simpler models. As n gets large, the first term (linear in n) will dominate the second
term (logarithmic in n). This corresponds to the fact that, for a large n, the variance term in
the mean squared error expression tends to be negligible. We also point out that, despite the
superficial similarity between the AIC and the BIC, the first is usually justified by resorting to
classical asymptotic arguments, while the second by appealing to the Bayesian framework.
To conclude, the scoring function criteria for selecting models are easy to calculate and

lead to a total ordering of the models. From most statistical packages we can get the AIC and
BIC scores for all the models considered. A further advantage of these criteria is that they can
be used also to compare non-nested models and, more generally, models that do not belong to
the same class (for instance a probabilistic neural network and a linear regression model).
However, the limit of these criteria is the lack of a threshold, as well the difficult inter-
pretability of their measurement scale. In other words, it is not easy to determine if the dif-
ference between two models is significant or not, and how it compares to another difference.
These criteria are indeed useful in a preliminary exploration phase. To examine this criteria
and to compare it with the previous ones see, for instance, (Zucchini, 2000) or (Hand et al.,
2001).
A possible ”compromise” between the previous two criteria is the Bayesian criteria which
could be developed in a rather coherent way (see e.g. (Bernardo and Smith, 1994)). It appears
to combine the advantages of the two previous approaches: a coherent decision threshold and
a complete ordering. One of the problems that may arise is connected to the absence of a
general purpose software. For Data Mining works using Bayesian criteria the reader could
see, for instance, (Giudici, 2003) and (Giudici and Castelo, 2001).
The intensive wide spread use of computational methods has led to the development of
computationally intensive model comparison criteria. These criteria are usually based on using
dataset different than the one being analyzed (external validation) and are applicable to all the
models considered, even when they belong to different classes (for example in the comparison
between logistic regression, decision trees and neural networks, even when the latter two are
non probabilistic). A possible problem with these criteria is that they take a long time to be
designed and implemented, although general purpose softwares have made this task easier.
The most common of such criterion is based on cross-validation. The idea of the cross-
validation method is to divide the sample into two sub-samples, a ”training” sample, with
n −m observations, and a ”validation” sample, with m observations. The first sample is used
to fit a model and the second is used to estimate the expected discrepancy or to assess a
distance. Using this criterion the choice between two or more models is made by evaluating
an appropriate discrepancy function on the validation sample. Notice that the cross-validation
idea can be applied to the calculation of any distance function.

One problem regarding the cross-validation criterion is in deciding how to select m, that is,
the number of the observations contained in the ”validation sample”. For example, if we select
m = n/2 then only n/2 observations would be available to fit a model. We could reduce m
but this would mean having few observations for the validation sampling group and therefore
reducing the accuracy with which the choice between models is made. In practice proportions
of 75% and 25% are usually used, respectively for the training and the validation samples.
To summarize these criteria have the advantage of being generally applicable but have the
disadvantage of taking a long time to be calculated and of being sensitive to the characteristics
of the data being examined. A way to overcome this problem is to consider model combi-
nation methods, such as bagging and boosting. For a thorough description of these recent
methodologies, see (Hastie et al., 2001).
One last group of criteria seem specifically tailored for the data mining field. These are
criteria that compare the performance of the models in terms of their relative losses, connected
to the errors of approximation made by fitting Data Mining models. Criteria based on loss
functions have appeared recently, although related ideas are known since longtime in Bayesian
decision theory (see for instance (Bernardo and Smith, 1994)) . They are of great interest and
have great application potential although at present they are mainly concerned with solving
problems regarding classification. For a more detailed examination of these criteria the reader
can see for example (Hand , 1997,Hand et al., 2001) or the reference manuals on Data Mining
software, such as that of SAS Enterprise Miner.
The idea behind these methods is that it is important to focus the attention, in the choice
among alternative models, to compare the utility of the results obtained from the models and
not just to look exclusively at the statistical comparison between the models themselves. Since
the main problem dealt with by data analysis is to reduce uncertainties on the risk factors or
”loss” factors, reference is often made to developing criteria that minimize the loss connected
to the problem being examined. In other words, the best model is the one that leads to the least
loss.
32 64
Data Mining Model Comparison
5

646 Paolo Giudici
Most of the loss function based criteria apply to predictive classification problems, where
the concept of a confusion matrix arises. The confusion matrix is used as an indication of
the properties of a classification (discriminant) rule. It contains the number of elements that
have been correctly or incorrectly classified for each class. On its main diagonal we can see
the number of observations that have been correctly classified for each class while the off-
diagonal elements indicate the number of observations that have been incorrectly classified. If
it is (explicitly or implicitly) assumed that each incorrect classification has the same cost, the
proportion of incorrect classifications over the total number of classifications is called rate of
error, or misclassification error, and it is the quantity which must be minimized. Of course the
assumption of equal costs can be replaced by weighting errors with their relative costs.
The confusion matrix gives rise to a number of graphs that can be used to assess the rel-
ative utility of a model, such as the Lift Chart, and the ROC Curve. For a detailed illustration
of these graphs we refer to (Hand , 1997) or (Giudici, 2003). The lift chart puts the valida-
tion set observations, in increasing or decreasing order, on the basis of their score, which is
the probability of the response event (success), as estimated on the basis of the training set.
Subsequently, it subdivides such scores in deciles. It then calculates and graphs the observed
probability of success for each of the decile classes in the validation set. A model is valid
if the observed success probabilities follow the same order (increasing or decreasing) as the
estimated ones. Notice that, in order to be better interpreted, the lift chart of a model is usually
compared with a baseline curve, for which the probability estimates are drawn in the absence
of a model, that is, taking the mean of the observed success probabilities.
The ROC (Receiver Operating Characteristic) curve is a graph that also measures predic-
tive accuracy of a model. It is based on four conditional frequencies that can be derived from
a model, and the choice of a cut-off points for its scores:
• the observations predicted as events and effectively such (sensitivity)
• the observations predicted as events and effectively non events
• the observations predicted as non events and effectively events;
• the observations predicted as non events and effectively such (specificity)
The ROC curve is obtained representing, for any fixed cut-off value, a point in the Carte-

sian plane having as x-value the false positive value (1-specificity) and as y-value the sensi-
tivity value. Each point in the curve corresponds therefore to a particular cut-off. In terms of
model comparison, the best curve is the one that is leftmost, the ideal one coinciding with the
y-axis. To summarize, criteria based on loss functions have the advantage of being easy to
interpret and, therefore, well suited for Data Mining applications but, on the other hand, they
still need formal improvements and mathematical refinements. In the next section we give an
example of how this can be done, and show that statistics and Data Mining applications can
fruitfully interact.
32.3 Application to Credit Risk Management
We now apply the previous considerations to a case-study that concerns credit risk manage-
ment. The objective of the analysis is the evaluation of the credit reliability of small and
medium enterprises (SMEs) that demand financing for their development.
In order to assess credit reliability each applicant for credit is associated with a score,
usually expressed in terms of probability of repayment (default probability). Data Mining
methods are used to estimate such score and, on the basis of it, to classify applicants as being
reliable (worth of credit) or not.
Data Mining models for credit scoring are of the predictive (or supervised) kind: they
use explanatory variables obtained from information available on the applicant in order to get
an estimate of the probability of repayment (target or response variable). The methods most
used in practical credit scoring applications are: linear and logistic regression models, neural
networks and classification tress. Often, in banking practice, the resulting scores are called
”statistical” and supplemented with subjective, judgemental evaluations.
In this section we consider the analysis of a database that includes 7134 SMEs belong-
ing to the retail segment of an important Italian bank. The retail segment contains companies
with total annual sales less than 2,5 million per year. On each of this companies the bank
has calculated a score, in order to evaluate their financing (or refinancing) in the period from
April 1
st
, 1999 to April 30
th

, 2000. After data cleaning, 13 variables are included in the anal-
ysis database, of which one binary variable that expresses credit reliability (BAD =0 for the
reliables, BAD=1 for the non reliables) can be considered as the response or target variable.
The sample contains about 361 companies with BAD=1 (about 5%) and 6773 observed with
BAD=0 (about 95%). The objective of the analysis is to build a statistical rule that explains
the target variable as a function of the explanatory one. Once built on the observed data, such
rule will be extrapolated to assess and predict future applicants for credit. Notice the unbal-
ancedness of the distribution of the target response: this situation, typical in predictive Data
Mining problems, poses serious challenges to the performance of a model.
The remaining 12 available variables are retained to influence reliability, and can be con-
sidered as explanatory predictors. Among them we have: the age of the company, its legal
status, the number of employees, the total sales and variation of the sales in the last period, the
region of residence, the specific business, the duration of the relationship of the managers of
the company with the bank. Most of them can be considered as ”demographic” information on
the company, stable in time but indeed not very powerful to build a statistical model. However,
it must be said that, being the companies considered all SMEs, it is rather difficult to rely on
other, such as balance sheet, information.
A preliminary exploratory analysis can give indications on how to code the explanatory
variables, in order to maximize their predictive power. In order to reach this objective we have
employed statistical measures of association between pairs of variables, such as chi-squared
based measures and statistical measures of dependence, such as Goodman and Kruskal’s
(see (Giudici, 2003) for a systematic comparison of such measures). We remark that the use of
such tools is very much beneficial for the analysis, and can considerably improve the final per-
formance results. As a result of our analysis, all explanatory variables have been discretised,
with a number of levels ranging from 2 to 26.
In order to focus on the issue of model comparison we now concentrate on the comparison
of three different logistic regression models on the data. This model is the most used in credit
scoring applications; other models that are employed are classification trees, linear discrimi-
nant analysis and neural networks. Here we prefer to compare models belonging to the same
class, to better illustrate our issue; for a detailed comparison of credit scoring methods, on a

different data set, see (Giudici, 2003). Our analysis have been conducted using SAS and SAS
Enterprise Miner softwares, available at the bank subject of the analysis.
We have chosen, in agreement with the bank’s experts, three logistic regression models:
a saturated model, that contains all explanatory variables, with the levels obtained from the
explanatory analysis; a statistically selected model, using pairwise statistical hypotheses test-
ing; and a model that minimizes the loss function. In the following, the saturated model will
be named ”RegA (model A)”; the chosen model, according to a statistical selection strategy
”RegB (model B)”, the model chosen minimizing the loss function ”RegC (model C)”. Sta-
tistical model comparison has been carried out using a stepwise model selection approach,
32 64
Data Mining Model Comparison
7
648 Paolo Giudici
with a reference value of 0,05 to compare p-values with. On the other hand, the loss function
has been expressed by the bank’s experts, as a function of the classification errors. Table 32.1
below describes such a loss function.
Table 32.1. The chosen loss function
Predicted
Actual
BAD GOOD
BAD 0 20
GOOD -1 0
The table contains the estimated losses (in scale free values) corresponding to the combi-
nations of actual and predicted values of the target variable. The specified loss function means
that it is retained that giving credit to a non reliable (bad) enterprise is 20 times more costly
that not giving credit to a reliable (good) enterprise. In statistical terms, the type I error costs
20 times the type II error. As each of the four scenarios in Table 32.1 has an occurrence prob-
ability, it is possible to calculate the expected loss of each considered statistical model. The
best one will be that minimizing such expected loss.
In the SAS Enterprise Miner tool the Assessment node provides a common framework to

compare models, in terms of their predictions. This requires that data has been partitioned in
two or more datasets, according to computational criteria of model comparison. The Assess-
ment node produces a table view of the model results that lists relevant statistics and model
adequacy and several different charts/reports depending on whether the target variable is con-
tinuous or categorical and whether a profit/loss function has been specified.
In the case under examination, the initial dataset (5351 observations) has been split in two,
using a sampling mechanism stratified with respect to the target variable. The training dataset
contains about 70% of the observations (about 3712) and the validation dataset the remaining
30% (about1639 observations). As the samples are stratified, in both the resulting datasets the
percentages of ”bad” and ”good” enterprises remain the same as those in the combined dataset
(5%eil95%).
The first model comparison tool we consider is the lift chart. For a binary target, the lift
(also called gains chart) is built as follows. The scored data set is sorted by the probabili-
ties of the target event in descending order; observations are then grouped into deciles. For
each decile, a lift chart can calculate either: the percentage of target responses (Bad repayers
here) or the ratio between the percentage and the corresponding one for the baseline (random)
model, called the lift. Lift charts show the percent of positive response or the lift value on
the vertical axis. Table 54.1 show the calculations that give rise to the lift chart, for the credit
scoring problem considered here. Figure 32.3 shows the corresponding curves.
Table 32.2. Calculations for the lift chart
Number of observa-
tions in each group
percentile % of captured re-
sponses (BASELINE)
% di of captured re-
sponses % (REG A)
% di of captured re-
sponses % (REG B)
% di of captured re-
sponses % (REG C)

163.90 10 5.064 20.134 22.575 22.679
163.90 20 5.064 12.813 12.813 14.033
163.90 30 5.064 9.762 10.103 10.293
163.90 40 5.064 8.237 8.237 8.542
163.90 50 5.064 7.322 7.383 7.445
163.90 60 5.064 6.508 6.913 6.624
163.90 70 5.064 5.753 6.237 6.096
163.90 80 5.064 5.567 5.567 5.644
163.90 90 5.064 5.288 5.220 5.185
163.90 100 5.064 5.064 5.064 5.064
Fig. 32.1. Lift charts for the best model
Comparing the results in Table 54.1 and Figure 39.1 it emerges that the performances of
the three models being compared are rather similar; however the best model seem to be model
C (the model that minimises the losses) as it is the model that, in the first deciles, is able to
effectively capture more bad enterprises, a difficult task in the given problem. Recalling that
the actual percentage of bad enterprises observed is equal to 5%, the previous graph can be
normalized by dividing the percentage of bads in each decile by the overall 5% percentage.
The result is the actual lift of a model, that is, the actual improvement with respect to the
baseline situation of absence of a model (as if each company were estimated good/bad accord-
ing to a purely random mechanism). In terms of model C, in the first decile (with about 164
enterprises) the lift is equal to 4,46 (i.e. 22, 7%

5,1%); this means that, using model C it is
expected to obtain, in the first decile, a number of enterprises 4,5 times higher with respect to
a random sample of the considered enterprises.
The second Assessment tool we consider is the threshold chart. Threshold-based charts
enable to display the agreement between the predicted and actual target values across a range
of threshold levels. The threshold level is the cutoff that is used to classify an observation that
is based on the event level posterior probabilities. The default threshold level is 0.50. For the
credit scoring case the calculations leading to the threshold chart are in Table 32.3 and the

corresponding figure in Figure 32.3 below.
In order to interpret correctly the previous table and figure, let us consider some numerical
examples. First we remark that the results refer to the validation dataset, with 1629 enterprises
32 64
Data Mining Model Comparison
9

×