Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 22 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (217.14 KB, 10 trang )

190 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni
By repeating this procedure for each case in the database, we compute fitted values
for each variable Y
i
, and then define the blanket residuals by
r
ik
= y
ik
− ˆy
ik
for numerical variables, and by
c
ik
=
δ
(y
ik
, ˆy
ik
)
for categorical variables, where the function
δ
(a,b) takes value
δ
= 0 when a = b
and
δ
= 1 when a = b. Lack of significant patterns in the residuals r
ik
and approxi-


mate symmetry about 0 will provide evidence in favor of a good fit for the variable
Y
i
, while anomalies in the blanket residuals can help to identify weaknesses in the
dependency structure that may be due to outliers or leverage points. Significance
testing of the goodness of fit can be based on the standardized residuals:
R
ik
=
r
ik

V (y
i
)
where the variance V (y
i
) is computed from the fitted values. Under the hypothesis
that the network fits the data well, we would expect to have approximately 95% of
the standardized residuals within the limits [-2,2]. When the variable Y
i
is categorical,
the residuals c
ik
identify the error in reproducing the data and can be summarized to
compute the error rate for fit.
Because these residuals measure the difference between the observed and fit-
ted values, anomalies in the residuals can identify inadequate dependencies in the
networks. However, residuals that are on average not significantly different from 0
do not necessarily prove that the model is good. A better validation of the network

should be done on an independent test set to show that the model induced from
one particular data set is reproducible and gives good predictions. Measures of the
predictive accuracy can be the monitors based on the logarithmic scoring function
(Good, 1952). The basic intuition is to measure the degree of surprise in predicting
that the variable Y
i
will take a value y
ih
in the hth case of an independent test set. The
measure of surprise is defined by the score
s
ih
= −log p(y
ih
|MB(y
i
)
h
)
where MB(y
i
)
h
is the configuration of the Markov blanket of Y
i
in the test case h,
p(y
ih
|MB(y
i

)
h
) is the predictive probability computed with the model induced from
data, and y
ih
is the value of Y
i
in the hth case of the test set. The score s
ih
will be
0 when the model predicts y
ih
with certainty, and increases as the probability of y
ih
decreases. The scores can be summarized to derive local and global monitors and to
define tests for predictive accuracy (Cowell et al., 1999).
In absence of an independent test set, standard cross validation techniques are
typically used to assess the predictive accuracy of one or more nodes (Hand, 1997).
In K-fold cross validation, the data are divided into K non-overlapping sets of ap-
proximately the same size. Then K −1 sets are used for retraining (or inducing)
10 Bayesian Networks 191
the network from data that is then tested on the remaining set using monitors or
other measures of the predictive accuracy (Hastie et al., 2001). By repeating this
process K times, we derive independent measures of the predictive accuracy of the
network induced from data as well as measures of the robustness of the network to
sampling variability. Note that the predictive accuracy based on cross-validation is
usually an over-optimistic measure, and several authors have recently argued that
cross-validation should be used with caution (Braga-Neto and Dougerthy, 2004),
particularly with small sample sizes.
10.5 Bayesian Networks in Data Mining

This section describes the use of Bayesian networks to undertake other typical Data
Mining tasks such as classification, and for modeling more complex models, such as
nonlinear and temporal dependencies.
10.5.1 Bayesian Networks and Classification
The term “supervised classification” covers two complementary tasks: the first is to
identify a function mapping a set of attributes onto a class, and the other is to assign
a class label to a set of unclassified cases described by attribute values. We denote by
C the variable whose states represent the class labels c
i
, and by Y
i
the attributes.
Classification is typically performed by first training a classifier on a set of la-
belled cases (training set) and then using it to label unclassified cases (test set). The
supervisory component of this classifier resides in the training signal, which pro-
vides the classifier with a way to assess a dependency measure between attributes
and classes. The classification of a case with attribute values y
1k
, ,y
vk
is then per-
formed by computing the probability distribution p(C | y
1k
, ,y
vk
) of the
class variable, given the attribute values, and by labelling the case with the most
probable label. Most of the algorithms for learning classifiers described as Bayesian
networks impose a restriction on the network structure, namely that there cannot be
arcs pointing to the class variable. In this case, by the local Markov property, the joint

probability p(y
1k
, ,y
vk
,c
k
) of class and attributes is factorized as
p(c
k
)p(y
1k
, ,y
vk
| c
k
). The simplest example is known as a Na
¨
ıve Bayes classifier
(
NBC) (Duda and Hart, 1973,Langley et al., 1992), and makes the further simplifica-
tion that the attributes Y
i
are conditionally independent given the class C so that
p(y
1k
, ,y
vk
|c
k
)=


i
p(y
ik
|c
k
).
Figure 10.5 depicts the directed acyclic graph of a
NBC. Because of the restriction on
the network topology, the training step for a
NBC classifier consists of estimating the
conditional probability distributions of each attribute, given the class, from a training
data set. When the attributes are discrete or continuous variables and follow Gaussian
distributions, the parameters are learned by using the procedure described in Section
10.4. Once trained, the
NBC classifies a case by computing the posterior probability
192 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni
distribution over the classes via Bayes’ Theorem and assigns the case to the class
with the highest posterior probability.
When the attributes are all continuous and modelled by Gaussian variables, and
the class variable is binary, say c
k
= 0,1, the classification rule induced by a NBC is
very similar to the Fisher discriminant rule and turns out to be a function of
r =

i
{log(
σ
2

i0
/
σ
2
i0
) −(y
i

μ
i1
)
2
/
σ
2
i1
+(y
i

μ
i0
)
2
/
σ
2
i0
}
where y
i

is the value of attribute i in the new sample to classify and the parameters
σ
2
ik
and
μ
ik
are the variance and mean of the attribute Gaussian distribution, conditional
on the class membership, that are usually estimated by Maximum likelihood.
Other classifiers have been proposed to relax the assumption that attributes are
conditionally independent given the class. Perhaps the most competitive one is the
Tree Augmented Na
¨
ıve Bayes(
TAN) classifier (Friedman et al., 1997) in which all the
attributes have the class variable as a parent as well as another attribute. To avoid
cycles, the attributes have to be ordered and the first attribute does not have other
parents beside the class variable. Figure 10.6 shows an example of a
TAN classifier
with five attributes. An algorithm to infer a
TAN classifier needs to choose both the
dependency structure between attributes and the parameters that quantify this depen-
dency. Due to the simplicity of its structure, the identification of a
TAN classifiers
does not require any search but rather the construction of a tree among the attributes.
An “ad hoc” algorithm called Construct-TAN
CTAN was proposed in (Friedman
et al., 1997). One limitation of the
CTAN algorithm to build TAN classifiers is that it
applies only to discrete attributes, and continuous attributes need to be discretized.

Other extensions of the
NBC try to relax some of the assumptions made by the
NBC or the TAN classifiers. Some examples are the l-Limited Dependence Bayesian
classifier (l-
LDB) in which the maximum number of parents that an attribute can have
is l (Sahami, 1996). Another example is the unrestricted Augmented Na
¨
ıve Bayes
classifier (
ANB) in which the number of parents is unlimited but the scoring metric
used for learning, the minimum description length criterion, biases the search toward
models with small number of parents per attribute (Friedman et al., 1997). Due to the
high dimensionality of the space of different ANB networks, algorithms that build
Fig. 10.5. The structure of the Na
¨
ıve Bayes classifier.
10 Bayesian Networks 193
Fig. 10.6. The structure of a TAN classifier.
this type of classifiers must rely on heuristic searches. More examples are reported
in (Friedman et al., 1997).
10.5.2 Generalized Gamma Networks
Most of the work on learning Bayesian networks from data has focused on learning
networks of categorical variables, or networks of continuous variables modeled by
Gaussian distributions with linear dependencies. This section describes a new class
of Bayesian networks, called Generalized Gamma networks (
GGN), able to describe
possibly nonlinear dependencies between variables with non-normal distributions
(Sebastiani and Ramoni, 2003).
In a
GGN the conditional distribution of each variable Y

i
given the parents Pa(y
i
)=
{Y
i1
, ,Y
ip(i)
}follows a Gamma distribution Y
i
|pa(y
i
),
θ
i
∼Gamma(
α
i
,
μ
i
(pa(y
i
),
β
i
)),
where
μ
i

(pa(y
i
),
β
i
) is the conditional mean of Y
i
and
μ
i
(pa(y
i
),
β
i
)
2
/
α
i
is the con-
ditional variance. We use the standard parameterization of generalized linear models
(McCullagh and Nelder, 1989), in which the mean
μ
i
(pa(y
i
),
β
i

) is not restricted
to be a linear function of the parameters
β
ij
, but the linearity in the parameters is
enforced in the linear predictor
η
i
, which is itself related to the mean function by the
link function
μ
i
= g(
η
i
). Therefore, we model the conditional density function as:
p(y
i
|pa(y
i
),
θ
i
)=
α
α
i
i
Γ
(

α
i
)
μ
α
i
i
y
α
i
−1
i
e

α
i
y
i
/
μ
i
, y
i
≥ 0 (10.4)
where
μ
i
= g(
η
i

) and the linear predictor
η
i
is parameterized as
η
i
=
β
i0
+

j
β
ij
f
j
(pa(y
i
))
and f
j
(pa(y
i
)) are possibly nonlinear functions. The linear predictor
η
i
is a function
linear in the parameters
β
, but it is not restricted to be a linear function of the parent

values, so that the generality of Gamma networks is in the ability to encode general
non-linear stochastic dependency between the node variables. Table 10.1 shows ex-
ample of non-linear mean functions. Figure 10.7 shows some examples of Gamma
194 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni
LINK g(·) LINEAR PREDICTOR
η
IDENTITY
μ
=
ηη
i
=
β
i0
+

j
β
ij
y
ij
INVERSE
μ
=
η
−1
η
i
=
β

i0
+

j
β
ij
y
−1
ij
LOG
μ
= e
η
η
i
=
β
i0
+

j
β
ij
log(y
ij
)
Table 10.1. Link functions and parameterizations of the linear predictor.
density functions, for different shape parameters
α
= 1,1.5,5 and mean

μ
= 400.
Note that approximately symmetrical distributions are obtained for particular values
of the shape parameter
α
.
Fig. 10.7. Example of Gamma density functions for shape parameters
α
= 1 (continuous
line),
α
= 1.5 (dashed line), and
α
= 5 (dotted line) and mean
μ
= 400. For fixed mean, the
parameter
α
determines the shape of the distribution that is skewed to the left for small
α
and
approaches symmetry as
α
increases.
Unfortunately, there is no closed form solution to learn the parameters of a GGN
and we have therefore to resort to Markov Chain Monte Carlo methods to compute
stochastic estimates (Madigan and Ridgeway, 2003), or to maximum likelihood to
compute numerical approximation of the posterior modes (Kass and Raftery, 1995).
A well know property of generalized linear models is that the parameters
β

ij
can be
estimated independently of
α
i
, which is then estimated conditionally on
β
ij
(McCul-
lagh and Nelder, 1989).
To compute the maximum likelihood estimates of the parameters
β
ij
within
each
family (Y
i
,Pa(y
i
)), we need to solve the system of equations
10 Bayesian Networks 195

log p(D |
θ
i
)/
∂β
ij
= 0. The Fisher Scoring method is the most efficient algorithm
to find the solution of the system of equations. This iterative procedure is a general-

ization of the Newton Raphson procedure in which the Hessian matrix is replaced by
its expected value. This modification speeds up the convergence rate of the iterative
procedure that is known for being usually very efficient — it usually converges in 5
steps for appropriate initial values. Details can be found for example in (McCullagh
and Nelder, 1989).
Once the ML estimates of
β
ij
are known, say
ˆ
β
i
, we compute the fitted means
ˆ
μ
ik
= g(
ˆ
β
i0
+

j
ˆ
β
ij
f
j
(pa(y
i

)) and use these quantities to estimate the shape param-
eter
α
i
. Estimation of the shape parameter in Gamma distributions is an open is-
sue, and authors have suggested several estimators (see for example (McCullagh and
Nelder, 1989)). Popular choices are the deviance-based estimator that is defined as
˜
α
i
=
n −q

k
(y
ik

ˆ
μ
ik
)
2
/
ˆ
μ
2
ik
where q is the number of parameters
β
ij

that appear in the linear predictor. The
maximum likelihood estimate
ˆ
α
i
of the shape parameter
α
i
would need the solution
of the equation
n +nlog(
α
i
)+n
Γ
(
α
i
)

Γ
(
α
i
)
+ −

k
log(
ˆ

μ
ik
)+

k
log(y
ik
) −

i
y
ik
ˆ
μ
ik
= 0
with respect to
α
i
. We have an approximate closed form solution to this equation
based on a Taylor expansion that is discussed in (Sebastiani, Ramoni, and Kohane,
2003, Sebastiani et al., 2004, Sebastiani, Yu, and Ramoni, 2003).
Also the model selection process requires the use of approximation methods. In
this case, we use the Bayesian information criterion (BIC) (Kass and Raftery, 1995)
to approximate the marginal likelihood by 2 log p(D|
ˆ
θ
) −n
p
log(n) where

ˆ
θ
is the
maximum likelihood estimate of
θ
, and n
p
is the overall number of parameters in the
network. BIC is independent of the prior specification on the model space and trades
off goodness of fit — measured by the term 2 log p(D |
ˆ
θ
) — and model complexity
— measured by the term n
p
log(n). We note that BIC factorizes into a product of
terms for each variable Y
i
and makes it possible to conduct local structural learning.
While the general type of dependencies in Gamma networks makes it possible
to model a variety of dependencies within the variables, exact probabilistic reason-
ing with the network becomes impossible and we need to resort to Gibbs sampling
(see Section 10.2). Our simulation approach uses the adaptative rejection metropo-
lis sampling (ARMS) of (Gilks and Roberts, 1996) when the conditional density
p(y
i
|Y \y
i
,
ˆ

θ
) is log-concave, and adaptive rejection with Metropolis sampling in
the other cases (Sebastiani and Ramoni, 2003).
10.5.3 Bayesian Networks and Dynamic Data
One of the limitations of Bayesian networks is the inability to represents forward
loops: by definition the directed graph that encodes the marginal and conditional
196 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni
independencies between the network variables cannot have cycles. This limitation
makes traditional Bayesian networks unsuitable for the representation of many sys-
tems in which feedback controls are a critical aspect of many application domains,
from control engineering to biomedical sciences. Dynamic Bayesian networks pro-
vide a general framework to integrate multivariate time series and to represent feed-
forward loops and feedback mechanisms.
Fig. 10.8. A directed acyclic graph that represents the temporal dependency of three categori-
cal variables describing positive (+) and negative (-) regulation.
A dynamic Bayesian network is defined by a directed acyclic graph in which
nodes continue to represent stochastic variables and arrows represent temporal de-
pendencies that are quantified by probability distributions. The crucial assumption is
that the probability distributions of the temporal dependencies are time invariant, so
that the directed acyclic graph of a dynamic Bayesian network represents only the
necessary and sufficient time transitions to reconstruct the overall temporal process.
Figure 10.8 shows the directed acyclic graph of a dynamic Bayesian network with
three variables. The subscript of each node denotes the time lag, so that the arrows
from the nodes Y
2(t−1)
and Y
1(t−1)
to the node Y
1(t)
describe the dependency of the

probability distribution of the variable Y
1
at time t on the value of Y
1
and Y
2
at time
t −1. Similarly, the directed acyclic graph shows that the probability distribution of
the variable Y
2
at time t is a function of the value of Y
1
and Y
2
at time t −1. This
symmetrical dependency allows us to represent feedback loops and we used it to de-
scribe the regulatory control of glucose in diabetic patients (Ramoni et al., 1995).
A dynamic Bayesian network is not restricted to represent temporal dependency of
order 1. For example the probability distribution of the variable Y
3
at time t depends
on the value of the variable at time t −1 as well as the value of the variable Y
2
at time
t −2. The conditional probability table in Figure 10.8 shows an example when the
variables Y
2
,Y
3
are categorical.

By using the local Markov property, the joint probability distribution of the three
variables at time t, given the past history
10 Bayesian Networks 197
h
t
:= y
1(t−1)
, ,y
1(t−l)
,y
2(t−1)
, ,y
2(t−l)
,y
3(t−1)
, ,y
3(t−l)
is given by the product of the three factors:
p(y
1(t)
|h
t
)=p(y
1(t)
|y
1(t−1)
,y
2(t−1)
)
p(y

2(t)
|h
t
)=p(y
2(t)
|y
1(t−1)
,y
2(t−1)
)
p(y
3(t)
|h
t
)=p(y
3(t)
|y
3(t−1)
,y
2(t−2)
)
that represent the probability of transition over time. By assuming that these prob-
ability distributions are time invariant, they are sufficient to compute the proba-
bility that a process that starts from known values y
1(1)
,y
2(1)
,y
3(0)
,y

3(1)
evolves
into y
1(T )
,y
2(T )
,y
3(T )
, by using one of the algorithms for probabilistic reasoning de-
scribed in Section 10.2. The same algorithms can be used to compute the probability
that a process with values y
1(T )
,y
2(T )
,y
3(T )
at time T started from the initial states
y
1(1)
,y
2(1)
,y
3(0)
,y
3(1)
.
Fig. 10.9. Modular learning of the dynamic Bayesian network in Figure 10.8. First a regressive
model is learned for each of the three variables at time t, and then the three models are joined
by their common ancestors Y
1(t−1)

,Y
2(t−2)
and Y
2(t−2)
to produce the directed acyclic graph
in Figure 10.8.
Learning dynamic Bayesian networks when all the variables are observable is
a straightforward parallel application of the structural learning described in Section
198 Paola Sebastiani, Maria M. Abad, and Marco F. Ramoni
10.4. To build the network, we proceed by selecting the set of parents for each vari-
able Y
i
at time t, and then the models are joined by the common ancestors. An ex-
ample is in Figure 10.9. The search of each local dependency structure is simplified
by the natural ordering imposed on the variables by the temporal frame (Friedmanm
et al., 1998) that constrains the model space of each variable Y
i
at time t: the set of
candidate parents consists of the variables Y
i(t−1)
, ,Y
i(t−p)
as well as the variables
Y
h(t−j)
for all h = i, and j = 1, ,p. The K2 algorithm (Cooper and Herskovitz,
1992) discussed in Section 10.4 appears to be particularly suitable for exploring the
space of dependency for each variable Y
i(t)
. The only critical issue is that the selec-

tion of the largest temporal order to explore depends on the sample size, because
each temporal lag of order p leads to the loss of the first p temporal observations in
the data set (Yu et al., 2002).
10.6 Data Mining Applications
Bayesian networks have been used by us and others as knowledge discovery tools
in a variety of fields, ranging from survey data analysis (Sebastiani and Ramoni,
2000, Sebastiani and Ramoni, 2001B) to customer profiling (Sebastiani et al., 2000)
and bioinformatics (Friedman, 2004,Sebastiani et al., 2004,2). Here we describe two
Data Mining and knowledge discovery applications based on Bayesian networks.
10.6.1 Survey Data
A major goal of surveys conducted by Federal Agencies is to provide citizens, con-
sumers and decision makers with useful information in a compact and understand-
able format. Data are expected to improve the understanding of institutions, busi-
nesses, and citizens of the current state of affairs in the country and play a key role in
political decisions. But the size and structure of this fast-growing databases pose the
challenge of how effectively extracting and presenting this information to enhance
planning, prediction, and decision making. An example of fast growing database is
the Current Population Survey (CPS) database that collects monthly surveys of about
50,000 households conducted by the U.S. Bureau of the Census. These surveys are
the primary source of information on the labor force characteristics of the U.S. pop-
ulation, they provide estimates for the nation as a whole and serve as part of model-
based estimates for individual states and other geographic areas. Estimates obtained
from the CPS include employment and unemployment, earnings, hours of work, and
other indicators, and are often associated with a variety of demographic character-
istics including age, sex, race, marital status, and education. CPS data are used by
government policymakers and legislators as important indicators of the nations’s eco-
nomic situation and for planning and evaluating many government programs.
For most of the surveys conducted by the U.S Census Bureau, users can ac-
cess both the microdata or summary tables. Summary tables provide easy access to
findings of interest by relating a small number of preselected variables. In so do-

ing, summary tables disintegrate the information contained in the original data into
10 Bayesian Networks 199
Fig. 10.10. Bayesian network induced from a portion of the 1996 General Household Survey,
conducted between April 1996 and March 1997 by the British Office of National Statistics in
Great Britain.
micro-components and fail to convey an overall picture of the process underlying
the data. A different approach to the analysis of survey data would be to employ
Data Mining tools to generate hypothesis and hence to make new discoveries in an
automated way (Hand et al., 2001, Hand et al., 2002).
As an example, Figure 10.10 shows a Bayesian network learned from a data set of
13 variables extracted from the 1996 General Household Survey conducted between
April 1996 and March 1997 by the British Office of National Statistics in Great
Britain. Variables and their states are summarized in Table 10.2. The network struc-
ture shows interesting, directed dependencies and conditional independencies. For
example, there is a dependency between the ethnic group of the heads of the house-
holds and the region of birth (variables
Region
and
HoH origin
) and the conditional
probability table that shapes this dependency reveals a more cosmopolitan society in
England than Wales and Scotland, with a larger proportion of Blacks and Indians as
household heads. The working status of the head of the household (
Hoh status
)is
independent of the ethnic group given gender and age. The conditional probability
table that shapes this dependency shows that young female heads of household are
much more likely to be inactive than male heads of household (40% compared to 6%
when the age group is 17–36). This difference is attenuated as the age of the head

×