Data Mining and Knowledge Discovery Handbook, 2 Edition part 57 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (394.99 KB, 10 trang )

540 Yoav Benjamini and Moshe Leshno
Storey J.D., Taylor J.E. and Siegmund D., (2004). Strong control, conservative point esti-
mation, and simultaneous conservative consistency of false discovery rates: A uniﬁed
approach. Journal of the Royal Statistical Society Series B, 66:187–205.
Therneau T.M. and Grambsch P.M., (2000). Modeling Survival Data, Extending the Cox
Model. Springer.
Tibshirani R. and Knight K., (1999). The covariance inﬂation criterion for adaptive model
selection. Journal of the Royal Statistical Society Series B, 61:Part 3 529–546.
Zembowicz R. and Zytkov J.M., (1996). From contingency tables to various froms of knowl-
edge in databases. In U.M. Fayyad, R. Uthurusamy, G. Piatetsky-Shapiro and P. Smyth
(editors) Advances in Knowledge Discovery and Data Mining (pp. 329-349). MIT Press.
Zytkov J.M. and Zembowicz R., (1997). Contingency tables as the foundation for concepts,
concept hierarchies and rules: The 49er system approach. Fundamenta Informaticae,
30:383–399.
26
Logics for Data Mining
Petr H
´
ajek
Institute of Computer Science
Academy of Sciences of the Czech Republic
182 07 Prague, Czech Republic

Summary. Systems of formal (symbolic) logic suitable for Data Mining are presented, main
stress being put to various kinds of generalized quantiﬁers.
Key words: logic, Data Mining, generalized quantiﬁers, GUHA method
Introduction
Data Mining, as presently understood, is a broad term, including search for “association rules”,
classiﬁcation, regression, clustering and similar. Here we shall restrict ourselves to search
for “rules” in a rather general sense, namely general dependencies valid in given data and
expressed by formulas of a formal logical language. The present theoretical approach is the

result of a long development of the GUHA method of automated generation of hypotheses
(General Unary Hypotheses Automaton, see a paragraph in Section 26.2) but is believed to
be fully relevant for contemporary mining of association rules and its possible generalization.
See (Agrawal et al., 1996,Hoppner, 2005, Adamo, 2001) for association rules.
Data are assumed to have the form of one or more tables, matrices or relations. A rectan-
gular matrix may be understood as giving data on objects (corresponding to rows of the matrix)
and their attributes (columns). Or rows may correspond to objects from one set, columns to
objects of the same or a different set and the whole matrix is understood as one binary at-
tribute. In the former case we have variables x,y for objects and a unary predicate for each
column (P
i
for i-th column, say); P
i
(x) denotes the value of P
i
for the object x. In the latter
case we have variables for objects from the ﬁrst set (x, say), other variables for objects from
the other set (y, say) and one binary predicate P; then P(x,y) denotes the value of the attribute
for the pair (x,y) of objects.
For example, rows correspond to patients, the ﬁrst column corresponds to having feaver
(yes – value 1, no – value 0). Then patient Nov
´
ak satisﬁes P
1
(x) if he has feaver. Secondly,
the matrix describes the relation “being a married couple” and we take MC for the predicate.
Then the couple (Nov
´
ak, Nov
´

akov
´
a) satisﬁes P(x,y) if they are a married couple, thus the
corresponding ﬁeld in the matrix has value 1.
These were examples of Boolean (0-1-valued) data; more gener-
ally, they may take values from some set of values (reals, colours, ).
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_26, © Springer Science+Business Media, LLC 2010
542 Petr H
´
ajek
It should be clear what is the value of P(x) for an object in the former case and value
of P(x,y) for a pair of objects in the latter.
Logic enables us to construct composed formulas from atomic formulas as above using
some connectives (as conjunction, disjunction, implication, negation in the Boolean logic, no-
tation: ∧, ∨, →,¬) and quantiﬁers (universal ∀ and existential ∃ in classical Boolean logic).
Our main message is that in Data Mining we have to deal with generalized quantiﬁers of
some particular kind and that their logical properties are very important. For simplicity we re-
strict ourselves to two-valued (0-1-valued) data. Our approach generalizes easily to categorical
(ﬁnitely valued) data when we work with atomic formulas of the form (X)P(x) (or (X)P(x,y)
etc.) where X is a subset of the domain of values of the attribute P and an object o satisﬁes
(X)P(x) iff the P-value of o is in X, similarly for (X)P(x,y). (For example let P denote age in
years 0, ,100, let X be the set of numbers 0 ≤n ≤ 30.)
The reader having some knowledge of classical propositional and predicate calculus will
have no problems with the mentioned notions; the reader having difﬁculties is recommended
to have a look to a textbook of mathematical logic, e. g. (Ebbinghaus et al., 1984).
26.1 Generalized quantiﬁers
We shall present the notion of a generalized quantiﬁer and supply several examples. (In
the next section we shall study various classes of quantiﬁers.) For simplicity we shall work
with data having the form of a rectangular boolean matrix, rows corresponding to object and

columns to (yes-no) attributes. Predicates P
1
, ,P
n
are names of the attributes; we have an
object variable x, and P
1
(x), ,P
n
(x) are atomic formulas. For each formula
ϕ
, we have its
negation ¬
ϕ
; an object satisﬁes ¬
ϕ
if it does not satisfy
ϕ
. For each pair
ϕ
,
ψ
of formulas
we have their conjunction
ϕ
∧
ψ
and disjunction
ϕ
∨

ψ
. An object satisﬁes
ϕ
∧
ψ
if it satisﬁes
both
ϕ
and
ψ
; it satisﬁes at least one of
ϕ
,
ψ
. Similarly for conjunction/disjunction of three,
four. formulas. An object satisﬁes implication
ϕ
→
ψ
if it satisﬁes
ψ
or does not satisfy
ϕ
. Formulas built from atomic formulas using the connectives ¬,∧∨,→ are open: each open
formula
ϕ
deﬁnes an attribute: each object of our data either satisﬁes
ϕ
or does not satisfy
it. This is uniquely determined by the data. Thus

ϕ
deﬁnes two numbers: r – the number of
objects satisfying
ϕ
and s – the number of objects satisfying ¬
ϕ
. The pair (r,s) may be called
the two-fold table of
ϕ
(given by the data); r+s = m is the number of objects in the data (rows
of the matrix).
A (one-dimensional) quantiﬁer Q applied to an open formula
ϕ
describes the behaviour
of the attribute deﬁned by
ϕ
in the data as a whole, i.e. gives a global characterization of the
attribute (in the data). The reader surely knows the classical quantiﬁers ∀ (universal) and ∃
(existential). The formula (∀x)
ϕ
is true in the data if each object satisﬁes
ϕ
(thus s = 0); the
formula (∃x)
ϕ
is true in the data of at least one object satisﬁes
ϕ
(thus r ≥ 1). You see that
truth of such a quantiﬁed formula does not depend on any ordering of the rows of the data
matrix, just it is given by the two-fold table of

ϕ
.
A (one-dimensional) quantiﬁer Q is determined by its truth function Tr
Q
assigning to
each two-fold table (r,s) either 1 (true) or 0 (false). For each open formula
ϕ
, the closed
formula (Qx)
ϕ
is true in the data iff the two-fold table (r,s) of
ϕ
(given by the data) satisﬁes
Tr
Q
(r, s)=1. Clearly, Tr
∀
(r, s)=1iffs = 0 and Tr
∃
(r, s)=1iffr > 0.
Some other examples:
Majority: Tr
Maj
(r, s)=1iffr > s; (Maj x)
ϕ
says that the majority of objects satisfy
ϕ
.
Many: Let 0 < p < 1; Tr
Many

p
(r, s)=1iffr/(r + s) ≥ p; (Many
p
x)
ϕ
says that the relative
26 Logics for Data Mining 543
frequence of objects satisfying
ϕ
is at least p.
At least: Tr
∃≥n
(r, s)=1iffr ≥ n (at least n objects satisfy
ϕ
).
Odd: Tr
Odd
(r, s)=1iffr is an odd number.
The reader may produce many more examples. We shall be more general:
A two-dimensional quantiﬁer Q, applied to a pair
ϕ
,
ψ
of open formulas describes the
behaviour of the pair of attributes deﬁned by
ϕ
and
ψ
in the data as a whole, thus gives a
global characterization of the mutual relation of

ϕ
,
ψ
(in the data). The closed formula given
by Q,
ϕ
,
ψ
is written as (Qx)(
ϕ
,
ψ
). Its truth/falsity in the data is determined by the four-fold
table (a,b, c, d) where a, b, c,d denotes the number of objects in the data satisfying
ϕ
∧
ψ
,
ϕ
∧¬
ψ
, ¬
ϕ
∧
ψ
, ¬
ϕ
∧¬
ψ
respectively. This is often displayed as

ψ
¬
ψ
ϕ
abr
¬
ϕ
cds
klm
where r = a + b (number of objects satisfying
ϕ
), s = c + d, k = a + c, l = b + d (marginal
sums), m = a+b+c+d = r+s = k+l. Thus the truth function of a two-dimensional quantiﬁer
Q assigns to each four-fold table (a,b,c,d) the value Tr
Q
(a,b,c,d) ∈{0,1}.
All (two-dimensional)
ϕ
(x)
⇒
ψ
says “all
ϕ
’s are
ψ
’s)”; Tr
⇒
(a,b,c,d)=1iffb = 0. This
is deﬁnable by one-dimensional ∀ and the connective →, namely
ϕ

(x)
⇒
ψ
says the same as
(∀x)(
ϕ
→
ψ
).
Many: for 0 < p ≤ 1,
ϕ
(x)
⇒
p
ψ
says “p-many
ϕ
’s are
ψ
’s”, i.e. the relative frequence of
objects satisfying
ψ
among those satisfying
ϕ
is ≥ p, thus Tr
⇒
p
(a,b,c,d)=1iffa/(a + b) ≥
p. Caution: this is not the same as (Many
p

x)(
ϕ
→
ψ
) : for example if (a, b,c,d) is (2,2,5,5)
then a/(a + b)=1/2 but the number of objects satisfying
ϕ
→
ψ
is a + c + d = 12, thus
(a + c + d)/m = 12/14 = 6/7. Thus if p = 0.8 then
ϕ
(x)
⇒
ψ
is false but (∀x)(
ϕ
→
ψ
) is
true. But
ϕ
(x)
⇒
ψ
can be also written as (Many
p
x)
ϕ
/

ψ
, understood as saying that the formula
(Many
p
x)
ϕ
is true in the subtable consisting of rows satisfying
ψ
.
p-equivalence:
ϕ
(x)
⇔
ψ
(
ϕ
is p-equivalent to
ψ
) is true in the data if both
ϕ
(x)
⇒
p
ψ
and
¬
ϕ
(x)
⇒
p

¬
ψ
is true, thus a/(a +b) ≥ p and d/(c+ d) ≥ p.
Foundedness: Let t be a natural number. (Fdd x)(
ϕ
,
ψ
) says that at least t objects satisfy
ϕ
∧
ψ
, i.e. a ≥t. Similarly:
Support: Let 0 <
σ
< 1. (Supp
σ
x)(
ϕ
,
ψ
) is Many
σ
(
ϕ
∧
ψ
), thus says that the relative
frequence of
ϕ
∧

ψ
in the data is at least
σ
.
Founded implication: (FIMPL
p,t
x)(
ϕ
,
ψ
) (or just
ϕ
⇒
p,t
ψ
) is
(Many
p
x)(
ϕ
,
ψ
) and Fdd
t
s)(
ϕ
∧
ψ
), hence Tr
⇒

p,t
(a,b,c,d)=1iff
a/(a + b) ≥ p and a ≥t.
Agrawal:
ϕ
⇒
Agr
p,
σ
ψ
is Many
p
(
ϕ
,
ψ
) and Supp
σ
(
ϕ
,
ψ
), hence
Tr
⇒
Agr
p,
σ
(a,b,c,d)=1iffa/(a + b) ≥ p and a/(a,b, c, d) ≥
σ

.
Clearly the last two quantiﬁers differ only very little:
ϕ
⇒
∗
p,t
ψ
is equivalent to
ϕ
⇒
Agr
p,
σ
for
σ
= t/(a + b + c + d). Now ⇒
Agr
is the quantiﬁer of the association rules of Agrawal; it is
little known and has to be stressed that the “almost the same” quantiﬁer of founded implication
544 Petr H
´
ajek
was used in GUHA to generate “association rules” in the presently common sense as soon as
in mid-sixties of the past century (H
´
ajek et al., 1966).
The reader may play by deﬁning more and more two-dimensional quantiﬁers; clearly not
all of them are relevant for Data Mining. We close this section by two important remarks.
Closed formulas. Each formula (of the present formalism, with unary predicates and just
one object variable) beginning with a quantiﬁer is closed, i.e. does not refer to any particular

object but expresses some global pattern found in the data. Further closed formulas result
from those beginning by a quantiﬁer using connectives (e.g. we have seen that
ϕ
(x)
⇔
p
ψ
is
equivalent to (
ϕ
(x)
⇒
p
ψ
) ∧(¬
ϕ
(x)
⇒
p
¬
ψ
), etc.). A closed formula is a tautology (logical truth)
if it is true in each data. To give a trivial example, observe that if p
1
≤ p
2
then the formula
(
ϕ
(x)

⇒
p
2
ψ
) → (
ϕ
(x)
⇒
p
1
ψ
) is a tautology.
Predicates of higher arity. If our data contain information on relations of higher arity (bi-
nary, ternary,. . . ) we have to use predicates of higher arity and more than one object variable.
A quantiﬁer always binds (quantiﬁes) a variable. This leads to the logical notion of free and
bound variables of a formula, free variables varying over arbitrary objects of the data. For ex-
ample, take a binary predicate P; P(x, y) is a formula in which x,y are free, (Many
p
y)P(x,y) is
a formula in which only x is free. An object o satisﬁes the last formula iff it is P-related with p-
many objects. (Let P(x,y) say “x knows y”, let p be 0.8. An object o satisﬁes (Many
0.8
y)P(x,y)
if he knows at least 80% objects (from the data). We may form composed formulas using sev-
eral quantiﬁers binding different variables, e.g. (∀x)((Many
0.8
y)P(x,y) →R(x)) (saying “each
object knowing at least 80% objects has the property R”) etc. This sort of formulas is used in
relational Data Mining (D
ˇ

zeroski and Lavra
ˇ
c, 2001). We shall not go into details.
26.2 Some important classes of quantiﬁers
26.2.1 One-dimensional
Let us call a one-dimensional quantiﬁer Q multitudinal if its truth function Tr
Q
is not decreas-
ing in its ﬁrst argument and non-increasing in the second, i.e. for any two-fold tables (r
1
,s
1
),
(r
2
,s
2
) whenever
r
1
≤ r
2
, s
1
≥ s
2
and Tr
Q
(r, s
1

)=1 then Tr
Q
(r
1
,s
2
)=1. This means that the formula (Qx)
ϕ
says, in some sense given by Tr
Q
, that sufﬁciently many objects satisfy
ϕ
. “Sufﬁciently many”
may mean “all” (∀), at least one (∃), at least 7 (∃
7
), at least 100p% (Many
p
) etc.
Very important: The quantiﬁer may correspond to a statistical test of high probabil-
ity. Telegraphically: Our hypothesis is that the probability of the attribute
ϕ
is bigger than
p (under frame assumptions saying that all objects have the same probability of having
ϕ
and are mutually independent). Take a small
α
(e.g. 0.05 – signiﬁcance level). The number
∑
r+s
i=r

(
r+s
i
)p
i
(1 − p)
r+s−i
is the probability that at least r objects (form our r +s objects) will
have
ϕ
, assuming that the probability of
ϕ
is p. If this sum is ≤
α
then we can reject the (null)
hypothesis saying that the probability of
ϕ
is p or less (since if it were then what we have
observed would be improbable). This is the (simpliﬁed) idea of statistical hypothesis testing.
We get a quantiﬁer of testing high probability, HProb
p,
α
.
Tr
HProb
p,
α
(r, s)=1iff
r+s
∑

i=r

r +s
i

p
i
(1 − p)
r+s−i
≤
α
.
26 Logics for Data Mining 545
This is an example of a statistically motivated one-dimensional quantiﬁer; it can be proved
to be multitudinal). See Chapter 31.4.6 for statistical hypothesis testing and (H
´
ajek and
Havr
´
anek, 1978) for its logical foundations.
26.2.2 Two-dimensional
Recall that a two-dimensional quantiﬁer is given by its truth function assigning to each four-
fold table (a,b,c,d) a truth value (1 or 0). In Data Mining we are especially interested in
two-dimensional quantiﬁers expressing in some sense a kind of association of two attributes
(described by two open formulas). In some sense, the formula (Qx)(
ϕ
,
ψ
) should say that there
are sufﬁciently many coincidences in (the truth values of)

ϕ
,
ψ
and not too many differences.
This leads to the following deﬁnition (H
´
ajek and Havr
´
anek, 1978):
A two-dimensional quantiﬁer is associational if it satisﬁes the fol-
lowing for each pair (a
1
,b
1
,c
1
,d
1
), (a
2
,b
2
,c
2
,d
2
) of four-fold tables:
a
2
≥ a

1
,b
2
≤ b
1
,c
2
≤ c
1
,d
2
≥ d
1
and Tr
Q
(a
1
,b
1
,c
1
,d
1
)=1 implies Tr
Q
(a
2
,b
2
,c

2
,d
2
)=1.
In other words: if
ψ
1
¬
ψ
1
ϕ
1
a
1
b
1
¬
ϕ
1
c
1
d
1
ψ
2
¬
ψ
2
ϕ
2

a
2
b
2
¬
ϕ
2
c
2
d
2
are four-fold tables of the pairs (
ϕ
1
,
ψ
1
), (
ϕ
2
,
ψ
2
) of open formulas in given data, if
(Qx)(
ϕ
1
,
ψ
1

) is true in the data and the above inequalities hold (a
2
≥a
1
,b
2
≤b
1
,c
2
≤c
1
,d
2
≥
d
1
), then (Qx)(
ϕ
2
,
ψ
2
) is also true. The second table has more coincidences (a
2
≥a
1
,d
2
≥d

1
)
and less differences (b
2
≤ b
1
,c
2
≤ c
1
).
A quantiﬁer Q is locally associational if the above condition holds for all (a
1
,b
1
,c
1
,d
1
),
(a
2
,b
2
,c
2
,d
2
) satisfying the additional assumption a
1

+ b
1
+ c
1
+ d
1
= a
2
+ b
2
+ c
2
+ d
2
(i.e. the tables correspond to two data matrices of the same cardinality; in particular, think
of
ϕ
1
,
ψ
1
,
ϕ
2
,
ψ
2
evaluated in the same data matrix).
We shall deal with (locally) associational quantiﬁers of two important kinds: implicational
and comparative. We give examples and state general (deductive) properties of quantiﬁers in

these classes.
Implicational quantiﬁers formalize the association formulated as “many
ϕ
’s are
ψ
’s”.
(They could be also called two-dimensional multitudinal quantiﬁers.) The deﬁnition reads
as follows:
A two-dimensional quantiﬁer Q is implicational if each pair (a
1
,b
1
, c
1
,d
1
),
(a
2
,b
2
,c
2
,d
2
) of four fold tables satisﬁes the following condition: If a
2
≥ a
1
,b

2
≤ b
1
and
Tr
Q
(a
1
,b
1
,c
1
,d
1
)=1 then Tr
Q
(a
2
,b
2
,c
2
,d
2
)=1 Q is locally implicational if this condi-
tion is satisﬁed for each pair of four-fold tables with the same sum (a
1
+ b
1
+ c

1
+ d
1
=
c
2
+ b
2
+ c
2
+ d
2
).
Clearly, the quantiﬁer ⇒
p
(p-many) is implicational: a
2
≥a
1
and b
2
≤b
1
imply a
2
/(a
2
+
b
2

) ≥a
1
/(a
1
+b
1
). The quantiﬁer ⇒
∗
p,t
of founded implication is also implicational: if a
2
≥a
1
and a
1
≥ t then trivially a
2
≥ t. The “almost the same” Agrawal’s quantiﬁer ⇒
Agr
p,
σ
is locally
implicational: if the tables have equal sum and a
2
≥a
1
then trivially a
2
/(a
2

+b
2
+c
2
+d
2
) ≥
a
1
/(a
1
+ b
1
+ c
1
+ d
1
).
546 Petr H
´
ajek
Note the statistical parallel of ⇒
p
: The hypothesis of P(
ψ
|
ϕ
) ≥ p (conditional probability
of
ψ

, given
ϕ
) is tested using the statistic
∑
a+b
i=a
(
a+b
i
) p
i
(1 − p)
a+b−i
. The corresponding quantiﬁer ⇒
!
p,
α
of likely p-implication (with
signiﬁcance level
α
) is
Tr
⇒
!
p,
α
(a,b,c,d)=1iff
a+b
∑
i=a


a + b
i

p
i
(1 − p)
a+b−i
≤
α
.
This is also an implicational quantiﬁer (see (H
´
ajek and Havr
´
anek, 1978), where also another
statistically motivated implicational quantiﬁer is discussed). For each (locally) implicational
quantiﬁer (denote it ⇒
#
) the following two deduction rules are sound (in the sense that when-
ever the assumption is true in your data, the consequence is also true):
(
ϕ
1
∧
ϕ
2
) ⇒
#
ψ

ϕ
1
⇒
#
ψ
∨¬
ϕ
2
,
ϕ
⇒
#
ψ
1
ϕ
⇒
#
(
ψ
1
∨
ψ
2
)
For example, if the formula “p-many probands being smokers and older then 50 have cancer”
is true in your data then the following is true too: “p-many probands being smokers have
cancer or are not older than 50”. Second: If the “association rule” “x buys Lidov
´
e noviny ⇒x
is Czech” is 90%-true with support 1000 then also

x buys Lidov
´
e noviny ⇒(x is Czech or x is Slovak)
is 90% true with the same support. These deduction rules are extremely useful for optimizing
search for formulas (“rules”) of the form
ϕ
⇒
#
ψ
where ⇒
#
is an implicational quantiﬁer,
ϕ
is
an elementary conjunction (conjunction of atomic open formulas and negated atomic formulas
containing each predicate at most once, e.g. P
1
(x) ∧¬P
3
(x) ∧P
7
(x)) and
ψ
is an elementary
disjunction (similar deﬁnition, e.g. ¬P
2
(x) ∨¬P
10
(x)).
Caution: For the “classical” quantiﬁer ⇒

1
(
ϕ
⇒
1
ψ
saying “all
ϕ
’s are
ψ
’s”) the ﬁrst rule
can be converted, thus truth of
ϕ
1
⇒
1
ψ
∨¬
ϕ
2
implies truth of (
ϕ
1
∧
ϕ
2
) ⇒
1
ψ
. But this is

not true for ⇒
p
and other mentioned implicational (locally implicational) quantiﬁers.
Let us stress once more that implicational quantiﬁers formalize, in various possible ways,
what we mean saying “many
ϕ
’s are
ψ
’s”. Agrawal’s association rules are a particular case,
with one particular implicational quantiﬁer and also with speciﬁc open formulas (no negation
allowed, just conjunction of atoms). Even if this may be the most used case, the reader is
invited to consider broader, more general and more powerful possibilities.
We now turn our attention to a very important class of quantiﬁers that we shall call com-
parative. The intuitive meaning of association expressed by a comparative quantiﬁer is that
the formula (Qx)(
ϕ
,
ψ
) should say that presence of
ϕ
positively contributes to the presence of
ψ
. This does not mean that many
ϕ
’s are
ψ
’s, thus that the relative frequence of
ψ
among
ϕ

(denoted Freq(
ψ
|
ϕ
)) is big but that Freq(
ψ
|
ϕ
) is (sufﬁciently) bigger than Freq(
ψ
|¬
ϕ
).
For example, imagine that 30% of smokers have an illness and only 5% of non-smokers
have the same illness. The simplest quantiﬁer of this kind is called the simple associational
quantiﬁer, denoted SIMPLE or ∼
0
(see (H
´
ajek and Havr
´
anek, 1978, H
´
ajek et al., 1995) or
other GUHA papers); the truth function is Tr
∼
0
(a,b,c,d)=1ifad > bc. A trivial compu-
tation shows that ad > bc is equivalent both to
a

a+b
>
a+c
a+b+c+d
(if (a,b, c, d) is the four-
fold table of
ϕ
,
ψ
then this says that, in the data, Freq(
ψ
|
ϕ
) > Freq(
ψ
)) and to
a
a+b
>
c
c+d
(Freq(
ψ
|
ϕ
) > Freq(
ψ
|¬
ϕ
)). You may make this quantiﬁer parametric, demanding ad > h.bc,

for some h ≥1.
Thus let us accept the following deﬁnition: A two-dimensional quantiﬁer Q is comparative
if Tr
Q
(a,b,c,d)=1 implies ad > bc. The statistical counterpart is Fisher quantiﬁer ∼
F
α
based
26 Logics for Data Mining 547
on the test of the hypothesis P(
ψ
|
ϕ
) > P(
ψ
) (against the null hypothesis P(
ψ
|
ϕ
) ≤P(
ψ
)),
with signiﬁcance
α
. The formula is:
Tr
∼
α
(a,b,c,d)=1ifad > bc and
min(a+b,a+c)

∑
i=a

a + b
i

b + d
a + b −i

a + b + c +d
a + b

≤
α
.
If we adopt the usual notation a + b = r, a+ c = k, b + d = l, a + b + c + d = m then the
last formula becomes
min(r,k)
∑
i=a

k
i

l
r −i

/

m

r

≤
α
.
This is a rather complicated formula; there are non-trivial algorithms for computing the sum
in question. Fisher quantiﬁer can be proved to be associational (H
´
ajek and Havr
´
anek, 1978).
Let us mention that another comparative associational statistically motivated quantiﬁer is
based on the statistical chi-square test.
Indeed,
Tr
∼
CHISQ
α
(a,b,c,d)=1ifad > bc and
m(ad −bc)
2
rskl
≥
χ
2
α
,
where
χ
2

α
is a constant (the (1-
α
)-quantile of the
χ
2
distribution function)
Now let us present three deduction rules and ask if our quantiﬁers obey them. Once more,
it means that whenever the assumption (above the line) is true in the data then the conclusion
(below the line) is true. Here ∼stands for a quantiﬁer; we write
ϕ
∼
ψ
instead of (∼x)(
ϕ
,
ψ
).
Rule of symmetry: (SYM)
ϕ
∼
ψ
ψ
∼
ϕ
Rule of negation: (NEG)
ϕ
∼
ψ
¬

ϕ
∼¬
ψ
Rule of conversion: (CNVS)
ϕ
∼
ψ
¬
ψ
∼¬
ϕ
Fact
The simple quantiﬁer ∼
0
, the Fisher quantiﬁer ∼
F
α
as well as the chi-square quantiﬁer ∼
CHI
α
obey all the rules (SYM), (NEG), (CNVS).
For a proof see again (H
´
ajek and Havr
´
anek, 1978), observing that if any quantiﬁer obeys
(SYM) and (NEG) then it automatically obeys (CNVS). Now we present three more quanti-
ﬁers occuring in the literature, each obeying just one of our present rules.
The quantiﬁer of pure p-equivalence ≡
p

(Rauch, see e.g. (Rauch, 1998A)). The formula
ϕ
≡
p
ψ
is true if both
ϕ
⇒
p
ψ
and ¬
ϕ
⇒
p
¬
ψ
are true, thus Tr
≡
p
(a,b,c,d)=1ifa/(a +b) ≥
p and d/(c + d) ≥ p. For p >
1
2
this quantiﬁer is comparative. (Indeed, if a/(a + b) >
1
2
and
d/(c +d) >
1
2

then c/(c + d) <
1
2
< a/(a + b), which gives bc < ad.)
548 Petr H
´
ajek
The quantiﬁer of conviction (Adamo, 2001).
ϕ
∼
conv
h
ψ
is true if
(a + b)(b + d) > h.b(a + b + c + d), or equivalently, (rl)/(bm) > h, where h is a parameter,
h ≥1. An elementary computation gives that the last inequality for h = 1 (and hence for each
h ≥1) implies ad < bc; the quantiﬁer is comparative.
The quantiﬁer “above average” is a variant of SIMPLE (used in the program 4FT-miner
(lispminer)).
ϕ
∼
AA
h
ψ
is true if
a/(a + b) > h.(a + c)/(a + b + c + d) (thus a/r > h.k/m), which means that is Fr(
ψ
/
ϕ
) >

h.Fr(
ψ
). For h = 1 this is equivalent to the simple quantiﬁer with h = 1; evidently, for each
h ≥ 1, the AA quantiﬁer is comparative.
But these last three quantiﬁers differ as far as our deduction rules are concerned:
Fact
(1) The quantiﬁer AA obeys symmetry but for h > 1 neither negation nor conversion. (2) The
quantiﬁer of pure p-equivalence obeys negation but for p < 1 neither symmetry nor conver-
sion. (3) The quantiﬁer of conviction obeys conversion but for h > 1 neither symmetry nor
negation.
The positive claims are veriﬁed by easy computations; the negative claims can be all
witnessed e.g. by the table (9,1,10, 80).
Let us also mention the quantiﬁer of double p-implication ⇔
p
(Rauch):
ϕ
⇔
p
ψ
is true if
both
ϕ
⇒
p
ψ
and
ψ
⇒
p
ϕ

is true. Show that this quantiﬁer is not comparative (consider e.g.
(9,1,1,0)); it obeys symmetry but (for p < 1) neither negation nor conversion.
The study of deductive rules is important for interpretation of results of Data Mining as
well as for optimization of mining algorithms.
To close this section let us mention that each two-dimensional quantiﬁer ∼ can be used
to deﬁne a three-dimensional quantiﬁer by partializing: the formula (
ϕ
∼
ψ
)/
χ
is true in the
data matrix in question iff
ϕ
∼
ψ
is true in the submatrix of objects satisfying
χ
. Cf. (H
´
ajek,
2003).
26.3 Some comments and conclusion
Using four-fold tables
Even is we have dealt with logical aspects of Data Mining we feel obliged to stress once more
the importance of the statistical side of the game. We already referred to (Giudici, 2003) ; let
us make some further references. Glymour’s (Glymour et al., 1996) is a good reading on the
prehistory of Data Mining, namely exploratory data analysis and of some dangers of using
statistically motivated notions in Data Mining. (Zytkow and Zembowicz, 1997) deal with gen-
erating knowledge from four-fold tables (and describe their database discovery system “49-

er”). Recently, the chi-square statistic was used to deﬁne “generalized association rules” (we
would say: using a comparative quantiﬁer) by Hegland (Hegland, 2001) and Brin (Brin et al.,
1998). Papers by Rauch (et al.) discuss several further classes of quantiﬁers, see references.
Two generalizations
First, the logical approach to mining generalized association rules can be and has been gener-
alized to fuzzy logic. We refer to Hole
ˇ
na’s papers ( (Hole
ˇ
na, 1996) – (Hole
ˇ
na, 1996)) and also
to Chen et. al. (Chen et al., 2003). Second, we have only mentioned relational Data Mining
26 Logics for Data Mining 549
and its techniques of inductive logic programming. Besides D
ˇ
zeroski and Lavra
ˇ
c(D
ˇ
zeroski
and Lavra
ˇ
c, 2001) the reader may consult e.g. Dehaspe and Toivonen (Dehaspe and Toivonen,
1999).
The GUHA method
The reader should be informed on the more then 30 years old story of the GUHA
method of automated generation of hypotheses (General Unary Hypotheses Automa-
ton) which is undoubtely one of the oldest methods of computerized exploratory data
analysis (or, if you want, mining of association rules) starting with (H

´
ajek et al.,
1966) from 1966. We already mentioned above that formulas almost identical with
Agrawal’s association rules were considered and algorithms for their generation were dis-
cussed in that paper. This was followed by a long period of research culminating in 1978
by the monograph (H
´
ajek and Havr
´
anek, 1978) by H
´
ajek and Havr
´
anek, presenting a logical
and statistical fundations that are still relevant for contemporary Data Mining. (Note that the
book is presently available on web, see references.) The research has continued; see (H
´
ajek
et al., 2003) and (H
´
ajek and Hole
ˇ
na, 2003, H
´
ajek, 2001) for a survey of the present state
and relation to other Data Mining methods. The GUHA approach offers observational logical
calculi (based on generated quantiﬁers as presented here), logical foundations of statistical
inference (theoretical logical calculi), theory of some auxiliary (helpful) quantiﬁers good for
compression of results, three semantics of missing information and several other facts, no-
tions and techniques. There have been several implementations; for two presently available

see (GUHA+-, lispminer).
It is regrettable that the mainstream of Data Mining has neglected the GUHA approach
( (Liu et al., 2000) being one of few exceptions); this subsection is a small attempt to change
this.
Conclusion
The study of logical aspects of Data Mining is interesting and useful: it gives an exact abstract
approach to “association rules” based on the notion of (generalized) quantiﬁers, important
classes of quantiﬁers, deductive properties of associations expressed using such quantiﬁers
as well as other results not mentioned here (as e.g. results on computational complexity).
Hopefully the present chapter will help the reader to enjoy this.
Acknowledgments
Partial support of the COST Action 274 (TARSKI) is recognized.
References
Adamo, J. M. Data Mining for association rules and sequential patterns. Springer 2001.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and A. I. Verkamo. “Fast discovery of
association rules.” In: Advances in knowledge discovery and Data Mining. Fayyad U.
M. et al., ed., AAAI Press/MIT Press,1996

Data Mining and Knowledge Discovery Handbook, 2 Edition part 57 potx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về