Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 33 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (121.75 KB, 10 trang )

300 Frank H
¨
oppner
of an instance with A
7
= 1, A
18
= 1 and A
i
= 0 for all other attributes, is there-
fore a set-oriented notation {A
7
,A
18
}. The task of association rule mining is, ba-
sically, to examine relationships between all possible subsets. For instance, a rule
{A
9
,A
37
,A
214
}→{A
189
,A
165
} would indicate that, whenever a database record pos-
sesses attributes values A
9
= 1,A
37


= 1, and A
214
= 1, also A
189
= 1 and A
165
= 1 will
hold. The main difference to classification (see Chapter II in this book) is that associ-
ation rules are usually not restricted to the prediction of a single attribute of interest.
Even if we consider only n = 100 attributes and rules with two items in antecedent
and consequent, we have already more than 23,500,000 possible rules. Every possi-
ble rule has to be verified against a very large database, where a single scan takes
already considerable time. For this challenging task the technique of association rule
mining has been developed (Agrawal and Shafer, 1996).
15.1.1 Formal Problem Definition
Let I={a
1
, ,a
n
} be a set of literals, properties or items. A record (t
1
, ,t
n
) ∈
dom(A
1
)× ×dom(A
n
) from our transaction database with schema S =(A
1

, ,A
n
),
can be reformulated as an itemset T by a
i
∈ T ⇔ t
i
= 1. We want to use the moti-
vation of the introductory example to define an association explicitly. How can we
characterize “associated products”?
Definition: We call a set Z ⊆ I an association, if the frequency of occurrences of
Z deviates from our expectation given the frequencies of individual X ∈ Z.
If the probability of having sausages (S) or mustard (M) in the shopping carts
of our customers is 10% and 4%, resp., we expect 0.4% of the customers to buy
both products at the same time. If we instead observe this behaviour for 1% of the
customers, this deviates from our expectations and thus {S, M} is an association.
Do sausages require mustard (S → M) or does mustard require sausages (M→S)?
If preference (or causality) induces a kind of direction in the association, it can be
captured by rules:
Definition: We call X →Y an association rule with antecedent X and consequent
Y ,ifZ = X ∪Y ⊆ I and X ∩Y = /0 holds.
1
We say that an association rule X →Y is supported by database D, if there is a
record T with X ∪Y ⊆T. Naturally, a rule is the more reliable the more it is supported
by records in the database:
Definition: Let X →Y be an association rule. The support of the rule is defined
as supp( X → Y ) := supp( X ∪Y ), where the support of an arbitrary itemset S is
defined as the probability P(S) of observing S in a randomly selected record of D:
supp(S)=
|{T ∈ D|S ⊆T }|

|D|
If we interpret the binary values in the records as a truth values, an association
rule can be considered as a logical implication. In this sense, an association rule holds
1
From the foregoing definition it is clear that it would make sense to additionally require
that Z must be an association. But this is not part of the “classical” problem statement.
15 Association Rules 301
if for every transaction T that supports X, Y is also supported. In our market basket
scenario this means that, for a rule butter → bread to hold, nobody will ever buy
butter without bread. This strict logical semantics is unrealistic in most applications
(but an interesting special case, see Section 15.4.5), in general we allow for partially
fulfilled association rules and measure their degree via confidence values:
Definition: Let X → Y be an association rule. The confidence of the rule is de-
fined as the fraction of itemsets that support the rule among those that support the
antecedent: conf( X →Y ) := P(Y|X) = supp( X ∪Y ) / supp(X).
Problem Statement: Given a database D and two thresholds min
con f
, min
supp
∈ [0,1], the problem of association rule mining is to generate all associ-
ation rules X →Y with supp(X →Y) > min
supp
and conf(X →Y)> min
con f
in D.
It has to be emphasized at this point, that these definitions are tailored towards the
standard association rule framework; other definitions would have also been possi-
ble. For instance, it is commonly agreed that support and confidence are not particu-
larly good as rule ranking measures in general. As an example, consider an everyday
product, such as bread. It is very likely, that it is contained in many carts, giving it

a high support, say 70%. If another product, say batteries, is completely indepen-
dent from bread, then the conditional probability P(bread|batteries) will be identical
to P(bread). In such a case, where no relationship exists, a rule batteries → bread
should not be flagged as interesting. However, given the high a priori probability of
bread being in the cart, the confidence of this rule will be close to P(bread)=70%.
Therefore, having found a rule that satisfies the constraints on support and confi-
dence according to the problem definition does not mean that an association has
been discovered (see footnote 1)! The reason why support and confidence are never-
theless so prominent is that their properties are massively exploited for the efficient
enumeration of association rules (see Section 15.2). Their deficiencies (as sufficient
conditions) for discovering associations can often be compensated by an additional
postprocessing phase. There are, however, also competing approaches that use differ-
ent measures from the beginning (often addressing the deficiencies of the minimum
support as a necessary condition), which we will discuss in section 15.4.5.
In the next section, we review the most prominent association rule mining tech-
nique, which provides the central basis for many extensions and modifications, and
illustrate its operation with an example. We then give a short impression on applica-
tions of the framework to other kinds of data than market basket. In section 15.4 we
discuss several ways how to overcome or compensate the difficulties when support
and confidence alone are used for rule evaluation.
15.2 Association Rule Mining
In this section we discuss the Apriori algorithm (Agrawal and Srikant, 1994,Agrawal
and Shafer, 1996), which solves the association rule mining problem. In a first phase,
it enumerates all itemsets found in the transaction database whose support exceeds
min
supp
. We call the itemsets enumerated in this way frequent itemsets, although
the min
supp
threshold may be only a few percent. In a second phase, the confidence

302 Frank H
¨
oppner
threshold min
con f
comes into play: The algorithm generates all rules from the fre-
quent itemsets whose confidence exceeds this threshold. The enumeration will be
discussed in section 15.2.1, the rule generation in section 60.2.1.
This subdivision into two independent phases is characteristic for many
approaches to association rule mining. Depending on the properties of the database
or problem at hand, the frequent itemset mining may be replaced by more efficient
variants of Apriori, which will be discussed in detail in Chapter 15.5. Many different
rule evaluation measures (see Section 60.2.2) can be calculated from the output of
the first phase, such that rule generation may also be altered independently of the
first phase. The two phase decomposition makes it easy to combine different item-
set mining and rule generation algorithms, however, it also means that the individual
properties of the rule evaluation measures used in phase two, which could potentially
lead to pruning mechanism in phase one, are not utilized in the standard approach.
In section 15.4.5 we will examine a few alternative methods from the literature, were
both phases are more tightly coupled.
15.2.1 Association Mining Phase
We speak of a k-itemset X,if|X|=k. With the Apriori algorithm, the set of frequent
itemsets is found iteratively by identifying all frequent 1-itemsets first, then searching
for all frequent 2-itemsets, 3-itemsets, and so forth. Since the number of frequent
itemsets is finite (bounded by 2
|I|
), for some k we will find no more frequent itemsets
and stop the iteration. Can we stop already if we have found no frequent k-itemset, or
is it possible that we miss some frequent (k + j)-itemsets, j > 0, then? The following
simple but important observation guarantees that this cannot happen:

Observation (Closure under Support): Given two itemsets X and Y with X ⊆
Y . Then supp(X) ≥supp(Y).
Whenever a transaction contains the larger itemset Y , it also contains X, therefore
this inequality is trivially true. Once we have found an itemset X to be infrequent, no
superset Y of X can become frequent; therefore we can stop our iteration at the first
k for which no frequent k-itemsets have been found.
The observation addresses the downward closure of frequent itemsets with re-
spect to minimum support: Given a frequent itemset, all of its subsets must also be
frequent. The Apriori algorithm explores the lattice over I (induced by set inclu-
sion, see Figure 15.1) in a breadth-first search, starting from the empty set (which
is certainly frequent), expanding every frequent itemset by a new item as long as
the new itemset turns out to be frequent. This gives us already the main loop of the
Apriori algorithm: Starting from a set C
k
of possibly frequent itemsets of size k, so-
called k-candidates, the database D is scanned to determine their support (function
countSupport(C
k
,D)). Afterwards, the set of frequent k-itemsets F
k
is easily identi-
fied as the subset of C
k
for which the minimum support is reached. From the frequent
k-itemsets F
k
we create candidates of size k + 1 and the loop starts all over, until we
finally obtain an empty set C
k
and the iteration terminates. The frequent itemset min-

ing step of the Apriori algorithm is illustrated in Figure 15.2.
15 Association Rules 303
{1, 2, 3}
{1, 2} {1, 3} {2, 3}
{1} {2} {3}
{ }
3
23
123
root
3
Fig. 15.1. Itemset lattice and tree of subsets of 1,2,3
C
1
:= {{i} | i∈I };
k:=1;
while C
k
= /0 do begin
countSupport(C
k
,D);
F
k
:= { S∈C
k
| supp(S) > min
supp
};
C

k+1
:= candidateGeneration(F
k
);
k := k+1;
end
F=F
1
∪ F
2
∪ ∪ F
k−1
; // all freq. itemsets
Fig. 15.2. Apriori algorithm
Before examining some parts of the algorithm in greater detail, we introduce
an appropriate tree data structure for the lattice, as shown in Figure 15.1. Let us
fix an arbitrary order for items. Each node in the tree represents an itemset, which
contains all items on the path from the node to the root node. If, at any node labeled
x, we create the successors labeled y with y¿x (according to the item order), the tree
encodes every itemset in the lattice exactly once.
Candidate Generation. The na
¨
ıve approach for the candidate generation func-
tion would be to simply return all subsets of I of size k + 1, which would lead to a
brute force enumeration of all possible combinations. Given the size of I, this is of
course prohibitive.
The key to escape from the combinatorial explosion is to exploit the closure under
support for pruning: For a (k + 1)-candidate set to be frequent, all of its subsets must
be frequent. Any (k + 1)-candidate has k + 1 subsets of size k, and if one of them
cannot be found in the set of frequent k-itemsets F

k
, the candidate itself cannot be
304 Frank H
¨
oppner
frequent. If we include only those (k + 1)-sets in C
k+1
that pass this test, the size of
C
k+1
is shrunk considerably. The candidateGeneration function therefore returns:
C
k+1
= { C⊂I ||C|=k+1, ∀S ⊂C: (|S|=k ⇒ S ∈ F
k
) }.
How can C
k+1
be determined efficiently? Let us assume that the sets F
k
are stored
in the tree data structure according to Figure 15.1. Then the k + 1 containment tests
(S ∈ F
k
) can be done efficiently by descending the tree. Instead of enumerating all
possible sets S of size k + 1, it is more efficient to construct them from a union of
two k-itemsets in F
k
(as a positive effect, then only k −1 containment tests remain to
be executed). This can be done as follows: All itemsets that have a common node at

level k −1 are denoted as a block (they share k−1 items). From every two k-itemsets
X and Y coming from the same block, we create a candidate Z = X ∪Y (implying
that |Z|=k+1). Candidates generated in this way can easily be used to extend the tree
of k-itemsets into a tree of candidate (k + 1)-itemsets.
Support Counting. Whenever the function countSupport(C
k
,D) is invoked, a
pass over the database is performed. At the beginning, the support counters for all
candidate itemsets are set to zero. For each transaction T in the database, the counters
for all subsets of T in C
k
have to be incremented. If the candidates are represented as a
tree, this can be done as follows: starting at the root node (empty set), follow all paths
with an item contained in T. This procedure is repeated at every node. Whenever a
node is reached at level k,ak-candidate contained in T has been found and its counter
can be incremented. Note that for subsets of arbitrary size we would have to mark
at each node whether it represents an itemset contained in the tree or it has been
inserted just because it lies on a path to a larger itemset. Here, all itemsets in C
k
or
F
k
are of the same length and therefore valid itemsets consist of all leafs at depth k
only.
It is clear from Figure 15.1, that the leftmost subtree of the root node is always the
largest and the rightmost subtree consists of a single node only. To make the subset
search as fast as possible, the items should be sorted by their frequency in ascending
order, such that the probability of descending the tree without finding a candidate is
as small as possible.
Runtime complexity. Candidate generation is log-linear in the number of fre-

quent patterns |F
k
|. A single iteration of the main routine in Figure 15.2 depends
linearly on the number of candidates and the database size. The most critical number
is the number of candidate itemsets, which is usually tractable in the market basket
data, because the transaction/item-table is sparse. However, if the support threshold
is very small or the table is dense, the number of candidates (as well as frequent
patterns) grows exponentially (see Section 15.4.4) and so does the runtime of this
algorithm. A comparison of the Apriori runtime with other frequent itemset mining
methods can be found in Chapter 15.5 of this volume.
15 Association Rules 305
Example. For illustration purposes, here is an example run of the Apriori algo-
rithm. Consider I={A, B, C, D, E} and a database D consisting of the following
transactions
Transaction # A B C D E Itemset
#1 1 1 1 0 1 A, B, C, E
#2 0 1 1 0 1 B, C, E
#3 1 0 1 1 1 A, C, D, E
#4 0 1 0 0 1 B, E
#5 1 1 1 0 0 A, B, C
For the 1-candidates C
1
= {{A}, {B}, {C}, {D}, {E}} we obtain the support
values 3/5, 4/5, 4/5, 1/5, 4/5. Suppose min
supp
= 2/5, then F
1
= {{A}, {B}, {C}, {E}}.
The 2-candidate generation always yields


|F
k
|
2

candidates, because the pruning
has not yet any effect. By pairing all frequent 1-itemsets we obtain the following
2-candidates with respective support:
candidate 2-itemset Support frequent 2-itemset
{A, B} 2/5 {A, B}
{A, C} 3/5 {A, C}
{A, E} 2/5 {A, E}
{B, C} 3/5 {B, C}
{B, E} 3/5 {B, E}
{C, E} 3/5 {C, E}
Here, all 2-candidates are frequent. For the generation of 3-candidates, the alpha-
betic order is used and the blocks are indicated by double lines. From the first block
(going from frequent 2-itemset {A,B} to {A,E}) we obtain the 3-candidates {A, B,
C}, {A, B, E} (pairing {A,B} with its successors {A,C} and {A,E} in the same
block), and {A, C, E} (pairing {A,C} with its successor {A,E}); from the second
block {B, C, E} is created, and none from the third block.To qualify as candidate
itemsets, we finally verify that all 2-subsets are frequent, e.g. for {A,B,C} (which
was generated from {A,B} and {A,C}) we can see that third 2-subset {B,C} is also
frequent.
cand. 3-itemset Support Freq. 3-itemset
{A, B, C} 2/5 {A, B, C}
{A, B, E} 1/5
{A, C, E} 2/5 {A, C, E}
{B, C, E} 2/5 {B, C, E}
The 3-itemset {A, B, E} turns out to be infrequent. Thus, three blocks with a

single itemset remain and no candidate 4-itemset will be generated. As an example,
{A, B, C, E} would be considered as a candidate 4-itemset only if all of its subsets
{B, C, E}, {A, C, E}, {A, B, E} and {A, B, C} are frequent, but {A, B, E} is not.
15.2.2 Rule Generation Phase
Once frequent itemsets are available, rules can be extracted from them. The objec-
tive is to create for every frequent itemset Z and its subsets X a rule X → Z −X
and include it in the result if supp(Z)/supp(X) > min
con f
. The observation in section
15.2.1 states that for X
1
⊂ X
2
⊂ Z we have supp(X
1
) ≥ supp(X
2
) ≥ supp(Z), and
therefore supp(Z)/supp(X
1
) ≤ supp(Z)/supp(X
2
) ≤ 1. Thus, the confidence value for
rule X
2
→ Z −X
2
will be higher than that of X
1
→ Z −X

1
. In general, the largest
confidence value will be obtained for antecedent X
2
being as large as possible (or
306 Frank H
¨
oppner
consequent Z −X
2
as small as possible). Therefore we start with all rules with a sin-
gle item in the consequent. If such a rule does not reach the minimum confidence
value, we do not need to evaluate a rule with a larger consequent.
Now, when we are about to consider a consequent Y
1
= Z −X
1
of a rule X
1
→Y
1
,
to reach the minimum confidence, all rules X
2
→Y
2
with Y
2
⊂Y
1

⊂Z must also reach
the minimum confidence. If for one Y
2
this is not the case, the rule X
1
→Y
1
must also
miss the minimum confidence. Thus, for rules X →Y with given Z = X ∪Y,wehave
the same downward closure of consequents (with respect to confidence) as we had
with itemsets (with respect to support). This closure can be used for pruning in the
same way as in the frequent itemset pruning in section 60.2.1: Before we look up the
support value of a (k + 1)-consequent Y , we check if all k-subsets of Y have led to
rules with minimum confidence. If that is not the case, Y will not contribute to the
output either. The following algorithm generates all rules from the set of frequent
itemsets F. To underline the similarity to the algorithm in Figure 15.2, we use again
the sets F
k
and C
k
, but this time they contain the candidate k-consequents (rather
than candidate k-itemsets) and k-consequents of rules that satisfy min
con f
(rather
than frequent k-itemsets). The benefit of pruning, however, is much smaller here,
because we only save a look up of the support values that have been determined in
phase one.
forall Z∈F do begin
R:= /0; // resulting set of rules
C

1
:= {{i} | i∈Y }; // cand. 1-consequents
k:=1;
while C
k
= /0 do begin
F
k
:= { X∈C
k
| conf(X→Z-X) > min
con f
};
R:=R∪ { X→Z-X | X∈F
k
};
C
k+1
:= candidateGeneration(F
k
);
k := k+1;
end
end
Fig. 15.3. Rule generation algorithm
As an example, consider the case of Z={A, C, E} in the outer loop. The set
of candidate 1-consequents is C
1
={{A}, {C}, {E}}. The confidence of the rules
AC → E is 2/3, AE →C is 1, and CE → A is 2/3. With min

con f
= 0.7, we obtain
F
1
={{C}}. Suppose {B, C} is considered as a candidate, then both of its subsets
{B} and {C} have to be found in F
1
, which contains only {C}. Thus, the candidate
generation yields only an empty set for C
2
and rule enumeration for Z is finished.
15 Association Rules 307
15.3 Application to Other Types of Data
The traditional application of association rules is market basket analysis, see for
instance (Brijs et al., 1999). Since then, the technique has been applied to other
kinds of data, such as:
• Census data (Brin et al., 1997A, Brin et al., 1997B)
• Linguistic data for writer evaluation (Aumann and Lindell, 2003)
• Insurance data (Castelo and Giudici, 2003)
• Medical diagnosis (Gamberger et al., 1999)
One of the first generalizations, which has still applications in the field of market
basket analysis, is the consideration of temporal or sequential information, such as
the date of purchase. Applications include:
• Market basket data (Agrawal and Srikant, 1995)
• Causes of plan failures (Zaki, 2001)
• Web personalization (Mobasher et al., 2002)
• Text data (Brin et al., 1997A,Delgado et al., 2002)
• Publication databases (Lee et al., 2001)
In the analysis of sequential data, very long sequences may easily occur (e.g. in
text data), corresponding to frequent k-itemsets with large values of k. Given that

any subsequence of such a sequence will also be frequent, the number of frequent
sequences quickly becomes intractable (to discover a frequent sequence of length
k,2
k
subsequences have to be found first). As a consequence, the reduced problem
of enumerating only maximal sequences is considered. A sequence is maximal,if
there is no frequent sequence that contains it as a subsequence. While in the Apriori
algorithm a breadth first search is performed to find all frequent k-itemsets first,
before (k + 1)-itemsets are considered, for maximal sequences a depth first search
may become computationally advantageous, because the exhaustive enumeration of
subsequences is skipped. Techniques for mining maximal itemsets will be discussed
in detail in the subsequent chapter.
In the analysis of sequential data, the traditional transactions are replaced by
short sequences, such as the execution of a short plan, products bought by a single
customer, or events caused by a single customer during website navigation. It is also
possible to apply association mining when the observations consist of a few attributes
observed over a long period of time. With such kind of data, the semantics of support
counting has to be revised carefully to avoid misleading or uninterpretable results
(Mannila et al., 1997, H
¨
oppner and Klawonn, 2002). Usually a sliding window is
shifted along the sequences and the probability of observing the pattern in a randomly
selected position is defined as the support of the pattern. Examples for this kind of
application can be found in:
• Telecommunication failures (Mannila et al., 1997) (Patterns are sequences of
parallel or consecutive events, e.g. “if alarm A is followed by simultaneous events
B1 and B2, then event C is likely to occur”.)
308 Frank H
¨
oppner

• Discovery of qualitative patterns in time series (H
¨
oppner and Klawonn, 2002)
(Patterns consist of labeled intervals and their interval relationship to each other,
e.g. “if an increase in time series A is finished by a decrease in series B, then it
is likely that the decrease in B overlaps a decrease in C”)
A number of extensions refer to the generalization of association rule mining to
the case of numerical data. In the market basket scenario, these techniques may be
applied to identify the most valuable customers (Webb, 2001). When broadening the
range of scales for the attributes in the database (not only 0/1 data), the discovery of
associations becomes very similar to cluster analysis and the discovery of association
rules becomes closely related to machine learning approaches (Mitchell, 1997) to
rule induction, classification and regression. Some approaches from the literature
are:
• Mining of quantitative rules by discretizing numerical attributes into categorical
attributes (Srikant and Agrawal, 1996)
• Mining of quantitative rules using mean values in the consequent (Aumann &
Lindell, 1999) (Webb, 2001)
• Mining interval data (Miller and Yang, 1997)
Finally, here are a few examples for applications with more complex data structures:
• Discovery of spatial and spatio-temporal patterns (Koperski and Han, 1995,Tsoukatos
and Gunopulos, 2001)
• Mining of molecule structure for drug discovery (Borgelt, 2002)
• Usage in inductive logic programming (ILP) (Dehaspe and Toivonen, 1999)
15.4 Extensions of the Basic Framework
Within the confidence/support framework for rule evaluation user’s often complain
that either the rules are well-known or even trivial, as it is often the case for rules with
high confidence and large support, or that there are several hundreds of variants of the
same rule with very similar confidence and support values. As we have seen already
in the introduction, many rules may actually be incidental or represent nothing of

interest (see batteries and bread examples in Section 15.1.1). In the literature, many
different ways to tackle this problem have been proposed:
• The use of alternative rule measures
• Support rule browsing and interactive navigation through rules
• The use of a compressed representation of rules
• Additional limitations to further restrict the search space
• Alternatives to frequent itemset enumeration that allow for arbitrary rule support
(highly interesting but rare patterns)
15 Association Rules 309
While the first few options may be implemented in a further postprocessing
phase, this is not true for the last options. For instance, if no minimum support
threshold is given, the Apriori-like enumeration of frequent itemsets has to be re-
placed completely. In the subsequent sections, we will briefly address all of these
approaches.
15.4.1 Some other Rule Evaluation Measures
The set of rules returned to the user must be evaluated manually. This costly manual
inspection should pay off, so only potentially interesting rules should be returned. Let
us consider the example from the introduction again, where the support/confidence-
framework flagged rules as worth investigating which were actually obtained from
independent items. If items A and B are statistically independent, the empirical joint
probability P(A,B) is approximately equal to P(A)P(B), therefore we could flag a
rule as interesting only in case the rule support deviates from P(A)P(B) (Piatetsky-
Shapiro, 1991):
rule-interest(X →Y )=supp(X →Y ) −supp(X)supp(Y )
If rule-interest < 0 (rule-interest > 0), then X is positively (negatively) correlated
to Y . The same underlying idea is used for the lift measure, a well-known statistical
measure, also known as interest or strength, where a division instead of a subtraction
is used
lift(X →Y)=supp(X →Y )/(supp(X)supp(Y))
Lift can be interpreted as the number of times buyers of X are more likely to

buy Y compared to all customers (conf(X→Y )/conf(/0 →
Y )). For independent X
and Y we obtain a value 1 (rather than 0 for rule-interest). The significance of the
correlation between X and Y can be determined using a chi-square test for a 2 x 2
contingency table. A problem with this kind of test is, however, that statistics poses
some conditions on the use of the chi-square test: (1) all cells in the contingency
table should have expected value greater than 1, and (2) at least 80% of the cells
should have expected value greater than 5 (Brin et al., 1997A). These conditions are
necessary since the binomial distribution (in case of items) is approximated by the
normal distribution, but they do not necessarily hold for all associations to be tested.
The outcome of the statistical test partitions the set of rules into significant and
non-significant, but cannot provide a ranking within the set of significant rules (the
chi-square value itself could be used, but as it is not bounded, a comparison between
rules is not that meaningful). Thus, chi-square alone is not sufficient (Tan and Kumar,
2002,Berzal et al., 2001). On the other hand, a great advantage of using contingency
tables is that also negative correlations are discovered (e.g. customers that buy coffee
do not buy tea).
From an information-theoretical point of view, the ultimate rule evaluation mea-
sure is the J-measure (Smyth, 1992), which ranks rules by their information content.
Consider X and Y as being random variables and X = x →Y = y being a rule. The

×