Tải bản đầy đủ (.pdf) (4 trang)

A an efficient algorithm for association rule mining

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (746.81 KB, 4 trang )

International Journal of Advanced Computer Science, Vol. 1, No. 4, Pp. 142-145, Oct. 2011.

An Efficient Algorithm for Association Rule Mining
Maryam Shekofteh

Manuscript
Received:
9, Sep., 2011
Revised:
11, Oct., 2011
Accepted:
25, Oct.,2011
Published:
30, Oct.,2011

Keywords
Data mining,
Association
rule mining,
Frequent
closed
item-set

Abstract
The efficient frequent item-set
mining is the most important problem in
association rule mining. To date, a number
of algorithms have been developed in the
field. But the high number of such items in
data base results in redundancy in rules. To
overcome the problem, recent studies deal


with frequent closed item-set mining as it is
significantly smaller than the whole frequent
items and has similar strength. In this paper,
a new algorithm called FC-Close is
introduced for frequent closed item-set
mining is introduced. This algorithm
employs a pruning technique to improve the
efficiency of frequent closed item-set mining.
The results of the tests show that FC-Close is
more efficient that the existing FP-close
algorithm.

been used for frequent itemsets [3, 7]. A popular condensed
representation method is using to frequent closed itemsets.
Compared with frequent itemsets, the frequent closed
itemsets is a much more limited set but with similar power.
In addition, it decreases redundant rules and increases
mining efficiency. Many algorithms have been presented for
mining frequent closed itemsets
In this paper a new algorithm called FC-Close is
introduced for frequent closed item-set. This algorithm
which is the developed based on FP-growth [5], employs a
pruning technique to improve its efficiency. The rest of the
paper is structured as follow: section 2 introduces frequent
close item-set mining and related concepts. Section 3
sketches out topics and structures of the new algorithm.
Section 4 describes our developed algorithm. The
evaluation of findings is presented in section 5 and section 6
is devoted to conclusion.


1. Introduction

2. Problem Development

Association Rule Mining (ARM) is one of the most
important data mining techniques. ARM aims at extraction,
hidden relation, and interesting associations between the
existing items in a transactional database. It is highly useful
in market basket analysis for stores and business centers.
For example, database mining of a department store
customers reveal that those who buy milk would buy butter
in 60% of occasions, and such principle is observed in 80%
of transactions. In this example, the above-mentioned
probability is called confidence percentage, and a
percentage of transactions which cover this rule is termed
support percentage. To find the rules, user should set a
minimum amount for support and confidence which are
called minimum support (min-sup) and minimum
confidence (min-conf) respectively [1].
The main step in association rule mining is the mining
frequent itemsets. In effect, with frequent itemsets in hand,
generating
association
rules
would
be
highly
straightforward. Frequent itemsets mining often generates a
very large number of frequent itemsets and rules. As such, it
reduces the efficiency and power of mining. To overcome

the problem, in recent years, condensed representation has

Let D be a transactional database. Each transactional
database includes a set of transactions. Each transaction t is
represented by <TID, x> in which x is a set of items and
TID is the unique identifier of transaction. Further, let us
consider I = {i1, i2, …, in} as the complete set of distinct
items in D. Each non- empty subset y of I is termed an
itemset, and if includes k items, it would be called k-itemset.
The number of transactions existing in D including itemset
y, is called the support of itemset y, denoted as sup(y) and it
is usually represented in percentage. Given a minimum
support, min-sup, an itemset y is frequent itemset, if sup(y)
. min-sup.

This work was supported by Islamic Azad University, Sarvestan Branch,
Shiraz, Iran.
Maryam Shekofteh is with Department of Computer Engineering,
Sarvestan Branch, Shiraz, Iran.


Definition1- Closed Itemset: An itemset y is a closed
itemset if there is not any superset of y like y' that sup(y) =
sup(y').

3. Related Literature
A. FP-tree and FP –growth Method:
As FC-close is the extended version of FP-growth, an
introduction to FP-growth and FP-tree structure is needed.
In FP-growth a new structure called FP-tree is used. FP-tree

is a dense data structure for saving that has all necessary
information on frequent item set in a database. Each branch
of FP-tree presents one frequent item set and the nodes
along the branch are the count of items in descending order.


143

Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining.

Each node in FP-tree has three fields: item-name, count, and
node –link where item name includes the name of item
which the nodes has. Count shows the number of
transactions along the covered path to the node, and
node-link indicates the next node in FP-tree which includes
the same item. If there is no such node, the node link is null.
Likewise, FP-tree has a header table related to its own. The
single items of database are saved in this table in a
descending order. Each entry to the header table includes
two fields: the item name and the node-link start which
refers to the first node in FP-tree which has the item name.
Compared with Apriori [1] and its types which require
considerable pass from database, FP-growth need just two
passes during mining of all frequent item sets. In the first
pass, support of each item is calculated and the repeated
single items (repeated items with length 1) are put in the
database in a descending order. At the second pass, an
FP-tree which includes all frequent information of database
is created. In other words, each transaction of the ordered
database is read and each time one transaction is added to

the FP-tree structure. (To add a new transaction in FP-tree,
if the transaction has similar prefix with other added
transactions, for the items inside the prefix no node is
considered and just the support number (number of fields) is
added. Therefore, mining on the database leads to the
mining on FP-tree. Figure 1 shows the first FP-tree of the
second pass with minimum support of 20%. While adding
item I to the itemset y where y i is called z, the path
from the father node of this node (node i) to the root node in
FP-tree related to y, is called prefix path z.
Let us review further information on FP-growth. Once
FP-tree is created, mining of frequent patterns of FP-tree
with FP-growth algorithm is performed. The algorithm
FP-growth performs based on recursive deletion. In this
algorithm the database is repeatedly limited to the existing
is named
itemsets where the database limited to item
and it is shown by T . To create
conditional pattern
the conditional pattern each itemset
, all prefix patterns
( beginning from root) is written. After creating
conditional pattern of one itemset, its conditional FP-tree is
made. To make conditional FP-tree of an itemset, we follow
the steps taken in making initial FP-tree. In this stage,
however, instead of using the whole database we employ the
conditional pattern of that item. Therefore, total of the
number of supports of all items in all conditional patterns
related to the item is calculated and if it is higher than
threshold, it is added to the header table and FP-tree. The

mining procedure in conditional FP-tree is conducted
recursively until it is null or includes one single branch,
otherwise all frequent patterns are extracted.
B. CFI-tree
In FP-close algorithm, CFI-tree is introduced as a special
database for storing closed frequent itemsets. CFI-tree is
like an FP-tree. It includes a root node which is named
along with root. Each node under the tree has four fields:
Item-name, count, node surface, and node link. All nodes
with similar item names are connected. The node link refers
to the next node with the same item name. A header table is
International Journal Publishers Group (IJPG)©

created for items in CFI-tree where the order of items in the
table is the same as the order of items in the first made
FP-tree on the database first pass. Each entry on the header
table includes two fields: the item name and head of
node-link. The node-link links to the first node with the
same item name in CFI-tree. The surface field is used to test
sub-fields. The count field is required to compare y with the
set z of three as it is regularly tested until it is confirmed
z and y and z have similar
that there is no case of y
count.
TABLE 1.
A SAMPLE DATABASE

tid

items


1
2
3
4
5
6
7
8
9
10

abcefo
acg
ei
acdeg
glace
ej
abcefp
acd
acegm
acegn

Fig. 1 Structure of FP-tree

The arrangement of a frequent itemset in CFI-tree is
similar to the arrangement of a transaction in FP-tree.
However, to add an item of transaction with prefix similar
to the added transactions, the count of nodes is not
increased, but the maximum of counts is updated. In effect,

in FP-close, one newly discovered itemset is put in CFI-tree,
unless that item is the sub-set of an item, and they have
similar count of occurrences in the tree. Figure 2 displays
CFI-tree of database in table 1when minimum support equal
to 20%. In this figure, a node x,c,1 shows that it includes
item node x with count c and surface 1. More details on
CFI-tree is available in [4].

Fig. 2 Structure of CFI-tree


144

International Journal of Advanced Computer Science, Vol. 1, No. 4, Pp. 142-145, Oct. 2011.

4. FC-Close Algorithm
FC-Close algorithm employs the same structures of
FP-tree, and CFI-tree for mining frequent closed item-set.
Here, however, the search space is decreased effectively
thanks to an optimal technique called pruning. The smaller
search space and trees' count implies less time compared
with the similar algorithm such as FP-close. Let us review
the pruning technique.
A. Optimal Prunning Technique in FC-Close Algorithm
Suppose y is an itemset. One optimal method is that
itemset support y and ( y i ) (I is member of conditional
pattern y) are compared. If support y and ( y i ) are
equal, then each transaction including y also includes i. This
guarantees that each frequent itemset z including y which
does not include I, includes frequent super set z i and

these two sets have similar support.
Based on close frequent itemset, counting the itemsets
including y which does not include i is not needed. Then it
would be possible to transfer i to y and delete item i from
conditional pattern y which includes item i.
Another optimization method involves comparison of
itemsets ( y i ) and ( z i) . (z includes items y-j where j
is already added to y). If support of ( y i ) and that of
( z i) is equal, it is guaranteed each frequent itemset has
frequent super set ( y i ) , and these two sets have similar
count (similar support).
According to close frequent itemset, it is possible to
avoid a search of the branch including itemset ( z i) .
Figure 3 shows a pseudo-code of function that performs
pruning technique:
Pruning (current itemset: y)
{
For each item i in y's conditional pattern base
{
Newitem= y i
If (support (Newitem)==support(y)
{ move i to y
Remove i from y's conditional pattern
base}
Newitem= y

i

If
(Newitem)==support ( z

Stop
containing the itemset

(support

i)
the

(z

)
branch

search

single path along with the frequent itemset y. Then it should
be checked that whether itemset is a close frequent itemset.
In the second line, if the itemset
( y x) is a close
frequent itemset, ( y x) is put in CFI-tree. If FP-tree is
not a single path tree, for each item in header item, the item
is added to y. Then if-closed function is called so that it
analyzes whether the itemset y is a close frequent itemset. If
so, y is put in CFI-tree. In the next line, FP-tree of Y is
made and the function of pruning is called. Then FC-close
is recursively called. Figure 4 displays FC-Close
pseudo-code function:
FC-Close(T)
Input T:FP-tree
Global:

y: a linked list of items
CFI-tree: CFI-tree
Output: the CFI-tree which contains CFI
Method:
(1)If T only contains a single path p{
(2)Generate all frequent itemset from p{
(3) For each x in frequent itemset
(4) If not if-closed ( y x ) {
(5)
Insert x into CFI-tree}
(6)Else for each i in header-table of T{
(7) Append i to head
(8)If not if-closed(y)
(9) {
(10)
Insert y into CFI-tree
(11)
Construct the y's FP-tree Ty
(12)
Pruning(y)
(13)
FC-Close(Ty)
(14)
Remove i from y}
Fig. 4 FC-Close Algorithm Pseudo-Code

5. Results
In this section, our developed algorithm, i.e.
FC-algorithm is compared with the existing FP-close
algorithm. To do so, a computer with the following

specifics is employed. It runs a Pentium processor with
3GHz and 1GB Ram memory. It carries 200 GB disk with
Wondows XP 2003. All codes are implemented by C++.
The results are tested on two databases where one of them is
a dense database called Chess and the other one is
T40.I10.D100k sparse database. The specifications of these
databases are shown in Table 2. The information on both
databases are taken from [11].

that

TABLE 2.

i)

}
}
Fig. 3 Pseudo-Code of Pruning Function

B. Mining on Frequent Itemsets using FC-Close
In this article, a new algorithm called FC-Close is
introduced using pruning technique. This is a developed
algorithm of FP-growth method. Like FP-growth, FC-Close
is recursive. In the first call, one FP-tree is made of the first
database pass. A link list y includes items which make the
current call conditional pattern. If there is just one single
path in FP-tree, each frequent itemset X created in this

SPECIFICATIONS OF TESTED DATABASES
Dataset


#transactions

Avg. transaction size

Chess

3196

3553

T40I10D100k

100000

3954

Figures 5 and 6 contracts the time consumed for
algorithm execution in our developed algorithm, i.e.
FP-Close with that of existing FP-close in Chess and
T40.I10.D100K databases.

International Journal Publishers Group (IJPG)©


145

Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining.

N. Pasquier, Y. Bastide, R. Taouil, & L. Lakhal,

“Discovering frequent closed itemsets for association rules,” (1999)
Proc. Int'l conf. Database Theory, pp. 398-416.
[8] J. Pei, J. Han, & R. Mao, “CLOSET: An efficient Algorithm
for mining frequent closed itemsets,” (2000) ACM SIGMOD
workshop research issue in Data mining and knowledge Discovery,
pp. 21-30.
[9] J. Wang, J. Han, & J. Pei, “CLOSET: Searching for the best
strategies for mining frequent closed itemsets,” (2003) proc. Int'l
Conf. Knowledge Discovery and Data Mining, pp. 236-245.
[10] M.J. Zaki & C. Hsiao, “Charm: An efficient algorithm for
closed itemset mining,” (2002) Proc. SIAM Int'l Conf. Data
Mining, pp. 457-473.
[11] , 2003.
[12] 2005.
[7]

Fig. 5 Comparison of execution time on Chess database

Fig. 6 Comparison of execution time on Chess database

As shown in both figures 5 and 6, FC-Close has higher
efficiency over FP-Close.

6. Conclusion
In this article, FC-Close is introduced as an effective
algorithm for mining close frequent itemsets. This algorithm
decreases search space and FP-tree size using pruning
technique. The experiments show that FC-Close has higher
efficiency over FP-close algorithm.


References
R. Agrawal & R. Srikant, “Fast algorithms for mining
association rules,” (1994) Proceeding of the VLDB, Santiago de
chile.
[2] C-C. Chang & C-Y. Lin, “perfect hashing schemes for
mining association rules,” (2005) Oxford university press on behalf
of the british computer society, vol. 48, no. 2.
[3] B. Goethals, “Survey on Frequent pattern mining,” (2004)
Department of computer science university of Helsinki.
[4] G. Grahne & J. Zhu, “Efficiently using prefix-trees in mining
frequent itemsets,” (2003) IEEE ICDM Workshop on Frequent
Itemset Mining Implementations.
[5] J. Han, J. Pie, Y. Yin, & R. Mao, “Mining frequent pattern
without candidate generation,” (2003) Data mining and knowledge
discovery.
[6] J.S. Park, M-s. Chen, & P.S. Yu, “An effective hash based
algorithm for mining association rules,” (1995) ACM SIGMOD
international conference on management of Data, vol. 24, pp.
175-186.
[1]

International Journal Publishers Group (IJPG)©



×