Tải bản đầy đủ (.pdf) (6 trang)

A Fast Parallel Algorithm for Discovering Frequent Patterns docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.77 MB, 6 trang )

A Fast Parallel Algorithm for Discovering Frequent Patterns
Kawuu W. Lin
Department
of
Computer Science and Information
Engineering
National Kaohsiung University of Applied Sciences
Kaohsiung, Taiwan, R.O.C.

Abstract
Fast discovery
of
frequent patterns is the most
extensively discussed problem in data mining fields
due to its wide applications. As the size
of
database
increases, the computation time and the required
memory increase severely. The difficulty
of
mining
large database launched the research
of
designing
parallel and distributed algorithms to solve the
problem. Most
of
the past studies tried to parallelize
the computation by dividing the database and
distribute the divided database to other nodes for
mining. This approach might leak data out and


evidently is not suitable to be applied to sensitive
domains like health-care. In this paper, we propose a
novel data mining algorithm named FD-Mine that is
able to efficiently utilize the nodes to discover
frequent patterns in cloud computing environments
with data privacy preserved. Through empirical
evaluations on various simulation conditions, the
proposed FD-Mine delivers excellent performance in
terms
of
scalability and execution time.
Keywords: Data mmmg; cloud computing;
association rule mining; frequent pattern mining;
privacy preserved
I.
Introduction
With the progress
of
information technology, data
mining techniques have been extensively applied to
many applications in various domains. The goal of
data mining is to discover the hidden useful
information from large databases. The discovered
information could help the decision processes, aid the
commercial promotion, and so forth. The data mining
includes four main topics: association rule mining [2],
sequential pattern mining [3], clustering [11] and
classification [5]. Among the data mining studies, the
problem
of

frequent pattern mining, i.e. association
rule mining and sequential pattern mining, is mostly
discussed due to its wide applications.
The basic conception of frequent pattern mining
problem is to discover the pattern whose frequency of
appearance in the database is greater than a specific
threshold. An association rule is defined as X=>Y,
where X and
Yare
sets
of
items. The concept
of
association rule mining is to discover the sets
of
items tending to associate with the others in the
database. The studies on association rule mining can
be classified into two types, 1) the generate-and-test
Yu-Chin Luo
Department
of
Computer Science and Information
Engineering
National Kaohsiung University
of
Applied Sciences,
Kaohsiung, Taiwan, R.O.C.

[2]
(Apriori-like) approach and 2) the frequent

pattern growth approach [6]
(FP-growth-like). The
Apriori-like methods iteratively generate candidate
itemset of size (k+1) from frequent itemset of size k
and scan the database repetitively to test the
frequency
of
each candidate itemset. Definitely, the
Apriori-like methods suffer from the large number
of
candidate itemsets, especially when the support
threshold is small. In view
of
this reason, Han et al.
[6] proposed a novel data structure, named frequent
pattern tree (FP-tree), in which the transactions are
compressed and stored. A mining algorithm, namely
FP-growth was also proposed for discovering the
frequent patterns from the FP-tree. FP-growth needs
only two scans on physical databases and therefore
has a great improvement on the execution time.
As the size
of
database increases, the computation
time and the required memory increase severely.
Many studies on association rules mining were
proposed mainly to improve the efficiency in terms
of
execution time. In the past decades, parallel and
distributed computing (PDC) techniques have

attracted extensive attentions on the ability to manage
and compute the significant amount of data. The
difficulty
of
mining large database launched the
research of designing parallel and distributed
algorithms to solve the problem [7], [8], [10], [13],
[14]. The main approach
of
the existing studies is to
divide the database and then to distribute each part of
the database to nodes or processors for mining with
the goal to distribute the computation loading. During
the mining process, the nodes will exchange required
transactions from each other. The workload
of
data
exchanging among nodes becomes heavy when the
average length of transaction is long or the size
of
database is large. Although many algorithms have
been proposed, the execution efficiency
of
frequent
pattern mining is still a challenge to the researchers
due to the data explosion. In addition to the
exchanging workload, the data privacy is also a major
concern since this kind
of
algorithms duplicates the

database to every node in the PDC architecture. This
approach evidently is not suitable to be applied to
sensitive domains like health-care.
In this paper, we propose a novel data mining
method named FD-Mine that is able to efficiently
utilize the cloud nodes to fast discover frequent
patterns in cloud computing environments with data
privacy preserved. Through empirical evaluations on
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
various simulation conditions, the proposed FD-Mine
delivers excellent performance in terms
of
scalability
and execution time.
In the following sections, we briefly review related
work in Section 2. In Section 3, we propose the
architecture and present the data mining algorithm.
The empirical evaluation for performance study is
made in Section 4. The conclusions are given in
Section 5.
II. Related
Work
In order to improve the performance
of
association
rule mining, many researchers tried to distribute the
mining computation over more than one
processor/node. In [9], the authors proposed a parallel
algorithm named Parallel FP-tree (PFP-tree) based on
the FP-tree data structure for mining frequent patterns

on message passing multiprocessor systems. The
proposed algorithm divides the database into several
non-overlapping parts according to number the
available processors, and lets each processor
construct its FP-tree by exchanging necessary
information from other processors. Because the
algorithm is performed on a node, the data
exchanging is done in the same node so that the
overhead might not be severe. To parallelize the
frequent pattern mining, the past studies relied on
mainly the database dividing method [4], [15]. The
database is divided equally or by some criteria and
each part
of
the database is sent to the node for
mining. The approach that duplicates the database to
other nodes risks leaking out the data. The data
privacy cannot be preserved by this approach.
Note that in cloud computing environments the
network latency is an important issue that should be
carefully considered. Generally, the size
of
the
targeted database is always large in the mining
applications. Transmitting the database and
exchanging large amount
of
data over the internet
will greatly slow down the performance. In [12], the
proposed method, named QFP-growth, divides the

database equally and constructs the FP-trees based on
the assigned parts
of
database. The FP-trees are then
merged to a FP-tree to complete the mining task.
The data transmission overhead was studied in [14].
The authors observed that the elapsed time by
exchanging transactions is much more than mining
time. To efficiently exchange transactions among
nodes for database dividing approach, TPFP-tree was
proposed by using transaction identification set
(Tidset) to select the transactions directly instead
of
scanning the physical database. The Tidset is a table
recording the IDs
of
transactions that contain a
certain item, so the required memory
of
Tidset is as
the same size as the assigned partial database.
Therefore, TPFP is bound to the size
of
the targeted
database.
To balance the computing loading
of
TPFP-tree,
the authors [15] proposed BTP-tree algorithm, which
is a balanced Tidset-based parallel FP-tree algorithm,

for mining frequent patterns. The algorithm equally
divides the database into p parts, where p is the
number
of
nodes. The partial databases are sent to the
nodes individually. Each node establishes the Tidset
and header table in accordance with the assigned
database. A global header table named GHT is
derived by filtering the items with support smaller
than the threshold from the table in which all
of
the
header tables
of
the nodes are gathered. Before
executing the mining task, BTP-tree algorithm
calculates a performance index for each node, and
records the sum
of
performance indexes. A mining
task is then separated into p sub-tasks, where the
loading
of
each task is calculated in unit
of
the
number
of
items in header table. The task assignment
is decided by the mechanism

of
performance
indexing. After the task assignment, each node
constructs its Tidset for fast selection use. The
required transactions are exchanged among nodes to
generate the new sub-databases by referring to the
items
of
header tables. Finally, the FP-growth is
performed on each node to discover the frequent
patterns. The frequent patterns are further gathered
from all the nodes to obtain the complete frequent
patterns.
III.
Proposed
Algorithm: FD-Mine
In this section, we describe the proposed algorithm
that is able to efficiently distribute the computation in
the cloud computing environments. The cloud
architecture for mining frequent patterns is
introduced in Section 3.1.
In Section 3.2, we
formulate the problem. The details
of
the proposed
algorithms are described in Section 3.3.
3.1
Proposed
Cloud Architecture for
Frequent

Pattern
Mining
Note that in the cloud computing environments the
data privacy is an important issue. Since the clouds
are distributed physically and each cloud node
provides only its computation ability, the trusty
of
the
nodes cannot be preserved. Therefore, in order to
preserve the data privacy only a node that is safe,
while not every node, can access the database. In our
architecture, we name this node as trusted node or
kernel node, the cloud in which the node locates as
kernel cloud. Considering the efficiency
of
data
transmission among clouds, each cloud is designed to
have only a node to connect other clouds, named
connection-node, abbreviated as conn-node.
If
a node
N needs data from trusted node, the node N will ask
the conn-node
of
N's cloud to see whether the
conn-node has the data or not. If the conn-node has
the data,
N can download the data from conn-node
via intranet. Otherwise, the data will be duplicated to
the conn-node via internet, and then

N can download
the data from conn-node via intranet. By using this
transmission policy, the network latency can be
minimized.
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
~
Physical Machine
9
Dat"b.oL\\
IIIII!!II
Trusted Xode
• (Virtual Machine)
~
ConnectionNode
~
tvirt ual Mactunej
CI
Comreting Xcdc
~
[Virtu
alM
achine)
Figure 1. Proposed architecture for frequent
pattern
mining.
In this architecture, each conn-node should maintain
a table to record the status
of
the nodes
of

its cloud.
The recorded information for each node contains the
node's ID and the availability. All
of
the tables are
then gathered in the kernel node so that the kernel
node has complete information
of
computation ability
in terms
of
available nodes. The information is
updated periodically.
3.2 Mining frequent
patterns
in cloud computing
environments
One
of
the characteristics
of
the proposed algorithm
is that the data privacy is preserved. Unlike the
parallel Apriori-like algorithms [4] that need to
duplicate the database to remote nodes or the
BTP-tree [15] algorithm that distributes part
of
the
database directly to cloud nodes, only the kernel node
is permitted to access the database in our designed

architecture and algorithms. In addition to the leaking
problem
of
data privacy of the conventional
algorithms, the required time for duplicating physical
database is considerable.
The data structure used by the proposed algorithms
is based on that of FP-growth. The FP-tree is a data
structure that stores the frequent items in compressed
form. Because the items with support smaller than the
support threshold are filtered and the filtered
transactions have been constructed in the FP-tree,
reversely retrieving the complete transaction
of
any
user from the FP-tree is impossible. Moreover,
because the FP-tree is often implemented in
linked-list and our algorithm will also compress the
FP-tree again by ZIP to reduce the transmission time,
the transactions will not be reversed. The data
privacy can be preserved.
3.3 FD-Mine algorithm
The purpose
of
FD-Mine is fast mining. In the cloud
computing environments, the distribution
of
mining
computation accompanies data transmission over the
network. In BTP-tree [15], the database is divided

equally into several parts and sent to the available
nodes. Then the nodes ask the required data from
each other to finish the mining task. In fact, the
database is often large in size. Obviously, this
approach not only leaks the data but also incurs a lot
of
data transmission over the network. The
perforrnance
of
this kind
of
approach is expected to
be bad.
An intuitive way to save the time is to minimize
the amount
of
data transmission. Our proposed
FD-Mine is designed to transmit as less data as
possible to save the time from network latency and
disk I/O time. The algorithm is presented in Figure 2.
We describe the details
of
FD-Mine as below. The
trusted node
TN follows the FP-tree construction
algorithm to scan the database twice times, and
constructs the corresponding FP-tree stored in
TN
(line I). The next step is to obtain the header table
HT

(line 2) and to divide
HT
into I
N!
disjointed sets,
stored in
IS (line 3). Since the frequent patterns are
not predictable,
HT
is divided randomly with the goal
to balance the loading
of
each node. Considering the
execution efficiency, the most important issue is that
the amount
of
data transmission should be minimized.
To minimize the amount
of
data transmission, the
FP-tree constructed on
TN is duplicated to each idle
node. In the cloud computing environments, we also
consider the problem
of
network latency. Since the
internet latency always larger than intranet latency,
the FP-tree duplication should be done in intranet.
Algorithm FD-Mine
Input: A transaction database DB, a minimum support

threshold
~,
the trusted node TN, and a set
of
nodes N with cloud architecture C
Output: The complete set
of
frequent patterns, FP
1 TN.FPT
~
constructFPTree(DB,~)
II TN reads the DB and construct the corresponding FP-tree
2 HT
~
getHT(FPT)
II Obtain the header table
ofFPT
3 IS
~
divideHT(lNI)
IIRandomly divide the items
ofHT
into IN[ disjointed sets
4 FOR i=1 TO II
SI
5 n
~
selectNode(N ,i) II Select the ith node
6 cn
~

selectConnNode(n,C)
II Select the conn-node o
fn
7 IF (isExistFPT(cn)==FALSE)
8 cn.FPT
~
TN.FPT
II Duplicate FPT from TN if en does not have FPT
9 ENDIF
10 n.FPT
~
cn.FPT
II Duplicate FPT from the conn-node
ofn
11
is,
~
getSet(IS,i) II Obtain ith set
of
IS
12
fp,
~
N;.BatchFPGrowth(isD
II Batch-run FP-growth for each conditional item in is;to
mine the frequent patterns
13 FP
~
FP U
fp,

14 ENDFOR
15 RETURN FP
Figure 2. FD-Mine Algorithm.
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
80

- - - - - - - - - - - - - - -
,
Number of Nodes
Figure 3. The execut ion time for FD-Mine and BTP-tree with
number
of nodes varied on dataset T20.IS.NIOOK.DIOOK.
10
·0 ······ ·0 0 -O
30
~
60
!E-
Q)
E
F 50
c:
.2
"S
~
40
w
70
the required execution time
of

FD-Mine and
BTP-tree decreases with the increase in the number
of
nodes. It is observed that the execution time
of
FD-Mine is almost the same to that ofBTP-tree when
there is only one node available to be used. This is
trivial because both
of
them perform FP-growth in a
single node. The execution time
of
FD-Mine is
slightly more than that
of
BTP-tree when the number
of
processors is equal to 2 or 3. This is because the
time elapsed by FP-tree compression and
decompression is more than the time to directly
transmit the divided parts
of
database. When there are
more than 3 nodes, FD-Mine exhibits the advantage
of
sending after compression, less time required for
completing the whole mining task.
Figure 4 shows the impact on execution time when
the average length
of

transaction is lengthened to 40.
It
is found that FD-Mine delivers better performance
than BTP-tree when the number
of
nodes is greater
than 2. The reason is that BTP-tree, the database
dividing approach, needs to exchange the transactions
to each other, and the performance suffers from the
large number
of
exchanged transactions.
Figure 5 shows the performance
of
FD-Mine and
BTP-tree under the number
of
transactions set to
200K. In this experiment, FD-Mine outperforms
BTP-tree when the number
of
nodes is greater than 2,
in which the intrinsic drawback
of
the database
dividing approach is demonstrated. In the series
of
experiments, it is observed that FD-Mine not only
can preserve the data privacy but also delivers better
performance than BTP-tree in terms

of
execution
time especially when the database is large in size.
5.2 Effects of varying
the
parameters
of
dataset
In the section, we study the effects by varying the
support threshold, and the parameters, number
of
transactions and average transaction length,
of
the
data generator. Two algorithms are compared,
FD-Mine and BTP-tree in the experiment.
IV. Experimental Results
To evaluate the performance
of
the proposed
algorithm, we use IBM's Quest Synthetic Data
Generator [1] to generate the workload data for
mining. The experiments were conducted on a cloud
system with three clouds. The first cloud contains
four nodes, including the kernel node, in which each
node is equipped with an E8400 204GHZ CPU, 1GB
of
available RAM and 320GB
of
disk storage. The

second cloud and third cloud contain four and three
nodes respectively, in which each node is equipped
with a P8600 204GHZ CPU, IGB
of
available RAM
and 160GB
of
disk storage. Note that the kernel node
is responsible for receiving the requests and is not
used for mining. Therefore totally ten nodes can be
used for mining in the system. To verify the
performance, since there are very few parallel and
privacy-preserved algorithms
of
frequent pattern
mining, we select the BTP-tree for comparison,
which is one
of
the most efficient algorithms that can
parallelize the mining task on grid systems. Both
of
FD-Mine and BTP-tree were implemented in Java,
and the message passing among nodes and remote
function call were implemented in Java RMI
technology. Since the most
of
the existing parallel
algorithms are database dividing approach, we select
the most efficient one, BTP-tree, for performance
comparison.

5.1 Effects of varying
the
number
of cloud nodes
In the following experiments, we investigate the
performance
of
FD-Mine in terms
of
execution time
by varying the number
of
cloud nodes from I to 10.
The performance results for database
T20.I5.NIOOK.D100K are described. The support
threshold is set to 0.03%, which is a very small value,
in order to verify the performance
of
both the
algorithms, FD-Mine and BTP-tree. Figure 3 shows
For this reason, the FP-tree duplication is processed
as follows. First, the algorithm selects an idle node n
(line 5), and selects the connection node en
of
n from
the cloud architecture C (line 6). If en has no
duplicated FP-tree, TN will duplicate one to en (line
7 to line 9). Note that in order to minimize the
transmitting overhead the FP-tree should be
compressed in advance. Afterwards, node n can

obtain the compressed FP-tree via intranet and
decompress it (line 10). After receiving the FP-tree,
node n is assigned to a subset
of
IS (line 11), and
batch-runs FP-growth for each conditional item in the
subset to mine the frequent patterns (line 12 to line
13). Obviously, each node needs only one data
transmission, i.e. FP-tree duplication, and the
transmission is in intranet to minimize the network
latency. After all
of
the I
N!
disjointed sets are
processed, the frequent patterns are returned (line
15).
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
0.050.040.030.02
o.
·····················.0.
······
···············
0.
····· ··········· ······. .0 .
········
··· ···
·······
0
0.01

34
32
20
u-
! 30
Q)
E 28
i=
c
.Q 26
3
~
24
22
36
,
- - - - - - - - - - - - - - - -
,
1
8.L-, r , , J
·· ······ 0 ··· ····· 0 {) 0
140
120
U-
Q)
$
Q)
100
E
i=

c
a
~
80
c
Q)
o ,
x
w
>'0 .
60
40
8 10
Number
of
Nodes
Figure 4. The execution time for FD-Mine
and
BTP-tree with
number
of nodes varied on dataset T40.I5.N100K.D100K.
Support Thresh old (%)
Figure 6. The execution time for FD-Mine and BTP-tree with
support
threshold
var
ied on
dataset
T20.15.N100K.D100K.
data privacy is preserved. Unlike the parallel

Apriori-like algorithms that need to duplicate the
database to remote nodes or the BTP-tree algorithm
that distributes part
of
the database directly to cloud
nodes, the database will never be duplicated and only
the kernel node is permitted to access the database in
our designed architecture and algorithms. Through
empirical evaluations on various simulation
conditions, the proposed FD-Mine delivers excellent
performance in terms
of
scalability and execution
time.
····· ··0 ······· 0 ······.0 0
100
90
u-
80
Q)
$
Q)
70
E
i=
c
.Q
60
:5
o

Q)
x
·0
w
50
40
30
-'-r r r r r r r ,
10
Number of Nodes
Figure 5. The execution time for FD-Mine
and
BTP-tree with
number
of nodes varied on dataset T20.I5 .Nl OOK.D200K.
Acknowledgement
This research was partially supported by National
Science Council, Taiwan, ROC under Grant
No.97-2218-E-151-003-MY2.
In Figure 6, we explore the impact on execution
time by varying the support threshold from 0.05% to
0.0I% with ten cloud nodes. It can be found that
FD-Mine always requires less time than BTP-tree.
The efficiency in execution time
of
FD-Mine is
mainly achieved by reducing the transmission
overhead and the disk I/O times. In the experiment,
the required time
of

FD-Mine is only about 82%
of
the execution time ofBTP-tree in average.
V. Conclusions
In this paper, we have presented an efficient
algorithm named FD-Mine that is able to efficiently
utilize the cloud nodes to discover frequent patterns
in cloud computing environments with data privacy
preserved. The proposed FD-Mine is composed
of
two algorithms, namely HD-Mine and FD-Mine. The
limitation
of
the conventional algorithm for mining
the dataset with a large number
of
frequent patterns is
bounded to the available memory. The proposed
HD-Mine is able to discover the frequent patterns
from this kind
of
datasets by merging the memory
of
several nodes. The proposed FD-Mine focuses on the
fast discovery
of
frequent patterns by utilizing the
cloud nodes, and is useful to the applications that
emphasize real time mining. Another important
characteristic

of
the proposed algorithms is that the
References
[IJ R. Agrawal and R. Srikant. Quest Synthetic Data Generator.
IBM Almaden Research Center, San Jose, California,
/>[2J
R. Agrawal, Imielinski T, Swami A. Mining association rules
between sets
of
items in large databases. In: Proc. ACM SIGMOD
IntI. ConfManagement Data, 1993.
[3J R. Agrawal,
R. Srikant, Mining Sequential Patterns, in: Proc. of
the
11
th
1nt'l Conf. on Data Engineering, 1995, pp. 3-14.
[4J
R. Agrawal, John C. Shafer, "Parallel Mining
of
Association
Rules", IEEE Transactions on knowledge and Data Engineering,
December 1996.
[5J R. J. Bayardo, Jr., Brute-force mining
of
high-confidence
classification rules. In Proceedings
of
the 3rd international
conference on knowledge discovery and data mining (KDD'97),

Newport Beach, California, USA.
[6J J. Han,
1. Pei, and Y. Yin. Mining Frequent Patterns Without
Candidate Generation. Proc. of ACM Int. Conf. on Management
of
Data (SIGMOD),
\-12
,2000.
[7J J.D. Holt, S.M. Chung, "Parallel mining
of
association rules
from text databases on a cluster of workstations," Proceedings
of
18th International Symposium on Parallel and Distributed
Processing, 2004, pp. 86.
[8J P.Iko and
M. Kitsuregawa, "Shared Nothing Parallel Execution
of
FPgrowth." DBSJ Letters, Volume 2, No.1, 2003, pp. 43-46.
[9J A. Javed,
A. Khokhar, "Frequent Pattern Mining on Message
Passing Multiprocessor Systems," Distributed and Parallel
database, Volume 16, Issue 3, 2004, pp. 321-334.
[IOJ
T. Li, S. Zhu, M. Ogihara, "A New Distributed Data Mining
Model Based on Similarity," Symposium on Applied Computing,
2003, pp.432-436.
[II J Ester M., Kriegel H P., Sander
1., Xu X.: "A Density-Based
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

Algorithm for Discovering Clusters in Large Spatial Databases
with Noise", Proc. 2nd Int. Conf. on Knowledge Discovery and
Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231.
[12] Y. Qiu, Y. 1. Lan and Q. S. Xie,
"An
improved algorithm
of
mining from FP- tree," Proceedings
of
the Third International
Conference on Machine Learning and Cybernetics,
pp. 26-29,
2004.
[13] E H. S. Han, G. Karypis, and V. Kumar. Scalable parallel data
mining for association rules. IEEE Transactions on Knowledge and
Data Engineering, 12(3):352 -377, 2000.
[14] J. Zhou, K M. Yu, "Tidset-based Parallel FP-tree Algorithm
for the Frequent Pattern Mining Problem on PC Clusters," Lecture
Notes in Computer Science 5036, 2008, pp. 18-28.
[15] 1. Zhou, K M. Yu, Balanced Tidset-based Parallel FP-tree
Algorithm for the Frequent Pattern Mining on Grid System, Fourth
International Conference on Semantics, Knowledge and Grid, 2008.
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

×