DSpace at VNU: A method for mining top-rank-k frequent closed itemsets

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (975.16 KB, 9 trang )

1

2

3
4
5
6
7
8
9
10

A method for mining top-rank-k
frequent closed itemsets
Loan T.T. Nguyena,b,∗ , Truc Trinhc , Ngoc-Thanh Nguyend and Bay Voe

a Division of Knowledge and System Engineering for ICT, Ton Duc Thang University,
Ho Chi Minh City, Vietnam
b Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
c VOV College, Ho Chi Minh City, Vietnam
d Faculty of Computer Science and Management, Wroclaw University of Science and Technology,
Wrocław, Poland
e Faculty of Information Technology, Ho Chi Minh City University of Technology, Vietnam

Au
tho
rP

1

roo
f

Journal of Intelligent & Fuzzy Systems xx (20xx) x–xx
DOI:10.3233/JIFS-169128
IOS Press

23

Keywords: DCI-Plus, dynamic bit vectors, frequent closed itemsets, top-rank-k frequent closed itemsets

24

1. Introduction

15
16
17
18
19
20
21

25
26
27
28
29
30

rre

14

co

13

Data mining is the process of extracting interesting
knowledge from data. Various methods for discovering knowledge have been proposed, such as mining
traditional association rules [1–4, 6, 7, 23, 31, 36, 37],
mining non-redundant association rules [8, 41], mining minimal non-redundant association rules [26, 27],
mining most generalization association rules [38],

Un

12

cte
d

22

Abstract. Mining frequent closed itemsets (FCIs) is important in mining non-redundant (minimal) association rules. Therefore, many algorithms have been developed for mining FCIs with reduced mining time and memory usage. For mining FCIs,
algorithms use the minimum support threshold, minSup, to prune itemsets. However, using a fixed minSup is not suitable for
mining top-rank-k FCIs. A large threshold will lead to a small number of generated FCIs, leading to insufficient FCIs to query
when k is large. On the other hand, a small minSup will generate a huge number of generated FCIs, leading to large runtimes
and high memory usage. In this paper, we propose a method for mining top-rank-k FCIs without using a fixed minimum
support threshold. A strategy is first used to eliminate 1-items that cannot generate FCIs belonging to top-rank-k FCIs. Next,
based on the set of candidate 1-items, we propose TRK-FCI, a DCI-Plus-based algorithm, for mining top-rank-k FCIs. In

the process of mining top-rank-k FCIs, TRK-FCI automatically increases minSup according to the mined FCIs, efficiently
pruning itemsets that cannot belong to top-rank-k FCIs. We also modify the dynamic bit vector (DBV) structure and apply
it to reduce memory usage and runtime in the TRK-FCI-DBV algorithm. Experimental results show that TRK-FCI-DBV is
more efficient than TRK-FCI for various databases.

11

∗ Corresponding

author. Loan T.T. Nguyen, Division of
Knowledge and System Engineering for ICT, Ton Duc
Thang University, Ho Chi Minh City, Vietnam. E-mail:

31

classification using decision trees [13, 20, 30] or
ILA [33], classification based on association rules
[13, 14, 20, 21], and clustering [22]. Mining association rules has many applications in practice [3,
23]. For mining association rules, frequent itemsets
[2, 11, 21, 42], frequent closed itemsets (FCIs) [15,
21, 26, 28, 31, 32, 37, 40–42], or maximal frequent
itemsets [12, 19] must be mined. Mining frequent
itemsets is often used for generating all association
rules that satisfy minimum support threshold (minSup) and minimum confidence threshold (minConf)
[1, 2, 35, 36] and mining FCIs is used for mining
(minimal) non-redundant association rules (i.e., rules

1064-1246/16/$35.00 © 2016 – IOS Press and the authors. All rights reserved

32
33
34
35
36
37
38
39
40
41
42
43
44

50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

95

roo
f

49

belong to tabk . Because DCI-Plus uses fixed bit
vectors, it has high memory usage and runtime for
storing and computing the bit vector of a new itemset, checking subsets, and computing the supports
of itemsets. TRK-FCI-DBV, an improved version of
TRK-FCI, is then developed. TRK-FCI-DBV uses
the dynamic bit vector (DBV) structure instead of
the bit vector structure to reduce mining time and
memory usage.
The rest of this paper is organized as follows. Section 2 presents definitions of FCIs and top-rank-k
FCIs and states the problem of mining top-rank-k
FCIs. In Section 3, we review works related to the
problem of mining FCIs, top-k and top-rank-k frequent itemsets, and top-k FCIs. Section 4 describes
a method for mining top-rank-k based on the DCIPlus algorithm and an improved algorithm based on
DBVs. Experimental results on standard databases
for TRK-FCI and TRK-FCI-DBV are presented in
Section 5. Conclusions and suggestions for future
work are given in Section 6.

2. Deﬁnitions and problem statement
Let I = {i1 , i2 , . . . , im } be a set of items and
DB = {t1 , t2 , . . . , tn } be a set of transactions, where
each ti (1 ≤ i ≤ n) is a transaction labeled by a unique
identifier and contains a set of items in I.

cte
d

48

rre

47

considered redundant based on certain criteria are
eliminated) [26, 27, 41]. For mining maximal frequent itemsets, all frequent itemsets or FCIs (for
which the database must be scanned to compute the
supports of itemsets) must be generated to mine above
kinds of rules.
Mining FCIs is important for pruning redundant
rules. The problem was first stated in 1999 by
Pasquier et al. [26]. Since then, many algorithms have
been developed to enhance the efficiency of mining FCIs, such those based on FP-tree [11, 28, 40],
IT-tree [32, 42], bit vectors [31, 37], and N-Lists
[15]. To mine FCIs, the minSup is set. The FCIs
that satisfy the minSup threshold are selected. It is
difficult to mine a sufficient number of top-rank-k
FCIs because an excessively high threshold will lead
to very few FCIs, not enough to query. Conversely,
a minSup that is too low will lead to a very large number of FCIs, requiring a lot of memory and time to
mine. Therefore, developing efficient algorithms for
mining top-rank-k FCIs is necessary.
Some algorithms have been developed for mining top-rank-k frequent itemsets. Deng et al. [5]
proposed the NTK algorithm and used a Node-list

to mine top-rank-k frequent itemsets. The iNTK
algorithm, an improved version of the NTK algorithm proposed by Le et al. [14], uses the subsume
concept and the N-list structure to fast mine top-rankk itemsets. After that, some algorithms have been
developed for mining top-k frequent itemsets [29],
top-k FCIs [39], and top-k non-redundant association
rules [8].
Mining top-rank-k FCIs is important for mining
non-redundant association rules. However, for our
best knowledge, there are no developed algorithms
for mining top-rank-k FCIs. Besides, algorithms
developed for mining top-rank-k frequent itemsets
or top-k FCIs cannot be applied to mine top-rankk FCIs. Therefore, in this paper, we propose the
TRK-FCI algorithm, which is based on DCI-Plus
[31], for mining top-rank-k FCIs. First, the algorithm finds a set of candidate items that may belong
to top-rank-k FCIs, where k is a given threshold.
Then, it uses the DCI-Plus algorithm to generate
FCIs based on these candidate items. When an FCI
is generated, it is directly inserted into a table named
tabk . FCIs with the same support are stored in the
same entry. The number of entries in tabk is below
the threshold k. In the process of mining top-rank-k
FCIs, the algorithm automatically increases minSup
to reduce the number of FCI candidates that do not

co

46

Un

45

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

Au
tho
rP

2

Deﬁnition 1. (support of an itemset). Given a DB and
an itemset X (X ⊆ I), the support of X, denoted by
SUPX , is the number of transactions containing X in
DB.
Deﬁnition 2. (frequent itemset). Given a DB and
an itemset X (X ⊆ I), X is a frequent itemset if
SUPX ≥ min Sup.
Deﬁnition 3. (FCI). Given a DB and an itemset
X (X ⊆ I), X is called an FCI if no itemset Y exists
such that X ⊂ Y and SUPX = SUPY .
Deﬁnition 4. (rank of an FCI). Given a set of CI
including all closed itemsets from a transaction
database DB and an FCI X (X ∈ CI), the rank of
X in CI is the number of itemsets whose support values are no greater than the support of X. The rank of
X is defined as:
RX = |{SUPY |Y ∈ CI ∧ SUPY ≥ SUPX }|

96
97
98

99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

117

118
119
120
121

122
123
124
125

126
127
128

129
130
131

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

134
135
136

137
138
139
140
141
142

143
144
145
146

147
148

149
150

Deﬁnition 5. (a top-rank-k FCI). Given a set of
CI including all closed itemsets from a transaction
database DB and a threshold k, an itemset X ∈ CI is
called a top-rank-k FCI if and only if RX is no greater
than k, i.e., RX ≤ k.
Deﬁnition 6. (mining top-rank-k FCIs). Given CI
including all FCIs from transaction database DB and
a threshold k, the goal of mining top-rank-k FCIs is to
find a complete set of FCIs whose ranks are no greater
than k, i.e., top-rank-k FCIs are a set of itemsets for
which {X ∈ CI|RX ≤ k}.
From definition 6, the problem of mining top-rankk FCIs is stated as follows. Given a database BD and
a threshold k, mining top-rank-k FCIs is divided into
two steps:
Step 1: Mine all closed itemsets in DB, a set called
CI.
Step 2: Keep the closed itemsets that satisfy definition 6 in CI.

156

3. Related works

157

3.1. Mining frequent closed itemsets

152
153
154

cte
d

155

The above approach is simple but not feasible
because the number of closed itemsets in the database
is often large. Therefore, finding a direct solution for
mining top-rank-k FCIs without mining all closed
itemsets is a challenge.

151

eliminate items at high levels that have the same
support as that of items at low levels. FPClose [11] is
an improved version of Closet+ that uses FP-array to
reduce the number of FP-tree scans when FP-tree is
projected. CHARM [42] is based on tidsets for fast
computing the supports of itemsets and uses subset
checking to fast prune non-closed itemsets. To check
whether a generated itemset is closed, CHARM uses
a hash table in which the key of each itemset is the sum
of its items. dCHARM, a diffset approach for mining
FCIs is also developed [42]. CloseMiner [32] uses
closed tidsets to check whether an itemset is closed.
Although CHARM, dCHARM and CloseMiner have

advantages over algorithms based on horizon data
format such as Close, A-Close, Closet, Closet+, and
FPClose, they must use hash tables to check whether
a candidate itemset is closed, and thus closed itemsets
must be stored in main memory for easy checking.
DCI-Closed [21] uses tidsets and a non-duplication
generation strategy for mining FCIs. DCI-Plus [31],
an improved version of DCI-Closed [21], generates
FCIs and minimal generators of each FCI. Because
DCI-Closed is based on tidsets, when the tidsets of
itemsets are long, a lot of memory is required to store
the tidsets and the runtime required to compute the
intersection with other tidsets is high. To reduce the
length of tidsets and reduce computation time, DCIPlus uses BitTable.

roo
f

133

Au
tho
rP

132

3

3.2. Mining top-rank-k frequent itemsets

161
162
163
164
165
166
167
168
169
170
171
172
173
174

rre

160

co

159

Problem of mining FCIs was first proposed in
1999 [26]. Many algorithms for mining FCIs have
since been developed to reduce runtime and memory usage. Apriori-based algorithms for this purpose
include Close [26] and A-Close [27]. These algorithms generate candidates and compute their closure
to find FCIs. Algorithms based on the divide-andconquer technique have been developed. Closet [28]
uses FP-tree to compress the database and early pruning to prune non-closed itemsets. Closet+ [40], an
improved version of Closet (which uses a bottomup projection scheme for FP-tree), uses a hybrid

approach: bottom-up for dense databases and topdown for sparse databases. It uses item merging and
sub-itemset pruning, which are widely used in other
algorithms, and applies the subset checking strategy
to fast check closed itemsets and item skipping to

Un

158

Deng et al. proposed the NTK algorithm for mining top-rank-k frequent itemsets [5]. NTK uses the
Node-list data structure to represent itemsets and
uses a level-wise approach for mining top-rank-k
frequent itemsets, i.e., t-patterns are used to form
(t+1)-patterns. By using Node-lists, the algorithm
does not need to rescan the database to compute the
supports of itemsets. A dynamic minSup is used to
efficient prune candidates. Le et al. developed iNTK
[14], an improved version of NTK. iNTK uses the
subsume concept to reduce the number of generated
candidates compared to those for NTK, reducing the
time required to generate candidates.
3.3. Mining top-k frequent closed itemsets
Wang et al. [39] proposed the TFP algorithm for
mining top-k FCIs, where k is the number of FCIs
that need to be mined. TFP uses a divide-and-conquer
technique (like FP-Growth) and prunes candidates

175
176
177

178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202

203

204
205

206
207
208
209
210
211
212
213
214
215
216

217

218
219
220
221

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

226

3.4. Mining top-k association rules

223
224

239

In 2012, Fournier-Viger et al. [10] proposed the
TopKRules algorithm for mining top-k association
rules from datasets. This algorithm uses the minConf value during the mining process of top-k rules.
The change of the minSup value is dependent on
the lowest support of itemsets. The TopKRules algorithm is based on the principle of extending rules and
some methods for early eliminating rules that do not
belong to top-k rules. Fournier-Viger and Tseng also
extended TopKRules for mining top-k non-redundant
rules [8] and top-k sequential rules [9]. These algorithms are very efficient compared to post-processing
methods.

240

3.5. Dynamic bit vectors

228
229
230
231
232
233
234
235
236
237
238

241
242

243
244
245
246
247
248
249
250

In 2012, Vo et al. [37] proposed the concept of
dynamic bit vectors (DBV) and used it in mining frequent closed itemsets. DBV of an itemset is a bit
vector in which zero bits from the begin and the end
are removed. With this concept, we can save memory
to store bit vectors and time to compute the intersection of bit vectors. Tran et al. expanded this concept
to mine frequent closed sequences [34]. Le et al. also
used DBV to develop an efficient algorithm for mining frequent closed inter-sequence using DBV [16].

cte
d

227

267

4.1. TRK-FCI algorithm

roo
f

225

based on minSup (automatically increased in the
process of updating candidates). The authors also
used a threshold min l to eliminate itemsets whose
lengths are smaller than min l.

222

Au
tho
rP

4

253
254
255
256
257
258
259
260
261
262
263
264
265
266

In this section, we present the TRK-FCI algorithm

for mining top-rank-k FCIs based on BitTable. TRKFCI uses DCI Plus [31] to generate candidate closed
itemsets and apply some early pruning techniques to
prune candidates. First, the algorithm chooses a set
of candidate items that may belong to top-rank-k
FCIs, where k is a given threshold. Then, it uses
the DCI-Plus algorithm to generate FCIs based on
these candidate items. When an FCI is generated, it
is directly inserted into a table named tabk . FCIs with
the same support are stored in the same entry. The
number of entries in tabk is below the threshold k. In
the process of mining top-rank-k FCIs, the algorithm
automatically increases minSup to reduce the number
of FCI candidates that do not belong to tabk .

co

252

4. Proposed algorithms

Un

251

rre

Fig. 1. TRK-FCI algorithm for mining top-rank-k FCIs.

In the above algorithm, database D is first scanned
to compute the BitTable and determine single items

F1 . These items are sorted in descending order
according to their supports; if two items have the
same support, then they are sorted in increasing lexicographical order. Next, the algorithm creates F2
by inserting each item in F1 into F2 such that the
number of items (which are different in their BitTables) is equal to k. The items in F2 are sorted in
increasing order according to their supports; if two
items have the same support, then they are sorted in
increasing lexicographical order. POST SET is created by computing the closure of each item in F2 . The
procedure DCI CLOSED++ is called with the input
iCLOSED SET = ∅, PRE SET = ∅, POST SET, and
minSup, where minSup is the support of the first item

268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

4.2. Illustration

287

Consider the database in Table 1, which includes
10 transactions and 10 items.
Assume that k = 5, the process of mining top-rankk FCIs is as follows. First, the BitTable and support
of each item are obtained, as shown in Table 2.
After F1 is sorted, we have F1 = {G, F, E, H, D, C,
B, A, J, I}. Next, we choose items from F1 that may
belong to top-rank-k FCIs and store them in F2 , i.e.,
F2 = {A, C, D, H, E, F, G} (after sorting). Because
A has the same BitTable as that of C and E has the
same BitTable as that of F, they are grouped into
two groups as (A, B) and (E, F), respectively. After
grouping, the algorithm computes the closure of each
item. The results are shown in Table 3. From Table 3,
we have POST SET = {ACEFG, CEFG, DEF, H,
EF, F, G}.

Au
tho
rP

roo
f

DCI CLOSED++

cte
d

Table 1
Transaction database D

Transaction

Items

1
2
3
4
5
6
7
8
9
10

A, C, E, F, G, H
D, E, F, H
G, H, I
B, D, E, F, G
D, E, F, G
G, H
A, C, E, F, G
B, E, F, H

D, E, F, G, H
D, E, F, G, H, J

Table 2
Items in D with their BitTable and support

Item

rre

286

in F2 if the number of items which are different in
their BitTables in F2 is equal to k; otherwise, minSup
is set to 0.

co

285

Un

284

5

Fig. 2. DCI CLOSED++ procedure.

BitTable

Support

520
68
520
355
879
879
763
919
128
1

0.2
0.2
0.2
0.5
0.8
0.8
0.8
0.7
0.1
0.1

A
B
C
D
E
F

G
H
I
J

Table 3
BitTable and Closure of items in D
Item
A
C
D
H
E
F
G

BitTable

Closure

Support

520
520
355
919
879
879
763

ACEFG
CEFG
DEF
H
EF
F
G

0.2
0.2
0.5
0.7
0.8
0.8
0.8

288
289
290
291
292
293
294
295
296
297
298
299
300
301

302
303

309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334

335
336
337
338
339
340
341
342
343
344
345
346

4.3. Improved algorithm
The TRK-FCI algorithm is based on the DCI-Plus
algorithm. Because DCI-Plus uses bit vectors to represent the tidsets of items, it requires more memory
to store bit vectors and more time to compute the
intersection of bit vectors when the number of transactions in the database is large. To reduce the mining
time and memory usage, we develop an improved
algorithm that uses DBVs instead.
Table 5 is presented to show the process of using
DBVs for mining top-rank-k FCIs. It shows the details
of items, supports, closures, and DBVs of F2 .
Procedure DCI CLOSED++ is the same as that in
TRK-FCI but the operations for BitTable are replied
by operations for DBVs. The final results are the same
as those obtained with TRK-FCI.

5. Experiments

Table 4
Top-rank-k FCIs generated according to
TRK-FCI algorithm
k

key/sup

1
2
3
4
5

0.8
0.7
0.6
0.5
0.4

FCIs

{EF}, {G}
{H}
{EFG},
{DEF}, {HEF}, {HG}
{DEFG}

Table 5
DBVs, closures, and supports of items in F2
A

C
D
H
E
F
G

348
349
350

352
353
354
355
356
357
358
359
360
361
362
363
364
365
366

367

The algorithms used in the experiments were

implemented in C# 2012 on a personal computer
with an i5-4200U 1.60-GHz CPU and 4 GB of
RAM running Windows 8.1. The experiments were
tested on three databases downloaded from the UCI
Machine Learning Repository ( />data). Table 6 shows the characteristics of the experimental databases.

Item

347

351

roo
f

308

removed and EF is inserted into tabk , and minSup
is set to 0.3 (the key of the last entry in tabk ). The
algorithm will continue to process other FCIs. The
results are shown in Table 4.

cte
d

307

rre

306

Procedure DCI CLOSED++ is called with the
input PRE SET = ∅, POST SET, CLOSED SET = ∅,
and minSup = supp(A) = 0.2. The first element of
POST SET (ACEFG) is set to I. Because PRE
SET = ∅ and supp(ACEFG) = minSup, ACEFG
is an FCI, and it is put into tabk with its key,
which is its support (0.2), ACEFG is also inserted
into PRE SET. Next, itemset CEFG is processed.
Because supp(CEFG) = minSup and the BitTable
of CEFG is a subset of the BitTable of ACEFG
in PRE SET, CEFG is pruned. When DEF is
processed, because supp(DEF) > minSup and its
BitTable is not a subset of the BitTable of any
itemset in PRE SET, DEF is an FCI, and it is put into
tabk with its key, which is its support (0.5). After
that, procedure DCI CLOSED++ is called with
PRE SET = {ACEFG}, CLOSED SETnew = DEF,
and POST SETnew = {H, G}. EF and F do not
appear in POST SETnew because they belong to
CLOSED SET. Because CLOSED SET =
/ φ, DEF
is joined with H to create a newgen, which is
DEFH. Similarly, DEFH is an FCI, and is
inserted into tabk with its key (0.3). The procedure is called recursively with parameters PRE
SET = {ACEFG}, CLOSED SETnew = DEFH, and
POST SETnew = {G}. Because CLOSED SET =
/ φ
and its generator is DEFHG, and there is no itemset
X in PRE SET such that the BitTable of DEFHG is

a subset of the BitTable of X, and thus DEFGH is
an FCI, and is inserted into tabk with its key (0.2).
Now, POST SET = φ and thus DEF is added into
PRE SET. The process continues by joining DEF
with G to form DEFG. DEFG is also an FCI and it
is inserted into tabk with its key (0.4. The algorithm
then starts with a newgen H. H is an FCI and is
inserted into tabk with its key (0.7). Note that now
the number of entries in tabk is 5 and equal to k. The
algorithm will continue to insert generated FCIs into
tabk . They include HEF (key is 0.5), HEFG (key
is 0.3), and HG (key is 0.5). Consider the process
of inserting FCI EF (whose key is 0.8) into tabk .
Because the key of EF is greater than that of the
last entry (DEFGH) in tabk (key is 0.2), DEFGH is

co

305

Un

304

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

Au
tho
rP

6

DBV

Closure

Support

{0,520}
{0,520}
{0,355}
{0,919}
{0,879}
{0,879}
{0,763}

ACEFG
CEFG
DEF
H
EF
F
G

0.2
0.2
0.5
0.7
0.8
0.8

0.8

Table 6
Characteristics of experimental databases
Database

# of transactions

# of items

Chess
Pumsb
Accidents

3196
49046
340183

76
7117
468

368
369
370
371
372
373
374
375

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

382

The efficiency of applying BitTable and DBVs
for mining top-rank-k FCIs was evaluated. The

Fig. 3. Runtimes of TRK-FCI-DBV and TRK-FCI for Accidents
database.

roo
f

381

5.1. Execution time

Au
tho
rP

380

experiments were conducted with various values of
threshold k for the Accidents, Chess, and Pumsb
databases. With increasing threshold k, the number of
FCIs increased, increasing the time required to obtain
top-rank-k FCIs.

Figures 3 to 5 show that the time required for
mining top-rank-k FCIs from the three databases
increases with increasing k. TRK-FCI-DBV runs

Fig. 6. Memory usage of TRK-FCI-DBV and TRK-FCI for Chess
database.

cte
d

379

rre

378

The experimental databases have different features. The Pumsb and Accidents databases have many
transactions (or records), whereas the Chess database
is small (3196 transactions).

Fig. 4. Runtimes of TRK-FCI-DBV and TRK-FCI for Chess
database.

Fig. 7. Memory usage of TRK-FCI-DBV and TRK-FCI for Accidents database.

co

377

Un

376

7

Fig. 5. Runtimes of TRK-FCI-DBV and TRK-FCI for Pumsb
database.

Fig. 8. Memory usage of TRK-FCI-DBV and TRK-FCI for Pumsb
database.

383
384
385
386
387
388
389
390

391
392
393
394
395
396
397

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

faster than TRK-FCI. For example, consider the
Pumsb database with a threshold k of 200. The mining
time of TRK-FCI is 179.8 s and that of TRK-FCIDBV is 130.7 s. Most of the processing time for both
algorithms is in the itemset expansion stage. TRKFCI-DBV has a lower processing time because it uses
a better data format.

[4]

[5]

[6]

[7]

399
400
401
402
403
404
405
406
407
408

5.2. Memory usage
Figures 6 to 8 show that the memory usage for
mining top-rank-k FCIs for the three experimental
databases increases with increasing threshold k. The

memory required by TRK-FCI-DBV is significantly
less than that required by TRK-FCI. Consider the
Pumsb database with a threshold k of 120. The memory usage values of the two algorithms are similar;
however, when the threshold k is increased to 200,
the memory used by TRK-FCI is nearly double that
used by TRK-FCI-DBV.

[8]

[9]

Au
tho
rP

398

[10]

[11]

[12]
409

6. Conclusion and future work

[13]

414
415

416
417
418
419
420
421
422
423
424

425

426

References
[1]

427
428
429
430

[2]

431
432
433
434
435
436

[14]

[15]

cte
d

413

rre

412

[3]

co

411

This paper proposed a method for mining top-rankk FCIs based on DCI-Plus. Two efficient algorithms,
TRK-FCI and TRK-FCI-DBV, were proposed. These
two algorithms differ in the way they represent data
for each itemset, which gives them different mining
times and memory usage values. A strategy is used
to automatically change minSup to prune candidates
in the mining process. The mining time and memory
usage of the two algorithms were analyzed to compare the effectiveness of DBV compared to that of
BitTable.
In the future, we will study how to prune candidates

more efficiently. Moreover, we will try to use other
approaches for mining top-rank-k FCIs. We will also
expand our research to quantitative databases.

R. Agrawal, T. Imielinski and A. Swami, Mining association
rules between sets of items in large databases, In Proc of the
1993 ACM SIGMOD Conference Washington DC, USA,
1993, pp. 207–216.
R. Agrawal and R. Srikant, Fast algorithms for mining
association rules in large databases, In Proc of the 20th
International Conference on Very Large Data Bases, San
Francisco, CA, USA, 1994, pp. 487–499.
S. Ayubi, M.K. Muyeba, A. Baraani and J. Keane, An algorithm to mine general association rules from tabular data,
Information Sciences 179(20) (2009), 3520–3539.

Un

410

E. Baralis, L. Cagliero, T. Cerquitelli and P. Garza, Generalized association rule mining with constraints, Information
Sciences 194 (2011), 68–84.
Z.H. Deng, Fast mining top-rank-k frequent patterns by
using Node-lists, Expert Systems with Applications 41(4)
(2014), 1763–1768.
Y.J. Du and H.M. Li, Strategy for mining association rules
for web pages based on formal concept analysis, Applied
Soft Computing 10 (2010), 772–783.
H.V. Duong and T.C. Truong, An efficient method for mining association rules based on minimum single constraints,
Vietnam Journal of Computer Science 2(2) (2015), 67–83.
P. Fournier-Viger and V.S. Tseng, Mining top-k nonredundant association rules, In Proc of 20th International

Symposium, ISMIS 2012, Macau, China, 7661, 2012, pp.
31–40.
P. Fournier-Viger and V.S. Tseng, Mining top-K sequential
rules, In Proc of ADMA 2011, Beijing, China, 7121, 2011,
pp. 180–194.
P. Fournier-Viger, C.W. Wu and V.S. Tseng, Mining top-K
association rules, In Proc of Canadian Conference on AI
2012, Toronto, Canada, 7310, 2011, pp. 61–73.
G. Grahne and J. Zhu, Fast algorithms for frequent itemset
mining using fptrees, IEEE Transactions on Knowledge and
Data Engineering 17(10) (2005), 1347–1362.
K. Gouda and M.J. Zaki, GenMax: An efficient algorithm
for mining maximal frequent itemsets, Data Mining and
Knowledge Discovery 11(3) (2005), 223–242.
T.R. Hoens, Q. Qian, N.V. Chawla and Z.H. Zhou, Building
decision trees for the multiclass imbalance problem, In Proc
of PAKDD 2012, 2012, pp. 122–134.
Q.H.T. Le, T. Le, B. Vo and B. Le, An efficient and effective
algorithm for mining top-rank-k frequent patterns, Expert
Systems with Applications 42(1) (2015), 156–164.
T. Le and B. Vo, An N-list-based algorithm for mining
frequent closed patterns, Expert Systems with Applications
42(9) (2015), 6648–6657.
B. Le, M.T. Tran and B. Vo, Mining frequent closed intersequence patterns efficiently using dynamic bit vectors,
Applied Intelligence 43(1) (2015), 74–84.
W. Li, J. Han and J. Pei, CMAR: Accurate and efficient
classification based on multiple class-association rules, In
Proc of The 1st IEEE International Conference on Data
Mining, San Jose, California, USA, 2001, pp. 369–376.
B. Liu, W. Hsu and Y. Ma, Integrating classification and

association rule mining, In Proc of the 4th International
Conference on Knowledge Discovery and Data Mining,
New York, USA, 1998, pp. 80–86.
X.B. Liu, K. Zhai and W. Pedrycz, An improved association rules mining method, Expert Systems with Applications
39(1) (2012), 1362–1374.
W.Y. Loh, Classification and regression trees, WIREs Data
Mining and Knowledge Discovery 1(1) (2011), 14–23.
C. Lucchese, S. Orlando and R. Perego, Fast and memory
efficient mining of frequent closed itemsets, IEEE Trans
Knowledge and Data Engineering 18(1) (2006), 21–36.
S.T. Mai, X. He, J. Feng, C. Plant and C. B¨ohm, Anytime
density-based clustering of complex data, Knowledge and
Information Systems 45(2) (2015), 319–355.
V. Nebot and R. Berlanga, Finding association rules in
semantic web data, Knowledge-Based Systems 25 (2012),
51–62.
L.T.T. Nguyen, B. Vo, T.P. Hong and H.C. Thanh, CARMiner: An efficient algorithm for mining class-association

roo
f

8

[16]

[17]

[18]

[19]

[20]
[21]

[22]

[23]

[24]

437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456

457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486

487
488
489
490
491
492
493
494
495
496
497
498
499

L.T.T. Nguyen et al. / A method for mining top-rank-k frequent closed itemsets

505

[26]

506
507
508
509

[27]

510
511

512

[28]

513
514
515
516

[29]

517
518
519

[30]

520
521

[31]

522
523
524
525

[32]

526

527
528
529
530
531

[33]

[35]

[36]

[37]

[38]

[39]

roo
f

504

M.T. Tran, B. Le and B. Vo, Combination of dynamic bit vectors and transaction information for mining frequent closed
sequences efficiently, Engineering Applications of Artiﬁcial
Intelligence 38 (2015), 183–189.
B. Vo and B. Le, Mining traditional association rules using
frequent itemsets lattice, 39th International Conference on
CIE, Troyes, France, 2009, pp. 1401–1406.
B. Vo and B. Le, Interestingness measures for association

rules: Combination between lattice and hash tables, Expert
Systems with Applications 38(9) (2011), 11630–11640.
B. Vo, T.P. Hong and B. Le, DBV-Miner: A dynamic bitvector approach for fast mining frequent closed itemsets,
Expert Systems with Applications 39(8) (2012), 7196–7206.
B. Vo, T.P. Hong and B. Le, A lattice-based approach for
mining most generalization association rules, KnowledgeBased Systems 45 (2013), 20–30.
J. Wang, J. Han, Y. Lu and P. Tzvetkov, TFP: An efficient
algorithm for mining top-k frequent closed itemsets, IEEE
Transactions on Knowledge and Data Engineering 17(5)
(2005), 652–664.
J. Wang, J. Han and J. Pei, CLOSET+: Searching for the
best strategies formining frequent closed itemsets, In ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 236–245.
M.J. Zaki, Mining non-redundant association rules, Data
Mining and Knowledge Discovery 9(3) (2004), 223–248.
M.J. Zaki and C.J. Hsiao, Efficient algorithms for mining
closed itemsets and their lattice structure, IEEE Transactions on Knowledge and Data Engineering 17(4) (2005),
462–478.

Au
tho
rP

503

[34]

[40]

[41]

[42]

cte
d

[25]

rre

502

rules, Expert Systems with Applications 40(6) (2013),
2305–2311.
D. Nguyen, L.T.T. Nguyen, B. Vo and T.P. Hong, A novel
method for constrained class association rule mining, Information Sciences 320 (2015), 107–125.
N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, Discovering frequent closed itemsets for association rules. In Proc
of the 5th International Conference on Database Theory,
1999, pp. 398–416.
N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, Efficient
mining of association rules using closed itemset lattices,
Information Systems 24(1) (1999), 25–46.
J. Pei, J. Han and R. Mao, CLOSET: An efficient algorithm
for mining frequent closed itemsets. In Proc of the 5th ACMSIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery, Dallas, Texas, USA, 2000, pp.11–20.
G. Pyun and U. Yun, Mining top-k frequent patterns
with combination reducing techniques, Applied Intelligence
41(1) (2014), 76–98.
J.R. Quinlan, Introduction of decision tree, Machine Learning 1(1) (1986), 81–106.
J. Sahoo, A.K. Das and A. Goswami, An effective
association rule mining scheme using a new generic

basis, Knowledge and Information Systems 43(1) (2015),
127–156.
N.G. Singh, S.R. Singh and A.K. Mahanta, CloseMiner:
Discovering frequent closed itemsets using frequent closed
tidsets, In Proc of the 5th ICDM, Washington DC, USA,
2005, pp. 633–636.
M.R. Tolun and S.M. Abu-Soud, ILA: An inductive learning algorithm for production rule discovery, Expert Systems
with Applications 14(3) (1998), 361–370.

co

501

Un

500

9

532
533
534
535
536
537
538
539
540
541
542

543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561

DSpace at VNU: A method for mining top-rank-k frequent closed itemsets

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về