DSpace at VNU: EIFDD: An efficient approach for erasable itemset mining of very dense datasets

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.23 MB, 10 trang )

Appl Intell
DOI 10.1007/s10489-014-0644-8

EIFDD: An efficient approach for erasable itemset mining
of very dense datasets
Giang Nguyen · Tuong Le · Bay Vo · Bac Le

© Springer Science+Business Media New York 2015

Abstract Erasable itemset mining, first proposed in 2009,
is an interesting problem in supply chain optimization.
The dPidset structure, a very effective structure for mining erasable itemsets, was introduced in 2014. The dPidset
structure outperforms previous structures such as PID List
and NC Set. Algorithms based on dPidset can effectively
mine erasable itemsets. However, for very dense datasets,
the mining time and memory usage are large. Therefore,
this paper proposes an effective approach that uses the subsume concept for mining erasable itemsets for very dense
datasets. The subsume concept is used to help early determine the information of a large number of erasable itemsets
without the usual computational cost. Then, the erasable
itemsets for very dense datasets (EIFDD) algorithm, which
uses the subsume concept and the dPidset structure for the
erasable itemset mining of very dense datasets, is proposed.
G. Nguyen
Faculty of Information Technology, Ho Chi Minh City University
of Technology, Ho Chi Minh City, Vietnam
e-mail:
T. Le ( ) · B. Vo
Division of Data Science, Ton Duc Thang University,
Ho Chi Minh City, Vietnam
e-mail: ;
T. Le · B. Vo

Faculty of Information Technology, Ton Duc Thang University,
Ho Chi Minh City, Vietnam
B. Vo
e-mail: ;
B. Le
Faculty of Information Technology, University of Science, VNU,
Ho Chi Minh City, Vietnam
e-mail:

An illustrative example is given to demonstrate the proposed
algorithm. Finally, an experiment is conducted to show the
effectiveness of EIFDD.
Keywords Pattern mining · Erasable itemset · Subsume
concept · Dense datasets

1 Introduction
Data mining, the computational process of discovering patterns in large datasets, has become important due to the fast
growth of data. Pattern mining, including frequent itemset
mining [3, 9, 12, 16, 19, 21–23, 25, 26, 28, 29] top-rankk frequent patterns [10, 11], and sequential pattern mining
[18, 30], is an essential task in data mining, especially association rule mining [1, 2, 17, 25] and its applications [4,
24]. In recent years, the mining of erasable itemsets [7] and
top-rank-k erasable itemsets [5, 20] has been introduced.
Consider the following problem. A manufacturer produces
many products, which are created from a number of items
(components). Each product brings income for the manufacturer. Unfortunately, in a financial crisis, the budget is insufficient to purchase all necessary components. The erasable
itemset mining problem is to find the itemsets that can be
erased so as to minimize the loss of the factory’s revenue.
Managers can utilize the knowledge from these erasable
itemsets to make a new production plan via a recommendation system. There are many algorithms for mining erasable
itemsets, including META [7], VME [8], MERIT [6], and

MEI [14]. Thes algorithms were presented and compared
in a study by [15]. The MERIT algorithm uses an NC Set
generated from a WPPC-tree, whereas the MEI algorithm
uses the dPidset structure. Based on the experimental results

Giang Nguyen et al.

reported by [14], MEI is currently the most effective algorithm for mining erasable itemsets. However, the runtime
and memory usage of MEI are still quite large for very dense
datasets with a small number of total items, a large number
of items for each product, and many co-occurring itemsets. Therefore, this paper proposes an improved algorithm
that uses the subsume concept and the dPidset structure to
enhance the process of mining erasable itemsets for very
dense datasets. The main contributions of this paper are as
follows: (i) the subsume concept for erasable itemsets is
described, (ii) a method for determining the subsume index
associated with erasable 1-itemsets is proposed, and (iii)
the erasable itemsets for very dense datasets (EIFDD) algorithm is proposed. To demonstrate the effectiveness of the
proposed method for very dense datasets, experiments were
conducted. The experimental results show that EIFDD outperforms MEI and MERIT algorithms in terms of mining
time and memory usage for very dense datasets.
The rest of the paper is organized as follows. Section 2
presents the basic concept of erasable itemsets. Related
work is presented in Section 3. The dPidset structure for
quickly mining erasable itemsets, the subsume concept for
erasable itemset mining, a fast method for determining the
subsume index of erasable 1-itemsets, and EIFDD are introduced in Section 4. Performance studies are presented in
Section 5 to show the effectiveness of EIFDD for very
dense datasets. Section 6 gives a summary and a number of

recommendations.

2 Basic concepts
Deng et al. [7] defined erasable itemsets and the problem of
mining erasable itemsets, which are summarized below.
Let all items of a manufacturer be represented as I =
{i1 , i2 , ..., im }. Let DB = {P1 , P2 , ..., Pn } be a product
dataset of this manufacturer, where P i is a product presented in the form Items, Val . Items are the items that
compose this product and Val is the revenue that the factory
gets by selling this product. The example dataset presented
in Table 1 is used throughout this article.
Definition 1 (revenue of an itemset) . Let X(⊆ I ) be an
itemset. The revenue of X is determined as:
g(X) =

V al(Pi )
{Pi |X∩I tems(Pi )= }

where:
–
–
–

g(X) is the revenue of itemset X;
Val(Pi ) is the revenue of product Pi ;
Items(Pi ) represents the items used to create Pi .

(1)

Table 1 Example dataset (DBE )

Product

Items

Val ($)

P1
P2
P3
P4
P5
P6
P7
P8

a, b
a, b, e
c, e
b, d, e, f
c, d, e
d, e, f , h
d, h
d, f , h

1,000
200
150
50
100
200

150
100

Example 1 According to Definition 1, g(a) = V al(P1 ) +
V al(P2 ) = 1200, and g(h) = V al(P6 ) + V al(P7 ) +
V al(P8 ) = 450.
Definition 2 (erasable itemset.) Given a threshold ξ and a
product dataset DB, let T = pi ∈DB V al(Pi ) be the total
revenue of the factory. An itemset X is an erasable itemset
if and only if:
g(X) ≤ T × ξ

(2)

Example 2 Let ξ = 40 %. For DBE , T = 1950. According
to Definition 2, g(a) = 1200 > T × ξ = 40 % × 1950 =
780; therefore, item a is not an erasable itemset. In contrast,
g(h)600 < 780, and hence item h is an erasable itemset.
All EIs for DBE with ξ = 40 % are shown in Table 2.
Table 2 All EIs for DBE with ξ = 40 %
Erasable itemsets

Val ($)

e
e, c
d
d, h
d, f
d, h, f

d, c
d, c, h
d, c, f
d, c, h, f
h
h, f
h, c
h, c, f
f
f, c
c

700
700
600
600
600
600
750
750
750
750
450
500
700
750
350
600
250

EIFDD: An efficient approach for erasable itemset mining of very

The mining erasable itemsets problem is to find all
erasable itemsets whose revenue g(X) is less than T × ξ in
the dataset.

3 Related work
Deng et al. [7] proposed META, an Apriori-based algorithm, for mining erasable itemsets. However, the runtime
of this algorithm is slow because:
1. META scans the database k+1 times, where k is the
maximum level of erasable itemsets.
2. The strategy META uses to generate candidate itemsets is a na¨ıve strategy in which an erasable (k −
1)-itemset X is considered with all remaining erasable
(k − 1)-itemsets used to combine and generate erasable
k-itemsets.
Consequently, [8] proposed the VME algorithm, which is
faster than META. However, VME still has some significant
disadvantages:
1. It scans the dataset twice to determine which 1-itemsets
are erasable (it is well established that scanning a
dataset requires considerable computer time and memory).
2. It uses an inefficient mechanism for generating candidate erasable k-itemsets in which the set of (k-1)itemsets is used to define candidate k-itemsets. For
example, given the erasable 2-itemsets {ab, ac, ad, bc,
bd, cd}, VME groups these according to their 1-itemset
prefixes ({a}, {b}, and {c}) to give three groups: {ab, ac,
ad}, {bc, bd}, and {cd}. VME then combines the elements of each group to create the candidate erasable
3-itemsets, {abc, abd, acd} and {bcd}. However, this
process is computationally expensive.
3. It stores each product’s revenue in the form of a

tuple, PID, Val , where PID is the product identifier
and Val is the revenue value. This leads to a duplication of data because a PID, Val pair can appear
in many PID Lists associated with different erasable
itemsets. Thus, the VME algorithm requires a lot of
memory.
4. It uses a strategy whereby the PID List associated with
erasable itemset X is a subset of the PID List associated with erasable itemset Y , where X ⊂ Y. This
strategy requires significant memory and computational
power when large numbers of erasable itemsets are
considered.
MERIT [6] uses the concept of NC Sets to reduce memory
usage for mining EIs. Although the use of NC Sets gives
MERIT some advantages over VME, there are still some
disadvantages:

1. The weight value of each node code (NC) is stored individually even though it can appear in many erasable
itemsets’ NC Sets, leading to a lot of duplication.
2. It uses a strategy whereby itemset X’s NC Set is
assumed to be a subset of itemset Y ’s NC Set if X ⊂
Y . This leads to high memory consumption and high
runtime when itemsets are combined to create new
nodes.
MEI [14] uses the dPidset structure to quickly determine the
information of erasable itemsets. Although mining time and
memory usage are better than those of the above algorithms,
MEI’s performance for mining erasable itemsets for very
dense datasets can be improved.

4 EIFDD algorithm
4.1 dPidset structure

Definition 3 (pidset of an itemset) . The pidset of itemset
X is denoted as:
pX =

p(A)

(3)

A∈X

where:
–
–

A is an item in itemset X;
p(A) is the pidset of item A, i.e., the set of product
identifiers (IDs) which have item A.

Definition 4 (revenue of an itemset based on pidset) . The
revenue of itemset X, denoted by g(X), is computed easily
as follows:
g(X) =

V al(Pi )

(4)

Pi ∈p(X)

Theorem 1 Let XA and XB be two itemsets with the same

prefix X (X can be an empty set).p(XA) and p(XB) are pidsets of XA and XB, respectively. The pidset of XAB denoted
by p(XAB) is computed as follows:
p(XAB) = p(XB) ∪ p(XA)

(5)

Example 3 For DBE , p(ab) = {1, 2, 4} and p(ac) = {1,
2, 3, 5}. According to Theorem 1, p(abc) = p(acb) =
p(ab) ∪ p(ac) = {1, 2, 4}∪ {1, 2, 3, 5} = {1, 2, 3, 4, 5}.
Definition 5 (dPidset of an itemset) . The dPidset of itemset
XAB denoted by dP (XAB) based on p(XA) and p(XB) is
defined as follows:
dP (XAB) = pXB \ p(XA)

(6)

where p(XB) \ p(XA) is the set of product IDs which only
exist on p(XB).

Giang Nguyen et al.

Example 4 We have p(ab) = {1, 2, 4} and p(ac) = {1, 2,
3, 5}. Based on Definition 5, dP (abc) = p(ac) \ p(ab) =
{1, 2, 3, 5}\ {1, 2, 4} = {3, 5}. Note that reversing the
order of ab and ac will get a different result. Consequently,
dP (acb) = p(ab) \ p(ac) = {4}.
Theorem 2 Let XA and XB be two itemsets. dP(XA)
and dP(XB) are the dPidsets of XA and XB, respectively.
The dPidset of XAB denoted by dP(XAB) is computed as

follows:
dP (XAB) = dP (XB) \ dP (XA)

(7)

Example 5 Based on DBE , p(a) = {1, 2}, p(b) = {1, 2, 4}
and p(c) = {2, 3, 5}. According to Definition 5, dP (ac) =
p(c) \ p(a) = {3, 5} and dP (ab) = p(b) \ p(a) = {4}.
Based on Theorem 4, dP (abc) = dP (ac) \ dP (ab) = {3,
5}\ {4}= {3, 5}. In Examples 2 and 3, dP (abc) = {3, 5}.
Therefore, these examples verify Theorem 2.
Theorem 3 The revenue of XAB denoted by g(XAB) is
determined based on that of XA as follows:
g(XAB) = g(XA) +

V al(Pi )

(8)

Pi ∈dP (XAB)

Theorem 5 Let A, B, and C ∈ I1 be three items. If A ∈
Subsume(B) and B ∈ Subsume(C), then A ∈ Subsume(C).
Proof . We have A ∈ Subsume(B) and B ∈ Subsume(C)
and hence p(B) ⊆ p(A) and p(C) ⊆ p(B). Therefore
p(C) ⊆ p(A) and thus this theorem is proven.

4.3 Algorithm for finding subsume index associated
with E1
Using the definition of the subsume index associated with

erasable 1-itemsets (E1 ) based on pidsets (Definition 6), this
paper proposes Algorithm 1 for finding this index. After
determining the pidsets associated with E1 and sorting E1
in descending order of pidset length, the algorithm uses
two loops to find the subsume index associated with E1 , as
shown in Fig. 1.
4.4 EIFDD algorithm

where g(XA) is the revenue of X and V al(Pi ) is the revenue
of Pi .
Example 6 We have p(a) = {1, 2}, p(b) = {1, 2, 4}
and p(c) = {3, 5}, and thus g(a) = 1200, g(b) = 1250,
and g(c) = 250. According to Example 3, dP (ac) =
{3, 5} and dP (ab) = {4}, and thus g(ac) = g(a) +
Pi ∈dP (ac) V al(Pi ) = 1200 + 150 + 100 = 1450 and
g(ab) = g(a) +
Pi ∈dP (ab) V al(Pi ) = 1200 + 50
= 1250. dP (abc) = {3, 5} so g(abc) = g(ab) +
Pi ∈dP (abc) V al(Pi ) = 1250 + 150 + 100 = 1500.
4.2 Subsume concept
Definition 7 [11]. The subsume index of an erasable 1itemset, A, denoted by Subsume(A), is defined as follows:
Subsume(A) = {B ∈ II |P (A) ⊆ p(B)}

{f , h, f h}. Based on Theorem 4, the revenue of 2m − 1
itemsets, which are 2m − 1 nonempty subsets of Subsume(d)
combined with d, is equal to g(d). In this case, the revenue
of {df , dh, df h } is 750 dollars

The EIFDD algorithm is shown in Fig. 2. Firstly, the algorithm scans the dataset to determine E1 with their pidsets,
and then sorts E1 in descending order of pidset length.

Secondly, the algorithm calls Algorithm 1 to generate the
subsume index associated with E1 . Thirdly, the algorithm
puts all EIs in E1 to the results. Finally, the algorithm
calls the Expand E procedure, which uses the divide-andconquer strategy and the subume index associated with E1
to mine all erasable itemsets. The processes of this approach
are described below.
For erasable 1-itemsets, the algorithm considers the first
element, A, with the remaining elements in erasable 1itemsets to create the erasable 2-itemsets. If A has a number of subsume values, all itemsets that combine A and

(9)

We have p(a) = {1, 2} and p(b) = {1, 2, 4}. Because
p(a) ⊆ p(b), a ∈ Subsume(b).
Theorem 4 Given the subsume index of an item A,
Subsume(A) = {a1 , a2 ,..., am }, the revenue of each of the
2m − 1 nonempty subsets of {a1 , a2 ,..., am }is equal to the
revenue of A.
According to DBE , we have Subsume(d) = {f , h}.
Therefore the 2m − 1 nonempty subsets of Subsume(d) are

Fig. 1 Algorithm for finding subsume index associated with E1

EIFDD: An efficient approach for erasable itemset mining of very
Fig. 2 EIFDD algorithm

2m - 1 nonempty subsets generated from Subsume(A) are
considered erasable itemsets without calculating their pidsets and revenues. The remaining elements (∈
/ Subsume(A))
are combined with A to create 2-itemsets. For elements

whose revenues are smaller than T × ξ , the algorithm:
(i) adds them and the erasable itemsets that are combined
with 2m - 1 nonempty subsets generated from Subsume(A)

to the results and (ii) adds them to Enext . The algorithm
is called recursively with erasable 2-itemsets as parameter to create erasable 3-itemsets. The algorithm continues
until no EIs are created.The algorithm uses this strategy with all elements in E1 until all itemsets that can
be created from n elements of erasable 1-itemsets are
considered.

Fig. 3 Erasable 1-itemsets and
their pidsets for DBE with ξ =
40 %

{}

e
{23456}
700

d
{45678}
600

h
{678}
450

f
{468}

350

c
{35}
250

Giang Nguyen et al.
Table 3 Subsume index associated with erasable 1-itemsets
Erasable 1-itemset

Subsume

e
d
f
h
c

c
f, h

4. Step 4: f is considered with c to create fc with g(fc) =
600 dollars. fc is added to the results. Figure 7 shows all
erasable itemsets for DBE with ξ = 40 %.
As shown in Fig. 7, the algorithm does not compute and
store the pidsets of the nodes {df, dh, dfh, dcf, dch, dcfh, ec}.
Therefore, using the subsume concept reduces the runtime
and memory usage of erasable itemset mining.
4.6 Complexity analysis

4.5 An illustrative example
Firstly, EIFDD scans DBE to find erasable 1-itemsets and
their pidsets (line 1 of the EIFDD algorithm) with ξ = 40 %
Fig. 3.
Secondly, the algorithm finds the subsume index associated with erasable 1-itemsets Table 3 on line 2 of the EIFDD
algorithm, which calls Algorithm 1 Fig. 1.
Thirdly, the algorithm calls the Expand E procedure for
mining all erasable itemsets. The following four steps are
used:
1. Step 1: e is considered with the remaining erasable 1itemsets. Because Subsume(e) = {c}, ec is added to the
results without determining its information. ed, ef, and
eh are disqualified because their revenues exceed the
threshold. Figure 4 shows the results of this step.
2. Step 2: d is considered with the remaining erasable 1itemsets (h, f , and c). h and f belong to Subsume(d);
therefore, the algorithm adds df, dh, and dfh to the
results without determining their information. d is combined with c to create dc with g(dc) = 750 dollars.
dc and the erasable itemsets {dch, dcf , dchf } created from Subsume(d) and dc are added to the results
(Fig. 5).
3. Step 3: h is considered with the remaining erasable 1itemsets (f and c) to create hf and hc with g(hf ) = 500
dollars and g(hc) = 700 dollars. hf and hc are added to
the results. They are used to create hfc with g(hf c) =
750 dollars, which is added to the results Fig. 6.

Let m be the number of items in the product dataset. The
search space for the enumeration of all erasable itemsets is
2m − 1. However, based on the anti-monotone characteristic
(if X is inerasable, and Y is a superset of X, Y must also
be inerasable), the overall complexity of algorithms such as
MEI is O(r × n × 2l ), where n is the number of products,

l is the length of the longest erasable itemset, and r is the
maximum number of erasable itemsets.
The EIFDD algorithm reduces the search space for
erasable itemset mining using the subsume concept. The
complexity of determining the subsume index associated
with erasable 1-itemsets is O(m2 ). In the best case, all
erasable 1-itemsets which are behind an erasable itemset X
are in the subsume index of X. Then, the complexity of finding all erasable itemsets is O(m). The overall complexity in
this case is O(m2 ). In the worst case, all erasable 1-itemsets
which are an erasable itemset X are not in the subsume
index of X. Then, the complexity of finding all erasable
itemsets is O(r × n × 2l ). The overall complexity in this case
is O(r × n × 2l + m2 ). Fortunately, for very dense datasets,
there are many elements in the subsume index associated
with erasable 1-itemsets (see Section 5.2). Therefore, the
best case is most likely.

5 Performance studies
This section reports the performance of the proposed
EIFDD algorithm MEI and MERIT for a number of dense
datasets. All studies in this section were performed on

Fig. 4 Erasable itemsets for
DBE with ξ = 40 %(step 1)

{}

e
{23456}
700

ec
700

d
{45678}
600

h
{678}
450

f
{468}
350

c
{35}
250

EIFDD: An efficient approach for erasable itemset mining of very
{}

Fig. 5 Erasable itemsets for
DBE with ξ = 40 % (step 2)

d
{45678}
600

e
{23456}
700

ec

dh

df

dhf

700

600

600

600

h
{678}
450

f
{468}
350

c

{35}
250

f
{468}
350

c
{35}
250

f
{468}
350

c
{35}
250

dc
{3}
750

dch

dcf

dchf

750

750

750

Fig. 6 Erasable itemsets for
DBE with ξ = 40 %(step 3)

{}

d
{45678}
600

e
{23456}
700

ec

dh

df

dhf

700

600

600

600

h
{678}
450

dc
{3}
750

hc
{35}
700

hf
{4}
500

dch

dcf

dchf

750

750

750

Fig. 7 Erasable itemsets for
DBE with ξ = 40 % (step 4)

hfc
{35}
750

{}

d
{45678}
600

e
{23456}
700

ec

dh

df

dhf

700

600

600

600

h
{678}
450

dc
{3}
750

hc
{35}
700

hf
{4}
500

dch

dcf

dchf

750

750

750

hfc
{35}
750

fc
{35}
600

Giang Nguyen et al.
Table 5 Numbers of subsume values associted with erasable 1itemsets

Table 4 Features of datasets used in experiments
Dataset

Type

# of products

# of items

Chess
Mushroom
Connect

Very Dense
Normal

Dense

3,196
8,124
67,557

76
120
130

Dataset

Threshold # of erasable # of subsume values associate
1-itemsets
with erasable 1-itemsets

Chess

38
43
48
53
58

33
36
38
39
41

13
20
23
24
32

Mushroom 1.2
1.7
2.2
2.7
3.2

28
29
31
37
38

18
18
18
42
42

Connect

33
34
35
37

39

6
6
6
7
7

These datasets are available at />
a computer with an Intel Core i3-3110M 2.4-GHz CPU
and 4 GB of RAM. All the programs were implemented
in Visual Studio C# and the .Net framework (version
4.5.50709).
The experiments were conducted on datasets Chess, Connect, and Mushroom1 . In order to make these datasets look
like the product datasets introduced in Section 2, a column
was added into each dataset to store the product revenues.
To generate revenue values, a function denoted by N(100,
50), where the mean value is 100 and the variance is 50,
was created. The features of these datasets are shown in
Table 4.

2.3
2.8
3.3
3.8
4.3

5.1 Compactness of subsume index
Table 5 shows the number of subsume values associated
with erasable 1-itemsets and the number of erasable 1itemsets. Note that an erasable 1-itemset can have one or

more subsume values. Therefore, the number of subsume
values associated with erasable 1-itemsets can be greater
than the number of erasable 1-itemsets. The EIFDD algorithm is more effective than the MEI algorithm for datasets
with a large number of subsume values associated with
erasable 1-itemsets.
Figures 8, 9 and 10 show the numbers of nodes with
subsume values and the total nodes for various thresholds
for Chess and Mushroom datasets. For a larger number of
nodes, the pidsets and revenues do not need to be determined. Therefore, the mining time and memory usage are
reduced.

5.3 Mining time
This section compares the mining times of MEI [14] and
EIFDD on the experimental datasets. Figures 14, 15 and 16
show that the mining time of EIFDD is much smaller than
that of MEI. However, MEI cannot run for some thresholds (3.2 % for Mushroom and 4.3 % for Connect) due to
memory limitations.

6 Conclusion and future work
This paper proposed a method for mining erasable itemsets from very dense datasets. Firstly, the subsume concept
is used to help early determine the information of a large

5.2 Memory usage

1

Downloaded from />
8
Millions

This section compares the memory usage of MEI [14]
and EIFDD on three experimental datasets. For the Chess
dataset, the memory usage of EIFDD is less than that of MEI
for all thresholds (from 38 % to 58 %) Fig. 11. For the other
datasets Figs. 12 and 13, MEI cannot run with large thresholds (3.2 % for Mushroom dataset and 4.3 % for Connect
dataset) due to memory limitations. The EIFDD algorithm
is thus more efficient than MEI in terms of memory usage.

10

6
# of nodes

4

# of nodes
with subsume

2
0
38

43

48

53

58

(%)

Fig. 8 Number of nodes with subsume values and total number of
nodes of EIFDD algorithm’s results for Chess dataset

12

800

10

700

8
# of nodes

6
# of nodes
with subsume

4
2

Memory usage (Mb)

Millions

EIFDD: An efficient approach for erasable itemset mining of very

600
500

400

EIFDD

300

MEI

200
100

0
1.2

1.7

2.2

2.7

3.2

(%)

0
2.3

# of nodes
# of nodes
with subsume

2.3

2.8

3.3

3.8

4.3

(%)

Fig. 10 Number of nodes with subsume value and total number of
nodes of EIFDD algorithm’s results for Connect dataset

Memory usage (Mb)

Mining Ɵme (seconds)

9
8
7
6
5
4
3

2
1
0

500
450
400
350
300
250
200
150
100
50
0

2.8

3.3

3.8

4.3

(%)

Fig. 13 Memory usage of EIFDD and MEI algorithms for Connect
dataset
20
18

16
14
12
10
8
6
4
2
0

EIFDD
MEI

38

43

48

53

58

(%)

Fig. 14 Mining time of EIFDD and MEI algorithms for Chess dataset
25

EIFDD
MEI

38

43

48

53

58

Mining Ɵme (seconds)

Millions

Fig. 9 Number of nodes with subsume value and total number of
nodes of EIFDD algorithm’s results for Mushroom dataset

20
15

EIFDD
MEI

10
5
0

(%)

1.2

Fig. 11 Memory usage of EIFDD and MEI algorithms for Chess
dataset

1.7

2.2

2.7

3.2

(%)

Fig. 15 Mining time of EIFDD and MEI algorithms for Mushroom
dataset

400

30

350

25

300
250

EIFDD

200
MEI

150
100
50

Mining Ɵme (seconds)

Memory usage (Mb)

450

20
EIFDD
MEI

15
10

5

0
1.2

1.7

2.2

2.7

3.2

(%)

0
2.3

Fig. 12 Memory usage of EIFDD and MEI algorithms for Mushroom
dataset

2.8

3.3

3.8

4.3

(%)

Fig. 16 Mining time of EIFDD and MEI algorithms for Connect dataset

Giang Nguyen et al.

number of erasable itemsets without the usual computational cost. Then, a fast procedure for finding the subsume
index associated with erasable 1-itemsets based on the dPidset structure is used. The EIFDD algorithm was proposed
based on these concepts. An example was presented to

demonstrate the proposed algorithm. The performance studies show that EIFDD outperforms the MEI algorithm in
mining erasable itemsets from very dense datasets in terms
of mining time and memory usage.
In future work, a number of issues will be considered,
including mining erasable closed/maximal itemsets, mining
erasable itemsets from a data stream, and erasable itemset
applications.
Acknowledgments This research is funded by Foundation for Science and Technology Development of Ton Duc Thang University
(FOSTECT), website: , under Grant FOSTECT.2015.BR.01.

References
1. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules In VLDB’94
2. Agrawal R, Imielinski T, Swami A (1993) Mining association
rules between set of items in large databases In SIGMOD’93
3. Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining
frequent itemsets in a stream. Inf Syst 39:233–255
4. Czibula G, Marian Z, Czibula IG (2014) Software defect prediction using relational association rule mining. Inf Sci 264:260–
278
5. Deng ZH (2013) Mining top-rank-k erasable itemsets by PID lists.
Int J Intell Syst 28(4):366–379
6. Deng ZH, Xu XR (2012) Fast mining erasable itemsets using
NC sets. Expert Syst Appl 39(4):4453–4463
7. Deng Z, Fang G, Wang Z, Xu X (2009) Mining erasable itemsets
In ICMLC’09
8. Deng ZH, Xu XR (2010) An efficient algorithm for mining
erasable itemsets In ADMA’10:214–225
9. Han J, Pei J, Yin Y (2003) Mining frequent patterns without
candidate generation In SIGMOD’00:1–12
10. Huynh-Thi-Le Q, Le T, Vo B, Le B (2015) An efficient and effective algorithm for mining top-rank-k frequent patterns. Expert Syst
Appl 42(1):156–164

11. Pyun G, Yun U (2014) Mining top-k frequent patterns with
combination reducing techniques. Appl Intell 41(1):76–98

12. Pyun G, Yun G, Ryu KH (2014) Efficient frequent pattern mining
based on Linear Prefix tree. Knowledge-Based Syst 55:125–139
13. Le T, Vo B, Coenen F (2013) An efficient algorithm for mining erasable itemsets using the difference of NC-Sets In IEEE
SMC’13:2270–2274
14. Le T, Vo B (2014) MEI: an efficient algorithm for mining erasable
itemsets. Eng Appl Artif Intell 27:155–166
15. Le T, Vo B, Nguyen G (2014) A survey of erasable itemset mining
algorithms. WIREs Data Min Knowl Disc 4(5):356–379
16. Li H, Zhang H, Zhu J, Cao H, Wang Y (2014) Efficient frequent
itemset mining methods over time-sensitive streams. KnowlBased Syst 56:281–298
17. Li Y, Wu J (2014) Interpretation of association rules in multi-tier
structures. Int J Approx Reason 55(6):1439–1457
18. Liao VCC, Chen MS (2014) DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences.
Knowl Inf Syst 38(3):623–639
19. Lin KC, Liao I, Chang TP, Lin SF (2014) A frequent itemset mining algorithm based on the Principle of Inclusion-Exclusion and
transaction mapping. Inf Sci 276:278–289
20. Nguyen G, Le T, Vo B, Le B (2014) A new approach for mining
top-rank-k erasable itemsets. In ACIIDS’14
21. Nori F, Deypir M, Sadreddini MH (2013) A sliding window based
algorithm for frequent closed itemset mining over data streams. J
Syst Softw 86(3):615–623
22. Sohrabi MK, Barforoush AA (2013) Parallel frequent itemset
mining using systolic arrays. Knowl-Based Syst 37:462–471
23. Song W, Yang B, Xu Z (2008) Index-BitTableFI: An improved
algorithm for mining frequent itemsets. Knowl-Based Syst
21:507–513
24. Versichele M, Groote L, Bouuaert MC, Neutens T, Moerman

I, Weghe NV (2014) Pattern mining in tourist attraction visits
through association rule learning on Bluetooth tracking data: A
case study of Ghent, Belgium. Tour Manag 44:67–81
25. Vo B, Coenen F, Le T, Hong T-P (2013) A hybrid approach for
mining frequent itemsets In IEEE SMC’13:4647–4651
26. Vo B, Le T, Coenen F, Hong TP (2014) Mining frequent itemsets using the N-list and subsume concepts. International Journal
of Machine Learning and Cybernetics. />s13042-014-0252-2
27. Vo B, Hong TP, Le B (2013) A lattice-based approach for mining
most generalization association rules. Knowl-Based Syst 45:20–
30
28. Zaki MJ (2000) Scalable algorithms for association mining. IEEE
Trans Knowl Data Eng 12(3):372–390
29. Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets In
SIGKDD’03
30. Zhang B, Lin CW, Gan W, Hong TP (2014) Maintaining the
discovered sequential patterns for sequence insertion in dynamic
databases. Eng Appl Artif Intell 35:131–142

DSpace at VNU: EIFDD: An efficient approach for erasable itemset mining of very dense datasets

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về