Tải bản đầy đủ (.pdf) (4 trang)

A NOVEL APPROACH FOR MINING EMERGING PATTERNS IN DATA

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (227.11 KB, 4 trang )

A NOVEL APPROACH FOR MINING EMERGING PATTERNS IN DATA
STREAMS
Hamad Alhammady
Etisalat University College - UAE


ABSTRACT
Streaming data mining is one of the most difficult tasks in
Knowledge Discovery in Databases (KDD). This task is
essential in many applications such as financial
applications, network monitoring, marketing and others.
In this model, data arrives in multiple, continuous, rapid,
time-varying data streams. These characteristics make it
infeasible for traditional mining techniques to deal with
data streams. In this paper, we propose a new approach
for mining emerging patterns [EPs] in streaming data.
EPs are those itemsets whose frequencies in one class
are significantly higher than their frequencies in the
other classes. We experimentally prove that our new
method for mining EPs has an excellent impact on the
process of classifying data streams.

1. INTRODUCTION
Many recent studies show that the major challenge in
streaming data is its unbounded size [1] [2] [3]. This
makes it infeasible to store the entire data on disk. There
are two important problems arising from this fact.
Firstly, multi-pass algorithms, which need the entire data
to be stored in conventional relations, cannot deal
directly with data streams. Secondly, obtaining the exact
answers from data streams is too expensive [4].


EPs are a new kind of patterns introduced recently [5].
They have been proved to have a great impact in many
applications [6] [7] [8] [9] [10]. EPs can capture
significant changes between datasets. They are defined
as itemsets whose supports increase significantly from
one class to another. The discriminating power of EPs
can be measured by their growth rates. The growth rate
of an EP is the ratio of its support in a certain class over
that in another class. Usually the discriminating power of
an EP is proportional to its growth rate.
For example, the Mushroom dataset, from the UCI
Machine Learning Repository [11], contains a large
number of EPs between the poisonous and the edible
mushroom classes. Table 1 shows two examples of these
EPs. These two EPs consist of 3 items. e1 is an EP from
the poisonous mushroom class to the edible mushroom
class. It never exists in the poisonous mushroom class,

1-4244-0779-6/07/$20.00 ©2007 IEEE

and exists in 63.9% of the instances in the edible
mushroom class; hence, its growth rate is ∞ (63.9 / 0).
It has a very high predictive power to contrast edible
mushrooms against poisonous mushrooms. On the other
hand, e2 is an EP from the edible mushroom class to the
poisonous mushroom class. It exists in 3.8% of the
instances in the edible mushroom class, and in 81.4% of
the instances in the poisonous mushroom class; hence, its
growth rate is 21.4 (81.4 / 3.8). It has a high predictive
power to contrast poisonous mushrooms against edible

mushrooms.
Table1. Examples of emerging patterns.
EP

Support in poisonous
Support in edible
Growth rate
mushrooms
mushrooms
e1
0%
63.9%
e2
81.4%
3.8%
21.4
e1 = {(ODOR = none), (GILL_SIZE = broad), (RING_NUMBER = one)}
e2 = {(BRUISES = no), (GILL_SPACING = close), (VEIL_COLOR =
white)}



Current approaches for mining EPs [12] [13] cannot
be used directly in data streams because they are based
on multi-pass algorithms. Work in [4] introduces a new
type of EPs, approximate EPs (AEPs). This type of EPs
enables current mining techniques to operate on
streaming data. AEPs and the AEP-tree method have
shown a good accuracy in classifying streaming data. In
this paper, we propose another new type of EPs,

matching EPs (MEPs). These MEPs has two advantages
over AEPs in terms of mining complexity and
classification accuracy (details are discussed later).
2. RELATED WORK
The data stream model differs from the conventional
stored relation model in a number of ways. The most
significant difference is that data streams are unbounded
in size. Most of the instances from a data stream have to
be discarded after processing. However, a certain
number of instances can be stored for future analysis.
This number is proportional to the available memory.
That is, the data stream model does not preclude the
presence of some data stored in conventional relations
[1].
The idea of storing some instances from a data stream
in conventional relations is fundamental to many
techniques used in the data stream model. These
techniques include sliding windows, sampling, and
synopsis data structures. They are the basic features of


any Data Stream Management System (DSMS) such as
STREAM [14].
Sliding windows [1] have a noticeable power to obtain
approximate answers to data stream queries. This
technique involves using a sliding window of recent data
from the data stream rather than operating over the entire
range of data. For example, if an unlabeled instance
arrives from a data stream, and it needs to be classified to
one of the classes associated with this data stream, then,

only a certain number of recent instances (a window) will
be used to train the classifier.
Sliding Window

Past Data
(Discarded)

Recent Data

Future Data

Figure 1. Sliding window technique
Figure 1 sketches the idea behind this technique. The
sliding window technique has the advantage of being
well-defined. In addition, it is a deterministic method
which avoids the problem of bad approximation caused
by random sampling. Most essentially, it accentuates
recent data which is considered to be the most interesting
data in a large number of the real-life applications [1].
However, the sliding window technique suffers from the
elimination of some important information contained in
old (discarded) data. That is, sliding windows do not
represent the whole range of knowledge contained in the
data, but rather a portion (proportional to the size of the
sliding window) of that knowledge. This problem may
affect the quality of approximation.
Sampling [15] is another technique for approximation
in the data stream model. In this case, the streaming data
is randomly sampled to a certain number of instances.
This number is proportional to the available memory. In

contrast with the sliding window technique, sampling has
the advantage of representing the whole range of old
data. The representation level is proportional to the
sampling rate. On the other hand, sampling may suffer
from problems caused by noisy instances being selected
during the random sampling process.
Synopsis data structures [1] aim at summarizing the
most important characteristics of the whole range of data.
These important characteristics play a key role in
classifying future unlabeled instances. Synopsis data
structures, like the sampling technique, have the
advantage of representing the whole range of old data.
Moreover, these structures avoid the problem caused by
noisy instances. The reason is that they store the
important characteristics rather than the data itself. EPs
can be thought of as synopsis data structures because
they represent the discriminating characteristics of the
data they are related to.
Approximate emerging patterns (AEPs) adopt
approximation to mine EPs from data streams [4].
Mining AEPs is based on mining EPs from blocks of
streaming data and merging the resulting EP sets to get a

fixed number of AEPs. These special EPs are described
as approximate because they are not mined from the
complete range of data. The AEP tree is a new type of
decision trees to classify streaming data. This tree uses
AEPs rather than data instances to make decisions on the
classes of unlabelled data.
3. EMERGING PATTERNS AND

CLASSIFICATION
Let obj = {a1, a2, a3, ... an} be a data object following the
schema {A1, A2, A3, ... An}. A1, A2, A3.... An are called
attributes, and a1, a2, a3, ... an are values related to these
attributes. We call each pair (attribute, value) an item.
Let I denote the set of all items in an encoding dataset
D. Itemsets are subsets of I. We say an instance Y
contains an itemset X, if X ⊆ Y.
Definition 1. Given a dataset D, and an itemset X, the
support of X in D, sD(X), is defined as
sD ( X ) =

count D ( X )
|D|

(1)

where countD(X) is the number of instances in D
containing X.
Definition 2. Given two different classes of datasets
D1 and D2. Let si(X) denote the support of the itemset X
in the dataset Di. The growth rate of an itemset X from
D1 to D2, grD1 → D2 ( X ) , is defined as

 0,

grD1 →D2 ( X ) =  ∞,
 s2 ( X )
,


 s1( X )

if s1( X ) = 0 and s2 ( X ) = 0
if s1( X ) = 0 and s2 ( X ) ≠ 0

(2)

otherwise

Definition 3. Given a growth rate threshold ρ >1, an
itemset X is said to be a ρ -emerging pattern ( ρ -EP or
simply EP) from D1 to D2 if grD1 → D2 ( X ) ≥

ρ.

Let C = {c1, … ck} is a set of class labels. A training
dataset is a set of data objects such that, for each object
obj, there exists a class label cobj ∈ C associated with it.
A classifier is a function from attributes {A1, A2, A3, ...
An} to class labels {c1, … ck}, that assigns class labels to
unseen examples.
4. MINING MATCHING EMERGING PATTERNS
We adopt the streaming data model presented in [4].
This model is shown in figure 2. Assume that the data
stream consists of two classes; C1 and C2. Data is
received in blocks of size N, where N is decided
according to the memory available in the system. Bt,j is a
block of instances related to class j (C1 or C2) at time t.
After receiving and processing a number of data
blocks, we need to gain information to classify the future

unlabeled instances in the data streams. This information


can be expressed as EPs. However, mining EPs from a
dataset requires the availability of all instances in this
dataset. This is infeasible in data streams as data is
arriving continuously.
N

Class 1
B1,1
B2,1
B3,1
B4,1

Class 2
B1,2
B2,2
B3,2
B4,2
Data Stream

Figure 2. Data stream model
Our method is based on mining EPs from selected
blocks of data and matching these EPs with the future
data. As data is streaming in blocks of size N, blocks of
all classes related to period t (Bt,1 and Bt,2) are stored,
processed and then discarded. EPs for both classes are
mined from data blocks Bt,1 and Bt,2 before discarding
them. These EPs are EPt,1 (for C1 ) and EPt,2 (for C2 ).

These EPs are moved to new sets of EPs called matching
EPs, MEPs. That is, MEP1 represents the MEPs of class
C1 and MEP2 represents the MEPs of class C2 .
In the following stage, new data blocks arrive, Bt+1,1
and Bt+1,2. EPs are not mined from these new blocks.
Instead, current MEPs are matched with the new data
instances to check if they are still EPs for the new data.
If MEPs retain their EP characteristics (existing in one
class more than the other) they remain in the set of
MEPs. Otherwise, they are eliminated from the set. For
example, EPs in MEP1 are matched with the new data
blocks. If they exist in Bt+1,1 more frequently than in
Bt+1,2 then they are still valid EPs and remain in MEP1
otherwise they are eliminated. The same is applied to
MEP2.
The above process continues for the future blocks of
data until the number of EPs in MEP1 (or MEP2) is
reduced to a predefined number, α , because of the
elimination process. In this case, EPs are mined from the
new blocks of data and best EPs are chosen to refill
MEP1 (or MEP2) set. For example, suppose that the
number of EPs in MEP1 has been reduced to α at time
t+8, then, EPs are mined from block Bt+9,1 to create its set
of EPs, EPt+9,1. The strongest1 EPs in EPt+9,1 are used to
refill MEP1 again. Algorithm 1 explains the idea of
mining MEPs.
We call the emerging patterns resulted from the above
algorithm matching emerging patterns (MEPs). These
EPs are mined from selected blocks of data rather than
the complete range of data. In spite of that, our approach

guarantees that these MEPs are inherited from all the old
discarded data. That is, for each class, we have to store
its limited set of MEPs rather than its growing number of
data instances.
MEPs are motivated by the following points:
1

The strength of an EP is proportional to its support and growth
rate.

1. If an old EP does not exist any more in the future
blocks of data it is eliminated.
2. If an EP exists in the future data blocks, it remains in
the MEP set.
Algorithm 1. Mining MEPs from streaming data

MEP1= Φ , MEP2= Φ , t = 0
As data is streaming Do
t=t+1
If EPs in MEP1 < α
EPt,1 = mined EPs from Bt,1
MEP1 = MEP1 U strongest EPs in EPt,1
Else
For each EP e in MEP1
Match e with Bt,1 and Bt,2
If e is no more an EP
Remove e from MEP1
End if
End for
End if

If EPs in MEP2 < α
EPt,2 = mined EPs from Bt,2
MEP2 = MEP2 U strongest EPs in EPt,2
Else
For each EP e in MEP2
Match e with Bt,2 and Bt,1
If e is no more an EP
Remove e from MEP2
End if
End for
End if
End do

The above two points support the importance of the
recent data which is the main advantage of the sliding
window technique. Moreover, they prove that the MEPs
are related to all the previous data which is the advantage
of the sampling technique.
Furthermore, MEPs
overcome the problem of mining EPs from all blocks of
streaming data in the AEP method by applying a selective
approach to choose certain blocks of data. This ensures
that the mining process is conducted when necessary
only. Our approach ensures that at each period of time we
have limited sets of MEPs that best represents all the
discarded data. These sets can be used at any time to
classify unlabeled data instances using the AEP-tree
proposed in [4].
5. EXPERIMENTAL EVALUATION
In this section, we apply five techniques to the data

streaming model described in section 4.
These
techniques are AEP-tree using MEPs, AEP-tree using
AEPs, random sampling, sliding window, and a
traditional classifier (C4.5 decision tree).
Beside
applying it alone, the C4.5 decision tree will be the base
classifier for the random sampling, and sliding window
techniques.
The testing method is 10-fold-cross validation. This
method is adapted to agree with the data stream model
adopted in our experiments. The data is divided into ten


folds. For each round of the 10-fold-cross validation,
one fold is used for testing and the other nine folds are
used for training. The training folds act as the blocks of
data explained in the data stream model.
Table 2 shows the performance of the previous
techniques on 10 real-life data from the UCI repository
[11]. The last row in the table indicates the average
accuracy of each technique. The results show that our
proposed method, AEP-tree using MEPs, has the highest
accuracy average. It outperforms the sliding window and
sampling techniques on all datasets. It also outperforms
the AEP-tree using AEPs on 8 datasets. This indicates
that our proposed method for mining MEPs is capable of
gaining accurate knowledge from streaming data.
Table 2. Experimental results
Dataset


C4.5*

Sliding
window

sampling

Breast
Cleve
Diabetes
Heart
labor
CC
Hayes-roth
Hepatitis
Horse
Segment
Average

94.6
73.8
73.4
80.6
76.9
85.3
70.2
81.8
85.2
93.5

81.53

74.1
59.4
59.8
61.7
52.3
70.2
60.3
70.4
72.6
81.9
66.27

70.4
55.8
55.2
59.4
50.7
67.6
55.2
67.3
70.8
81.4
63.38

AEP
tree
(AEPs)
83.4

65.1
62.9
71.5
63.3
77.6
66.5
75.1
81.4
87.8
73.46

AEP tree
(MEPs)
81.7
69.6
67.3
75.1
69.9
75.1
68.1
78.1
82.2
88.8
75.59

* C4.5 with complete knowledge

6. CONCLUSIONS
In this paper, we study the mining of emerging patterns
in data streams by introducing a special type of emerging

patterns, matching emerging pattern (MEPs). This type of
EPs can be easily mined from data streams by applying a
selective approach to conduct the mining process.Our
experiments prove that MEPs is capable of gaining
important information from streaming data.
This
information increases the accuracy of classification.Our
research opens a wide avenue for the applications of
emerging patterns (EPs). EPs can now be used in
different data stream applications. Our future work will
focus on designing new techniques for mining EPs from
data streams. These techniques might be useful for
mining other types of patterns which are currently
infeasible in the data stream model.
REFERENCES
[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J.
Widom. Models and Issues in Data Stream Systems. In
Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS’02), Madison, Wisconsin,
USA.
[2] G. Dong, J. Han, L.V.S. Lakshmanan, J. Pei, H.
Wang and P.S. Yu. Online Mining of Changes from
Data Streams: Research Problems and Preliminary
Results. In Proceedings of the 2003 ACM SIGMOD
Workshop on Management and Processing of Data
Streams, San Diego, CA, USA.

[3] M. Garofalakis, J. Gehrke, and R. Rastogi. Querying
and Mining Data Streams: You Only Get One Look. In
Proceedings of the 28th International Conference on

Very Large Data Bases (VLDB’02), Hong Kong, China.
[4] Alhammady, H., & Ramamohanarao, K. (2005).
Mining emerging patterns and classification in data
streams. In Proceedings of the IEEE/WIC/ACM
International Conference on Web Intelligence (WI),
Compiegne, France, pp. 272-275.
[5] G. Dong, and J. Li. Efficient Mining of Emerging
Patterns: Discovering Trends and Differences.
In
Proceedings of the 1999 International Conference on
Knowledge Discovery and Data Mining (KDD'99), San
Diego, CA, USA.
[6] H. Alhammady, and K. Ramamohanarao. The
Application of Emerging Patterns for Improving the
Quality of Rare-class Classification. In Proceedings of
the 2004 Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD'04), Sydney,
Australia.
[7] H. Alhammady, and K. Ramamohanarao. Using
Emerging Patterns and Decision Trees in Rare-class
Classification. In Proceedings of the 2004 IEEE
International Conference on Data Mining (ICDM'04),
Brighton, UK.
[8] H. Alhammady, and K. Ramamohanarao. Expanding
the Training Data Space Using Emerging Patterns and
Genetic Methods. In Proceeding of the 2005 SIAM
International Conference on Data Mining (SDM’05),
New Port Beach, CA, USA.
[9] H. Fan, and K. Ramamohanarao. A Bayesian
Approach to Use Emerging Patterns for Classification.

In Proceedings of the 14th Australasian Database
Conference (ADC’03), Adelaide, Australia.
[10] Guozhu D., Xiuzhen Z., Limsoon W., and Jinyan L..
CAEP: Classification by Aggregating Emerging Patterns.
In Proceedings of the 2nd International Conference on
Discovery Science (DS'99), Tokyo, Japan.
[11] C. Blake, E. Keogh, and C. J. Merz. UCI repository
of machine learning databases. Department of
Information and Computer Science, University of
California at Irvine, CA, 1999.
[12] H. Fan, and K. Ramamohanarao. An Efficient
Single-Scan Algorithm For Mining Essential Jumping
Emerging Patterns for Classification. In Proceedings of
the 2002 Pacific-Asia Conference on Knowledge
Discovery and Data Mining, Taipei, Taiwan.
[13] H. Fan, and K. Ramamohanarao. Efficiently Mining
Interesting Emerging Patterns. In Proceedings of the 4th
International Conference on Web-Age Information
Management (WAIM’03), Chengdu, China.
[14] Stanford Stream Data Management (STREAM)
Project. />[15] B. Babcock, M. Datar, and R. Motwani. Sampling
From a Moving Window Over Streaming Data. In
Proceedings of the 2002 Annual ACM-SIAM
Symposium On Discrete Algorithms, San Francisco, CA,
USA.



×