Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 80 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (171.51 KB, 10 trang )

770 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
Fig. 39.5. ANNCAD Framework
tion sequence. FOCUS framework uses the difference between data mining models
as the deviation in data sets.
Ferrer-Troyano et al (Ferrer-Troyano et al.,2004) have proposed a scalable
classification algorithm for numerical data streams. The algorithm has been termed
as Scalable Classification Algorithm by Learning decisiOn Patterns SCALLOP. The
algorithm starts by reading a number of user-specified labeled records. A number
of rules are created for each class from these records. For each record read after
creating these rules, there are three cases:
a) Positive covering: a new record that strengthens a current discovered rule.
b) Possible expansion: a new record that is associated with at least one rule however
is not covered by any discovered rule.
c) Negative covering: a new record that weakens a current discovered rule.
For each of the above cases, a different procedure is used as follows:
a) Positive covering: an update of the positive support and confidence of the rule is
calculated and assigned to the existing rule.
b) Possible expansion: the rule is extended if it satisfies two conditions:
1. It is bounded within a user-specified growth bounds to avoid a possible wrong
expansion of the rule.
2. There is no intersection between the expanded rule and any already discovered
rule associated with the same class label.
c) Negative covering: an update of the negative support and confidence is calculated.
If the confidence is less than a minimum user-specified threshold, a new rule is added.
Having read a user-defined number of records, a rule refining process takes place.
Merge of rules in the same class and within a user-defined acceptable distance mea-
sure is used in this process with a condition non-intersecting with rules associated
39 Data Stream Mining 771
with other class labels. The resulting hypercube should also be within the growth
bounds of the rules. The second step of the refining stage release the uninteresting
rules from the current model. The rules that have less than the minimum positive


support are released from the model. Also the rules that are not covered by at least
one of the records of the last user-defined number of received records are also re-
leased from the classifier. Figure 39.6 shows an illustration of the basic process of
using SCALLOP to build a data stream classifier.
Finally a voting-based classification technique is used to classify the unlabelled
records for model use. If there is a rule covers the current record, the label associated
with that rule is used as the classifier output; otherwise a voting over the current rules
within the growth bounds is used to infer the class label.
Fig. 39.6. Basic SCALLOP Process
Papadimitriou et al (Papadimitriou et al., 2003) have proposed AWSOM (Arbi-
trary Window Stream mOdeling Method) for discovering interesting patterns from
sensor data. They developed a one-pass algorithm to incrementally update the pat-
terns. Their method requires only O(logN) memory where N is the length of the
sequence. They conducted experiments with real and synthetic data sets. They use
wavelet coefficients as compact information representation and correlation structure
detection, and then apply a linear regression model in the wavelet domain. The sys-
tem depends on creating compact representation to address the high speed streaming
problem. The experimental results show the efficiency in detecting correlation.
Gaber et al. (Gaber et al., 2005) have developed Lightweight Classification LW-
Class. It is a variation of LWC. It is also an AOG-based technique. The idea is to use
Knearest neighbors with updating the frequency of class occurrence given the data
stream features. In case of contradiction between the incoming stream and the stored
summary of the cases, the frequency is reduced. In case of the frequency is equalized
to zero, all the cases represented by this class is released from the memory.
772 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
39.4 Frequent Pattern Mining Techniques
Frequency counting is the process of identifying the highest frequent items. It could
be used as a stand alone technique to discover the heavy hitters (Cormode and
Muthukrishnan, 2003). It could also be used as a step towards finding association
rules. The main idea is to find data items with a probability greater than or equal to a

pre-specified minimum threshold known in the context of frequent items as the item
support (Dunham, 2003). The item support is calculated by dividing the number of
times the observed item appears to the total number of records.
Giannella et al (Giannella et al., 2003) have proposed and implemented a fre-
quent itemsets mining algorithm over data stream. They have used tilted windows
to calculate the frequent patterns for the most recent transactions based on the fact
that users are more interested in the most recent streaming information rather than
older data streams. They have developed an incremental algorithm to maintain the
FP-stream, which is a tree data structure to represent and discover frequent itemsets
in data streams. FP-stream has been developed based on FP-tree, which has been first
introduced by Han et al (Han et al., 2000) as a graphical representation for discov-
ering frequent itemsets. A number of experiments have been conducted to prove the
algorithm efficiency. The results show that with limited memory, the algorithm can
discover the frequent itemsets with approximate support.
Manku and Motwani (Manku and Motwani, 2002) have proposed and imple-
mented an approximate frequency counting algorithm in data streams. The imple-
mented algorithm uses all the previous historical data to calculate the frequent pat-
terns incrementally. Two algorithms have been introduced: sticky sampling and lossy
counting algorithms. Although the first algorithm analytically should have a better
performance because it has better worst-case bound, the experimental studies have
proved the lossy count algorithm has a better practical performance. The sticky sam-
pling algorithm uses sampling that attracts the new records with already existing en-
tries to have a higher probability to be sampled. The other algorithm uses that idea of
group testing using buckets for counting items within the same group by maintaining
one counter only.
Cormode and Muthukrishnan (Cormode and Muthukrishnan, 2003) have devel-
oped an algorithm for counting frequent items. The algorithm uses group testing to
find the hottest k items. The algorithm can process turnstile data stream model which
allows addition as well as deletion of data records. An approximation randomized
algorithm has been used to approximately discover the most frequent items. The al-

gorithm can recall the frequent items with given item support and probability. It is
worth mentioning that the turnstile data stream model is the hardest to analyze. Time
series and cash register models are easier. The former does not allow increments and
decrements and the later one allows only increments.
Jin et al (Jin et al., 2003) have proposed hCount algorithm to discovering frequent
items in data streams. This algorithm also deals with the turnstile data stream model
where insertion and deletion from the data are allowed. The algorithm dynamically
works with any range of data and does not need any prior knowledge about the data.
The algorithm is classified as an approximation technique that keeps the number
39 Data Stream Mining 773
of counters that can guarantees a minimum acceptable error. The algorithm simply
keeps the number of counters that analytically can result in the final approximated
output deviated with a user given threshold of error.
Gaber et al. (Gaber et al., 2005) have developed one more AOG-based algorithm:
Lightweight frequency counting LWF. It has the ability to find an approximate solu-
tion to the most frequent items in the incoming stream using adaptation and releasing
the least frequent items regularly in order to count the more frequent ones.
39.5 Time Series Analysis
Time series analysis is concerned with discovering patterns in attribute values that
vary over temporal basis. Three main functions are performed in time series min-
ing: clustering of similar time series, predicting future values in a time series, and
classifying the behavior of a time series (Dunham, 2003).
Indyk et al (Indyk et al., 2000) have proposed approximate solutions with prob-
abilistic error bounding to two problems in time series analysis: relaxed periods and
average trends. The algorithms use dimensionality reduction sketching techniques.
The process starts with computing the sketches over an arbitrarily chosen time win-
dow. This creates what so called sketch pool. Sketching is the process of random
projection over a number of attributes. Using this pool of sketches, relaxed periods
and average trends are computed. Relaxed periods refer to those periods in time se-
ries that are repeated over time. Since exact repetition is rare, similar ones using

distance functions are acceptable. Average trend is the mean values of a subsequence
of observation of a pre-specified length in a time series. The algorithms have shown
experimentally efficiency in running time and accuracy.
Perlman and Java (Perlman and Java, 2003) have proposed an approach to mine
astronomical time series streams. The technique starts with handling missing data
using interpolation. A normalization process then takes place for a two-phase pre-
processing step. A process of finding frequently occurring shapes in times series us-
ing time windows represents the first processing step. Then, clustering the discovered
patterns of shapes is the second step. Rule extraction and filtering over the created
clusters represent final step in the approach. The limitation of the implemented sys-
tem is that it can process only one time series at any time. Figure 39.7 shows a simple
flow chart of the approach.
Zhu and Shasha (Zhu and Shasha, 2003) have proposed techniques to compute
a set of statistical measures over time series data streams. The proposed techniques
use discrete Fourier transform to create synopsis data structure. The system is called
StatStream and is able to compute approximate error bounded correlations and inner
products. The system works over an arbitrarily chosen sliding window.
Keogh et al (Keogh et al., 2003) have proved empirically that most cited clus-
tering time series data streams algorithms proposed so far in the literature result in
meaningless results in subsequence clustering. They have proposed a solution using
k-motif to choose the subsequences that the algorithm can work on. The 1-motif is
the subsequence that has the highest count of not-trivial matches in a time series.
774 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
Fig. 39.7. Astronomical Time Series Analysis
Thus, the k-motif is the highest k subsequences that satisfy the condition of high-
est count of matches. Experimental results show the success of the techniques in
extracting meaningful time series clustering results.
Lin et al (Lin et al., 2003) have proposed the use of symbolic representation of
time series data streams that has been termed Symbolic Aggregate approXimation
(SAX). This representation allows dimensionality/numerosity reduction. Numeros-

ity reduction refers to reducing the number of records. They have demonstrated the
applicability of the proposed representation by applying it to clustering, classifica-
tion, indexing and anomaly detection mining techniques. The approach has two main
stages. The first one is the transformation of time series data to Piecewise Aggregate
Approximation followed by transforming the output to discrete string symbols in the
second stage.
Chen et al (Chen et al., 2002) have proposed the application of what so called re-
gression cubes for data streams. Due to the success of OnLine Analytical Processing
OLAP technology in the application of static stored data, it has been proposed to use
multidimensional regression analysis to create a compact cube that could be used for
answering aggregate queries over the incoming data streams. This research has been
extended to be adopted in the undergoing project Mining Alarming Incidents in Data
Streams MAIDS. The technique has shown experimentally efficiency in analyzing
time series data streams.
39.6 Systems and Applications
Recently systems and applications that deal with mining data streams have been
developed. The systems are application-oriented except for MAIDS developed by
Cai et al (Cai et al., 2004) which represents the first attempt to develop a generic data
stream mining system. The following list introduces these systems and applications
with short descriptions.
Burl et al (Burl et al., 1999) have developed Diamond Eye for NASA and JPL.
The aim of the project is to enable remote systems as well as scientists to extract
patterns from spatial objects in real time image streams. The success of this project
will enable ”a new era of exploration using highly autonomous spacecraft, rovers,
39 Data Stream Mining 775
and sensors” (Burl et al., 1999). The system uses a high performance computational
facility for processing the data mining request. The scientist uses a web interface
that uses java applets to connect to the server that requests that images to perform
the image mining process.
Kargupta et al (Kargupta et al., 2002) have developed the first ubiquitous data

stream mining system termed MobiMine. It is a client/server PDA-based distributed
data mining application for financial data streams. The system prototype has been de-
veloped using a single data source and multiple mobile clients; however the system is
designed to handle multiple data sources. The server functionalities in the proposed
system are data collection from different financial web sites and storage, selection
of active stocks using common statistics methods, and applying online data min-
ing techniques to the stock data. The client functionalities are portfolio management
using a mobile micro-database to store portfolio data and information about user’s
preferences, and construction of the WatchList and this is the first point of interaction
between the client and the server. The server computes the most active stocks in the
market, and the client in turn selects a subset of this list to construct the personal-
ized WatchList according to an optimization module. The second point of interaction
between the client and the server is that the server performs online data mining and
then transforms the results using Fourier transformation and finally sends this to the
client. The client in turn visualizes the results on the PDA screen. It is worth pointing
out that the data mining process in MobiMine has been performed at the server side
given the resource constraints of a mobile device. With the increase need for onboard
data mining in resource-constrained computing environments, Kargupta et al (Kar-
gupta, 2004) have developed onboard mining techniques for a different application
in mining vehicle sensory data streams.
Kargupta et al (Kargupta, 2004) have developed Vehicle Data Stream Mining
System VEDAS. It is a ubiquitous data stream mining system that allows continuous
monitoring and pattern extraction from data streams generated on-board a moving
vehicle. The mining component is located on the PDA. VEDAS uses online incre-
mental clustering for modeling of driving behavior.
Tanner et al (Tanner et al., 2002) have developed EnVironment for On-Board
Processing (EVE) for astronomical data streams. The system analyzes data streams
continuously generated from measurements of different on-board sensors. Only in-
teresting patterns are sent to the ground stations for further analysis preserving the
limited bandwidth.

Srivastava and Stroeve (Srivastava and Stroeve, 2003) work in a NASA project
for onboard detection of geophysical processes such as snow, ice and clouds using
kernel clustering methods for data compression preserving limited bandwidth needed
to send image streams to the ground centers. The kernel methods have been chosen
due to its low computational complexity.
Cai et al (Cai et al., 2004) have developed an integrated mining and querying sys-
tem. The system can classify, cluster, count frequency and query over data streams.
Mining Alarming Incidents of Data Streams MAIDS is currently under develop-
ment and recently the project team has demonstrated its prototype implementation.
776 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
Sequential pattern mining and hidden network mining are currently under develop-
ment.
Pirttikangas et al (Pirttikangas et al., 2001) have implemented a mobile agent-
based ubiquitous data mining for a context-aware health club for cyclists. The sys-
tem is called Genie of the Net. The process starts by collecting information from
sensors and databases in order to recognize the needed information for the specific
application. This information includes user’s context and other needed information
collected by mobile agents. The main scenario for the health club system is that the
user has a plan for an exercise. All the needed information about the health such as
heart rate is recorded during the exercise. This information is analyzed using data
mining techniques to advise the user after each exercise.
Having discussed the state-of-the-art in mining data streams in terms of devel-
oped techniques as well as systems used in different applications, we can use this
review as a base for classifying these techniques into generic categories
39.7 Taxonomy of Data Stream Mining Approaches
Research problems and challenges that have been discussed earlier in mining data
streams have its solutions using well-established statistical and computational ap-
proaches. We can categorize these solutions to data-based and task-based ones. In
data-based solutions, the idea is to examine only a subset of the whole dataset or to
transform the data vertically or horizontally to an approximate smaller size data

representation. On the other hand, in task-based solutions, techniques from com-
putational theory have been adopted to achieve time and space efficient solutions. In
this section we review these theoretical foundations.
39.7.1 Data-based Techniques
Data-based techniques refer to summarizing the whole dataset or choosing a subset
of the incoming stream to be analyzed. Sampling, load shedding and sketching tech-
niques represent the former one. Synopsis data structures and aggregation represent
the later one. The following subsections represent an outline of the basics of these
techniques with pointers to its applications in the context of data stream mining.
Sampling
Sampling refers to the process of probabilistic choice of a data item to be pro-
cessed (Toivonen, 1996). Sampling is an old statistical technique that has been used
for a long time in the context of conventional data mining for large databases. In
the context of data stream mining, boundaries of error rate of the computation are
given as a function in the sampling rate or size. Very Fast Machine Learning tech-
niques (Domingos and Hulten, 2000) have used Hoeffding bound (Hoeffding, 1963)
to measure the sample size according to a derived loss function according to the
39 Data Stream Mining 777
running mining algorithm. The problem with using sampling in the context of data
stream analysis is the unknown dataset size. Thus the treatment of data stream should
follow a special analysis to find the error bounds. Another problem with sampling
is that it is important to check for anomalies for surveillance analysis as an applica-
tion in mining data streams. Sampling is not the right choice for such an application.
Sampling also does not address the problem of fluctuating data rates. It would be
worth investigating the relationship among the three parameters: data rate, sampling
rate and error bounds.
Load Shedding
Load shedding refers (Babcock et al., 2003, Tatbul et al., 2003, Tatbul et al., 2003)
to the process of dropping a sequence of data streams. Load shedding has been used
successfully in querying data streams. It has the same problems of sampling. Load

shedding is difficult to be used with mining algorithms because it drops chunks of
data streams that could be used in the structuring of the generated models or it might
represent a pattern of interest in time series analysis. However recently it has been
used in the classification problem with an acceptable accuracy in an algorithm de-
veloped by Chi et al (Chi et al., 2005). The algorithm has been termed as Loadstar.
It represents the first attempt for using load shedding in high speed data stream clas-
sification problems.
Sketching
Sketching (Babcock et al., 2002, Muthukrishnan, 2003) is the process of randomly
project a subset of the features. It is the process of vertically sample the incoming
stream. Sketching has been applied in comparing different data streams and in ag-
gregate queries. The major drawback of sketching is that of accuracy. It is hard to use
it in the context of data stream mining. Principal Component Analysis (PCA) would
be a better solution that has been applied in streaming applications (Kargupta, 2004).
Synopsis Data Structures
Creating synopsis of data refers to the process of applying summarization techniques
that are capable of summarizing the incoming stream for further analysis. Wavelet
analysis (Gilbert et al., 2003), histograms, quantiles and frequency moments (Bab-
cock et al., 2002) have been proposed as synopsis data structures. Since synopsis of
data does not represent all the characteristics of the dataset, approximate answers are
produced when using such data structures.
Aggregation
Aggregation is the process of computing statistical measures such as means and vari-
ance that summarize the incoming data stream. Using this aggregated data could then
778 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
be used by the data mining algorithm. The problem with aggregation is that it does
not perform well with highly fluctuating data distributions. Merging online aggrega-
tion with offline mining has been studies in (Aggarwal et al., 2003, Aggarwal et al.,
2004, Aggarwal et al., 2004) for clustering and classification of data streams.
Definitions, advantages and disadvantages of all of the above data-based ap-

proaches are given in Table 39.2.
39.7.2 Task-based Techniques
Task-based techniques are those methods that modify existing techniques or develop
new ones in order to address the computational challenges of data stream processing.
Approximation algorithms, sliding window techniques represent this category. In the
following subsections, we examine each of these techniques and its application in
the context of data stream analysis.
Approximation algorithms
Approximation algorithms (Muthukrishnan, 2003) have their roots in algorithm de-
sign. It is concerned with design algorithms for computationally hard problems.
These algorithms can result in an approximate solution with error bounds. The idea
is that data stream mining algorithms are considered hard computational problems
given its features of continuity and speed and the resource-constrained computational
environment. Approximation algorithms have attracted researchers as a direct solu-
tion to data stream mining problems. However, the problem of data rates with regard
to the available resources could not be solved using approximation algorithms. Other
tools should be used along with these algorithms in order to adapt to the available
resources. Approximation algorithms have been used in (Cormode and Muthukrish-
nan, 2003, Jin et al., 2003) for discovering frequent items.
Sliding Window
The inspiration behind sliding window techniques is that the user is more concerned
with the analysis of most recent data streams. Thus, the detailed analysis is done
over the most recent data items and summarized versions of the old ones. This idea
has been adopted in many techniques in the undergoing comprehensive data stream
mining system MAIDS (Dong et al., 2003). The main issue of the sliding window
techniques is how to remove the expired results from the current created model.
Algorithm Output Granularity
The algorithm output granularity (AOG) (Gaber et al., 2005,Gaber et al., 2004) intro-
duces the first resource-aware data analysis approach that can cope with fluctuating
very high data rates according to the available memory and the processing speed rep-

resented in time constraints. The AOG performs the local data analysis on a resource
39 Data Stream Mining 779
Table 39.2. Data-based Techniques
Technique Definition Pros Cons
Sampling The process of
choosing a subset
of a dataset for the
sake of analysis
using probability
theory.
• Well es-
tablished
techniques.
• Error bound-
aries guaran-
teed
• Poor for
anomaly detec-
tion.
Load Shedding The process of ig-
noring a continuous
chunk of streaming
data
• Proved effi-
ciency with
data stream
querying.
• Used recently
with success
in data stream

mining
• Very poor for
anomaly detec-
tion.
Sketching Randomly pro-
jection of a set
of features to be
analyzed
• Considerably
improve the
running time.
• Some unse-
lected features
might be of
great impor-
tance.
Synopsis Data
Structure
Quick transfor-
mation of the
incoming stream
into a summarized
compressed form.
• Analysis task
independent.
• might not be
sufficient with
high data rates.
Aggregation Calculating statisti-
cal measures that

capture the features
of data.
• Analysis task
independent.
• Aggregation
measures do
not capture all
the required
features of
data.

×