Data Mining and Knowledge Discovery Handbook, 2 Edition part 81 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (368.95 KB, 10 trang )

780 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
constrained device that generates or receive streams of information. AOG has three
main stages. Mining followed by adaptation to resources and data stream rates repre-
sent the ﬁrst two stages. Merging the generated knowledge structures when running
out of memory represents the last stage. AOG has been used in clustering, classiﬁca-
tion and frequency counting (Gaber et al., 2005).
Figure 39.8 shows a ﬂowchart of AOG-mining process. It shows the sequence of
the three stages of AOG.
Fig. 39.8. AOG Approach
Deﬁnitions, advantages and disadvantages of all of the above task-based ap-
proaches are given in Table 39.3.
39.8 Related Work
The last few years have witnessed the emergence of data management strategies
focusing on data stream issues (Babcock et al., 2002). Querying and summarizing
data that could be stored for further analysis are the main processing tasks studied
in data stream management systems. Extension of query languages, query planning,
scheduling, and optimization are the major research activities conducted in this area.
Aurora (Abadi et al., 2003), COUGAR (Yao and Gehrke, 2002), Gigascope (Cra-
nor et al., 2003), STREAM (Arasu et al., 2003), TelegraphCQ (Krishnamurthy et
al., 2003) represent the ﬁrst generation of data stream management systems. In this
section, a brief description of each one is given as follows:
• STREAM: STanford stREam datA Manager (STREAM) (Arasu et al., 2003) is a
data stream management system that handles multiple continuous data streams
and supports long-running continuous queries. The intermediate results of a con-
tinuous query are stored in a data structure termed Scratch Store. The results of a
query could be a data stream transferred to the user or it could be a relation that
also could be stored for re-processing. To support continuous queries over data
streams, a continuous query language termed as CQL has been developed as part
of the system. The language supports relation-to-relation, stream-to-relation, and
relation-to-stream operators.
• Gigascope: is a specialized data stream management system (Cranor et al., 2003)

for the application of network monitoring. It has its own SQL-like query language
termed as GSQL. Unlike CQL, the input and output of this language are only
39 Data Stream Mining 781
Table 39.3. Task-based Techniques
Technique Deﬁnition Pros Cons
Approximation Al-
gorithms
Design algorithms
that approximate
mining results with
error bounds.
• Efﬁciency in
running time.
• the problem
of data rates
with regard
to the avail-
able resources
could not be
solved using
approximation
algorithms.
Sliding Window Analyzing the most
recent data streams
• Applicable
to most of
data stream
applications.
• don’t provide
a model for

the whole data
stream.
Algorithm Output
Granularity
Adapting the
algorithm param-
eters according
to data stream
rate and memory
consumption
• Generic ap-
proach that
could be
used with
any mining
technique with
no or minor
modiﬁcations
• It has an over-
head when run-
ning for long
period of time
data streams. GSQL supports merge, selection, join and aggregation operations
on data streams. Query optimization and performance considerations have been
addressed in developing the language. The system serves a number of network
related applications including intrusion detection and trafﬁc analysis.
• TelegraphCQ: is a continuous query processing system (Krishnamurthy et al.,
2003) built on the basis of PostgreSQL open source query language. The system
supports creating data streams, sources, wrappers and queries.
• COUGAR: is a data stream management system (Yao and Gehrke, 2002) de-

signed for sensor networks. Motivated by the fact that local computation in sen-
sor networks is cheaper than transferring data generated from sensors over wire-
less connections, a loosely coupled distributed architecture has been proposed to
answer in-network queries.
• Aurora: is a data stream management system (Abadi et al., 2003) that has the
optimization features for load shedding, real-time query scheduling and QoS as-
sessment. It is mainly designed to deal with very large numbers of data streams.
782 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
Queries over data streams have some similarities with data stream mining in
terms of research issues and challenges. The two main constraints for querying
data streams are the unbounded memory requirement and the high data rate. Thus,
the computation time per data element/record should be less than the data rate or
the sampling rate. Furthermore, the unbounded memory requirement compounds
the challenge by necessitating approximate rather than exact results. Signiﬁcant re-
search efforts have been conducted to approximate the query results (Babcock et al.,
2002, Garofalakis et al., 2002b).
The data stream mining algorithms have used some of the techniques introduced
in the data stream management research. Sampling and load shedding (Muthukrish-
nan, 2003) are among the basic techniques that have been introduced in querying
data streams and extended to the data mining process.
39.9 Future Directions
The ﬁeld of data stream mining is in a nascent stage of evolution. The last few years
have witnessed increased attention to this area of research due to the dissemination
of data stream sources. Based on the state-of-the-art in the area and demands of data
streaming applications, we can identify the future directions of research as follows:
• Developing data mining algorithms for wireless sensor networks to serve a num-
ber of real-time critical applications.
• Online medical, scientiﬁc and biological data stream mining using data generated
from medical, biological instruments and various tools employed in scientiﬁc
laboratories.

• Hardware solutions to small devices emitting or receiving data streams in order
to enable high performance computation on small devices.
• Developing software architectures that serve data streaming applications.
39.10 Summary
In this chapter, a review of the state of the art in mining data streams has been pre-
sented. Clustering, classiﬁcation, frequency counting, time series analysis techniques
have been discussed. Different systems that use data stream mining techniques have
been also presented. Generalization of the approaches used in developing data stream
mining techniques is given. The approaches have been broadly classiﬁed into data-
based and task-based strategies. Sampling, load shedding, sketching, synopsis data
structure creation and aggregation represent the data-based approaches. Approxi-
mation algorithms, sliding window and algorithm output granularity are the two ap-
proaches that form the task-based approaches. The chapter is concluded with pointers
to future research directions in the area.
39 Data Stream Mining 783
References
A. Arasu, B. Babcock. S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J.
Widom. STREAM: The Stanford Stream Data Manager Demonstration description -
short overview of system status and plans, in Proc. of the ACM Intl Conf. on Manage-
ment of Data (SIGMOD 2003), June 2003, pp. 665 - 665.
D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Erwin, E. Galvez, M.
Hatoun, J. Hwang, A. Maskey, A. Rasin, A. Singer, M. Stonebraker, N. Tatbul, Y. Xing,
R.Yan, S. Zdonik. Aurora: A Data Stream Management System (Demonstration). Pro-
ceedings of the ACM SIGMOD International Conference on Management of Data (SIG-
MOD’03), San Diego, CA, June 2003.
C. Aggarwal, J. Han, J. Wang, P. S. Yu, A Framework for Clustering Evolving Data Streams,
Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB’03), Berlin, Germany, Sept.
2003, pp 81-92.
C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A Framework for Projected Clustering of High
Dimensional Data Streams, Proc. 2004 Int. Conf. on Very Large Data Bases (VLDB’04),

Toronto, Canada, Aug. 2004, pp. 852-863.
C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On Demand Classiﬁcation of Data Streams,
Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD’04), Seattle,
WA, Aug. 2004, pp. 503-508.
I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A Survey on Sensor Networks,
IEEE Communication Magazine, August, 2002, pp. 102-114.
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream
systems, Proceedings of PODS, 2002, pp. 1-16.
B. Babcock, M. Datar, and R. Motwani. Load Shedding Techniques for Data Stream Sys-
tems (short paper), Proc. of the 2003 Workshop on Management and Processing of Data
Streams (MPDS 2003), June 2003
B. Babcock, M. Datar, R. Motwani, L. O’Callaghan, Maintaining Variance and k-Medians
over Data Stream Windows, Proceedings of the 22nd Symposium on Principles of
Database Systems (PODS 2003), pp. 234 - 243.
M. Burl, Ch. Fowlkes, J. Roden, A. Stechert, and S. Mukhtar, Diamond Eye: A distributed
architecture for image data mining, in SPIE DMKD, Orlando, April 1999, pp. 197-206.
M. Charikar, L. O’Callaghan, and R. Panigrahy, Better streaming algorithms for clustering
problems, Proc. of 35th ACM Symposium on Theory of Computing (STOC), 2003, pp.
30-39.
Y.D. Cai, D. Clutter, G. Pape, J. Han, M. Welge, and L. Auvil, MAIDS: Mining Alarming
Incidents from Data Streams, (system demonstration), Proc. 2004 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD’04), Paris, France, June 2004, pp. 919 - 920.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression Analysis
of Time-Series Data Streams, Proceedings of VLDB Conference, 2002, pp. 323-334.
B. Castano, M. Judd, R. C. Anderson, and T. Estlin, Machine Learning Challenges in Mars
Rover Traverse Science, Proc. of the ICML 2003 workshop on Machine Learning Tech-
nologies for Autonomous Space Applications.
C. Cranor , Johnson, T., Spataschek, O., and Shkapenyuk, V., Gigascope: a stream database
for network applications, In Proceedings of the 2003 ACM SIGMOD international Con-
ference on Management of Data (San Diego, California, June 09 - 12, 2003). SIGMOD

’03. ACM, New York, NY, 647-651
L. O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani,
Streaming-data algorithms for high-quality clustering, Proceedings of IEEE Interna-
784 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
tional Conference on Data Engineering, March 2002, pp. 685-697.
G. Cormode, S. Muthukrishnan, What’s hot and what’s not: tracking most frequent items
dynamically, PODS 2003, pp. 296-306
J. Coughlan, Accelerating Scientiﬁc Discovery at NASA, SIAM SDM 2004, Florida USA.
G. Cormode and S. Muthukrishnan., What is new: Finding signiﬁcant differences in network
data streams, INFOCOM 2004.
Y. Chi, Philip S. Yu, Haixun Wang, Richard R. Muntz, Loadstar: A Load Shedding Scheme
for Classifying Data Streams, The 2005 SIAM International Conference on Data Mining
(SIAM SDM’05), 2005.
G. Dong, J. Han, L.V.S. Lakshmanan, J. Pei, H. Wang and P.S. Yu. Online mining of changes
from data streams: Research problems and preliminary results, Proceedings of the 2003
ACM SIGMOD Workshop on Management and Processing of Data Streams. In cooper-
ation with the 2003 ACM-SIGMOD International Conference on Management of Data
(SIGMOD’03), San Diego, CA, June 8, 2003.
P. Domingos and G. Hulten, Mining High-Speed Data Streams, In Proceedings of the As-
sociation for Computing Machinery Sixth International Conference on Knowledge Dis-
covery and Data Mining, 2000, pp. 71-80
P. Domingos and G. Hulten. Catching Up with the Data: Research Issues in Mining Data
Streams, Workshop on Research Issues in Data Mining and Knowledge Discovery, 2001.
Santa Barbara, CA
P. Domingos and G. Hulten, A General Method for Scaling Up Machine Learning Algo-
rithms and its Application to Clustering, Proceedings of the Eighteenth International
Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann, pp.
106-113.
M. Dunham. Data Mining: Introductory and Advanced Topics. Pearson Education, 2003.
F.J. Ferrer-Troyano, J.S. Aguilar-Ruiz and J.C. Riquelme, Discovering Decision Rules from

Numerical Data Streams, ACM Symposium on Applied Computing - SAC04, 2004,
ACM Press, pp. 649-653.
U.M. Fayyad: Knowledge Discovery in Databases: An Overview. ILP 1997, pp. 3-16
U.M. Fayyad: Mining Databases: Towards Algorithms for Knowledge Discovery. IEEE Data
Eng. Bull. 21(1), 1998 pp. 39-48.
U.M. Fayyad, Georges G. Grinstein, Andreas Wierse: Information Visualization in Data Min-
ing and Knowledge Discovery Morgan Kaufmann 2001.
M.M. Gaber , Yu P. S., A Holistic Approach for Resource-aware Adaptive Data Stream
Mining, Journal of New Generation Computing, Special Issue on Knowledge Discovery
from Data Streams, 2006.
V. Ganti, Johannes Gehrke, Raghu Ramakrishnan: Mining Data Streams under Block Evolu-
tion. SIGKDD Explorations 3(2), 1002 pp. 1-10.
M. Garofalakis, Johannes Gehrke, Rajeev Rastogi: Querying and mining data streams: you
only get one look a tutorial. SIGMOD Conference 2002: 635
C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu, Mining Frequent Patterns in Data Streams
at Multiple Time Granularities, in H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha
(eds.), Next Generation Data Mining, AAAI/MIT, 2003.
A.C. Gilbert, Yannis Kotidis, S. Muthukrishnan, Martin Strauss: One-Pass Wavelet Decom-
positions of Data Streams. TKDE 15(3), 2003, pp. 541-554.
M.M. Gaber, Krishnaswamy, S., and Zaslavsky, A., On-board Mining of Data Streams in
Sensor Networks, a book chapter in Advanced Methods of Knowledge Discovery from
Complex Data, (Eds.) Sanghamitra Badhyopadhyay, Ujjwal Maulik, Lawrence Holder
and Diane Cook, Springer Verlag,.2005.
39 Data Stream Mining 785
R. Grossman, Supporting the Data Mining Process with Next Generation DataMining Sys-
tems, Enterprise Systems, August 1998
M.M. Gaber, Zaslavsky, A., and Krishnaswamy, S., Towards an Adaptive Approach for Min-
ing Data Streams in Resource Constrained Environments, Proceedings of Sixth Inter-
national Conference on Data Warehousing and Knowledge Discovery - Industry Track
(DaWaK 2004), Zaragoza, Spain, 30 August - 3 September, Lecture Notes in Computer

Science (LNCS), Springer Verlag.
S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, Clustering data streams, Proceedings
of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000,
pp. 359-366.
S. Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan, Cluster-
ing Data Streams: Theory and Practice TKDE special issue on clustering, vol. 15, 2003,
pp. 515-528.
D.J. Hand, Statistics and Data Mining: Intersecting Disciplines, ACM SIGKDD Explo-
rations, 1, 1, June 1999, pp. 16-19.
D.J. Hand, Mannila H., and Smyth P. Principles of data mining, MIT Press, 2001.
W. Hoeffding. Probability inequalities for sums of bounded random variables, Journal of the
American Statistical Association (58), 1963, pp. 13-30.
J. Han, Pei, J., and Yin, Y, Mining frequent patterns without candidate generation, In Proc.
2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pp. 1-12.
G. Hulten, L. Spencer, and P. Domingos. Mining Time-Changing Data Streams. ACM
SIGKDD 2001, pp. 97-106.
M. Henzinger, P. Raghavan and S. Rajagopalan, Computing on data streams , Technical Note
1998-011, Digital Systems Research Center, Palo Alto, CA, May 1998
T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining, infer-
ence, and prediction, New York: Springer, 2001
P. Indyk, N. Koudas, and S. Muthukrishnan, Identifying Representative Trends in Massive
Time Series Data Sets Using Sketches. In Proc. of the 26th Int. Conf. on Very Large Data
Bases, Cairo, Egypt, September 2000, pp. 363 - 372.
C. Jin, Weining Qian, Chaofeng Sha, Jeffrey X. Yu, and Aoying Zhou, Dynamically Main-
taining Frequent Items over a Data Stream, In Proceedings of the 12th ACM Conference
on Information and Knowledge Management (CIKM’2003), pp. 287-294
M. Kantardzic, Data mining : concepts, models, methods and algorithms, Piscataway, NJ:
IEEE Pr. Wiley Interscience, 2003.
H. Kargupta, Ruchita Bhargava, Kun Liu, Michael Powers, Patrick Blair, Samuel Bushra,
James Dull, Kakali Sarkar, Martin Klein, Mitesh Vasa, and David Handy, VEDAS: A

Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring,
Proceedings of SIAM International Conference on Data Mining 2004.
S. Krishnamurthy, S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein,
W. Hong, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: An Architectural
Status Report. IEEE Data Engineering Bulletin, Vol 26(1), March 2003.
E. Keogh, J. Lin, and W. Truppel. Clustering of Time Series Subsequences is Meaningless:
Implications for Past and Future Research. In proceedings of the 3rd IEEE International
Conference on Data Mining. Melbourne, FL. Nov 19-22, 2003, pp. 115-122.
H. Kargupta, Park, B., Pittie, S., Liu, L., Kushraj, D. and Sarkar, K. (2002). MobiMine:
Monitoring the Stock Market from a PDA. ACM SIGKDD Explorations. January 2002.
Volume 3, Issue 2, ACM Press, pp. 37-46.
B. Krishnamachari and S.S. Iyengar. Efﬁcient and Fault-tolerant Feature Extraction in Sensor
Networks. In Proceedings of the 2nd International Workshop on Information Processing
786 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy
in Sensor Networks (IPSN ’03), Palo Alto, California, April 2003.
B. Krishnamachari and S. Iyengar. Distributed Bayesian Algorithms for Fault-tolerant Event
Region Detection in Wireless Sensor Networks. IEEE Transactions on Computers, vol.
53, No. 3, March 2004.
M. Last, Online Classiﬁcation of Nonstationary Data Streams, Intelligent Data Analysis, Vol.
6, No. 2, 2002, pp. 129-147.
Y. Law, C. Zaniolo, An Adaptive Nearest Neighbor Classiﬁcation Algorithm for Data
Streams, Proceedings of the 9th European Conference on the Principals and Practice
of Knowledge Discovery in Databases (PKDD 2005), Springer Verlag, Porto, Portugal,
October 3-7, 2005, pp. 108-120.
J. Lin, E. Keogh, S. Lonardi, and B. Chiu, A Symbolic Representation of Time Series, with
Implications for Streaming Algorithms, In proceedings of the 8th ACM SIGMOD Work-
shop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June
13, 2003, pp. 2-11.
G.S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceed-
ings of the 28th International Conference on Very Large Data Bases, Hong Kong, China,

August 2002, pp. 346-357.
R. Moskovitch, Y. Elovici, L. Rokach, Detection of unknown computer worms based
on behavioral classiﬁcation of the host, Computational Statistics and Data Analysis,
52(9):4544–4566, 2008.
S. Muthukrishnan, Data streams: algorithms and applications. Proceedings of the fourteenth
annual ACM-SIAM symposium on discrete algorithms, 2003.
O. Nasraoui , Cardona C., Rojas C., and Gonzalez F., Mining Evolving User Proﬁles in
Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, in
Proc. of WebKDD 2003 - KDD Workshop on Web mining as a Premise to Effective and
Intelligent Web Applications, Washington DC, August 2003, p. 71
C. Ordonez. Clustering Binary Data Streams with K-means ACM DMKD 2003.
B. Park and H. Kargupta. Distributed Data Mining: Algorithms, Systems, and Applications,
Data Mining Handbook. Editor: Nong Ye. 2002.
E. Perlman and A. Java, Predictive Mining of Time Series Data in Astronomy. In ASP Conf.
Ser. 295: Astronomical Data Analysis Software and Systems XII, 2003.
S. Papadimitriou, C. Faloutsos, and A. Brockwell, Adaptive, Hands-Off Stream Mining, 29th
International Conference on Very Large Data Bases VLDB, 2003.
S. Pirttikangas, J. Riekki, J. Kaartinen, J. Miettinen, S. Nissila, J. Roning. Genie Of The
Net: A New Approach For A Context-Aware Health Club. In Proceedings of Joint 12th
ECML’01 and 5th European Conference on PKDD’01. September 3-7, 2001, Freiburg,
Germany.
L. Rokach, Decomposition methodology for classiﬁcation tasks: a meta decomposer frame-
work, Pattern Analysis and Applications, 9(2006):257–271.
L. Rokach, O. Maimon and R. Arbel, Selective voting-getting more for less in sensor fusion,
International Journal of Pattern Recognition and Artiﬁcial Intelligence 20 (3) (2006), pp.
329–350.
A. Srivastava and J. Stroeve, Onboard Detection of Snow, Ice, Clouds and Other Geophysical
Processes Using Kernel Methods, Proceedings of the ICML’03 workshop on Machine
Learning Technologies for Autonomous Space Applications.
S. Tanner, M. Alshayeb, E. Criswell, M. Iyer, A. McDowell, M. McEniry, K. Regner,

EVE: On-Board Process Planning and Execution, Earth Science Technology Confer-
ence, Pasadena, CA, Jun. 11 - 14, 2002.
39 Data Stream Mining 787
N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack and M. Stonebraker, Load Shedding in a
Data Stream Manager Proceedings of the 29th International Conference on Very Large
Data Bases (VLDB), September, 2003.
N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, M. Stonebraker. Load Shedding on
Data Streams, In Proceedings of the Workshop on Management and Processing of Data
Streams (MPDS 03), San Diego, CA, USA, June 8, 2003.
H. Toivonen, Sampling large databases for association rules, Proceeding of VLDB Confer-
ence, 1996
Y. Yao, J. E. Gehrke, The Cougar Approach to In-Network Query Processing in Sensor Net-
works, SIGMOD Record, Volume 31, Number 3. September 2002, pp. 9-18.
H. Wang, W. Fan, P. Yu and J. Han, Mining Concept-Drifting Data Streams using Ensemble
Classiﬁers, in the 9th ACM International Conference on Knowledge Discovery and Data
Mining (SIGKDD), Aug. 2003, Washington DC, USA.
Y. Zhu and D. Shasha, Efﬁcient Elastic Burst Detection in Data Streams, The Ninth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining KDD-
2003 24 August 2003 - 27 August 2003, pp 336 - 345.

40
Mining Concept-Drifting Data Streams
Haixun Wang
1
, Philip S. Yu
2
, and Jiawei Han
3
1
IBM T. J. Watson Research Center

2
IBM T. J. Watson Research Center

3
University of Illinois, Urbana Champaign

Summary. Knowledge discovery from inﬁnite data streams is an important and difﬁcult task.
We are facing two challenges, the overwhelming volume and the concept drifts of the stream-
ing data. In this chapter, we introduce a general framework for mining concept-drifting data
streams using weighted ensemble classiﬁers. We train an ensemble of classiﬁcation models,
such as C4.5, RIPPER, naive Bayesian, etc., from sequential chunks of the data stream. The
classiﬁers in the ensemble are judiciously weighted based on their expected classiﬁcation ac-
curacy on the test data under the time-evolving environment. Thus, the ensemble approach
improves both the efﬁciency in learning the model and the accuracy in performing classiﬁca-
tion. Our empirical study shows that the proposed methods have substantial advantage over
single-classiﬁer approaches in prediction accuracy, and the ensemble framework is effective
for a variety of classiﬁcation models.
Key words: Data Mining, concept learning, classiﬁer design and evaluation
40.1 Introduction
Knowledge discovery on streaming data is a research topic of growing interest (Bab-
cock et al., 2002, Chen et al., 2002, Domingos and Hulten, 2000, Hulten et al.,
2001). The fundamental problem we need to solve is the following: given an inﬁ-
nite amount of continuous measurements, how do we model them in order to capture
time-evolving trends and patterns in the stream, and make time-critical predictions?
Huge data volume and drifting concepts are not unfamiliar to the Data Min-
ing community. One of the goals of traditional Data Mining algorithms is to
learn models from large databases with bounded-memory. It has been achieved
by several classiﬁcation methods, including Sprint (Shafer et al., 1996), BOAT
(Gehrke et al., 1999), etc. Nevertheless, the fact that these algorithms require multi-

ple scans of the training data makes them inappropriate in the streaming environment
where examples are coming in at a higher rate than they can be repeatedly analyzed.
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_40, © Springer Science+Business Media, LLC 2010

Data Mining and Knowledge Discovery Handbook, 2 Edition part 81 pdf

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về