Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 4 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (339.59 KB, 10 trang )

10 Oded Maimon and Lior Rokach
• Full taxonomy – for all the nine steps of the KDD process. We have shown a
taxonomy for the DM methods, but a taxonomy is needed for each of the nine
steps. Such a taxonomy will contain methods appropriate for each step (even the
first one), and for the whole process as well.
• Meta-algorithms – algorithms that examine the characteristics of the data in order
to determine the best methods, and parameters (including decompositions).
• Benefit analysis – to understand the effect of the potential KDD\DM results on
the enterprise.
• Problem characteristics – analysis of the problem itself for its suitability to the
KDD process.
• Mining complex objects of arbitrary type – Expanding Data Mining inference to
include also data from pictures, voice, video, audio, etc. This will require adapt-
ing and developing new methods (for example, for comparing pictures using clus-
tering and compression analysis).
• Temporal aspects - many data mining methods assume that discovered patterns
are static. However, in practice patterns in the database evolve over time. This
poses two important challenges. The first challenge is to detect when concept
drift occurs. The second challenge is to keep the patterns up-to-date without in-
ducing the patterns from scratch.
• Distributed Data Mining – The ability to seamlessly and effectively employ Data
Mining methods on databases that are located in various sites. This problem is
especially challenging when the data structures are heterogeneous rather than
homogeneous.
• Expanding the knowledge base for the KDD process, including not only data but
also extraction from known facts to principles (for example, extracting from a
machine its principle, and thus being able to apply it in other situations).
• Expanding Data Mining reasoning to include creative solutions, not just the ones
that appears in the data, but being able to combine solutions and generate another
approach.
1.6 The Organization of the Handbook


This handbook is organized in eight parts. Starting with the KDD process, through
to part six, the book presents a comprehensive but concise description of different
methods used throughout the KDD process. Each part describes the classic methods
as well as the extensions and novel methods developed recently. Along with the al-
gorithmic description of each method, the reader is provided with an explanation of
the circumstances in which this method is applicable and the consequences and the
trade-offs of using the method including references for further readings. Part seven
presents real-world case studies and how they can be solved. The last part surveys
some software and tools available today. The first part is about preprocessing meth-
ods. This covers the preprocessing methods (Steps 3, 4 of the KDD process). The
Data Mining methods are presented in the second part with the introduction and
the very often-used supervised methods. The third part of the handbook considers
1 Introduction to Knowledge Discovery and Data Mining 11
the unsupervised methods. The fourth part is about methods termed soft computing,
which include fuzzy logic, evolutionary algorithms, neural networks etc. Having es-
tablished the foundation, we now proceed with supporting methods needed for Data
Mining in the fifth part. The sixth part covers advanced methods like text mining and
web mining. With all the methods described so far, the next section, the seventh, is
concerned with applications for medicine, biology and manufacturing. The last and
final part of this handbook deals with software tools. This part is not a complete sur-
vey of the software available, but rather a selected representative from different types
of software packages that exist in today’s market.
1.7 New to This Edition
Since the first edition that was published five years ago, the field of data mining has
been evolved in the following aspects:
1.7.1 Mining Rich Data Formats
While in the past data mining methods could effectively analyze only flat tables,
in recent years new mature techniques have been developed for mining rich data
formats:
• Data Stream Mining - The conventional focus of data mining research was on

mining resident data stored in large data repositories. The growth of technolo-
gies, such as wireless sensor networks, have contributed to the emergence of
data streams. The distinctive characteristic of such data is that it is unbounded in
terms of continuity of data generation. This form of data has been termed as data
streams to express its owing nature. Mohamed Medhat Gaber, Arkady Zaslavsky,
and Shonali Krishnaswamy present a review of the state of the art in mining data
streams (Chapter 39). Clustering, classification, frequency counting, time series
analysis techniques are been discussed. Different systems that use data stream
mining techniques are also presented.
• Spatio-temporal - Spatio-temporal clustering is a process of grouping objects
based on their spatial and temporal similarity. It is relatively new subfield of
data mining, which gained high popularity especially in geographic information
sciences due to the pervasiveness of all kinds of location-based or environmen-
tal devices that record position, time or/and environmental properties of an ob-
ject or set of objects in real-time. As a consequence, different types and large
amounts of spatio-temporal data became available and introduce new challenges
to data analysis, which require novel approaches to knowledge discovery. Slava
Kisilevich, Florian Mansmann, Mirco Nanni and Salvatore Rinzivillo provide a
classification of different types of spatio-temporal data (Chapter 44). Then, they
focus on one type of spatio-temporal clustering - trajectory clustering, provide
an overview of the state-of-the-art approaches and methods of spatio-temporal
clustering and finally present several scenarios in different application domains
such as movement, cellular networks and environmental studies.
12 Oded Maimon and Lior Rokach
• Multimedia Data Mining - Zhongfei Mark Zhang and Ruofei Zhang present new
methods for Multimedia Data Mining (Chapter 57). Multimedia data mining, as
the name suggests, presumably is a combination of the two emerging areas: mul-
timedia and data mining. Instead, the multimedia data mining research focuses
on the theme of merging multimedia and data mining research together to exploit
the synergy between the two areas to promote the understanding and to advance

the development of the knowledge discovery multimedia data.
1.7.2 New Techniques
In this edition the following two new techniques are covered:
• In Chapter 23, Swagatam Das and Ajith Abraham present a family of bio-inspired
algorithms, known as Swarm Intelligence (SI). SI has successfully been applied
to a number of real world clustering problems. This chapter explores the role of
SI in clustering different kinds of datasets. It also describes a new SI technique for
partitioning a linearly non-separable dataset into an optimal number of clusters
in the kernel- induced feature space. Computer simulations undertaken in this
research have also been provided to demonstrate the effectiveness of the proposed
algorithm.
• Multi-label classification - Most of the research in the field of supervised learn-
ing has been focused on single label tasks, where training instances are associ-
ated with a single label from a set of disjoint labels. However, Textual data, such
as documents and web pages, are frequently annotated with more than a single
label. In Chapter 34, Grigorios Tsoumakas, Loannis Katakis and Loannis Vla-
havas review techniques for addressing multi-label classification task grouped
into the two categories: i) problem transformation, and ii) algorithm adaptation.
The first group of methods is algorithm independent. They transform the learning
task into one or more single-label classification tasks, for which a large bibliogra-
phy of learning algorithms exists. The second group of methods extends specific
learning algorithms in order to handle multi-label data directly.
• Sequences Analysis - In Chapter 29, Noa Ruschin Rimini and Oded Maimon
introduce a new visual analysis technique of sequences dataset using Iterated
Function System (IFS). IFS produces a fractal representation of sequences. The
proposed method offers an effective tool for visual detection of sequence patterns
influencing a target attribute, and requires no understanding of mathematical or
statistical algorithms. Moreover, it enables to detect sequence patterns of any
length, without predefining the sequence pattern length.
1.7.3 New Application Domains

A new domain for KDD is the world of nanoparticles. Oded Maimon and Abel
Browarnik present a smart repository system with text and data mining for this do-
main (Chapter 66). The impact of nanoparticles on health and the environment is
1 Introduction to Knowledge Discovery and Data Mining 13
a significant research subject, driving increasing interest from the scientific commu-
nity, regulatory bodies and the general public. The growing body of knowledge in this
area, consisting of scientific papers and other types of publications (such as surveys
and whitepapers) emphasize the need for a methodology to alleviate the complexity
of reviewing all the available information and discovering all the underlying facts,
using data mining algorithms and methods. .
1.7.4 New Consideration
In Chapter 35, Vicenc Torra describes the main tools for privacy in data mining. He
presents an overview of the tools for protecting data, and then focuses on protection
procedures. Information loss and disclosure risk measures are also described.
1.7.5 Software
In Chapter 67, Zhang and Segall present selected commercial software for data min-
ing, text mining, and web mining. The selected software are compared with their
features and also applied to available data sets. Screen shots of each of the selected
software are presented, as are conclusions and future directions.
1.7.6 Major Updates
Finally several chapters have been updated. Specifically, in Chapter 19, Alex Freitas
presents a brief overview of EAs, focusing mainly on two kinds of EAs, viz. Genetic
Algorithms (GAs) and Genetic Programming (GP). Then the chapter reviews the
main concepts and principles used by EAs designed for solving several data mining
tasks, namely: discovery of classification rules, clustering, attribute selection and
attribute construction.
In Chapter 21, Peter Zhang provides an overview of neural network models and
their applications to data mining tasks. He provides historical development of the
field of neural networks and presents three important classes of neural models in-
cluding feed forward multilayer networks, Hopfield networks, and Kohonen’s self-

organizing maps.
In Chapter 24, we discuss how fuzzy logic extends the envelope of the main data
mining tasks: clustering, classification, regression and association rules. We begin by
presenting a formulation of the data mining using fuzzy logic attributes. Then, for
each task, we provide a survey of the main algorithms and a detailed description (i.e.
pseudo-code) of the most popular algorithms.
References
Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition
Letters, 27(14): 1619–1631, 2006, Elsevier.
14 Oded Maimon and Lior Rokach
Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Context-
sensitive medical information retrieval, The 11th World Congress on Medical Informat-
ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286.
Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with
Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007.
Hastie, T. and Tibshirani, R. and Friedman, J. and Franklin, J., The elements of statistical
learning: data mining, inference and prediction, The Mathematical Intelligencer, 27(2):
83–85, 2005.
Han, J. and Kamber, M., Data mining: concepts and techniques, Morgan Kaufmann, 2006.
H. Kriege, K. M. Borgwardt, P. Krger, A. Pryakhin, M. Schubert and Arthur Zimek, Future
trends in data mining, Data Mining and Knowledge Discovery, 15(1):87-97, 2007.
Larose, D.T., Discovering knowledge in data: an introduction to data mining, John Wiley and
Sons, 2005.
Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors
manufacturing case study, in Data Mining for Design and Manufacturing: Methods and
Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001.
Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Pro-
ceedings of the Second International Symposium on Foundations of Information and
Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002.
Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and

Data Mining: Theory and Applications, Series in Machine Perception and Artificial In-
telligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005.
Rokach, L., Decomposition methodology for classification tasks: a meta decomposer frame-
work, Pattern Analysis and Applications, 9(2006):257–271.
Rokach L., Genetic algorithm-based feature set partitioning for classification prob-
lems,Pattern Recognition, 41(5):1676–1700, 2008.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-
proach, Proceedings of the 14th International Symposium On Methodologies For Intel-
ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,
2003, pp. 24–31.
Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical
Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer-
Verlag, 2004.
Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor
fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3)
(2006), pp. 329–350.
Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In-
ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480,
2001.
Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-
ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158.
Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery
Handbook, pp. 321–352, 2005, Springer.
Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a
feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285–
299, 2006, Springer.
1 Introduction to Knowledge Discovery and Data Mining 15
Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World

Scientific Publishing, 2008.
Witten, I.H. and Frank, E., Data Mining: Practical machine learning tools and techniques,
Morgan Kaufmann Pub, 2005.
Wu, X. and Kumar, V. and Ross Quinlan, J. and Ghosh, J. and Yang, Q. and Motoda, H. and
McLachlan, G.J. and Ng, A. and Liu, B. and Yu, P.S. and others, Top 10 algorithms in
data mining, Knowledge and Information Systems, 14(1): 1–37, 2008.

Part I
Preprocessing Methods

2
Data Cleansing: A Prelude to Knowledge Discovery
Jonathan I. Maletic
1
and Andrian Marcus
2
1
Kent State University
2
Wayne State University
Summary. This chapter analyzes the problem of data cleansing and the identification of po-
tential errors in data sets. The differing views of data cleansing are surveyed and reviewed and
a brief overview of existing data cleansing tools is given. A general framework of the data
cleansing process is presented as well as a set of general methods that can be used to address
the problem. The applicable methods include statistical outlier detection, pattern matching,
clustering, and Data Mining techniques. The experimental results of applying these methods
to a real world data set are also given. Finally, research directions necessary to further address
the data cleansing problem are discussed.
Key words: Data Cleansing, Data Cleaning, Data Mining, Ordinal Rules, Data Qual-
ity, Error Detection, Ordinal Association Rules

2.1 INTRODUCTION
The quality of a large real world data set depends on a number of issues (Wang
et al., 1995, Wang et al., 1996), but the source of the data is the crucial factor. Data
entry and acquisition is inherently prone to errors, both simple and complex. Much
effort can be allocated to this front-end process with respect to reduction in entry
error but the fact often remains that errors in a large data set are common. While one
can establish an acquisition process to obtain high quality data sets, this does little
to address the problem of existing or legacy data. The field errors rates in the data
acquisition phase are typically around 5% or more (Orr, 1998, Redman, 1998) even
when using the most sophisticated measures for error prevention available. Recent
studies have shown that as much as 40% of the collected data is dirty in one way or
another (Fayyad et al., 2003).
For existing data sets the logical solution is to attempt to cleanse the data in some
way. That is, explore the data set for possible problems and endeavor to correct the
errors. Of course, for any real world data set, doing this task by hand is completely
out of the question given the amount of person hours involved. Some organizations
spend millions of dollars per year to detect data errors (Redman, 1998). A manual
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_2, © Springer Science+Business Media, LLC 2010

×