IT training scientific data mining and knowledge discovery principles and foundations gaber 2009 10 06

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.96 MB, 411 trang )

Scientific Data Mining and Knowledge Discovery

“This page left intentionally blank.”

Mohamed Medhat Gaber
Editor

Scientific Data Mining and
Knowledge Discovery
Principles and Foundations

ABC

Editor
Mohamed Medhat Gaber
Caulfield School of Information Technology
Monash University
900 Dandenong Rd.
Caulfield East, VIC 3145
Australia

Color images of this book you can find on www.springer.com/978-3-642-02787-1
ISBN 978-3-642-02787-1
e-ISBN 978-3-642-02788-8
DOI 10.1007/978-3-642-02788-8
Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2009931328
ACM Computing Classification (1998): I.5, I.2, G.3, H.3
c Springer-Verlag Berlin Heidelberg 2010
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
Cover design: KuenkelLopka GmbH
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

This book is dedicated to:
My parents: Dr. Medhat Gaber and
Mrs. Mervat Hassan
My wife: Dr. Nesreen Hassaan
My children: Abdul-Rahman and Mariam

“This page left intentionally blank.”

Contents

Introduction . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

Mohamed Medhat Gaber

1

Part I Background
Machine Learning .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .
Achim Hoffmann and Ashesh Mahidadia

7

Statistical Inference . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 53
Shahjahan Khan
The Philosophy of Science and its relation to Machine Learning . .. . . . . . . . . . . 77
Jon Williamson
Concept Formation in Scientific Knowledge Discovery
from a Constructivist View. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 91
Wei Peng and John S. Gero
Knowledge Representation and Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .111
Stephan Grimm
Part II Computational Science
Spatial Techniques . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .141
Nafaa Jabeur and Nabil Sahli
Computational Chemistry.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .173
Hassan Safouhi and Ahmed Bouferguene
String Mining in Bioinformatics .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .207
Mohamed Abouelhoda and Moustafa Ghanem

vii

viii

Part III

Contents

Data Mining and Knowledge Discovery

Knowledge Discovery and Reasoning in Geospatial
Applications . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .251
Nabil Sahli and Nafaa Jabeur
Data Mining and Discovery of Chemical Knowledge . . . . . . . . . . . . . . .. . . . . . . . . . .269
Lu Wencong
Data Mining and Discovery of Astronomical Knowledge . . . . . . . . . .. . . . . . . . . . .319
Ghazi Al-Naymat
Part IV

Future Trends

On-board Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .345
Steve Tanner, Cara Stein, and Sara J. Graves
Data Streams: An Overview and Scientific Applications . . . . . . . . . . .. . . . . . . . . . .377
Charu C. Aggarwal
Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .399

Contributors

Mohamed Abouelhoda Cairo University, Orman, Gamaa Street, 12613 Al Jizah,
Giza, Egypt Nile University, Cairo-Alex Desert Rd, Cairo 12677, Egypt

Charu C. Aggarwal IBM T. J. Watson Research Center, NY, USA, AL 35805,
USA,
Ghazi Al-Naymat School of Information Technologies, The University of Sydney,
Sydney, NSW 2006, Australia,
Ahmed Bouferguene Campus Saint-Jean, University of Alberta, 8406, 91 Street,
Edmonton, AB, Canada T6C 4G9
Mohamed Medhat Gaber Centre for Distributed Systems and Software
Engineering, Monash University, 900 Dandenong Rd, Caul eld East, VIC 3145,
Australia,
John S. Gero Krasnow Institute for Advanced Study and Volgenau School
of Information, Technology and Engineering, George Mason University, USA,

Moustafa Ghanem Imperial College, South Kensington Campus, London SW7
2AZ, UK
Sara J. Graves University of Alabama in Huntsville, AL 35899, USA,

Stephan Grimm FZI Research Center for Information Technologies, University
of Karlsruhe, Baden-W¨urttemberg, Germany,
Achim Hoffmann University of New South Wales, Sydney 2052, NSW, Australia
Nafaa Jabeur Department of Computer Science, Dhofar University, Salalah,
Sultanate of Oman, nafaa
Shahjahan Khan Department of Mathematics and Computing, Australian Centre
for Sustainable Catchments, University of Southern Queensland, Toowoomba,
QLD, Australia,
Ashesh Mahidadia University of New South Wales, Sydney 2052, NSW,
Australia
ix

x

Contributors

Wei Peng Platform Technologies Research Institute, School of Electrical and
Computer, Engineering, RMIT University, Melbourne VIC 3001, Australia,

Cara Stein University of Alabama in Huntsville, AL 35899, USA,

Hassan Safouhi Campus Saint-Jean, University of Alberta, 8406, 91 Street,
Edmonton, AB, Canada T6C 4G9
Nabil Sahli Department of Computer Science, Dhofar University, Salalah,
Sultanate of Oman, nabil
Steve Tanner University of Alabama in Huntsville, AL 35899, USA,

Lu Wencong Shanghai University, 99 Shangda Road, BaoShan District, Shanghai,
Peoples Republic of China,
Jon Williamson Kings College London, Strand, London WC2R 2LS, England,
UK,

Introduction
Mohamed Medhat Gaber

“It is not my aim to surprise or shock you – but the simplest way I can summarise is to say
that there are now in the world machines that think, that learn and that create. Moreover,
their ability to do these things is going to increase rapidly until – in a visible future – the
range of problems they can handle will be coextensive with the range to which the human
mind has been applied” by Herbert A. Simon (1916-2001)

1 Overview

This book suits both graduate students and researchers with a focus on discovering
knowledge from scientific data. The use of computational power for data analysis
and knowledge discovery in scientific disciplines has found its roots with the revolution of high-performance computing systems. Computational science in physics,
chemistry, and biology represents the first step towards automation of data analysis
tasks. The rational behind the development of computational science in different areas was automating mathematical operations performed in those areas. There was
no attention paid to the scientific discovery process. Automated Scientific Discovery (ASD) [1–3] represents the second natural step. ASD attempted to automate
the process of theory discovery supported by studies in philosophy of science and
cognitive sciences. Although early research articles have shown great successes, the
area has not evolved due to many reasons. The most important reason was the lack
of interaction between scientists and the automating systems.
With the evolution in data storage, large databases have stimulated researchers
from many areas especially machine learning and statistics to adopt and develop
new techniques for data analysis. This has led to a new area of data mining and
knowledge discovery. Applications of data mining in scientific applications have
M.M. Gaber ( )
Centre for Distributed Systems and Software Engineering,
Monash University, 900 Dandenong Rd, Caulfield East,
VIC 3145, Australia
e-mail:
M.M. Gaber (ed.), Scientific Data Mining and Knowledge Discovery: Principles
and Foundations, DOI 10.1007/978-3-642-02788-8 1,
© Springer-Verlag Berlin Heidelberg 2010

1

2

M.M. Gaber

been studied in many areas. The focus of data mining in this area was to analyze
data to help understanding the nature of scientific datasets. Automation of the whole
scientific discovery process has not been the focus of data mining research.
Statistical, computational, and machine learning tools have been used in the area
of scientific data analysis. With the advances in Ontology and knowledge representation, ASD has great prospects in the future. In this book, we provide the reader
with a complete view of the different tools used in analysis of data for scientific discovery. The book serves as a starting point for students and researchers interested in
this area. We hope that the book represents an important step towards evolution of
scientific data mining and automated scientific discovery.

2 Book Organization
The book is organized into four parts. Part I provides the reader with background
of the disciplines that contributed to the scientific discovery. Hoffmann and Mahidadia provided a detailed introduction to the area of machine learning in Chapter
Machine Learning. Chapter Statistical Inference by Khan gives the reader a clear
start-up overview of the field of statistical inference. The relationship between scientific discovery and philosophy of science is provided by Williamson in Chapter
The Philosophy of Science and its Relation to Machine Learning. Cognitive science
and its relationship to the area of scientific discovery is detailed by Peng and Gero
in Chapter Concept Formation in Scientific Knowledge Discovery from a Constructivist View. Finally, Part I is concluded with an overview of the area of Ontology
and knowledge representation by Grimm in Chapter Knowledge Representation and
Ontologies. This part is highly recommended for graduate students and researchers
starting in the area of using data mining for discovering knowledge in scientific
disciplines. It could also serve as excellent introductory materials for instructors
teaching data mining and machine learning courses. The chapters are written by
experts in their respective fields.
After providing the introductory materials in Part I, Part II provides the reader
with computational methods used in the discovery of knowledge in three different
fields. In Chapter Spatial Techniques, Jabeur and Sahli provide us with a chapter of
the different computational techniques in the Geospatial area. Safouhi and Bouferguene in Chapter Computational Chemistry provide the reader with details on the
area of computational chemistry. Finally, Part II is concluded by discussing the wellestablished area of bioinformatics outlining the different computational tools used
in this area by Aboelhoda and Ghanem in chapter String Mining in Bioinformatics.
The use of data mining techniques to discover scientific knowledge is detailed in

three chapters in Part III. Chapter Knowledge Discovery and Reasoning in Geospatial Applications by Sahli and Jabeur provides the reader with techniques used in
reasoning and knowledge discovery for Geospatial applications. The second chapter
in this part, Chapter Data Mining and Discovery of Chemical Knowledge, is written by Wencong providing the reader with different projects, detailing the results,

Introduction

3

Scientific Disciplines contributed
to Automated Scientific Discovery
Part I
Machine Learning

Data Mining
Techniques in
Scientific Knowledge
Discovery

Future
Trends and
Directions

Part II

Part III

Part IV

Geospatial

Geospatial Knowledge

Chemistry

Chemical Knowledge

Bioinformatics

Astronomical Knowledge

Computational
Sciences

Statistical Inference

Philosophy of Science

Onboard Mining
Data Stream
Mining

Cognitive Science
Knowledge
Representation

Fig. 1 Book Organization

of using data mining techniques to discover chemical knowledge. Finally, the last
chapter of this part, Chapter Data Mining and Discovery of Astronomical Knowledge, by Al-Naymat provides us with a showcase of using data mining techniques

to discover astronomical knowledge.
The book is concluded with a couple of chapters by eminent researchers in
Part IV. This part represents future directions of using data mining techniques in
the ares of scientific discovery. Chapter On-Board Data Mining by Tanner et al.
provides us with different projects using the new area of onboard mining in spacecrafts. Aggarwal in Chapter Data Streams: An Overview and Scientific Applications
provides an overview of the areas of data streams and pointers to applications in the
area of scientific discovery.
The organization of this book follows a historical view starting by the wellestablished foundations and principles in Part I. This is followed by the traditional
computational techniques in different scientific disciplines in Part II. This is followed by the core of this book of using data mining techniques in the process of
discovering scientific knowledge in Part III. Finally, new trends and directions in
automated scientific discovery are discussed in Part IV. This organization is depicted
in Fig. 1

3 Final Remarks
The area of automated scientific discovery has a long history dated back to the 1980s
when Langley et al. [3] have their book “Scientific Discovery: Computational Explorations of the Creative Processes” outlining early success stories in the area.

4

M.M. Gaber

Although the research in this area has not been progressing as such in the 1990s
and the new century, we believe that with the rise of areas of data mining and machine learning, the area of automated scientific discovery will witness an accelerated
development.
The use of data mining techniques to discover scientific knowledge has recently
witnessed notable successes in the area of biology [4] and with less impact in
the area of chemistry [5], physics and astronomy [6]. The next decade will witness
more success stories with discovering scientific knowledge automatically due to the
large amounts of data available and the faster than ever production of scientific data.

References
1. R.E. Valdes-Perez, Knowl. Eng. Rev. 11(1), 57–66 (1996)
2. P. Langley, Int. J. Hum. Comput. Stud. 53, 393–410 (2000)
3. P. Langley, H.A. Simon, G.L. Bradshaw, J.M. Zytkow (1987) Scientific Discovery: Computational Explorations of the Creative Processes (MIT, Cambridge, MA)
4. J.T.L. Wang, M.J. Zaki, H.T.T. Toivonen, D. Shasha, in Data Mining in Bioinformatics, eds. by
X. Wu, L. Jain. Advanced Information and Knowledge Processing (Springer London, 2005)
5. N. Chen, W. Lu, J. Yang, G. Li, Support Vector Machine in Chemistry (World Scientific Publishing, Singapore, 2005)
6. H. Karimabadi, T. Sipes, H. White, M. Marinucci, A. Dmitriev, J. Chao, J. Driscoll, N. Balac,
J. Geophys. Res. 112(A11) (2007)

Part I

Background

“This page left intentionally blank.”

Machine Learning
Achim Hoffmann and Ashesh Mahidadia

The purpose of this chapter is to present fundamental ideas and techniques of
machine learning suitable for the field of this book, i.e., for automated scientific
discovery. The chapter focuses on those symbolic machine learning methods, which
produce results that are suitable to be interpreted and understood by humans. This
is particularly important in the context of automated scientific discovery as the scientific theories to be produced by machines are usually meant to be interpreted by
humans.
This chapter contains some of the most influential ideas and concepts in machine

learning research to give the reader a basic insight into the field. After the introduction in Sect. 1, general ideas of how learning problems can be framed are given
in Sect. 2. The section provides useful perspectives to better understand what learning algorithms actually do. Section 3 presents the Version space model which is an
early learning algorithm as well as a conceptual framework, that provides important
insight into the general mechanisms behind most learning algorithms. In section 4,
a family of learning algorithms, the AQ family for learning classification rules is
presented. The AQ family belongs to the early approaches in machine learning. The
next, Sect. 5 presents the basic principles of decision tree learners. Decision tree
learners belong to the most influential class of inductive learning algorithms today.
Finally, a more recent group of learning systems are presented in Sect. 6, which
learn relational concepts within the framework of logic programming. This is a particularly interesting group of learning systems since the framework allows also to
incorporate background knowledge which may assist in generalisation. Section 7
discusses Association Rules – a technique that comes from the related field of Data
mining. Section 8 presents the basic idea of the Naive Bayesian Classifier. While
this is a very popular learning technique, the learning result is not well suited for
human comprehension as it is essentially a large collection of probability values. In
Sect. 9, we present a generic method for improving accuracy of a given learner by
generating multiple classifiers using variations of the training data. While this works
well in most cases, the resulting classifiers have significantly increased complexity
A. Hoffmann ( )
University of New South Wales, Sydney 2052, NSW, Australia

M.M. Gaber (ed.), Scientific Data Mining and Knowledge Discovery: Principles
and Foundations, DOI 10.1007/978-3-642-02788-8 2,
© Springer-Verlag Berlin Heidelberg 2010

7

8

A. Hoffmann and A. Mahidadia

and, hence, tend to destroy the human readability of the learning result that a single
learner may produce. Section 10 contains a summary, mentions briefly other techniques not discussed in this chapter and presents outlook on the potential of machine
learning in the future.

1 Introduction
Numerous approaches to learning have been developed for a large variety of possible
applications. While learning for classification is prevailing, other learning tasks have
been addressed as well which include tasks such as learning to control dynamic
systems, general function approximation, prediction as well as learning to search
more efficiently for a solution of combinatorial problems.
For different types of applications specialised algorithms have been developed.
Although, in principle, most of the learning tasks can be reduced to each other. For
example, a prediction problem can be reduced to a classification problem by defining
classes for each of the possible predictions.1 Equally, a classification problem can
be reduced to a prediction problem, etc.

The Learner’s Way of Interaction
Another aspect in learning is the way how a learning system interacts with its environment. A common setting is to provide the learning system with a number of
classified training examples. Based on that information, the learner attempts to find a
general classification rule which allows to classify correctly both, the given training
examples as well as unseen objects of the population. Another setting, unsupervised learning, provides the learner only with unclassified objects. The task is to
determine which objects belong to the same class. This is a much harder task for a
learning algorithm than if classified objects are presented. Interactive learning systems have been developed, which allow interaction with the user while learning.
This allows the learner to request further information in situations, where it seems
to be needed. Further information can range from merely providing an extra classified or unclassified example randomly chosen to answering specific questions which
have been generated by the learning system. The latter way allows the learner to acquire information in a very focused way. Some of the ILP systems in Sect. 6 are
interactive learning systems.

1
In prediction problems there is a sequence of values given, on which basis the next value of
the sequence is to be predicted. The given sequence, however, may usually be of varying length.
Opposed to that are many classification problems based on a standard representation of a fixed
length. However, exceptions exist here as well.

Machine Learning

9

Another more technical aspect concerns how the gathered information is internally processed and finally organised. According to that aspect the following types
of representations are among the most frequently used for supervised learning of
classifiers:
•
•
•
•
•

Decision trees
Classification rules (production rules) and decision lists
PROLOG programs
The structure and parameters of a neural network
Instance-based learning (nearest neighbour classifiers etc.)2

In the following, the focus of the considerations will be on learning classification functions. A major part of the considerations, however, is applicable to a
larger class of tasks, since many tasks can essentially be reduced to classification tasks. Although, the focus will be on concept learning which is a special
case of classification learning, concept learning attempts to find representations
which resemble in some way concepts humans may acquire. While it is fairly unclear, how humans actually do that, in the following we understand under concept

learning the attempt to find a “comprehensible”3 representation of a classification
function.

2 General Preliminaries for Learning Concepts from Examples
In this section, a unified framework will be provided in which almost all learning
systems fit in, including neural networks, that learn concepts, i.e. classifiers, from
examples. The following components can be distinguished to characterise concept
learning systems:
•
•
•

A set of examples
A learning algorithm
A set of possible learning results, i.e. a set of concepts

Concerning the set of examples, it is an important issue to find a suitable representation for the examples. In fact, it has been recognised that the representation of
examples may have a major impact on success or failure of learning.

2
That means gathering a set of examples and a similarity function to determine the most similar
example for a given new object. The most similar example is being used for determining the class
of the presented object. Case-based reasoning is also a related technique of significant popularity,
see e.g. [1, 2].
3
Unfortunately, this term is also quite unclear. However, some types of representations are certainly more difficult to grasp for an average human than others. Foe example, cascaded linear
threshold functions, as present in multi-layer perceptions, seem fairly difficult to comprehend, as
opposed to, e.g., boolean formulas.

10

A. Hoffmann and A. Mahidadia

2.1 Representing Training Data
The representation of training data, i.e. of examples for learning concepts, has to
serve two ends: On one hand, the representation has to suit the user of the learning
system, in that it is easy to reflect the given data in the chosen representation form.
On the other hand, the representation has to suit the learning algorithm. Suiting the
learning algorithm again has at least two facets: Firstly, the learning algorithm has
to be able to digest the representations of the data. Secondly, the learning algorithm
has to be able to find a suitable concept, i.e. a useful and appropriate generalisation
from the presented examples.
The most frequently used representation of data is some kind of attribute or feature vectors. That is, objects are described by a number of attributes.
The most commonly used kinds of attributes are one of the following:
•

Unstructured attributes:
– Boolean attributes i.e. either the object does have an attribute or it does not.
Usually specified by the values ff; tg, or f0; 1g, or sometimes in the context
of neural networks by f 1; 1g.
– Discrete attributes, i.e. the attribute has a number of possible values (more
then two), such as a number of colours fred; blue; green; browng, shapes
fcircle; triangle; rectangleg, or even numbers where the values do not carry
any meaning, or any other set of scalar values.

•

Structured attributes, where the possible values have a presumably meaningful
relation to each other:

– Linear attributes. Usually the possible values of a linear attribute are a set of
numbers, e.g. f0; 1; :::; 15g, where the ordering of the values is assumed to
be relevant for generalisations. However, of course also non-numerical values could be used, where such an ordering is assumed to be meaningful. For
example, colours may be ordered according to their brightness.
– Continuous attributes. The values of these attributes are normally reals (with
a certain precision) within a specified interval. Similarly as with linear attributes, the ordering of the values is assumed to be relevant for generalisations.
– Tree-structured attributes. The values of these attributes are organised in a
subsumption hierarchy. That is, for each value it is specified what other values
it subsumes. This specification amounts to a tree-structured arrangement of
the values. See 5 for an example.

Using attribute vectors of various types, it is fairly easy to represent objects of
manifold nature. For example, cars can be described by features as colour, weight,
height, length, width, maximal speed, etc.

Machine Learning

11

2.2 Learning Algorithms
Details of various learning algorithms are given later in this chapter. However, generally speaking, we can say, that every learning algorithm searches implicitly or
explicitly in a space of possible concepts for a concept that sufficiently fits the
presented examples. By considering the set of concepts and their representations
through which a learning algorithm is actually searching, the algorithm can be characterised and its suitability for a particular application can be assessed. Section 2.3
discusses how concepts can be represented.

2.3 Objects, Concepts and Concept Classes
Before discussing the representation of concepts, some remarks on their intended
meaning should be made. In concept learning, concepts are generally understood

to subsume a certain set of objects. Consequently, concepts can formally be described with respect to a given set of possible objects to be classified. The set of
possible objects is defined by the kind of representation chosen for representing the
examples. Considering for instance attribute vectors for describing objects, there is
usually a much larger number of possible objects than the number of objects which
may actually occur. This is due to the fact, that in the case of attribute vectors,
the set of possible objects is simply given by the Cartesian product of the sets of
allowed values for each of the attributes. That is, every combination of attribute values is allowed although, there may be no “pink elephants”, “green mice”, or “blue
rabbits”.
However, formally speaking, for a given set of objects X , a concept c is defined
by its extension in X , i.e. we can say c is simply a subset of X . That implies that
for a set of n objects, i.e. for jX j D n there are 2n different concepts. However,
most actual learning systems will not be able to learn all possible concepts. They
will rather only be able to learn a certain subset. Those concepts which can potentially be learnt, are usually called the concept class or concept space of a learning
system. In many contexts, concepts which can be learnt are also called hypotheses
and hypothesis space respectively. Later, more formal definitions will be introduced.
Also, in the rather practical considerations to machine learning a slightly different
terminology is used than in the more mathematically oriented considerations.
However, in general it can be said that an actual learning system L, given n
possible objects, works only on a particular subset of all the 2n different possible
concepts which is called the concept space C of L. For C , both of the following
conditions hold:
1. For every concept c 2 C there exists training data, such that L will learn c.
2. For all possible training data, L will learn some concept c, such that c 2 C . That
is, L will never learn a concept c 62 C .

12

A. Hoffmann and A. Mahidadia
n

Considering a set of concepts there is the huge number of 22 different sets
of concepts on a set of n objects. To give a numerical impression: Looking at 30
boolean features describing the objects in X under consideration, would amount
to n D 230
1000000000 D 109 different possible objects. Thus, there exist
1000000000
300000000
1000000000
2
different possible concepts and 22
1010
different concept spaces, an astronomically large number.
Another characteristic of learning algorithms besides their concept space, is the
particular order in which concepts are considered. That is, if two concepts are
equally or almost equally confirmed by the training data, which of these two concepts will be learnt?
In Sect. 2.4, the two issues are treated in more detail to provide a view of learning
which makes the similarities and dissimilarities among different algorithms more
visible.

2.4 Consistent and Complete Concepts
In machine learning some of the technical terms describing the relation between
a hypothesis of how to classify objects and a set of classified objects (usually the
training sample) are used differently in different contexts. In most mathematical/theoretical considerations a hypothesis h is called consistent with the training set of
classified objects, if and only if the hypothesis h classifies all the given objects in
the same way as given in the training set. A hypothesis h0 is called inconsistent with
a given training set if there is an object which is differently classified in the training
set than by the hypothesis h0 .
Opposed to that, the terminology following Michalski [3] considering concept
learning assumes that there are only two classes of objects. One is the class of positive examples of a concept to be learned and the remaining objects are negative

examples. A hypothesis h for a concept description is said to cover those objects
which it classifies as positive examples. Following this perspective, it is said that a
hypothesis h is complete if h covers all positive examples in a given training set.
Further, a hypothesis h is said to be consistent if it does not cover any of the given
negative examples. The possible relationships between a hypothesis and a given set
of training data are shown in Fig. 1.

3 Generalisation as Search
In 1982, Mitchell introduced [4] the idea of the version space, which puts the process of generalisation into the framework of searching through a space of possible
“versions” or concepts to find a suitable learning result.
The version space can be considered as the space of all concepts which are
consistent with all learning examples presented so far. In other words, a learning

Machine Learning

-

c

13

-

-

h

+
+

-

c

+

+

h

+

-

+

-

-

+

(b) complete and inconsistent

-

c

h

+
+

+

-

-

(a) complete and consistent

-

+

+

-

c

-

-

-

(c) incomplete and consistent

-

-

h

+
+

+

+

-

(d) incomplete and inconsistent

Fig. 1 The four possible relationships between a hypothesis and a set of classified examples. The
correct concept c is shown as dashed line. The hypothesis h as solid line. A consistent hypothesis
covers all positive examples. A complete hypothesis covers no negative example

algorithm considers initially, before any training data has been presented, the complete concept space as possible outcomes of the learning process. After examples are
presented, this space of still possible outcomes of the learning process is gradually
reduced.
Mitchell provided data structures which allow an elegant and efficient maintenance of the version space, i.e. of concepts that are consistent with the examples
presented so far.
Example. To illustrate the idea, let us consider the following set of six geometrical objects big square, big triangle, big circle, small square, small triangle, and
small circle, and abbreviated by b:s, b:t, ... , s:t, s:c, respectively. That is, let
X D fb:s; b:t; b:c; s:s; s:t; s:cg.
And let the set of concepts C that are potentially output by a learning system L

be given by
C D ffg; fb:sg; fb:tg; fb:cg; fs:sg; fs:tg; fs:cg; fb:s; b:t; b:sg; fs:s; s:t; s:sg;
fb:s; s:sg; fb:t; s:tg; fb:c; s:cg; X g.
That is, C contains the empty set, the set X , all singletons and the abstraction of
the single objects by relaxing one of the requirements of having a specific size or
having a specific shape.

14

A. Hoffmann and A. Mahidadia

Fig. 2 The partial order of concepts with respect to their coverage of objects

In Fig. 2, the concept space C is shown and the partial order between the concepts is indicated by the dashed lines. This partial order is the key to Mitchell’s
approach. The idea is to always maintain a set of most general concepts and a set of
most specific concepts that are consistent and complete with respect to the presented
training examples.
If a most specific concept cs does contain some object x which is given as a
positive example, then all concepts which are supersets of s contain the positive
example, i.e. are consistent with the positive example as well as cs itself. Similarly, if a most general concept cg does not contain some object x which is given
as a negative example, then all concepts which are subsets of sg do not contain
the negative example, i.e. are consistent with the negative example as well as cg
itself.
In other words, the set of consistent and complete concepts which exclude all
presented negative examples and include all presented positive examples is defined
by the sets of concepts S and G being the most specific and most general concepts
consistent and complete with respect to the data. That is, all concepts of C which
lie between S and G are complete and consistent as well. A concept c lies between
S and G, if and only if there are two concepts cg 2 G and cs 2 S such that cs Â

c Â cg . An algorithm that maintains the set of consistent and complete concepts
is sketched in Fig. 3. Consider the following example to illustrate the use of the
algorithm in Fig. 3:
Example. Let us denote the various sets S and G by Sn and Gn , respectively after
the nth example has been processed. Before the first example is presented, we have
G0 D fX g and S0 D ffgg.
Suppose a big triangle is presented as positive example. Then, G remains the
same, but the concept in S has to be generalised. That is, we obtain G1 D G0 fX g
and S1 D ffb:tgg.
Suppose the second example being a small circle as negative example: Then S
remains the same, but the concept in G has to be specialised. That is, we obtain

IT training scientific data mining and knowledge discovery principles and foundations gaber 2009 10 06

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về