Tải bản đầy đủ (.pdf) (350 trang)

IT training data mining foundations and intelligent paradigms (vol 1 clustering, association and classification) holmes jain 2011 11 07

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (5.55 MB, 350 trang )


Dawn E. Holmes and Lakhmi C. Jain (Eds.)
Data Mining: Foundations and Intelligent Paradigms


Intelligent Systems Reference Library, Volume 23
Editors-in-Chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail:

Prof. Lakhmi C. Jain
University of South Australia
Adelaide
Mawson Lakes Campus
South Australia 5095
Australia
E-mail:

Further volumes of this series can be found on our homepage:
springer.com
Vol. 1. Christine L. Mumford and Lakhmi C. Jain (Eds.)
Computational Intelligence: Collaboration, Fusion
and Emergence, 2009
ISBN 978-3-642-01798-8
Vol. 2. Yuehui Chen and Ajith Abraham
Tree-Structure Based Hybrid


Computational Intelligence, 2009
ISBN 978-3-642-04738-1
Vol. 3. Anthony Finn and Steve Scheding
Developments and Challenges for
Autonomous Unmanned Vehicles, 2010
ISBN 978-3-642-10703-0
Vol. 4. Lakhmi C. Jain and Chee Peng Lim (Eds.)
Handbook on Decision Making: Techniques
and Applications, 2010
ISBN 978-3-642-13638-2

Vol. 12. Florin Gorunescu
Data Mining, 2011
ISBN 978-3-642-19720-8
Vol. 13. Witold Pedrycz and Shyi-Ming Chen (Eds.)
Granular Computing and Intelligent Systems, 2011
ISBN 978-3-642-19819-9
Vol. 14. George A. Anastassiou and Oktay Duman
Towards Intelligent Modeling: Statistical Approximation
Theory, 2011
ISBN 978-3-642-19825-0
Vol. 15. Antonino Freno and Edmondo Trentin
Hybrid Random Fields, 2011
ISBN 978-3-642-20307-7
Vol. 16. Alexiei Dingli
Knowledge Annotation: Making Implicit Knowledge
Explicit, 2011
ISBN 978-3-642-20322-0

Vol. 5. George A. Anastassiou

Intelligent Mathematics: Computational Analysis, 2010
ISBN 978-3-642-17097-3

Vol. 17. Crina Grosan and Ajith Abraham
Intelligent Systems, 2011
ISBN 978-3-642-21003-7

Vol. 6. Ludmila Dymowa
Soft Computing in Economics and Finance, 2011
ISBN 978-3-642-17718-7

Vol. 18. Achim Zielesny
From Curve Fitting to Machine Learning, 2011
ISBN 978-3-642-21279-6

Vol. 7. Gerasimos G. Rigatos
Modelling and Control for Intelligent Industrial Systems, 2011
ISBN 978-3-642-17874-0

Vol. 19. George A. Anastassiou
Intelligent Systems: Approximation by Artificial Neural
Networks, 2011
ISBN 978-3-642-21430-1

Vol. 8. Edward H.Y. Lim, James N.K. Liu, and
Raymond S.T. Lee
Knowledge Seeker – Ontology Modelling for Information
Search and Management, 2011
ISBN 978-3-642-17915-0
Vol. 9. Menahem Friedman and Abraham Kandel

Calculus Light, 2011
ISBN 978-3-642-17847-4
Vol. 10. Andreas Tolk and Lakhmi C. Jain
Intelligence-Based Systems Engineering, 2011
ISBN 978-3-642-17930-3
Vol. 11. Samuli Niiranen and Andre Ribeiro (Eds.)
Information Processing and Biological Systems, 2011
ISBN 978-3-642-19620-1

Vol. 20. Lech Polkowski
Approximate Reasoning by Parts, 2011
ISBN 978-3-642-22278-8
Vol. 21. Igor Chikalov
Average Time Complexity of Decision Trees, 2011
ISBN 978-3-642-22660-1
law Róz˙ ewski,
Emma
Vol.
22.Kusztina,
Przemys Ryszard Tadeusiewicz,
and Oleg Zaikin
Intelligent Open Learning Systems, 2011
ISBN 978-3-642-22666-3
Vol. 23. Dawn E. Holmes and Lakhmi C. Jain (Eds.)
Data Mining: Foundations and Intelligent Paradigms, 2012
ISBN 978-3-642-23165-0


Dawn E. Holmes and Lakhmi C. Jain (Eds.)


Data Mining: Foundations and
Intelligent Paradigms
Volume 1: Clustering, Association and Classification

123


Prof. Dawn E. Holmes

Prof. Lakhmi C. Jain

Department of Statistics and Applied Probability
University of California
Santa Barbara,
CA 93106
USA
E-mail:

Professor of Knowledge-Based Engineering
University of South Australia
Adelaide
Mawson Lakes, SA 5095
Australia
E-mail:

ISBN 978-3-642-23165-0

e-ISBN 978-3-642-23166-7

DOI 10.1007/978-3-642-23166-7

Intelligent Systems Reference Library

ISSN 1868-4394

Library of Congress Control Number: 2011936705
c 2012 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilm or in any other way,
and storage in data banks. Duplication of this publication or parts thereof is permitted
only under the provisions of the German Copyright Law of September 9, 1965, in
its current version, and permission for use must always be obtained from Springer.
Violations are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general
use.
Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.
Printed on acid-free paper
987654321
springer.com


Preface

There are many invaluable books available on data mining theory and applications.
However, in compiling a volume titled “DATA MINING: Foundations and Intelligent
Paradigms: Volume 1: Clustering, Association and Classification” we wish to introduce
some of the latest developments to a broad audience of both specialists and nonspecialists in this field.
The term ‘data mining’ was introduced in the 1990’s to describe an emerging field
based on classical statistics, artificial intelligence and machine learning. Clustering, a

method of unsupervised learning, has applications in many areas. Association rule
learning, became widely used following the seminal paper by Agrawal, Imielinski and
Swami; “Mining Association Rules Between Sets of Items in Large Databases”,
SIGMOD Conference 1993: 207-216. Classification is also an important technique in
data mining, particularly when it is known in advance how classes are to be defined.
In compiling this volume we have sought to present innovative research from
prestigious contributors in these particular areas of data mining. Each chapter is selfcontained and is described briefly in Chapter 1.
This book will prove valuable to theoreticians as well as application scientists/
engineers in the area of Data Mining. Postgraduate students will also find this a useful
sourcebook since it shows the direction of current research.
We have been fortunate in attracting top class researchers as contributors and wish
to offer our thanks for their support in this project. We also acknowledge the expertise
and time of the reviewers. We thank Professor Dr. Osmar Zaiane for his visionary
Foreword. Finally, we also wish to thank Springer for their support.

Dr. Dawn E. Holmes
University of California
Santa Barbara, USA

Dr. Lakhmi C. Jain
University of South Australia
Adelaide, Australia



Contents

Chapter 1
Data Mining Techniques in Clustering, Association and
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dawn E. Holmes, Jeffrey Tweedale, Lakhmi C. Jain
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Chapters Included in the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
1
2
2

3
3
4
4
4
5
5
6

Chapter 2
Clustering Analysis in Large Graphs with Rich Attributes . . . . . .
Yang Zhou, Ling Liu
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
General Issues in Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Graph Partition Techniques . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Basic Preparation for Graph Clustering . . . . . . . . . . . . . . .
2.3
Graph Clustering with SA-Cluster . . . . . . . . . . . . . . . . . . . .
3
Graph Clustering Based on Structural/Attribute Similarities . . .
4
The Incremental Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
The Storage Cost and Optimization . . . . . . . . . . . . . . . . . .

5.2
Matrix Computation Optimization . . . . . . . . . . . . . . . . . . .
5.3
Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7
8
11
12
14
15
16
19
21
22
23
24
24
25


VIII

Contents

Chapter 3
Temporal Data Mining: Similarity-Profiled Association

Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jin Soung Yoo
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Similarity-Profiled Temporal Association Pattern . . . . . . . . . . . . .
2.1
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Interest Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Mining Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Envelope of Support Time Sequence . . . . . . . . . . . . . . . . . .
3.2
Lower Bounding Distance . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Monotonicity Property of Upper Lower-Bounding
Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
SPAMINE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

29
32
32
34
35
35
36
38
39
41
43
45
45

Chapter 4
Bayesian Networks with Imprecise Probabilities:
Theory and Application to Classification . . . . . . . . . . . . . . . . . . . . . . . .
G. Corani, A. Antonucci, M. Zaffalon
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Credal Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Basic Operations with Credal Sets . . . . . . . . . . . . . . . . . . . .
3.3
Credal Sets from Probability Intervals . . . . . . . . . . . . . . . .

3.4
Learning Credal Sets from Data . . . . . . . . . . . . . . . . . . . . . .
4
Credal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Credal Network Definition and Strong Extension . . . . . . .
4.2
Non-separately Specified Credal Networks . . . . . . . . . . . . .
5
Computing with Credal Networks . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Credal Networks Updating . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Algorithms for Credal Networks Updating . . . . . . . . . . . . .
5.3
Modelling and Updating with Missing Data . . . . . . . . . . . .
6
An Application: Assessing Environmental Risk by Credal
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Debris Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
The Credal Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Credal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Mathematical Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9

Naive Credal Classifier (NCC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49
49
51
52
53
53
55
55
56
56
57
60
60
61
62
64
64
65
70
71
73
74


Contents

9.1
Comparing NBC and NCC in Texture Recognition . . . . .

9.2
Treatment of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Metrics for Credal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 Tree-Augmented Naive Bayes (TAN) . . . . . . . . . . . . . . . . . . . . . . . .
11.1 Variants of the Imprecise Dirichlet Model: Local and
Global IDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12 Credal TAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 Further Credal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.1 Lazy NCC (LNCC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 Credal Model Averaging (CMA) . . . . . . . . . . . . . . . . . . . . .
14 Open Source Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

IX

76
79
80
81
82
83
85
85
86
88
88
88

Chapter 5

Hierarchical Clustering for Finding Symmetries and Other
Patterns in Massive, High Dimensional Datasets . . . . . . . . . . . . . . . .
Fionn Murtagh, Pedro Contreras
1
Introduction: Hierarchy and Other Symmetries in Data
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
About This Article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
A Brief Introduction to Hierarchical Clustering . . . . . . . . .
1.3
A Brief Introduction to p-Adic Numbers . . . . . . . . . . . . . .
1.4
Brief Discussion of p-Adic and m-Adic Numbers . . . . . . . .
2
Ultrametric Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Ultrametric Space for Representing Hierarchy . . . . . . . . . .
2.2
Some Geometrical Properties of Ultrametric Spaces . . . . .
2.3
Ultrametric Matrices and Their Properties . . . . . . . . . . . . .
2.4
Clustering through Matrix Row and Column
Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
Other Miscellaneous Symmetries . . . . . . . . . . . . . . . . . . . . .
3
Generalized Ultrametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1

Link with Formal Concept Analysis . . . . . . . . . . . . . . . . . . .
3.2
Applications of Generalized Ultrametrics . . . . . . . . . . . . . .
3.3
Example of Application: Chemical Database
Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Hierarchy in a p-Adic Number System . . . . . . . . . . . . . . . . . . . . . .
4.1
p-Adic Encoding of a Dendrogram . . . . . . . . . . . . . . . . . . . .
4.2
p-Adic Distance on a Dendrogram . . . . . . . . . . . . . . . . . . . .
4.3
Scale-Related Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Tree Symmetries through the Wreath Product Group . . . . . . . . .
5.1
Wreath Product Group Corresponding to a
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Wreath Product Invariance . . . . . . . . . . . . . . . . . . . . . . . . . .

95

95
96
96
97
98
98

98
100
100
101
103
103
103
104
105
110
110
113
114
114
115
115


X

Contents

5.3

Example of Wreath Product Invariance: Haar Wavelet
Transform of a Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . .
6
Remarkable Symmetries in Very High Dimensional Spaces . . . . .
6.1
Application to Very High Frequency Data Analysis:

Segmenting a Financial Signal . . . . . . . . . . . . . . . . . . . . . . .
7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116
118
119
126
126

Chapter 6
Randomized Algorithm of Finding the True Number of
Clusters Based on Chebychev Polynomial Approximation . . . . . . .
R. Avros, O. Granichin, D. Shalymov, Z. Volkovich, G.-W. Weber
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Stability Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Geometrical Cluster Validation Criteria . . . . . . . . . . . . . .
3
Randomized Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131
131
135
135
138
141
144
147
152
152

Chapter 7
Bregman Bubble Clustering: A Robust Framework for Mining
Dense Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Joydeep Ghosh, Gunjan Gupta
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Partitional Clustering Using Bregman Divergences . . . . . .
2.2
Density-Based and Mode Seeking Approaches to
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Iterative Relocation Algorithms for Finding a Single
Dense Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4
Clustering a Subset of Data into Multiple Overlapping
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Bregman Bubble Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Bregmanian Balls and Bregman Bubbles . . . . . . . . . . . . . .
3.4
BBC-S: Bregman Bubble Clustering with Fixed
Clustering Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
BBC-Q: Dual Formulation of Bregman Bubble
Clustering with Fixed Cost . . . . . . . . . . . . . . . . . . . . . . . . . .

157
157
161
161
162
164
165
165
165
166
166
167

169


Contents

4

Soft Bregman Bubble Clustering (Soft BBC) . . . . . . . . . . . . . . . . .
4.1
Bregman Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Motivations for Developing Soft BBC . . . . . . . . . . . . . . . . .
4.3
Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Soft BBC EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
Choosing an Appropriate p0 . . . . . . . . . . . . . . . . . . . . . . . . .
5
Improving Local Search: Pressurization . . . . . . . . . . . . . . . . . . . . . .
5.1
Bregman Bubble Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3
BBC-Press . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4
Soft BBC-Press . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
Pressurization vs. Deterministic Annealing . . . . . . . . . . . .

6
A Unified Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Unifying Soft Bregman Bubble and Bregman Bubble
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Other Unifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Example: Bregman Bubble Clustering with Gaussians . . . . . . . . .
7.1
σ 2 Is Fixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
σ 2 Is Optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3
“Flavors” of BBC for Gaussians . . . . . . . . . . . . . . . . . . . . . .
7.4
Mixture-6: An Alternative to BBC Using a Gaussian
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Extending BBOCC & BBC to Pearson Distance and Cosine
Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Pearson Correlation and Pearson Distance . . . . . . . . . . . . .
8.2
Extension to Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . .
8.3
Pearson Distance vs. (1-Cosine Similarity) vs. Other
Bregman Divergences – Which One to Use Where? . . . . .
9
Seeding BBC and Determining k Using Density Gradient

Enumeration (DGRADE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
DGRADE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
Selecting sone : The Smoothing Parameter for
DGRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 Results for BBC with Pressurization . . . . . . . . . . . . . . . . . .
10.5 Results on BBC with DGRADE . . . . . . . . . . . . . . . . . . . . . .
11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XI

169
169
170
171
171
173
174
174
175
176
177

177
177
177
178
180
180
181
182
182
183
183
185
185
185
186
186
188
190
190
190
192
194
198
202
204


XII

Contents


Chapter 8
DepMiner: A Method and a System for the Extraction of
Significant Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rosa Meo, Leonardo D’Ambrosi
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Estimation of the Referential Probability . . . . . . . . . . . . . . . . . . . .
4
Setting a Threshold for Δ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Embedding Δn in Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Determination of the Itemsets Minimum Support Threshold . . .
7
System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

209
209
211
213
213

215
216
218
220
221
221

Chapter 9
Integration of Dataset Scans in Processing Sets of Frequent
Itemset Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marek Wojciechowski, Maciej Zakrzewicz, Pawel Boinski
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Frequent Itemset Mining and Apriori Algorithm . . . . . . . . . . . . . .
2.1
Basic Definitions and Problem Statement . . . . . . . . . . . . . .
2.2
Algorithm Apriori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Frequent Itemset Queries – State of the Art . . . . . . . . . . . . . . . . . .
3.1
Frequent Itemset Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Constraint-Based Frequent Itemset Mining . . . . . . . . . . . .
3.3
Reusing Results of Previous Frequent Itemset Queries . . .
4
Optimizing Sets of Frequent Itemset Queries . . . . . . . . . . . . . . . . .
4.1

Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Related Work on Multi-query Optimization . . . . . . . . . . . .
5
Common Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Motivation for Query Set Partitioning . . . . . . . . . . . . . . . . .
5.3
Key Issues Regarding Query Set Partitioning . . . . . . . . . .
6
Frequent Itemset Query Set Partitioning by Hypergraph
Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Data Sharing Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Hypergraph Partitioning Problem Formulation . . . . . . . . .
6.3
Computation Complexity of the Problem . . . . . . . . . . . . . .
6.4
Related Work on Hypergraph Partitioning . . . . . . . . . . . . .
7
Query Set Partitioning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1
CCRecursive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
CCFull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3
CCCoarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

223
223
225
225
226
227
227
229
230
231
232
233
234
234
234
237
237
238
239
239
241
241
241
242
243
246



Contents

7.4
CCAgglomerative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5
CCAgglomerativeNoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6
CCGreedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7
CCSemiGreedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Comparison of Basic Dedicated Algorithms . . . . . . . . . . . .
8.2
Comparison of Greedy Approaches with the Best
Dedicated Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Review of Other Methods of Processing Sets of Frequent
Itemset Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XIII

247
248
249
250

251
252
257
260
261
262

Chapter 10
Text Clustering with Named Entities: A Model,
Experimentation and Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tru H. Cao, Thao M. Tang, Cuong K. Chau
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
An Entity-Keyword Multi-Vector Space Model . . . . . . . . . . . . . . .
3
Measures of Clustering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Hard Clustering Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Fuzzy Clustering Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Text Clustering in VN-KIM Search . . . . . . . . . . . . . . . . . . . . . . . . .
7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

267
267
269

271
273
277
282
285
286

Chapter 11
Regional Association Rule Mining and Scoping from
Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wei Ding, Christoph F. Eick
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Hot-Spot Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Spatial Association Rule Mining . . . . . . . . . . . . . . . . . . . . . .
3
The Framework for Regional Association Rule Mining and
Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Region Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Measure of Interestingness . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1
Region Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Generation of Regional Association Rules . . . . . . . . . . . . .

289
289
291
291
292
293
293
294
295
298
298
301


XIV

Contents

5

Arsenic Regional Association Rule Mining and Scoping in the
Texas Water Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Data Collection and Data Preprocessing . . . . . . . . . . . . . . .
5.2

Region Discovery for Arsenic Hot/Cold Spots . . . . . . . . . .
5.3
Regional Association Rule Mining . . . . . . . . . . . . . . . . . . . .
5.4
Region Discovery for Regional Association Rule
Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

302
302
304
305
307
310
311

Chapter 12
Learning from Imbalanced Data: Evaluation Matters . . . . . . . . . . . .
Troy Raeder, George Forman, Nitesh V. Chawla
1
Motivation and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Prior Work and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2

Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Discussion and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Comparisons of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Towards Parts-Per-Million . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

315
315
317
318
321
321
325
325
328
329
329
330

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333


Editors


Dr. Dawn E. Holmes serves as Senior Lecturer in the Department of Statistics and
Applied Probability and Senior Associate
Dean in the Division of Undergraduate Education at UCSB. Her main research area,
Bayesian Networks with Maximum Entropy,
has resulted in numerous journal articles and
conference presentations. Her other research
interests include Machine Learning, Data
Mining, Foundations of Bayesianism and
Intuitionistic Mathematics. Dr. Holmes has
co-edited, with Professor Lakhmi C. Jain,
volumes ‘Innovations in Bayesian Networks’ and ‘Innovations in Machine Learning’. Dr. Holmes teaches a broad range of
courses, including SAS programming, Bayesian Networks and Data Mining. She was
awarded the Distinguished Teaching Award by Academic Senate, UCSB in 2008.
As well as being Associate Editor of the International Journal of Knowledge-Based
and Intelligent Information Systems, Dr. Holmes reviews extensively and is on the
editorial board of several journals, including the Journal of Neurocomputing. She
serves as Program Scientific Committee Member for numerous conferences; including the International Conference on Artificial Intelligence and the International Conference on Machine Learning. In 2009 Dr. Holmes accepted an invitation to join
Center for Research in Financial Mathematics and Statistics (CRFMS), UCSB. She
was made a Senior Member of the IEEE in 2011.
Professor Lakhmi C. Jain is a Director/Founder of the
Knowledge-Based Intelligent Engineering Systems
(KES) Centre, located in the University of South Australia. He is a fellow of the Institution of Engineers
Australia.
His interests focus on the artificial intelligence paradigms and their applications in complex systems, artscience fusion, e-education, e-healthcare, unmanned air
vehicles and intelligent agents.



Chapter 1

Data Mining Techniques in Clustering, Association and
Classification
Dawn E. Holmes1 , Jeffrey Tweedale2 , and Lakhmi C. Jain3
1

2

3

Department of Statistics and Applied Probability
University of California Santa Barbara
Santa Barbara
CA 93106-3110
USA
School of Electrical and Information Engineering
University of South Australia
Adelaide
Mawson Lakes Campus
South Australia SA 5095
Australia
School of Electrical and Information Engineering
University of South Australia
Adelaide
Mawson Lakes Campus
South Australia SA 5095
Australia

1 Introduction
The term Data Mining grew from the relentless growth of techniques used to interrogation masses of data. As a myriad of databases emanated from disparate industries, management insisted their information officers develop methodology to exploit the knowledge held in their repositories. The process of extracting this knowledge evolved as an
interdisciplinary field of computer science within academia. This included study into

statistics, database management and Artificial Intelligence (AI). Science and technology provide the stimulus for an extremely rapid transformation from data acquisition to
enterprise knowledge management systems.
1.1 Data
Data is the representation of anything that can be meaningfully quantized or represented
in digital form, as a number, symbol or even text. We process data into information
by initially combining a collection of artefacts that are input into a system which is
generally stored, filtered and/or classified prior to being translated into a useful form for
dissemination [1]. The processes used to achieve this task have evolved over many years
and has been applied to many situations using a magnitude of techniques. Accounting
and pay role applications take center place in the evolution of information processing.
D.E. Holmes, L.C. Jain (Eds.): Data Mining: Found. & Intell. Paradigms, ISRL 23, pp. 1–6.
c Springer-Verlag Berlin Heidelberg 2012
springerlink.com


2

D.E. Holmes, J. Tweedale, and L.C. Jain

Data mining, expert system and knowledge-based system quickly followed. Today we
live in an information age where we collect data faster than it can be processed. This
book examines many recent advances in digital information processing with paradigms
for acquisition, retrieval, aggregation, search, estimation and presentation.
Our ability to acquire data electronically has grown exponentially since the introduction of mainframe computers. We have also improved the methodology used to extract
information from data in almost every aspect of life. Our biggest challenge is in identifying targeted information and transforming that into useful knowledge within the
growing collection of noise collected in repositories all over the world.
1.2 Knowledge
Information, knowledge and wisdom are labels commonly applied to the way humans
aggregate practical experience into an organised collection of facts. Knowledge is considered a collection of facts, truths, or principles resulting from a study or investigation.
The concept of knowledge is a collection of facts, principles, and related concepts.

Knowledge representation is the key to any communication language and a fundamental issue in AI. The way knowledge is represented and expressed has to be meaningful
so that the communicating entities can grasp the concept of the knowledge transmitted
among them. This requires a good technique to represent knowledge. In computers symbols (numbers and characters) are used to store and manipulate the knowledge. There
are different approaches for storing the knowledge because there are different kinds of
knowledge such as facts, rules, relationships, and so on. Some popular approaches for
storing knowledge in computers include procedural, relational, and hierarchical representations. Other forms of knowledge representation used include Predicate Logic,
Frames, Semantic Nets, If-Then rules and Knowledge Inter-change Format. The type of
knowledge representation to be used depends on the AI application and the domain that
Intelligent Agents (IAs) are required to function [2]. Knowledge should be separated
from the procedural algorithms in order to simplify knowledge modification and processing. For an IA to be capable of solving problems at different levels of abstraction,
knowledge should be presented in the form of frames or semantic nets that can show
the is-a relationship of objects and concepts. If an IA is required to find the solution
from the existing data, Predicate logic using IF-THEN rules, Bayesian or any number
of techniques can be used to cluster information [3].
1.3 Clustering
In data mining a cluster is the resulting collection of similar or same items from a volume of acquired facts. Each cluster has distinct characteristics, although each has a
similarity, its size is measured from the centre with a distance or separation from the
next [4]. Non-hierarchical clusters are generally partitioned by class or clumping methods. Hierarchical clusters produce sets of nested groups that need to be progressively
isolated as individual subsets. The methodology used are described as: partitioning,
hierarchical agglomeration, Single Link (SLINK), Complete Link (CLINK), group average and text based document methods. Other techniques include [5]:


Data Mining Techniques in Clustering, Association and Classification















3

A Comparison of Techniques,
Artificial Neural Networks for Clustering, and
Clustering Large Data Sets, and
Evolutionary Approaches for Clustering, and
Fuzzy Clustering, and
Hierarchical Clustering Algorithms, and
Incorporating Domain Constraints in Clustering, and
Mixture-Resolving and Mode-Seeking Algorithms, and
Nearest Neighbour Clustering, and
Partitional Algorithms, and
Representation of Clusters, and
Search-Based Approaches.

Where clustering can typically be applied in Image Segmentation, Object/Character
Recognition, Information Retrieval and Data Mining.
1.4 Association
Data is merely a collection of facts. To make sense of that collection, a series of rules
can be created to sort, select and match a pattern of behavior or association based on
specified dependancies or relationships. For instance a collection of sales transaction
within a department store can hold a significant volume of information. If a cosmetics manager desired to improve sales, knowledge about existing turnover provides an
excellent base-line (this is a form of market analysis). Similarly, using the same data

set, the logistics manager could determine inventory levels (this concept is currently associated with trend analysis and prediction). Association rules allow the user to reveal
sequences, links and unique manifestations that emerge over time [6]. Typically crosstabulation can be used where items, words or conjunctions are employed to analyse
simple collections that are easily classified, such as age, cost or gender.
1.5 Classification
Data bases provide an arbitrary collection of facts. In order to make sense of the random
nature of such collections, any number of methods can be used to map the data into
usable or quantifiable categories based on a series of attributes. These subsets improve
efficiency by reducing the noise and volume of data during subsequent processing. The
goal is to predict the target class for each case. An example would be to measure the
risk management of an activity, as either low, high or some category in between. Prior
to classification, the target categories must be defined before the process is run [7]. A
number of AI techniques are used to classify data. Some include decision-trees, rulebased, Bayesian, rough sets, dependency networks, Support Vector Machines (SVM),
Neural Networkss (NNs), genetic algorithms and fuzzy logic.


4

D.E. Holmes, J. Tweedale, and L.C. Jain

2 Data Mining
There are many commercial data mining methods, algorithms and applications, with
several that have had major impact. Examples include: SAS1, SPSS2 and Statistica3 .
Other examples are listed in sections 2.1 and 2.2. Any number can be found on-line, and
many are free. Examples include: Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI)4 , General Architecture for Text Engineering (GATE)5
and Waikato Environment for Knowledge Analysis (Weka)6 .

2.1 Methods and Algorithms















Association rule learning,
Cluster analysis, and
Constructive induction, and
Data analysis, and
Decision trees, and
Factor analysis, and
Knowledge discovery, and
Neural nets, and
Predictive analytics, and
Reactive business intelligence, and
Regression, and
Statistical data analysis, and
Text mining.

2.2 Applications








1
2
3
4
5
6

Customer analytics,
Data Mining in Agriculture, and
Data mining in Meteorology, and
Law-enforcement, and
National Security Agency, and
Quantitative structure-activity relationship, and
Surveillance.

See />See />See />See from Ludwig Maximillian University.
See gate.ac.uk from the University of Sheffield.
See from the University of Waikato.


Data Mining Techniques in Clustering, Association and Classification

5

3 Chapters Included in the Book
This book includes twelve chapters. Each chapter is described briefly below.
Chapter 1 provides an introduction to data mining and presents a brief abstract of each

chapter included in the book. Chapter 2 is on clustering analysis in large graphs with
rich attributes. The authors state that a key challenge for addressing the problem of clustering large graphs with rich attributes is to achieve a good balance between structural
and attribute similarities. Chapter 3 is on temporal data mining. A temporal association
mining problem, based on similarity constraint, is presented. Chapter 4 is on Bayesian
networks with imprecise probabilities. The authors report extensive experimentation on
public benchmark data sets in real-world applications to show that on the instances indeterminately classified by a creedal network, the accuracy of its Bayesian counterpart
drops.
Chapter 5 is on hierarchical clustering for finding symmetries and other patterns
in massive, high dimensional datasets. The authors have illustrated the powerfulness of
hierarchical clustering in case studies in chemistry and finance. Chapter 6 is on randomized algorithm of finding the true number of clusters based on Chebychev polynomial
approximation. A number of examples are used to validate the proposed algorithm.
Chapter 7 is on Bregman bubble clustering. The authors present a broad framework for
finding k dense clusters while ignoring rest of the data. The results are validated on
various datasets to demonstrate the relevance and effectiveness of the technique.
Chapter 8 is on DepMiner. It is a method for implementing a model for the evaluation of item-sets, and in general for the evaluation of the dependencies between the
values assumed by a set of variables on a domain of finite values. Chapter 9 is on the
integration of dataset scans in processing sets of frequent item-set queries. Chapter 10
is on text clustering with named entities. It is demonstrated that a weighted combination
of named entities and keywords are significant to clustering quality. The authors present
implementation of the scheme and demonstrate the text clustering with named entities
in a semantic search engine.
Chapter 11 is on learning from imbalanced data. Using experimentations, the authors
made some recommendations related to the data evaluation methods. Finally Chapter 12
is on regional association rule mining and scoping from spatial data. The authors have
investigated the duality between regional association rules and regions where the associations are valid. The design and implementation of a reward-based region discovery
framework and its evaluation are presented.


6


D.E. Holmes, J. Tweedale, and L.C. Jain

4 Conclusion
This chapter presents a collection of selected contribution of leading subject matter
experts in the field of data mining. This book is intended for students, professionals and
academics from all disciplines to enable them the opportunity to engage in the state of
art developments in:
• Clustering Analysis in Large Graphs with Rich Attributes;
• Temporal Data Mining: Similarity-Profiled Association Pattern;
• Bayesian Networks with Imprecise Probabilities: Theory and Application to
Classification;
• Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive,
High Dimensional Datasets;
• Randomized Algorithm of Finding the True Number of Clusters Based on Chebychev Polynomial Approximation;
• Bregman Bubble Clustering: A Robust Framework for Mining Dense Clusters;
• DepMiner: A method and a system for the extraction of significant dependencies;
• Integration of Dataset Scans in Processing Sets of Frequent Itemset Queries;
• Text Clustering with Named Entities: A Model, Experimentation and Realization;
• Regional Association Rule Mining and Scoping from Spatial Data; and
• Learning from Imbalanced Data: Evaluation Matters.
Readers are invited to contact individual authors to engage with further discussion or
dialog on each topic.

References
1. Moxon, B.: Defining data mining, the hows and whys of data mining, and how it differs from
other analytical techniques. DBMS 9(9), S11–S14 (1996)
2. Bigus, J.P., Bigus, J.: Constructing Intelligent Agents Using Java. Professional Developer’s
Guide Series. John Wiley & Sons, Inc., New York (2001)
3. Tweedale, J., Jain, L.C.: Advances in information processing paradigms. In: Watanabe, T. (ed.)
Innovations in Intelligent Machines, pp. 1–19. Springer, Heidelberg (2011)

4. Bouguettaya, A.: On-line clustering. IEEE Trans. on Knowl. and Data Eng. 8, 333–339 (1996)
5. Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys 3(3), 264–
323 (1999)
6. Hill, T., Lewicki, P.: Statistics: Methods and Applications, StatSoft, Tulsa, OK (2007)
7. Classification, clustering, and data mining applications. In: Banks, D., House, L., McMorris, F., Arabie, P., Gaul, W. (eds.) International Federation of Classification Societies (IFCS),
Illinois Institute of Technology, Chicago, p. 658. Springer, New York (2004)


Chapter 2
Clustering Analysis in Large Graphs with Rich
Attributes
Yang Zhou and Ling Liu
DiSL, College of Computing, Georgia Institute of Technology,
Atlanta, Georgia, USA
Abstract. Social networks, communication networks, biological networks
and many other information networks can be modeled as a large graph.
Graph vertices represent entities and graph edges represent the relationships or interactions among entities. In many large graphs, there
is usually one or more attributes associated with every graph vertex
to describe its properties. The goal of graph clustering is to partition
vertices in a large graph into subgraphs (clusters) based on a set of
criteria, such as vertex similarity measures, adjacency-based measures,
connectivity-based measures, density measures, or cut-based measures.
Although graph clustering has been studied extensively, the problem of
clustering analysis of large graphs with rich attributes remains a big
challenge in practice. In this chapter we first give an overview of the
set of issues and challenges for clustering analysis of large graphs with
vertices of rich attributes. Based on the type of measures used for identifying clusters, existing graph clustering methods can be categorized
into three classes: structure based clustering, attribute based clustering and structure-attribute based clustering. Structure based clustering
mainly focuses on the topological structure of a graph for clustering,
but largely ignore the vertex properties which are often heterogenous.

Attribute based clustering, in contrast, focuses primarily on attributebased vertex similarity, but suffers from isolated partitions of the graph
as a result of graph clustering. Structure-attribute based clustering is
a hybrid approach, which combines structural and attribute similarities
through a unified distance measure. We argue that effective clustering
analysis of a large graph with rich attributes requires the clustering methods to provide a systematic graph analysis framework that partition the
graph based on both structural similarity and attribute similarity. One
approach is to model rich attributes of vertices as auxiliary edges among
vertices, resulting in a complex attribute augmented graph with multiple edges between some vertices. To show how to best combine structure
and attribute similarity in a unified framework, the second part of this
chapter will outline a cluster-convergence based iterative edge-weight
assignment scheme that assigns different weights to different attributes
based on how fast the clusters converge. We use a K-Medoids clustering
algorithm to partition a graph into k clusters with both cohesive intracluster structures and homogeneous attribute values based on iterative
weight updates. At each iteration, a series of matrix multiplication operations is used for calculating the random walk distances between graph
D.E. Holmes, L.C. Jain (Eds.): Data Mining: Found. & Intell. Paradigms, ISRL 23, pp. 7–27.
c Springer-Verlag Berlin Heidelberg 2012
springerlink.com


8

Y. Zhou and L. Liu
vertices. Optimizations are used to reduce the cost of recalculating the
random walk distances upon each iteration of the edge weight update.
Finally, we discuss the set of open problems in graph clustering with rich
attributes, including storage cost and efficiency, scalable analytics under
memory constraints, distributed graph clustering and parallel processing.

1


Introduction

A number of scientific and technical endeavors are generating data that usually consists of a large number of interacting physical, conceptual, and societal
components. Such examples include social networks, semantic networks, communication systems, the Internet, ecological networks, transportation networks,
database schemas and ontologies, electrical power grids, sensor networks, research coauthor networks, biological networks, and so on. All the above networks share an important common feature: they can be modeled as graphs, i.e.,
individual objects interact with one another, forming large, interconnected, and
sophisticated graphs with vertices of rich attributes. Multi-relational data mining finds the relational patterns in both the entity attributes and relations in
the data. Graph mining, as one approach of multi-relational data mining, finds
relational patterns in complex graph structures. Mining and analysis of these
annotated and probabilistic graph structures is crucial for advancing the state
of scientific research, accurate modeling and analysis of existing systems, and
engineering of new systems.
Graph clustering is one of the most popular graph mining methodologies. Clustering is a useful and important unsupervised learning technique widely studied
in literature [1,2,3,4]. The general goal of clustering is to group similar objects
into one cluster while partitioning dissimilar objects into different clusters. Clustering has broad applications in the analysis of business and financial data, biological data, time series data, spatial data, trajectory data and so on. As one
important approach of graph mining, graph clustering is an interesting and challenging research problem which has received much attention recently [5,6,7,8].
Clustering on a large graph aims to partition the graph into several densely connected components. This is very useful for understanding and visualizing large
graphs. Typical applications of graph clustering include community detection in
social networks, reduction of very large transportation networks, identification
of functional related protein modules in large protein-protein interaction networks, etc. Although many graph clustering techniques have been proposed in
literature, the problem of clustering analysis in large graphs with rich attributes
remains to be challenging due to the demand on memory and computational resources and the demand on fast access to disk-based storage. Furthermore, with
the grand vision of utility-driven and pay-as-you-go cloud computing paradigm
shift, there is a growing demand for providing graph-clustering as a service. We
witness the emerging interests from science and engineering fields in design and
development of efficient and scalable graph analytics for managing and mining
large information graphs.



×