Tải bản đầy đủ (.pdf) (208 trang)

6462 getting started with UDK

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.81 MB, 208 trang )

Advances in Computer Vision and Pattern Recognition

T. Ravindra Babu
M. Narasimha Murty
S.V. Subrahmanya

Compression
Schemes for Mining
Large Datasets
A Machine Learning Perspective


Advances in Computer Vision and Pattern
Recognition

For further volumes:
www.springer.com/series/4205


T. Ravindra Babu r M. Narasimha Murty
S.V. Subrahmanya

r

Compression Schemes
for Mining Large Datasets
A Machine Learning Perspective


T. Ravindra Babu
Infosys Technologies Ltd.


Bangalore, India

S.V. Subrahmanya
Infosys Technologies Ltd.
Bangalore, India

M. Narasimha Murty
Indian Institute of Science
Bangalore, India
Series Editors
Prof. Sameer Singh
Rail Vision Europe Ltd.
Castle Donington
Leicestershire, UK

Dr. Sing Bing Kang
Interactive Visual Media Group
Microsoft Research
Redmond, WA, USA

ISSN 2191-6586
ISSN 2191-6594 (electronic)
Advances in Computer Vision and Pattern Recognition
ISBN 978-1-4471-5606-2
ISBN 978-1-4471-5607-9 (eBook)
DOI 10.1007/978-1-4471-5607-9
Springer London Heidelberg New York Dordrecht
Library of Congress Control Number: 2013954523
© Springer-Verlag London 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of

the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)


Preface

We come across a number of celebrated text books on Data Mining covering multiple aspects of the topic since its early development, such as those on databases,
pattern recognition, soft computing, etc. We did not find any consolidated work on
data mining in compression domain. The book took shape from this realization. Our
work relates to this area of data mining with a focus on compaction. We present
schemes that work in compression domain and demonstrate their working on one or
more practical datasets in each case. In this process, we cover important data mining
paradigms. This is intended to provide a practitioners’ view point of compression
schemes in data mining. The work presented is based on the authors’ work on related

areas over the last few years. We organized each chapter to contain context setting,
background work as part of discussion, proposed algorithm and scheme, implementation intricacies, experimentation by implementing the scheme on a large dataset,
and discussion of results. At the end of each chapter, as part of bibliographic notes,
we discuss relevant literature and directions for further study.
Data Mining focuses on efficient algorithms to generate abstraction from large
datasets. The objective of these algorithms is to find interesting patterns for further
use by the least number of visits of entire dataset, ideal being a single visit. Similarly, since the data sizes are large, effort is made in arriving at a much smaller
subset of the original dataset that is a representative of entire data and contains attributes characterizing the data. The ability to generate an abstraction from a small
representative set of patterns and features that is as accurate as that can be obtained
with entire dataset leads to efficiency in terms of both space and time. Important
data mining paradigms include clustering, classification, association rule mining,
etc. We present a discussion on data mining paradigms in Chap. 2.
In our present work, in addition to data mining paradigms discussed in Chap. 2,
we also focus on another paradigm, viz., the ability to generate abstraction in the
compressed domain without having to decompress. Such a compression would lead
to less storage and improve the computation cost. In the book, we consider both
lossy and nonlossy compression schemes. In Chap. 3, we present a nonlossy compression scheme based on run-length encoding of patterns with binary-valued features. The scheme is also applicable to floating-point-valued features that are suitv


vi

Preface

ably quantized to binary values. The chapter presents an algorithm that computes
the dissimilarity in the compressed domain directly. Theoretical notes are provided
for the work. We present applications of the scheme in multiple domains.
It is interesting to explore when one is prepared to lose some part of pattern representation, whether we obtain better generalization and compaction. We examine
this aspect in Chap. 4. The work in the chapter exploits the concept of minimum
feature or item-support. The concept of support relates to the conventional association rule framework. We consider patterns as sequences, form subsequences of short
length, and identify and eliminate repeating subsequences. We represent the pattern

by those unique subsequences leading to significant compaction. Such unique subsequences are further reduced by replacing less frequent unique subsequences by more
frequent subsequences, thereby achieving further compaction. We demonstrate the
working of the scheme on large handwritten digit data.
Pattern clustering can be construed as compaction of data. Feature selection also
reduces dimensionality, thereby resulting in pattern compression. It is interesting to
explore whether they can be simultaneously achieved. We examine this in Chap. 5.
We consider an efficient clustering scheme that requires a single database visit to
generate prototypes. We consider a lossy compression scheme for feature reduction. We also examine whether there is preference in sequencing prototype selection
and feature selection in achieving compaction, as well as good classification accuracy on unseen patterns. We examine multiple combinations of such sequencing.
We demonstrate working of the scheme on handwritten digit data and intrusion detection data.
Domain knowledge forms an important input for efficient compaction. Such
knowledge could either be provided by a human expert or generated through an
appropriate preliminary statistical analysis. In Chap. 6, we exploit domain knowledge obtained both by expert inference and through statistical analysis and classify
a 10-class data through a proposed decision tree of depth of 4. We make use of 2class classifiers, AdaBoost and Support Vector Machine, to demonstrate working of
such a scheme.
Dimensionality reduction leads to compaction. With algorithms such as runlength encoded compression, it is educative to study whether one can achieve efficiency in obtaining optimal feature set that provides high classification accuracy.
In Chap. 7, we discuss concepts and methods of feature selection and extraction.
We propose an efficient implementation of simple genetic algorithms by integrating
compressed data classification and frequent features. We provide insightful discussion on the sensitivity of various genetic operators and frequent-item support on the
final selection of optimal feature set.
Divide-and-conquer has been one important direction to deal with large datasets.
With reducing cost and increasing ability to collect and store enormous amounts of
data, we have massive databases at our disposal for making sense out of them and
generate abstraction that could be of potential business exploitation. The term Big
Data has been synonymous with streaming multisource data such as numerical data,
messages, and audio and video data. There is increasing need for processing such
data in real or near-real time and generate business value in this process. In Chap. 8,


Preface


vii

we propose schemes that exploit multiagent systems to solve these problems. We
discuss concepts of big data, MapReduce, PageRank, agents, and multiagent systems before proposing multiagent systems to solve big data problems.
The authors would like to express their sincere gratitude to their respective families for their cooperation.
T. Ravindra Babu and S.V. Subrahmanya are grateful to Infosys Limited for providing an excellent research environment in the Education and Research Unit (E&R)
that enabled them to carry out academic and applied research resulting in articles
and books.
T. Ravindra Babu likes to express his sincere thanks to his family members
Padma, Ramya, Kishore, and Rahul for their encouragement and support. He dedicates his contribution of the work to the fond memory of his parents Butchiramaiah
and Ramasitamma. M. Narasimha Murty likes to acknowledge support of his parents. S.V. Subrahmanya likes to thank his wife D.R. Sudha for her patient support.
The authors would like to record their sincere appreciation for Springer team, Wayne
Wheeler and Simon Rees, for their support and encouragement.
Bangalore, India

T. Ravindra Babu
M. Narasimha Murty
S.V. Subrahmanya


Contents

1

2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Data Mining and Data Compression . . . . . . . . . . . .
1.1.1 Data Mining Tasks . . . . . . . . . . . . . . . . .

1.1.2 Data Compression . . . . . . . . . . . . . . . . .
1.1.3 Compression Using Data Mining Tasks . . . . . .
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Data Mining Tasks . . . . . . . . . . . . . . . . .
1.2.2 Abstraction in Nonlossy Compression Domain . .
1.2.3 Lossy Compression Scheme and Dimensionality
Reduction . . . . . . . . . . . . . . . . . . . . .
1.2.4 Compaction Through Simultaneous Prototype and
Feature Selection . . . . . . . . . . . . . . . . .
1.2.5 Use of Domain Knowledge in Data Compaction .
1.2.6 Compression Through Dimensionality Reduction
1.2.7 Big Data, Multiagent Systems, and Abstraction . .
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Mining Paradigms . . . . . .
2.1 Introduction . . . . . . . . . .
2.2 Clustering . . . . . . . . . . .
2.2.1 Clustering Algorithms .
2.2.2 Single-Link Algorithm .
2.2.3 k-Means Algorithm . .
2.3 Classification . . . . . . . . . .
2.4 Association Rule Mining . . .
2.4.1 Frequent Itemsets . . .
2.4.2 Association Rules . . .
2.5 Mining Large Datasets . . . . .

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

1
1
1
2
2
3
3
5


. . . .

6

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.

.
.
.
.
.

6
7
7
8
9
9
9

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

11
11
12
13
14
15
17
22
23
25
26
ix


x

Contents

2.5.1 Possible Solutions . . . .
2.5.2 Clustering . . . . . . . .
2.5.3 Classification . . . . . . .
2.5.4 Frequent Itemset Mining .
2.6 Summary . . . . . . . . . . . . .
2.7 Bibliographic Notes . . . . . . .
References . . . . . . . . . . . . . . .
3

4


.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.


.
.
.
.
.
.
.

.
.
.
.
.
.
.

27
28
34
39
42
43
44

Run-Length-Encoded Compression Scheme . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Compression Domain for Large Datasets . . . . . . . . . . . . .
3.3 Run-Length-Encoded Compression Scheme . . . . . . . . . . .
3.3.1 Discussion on Relevant Terms . . . . . . . . . . . . . . .

3.3.2 Important Properties and Algorithm . . . . . . . . . . . .
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Application to Handwritten Digit Data . . . . . . . . . .
3.4.2 Application to Genetic Algorithms . . . . . . . . . . . .
3.4.3 Some Applicable Scenarios in Data Mining . . . . . . . .
3.5 Invariance of VC Dimension in the Original and the Compressed
Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Minimum Description Length . . . . . . . . . . . . . . . . . . .
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47
47
48
49
49
50
55
55
57
59

Dimensionality Reduction by Subsequence Pruning . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Lossy Data Compression for Clustering and Classification . . . .
4.3 Background and Terminology . . . . . . . . . . . . . . . . . . .
4.4 Preliminary Data Analysis . . . . . . . . . . . . . . . . . . . . .
4.4.1 Huffman Coding and Lossy Compression . . . . . . . . .
4.4.2 Analysis of Subsequences and Their Frequency in a Class

4.5 Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Frequent Item Generation . . . . . . . . . . . . . . . . .
4.5.3 Generation of Coded Training Data . . . . . . . . . . . .
4.5.4 Subsequence Identification and Frequency Computation .
4.5.5 Pruning of Subsequences . . . . . . . . . . . . . . . . .
4.5.6 Generation of Encoded Test Data . . . . . . . . . . . . .
4.5.7 Classification Using Dissimilarity Based on Rough Set
Concept . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.8 Classification Using k-Nearest Neighbor Classifier . . . .
4.6 Implementation of the Proposed Scheme . . . . . . . . . . . . .
4.6.1 Choice of Parameters . . . . . . . . . . . . . . . . . . .
4.6.2 Frequent Items and Subsequences . . . . . . . . . . . . .

60
63
65
65
66
67
67
67
68
73
74
79
81
83
83
84

84
85
85
86
87
87
87
88


Contents

4.6.3 Compressed Data and Pruning of Subsequences .
4.6.4 Generation of Compressed Training and Test Data
4.7 Experimental Results . . . . . . . . . . . . . . . . . . .
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5

6

xi

.
.
.
.
.
.


.
.
.
.
.
.

.
.
.
.
.
.

Data Compaction Through Simultaneous Selection of Prototypes
and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Prototype Selection, Feature Selection, and Data Compaction .
5.2.1 Data Compression Through Prototype and Feature
Selection . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Background Material . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Computation of Frequent Features . . . . . . . . . . .
5.3.2 Distinct Subsequences . . . . . . . . . . . . . . . . . .
5.3.3 Impact of Support on Distinct Subsequences . . . . . .
5.3.4 Computation of Leaders . . . . . . . . . . . . . . . . .
5.3.5 Classification of Validation Data . . . . . . . . . . . .
5.4 Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . .
5.5 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Patterns with Frequent Items Only . . . . . . . . . . .

5.5.2 Cluster Representatives Only . . . . . . . . . . . . . .
5.5.3 Frequent Items Followed by Clustering . . . . . . . . .
5.5.4 Clustering Followed by Frequent Items . . . . . . . . .
5.6 Implementation and Experimentation . . . . . . . . . . . . . .
5.6.1 Handwritten Digit Data . . . . . . . . . . . . . . . . .
5.6.2 Intrusion Detection Data . . . . . . . . . . . . . . . . .
5.6.3 Simultaneous Selection of Patterns and Features . . . .
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Domain Knowledge-Based Compaction . . . . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Multicategory Classification . . . . . . . . . . . . . . . .
6.3 Support Vector Machine (SVM) . . . . . . . . . . . . . .
6.4 Adaptive Boosting . . . . . . . . . . . . . . . . . . . . .
6.4.1 Adaptive Boosting on Prototypes for Data Mining
Applications . . . . . . . . . . . . . . . . . . . .
6.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . .
6.6 Preliminary Analysis Leading to Domain Knowledge . .
6.6.1 Analytical View . . . . . . . . . . . . . . . . . .
6.6.2 Numerical Analysis . . . . . . . . . . . . . . . .
6.6.3 Confusion Matrix . . . . . . . . . . . . . . . . .

.
.
.
.
.
.


89
91
91
92
93
94

.
.
.

95
95
96

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

99
100
103
104
104
105
105
105
107
107
108
109
109
110
110
116
120
122
123
123

.
.
.

.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

125
125
126
126
128

.
.
.

.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

129
130
131
132
133

134


xii

Contents

6.7 Proposed Method . . . . . . . . . . . . . . . . . .
6.7.1 Knowledge-Based (KB) Tree . . . . . . . .
6.8 Experimentation and Results . . . . . . . . . . . .
6.8.1 Experiments Using SVM . . . . . . . . . .
6.8.2 Experiments Using AdaBoost . . . . . . . .
6.8.3 Results with AdaBoost on Benchmark Data
6.9 Summary . . . . . . . . . . . . . . . . . . . . . . .
6.10 Bibliographic Notes . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
7

8

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


136
136
137
138
140
141
143
144
144

Optimal Dimensionality Reduction . . . . . . . . . . . . . . . .
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Based on Feature Ranking . . . . . . . . . . . . . . .
7.2.2 Ranking Features . . . . . . . . . . . . . . . . . . .
7.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Performance . . . . . . . . . . . . . . . . . . . . . .
7.4 Efficient Approaches to Large-Scale Feature Selection Using
Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . .
7.4.1 An Overview of Genetic Algorithms . . . . . . . . .
7.4.2 Proposed Schemes . . . . . . . . . . . . . . . . . . .
7.4.3 Preliminary Analysis . . . . . . . . . . . . . . . . .
7.4.4 Experimental Results . . . . . . . . . . . . . . . . .
7.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.

.
.
.
.

.
.
.
.
.
.
.

147
147
149
149
150
152
154

.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.

154
155
158
161
163
170
171
171

Big Data Abstraction Through Multiagent Systems . . . . . .
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Conventional Massive Data Systems . . . . . . . . . . . .
8.3.1 Map-Reduce . . . . . . . . . . . . . . . . . . . . .
8.3.2 PageRank . . . . . . . . . . . . . . . . . . . . . .
8.4 Big Data and Data Mining . . . . . . . . . . . . . . . . . .
8.5 Multiagent Systems . . . . . . . . . . . . . . . . . . . . .
8.5.1 Agent Mining Interaction . . . . . . . . . . . . . .
8.5.2 Big Data Analytics . . . . . . . . . . . . . . . . . .
8.6 Proposed Multiagent Systems . . . . . . . . . . . . . . . .
8.6.1 Multiagent System for Data Reduction . . . . . . .

8.6.2 Multiagent System for Attribute Reduction . . . . .
8.6.3 Multiagent System for Heterogeneous Data Access
8.6.4 Multiagent System for Agile Processing . . . . . .
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

173
173
173
174
174
176
176
177
177
178
178
178
179
180
181

182
182
183

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


Contents

Appendix Intrusion Detection Dataset—Binary Representation
A.1 Data Description and Preliminary Analysis . . . . . . . .
A.2 Bibliographic Notes . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .


xiii

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

185
185
189
189

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


191

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

193


Acronyms

AdaBoost
BIRCH
CA
CART
CF
CLARA
CLARANS
CNF
CNN
CS
DFS
DNF
DTC
EDW
ERM
FPTree
FS
GA
GFS
HDFS
HW

KB
KDD
kNNC
MAD Analysis
MDL
MI
ML
NNC
NMF
PAM

Adaptive Boosting
Balanced Iterative Reducing and Clustering using Hierarchies
Classification Accuracy
Classification and regression trees
Clustering Feature
CLustering LARge Applications
Clustering Large Applications based on RANdomized Search
Conjunctive Normal Form
Condensed Nearest Neighbor
Compression Scheme
Distributed File System
Disjunctive Normal Form
Decision Tree Classifier
Enterprise Data Warehouse
Expected Risk Minimization
Frequent Pattern Tree
Fisher Score
Genetic Algorithm
Google File System

Hadoop Distributed File System
Handwritten
Knowledge-Based
Knowledge Discovery from Databases
k-Nearest-Neighbor Classifier
Magnetic, Agile, and Deep Analysis
Minimum Description Length
Mutual Information
Machine Learning
Nearest-Neighbor Classifier
Nonnegative Matrix Factorization
Partition Around Medoids
xv


xvi

PCA
PCF
RLE
RP
SA
SBS
SBFS
SFFS
SFS
SGA
SSGA
SVM
TS

VC

Acronyms

Principal Component Analysis
Pure Conjunctive Form
Run-Length Encoded
Random Projections
Simulated Annealing
Sequential Backward Selection
Sequential Backward Floating Selection
Sequential Forward Floating Selection
Sequential Forward Selection
Simple Genetic Algorithm
Steady Stage Genetic Algorithm
Support Vector Machine
Taboo Search
Vapnik–Chervonenkis


Chapter 1

Introduction

In this book, we deal with data mining and compression; specifically, we deal with
using several data mining tasks directly on the compressed data.

1.1 Data Mining and Data Compression
Data mining is concerned with generating an abstraction of the input dataset using
a mining task.


1.1.1 Data Mining Tasks
Important data mining tasks are:
1. Clustering. Clustering is the process of grouping data points so that points in
each group or cluster are similar to each other than points belonging to two or
more different clusters. Each resulting cluster is abstracted using one or more
representative patterns. So, clustering is some kind of compression where details of the data are ignored and only cluster representatives are used in further
processing or decision making.
2. Classification. In classification a labeled training dataset is used to learn a model
or classifier. This learnt model is used to label a test (unlabeled) pattern; this
process is called classification.
3. Dimensionality Reduction. A majority of the classification and clustering algorithms fail to produce expected results in dealing with high-dimensional datasets.
Also, computational requirements in the form of time and space can increase
enormously with dimensionality. This prompts reduction of the dimensionality
of the dataset; it is reduced either by using feature selection or feature extraction.
In feature selection, an appropriate subset of features is selected, and in feature
extraction, a subset in some transformed space is selected.
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_1, © Springer-Verlag London 2013

1


2

1

Introduction


4. Regression or Function Prediction. Here a functional form for variable y is learnt
(where y = f (X)) from given pairs (X, y); the learnt function is used to predict
the values of y for new values of X. This problem may be viewed as a generalization of the classification problem. In classification, the number of class labels
is finite, where as in the regression setting, y can have infinite values, typically,
y ∈ R.
5. Association Rule Mining. Even though it is of relatively recent origin, it is the
earliest introduced task in data mining and is responsible for bringing visibility
to the area of data mining. In association rule mining, we are interested in finding
out how frequently two subsets of items are associated.

1.1.2 Data Compression
Another important topic in this book is data compression. A compression scheme
CS may be viewed as a function from the set of patterns X to a set of compressed
patterns X . It may be viewed as
CS : X ⇒ X .
Specifically, CS(x) = x for x ∈ X and x ∈ X . In a more general setting, we
may view CS as giving output x using x and some knowledge structure or a dictionary K. So, CS(x, K) = x for x ∈ X and x ∈ X . Sometimes, a dictionary
is used in compressing and uncompressing the data. Schemes for compressing data
are the following:
• Lossless Schemes. These schemes are such that CS(x) = x and there is an
inverse CS−1 such that CS−1 (x ) = x. For example, consider a binary string
00001111 (x) as an input; the corresponding run-length-coded string is 44 (x ),
where the first 4 corresponds to a run of 4 zeros, and the second 4 corresponds
to a run of 4 ones. Also, from the run-length-coded string 44 we can get back
the input string 00001111. Note that such a representation is lossless as we get x
from x using run-length encoding and x from x using decoding.
• Lossy Schemes. In a lossy compression scheme, it is not possible in general to get
back the original data point x from the compressed pattern x . Pattern recognition
and data mining are areas in which there are a plenty of examples where lossy
compression schemes are used.

We show some example compression schemes in Fig. 1.1.

1.1.3 Compression Using Data Mining Tasks
Among the lossy compression schemes, we considered the data mining tasks. Each
of them is a compression scheme as:
• Association rule mining deals with generating frequently cooccurring items/patterns from the given data. It ignores the infrequent items. Rules of association are


1.2 Organization

3

Fig. 1.1 Compression schemes

generated from the frequent itemsets. So, association rules in general cannot be
used to obtain the original input data points provided.
• Clustering is lossy because the output of clustering is a collection of cluster representatives. From the cluster representatives we cannot get back the original data
points. For example, in K-means clustering, each cluster is represented by the
centroid of the data points in it; it is not possible to get back the original data
points from the centroids.
• Classification is lossy as the models learnt from the training data cannot be used
to reproduce the input data points. For example, in the case of Support Vector
Machines, a subset of the training patterns called support vectors are used to get
the classifier; it is not possible to generate the input data points from the support
vectors.
• Dimensionality reduction schemes can ignore some of the input features. So, they
are lossy because it is not possible to get the training patterns back from the
dimensionality-reduced ones.
So, each of the mining tasks is lossy in terms of its output obtained from the given
data. In addition, in this book, we deal with data mining tasks working on compressed data, not the original data. We consider data compression schemes that

could be either lossy or nonlossy. Some of the nonlossy data compression schemes
are also shown in Fig. 1.1. These include run-length coding, Huffman coding, and
the zip utility used by the operating systems.

1.2 Organization
Material in this book is organized as follows.

1.2.1 Data Mining Tasks
We briefly discuss some data mining tasks. We provide a detailed discussion in
Chap. 2.


4

1

Introduction

The data mining tasks considered are the following.
• Clustering. Clustering algorithms generate either a hard or soft partition of the
input dataset. Hard clustering algorithms are either partitional or hierarchical.
Partitional algorithms generate a single partition of the dataset. The number of all
possible partitions of a set of n points into K clusters can be shown to be equal to
1
K!

K

(−1)K−i
i=1


K
(i)n .
i

So, exhaustive enumeration of all possible partitions of a dataset could be prohibitively expensive. For example, even for a small dataset of 19 patterns to be
partitioned into four clusters, we may have to consider around 11,259,666,000
partitions. In order to reduce the computational load, each of the clustering algorithms restricts these possibilities by selecting an appropriate subset of the set
of all possible K-partitions. In Chap. 2, we consider two partitional algorithms
for clustering. One of them is the K-means algorithm, which is the most popular clustering algorithm; the other is the leader clustering algorithm, which is the
simplest possible algorithm for partitional clustering.
A hierarchical clustering algorithm generates a hierarchy of partitions; partitions at different levels of the hierarchy are of different sizes. We describe the
single-link algorithm, which has been classically used in a variety of areas including numerical taxonomy. Another hierarchical algorithm discussed is BIRCH,
which is a very efficient hierarchical algorithm. Both leader and BIRCH are efficient as they need to scan the dataset only once to generate the clusters.
• Classification. We describe two classifiers in Chap. 2. Nearest-neighbor classifier
is the simplest classifier in terms of learning. In fact, it does not learn a model;
it employs all the training data points to label a test pattern. Even though it has
no training time requirement, it can take a long time for labeling a test pattern
if the training dataset is large in size. Its performance deteriorates as the dimensionality of the data points increases; also, it is sensitive to noise in the training
data. A popular variant is the K-nearest-neighbor classifier (KNNC), which labels a test pattern based on labels of K nearest neighbors of the test pattern. Even
though KNNC is robust to noise, it can fail to perform well in high-dimensional
spaces. Also, it takes a longer time to classify a test pattern.
Another efficient and state-of-the-art classifier is based on Support Vector Machines (SVMs) and is popularly used in two-class problems. An SVM learns a
subset of the set of training patterns, called the set of support vectors. These correspond to patterns falling on two parallel hyperplanes; these planes, called the
support planes, are separated by a maximum margin. One can design the classifier using the support vectors. The decision boundary separating patterns from
the two classes is located between the two support planes, one per each class. It is
commonly used in high-dimensional spaces, and it classifies a test pattern using
a single dot product computation.
• Association rule mining. A popular scheme for finding frequent itemsets and association rules based on them is Apriori. This was the first association rule mining



1.2 Organization

5

algorithm; perhaps, it is responsible for the emergence of the area of data mining
itself. Even though it is initiated in market-basket analysis, it can be also used in
other pattern classification and clustering applications. We use it in the classification of hand-written digits in the book. We describe the Apriori algorithm in
Chap. 2.
Naturally, in data mining, we need to analyze large-scale datasets; in Chap. 2, we
discuss three different schemes for dealing with large datasets. These include:
1. Incremental Mining. Here, we use abstraction AK and the (K + 1)th point XK+1
to generate the abstraction AK+1 . Here, AK is the abstraction generated after
examining the first K points. It is useful in dealing with stream data mining; in
big data analytics, it deals with velocity in the three-V model.
2. Divide-and-Conquer Approach: It is a popular scheme used in designing efficient
algorithms. Also, the popular and state-of-the-art Map-Reduce scheme is based
on this strategy. It is associated with dealing volume requirements in the three-V
model.
3. Mining based on an intermediate representation: Here an abstraction is learnt
based on accessing the dataset once or twice; this abstraction is an intermediate
representation. Once an intermediate representation is available, the mining is
performed on this abstraction rather than on the dataset, which reduces the computational burden. This scheme also is associated with the volume feature of the
three-V model.

1.2.2 Abstraction in Nonlossy Compression Domain
In Chap. 3, we provide a nonlossy compression scheme and ability to cluster and
classify data in the compressed domain without having to uncompress.
The scheme employs run-length coding of binary patterns. So, it is useful in dealing with either binary input patterns or even numerical vectors that could be viewed
as binary sequences. Specifically, it considers handwritten digits that could be represented as binary patterns and compresses the strings using run-length coding. Now

the compressed patterns are input to a KNNC for classification. It requires a definition of the distance d between a pair of run-length-coded strings to use the KNNC
on the compressed data.
It is shown that the distance d(x, y) between two binary strings x and y and
the modified distance d(x , y ) between the corresponding run-length-coded (compressed) strings x and y are equal; that is d(x, y) = d (x , y ). It is shown that the
KNNC using the modified distance on the compressed strings reduces the space and
time requirements by a factor of more than 3 compared to the application of KNNC
on the given original (uncompressed) data.
Such a scheme can be used in a number of applications that involve dissimilarity
computation in patterns with binary-valued features. It should be noted that even
real-valued features can be quantized into binary-valued features by specifying appropriate range and scale factors. Our earlier experience of such conversation on


6

1

Introduction

intrusion detection dataset is that it does not affect the accuracy. In this chapter, we
provide an application of the scheme in classification of handwritten digit data and
compare improvement obtained in size as well as computation time. Second application is related to efficient implementation of genetic algorithms. Genetic algorithms
are robust methods to obtain near-optimal solutions. The compression scheme can
be gainfully employed in situations where the evaluation function in Genetic Algorithms is the classification accuracy of the nearest-neighbor classifier (NNC). NNC
involves computation of dissimilarity a number of times depending on the size of
training data or prototype pattern set as well as test data size. The method can be
used for optimal prototype and feature selection. We discuss an indicative example.
The Vapnik–Chervonenkis (VC) dimension characterizes the complexity of
a class of classifiers. It is important to control the V C dimension to improve the
performance of a classifier. Here, we show that the V C dimension is not affected by
using the classifier on compressed data.


1.2.3 Lossy Compression Scheme and Dimensionality Reduction
We propose a lossy compression scheme in Chap. 4. Such compressed data can
be used in both clustering and classification. The proposed scheme compresses the
given data by using frequent items and then considering distinct subsequences. Once
the training data is compressed using this scheme, it is also required to appropriately
deal with test data; it is possible that some of the subsequences present in the test
data are absent in the training data summary. One of the successful schemes employed to deal with this issue is based on replacing a subsequence in the test data by
its nearest neighbor in the training data.
The pruning and transformation scheme employed in achieving compression reduces the dataset size significantly. However, the classification accuracy improves
because of the possible generalization resulting due to compressed representation.
It is possible to integrate rough set theory to put a threshold on the dissimilarity
between a test pattern and a training pattern represented in the compressed form. If
the distance is below a threshold, then the test pattern is assumed to be in the lower
approximation (proper core region) of the class of the training data; otherwise, it is
placed in the upper approximation (possible reject region).

1.2.4 Compaction Through Simultaneous Prototype and Feature
Selection
Simultaneous selection of prototypical patterns and features is considered in
Chap. 5. Here data compression is achieved by ignoring some of the rows and
columns in the data matrix; the rows correspond to patterns, and the columns are
features in the data matrix. Some of the important directions explored in this chapter
are:


1.2 Organization

7


• The impact of compression based on frequent items and subsequences on prototype selection.
• The representativeness of features selected using data obtained based on frequent
items with a high support value.
• The role of clustering and frequent item generation in lossy data compression and
how the classifier is affected by the representation; it is possible to use clustering
followed by frequent item set generation or frequent item set generation followed
by clustering. Both schemes are explored in evaluating the resulting simultaneous
prototype and feature selection. Here the leader clustering algorithm is used for
prototype selection and frequent itemset-based approaches are used for feature
selection.

1.2.5 Use of Domain Knowledge in Data Compaction
Domain knowledge-based compaction is provided in Chap. 6. We make use of domain knowledge of the data under consideration to design efficient pattern classification schemes. We design a domain knowledge-based decision tree of depth 4
that can classify 10-category data with high accuracy. The classification approaches
based on support vector machines and AdaBoost are used.
We carry out preliminary analysis on datasets and demonstrate deriving domain
knowledge from the data and from a human expert. In order that the classification
would be carried out on representative patterns and not on complete data, we make
use of the condensed nearest-neighbor approach and the leader clustering algorithm.
We demonstrate working of the proposed schemes on large datasets and public
domain machine learning datasets.

1.2.6 Compression Through Dimensionality Reduction
Optimal dimensionality reduction for lossy data compression is discussed in
Chap. 7. Here both feature selection and feature extraction schemes are described.
In feature selection, both sequential selection schemes and genetic algorithm (GA)
based schemes are discussed. In sequential selection, features are selected one after the other based on some ranking scheme; here each of the remaining features
is ranked based on their performance along with the already selected features using some validation data. These sequential schemes are greedy in nature and do
not guarantee globally optimal selection. It is possible to show that the GA-based
schemes are globally optimal under some conditions; however, most of practical

implementations may not be able to exploit this global optimality.
Two popular schemes for feature selection are based on Fisher’s score and Mutual information (MI). Fisher’s score could be used to select features that can assume


8

1

Introduction

continuous values, whereas the MI-based scheme is the most successful for selecting features that are discrete or categorical; it has been used in selecting features in
classification of documents where the given set of features is very large.
Another popular set of feature selection schemes employ performance of the
classifiers on selected feature subsets. Most popularly used classifiers in such feature
selection include the NNC, SVM, and Decision Tree classifier. Some of the popular
feature extraction schemes are:
• Principal Component Analysis (PCA). Here the extracted features are linear combinations of the given features. Signal processing community has successfully
used PCA-based compression in image and speech data reconstruction. It has
also been used by search engines for capturing semantic similarity between the
query and the documents.
• Nonnegative Matrix Factorization (NMF). Most of the data one typically uses are
nonnegative. In such cases, it is possible to use NMF to reduce the dimensionality.
This reduction in dimensionality is helpful in building effective classifiers to work
on the reduced-dimensional data even though the given data is high-dimensional.
• Random projections (RP). It is another scheme that extracts features that are linear
combinations of the given features; the weights used in the linear combinations
are random values here.
In this chapter, it is also shown as to how to exploit GAs in large-scale feature selection, and the proposed scheme is demonstrated using the handwritten digit data.
A problem with about 200-feature vector is considered for obtaining optimal subset
of features. The implementation integrates frequent features and genetic algorithms

and brings out sensitivity of genetic operators in achieving optimal set. It is practically shown on how the choice of probability of initialization of the population,
which is not often found in the literature, impacts the number of the final set of
features with other control parameters remaining the same.

1.2.7 Big Data, Multiagent Systems, and Abstraction
Chapter 8 contains ways to generate abstraction from massive datasets. Big data is
characterized by large volumes of heterogeneous types of datasets that need to be
processed to generate abstraction efficiently. Equivalently, big data is characterized
by three v’s, viz., volume, variety, and velocity. Occasionally, the importance of
value is articulated through another v. Big data analytics is multidisciplinary with a
host of topics such as machine learning, statistics, parallel processing, algorithms,
data visualization, etc. The contents include discussion on big data and related topics
such as conventional methods of analyzing big data, MapReduce, PageRank, agents,
and multiagent systems. A detailed discussion on agents and multiagent systems is
provided. Case studies for generating abstraction with big data using multiagent
systems are provided.


1.3 Summary

9

1.3 Summary
In this chapter, we have provided a brief introduction to data compression and mining compressed data. It is possible to use all the data mining tasks on the compressed
data directly. Then we have given how the material is organized in different chapters.
Most of the popular and state-of-the-art mining algorithms are covered in detail in
the subsequent chapters. Various schemes considered and proposed are applied on
two datasets, handwritten digit dataset and the network intrusion detection dataset.
Details of the intrusion detection dataset are provided in Appendix.


1.4 Bibliographical Notes
A detailed description of the bibliography is presented at the end of each chapter, and notes on the bibliography are provided in the respective chapters. This
book deals with data mining and data compression. There is no major effort so
far in dealing with the application of data mining algorithms directly on the compressed data. Some of the important books on compression are by Sayood (2000)
and Salomon et al. (2009). An early book on Data Mining was by Hand et al.
(2001). For a good introduction to data mining, a good source is the book by
Tan et al. (2005). A detailed description of various data mining task is given
by Han et al. (2011). The book by Witten et al. (2011) discusses various practical issues and shows how to use the Weka machine learning workbench developed by the authors. One of the recent books is by Rajaraman and Ullman
(2011).
Some of the important journals on data mining are:
1. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE).
2. ACM Transactions on Knowledge Discovery from Data (ACM TKDD).
3. Data Mining and Knowledge Discovery (DMKD).
Some of the important conferences on this topic are:
1.
2.
3.
4.

Knowledge Discovery and Data Mining (KDD).
International Conference on Data Engineering (ICDE).
IEEE International Conference on Data Mining (ICDM).
SIAM International Conference on Data Mining (SDM).

References
J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd edn. (Morgan Kaufmann,
San Mateo, 2011)
D.J. Hand, H. Mannila, P. Smyth, Principles of Data Mining (MIT Press, Cambridge, 2001)
A. Rajaraman, J.D. Ullman, Mining Massive Datasets (Cambridge University Press, Cambridge,
2011)



10

1

Introduction

D. Salomon, G. Motta, D. Bryant, Handbook of Data Compression (Springer, Berlin, 2009)
K. Sayood, Introduction to Data Compression, 2nd edn. (Morgan Kaufmann, San Mateo, 2000)
P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining (Pearson, Upper Saddle River,
2005)
I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques,
3rd edn. (Morgan Kaufmann, San Mateo, 2011)


Chapter 2

Data Mining Paradigms

2.1 Introduction
In data mining, the size of the dataset involved is large. It is convenient to visualize
such a dataset as a matrix of size n × d, where n is the number of data points, and d
is the number of features. Typically, it is possible that either n or d or both are large.
In mining such datasets, important issues are:
• The dataset cannot be accommodated in the main memory of the machine. So, we
need to store the data on a secondary storage medium like a disk and transfer the
data in parts to the main memory for processing; such an activity could be timeconsuming. Because disk access can be more expensive compared to accessing
the data from the memory, the number of database scans is an important parameter. So, when we analyze data mining algorithms, it is important to consider the
number of database scans required.

• The dimensionality of the data can be very large. In such a case, several of the
conventional algorithms that use the Euclidean distance like metrics to characterize proximity between a pair of patterns may not play a meaningful role in
such high-dimensional spaces where the data is sparsely distributed. So, different
techniques to deal with such high-dimensional datasets become important.
• Three important data mining tasks are:
1. Clustering. Here a collection of patterns is partitioned into two or more clusters. Typically, clusters of patterns are represented using cluster representatives; a centroid of the points in the cluster is one of the most popularly used
cluster representatives. Typically, a partition or a clustering is represented by
k representatives, where k is the number of clusters; such a process leads to
lossy data compression. Instead of dealing with all the n data points in the
collection, one can just use the k cluster representatives (where k n in the
data mining context) for further decision making.
2. Classification. In classification, a machine learning algorithm is used on a
given collection of training data to obtain an appropriate abstraction of the
dataset. Decision trees and probability distributions of points in various classes
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_2, © Springer-Verlag London 2013

11


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×