Tải bản đầy đủ (.pdf) (470 trang)

Constrained clustering advances in algorithms, theory, and applications basu, davidson wagstaff 2012 11 29 Cấu trúc dữ liệu và giải thuật

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.65 MB, 470 trang )

CuuDuongThanCong.com


Constrained Clustering
Advances in Algorithms,
Theory, and Applications

C9969_FM.indd
1
CuuDuongThanCong.com

7/11/08 11:47:01 AM


Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A

AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis. This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and handbooks. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.


PUBLISHED TITLES
UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix
Decompositions
David Skillicorn
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: Advances in Algorithms, Theory,
and Applications
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff

C9969_FM.indd
2
CuuDuongThanCong.com

7/11/08 11:47:01 AM


Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series

Constrained Clustering
Advances in Algorithms,
Theory, and Applications

Edited by

4VHBUP#BTVr*BO%BWJETPO
Kiri L. Wagstaff

C9969_FM.indd

3
CuuDuongThanCong.com

7/11/08 11:47:01 AM


Cover image shows the result of clustering a hyperspectral image of Mars using soft constraints to
impose spatial contiguity on cluster assignments. The data set was collected by the Space Telescope
Imaging Spectrograph (STIS) on the Hubble Space Telescope. This image was reproduced with permission from Intelligent Clustering with Instance-Level Constraints by Kiri Wagstaff.
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2009 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-13: 978-1-58488-996-0 (Hardcover)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com ( or contact the Copyright Clearance Center, Inc. (CCC), 222

Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
licenses and registration for a variety of users. For organizations that have been granted a photocopy
license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Constrained clustering : advances in algorithms, theory, and applications / editors,
Sugato Basu, Ian Davidson, Kiri Wagstaff.
p. cm. -- (Chapman & Hall/CRC data mining and knowledge discovery series)
Includes bibliographical references and index.
ISBN 978-1-58488-996-0 (hardback : alk. paper)
1. Cluster analysis--Data processing. 2. Data mining. 3. Computer algorithms. I.
Basu, Sugato. II. Davidson, Ian, 1971- III. Wagstaff, Kiri. IV. Title. V. Series.
QA278.C63 2008
519.5’3--dc22

2008014590

Visit the Taylor & Francis Web site at

and the CRC Press Web site at


C9969_FM.indd
4
CuuDuongThanCong.com

7/11/08 11:47:01 AM



Thanks to my family, friends and colleagues especially Joulia, Constance
and Ravi. – Ian
I would like to dedicate this book to all of the friends and colleagues who’ve
encouraged me and engaged in idea-swapping sessions, both about
constrained clustering and other topics. Thank you for all of your feedback
and insights! – Kiri
Dedicated to my family for their love and encouragement, with special thanks
to my wife Shalini for her constant love and support. – Sugato

CuuDuongThanCong.com


CuuDuongThanCong.com


Foreword

In 1962 Richard Hamming wrote, “The purpose of computation is insight,
not numbers.” But it was not until 1977 that John Tukey formalized the
field of exploratory data analysis. Since then, analysts have been seeking
techniques that give them better understanding of their data. For one- and
two-dimensional data, we can start with a simple histogram or scatter plot.
Our eyes are good at spotting patterns in a two-dimensional plot. But for
more complex data we fall victim to the curse of dimensionality; we need more
complex tools because our unaided eyes can’t pick out patterns in thousanddimensional data.
Clustering algorithms pick up where our eyes leave off: they can take data
with any number of dimensions and cluster them into subsets such that each
member of a subset is near the other members in some sense. For example, if
we are attempting to cluster movies, everyone would agree that Sleepless in
Seattle should be placed near (and therefore in the same cluster as) You’ve Got

Mail. They’re both romantic comedies, they’ve got the same director (Nora
Ephron), the same stars (Tom Hanks and Meg Ryan), they both involve falling
in love over a vast electronic communication network. They’re practically the
same movie. But what about comparing Charlie and the Chocolate Factory
with A Nightmare on Elm Street? On most dimensions, these films are near
opposites, and thus should not appear in the same cluster. But if you’re a
Johnny Depp completist, you know he appears in both, and this one factor
will cause you to cluster them together.
Other books have covered the vast array of algorithms for fully-automatic
clustering of multi-dimensional data. This book explains how the Johnny
Depp completist, or any analyst, can communicate his or her preferences to
an automatic clustering algorithm, so that the patterns that emerge make
sense to the analyst; so that they yield insight, not just clusters. How can the
analyst communicate with the algorithm? In the first five chapters, it is by
specifying constraints of the form “these two examples should (or should not)
go together.” In the chapters that follow, the analyst gains vocabulary, and
can talk about a taxonomy of categories (such as romantic comedy or Johnny
Depp movie), can talk about the size of the desired clusters, can talk about
how examples are related to each other, and can ask for a clustering that is
different from the last one.
Of course, there is a lot of theory in the basics of clustering, and in the
refinements of constrained clustering, and this book covers the theory well.
But theory would have no purpose without practice, and this book shows how

CuuDuongThanCong.com


constrained clustering can be used to tackle large problems involving textual,
relational, and even video data. After reading this book, you will have the
tools to be a better analyst, to gain more insight from your data, whether it

be textual, audio, video, relational, genomic, or anything else.
Dr. Peter Norvig
Director of Research
Google, Inc.
December 2007

CuuDuongThanCong.com


Editor Biographies

Sugato Basu is a senior research scientist at Google, Inc. His areas of research expertise include machine learning, data mining, information retrieval,
statistical pattern recognition and optimization, with special emphasis on scalable algorithm design and analysis for large text corpora and social networks.
He obtained his Ph.D. in machine learning from the computer science department of the University of Texas at Austin. His Ph.D. work on designing novel
constrained clustering algorithms, using probabilistic models for incorporating prior domain knowledge into clustering, won him the Best Research Paper
Award at KDD in 2004 and the Distinguished Student Paper award at ICML
in 2005. He has served on multiple conference program committees, journal
review committees and NSF panels in machine learning and data mining, and
has given several invited tutorials and talks on constrained clustering. He has
written conference papers, journal papers, book chapters, and encyclopedia
articles in a variety of research areas including clustering, semi-supervised
learning, record linkage, social search and routing, rule mining and optimization.
Ian Davidson is an assistant professor of computer science at the University of California at Davis. His research areas are data mining, artificial
intelligence and machine learning, in particular focusing on formulating novel
problems and applying rigorous mathematical techniques to address them.
His contributions to the area of clustering with constraints include proofs of
intractability for both batch and incremental versions of the problem and
the use of constraints with both agglomerative and non-hierarchical clustering algorithms. He is the recipient of an NSF CAREER Award on Knowledge
Enhanced Clustering and has won Best Paper Awards at the SIAM and IEEE
data mining conferences. Along with Dr. Basu he has given tutorials on clustering with constraints at several leading data mining conferences and has

served on over 30 program committees for conferences in his research fields.
Kiri L. Wagstaff is a senior researcher at the Jet Propulsion Laboratory
in Pasadena, CA. Her focus is on developing new machine learning methods,
particularly those that can be used for data analysis onboard spacecraft, enabling missions with higher capability and autonomy. Her Ph.D. dissertation,
“Intelligent Clustering with Instance-Level Constraints,” initiated work in the
machine learning community on constrained clustering methods. She has developed additional techniques for analyzing data collected by instruments on
the EO-1 Earth Orbiter, Mars Pathfinder, and Mars Odyssey. The applications range from detecting dust storms on Mars to predicting crop yield on

CuuDuongThanCong.com


Earth. She is currently working in a variety of machine learning areas including multiple-instance learning, change detection in images, and ensemble
learning. She is also pursuing a Master’s degree in Geology at the University
of Southern California, and she teaches computer science classes at California
State University, Los Angeles.

CuuDuongThanCong.com


Contributors

Charu Aggarwal
IBM T. J. Watson Research Center
Hawthorne, New York, USA

Joachim M. Buhmann
ETH Zurich
Zurich, Switzerland

Arindam Banerjee

Dept. of Computer Science and Eng.
University of Minnesota Twin Cities
Minneapolis, Minnesota, USA

Rich Caruana
Dept. of Computer Science
Cornell University
Ithaca, New York, USA

Aharon Bar-Hillel
Intel Research
Haifa, Israel

David Cohn
Google, Inc.
Mountain View, California, USA

Boaz Ben-moshe
Dept. of Computer Science
Simon Fraser University
Burnaby, Vancouver, Canada

Ayhan Demiriz
Dept. of Industrial Eng.
Sakarya University
Sakarya, Turkey

Kristin P. Bennett
Dept. of Mathematical Sciences
Rensselaer Polytechnic Institute

Troy, New York, USA

Marie desJardins
Dept. of Computer Science and EE
University of Maryland Baltimore County
Baltimore, Maryland, USA

Indrajit Bhattacharya
IBM India Research Laboratory
New Delhi, India

Martin Ester
Dept. of Computer Science
Simon Fraser University
Burnaby, Vancouver, Canada

Jean-Francois Boulicaut
INSA-Lyon
Villeurbanne Cedex, France

Julia Ferraioli
Bryn Mawr College
Bryn Mawr, Pennsylvania, USA

Paul S. Bradley
Apollo Data Technologies
Bellevue, Washington, USA

Byron J. Gao
Dept. of Computer Science

Simon Fraser University
Burnaby, Vancouver, Canada

CuuDuongThanCong.com


Stephen C. Gates
IBM T. J. Watson Research Center
Hawthorne, New York, USA
Rong Ge
Dept. of Computer Science
Simon Fraser University
Burnaby, Vancouver, Canada
Lise Getoor
Dept. of Computer Science and
UMIACS
University of Maryland
College Park, Maryland, USA
Joydeep Ghosh
Dept. of Elec. and Computer Eng.
University of Texas at Austin
Austin, Texas, USA

Nicole Immorlica
Dept. of Computer Science
Northwestern University
Evanston, Illinois, USA
Anil K. Jain
Dept. of Computer Science and Eng.
Michigan State University

East Lansing, Michigan, USA
Laks V. S. Lakshmanan
Dept. of Computer Science
University of British Columbia
Vancouver, Canada
Tilman Lange
ETH Zurich
Zurich, Switzerland

David Gondek
IBM T. J. Watson Research Center
Hawthorne, New York, USA

Martin H. Law
Dept. of Computer Science and Eng.
Michigan State University
East Lansing, Michigan, USA

Jiawei Han
Dept. of Computer Science
University of Illinois
Urbana-Champaign, Illinois, USA

Todd K. Leen
Dept. of Computer Science and Eng.
Oregon Graduate Institute
Beaverton, Oregon, USA

Alexander G. Hauptmann
School of Computer Science

Carnegie Mellon University
Pittsburgh, Pennsylvania, USA

Zhengdong Lu
Dept. of Computer Science and Eng.
Oregon Graduate Institute
Beaverton, Oregon, USA

Tomer Hertz
Microsoft Research
Redmond, Washington, USA

James MacGlashan
Dept. of Computer Science and EE
University of Maryland Baltimore County
Baltimore, Maryland, USA

Zengjian Hu
Dept. of Computer Science
Simon Fraser University
Burnaby, Vancouver, Canada

CuuDuongThanCong.com


Andrew Kachites McCallum
Dept. of Computer Science
University of Massachusetts Amherst
Amherst, Massachusetts, USA


Anthony Wirth
Dept. of Computer Science
and Software Eng.
The University of Melbourne
Melbourne, Victoria, Australia

Raymond T. Ng
Dept. of Computer Science
University of British Columbia
Vancouver, Canada

Rong Yan
IBM T. J. Watson Research Center
Hawthorne, New York, USA

Satoshi Oyama
Dept. of Social Informatics
Kyoto University
Kyoto, Japan

Jie Yang
School of Computer Science
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA

Ruggero G. Pensa
ISTI-CNR
Pisa, Italy

Philip Yu

IBM T. J. Watson Research Center
Hawthorne, New York, USA


eline Robardet
INSA-Lyon
Villeurbanne Cedex, France

Jian Zhang
Dept. of Statistics
Purdue University
West Lafayette, Indiana, USA

Noam Shental
Dept. of Physics of Complex Systems
Weizmann Institute of Science
Rehovot, Israel
Katsumi Tanaka
Dept. of Social Informatics
Kyoto University
Kyoto, Japan
Anthony K. H. Tung
Dept. of Computer Science
National University of Singapore
Singapore
Daphna Weinshall
School of Computer Science and Eng.
and the Center for Neural Comp.
The Hebrew University of Jerusalem
Jerusalem, Israel


CuuDuongThanCong.com


CuuDuongThanCong.com


List of Tables

1.1

Constrained k-means algorithm for hard constraints . . . . .

5.1
5.2
5.3
5.4

F-scores on the toy data set. . . . . .
Ethnicity classification results . . . .
Newsgroup data classification results
Segmentation results . . . . . . . . .

.
.
.
.

.
.

.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

114
115

117
117

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8

A Boolean matrix and its associated co-clustering
CDK-Means pseudo-code . . . . . . . . . . . . .
Constrained CDK-Means pseudo-code . . . . . .
Co-clustering without constraints . . . . . . . . .
Co-clustering (1 pairwise constraint) . . . . . . .
Co-clustering (2 pairwise constraints) . . . . . . .
Co-clustering without interval constraints . . . . .
Clustering adult drosophila individuals . . . . . .

.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

125
130
133
139
140
140
140
144

8.1

Samples needed for a given confidence level . . . . . . . . . .

176


9.1
9.2

Web data set k-Means and constrained k-Means results . . .
Web data set k-Median and constrained k-Median results . .

212
212

10.1
10.2
10.3

Performance of different algorithms on real data sets . . . .
Resolution accuracy for queries for different algorithms . . .
Execution time of different algorithms . . . . . . . . . . . . .

236
237
238

11.1
11.2
11.3

Confusion matrices for face data . . . . . . . . . . . . . . . .
Synthetic successive clustering results . . . . . . . . . . . . .
Comparison of non-redundant clustering algorithms . . . . .

270

278
281

12.1
12.2

Summaries of the real data set . . . . . . . . . . . . . . . . .
Comparison of NetScan and k-Means . . . . . . . . . . . . .

306
306

14.1

The five approaches that we tested empirically . . . . . . . .

338

15.1
15.2
15.3

DBLP data set . . . . . . . . . . . . . . . . . . . . . . . . .
Maximum F-measure values . . . . . . . . . . . . . . . . . .
Results with the correct cluster numbers . . . . . . . . . . .

368
369
370


CuuDuongThanCong.com

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.

.

4


CuuDuongThanCong.com


List of Figures

1.1
1.2

Constraints for improved clustering results . . . . . . . . . .
Constraints for metric learning . . . . . . . . . . . . . . . . .

5
6

2.1
2.2
2.3

Illustration of semi-supervised clustering . . . . . . . . . . .
Learning curves for different clustering approaches . . . . . .
Overlap of top terms by γ-weighting and information gain .

22
25
26


3.1

3.4
3.5
3.6

Examples of the benefits of incorporating equivalence constraints into EM . . . . . . . . . . . . . . . . . . . . . . . . .
A Markov network representation of cannot-link constraints
A Markov network representation of both must-link and cannotlink constraints . . . . . . . . . . . . . . . . . . . . . . . . .
Results: UCI repository data sets . . . . . . . . . . . . . . .
Results: Yale database . . . . . . . . . . . . . . . . . . . . .
A Markov network for calculating the normalization factor Z.

45
47
49
53

4.1
4.2
4.3
4.4
4.5
4.6

The influence of constraint weight on model fitting . . . . .
The density model fit with different weight . . . . . . . . . .
Three artificial data sets, with class denoted by symbols. . .
Classification accuracy with noisy pairwise relations . . . . .

Clustering with hard constraints derived from partial labeling
Clustering on the Brodatz texture data . . . . . . . . . . . .

66
67
74
75
77
79

5.1
5.2

93

5.3
5.4
5.5
5.6

Spectrum between supervised and unsupervised learning . .
Segmentation example and constraint-induced graph structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hidden MRF on labels . . . . . . . . . . . . . . . . . . . . .
Synthetic data used in the experiments . . . . . . . . . . . .
Sample face images for the ethnicity classification task . . .
Segmentation results . . . . . . . . . . . . . . . . . . . . . .

94
103
114

115
118

6.1
6.2
6.3
6.4

A synthetic data set . . . . . . . .
Two co-clusterings . . . . . . . .
Results on the malaria data set .
Results on the drosophila data set

.
.
.
.

137
138
141
143

7.1

The clustering algorithm . . . . . . . . . . . . . . . . . . . .

154

3.2

3.3

CuuDuongThanCong.com

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.

.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.

.
.
.

.
.
.
.

.
.
.
.

35
43


7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9

Assigning documents to pseudo-centroids . . . . . . . . .
Projecting out the less important terms . . . . . . . . . .
Merging very closely related clusters . . . . . . . . . . .

Removal of poorly defined clusters . . . . . . . . . . . . .
The classification algorithm . . . . . . . . . . . . . . . .
Some examples of constituent documents in each cluster
Some examples of classifier performance . . . . . . . . .
Survey results . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

154
155
155
155
158
162

165
165

8.1
8.2
8.3
8.4
8.5
8.6

Balanced clustering of news20 data set . . . . . .
Balanced clustering of yahoo data set . . . . . . .
Frequency sensitive spherical k-means on news20
Frequency sensitive spherical k-means on yahoo .
Static and streaming algorithms on news20 . . . .
Static and streaming algorithms on yahoo . . . .

.
.
.
.
.
.

.
.
.
.
.
.


180
181
187
188
190
191

9.1
9.2

207

9.3

Equivalent minimum cost flow formulation . . . . . . . . . .
Average number of clusters with fewer than τ for small data
sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average ratio of objective function (9.1) . . . . . . . . . . .

209
211

10.1
10.2
10.3
10.4
10.5
10.6
10.7


Bibliographic example of references and hyper-edges . .
The relational clustering algorithm . . . . . . . . . . .
Illustrations of identifying and ambiguous relations . .
Precision-recall and F1 plots . . . . . . . . . . . . . . .
Performance of different algorithms on synthetic data .
Effect of different relationships on collective clustering
Effect of expansion levels on collective clustering . . . .

11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
11.11
11.12
11.13
11.14
11.15

CondEns algorithm . . . . . . . . . . . . . . .
Multivariate IB: Bayesian networks . . . . . .
Multivariate IB: alternate output network . .
Multinomial update equations . . . . . . . . .
Gaussian update equations . . . . . . . . . . .

GLM update equations . . . . . . . . . . . . .
Sequential method . . . . . . . . . . . . . . . .
Deterministic annealing algorithm . . . . . . .
Face images: clustering . . . . . . . . . . . . .
Face images: non-redundant clustering . . . .
Text results . . . . . . . . . . . . . . . . . . .
Synthetic results . . . . . . . . . . . . . . . . .
Orthogonality relaxation: example sets . . . .
Orthogonality relaxation: results . . . . . . .
Synthetic successive clustering results: varying

CuuDuongThanCong.com

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.

.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.

224
231
234
237
239
240
240

. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
generation

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

253
259
260
264
265
266
268
269
270
270
272
273
276
277
279



12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8

Constructed graph g . . . . . . . . . . . . . . . . .
Deployment of nodes on the line . . . . . . . . . . .
Polynomial exact algorithm for CkC . . . . . . . .
Converting a solution of CkC to a solution of CkC
Illustration of Algorithm 2 . . . . . . . . . . . . . .
Step II of NetScan . . . . . . . . . . . . . . . . . . .
Radius increment . . . . . . . . . . . . . . . . . . .
Outlier elimination by radius histogram . . . . . . .

.
.
.
.
.
.
.
.

294

294
297
298
299
301
303
304

13.1

A correlation clustering example . . . . . . . . . . . . . . . .

315

14.1
14.2
14.3
14.4
14.5
14.6
14.7
14.8
14.9
14.10
14.11
14.12
14.13
14.14
14.15
14.16

14.17

Initial display of the overlapping circles data set . . . . .
Layout after two instances have been moved . . . . . . .
Layout after three instances have been moved . . . . . .
Layout after four instances have been moved . . . . . . .
Layout after 14 instances have been moved . . . . . . . .
2D view of the Overlapping Circles data set . . . . . . .
Experimental results on the Circles data set . . . . . . .
Effect of edge types on the Circles data set . . . . . . . .
Experimental results on the Overlapping Circles data set
Effect of edge types on the Overlapping Circles data set
Experimental results on the Iris data set . . . . . . . . .
Effect of edge types on the Iris data set . . . . . . . . . .
Experimental results on the IMDB data set . . . . . . .
Experimental results on the music data set . . . . . . . .
Effect of edge types on the music data set . . . . . . . .
Experimental results on the Amino Acid Indices data set
Experimental results on the Amino Acid data set . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

335
335
336
336
337
341
343
344
345
346
347
348
349
350
351
352
353

15.1
15.2

Using dissimilar example pairs in learning a metric . . . . .

Results of author identification for DBLP data set. . . . . .

359
374

16.1
16.2
16.3
16.4
16.5

A pivot movement graph . . . . . . .
The actual situation . . . . . . . . . .
An example of a deadlock cycle . . .
An example of micro-cluster sharing .
Transforming DHP to PM . . . . . .

.
.
.
.
.

383
383
385
388
393

17.1

17.2
17.3

Examples of various pairwise constraints . . . . .
A comparison of loss functions . . . . . . . . . . .
An illustration of the pairwise learning algorithms
the synthetic data . . . . . . . . . . . . . . . . . .
Examples of images from a geriatric nursing home
The flowchart of the learning process . . . . . . .
Summary of the experimental results . . . . . . .

. . . . . .
. . . . . .
applied to
. . . . . .
. . . . . .
. . . . . .
. . . . . .

398
406

17.4
17.5
17.6

CuuDuongThanCong.com

.
.

.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.

.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.

413
418
420
422


17.7

The classification error of various algorithms against the number of constraints . . . . . . . . . . . . . . . . . . . . . . . .
17.8 The classification error of the CPKLR algorithm against the
number of constraints . . . . . . . . . . . . . . . . . . . . . .
17.9 The labeling interface for the user study . . . . . . . . . . .
17.10 The classification errors against the number of noisy constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CuuDuongThanCong.com

423
424
425
426


Contents

1 Introduction
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . .
1.2 Initial Work: Instance-Level Constraints . . . . . . . . . . . .
1.2.1 Enforcing Pairwise Constraints . . . . . . . . . . . . .
1.2.2 Learning a Distance Metric from Pairwise Constraints
1.3 Advances Contained in This Book . . . . . . . . . . . . . . .
1.3.1 Constrained Partitional Clustering . . . . . . . . . . .
1.3.2 Beyond Pairwise Constraints . . . . . . . . . . . . . .
1.3.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . .
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Notation and Symbols . . . . . . . . . . . . . . . . . . . . . .
2 Semi-Supervised Clustering with User Feedback
David Cohn, Rich Caruana, and Andrew Kachites McCallum
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Relation to Active Learning . . . . . . . . . . . .
2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Semi-Supervised Clustering . . . . . . . . . . . . . . . .
2.3.1 Implementing Pairwise Document Constraints . .
2.3.2 Other Constraints . . . . . . . . . . . . . . . . .
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Clustering Performance . . . . . . . . . . . . . .
2.4.2 Learning Term Weightings . . . . . . . . . . . . .
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Constraints vs. Labels . . . . . . . . . . . . . . .
2.5.2 Types of User Feedback . . . . . . . . . . . . . .
2.5.3 Other Applications . . . . . . . . . . . . . . . . .
2.5.4 Related Work . . . . . . . . . . . . . . . . . . . .

1
1

2
3
5
6
7
8
9
9
10
12
17

.
.
.
.
.
.
.
.
.
.
.
.
.
.

17
19
20

21
22
23
24
24
26
27
27
27
28
28

3 Gaussian Mixture Models with Equivalence Constraints
Noam Shental, Aharon Bar-Hillel, Tomer Hertz, and Daphna
Weinshall
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Constrained EM: The Update Rules . . . . . . . . . . . . . .
3.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . .

33

CuuDuongThanCong.com

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

34
36
37


3.3


3.4
3.5
3.6
3.7

3.2.2 Incorporating Must-Link Constraints . . . . . . . . . .
3.2.3 Incorporating Cannot-Link Constraints . . . . . . . .
3.2.4 Combining Must-Link and Cannot-Link Constraints .
Experimental Results . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 UCI Data Sets . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Facial Image Database . . . . . . . . . . . . . . . . . .
Obtaining Equivalence Constraints . . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary and Discussion . . . . . . . . . . . . . . . . . . . .
Appendix: Calculating the Normalizing Factor Z and its Derivatives when Introducing Cannot-Link Constraints . . . . . . .
∂Z
. . . . . . . . . . . .
3.7.1 Exact Calculation of Z and ∂α
l
3.7.2 Approximating Z Using the Pseudo-Likelihood Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Pairwise Constraints as Priors in Probabilistic Clustering
Zhengdong Lu and Todd K. Leen
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Prior Distribution on Cluster Assignments . . . . . . .
4.2.2 Pairwise Relations . . . . . . . . . . . . . . . . . . . .
4.2.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Selecting the Constraint Weights . . . . . . . . . . . .
4.3 Computing the Cluster Posterior . . . . . . . . . . . . . . . .

4.3.1 Two Special Cases with Easy Inference . . . . . . . .
4.3.2 Estimation with Gibbs Sampling . . . . . . . . . . . .
4.3.3 Estimation with Mean Field Approximation . . . . . .
4.4 Related Models . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Artificial Constraints . . . . . . . . . . . . . . . . . . .
4.5.2 Real-World Problems . . . . . . . . . . . . . . . . . .
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Clustering with Constraints: A Mean-Field Approximation
Perspective
Tilman Lange, Martin H. Law, Anil K. Jain, and Joachim M.
Buhmann
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . .
5.2 Model-Based Clustering . . . . . . . . . . . . . . . . . . . . .
5.3 A Maximum Entropy Approach to Constraint Integration . .
5.3.1 Integration of Partial Label Information . . . . . . .

CuuDuongThanCong.com

38
41
44
45
46
48
49

50
51
52
53
54
59
60
61
61
62
63
65
68
68
69
70
70
72
73
75
78
80
81
83

91
92
93
96
98

99


5.3.2
5.3.3
5.3.4
5.3.5
5.3.6

5.4
5.5

Maximum-Entropy Label Prior . . . . . . . . . . . . .
Markov Random Fields and the Gibbs Distribution . .
Parameter Estimation . . . . . . . . . . . . . . . . . .
Mean-Field Approximation for Posterior Inference . .
A Detour: Pairwise Clustering, Constraints, and Mean
Fields . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.7 The Need for Weighting . . . . . . . . . . . . . . . . .
5.3.8 Selecting η . . . . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

100
102
104
105
107
108
110

112
116

6 Constraint-Driven Co-Clustering of 0/1 Data
123
Ruggero G. Pensa, C´eline Robardet, and Jean-Fran¸cois Boulicaut
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3 A Constrained Co-Clustering Algorithm Based on a Local-toGlobal Approach . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.3.1 A Local-to-Global Approach . . . . . . . . . . . . . . 128
6.3.2 The CDK-Means Proposal . . . . . . . . . . . . . . . 128
6.3.3 Constraint-Driven Co-Clustering . . . . . . . . . . . . 130
6.3.4 Discussion on Constraint Processing . . . . . . . . . . 132
6.4 Experimental Validation . . . . . . . . . . . . . . . . . . . . . 134
6.4.1 Evaluation Method . . . . . . . . . . . . . . . . . . . . 135
6.4.2 Using Extended Must-Link and Cannot-Link
Constraints . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4.3 Time Interval Cluster Discovery . . . . . . . . . . . . 139
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7 On Supervised Clustering for Creating Categorization Segmentations
149
Charu Aggarwal, Stephen C. Gates, and Philip Yu
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2 A Description of the Categorization System . . . . . . . . . . 151
7.2.1 Some Definitions and Notations . . . . . . . . . . . . . 151
7.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . 152
7.2.3 Supervised Cluster Generation . . . . . . . . . . . . . 153
7.2.4 Categorization Algorithm . . . . . . . . . . . . . . . . 157
7.3 Performance of the Categorization System . . . . . . . . . . . 160
7.3.1 Categorization . . . . . . . . . . . . . . . . . . . . . . 164

7.3.2 An Empirical Survey of Categorization Effectiveness . 166
7.4 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . 168

CuuDuongThanCong.com


8 Clustering with Balancing Constraints
Arindam Banerjee and Joydeep Ghosh
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 A Scalable Framework for Balanced Clustering . . . . . . . .
8.2.1 Formulation and Analysis . . . . . . . . . . . . . . . .
8.2.2 Experimental Results . . . . . . . . . . . . . . . . . .
8.3 Frequency Sensitive Approaches for Balanced Clustering . . .
8.3.1 Frequency Sensitive Competitive Learning . . . . . . .
8.3.2 Case Study: Balanced Clustering of Directional Data .
8.3.3 Experimental Results . . . . . . . . . . . . . . . . . .
8.4 Other Balanced Clustering Approaches . . . . . . . . . . . . .
8.4.1 Balanced Clustering by Graph Partitioning . . . . . .
8.4.2 Model-Based Clustering with Soft Balancing . . . . .
8.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . .

171
171
174
174
177
182
182
183
186

191
192
194
195

9 Using Assignment Constraints to Avoid Empty Clusters in
k-Means Clustering
201
Ayhan Demiriz, Kristin P. Bennett, and Paul S. Bradley
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.2 Constrained Clustering Problem and Algorithm . . . . . . . . 203
9.3 Cluster Assignment Sub-Problem . . . . . . . . . . . . . . . . 206
9.4 Numerical Evaluation . . . . . . . . . . . . . . . . . . . . . . 208
9.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10 Collective Relational Clustering
Indrajit Bhattacharya and Lise Getoor
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
10.2 Entity Resolution: Problem Formulation . . . . . .
10.2.1 Pairwise Resolution . . . . . . . . . . . . .
10.2.2 Collective Resolution . . . . . . . . . . . . .
10.2.3 Entity Resolution Using Relationships . . .
10.2.4 Pairwise Decisions Using Relationships . . .
10.2.5 Collective Relational Entity Resolution . .
10.3 An Algorithm for Collective Relational Clustering
10.4 Correctness of Collective Relational Clustering . .
10.5 Experimental Evaluation . . . . . . . . . . . . . . .
10.5.1 Experiments on Synthetic Data . . . . . . .
10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . .


221
.
.
.
.
.
.
.
.
.
.
.
.

222
223
224
225
226
226
227
230
233
235
238
241

11 Non-Redundant Data Clustering
David Gondek
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.1 Background Concepts . . . . . . . . . . . . . . . . . .

245

CuuDuongThanCong.com

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

245
246
247


×