Tải bản đầy đủ (.pdf) (689 trang)

Springer advances in data analysis (2007) 3540709800

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.67 MB, 689 trang )


Studies in Classification, Data Analysis,
and Knowledge Organization
Managing Editors

Editorial Board

H.-H. Bock, Aachen
W. Gaul, Karlsruhe
M. Vichi, Rome

Ph. Arabie, Newark
D. Baier, Cottbus
F. Critchley, Milton Keynes
R. Decker, Bielefeld
E. Diday, Paris
M. Greenacre, Barcelona
C. Lauro, Naples
J. Meulman, Leiden
P. Monari, Bologna
S. Nishisato, Toronto
N. Ohsumi, Tokyo
O. Optiz, Augsburg
G. Ritter, Passau
M. Schader, Mannheim
C. Weihs, Dortmund


Titles in the Series
W. Gaul and D. Pfeifer (Eds.)
From Data to Knowledge. 1995


H.-H. Bock and W. Polasek (Eds.)
Data Analysis and Information
Systems. 1996
E. Diday, Y. Lechevallier, and
O. Opitz (Eds.) Ordinal and
Symbolic Data Analysis. 1996
R. Klar and O. Opitz (Eds.)
Classification and Knowledge
Organization. 1997
C. Hayashi, N.Ohsumi, K.Yajima,
Y. Tanaka, H.-H. Bock, and Y. Baba
(Eds.)
Data Science, Classifaction,
and Related Methods. 1998
I. Balderjahn, R. Mathar, and
M. Schader (Eds.)
Classification, Data Analysis, and
Data Highways. 1998
A. Rizzi, M. Vichi, and H.-H. Bock
(Eds.)
Advances in Data Science
and Classification 1998.
M. Vichi and O. Optiz (Eds.)
Classification and Data Analysis.
1999
W. Gaul and H. Locarek-Junge (Eds.)
Classification in the Information
Age. 1999
H.-H. Bock and E. Diday (Eds.)
Analysis of Symbolic Data. 2000

H. A. L. Kiers, J.-P. Rasson, P. J. F.
Groenen, and M. Schader (Eds.)
Data Analysis, Classification, and
Related Methods. 2000
W. Gaul, O. Opitz, M. Schader (Eds.)
Data Analysis. 2000
R. Decker and W. Gaul (Eds.)
Classification and Information
Processing at the Turn of the
Millenium. 2000
S. Borra, R. Rocci, M. Vichi,
and M. Schader (Eds.)
Advances in Classification and Data
Analysis. 2000

W. Gaul and G. Ritter (Eds.)
Classification, Automation, and New
Media. 2002
K. Jajuga, A. Sokołowski, and
H.-H. Bock (Eds.)
Classification, Clustering and Data
Analysis. 2002
M. Schwaiger and O. Opitz (Eds.)
Exploratory Data Analysis in
Empirical Research. 2003
M. Schader, W. Gaul, and M. Vichi
(Eds.)
Between Data Science and Applied
Data Analysis. 2003
H.-H. Bock, M. Chiodi, and

A. Mineo (Eds.)
Advances in Multivariate Data
Analysis. 2004
D. Banks, L. House, F.R. McMorris,
P. Arabie, and W. Gaul (Eds.)
Classification, Clustering, and Data
Minig Applications. 2004
D. Baier and K.-D. Wernecke (Eds.)
Innovations in Classification, Data
Science, and Information Systems.
2005
M. Vichi, P. Monari, S. Mignani, and
A. Montanari (Eds.)
New Developments in Classification
and Data Analysis. 2005
D. Baier, R. Decker, and L. SchmidtThieme (Eds.)
Data Analysis and Decision Support.
2005
C. Weihs and W. Gaul (Eds.)
Classification - the Ubiquitous
Challenge. 2005
M. Spiliopoulou, R. Kruse, C.
Borgelt, A. Nürnberger, and W. Gaul
(Eds.)
From Data and Information Analysis
to Knowledge Engineering. 2006
V. Batagelj, H.-H. Bock, A. Ferligoj,
ˇ iberna (Eds.)
and A. Z
Data Science and Classification. 2006

S. Zani, A. Cerioli, M. Riani, M. Vichi
(Eds.)
Data Analysis, Classification and the
Forward Search. 2006


Reinhold Decker
Hans-J. Lenz
Editors

Advances
in Data Analysis
Proceedings of the 30th Annual Conference
of the Gesellschaft für Klassifikation e.V.,
Freie Universität Berlin, March 8-10, 2006

With 202 Figures and 92 Tables

123


Professor Dr. Reinhold Decker
Department of Business Administration and Economics
Bielefeld University
Universitätsstr. 25
33501 Bielefeld
Germany

Professor Dr. Hans - J. Lenz
Department of Economics

Freie Universität Berlin
Garystraße 21
14195 Berlin
Germany


Library of Congress Control Number: 2007920573

ISSN 1431-8814
ISBN 978-3-540-70980-0 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of
this publication or parts thereof is permitted only under the provisions of the German Copyright
Law of September 9, 1965, in its current version, and permission for use must always be obtained
from Springer. Violations are liable to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2007
The use of general descriptive names, registered names, trademarks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
Production: LE-TEX Jelonek, Schmidt & Vockler GbR, Leipzig
Cover-design: WMX Design GmbH, Heidelberg
SPIN 12022755

43/3100YL - 5 4 3 2 1 0

Printed on acid-free paper



Preface

This volume contains the revised versions of selected papers presented during
the 30th Annual Conference of the German Classification Society (Gesellschaft

ur Klassifikation – GfKl) on “Advances in Data Analysis”. The conference was
held at the Freie Universit¨
at Berlin, Germany, in March 2006. The scientific
program featured 7 parallel tracks with more than 200 contributed talks in 63
sessions. Additionally, thanks to the support of the DFG (German Research
Foundation), 18 plenary and semi-plenary speakers from Europe and overseas
could be invited to talk about their current research in classification and data
analysis. With 325 participants from 24 countries in Europe and overseas this
GfKl Conference, once again, provided an international forum for discussions
and mutual exchange of knowledge with colleagues from different fields of
interest. From altogether 115 full papers that had been submitted for this
volume 77 were finally accepted.
The scientific program included a broad range of topics from classification
and data analysis. Interdisciplinary research and the interaction between theory and practice were particularly emphasized. The following sections (with
chairs in alphabetical order) were established:
I. Theory and Methods
Clustering and Classification (H.-H. Bock and T. Imaizumi); Exploratory
Data Analysis and Data Mining (M. Meyer and M. Schwaiger); Pattern
Recognition and Discrimination (G. Ritter); Visualization and Scaling Methods (P. Groenen and A. Okada); Bayesian, Neural, and Fuzzy Clustering
(R. Kruse and A. Ultsch); Graphs, Trees, and Hierarchies (E. Godehardt
and J. Hansohm); Evaluation of Clustering Algorithms and Data Structures
(C. Hennig); Data Analysis and Time Series Analysis (S. Lang); Data Cleaning
and Pre-Processing (H.-J. Lenz); Text and Web Mining (A. N¨
urnberger and
M. Spiliopoulou); Personalization and Intelligent Agents (A. Geyer-Schulz);

Tools for Intelligent Data Analysis (M. Hahsler and K. Hornik).
II. Applications
Subject Indexing and Library Science (H.-J. Hermes and B. Lorenz); Marketing, Management Science, and OR (D. Baier and O. Opitz); E-commerce, Rec-


VI

Preface

ommender Systems, and Business Intelligence (L. Schmidt-Thieme); Banking
and Finance (K. Jajuga and H. Locarek-Junge); Economics (G. Kauermann
and W. Polasek); Biostatistics and Bioinformatics (B. Lausen and U. Mansmann); Genome and DNA Analysis (A. Schliep); Medical and Health Sciences (K.-D. Wernecke and S. Willich); Archaeology (I. Herzog, T. Kerig, and
A. Posluschny); Statistical Musicology (C. Weihs); Image and Signal Processing (J. Buhmann); Linguistics (H. Goebl and P. Grzybek); Psychology
(S. Krolak-Schwerdt); Technology and Production (M. Feldmann).
Additionally, the following invited sessions were organized by colleagues
from associated societies: Classification with Complex Data Structures (A. Cerioli); Machine Learning (D.A. Zighed); Classification and Dimensionality Reduction (M. Vichi).
The editors would like to emphatically thank the section chairs for doing
such a great job regarding the organization of their sections and the associated paper reviews. The same applies to W. Esswein for organizing the
Doctoral Workshop and to H.-H. Hermes and B. Lorenz for organizing the
Librarians Workshop. Cordial thanks also go to the members of the scientific
program committee for their conceptual and practical support (in alphabetical order): D. Baier (Cottbus), H.-H. Bock (Aachen), H.W. Brachinger (Fribourg), R. Decker (Bielefeld, Chair), D. Dubois (Toulouse), A. Gammerman
(London), W. Gaul (Karlsruhe), A. Geyer-Schulz (Karlsruhe), B. Goldfarb
(Paris), P. Groenen (Rotterdam), D. Hand (London), T. Imaizumi (Tokyo),
K. Jajuga (Wroclaw), G. Kauermann (Bielefeld), R. Kruse (Magdeburg),
S. Lang (Innsbruck), B. Lausen (Erlangen-N¨
urnberg), H.-J. Lenz (Berlin),
F. Murtagh (London), A. Okada (Tokyo), L. Schmidt-Thieme (Hildesheim)
M. Spiliopoulou (Magdeburg), W. St¨
utzle (Washington), and C. Weihs (Dortmund). The review process was additionally supported by the following colleagues: A. Cerioli, E. Gatnar, T. Kneib, V. K¨
oppen, M. Meißner, I. Michalarias, F. M¨

orchen, W. Steiner, and M. Walesiak.
The great success of this conference would not have been possible without
the support of many people mainly working in the backstage. Representative
for the whole team we would like to particularly thank M. Darkow (Bielefeld)
and A. Wnuk (Berlin) for their exceptional efforts and great commitment
with respect to the preparation, organization and post-processing of the conference. We thank very much our web masters I. Michalarias (Berlin) and
A. Omelchenko (Berlin). Furthermore, we would cordially thank V. K¨
oppen
(Berlin) and M. Meißner (Bielefeld) for providing an excellent support regarding the management of the reviewing process and the final editing of the
papers printed in this volume.
The GfKl Conference 2006 would not have been possible in the way
it took place without the financial and/or material support of the following institutions and companies (in alphabetical order): Deutsche Forschungsgemeinschaft, Freie Universit¨
at Berlin, Gesellschaft f¨
ur Klassifikation e.V.,
Land Software-Entwicklung, Microsoft M¨
unchen, SAS Deutschland, Springer-


Preface

VII

Verlag, SPSS M¨
unchen, Universit¨
at Bielefeld, and Westf¨
alisch-Lippische Universit¨atsgesellschaft. We express our gratitude to all of them.
Finally, we would like to thank Dr. Martina Bihn of Springer-Verlag, Heidelberg, for her support and dedication to the production of this volume.

Berlin and Bielefeld, January 2007


Hans-J. Lenz
Reinhold Decker


Contents

Part I Clustering
Mixture Models for Classification
Gilles Celeux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

How to Choose the Number of Clusters: The Cramer
Multiplicity Solution
Adriana Climescu-Haulica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Model Selection Criteria for Model-Based Clustering of
Categorical Time Series Data: A Monte Carlo Study
Jos´e G. Dias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Cluster Quality Indexes for Symbolic Classification – An
Examination
Andrzej Dudek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Semi-Supervised Clustering: Application to Image
Segmentation

ario A.T. Figueiredo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A Method for Analyzing the Asymptotic Behavior of the
Walk Process in Restricted Random Walk Cluster Algorithm
Markus Franke, Andreas Geyer-Schulz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Cluster and Select Approach to Classifier Fusion
Eugeniusz Gatnar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Random Intersection Graphs and Classification
Erhard Godehardt, Jerzy Jaworski, Katarzyna Rybarczyk . . . . . . . . . . . . . . 67
Optimized Alignment and Visualization of Clustering Results
Martin Hoffmann, D¨
orte Radke, Ulrich M¨
oller . . . . . . . . . . . . . . . . . . . . . . . 75


X

Contents

Finding Cliques in Directed Weighted Graphs Using Complex
Hermitian Adjacency Matrices
Bettina Hoser, Thomas Bierhance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Text Clustering with String Kernels in R
Alexandros Karatzoglou, Ingo Feinerer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Automatic Classification of Functional Data with Extremal
Information
Fabrizio Laurini, Andrea Cerioli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Typicality Degrees and Fuzzy Prototypes for Clustering
Marie-Jeanne Lesot, Rudolf Kruse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
On Validation of Hierarchical Clustering
Hans-Joachim Mucha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Part II Classification
Rearranging Classified Items in Hierarchies Using
Categorization Uncertainty
Korinna Bade, Andreas N¨
urnberger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Localized Linear Discriminant Analysis

Irina Czogiel, Karsten Luebke, Marc Zentgraf, Claus Weihs . . . . . . . . . . . 133
Calibrating Classifier Scores into Probabilities
Martin Gebel, Claus Weihs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Nonlinear Support Vector Machines Through Iterative
Majorization and I-Splines
Patrick J.F. Groenen, Georgi Nalbantov, J. Cor Bioch . . . . . . . . . . . . . . . . 149
Deriving Consensus Rankings from Benchmarking
Experiments
Kurt Hornik, David Meyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Classification of Contradiction Patterns
Heiko M¨
uller, Ulf Leser, Johann-Christoph Freytag . . . . . . . . . . . . . . . . . . . 171
Selecting SVM Kernels and Input Variable Subsets in Credit
Scoring Models
Klaus B. Schebesch, Ralf Stecking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179


Contents

XI

Part III Data and Time Series Analysis
Simultaneous Selection of Variables and Smoothing
Parameters in Geoadditive Regression Models
Christiane Belitz, Stefan Lang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Modelling and Analysing Interval Data
Paula Brito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Testing for Genuine Multimodality in Finite Mixture Models:
Application to Linear Regression Models
Bettina Gr¨

un, Friedrich Leisch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Happy Birthday to You, Mr. Wilcoxon!
Invariance, Semiparametric Efficiency, and Ranks

Marc Hallin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Equivalent Number of Degrees of Freedom for Neural
Networks
Salvatore Ingrassia, Isabella Morlini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Model Choice for Panel Spatial Models: Crime Modeling in
Japan
Kazuhiko Kakamu, Wolfgang Polasek, Hajime Wago. . . . . . . . . . . . . . . . . . 237
A Boosting Approach to Generalized Monotonic Regression
Florian Leitenstorfer, Gerhard Tutz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
From Eigenspots to Fisherspots – Latent Spaces in the
Nonlinear Detection of Spot Patterns in a Highly Varying
Background
Bjoern H. Menze, B. Michael Kelm, Fred A. Hamprecht . . . . . . . . . . . . . . 255
Identifying and Exploiting Ultrametricity
Fionn Murtagh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Factor Analysis for Extraction of Structural Components and
Prediction in Time Series
Carsten Schneider, Gerhard Arminger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Classification of the U.S. Business Cycle by Dynamic Linear
Discriminant Analysis
Roland Schuhr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281


XII

Contents


Examination of Several Results of Different Cluster Analyses
with a Separate View to Balancing the Economic and
Ecological Performance Potential of Towns and Cities
Nguyen Xuan Thinh, Martin Behnisch, Alfred Ultsch . . . . . . . . . . . . . . . . . 289

Part IV Visualization and Scaling Methods
VOS: A New Method for Visualizing Similarities Between
Objects
Nees Jan van Eck, Ludo Waltman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Multidimensional Scaling of Asymmetric Proximities with a
Dominance Point
Akinori Okada, Tadashi Imaizumi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Single Cluster Visualization to Optimize Air Traffic
Management
Frank Rehm, Frank Klawonn, Rudolf Kruse . . . . . . . . . . . . . . . . . . . . . . . . . 319
Rescaling Proximity Matrix Using Entropy Analyzed by
INDSCAL
Satoru Yokoyama, Akinori Okada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

Part V Information Retrieval, Data and Web Mining
Canonical Forms for Frequent Graph Mining
Christian Borgelt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Applying Clickstream Data Mining to Real-Time Web
Crawler Detection and Containment Using ClickTips
Platform
An´
alia Louren¸co, Orlando Belo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Plagiarism Detection Without Reference Collections
Sven Meyer zu Eissen, Benno Stein, Marion Kulig . . . . . . . . . . . . . . . . . . . . 359

Putting Successor Variety Stemming to Work
Benno Stein, Martin Potthast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Collaborative Filtering Based on User Trends
Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos
Papadopoulos, Yannis Manolopoulos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Investigating Unstructured Texts with Latent Semantic
Analysis
Fridolin Wild, Christina Stahl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383


Contents

XIII

Part VI Marketing, Management Science and Economics
Heterogeneity in Preferences for Odd Prices
Bernhard Baumgartner, Winfried J. Steiner . . . . . . . . . . . . . . . . . . . . . . . . 393
Classification of Reference Models
Robert Braun, Werner Esswein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Adaptive Conjoint Analysis for Pricing Music Downloads
Christoph Breidert, Michael Hahsler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Improving the Probabilistic Modeling of Market Basket Data
Christian Buchta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Classification in Marketing Research by Means of
LEM2-generated Rules
Reinhold Decker, Frank Kroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Pricing Energy in a Multi-Utility Market
Markus Franke, Andreas Kamper, Anke Eßer . . . . . . . . . . . . . . . . . . . . . . . 433
Disproportionate Samples in Hierarchical Bayes CBC
Analysis

Sebastian Fuchs, Manfred Schwaiger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Building on the Arules Infrastructure for Analyzing
Transaction Data with R
Michael Hahsler, Kurt Hornik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Balanced Scorecard Simulator – A Tool for Stochastic
Business Figures
Veit K¨
oppen, Marina Allgeier, Hans-J. Lenz . . . . . . . . . . . . . . . . . . . . . . . . . 457
Integration of Customer Value into Revenue Management
Tobias von Martens, Andreas Hilbert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
Women’s Occupational Mobility and Segregation in the
Labour Market: Asymmetric Multidimensional Scaling
Miki Nakai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Multilevel Dimensions of Consumer Relationships in the
Healthcare Service Market M-L IRT vs. M-L SEM Approach
Iga Rudawska, Adam Sagan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481


XIV

Contents

Data Mining in Higher Education
Karoline Sch¨
onbrunn, Andreas Hilbert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Attribute Aware Anonymous Recommender Systems
Manuel Stritt, Karen H.L. Tso, Lars Schmidt-Thieme . . . . . . . . . . . . . . . . 497

Part VII Banking and Finance
On the Notions and Properties of Risk and Risk Aversion in

the Time Optimal Approach to Decision Making
Martin Bouzaima, Thomas Burkhardt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
A Model of Rational Choice Among Distributions of Goal
Reaching Times
Thomas Burkhardt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
On Goal Reaching Time Distributions Estimated from DAX
Stock Index Investments
Thomas Burkhardt, Michael Haasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Credit Risk of Collaterals: Examining the Systematic Linkage
between Insolvencies and Physical Assets in Germany
Marc G¨
urtler, Dirk Heithecker, Sven Olboeter . . . . . . . . . . . . . . . . . . . . . . . . 531
Foreign Exchange Trading with Support Vector Machines
Christian Ullrich, Detlef Seese, Stephan Chalup . . . . . . . . . . . . . . . . . . . . . . 539
The Influence of Specific Information on the Credit Risk
Level
Miroslaw W´
ojciak, Aleksandra W´
ojcicka-Krenz . . . . . . . . . . . . . . . . . . . . . . 547

Part VIII Bio- and Health Sciences
Enhancing Bluejay with Scalability, Genome Comparison and
Microarray Visualization
Anguo Dong, Andrei L. Turinsky, Andrew C. Ah-Seng, Morgan
Taschuk, Paul M.K. Gordon, Katharina Hochauer, Sabrina Fr¨
ols, Jung
Soh, Christoph W. Sensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
Discovering Biomarkers for Myocardial Infarction from
SELDI-TOF Spectra
Christian H¨

oner zu Siederdissen, Susanne Ragg, Sven Rahmann . . . . . . . 569
Joint Analysis of In-situ Hybridization and Gene Expression
Data
Lennart Opitz, Alexander Schliep, Stefan Posch . . . . . . . . . . . . . . . . . . . . . 577


Contents

XV

Unsupervised Decision Trees Structured by Gene Ontology
(GO-UDTs) for the Interpretation of Microarray Data
Henning Redestig, Florian Sohler, Ralf Zimmer, Joachim Selbig . . . . . . . 585

Part IX Linguistics and Text Analysis
Clustering of Polysemic Words
Laurent Cicurel, Stephan Bloehdorn, Philipp Cimiano . . . . . . . . . . . . . . . . 595
Classifying German Questions According to Ontology-Based
Answer Types
Adriana Davidescu, Andrea Heyl, Stefan Kazalski, Irene Cramer,
Dietrich Klakow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
The Relationship of Word Length and Sentence Length: The
Inter-Textual Perspective
Peter Grzybek, Ernst Stadlober, Emmerich Kelih . . . . . . . . . . . . . . . . . . . . 611
Comparing the Stability of Different Clustering Results of
Dialect Data
Edgar Haimerl, Hans-Joachim Mucha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
Part-of-Speech Discovery by Clustering Contextual Features
Reinhard Rapp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627


Part X Statistical Musicology and Sound Classification
A Probabilistic Framework for Audio-Based Tonal Key and
Chord Recognition
Benoit Catteau, Jean-Pierre Martens, Marc Leman . . . . . . . . . . . . . . . . . . . 637
Using MCMC as a Stochastic Optimization Procedure for
Monophonic and Polyphonic Sound
Katrin Sommer, Claus Weihs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Vowel Classification by a Neurophysiologically Parameterized
Auditory Model
Gero Szepannek, Tam´
as Harczos, Frank Klefenz, Andr´
as Katai, Patrick
Schikowski, Claus Weihs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653


XVI

Contents

Part XI Archaeology
Uncovering the Internal Structure of the Roman Brick and
Tile Making in Frankfurt-Nied by Cluster Validation
Jens Dolata, Hans-Joachim Mucha, Hans-Georg Bartel . . . . . . . . . . . . . . . 663
Where Did I See You Before...
A Holistic Method to Compare and Find Archaeological
Artifacts
Vincent Mom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685



Part I

Clustering


Mixture Models for Classification
Gilles Celeux
Inria Futurs, Orsay, France;
Abstract. Finite mixture distributions provide efficient approaches of model-based
clustering and classification. The advantages of mixture models for unsupervised
classification are reviewed. Then, the article is focusing on the model selection problem. The usefulness of taking into account the modeling purpose when selecting a
model is advocated in the unsupervised and supervised classification contexts. This
point of view had lead to the definition of two penalized likelihood criteria, ICL and
BEC, which are presented and discussed. Criterion ICL is the approximation of the
integrated completed likelihood and is concerned with model-based cluster analysis.
Criterion BEC is the approximation of the integrated conditional likelihood and is
concerned with generative models of classification. The behavior of ICL for choosing the number of components in a mixture model and of BEC to choose a model
minimizing the expected error rate are analyzed in contrast with standard model
selection criteria.

1 Introduction
Finite mixtures models has been extensively studied for decades and provide
a fruitful framework for classification (McLachlan and Peel (2000)). In this
article some of the main features and advantages of finite mixture analysis
for model-based clustering are reviewed in Section 2. An important interest
of finite mixture model is to provide a rigorous setting to assess the number
of clusters in an unsupervised classification context or to assess the stability
of a classification function. It is focused on those two questions in Section 3.
Model-based clustering (MBC) consists of assuming that the data come

from a source with several subpopulations. Each subpopulation is modeled
separately and the overall population is a mixture of these subpopulations.
The resulting model is a finite mixture model. Observations x = (x1 , . . . , xn )
in Rnd are assumed to be a sample from a probability distribution with density
K

p(xi | K, θK ) =

pk φ(xi | ak )
k=1

(1)


4

Gilles Celeux

where the pk ’s are the mixing proportions (0 < pk < 1 for all k = 1, . . . , K
and
k pk = 1), φ(. | ak ) denotes a parameterized density and θK =
(p1 , . . . , pK−1 , a1 , . . . , aK ). When data are multivariate continuous observations, the component parameterized density is usually the d-dimensional Gaussian density and parameter ak = (µk , Σk ), µk being the mean and Σk the
variance matrix of component k. When data are discrete, the component parameterized density is usually the multivariate multinomial density which is
assuming conditional Independence of the observations knowing their component mixture and the ak = (ajk , j = 1, . . . , d)’s are the multinomial probabilities for variable j and mixture component k. The resulting model is the
so-called Latent Class Model (see for instance Goodman (1974)).
The mixture model is an incomplete data structure model: The complete
data are
y = (y1 , . . . , yn ) = ((x1 , z1 ), . . . , (xn , zn ))
where the missing data are z = (z1 , . . . , zn ), with zi = (zi1 , . . . , ziK ) are
binary vectors such that zik = 1 iff xi arises from group k. The z’s define a

partition P = (P1 , . . . , PK ) of the observed data x with Pk = {xi | zik = 1}.
In this article, it is considered that the mixture models at hand are estimated through maximum likelihood (ml) or related methods. Despite it has
received a lot of attention, since the seminal article of Diebolt and Robert
(1994), Bayesian inference is not considered here. Bayesian analysis of univariate mixtures has became the standard Bayesian tool for density estimation. But, especially in the multivariate setting a lot of problems (possible slow
convergence of MCMC algorithms, definition of subjective weakly informative
priors, identifiability, . . . ) remain and it cannot be regarded as a standard tool
for Bayesian clustering of multivariate data (see Aitkin (2001)). The reader is
referred to the survey article of Marin et al. (2005) for a readable state of the
art of Bayesian inference for mixture models.

2 Some advantages of model-based clustering
In this section, some important and nice features of finite mixture analysis are
sketched. The advantages of finite mixture analysis in a clustering context,
highlighted here, are: Many versatile or parsimonious models are available,
many algorithms to estimate the mixture parameters are available, special
questions can be tackled in a proper way in the MBC context, and, last but
not least, finite mixture models can be compared and assessed in an objective
way. It allows in particular to assess the number of clusters properly. The
discussion on this important point is postponed to Section 3.
Many versatile or parsimonious models are available.
In the multivariate Gaussian mixture context, the variance matrix eigenvalue
decomposition


Mixture Models for Classification

5

Σk = Vk Dkt Ak Dk
where Vk = |Σk )|1/d defines the component volume, Dk the matrix of eigenvectors of Σ defines the component orientation, and Ak the diagonal matrix

of normalized eigenvalues defines the component shape, leads to get different
and easily interpreted models by allowing some of these quantities to vary
between components. Following Banfield and Raftery (1993) or Celeux and
Govaert (1995), a large range of fourteen versatile (from the most complex
to the simplest one) models derived from this eigenvalue decomposition can
be considered. Assuming equal or free volumes, orientations and shapes leads
to eight different models. Assuming in addition that the component variance
matrices are diagonal leads to four models. And, finally, assuming in addition
that the component variance matrices are proportional to the identity matrix
leads to two other models.
In the Latent Class Model, a re-parameterization is possible to lead to
various models taking account of the scattering around centers of the clusters
in different ways (Celeux and Govaert (1991)). This re-parameterization is as
follows. The multinomial probabilities ak are decomposed in (mk , εk ) where
binary vector mk = (m1k , . . . , mdk ) provides the mode levels in cluster k for
variable j
1 if h = arg maxh ajh
k
(mjh
k )=
0 otherwise,
and the εjk can be regarded as scattering values.
(εjh ) =

1 − αjh if ajh = 1
αjh
if ajh = 0.

For instance, if ajk = (0.7, 0.2, 0.1), the new parameters are mjk = (1, 0, 0) and
εjk = (0.3, 0.2, 0.1). This parameterization can lead to five latent class models.

Denoting h(jk) the mode level for variable j and cluster k and h(ij) the level
of object i for the variable j, the model can be written
f (xi ; θ) =

jh(jk) xjh(jk)
i

(1 − εk

pk
k

)

jh(jk)
jh(ij) xjh(ij)
−xk
i

(εk

)

.

j

Using this form, it is possible to impose various constraints to the scattering
parameters εjh
k . The models of interest are the following:

• The standard latent class model [εjh
k ]: The scattering is depending upon
clusters, variables and levels.
• [εjk ]: The scattering is depending upon clusters and variables but not upon
levels.
• [εk ]: The scattering is depending upon clusters, but not upon variables.
• [εj ]: The scattering is depending upon variables, but not upon clusters.
• [ε]: The scattering is constant over variables and clusters.


6

Gilles Celeux

Many algorithms available from different points of view
The EM algorithm of Dempster et al. (1977) is the reference tool to derive
the ml estimates in a mixture model. An iteration of EM is as follows:
• E step: Compute the conditional probabilities tik , i = 1, . . . , n, k =
1, . . . , K that xi arises from the kth component for the current value
of the mixture parameters.
• M step: Update the mixture parameter estimates maximizing the expected
value of the completed likelihood. It leads to use standard formulas where
the observation i for group k is weighted with the conditional probability
tik .
Others algorithms are taking profit of the missing data structure of the mixture model. For instance, the classification EM (CEM), see Celeux and Govaert (1992) is directly concerned with the estimation of the missing labels z.
An iteration of CEM is as follows:
• E step: As in EM.
• C step: Assign each point xi to the component maximizing the conditional
probability tik using a maximum a posteriori (MAP) principle.
• M step: Update the mixture parameter estimates maximizing the completed likelihood.

CEM aims to maximize the completed likelihood where the component label
of each sample point is included in the data set. CEM is a K-means-like
algorithm and, contrary to EM, it converges in a finite number of iterations.
But CEM provides biased estimates of the mixture parameters. This algorithm
is interesting in a clustering context when the clusters are well separated (see
Celeux and Govaert (1993)).
From an other point of view, the Stochastic EM (SEM) algorithm can be
useful. It is as follows:
• E step: As in EM.
• S step: Assign each point xi at random to one of the component according
to the distribution defined by the (tik , k = 1, . . . , K).
• M step: Update the mixture parameter estimates maximizing the completed likelihood.
SEM generates a Markov chain whose stationary distribution is (more or less)
concentrated around the ML parameter estimator. Thus a natural parameter
estimate from a SEM sequence is the mean of the iterates values obtain after
a burn-in period. An alternative estimate is to consider the parameter value
leading to the largest likelihood in a SEM sequence. In any cases, SEM is
expected to avoid insensible maxima of the likelihood that EM cannot avoid,
but SEM can be jeopardized by spurious maxima (see Celeux et al. (1996) or
McLachlan and Peel (2000) for details). Note that different variants (Monte
Carlo EM, Simulated Annealing EM) are possible (see, for instance, Celeux et


Mixture Models for Classification

7

al. (1996)). Note also that Biernacki et al. (2003) proposed simple strategies
for getting sensible ml estimates. Those strategies are acting in two ways to
deal with this problem. They choose particular starting values from CEM or

SEM and they run several times EM or algorithms combining CEM and EM.
Special questions can be tackled in a proper way in the MBC context
Robust Cluster Analysis can be obtained by making use of multivariate Student distributions instead of Multivariate Gaussian distributions. It lead to
attenuate the influence of outliers (McLachlan and Peel (2000)). On an other
hand, including in the mixture a group from a uniform distribution allows to
take into account noisy data (DasGupta and Raftery (1998)).
To avoid spurious maxima of likelihood, shrinking the group variance matrix toward a matrix proportional to the identity matrix can be quite efficient.
On of the most achieved work in this domain is Ciuperca et al. (2003).
Taking profit of the probabilistic framework, it is possible to deal with
missing data at random in a proper way with mixture models (Hunt and
Basford (2001)). Also, simple, natural and efficient methods of semi-supervised
classification can be derived in the mixture framework (an example of pioneer
article on this subject, recently followed by many others, is Ganesalingam and
McLachlan (1978)). Finally, it can be noted that promising variable selection
procedures for Model-Based Clustering begin to appear (Raftery and Dean
(2006)).

3 Choosing a model in a classification purpose
In statistical inference from data selecting a parsimonious model among a
collection of models is an important but difficult task. This general problem receives much attention since the seminal articles of Akaike (1974) and
Schwarz (1978). A model selection problem consists essentially of solving the
bias-variance dilemma. A classical approach to the model assessing problem
consists of penalizing the fit of a model by a measure of its complexity. Criterion AIC of Akaike (1974) is an asymptotic approximation of the expectation
of the deviance. It is
AIC(m) = 2 log p(x|m, θˆm ) − 2νm .

(2)

where θˆm is the ml estimate of parameter θm and νm is the number of free
parameters of model m.

An other point of view consists of basing the model selection on the integrated likelihood of the data in a Bayesian perspective (see Kass and Raftery
(1995)). This integrated likelihood is
p(x|m) =

p(x|m, θm )π(θm )dθm ,

(3)


8

Gilles Celeux

π(θm ) being a prior distribution for parameter θm . The essential technical
problem is to approximate this integrated likelihood in a right way. A classical
asymptotic approximation of the logarithm of the integrated likelihood is the
BIC criterion of Schwarz (1978). It is
νm
log(n).
BIC(m) = log p(x|m, θˆm ) −
2

(4)

Beyond technical difficulties, the scope of this section is to show how it can
be fruitful to take into account the purpose of the model user to get reliable
and useful models for statistical description or decision tasks. Two situations
are considered to support this idea: Choosing the number of components in
a mixture model in a cluster analysis perspective, and choosing a generative
probabilistic model in a supervised classification context.

3.1 Choosing the number of clusters
Assessing the number K of components in a mixture model is a difficult question, from both theoretical and practical points of view, which had received
much attention in the past two decades. This section does not propose a state
of the art of this problem which has not been completely solved. The reader
is referred to the chapter 6 of the book of McLachlan and Peel (2000) for an
excellent overview on this subject. This section is essentially aiming to discuss
elements of practical interest regarding the problem of choosing the number
of mixture components when concerned with cluster analysis.
From the theoretical point of view, even when K ∗ the right number of
component is assumed to exist, if K ∗ < K0 then K ∗ is not identifiable in the
parameter space ΘK0 (see for instance McLachlan and Peel (2000), chapter
6).
But, here, we want to stress the importance of taking into account the
modeling context to select a reasonable number of mixture components. Our
opinion is that, behind the theoretical difficulties, assessing the number of
components in a mixture model from data is a weakly identifiable statistical
problem. Mixture densities with different number of components can lead
to quite similar resulting probability distributions. For instance, the galaxy
velocities data of Roeder (1990) has became a benchmark data set and is used
by many authors to illustrate procedures for choosing the number of mixture
components. Now, according to those authors the answer lies from K = 2 to
K = 10, and it is not exaggerating a lot to say that all the answers between
2 and 10 have been proposed as a good answer, at least one time, in the
articles considering this particular data set. (An interesting and illuminating
comparative study on this data set can be found in Aitkin (2001).) Thus, we
consider that it is highly desirable to choose K by keeping in mind what is
expected from the mixture modeling to get a relevant answer to this question.
Actually, mixture modeling can be used in quite different purposes. It can be



Mixture Models for Classification

9

regarded as a semi parametric tool for density estimation purpose or as a tool
for cluster analysis.
In the first perspective, much considered by Bayesian statisticians, numerical experiments (see Fraley and Raftery (1998)) show that the BIC approximation of the integrated likelihood works well at a practical level. And,
under regularity conditions including the fact that the component densities
are finite, Keribin (2000) proved that BIC provides a consistent estimator of
K.
But, in the second perspective, the integrated likelihood does not take
into account the clustering purpose for selecting a mixture model in a modelbased clustering setting. As a consequence, in the most current situations
where the distribution from which the data arose is not in the collection of
considered mixture models, BIC criterion will tend to overestimate the correct
size regardless of the separation of the clusters (see Biernacki et al. (2000)).
To overcome this limitation, it can be advantageous to choose K in order
to get the mixture giving rise to partitioning data with the greatest evidence.
With that purpose in mind, Biernacki et al. (2000) considered the integrated
likelihood of the complete data (x, z) (or integrated completed likelihood).
(Recall that z = (z1 , . . . , zn ) is denoting the missing data such that zi =
(zi1 , . . . , ziK ) are binary K-dimensional vectors with zik = 1 if and only if xi
arises from component k.) Then, the integrated complete likelihood is
p(x, z | K, θ)π(θ | K)dθ,

p(x, z | K) =

(5)

ΘK


where

n

p(x, z | K, θ) =

p(xi , zi | K, θ)
i=1

with
K

pzkik [φ(xi | ak )]zik .

p(xi , zi | K, θ) =
k=1

To approximate this integrated complete likelihood, those authors propose to
use a BIC-like approximation leading to the criterion
ˆ −
ICL(K) = log p(x, zˆ | K, θ)

νK
log n,
2

(6)

where the missing data have been replaced by their most probable value for paˆ (Details can be found in Biernacki et al. (2000)). Roughly
rameter estimate θ.

speaking criterion ICL is the criterion BIC penalized by the mean entropy
K

n

E(K) = −

tik log tik ≥ 0,
k=1 i=1

tik denoting the conditional probability that xi arises from the kth mixture
component (1 ≤ i ≤ n and 1 ≤ k ≤ K).


10

Gilles Celeux

As a consequence, ICL favors K values giving rise to partitioning the data
with the greatest evidence, as highlighted in the numerical experiments in
Biernacki et al. (2000), because of this additional entropy term. More generally, ICL appears to provide a stable and reliable estimate of K for real data
sets and also for simulated data sets from mixtures when the components are
not too much overlapping (see McLachlan and Peel (2000)). But ICL, which is
not aiming to discover the true number of mixture components, can underestimate the number of components for simulated data arising from mixture with
poorly separated components as illustrated in Figueiredo and Jain (2002).
On the contrary, BIC performs remarkably well to assess the true number
of components from simulated data (see Biernacki et al. (2000), Fraley and
Raftery (1998) for instance). But, for real world data sets, BIC has a marked
tendency to overestimate the numbers of components. The reason is that real
data sets do not arise from the mixture densities at hand, and the penalty

term of BIC is not strong enough to balance the tendency of the loglikelihood
to increase with K in order to improve the fit of the mixture model.
3.2 Model selection in classification
Supervised classification is about guessing the unknown group among K
groups from the knowledge of d variables entering in a vector xi for unit i.
This group for unit i is defined by zi = (zi1 , . . . , ziK ) a binary K-dimensional
vector with zik = 1 if and only if xi arises from group k. For that purpose,
a decision function, called a classifier, δ(x) : Rd → {1, . . . , K} is designed
from a learning sample (xi , zi ), i = 1, . . . , n. A classical approach to design
a classifier is to represent the group conditional densities with a parametric
model p(x|m, zk = 1, θm ) for k = 1, . . . , K. Then the classifier is assigning an observation x to the group k maximizing the conditional probability
p(zk = 1|m, x, θm ). Using the Bayes rule, it leads to set δ(x) = j if and only
if j = arg maxk pk p(x|m, zk = 1, θˆm ), θˆm being the ml estimate of the group
conditional parameters θm and pk being the prior probability of group k. This
approach is known as the generative discriminant analysis in the Machine
Learning community.
In this context, it could be expected to improve the actual error rate by
selecting a generative model m among a large collection of models M (see for
instance Friedman (1989) or Bensmail and Celeux (1996)). For instance Hastie
and Tibshirani (1996) proposed to model each group density with a mixture of
Gaussian distributions. In this approach the number of mixture components
per group are sensitive tuning parameters. They can be supplied by the user, as
in Hastie and Tibshirani (1996), but it is clearly a sub-optimal solution. They
can be chosen to minimize the v-fold cross-validated error rate, as done in
Friedman (1989) or Bensmail and Celeux (1996) for other tuning parameters.
Despite the fact the choice of v can be sensitive, it can be regarded as a nearly
optimal solution. But it is highly CPU time consuming and choosing tuning
parameters with a penalized loglikelihood criterion, as BIC, can be expected



×