Introduction to Data Mining for the Life Sciences
Rob Sullivan
Introduction to Data Mining
for the Life Sciences
Rob Sullivan
Cincinnati, OH, USA
ISBN 978-1-58829-942-0
e-ISBN 978-1-59745-290-8
DOI 10.1007/978-1-59745-290-8
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2011941596
# Springer Science+Business Media, LLC 2012
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street,
New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis.
Use in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
Humana Press is part of Springer Science+Business Media (www.springer.com)
To my wife, without whose support,
encouragement, love, and caffeine,
none of this would have been possible.
v
Preface
A search for the word “zettabyte” will return a page that predicts that we will enter
the zettabyte age around 2015. To store this amount of data on DVDs would require
over 215 billion disks. The search itself (on September 20, 2011) returned 775,000
results, many relevant, many irrelevant, and many duplicates. The challenge is to
elicit knowledge from all this data.
Scientific endeavors are constantly generating more and more data. As new
generations of instruments are created, one of the characteristics is typically more
sensitive results. In turn, this typically means more data is generated. New techniques made available by new instrumentation, techniques, and understanding allows
us to consider approaches such as genome-wide association studies (GWAS) that
were outside of our ability to consider just a few years ago. Again, the challenge is
to elicit knowledge from all this data.
But as we continue to generate this ever-increasing amount of data, we would
also like to know what relationships and patterns exist between the data. This, in
essence, is the goal of data mining: find the patterns within the data. This is what
this book is about. Is there some quantity X that is related to some other quantity Y
that isn’t obvious to us? If so, what could those relationships tell us? Is there
something novel, something new, that these patterns tell us? Can it advance our
knowledge?
There is no obvious end in sight to the increasing generation of data. To the
contrary, as tools, techniques, and instrumentation continue to become smaller,
cheaper, and thus, more available, it is likely that the opposite will be the case and
data will continue to be generated in ever-increasing volumes. It is for this reason
that automated approaches to processing data, understanding data, and finding these
patterns will be even more important.
This leads to the major challenge of a book like this: what to include and what to
leave out. We felt it important to cover as much of the theory of data mining as
possible, including statistical, analytical, visualization, and machine learning techniques. Much exciting work is being done under the umbrella of machine learning,
and much of it is seeing fruition within the data mining discipline itself. To say that
vii
viii
Preface
this covers a wide area – and a multitude of sins – is an understatement. To those
readers who question why we included a particular topic, and to those who question
why we omitted some other topic, we can only apologize for this. In writing a book
aimed at introducing data mining techniques to the life sciences, describing a broad
range of techniques is a necessity to allow the researcher to select the most
appropriate tools for his/her investigation. Many of the techniques we discuss are
not necessarily in widespread use, but we believe can be valuable to the researcher.
Many people over many years triggered our enthusiasm for developing this text
and thanks to them that we have created this book. Particular thanks goes to Bruce
Lucarelli and Viswanath Balasubramanian for their contributions and insights on
various parts of the text.
Cincinnati, OH, USA
Rob Sullivan
Contents
1
2
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
New Scientific Techniques, New Data Challenges
in Life Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
The Ethics of Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Data Mining: Problems and Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
From Data to Information: The Process. . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1
CRISP-DM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6
What Can Be Mined?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7
Standards and Terminologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8
Interestingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9
How Good Is Good Enough? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 The Datasets Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.11 The Rest of the Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.12 What Have We Missed in the Introduction? . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
8
10
12
17
19
24
25
26
27
28
28
31
31
Fundamental Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Data Mining Activities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Input Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
Does Data Expire? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7
Frequent Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8
Bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9
Generalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Data Characterization and Discrimination. . . . . . . . . . . . . . . . . . . . . . . . .
2.11 Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Classification and Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.1 Relevance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
33
37
38
39
40
41
42
43
45
45
46
46
47
ix
x
Contents
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
3
Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outlier Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dimensionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Overfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Concept Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Advanced Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.2 Monte Carlo Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21 Some Introductory Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.1 The 1R Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.3 Classification Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.4 Association Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.5 Relational Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.7 Simple Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.8 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21.9 Neighbors and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.22 Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.22.1 Global Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.22.2 Local Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.23 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
48
49
49
51
52
55
55
56
56
60
60
63
66
69
72
73
74
74
76
79
80
81
82
82
Data Architecture and Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Data Modeling, a Whirlwind Tour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Qualitative and Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
Interpretive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
Gene Ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7
Image Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8
Text Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.9
Protein and Compound 3D Structural Data . . . . . . . . . . . . . . . . . . . . . . .
3.10 Data Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.1 Same Entity, Different Identifiers . . . . . . . . . . . . . . . . . . . . . . .
3.10.2 Data in Different Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.3 Default Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.4 Multiscale Data Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.11 The Test Subject Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.12 The Transactional Data Model, Operational Data Store,
Data Warehouse, and Data Mart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
85
86
90
93
93
95
96
97
97
102
102
103
103
104
106
110
Contents
3.13
3.14
3.15
3.16
4
xi
Modeling Data with a Time Dimension. . . . . . . . . . . . . . . . . . . . . . . . . .
Historical Data and Archiving: Another Temporal Issue? . . . . . . .
Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Agile Modeling, Adaptive Data Warehouses,
and Hybrid Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.17 Gene Ontology Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.18 Broadening Our Concept of Data: Images, Structures,
and so on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.19 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
113
114
Representing Data Mining Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Tabular Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Decision Tables, Decision Trees,
and Classification Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Classification and Regression Trees (CARTs) . . . . . . . . . . . . . . . . . . .
4.4.1
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2
Cross-validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
Association Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2
The Apriori Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3
Rules, Rules, and More Rules . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
Some Basic Graphical Representations. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1
Representing the Instances Themselves . . . . . . . . . . . . . . . .
4.6.2
Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3
Frequency Polygrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.4
Data Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7
Representing the Output from Statistical Models . . . . . . . . . . . . . . . .
4.7.1
Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.2
Scatterplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.3
Q-Q Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8
Heat Maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9
Position Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9.1
Microbial Genome Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9.2
GenomePlot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9.3
GenoMap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9.4
Circular Genome Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Genotype Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Network Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Some Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.1 Cytoscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.2 TreeView: Phylogenetic Tree Visualization . . . . . . . . . . . .
125
125
126
115
116
119
122
122
127
129
133
143
144
145
145
148
150
152
152
153
154
154
158
159
161
162
166
170
170
171
171
171
171
172
175
176
177
xii
5
Contents
4.12.3 Systems Biology Markup Language . . . . . . . . . . . . . . . . . . . .
4.12.4 R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.5 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.6 Systems Biology Graphical Notation. . . . . . . . . . . . . . . . . . . .
4.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
178
181
183
183
185
189
The Input Side of the Equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Internal Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3
External Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4
Standardization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Transforming Categorical Attributes . . . . . . . . . . . . . . . . . . .
5.6
Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.2 Min-Max Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7
Denormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.1 Prejoining Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.2 Report Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.3 Mirroring Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.4 Splitting Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.5 Combining Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.6 Copying Data Across Entities . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.7 Repeating Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.8 Derived Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.9 Hierarchies of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8
Extract, Transform, and Load (ETL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.1 Integrating External Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.2 Gene/Protein Identifier Cross-Referencing . . . . . . . . . . . . .
5.9
Aggregation and Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9.1 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9.2 Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10 Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.2 Binning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.10.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11.1 Attribute Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11.2 Attribute Role. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11.3 Attribute Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.12 Missing Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.12.1 Other Issues with Missing Data . . . . . . . . . . . . . . . . . . . . . . . . .
5.13 Data Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191
194
195
197
198
198
199
200
200
200
201
201
202
202
203
203
203
204
205
205
206
207
208
209
209
211
211
212
212
215
215
215
216
217
217
223
224
Contents
6
xiii
5.13.1 Attribute Subsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.2 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.3 Dimensional Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.4 Alternate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.5 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.13.6 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14 Refreshing the Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14.1 Adding Data to Our Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.14.2 Removing Data from Our Dataset. . . . . . . . . . . . . . . . . . . . . . .
5.14.3 Correcting the Data in Our Dataset . . . . . . . . . . . . . . . . . . . . .
5.14.4 Data Changes that Significantly
Affect Our Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
224
225
225
225
226
227
228
229
229
230
Statistical Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Basic Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Statistical Inference and Hypothesis Testing. . . . . . . . . . . . . . . . . . . . .
6.2.1 What Constitutes a “Good” Hypothesis? . . . . . . . . . . . . . . .
6.2.2 The Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.3 p Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.4 Type I and Type II Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.5 An Example: Comparing Two Groups, the t Test. . . . . .
6.2.6 Back to Square One?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.7 Some Other Hypothesis Testing Methods . . . . . . . . . . . . . .
6.2.8 Multiple Comparisons and Multiple-Testing
Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
Measures of Central Tendency, Variability,
and Other Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.4 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.5 Asymmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.6 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.7 Same Means, Different Variances – Different
Variances, Same Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5
Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Continuous Data Confidence Levels. . . . . . . . . . . . . . . . . . . .
6.6
Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.2 Correlation Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
235
235
237
239
241
242
243
245
246
252
254
232
233
233
259
266
266
268
269
270
272
273
274
276
278
278
282
283
289
289
xiv
Contents
6.7
Maximum Likelihood Estimation Method . . . . . . . . . . . . . . . . . . . . . . .
6.7.1 Illustration of the MLE Method . . . . . . . . . . . . . . . . . . . . . . . . .
6.7.2 Use in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.8
Maximum A Posteriori (MAP) Estimation. . . . . . . . . . . . . . . . . . . . . . .
6.9
Enrichment Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.10 False Discover Rate (FDR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.11 Statistical Significance and Clinical Relevance . . . . . . . . . . . . . . . . . .
6.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
290
291
292
293
294
297
300
300
301
7
Bayesian Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Bayesian Formulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3
Assigning Probabilities, Odds Ratios, and Bayes Factor. . . . . . . . .
7.3.1 Probability Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.3 Bayes Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4
Putting It All Together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5
Bayesian Reasoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 A Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.6
Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7
Bayesian Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8
Parameter Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.9
Multiple Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.10 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.11 Conditional Random Field (CRF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.12 Array Comparative Genomic Hybridization . . . . . . . . . . . . . . . . . . . . .
7.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
303
303
304
309
312
312
314
314
315
322
324
327
337
340
342
342
351
355
360
360
8
Machine-Learning Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Missing and Erroneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2
Measure of Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3
Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Classification Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Association Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5
Semisupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Cotraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.3 Graph-Based SSL Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.4 Is This Type of Approach Always Helpful?. . . . . . . . . . . . .
363
363
366
368
370
371
372
372
372
373
374
377
381
383
Contents
xv
8.6
8.7
Kernel Learning Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Let’s Break for Some Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1
String and Tree Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2
Protein Structure Classification and Prediction . . . . . . . .
8.8
Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1
Gene Expression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9
Artificial Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.2
Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.3
Neuronal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.4
Encoding the Input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.5
Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.6
ANN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.7
Application to Data Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.10 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11 Some Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.1 Random Walk, Diffusion Map,
and Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.2 Network (Graph)-Based Analysis . . . . . . . . . . . . . . . . . . . . . . .
8.11.3 Network Motif Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.4 Binary Tree Algorithm in Drug Target
Discovery Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5 Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.6 Boolean and Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
383
385
386
386
389
394
399
399
399
401
406
406
424
424
425
425
Classification and Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4
Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5
1R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6
Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7
Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8
Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9
k-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.10 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11 Measuring Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11.1 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11.2 Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11.3 Evaluating Accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11.4 Improving Classifier/Predictor Accuracy. . . . . . . . . . . . . . . .
9.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
455
455
459
460
463
466
469
471
477
481
486
489
489
492
493
496
497
498
9
426
431
431
436
437
440
448
449
xvi
Contents
10
Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.1 Sources of Genomic and Proteomic Data . . . . . . . . . . . . . . .
10.2 Data Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 Integrating Annotations for a Common Sequence . . . . . .
10.2.2 Data in Different Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Tools and Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 Programming Languages and Environments . . . . . . . . . . . .
10.4 Standardization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5 Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.1 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.2 Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.3 Gene Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6 Finding Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7 Analyzing DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.1 Pairwise Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.2 Multiple Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
501
501
503
504
505
505
508
511
512
513
514
515
515
516
516
521
521
527
531
538
541
11
Systems Biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 What Is This Systems Biology of Which You Speak?. . . . . . . . . . .
11.2 Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 How Much Biology Do We Need?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.1 Biological Processes As Ordered
Sequences of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 But, We Do Need Some Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1 Data Mining by Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.2 So. . .How Are We Going to Cram
All This into a Single Chapter? . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 Gene Ontology and Microarray Databases. . . . . . . . . . . . . . . . . . . . . . .
11.5.1 Gene Ontology (GO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5.2 Microarray Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6 Text Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Some Core Problems and Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.1 Shotgun Fragment Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7.2 The BioCreAtIvE Initiative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8 Data Mining in Systems Biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8.1 Network Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.9 Novel Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.9.1 Data Mining Promises to Dig Up New Drugs . . . . . . . . . .
11.9.2 Temporal Interactions Among Genes . . . . . . . . . . . . . . . . . . .
543
543
547
553
553
554
561
562
563
564
565
565
571
572
575
576
577
577
577
578
Contents
xvii
11.10 The Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.11 Where to Next?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.12 Have We Left Anything Out? Boy, Have We Ever . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
579
580
580
581
Let’s Call It a Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1
We’ve Covered a Lot, But Not Really. . . . . . . . . . . . . . . . . . . . . . . . . .
12.2
When Two Models Are Better Than One. . . . . . . . . . . . . . . . . . . . . . .
12.3
The Most Widely Used Algorithms?. . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4
Documenting Your Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5
Where To From Here?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
585
585
586
587
588
589
590
Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
593
Appendix B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
603
Appendix C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
623
Appendix D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
627
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
629
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
631
12
Chapter 1
Introduction
Abstract What is this “data mining” thing that has been gradually percolating
through our scientific consciousness and why would I want to use its techniques?
In fact, what are those techniques and how can the results returned from using them
help me? What are some of the problems we encounter within the data mining
domain and how can we use these different techniques to not only overcome them,
but to provide us with new insight into the data we have at hand? These are some of
the questions we’re going to try and answer, to put into context, and set the
foundation for more detailed discussion in the remainder of this text. Our objective
in this chapter is not to provide the depth and the breadth of the issues and
opportunities we face, but to provide an introduction – the overall context – for
everything else. In a nutshell, we want to try and begin to answer the question “why
do we need data mining?”
1.1
Context
As any discipline matures, the amount of data collected grows. We only need to
look at the plethora of business systems to see that the amount of data generated and
manipulated in any organization today is vast.
In fact, a research project by Peter Lyman and Hal Varian at UC Berkeley
estimated that “Print, film, magnetic, and optical storage media produced about
5 exabytes of new information in 2002.”1 To put this in perspective, they give the
analogy that “If digitized with full formatting, the seventeen million books in the
Library of Congress contain about 136 terabytes of information; five exabytes of
information is equivalent in size to the information contained in 37,000 new
libraries the size of the Library of Congress book collections.” Their analysis
1
Lyman, Peter and Hal R. Varian, “How Much Information,” 2003. Retrieved from http://www.
sims.berkeley.edu/how-much-info-2003 on August 4, 2006.
R. Sullivan, Introduction to Data Mining for the Life Sciences,
DOI 10.1007/978-1-59745-290-8_1, # Springer Science+Business Media, LLC 2012
1
2
1 Introduction
goes on to state that the amount of information generated in 2002 is double that
generated in 1999. An interesting infographic was created by Cisco in 20112 that
predicts that we’ll move beyond exabytes and petabytes and enter the realm of the
zettabyte by 2015. For a point of reference, a zettabyte is roughly equivalent to the
amount of data that can be stored on 250,000,000 DVDs! That might take some
time to analyze.
Obviously, most domains of study won’t come anywhere near to needing to deal
with data of that size, but almost any domain is dealing with gigabytes and
terabytes. Some are already skirting the petabyte range, so our techniques definitely
need to be scalable.
In this text, we consider the data asset to be important in and of itself since it is
a cornerstone of almost every process and discipline, allowing people to leverage
its contents, make conclusions, and support the decision-making process.
However, in order to do so, various techniques, approaches, and tools are
necessary to get the most out of this asset. We begin with discussing general
infrastructure and fundamental concepts and then move on to discussing a variety
of techniques.
The focus of this book is data mining, but data mining with application to the life
sciences. The first question, therefore, is what constitutes data mining? A (very)
brief search on the Internet brought up the following definitions, among others:
The process of analyzing data to identify patterns or relationships.
The ability to query very large databases in order to satisfy a hypothesis (“top
down” data mining); or to interrogate a database in order to generate new
hypotheses based on rigorous statistical correlations (“bottom-up” data mining).
Searching large volumes of data looking for patterns that accurately predict
behavior in customers and prospects.
Nontrivial extraction of implicit, previously unknown and potentially useful information from data, or the search for relationships and global patterns that exist in
databases.
The process of autonomously extracting useful information or knowledge (“actionable assets”) from large data stores or sets. Data mining can be performed on a
variety of data stores, including the World Wide Web, relational databases,
transactional databases, internal legacy systems, pdf documents, and data
warehouses.
The process of using statistical techniques to discover subtle relationships between
data items, and the construction of predictive models based on them. The process
is not the same as just using an OLAP tool to find exceptional items. Generally,
data mining is a very different and more specialist application than OLAP, and
uses different tools from different vendors.
Data mining serves to find information that is hidden within the available data.
2
/>
1.1 Context
3
The comparison and study of large databases in order to discover new data
relationships. Mining a clinical database may produce new insights on outcomes,
alternate treatments or effects of treatment on different races and genders.
The unguided (or minimally guided) application of a collection of mathematical
procedures to a company’s data warehouse in an effort to find “nuggets” in the
form of statistical relationships.
The automated or semi-automated search for relationships and global patterning
within data. Data mining techniques include data visualization, neural network
analysis, and genetic algorithms.
Using advanced statistical tools to identify commercially useful patterns in
databases.
Analyzing data to discover patterns and relationships that are important to decision
making.
A process of analyzing business data (often stored in a data warehouse) to uncover
hidden trends and patterns and establish relationships. Data mining is normally
performed by expert analysts who use specialist software tools.
A means of extracting previously unknown, actionable information from the growing base of accessible data in data warehouses using sophisticated, automated
algorithms to discover hidden patterns, correlations and relationships.
The collection and mathematical analysis of vast amounts of computerized data to
discover previously hidden patterns or unknown relationships.
Data mining uses complex algorithms to search large amounts of data and find
patterns, correlation’s, and trends in that data.
Data mining, also known as knowledge-discovery in databases (KDD), is the
practice of automatically searching large stores of data for patterns. To do
this, data mining uses computational techniques from statistics and pattern
recognition.
These definitions highlight many of the common themes we will encounter
throughout this book, including:
•
•
•
•
•
•
Identifying patterns and/or relationships that are previously unknown
Satisfying and/or generating hypotheses
Building models for prediction
Statistical techniques, neural networks, and genetic algorithms
Proposing new insights
Automated (unguided) or semiautomated (minimally guided) search
Finding new phenomena within the vast data reserves being generated by
organizations potentially offers tremendous value both in the short term and the
long term. We may, for example, be able to highlight side effects of a treatment for
subsets of the target population before they become significant. We may be able to
identify new indications for an existing product. We may even be able to identify
previously unknown, and unsuspected, patterns within the genomic and proteomic
data that have been one of the causes of the data explosion at the end of the
twentieth century and into the twenty-first century.
4
1 Introduction
Data mining, or knowledge discovery, therefore has become an important topic
in many areas of science and technology since one of its predicates is the (semi)
autonomous trolling of data. With many databases growing exponentially, this
capability has moved from the nice-to-have category into the must-have category.
For most data-mining activities, data from multiple sources are often combined into
a single environment – most often called a data warehouse. However, integrating
different sources is often a source (sic) of challenge since the source domains may
differ, key fields or other designators may be incompatible with each other, or, even
after data attributes from different sources are matched, their coverage may still be
a source of problems. We consider these and other issues concerned with building
the data architecture to provide a robust platform for data-mining efforts.
Within this book, we consider the objective of data mining to provide a framework for making predictions and discoveries using large amounts of data. Within
this context, we will consider many different perspectives, identifying tools and
techniques from each, along with a focus on their use in a number of scientific
disciplines to provide a comprehensive handbook for data mining in the life
sciences. Such perspectives3 include:
•
•
•
•
Statistics
Pattern recognition
Database management systems
Artificial intelligence
The next question we ask is what constitutes data mining? This is actually a
more difficult question to answer than one might initially think since data mining
and knowledge discovery from databases mean different things to different people,
such as:
•
•
•
•
•
•
•
•
Generalization
Classification
Association
Clustering
Frequent pattern analysis
Structured pattern analysis
Outlier analysis
Trend analysis
•
•
•
•
•
•
•
•
Deviation analysis
Stream data mining
Biological data mining
Time-series analysis
Text mining
Intrusion detection
Web mining
Privacy-preserving mining
Note that although the above list includes a number of items, it is by no means
exhaustive. Further, many of these areas have been grouped together to provide
value, such as with frequent pattern and structured pattern analysis, of trend and
deviation analysis.
3
This list contains entries which overlap, some of which are contained in other entries, and some
of which have vastly different scopes. Our purpose is to highlight some of the more commonly
recognized subject areas.
1.1 Context
5
Information technology has revolutionized many areas of science and industry.
Today, hardly any step of the pharmaceutical development process, for example, is
untouched by the information technology revolution of the latter part of the twentieth
century. However, we have an overabundance of data which we are struggling to
extract information from. This “information rich but data poor” condition is causing a
challenge for our industry, as Fayyad (1996) succinctly stated in the Preface:
Our capabilities for collecting and storing data of all kinds have far outpaced our abilities to
analyze, summarize, and extract “knowledge” from this data.
Without powerful tools and techniques, we will find it increasingly difficult, if
not impossible, to analyze and comprehend this ever-increasing, distributed base of
data. We believe that the key word here is knowledge: in many disciplines,
scientific or otherwise, certain data is valued over other data since it has been
shown to have more significance and thus, higher value. Eliciting knowledge from
this data is often the domain of specialists who have built up their knowledge and
expertise over time. Being able to provide tools and techniques to elicit this
knowledge as quickly as possible is obviously a major goal of data mining which
has tremendous potential for those specialists and lay person alike. Understanding
how data elements relate to each other is not always intuitive. As the accuracy of
tests improves and the breadth of tests improves, the amount of data increases, often
exponentially. Our existing heuristics and knowledge serves us well, but the
interrelationships are often nonintuitive. Stepping outside of the scientific discipline
for a moment, a business school professor posed the following question: why would
a (barbeque) grill manufacturer buy a company that makes fire-starter logs? The
answer was because the grill manufacturer’s revenue model dipped in the winter
season whereas the fire-starter log’s revenue model increased in that same season
(and dipped in the summer season), thus smoothing out the revenue model for the
combined company. Not necessarily an intuitive relationship.
Consider an example: what do the following drugs have in common? (Fig. 1.1)
They are all drugs which have been withdrawn from the market since 1990.4 The
reasons for withdrawal range from toxicity to elevated risks (stroke, myocardial
infarction, cardiac arrhythmias, etc.). Are there any common factors between some5
or all of the problems with these drugs? Can we see any relationships between the
toxicity issues seen in a drug class, for example?
Much of the work performed throughout the drug discovery and development
process to mitigate issues which may be seen later, but it is a significant challenge
for any organization. Data mining has the potential to help in several areas:
• Aggregate and synthesize data using intelligent techniques to allow us to get a
higher level view of our data and drill down into the details thus allowing a
manageable stepping down into the data
4
/>As we shall discuss throughout, any sufficiently complex area of study will not allow us to see
global patterns of any complexity and value.
5
6
1 Introduction
Triazolam
Cerivastatin
Fen-phen
Rapacuronium
Terfenadine
Rofecoxib
Mibefradil
Palladone
Troglitazone
Vioxx
Alosetron
Pemoline
Cisapride
Tysabri
Fig. 1.1 Drugs withdrawn from US market since 1990
• Identify (hidden) relationships, patterns, associations, and correlations that exist
in the very large amounts of data captured through the discovery and development process so that we can uncover potential issues before they occur
• Compare data from one clinical program with another program in the same
therapeutic class, introducing normalization and standardization across
programs to make the comparisons meaningful
• Help understand where/if there are dependencies and interactions with factors
completely outside of our area of focus6
The various subjects discussed in this book leverage the gamut of breakthroughs
which comprise this discipline. While we attempt to avoid acronyms and technobabble wherever possible, it is sometimes unavoidable. The level of knowledge that
is projected for the typical reader would assume a reasonable level of information
technology knowledge, along with a basic understanding of the various life sciences
disciplines. Thus, topics such as domain names, IP addresses, the Internet, the
world-wide-web file transfer protocol, and the like are assumed to be familiar to
the reader from the usage perspective. Also, molecular biology, the pharmaceutical
discovery and development process, and the data aspects of the genomics and
proteomics disciplines are assumed. This book does not cover probability theory
as it is assumed the reader will have a command of this subject in both its single
variable and multivariate forms.
6
Viagra might be one of the best known products, in recent times, that started off as a drug for
hypertension and angina pectoris and ended up as an erectile dysfunction (ED) drug: “. . . a
telephone conversation with the clinician who was running the trials: ‘He mentioned that at
50 mg taken every 8 h for 10 days, there were episodes of indigestion [and of] aches in patients’
backs and legs. And he said, “Oh, there are also some reports of penile erections.”’” (Kling 1998
#134).
1.1 Context
7
To provide a holistic context to these areas, we consider where the various
data-mining algorithms and models may be useful within the pharmaceutical
discovery and development process. The information asset is very important within
the pharmaceutical development effort: it is something which is gathered from the
very inception of the idea and, as we are finding from press releases in the years
after 2003, which never goes away. This valuable piece of the puzzle adds a
significant amount to our overall understanding of the data at hand but also allows
us to predict, extrapolate, and view the future through a somewhat clear lens. Thus,
we included a significant second section that dealt with leveraging that asset. Once
you have a comprehensive data environment, how can you use it? What are some
of the new, and classic, techniques which can provide significant value to your
processes? We hope to provide some answers to these questions and also spur on
other questions.
Much data is gathered through clinical trials. Since the different clinical trial
design types impact the type of data captured, we will often refer to them as phases
I, II, III, and IV. We use the definitions as described in Piantadosi (1997):
Phase I: pharmacologically oriented (best dose of drug to employ)
Phase II: preliminary evidence of efficacy (and side effects at a fixed dose)
Phase III: new treatments compared (with standard therapy, no therapy, or placebo)
Phase IV: postmarketing surveillance
Throughout this book, we have avoided a few terms currently in common usage.
Terms such as bioinformatics, chemoinformatics, immunological bioinformatics,
and their like are robust disciplines and worthy approaches in their own rights. The
reason for our avoiding any specific term is that much of what we describe – from
the data, methodology, algorithm, and subject are perspectives – are applicable in
any of these disciplines. We will, however, consider models and approaches and use
subjects within these areas. Bioinformatics as a term, for example, covers a
multitude of areas, and a book on data mining in the life sciences would not be
complete without considering such things.
We are seeing a confluence of subjects within the concept of systems biology,
and data mining is providing significant value within this discipline. It does not take
too much to predict that the amount of data that will be generated into the future will
be significantly more than has already been seen.
A valid question to ask at this point is whether all patterns uncovered by a datamining effort are interesting? A corollary is how do we avoid the work and processing
associated with such “uninteresting” discoveries? As we shall see, the individual
does not become ancillary to the process but continues to drive the process, investing
effort in different steps. Further, we can keep some considerations in mind and ask
ourselves whether:
? The results are easily understood.
? They are valid on new data or on the remaining test data, with some degree of
certainty.
? They are potentially useful.
? They are novel, or they validate some hypothesis.