Statistical Modeling and
Machine Learning for
Molecular Biology
CHAPMAN & HALL/CRC
Mathematical and Computational Biology Series
Aims and scope:
This series aims to capture new developments and summarize what is known
over the entire spectrum of mathematical and computational biology and
medicine. It seeks to encourage the integration of mathematical, statistical,
and computational methods into biology by publishing a broad range of
textbooks, reference works, and handbooks. The titles included in the
series are meant to appeal to students, researchers, and professionals in the
mathematical, statistical and computational sciences, fundamental biology
and bioengineering, as well as interdisciplinary researchers involved in the
field. The inclusion of concrete examples and applications, and programming
techniques and examples, is highly encouraged.
Series Editors
N. F. Britton
Department of Mathematical Sciences
University of Bath
Xihong Lin
Department of Biostatistics
Harvard University
Nicola Mulder
University of Cape Town
South Africa
Maria Victoria Schneider
European Bioinformatics Institute
Mona Singh
Department of Computer Science
Princeton University
Anna Tramontano
Department of Physics
University of Rome La Sapienza
Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
3 Park Square, Milton Park
Abingdon, Oxfordshire OX14 4RN
UK
Published Titles
An Introduction to Systems Biology:
Design Principles of Biological Circuits
Uri Alon
Glycome Informatics: Methods and
Applications
Kiyoko F. Aoki-Kinoshita
Computational Systems Biology of
Cancer
Emmanuel Barillot, Laurence Calzone,
Philippe Hupé, Jean-Philippe Vert, and
Andrei Zinovyev
Python for Bioinformatics
Sebastian Bassi
Quantitative Biology: From Molecular to
Cellular Systems
Sebastian Bassi
Methods in Medical Informatics:
Fundamentals of Healthcare
Programming in Perl, Python, and Ruby
Jules J. Berman
Computational Biology: A Statistical
Mechanics Perspective
Ralf Blossey
Game-Theoretical Models in Biology
Mark Broom and Jan Rychtáˇr
Computational and Visualization
Techniques for Structural Bioinformatics
Using Chimera
Forbes J. Burkowski
Structural Bioinformatics: An Algorithmic
Approach
Forbes J. Burkowski
Normal Mode Analysis: Theory and
Applications to Biological and Chemical
Systems
Qiang Cui and Ivet Bahar
Kinetic Modelling in Systems Biology
Oleg Demin and Igor Goryanin
Data Analysis Tools for DNA Microarrays
Sorin Draghici
Statistics and Data Analysis for
Microarrays Using R and Bioconductor,
Second Edition
˘
Sorin Draghici
Computational Neuroscience:
A Comprehensive Approach
Jianfeng Feng
Biological Sequence Analysis Using
the SeqAn C++ Library
Andreas Gogol-Döring and Knut Reinert
Gene Expression Studies Using
Affymetrix Microarrays
Hinrich Göhlmann and Willem Talloen
Handbook of Hidden Markov Models
in Bioinformatics
Martin Gollery
Meta-analysis and Combining
Information in Genetics and Genomics
Rudy Guerra and Darlene R. Goldstein
Differential Equations and Mathematical
Biology, Second Edition
D.S. Jones, M.J. Plank, and B.D. Sleeman
Knowledge Discovery in Proteomics
Igor Jurisica and Dennis Wigle
Spatial Ecology
Stephen Cantrell, Chris Cosner, and
Shigui Ruan
Introduction to Proteins: Structure,
Function, and Motion
Amit Kessel and Nir Ben-Tal
Cell Mechanics: From Single ScaleBased Models to Multiscale Modeling
Arnaud Chauvière, Luigi Preziosi,
and Claude Verdier
RNA-seq Data Analysis: A Practical
Approach
Eija Korpelainen, Jarno Tuimala,
Panu Somervuo, Mikael Huss, and Garry Wong
Bayesian Phylogenetics: Methods,
Algorithms, and Applications
Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis
Introduction to Mathematical Oncology
Yang Kuang, John D. Nagy, and
Steffen E. Eikenberry
Statistical Methods for QTL Mapping
Zehua Chen
Biological Computation
Ehud Lamm and Ron Unger
Published Titles (continued)
Optimal Control Applied to Biological
Models
Suzanne Lenhart and John T. Workman
Genome Annotation
Jung Soh, Paul M.K. Gordon, and
Christoph W. Sensen
Clustering in Bioinformatics and Drug
Discovery
John D. MacCuish and Norah E. MacCuish
Niche Modeling: Predictions from
Statistical Distributions
David Stockwell
Spatiotemporal Patterns in Ecology
and Epidemiology: Theory, Models,
and Simulation
Horst Malchow, Sergei V. Petrovskii, and
Ezio Venturino
Algorithms in Bioinformatics: A Practical
Introduction
Wing-Kin Sung
Stochastic Dynamics for Systems
Biology
Christian Mazza and Michel Benaïm
The Ten Most Wanted Solutions in
Protein Bioinformatics
Anna Tramontano
Statistical Modeling and Machine
Learning for Molecular Biology
Alan M. Moses
Combinatorial Pattern Matching
Algorithms in Computational Biology
Using Perl and R
Gabriel Valiente
Engineering Genetic Circuits
Chris J. Myers
Pattern Discovery in Bioinformatics:
Theory & Algorithms
Laxmi Parida
Exactly Solvable Models of Biological
Invasion
Sergei V. Petrovskii and Bai-Lian Li
Computational Hydrodynamics of
Capsules and Biological Cells
C. Pozrikidis
Modeling and Simulation of Capsules
and Biological Cells
C. Pozrikidis
Cancer Modelling and Simulation
Luigi Preziosi
Introduction to Bio-Ontologies
Peter N. Robinson and Sebastian Bauer
Dynamics of Biological Systems
Michael Small
Introduction to Bioinformatics
Anna Tramontano
Managing Your Biological Data with
Python
Allegra Via, Kristian Rother, and
Anna Tramontano
Cancer Systems Biology
Edwin Wang
Stochastic Modelling for Systems
Biology, Second Edition
Darren J. Wilkinson
Big Data Analysis for Bioinformatics and
Biomedical Discoveries
Shui Qing Ye
Bioinformatics: A Practical Approach
Shui Qing Ye
Introduction to Computational
Proteomics
Golan Yona
Statistical Modeling and
Machine Learning for
Molecular Biology
Alan M. Moses
University of Toronto, Canada
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20160930
International Standard Book Number-13: 978-1-4822-5859-2 (Paperback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com ( or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Moses, Alan M., author.
Title: Statistical modeling and machine learning for molecular biology / Alan M.
Moses.
Description: Boca Raton : CRC Press, 2016. | Includes bibliographical
references and index.
Identifiers: LCCN 2016028358| ISBN 9781482258592 (hardback : alk. paper) |
ISBN 9781482258615 (e-book) | ISBN 9781482258622 (e-book) | ISBN
9781482258608 (e-book)
Subjects: LCSH: Molecular biology–Statistical methods. | Molecular
biology–Data processing.
Classification: LCC QH506 .M74 2016 | DDC 572.8–dc23
LC record available at />Visit the Taylor & Francis Web site at
and the CRC Press Web site at
For my parents
Contents
Acknowledgments, xv
Section i
Overview
chapter 1
◾
Across Statistical Modeling and Machine
Learning on a Shoestring
3
1.1
ABOUT THIS BOOK
3
1.2
WHAT WILL THIS BOOK COVER?
4
1.2.1
Clustering
4
1.2.2
Regression
5
1.2.3
Classification
6
1.3
ORGANIZATION OF THIS BOOK
6
1.4
WHY ARE THERE MATHEMATICAL CALCULATIONS
IN THE BOOK?
8
1.5
WHAT WON’T THIS BOOK COVER?
11
1.6
WHY IS THIS A BOOK?
12
REFERENCES AND FURTHER READING
chapter 2
◾
Statistical Modeling
14
15
2.1
WHAT IS STATISTICAL MODELING?
15
2.2
PROBABILITY DISTRIBUTIONS ARE THE MODELS
18
2.3
AXIOMS OF PROBABILITY AND THEIR
CONSEQUENCES: “RULES OF PROBABILITY”
23
HYPOTHESIS TESTING: WHAT YOU PROBABLY
ALREADY KNOW ABOUT STATISTICS
26
2.4
ix
x ◾ Contents
2.5
TESTS WITH FEWER ASSUMPTIONS
30
2.5.1
Wilcoxon Rank-Sum Test, Also Known As the
Mann–Whitney U Test (or Simply the WMW Test)
30
2.5.2
Kolmogorov–Smirnov Test (KS-Test)
31
2.6
CENTRAL LIMIT THEOREM
33
2.7
EXACT TESTS AND GENE SET ENRICHMENT ANALYSIS
33
2.8
PERMUTATION TESTS
36
2.9
SOME POPULAR DISTRIBUTIONS
38
2.9.1
The Uniform Distribution
38
2.9.2
The T-Distribution
39
2.9.3
The Exponential Distribution
39
2.9.4
The Chi-Squared Distribution
39
2.9.5
The Poisson Distribution
39
2.9.6
The Bernoulli Distribution
40
2.9.7
The Binomial Distribution
40
EXERCISES
40
REFERENCES AND FURTHER READING
41
chapter 3
3.1
◾
Multiple Testing
43
THE BONFERRONI CORRECTION AND GENE SET
ENRICHMENT ANALYSIS
43
MULTIPLE TESTING IN DIFFERENTIAL EXPRESSION
ANALYSIS
46
3.3
FALSE DISCOVERY RATE
48
3.4
eQTLs: A VERY DIFFICULT MULTIPLE-TESTING
PROBLEM
49
3.2
EXERCISES
51
REFERENCES AND FURTHER READING
52
chapter 4
4.1
◾
Parameter Estimation and Multivariate Statistics 53
FITTING A MODEL TO DATA: OBJECTIVE
FUNCTIONS AND PARAMETER ESTIMATION
53
4.2
MAXIMUM LIKELIHOOD ESTIMATION
54
4.3
LIKELIHOOD FOR GAUSSIAN DATA
55
Contents ◾ xi
4.4
HOW TO MAXIMIZE THE LIKELIHOOD ANALYTICALLY 56
4.5
OTHER OBJECTIVE FUNCTIONS
60
4.6
MULTIVARIATE STATISTICS
64
4.7
MLEs FOR MULTIVARIATE DISTRIBUTIONS
69
4.8
HYPOTHESIS TESTING REVISITED: THE PROBLEMS
WITH HIGH DIMENSIONS
77
EXAMPLE OF LRT FOR THE MULTINOMIAL: GC
CONTENT IN GENOMES
80
4.9
EXERCISES
83
REFERENCES AND FURTHER READING
83
Section ii
Clustering
chapter 5
◾
Distance-Based Clustering
87
5.1
MULTIVARIATE DISTANCES FOR CLUSTERING
87
5.2
AGGLOMERATIVE CLUSTERING
91
5.2
CLUSTERING DNA AND PROTEIN SEQUENCES
95
5.4
IS THE CLUSTERING RIGHT?
98
5.5
K-MEANS CLUSTERING
100
5.6
SO WHAT IS LEARNING ANYWAY?
106
5.7
CHOOSING THE NUMBER OF CLUSTERS FOR
K-MEANS
107
5.8
K-MEDOIDS AND EXEMPLAR-BASED CLUSTERING
109
5.9
GRAPH-BASED CLUSTERING: “DISTANCES” VERSUS
“INTERACTIONS” OR “CONNECTIONS”
110
5.10 CLUSTERING AS DIMENSIONALITY REDUCTION
113
EXERCISES
113
REFERENCES AND FURTHER READING
115
chapter 6
◾
Mixture Models and Hidden Variables
for Clustering and Beyond
117
6.1
THE GAUSSIAN MIXTURE MODEL
118
6.2
E-M UPDATES FOR THE MIXTURE OF GAUSSIANS
123
xii ◾ Contents
6.3
DERIVING THE E-M ALGORITHM FOR THE MIXTURE
OF GAUSSIANS
127
6.4
GAUSSIAN MIXTURES IN PRACTICE AND THE
CURSE OF DIMENSIONALITY
131
CHOOSING THE NUMBER OF CLUSTERS
USING THE AIC
131
APPLICATIONS OF MIXTURE MODELS IN
BIOINFORMATICS
133
6.5
6.6
EXERCISES
141
REFERENCES AND FURTHER READING
142
Regression
Section iii
chapter 7
7.1
◾
Univariate Regression
145
SIMPLE LINEAR REGRESSION AS A PROBABILISTIC
MODEL
145
7.2
DERIVING THE MLEs FOR LINEAR REGRESSION
146
7.3
HYPOTHESIS TESTING IN LINEAR REGRESSION
149
7.4
LEAST SQUARES INTERPRETATION OF LINEAR
REGRESSION
154
7.5
APPLICATION OF LINEAR REGRESSION TO eQTLs
155
7.6
FROM HYPOTHESIS TESTING TO STATISTICAL
MODELING: PREDICTING PROTEIN LEVEL BASED
ON mRNA LEVEL
157
7.7
REGRESSION IS NOT JUST “LINEAR”—POLYNOMIAL
AND LOCAL REGRESSIONS
161
7.8
GENERALIZED LINEAR MODELS
165
EXERCISES
167
REFERENCES AND FURTHER READING
167
chapter 8
◾
Multiple Regression
169
8.1
PREDICTING Y USING MULTIPLE Xs
169
8.2
HYPOTHESIS TESTING IN MULTIPLE DIMENSIONS:
PARTIAL CORRELATIONS
171
Contents ◾ xiii
8.3
EXAMPLE OF A HIGH-DIMENSIONAL MULTIPLE
REGRESSION: REGRESSING GENE EXPRESSION LEVELS
ON TRANSCRIPTION FACTOR BINDING SITES
174
8.4
AIC AND FEATURE SELECTION AND OVERFITTING
IN MULTIPLE REGRESSION
179
EXERCISES
182
REFERENCES AND FURTHER READING
183
chapter 9
Regularization in Multiple Regression
and Beyond
185
9.1
REGULARIZATION AND PENALIZED LIKELIHOOD
186
9.2
DIFFERENCES BETWEEN THE EFFECTS OF L1 AND L2
PENALTIES ON CORRELATED FEATURES
189
9.3
REGULARIZATION BEYOND SPARSITY:
ENCOURAGING YOUR OWN MODEL STRUCTURE
190
PENALIZED LIKELIHOOD AS MAXIMUM A
POSTERIORI (MAP) ESTIMATION
192
CHOOSING PRIOR DISTRIBUTIONS FOR
PARAMETERS: HEAVY-TAILS IF YOU CAN
193
9.4
9.5
◾
EXERCISES
197
REFERENCES AND FURTHER READING
199
Section iV
Classification
chapter 10
◾
Linear Classification
203
10.1 CLASSIFICATION BOUNDARIES AND LINEAR
CLASSIFICATION
205
10.2 PROBABILISTIC CLASSIFICATION MODELS
206
10.3 LOGISTIC REGRESSION
208
10.4 LINEAR DISCRIMINANT ANALYSIS (LDA) AND THE
LOG LIKELIHOOD RATIO
210
10.5 GENERATIVE AND DISCRIMINATIVE MODELS FOR
CLASSIFICATION
214
10.6 NAÏVE BAYES: GENERATIVE MAP CLASSIFICATION
215
xiv ◾ Contents
10.7 TRAINING NAÏVE BAYES CLASSIFIERS
221
10.8 NAÏVE BAYES AND DATA INTEGRATION
222
EXERCISES
223
REFERNCES AND FURTHER READING
223
chapter 11
◾
Nonlinear Classification
225
11.1 TWO APPROACHES TO CHOOSE NONLINEAR
BOUNDARIES: DATA-GUIDED AND MULTIPLE
SIMPLE UNITS
226
11.2 DISTANCE-BASED CLASSIFICATION WITH
k-NEAREST NEIGHBORS
228
11.3 SVMs FOR NONLINEAR CLASSIFICATION
230
11.4 DECISION TREES
234
11.5 RANDOM FORESTS AND ENSEMBLE
CLASSIFIERS: THE WISDOM OF THE CROWD
236
11.6 MULTICLASS CLASSIFICATION
237
EXERCISES
238
REFERENCES AND FURTHER READING
239
chapter 12
◾
Evaluating Classifiers
241
12.1 CLASSIFICATION PERFORMANCE STATISTICS IN THE
IDEAL CLASSIFICATION SETUP
241
12.2 MEASURES OF CLASSIFICATION PERFORMANCE
242
12.3 ROC CURVES AND PRECISION–RECALL PLOTS
245
12.4 EVALUATING CLASSIFIERS WHEN YOU
DON’T HAVE ENOUGH DATA
248
12.5 LEAVE-ONE-OUT CROSS-VALIDATION
251
12.6 BETTER CLASSIFICATION METHODS
VERSUS BETTER FEATURES
253
EXERCISES
254
REFERENCES AND FURTHER READING
255
INDEX, 257
Acknowledgments
First, I’d like to acknowledge the people who taught me statistics and computers. As with most of the people that will read this book, I took the
required semester of statistics as an undergraduate. Little of what I learned
proved useful for my scientific career. I came to statistics and computers late, although I learned some html during a high-school job at PCI
Geomatics and tried (and failed) to write my first computer program as
an undergraduate hoping to volunteer in John Reinitz’s lab (then at Mount
Sinai in New York). I finally did manage to write some programs as an
undergraduate summer student, thanks to Tim Gardner (then a grad student in Marcelo Magnasco’s lab), who first showed me PERL codes.
Most of what I learned was during my PhD with Michael Eisen (who
reintroduced cluster analysis to molecular biologists with his classic paper
in 1998) and postdoctoral work with Richard Durbin (who introduced
probabilistic models from computational linguistics to molecular biologists, leading to such universal resources as Pfam, and wrote a classic bioinformatics textbook, to which I am greatly indebted). During my PhD
and postdoctoral work, I learned a lot of what is found in this book from
Derek Chiang, Audrey Gasch, Justin Fay, Hunter Fraser, Dan Pollard,
David Carter, and Avril Coughlan. I was also very fortunate to take courses
with Terry Speed, Mark van der Laan, and Michael Jordan while at UC
Berkeley and to have sat in on Geoff Hinton’s advanced machine learning lectures in Toronto in 2012 before he was whisked off to Google. Most
recently, I’ve been learning from Quaid Morris, with whom I cotaught the
course that inspired this book.
I’m also indebted to everyone who read this book and gave me feedback
while I was working on it: Miranda Calderon, Drs. Gelila Tilahun, Muluye
Liku, and Derek Chiang, my graduate students Mitchell Li Cheong Man,
Gavin Douglas, and Alex Lu, as well as an anonymous reviewer.
xv
xvi ◾ Acknowledgments
Much of this book was written while I was on sabbatical in 2014–2015
at Michael Elowitz’s lab at Caltech, so I need to acknowledge Michael’s
generosity to host me and also the University of Toronto for continuing
the tradition of academic leave. Michael and Joe Markson introduced me
to the ImmGen and single-cell sequence datasets that I used for many of
the examples in this book.
Finally, to actually make this book (and the graduate course that
inspired it) possible, I took advantage of countless freely available software,
R packages, Octave, PERL, bioinformatics databases, Wikipedia articles
and open-access publications, and supplementary data sets, many of which
I have likely neglected to cite. I hereby acknowledge all of the people who
make this material available and enable the progress of pedagogy.
I
Overview
The first four chapters give necessary background. The first chapter is
background to the book: what it covers and why I wrote it. The next three
chapters are background material needed for the statistical modeling
and machine learning methods covered in the later chapters. However,
although I’ve presented that material as background, I believe that the
review of modeling and statistics (in Chapters 2, 3 and 4) might be valuable to readers, whether or not they intend to go on to the later chapters.
1
Chapter
1
Across Statistical
Modeling and
Machine Learning
on a Shoestring
1.1 ABOUT THIS BOOK
This is a guidebook for biologists about statistics and computers. Much
like a travel guide, it’s aimed to help intelligent travelers from one place
(biology) find their way around a fascinating foreign place (computers and statistics). Like a good travel guide, this book should teach you
enough to have an interesting conversation with the locals and to bring
back some useful souvenirs and maybe some cool clothes that you can’t
find at home. I’ve tried my best to make it fun and interesting to read and
put in a few nice pictures to get you excited and help recognize things
when you see them.
However, a guidebook is no substitute to having lived in another place—
although I can tell you about some of the best foods to try and buildings to visit, these will necessarily only be the highlights. Furthermore,
as visitors we’ll have to cover some things quite superficially—we can
learn enough words to say yes, no, please, and thank you, but we’ll never
master the language. Maybe after reading the guidebook, some intrepid
3
4 ◾ Statistical Modeling and Machine Learning for Molecular Biology
readers will decide to take a break from the familiar territory of molecular biology for a while and spend a few years in the land of computers
and statistics.
Also, this brings up an important disclaimer: A guidebook is not an
encyclopedia or a dictionary. This book doesn’t have a clear section heading for every topic, useful statistical test, or formula. This means that
it won’t always be easy to use it for reference. However, because online
resources have excellent information about most of the topics covered
here, readers are encouraged to look things up as they go along.
1.2 WHAT WILL THIS BOOK COVER?
This book aims to give advanced students in molecular biology enough
statistical and computational background to understand (and perform)
three of the major tasks of modern machine learning that are widely used
in bioinformatics and genomics applications:
1. Clustering
2. Regression
3. Classification
1.2.1 Clustering
Given a set of data, clustering aims to divide the individual observations
into groups or clusters. This is a very common problem in several areas
of modern molecular biology. In the genomics era, clustering has been
applied to genome-wide expression data to find groups of genes with similar expression patterns; it often turns out that these genes do work together
(in pathways or networks) in the cell and therefore share common functions. Finding groups of similar genes using molecular interaction data
can implicate pathways or help lead to hypotheses about gene function.
Clustering has therefore been applied to all types of gene-level molecular
interaction data, such as genetic and physical protein interactions. Proteins
and genes that share sequence similarity can also be grouped together to
delineate “families” that are likely to share biochemical functions. At the
other end of the spectrum, finding groups of similar patients (or disease
samples) based on molecular profiles is another major current application
of clustering.
Historically, biologists wanted to find groups of organisms that represented species. Given a set of measurements of biological traits of
Across Statistical Modeling and Machine Learning on a Shoestring ◾ 5
individuals, clustering can divide them into groups with some degree of
objectivity. In the early days of the molecular era, evolutionary geneticists obtained sequences of DNA and proteins wanting to find patterns
that could relate the molecular data to species relationships. Today, inference of population structure by clustering individuals into subpopulations
(based on genome-scale genotype data) is a major application of clustering
in evolutionary genetics.
Clustering is a classic topic in machine learning because the nature of
the groups and the number of groups are unknown. The computer has
to “learn” these from the data. There are endless numbers of clustering
methods that have been considered, and the bioinformatics literature has
contributed a very large number of them.
1.2.2 Regression
Regression aims to model the statistical relationship between one or more
variables. For example, regression is a powerful way to test for and model
the relationship between genotype and phenotype. Contemporary data
analysis methods for genome-wide association studies (GWAS) and quantitative trait loci for gene expression (eQTLs) rely on advanced forms of
regression (known as generalized linear mixed models) that can account
for complex structure in the data due to the relatedness of individuals and
technical biases. Regression methods are used extensively in other areas
of biostatistics, particularly in statistical genetics, and are often used in
bioinformatics as a means to integrate data for predictive models.
In addition to its wide use in biological data analysis, I believe regression is a key area to focus on in this book for two pedagogical reasons.
First, regression deals with the inference of relationships between two
or more types of observations, which is a key conceptual issue in all
scientific data analysis applications, particularly when one observation
can be thought of as predictive or causative of the other. Because classical regression techniques yield straightforward statistical hypothesis
tests, regression allows us to connect one type of data to another, and
can be used to compare large datasets of different types. Second, regression is an area where the evolution from classical statistics to machine
learning methods can be illustrated most easily through the development of penalized likelihood methods. Thus, studying regression can
help students understand developments in other areas of machine learning (through analogy with regression), without knowing all the technical details.
6 ◾ Statistical Modeling and Machine Learning for Molecular Biology
1.2.3 Classification
Classification is the task of assigning observations into previously defined
classes. It underlies many of the mainstream successes of machine
learning: spam filters, face recognition in photos, and the Shazam app.
Classification techniques also form the basis for many widely used bioinformatics tools and methodologies. Typical applications include predictions of gene function based on protein sequence or genome-scale
experimental data, and identification of disease subtypes and biomarkers.
Historically, statistical classification techniques were used to analyze the
power of medical tests: given the outcome of a blood test, how accurately
could a physician diagnose a disease?
Increasingly, sophisticated machine learning techniques (such as
neural networks, random forests and support vector machines or
SVMs) are used in popular software for scientific data analysis, and
it is essential that modern molecular biologists understand the concepts underlying these. Because of the wide applicability of classification in everyday problems in the information technology industry, it
has become a large and rapidly developing area of machine learning.
Biomedical applications of these methodological developments often
lead to important advances in computational biology. However, before
applying these methods, it’s critical to understand the specific issues
arising in genome-scale analysis, particularly with respect to evaluation
of classification performance.
1.3 ORGANIZATION OF THIS BOOK
Chapters 2, 3, and 4 review and introduce mathematical formalism,
probability theory, and statistics that are essential to understanding
the modeling and machine learning approaches used in contemporary
molecular biology. Finally, in Chapters 5 and 6 the first real “machine
learning” and nontrivial probabilistic models are introduced. It might
sound a bit daunting that three chapters are needed to give the necessary
background, but this is the reality of data-rich biology. I have done my
best to keep it simple, use clear notation, and avoid tedious calculations.
The reality is that analyzing molecular biology data is getting more and
more complicated.
You probably already noticed that the book is organized by statistical
models and machine learning methods and not by biological examples
or experimental data types. Although this makes it hard to look up a
Across Statistical Modeling and Machine Learning on a Shoestring ◾ 7
statistical method to use on your data, I’ve organized it this way because
I want to highlight the generality of the data analysis methods. For example, clustering can be applied to diverse data from DNA sequences to brain
images and can be used to answer questions about protein complexes and
cancer subtypes. Although I might not cover your data type or biological
question specifically, once you understand the method, I hope it will be
relatively straightforward to apply to your data.
Nevertheless, I understand that some readers will want to know that
the book covers their type of data, so I’ve compiled a list of the molecular
biology examples that I used to illustrate methods.
LIST OF MOLECULAR BIOLOGY EXAMPLES
1. Chapter 2—Single-cell RNA-seq data defies standard models
2. Chapter 2—Comparing RNA expression between cell types for one or
two genes
3. Chapter 2—Analyzing the number of kinase substrates in a list of genes
4. Chapter 3—Are the genes that came out of a genetic screen involved
in angiogenesis?
5. Chapter 3—How many genes have different expression levels in T cells?
6. Chapter 3—Identifying eQTLs
7. Chapter 4—Correlation between expression levels of CD8 antigen alpha
and beta chains
8. Chapter 4—GC content differences on human sex chromosomes
9. Chapter 5—Groups of genes and cell types in the immune system
10. Chapter 5—Building a tree of DNA or protein sequences
11. Chapter 5—Immune cells expressing CD4, CD8 or both
12. Chapter 5—Identifying orthologs with OrthoMCL
13. Chapter 5—Protein complexes in protein interaction networks
14. Chapter 6—Single-cell RNA-seq revisited
15. Chapter 6—Motif finding with MEME
16. Chapter 6—Estimating transcript abundance with Cufflinks
17. Chapter 6—Integrating DNA sequence motifs and gene expression data
18. Chapter 7—Identifying eQTLs revisited
19. Chapter 7—Does mRNA abundance explain protein abundance?
20. Chapter 8—SAG1 expression is controlled by multiple loci
21. Chapter 8—mRNA abundance, codon bias, and the rate of protein
evolution
22. Chapter 8—Predicting gene expression from transcription factor binding motifs
23. Chapter 8—Motif finding with REDUCE
24. Chapter 9—Modeling a gene expression time course
8 ◾ Statistical Modeling and Machine Learning for Molecular Biology
25.
26.
27.
28.
29.
30.
31.
Chapter 9—Inferring population structure with STRUCTURE
Chapter 10—Are mutations harmful or benign?
Chapter 10—Finding a gene expression signature for T cells
Chapter 10—Identifying motif matches in DNA sequences
Chapter 11—Predicting protein folds
Chapter 12—The BLAST homology detection problem
Chapter 12—LPS stimulation in single-cell RNA-seq data
1.4 WHY ARE THERE MATHEMATICAL
CALCULATIONS IN THE BOOK?
Although most molecular biologists don’t (and don’t want to) do mathematical derivations of the type that I present in this book, I have included
quite a few of these calculations in the early chapters. There are several
reasons for this. First of all, the type of machine learning methods presented here are mostly based on probabilistic models. This means that
the methods described here really are mathematical things, and I don’t
want to “hide” the mathematical “guts” of these methods. One purpose
of this book is to empower biologists to unpack the algorithms and mathematical notations that are buried in the methods section of most of the
sophisticated primary research papers in the top journals today. Another
purpose is that I hope, after seeing the worked example derivations for
the classic models in this book, some ambitious students will take the
plunge and learn to derive their own probabilistic machine learning
models. This is another empowering skill, as it frees students from the
confines of the prepackaged software that everyone else is using. Finally,
there are students out there for whom doing some calculus and linear
algebra will actually be fun! I hope these students enjoy the calculations
here. Although calculus and basic linear algebra are requirements for
medical school and graduate school in the life sciences, students rarely
get to use them.
I’m aware that the mathematical parts of this book will be unfamiliar for many biology students. I have tried to include very basic introductory material to help students feel confident interpreting and attacking
equations. This brings me to an important point: although I don’t assume
any prior knowledge of statistics, I do assume that readers are familiar
with multivariate calculus and something about linear algebra (although
I do review the latter briefly). But don’t worry if you are a little rusty and
don’t remember, for example, what a partial derivative is; a quick visit to
Wikipedia might be all you need.