Tải bản đầy đủ (.pdf) (281 trang)

Statistical moderling machine learning for molecular biology

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.59 MB, 281 trang )


Statistical Modeling and
Machine Learning for
Molecular Biology


CHAPMAN & HALL/CRC
Mathematical and Computational Biology Series
Aims and scope:
This series aims to capture new developments and summarize what is known
over the entire spectrum of mathematical and computational biology and
medicine. It seeks to encourage the integration of mathematical, statistical,
and computational methods into biology by publishing a broad range of
textbooks, reference works, and handbooks. The titles included in the
series are meant to appeal to students, researchers, and professionals in the
mathematical, statistical and computational sciences, fundamental biology
and bioengineering, as well as interdisciplinary researchers involved in the
field. The inclusion of concrete examples and applications, and programming
techniques and examples, is highly encouraged.

Series Editors
N. F. Britton
Department of Mathematical Sciences
University of Bath
Xihong Lin
Department of Biostatistics
Harvard University
Nicola Mulder
University of Cape Town
South Africa
Maria Victoria Schneider


European Bioinformatics Institute
Mona Singh
Department of Computer Science
Princeton University
Anna Tramontano
Department of Physics
University of Rome La Sapienza

Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
3 Park Square, Milton Park
Abingdon, Oxfordshire OX14 4RN
UK


Published Titles
An Introduction to Systems Biology:
Design Principles of Biological Circuits
Uri Alon
Glycome Informatics: Methods and
Applications
Kiyoko F. Aoki-Kinoshita
Computational Systems Biology of
Cancer
Emmanuel Barillot, Laurence Calzone,
Philippe Hupé, Jean-Philippe Vert, and
Andrei Zinovyev
Python for Bioinformatics
Sebastian Bassi
Quantitative Biology: From Molecular to

Cellular Systems
Sebastian Bassi
Methods in Medical Informatics:
Fundamentals of Healthcare
Programming in Perl, Python, and Ruby
Jules J. Berman
Computational Biology: A Statistical
Mechanics Perspective
Ralf Blossey
Game-Theoretical Models in Biology
Mark Broom and Jan Rychtáˇr
Computational and Visualization
Techniques for Structural Bioinformatics
Using Chimera
Forbes J. Burkowski
Structural Bioinformatics: An Algorithmic
Approach
Forbes J. Burkowski

Normal Mode Analysis: Theory and
Applications to Biological and Chemical
Systems
Qiang Cui and Ivet Bahar
Kinetic Modelling in Systems Biology
Oleg Demin and Igor Goryanin
Data Analysis Tools for DNA Microarrays
Sorin Draghici
Statistics and Data Analysis for
Microarrays Using R and Bioconductor,
Second Edition

˘
Sorin Draghici
Computational Neuroscience:
A Comprehensive Approach
Jianfeng Feng
Biological Sequence Analysis Using
the SeqAn C++ Library
Andreas Gogol-Döring and Knut Reinert
Gene Expression Studies Using
Affymetrix Microarrays
Hinrich Göhlmann and Willem Talloen
Handbook of Hidden Markov Models
in Bioinformatics
Martin Gollery
Meta-analysis and Combining
Information in Genetics and Genomics
Rudy Guerra and Darlene R. Goldstein
Differential Equations and Mathematical
Biology, Second Edition
D.S. Jones, M.J. Plank, and B.D. Sleeman
Knowledge Discovery in Proteomics
Igor Jurisica and Dennis Wigle

Spatial Ecology
Stephen Cantrell, Chris Cosner, and
Shigui Ruan

Introduction to Proteins: Structure,
Function, and Motion
Amit Kessel and Nir Ben-Tal


Cell Mechanics: From Single ScaleBased Models to Multiscale Modeling
Arnaud Chauvière, Luigi Preziosi,
and Claude Verdier

RNA-seq Data Analysis: A Practical
Approach
Eija Korpelainen, Jarno Tuimala,
Panu Somervuo, Mikael Huss, and Garry Wong

Bayesian Phylogenetics: Methods,
Algorithms, and Applications
Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis

Introduction to Mathematical Oncology
Yang Kuang, John D. Nagy, and
Steffen E. Eikenberry

Statistical Methods for QTL Mapping
Zehua Chen

Biological Computation
Ehud Lamm and Ron Unger


Published Titles (continued)
Optimal Control Applied to Biological
Models
Suzanne Lenhart and John T. Workman


Genome Annotation
Jung Soh, Paul M.K. Gordon, and
Christoph W. Sensen

Clustering in Bioinformatics and Drug
Discovery
John D. MacCuish and Norah E. MacCuish

Niche Modeling: Predictions from
Statistical Distributions
David Stockwell

Spatiotemporal Patterns in Ecology
and Epidemiology: Theory, Models,
and Simulation
Horst Malchow, Sergei V. Petrovskii, and
Ezio Venturino

Algorithms in Bioinformatics: A Practical
Introduction
Wing-Kin Sung

Stochastic Dynamics for Systems
Biology
Christian Mazza and Michel Benaïm

The Ten Most Wanted Solutions in
Protein Bioinformatics
Anna Tramontano


Statistical Modeling and Machine
Learning for Molecular Biology
Alan M. Moses

Combinatorial Pattern Matching
Algorithms in Computational Biology
Using Perl and R
Gabriel Valiente

Engineering Genetic Circuits
Chris J. Myers
Pattern Discovery in Bioinformatics:
Theory & Algorithms
Laxmi Parida
Exactly Solvable Models of Biological
Invasion
Sergei V. Petrovskii and Bai-Lian Li
Computational Hydrodynamics of
Capsules and Biological Cells
C. Pozrikidis
Modeling and Simulation of Capsules
and Biological Cells
C. Pozrikidis
Cancer Modelling and Simulation
Luigi Preziosi
Introduction to Bio-Ontologies
Peter N. Robinson and Sebastian Bauer
Dynamics of Biological Systems
Michael Small


Introduction to Bioinformatics
Anna Tramontano

Managing Your Biological Data with
Python
Allegra Via, Kristian Rother, and
Anna Tramontano
Cancer Systems Biology
Edwin Wang
Stochastic Modelling for Systems
Biology, Second Edition
Darren J. Wilkinson
Big Data Analysis for Bioinformatics and
Biomedical Discoveries
Shui Qing Ye
Bioinformatics: A Practical Approach
Shui Qing Ye
Introduction to Computational
Proteomics
Golan Yona


Statistical Modeling and
Machine Learning for
Molecular Biology

Alan M. Moses
University of Toronto, Canada



CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20160930
International Standard Book Number-13: 978-1-4822-5859-2 (Paperback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com ( or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Moses, Alan M., author.
Title: Statistical modeling and machine learning for molecular biology / Alan M.
Moses.

Description: Boca Raton : CRC Press, 2016. | Includes bibliographical
references and index.
Identifiers: LCCN 2016028358| ISBN 9781482258592 (hardback : alk. paper) |
ISBN 9781482258615 (e-book) | ISBN 9781482258622 (e-book) | ISBN
9781482258608 (e-book)
Subjects: LCSH: Molecular biology–Statistical methods. | Molecular
biology–Data processing.
Classification: LCC QH506 .M74 2016 | DDC 572.8–dc23
LC record available at />Visit the Taylor & Francis Web site at

and the CRC Press Web site at



For my parents



Contents
Acknowledgments, xv
Section i

Overview

chapter 1



Across Statistical Modeling and Machine
Learning on a Shoestring


3

1.1

ABOUT THIS BOOK

3

1.2

WHAT WILL THIS BOOK COVER?

4

1.2.1

Clustering

4

1.2.2

Regression

5

1.2.3

Classification


6

1.3

ORGANIZATION OF THIS BOOK

6

1.4

WHY ARE THERE MATHEMATICAL CALCULATIONS
IN THE BOOK?

8

1.5

WHAT WON’T THIS BOOK COVER?

11

1.6

WHY IS THIS A BOOK?

12

REFERENCES AND FURTHER READING


chapter 2



Statistical Modeling

14

15

2.1

WHAT IS STATISTICAL MODELING?

15

2.2

PROBABILITY DISTRIBUTIONS ARE THE MODELS

18

2.3

AXIOMS OF PROBABILITY AND THEIR
CONSEQUENCES: “RULES OF PROBABILITY”

23

HYPOTHESIS TESTING: WHAT YOU PROBABLY

ALREADY KNOW ABOUT STATISTICS

26

2.4

ix


x ◾ Contents

2.5

TESTS WITH FEWER ASSUMPTIONS

30

2.5.1

Wilcoxon Rank-Sum Test, Also Known As the
Mann–Whitney U Test (or Simply the WMW Test)

30

2.5.2

Kolmogorov–Smirnov Test (KS-Test)

31


2.6

CENTRAL LIMIT THEOREM

33

2.7

EXACT TESTS AND GENE SET ENRICHMENT ANALYSIS

33

2.8

PERMUTATION TESTS

36

2.9

SOME POPULAR DISTRIBUTIONS

38

2.9.1

The Uniform Distribution

38


2.9.2

The T-Distribution

39

2.9.3

The Exponential Distribution

39

2.9.4

The Chi-Squared Distribution

39

2.9.5

The Poisson Distribution

39

2.9.6

The Bernoulli Distribution

40


2.9.7

The Binomial Distribution

40

EXERCISES

40

REFERENCES AND FURTHER READING

41

chapter 3
3.1



Multiple Testing

43

THE BONFERRONI CORRECTION AND GENE SET
ENRICHMENT ANALYSIS

43

MULTIPLE TESTING IN DIFFERENTIAL EXPRESSION
ANALYSIS


46

3.3

FALSE DISCOVERY RATE

48

3.4

eQTLs: A VERY DIFFICULT MULTIPLE-TESTING
PROBLEM

49

3.2

EXERCISES

51

REFERENCES AND FURTHER READING

52

chapter 4
4.1




Parameter Estimation and Multivariate Statistics 53

FITTING A MODEL TO DATA: OBJECTIVE
FUNCTIONS AND PARAMETER ESTIMATION

53

4.2

MAXIMUM LIKELIHOOD ESTIMATION

54

4.3

LIKELIHOOD FOR GAUSSIAN DATA

55


Contents ◾ xi

4.4

HOW TO MAXIMIZE THE LIKELIHOOD ANALYTICALLY 56

4.5

OTHER OBJECTIVE FUNCTIONS


60

4.6

MULTIVARIATE STATISTICS

64

4.7

MLEs FOR MULTIVARIATE DISTRIBUTIONS

69

4.8

HYPOTHESIS TESTING REVISITED: THE PROBLEMS
WITH HIGH DIMENSIONS

77

EXAMPLE OF LRT FOR THE MULTINOMIAL: GC
CONTENT IN GENOMES

80

4.9

EXERCISES


83

REFERENCES AND FURTHER READING

83

Section ii

Clustering

chapter 5



Distance-Based Clustering

87

5.1

MULTIVARIATE DISTANCES FOR CLUSTERING

87

5.2

AGGLOMERATIVE CLUSTERING

91


5.2

CLUSTERING DNA AND PROTEIN SEQUENCES

95

5.4

IS THE CLUSTERING RIGHT?

98

5.5

K-MEANS CLUSTERING

100

5.6

SO WHAT IS LEARNING ANYWAY?

106

5.7

CHOOSING THE NUMBER OF CLUSTERS FOR
K-MEANS


107

5.8

K-MEDOIDS AND EXEMPLAR-BASED CLUSTERING

109

5.9

GRAPH-BASED CLUSTERING: “DISTANCES” VERSUS
“INTERACTIONS” OR “CONNECTIONS”

110

5.10 CLUSTERING AS DIMENSIONALITY REDUCTION

113

EXERCISES

113

REFERENCES AND FURTHER READING

115

chapter 6




Mixture Models and Hidden Variables
for Clustering and Beyond

117

6.1

THE GAUSSIAN MIXTURE MODEL

118

6.2

E-M UPDATES FOR THE MIXTURE OF GAUSSIANS

123


xii ◾ Contents

6.3

DERIVING THE E-M ALGORITHM FOR THE MIXTURE
OF GAUSSIANS
127

6.4

GAUSSIAN MIXTURES IN PRACTICE AND THE

CURSE OF DIMENSIONALITY

131

CHOOSING THE NUMBER OF CLUSTERS
USING THE AIC

131

APPLICATIONS OF MIXTURE MODELS IN
BIOINFORMATICS

133

6.5
6.6

EXERCISES

141

REFERENCES AND FURTHER READING

142

Regression

Section iii
chapter 7
7.1




Univariate Regression

145

SIMPLE LINEAR REGRESSION AS A PROBABILISTIC
MODEL

145

7.2

DERIVING THE MLEs FOR LINEAR REGRESSION

146

7.3

HYPOTHESIS TESTING IN LINEAR REGRESSION

149

7.4

LEAST SQUARES INTERPRETATION OF LINEAR
REGRESSION

154


7.5

APPLICATION OF LINEAR REGRESSION TO eQTLs

155

7.6

FROM HYPOTHESIS TESTING TO STATISTICAL
MODELING: PREDICTING PROTEIN LEVEL BASED
ON mRNA LEVEL

157

7.7

REGRESSION IS NOT JUST “LINEAR”—POLYNOMIAL
AND LOCAL REGRESSIONS
161

7.8

GENERALIZED LINEAR MODELS

165

EXERCISES

167


REFERENCES AND FURTHER READING

167

chapter 8



Multiple Regression

169

8.1

PREDICTING Y USING MULTIPLE Xs

169

8.2

HYPOTHESIS TESTING IN MULTIPLE DIMENSIONS:
PARTIAL CORRELATIONS

171


Contents ◾ xiii

8.3


EXAMPLE OF A HIGH-DIMENSIONAL MULTIPLE
REGRESSION: REGRESSING GENE EXPRESSION LEVELS
ON TRANSCRIPTION FACTOR BINDING SITES
174

8.4

AIC AND FEATURE SELECTION AND OVERFITTING
IN MULTIPLE REGRESSION

179

EXERCISES

182

REFERENCES AND FURTHER READING

183

chapter 9

Regularization in Multiple Regression
and Beyond

185

9.1


REGULARIZATION AND PENALIZED LIKELIHOOD

186

9.2

DIFFERENCES BETWEEN THE EFFECTS OF L1 AND L2
PENALTIES ON CORRELATED FEATURES
189

9.3

REGULARIZATION BEYOND SPARSITY:
ENCOURAGING YOUR OWN MODEL STRUCTURE

190

PENALIZED LIKELIHOOD AS MAXIMUM A
POSTERIORI (MAP) ESTIMATION

192

CHOOSING PRIOR DISTRIBUTIONS FOR
PARAMETERS: HEAVY-TAILS IF YOU CAN

193

9.4
9.5




EXERCISES

197

REFERENCES AND FURTHER READING

199

Section iV

Classification

chapter 10



Linear Classification

203

10.1 CLASSIFICATION BOUNDARIES AND LINEAR
CLASSIFICATION

205

10.2 PROBABILISTIC CLASSIFICATION MODELS

206


10.3 LOGISTIC REGRESSION

208

10.4 LINEAR DISCRIMINANT ANALYSIS (LDA) AND THE
LOG LIKELIHOOD RATIO

210

10.5 GENERATIVE AND DISCRIMINATIVE MODELS FOR
CLASSIFICATION

214

10.6 NAÏVE BAYES: GENERATIVE MAP CLASSIFICATION

215


xiv ◾ Contents

10.7 TRAINING NAÏVE BAYES CLASSIFIERS

221

10.8 NAÏVE BAYES AND DATA INTEGRATION

222


EXERCISES

223

REFERNCES AND FURTHER READING

223

chapter 11



Nonlinear Classification

225

11.1 TWO APPROACHES TO CHOOSE NONLINEAR
BOUNDARIES: DATA-GUIDED AND MULTIPLE
SIMPLE UNITS

226

11.2 DISTANCE-BASED CLASSIFICATION WITH
k-NEAREST NEIGHBORS

228

11.3 SVMs FOR NONLINEAR CLASSIFICATION

230


11.4 DECISION TREES

234

11.5 RANDOM FORESTS AND ENSEMBLE
CLASSIFIERS: THE WISDOM OF THE CROWD

236

11.6 MULTICLASS CLASSIFICATION

237

EXERCISES

238

REFERENCES AND FURTHER READING

239

chapter 12



Evaluating Classifiers

241


12.1 CLASSIFICATION PERFORMANCE STATISTICS IN THE
IDEAL CLASSIFICATION SETUP
241
12.2 MEASURES OF CLASSIFICATION PERFORMANCE

242

12.3 ROC CURVES AND PRECISION–RECALL PLOTS

245

12.4 EVALUATING CLASSIFIERS WHEN YOU
DON’T HAVE ENOUGH DATA

248

12.5 LEAVE-ONE-OUT CROSS-VALIDATION

251

12.6 BETTER CLASSIFICATION METHODS
VERSUS BETTER FEATURES

253

EXERCISES

254

REFERENCES AND FURTHER READING


255

INDEX, 257


Acknowledgments
First, I’d like to acknowledge the people who taught me statistics and computers. As with most of the people that will read this book, I took the
required semester of statistics as an undergraduate. Little of what I learned
proved useful for my scientific career. I came to statistics and computers late, although I learned some html during a high-school job at PCI
Geomatics and tried (and failed) to write my first computer program as
an undergraduate hoping to volunteer in John Reinitz’s lab (then at Mount
Sinai in New York). I finally did manage to write some programs as an
undergraduate summer student, thanks to Tim Gardner (then a grad student in Marcelo Magnasco’s lab), who first showed me PERL codes.
Most of what I learned was during my PhD with Michael Eisen (who
reintroduced cluster analysis to molecular biologists with his classic paper
in 1998) and postdoctoral work with Richard Durbin (who introduced
probabilistic models from computational linguistics to molecular biologists, leading to such universal resources as Pfam, and wrote a classic bioinformatics textbook, to which I am greatly indebted). During my PhD
and postdoctoral work, I learned a lot of what is found in this book from
Derek Chiang, Audrey Gasch, Justin Fay, Hunter Fraser, Dan Pollard,
David Carter, and Avril Coughlan. I was also very fortunate to take courses
with Terry Speed, Mark van der Laan, and Michael Jordan while at UC
Berkeley and to have sat in on Geoff Hinton’s advanced machine learning lectures in Toronto in 2012 before he was whisked off to Google. Most
recently, I’ve been learning from Quaid Morris, with whom I cotaught the
course that inspired this book.
I’m also indebted to everyone who read this book and gave me feedback
while I was working on it: Miranda Calderon, Drs. Gelila Tilahun, Muluye
Liku, and Derek Chiang, my graduate students Mitchell Li Cheong Man,
Gavin Douglas, and Alex Lu, as well as an anonymous reviewer.
xv



xvi ◾ Acknowledgments

Much of this book was written while I was on sabbatical in 2014–2015
at Michael Elowitz’s lab at Caltech, so I need to acknowledge Michael’s
generosity to host me and also the University of Toronto for continuing
the tradition of academic leave. Michael and Joe Markson introduced me
to the ImmGen and single-cell sequence datasets that I used for many of
the examples in this book.
Finally, to actually make this book (and the graduate course that
inspired it) possible, I took advantage of countless freely available software,
R packages, Octave, PERL, bioinformatics databases, Wikipedia articles
and open-access publications, and supplementary data sets, many of which
I have likely neglected to cite. I hereby acknowledge all of the people who
make this material available and enable the progress of pedagogy.


I
Overview
The first four chapters give necessary background. The first chapter is
background to the book: what it covers and why I wrote it. The next three
chapters are background material needed for the statistical modeling
and machine learning methods covered in the later chapters. However,
although I’ve presented that material as background, I believe that the
review of modeling and statistics (in Chapters 2, 3 and 4) might be valuable to readers, whether or not they intend to go on to the later chapters.

1




Chapter

1

Across Statistical
Modeling and
Machine Learning
on a Shoestring

1.1 ABOUT THIS BOOK
This is a guidebook for biologists about statistics and computers. Much
like a travel guide, it’s aimed to help intelligent travelers from one place
(biology) find their way around a fascinating foreign place (computers and statistics). Like a good travel guide, this book should teach you
enough to have an interesting conversation with the locals and to bring
back some useful souvenirs and maybe some cool clothes that you can’t
find at home. I’ve tried my best to make it fun and interesting to read and
put in a few nice pictures to get you excited and help recognize things
when you see them.
However, a guidebook is no substitute to having lived in another place—
although I can tell you about some of the best foods to try and buildings to visit, these will necessarily only be the highlights. Furthermore,
as visitors we’ll have to cover some things quite superficially—we can
learn enough words to say yes, no, please, and thank you, but we’ll never
master the language. Maybe after reading the guidebook, some intrepid

3


4 ◾ Statistical Modeling and Machine Learning for Molecular Biology


readers will decide to take a break from the familiar territory of molecular biology for a while and spend a few years in the land of computers
and statistics.
Also, this brings up an important disclaimer: A guidebook is not an
encyclopedia or a dictionary. This book doesn’t have a clear section heading for every topic, useful statistical test, or formula. This means that
it won’t always be easy to use it for reference. However, because online
resources have excellent information about most of the topics covered
here, readers are encouraged to  look things up as they go along.

1.2 WHAT WILL THIS BOOK COVER?
This book aims to give advanced students in molecular biology enough
statistical and computational background to understand (and perform)
three of the major tasks of modern machine learning that are widely used
in bioinformatics and genomics applications:
1. Clustering
2. Regression
3. Classification
1.2.1 Clustering
Given a set of data, clustering aims to divide the individual observations
into groups or clusters. This is a very common problem in several areas
of modern molecular biology. In the genomics era, clustering has been
applied to genome-wide expression data to find groups of genes with similar expression patterns; it often turns out that these genes do work together
(in pathways or networks) in the cell and therefore share common functions. Finding groups of similar genes using molecular interaction data
can implicate pathways or help lead to hypotheses about gene function.
Clustering has therefore been applied to all types of gene-level molecular
interaction data, such as genetic and physical protein interactions. Proteins
and genes that share sequence similarity can also be grouped together to
delineate “families” that are likely to share biochemical functions. At the
other end of the spectrum, finding groups of similar patients (or disease
samples) based on molecular profiles is another major current application
of clustering.

Historically, biologists wanted to find groups of organisms that represented species. Given a set of measurements of biological traits of


Across Statistical Modeling and Machine Learning on a Shoestring ◾ 5

individuals, clustering can divide them into groups with some degree of
objectivity. In the early days of the molecular era, evolutionary geneticists obtained sequences of DNA and proteins wanting to find patterns
that could relate the molecular data to species relationships. Today, inference of population structure by clustering individuals into subpopulations
(based on genome-scale genotype data) is a major application of clustering
in evolutionary genetics.
Clustering is a classic topic in machine learning because the nature of
the groups and the number of groups are unknown. The computer has
to “learn” these from the data. There are endless numbers of clustering
methods that have been considered, and the bioinformatics literature has
contributed a very large number of them.
1.2.2 Regression
Regression aims to model the statistical relationship between one or more
variables. For example, regression is a powerful way to test for and model
the relationship between genotype and phenotype. Contemporary data
analysis methods for genome-wide association studies (GWAS) and quantitative trait loci for gene expression (eQTLs) rely on advanced forms of
regression (known as generalized linear mixed models) that can account
for complex structure in the data due to the relatedness of individuals and
technical biases. Regression methods are used extensively in other areas
of biostatistics, particularly in statistical genetics, and are often used in
bioinformatics as a means to integrate data for predictive models.
In addition to its wide use in biological data analysis, I believe regression is a key area to focus on in this book for two pedagogical reasons.
First, regression deals with the inference of relationships between two
or more types of observations, which is a key conceptual issue in all
scientific data analysis applications, particularly when one observation
can be thought of as predictive or causative of the other. Because classical regression techniques yield straightforward statistical hypothesis

tests, regression allows us to connect one type of data to another, and
can be used to compare large datasets of different types. Second, regression is an area where the evolution from classical statistics to machine
learning methods can be illustrated most easily through the development of penalized likelihood methods. Thus, studying regression can
help students understand developments in other areas of machine learning (through analogy with regression), without knowing all the technical details.


6 ◾ Statistical Modeling and Machine Learning for Molecular Biology

1.2.3 Classification
Classification is the task of assigning observations into previously defined
classes. It underlies many of the mainstream successes of machine
learning: spam filters, face recognition in photos, and the Shazam app.
Classification techniques also form the basis for many widely used bioinformatics tools and methodologies. Typical applications include predictions of gene function based on protein sequence or genome-scale
experimental data, and identification of disease subtypes and biomarkers.
Historically, statistical classification techniques were used to analyze the
power of medical tests: given the outcome of a blood test, how accurately
could a physician diagnose a disease?
Increasingly, sophisticated machine learning techniques (such as
neural networks, random forests and support vector machines or
SVMs) are used in popular software for scientific data analysis, and
it is essential that modern molecular biologists understand the concepts underlying these. Because of the wide applicability of classification in everyday problems in the information technology industry, it
has become a large and rapidly developing area of machine learning.
Biomedical applications of these methodological developments often
lead to important advances in computational biology. However, before
applying these methods, it’s critical to understand the specific issues
arising in genome-scale analysis, particularly with respect to evaluation
of classification performance.

1.3 ORGANIZATION OF THIS BOOK
Chapters 2, 3, and 4 review and introduce mathematical formalism,

probability theory, and statistics that are essential to understanding
the modeling and machine learning approaches used in contemporary
molecular biology. Finally, in Chapters 5 and 6 the first real “machine
learning” and nontrivial probabilistic models are introduced. It might
sound a bit daunting that three chapters are needed to give the necessary
background, but this is the reality of data-rich biology. I have done my
best to keep it simple, use clear notation, and avoid tedious calculations.
The reality is that analyzing molecular biology data is getting more and
more complicated.
You probably already noticed that the book is organized by statistical
models and machine learning methods and not by biological examples
or experimental data types. Although this makes it hard to look up a


Across Statistical Modeling and Machine Learning on a Shoestring ◾ 7

statistical method to use on your data, I’ve organized it this way because
I want to highlight the generality of the data analysis methods. For example, clustering can be applied to diverse data from DNA sequences to brain
images and can be used to answer questions about protein complexes and
cancer subtypes. Although I might not cover your data type or biological
question specifically, once you understand the method, I hope it will be
relatively straightforward to apply to your data.
Nevertheless, I understand that some readers will want to know that
the book covers their type of data, so I’ve compiled a list of the molecular
biology examples that I used to illustrate methods.
LIST OF MOLECULAR BIOLOGY EXAMPLES
1. Chapter 2—Single-cell RNA-seq data defies standard models
2. Chapter 2—Comparing RNA expression between cell types for one or
two genes
3. Chapter 2—Analyzing the number of kinase substrates in a list of genes

4. Chapter 3—Are the genes that came out of a genetic screen involved
in angiogenesis?
5. Chapter 3—How many genes have different expression levels in T cells?
6. Chapter 3—Identifying eQTLs
7. Chapter 4—Correlation between expression levels of CD8 antigen alpha
and beta chains
8. Chapter 4—GC content differences on human sex chromosomes
9. Chapter 5—Groups of genes and cell types in the immune system
10. Chapter 5—Building a tree of DNA or protein sequences
11. Chapter 5—Immune cells expressing CD4, CD8 or both
12. Chapter 5—Identifying orthologs with OrthoMCL
13. Chapter 5—Protein complexes in protein interaction networks
14. Chapter 6—Single-cell RNA-seq revisited
15. Chapter 6—Motif finding with MEME
16. Chapter 6—Estimating transcript abundance with Cufflinks
17. Chapter 6—Integrating DNA sequence motifs and gene expression data
18. Chapter 7—Identifying eQTLs revisited
19. Chapter 7—Does mRNA abundance explain protein abundance?
20. Chapter 8—SAG1 expression is controlled by multiple loci
21. Chapter 8—mRNA abundance, codon bias, and the rate of protein
evolution
22. Chapter 8—Predicting gene expression from transcription factor binding motifs
23. Chapter 8—Motif finding with REDUCE
24. Chapter 9—Modeling a gene expression time course


8 ◾ Statistical Modeling and Machine Learning for Molecular Biology
25.
26.
27.

28.
29.
30.
31.

Chapter 9—Inferring population structure with STRUCTURE
Chapter 10—Are mutations harmful or benign?
Chapter 10—Finding a gene expression signature for T cells
Chapter 10—Identifying motif matches in DNA sequences
Chapter 11—Predicting protein folds
Chapter 12—The BLAST homology detection problem
Chapter 12—LPS stimulation in single-cell RNA-seq data

1.4 WHY ARE THERE MATHEMATICAL
CALCULATIONS IN THE BOOK?
Although most molecular biologists don’t (and don’t want to) do mathematical derivations of the type that I present in this book, I have included
quite a few of these calculations in the early chapters. There are several
reasons for this. First of all, the type of machine learning methods presented here are mostly based on probabilistic models. This means that
the methods described here really are mathematical things, and I don’t
want to “hide” the mathematical “guts” of these methods. One purpose
of this book is to empower biologists to unpack the algorithms and mathematical notations that are buried in the methods section of most of the
sophisticated primary research papers in the top journals today. Another
purpose is that I hope, after seeing the worked example derivations for
the classic models in this book, some ambitious students will take the
plunge and learn to derive their own probabilistic machine learning
models. This is another empowering skill, as it frees students from the
confines of the prepackaged software that everyone else is using. Finally,
there are students out there for whom doing some calculus and linear
algebra will actually be fun! I hope these students enjoy the calculations
here. Although calculus and basic linear algebra are requirements for

medical school and graduate school in the life sciences, students rarely
get to use them.
I’m aware that the mathematical parts of this book will be unfamiliar for many biology students. I have tried to include very basic introductory material to help students feel confident interpreting and attacking
equations. This brings me to an important point: although I don’t assume
any prior knowledge of statistics, I do assume that readers are familiar
with multivariate calculus and something about linear algebra (although
I do review the latter briefly). But don’t worry if you are a little rusty and
don’t remember, for example, what a partial derivative is; a quick visit to
Wikipedia might be all you need.


×