Tải bản đầy đủ (.pdf) (161 trang)

Systems biology in animal production and health, vol 1

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.94 MB, 161 trang )

Haja N. Kadarmideen Editor

Systems Biology
in Animal
Production and
Health, Vol. 1


Systems Biology in Animal Production
and Health, Vol. 1


Haja N. Kadarmideen
Editor

Systems Biology in Animal
Production and Health,
Vol. 1


Editor
Haja N. Kadarmideen
Faculty of Health and Medical Sciences
University of Copenhagen
Frederiksberg C, Denmark

ISBN 978-3-319-43333-2
ISBN 978-3-319-43335-6
DOI 10.1007/978-3-319-43335-6

(eBook)



Library of Congress Control Number: 2016956674
© Springer International Publishing Switzerland 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG Switzerland
The registered company address is Gewerbestrasse 11, 6330 Cham, Switzerland


Foreword

The increased prominence of “systems biology” in biological research over the past
two decades is arguably a reaction to the reductionist approach exemplified by the
genome sequencing phase of the Human Genome Project. A simplistic view of the
genome projects was that the genome sequence of a species, whether humans,
model organisms, plants or farmed animals, represents a blueprint for the organism
of interest, and thus characterising the sequence would reveal the relevant instructions. Subsequent targets for the reductionist or cataloguing approach were complete lists of transcripts (transcriptomes) and proteins (proteomes) for the organism
of interest. The ‘omics approach to the comprehensive characterisation of an organism, tissue or cell has also been extended to metabolites and hence metabolomes.

A catalogue of parts, however, is insufficient to understand how an organism functions. Thus, a holistic approach that recognises the interactions between components of the system was required. Given the size and complexity of the data and the
possible interactions, it was necessary to use advanced mathematical and computational methods to attempt to make sense of the data. Thus, “systems biology” in the
‘omics era is widely considered to concern the use of mathematical modelling and
analysis together with ‘omics data (genome sequence, transcriptomes, proteomes,
metabolomes) to understand complex biological systems. The predictive aspect of
these models is viewed as particularly important. Moreover, it is desirable that the
models’ predictions can be tested experimentally. Systems biology, therefore, contributes in part to converting large ‘omics data sets from data-driven biology experiments into testable hypotheses.
Systems approaches and the use of predictive mathematical models in biological
systems long pre-date the post genome project (re-)emergence of systems biology.
Population biologists/geneticists, epidemiologists, agricultural scientists, quantitative geneticists and plant and animal breeders have been developing and successfully exploiting predictive mathematical models and systems approaches for
decades.
Quantitative geneticists and animal breeders, for example, have been remarkably
successful at developing statistical animal models that are effective predictors of
future performance. For decades, these successes were achieved without any knowledge of the underlying molecular components. The accuracy of these models has
been increased by using high-density molecular (single nucleotide polymorphism,
SNP) genotypes in so-called genomic selection. However, whilst the sequences and
v


vi

Foreword

genome locations of the SNP markers are known little is known about the functional
impact or relevance of the individual SNP loci. Further improvements could be
achieved through the use of genome sequence data and by adding knowledge of the
likely effects of the sequence variants whether coding or regulatory. Thus, there is a
growing commonality between the systems approaches of quantitative geneticists
and animal breeders and the ‘omics version of systems biology.
Animals are not only complex biological systems but also function within wider

complex systems. The recognition that an animal’s phenotype is determined by a
combination of its genotype and environmental factors simply restates the latter.
The environmental factors include, amongst others, feed, pathogens and the microbiomes present in the gastrointestinal tract and other locations. The ‘omics technologies allow not only the characterisation of the components of the animal of
interest, but also those of its commensal microbes and the microbes, including
pathogens present in its environment.
As noted earlier, it is desirable that the mathematical models developed in systems biology are predictive and that the associated hypotheses are testable. Genome
editing technologies which have been demonstrated in farmed animal species facilitate hypothesis testing at the level of modifying the genome sequence that determines components of the system of interest.
This volume of Systems Biology in Animal Production and Health, edited by
Professor Haja Kadarmideen, explores some aspects of both quantitative genetics
and ‘omics led approaches to applying systems approaches to tackling the challenges of improving animal productivity and reducing the burden of disease. This
book contains some chapters with R codes and other computer programs, workflow/
pipelines for processing and analysing multi-omic datasets from laboratory all the
way to interpretation of results. Hence, this book would be particularly useful for
students, teachers and practitioners of integrative genomics, bioinformatics and systems biology in animal and veterinary sciences.
Adhil et al. (chapter “Advanced Computational Methods, NGS Tools, and
Software for Mammalian Systems Biology”) review the computational methods
and tools required to analyse and integrate multi-omics data from different levels
including genome sequence, transcriptomics, proteomics and metabolomics. The
analysis of transcriptomic data and specifically RNA-Seq data are described in
greater detail by Heras-Saldana et al. (chapter “RNA Sequencing Applied to
Livestock Production”).
Whilst it is generally challenging to identify the causal genetic variants for complex phenotypes, identifying loci with effects on primary traits such as the level of
gene expression or levels of a metabolite is easier as effects are often delivered close
to the gene. For example, many expression quantitative trait loci (eQTL) are detected
as cis-effects with the causal genetic variation located in the regulatory sequences of
the gene of interest. Of course, most phenotypes of importance to animal production
or health are controlled by the effects of many genes. Wang and Michoel (chapter
“Detection of Regulator Genes and eQTL Gene Networks”) address the challenge
of identifying the gene networks that capture the interaction between genes from
eQTL data. Systems genetics and systems biology using gene network methods



Foreword

vii

with application for obesity using pig models is reviewed by Kogelman and
Kadarmideen (chapter “Applications of Systems Genetics and Biology for Obesity
Using Pig Models”). Fontanesi (chapter “Merging Metabolomics, Genetics, and
Genomics in Livestock to Dissect Complex Production Traits”) reviews metabolite
QTL (mQTL), which have similar advantages to eQTL in respect of ease of identification, in pigs and cattle.
Rosa et al. (chapter “Applications of Graphical Models in Quantitative Genetics
and Genomics”) discuss the use of stochastic graphical models with an emphasis on
Bayesian networks to predict phenotypes, including primary traits such as gene
expression levels and end traits from sequence variants and thus arguably traversing
the path from sequence to consequence.
Professor Alan L. Archibald FRSE
Deputy Director, Head of Genetics and Genomics
The Roslin Institute and Royal (Dick) School of Veterinary Studies
University of Edinburgh
Easter Bush, Midlothian EH25 9RG, UK


Preface

Systems biology is a research discipline at the crossroad of statistical, computational, quantitative, and molecular biology methods. It involves joint modeling,
combined analysis and interpretation of high-throughput omics (HTO) data collected at many “levels or layers” of the biological systems within and across individuals in the population. The systems biology approach is often aimed at studying
associations and interactions between different “layers or levels”, but not necessarily one layer or level in isolation. For instance, it involves study of multidimensional
associations or interaction among DNA polymorphisms, gene expression levels,
proteins or metabolite abundances. With modern HTO biotechnologies and their

decreasing costs, hugely comprehensive multi-omic data at all “levels or layers” of
the biological system are now available. This “big data” at lower costs, along with
development of genome scale models, network approaches and computational
power, have spearheaded the progress of the systems biology era, including applications in human biology and medicine. Systems biology is an established independent discipline in humans and increasingly so in animals, plants and microbial
research. However, joint modeling and analyses of multilayer HTO data, in large
volumes on a scale that has never been seen before, has enormous challenges from
both computational and statistical points of view. Systems biology tackles such joint
modeling and analyses of multiple HTO datasets using a combination of statistical,
computational, quantitative and molecular biology methods and bioinformatics
tools. As I wrote in my review article (Livestock Science 2014, 166:232–248), systems biology is not only about multilayer HTO data collection from populations of
individuals and subsequent analyses and interpretations; it is also about a philosophy and a hypothesis-driven predictive modeling approach that feeds into new
experimental designs, analyses and interpretations. In fact, systems biology revolves
and iterates between these “wet” and “dry” approaches to converge on coherent
understanding of the whole biological system behind a disease or phenotype and
provide a complete blueprint of functions that leads to a phenotype or a complex
disease.
It is equally important to introduce, alongside systems biology, the sub-discipline of systems genetics as a branch of systems biology. It is akin to considering
“genetics” as a sub-discipline of “biology”. It is well known that quantitative genetics/genomics links genome-wide genetic variation with variation in disease risks or
a performance (phenotype or trait) that we can easily measure or observe in a
ix


x

Preface

population of individuals. However, systems genetics or systems genomics not only
performs such genome-wide association studies (GWAS), but also performs linking
genetic variations (e.g. SNPs, CNVs, QTLs etc.) at the DNA sequence level with
variation in molecular profiles or traits (e.g. gene expression or metabolomic or

proteomic levels etc. in tissues and biological fluids) that we can measure using
high-throughput next- and third-generation biotechnologies. The systems genetics
approach is still “genetics”, because we are looking at those genetic variants that
exert their effects from DNA to phenotypic expression or disease manifestations
through a number of intermediate molecular profiles. Hence, systems genetics
derives its name, as originally proposed in my earlier article (Mammalian Genome,
2006, 17:548–564), by being able to integrate analyses of all underlying genetic
factors acting at different biological levels, namely, QTL, eQTL, mQTL, pQTL and
so on. I have provided a complete up-to-date review and illustration of systems
genetics or systems genomics and multi-omic data integration and analyses in our
review paper published in Genetics Selection Evolution (2016), 48:38. Overall, systems genetics/genomics leads us to provide a holistic view on complex trait heredity
at different biological layers or levels.
Whether it is systems biology or systems genetics, the gene ontology annotation
is one of the most important and valuable means of assigning functional information
using standardized vocabulary. This would include annotation of genetic variants
falling into functional groups such as trait QTL, eQTL, mQTL, pQTL. Molecular
pathway profiling, signal transduction and gene set enrichment analyses along with
various types of annotations form the “icing on cake”. For this purpose, several
bioinformatics tools are frequently used. Most chapters in this book and its associated volume cover these aspects.
I would like to point out that systems biology approaches have been proven to be
very powerful and shown to produce accurate and replicable discoveries of genes,
proteins and metabolites and their networks that are involved in complex diseases or
traits. In very practical terms, it delivers biomarkers, drug targets, vaccine targets,
target transcripts or metabolites, genetic markers, pathway targets etc. to diagnose
and treat diseases better or improve traits or characteristics in animals, plants and
humans. In the world of genomic prediction and genomic selection, there have been
an increasing number of studies that have shown high accuracy and predictive
power when models include functional QTLs such as eQTL, mQTL, pQTL which,
in fact, are results from systems genetics methods.
This book and its associated volume cover the above-mentioned principles, theory and application of systems biology and systems genetics in livestock and animal

models and provides a comprehensive overview of open source and commercially
available software tools, computer programing codes and other reading materials to
learn, use and successfully apply systems biology and systems genetics in animals.
Overall, I believe this book is an extremely valuable source for students interested in learning the basics and could form as a textbook in higher educational
institutes and universities around the world. Equally, the book chapters are very
relevant and useful for scientists interested in learning and applying advanced HTO
studies, integrative HTO data analyses (e.g. eQTLs and mQTLs) and computational


Preface

xi

systems biology techniques to animal production, health and welfare. One of the
chapters focuses on systems genomics models and computational methods applied
to animal models for elucidating systems biology of human obesity and diabetes.
The two volumes of this book is a result of contributions from highly reputed scientists and practitioners who originate from renowned universities and multinational
companies in the UK, Denmark, France, Italy, Australia, USA, Brazil and India.
I would like to thank the publisher Springer for inviting me to edit two volumes on
this subject, publishing in an excellent form and promoting the book across the
globe. I am grateful to all contributing authors and co-authors of this book. I also
wish to thank Ms. Gilda Kischinovsky from my research group for proofreading
and the staff at Springer involved in production of this book. Last but not least,
I wish to thank my wife and children who have given me moral support and strength
while I reviewed and edited this book.
Copenhagen, Denmark
September, 2016

Haja N. Kadarmideen



Contents

Detection of Regulator Genes and eQTLs in Gene Networks. . . . . . . . . . . . . 1
Lingfei Wang and Tom Michoel
Applications of Systems Genetics and Biology
for Obesity Using Pig Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Lisette J.A. Kogelman and Haja N. Kadarmideen
Merging Metabolomics, Genetics, and Genomics in Livestock
to Dissect Complex Production Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Luca Fontanesi
RNA Sequencing Applied to Livestock Production . . . . . . . . . . . . . . . . . . . . 63
Sara de las Heras-Saldana, Hawlader A. Al-Mamun,
Mohammad H. Ferdosi, Majid Khansefid, and Cedric Gondro
Applications of Graphical Models in Quantitative
Genetics and Genomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Guilherme J.M. Rosa, Vivian P.S. Felipe, and Francisco Peñagaricano
Advanced Computational Methods, NGS Tools, and Software
for Mammalian Systems Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Mohamood Adhil, Mahima Agarwal, Prahalad Achutharao, and
Asoke K. Talukder

xiii


Detection of Regulator Genes and eQTLs
in Gene Networks
Lingfei Wang and Tom Michoel

Abstract


Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in noncoding genomic regions.
These genetic variants are often also associated to differences in expression levels of nearby genes (they are “expression quantitative trait loci” or eQTLs, for
short) and presumably play a gene regulatory role, affecting the status of molecular networks of interacting genes, proteins, and metabolites. Computational systems biology approaches to reconstruct causal gene networks from large-scale
omics data have therefore become essential to understand the structure of networks controlled by eQTLs together with other regulatory genes, as well as to
generate detailed hypotheses about the molecular mechanisms that lead from
genotype to phenotype. Here we review the main analytical methods and software to identify eQTLs and their associated genes, to reconstruct coexpression
networks and modules, to reconstruct causal Bayesian gene and module networks, and to validate predicted networks in silico.

1

Introduction

Genetic differences between individuals are responsible for variation in the observable phenotypes. This principle underpins genomewide association studies (GWAS),
which map the genetic architecture of complex traits by measuring genetic variation
at single-nucleotide polymorphisms (SNPs) on a genomewide scale across many

L. Wang • T. Michoel (*)
Division of Genetics and Genomics, The Roslin Institute, The University of Edinburgh,
Midlothian EH25 9RG, UK
e-mail:
© Springer International Publishing Switzerland 2016
H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_1

1


2


L. Wang and T. Michoel

individuals (Mackay et al. 2009). GWAS have resulted in major improvements in
plant and animal breeding (Goddard and Hayes 2009) and in numerous insights into
the genetic basis of complex diseases in human (Manolio 2013). However, quantitative trait loci (QTLs) with large effects are uncommon and a molecular explanation
for their trait association rarely exists (Mackay et al. 2009). The vast majority of
QTLs indeed lie in noncoding genomic regions and presumably play a gene regulatory role (Hindorff et al. 2009; Schaub et al. 2012). Consequently, numerous studies
have identified cis- and trans-acting DNA variants that influence gene expression
levels (i.e., “expression QTLs”; eQTLs) in model organisms, plants, farm animals,
and humans (reviewed in Rockman and Kruglyak 2006; Georges 2007; Cookson
et al. 2009; Cheung and Spielman 2009; Cubillos et al. 2012). Gene expression
programs are of course highly tissue- and cell-type specific, and the properties and
complex relations of eQTL associations across multiple tissues are only beginning
to be mapped (Dimas et al. 2009; Foroughi Asl et al. 2015; Greenawalt et al. 2011;
Ardlie et al. 2015). At the molecular level, a mounting body of evidence shows that
cis-eQTLs primarily cause variation in transcription factor (TF) binding to gene
regulatory DNA elements, which then causes changes in histone modifications,
DNA methylation, and mRNA expression of nearby genes; trans-eQTLs in turn can
usually be attributed to coding variants in regulatory genes or cis-eQTLs of such
genes (Albert and Kruglyak 2015).
Taken together, these results motivate and justify a systems biological view of
quantitative genetics (“systems genetics”), where it is hypothesized that genetic
variation, together with environmental perturbations, affects the status of molecular
networks of interacting genes, proteins, and metabolites; these networks act within
and across different tissues and collectively control physiological phenotypes
(Williams 2006; Kadarmideen et al. 2006; Rockman 2008; Schadt 2009; Schadt and
Björkegren 2012; Civelek and Lusis 2014; Björkegren et al. 2015). Studying the
impact of genetic variation on gene regulation networks is of crucial importance in
understanding the fundamental biological mechanisms by which genetic variation

causes variation in phenotypes (Chen et al. 2008), and it is expected to lead to the
discovery of novel disease biomarkers and drug targets in human and veterinary
medicine (Schadt et al. 2009). Because the direct experimental mapping of genetic,
protein–protein, or protein–DNA interactions is an immensely challenging task,
further exacerbated by the cell-type-specific and dynamic nature of these interactions (Walhout 2006), comprehensive, experimentally verified molecular networks
will not become available for multi-cellular organisms in the foreseeable future.
Statistical and computational methods are therefore essential to reconstruct trait-­
associated causal networks by integrating diverse omics data (Rockman 2008;
Schadt 2009; Ritchie et al. 2015).
A typical systems genetics study collects genotype and gene, protein, and/or
metabolite expression data from a large number of individuals segregating for one
or more traits of interest. After raw data processing and normalization, eQTLs are
identified for each of the expression data types, and a coexpression matrix is constructed. Causal Bayesian gene networks, coexpression modules (i.e., clusters), and/


Detection of Regulator Genes and eQTLs in Gene Networks

3

Adequate experimental design and data collection

Appropriate data preprocessing and quality control

Expression quantitative trait loci analysis
matrix-eQTL,kruX

Covered in this chapter

Choice of correlation function and calculation of gene co-expression


Co-expression module detection
Fast Modularity,MCL,WGCNA

Genotype data assisted edge directing
NEO,Trigger

Model-based clustering
Lemon-Tree

Module network reconstruction
Lemon-Tree

In silico validation of reconstructed gene regulation network
ENCODE,Roadmap Epigenomics,modENCODE,BioGRID,
Gene Expression Omnibus,ArrayExpress

Experimental verification of regulatory pathways
Fig. 1  A flow chart for a typical systems genetics study and the corresponding software. Steps in
light yellow are covered in this chapter

or causal Bayesian module networks are then reconstructed. The in silico validation
of predicted networks and modules using independent data confirms their overall
validity, ideally followed by the experimental validation of the most promising findings in a relevant cell line or model organism (Fig. 1). Here we review the main
analytic principles behind each of the steps from eQTL identification to in silico
network validation and present a selection of most commonly used methods and
software for each step. Throughout this chapter, we tacitly assume that all data have
been quality controlled, preprocessed, and normalized to suit the assumptions of the
analytic methods presented here. For expression data, this usually means working
with log-transformed data where each gene expression profile is centered around
zero with standard deviation one. We also assume that the data have been corrected

for any confounding factors, either by regressing out known covariates or by estimating hidden factors (Stegle et al. 2012).


4

2

L. Wang and T. Michoel

Genetics of Gene Expression

A first step toward identifying molecular networks affected by DNA variants is to
identify variants that underpin variations in eQTLs of transcripts (Cookson et al.
2009), proteins (Foss et al. 2007), or metabolites (Nicholson et al. 2011) across
individuals. When studying a single trait, as in GWAS, it is possible to consider
multiple statistical models to explicitly account for additive and/or dominant genetic
effects (Laird and Lange 2011). However, when the possible effects of a million or
more SNPs on tens of thousands of molecular abundance traits need to be tested, as
is common in modern genetics of gene expression studies, the computational cost of
testing SNP–trait associations one by one becomes prohibitive. To address this
problem, new methods have been developed to calculate the test statistics for the
parametric linear regression and analysis of variance (ANOVA) models (Shabalin
2012) and the nonparametric ANOVA model (or Kruskal–Wallis test) (Qi et al.
2014) using fast matrix multiplication algorithms, implemented in the software
matrix eQTL ( />(Shabalin 2012) and kruX ( (Qi et al. 2014).
In both software, genotype values of s genetic markers and expression levels of
k transcripts, proteins, or metabolites in n individuals are organized in an s ´ n
genotype matrix G and k ´ n expression data matrix X. Genetic markers take values
0, 1, …, ℓ, where ℓ is the maximum number of alleles (  = 2 for biallelic markers),
whereas molecular traits take continuous values. In the linear model, a linear relation is tested between the expression level of gene i and the genotype value (i.e., the

number of reference alleles) of SNP j. The corresponding test statistic is the Pearson
correlation between the ith row of X and the jth row of G, for all values of i and j.
Standardizing the data matrices to zero mean and unit variance, such that for all i
and j,
n

n

åX

il

= åG jl = 0 and

n

åX

2
il

n

= åG 2jl = n,

l =1
l =1
l =1
l =1


it follows that the correlation values can be computed as



n

Rij = åX il G jl = ( XG T ) ,

ij
l =1

where GT denotes the transpose of G. Hence, a single matrix multiplication suffices
to compute the test statistics for the linear model for all pairs of traits and SNPs.
The ANOVA models test if expression levels in different genotype groups originate from the same distribution. Therefore, ANOVA models can account for both
additive and dominant effects of a genetic variant on expression levels. In the parametric ANOVA model, suppose the test samples are divided into  +1 groups by the
SNP j. The mean expression level for gene i in each group m can be written as

X i(


m ,j )

=

1
n

( m ,j )

å


{l :G jl = m}

X il ,




Detection of Regulator Genes and eQTLs in Gene Networks

5

where n(m,j) is the number of samples in genotype group m for SNP j.
Again assuming that the expression data are standardized, the F-test statistic for
testing gene i against SNP j can be written as
Fi ( ) =
j

n -  - 1 SSi( )
,
j

n - SSi( )
j


where SSi(j) is the sum of squares between groups,


SSi( ) = ån(

j



m ,j )

m=0

X i(

m ,j )

2

.



Let us define the n ´ s indicator matrix I(m) for genotype group m, i.e., I (lj ) = 1
if G jl = m and 0 otherwise. Then
m

{

å

}

(


X il = XI (

m)

)

ij

.



Hence, for each pair of expression level Xi and SNP Gj, the sum of squares matrix
SSi( j) can be computed via  -1 matrix multiplications1.
In the nonparametric ANOVA model, the expression data matrix is converted to
a matrix T of data ranks, independently over each row. In the absence of ties, the
Kruskal–Wallis test statistic is given by
l :G jl = m

Sij =


2
12
m ,j
m ,j
n( ) Ti ( ) - 3 ( n + 1) ,
å
n ( n + 1) m = 0




m ,j
where Ti ( ) is the average expression rank of gene i in genotype group m of SNP j,
defined as
Ti (

m ,j )

=

1
n

( m ,j )

å

Til ,

{l :G jl = m}



which can be similarly obtained from the  -1 matrix multiplications.
There is as yet no consensus about which statistical model is most appropriate
for eQTL detection. Nonparametric methods were introduced in the earliest eQTL
studies (Brem et al. 2002; Schadt et al. 2008) and have remained popular, as they are
robust against variations in the underlying genetic model and trait distribution.
More recently, the linear model implemented in matrix eQTL has been used in a

number of large-scale studies (Ardlie et al. 2015; Lappalainen et al. 2013). A comparison on a data set of 102 human whole blood samples showed that the parametric
ANOVA method was highly sensitive to the presence of outlying gene expression
 There are only  -1 matrix multiplications, because the data standardization implies that

1

 -1

XI ( 0) = 1 - åXI (

m)

m =1

.


6

L. Wang and T. Michoel

values and SNPs with singleton genotype group. Linear models reported the highest
number of eQTL associations after empirical False Discovery Rate (FDR) correction, with an expected bias toward additive linear associations. The Kruskal–Wallis
test was most robust against data outliers and heterogeneous genotype group sizes
and detected a higher proportion of nonlinear associations but was more conservative for calling additive linear associations than linear models (Qi et al. 2014).
In summary, when large numbers of traits and markers have to be tested for association, efficient matrix multiplication methods can be used to calculate all test statistics at once, leading to a dramatic reduction in computation time compared with
calculating these statistics one by one for every pair using traditional methods.
Matrix multiplication is a basic mathematical operation, which has been purposely
studied and optimized for tens of years (Golub and Van Loan 1996). Highly efficient packages, such as BLAS ( and LAPACK (http://
www.netlib.org/lapack/), are available for use on generic CPUs and are indeed used

in most mainstream scientific computing software and programming languages,
such as Matlab and R. In recent years, graphics processor unit (GPU)-accelerated
computing, such as CUDA, has revolutionized scientific calculations that involve
repetitive operations in parallel on bulky data, offering even more speedup than the
existing CPU-based packages. The first applications of GPU computing in eQTL
analysis have already appeared (e.g., Hemani et al. 2014), and more can be expected
in the future.
Lastly, for pairs exceeding a predefined threshold on the test statistic, a p-value
can be computed from the corresponding test distribution, and these p-values can
then be further corrected for multiple testing by common procedures (Shabalin
2012; Qi et al. 2014).

3

Coexpression Networks and Modules

3.1

Coexpression Gene Networks

The Pearson correlation is the simplest and computationally most efficient similarity measure for gene expression profiles. For genes i and j, their Pearson correlation
can be written as
n

Cij = åX il X jl .


l =1

In matrix notation, this can be combined as the matrix multiplication


(1)

C = XXT .

Gene pairs with large positive or negative correlation values tend to be up- or down-­
regulated together due to either a direct regulatory link between them or being
jointly coregulated by a third, often hidden, factor. By filtering for correlation values
exceeding a significance threshold determined by comparison with randomly


Detection of Regulator Genes and eQTLs in Gene Networks

7

permuted data, a discrete coexpression network is obtained. Assuming that a high
degree of coexpression signifies that genes are involved in the same biological processes, graph theoretical methods can be used, for instance, to predict gene function
(Sharan et al. 2007).
One drawback of the Pearson correlation is that by definition, it is biased toward
linear associations. To overcome this limitation, other measures are available. The
Spearman correlation uses expression data ranks (cf. Section 2) in Eq. (1) and will
give high score to monotonic relations. Mutual information is the most general measure and detects both linear and nonlinear associations. For a pair of discrete random variables A and B (representing the expression levels of two genes) taking
values al and bm, respectively, the mutual information is defined as

where

MI ( A,B ) = H ( A ) + H ( B ) - H ( A,B ) ,
H ( A ) = -åP ( al ) log P ( al ) ,
l


H ( B ) = -åP ( bm ) log P ( bm ) ,
m

H ( A,B ) = åP ( al , bm ) log P ( al , bm ) ,


lm

are the individual and joint Shannon entropies of A and B, and P ( al ) = P ( A = al ) ,
and likewise for the other terms. Because gene expression data are continuous,
mutual information estimation is nontrivial and usually involves some form of discretization (Daub et al. 2004). Mutual information has been successfully used as a
coexpression measure in a variety of contexts (Butte and Kohane 2000; Basso et al.
2005; Faith et al. 2007).

3.2

Clustering and Coexpression Module Detection

It is generally understood that cellular functions are carried out by “modules,”
groups of molecules that operate together and whose function is separable from that
of other modules (Hartwell et al. 1999). Clustering gene expression data (i.e., dividing genes into discrete groups on the basis of similarities in their expression profiles) is a standard approach to detect such functionally coherent gene modules. The
literature on gene expression clustering is vast and cannot possibly be reviewed
comprehensively here. It includes “standard” methods such as hierarchical clustering (Eisen et al. 1998), k-means (Tavazoie et al. 1999), graph-based methods that
operate directly on coexpression networks (Sharan and Shamir 2000), and model-­
based clustering algorithms which assume that the data are generated by a mixture
of probability distributions, one for each cluster (Medvedovic and Sivaganesan
2002). Here we briefly describe a few recently developed methods with readily
available software.



8

L. Wang and T. Michoel

3.2.1 Modularity Maximization
Modularity maximization is a network-clustering method that is particularly popular in the physical and social sciences, based on the assumption that intramodule
connectivity should be much denser than intermodule connectivity (Newman and
Girvan 2004; Newman 2006). In the context of coexpression networks, this method
can be used to identify gene modules directly from the correlation matrix C (Ayroles
et al. 2009). Suppose the genes are grouped into N modules M l , l = 1, ¼, N . Each
module Ml is a nonempty set that can contain any combination of the genes
i = 1, ¼, k , but each gene is contained by exactly one module. Also define M0 as the
set containing all genes. The modularity score function is defined as
æ W ( M , M ) æ W ( M , M ) ö2 ö
l
l
l
0
S (M ) = åç
- çç
÷ ÷,
ç
W ( M 0 , M 0 ) ÷ø ÷
l =1 W ( M 0 , M 0 )
è
è
ø
N



where W ( A, B ) =

å

iÎ A, jÎB , i ¹ j

w ( Cij ) is a weight function, summing over all the edges

that connect one vertex in A with another vertex in B, and w(x) is a monotonic
function to map correlation values to edge strengths. Common functions are
­
b
bx
w ( x ) = x , x (power law) (Langfelder and Horvath 2008), e
(exponential)
bx
(Ayroles et al. 2009), or 1 / (1 + e ) (sigmoid) (Lee et al. 2009).
A modularity maximization software particularly suited for large networks is fast
modularity ( (Clauset
et al. 2004).
Markov Cluster Algorithm
The Markov cluster (MCL) algorithm is a graph-based clustering algorithm, which
emulates random walks among gene vertices to detect clusters in a graph obtained
directly from the coexpression matrix C. It is implemented in the MCL software
( (Van Dongen 2001; Enright et al. 2002). The MCL algorithm starts with the correlation matrix C as the probability flow matrix of a random
walk and then iteratively suppresses weak structures of the network and performs a
multistep random walk. In the end, only backbones of the network structure remain,
essentially capturing the modules of coexpression network. To be precise, the MCL
algorithm performs the following two operations on C alternatingly:
• Inflation: The algorithm first contrasts stronger direct connections against weaker

ones, using an element-wise power law transformation, and normalizes each column separately to sum to one, such that the element Cij corresponds to the dissipation rate from vertex Xi to Xj in a single step. The inflation operation hence
updates C as C ® Gµ C , where the contrast rate µ> 1 is a predefined parameter
of the algorithm. After operation Γα, each element of C becomes
Cij ® Gµ Cij = Cij

µ

k

µ

/ å C pj .


p =1

• Expansion: The probability flow matrix C controls the random walks performed
in the expansion phase. After some integer b ³ 2 steps of random walk, gene


Detection of Regulator Genes and eQTLs in Gene Networks

9

pairs with strong direct connections and/or strong indirect connections through
other genes tend to see more probability flow exchanges, suggesting higher probabilities of belonging to the same gene modules. The expansion operation for the
β-step random walk corresponds to the matrix power operation
C ® Cb .

The MCL algorithm performs the above two operations iteratively until convergence. Nonzero entries in the convergent matrix C connect gene pairs belonging to

the same cluster, whereas all inter-cluster edges attain the value zero, so that cluster
structure can be obtained directly from this matrix (Van Dongen 2001; Enright et al.
2002).
Weighted Gene Coexpression Network Analysis
With higher than average correlation or edge densities within clusters, genes from
the same cluster typically share more neighboring (i.e., correlated) genes. The
weighted number of shared neighboring genes hence can be another measure of
gene function similarity. This information is captured in the so-called topological
overlap matrix Ω, first defined by Ravasz et al. (2002) for binary networks as

wij =

Aij + åAiu Auj

u
,
min ( ki , k j ) + 1 - Aij


where A is the (binary) adjacency matrix of the network and ki = åAiu is the connectivity of vertex Xi. The

åA

iu

u

Auj term represents vertex similarity through neigh-

u


boring genes, and the rest of terms normalize the output as 0 £ wij £ 1 . This concept
was later extended onto networks with weighted edges by applying a “soft threshold” preprocess on the correlation matrix, for example, as
Aij =

or

1 + Cij
2

µ

,



µ

Aij = Cij ,


such that 0 £ Aij £ 1 (Zhang and Horvath 2005). Note that in the first case, only
positive correlations have high edge weight, whereas in the second case, positive
and negative correlations are treated equally. The parameter µ> 1 is determined
such that the weighted network with adjacency matrix A has approximately a scale-­
free degree distribution (Zhang and Horvath 2005).
In principle, any clustering algorithm (including the aforementioned ones) can
be applied to the topological overlap matrix W . In the popular WGCNA software
( />WGCNA/) (Langfelder and Horvath 2008), which is a multipurpose toolbox for



10

L. Wang and T. Michoel

network analysis, hierarchical clustering with a dynamic tree-cut algorithm
(Langfelder et al. 2008) is used.
Model-Based Clustering
Model-based clustering approaches assume that the observed data are generated by
a mixture of probability distributions, one for each cluster, and takes explicitly into
account the noise of gene expression data. To infer model parameters and cluster
assignments, techniques such as expectation maximization (EM) or Gibbs sampling
are used (Liu 2002). A recently developed method assumes that the expression levels of genes in a cluster are random samples drawn from a mixture of normal distributions, where each mixture component corresponds to a clustering of samples for
that module, i.e., it performs a two-way co-clustering operation (Joshi et al. 2008).
The method is available as part of the Lemon-Tree package ( />eb00/lemon-tree) and has been successfully used in a variety of applications (Bonnet
et al. 2015).
The co-clustering is carried out by a Gibbs sampler, which iteratively updates the
assignment of each gene and, within each gene cluster, the assignment of each
experimental condition. The co-clustering operation results the full posterior distribution, which can be written as
N

Ll

p ( C | X ) µ ÕÕ òò p ( m ,t ) Õ

Õ p ( Xim | m ,t ) d m dt ,
l =1 u =1
iÎMl mÎEl ,u

where C = {M l , El ,u : l = 1, ¼, N ; u = 1, ¼, Ll } is a coclustering consisting of N gene

modules Ml, each of which has a set of Lm sample clusters as εl,u; p ( X im | m ,t ) is
a normal distribution function with mean μ and precision τ, and p(μ, τ) is a noninformative normal-gamma prior. Detailed investigations of the convergence properties of the Gibbs sampler showed that the best results are obtained by deriving
consensus clusters from multiple independent runs of the sampler. In the Lemon-­
Tree package, consensus clustering is performed by a novel spectral graph clustering algorithm (Michoel and Nachtergaele 2012) applied to the weighted graph of
pairwise frequencies with which two genes are assigned to the same gene module
(Bonnet et al. 2015).
4

Causal Gene Networks

4.1

 sing Genotype Data to Prioritize Edge Directions
U
in Coexpression Networks

Pairwise correlations between gene expression traits define undirected coexpression
networks. Several studies have shown that pairs of gene expression traits can be
causally ordered using genotype data (Zhu et al. 2004; Chen et al. 2007; Aten et al.


Detection of Regulator Genes and eQTLs in Gene Networks

11

2008; Schadt et al. 2005; Neto et al. 2008, 2013; Millstein et al. 2009). Although
varying in their statistical details, these methods conclude that gene A is causal for
gene B, if the expression of B associates significantly with A’s eQTLs, and this
association is abolished by conditioning on the expression of A and on any other
known confounding factors. In essence, this is the principle of “Mendelian randomization,” first introduced in epidemiology as an experimental design to detect causal

effects of environmental exposures on human health (Smith and Ebrahim 2003),
applied to gene expression traits.
To illustrate how these methods work, let A and B be two random variables representing two gene expression traits, and let E be a random variable representing a
SNP, which is an eQTL for gene A and B. Because genotype cannot be altered by
gene expression (i.e., E cannot have any incoming edges), there are three possible
regulatory models to explain the joint association of E to A and B:
1. E ® A ® B : the association of E to B is indirect and due to a causal interaction
from A to B.
2. E ® B ® A : idem with the roles of A and B reversed.
3. A ¬ E ® B : A and B are independently associated to E.
To determine if gene A mediates the effect of SNP E on gene B (model 1), one
can test whether conditioning on A abolishes the correlation between E and B, using
the partial correlation coefficient
cor ( E , B | A ) =

cor ( E , B ) - cor ( E , A ) cor ( B, A )

(1 - cor ( E, A) )(1 - cor ( B, A) ).
2

2


If model 1 is correct, then cor ( E , B | A ) is expected to be zero, and this can be
tested, for example, using Fisher’s Z transform to assess the significance of a sample
correlation coefficient. The same approach can be used to test model 2, and if neither is significant, it is concluded that no inference on the causal direction between
A and B can be made (using SNP E), i.e., that model 3 is correct. For more details,
see (Aten et al. 2008), who have implemented this approach in the NEO software
( />Other approaches are based on the same principle but use statistical model selection to identify the most likely causal model, with the probability density functions
(PDF) for the models as follows:

• p ( E , A, B ) = p ( E ) p ( A | E ) p ( B | A ) ,
• p ( E , A, B ) = p ( E ) p ( B | E ) p ( A | B ) ,
• p ( E , A, B ) = p ( E ) p ( A | E ) p ( B | E , A ) ,
where the dependence on A in the last term of the last model indicates that there
may be a residual correlation between B and A not explained by E. The minimal
additive model assumes the distributions are (Schadt et al. 2005)


12

L. Wang and T. Michoel

E ~ Bernoulli ( q ) ,

A | E ~ N ( m A| E ,s A2 ) ,
æ
ö
s
B | A ~ N ç m B + r B ( A - m A ) , (1 - r 2 ) s B2 ÷ ,
sA
è
ø
æ
ö
s
B | E , A ~ N ç m B| E + r B ( A - m A| E ) , (1 - r 2 ) s B2 ÷ ,
s
è
ø
A



so that E fulfils a Bernoulli distribution, A | E undergoes a normal distribution
whose mean depends on E, and that B | A has a conditional normal distribution
whose mean and variance are contributed in part by A. For ( B | E , A ) , the mean
of B also depends on E. The parameters of all distributions can be estimated by
maximum likelihood, and the model with the highest likelihood is selected as
the most likely causal model. The number of free parameters can be accounted
using penalties such as the Akaike information criterion (AIC) (Schadt et al.
2005).
The approach has been extended in various ways. In the study of Chen et al.
(2007), likelihood ratio tests, comparison to randomly permuted data, and false discovery rate estimation techniques are used to convert the three model scores in a
single probability value P ( A ® B ) for a causal interaction from gene A to B. This
method is available in the Trigger software ( In the study of Millstein et al. (2009) and
(Neto et al. (2013), the model selection task is recast into a single hypothesis test,
using F-tests and Vuong’s model selection test respectively, resulting in a significance p-value for each gene–gene causal interaction.
It should be noted that all of these approaches suffer from limitations due to their
inherent model assumptions. In particular, the presence of unequal levels of measurement noise among genes, or of hidden regulatory factors causing additional
correlation among genes, can confuse causal inference. For example, excessive
error level in the expression data of gene A, may mistake the true structure
E ® A ® B as E ® B ® A . These limitations are discussed by Rockman (2008)
and Li et al. (2010).

4.2

 sing Bayesian Networks to Identify Causal Regulatory
U
Mechanisms

Bayesian networks are probabilistic graphical models that encode conditional

dependencies between random variables in a directed acyclic graph (DAG).
Although Bayesian network cannot fully reflect certain pathways in gene regulation, such as self-regulation or feedback loops, they still serve as a popular method
for modeling gene regulation networks, as they provide a clear methodology for
learning statistical dependency structures from possibly noisy data (Friedman et al.
1999a, 2000; Koller and Friedman 2009).


Detection of Regulator Genes and eQTLs in Gene Networks

13

We adopt our previous convention in Section 2, where we have the gene expression data X and genetic markers G. The model contains a total of k vertices (i.e.,
random variables), Xi with i = 1, ¼, k , corresponding to the expression level of gene
i. Given a DAG  , and denoting the parental vertex set of Xi by Pa( ) ( X i ) , the
acyclic property of  allows to define the joint probability distribution function as
k

(

)

p ( X 1 , ¼, X k |  ) = Õ p X i | Pa( ) ( X i ) .

(2)

i =1

In its simplest form, we model the conditional distributions as

æ

ö
p X i | Pa( ) ( X i ) = N ç a i + å b ji ( X j - a j ) , s i2 ÷ ,
ç
÷

X j ÎPa( ) ( X i )
è
ø

(

)


where (αi, σi) and βji are parameters for vertex Xi and edge X j ® X i respectively, as
part of the DAG structure  . Under such modeling, the Bayesian network is called
a linear Gaussian network.
The likelihood of data X given the graph  is
k

n

( {

})

p ( X |  ) = ÕÕ p X il | X jl , X j Î Pa( ) ( X i ) .


i =1 l =1


Using Bayes’ rule, the log-likelihood of the DAG  based on the gene expression
data X becomes
log p (  | X ) = log p ( X |  ) + log p (  ) - log p ( X ) ,

where p (  ) is the prior probability for  , and p(X) is a constant when the expression data are provided, so the follow-up calculations do not rely on it.
Typically, a locally optimal DAG is found by starting from a random graph and
randomly ascending the likelihood by adding, modifying, or removing one directed
edge at a time (Friedman et al. 1999a, 2000; Koller and Friedman 2009).
Alternatively, the posterior distribution p (  | X ) can be estimated with Bayesian
inference using Markov chain Monte Carlo simulation, allowing us to estimate the
significance levels at an extra computational cost. The parameter values of α, β, and
σ, as part of  , can be estimated with maximum likelihood.
When Bayesian network is modified by a single edge, only the vertices that
receive a change would require a recalculation, whereas all others remain intact.
This significantly reduces the amount of computation needed for each random step.
A further speedup is achievable if we constrain the maximum number of parents
each vertex can have, either by using the same fixed number for all nodes or by
preselecting a variable number of potential parents for each node using, for instance,
a preliminary L1-regularization step (Schmidt et al. 2007).
Two DAGs are called Markov equivalent if they result in the same PDF (Koller
and Friedman 2009). Clearly, using gene expression data alone, Bayesian networks
can only be resolved up to Markov equivalence. To break this equivalence and
uncover a more specific causal gene regulation network, genotype data are


14

L. Wang and T. Michoel


incorporated in the model inference process. The most straightforward approach is
to use any of the methods in the previous section to calculate the probability
P ( X i ® X j ) of a causal interaction from Xi to Xj (Zhu et al. 2004, 2008, 2012;
Zhang et al. 2013), for example, by defining the prior as
æ
ö
p ( ) = Õ ç Õ P ( X j ® X i ) Õ
1 - P ( X j ® X i ) ÷ . A more ambi÷

X i ç X ÎPa( ) ( X )
X j ÎPa( ) ( X i )
i
è j
ø
tious approach is to jointly learn the eQTL associations and causal trait (i.e., gene or
phenotype) networks. In the study of Neto et al. (2010), EM is used to alternatingly
map eQTLs given the current DAG structure and update the DAG structure and
model parameters given the current eQTL mapping. In the study of Scutari et al.
(2014), Bayesian networks are learned where SNPs and traits both enter as variables
in the model, with the constraint that traits can depend on SNPs, but not vice versa.
However, the additional complexity of both methods means that they are computationally expensive and have only been applied to problems with a handful of traits
(Neto et al. 2010; Scutari et al. 2014).
A few additional “tips and tricks” are worth mentioning:

(

)

• First, when the number of vertices is much larger than the sample count, we may
break the problem into independent subproblems by learning a separate Bayesian

network for each coexpression module (Section 3.1 and Zhang et al. 2013).
Dependencies between modules could then be learned as a Bayesian network
among the module eigengenes (Langfelder and Horvath 2007), although this
does not seem to have been explored.
• Second, Bayesian network learning algorithms inevitably result in locally optimal models, which may contain a high number of false positives. To address this
problem, we can run the algorithm multiple times and report an averaged network, only consisting of edges that appear sufficiently frequent.
• Finally, another technique that helps in distinguishing genuine dependencies
from false positives is bootstrapping, where resampling with replacement is executed on the existing sample pool. A fixed number of samples are randomly
selected and then processed to predict a Bayesian network. This process is
repeated many times, essentially regarding the distribution of sample pool as the
true PDF, and allowing to estimate the robustness of each predicted edge, so that
only those with high significance are retained (Friedman et al. 1999b). In theory,
even the whole pipeline of Fig. 1 up to the in silico validation could be simulated
in this way. Although bootstrapping is computationally expensive and mostly
suited for small data sets, it could be used in conjunction with the separation into
modules on larger data sets.

4.3

 sing Module Networks to Identify Causal Regulatory
U
Mechanisms

Module network inference is a statistically well-grounded method that uses probabilistic graphical models to reconstruct modules of coregulated genes and their upstream
regulatory programs and that has been proven useful in many biological case studies


×