Tải bản đầy đủ (.pdf) (16 trang)

DSpace at VNU: Modeling protein evolution with several amino acid replacement matrices depending on site rates

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (720.06 KB, 16 trang )

Modeling Protein Evolution with Several Amino Acid
Replacement Matrices Depending on Site Rates
Si Quang Le,1,2 Cuong Cao Dang,3 and Olivier Gascuel*,1
1

Me´thodes et Algorithmes pour la Bioinformatique (LIRMM & IBC), Centre National de la Recherche Scientifique (CNRS)–
Universite´ Montpellier II, Montpellier Cedex 5, France
2
Wellcome Trust Sanger Institute, Genome Campus, Hinxton, United Kingdom
3
University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
*Corresponding author: E-mail:
Associate editor: Jeffrey Thorne

Abstract

Key words: amino acid substitutions, replacement matrices, gamma and distribution-free rate models, maximum
likelihood estimations, phylogenetic inference.

Introduction
Amino acid replacement matrices—20 Â 20 matrices containing estimates of the instantaneous substitution rates of
any amino acid by another—are essential in most methods
to infer protein phylogenies. These matrices are expected to
capture the biological and physicochemical properties of
amino acids. They are used in distance-based methods to estimate the evolutionary distance—the expected number of
substitutions per site—between sequence pairs. In maximum
likelihood (ML) and Bayesian methods, they are used to compute substitution probabilities along tree branches and hence
the likelihood of the data (see textbooks, e.g., Felsenstein 2003;
Yang 2006).
The standard approach to infer protein phylogenies is
based on the use of a single replacement matrix. Several


general matrices estimated from very large sets of taxa
and alignments have been proposed since the pioneering
work of Dayhoff et al. (1972), notably JTT (Jones et al.
1992), WAG (Whelan and Goldman 2001), and LG (Le
and Gascuel 2008). Some studies showed that specific matrices should be used for certain analyses, for example, with

membrane (Jones et al. 1994) or mitochondrial (Yang et al.
1998) proteins, but general matrices are usually robust and
tend to perform well in many cases (Keane et al. 2006). However, site evolution is highly heterogeneous and depends on
many factors such as genetic code, solvent accessibility, secondary and tertiary structure, and protein functions. Most
notably, some sites are subject to strong evolutionary pressure
and evolve slowly due to their role in the structure or functions of the protein, whereas others are much less constrained
and accumulate substitutions rapidly. In the standard
approach, this variability is modeled by discrete gamma
rate categories, which are used to modulate the (unique)
replacement matrix being selected, depending on the
site rates (Yang 1993). As site rates are unknown, all rates
are envisaged for every site and accounted for thanks to
a mixture approach (see textbooks, e.g., Gascuel and
Guindon 2007).
However, many works revealed that depending on site
specificities not only do the global rates vary but also the
substitution patterns. Notably, buried sites (typically slow)
and exposed sites (typically fast) obey very different matrix

© The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail:

Mol. Biol. Evol. 29(10):2921–2936. 2012 doi:10.1093/molbev/mss112


Advance Access publication April 6, 2012

2921

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015
Research
article

Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of
amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution
patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed,
the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there
is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first
introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four).
These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard
gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses
four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All
these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of
independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid
substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix
models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains
compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages
but require the same memory space and have comparable running times to standard models, we believe that LG4M
and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from
/>

MBE

Le et al. · doi:10.1093/molbev/mss112


2922

WAGþC4, and LGþC4) in terms of tree likelihood and
often infers different tree topologies.
We then examine the limitation of constraining rates to
a gamma distribution by testing a model of four matrices
where rates and weights of the four matrices are freely
estimated. To this end, we estimate another 4-matrix
model (LG4X), where rates and weights are left out of
the gamma distribution assumption. Experimental results
show that LG4X is significantly better than LG4M and comparable to the two-level mixture models from Le, Lartillot,
and Gascuel (2008) and Le and Gascuel (2010) and at the
same time much simpler.
These results, combined with low computing times and
memory consumption, suggest that LG4M and LG4X are
relevant alternatives to standard single-matrix models in
inferring phylogenetic trees from protein sequences. In
the following, we first describe our data, then our models
and their estimation procedures, and lastly provide comparisons with independent testing alignments.

Data Sets
To estimate LG4M and LG4X, we used alignments
extracted from the Homology-Derived Secondary Structure
of Proteins database (HSSP; Schneider et al. 1997). HSSP
comprises ;50,000 alignments of protein families, each
containing an average of ;550 members. Each alignment
is obtained by aligning a protein with known 3D structure
in the Protein Data Bank (PDB; Berman et al. 2000) to all its
probable sequence homologs in UNIPROT. The protein
with known structure is called the ‘‘test protein’’ of the

alignment. HSSP alignments contain a huge number of gaps
due to absent or unsequenced domains for some proteins.
Consequently, we cleaned each alignment by selecting
sequences that were well aligned, sufficiently different
one from the other, and had 40–99% identities with the
test protein. Gapped regions among selected sequences
were eliminated using GBLOCKS (Castresana 2000) with
default options, and we removed alignments with less than
10 selected sequences or 100 remaining sites. We also left
out membrane proteins (based on their presence in the
Membrane PDB; Raman et al. 2006) since their amino acid
replacement pattern is highly different to that of globular
proteins (Jones et al. 1994). Moreover, HSSP is highly redundant because a protein sequence may appear in more than
one alignment depending on its homologs with known
structure in PDB. Thus, we retained only independent
alignments that do not share any sequence. To this end,
we used a heuristic algorithm to find a large number of
independent alignments containing a large number of sites
with few gaps (Le, Lartillot, and Gascuel 2008). This selection procedure resulted in 1,771 nonredundant alignments,
with an average of ;56 sequences and ;254 sites per alignment, a total of ;27 million amino acids and less than 0.1%
gaps. We randomly picked 1,471 alignments to estimate
LG4M and LG4X and used the remaining 300 for model
comparison. These alignments were the same as those used
to estimate and test our two-level mixtures of profiles and
matrices, and our structure-informed models (Le, Gascuel,

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

models (Koshi and Goldstein 1995; Lio et al. 1998; Goldman
et al. 1998; Holmes and Rubin 2002; Le, Lartillot, and

Gascuel 2008; Le and Gascuel 2010). To a lesser extent,
it was also shown that substitution processes vary among
secondary structures (Koshi and Goldstein 1995; Thorne
et al. 1996; Le, Lartillot, and Gascuel 2008; Le and Gascuel
2010). All of these works (and others) thus explored sitedependent models using several matrices or profiles.
In the profile approach (Koshi and Goldstein 1998;
Lartillot and Philippe 2004; Le, Gascuel, and Lartillot
2008), sets of elementary models defined by their amino
acid equilibrium frequencies are used; these models rely
on simple multinomial processes over the 20 amino acids—
analogous to the (Felsenstein 1981) model of DNA
substitution—and do not use replacement matrices (or only
highly simplified ones). In the multimatrix approach (Koshi
and Goldstein 1995; Thorne et al. 1996; Goldman et al.
1998; Le, Lartillot, and Gascuel 2008), different matrices are
used for different site categories. The model introduced by
Wang et al. (2008) is a compromise between these two approaches as it uses several (full range) matrices that only differ
in their amino acid equilibrium distributions (for a similar
model, see also Lartillot and Philippe 2004). In all cases, the
set of profiles or matrices is combined thanks to a mixture approach or a Hidden Markov Model (HMM; Felsenstein and
Churchill 1996; Thorne et al. 1996). In recent studies (Le, Gascuel,
and Lartillot 2008; Le, Lartillot, and Gascuel 2008; Wang et al.
2008), this first-level mixture is combined with a second-level
mixture corresponding to the standard gamma rate categories.
This combination was shown to be quite accurate but is computationallyheavyasboththecomputingtimeandthememory
consumption are roughly proportional (see, e.g., Bryant et al.
2005) to the number of site categories (e.g., 12 with 4 gamma
categories and 3 biochemical categories).
In this paper, we investigate simpler models, where sites are
categorized depending on their evolutionary rate, and different replacement matrices are used for each site category.

Indeed, the variability of evolutionary rates corresponds to
one of the most apparent heterogeneity factor among sites,
and there is no reason to suppose (as in the standard
approach) that the substitution pattern remains identical
regardless the evolutionary rate. For example, we expect
slow sites to be mostly hydrophobic (and fast sites to be
hydrophilic), which implies that the amino acid equilibrium frequencies should vary depending on the site rate.
Investigated models thus focus on an essential site heterogeneity factor. They refine the standard gamma model by
using several different replacement matrices, instead of
only one modulated by a global rate. However, these models are less complex than two-level mixtures as they use
a single mixture level enabling fair computing times and
low memory consumption.
We first verify the use of different matrices for different
evolutionary rates that follow a discrete gamma distribution. To this end, we estimate a 4-matrix model (LG4M),
where each matrix corresponds to one standard gamma
rate category (C4; Yang 1993). Experimental results show
that LG4M outperforms single-matrix models (JTTþC4,


MBE

Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112

Models

i (Di) given T and Q. L(T,Q;Di) is computed efficiently thanks to
the pruning algorithm (Felsenstein 1981).
Yang (1993) introduced a mixture model based on
a single replacement matrix but variable rates across sites
following a discrete gamma distribution with K equally

weighted rate categories. With K 5 4, the data likelihood
is given by
LðT; Q; a; DÞ 5

4
Y 1 X

4 k51

i

i

where the product runs over all the sites (independence
assumption), and L(T,Q;Di) is the likelihood of the data at site

ð2Þ

where C(a,k) is the kth rate of a discrete gamma distribution
with parameter a. The weights (or contributions) of rate categories are all equal to 1/K. Both T and a are estimated by
maximizing likelihood (2). Variants of this model include
nonequally weighted gamma rate categories (e.g., Susko
et al. 2003; Mayrose et al. 2005), an approach that could
be further investigated to improve models presented here.
Multimatrix models were proposed by several authors
(e.g., Koshi and Goldstein 1995; Thorne et al. 1996; Goldman
et al. 1998) to account for the secondary structure and
solvent accessibility. With such models, the data likelihood in a mixture context is expressed as
LðT; Q 5 fQ1 ; . . . ; QM g; w 5 fw1 ; . . . ; wM g; DÞ
M


Y X
wm L½T; Qm ; Di Š ;
5
i

All amino acid substitution matrices discussed here comply
with the general time-reversible model (see textbooks,
e.g., Bryant et al. 2005; Yang 2006). Such a matrix is denoted
Q 5 (qxy), where qxy is the substitution rate from amino
acid x to amino acid y(6¼x); diagonal termsP
are set such that
the row sums are all zero, that is, qxx 5 À y6¼x qxy . Thanks
to time reversibility, Q can be decomposed into the
symmetric exchangeability matrix R5ðrx4y Þ and the
amino acid equilibrium distribution p 5 (px), using
qxy 5py rx4y ðx 6¼ yÞ. The amino acid distribution (p)
may be estimated from the training alignments and is then
called the model equilibrium distribution or from the data
analyzed (þF option). With single-matrix models, Q and R
are normalized such that one time unit
Pcorresponds to one
substitution per site, that is, q5 À x qxx px 51:0, where
q is the global rate of Q. This constraint is released with
some multimatrix models (e.g., Le, Lartillot, and Gascuel
2008), where some site categories and matrices are fast with
a high global rate and some others are slow with a low
q value. Here, we use normalized matrices only but modulate their global rate using external parameters with values
fitted on the analyzed alignment (see below). A matrix Q
then contains 208 free parameters (190 in R þ 19 in p À 1

normalization constraint).
The likelihood of the data (denoted D) for a given tree T
(including branch lengths) and replacement matrix Q is
Y
LðT; Q; DÞ 5
LðT; Q; Di Þ;
ð1Þ


L½T; Cða; kÞQ; Di Š ;

ð3Þ

m51

where M is the number of matrices
and wm is the weight of
P
w
matrix Qm, with constraint M
m51 m 51.
Recent works (e.g., Le, Lartillot, and Gascuel 2008; Wang
et al. 2008) combined Yang’s model (2) with the above (3)
multiple-matrix model:
LðT; Q 5 fQ1 ; . . . ; QM g; w 5 fw1 ; . . . ; wM g; a; DÞ
M
K

Y X
wm X

5
L½T; Cða; kÞQm ; Di Š ; ð4Þ
K k51
m51
i
PM
where constraint
m51 wm 51 still holds. Equation (4)
expresses two levels of mixture, one for gamma distributed
rate categories and one for multiple substitution matrices.
In this framework, we introduced several supervised and unsupervised models, for example, (supervised) EX2 with two
matrices for buried and exposed sites and (unsupervised or
‘‘blind’’) UL3 based on three matrices that were estimated
without a priori knowledge on site categorization (Le, Lartillot,
and Gascuel 2008). The same framework was used by Wang
et al. (2008) in a 5-matrix model, where all matrices were
based on the same JTT or WAG exchangeability matrix (R)
but used different amino acid equilibrium distributions (p).
Although the above models (EX2, UL3, etc.) perform well
and provide high likelihood values, they are computationally expensive in terms of both computing time and memory consumption. This is mainly due to their high number
of site categories, for example, 12 with UL3 and 4 gamma
categories. In this paper, we explore simplifications of equation (4). In our LG4M model, we assume four equally
weighted gamma rate categories and use four matrices,
one for each rate category. Let Q 5 {Q1,Q2,Q3,Q4} be
the set of these four matrices, where Q1 stands for the
2923

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

and Lartillot 2008; Le, Lartillot, and Gascuel 2008; Le and

Gascuel 2010). Additional details on the selection procedure are provided in these references, and the training
and test alignments are available from />To assess the performance of our models, we used the
300 HSSP test alignments and another set of independent
alignments extracted from TreeBase (Sanderson et al.
1994). This database contains alignments that were produced especially for phylogenetic analyses and thus provide
a good benchmark for comparing models meant for phylogenetic reconstruction. Moreover, the use of test alignments from a different database should avoid possible
biases induced by some feature specific to our HSSP training
alignments. We took all (113) most recently updated TreeBase globular protein alignments and then removed those
including too many gaps (.45%) or showing a too high level
of sequence divergence (average number of amino acids per
site . 8, presence in the ML tree of one or several branches
with length . 2.0, or average ML tree branch length . 0.50).
We retained 84 alignments, with size ranging from small, single protein alignments (e.g., 7 taxa and 232 sites), to very
large concatenated protein alignments (e.g., 62 taxa and
11,544 sites). These TreeBase test alignments are also available from />

MBE

Le et al. · doi:10.1093/molbev/mss112

matrix corresponding to slowest category and Q4 that of
the fastest one. Data likelihood is expressed as
LðT; Q; a; DÞ 5

4
Y 1 X
i

4 k51



L½T; Cða; kÞQk ; Di Š :

ð5Þ

LðT; Q; q 5 fq1 ; q2 ; q3 ; q4 g; w 5 fw1 ; w2 ; w3 ; w4 g; DÞ
4

Y X
wk L½T; qk Qk ; Di Š ; ð6Þ
5
i

k51

wherePwk and qk are the
P weight and rate of matrix Qk such
that 4k51 wk 51 and 4k51 wk qk 51. The latter normalization constraint is needed to get 1.0 substitution per site within
one time unit, just as in standard single-matrix models (this
normalization is implicit in LG4M). This model thus involves
three free parameters among weights wk plus three free parameters among rates qk, which are estimated by maximizing likelihood (6) on the data set analyzed.

Model Estimation
We have a set of N protein alignments denoted D 5
{D1, . . .,DN}, where Da is an alignment. We aim to estimate
a 4-matrix model Q* 5 (Q1*,Q2*,Q3*,Q4*) that maximizes the
likelihood of D:

Y
N

Ã
a
a
a
a
Q 5
arg max
LðT ; Q; q ; w ; D Þ ; ð7Þ
Q 5 ðQ1 ;Q2 ;Q3 ;Q4 Þ;T;P;W

a51

P5ðq1 ; . . . ; qN Þ,
and
where
T5ðT 1 ; . . . ; T N Þ,
W5ðw1 ; . . . ; wN Þ are the trees, rates, and weights of the N
alignments, respectively; LðT a ; Q; qa ; wa ; Da Þ is the likelihood
of Da given model Q, tree Ta, rates qa 5ðqa1 ; . . . ; qa4 Þ, and
Thus,
to
estimate
weights
wa 5ðwa1 ; . . . ; wa4 Þ.
Q Ã 5ðQÃ1 ; QÃ2 ; QÃ3 ; QÃ4 Þ, we also need to estimate TÃ , PÃ , and
WÃ , which optimize likelihood (7). We are interested in
two 4-matrix models: LG4M where LðT a ; Q; qa ; wa ; Da Þ is
calculated using equation (5) and LG4X that is based on equation (6). For each alignment Da, qa, and wa of LG4M follow
a discrete gamma distribution with four equally weighted
rate categories, whereas

P parameters are freely
P in LG4X, these
estimated such that 41 wak 51 and 41 wak qak 51, without any
additional constraint.
2924

"Da : ðT a ; qa ; wa Þ 5 arg maxT;q;w fLðT; Q; q; w; Da Þg: ð8Þ

For this purpose, we use an adaptation of PhyML 3.0
(Guindon et al. 2010) that is described below. Having
obtained T*, P*, and W*, we search for Q* that maximizes
the likelihood of the data given T*, P*, and W*:

Y
N
Q Ã 5ðQÃ1 ; QÃ2 ; QÃ3 ; QÃ4 Þ5 arg max
LðT a ; Q; qa ; Ea ; Da Þ :
Q 5 ðQ1 ;Q2 ;Q3 ;Q4 Þ

a51

ð9Þ

It is impractical to optimize (Q1*,Q2*,Q3*,Q4*) directly from
equation (9) due to the huge number (4 Â 208) of free
parameters in Q. Consequently, we use the approximate
learning method proposed in Le and Gascuel (2008)
and Le, Lartillot, and Gascuel (2008), where
Q Ã 5ðQÃ1 ; QÃ2 ; QÃ3 ; QÃ4 Þ is handled by simplifying the site
likelihood in equation (9) using the site rate category

with maximum posterior probability (MAP) only, instead
of summing overall rate categories, that is,
Q Ã 5 ðQÃ1 ; QÃ2 ; QÃ3 ; QÃ4 Þ 5

YY
LðT a ; qaci Qci ; Dai Þ ;
arg max
Q 5 ðQ1 ;Q2 ;Q3 ;Q4 Þ

a

ð10Þ

i

where Dai is the ith site of alignment Da, ci is the MAP rate
category (computed during tree estimation) for site Dai , and
qci is the rate of ci corresponding to Qci substitution matrix.
Equation (10) can then be rewritten as
"k 5 1 . . . 4;
QÃk

5 arg maxQk

Y Y

LðT

a


; qak Qk ; Dai Þ


:

ð11Þ

a i:ci 5 k

In other words, every Qk is estimated independently.
To achieve these estimations, we used XRate (Holmes
and Rubin 2002; Klosterman et al. 2006) with the same
search options as in Le and Gascuel (2008) and Le, Lartillot,
and Gascuel (2008). Notably, we used the forgiven option
(with 3 jumps) to escape from local optima. XRate is able
to deal with mixtures, instead of using our simplifying
MAP-based approach (11). However, we observed that using MAP in this estimation context is much faster, less

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

Mathematically speaking , this model (5) is a compromise
between Yang’s model (2) and two-level mixture models (4).
Instead of sharing the same matrix as in Yang’s model, each
rate has its own matrix, and each matrix is applied only to
one rate category instead of being applied to all rates as in
two-level mixture models. This model (5) is thus more general than Yang’s model but keeps the same free parameters
to be estimated from the data (i.e., a and T) as in Yang’s
model. From a biological standpoint, the simplification from
two-level mixture (4) to model (5) means that the main heterogeneity factor among sites is their evolutionary rate, an
assumption that will be tested in the Performance Comparison section.

Model LG4M in equation (5) constrains the site rates
using a discrete gamma distribution. In our LG4X model,
we generalize LG4M by removing this constraint:

Following (Whelan and Goldman 2001; Le and Gascuel
2008; Le, Lartillot, and Gascuel 2008), we estimate
Q* 5 (Q1*,Q2*,Q3*,Q4*) from equation (7) in two steps: 1)
given a fixed starting value of Q, we estimate T*, P*,
and W* by maximizing likelihood (5) (LG4M) or (6)
(LG4X); 2) we then estimate Q* 5 (Q1*,Q2*,Q3*,Q4*) using
equation (7) with respect to T*, P*, and W* values obtained
in step (1). These two steps are iterated one after the
other until no more improvement of T*, P*, W*, and Q*
is found.
Since trees, rates, and weights of alignments in D are
independent of one another, we optimize T*, P*, and W*
for each alignment Da independently:


Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112

MBE

affected by local optima, and tends to provide better results
(Le and Gascuel 2008). This is why here we adopted the
same strategy, which is close to Viterbi’s approximation
that proved to be both efficient and accurate to estimate
HMMs (Durbin et al. 1998).
To perform these computations and use our new models to infer trees, we adapted PhyML 3.0 (Guindon et al.
2010) to LG4X and LG4M. This dedicated version is called

PhyML-4X in the following. The adaptation of PhyML to
LG4M is just a trigger so that the program selects the correct matrix for each rate category. The other parts (e.g., to
optimize a or to search tree topologies) are kept the same
as in standard PhyML. In the case of LG4X, we reused the
optimization module from (Le, Lartillot, and Gascuel 2008)
to optimize weights (w1, . . .,w4) and rates (q1, . . .,q4),
alternating monodimensional Brent optimization of every
variable
until global convergence. To account forP
constraint
P
wm 51, we use the variable change wi 5evi = evm and
then optimize the vis using Brent. The second constraint

P

wm qm 51 is fulfilled by rescaling rates and branch
lengths before returning the final tree. To accelerate the
calculations, the starting tree, rates, and weights (50.25)
are first estimated with LG4M, which involves a single
(a) parameter to be optimized instead of six.
Figure 1 summarizes the whole estimation procedure.
Both LG4M and LG4X are initialized starting from the
LG matrix. LG4M uses a supervised approach where each
matrix is associated with the same gamma rate category
throughout the optimization procedure; for example,
the ‘‘Fast’’ matrix is systematically associated with the highest rate, among four gamma-distributed rates. LG4X is
estimated in a semisupervised way. During the first step,
sites are categorized based on the rate (associated to
LG) providing the highest likelihood value. During subsequent steps, sites are categorized based on the (rate,

matrix) pair with highest likelihood. In most cases, the
‘‘Fast’’ matrix is associated with the highest rate, and the
same holds with other matrices. However, since rates
2925

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

FIG. 1. Algorithm for estimating LG4M and LG4X. Note: Tree likelihood is calculated by PhyML-4X in step 3 using equation (5) for LG4M and
equation (6) for LG4X. In step 6, Qk % Qk* is measured by the sum of squared entry differences, to be ,0.1.


MBE

Le et al. · doi:10.1093/molbev/mss112
Table 1. Main Features of LG4M and LG4X Replacement Matrices.
LG4M
LogCor/LG
ClosestMatrix
Hydro
Average rate
5% quantiles
LG4X
LogCor/LG
ClosestMatrix
Hydro
Average weight
5% quantiles
Average rate
5% quantiles
Rate distribution


Very Slow

Slow

Medium

Fast

0.813
Intermediate (0.828)
20.492
0.145
0.035/0.315

0.874
Buried (0.914)
1.249
0.440
0.257/0.673

0.957
Intermediate (0.966)
0.219
0.952
0.826/1.053

0.917
Exposed (0.986)
21.682

2.463
1.938/2.882

0.847
Buried (0.880)
0.934
0.313
0.180/0.418
0.289
0.084/0.441
84/0/0/0

0.853
Buried (0.885)
0.325
0.332
0.185/0.419
0.770
0.394/1.209
0/75/6/3

0.898
Intermediate (0.946)
20.816
0.233
0.145/0.375
1.370
0.800/2.232
0/9/73/2


0.897
Exposed (0.987)
21.815
0.122
0.019/0.251
3.420
1.406/5.317
0/0/5/79

(and weights) are estimated independently for each alignment without any a priori constraint, it may occur for some
data sets that the ‘‘Fast’’ matrix is actually associated with
a slow rate and vice versa (see table 1 and supplementary
tables and figures at Thus, this semisupervised procedure provides
a measure of the importance of the site rate factor. If
the initial rate-based site categorization and matrix interpretation disappeared during the subsequent training
steps, this would mean that site rate is not a heterogeneity
factor of first importance and that other more important
factors exist. If (as is the case), the initial rate-based site
categorization and matrix interpretation are (mostly) preserved all along training steps, this implies that the rate
factor is of first importance, as we assumed in this study.
As all Expectation–Maximization (EM) approaches, XRate
is sensitive to starting parameter values. For computing-time
reasons (LG4X required nearly a week to be estimated and
much more to be tested and compared with other models),
we did not try alternative starting matrices and training
strategies. However, based on our previous experiments with
LG where XRate performed remarkably well (Le and Gascuel
2008), we are confident that LG4M (estimated in a supervised manner) should be relatively stable and insensitive
to the starting matrix used to initiate the training procedure.
On the other hand, we also observed (Le, Lartillot, and

Gascuel 2008) that semisupervised training of mixture
models is more sensitive to the choice of the starting
point. This suggests that LG4X could likely be improved
using other starting points or training strategies.

LG4M and LG4X Matrices and Models
LG4M and LG4X matrices (estimated as described above)
are available at />lg4x, along with additional information and statistics. Here,
2926

we discuss the main features of these matrices and models
that make them better than single-matrix models, especially LG4X thanks to its rate distribution-free scheme. Table 1 provides summary statistics and figure 2 shows some
illustrative matrices. It can be seen that LG4M and LG4X
matrices clearly depart from LG. The correlation of the
log-entries with those of LG (LogCor/LG; table 1) is below
0.9 in most cases. LG4M ‘‘Medium’’ is a noticeable exception (LogCor/LG 5 0.957), which is somewhat expected as
this matrix is used for intermediate sites with evolutionary
rates close to 1. For each matrix, table 1 provides its global
hydropathy (Hydro), computed as the average hydropathy
index (Kyte and Doolittle 1982) of the 20 amino acids with
weights equal to their equilibrium frequencies in the given
matrix. This index also points to a clear difference between
the new matrices and LG (Hydro 5 À0.253). Most of the
matrices (e.g., LG4X ‘‘Fast’’ 5 À1.815 or LG4M ‘‘Slow’’ 5
1.249) are clearly hydrophilic or clearly hydrophobic, with
hydropathy values close to that of ‘‘Exposed’’ (À1.993) or
‘‘Buried’’ (1.715) matrices from our EX3 model (Le, Lartillot,
and Gascuel 2008). These results and measures support our
working hypothesis that the substitution patterns differ
depending on the site rates. Modulating a unique replacement matrix (e.g., LG) using gamma-distributed

rates appears to be an oversimplification.
‘‘Very Slow’’ matrices show a remarkable pattern, especially that of LG4X (fig. 2), which is mostly used to express
high replacement rates between amino acid pairs that are
biochemically very similar, for example: R and K (positively
charged), D and E (negatively charged), and F and Y
(aromatic). These three pairs are very close in the genetic
code, requiring only one nucleotide change to mutate
amino acid into the other. Interestingly, some of these pairs
are highly hydrophilic (e.g., R and K), which contradicts the
first intuition that very slow sites should be all buried and
hydrophobic. However, to avoid misinterpretation, it has to

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

NOTE.—The four matrices of LG4M and LG4X are ranked according to their average rates as ‘‘Very Slow,’’ ‘‘Slow,’’ ‘‘Medium,’’ and ‘‘Fast.’’ ‘‘LogCor/LG’’ is the Pearson
correlation coefficient of the log-entries in the given matrix with those of LG. ‘‘ClosestMatrix’’ is the matrix among ‘‘Buried,’’ ‘‘Intermediate,’’ and ‘‘Exposed’’ matrices from
EX3 (Le, Lartillot, and Gascuel 2008) that is closest to the given matrix based on the correlation of the log-entries (value in parentheses). ‘‘Hydro’’ is the average hydropathy
index (Kyte and Doolittle 1982) of the 20 amino acids with weights equal to their equilibrium frequencies in the given matrix. ‘‘Weight’’ is the weight (w) of the given
matrix, averaged over the 84 TreeBase testing alignments. ‘‘Rate’’ is the average rate (q) among TreeBase alignments. ‘‘5% quantiles’’ provide the 5th and 80th rate and
weight values among these 84 alignments. ‘‘Rate distribution’’ is the number of alignments in which a given matrix is ranked (based on its estimated rate) as very slow/slow/
medium/fast; for example, with ‘‘Slow’’, we see that this matrix is never ranked as the slowest matrix, 75 times as the second slowest, 6 times as medium, and 3 times as the
fastest matrix. Similar statistics are obtained with the 300 HSSP test alignments (see supplementary tables and figures at />

Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112

MBE

be noted that rates displayed in figure 2 are relative rates; all
matrices are normalized and do not incorporate the fact
that very slow sites globally evolve slower (;6 times on

average; table 1) than fast sites; for example, the absolute
rate (accounting for this global factor of ;6) between R
and K is nearly symmetrical and almost the same in the
‘‘Very Slow’’ and ‘‘Fast’’ matrices of LG4X. In other words,
R–K replacements are fast in all rate categories, including
the slowest one. The ‘‘Very Slow’’ LG4M matrix is less contrasted than the LG4X matrix and deals with other amino
acid groups, also very close biochemically and in the genetic
code, for example: I, L, and V (aliphatic) and S and T (tiny
and polar). The latter amino acids are focused in the LG4X
‘‘Slow’’ matrix, whereas the LG4M ‘‘Slow’’ matrix mainly
deals with tiny and nearly neutral amino acids (A, G, S,
and T) and the I, V pair (supplementary tables and figures
at />The ‘‘Very Slow’’ matrices are thus used to express the
fact that even in very slow sites, substitutions between
highly similar amino acids are likely to occur. Their con-

tents may be seen as being similar to that of the profiles
in the CAT model (Lartillot and Philippe 2004; Le, Gascuel,
and Lartillot 2008). The ‘‘Slow’’ matrices are analogous but
less contrasted. Moreover, the ‘‘Slow’’ matrix of LG4M is
relatively close to Buried from EX3 (table 1 and above hydropathy values), indicating (as expected) that buried sites
and slow sites are often the same. However, both LG4M
and LG4X ‘‘Very Slow’’ matrices partly contradict this basic
fact, as the LG4X ‘‘Very Slow’’ matrix focuses on some hydrophilic pairs and the LG4M ‘‘Very Slow’’ matrix is slightly
hydrophilic (Hydro 5 À0.429). An explanation of this finding could be that both LG4X and LG4M ‘‘Very Slow’’ matrices are strongly influenced by the genetic code (see
above examples), which intervenes first in the mutational
process (before the physicochemical constraints and selection) and favors substitutions between amino acids that are
not necessarily hydrophobic.
The ‘‘Medium’’ matrix of LG4M is correlated with both
LG and ‘‘Intermediate’’ matrix from EX3 and is thus mostly

used for standard sites with average rates and solvent
2927

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

FIG. 2. LG4M and LG4X replacement matrices. Note: Amino acids are ranked from highly hydrophilic (R) to highly hydrophobic (I) based on the
hydropathy index (Kyte and Doolittle 1982). Bubble sizes are proportional to the replacement rates. The horizontal axis displays the original
amino acid, the vertical axis the new one resulting from replacement. X1: ‘‘Very Slow’’ LG4X matrix; X4: ‘‘Fast’’ LG4X matrix (highly similar to
‘‘Fast’’ LG4M matrix); M1: ‘‘Very Slow’’ LG4M matrix; LG is provided as a reference average matrix.


MBE

Le et al. · doi:10.1093/molbev/mss112

2928

ibility of LG4X, which is more powerful than the association
of a single matrix with a distribution-free scheme of rates
across sites, whose performance is somewhat disappointing
(Susko et al. 2003; Mayrose et al. 2005; see results of LGþU4
below). Here, each matrix corresponds (to some extent) to
many/few and fast/slow sites, depending on the protein
analyzed. This clearly shows that substitutions are not just
Markovian with a fixed pattern (replacement matrix) modulated by site-dependent rates. As in other site-dependent
models, we have different categories of sites corresponding
to different matrices and tendencies to be slow or fast,
but no strict constraint to be so. This further illustrates
the finding by many authors (e.g., Keane et al. 2006) that
substitution patterns may be very different in different proteins. The payoff of LG4X flexibility is that matrices in this

model are not fully interpretable as ‘‘Very Slow’’, ‘‘Slow’’,
‘‘Medium’’ and ‘‘Fast’’ matrices; for example, the ‘‘Slow’’ matrix is sometimes the fastest one, and this feature most
likely impacts its coefficient values. However, the average
rates of these four matrices (0.29, 0.77, 1.37, and 3.42, respectively) clearly correspond to their natural interpretation. It must be emphasized that this ranking and global
interpretation are obtained through our semisupervised
learning procedure (see above and fig. 1), where only
the first step accounts for site rates while further optimization steps are performed in a blind manner, clustering
the sites based on their preferred matrix without reference
to their rate. The fact that the four LG4X matrices are
still clearly correlated to rate categories after this (6-step)
phase of blind learning illustrates (if needed) that the evolutionary rate is a major factor in modeling substitution
processes.

Model Comparisons
In the following, we assess the performance of the new
models LG4M and LG4X by comparing them with existing
models using 84 TreeBase and 300 HSSP test alignments
(see Data Sets). The following models are compared.

Single-Matrix Models: JTT, WAG, LG
These standard matrices are used with four categories
of gamma-distributed rates across sites (þC4 option,
not indicated below for conciseness). To assess the use
of a gamma distribution with four discrete rate categories,
we ran LGÀC (constant site rate); LGþC3, LGþC6, and
LGþC8 with 3, 6, and 8 gamma rate categories, respectively; LGþU4 (free distribution of site rates with four categories, just as in LG4X but using a single LG matrix).
Moreover, we tested LGþF, where the amino acid frequencies are estimated from the studied alignment, instead of
being assigned to the default, average frequencies of LG.
LGþF was used with the þC4 option. LGþF should better
fit the specificities of the data being analyzed, but is penalized by the large number of extraparameters (frequencies)

to be estimated. In total, LGþF has 20 free parameters (1
gamma þ 19 frequencies); LGþU4 has 6 free parameters
(3 rates þ 3 weights); LG, JTT, and LG (þC3, þC4, þC6,

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

accessibility. The ‘‘Medium’’ matrix of LG4X is also correlated with ‘‘Intermediate’’ but to a lesser extent than
LG4M ‘‘Medium’’, and its global hydropathy is relatively
low (À0.816) compared with that of LG (À0.253) and that
of LG4M ‘‘Medium’’ (0.219).
Lastly, the ‘‘Fast’’ matrices of LG4X and LG4M are very
close (correlation of the log-entries 5 0.994) and quite similar to ‘‘Exposed’’ from EX3 (correlation of the log-entries %
0.99). As expected, fast sites and exposed sites are often the
same. Moreover, ‘‘Fast’’ matrices show a relatively low contrast (fig. 2) and allow for all possible substitutions, with
a preference for substitutions between amino acids with
similar hydrophathy.
All together we thus see (as expected) a clear correlation
between evolutionary rate and solvent accessibility; for example, LG4M ‘‘Slow’’ is close to ‘‘Buried’’, whereas both
‘‘Fast’’ matrices are close to ‘‘Exposed’’. However, the matrices of LG4M and LG4X account for other features of the
substitution processes; for example, LG4X ‘‘Very Slow’’ is
weakly correlated with Buried but focuses on specific highly
exchangeable amino acid pairs, some highly hydrophilic
(e.g., R and K). The variability among all these replacement
matrices demonstrates the complexity of amino acid
substitutions and explains why a single matrix has limited
capacity in modeling such complex processes.
Analysis of rates across sites further illustrates the difficulty involved in substitution modeling and the advantage
of flexible models such as LG4X. With LG4M, the a value
of the gamma parameter is significantly higher than with
LG (respectively 0.866 and 0.584 on average; this ordering

of a values is observed with all but five alignments, see
supplementary tables and figures at This is an expected outcome,
as part of the rate variability is taken into account in
LG4M by the use of four different matrices. With LG4X,
the matrix rates show a different picture. While with
LG4M, each of the four matrices is pretty much associated
with the same rate, with LG4X, the rates differ significantly
depending on the data set analyzed, to the point that the
‘‘Slow’’ matrix is sometimes (three cases among 84 TreeBase test alignments; table 1) associated with the fastest
rate. It is worth to note that rates and weights in equation
(6) are optimized for each data set analyzed, without
constraining the rates to be ordered depending on the
matrix with which they are associated. In the same
way, the weights of site categories are highly variable;
for example, with TreeBase test alignments, the ‘‘Fast’’
weight varies from ;0.0 to ;0.25 (table 1). Results with
HSSP test alignments (supplementary tables and figures at
are similar
but show less variability; for example, the ‘‘Slow’’
matrix is only once the fastest one among 300 alignments,
instead of 3/84 with TreeBase. This indicates that the
flexibility of LG4X is less useful with HSSP than with TreeBase, which can be expected since LG4X was estimated
from HSSP.
The gains obtained by LG4X over LG4M (see Performance Comparisons) are thus explained by the high flex-


MBE

Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112


þC8) have 1 free (gamma) parameter; LGÀC has 0 free
parameter.

Two-Level Mixture Models: EX2, UL3, EXEHO

Confidence-Based Models: EX2/S, EXEHO/S
The previous models are mixtures. EX2 and EXEHO use categories of sites having a structural meaning and matrices
estimated from sites categorized based on their structural
properties, but to infer phylogenies these two models do
not use any structural information. The likelihood of every
site is computed within each category and then averaged,
as expressed in equation (4). On the contrary, EX2/S and
EXEHO/S use structural information on the analyzed data
set. Basically, the likelihood of each site is computed based
on its known structural category, as in the standard partition approach. However, since structural information may
be erroneous or inappropriate in a phylogenetic context,
we refined this approach by introducing a confidence coefficient, estimated from the analyzed data set, which expresses
a trade-off between the standard mixture (no structural information is available) and partition (structural information is
fully reliable and relevant) models. These models are
described in details in Le and Gascuel (2010), where they
are called EX2_CONF/MIX and EX_EHO_CONF/MIX.
Here, we use EX2/S and EXEHO/S to make it clear that

Single-Level Mixture Models: LG4M and LG4X
These are the two new models proposed in this paper.
Both involve 4 site categories in total (to be compared
with the 24 categories of EXEHO and the 240 of CAT60,
remembering that the computing time and memory consumption are strongly correlated to the number of categories). Rates in LG4M are gamma distributed, whereas LG4X
uses a distribution-free scheme. LG4M has 1 (gamma) free
parameter; LG4X has 6 (3 rates þ 3 weights) free parameters.


Comparison Criteria and Methods
Our aim was to compare the performance of all these models, regarding likelihood and topological criteria. To infer
trees, we used: the last version of PhyML 3.0 (Guindon
et al. 2010) for LG, JTT, and WAG; PhyML-Structure
(Le and Gascuel 2010) for EX2, UL3, EXEHO, EX2/S, and
EXEHO/S; and our adaptation (PhyML-4X) of PhyML 3.0
for LG4M, LG4X, and LGþU4. All programs were run with
BioNJ (Gascuel 1997) starting tree and subtree pruning and
regrafting (SPR) tree searching.
Since these models involve different numbers of free
parameters, we measured their fitness to data using the
AIC criterion (Akaike 1974):
AICðM; Da Þ
5 À 2LLðM; T a ; Da Þ þ 2# parametersðMÞ;
where LL(M,Ta;Da) is the log-likelihood of alignment Da given
model M and inferred tree Ta; #parameters(M) is the number of
free parameters of model M. The AIC criterion has to be minimized; best scores are given to models with low numbers of
free parameters and high likelihood values. All tested models
involve one parameter (length) per tree branch plus the model
parameters detailed in previous section.
For every model M studied, we computed the average
AIC per site for all alignments in test set A:
P
AICðM; Da Þ
a2A
P a
AIC=siteðM; AÞ 5
;
s

a2A

where sa is the number of sites in Da. To complete this global
average result, we performed pairwise model comparisons
and counted the number of alignments Da, where AIC(M1,Da)
, AIC(M2,Da) (i.e., M1 fits Da better than M2) for a given
model pair M1, M2. To assess the statistical significance of
the observed difference between M1 and M2 for any given
alignment, we used a Kishino–Hasegawa (KH; 1989) test with
P , 0.01. As the number of free parameters between M1
and M2 may differ, we used AIC penalized likelihood values.
This test is essentially the same as that used to compare
2929

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

EX2 involves two first-level categories of sites, based on
solvent accessibility; its two matrices were estimated in
a supervised manner from sites being classified as Buried
and Exposed in HSSP. UL3 has three first-level categories
of sites; it was estimated in a fully unsupervised manner,
starting from three random matrices. EXEHO has six
first-level categories of sites, crossing solvent accessibility
(two categories) with secondary structure (three categories: Extended, Helix, and Other); EXEHO was learned in
a supervised manner from HSSP sites categorized accordingly. First-level categories in EX2, UL3, and EXEHO were
combined in this study with four second-level gammadistributed rate categories (þC4 option); for example,
EXEHO has 6 Â 4 5 24 site categories in total. EX2 and
UL3 proved to be our best mixture models with 2 and 3
(first-level) categories, respectively, and UL3 was even
better than the CAT60 model that involves 60 first-level

categories (profiles) and thus 240 categories in total with
the þC4 option (Le, Lartillot, and Gascuel 2008). Among
mixture models, EX2 and UL3 were only beaten by EXEHO
(Le and Gascuel 2010), but this latter requires high memory
consumption and running time with large data sets due to
its 24 site categories; for this reason, EXEHO was not run on
TreeBase test alignments, some being very large but on
HSSP alignments only. We also tested the model by Wang
et al. (2008) but encountered several difficulties when running their program and do not provide results for this
model (e.g., QmmRAxML did not finish searching for
10/84 alignments after 3 weeks). EX2, UL3, and EXEHO require estimating the gamma rate parameter from the data
plus the proportions of first-level categories, that is, 2, 3,
and 6 free parameters, respectively.

they benefit (contrary to all other models) from structural
information. Both are combined with four gamma rate
categories (þC4 option). They were run on HSSP test alignments only, as no structural information is available in TreeBase. EX2/S and EXEHO/S involve 3 (1 gamma þ 1
confidence þ 1 category proportion) and 6 (1 gamma þ
1 confidence þ 5 category proportions) free parameters
to be estimated from the data.


MBE

Le et al. · doi:10.1093/molbev/mss112
Table 2. Model Comparison with TreeBase Test Alignments, Using Likelihood and Topological Criteria.
M1
LG
LG
LG

LG
LG
LG
LG
LG
LG
LG
LG4M
LG4X
LG4X

M2
JTT
WAG
LG2G
LG1G3
LG1G6
LG1G8
LG1U4
LG1F
LG4M
LG4X
LG4X
EX2
UL3

AIC/site
0.47
0.29
2.34

0.07
20.04
20.05
20.03
0.00
20.15
20.33
20.18
0.08
20.19

#M1 > M2
73
71
83
80
7
10
34
51
33
12
1
67
39

#M1 > M2
(P < 0.01)
66
62

80
41
1
1
15
7
11
1
0
27
6

#M1 < M2
(P < 0.01)
7
7
0
1
37
33
21
6
34
50
48
2
22

#T1 > T2
39

32
62
32
3
3
17
24
20
10
0
44
24

#T1 < T2
8
10
0
0
22
29
28
12
37
48
52
11
35

#T1 > T2
(P < 0.01)

36
29
61
11
0
0
4
4
7
1
0
17
3

#T1 < T2
(P < 0.01)
5
6
0
0
6
9
10
3
24
36
26
1
17


RF (%)
430 (11)
404 (10)
566 (14)
300 (8)
266 (7)
270 (7)
476 (12)
364 (9)
616 (15)
606 (15)
530 (13)
536 (13)
602 (15)

L1 2 L2
(P < 0.01)
0.036
0.136
0.307
0.002
0.018
0.027
0.063
20.011
20.073
0.062
0.145
20.100
20.265


phylogenies. We use the RELL bootstrap to estimate the
distribution of the test statistic under the null hypothesis
that both models are equivalent, but incorporate the number of parameters of each model in this statistic just as in the
AIC criterion (for explanations and justifications of this test,
see Shimodaira 1997).
We compared the lengths of inferred trees, that is, the
sums of their branch lengths. It has been suggested that
best models tend to produce longer trees capturing more
hidden substitutions (e.g., Pagel and Meade 2005).
We also compared the topologies of inferred trees.
Indeed, if the new models produced the same topologies
as the existing models, the effort of introducing new models would be rather useless. Unfortunately, the true tree is
usually unknown with real data (as opposed to simulated
data), and thus, it is hard to assess the topological accuracy
induced by any tree-building approach in a realistic setting.

Here, we studied the topological impact of our new models,
that is, whether or not using these models enables us to
frequently infer trees that differ from those inferred with
standard models.
When comparing models M1 and M2, we counted the
number of alignments where the inferred topology using
M1 differs from that obtained using M2. Both topologies
were also compared using the Robinson and Foulds (RF;
1979) distance, which is the number of branches (bipartitions) that belong to one tree but not to the other. When
different topologies are found, one should prefer the one
with best likelihood value or best AIC (or similar criterion)
value, when evolutionary models used for tree inference
involve different numbers of parameters. However, the difference may be slight and nonsignificant, so one cannot

reject the topology with a lower fit to data. We thus

Table 3. Model Comparison with HSSP Test Alignments, Using Likelihood and Topological Criteria.
M1
LG
LG
LG
LG
LG4M
LG4X
LG4X
LG4X
LG4X
LG4X

M2
JTT
WAG
LG4M
LG4X
LG4X
EX2
UL3
EXEHO
EX2/S
EXEHO/S

AIC/site
0.72
0.31

–0.59
20.65
20.06
0.15
0.00
20.14
20.21
20.61

#M1 > M2
267
248
30
13
93
241
199
117
60
5

#M1 > M2
(P < 0.01)
221
141
3
0
2
62
37

2
1
0

#M1 < M2
(P < 0.01)
10
7
174
182
20
2
10
23
80
223

#T1 > T2
220
196
27
10
83
200
165
88
56
4

#T1 < T2

23
32
251
257
166
51
99
166
199
250

#T1 > T2
(P < 0.01)
184
110
1
0
2
42
26
2
0
0

#T1 < T2
(P < 0.01)
6
2
162
163

16
1
10
21
57
181

RF (%)
3,570 (15)
3,478 (15)
4,386 (18)
4548 (19)
4,014 (17)
4,110 (18)
4,356 (18)
4,176 (17)
4,188 (18)
4,204 (18)

L1 2 L2
(P < 0.01)
0.048
0.174
0.009
0.078
0.068
20.063
20.326
20.094
20.058

20.092

NOTE.—Models are compared using 300 HSSP test alignments. EX2/S has the same matrices as EX2 but (contrary to EX2) uses the solvent accessibility of the residues
derived from the 3D protein structure. EXEHO: 6-matrix two-level mixture model, combining accessibility to solvent and secondary structure. EXEHO/S has the same six
matrices as EXEHO but (contrary to EXEHO) uses the secondary structure and the solvent accessibility of the residues derived from the 3D protein structure. See note to
table 2 for other abbreviations and explanations.

2930

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

NOTE.—Models are compared using 84 TreeBase test alignments. All models use four categories of gamma distributed rates (þC4), unless explicitly stated, that is: þU4 and
LG4X for distribution-free scheme; ÀC for constant site rate; þC3, þC6, and þC8 for 3, 6, and 8 gamma rate categories, respectively. LGþF: LG exchangeability coefficients
are combined with the amino frequencies of the alignment being analyzed. EX2: 2-matrix two-level mixture model, with matrices estimated from buried/exposed sites. UL3:
3-matrix two-level mixture model, with blindly estimated matrices. LG4M: 4-matrix one-level mixture model using gamma distribution of site rates, proposed in this paper.
LG4X: 4-matrix one-level mixture model using distribution-free scheme of site rates, proposed in this paper. On each row, model M1 is compared with model M2 using all
test alignments. AIC/site: average per site difference in AIC value between M1 and M2; a positive (negative) value means that AIC/site of M1 is better (worse) than M2, on
average. #M1 . M2: number of alignments (of 84) where M1 has a better AIC value than M2. #M1 . M2 (P , 0.01): number of alignments where the AIC of M1 is
significantly better than that of M2. #M1 , M2 (P , 0.01): same as #M1 . M2 (P , 0.01), but now M2 is significantly better than M1. #T1 . T2: number of alignments
where the tree T1 inferred with M1 has a better AIC value than T2 inferred using M2 and where T1 and T2 have different topologies. #T1 , T2: same as #T1 . T2, but now
T2 is better than T1. #T1 . T2 (P , 0.01): same as #T1 . T2, but now T1 is significantly better than T2. #T1 , T2 (P , 0.01): T2 is significantly better than T1. RF (%): total
RF distance between T1 and T2 trees (i.e., sum over all data sets of the number of branches that belong to one tree but not the other); numbers in parentheses report the
percentage of RF relative to the total number (3,994) of internal branches in both T1 and T2 trees. L1 À L2 (P , 0.01): average of tree length differences between T1 and T2;
we also counted the number of cases where T1 is longer/shorter than T2 and assessed the significance using a sign test with P , 0.01, significant differences are underlined.


Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112

counted the number of alignments where M1 and M2 topologies differ and where M1 is significantly better (worse) than
M2, using a KH test on AIC penalized likelihood values with

P , 0.01.
Lastly, we checked that the observed topological differences comprised some branches with significant support.
Indeed, the topological impact would be low if all differences corresponded to poorly supported branches. To this
end, we performed bootstrap analyses and counted the
number of branches with notable bootstrap support
(BP1 ! 50%) in one tree, which were not recovered in
the other tree, or had a much lower support in this tree
(BP2 þ 50%
BP1). For example, one branch with BP1
5 40% was not counted, even when it was not recovered
in the other tree; on the contrary, one branch with BP1 5
80% in one tree was counted when it was found in the
other tree with BP2 5 20%. This measure (first introduced

FIG. 4. AIC progress of amino acid replacement models, using HSSP.
Note: Models are compared using 300 HSSP test alignments. EX2/S
has the same matrices as EX2 but (contrary to EX2) uses the solvent
accessibility of the residues derived from the 3D protein structure.
EXEHO: 6-matrix two-level mixture model, combining accessibility
to solvent and secondary structure. EXEHO/S has the same six
matrices as EXEHO but (contrary to EXEHO) uses the secondary
structure and the solvent accessibility of the residues derived from
the 3D protein structure. See note to figure 3 for other
abbreviations and explanations.

in Le and Gascuel 2010) thus summarizes the topological
and branch support differences. We used only 50 bootstrap
replicates for computing time reasons, but this suffices for
our gap of 50% between BP1 and BP2 to be highly significant (P value ; 0.0 using a Z-test for two proportions).
Moreover, 50% of bootstrap support was shown to be optimal in terms of topological error (Berry and Gascuel 1996;

see also Holder et al. 2008). As this procedure is computationally heavy (even with 50 replicates), we analyzed the
63 smallest TreeBase alignments only. All (300, relatively
small) HSSP alignments were analyzed.

Fitness Comparison
Comparisons were performed on 84 TreeBase and 300
HSSP test alignments. Both sets are independent of the
training alignments and thus provide fair estimations of
model performance. Moreover, using TreeBase should
avoid possible biases induced by some of the specificities
of HSSP alignments used to train our models. Tables 2
(TreeBase) and 3 (HSSP) display comparisons between
all models listed above. Figures 3 (TreeBase) and 4 (HSSP)
show the progress in the AIC/site for the main models.
2931

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

FIG. 3. AIC progress of amino acid replacement models, using
TreeBase. Note: Models are compared using 84 TreeBase test
alignments. All models use four categories of gamma-distributed
rates (þC4), except LG4X. EX2: 2-matrix two-level mixture model,
with matrices estimated from buried/exposed sites. UL3: 3-matrix
two-level mixture model, with blindly estimated matrices. LG4M:
4-matrix one-level mixture model using gamma distribution of site
rates, proposed in this paper. LG4X: 4-matrix one-level mixture
model using distribution-free scheme of site rates, proposed in this
paper. In the upper panel (a) performance is measured by the
average AIC per site (AIC/site) and compared with the JTT value. In
the lower panel (b), we count the number of alignments (among

84) where each model provides a better (positive side) and a worse
(negative side) likelihood value than LG. The black bars correspond
to the numbers of significant differences using the KH test on AIC
values with P , 0.01.

MBE


Le et al. · doi:10.1093/molbev/mss112

2932

alignment. Compared with LG4M, LG4X shows a slight
advantage with HSSP (AIC/site gain of 0.06) and is clearly
better with TreeBase: mean AIC/site gain of 0.18, which is
significant for 50/84 alignments, whereas LG4M is better
than LG4X for only one nonsignificant alignment. The
advantage of LG4X over LG4M is explained by its greater
flexibility to fit the specificities of analyzed data, thanks to
its distribution-free scheme. This scheme (þU4) does not
show such an advantage with single-matrix models (see
above) but becomes clearly beneficial when combined with
different replacement matrices. The superiority of LG4X
over LG4M is more marked with TreeBase than with HSSP
alignments, possibly because LG4M and LG4X were learned
from HSSP alignments and fit both HSSP specificities well.
Moreover, HSSP alignments are smaller than TreeBase
alignments, and some HSSP alignments are likely too small
to compensate for the 5 additional parameters in LG4X
compared with LG4M. This advantage of LG4X over

LG4M should thus be observed by future users analyzing
phylogeny-intended alignments, such as those stored in
TreeBase.
Compared with two-level mixture models, we see that
LG4X is 1) slightly better than EX2 (AIC/site gain of 0.08
and 0.15 with TreeBase and HSSP, respectively); 2) nearly
equivalent to UL3 (AIC/site loss of 0.19 with TreeBase but null
with HSSP; the number of significant cases is low and does not
favor one model over the other); 3) slightly behind EXEHO
(AIC/site loss of 0.14 with HSSP, but low number [23] of cases
where EXEHO is significantly better than LG4X). Globally, we
thus do not see a clear advantage of two-level mixture models
over our new one-level mixture models, the former involving
high number of rate categories (up to 24 with EXEHO) and
heavy computational resources (at least EXEHO).
Lastly, when comparing EX2/S and EXEHO/S with LG4X
and LG4M (and all other models), we see the clear advantage of using structure-informed models when the structural annotation of the proteins analyzed is available.
The AIC/site gain of EXEHO/S over LG4X is of 0.61 on
average, and this gain is significant with 223/300 alignments, whereas LG4X is never significantly better than
EXEHO/S. The advantage of EX2/S over LG4X is less impressive but still significant.
Tree-length comparisons (tables 2 and 3) do not show
a clear picture. Some of the findings are expected, for
example, LGþC4 trees are much longer than LGÀC ones,
but the correlation between tree length and AIC value is weak
or nonexistent. Mixture models tend to produce longer trees
than standard models, with the notable exception of both
distribution-free models (LGþU4 and LG4X), which infer
trees shorter than LGþC4 ones. LG4M and LGþC4 trees
have similar length. UL3 trees (comparable to LG4X in
AIC terms) are very long, while those inferred using

EXEHO/S (our best model in AIC terms) do not differ substantially from LG ones. These results thus contradict the
assumption that better models should produce longer trees
(Pagel and Meade 2005).
Overall, the comparisons of likelihood and AIC values
show that 1) our new simple one-level mixture models

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

It is clear from these results that LG outperforms JTT and
WAG. The average AIC/site of LG with TreeBase alignments
is respectively 0.47 and 0.29 lower than that of JTT and
WAG, equivalent to a gain of 117.5 and 72.5 log-likelihood
units with a 500-site alignment. Moreover, LG is significantly better than JTT and WAG for most alignments.
These results reconfirm the claim in Le and Gascuel (2008).
Table 2 (TreeBase) reports comparisons between several
standard model options, combined here with LG. The comparison between LGÀC (no gamma distribution of site
rates) and LGþC4 highlights the crucial role of modeling
rates across sites. LGþC4 has significantly better AIC values
than LGÀC for 80/84 alignments, with an average AIC/site
gain of 2.34. LGþC4 is significantly better than LGþC3 for
41/84 alignments, with an average AIC/site gain of 0.07,
whereas it is worse than LGþC6 for 37/84 alignments, with
a slight average AIC/site loss of 0.04. Using eight categories in
LGþC8 does not improve AIC results compared with
LGþC6. These results indicate that with standard models,
three gamma rate categories are not enough and that six
categories suffice. Moreover, the differences in AIC/site
between these options are low compared with those of
other options and models (e.g., LGþC vs. LGÀC). Using four
categories, which is standard throughout the community,

thus appears to be a fair compromise between likelihood
(AIC) value and computing time, which is proportional to
the number of categories. These results also support our
choice of using four categories in our LG4M and LG4X models. Having three categories would be not enough and overly
simple, whereas using four categories is likely to be a fair
compromise just as with standard models. However, we cannot exclude that having more than four categories could
lead to even better (but slower) models. The low AIC/site
gain (0.03) between LGþU4 and LGþC4 confirms the conclusions of Susko et al. (2003) and Mayrose et al. (2005), who
observed low gains when combining a single replacement
matrix with a distribution-free scheme or a mixture of
gamma distributions. Lastly, when combining LG with
þF option, where amino acid distribution is estimated independently for each testing alignment, the average AIC/
site value is nearly the same as with LG alone. Six large
alignments obtain significantly better AIC values than with
LG, but for the other (small or medium-size) alignments, the
likelihood gain is not large enough to compensate for the 19
additional free parameters estimated from the data.
LG4M shows a clear improvement over LG, with an
AIC/site gain of 0.15 and 0.59, with TreeBase and HSSP
alignments, respectively (note that these gains cannot
be compared, as they depend on several factors, e.g., the
number of taxa per alignment). With HSSP, LG4M has higher
AIC (and likelihood) values than LG for 270/300 alignments
with 174 significant cases. With TreeBase results are not so
impressive but still clearly in favor of LG4M versus LG.
LG4X has a major advantage over LG, with an AIC/site
gain of 0.33 and 0.65, with TreeBase and HSSP, respectively.
LG4X is significantly better than LG for more than half of
the alignments (both HSSP and TreeBase), whereas LG is
significantly better than LG4X for only one (TreeBase)


MBE


Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112

MBE

outperform the standard models; 2) they are comparable
to the best two-level mixture models, while requiring less
computational resources; and 3) they are clearly beaten by
structure-informed models, which should be preferred
when structural information is available.

Topological Impact

FIG. 5. Topological support dissimilarities of the main models. Note:
Topological support dissimilarity between models M1 and M2 is
computed from the bootstrap trees inferred using M1 and M2, by
counting the number of branches supported in one tree but not the
other (BP1 ! BP2 þ 50%, see text). Trees in this figure were built
using distance-based FastME software, from all pairwise model
dissimilarities. The tree in the upper panel (a) is based on the 63
smallest TreeBase test alignments; tree (b) is based on 300 HSSP test
alignments. All models use four categories of gamma distributed
rates (þC4), unless explicitly stated, that is, LG4X for distributionfree scheme; ÀC for constant site rate; þC8 for 8 gamma rate
categories. EX2: 2-matrix two-level mixture model, with matrices
estimated from buried/exposed sites. EX2/S has the same matrices
as EX2 but (contrary to EX2) uses the solvent accessibility of the
residues derived from the 3D protein structure. EXEHO/S: 6-matrix

model combining the accessibility to the solvent and the secondary
structure of the residues derived from the 3D protein structure.
UL3: 3-matrix two-level mixture model, with blindly estimated
matrices. LG4M: 4-matrix one-level mixture model using gamma
distribution of site rates, proposed in this paper. LG4X: 4-matrix
one-level mixture model using distribution-free scheme of site rates,
proposed in this paper.

support dissimilarity between LG and LGþC8, while this
ratio is ;2 with RF distance (both TreeBase and HSSP).
We thus measured the topological support dissimilarities
between main models using (63 smallest) TreeBase and
2933

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

The previous section analyzes model performance in terms
of fit to the data, measured by likelihood and AIC values.
Here, we study the impact of using refined models, that is,
how they change the topology of inferred trees.
We see from tables 2 and 3 that refined models have
a strong topological impact compared with standard models.
For example comparing LG4X with LG, both models infer different topologies with 58/84 TreeBase and 267/300 HSSP
alignments. Moreover, these topologies are clearly different
(percentage RF distance of 15% and 19% for TreeBase and
HSSP, respectively), and the AIC value significantly favors
LG4X topologies over LG topologies for 36/58 TreeBase
and 163/267 HSSP alignments, whereas the LG topology is
significantly favored for only one TreeBase alignment and
never with HSSP. This means that with 36 (;45%) TreeBase

and 163 (;55%) HSSP alignments, one should confidently
select the LG4X topology and abandon that inferred using
LG. Moreover, these two topologies are very different in general (see RF values).
However, the effective impact of selecting these alternative trees would be low if the differences between topologies of standard and refined models corresponded to
poorly supported branches and was solely due to random
effects inherent to phylogenetic reconstruction. Indeed, inspecting the topological distances (RF) between LG topologies and those of other models, we see that closely related
models still show substantial topological differences; for
example (table 2), LG (used with the þC4 option, omitted
below for conciseness as in the previous section) and
LGþC8 topologies are different for 32/84 TreeBase alignments, with 9/32 significant cases and percentage RF distance of 7%. We thus counted the number of branches that
are supported by the bootstrap in one tree but not the
other (BP1 ! BP2 þ 50%, see above). For example, comparing LG and LGþC8 with the 63 smallest TreeBase alignments, we have 8 such branches among a grand total of
2,382 branches, against 48 with LG versus LG4X. Using
the 300 HSSP alignments, we have 122 branches supported
in one tree but not the other when comparing LG and
LGþC8 and 552 for LG versus LG4X (grand total 5
23,908). These measures are much lower than the corresponding RF topological distances (tables 2 and 3), as
expected since here we only consider branches with substantial bootstrap support. However, with LG versus LG4X,
we still have on average ;1 (TreeBase) to ;2 (HSSP)
branches per tree with clearly different bootstrap support.
Moreover, this ‘‘topological support dissimilarity’’ provides
a sharper view of the models’ resemblance and dissemblance
than standard RF distance; for example, the topological support dissimilarity between LG and LG4X is ;6 times larger
with TreeBase (;4.5 with HSSP) than the topological


MBE

Le et al. · doi:10.1093/molbev/mss112


EXEHO (all three with the C4 option), respectively, use
twice, three, and six times as much memory because they
are two-level mixtures with two, three, and six matrices,
respectively. For example, LG4X and UL3þC4 require 2
GB and 6 GB of memory space, respectively, for the largest
TreeBase alignment with 62 taxa and 11,544 sites (accession
number M4680). EXEHO requires almost 12 GB to analyze
this data set, which makes it impractical for most standard
computers and users.
The same ratios apply to the computing times needed to
calculate data likelihood, given a tree with branch lengths and
model parameter values. For example, EXEHO is nearly six
times slower than a standard, single-matrix model. However,
other factors impact the total computing time. Most notably,
models differ in the number of parameters to be estimated
from the data via likelihood optimization. Standard models
and LG4M have only one (gamma) parameter, whereas EX2,
UL3, LG4X, and EXEHO, respectively, have two, three, six, and
six parameters, thus again requiring additional computing
time compared with standard models and LG4M.
Using PhyML-4X with standard options (þC4, SPRbased tree searching) on a powerful CPU (Intel[R] Xeon[R]
E5440 at 2.83GHz with 16 GB memory) to infer phylogenies
for all 84 TreeBase alignments requires about 55 h with
standard models (LG was used here), 60 h with LG4M,
and 85 h with LG4X. As expected, LG4M is nearly as fast
as standard models, but LG4X is somewhat slowed down
by model parameter estimations. Using PhyML-Structure
(Le and Gascuel 2010) for the same task requires about
280, 380, and 670 h for EX2, UL3, and EXEHO, respectively
(total time for EXEHO is about 1 month and was estimated

from a sample of alignments). Applying these models to
the largest TreeBase alignment (M4860) using the same
CPU, programs and options, requires about 6, 8, 11, 51,
53, and 84 h for LG, LG4M, LG4X, EX2, UL3, and EXEHO,
respectively. Though different programs were used in these
experiments, with PhyML-4X being based on a more recent
and about twice faster version of PhyML than PhyMLStructure, we obtain a clear picture: LG4M is nearly the
same as standard models and is a fast model as expected;
LG4X is a bit slower due to its six parameters to be estimated; EX2 and UL3 are significantly slower than standard
models but still clearly applicable even to large data sets;
EXEHO requires important computing resources for large
data sets not only in terms of computing times but also
with respect to memory space. The /S option (called
CONF/MIX in Le and Gascuel 2010) that we used for HSSP
alignments with known 3D structure requires nearly the
same time and memory as the mixture version, with both
EX2 and EXEHO. However, EXEHO (and EX2) may be used
with less demanding options (CONF/LG and PART) when
the 3D structure is known.

Conclusion
Memory Consumption and Running Time
LG4M and LG4X require the same amount of memory as
single-matrix models with the C4 option. EX2, UL3, and
2934

In this paper, we proposed two new models, LG4M and
LG4X, for amino acid replacement modeling in protein
phylogenetics. The main idea was to use different


Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

(300) HSSP alignments, and then constructed distancebased trees representing the topological impacts and resemblance/dissemblance of these models. For this purpose, we
used the FastME software (Desper and Gascuel 2002;
with default options (analogous to NJ but using branch swapping). Resulting trees are displayed in figure 5, and the pairwise
distance matrices are available in the supplementary tables
and figures at />Although these two trees were obtained from completely different data (TreeBase, HSSP) through a complex
procedure (bootstrap, dissimilarity computation, FastME),
it is remarkable that both are nearly identical in terms of
topology and branch lengths. Moreover, these trees are
easily interpreted and illustrate the main features of studied
models. A first obvious observation is that standard and
nonstandard models form the two main clades. Moreover,
as expected with standard models: LG and LGþC8 form
a tight clade; LG, WAG, and JTT are relatively close;
LGÀC (constant site rate) is isolated at the end of a long
branch, which illustrates the (well-documented) impact of
using a gamma distribution of site rates. Nonstandard models
are separated in two clades, respectively, containing one-level
and two-level mixtures. Within two-level mixtures in the HSSP
tree, we have a clade containing all structure-based models,
with two tight subclades containing, respectively, EX2 and
EX2/S and EXEHO and EXEHO/S. Surprisingly, knowing the
3D structure in EX2/S and EXEHO/S does not have much
impact on the tree topology with respect to EX2 and EXEHO.
However, the topological impact of these four structure-based
models with respect to LG is almost the same as that of LG
with respect to LGÀC. The topological impact of UL3, LG4M,
and LG4X with respect to LG is even larger. As expected LG4M
and LG4X form a clade, but both models are relatively distant,

whereas UL3 is distant from all other models.
From trees in figure 5, it is not possible to predict which
model provides the best topologies. However, it is noticeable
that nonstandard and mixture models are on the opposite
side of LGÀC, known to be a poor model, whereas standard
models are in between. We might have a preference for
structure-based models, as they infer similar tree topologies
(fig. 5) and lead to very high likelihood values when the 3D
structure is available (fig. 4). However, there is no guaranty
that these models are best from a topological standpoint,
even if they have strong biological justifications. LG4X and
LG4M are also based on meaningful assumptions (as opposed to UL3 learned in a purely blind manner). Both have
a strong topological impact and provide high likelihood
gains compared with standard models. LG4M and LG4X
tree topologies contain well-supported clades, not discovered by any of the other models, and thus representing
biological and phylogenetic interest and deserving further
investigations.


Modeling Protein Evolution with Site Rate–Dependent Matrices · doi:10.1093/molbev/mss112

Acknowledgments
Thanks to Associate Editor Jeffrey Thorne and two anonymous referees for their help and suggestions. This research
was supported by the French ANR BIOSYS (MitoSys project)
and the Vietnam National Foundation for Science and
Technology Development.

References
Akaike H. 1974. A new look at statistical model identification. IEEE
Trans Automatic Control. 19:716–722.

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
Shindyalov IN, Bourne PE. 2000. The Protein Data Bank. Nucleic
Acids Res. 28:235–242. Available from:

Berry V, Gascuel O. 1996. Interpretation of bootstrap trees:
threshold of clade selection and induced gain. Mol Biol Evol.
13:999–1011.
Bryant D, Galtier N, Poursat MA. 2005. Likelihood calculations in
phylogenetics. In: Gascuel O, editor. Mathematics of evolution
and phylogeny. Oxford: Oxford University Press. p. 33–62.
Castresana J. 2000. Selection of conserved blocks from multiple
alignments for their use in phylogenetic analysis. Mol Biol Evol.
17:540–552.
Dayhoff MO, Eyck RV, Park CM. 1972. A model of evolutionary
change in proteins. In: Dayhoff MO, editor. Atlas of protein
sequence and structure. Vol. 5. Washington (DC): National
Biomedical Research Foundation. p. 89–99.
Desper R, Gascuel O. 2002. Fast and accurate phylogeny reconstruction
algorithms based on the minimum-evolution principle. J Comp
Biol. 19:687–705.
Durbin R, Eddy S, Krogh A, Mitchison G. 1998. Biological sequence
analysis: probabilistic models of proteins and nucleic acids.
Cambridge: Cambridge University Press.
Felsenstein J. 1981. Evolutionary trees from DNA sequences:
a maximum likelihood approach. J Mol Evol. 17:368–376.
Felsenstein J. 2003. Inferring phylogenies. Sunderland (MA): Sinauer
Associates, Inc.
Felsenstein J, Churchill GA. 1996. A Hidden Markov Model approach
to variation among sites in rate of evolution. Mol Biol Evol. 13:
93–104.

Gascuel O. 1997. BIONJ: an improved version of the NJ algorithm
based on a simple model of sequence data. Mol Biol Evol.
14:685–695.
Gascuel O, Guindon S. 2007. Modelling the variability of evolutionary
processes. In: Gascuel O, Steel M, editors. Reconstructing
evolution: new mathematical and computational advances.
Oxford: Oxford University Press. p. 65–99.
Goldman N, Thorne JL, Jones DT. 1998. Assessing the impact of
secondary structure and solvent accessibility on protein
evolution. Genetics 149:445–458.
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W,
Gascuel O. 2010. New algorithms and methods to estimate
maximum-likelihood phylogenies: assessing the performance of
PhyML 3.0. Syst Biol. 59:307–321.
Holder MT, Sukumaran J, Lewis PO. 2008. A justification for reporting
the majority-rule consensus tree in Bayesian phylogenetics. Syst
Biol. 57:814–821.
Holmes I, Rubin GM. 2002. An expectation maximization algorithm
for training hidden substitution models. J Mol Biol. 317:753–764.
Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of
mutation data matrices from protein sequences. Comput Appl
Biosci. 8:275–282.
Jones DT, Taylor WR, Thornton JM. 1994. A mutation data matrix
for transmembrane proteins. FEBS Lett. 339:269–275.
Keane TMC, Creevey CJ, Pentony MM, Naughton TJ, McLnerney JO.
2006. Assessment of methods for amino acid matrix selection
and their use on empirical data shows that ad hoc assumptions
for choice of matrix are not justified. BMC Evol Biol. 6:29.
Kishino H, Hasegawa M. 1989. Evaluation of the maximum
likelihood estimate of the evolutionary tree topologies from

DNA sequence data, and the branching order in Hominoidea.
J Mol Evol. 29:170–179.
Klosterman PSA, Uzilov AV, Bendana YR, Bradley RK, Chao S,
Kosiol C, Goldman N, Holmes I. 2006. XRate: a fast prototyping,
training and annotation tool for phylo-grammars. BMC
Bioinformatics 7:428.
Koshi JM, Goldstein RA. 1995. Context-dependent optimal substitution
matrices. Protein Eng. 8:641–645.
Koshi JM, Goldstein RA. 1998. Models of natural mutations
including site heterogeneity. Proteins 32:289–295.

2935

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015

substitution matrices for the different evolutionary rate
categories and to reduce the standard gamma distribution
constraints on site rates by adopting a distribution-free
scheme (LG4X). Experiments with independent alignments
showed that LG4M and LG4X most often infer trees with
higher likelihood and AIC values than single-matrix models
(JTT, WAG, and LG), thus illustrating the limit of the standard approach that assumes a unique replacement matrix
regardless of the site rate. Moreover, these trees tend to
differ significantly in their topology from those inferred using the standard approach. These experiments also showed
that our distribution-free scheme for site rates offers high
flexibility and contributes greatly to LG4X performance.
Since LG4M and LG4X produce significantly better
results while requiring the same memory space and similar running times, they would be reasonable replacements for single-matrix models. Current phylogenetic
tree inference software using the standard approach
and gamma distribution of site rates could immediately

use LG4M because it requires the same data structures
and procedures as single-matrix models. LG4X could be
easily adapted as well, by adding an appropriate optimization procedure for estimating the distribution-free site
rate parameters. These two models and others, notably
structural, will be incorporated in a forthcoming release
of the official PhyML.
Many questions arise from the improvement provided by
LG4X and LG4M. The first is to check the number of rate
categories. It is commonly acknowledged (our results in
table 2 support this practice) that four gamma-distributed
rate categories provide a fair compromise for single-matrix
models. Moreover, LG4X with four rate categories is better
than two-matrix  four-rate models (EX2þC4) and comparable to three-matrix  four-rate models (UL3þC4). However, we do not yet know the difference when we increase or
decrease the number of rate categories and use the same
scheme as LG4M or LG4X. Variants of LG4X and LG4M
and their combination with the standard þF option to
fit proteins with specific amino acid distributions should also
be investigated. Lastly, a major direction for further research
is to better understand the substitution processes revealed
by these complex models and replacement matrices, unravel
the biological differences between models, and exploit them
for further improvements.

MBE


Le et al. · doi:10.1093/molbev/mss112

2936


Sanderson MJ, Donoghue MJ, Piel W, Eriksson T. 1994. TreeBase:
a prototype database of phylogenetic analyses and an interactive
tool for browsing the phylogeny of life. Am J Bot. 81:183.
Schneider R, de Daruvar A, Sander C. 1997. The HSSP database of protein
structure-sequence alignments. Nucleic Acids Res. 25:226–230.
Shimodaira H. 1997. Assessing the error probability of the model
selection test. Ann Inst Stat Math. 49:395–410.
Susko E, Field C, Blouin C, Roger AJ. 2003. Estimation of ratesacross-sites distributions in phylogenetic substitution models.
Syst Biol. 52:594–603.
Thorne JL, Goldman N, Jones DT. 1996. Combining protein
evolution and secondary structure. Mol Biol Evol. 13:666–673.
Wang HC, Li K, Susko E, Roger AJ. 2008. A class frequency mixture
model that adjusts for site-specific amino acid frequencies
and improves inference of protein phylogeny. BMC Evol Biol.
8:331.
Whelan S, Goldman N. 2001. A general empirical model of
protein evolution derived from multiple protein families
using a maximum-likelihood approach. Mol Biol Evol. 18:
691–699.
Yang Z. 1993. Maximum-likelihood estimation of phylogeny from
DNA sequences when substitution rates differ over sites. Mol
Biol Evol. 10:1396–1401.
Yang Z. 2006. Computational molecular evolution. Oxford: Oxford
University Press.
Yang Z, Nielsen R, Hasegawa M. 1998. Models of amino acid
substitution and applications to mitochondrial protein
evolution. Mol Biol Evol. 15:1600–1611.

Downloaded from at UNIVERSIDAD DE SEVILLA on June 17, 2015


Kyte J, Doolittle RF. 1982. A simple method for displaying the
hydropathic character of a protein. J Mol Biol. 157:105–132.
Lartillot N, Philippe H. 2004. A Bayesian mixture model for acrosssite heterogeneities in the amino-acid replacement process. Mol
Biol Evol. 21:1095–1109.
Le SQ, Gascuel O. 2008. An improved general amino acid
replacement matrix. Mol Biol Evol. 25:1307–1320.
Le SQ, Gascuel O. 2010. Accounting for accessibility to solvent and
secondary structure in protein phylogenetics is clearly beneficial.
Syst Biol. 59:277–287.
Le SQ, Gascuel O, Lartillot N. 2008. Empirical profile mixture
models for phylogenetic reconstruction. Bioinformatics
24:2317–2323.
Le SQ, Lartillot N, Gascuel O. 2008. Phylogenetic mixture models for
proteins. Philos Trans R Soc Lond B Biol Sci. 363:3965–3976.
Lio P, Goldman N, Thorne JL, Jones DT. 1998. PASSML: combining
evolutionary inference and protein secondary structure prediction.
Bioinformatics 14:726–733.
Mayrose I, Friedman N, Pupko T. 2005. A Gamma mixture model
better accounts for among site rate heterogeneity. Bioinformatics
21:151–158.
Pagel M, Meade A. 2005. Mixture models in phylogenetic inference.
In: Gascuel O, editor. Mathematics of evolution and phylogeny.
Oxford: Oxford University Press. p. 121–142.
Raman P, Cherezov V, Caffrey M. 2006. The Membrane Protein Data
Bank. Cell Mol Life Sci. 63:36–51.
Robinson D, Foulds L. 1979. Comparison of weighted labeled trees.
Lect Notes Math. 748:119–126.

MBE




×