Tải bản đầy đủ (.pdf) (6 trang)

DSpace at VNU: Protein type specific amino acid substitution models for influenza viruses

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (222.15 KB, 6 trang )

2011 Third International Conference on Knowledge and Systems Engineering

Protein type specific amino acid substitution models for influenza viruses
Nguyen Van Sau1,
Dang Cao Cuong1,
Le Si Quang3, Le Sy Vinh1, 2
University of Engineering and Technology, VNU1
Institute of Information Technology, VNU2
Vietnam National University Hanoi, 144 Xuan Thuy, Ha Noi, Viet Nam
Welcome Trust Centre for Human Genetics3
University of Oxford, UK
Roosevelt Drive, Oxford OX3 7BN, UK
, , ,
[1,2] and JTT [3] are two most popular models
estimated by this approach.
The second approach takes advantages of multiple
alignments by using the maximum likelihood method.
The main idea is to estimate both phylogenies as well
as the substitution models to maximize the likelihood
of alignments. Adachi and Hasegawa [4], Yang et al.
[5], and Adachi et al. [4] were first to apply the
approach to alignments from few species with an
assumption that all proteins come from the same
phylogeny. Whelan and Goldman released the
assumption where they used approximate phylogenies
for different alignments. Le and Gascuel [6] extended
:KOHQ DQG *ROGPDQ¶V PHWKRd by optimizing
phylogenies and evolution rates across sites in
estimating processes.
General models have been estimated from large
databases, however, current studies have showed that


they might be not appropriate for particular set of
species due to differences in the evolutionary processes
of these species [7,8,9]. A number of specific amino
acid substitution models for important species have
been introduced. For example, Dimmic and colleagues
estimated the rtRev model for inference of retrovirus
and reverse transcriptase Phylogeny [9]. Nickle and
coworkers introduced HIV-specific models that
showed a consistently superior fit compared with the
best general models when analyzing HIV proteins [7].
Influenza viruses are the most dangerous viruses
for avian and humans. They are a kind of RNA virus
and belong to the Orthomyxoviridae family. They are
divided into three types: influenza A, influenza B, and
influenza C, of which influenza A type is the most
prevalent and dangerous. In recent years, influenza A
viruses have caused serious problems for human health
and social economics. Current emerging influenza
epidemics are H5N1 ('avian flu') or H1N1. More
details about historical and recently emerging influenza
pandemics and epidemics can be found at the World
Health
Organization
website
( />Theoretical and experimental studies have been
extensively conducted for decades to understand the
evolution, transmission, and infection processes of

Abstract²The amino acid substitution model (matrix) is a
crucial part of protein sequence analysis systems. General

amino acid substitution models have been estimated from
large protein databases, however, they are not specific for
influenza viruses. In previous study, we estimated the
amino acid substitution model, FLU, for all influenza
viruses. Experiments showed that FLU outperformed
other models when analyzing influenza protein sequences.
Influenza virus genomes consist of different protein
types, which are different in both structures and
evolutionary processes. Although FLU matrix is specific
for influenza viruses, it is still not specific for influenza
protein types. Since influenza viruses cause serious
problems for both human health and social economics, it
is worth to study them as specific as possible.
In this paper, we used more than 27 million amino
acids to estimate 11 protein type specific models for
influenza viruses. Experiments showed that protein type
specific models outperformed the FLU model, the best
model for influenza viruses. These protein type specific
models help researcher to conduct studies on influenza
viruses more precisely.
Keywords-influenza virus, amino acid substitution
model, phylogeny tree.

1. BACKGROUND
Protein sequence analysis systems usually require
an amino acid substitution model for analyzing the
relationships between protein sequences. Therefore,
estimating amino acid substitution models is a crucial
task in Bioinformatics for more than 4 decades.
There are two main approaches to estimate amino

acid substitution models from proteins alignments. The
first one estimates substitution rates between amino
acids based on an assumption that the probability of
exchanging from an amino acid to another one in a
period of time is linear to the substitution rates between
the two amino acids. Thus, substitution rates can be
estimated directly from the number of exchanges
between amino acid sequence pairs. This approach is
simple and applicable to large databases. However, the
assumption is only acceptable if the time period is
short, thus, the amino acid sequences must be very
closely related (typically with >85% identity). PAM
978-0-7695-4567-7/11 $26.00 © 2011 IEEE
DOI 10.1109/KSE.2011.23

98


influenza viruses [10,11,12,8] (and references therein).
Recently, we published the FLU model, which was
specifically estimated for influenza viruses. Our
extensive experiments showed that FLU is much better
than other models when analyzing influenza protein
sequences.
Although FLU model is specific for influenza
viruses, it is not specific for protein types. The
influenza A virus genome consists of 11 different
protein types: HA, NA, M1, M2, NS1, NS2, NP, PA,
PB1, PB1-F2, PB2 (see Table 1 for more details).
These protein types have different structures and

evolve at different rates. These raise a need to have
different amino acid substitution models for different
protein types.
In this study, we continue working on amino acid
substitution models for influenza viruses. Since
influenza A viruses are the most prevalent and
dangerous, we studied and estimated 11 amino acid
substitution models for 11 protein types of influenza A
viruses. These models will allow researchers to analyze
the evolution processes of influenza proteins more
precisely.
The paper is organized into 5 sections. In the
section 2 (Method) we will present theoretical
background of amino acid substitution models; and our
approach to estimate protein type specific models.
Section 3 (Data preparation) describes our process to
prepare protein sequences to estimate models. Result
comparisons among models will be reported in the
section 4. Conclusions are given in the last section.

Figure 1. The four-step approach to estimate protein type
specific amino acid substitution models.

The model consists of two components: 1) an
instantaneous substitution rate 20x20-matrix
where
is the number of substitutions
from amino acid x to amino acid y per time unit; 2) an
amino acid frequency 20-vector
where

is
the frequency of amino acid . While can be easily
estimated from data using a counting method, Q is the
study subject of estimation methods.
We will apply four-step maximum likelihood
approach to estimate protein type specific models as
pictured in Figure 1:
- Data
preparation:
Downloaded,
cleaned,
classified and aligned sequences to create multiple
protein alignments (more details will be presented
in Section 3).
- Constructing tree step: For each protein
alignment, use the maximum likelihood method
(such as PhyML [16]) to construct a phylogenetic
tree using an initial matrix Q (initial with FLU
matrix).
- Estimating model step: Use an expectationmaximization algorithm (such as XRATE [17]) to
train a new model Q' using protein alignments and
reconstructed trees.
- Comparing Step: Compare Q and Q'. If Q' is
nearly identical to Q, Q' is consider as the final
model. Otherwise, replace Q by Q' and go to
Constructing tree step.

2. METHOD
The substitution process among each amino acid
sites is assumed to be independent, stationary and

remain constant over the time [13,14]. We can use a
time-homogeneous, time-continuous, and timereversible Markov process [13,14,15] to model the
substitution process between amino acids.
Table 1. Data of 11 protein types of influenza A viruses

Protein type #Sequences #Alignments
HA
NA
PB2
PA
PB1
NS1
NP
M2
NS2
M1
PB1-F2

17,261
9,718
6,873
6,443
6,195
4,852
4,568
3,263
2,465
2,399
2,102


646
377
274
245
238
176
162
124
91
88
79

Proportion
(%)
26.10
14.69
10.39
9.74
9.37
7.34
6.91
4.93
3.73
3.63
3.18

99


Extensive experiments show that Q is almost

unchangeable (Q'~ Q) after three iterations.

Model performance analysis
We compared 11 protein type specific models with
the FLU model (the best amino acid substitution model
for influenza viruses) by comparing maximum
likelihood trees constructed using different models.
Note that it is the standard metric to compare different
models.
As expected, experiments showed that protein type
specific models outperformed all other models when
analyzing their corresponding protein sequences (see
Table 2). For example, PA model is the best model in
98.37% cases when analyzing PA alignments. As we
can see from Table 2 that, the NS2 does not completely
outperform other models. It is the best model in only
65% of cases when analyzing NS2 sequences. This is
due to the fact that only a small amount of NS2 protein
sequences are available for estimating the NS2 model.
Table 3 shows the summary comparisons between
FLU and protein type specific models in term of log
likelihoods when analyzing their corresponding
proteins. Note that the greater log likelihood per site is
the better model. It is obvious that the protein type
specific models are better than FLU model when
analyzing their corresponding proteins. For example,
the log likelihood of HA model (-16.5699) is higher
than log likelihoods of other models.

3. DATA PREPARATION

On Jan 07th 2011, there were more than 9,300
complete genomes including 200,000 protein
sequences in the Influenza database at NCBI
(www.ncbi.nlm.nih.gov/genomes/FLU/) [18]. In the
database, 95% of sequences are influenza A proteins,
including ~9,000 complete genomes and ~190,000
protein sequences. The other sequences are influenza B
and C viruses.
The number of available sequences for influenza B
and C types is not enough to estimate protein type
specific models for these virus types. We concentrate
on estimating models for 11 protein types of influenza
A viruses. The data preparation process is described as
below:
- Downloading step: We downloaded 200,000
influenza A protein sequences consisting of more
than 27 million amino acids.
- Cleaning step: There are a large number of
sequence duplications. We removed duplicated
sequences and obtained ~100,000 unique protein
sequences.
- Categorizing step: Sequences are classified into 11
classes corresponding to 11 protein types: HA,
M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2,
PB2.
- Splitting Step: Sequences of the same class were
split into subgroups such that each subgroup
consists of from 5 to 50 sequences.
- Aligning step: Sequences of each subgroup are
aligned using MUSCLE program (default

parameters) [19] and subsequently cleaned by
GBLOCK program (parameter ±b5=h) [20] to
eliminate sites containing too many gaps. We
selected 2,500 alignments (66,139 sequences,
1,058,987 sites, and 27,588,017 amino acids) each
consists of at least 50 amino acid sites.

Table 2. Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP,
PA, PB1, PB1-F2, PB2 models when analyzing their
corresponding protein type sequences. For example, PA is the
best model in 98.37% of PA alignment.

4. RESULTS
We estimated amino acid substitution models: HA,
NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2 and
PB2 for 11 corresponding protein types of influenza A
viruses. To compare different models, we conducted
two folds cross validation. To this end, we randomly
divided the dataset of each protein type into two equal
subsets, one for training and the other for testing.

100

Model

% cases where the
model is the best

PA
NS1

NP
NA
HA
M2
M1
PB1
PB2
PB1-F2
NS2

98.37
98.30
97.53
95.76
90.87
89.52
87.50
87.39
86.50
68.33
65.93


Table 3. Pairwise comparisons between FLU and HA, M1, M2,
NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log
likelihoods.

M1
HA
M1

M2
NA
NS1
NS2
NP
PA
PB1
PB1-F2
PB2

Table 5. The result of five best models when analyzing NA
sequences. NA is the best model in 361 over 377 cases.

1st

LogLK/site LogLK/site M1>M2 M1FLU (M2)
-16.5699
-16.6759
587
17
-5.16385
-5.28431
77
1
-7.84449
-8.31075
111
1
-13.069

-13.1597
361
3
-9.43458
-9.69325
173
0
-7.64682
-8.07906
60
3
-5.34475
-5.40045
158
3
-5.03603
-5.06622
241
3
-4.49867
-4.50842
208
30
-12.453
-15.6054
41
17
-4.63386
-4.65857
237

34

NA

HA

3rd

4th

58

1

0

0

PB1-F2

41

1

0

0

0


FLU

17

587

41

1

0

NS2

1

0

0

0

0

NA

0

0


593

50

3

5th

16

0

0

0

13

0

0

0

0

FLU

3


361

13

0

0

HA

0

0

338

28

9

PB1

0

0

21

284


60

Table 6. Log likelihood comparison among HA (NA) model and
other models when analyzing HA (NA) protein sequences.

Model

HA
FLU
NA
PB1
PB2
PA
NP
NS1
M1
M2
NS2
PB1F2

5th

587

4th

It stands at the first place in 361 (95.76%) over 377
cases, and the second place for the other cases. FLU is
again the second best model for most cases (95.76%).
Table 6 presents the log likelihood per site of

different models when analyzing HA and NA protein.
The log likelihood per site of HA (NA) model when
analyzing HA (NA) proteins is -16.5699 (-13.0690)
which is the best in comparison with other models. The
FLU is the second best model.

Table 4. The results of five best models when analyzing HA
sequences. HA is the best model in 587 out of 646 cases.

2nd

3rd

361

PB1-F2

The most important protein types of influenza
viruses are HA and NA proteins. The combinations of
different HA and NA protein variations result in
different influenza subtypes such as H1N1, H5N1.
Table 1 shows that HA (~26%) and NA (~14%) are
two most prevalent proteins in the database. In the
following we present analysis with HA and NA
proteins.
Table 4 shows that HA model is the best model
when analyzing HA proteins. HA helps to construct the
best likelihood trees for 587 out of 646 HA protein
alignments (~90.87%). It is the second best in the 58
other cases (~8.9%). The FLU is the second best

models in most of cases (587 over 646 cases).
Table 5 shows similar observation when analyzing
NA sequences. NA model completely outperforms
other models when analyzing NA protein sequences.

1st

2nd

101

Log Likelihood
per site (HA
proteins)
-16.5699
-16.6759
-16.9003
-17.041
-17.1564
-17.1793
-17.3324
-17.6543
-17.8061
-18.4358
-18.5862
-21.2087

Model

NA

FLU
HA
PB1
PB2
PA
NP
NS1
M1
M2
NS2
PB1F2

Log Likelihood
per site (NA
proteins)
-13.0690
-13.1597
-13.3791
-13.4553
-13.5070
-13.5124
-13.6538
-13.9437
-14.0849
-14.2751
-14.8277
-16.6106


Table 7. Correlations among 12 models. As we can see that, these models are very different from each other. For example, the correlation

between HA and FLU models is only 0.878.

M1
M2
NA
NS1
NS2
NP
PA
PB1
PB1-F2
PB2
FLU

0.295
0.427
0.900
0.542
0.149
0.631
0.643
0.751
0.063
0.750
0.878
HA

0.276
0.305
0.340

0.123
0.417
0.476
0.447
0.196
0.411
0.344
M1

0.576
0.479
0.182
0.678
0.642
0.608
0.305
0.578
0.544
M2

0.541
0.157
0.732
0.722
0.800
0.077
0.784
0.886
NA


0.225
0.737
0.809
0.741
0.138
0.723
0.520
NS1

0.303
0.296
0.234
0.041
0.263
0.191
NS2

Model correlation analysis
The correlations among FLU and protein specific
models are reported in Table 7. It is obvious that these
models are very different from each other. For
example, the correlation between HA (NA) and FLU
models is only 0.878 (0.886). The comparison suggests
FLU model is not specific to analyze different protein
types. In other words, we should use protein type
specific models to analyze their corresponding protein.

0.897
0.915
0.098

0.850
0.711
NP

0.888
0.276
0.861
0.694
PA

0.120
0.945
0.135
0.856
0.122
PB1 PB1-F2

0.828
PB2

partitions present in one of the two trees but not the
other, divided by the number of possible bipartitions.
Thus, two trees are closer and their topologies are
closer if their RF distance between them is smaller.
Note that the value of RF distance ranges from 0.0 to
1.0. We computed RF distances between trees
reconstructed using FLU model and other protein type
specific models.
Figure 2 shows that tree topologies inferred using
FLU and the 11 protein type specific models are

different. There are 7,932 cases where the RF distance
is 0.1. We even observed many cases where the
reconstructed trees from FLU and protein type specific
models are very different. Thus, these models have a
strong impact on tree structures.

Tree structure analysis
We also analyze the impact of matrices on tree
structures. To this end, we used the Robinson-Foulds
(RF) [21] distance to measure the difference between 2
tree topologies. RF distance is the number of bi9000
8000
7000
6000
5000
4000
3000
2000
1000
0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure 2. The Robinson-Foulds distances between trees inferred using FLU and 11 protein type specific models for protein of Influenza A
viruses. The horizontal axis indicates the RF distance between 2 tree topologies, where the vertical axis indicates the number of
alignments.

102


5. CONCLUSIONS

Influenza viruses are one of the most dangerous
virus kinds for human health and economics. Although
they have been the subject of thousand studies, they
still get extensive studies from researchers, funds from
governments and pharmacy companies.
Through our intensive studies of influenza viruses
with a huge amount of protein sequences, we were able
to estimate 11 amino acid substitution models for 11
protein types of influenza A viruses.
Our experiments showed that protein type specific
models gave better results than the best model, FLU,
for influenza viruses. Model correlation and tree
structure analyses presented that these models are very
different form each other and have strong impact on
the tree structures. The protein type specific models
enable researchers to study influenza protein sequences
more precisely. We strongly recommend researchers to
use protein type specific models to analyze
corresponding protein sequences.

[9]

[10]
[11]

[12]

[13]
[14]


ACKNOWLEDGMENT
This work is partially supported by the TRIG project at
University of Engineering and Technology, VNU
Hanoi.

[15]

10. References
[1] Dayhoff, Atlas of Protein Sequence Structure, M. O.
Dayhoff, Ed. Washington DC: National Biomedical
Research Foundation, 1978, vol. 5.
[2] Dayhoff, M. O. and Schwartz, R. M. and Orcutt, B. C.,
"A Model of Evolutionary Change in Proteins," in Atlas
of Protein Sequence Structure, M. O. Dayhoff, Ed.
Washington DC: National Biomedical Research
Foundation, 1978, vol. 5, pp. 345-352.
[3] Jones, David T. and Taylor, William R. and Thornton,
Janet M., "The rapid generation of mutation data
matrices from protein sequences," Comput. Appl.
Biosci., pp. 275-282, 1992.
[4] Adachi, Jun and Hasegawa, Masami, "Model of Amino
Acid Substitution in Proteins Encoded by Mitochondrial
DNA," J. Mol. Evol., pp. 459-468, 1996.
[5] Nielsen, Rasmus and Yang, Ziheng, "Likelihood Models
for Detecting Positively Selected Amino Acid Sites and
Applications to the HIV-1 Envelope Gene," Genetics,
vol. 148, pp. 929-936, 1998.
[6] Quang Le and Olivier Gascuel, "An improved general
amino acid replacement matrix," Mol. Biol. Evol., pp.
1307-1320, 2008.

[7] David C. Nickle, Laura Heath, Mark A. Jensen, Peter B.
Gilbert, James I. Mullins, and Sergei L. Kosakovsky
Pond, "HIV-Specific Probabilistic Models of Protein
Evolution," PLoS ONE, vol. 2, p. e503, 2007.
[8] Cuong Cao Dang, Quang Si Le, Olivier Gascuel and
Vinh Sy Le, "FLU, an amino acid substitution model

[16]

[17]

[18]

[19]

[20]

[21]

103

for," BMC Evolutionary Biology, 2010.
M. W. Dimmic, Rest JS, Mindell DP, and Goldstein
RA., "rtREV: an amino acid substitution matrix for
inference of retrovirus and reverse transcriptase
phylogeny," J Mol. Evol., vol. 55, pp. 65-73, 2002.
Anthony S. Fauci, "Race against time," Nature, vol.
435, 2009.
Daniel A. Janies, Andrew Hill, Rob Guralnick, Farhat
Habib, Eric Waltari, Ward C. Wheeler, "Genomic

Analysis and Geographic Visualization of the Spread of
Avian Influenza (H5N1)," Systematic Biology, vol. 56,
pp. 321-329, 2007.
Tien D. Nguyen, The Vinh Nguyen, Dhanasekaran
Vijaykrishna, Robert G. Webster,Yi Guan, J.S. Malik
Peiris,and Gavin J.D. Smith, "Multiple Sublineages of
Influenza A Virus (H5N1), Vietnam, 2005-2007,"
Emerging Infectious Diseases, vol. 14, pp. 632-636,
2008.
Felsenstein, Joe, Infering Phylogenies. Sunderland,
Massachusetts: Sinauer Associates, 2004.
Ziheng Yang, Computational Molecular Evolution, 1,
Ed.: Oxford University Press, 2006.
Strimmer, Korbinian and von Haeseler, Arndt,
"Nucleotide Substitution Models," in The Phylogenetics
Handbook A Practical Approach to DNA and Protein
Phylogeny, Marco and Vandamme, Anne-Mieke Salemi,
Ed. Cambridge: Cambridge University Press, 2003, pp.
72-100.
Guindon, Stephane and Oliver Gascuel, "A Simple, Fast
and Accurate Algorithm to Estimate Large Phylogenies
by Maximum Likelihood," Syst. Biol., vol. 696-704, p.
52, 2003.
Peter S Klosterman, Andrew V Uzilov, Yuri R Bendaña,
Robert K Bradley, Sharon Chao, Carolin Kosiol, Nick
Goldman and Ian Holmes, "XRate: a fast prototyping,
training and annotation tool for phylo-grammars," BMC
Bioinformatics, vol. 7, p. 428, 2006.
Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky
L, Tatusova T, Ostell J, Lipman D, "The Influenza Virus

Resource at the National Center for Biotechnology
Information.," J Virol, vol. 82, pp. 596-601, 2008.
Edgar, Robert C., "MUSCLE: multiple sequence
alignment with high accuracy and high throughput,"
Nucl. Acids Res., vol. 32, pp. 1792-1797, 2004.
J. Castresana, "Selection of conserved blocks from
multiple alignments for their use in phylogenetic
analysis," Molecular Biology and Evolution, vol. 17, pp.
540-552, 2000.
Felsenstein, Joseph, "Distance methods for inferring
phylogenies: A Justification," Evolution, vol. 38, pp. 1624, 1984.



×