Domain deletions and substitutions in the modular protein
evolution
January Weiner 3rd, Francois Beaussart and Erich Bornberg-Bauer
Division of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Munster, Germany
ă
Keywords
domain loss; ssion; fusion; protein
domains; protein evolution
Correspondence
E. Bornberg-Bauer, Division of
Bioinformatics, School of Biological
Sciences,The Westfalian Wilhelms
University of Munster, Schlossplatz 4,
ă
D48149 Munster, Germany
ă
Fax: +49 251 8321631
Tel: +49 251 8321630
E-mail:
(Received 5 December 2005, revised 13
February 2006, accepted 9 March 2006)
doi:10.1111/j.1742-4658.2006.05220.x
The main mechanisms shaping the modular evolution of proteins are
gene duplication, fusion and fission, recombination and loss of fragments. While a large body of research has focused on duplications and
fusions, we concentrated, in this study, on how domains are lost. We
investigated motif databases and introduced a measure of protein similarity that is based on domain arrangements. Proteins are represented as
strings of domains and comparison was based on the classic dynamic
alignment scheme. We found that domain losses and duplications were
more frequent at the ends of proteins. We showed that losses can be
explained by the introduction of start and stop codons which render the
terminal domains nonfunctional, such that further shortening, until the
whole domain is lost, is not evolutionarily selected against. We demonstrated that domains which also occur as single-domain proteins are less
likely to be lost at the N terminus and in the middle, than at the C terminus. We conclude that fission ⁄ fusion events with single-domain
proteins occur mostly at the C terminus. We found that domain substitutions are rare, in particular in the middle of proteins.We also showed
that many cases of substitutions or losses result from erroneous annotations, but we were also able to find courses of evolutionary events where
domains vanish over time. This is explained by a case study on the bacterial formate dehydrogenases.
Proteins are well known to evolve not only by point
mutations, but also by modular rearrangements [1–
3]. By and large, these rearrangements occur at the
level of domains, which are independent folding units
and have been proposed to represent the unit of
modular evolution [3,4]. Most domains always form
the same combinations; that is, they are always
found next to the same neighbours. For example,
domains found in ribosomal proteins are not found
elsewhere and are present always in the same context. Also, it has been reported that many domains
appear in a very much conserved order (supradomains) [5], and that the frequent occurrence of certain modular arrangements (arrangements of modules
along a sequence) across phyla is the result of conservation [6].
While few domains co-occur with many others at
least once in the same protein, most domains have few
partner domains, or are even always singletons [3,7–9].
Well-known examples of highly linked domains occurring in many different combinations are the P-loop
nucleotide triphosphate hydrolase domain, the epidermal growth factor (EGF) domain, the SH3 domain,
the P-kinase domain and the domains involved in the
blood clotting cascade [1,10].
The phenomenon of differential arrangements has
often been termed domain mobility [11]. However,
this term may be misleading as it implies that single
Abbreviations
Domain ID, domain identification number; EGF, epidermal growth factor; FDHF, formate dehydrogenase H.
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
2037
Mechanisms shaping modular protein evolution
J. Weiner 3rd et al.
modules or small arrangements are being transferred
from one protein to another. Considering that often
two modules or larger arrangements as such are
fused into one protein, it becomes difficult to defne
which of the modules is ‘mobile’ and which is ‘static’. Therefore, it has been suggested that the term
versatility ahould be used instead of domain mobility
[3,12]. Independently of the perspective taken, the
underlying mechanisms of modular rearrangements
are mostly gene fusion and domain loss and, probably to a lesser extent, domain shuffling of exons
and recombination [13–17].
While the emergence of domain combinations is well
documented [4,6,7,18–21], relatively little is known
about domain losses.
In this article, we focus on how domains are lost.
Ultimately, this question is difficult to discern from the
recruitment of domains because, in comparing two
proteins, phylogenetic analysis is required to detect
whether a domain has been recruited in one protein or
lost in the other. To deal with this problem, we investigated the possible genetic mechanisms that can cause a
domain to be lost or gained.
As usual in sequence analysis, information on the
history of evolution can only be assumed a posteriori, meaning that disadvantagous mutations (frameshifts, domain deletions, etc.) have been weeded out
by negative selection. Thus, we only observe events
of modular rearrangements that are either beneficial
or neutral. For the sake of comprehensiveness, we
used the ProDom database [22], which records
conserved sequence fragments. However, they are not
always identical to structural domains. To confer
with the general definition of domains [3], all key
results were confirmed using Pfam, which largely
agrees with structural domain definitions [23].
In the following study we first investigated whether
the relative frequencies of deletions (or recruitements)
depend on if a domain is at the end or the middle of
a protein. Unless explicitly stated, we used the term
‘deletion’ as synonymous for deletions and recruitments. We then investigated whether eliminations are
more frequently observed at the boundaries of
domains and whether or not domain substitutions are
frequent. For that purpose, we categorized and described misannotations of domains to discern them
from real substitutions or deletions of domains. Next,
we studied whether some domains are more often lost
and whether frequencies of domain deletions depend
on domain versatility. Finally, we discussed the implications of our results for a wider understanding of
modular protein evolution and the possibilities for generating a model in which modular protein evolution is
2038
formally described in terms of module edit operations
and cost functions.
Results and Discussion
Single domain deletions
The first question we asked was whether the probability of a domain deletion is evenly distributed throughout a protein. The null hypothesis was that genetic
mechanisms which lead to domain deletions (for example, deletions and insertions of sequence fragments,
intron recombinations, etc.) do not depend on the
position within the sequence. However, two factors
could cause a bias. First, any point mutation that creates a premature stop codon will cause a C-terminal
deletion of a protein. Likewise, a mutation leading to
the emergence of an alternative transcription or translation start will cause an N-terminal deletion. Second,
a fission producing two genes from one will result in
the deletion of a terminal fragment from a protein or,
vice versa, a fusion of two smaller proteins into one
will result in the observed pattern.
We first grouped proteins by the number of domains
they have (see the Materials and methods). For each
protein, we searched for deletion events, that is, a protein which has exactly the same domain arrangement,
except for a single domain missing anywhere in the
arrangement. Then we calculated the frequency of the
deletion at each domain position within the group of
proteins containing a given number of domains.
We found that the domain deletions are more common at either of the protein termini, and that their
occurrence is slightly higher at one of the termini,
depending on the number of domains in the protein
and the database selected (Fig. 1). The prevalence of
terminal deletions did not depend on the number of
domains in proteins, and the results for Pfam and ProDom databases were similar. In only a few cases were
slightly increased frequencies of domain deletions
observed at a central position.
According to our predictions, this suggests that the
genetic mechanism of domain deletions acts predominantly on sequence termini. Therefore, we tentatively
propose that the insertions of new transcription start
and stop codons, as well as gene fusion and fission,
are more likely to occur than, for example, intron
mobility caused by exon shuffling.
Multiple domain deletions
We supported the previous findings by analysing cases
where one or more domains were deleted from a
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
J. Weiner 3rd et al.
Mechanisms shaping modular protein evolution
0.9
0.8
0.8
0.7
0.7
0.6
Proportion of domains deleted
0.9
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
1
2
3
0
4
0.8
2
3
4
5
6
0.7
0.7
1
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
11
Position
protein. We considered only deletions in which at least
half of the domains of the full length arrangement was
preserved, to ensure that homologous arrangements
were being compared. The results were similar to those
of single domain deletions, in that the terminal deletions were prevalent (see the Supplementary Material).
In many cases, a deleted domain is a part of a larger, deleted fragment. We have found that fragments
deleted at either termini are, in general, much longer
than fragments deleted within a protein sequence. The
deletions within the protein are much more often single
domain deletions (Fig. 2). The total number of deletions that concern only one, single domain, is higher
for the positions between the termini. However, the
number of major deletions (deletions that span more
than one domain) is higher at terminal positions. This
supports the view that the deletions generally involve
the protein termini.
Number of occurencies
Fig. 1. Statistics of single domain deletions in the whole SwissProt ⁄ TrEMBL set of proteins. The figure shows the relative proportion of
domain deletions at different positions within the proteins of length 4, 6, 10 and 11 domains. Dark grey, Pfam; Light grey, ProDom.
Length of the deleted fragment (in domains)
Fig. 2. Number of occurrences of domain deletions as a function of
the length (in domains) of the deleted fragment. Diamonds, N-terminal deletions; squares, deletions within the protein; circles, C-terminal deletions. Single domain losses occur preferentially on one of
the middle positions, whereas longer fragments tend to be deleted
at the termini.
In-detail analysis of the deletion events
During our analyses, we noted that some of the apparent domain deletions are actually just misannotations.
A lack of a domain identifier at a given position in a
protein annotation does not necessarily mean that the
corresponding domain is physically deleted. Likewise,
a different identifier does not necessarily signify a
physical substitution. To address this problem, we constructed clusters of similar proteins that contained at
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
2039
Mechanisms shaping modular protein evolution
J. Weiner 3rd et al.
Table 1. Criteria used to distinguish between various types of sequence rearrangements and annotation artefacts that result in a disappearance of a domain in the domain string of a protein.
Evolutionary events
physical deletion
a domain is physically deleted from the protein sequence, and only a short (<20 amino acids) fragment can
be found between the neighbouring domains
a domain is replaced by another domain that bears no similarity with the original domain
at a given position, in one protein there is a ProDom domain; at the same position in another protein
there is an amino acid sequence which is not similar to the given domain and which does
not correspond to a ProDom ID
substitution
shadow domain
Annotation artefacts
camouflage
although there are two different ProDom domains at the same position in two proteins,
they are significantly similar (E<<1)
the domain is not annotated in ProDom, but there is at this position a similar amino acid sequence
erosion
least six ProDom domains. We aligned the domain
arrangements within a cluster using a simple progressive multiple alignment algorithm [24], based on
pairwise alignments generated using the NeedlemanWunsch algorithm [25] (Supplementary material).
We were able to distinguish five types of phenomena that resulted in an apparent deletion from the
domain arrangement (Table 1, Fig. 3). The first two
were real substitutions and physical deletions of
domains. In some cases, at the site where the domain
annotation was missing, there was, in fact, a sequence
similar to the sequence of this domain. However,
because of length or large evolutionary distance, this
sequence was not annotated by the automatic annotation mechanism of ProDom (‘erosion’). In other
cases, if there is a high sequence variation between
the instances of the domains with a given identification number (ID), homologous sequences can be
assigned different ProDom identifiers (‘camouflage’).
Yet, in other cases, although the annotation (ProDom
Domain−wise evolutionary events
A
Substitution
E−value (B,D) ~ 1
Annotation artifacts
D
Camouflage
E−value (B,D) << 1
A
B
C
A
B
C
A
D
C
A
D
C
Camouflage
Substitution
B
Shadow domain
A
C
B
A
E−value (B,seq) ~ 1
Erosion
A
seq
E−value (B,seq) << 1
B
seq
C
A
Shadow domain
C
E
C
C
Erosion
Physical deletion
B
A
A
C
C
Deletion
Fig. 3. Classification of domain-wise events observed in the domain databases. Different evolutionary events (A, B, C) and annotation artefacts (D, E) result in an apparent ‘deletion’ of a ProDom domain from a protein annotated in terms of ProDom domains. Domain and dot
plots can be found in the Supplementary material.
2040
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
J. Weiner 3rd et al.
ID) of a given domain is missing, there is no physical
deletion or misannotation. Instead, the amino acid
sequence at this position is not similar to the given
ProDom domain; therefore, it is a case of a real substitution.We call this case a ‘shadow domain’.
For each of these events, we counted its occurrence in the constructed protein clusters (see the
Materials and methods for details), at each position
in each protein cluster, as follows. If a domain was
found to be deleted from an arrangement in a cluster, the amino acid sequences occurring in all the
sequences of the cluster at the given position were
analysed. We have applied the criteria from Table 1
to distinguish between the three types of real evolutionary events (physical domain deletion, substitution
and shadow domains) and two types of annotation
artefacts (camouflage and erosion). In the case of
physical deletions, shadow domains and erosions, the
numbers of these events were simply counted. However, in the case of substitutions and camouflage, it
is not reasonable to count the number of occurrences of such an event without inferring a direction
of the substitution. For example, if at a certain position in a cluster, domain A occurs in two sequences,
and each of the domains B and C occurs five times,
then what frequency of the substitutions should be
assumed here? We have used the following routine:
all possible pairwise combinations of domains from
different proteins occurring at the same domain position in a cluster were analysed. If the two domains
in a pair were different, then an event (substitution
or camouflage) was recorded. Therefore, the calculated numbers of substitution and camouflage events
cannot be used to infer any conclusions on the actual substitution rate of domains; however, because at
all domain positions the number of camouflage and
substitution events have been calculated in the same
way, relative frequencies of the camouflage and substitution events at different positions can be inferred.
The relative frequencies of physical domain deletions, substitutions and shadow domains are all
higher at the termini. The average domain deletion
frequency is 9%, 7% at the nonterminal position
and 20% at the termini (Table 3). This trend cannot
be seen in the case of annotation artefacts (Fig. 4,
Table 3). Furthermore, annotation artefacts are 10
times rarer than real, physical events (Table 3).
Therefore, our previous results for single-domain and
multiple deletions are scarcely affected by inaccuracies of the database annotations and reflect real
evolutionary events. This supports the aforementioned finding that the majority of deletions are
caused by the physical deletions of protein termini.
Mechanisms shaping modular protein evolution
Evolutionary events
Annotation artefacts
Fig. 4. Results of the protein clusters analysis: relative percentages
of different evolutionary events and annotation artefacts at different
domain positions within the analysed sequences. Error bars indicate the standard error of the calculated proportion. The values for
the ‘Middle position’ were averaged from the values for all nonterminal positions.
We repeated this analysis to test whether there are
differences between prokaryotes and eukaryotes; however, we did not find significant differences (see the
Supplementary material).
Distribution of termini length in proteins
We have further pursued the question of whether the
terminal deletions can be regarded as truly modular
events; that is, to what extent evolution preserves
domain boundaries upon domain deletion. The null
hypothesis is that in the case of nearly neutral evolution, the domains are depleted gradually, and partially
deleted domain fragments are common. In such a case,
the evolution of proteins cannot be modelled by the
approximation of domains or modules. However, several factors can make the situation different. First,
selection pressure could rapidly eliminate the truncated
fragments – unnecessary biosynthesis of the nonfunctional protein fragments should reduce fitness. Second,
if domain deletions are caused by genetic mechanisms
preserving domain boundaries (such as gene fusions),
partial domains will be rare. If this is the case, amino
acid sequence deletions can be simplified to domain
deletion events, and thus protein evolution could be
abstracted to the level of modules.
We tackled this problem as follows. We have constructed clusters of proteins. Each cluster contained
proteins with the same domain arrangement, or with
an arrangement shortened by a terminal domain deletion, either N terminal or C terminal. We recorded
the length of the N- or C-terminal amino acid
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
2041
ProDom, C−terminus
0 5000
15000
ProDom, N−terminus
0
100
200
0
300
200
300
Pfam, C −terminus
15000
Pfam, N−terminus
100
0 5000
Number of occurences
0 5000
55000
J. Weiner 3rd et al.
25000
Number of occurences
35000
15000
0 5000
Mechanisms shaping modular protein evolution
0
100
200
300
Length of the N terminus in % of the deleted domain
0
100
300
200
Length of the C terminus in % of the deleted domain
sequence and plotted the distribution of its length
(see the Materials and methods for details). The
lengths were normalized for every protein cluster and
then averaged for evaluation. A length of 0 corresponds to the case when the terminal domain is completely deleted, and 100 to the average length of the
terminal domain in the whole cluster. Furthermore,
we refined these results by counting only the protein
sequence fragments that are similar, at the amino acid
sequence level, to the remaining sequence of the deleted domain, given one of two E-value thresholds.
These E-values between those fragments and the
intact domain were recorded and put in three bins,
each for a different range of E-values (any E-value,
0 £ E £ 0.01; 0 £ E £ 1 · 10)5).
The distributions of termini lengths are shown in
Fig. 5. The distributions show that complete domains
are much more likely to be present in proteins, and
that partial domains are rare at the terminal ends.
These distributions hold also for sets of data in which
sequences containing three or fewer domains were
Fig. 5. Length distributions of the remaining
fragment from a terminal domain. Distribution of the length of the terminal sequences
is based on comparison of domain arrangements alignments. Left, distribution on
the N-termini; right; distribution on the
C-termini.The lengths are relative to the size
of the deleted domain (¼ 100%). White bars;
all terminal fragments; light grey, terminal
fragments similar to the deleted domain
(E < 0.01); dark grey, terminal fragments
significantly similar to the deleted domain
(E < 1 · 10–5). Top, results for the ProDom
database; bottom, results for the PfamA data
set.
removed, and also in the case of Pfam domains
(Fig. 5, bottom). If an E-value was applied (only fragments similar to the given domain were considered),
the shorter sequences with a terminal fragment that
was completely lost were eliminated from the histogram. This was not necessarily because the fragments
were not homologous, but because the fragments were
too short to show any significant similarity. However,
the right part of the distribution, corresponding to
sequence fragments of > 50% of the average domain
length, did not change significantly (grey bars on
Fig. 5).
Domain deletions and domain versatility
Finally, we investigated whether the domain deletion
events were connected to the properties of the deleted
domains itself. Specifically, we wished to establish whether the versatility of a domain plays a role in domain
deletions. Furthermore, we considered that domains
can, in general, fold autonomously. Therefore, we
Table 2. Deleted domains and domain versatility.
Position
Fraction as single for alla
Fraction as single for deletedb
Average NN for allc
Average NN for deletedd
Total for all domains
N-terminus
middle
C-terminus
3.00%
2.46%
2.40%
3.20%
1.82%
1.81%
0.96%
3.64%
2.50
1.67
3.65 (1.83)
1.70
2.40
2.13
4.12 (2.06)
2.72
±
±
±
±
0.04
0.07
0.05
0.09
±
±
±
±
0.10
0.16
0.13
0.27
±
±
±
±
0.02
0.02
0.03
0.02
±
±
±
±
0.08
0.12
0.29
0.19
a
Overall fraction of domains that were found to form single-domain proteins; b fraction of deleted domains that were found to form singledomain proteins; c average number of neighbours for all domains in the protein clusters ± standard error; d average number of neighbours
for the deleted proteins ± standard error. As each of the domains in a middle position has two neighbours, the values in parentheses are
the averages divided by two. The results are based on a dataset with proteins having 3 or more domains.
2042
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
J. Weiner 3rd et al.
Mechanisms shaping modular protein evolution
Table 3. Results of the analysis of protein clusters for the ProDom
database. Numbers in the table correspond to the absolute numbers of events recorded (% of the events recorded is given in parenthesis).
Event
Average
(%)
Total number of 152105
domains
Real events:
Deletions
13925
Substitutions
3034
Shadow domains 8770
Annotation artefacts:
Camouflage
1557
Erosion
1235
N-terminus middle
C-terminus
14520
14520
123065
(9.2) 2998 (20.6)
(2.0) 546 (3.8)
(5.8) 1811 (12.5)
8077 (6.6) 2850 (19.6)
2000 (1.6) 488 (3.4)
5399 (4.4) 1560 (10.7)
(1.0)
(0.8)
1391 (1.1)
1001 (0.8)
110 (0.8)
82 (0.6)
56 (0.4)
152 (1.0)
recorded how often domains that are lost form singledomain proteins.
First, we calculated the fraction of domains that also
occur as single-domain genes in the sets of domains
that are deleted at an N-terminal, C-terminal or central position.We found that the domains which also
occur as single-domain proteins are found two- to four
times more frequently at the termini, and twice as frequently at the C terminus than at the N terminus
(Table 2). Surprisingly, the average fraction of
domains that also occur as single-domain genes is
lower for the domains that partake in deletion events
than the average for all domains.
The ability of a domain to form autonomous, single-domain proteins may be related to its versatility.
We have therefore calculated the domain connectivity
and found that it is highest for the nonterminal
domains. However, as the domains at a nonterminal
position have, on average, two neighbours, whereas
the terminal domains have only one, the averages for
this type of domains must be halved. In that case,
the percentages of domains that form autonomous,
single-domain proteins are higher for domains that
undergo deletions at the termini, and lower for
domains that undergo deletions at a nonterminal
position (Table 2). Again, the numbers of domains
that form autonomous, single-domain proteins are
highest for the domains that are deleted at the
C terminus.
We conclude that the elevated rates of domain deletions at the termini regions are partly related to
domain versatility and their ability to function outside
a multidomain protein (to form single-domain proteins). The events involving domain acquisition ⁄ loss
are twice as frequent at the C terminus than at the
N terminus (Table 2).
Case study: bacterial formate
dehydrogenases
An exemplary cluster of bacterial formate dehydrogenase proteins is shown in Fig. 6. This cluster illustrates
several modular events, including domain deletion, a
substitution by a diverged sequence fragment, and erosion (Fig. 6B). A multiple alignment of the protein
sequences can be found in the Supplementary material.
For some of the proteins the structure is known [26].
We analysed the phylogeny of the cluster, as derived
from whole protein sequences (Fig. 6C). The obtained
phylogenetic tree is consistent with the modifications of
the domain arrangements (Fig. 6D), and the revealed
events can be associated with the tree nodes. Significant
rearrangements take place on the sixth position of the
cluster where, in different proteins, we found two different ProDom domains, shadow domains and, at one
position (in the protein O59078), a complete deletion.
Further rearrangements are found at the protein C terminus: two proteins have additionally two other
domains. The shadow domains may either be the result
of a substitution by another sequence, or by such a high
accumulation of mutations in a domain that it is no
longer similar to the original sequence.
There are three variable regions in the domain
arrangement of the protein cluster. First, at position 6
in the arrangement, in some proteins there are similar
sequences that were not annotated in ProDom (‘erosion’) or domains which were annotated differently
because of high sequence divergence (‘camouflage’).
Next, at position 8, there is a substitution in two of
the sequences. Finally, the C-terminal part is missing,
truncated or eroded in many sequences, for example in
the illustrated structure (Fig. 6A,B).
Conclusions
Our main conclusions are as follows (a) domain deletion events occur frequently at either of the termini,
(b) the deletions occur domain-wise; that is, in most of
the cases the whole domain is lost, (c) domain losses
correlate with domain versatility (i.e. the number of
different combinations in which a domain occurs), (d)
versatile domains are more frequently found at the
C terminus and (e) clear definitions can be given to
distinguish misannotations from physical deletions.
Eventually the question ‘What is the probability of a
domain deletion?’ can only be answered using domain
phylogenies. However, our study shows that the deletion events are quite frequent; in the collected protein
clusters, the frequencies of proteins in a cluster with a
domain deleted at either of the termini were % 9%
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
2043
Mechanisms shaping modular protein evolution
A
C
J. Weiner 3rd et al.
B
D
(Table 3), which provides a rough estimate for the
frequency of deletion events in protein–protein comparisons.
The fact that the domain deletions are not uniformly
distributed along a protein, but that they nonetheless
follow a distinct pattern of domain deletions, is an
important conclusion in the context of constructing
algorithms for sequence alignments that take into
account domain arrangements of proteins. It also provides a biological justification for choosing a lower-end
gap penalty in sequence alignment algorithms, such as
clustalw [27].
In conclusion, by analysing the versatility of deleted
domains and their ability to form single-domain proteins, we have found that, while gene fusion and fission
indeed play a significant role in the deletion events at
the termini, the introduction of new start and stop codons also play a major role. The fraction of the deleted domains that can be found as single-domain
proteins was twice as high at the C terminus (Table 2),
as was the connectivity of the C-terminally deleted
domains. This suggests that in a gene fusion or fission
event, the versatile, single-domain protein is more
likely to be found at the C terminus. This may be
explained by the fact that in a gene fusion ⁄ fission
event, or in the case of introduction of new start and
2044
Fig. 6. Cluster of the bacterial formate dehydrogenases. (A,B) The structure of formate
dehydrogenase H (FDHF) from Escherichia
coli. (C) Phylogeny of the analysed proteins
obtained by the parsimony method with 100
bootstraps. (D) The corresponding domain
arrangements of the analysed proteins.
Colour code: (A) is coloured according to the
ProDom annotation, with one colour for
every domain. Colours and arrows on (B)
indicate events identified by analysis of a
cluster of related proteins and correspond
to the coloured arrows on (C) and (D). The
symbols on (C) show a possible attribution
of the events to tree nodes. sub, substitution; del, deletion ⁄ insertion; colours of the
symbols correspond to the colours (B). The
coloured boxes on (D) correspond to different ProDom domains and are the same as
on (A). The black thin boxes on position 6
correspond to ‘shadow domains’.
stop codons, the N-terminal part of the coding
sequence remains connected to its promoter region and
regulatory sites. Thus, a versatile domain that is fused
with the C terminus of a much larger protein will not
have an effect on the regulation of the whole protein,
because it will not modify the promoter region and
regulatory sites. Our results suggest such a selective
disequilibrium: the function (and regulation) of the
protein is connected to its N-terminal part, and therefore the fusion ⁄ fission events involving smaller, versatile domains will occur more frequently at the
C terminus.
Moreover, we have found that the event of domain
deletion occurs mostly in a modular manner. This can
have two explanations. First, the apparent domain
deletion can be caused by gene fusion or fission. Second, a domain fragment truncated (e.g. by a nonsense
mutation) that is no longer functional may be rapidly
eliminated by natural selection. Either way, the
domain deletions effectually respect domain boundaries. These results have further supported the emerging
view that, by and large, the modular evolution of proteins is dominated by two major types of events:
fusion, on the one hand, and deletion and fission on
the other [3,4,21,28]. Exon shuffling and recombination
seem to be rare.
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
J. Weiner 3rd et al.
Materials and methods
For the analyses, ProDom [22] version 2004.1 was used. The
main results were confirmed using the Pfam, release17 [29].
Each database contains a number of domain arrangements,
that is, proteins annotated in terms of domains. All supplementary materials can be found on our web page (http://
www.uni-muenster.de/Bioinformatics/services/domdel/).
Overall single deletion statistics
Proteins from the ProDom database and, separately, from
the Pfam database, were divided into sets according to the
number of domains. Each set contained all proteins with a
fixed number of domains, for example ‘set6’ contained proteins with six domains.
Each protein from a given set containing proteins of
length N domains was compared with each protein from
the set containing proteins of length N)1 domains. For
example, a protein with six domains was compared with all
proteins that have five domains. If the shorter arrangement
was identical to the longer one, with the exception of a single, missing domain, a deletion was registered. The position
of the deletion within the domain arrangement was recorded. For example, given the five-domain arrangement
ABDEF (where A to E are domains), it is identical to the
six-domain arrangement, ABCDEF, with the exception of
the deleted domain C.
The average deletion frequency was calculated as the
number of all deletion events divided by the total number
of domains in all the examined sequences. The relative
domain deletion frequency at a given domain position in a
set of proteins of a given length was defined as the number
of deletions at this position, divided by the total number of
deletions in this set.
These investigations have been repeated with a nonredundant data set, in which each arrangement was represented
only once. That is, from a set of proteins which had the same
domain arrangement, only one representative was kept.
Overall multiple deletion statistics
For each domain arrangement given, all other arrangements that would be obtained by removal from the given
arrangement of one or more domains were considered. For
example, if A to F are domains, and ABCDEF is the given
arrangement, then we would consider the arrangements
ABCDE, BCD, ABEF, etc.
Similarity of protein arrangements
For the purpose of constructing multiple domain arrangement alignments and domain arrangement-based phylogenies, we implemented the Needleman-Wunsch global
alignment algorithm [25] for protein domains, with the
Mechanisms shaping modular protein evolution
parameters as defined previously [17]: match ¼ 10, mismatch ¼ )5, gap ¼ )1.
Construction of protein clusters
We constructed clusters of proteins with similarity in their
domain arrangement of > 80%. Only clusters that had at
least six domains were considered. For each protein from
the ProDom database, all proteins were considered that had
one domain less than the given protein. If a given protein
matched the examined arrangement by all but one domain,
a deletion event was recorded. Starting with a single protein,
a number of hits was recorded and added to the cluster; furthermore, these proteins were used to obtain the next set of
hits (i.e. proteins that have one domain less than the protein
that was used in the search). The procedure stopped for a
given cluster when no further similar domain arrangements
were found. Only clusters containing at least 10 proteins
and 10 ProDom domains were used for further analysis.
Additionally, the amino acid sequences of all the sequences
in the cluster were collected. The resulting clusters were subsequently aligned with a simple multiple-domain arrangement alignment algorithm (progressive alignment). The
length (in terms of domains) of a cluster was defined as the
length of the multiple-domain arrangement alignment.
Calculation of the relative event frequency at
different domain positions in protein clusters
For each of the events, e, and for each of the sets of clusters of a given length, l, the frequency of the event at a
position, k, was defined as:
Xl
n ;
fe;k ¼ ne;k =
i¼1 e;i
where ne,i is the number of occurrences of the event e at the
domain position i. The average frequency at the middle
positions (that is, all domain positions except the N- and
C termini) was calculated as:
Xl1
f =l 2ị:
ne;middle ẳ
iẳ2 e;i
Finally, the N-terminal, C-terminal and central position
frequencies for each event were averaged for all sets of
clusters.
Distribution of amino acid sequence length
of the termini
For each of the databases ProDom and Pfam, two sets of
alignments were created: one for N-terminal deletions, and
one for C-terminal deletions. In each set, an alignment contained sequences that had one of the two types of arrangements: either a complete arrangement, or one in which a
terminal domain was missing from the ProDom description.
Alignments were constructed from the whole ProDom
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
2045
Mechanisms shaping modular protein evolution
J. Weiner 3rd et al.
database. Only alignments which contained at least one
complete sequence and one sequence with a missing domain
(depending on the set, either N- or C terminal) were considered.
For each alignment in each set, the average size of the
deleted domain was calculated for the proteins with the
complete arrangement. To take into account the variability
of the length of the complete domain, the length of the
N-terminal fragment was definned as the length of the
amino acid sequence preceding the next domain in
the arrangements, expressed as the percentage of the calculated average length of the deleted domain in this alignment. Finally, the distribution of these values throughout
all of the analysed alignments was calculated.
References
1 Patthy L (1999) Protein Evolution. Blackwell Science,
Oxford.
2 Liu J & Rost B (2004) CHOP: parsing proteins into
structural domains. Nucleic Acids Res 32, W569–W571.
3 Bornberg-Bauer E, Beaussart F, Kummerfeld S, Teichmann S & Weiner J 3rd (2005) The evolution of domain
arrangements in proteins and interaction networks. Cell
Mol Life Sci 62, 435–445.
4 Voge IC, Teichmann S & Pereira-Lea IJ (2005) The
relationship between domain duplication and recombination. J Mol Biol 346, 355–365.
5 Voge IC, Berzuini C, Bashton M, Gough J & Teichmann S (2004) Supra-domains: evolutionary units larger
than single protein domains. J Mol Biol 336, 809–823.
6 Gough J (2005) Convergent evolution of domain architectures (is rare). Bioinformatics 21, 1464–1471.
7 Apic G, Gough J & Teichmann S (2001) An insight into
domain combinations. Bioinformatics 17 (Suppl. 1),
S83–S89.
8 Wuchty S (2001) Scale-free behavior in protein domain
networks. Mol Biol Evol 18, 1694–1702.
9 Bornberg-Bauer E (2002) Randomness, structural
uniqueness, modularity, and neutral evolution in
sequence space of model proteins. Z Phys Chem 216,
139–154.
10 Madera M, Voge IC, Kummerfeld S, Chothia C &
Gough J (2004) The SUPERFAMILY database in
2004: additions and improvements. Nucleic Acids Res
32, D235–D239.
11 Doolittle R & Bork P (1993) Evolutionarily mobile
modules in proteins. Sci Am 269, 50–56.
12 Apic G, Huber W & Teichmann S (2003) Multi-domain
protein families and domain pairs: comparison with
known structures and a random model of domain
recombination. J Struct Funct Genomics 4, 67–78.
13 Ponting C & Russel IR (1995) Swaposins: circular permutations within genes encoding saposin homologues.
Trends Biochem Sci 20, 179–180.
2046
14 Ulie IS, Fliess A & Unger R (2001) Naturally occurring circular permutations in proteins. Prot Eng 14,
533–542.
15 Fliess A, Motro B & Unger R (2002) Swaps in protein
sequences. Proteins 48, 377–387.
16 Bujnicki J (2002) Sequence permutations in the molecular evolution of DNA methyltransferases. BMC Evol
Biol 2, 3.
17 Weiner J 3rd, Thomas G & Bornberg-Bauer E (2005)
Rapid motif-based prediction of circular permutations
in multi-domain proteins. Bioinformatics 21, 932–937.
18 Apic G, Gough J & Teichmann S (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 310, 311–325.
19 Bashton M & Chothia C (2002) The geometry of
domain combination in proteins. J Mol Biol 315, 927–
939.
20 Vogel C, Bashton M, Kerrison N, Chothia C & Teichmann S (2004) Structure, function and evolution of
multidomain proteins. Curr Opin Struct Biol 14, 208–
216.
21 Kummerfeld S & Teichmann S (2005) Relative rates of
gene fusion and fission in multi-domain proteins. Trends
Genet 21, 25–30.
22 Corpet F, Servant F, Gouzy J & Kahn D (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res
28, 267–269.
23 Zhang Y, Chandonia J, Ding C & Holbrook S (2005)
Comparative mapping of sequence-based and structurebased protein domains. BMC Bioinformatics 6, 77.
24 Feng D & Doolittle R (1987) Progressive sequence
alignment as a prerequisite to correct phylogenetic trees.
J Mol Evol 25, 351–360.
25 Needleman S & Wunsch C (1970) A general method
applicable to the search for similarities in the amino
acid sequence of two proteins. J Mol Biol 48, 443–
453.
26 Boyington J, Gladyshev V, Khangulov S, Stadtman T
& Sun P (1997) Crystal structure of formate dehydrogenase H: catalysis involving Mo, molybdopterin, selenocysteine, and an Fe4S4 cluster. Science 275, 1305–
1308.
27 Thompson J, Higgins D & Gibson T (1994) clustalw:
improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res 22, 4673–4680.
28 Weiner J 3rd & Bornberg-Bauer E (2006) Evolution of
circular permutations in multi-domain proteins. Mol
Biol Evol 23, 734–743.
29 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L,
Eddy S, Griffiths-Jones S, Howe K, Marshal IM &
Sonnhammer E (2002) The Pfam protein families database. Nucleic Acids Res 30, 276–280.
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
J. Weiner 3rd et al.
Supplementary material
The following supplementary material is available
online:
Fig. S1. Statistics for single domain deletions.
Fig. S2. Statistics for multiple domain deletions.
Fig. S3. Detailed results of the cluster analysis.
Mechanisms shaping modular protein evolution
Fig. S4. Results for the comparison of eukaryotes and
prokaryotes.
Fig. S5. Pairwise multiple alignment algorithm for
domain arrangements.
This material is available as part of the online article
from
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
2047