Universal positions in globular proteins
From observation to simulation
Nikolaos Papandreou
1
, Igor N. Berezovsky
2,3
, Anne Lopes
4
, Elias Eliopoulos
1
and Jacques Chomilier
4
1
Laboratory of Genetics, Agricultural University of Athens, Greece;
2
Department of Structural Biology, The Weizmann Institute of
Science, Rehovot, Israel;
3
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA;
4
Equipe Biologie Structurale, LMCP, Universite
´
s Paris 6 and Paris 7, Paris, France
The description of globular protein structures as a n e nsem-
ble of c ontiguous Ôclosed loops Õ or Ôtightened end fragments Õ
reveals fold elements crucial for the formation of stable
structures and for navigating the very process o f protein
folding. These are the ends of the loops, which are spatially
close t o each o ther but are situated apart in the polypeptide
chain by 25–30 r esidues. They also c orrelate with the l oca-
tions of highly conserved hydrophobic residues (referred
to as topohydrophobic), in a structural alignment of the
members of a protein family. This study analysed these
positions in 111 representatives of d ifferent protein folds,
and then c arried out dynamic Monte Carlo simulations of
the first steps of the folding process, aimed at predicting the
origins of the assembling folds. The simulations demon-
strated that there is an obvious trend f or certain sets of
residues, named Ômostly interacting r esiduesÕ,tobeburiedat
the early stages of the folding process. Location of these
residues at t he loop ends and c orrelation with topohydro-
phobic positio ns are demonstrated, t hereby giving a route t o
simulations of the protein folding process.
Keywords: folding nucleus; hydrophobic core; lattice simu-
lation; protein folding.
Despite t he continu ously incr easing number of e xperiment-
ally determined protein structures, many new folds are s till
to be discove red. This was illustrated c learly in a rec ent
study [1], where a plot of the numbe r of prote in families vs.
the number of resolved complete genomes resulted in a
quasi-linearly increasing f unction. Elucidating t he evolu-
tionary mechanisms leading to the em ergence of a finite
number of protein folds [2,3] from the v ast number of
protein sequences [4,5], as well as the mechanisms of the
formation of mature protein globules [6], remains a topic
both o f g reat challenge and inter est. T he latter mechanisms
are related to the physical basis of protein structure
formation and stability [7], and thus can point to possible
evolutionary routes [8].
This study is based o n univers al structural units of protein
folds, named Ôclosed loopsÕ [9] or Ôtightened-end fragmentsÕ
(TEFs) [10]. These major elements are universally present in
all types of protein folds and h ave t he following features in
common: (a) they usually s tart a nd end in t he hydrophobic
core [11]; (b) they form loop-like structures of nearly
standard size (25–30 amino acid residues); (c) they serve as
universal units of protein domain structure [12]; (d) the
ends of these elements ( or so-called l o cks [13]), mainly
correspond to clusters of hydrophobic amino acids in
general (WIMVYLF), and h ighly conserved ones, the
topohydrophobic (TH) positions [14,15], in particular.
Determination of t he TH positions is based on the analysis
of multiple structural alignmen ts of members of a protein
family, limited to a p air s equence i dentity with a maximu m
of 30%. TH positions are of particular importance for the
formation and stability of the protein core [16]. From a
dynamic point of view, the early formation of a nucleus
composed of TH positions would favor the formation of
closed loops and c onsiderably speed up the folding process
[17]. The coupled concepts of TH and closed loops/TEFs
therefore o ffer a s imple and general scenario for the folding
mechanism o f globular proteins [11,15] and p rovide a set of
critical positions in the protein core [10,11,13]. The loop
structure of g lobular proteins is a general concept, inde-
pendent from second ary s tructure, a s well a s f rom the
particular folding mechanism of each protein [9,10,13].
This stu dy addresses the question o f p redicting these
critical positions from the s equence, a task of major
importance to approach the structure of a protein of
unknown f olding. T o successfully build such a structure,
numerous pieces of information have to be collected by
combining various meth ods. An i nitial calculation o f c ritical
positions could be a first step, providing a f rame of
structural restraints, as TEF limits and TH residues are
located mainly inside the protein core.
The notion of topohydrophobic positions suggests that
the forces that bury these residues a nd lead to a s table core do
not rely on the details of the amino acid side chain structure,
but rather on an adequate succession of hydrophobic and
polar amino a cid residues a long the polypeptide chain. Thus
simplified protein m odels, such a s lattice ones, are adequate
tools f or calculations aim ed at locating critical r esidues.
Correspondence to N. Papandreou, Laboratory of Genetics, Agricul-
tural University of Athe ns, Iera Odos 75, 11855 Athens, Greece.
Fax: +30 2105294322, Tel .: +30 210529 4372,
E-mail:
Abbreviations: MIR, mostly interacting residues; PDB, protein data
bank; SCOP, structural classification of proteins; TEF, tightened end
fragment; TH, topohydrophobic.
(Received 2 9 June 200 4, revised 2 2 September 2004,
accepted 15 October 2004)
Eur. J. Biochem. 271, 4762–4768 (2004) Ó FEBS 2004 doi:10.1111/j.1432-1033.2004.04440.x
This study was carried out on a dataset of 111 globular
proteins with well-defined structures in the Protein Data
Bank (PDB), that were representative of various f olds, and
for which the TEFs were available. For a subset of 73
proteins of the above database, the TH positions have also
been determined.
The initial stages of folding were simulated using a
simplified model, which consists o f an alpha-carbon reduced
representation of the polypeptide chain on a 24-first
neighbour lattice. A standard Monte Carlo algorithm
dynamically simulated the folding process an d a statistical
mean force potential was used to describe the interactions
between noncontigu ous residues. A commonly a ccepted
lattice model has been used [18] and was focused on the first
stages of folding process, by measuring the tendency of
amino acids to be packed inside the h ydrophobic core,
depending on the peculiarities of polypeptide chain
sequence.
Starting from random conformations, the Monte Carlo
simulations revealed that a subset of hydrophobic residues
had a strong tend ency to be buried. These residues, named
Ômostly interacting residuesÕ (MIR), were found to statisti-
cally match TEF limits and TH positions.
These r esults are in agreement with the hydrophobic
collapse mechanism, which can be f urther generalized onto
the nucleation–condensation mechanism, a h ybrid o f h ier-
archical and hydrophobic collapse mechanism [23,24].
Materials and methods
The p rotein database consisted o f 111 globular protein
chains, r epresenting 78 different folds, according to t he
structural classification of p roteins ( SCOP classification)
[22]. In detail, there are 26 a class proteins, 23 b class, 26
a + b class, 18 a/b class and 18 o f t he small proteins class,
providing a balanced representation of the major known
folds. The polypeptide chain lengths var y between 50 and
250 residues.
Simulations have b een carried out using a Ca represen-
tation of the polypeptide chains and the lattice geometry
(Fig.1)isasin[18].
On an underlying cubic lattice (Fig. 1, dotted lines) with
edges of unit length, contiguous al pha carbons are c onnec-
ted by v ectors o f t he form (± 2, ± 1, 0 ) ( Fig. 1, solid lines).
Thelengthofsuchavectoris
ffiffiffi
5
p
lattice units and is
equivalent to 3.8 A
˚
, the typical distance b etween contiguous
alpha carbons in proteins. In this geometry, for residue i,
there are 24 possible positions for residue i + 1 to occupy.
This kind of polypeptide chain projection a llows for a more
realistic representation of the polypeptide chain [18].
Two spatial constraints are implemented. First, the
distance between noncontiguous alpha carbons cannot be
less than 3.8 A
˚
(
ffiffiffi
5
p
lattice units), and second (contrary to
cubic lattice, where only a ngles o f 9 0° and 180° are possible),
limit angles here are 66° and 143° (seven possible values),
approximating the range of pseudo-angle s in natural
proteins [19].
The d ifferent nature of amino acids is taken into a ccount
in the force field used to attribute an energy value to each
chain conformation. The distance-independent 20 · 20
residue pair energy matrix of Miyazawa and Jernigan was
used [20]. In detail, if two noncontiguous residues i and j are
found within a d istance s maller or equal to 5 .88 A
˚
,atermE
ij
is added t o the total energy, depending on their nature. The
maximum interaction range of 5 .88 A
˚
corresponds to
ffiffiffiffiffi
12
p
lattice units and seems a reasonable estimate for the mean
noncovalent interaction range b etween amino acid residues.
For each protein, 100 different i nitial conformations were
randomly generated and used as starting points for 100
simulation runs, to avoid dependency from the initial state.
The only constraint placed on initial states is their
noncompactness, in the sense that amino acid residues
placed far away i n the se quence were not allowed t o b e close
in space, to avoid clustering due to particular initial state
conformation. Quantitatively, this constraint introduces a
minimum spatial distance, dmin, according to the separ-
ation Delta ¼ |i–j| between r esidues i and j: (1) Delta ¼
6‚10, dmin ¼ 7A
˚
;(2)Delta¼ 11‚15, dmin ¼ 11 A
˚
;(3)
Delta ¼ 16‚20, dmin ¼ 19 A
˚
; (4) Delta more than 20,
dmin ¼ 27 A
˚
.
The single residue movements [18] are of two kinds; end
flip movement for the N and C terminal re sidues and corner
movements for the o thers. The choice of the move s et is
more or less arbitrary, as the e lementary one-residue moves
are s ufficient to bring the protein t o a folded state. In th is
case, the restriction to elementary m oves only, apart from
its simplicity, permits a sequential a nalysis of the c hain
tendency t o f orm c ompact fragments around particular
amino acid residues from the beginning of the simulation.
After each move, the calculated conformational energy
was subjected to a standard Metropolis criterion, at
constant temperature.
Because the goal was to analyse the propensity of residues
to be buried from the start o f folding, w e ensured that the
maximum number of Monte Carlo s teps was sufficient to
allow formation of compact chain fragments. Due to the
serial nature of the algorithm, this time limit is correlated to
protein c hain length L. It was empirically determined that
for small proteins of about 50 residues, the value t
max
is
around 10
6
Monte Carlo steps. Thus, t he following linear
relation was adopted t o generalize t
max
to proteins of any
length L: t
max
¼ INT (10
6
L/50), where INT is integer part,
because t
max
is an integer by definition (Monte Carlo steps).
Fig. 1. The lattice model. The solid line represents the backbone from
Ca to C a positions, while the dotted line i s the underlying cubic lattice.
Ó FEBS 2004 Universal positions in globular proteins (Eur. J. Biochem. 271) 4763
For each simulation, 10
4
records of i ntermediate confor-
mations were taken at regular time intervals. As the number
of simulations per protein is 100 (one for each initial state),
the end result is a set of 10
6
records per protein.
For every recorded confor mation, and for each amino
acid residue the number o f residues w ith which it is in
noncovalent interaction was calculated. In spatial terms,
these noncovalent neighbours are the amino acid residues
lying w ithin a distanc e of 5.88 A
˚
or
ffiffiffiffiffi
12
p
lattice units. For a
given p rotein and for residue i, at the r-th record, the
number of no ncovalent neighbours i s n c(i,r). The t ime mean
of this quantity is
NCðiÞ¼
1
10
6
X
10
6
r¼1
nc(i,r)
NC(i) values a re r ounded t o the nearest integer. This mean
number of noncovalent neighbours is a quantitative m eas-
ure of the tendency of a residue to be buried from solvent.
The higher the NC(i), the stronger this tendency.
If
NC is the mean value of N C(i) over t he sequence for a
given protein, the residues for which NC(i) is significantly
higher than
NC are of particular interest a nd are called
mostly interacting residues (MIRs). Their selection requires
fixing a cut-off value above the mean value
NC. It was
found that NC(i) varies b etween 1 a nd 8 a nd that
NC ¼ 4
for all studied proteins. Figure 2 presents the distribution of
the different values of N C(i) over the amino acid residues o f
all 111 proteins. The most probab le value is four, w hich
coincides with the mean sequence value, which is also fou r
for all proteins as stated above. From this distribution, it
appears that 13% of residues have a number o f noncovalent
neighbours equal to o r higher than s ix, which was a dopted
as the l owest NC(i) value for considering residue i as a MIR.
In order t o validate this m odel, once the positions of MIRs
were determined they were compared to TEF limits a nd to
topohydrophobic positions. The comparison with TEFs was
performed on the complete database of 111 proteins. The
comparison with TH positions was performed on a 73-
protein subset of this database, where these positions were
determined. For the remaining 38 pr oteins, the calculation o f
TH positions was not possible, because t o obtain this a t l east
four 3D structures of m embers of the same family are
required, with a pair i dentity not e xceeding 30% [14,15]. This
critirion w as not fullfilled for these 38 cases.
The P DB codes [ 21] of t he database are given in Table 1.
Results
The Monte Carlo algorithm for folding simulation has been
applied to the entire protein dataset and the histograms
NC(i), containing the distribution of noncovalent neigh-
bours along the amino acid sequence, have been obtained
for each protein.
In Fig. 3 t he positions of TEFs, TH and MIR for 10
proteins of the database representative of t he various classes
as determined by SCOP [22] are illustrated.
Among the 1920 calculated MIRs, 92% were hydropho-
bic, following the definition of topohydrophobic residues
(i.e. they belonged to the set ÔVIMWYLFÕ). Also, the total
numbers of MIRs and TH positions, in t he 73-protein
subset where they are compared, are relatively close (1299
MIRs vs. 1011 TH). In the same subset, t he total number of
TEFs was 309; thus the number of TEF limits w as 618,
about half the number of MIRs.
To assess the overall quality of agreement betwee n
predicted critical positions (MIR) and structure-defined ones
(TH and TEF limits), a statistical analysis is required. This
has b een carried out over the whole database, i.e. over a ll 111
proteins for the comparis on between MIR and TEF limits
and f or the subset of 73 proteins for the comparison between
MIR a nd TH. T he results are presented in t wo histograms in
Figs 4 and 5. The histogram of Fig. 4 gives t he comparison
between MIR a nd TH positions and is constructed as
follows. Eac h TH position is p laced at the origin of the
abscissa. T hen, the neighbouring MIRs t hat are closer to this
central TH than to any other TH arel ocated. Their number is
plotted a s a function of their s equence distance with respect
to the central TH. This is r eproduced for all THs along all the
73 proteins of the data s et. T hus Fig. 4 s hows a histogram of
the s ep aration between TH and t he closest M IR. The plotted
distances range from )20 to +20, and MIRs l ying at
distances greater t han ± 20 residues from t he closest T H a re
added t o the histogram at t he ± 2 0 positions. The second
histogram (Fig. 5) follows the same rules and concerns the
comparison of MIR to TEF limits. It is constructed using the
whole database of 111 proteins. From observation o f Figs 4
and 5 it is evident that comparison of MIR with TH and TEF
limits clearly pre se nts a p eak a t t he origin. T his i s a n
indication that the residues predicted to be MIRs actually do
correspond to TH positions. They a lso statistically correlate
with TE F limits , which are mostly hydrophobic [ 13] as it was
already shown that most TH position s are located in or in
vicinity of TEF ends [10]. T he agreement between MIR and
TH is very clear and 63% of M IR were found within ± 5
positions from a T H residue. The TEF histogram presents
two main secondary maxima at positions ± 3 and 57% of
MIR w as found within ± 5 positions from a TEF limit. Th is
good agreement between prediction and analysis [13] is o f
great i nterest i n the prediction of elements of the protein core
from the s equence.
Discussion
The existence of critical positions in protein s tructures,
punctuated by TH positions and/or TEF limits, is of great
importance f or protein folding and stability. Consecutive
formation of the globule c ore [10,11,17] composed essen-
tially of these residues [13] leads to tremendous optimi-
Fig. 2. Distribution of the mean number of noncovalent neighbours over
all 1 11 sequences of the dataset.
4764 N. Papandreou et al.(Eur. J. Biochem. 271) Ó FEBS 2004
Table 1 . A list of the PDB codes, name s and S CO P c la sses of the pro teins studied. The TEFs are known for all these proteins. Proteins with known
TH positions a re in bold. The uppercase letters at the end of the code correspond to the chain.
PDB
code Name SCOP
PDB
code Name SCOP
PDB
code Name SCOP
1aep Apolipophorin-III a 2sns Staphylococcal
nuclease
b 1gmpA RNase Sa a + b
1utg Uteroglobin a 1yna Xylanase II b 1aba Glutaredoxin a/b
2mhr Myohemerythin a 2ayh Bacillus 1–3,
1–4-b-glucanase
b 1opr Orotate
phosphoribosyltransferase
a/b
256bA Cytochrome b562 a 1lcl Serine esterase b 1ble Fructose permease,
subunit Iib
a/b
1aa0 Fibritin a 2pelA Legume lectin b 3cla Chloramphenicol
acetyltransferase
a/b
1occD Cytochrome c oxidase a 1knb Adenovirus fibre b 5nll Flavodoxin a/b
1poc Phospholipase A2 a 2stv STNV coat protein b 3chy Signal transduction protein a/b
1lis Lysin a 1pmy Pseudoazurin b 1 cls Cutinase a/b
1lbd Retinoid-X receptor a a 1qabA Transthyretin b 1dhr Dihydropteridin reductase a/b
2cy3 Cytochrome c3 a 2plv3 Picornavirus b 5p21 cH-p21 Ras a/b
2ilk Interleukin-10 a 1cbs Cellular retinoic-
acid-binding protein
b 1asu Retroviral integrase,
catalytic domain
a/b
1rro Oncomodulin a 1ivpA 2 (HIV-2) protease b 1lbbA Glutamate receptor
ligand binding core
a/b
2sas Calcium-binding
protein
a 1ptf Histidine-containing
phosphocarrier
a + b 1tml Cellulase E2 a/b
4cpv Parvalbumin a 1ubi Ubiquitin a+ b 1tpfB Triosephosphate isomerase a/b
1bvd Myoglobin a 1frd 2Fe-2S ferredoxin a + b 1brsA Endonuclease a/b
1hbg Glycera globin a 153 L Lysozyme, Goose a + b 1akz Uracil-DNA glycosylase a/b
2lhb Lamprey globin a 1lsg Lysozyme, Chicken a + b 1rvvA Lumazine synthase a/b
2mhbA Hemoglobin (horse) a 1acf Profilin a + b 1 ns5A Hypothetical protein YbeA a/b
1dkeA Hemoglobin (human) a 1ctf Ribosomal protein
L7/12
a + b 1jkeB D-Tyr tRNAtyr deacylase a/b
1eca Erythrocruorin a 1aihA Integrase a + b 1iodG Coagulation factor X small
1lki Leukemia inhibitory
factor
a 1apyA Glycosylasparaginase a + b 1dtdB Carboxypeptidase inhibitor small
3cytO Mitochondrial
cytochrome c
a 1ast Astacin a + b 1icfI MHC class II p41
invariantchain fragment
small
3c2c Cytochrome c2 a 1dtp Diphtheria toxin a + b 2bbkL Methylamine dehydrogenase small
1 bp2 Phospholipase A2 a 1nox NADH oxidase a + b 1sgpI Ovomucoid III domain small
1enh DNA-binding protein a 2pii Signal transduction
protein
a + b 1ajj ldl Receptor small
2erl Pheromone a 1durA Ferredoxin II,
Peptostreptococcus
a + b 1i8nA Anti-platelet protein small
1pht Phosphatidylinositol
3-kinase
b 1fxd Ferredoxin II,
Desulfovibrio gigas
a + b 1ejgA Crambin small
1pwt a-Spectrin, SH3 domain b 1c0bA Ribonuclease A a + b 1ehs Heat-stable enterotoxin B small
1semA Signal transduction protein b 1shaA c-src Tyrosine kinase a + b 1tgj TGF-b3 small
1cauB Seed storage protein
7 s vicillin
b 1ag2 Prion protein domain a + b 4rxn Rubredoxin,
Clostridium pasteurianum
small
1reiA Immunoglobulin b 1abrA Abrin A-chain a + b 1caa Rubredoxin, Archaeon
Pyrococcus furiosus
small
1cdcA CD2, first domain b 1plfB Platelet factor 4 a + b 1fas Fasciculin small
2 lm Macromycin b 1mgsA Chemokine (growth
factor)
a + b 1pk4 Plasminogen small
1anu Cohesin-2 domain b 1hucB (Pro)cathepsin B a + b 1hpi HIPIP, Ectothiorhodospira
vacuolata
small
1f3g Glucose-specific factor III b 2act Actinidin a+ b 1hip HIPIP, Allochromatium
vinosum
small
1sno Staphylococcal nuclease b 2 ci2 Chymotrypsin
inhibitor CI-2
a + b 1knt Collagen type VI small
1gpc DNA-binding protein b 1fkb FK-506 binding a + b 1edmB Factor IX small
Ó FEBS 2004 Universal positions in globular proteins (Eur. J. Biochem. 271) 4765
zation of the folding process, by reducing the conform-
ational s pace to be explored. Thus, the prediction of these
ÔhotÕ residues becomes an important step in approaching
the native three-dimensional structure. A first approach to
this goal was undertaken in this study. The guiding
hypothesis was that, in order to achieve fast folding,
Fig. 3. Examples of comparison of MIR, TH and TEF for 10 sequenc es of various folds. In each example, the PDB code (with t he chain) is given,
followed by t he name, the SCOP class and the f old of t he protein in p are ntheses. The followin g lines re prese nt the sequen ce and t he TEFs. The
residues belong ing t oaTEFareindicatedÔIÕ. In case o f TEF overlap, two lines are u sed for this representation (for example in protein 1 shaA). The
next line s hows TH positions, where t he corresponding residues are indicated ÔTÕ. The final line shows MIR residues, indicated by ÔMÕ. For 3chy and
5p21, due to the sequence length, the re sults appear in two consecu tive blocks.
4766 N. Papandreou et al.(Eur. J. Biochem. 271) Ó FEBS 2004
critical residues should have a tendency to contact each
other and thus form the origins of the hydrophobic core.
The results confirmed this hypothesis. Using a simple
alpha-carbon lattice m odel, formation of t he nucleation
sites a t i nitial steps of t he folding process was demon-
strated.
These results suggest that folding i nitiation can be based
on the early formation o f a se t of nucleation s ites around
selected hydrophobic residues [10,11,13]. This is e ssentially
the basis of the hydrophobic c ollapse m echanism [23], which
supposes formation of hydrophobic tertiary interactions
that initiate secondary structure. It can be extended onto a
unified nucleation–condensation mechanism, which is a
combination of hierarchical and hydrophobic collapse
mechanisms [23,24]. In the latter case, h ydrophobic t ertiary
interactions are c onsolidated at the s ame t ime as e lements o f
secondary structure (with possible variations of t he kinetics
of the m echanism caused by the different intrinsic stabilities
of the secondary structural elements). These models have
been developed from experiments and simulations of
folding and u nfolding of several small proteins [23,24] and
particularly from the analysis of the residual s tructure of
denatured s tates, which a re thought to correlate to t he
nucleation s ites. T he comparison of MIR predictions with
this type of data is being c onsidered for future studies.
The secondary peaks in the histogram representing the
correlation between MIR and TEF (Fig. 5) come f rom
the p roteins b elonging mainly to the a class. For these
folds, the TEF limits are often located inside a helices and
are mainly hydrophobic. Sometimes, the predicted MIR
are not exactly these limits but are the nearest hydropho-
bic residues, which in a helix are located three positions
away because of the a-helix periodicity. This observation
is in full agreement with t he definition of the van der
Waals locks, as extended (three to five residues long)
segments of polypeptide chains interacting with each
other, and thus forming Ôloop-n-lockÕ structures in globu-
lar proteins [13].
The main c onclusion o f this study is that burying MIR
positions can serve as the creation of a nchors for sequential
formation of closed loops. T hese results remarkably c orro-
borate experimental evidence on the initial stages of the
folding process. NMR analysis of folding intermediates of
protein bovine pancreatic trypsine inhibitor [25] revealed
loop formation i n early, non-native states, stabilized by
nonlocal interactions. Also, an NMR s tudy on the f olding of
lysozyme [26] showed the early formation of hydrophobic
clusters, which are linked together by l ong-range i nter-
actions. These inter actions were shown not to occur in the
native structure, but they are apparently important for
keeping the loop structure and thereby speeding up the
folding procedure. The appearance of these essential f eatures
in this folding simulation p ermits an initial estimation of
the anchor regions for loop formation. This approach
therefore provides a set of structural constraints from first
principles for an unknown structure. This information
could be incorporated at the early steps of a prediction
method for building protein structures from the sequence by
producing a nchor residues known to b elong to the structural
core. In a second stage they can be introduced as a set of
constraint distances in a more d etailed m odeling p rocess.
Acknowledgements
This project has been funded by a Concerted Action from the European
Union, QLG2-CT-2002–01298, and by the Greek-French bilateral
PLATO program (grant no 04146WM). I. N. B. was also supported
by the Post-Doctoral Fellowship of the Feinberg Graduate School,
Weizmann Institute o f Science.
References
1. Kunin, V., Cases, I ., E nright, A.J., de L orenzo, V . & Ouzo unis,
C.A. (2 003) Myriads of protein famil ies, and still counting.
Genome Biol. 4, 401.
2. Koonin, E.V ., Wol f, Y.I. & Karev, G. P. (2002) The structure of
the protein universe and g eno me evolution. Nature 420, 2 18–223.
3. Xia, Y. & Levitt, M. (2004) Simulating protein evolution in
sequence and structure space. Curr. Opin. Struct. Biol. 14, 202–
207.
4. Rost, B . (2002) Did evolution leap to create the protein universe?
Curr. Opin. Struct. Biol. 12, 409–416.
5. Liu, J. & Rost, B. (2003) Domains, motifs and clusters in the
protein universe. Curr. Op in . Chem. Biol. 7, 5–11.
6. Daggett, V. & Fersht, A. (2003) The present view of the
mechanism of protein folding. Nat. Rev. M ol. Cell. Biol. 4, 497 –
502.
7. Shakhnovich, E.I. (1997) Theoretical studies of protein-folding
thermodynamics and kinetics. Curr. Opin. Struct. Biol. 7, 29–40.
8. Tiana, G., Shakhnovich , B.E., Dokholyan, N.V. & Shakhnovich,
E.I. (2 004) Imprint of evolution on p rotein structures. Proc. Natl
Acad. Sc i. USA 101 , 2846–2851.
Fig. 4. Histogram of the correspondence between TH positions a nd
MIRfromasetof73proteins.
Fig. 5. Histogram of the correspondence between TEF ends and MIR
from a set of 111 proteins.
Ó FEBS 2004 Universal positions in globular proteins (Eur. J. Biochem. 271) 4767
9. Berezovsky, I.N ., Grosberg, A.Y. & Trifonov, E .N. (2000) Closed
loops of nearly standard size: common basic element of protein
structure. FEBS Lett. 466 , 283–286.
10. Lamarine,M.,Mornon,J.P.,Berezovsky,I.N.&Chomilier,J.
(2001) Distribution of tightened end f ragments of globular pro-
teins statistically match that of topohydro phobic positions:
towards an e fficient punctuation of protein folding? Cell. Mol. Life
Sci. 58, 492–498.
11. Berezovsky, I.N., Kirznher, V., Kirzhner, A. & Trifonov, E.N.
(2001) Protein folding: looping from hydrophobic nuclei. Proteins
45, 3 46–350.
12. Berezovsky, I.N. (2003) Discrete structure of van der Waals
domains in globular proteins. Protein E ngineering 16, 1 61–167.
13. Berezovsky, I.N. & Trifonov, E.N. (2001) Van der Waals l ocks:
loop-n-lock structure o f globular prote ins. J. Mol. Biol. 307, 1419–
1426.
14. Poupon, A. & Mornon, J.P. ( 1998) Population s of h ydrophob ic
amino acids within protein globular domains; identification
of conserved ÔtopohydrophobicÕ positions. Proteins 33, 329–
342.
15. Poupon, A. & Mornon, J.P. (1999) ÔTopohydrophobic positionsÕ
as key markers of glob ular protein f olds. Theoret C hem. Acc ounts
101, 2 –8.
16. Poupon, A. & Mornon, J.P. ( 1999) Predicting the protein folding
nucleus from sequences. FEBS Lett. 452, 2 83–289.
17. Berezovsky, I .N. & Trifonov, E.N. (2002) Loop fold structure o f
proteins: resolution of Levinthal’s pa radox. J. Biom ol. Struct.
Dynamics 20 , 5–6.
18. Skolnick, J. & Ko linski, A . (1991) D ynamic M on te Carlo s imu-
lations of a ne w lattice model of globular protein f olding, structure
and dynamics. J. M ol. Biol. 221, 499–531.
19. Labesse, G., Colloc’h, N., Pothier, J. & Mornon, J.P . (1997)
P-SEA: a new efficie nt assignment of secondary structure from C
alpha trace of proteins. Comput Appl. Biosci. 13, 291–295.
20. Miyazawa, S . & J ernigan, R.L. ( 1996) Residue-residue pote ntials
with a favorable contact pari term and an unfavorable high
packing d ensity term for simulation a nd threading. J. Mol. Biol.
256, 6 23–644.
21. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T .N.,
Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000) The P rotein
Data Bank. Nucleic Acids Res. 28, 235– 242.
22. Murzin, A.G., Brenner, S.E., Hubbard, T . & Chothia, C. (199 5)
SCOP: a structural classification of proteins database for the
investigation of sequences an d structures. J. Mol. Biol. 247, 536–
540.
23. Fersht, A. & Daggett, V. (2002) Protein f olding at atomic
resolution. Cell 108, 573–582.
24. Fersht, A . ( 1997) Nucleation mechanisms i n protein folding. Cur r.
Opin. Struct. Biol. 7, 3–9 .
25. Ittah, V. & Haas, E. ( 1995) Nonlocal interactions stabilize long
range loops in the initial f olding i ntermediates of reduced bovine
pancreatic trypsin i nhibitor. Biochemistry 34 , 4493–4506.
26. Klein-Seetharaman, J ., Oikawa, M., Grimshaw, S.B., Wirmer, J .,
Duchardt,E.,Ueda,T.,Imoto,T.,Smith,L.J.,Dobson,C.M.&
Schwalbe, H. ( 2002) Long-range interactions within a non-native
protein. Science 295 , 1719–1722.
4768 N. Papandreou et al.(Eur. J. Biochem. 271) Ó FEBS 2004