Tải bản đầy đủ (.pdf) (14 trang)

Tài liệu Báo cáo khoa học: A knowledge-based potential function predicts the specificity and relative binding energy of RNA-binding proteins ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.2 MB, 14 trang )

A knowledge-based potential function predicts the
specificity and relative binding energy of RNA-binding
proteins
Suxin Zheng
1,
*, Timothy A. Robertson
2,
* and Gabriele Varani
1,2
1 Department of Chemistry, University of Washington, Seattle, WA, USA
2 Department of Biochemistry, University of Washington, Seattle, WA, USA
The sequence-specific recognition of RNA by proteins
plays a fundamental role in gene expression by direct-
ing different cellular RNAs to specific processing path-
ways or subcellular locations. Many experimental
studies have explored the molecular basis for the
sequence dependence of protein–RNA recognition [1–
4]; more recently, a few studies have explored this prob-
lem from a computational perspective as well [5–16].
However, these early studies have emphasized qualita-
tive descriptions of the recognition process; relatively
few attempts have been made to quantify the character-
istics of protein–RNA interactions using computational
approaches [17]. Here, we present a new approach for
predicting the specificity of RNA-binding proteins and
to evaluate the contribution of individual amino acids
to the energetic of protein–RNA complexes.
Knowledge-based potential functions have been
employed in protein structure prediction [18–27], as
well as in the prediction of protein–protein [25,28–30]
and protein–ligand interactions [30–33]. A few studies


have explored the use of knowledge-based methods for
the prediction of protein–DNA interactions from
structure [30,34,35]. More recently, our group [36] and
others [37] have independently demonstrated that
knowledge-based potentials can provide quantitative
descriptions of protein–DNA interfaces comparable to
those provided using molecular mechanics force fields
[37].
The relative scarcity of high-resolution structures of
protein–RNA complexes has represented an under-
standable barrier to the quantitative application of
computational approaches to the problem of protein–
RNA recognition. However, we have previously dem-
onstrated that a statistical hydrogen bonding potential
can discriminate native structures of protein–RNA
complexes from docking decoy sets [17]. As hydrogen
Keywords
distance-dependent potential; protein–RNA
interaction; RRM recognition; statistical
potential
Correspondence
G. Varani, Department of Chemistry and
Department of Biochemistry, University of
Washington, Seattle, WA 98195, USA
Fax: +1 206 685 8665
Tel: +1 206 543 7113
E-mail:
*These authors contributed equally to this
work
(Received 25 July 2007, revised 22 Septem-

ber 2007, accepted 19 October 2007)
doi:10.1111/j.1742-4658.2007.06155.x
RNA–protein interactions are fundamental to gene expression. Thus, the
molecular basis for the sequence dependence of protein–RNA recognition
has been extensively studied experimentally. However, there have been very
few computational studies of this problem, and no sustained attempt has
been made towards using computational methods to predict or alter the
sequence-specificity of these proteins. In the present study, we provide a
distance-dependent statistical potential function derived from our previous
work on protein–DNA interactions. This potential function discriminates
native structures from decoys, successfully predicts the native sequences
recognized by sequence-specific RNA-binding proteins, and recapitulates
experimentally determined relative changes in binding energy due to muta-
tions of individual amino acids at protein–RNA interfaces. Thus, this work
demonstrates that statistical models allow the quantitative analysis of
protein–RNA recognition based on their structure and can be applied to
modeling protein–RNA interfaces for prediction and design purposes.
Abbreviations
KH, K homology; MD, molecular dynamics; PDB, Protein Data Bank; RRM, RNA recognition motif; SRP, signal recognition particle.
6378 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
bonds represent only approximately 25% of contacts
between protein and RNA [12], we reasoned that a
more comprehensive approach would describe these
interactions more effectively.
In the present study, we report the application of an
all-atom, distance-dependent statistical potential to the
prediction of sequence-specific recognition between
proteins and RNA. We demonstrate that this approach
can discriminate native structures of complexes from
even close docking decoys, recapitulate experimentally

determined relative binding energies (DDGs) for several
protein–RNA complexes, and predict the RNA
sequences recognized by a number of different RNA
recognition motif (RRM) and K homology (KH)
domains. These results demonstrate that statistical
models can be applied to problems requiring the high-
resolution modeling of protein–RNA interactions. The
anticipated future enrichment of the structural data-
base will further improve the predictive performance
of the potential.
Results
The all-atom distance potential is constructed from the
distribution of interatomic distances observed in the
high resolution (< 2.5 A
˚
) structures of protein–RNA
complexes deposited in the Protein Data Bank (PDB).
In this approach, the ‘correctness’ of a protein–RNA
structure is assumed to be approximated by the sum of
the probabilities of observing the set intermolecular
distances defined in the 3D structure, relative to the
likelihood of encountering such distance in the dataset
of all protein–RNA structures. This kind of method
was proposed by Sipple [20], and has been applied to
protein structure prediction, protein–protein and pro-
tein–ligand interactions [18–33], as well as to protein–
DNA recognition [30,34–37]. The distance-dependent
statistical potential used here for protein–RNA inter-
faces is essentially identical to the score recently
described by us for protein–DNA complexes [36]. The

primary difference is the introduction of a new pseud-
count correction, where an optimized number of
pseudocounts are added to the observed counts for
each atom pair (for additional details, see Experimen-
tal procedures). As a control, we also tested a simple
contact-counting method, wherein every contact
between protein and RNA (within a given distance
cut-off) was assigned the same score of )1.
Docking decoy discrimination
An important property of any potential function is its
ability to discriminate cognate (native crystallographi-
cally determined structures) from noncognate (decoy)
structures [38]. As a preliminary test of our method,
and a direct comparison with previous work, we used
our distance-dependent potential to evaluate five sets
of docking decoys generated for the application of
the rosetta physical potential function to protein–
RNA interactions [17]. These decoys were created
using a combination of rigid-body docking and pro-
tein side-chain repacking, and range in rmsd (relative
to the native structure) from 0.2 A
˚
to over 20 A
˚
.
Thus, they represent a solid basis for comparison to
a much more complex scoring method (the multiterm,
hybrid physical ⁄ statistical potential function used by
rosetta).
When scored with the distance-dependent potential,

the native complex can always be identified as the best
structure in each of the five decoy sets (Fig. 1), even
for decoys that are very close to the native structure.
The native structure Z-scores for these decoy sets are
shown in Table 1. These values indicate a strong dis-
criminatory ability, comparable to that reported by
Chen et al. [17] using their significantly more complex
scoring method. Overall, the distance potential (using
a6A
˚
cut-off) results in a mean native Z-score of
)5.45, versus the value of )6.37 obtained by Chen
et al. [17] (Table 1); this difference is statistically insig-
nificant (P ¼ 0.53, Welch’s two-sided t-test), indicating
that the two methods are equivalent.
When we investigated protein–DNA complexes
using the same approach, we demonstrated that the
all-atom potential outperformed a reduced atom
description, where relevant groups were grouped
according to their chemical similarity (as described in
the Experimental procedures) [36]. Given the relative
sparsity of the structural database, we investigated
whether a reduced-atom representation would not lead
to improved performance in the protein–RNA case.
The all-atom potential performs better than the
reduced atom potential (mean Z-score )5.45 versus
)4.66; see also supplementary Table S4), although the
difference is not as striking as for protein–DNA com-
plexes. We believe this is due to less favorable statistics
(fewer structures of protein–RNA complexes). We

anticipate that the increasing availability of protein–
RNA structures, together with the availability of data
on specificity, will further improve the performance of
the knowledge-based predictive method presented here.
We retained the all atom representation because it is
already slightly better than the reduced atom
approach.
The protein–RNA score has distinctive properties
compared to the protein–DNA potential. When we
scored the protein–RNA decoy set using the protein–
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6379
DNA potential, the average Z-score was approxi-
mately half that obtained with the protein–RNA
potential ()2.84 versus )5.45; see also supplementary
Table S4). Thus, although the chemistry of RNA and
DNA are very similar, the structure of RNA allows
for different interactions between proteins and the two
nucleic acids that are reflected in this result.
To investigate whether the statistical potential is not
simply reflecting the size of an interface or the number
of intermolecular contacts, we also used a very simple
contact-counting potential to evaluate the same decoys;
in this method, the fitness of an interface is evaluated
by counting the number of close approaches between
the protein and RNA. Satisfactorily, this method was
A B
C
E
D

Fig. 1. Score–rmsd plots for the five docking decoy sets generated by Chen et al. [17]; the score generated by the distance-dependent
potential (in arbitrary units) is plotted versus the deviation from the native structure (open circle at rmsd ¼ 0). (A) Poly A-binding protein in
complex with polyadenylate RNA (PDB code: 1CVJ). (B) Nova-2 KH RNA-binding domain 3 (PDB code: 1EC6). (C) HuD protein in complex
with AU-rich RNA (PDB code: 1FXL). (D) Human SRP19 in complex with human SRP RNA (PDB code: 1JID). (E) Human U1A protein in com-
plex with U1 snRNA hairpin (PDB code: 1URN). Close-up views of near-native decoys (0–3 A
˚
rmsd) are shown in the insets.
A knowledge-based potential function S. Zheng et al.
6380 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
much less effective, providing an average Z-score of
)2.64, less than half of the average native Z-score
found using the distance potential (Table 1).
Interestingly, the magnitude of the observed
Z-scores declines significantly as the contact cut-off is
increased from 6 A
˚
to 10 A
˚
and then to 12 A
˚
(see sup-
plementary Table S5), suggesting that short-range con-
tacts provide the bulk of the discriminatory power in
this test. This result suggests that protein–RNA recog-
nition specificity is primarily determined by short-
range intermolecular contacts. Long-range effects (e.g.
nonlocal electrostatics) appear to play a more limited
role, at least in decoy discrimination.
To test the discrimination ability of the potential for
near native decoys, we next compared its ability to

discriminate near-native protein–RNA structures with
that of the force field implemented in the amber 8
molecular simulation package. We generated near-
native protein–RNA decoys for 21 protein–RNA
complexes by conducting molecular dynamics (MD)
simulations of the native complexes, and by selecting
multiple time-steps from the resulting trajectories for
each structure. We then scored these structures using
the distance-dependent potential function, and exam-
ined the correlations between distance scores and
amber energies for each decoy set.
This is a difficult test of score performance because
the structures are very close to native. Indeed, neither
the distance-dependent score, nor the amber potential
appears to be able to discriminate native structures
from these very near-native, MD-generated decoys
(average Z-score of )0.69 versus )0.59; Table 2).
Although there is no correlation of the either score
with rmsd, the distance-dependent statistical potential
is somewhat correlated (average R
2
¼ 0.41) with the
energy values predicted by the amber force field. Thus,
it remains very difficult for either approach to discrim-
inate the native structure from structures that are close
to it in energy.
Identifying RNA-binding sequences from
structure
Having established the performance of the statistical
potential function in decoy discrimination, we investi-

gated the ability of the potential to perform tasks rele-
vant to its intended application. First, we sought to
evaluate whether the potential could predict the cog-
nate recognition sequences of RNA-binding proteins.
This is a particularly important problem because
sequence specificity is known for only a fraction of all
RNA-binding proteins. The ability to predict (or at
least narrow down) the cognate sequence for ‘orphan’
RNA-binding proteins would greatly facilitate the
design of biological experiments aimed at dissecting
the function of these proteins. It is also a problem that
is not well suited for MD approaches because of the
demanding computational requirements.
This application relies on a specific structural
model of RNA recognition by RRM and KH
Table 1. Native Z-scores and score–rmsd correlation coefficients
for the protein–RNA docking decoy sets prepared by Chen et al.
[17] .
Z-scores
Distance-
dependent
a
Coulomb
b
ROSETTA +
HB
c
Contact
count
a

1CVJ )7.02 )1.19 )5.11 )2.44
1EC6 )6.46 )1.09 )6.53 )3.00
1FXL )2.66 )1.55 )2.70 )1.26
1JID )6.29 )1.36 )9.12 )3.09
1URN )4.80 )1.35 )8.39 )3.39
Mean ± SD )5.45 ± 1.76 )1.31 ± 0.18 )6.37 ± 2.58 )2.64 ± 0.84
a
Using a 6 A
˚
contact cut-off.
b
From Chen et al. [17] and referring to
a potential lacking the directional component of hydrogen bonding
(HB) interactions.
c
From Chen et al. [17] and referring to the com-
plete potential function.
Table 2. Z-scores and correlations for near-native decoys generated
by MD simulation.
Largest
rmsd (A
˚
)
Z-scores
Distance-
dependent
versus
AMBER (R
2
)

Distance-
dependent AMBER
1B7F 2.33 )3.51 )3.01 0.37
1CVJ 2.63 )1.13 )0.44 0.51
1DFU 2.45 )1.82 )1.93 0.39
1E7K 2.58 )0.38 0.83 0.34
1EC6 5.21 )3.02 )2.29 0.71
1FJE 3.54 1.49 2.51 0.42
1FXL 2.22 )1.32 )1.34 0.47
1JBS 2.46 )0.31 0.60 0.38
1JID 2.92 0.26 1.05 0.54
1K1G 3.48 0.52 1.95 0.52
1KNZ 2.44 )1.39 )1.69 0.24
1M8W 2.47 )1.10 )1.21 0.44
1R9F 2.55 )1.24 )0.19 0.11
1RKJ 3.85 0.92 3.00 0.19
1URN 3.06 )1.08 )2.88 0.46
2AD9 3.94 )0.68 )2.09 0.42
2ADB 2.66 0.37 )2.15 0.16
2ADC 3.01 )0.21 )0.65 0.63
2ASB 2.12 )0.94 )2.07 0.48
2ATW 2.19 )1.38 )2.94 0.33
2CJK 2.75 1.42 2.47 0.42
Mean ± SD )0.69 ± 1.28 )0.59 ± 1.94 0.41 ± 0.15
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6381
domains involving four nucleotides, as detailed in the
Experimental procedures. This model is strongly sup-
ported by previous research on the mechanism of
RNA recognition for RRM proteins [6,39,40] and by

the structure of existing KH domains bound to RNA
[6,41–44]. As a consequence of the assumptions of
the model, complexes containing two RNA-binding
domains were divided into independent structures
(e.g. 1CVJ_1 and 1CVJ_2 represent the first and sec-
ond Poly A binding protein domain of structure
1CVJ, respectively), and the two domains were con-
sidered structurally and thermodynamically unrelated.
Because the model assumes that each RRM and KH
domain binds to each of four nucleotides indepen-
dently, we generated a set of 4
4
(256) different
structures for each protein–RNA complex by compu-
tationally ‘threading’ all possible four-nucleotide com-
binations onto the RNA bases nearest the center of
the b-sheet structure of the RRM. We then scored
these sequence-variant structures with the distance-
dependent potential function.
Figure 2 shows the results of this analysis. If the
potential and model of recognition were perfect, and if
each structure was sequence-specific and corresponded
to the most favorable sequence recognized by a given
domain, the cognate sequences of the tested structures
would be expected to rank as number 1. Because it is
unlikely that the cognate recognition sequences for all
domains will be consistently assigned the best score,
we expressed sequence-discrimination performance in
terms of percentiles (where perfect discrimination of
the cognate recognition sequence would result in a

percentile score of 100). Remarkably, we found that 18
of the 29 tested RRM and KH domain complexes had
their cognate recognition sequence ranked above the
90th percentile (i.e. had better than ten-fold enrich-
ment for the correct sequence). Furthermore, the
distance-dependent potential ranks the cognate recog-
nition sequences of the protein–RNA complexes in our
test set above the 90th percentile, on average. By con-
trast, when we performed the same test using a simple
counting potential as a control (Fig. 2), the average
rank was the 41st percentile.
Among successful examples of binding-sequence
discrimination, the native sequences of the RRM1 of
Sex-lethal protein (1B7F_1) and KH1 domain of
Poly C-binding protein-2 were both ranked first out
of 256 sequences, whereas KH domain 3 of hnRNP K
(1ZZI), RRM of U2B¢ protein (1A9N) and RRM 4 of
Polypyrimidine Tract Binding protein (2ADC_1) each
had their cognate recognition sequences ranked in the
top 3 (Supplemental Table S2). However, prediction
was less successful for other RRM domains, such as
the U1A complex (the cognate recognition sequence of
U1A protein was ranked at 30). This result is none-
theless not too surprising due to the noncanonical,
seven-nucleotide recognition sequence (AUUGCAC)
recognized by U1A that makes an unusually specific
and strong interaction with RNA, unparalleled in
other known RRMs [45]. Relatively poor results were
also obtained for the Poly A binding potein (1CVJ_1,
rank 19), and for RRM1 of the HuD protein

(1FXL_1, rank 32). Both Pab and HuD utilize two
domains to achieve sequence-specific recognition in a
cooperative manner and do not discriminate well
between sequences that are related to their cognate rec-
ognition motif (A-rich and AU-rich sequences, respec-
tively) [46]. Notably, however, the nonsequence-specific
RNA helicase protein (PDB code: 2DB3, included as a
negative control) had an expectedly poor cognate
sequence rank of 226 ⁄ 256.
Estimating experimentally determined relative
RNA-binding affinities
A second very important property of any potential
function is the ability to recapitulate the sequence
dependence of experimental binding energies; this is a
prerequisite if the potential is to be applied to prob-
lems of protein–RNA interface prediction or design.
Fortunately, a few structures have a relatively dense
set of experimentally determined binding constants for
interface mutations. We used these experimentally
characterized mutants to create a set of computation-
ally ‘mutated’ structures of the complexes (Table 3),
Fig. 2. Structure-based identification of RRM recognition sequen-
ces. The cognate sequence is ranked by the distance potential
(cut-off ¼ 6A
˚
) for RRM ⁄ KH domain proteins. The red line repre-
sents the rank of cognate recognition sequences using the contact-
counting score; the blue line represents the rank of these
sequences using the distance-dependent potential. The points in
each colored line are sorted independently by rank; the x-axis is the

sort order. The dashed line represents the 10th percentile.
A knowledge-based potential function S. Zheng et al.
6382 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
and have scored these structures using the distance-
dependent statistical potential.
A first very instructive example is provided by
mutants of bacteriophage MS2 coat protein [47,48].
Starting with the crystal structure of the complex
between MS2 coat protein and the cognate RNA hair-
pin (PDB code: 1ZDI), a series of structures were gen-
erated, representing the RNA and protein mutants for
which binding constants are reported in the literature.
Then the distance-dependent potential scores for these
structures were compared with the known binding con-
stants for each mutation. Unfortunately, when all of
the MS2 mutations were considered together, a poor
correlation was observed between distance score and
experimental binding affinities (data not shown). How-
ever, excellent correlations were obtained between
these values when the binding-affinity data were
divided into two subsets (Table 3, Fig. 3). A first set
corresponds to complexes where the bound RNA hair-
pin contained adenine, guanine or uridine base at posi-
tion )5; the second set contains instead protein
mutants where the bound RNA contained a cytosine
at this position. Within each sets of mutants, the corre-
lation between distance score and experimental binding
affinity is strong (R
2
¼ 0.65, Fig. 3A; R

2
¼ 0.97,
Fig. 3B), and statistically significant at the 95% confi-
dence level. Figure 3C shows a likely explanation for
this result: an intramolecular hydrogen bond formed
by the cytosine at position )5 [47]. When this nucleo-
tide is mutated to any other base, the intramolecular
hydrogen bond is lost, leading to a reorganization of
the RNA structure.
This result does not provide direct information on
the relative contribution of that hydrogen bond to
the overall binding energy; it is simply implied that
Table 3. Correlations between the distance-dependent score and
the experimental free energy of binding for several mutant protein–
RNA complexes.
Distance-dependent Contact counting
6A
˚
10 A
˚
12 A
˚
6A
˚
10 A
˚
12 A
˚
Protein mutations
MS2 (no cytosine

at position )5)
0.43 0.50 0.65 0.19 0.10 0.08
MS2 mutations
(with cytosine at )5)
0.81 0.81 0.97 0.43 0.14 0.09
Fox-1 0.40 0.45 0.47 0.47 0.43 0.42
U1A
a
0.27 0.48 0.65 0.29 )0.06 )0.03
U1A
b
0.04 0.14 0.39 0.29 )0.06 )0.03
RNA mutations
Fox-1 0.20 )0.39 )0.57 0.30 0.33 0.35
SRP; 2¢-OH mutations 0.87 0.56 0.52 0.36 0.30 0.29
SRP; base mutations )0.07 )0.03 )0.07 0.01 0.07 0.05
a
The native U1A complex was included in the training set for this
experiment.
b
The U2B¢ complex (U1A homolog) was included in
the training set for this experiment.
A
B
C
Fig. 3. Correlation between scores generated by the distance-
dependent statistical potential and experimental binding free ener-
gies (logK
d
) for mutants of the MS2 coat protein. (A) Complexes

between protein mutants and RNA-containing nucleotides other
than cytosine at position )5. (B) Complexes between protein
mutants and RNA containing cytosine at position )5. (C) The char-
acteristic intramolecular hydrogen bond between the amino group
of C5 and the O1P atom of U6 observed in the structure of the
MS2–RNA complex containing a cytosine at position )5 that helps
organize the RNA structure for protein binding [47].
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6383
mutations must be segregated into two groups to
obtain a clear correlation between experimental and
predicted relative affinities. The most likely explana-
tion for this result is that, at present, the statistical
potential does not consider RNA intramolecular con-
tacts; therefore, contributions to binding energy due to
changes in RNA structure (i.e. that occur when that
hydrogen bond is lost) cannot be captured by our cur-
rent approach.
A second example that reinforces our interpretation
of the results obtained with MS2 is provided by Fox-1
protein, which regulates alternative splicing of tissue-
specific exons by binding to the GCAUG sequence
[49]. The structure of the complex (PDB code: 2ERR)
and the experimental binding constants for two sets of
related mutations have been reported [49]: one set for
mutations on the Fox-1 protein and a second set for
mutations to its target RNA molecule. A moderately
strong correlation was observed between the distance
score and the protein mutation data (R
2

¼ 0.46,
Fig. 4), but an anticorrelation was observed for the set
of RNA mutations (R
2
¼ )0.57; Table 3). As in the
previous case, this result reflects the failure of the
current statistical potential to capture the energetic
contribution associated with the disruption of RNA
intramolecular interactions that are a characteristic of
this complex [49].
A third example is human U1A protein (PDB code:
1URN), a great model for the RRM superfamily
because of the availability of NMR and crystallo-
graphic structures [50,51], as well as binding data.
In this case, we observed poor correlations between
the distance-dependent score and the experimentally
determined dissociation constants (K
d
) [52] when we
conducted a test using a training set of strictly non-
homologous protein–RNA structures. Initially, we
assumed that this observation would reflect the very
large and energetically significant conformational
changes that have been observed in the RNA and
protein upon complex formation [53]. However, when
the U1A complex itself was included in the training
set, we obtained moderate to strong correlations (R
2
values between 0.27 and 0.65, depending on the
choice of distance cut-off). This suggests that U1A

binds to RNA by forming intermolecular interactions
that are not commonly observed in the database of
training structures. This hypothesis is supported by
the observation that the inclusion of a close U1A
homolog (the U2B¢–U2A¢ complex) in the training set
improves the results of this test as well (R
2
increases
from 0.04 to 0.39; Table 3). Thus, it appears that the
structure of the U1A or of its homologous complex
contains a set of protein–RNA atomic contacts (i.e.
interatomic distances) that are not well represented in
the 71 other protein–RNA complexes in our training
set.
Figure 5 shows the final example, a universally con-
served component of the core of the signal recognition
particle (SRP). The structure of the complex (PDB
code: 1HQ1) and the binding affinity of a series of
RNA mutants have been determined [54]. The distance
potential results in scores that correlate significantly
(R
2
¼ 0.52, P £ 0.05) with experimental binding affini-
ties for mutations involving substitutions of deoxy-
nucleotides for their corresponding ribonucleotides.
However, as observed for Fox-1, no significant
Ade-4
Cyt-3
Ura-1
Gua-2

A
B
Fig. 4. (A) Correlation between scores generated by the distance-
dependent statistical potential and experimental binding free ener-
gies (logK
d
) for mutants of the Fox-1 protein. (B) The intramolecular
hydrogen bond between uracil 1 and cytosine 3, and the non-Wat-
son–Crick base pair between guanine 2 and adenine 4 for the RNA
in complex with Fox-1 protein (PDB code: 2ERR). The protein is
represented in yellow; the RNA structure is colored by atom type.
A knowledge-based potential function S. Zheng et al.
6384 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
correlation was found for mutations of nucleotides
that disrupt critical RNA intramolecular interactions.
In this final case, these mutations involve the disrup-
tion of base pairs near the binding interface that define
the secondary structure of the RNA, which is obvi-
ously important for recognition, but do not contribute
directly to the formation of intermolecular contacts
[54].
Disscussion
The central role of protein–RNA interactions in
the regulation of gene expression has led to consider-
able interest in the biochemical processes underlying
these interactions [55–57]. However, much of this
research has been devoted to the study of the struc-
ture ⁄ function relationship for individual protein–
RNA complexes, and little effort has been made to
develop quantitative models that might describe

these interactions more comprehensively. Thus, our
understanding of the mechanisms driving protein–
RNA recognition is still largely descriptive [11].
Recent work on protein–DNA interactions has
shown that quantitative models of protein–nucleic
acid recognition can provide insight into the mecha-
nisms of gene regulation [58,59], and, in the not too
distant future, promise to allow the rational design
of DNA-binding proteins with altered specificity [60].
The development of computational tools capable of
predicting the specificity of RNA-binding proteins
across entire families (such as the RRM superfam-
ily), or of redesigning the specificity of these pro-
teins, would be of equal importance in dissecting
post-transcriptional regulatory mechanisms, and in
providing new tools to interrogate gene expression
pathways.
In a previous study, our group demonstrated that a
statistical potential function could be surprisingly accu-
rate when used to predict protein–DNA interactions
from structure [36]; this result was corroborated by a
similar study published concurrently by another group
[37]. Given these results, we hypothesized that the
same approach would be equally successful with pro-
tein–RNA interfaces. Indeed, although various statisti-
cal techniques have been used by a number of groups
for the prediction of protein structures, protein–DNA
and protein–ligand interactions [18–35], such an
approach has never been applied to protein–RNA
interactions.

In the present study, we describe the successful
application of the distance-dependent, all-atom statis-
tical potential function to the prediction of the ener-
getics and recognition specificity of protein–RNA
interactions. We demonstrate that the statistical
potential can recapitulate experimentally determined
relative binding constants for a number of protein–
RNA complexes (with the caveat that it cannot yet
capture the effect of mutations on RNA–RNA inter-
actions). We also demonstrate that this simple tech-
nique is remarkably successful at predicting the
cognate recognition sequences of a wide variety of
RNA-binding proteins.
The challenge of near native decoy
discrimination
The statistical potential performs very well in classi-
cal decoy discriminations tests. It is quite remarkable
that similar Z-scores in tests of decoy discrimination
are obtained for the statistical score and the
rosetta-derived score because this second method
contains many more adjustable parameters that are
optimized to reproduce the average composition of
these interfaces as observed in nature. By contrast,
the current statistical potential was generated ‘as is’
from the observed frequency of intermolecular con-
tacts in the database of protein–RNA structures.
Thus, it appears that the distance-dependent statisti-
cal potential implicitly captures at least some of
the complexities of these intermolecular interactions
that are explicitly enumerated in physical energy

functions.
The question of how to generate and discriminate
near-native decoys is still an open challenge for many
areas of computational structural biology [61,62]. The
docking decoy set used here contains many near-native
decoys (e.g. < 1 A
˚
rmsd) that can be discriminated by
the distance-dependent potential (Fig. 1). However,
when testing against the exceptionally near-native
Fig. 5. Correlation between scores generated by the distance-
dependent statistical potential and experimental binding free
energies (logK
d
) for ribose-to-deoxyribose mutants of a universally
conserved protein component of the SRP.
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6385
decoys generated by extracting snapshots from MD
simulations (Table 2), we found that near non-native
decoys could not be reliably discriminated from native
structures, not even by amber, which was used to con-
duct the MD simulations. Thus, the question of how
to create a potential that is sensitive to the extremely
subtle structural variations present in very near-native
decoys remains a challenging and important area of
research. We are hopeful that the incorporation of
terms describing the higher-order geometric preferences
of protein–RNA interfaces (e.g. the incorporation of a
directional hydrogen-bonding potential) [17] may

enhance the discriminatory power of our method, as
will the inevitable increase in high-resolution structural
data available for training. Nevertheless, the distance-
dependent potential function already performs on par
with the amber and rosetta force fields in decoy dis-
crimination tests.
The impact of contact distance cut-off on
discriminatory power
The contact distance cut-offs used in the present
study were varied to determine the value that maxi-
mizes decoy discrimination performance for protein–
RNA complexes. Previously, Robertson et al. [36]
showed that shorter contact cut-offs result in optimal
discrimination ability in protein–DNA complexes,
whereas Samudrala et al. [21] found that a longer
cut-off (> 10 A
˚
) was better able to discriminate cor-
rect structures during protein structure prediction
experiments. Finally, Lu et al. [23] demonstrated that
the first coordination shell (i.e. a cut-off between
3.5 A
˚
and 6.5 A
˚
) achieves the greatest selectivity for
protein decoys created using gapless threading pro-
cedures; thus, the question remains as to the best
choice of contact cut-off.
To evaluate the influence of different cut-off values

in our study, replicate experiments were conducted
using 6 A
˚
,10A
˚
and 12 A
˚
distance cut-offs. In nearly
all of our tests, the use of a shorter contact cut-off
(6 A
˚
) results in greater selectivity for structural details
of the interface (Table 1). For the prediction of
mutation energies, however, a longer cut-off appears
to outperform shorter cut-off values for some sets of
mutation data (Table 3). Some of these mutations are
not near the protein–RNA interface (e.g. one of the
U1A mutations, D79V, is 9 A
˚
from the RNA mole-
cule), and only the use of a longer cut-off value can
capture these effects. In light of the differing conclu-
sions of previous research [21,23,36], these results
imply that a ‘one size fits all’ approach to energy
function design may be limiting. In other words, it
may be possible to significantly improve potential
functions by customizing their parameterization to
particular problems.
Prediction of RNA recognition sequences from
protein–RNA complex structures

An obvious but yet to be attempted application of
any potential function for protein–RNA interactions
is the prediction of cognate binding sequences. In a
test of sequence recognition for 29 unique KH and
RRM domains, we found that the potential is able to
identify (within the 10th percentile) the cognate RNA
recognition motifs of these domains approximately
70% of the time. As not all RRM ⁄ KH domains (for
example, U1A) obey the simple four-nucleotide recog-
nition model that we have introduced (where each
nucleotide makes independent interaction with the
protein) [6], and the specificity of some proteins is
limited (i.e. they bind nearly equally well to a set of
related sequences), this is a remarkably strong result.
Despite the simple form of the statistical potential,
and the over-simplifications of the four-nucleotide
recognition model, this method is surprisingly robust
over the diverse set of RNA-binding domains that we
have considered.
Prediction of relative protein–RNA binding
energies
When we evaluated the relative free energy of a set
of mutations for several protein–RNA complexes of
known structure, the distance-dependent potential was
successful within defined structural classes. We
observed strong, statistically significant (P £ 0.05)
score–energy correlations for several sets of mutations
that we tested; however, to achieve these results, it was
necessary to subdivide several of the mutation data
sets. For example, for the MS2 complex, the mutation

data had to be divided into two classes based on the
presence or absence of a cytosine at position )5 in the
RNA. A likely explanation for the importance of
the )5 cytosine mutation is offered by the observation
that the amino group of the cytosine at position )5
makes an intramolecular hydrogen bond that increases
the propensity of the free RNA to adopt the structure
seen in the complex [48] (Fig. 3C). Because the dis-
tance potential currently measures only intermolecular
interactions, it is unable to capture the thermodynamic
effect of interactions within the RNA or protein, and
of mutation-induced changes in RNA (or protein)
structure. The good correlations of distance potential
with experimental binding energies (i.e. when sequence
A knowledge-based potential function S. Zheng et al.
6386 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
mutations are grouped according to the base identity
at position )5) strongly suggests that the potential cap-
tures the energetic contributions of intermolecular
interactions well.
The same limitations observed in the MS2 mutation
data led to the failures in prediction for RNA mutations
in the Fox-1 and SRP complexes. In the structure of the
Fox-1 complex, nucleotide U1 interacts with C3 by
forming an intramolecular hydrogen bond, whereas G2
and A4 form a non-Watson–Crick base pair [49]
(Fig. 4). Four out of seven Fox-1 RNA mutations that
were tested directly affect these intramolecular interac-
tions, which are not evaluated by the statistical potential
used in the present study. In the case of the RNA muta-

tions to the SRP complex, the mutated RNA residues
are located in a double-stranded region of RNA, and do
not interact with the protein [54], yet the disruption of
the helix clearly affects the binding energy. The effect of
these changes in RNA conformation cannot be captured
by the intermolecular potential function used here.
Given these observations, it is reasonable to con-
clude that the omission of protein intramolecular con-
tacts might also limit the predictive power of the
method. However, additional examples will need to be
examined before definite conclusions can be made con-
cerning the applications of statistical potentials to pre-
diction of relative binding energies.
The effect of training set composition
on potential function performance
All knowledge-based potentials face the possibility of
unintentional bias or over-training because their train-
ing depends upon the selection of a representative sam-
ple of structures. If great care is not exercised to
ensure that this training set is unbiased (i.e. structur-
ally heterogeneous), it is possible to create a statistical
potential that unfairly scores certain structures more
favorably than others simply because they are over-
represented in the training set.
The challenge of over-fitting is particularly acute for
protein–RNA interactions because there are relatively
few high-resolution structures of protein–RNA com-
plexes. Because of this limitation, a combined train-
ing ⁄ test set was used in the present study. To avoid
bias, a ‘leave one out’ cross-validation strategy was

employed: the tested structure was always excluded
from the training set. Thus, every test in the present
study was conducted with a different score, and
trained using only those structures that were not
homologous to the tested protein–RNA complex.
This strategy cannot be avoided at the present time,
yet it leads to situations where the training data does
not contain enough information to capture particular
structural phenomena. For example, we observed vir-
tually no correlation between the distance-dependent
score and the experimental binding affinity for muta-
tions of U1A protein until the U1A complex structure
was added to the training set (Table 3). Addition of
the homologous U2B¢ complex structure (PDB code:
1A9N) to the training set improved these results con-
siderably, indicating that the training set was missing
critical structural information that would help to dis-
criminate native-like contacts unique to the U1A com-
plex (an unusually high-affinity RRM, with a long,
seven-nucleotide recognition sequence) [52]. We antici-
pate that the performance of the method will improve
with the size of the structural database, as more high-
resolution protein–RNA structures become available.
Conclusions
We have introduced a statistical potential function that
discriminates the structures of native protein–RNA
complexes from decoys, reproduces experimentally
determined relative binding affinities for a number of
RNA-binding proteins, and predicts cognate binding
sequences for a large set of protein–RNA complexes.

The statistical potential performs as well as highly
optimized physical potential functions in tests of
docking decoy discrimination. We anticipate that the
performance of the potential will only increase with
the size of the structural database and as terms are
added to the model to account for protein and RNA
intra-molecular interactions that are currently ignored.
Nevertheless, even in its current implementation, this
statistical model achieves a high degree of sensitivity to
subtle changes in protein–RNA interface structure. We
are optimistic that this knowledge-based potential
function will find broad application to problems
requiring the high-resolution modeling of protein–
RNA interfaces, such as structure-based genome anno-
tation, or the rational design of novel RNA-binding
proteins.
Experimental procedures
All-atom distance potential
The potential function used here is identical to a previ-
ously described method [36] (a more complete description
of the method is provided in supplementary Doc. S1), with
the exception of a modified low-count correction. In the
present study, the correction described by Sippl [20] is
replaced with a weighted pseudocount method, where a
constant number of pseudocounts (P) are added to the
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6387
observed counts for each atom pair. A pseudocount cor-
rection value of 75 ensured the greatest performance of
the potential, as indicated by the average Z-scores [63]

(see supplementary Table S3). These pseudocounts are
allocated over distance bins in proportion to the back-
ground frequency f(d
ij
) values, as calculated using Eqn (4)
from a previous study [36], leading to an updated expres-
sion for f(d
ij
, t
i
, t
j
):
f ðd
ij
; t
i
; t
j
Þ
adj
¼
N
obs
ðd
ij
; t
i
; t
j

ÞþP Á f ðd
ij
Þ
P
d
ij
N
obs
ðd
ij
; t
i
; t
j
ÞþP
Here, d
ij
is the cartesian distance observed between two
atoms (i,j), of types t
i
and t
j
, and N
obs
(d
ij
,t
i
,t
j

) represents the
number of atoms of types t
i
and t
j
observed in the structure
training set, separated by a distance of at least d
ij
.
As a control, we also tested a simple, contact-counting
method, wherein every contact between protein and RNA
(within a given distance cut-off) was assigned a same score
of )1.
Atom type selection
Atom score types were assigned using the method of Rob-
ertson and Varani [36]. Briefly, the all-atom potential treats
every atom, in every residue, as a unique type (e.g. ala-
nine Cb and arginine Cb are considered as unique atom
types under this scheme), resulting in a total of 158 protein,
and 81 RNA atom types. Using a 10 A
˚
cut-off, there are
total of 1639 295 counts; with this representation, they are
distributed over 158 · 81 · 8 bins, for an average of nearly
16 counts in each bin. When using a reduced atom repre-
sentation, chemically similar atoms were group together
based on the CHARM atom definition, as previously
described [30,36].
Selection of protein–RNA training set
The training set contains crystal structures of protein–RNA

complexes downloaded from the PDB [64] with resolution
better than 2.5 A
˚
. Structures with more than 20% sequence
identity were identified using the expasy sequence-redun-
dancy tool [65]; the higher-resolution structure of every
homologous pair was retained. After filtering, the training
set contained 72 protein–RNA complexes (the 50S ribo-
some structure comprises 28 individual peptide chains in
complex with RNA, plus 44 independent protein–RNA
crystal structures).
Because of the limited number of protein–RNA struc-
tures, it was necessary to use a combined training ⁄ test set.
Thus, to assess the performance of the potential without
biasing the result, the native structure of scored complexes
was excluded from the training set for the score (‘leave one
out’ cross validation).
Construction of test sets
Five high quality docking decoy sets [17] were used for ini-
tial decoy-discrimination tests. These were constructed
using the docking module of rosetta, which incorporates
energy minimization through the use of a protein side-chain
repacking algorithm [66,67]. Each of these decoy sets con-
tains 2000 structures with deviations as low as 0.2 A
˚
rmsd
from the native complex structure.
Additionally, large numbers of near-native decoys were
generated for 21 different protein–RNA complexes by
extracting time-step structures from MD trajectories, cre-

ated using amber 8 in a deformation-like process with the
ff99 force field [68]. These MD-generated decoys are espe-
cially near-native structures; the maximum decoy rmsd for
21 sets is below 4 A
˚
, and only seven decoy sets have a max-
imum rmsd greater than 3 A
˚
.
To generate these decoy sets, the initial structure of each
native complex was first minimized in 500 steps (250 steps
of steepest-descent and 250 steps of conjugate gradient min-
imization), then heated from 0–400 °K in 20 ps using a
Langevin dynamics algorithm [69,70]. Snapshots were taken
every 0.05 ps, and a total of 400 structures were extracted
from each MD simulation. The binding free energy was cal-
culated using the mm_gbsa module of amber 8 as:
DG
bind
¼ G
complex
ÀðG
protein
þ G
RNA
Þ
where G
complex
, G
protein

and G
RNA
represent the mm_gbsa-
calculated free energies of the protein–RNA complex, the
free protein and the free RNA, respectively.
Prediction of sequence-specificity
As many RNA-binding domains of the RRM superfamily
interact in a conserved fashion with four nucleotides across
the surface of the b-sheet [6,39,40], and recognition by KH
domains appears to be conserved between different
domains as well [6,41–44], we adopted a four-nucleotide
model for our sequence-specificity test. Starting with each
complex in the training set containing one or more RRM
or KH domains, we extracted the protein coordinates and
the four nucleotides bound at the center of the domain (for
structures containing more than one RNA-binding domain,
the structure was divided into two independent domains).
This approach was chosen even though we are well aware
that this simple model of RRM recognition would fail
for domains that bind anomalously (e.g. U1A protein), or
in situations where two domains cooperatively define
specificity.
For all chosen domains, every nucleotide was replaced
with A, U, C and G, systematically, in all possible combi-
nations, using insight ii 2000 (Accelrys Software Inc., San
Diego, CA, USA). Thus, 256 different structures were gen-
erated for each binding domain, and minimized using
A knowledge-based potential function S. Zheng et al.
6388 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
amber 8 [68] in 20 steps to regularize the local structure.

Some RRM and KH domains in complex with single
strand DNA (PDB codes: 2UP1, 1WTB, 1X0F, 1ZZI and
1ZZJ) were also included in the test set because recognition
of single stranded RNA and DNA are mechanistically simi-
lar. Protein mutations were modeled using moe (Chemical
Computing Group, Montreal, Canada), followed by energy
minimization with amber; the conformation of the mutated
residue with side chain conformation most similar to the
native residue was retained.
Acknowledgements
We wish to thank Dr Yu Chen for providing the pro-
tein–RNA decoy sets and Mr Daniel Bjerre for many
valuable discussions. The study was supported by
grants from NIH.
References
1 Amosova O, Broitman SL & Fresco JR (2003) Alanine-
scanning mutagenesis of the predicted rRNA-binding
domain of ErmC¢ redefines the substrate-binding site
and suggests a model for protein–RNA interactions.
Nucleic Acids Res 31, 4941–4949.
2 Law MJ, Rice AJ, Lin P & Laird-Offringa IA (2006)
The role of RNA structure in the interaction of U1A
protein with U1 hairpin II RNA. RNA 12, 1168–1178.
3 Xia T, Wan C, Roberts RW & Zewail AH (2005)
RNA–protein recognition: single-residue ultrafast
dynamical control of structural specificity and function.
PNAS 102, 13013–13018.
4 White SA, Hoeger M, Schweppe JJ, Shillingford A,
Shipilov V & Zarutskie J (2004) Internal loop mutations
in the ribosomal protein L30 binding site of the yeast

L30 RNA transcript. RNA 10, 369–377.
5 Allers J & Shamoo Y (2001) Structure-based analysis of
protein–RNA interactions using the program ENTAN-
GLE. J Mol Biol 311, 75–86.
6 Auweter SD, Oberstrass FC & Allain FHT (2006)
Sequence-specific binding of single-stranded RNA: is
there a code for recognition? Nucleic Acids Res 17,
4943–4959.
7 Chen Y & Varani G (2005) Protein families and RNA
recognition. FEBS J 272, 2088–2097.
8 Draper DE (1995) Protein–RNA recognition. Annu Rev
Biochem 64, 593–620.
9 Guzman RND, Turner RB & Summers MF (1998) Pro-
tein–RNA recognition. Biopolymers (Nucleic Acid Sci)
48, 181–195.
10 Jones S, Daley DTA, Luscombe NM, Berman HM &
Thornton JM (2001) Protein–RNA interactions: a struc-
tural analysis. Nucleic Acids Res 29, 943–954.
11 Messias AC & Sattler M (2004) Structural basis of
single-stranded RNA recognition. Acc Chem Res 37 ,
279–287.
12 Treger M & Westhof E (2001) Statistical analysis of
atomic contacts at RNA–protein interfaces. J Mol
Recognit 14, 199–214.
13 Nadassy K, Wodak SJ & Janin J (1999) Structural
features of protein–nucleic acid recognition sites.
Biochemistry 38, 1999–2017.
14 Perez-Canadillas J-M & Varani G (2001) Recent
advances in RNA–protein recognition. Curr Opin Struct
Biol 11, 53–58.

15 Stefl R, Skrisovska L & Allain FH-T (2005) RNA
sequence- and shape-dependent recognition by pro-
teins in the ribonucleoprotein particle. EMBO Rep 6,
33–38.
16 Frankel AD (2000) Fitting peptides into the RNA
world. Curr Opin Struct Biol 10, 332–340.
17 Chen Y, Kortemme T, Robertson T, Baker D & Varani
G (2004) A new hydrogen-bonding potential for the
design of protein–RNA interactions predicts specific
contacts and discriminates decoys. Nucleic Acids Res 32,
5147–5162.
18 Sippl M, Ortner M, Jaritz M, Lackner P & Flo
¨
ckner H
(1996) Helmholtz free energies of atom pair interactions
in proteins. Fold Des 1, 289–298.
19 Sippl M (1993) Boltzmann’s principle, knowledge-based
mean fields and protein folding. An approach to the
computational determination of protein structures.
J Comput Aided Mol Des 7, 473–501.
20 Sippl MJ (1990) Calculation of conformational ensem-
bles from potentials of mean force: an approach to the
knowledge-based prediction of local structures in globu-
lar proteins. J Mol Biol 213, 859–883.
21 Samudrala R & Moult J (1998) An all-atom distance-
dependent conditional probability discriminatory func-
tion for protein structure prediction. J Mol Biol 275 ,
895–916.
22 Skolnick J, Kolinski A & Ortiz A (2000) Derivation of
protein-specific pair potentials based on weak sequence

fragment similarity. Proteins: Struct Funct Genet 38,
3–16.
23 Lu H & Skolnick J (2001) A distance-dependent atomic
knowledge-based potential for improved protein struc-
ture selection. Proteins: Struct Funct Genet 44, 223–232.
24 Zhou H & Zhou Y (2002) Distance-scaled, finite ideal-
gas reference state improves structure-derived potentials
of mean force for structure selection and stability pre-
diction. Protein Sci 11, 2714–2726.
25 Zhang C, Liu S, Zhou H & Zhou Y (2004) An accurate,
residue-level, pair potential of mean force for folding
and binding based on the distance-scaled, ideal-gas
reference state. Protein Sci 13, 400–411.
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6389
26 Skolnick J (2006) In quest of an empirical potential for
protein structure prediction. Curr Opin Struct Biol 16,
166–171.
27 Weichenberger CX & Sippl MJ (2006) Self-consistent
assignment of asparagine and glutamine amide rotamers
in protein crystal structures. Structure 14, 967–972.
28 Jiang L, Gao Y, Mao F, Liu Z & Lai L (2002) Potential
of mean force for protein–protein interaction studies.
Proteins: Struct Funct Genet 46, 190–196.
29 Lu H, Lu L & Skolnick J (2003) Development of uni-
fied statistical potentials describing protein–protein
interactions. Biophys J 84, 1895–1901.
30 Zhang C, Liu S, Zhu Q & Zhou Y (2005) A knowledge-
based energy function for protein–ligand, protein–pro-
tein, and protein–DNA complexes. J Med Chem 48,

2325–2335.
31 Ishchenko AV & Shakhnovich EI (2002) SMall Mole-
cule Growth 2001 (SMoG2001): an improved knowl-
edge-based scoring function for protein–ligand
interactions. J Med Chem 45, 2770–2780.
32 Velec HFG, Gohlke H & Klebe G (2005) Drug-
Score
CSD
-knowledge-based scoring function derived
from small molecule crystal data with superior recogni-
tion rate of near-native ligand poses and better affinity
prediction. J Med Chem 48, 6296–6303.
33 DeWitte RS & Shakhnovich EI (1996) SMoG: de novo
design method based on simple, fast, and accurate free
energy estimates. 1. Methodology and supporting evi-
dence. J Am Chem Soc 118, 11733–11744.
34 Liu Z, Mao F, Guo J-T, Yan B, Wang P, Qu Y & Xu
Y (2005) Quantitative evaluation of protein–DNA inter-
actions using an optimized knowledge-based potential.
Nucleic Acids Res 33, 546–558.
35 Kono H & Sarai A (1999) Structure-based prediction of
DNA target sites by regulatory proteins. Proteins:
Struct Funct Genet 35, 114–131.
36 Robertson TA & Varani G (2007) An all-atom, dis-
tance-dependent scoring function for the prediction of
protein–DNA interactions from structure. Proteins:
Struct Funct Bioinform 66, 359–374.
37 Donald JE, Chen WW & Shakhnovich EI (2007) Ener-
getics of protein–DNA interactions. Nucleic Acids Res
35, 1039–1047.

38 Hendlich M, Lackner P, Weitckus S, Floeckner H,
Froschauer R, Gottsbacher K, Casari G & Sippl MJ
(1990) Identification of native protein folds amongst a
large number of incorrect models: the calculation of low
energy conformations from potentials of mean force.
J Mol Biol 216, 167–180.
39 Wang X & Tanaka Hall TM (2001) Structural basis for
recognition of AU-rich element RNA by the HuD pro-
tein. Nat Struct Mol Biol 8, 141–145.
40 Maris C, Dominguez C & Allain FHT (2005) The RNA
recognition motif, a plastic RNA-binding platform to
regulate post-transcriptional gene expression. FEBS
J 272, 2118–2131.
41 Beuth B, Pennell S, Arnvig KB, Martin SR &
Taylor IA (2005) Structure of a Mycobacterium
tuberculosis NusA–RNA complex. EMBO J 24, 3576–
3587.
42 Lewis HA, Musunuru K, Jensen KB, Edo C, Chen H,
Darnell RB & Burley SK (2000) Sequence-specific RNA
binding by a Nova KH domain: implications for para-
neoplastic disease and the fragile X syndrome. Cell 100,
323–332.
43 Siomi H, Matunis MJ, Michael WM & Dreyfuss G
(1993) The pre-mRNA binding K protein contains a
novel evolutionary conserved motif. Nucleic Acids Res
21, 1193–1198.
44 Grishin NV (2001) KH domain: one motif, two folds.
Nucleic Acids Res 29, 638–643.
45 Tsai DE, Harper DS & Keene JD (1991) U1-snRNP-A
protein selects a ten nucleotide consensus sequence from

a degenerate RNA pool presented in various structural
contexts. Nucleic Acids Res 19, 4931–4936.
46 Lunde BM, Moore C & Varani G (2007) RNA-binding
proteins: modular design for efficient function. Nat Rev
Mol Cell Biol 8, 479–490.
47 Valegard K, Murray JB, Stonehouse NJ, van den Worm
S, Stockley PG & Liljas L (1997) The three-dimensional
structures of two complexes between recombinant MS2
capsids and RNA operator fragments reveal sequence-
specific protein–RNA interactions. J Mol Biol 270, 724–
738.
48 Johansson HE, Dertinger D, LeCuyer KA, Behlen
LS, Greef CH & Uhlenbeck OC (1998) A thermody-
namic analysis of the sequence-specific binding of
RNA by bacteriophage MS2 coat protein. PNAS 95,
9244–9249.
49 Auweter SD, Fasan R, Reymond L, Underwood JG,
Black DL, Pitsch S & Allain FH-T (2006) Molecular
basis of RNA recognition by the human alternative
splicing factor Fox-1. EMBO J 25, 163–173.
50 Oubridge C, Ito N, Evans PR, Teo CH & Nagai K
(1994) Crystal structure at 1.92 A resolution of the
RNA-binding domain of the U1A spliceosomal
protein complexed with an RNA hairpin. Nature 372,
432–438.
51 Allain FHT, Gubser CC, Howe PWA, Nagai K,
Neuhaus D & Varani G (1996) Specificity of ribonucleo-
protein interaction determined by RNA folding during
complex formation. Nature 380, 646–650.
52 Timm H, Jessen Oubridge C, Teo CH, Pritchard C &

Nagai K (1991) Identification of molecular contacts
between the U1A small nuclear ribonucleoprotein and
U1 RNA. EMBO J 10, 3447–3456.
53 Gubser CC & Varani G (1996) Structure of the poly-
adenylation regulatory element of the human U1A
A knowledge-based potential function S. Zheng et al.
6390 FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS
pre-mRNA-3¢-untranslated region and interaction with
the U1A protein. Biochemistry 35, 2253–2267.
54 Batey RT, Sagar MB & Doudna JA (2001) Structural
and energetic analysis of RNA recognition by a univer-
sally conserved protein from the signal recognition par-
ticle. J Mol Biol 307, 229–246.
55 Siomi H & Dreyfuss G (1997) RNA-binding proteins as
regulators of gene expression. Curr Opin Genet Dev 7,
345–353.
56 Onesto C, Berra E, Grepin R & Pages G (2004)
Poly(A)-binding protein-interacting protein 2, a strong
regulator of vascular endothelial growth factor mRNA.
J Biol Chem 279, 34217–34226.
57 Kinnaird JH, Maitland K, Walker GA, Wheatley I,
Thompson FJ & Devaney E (2004) HRP-2, a heteroge-
neous nuclear ribonucleoprotein, is essential for embryo-
genesis and oogenesis in Caenorhabditis elegans . Exp
Cell Res 298, 418–430.
58 Havranek JJ, Duarte CM & Baker D (2004) A simple
physical model for the prediction and design of protein–
DNA interactions. J Mol Biol 344, 59–70.
59 Morii T, Sato S, Hagihara M, Mori Y, Imoto K &
Makino K (2002) Structure-based design of a leucine

zipper protein with new DNA contacting region.
Biochemistry 41, 2177–2183.
60 Ashworth J, Havranek JJ, Duarte CM, Sussman D,
Monnat RJ, Stoddard BL & Baker D (2006) Computa-
tional redesign of endonuclease DNA binding and
cleavage specificity. Nature 441, 656–659.
61 Gray JJ (2006) High-resolution protein–protein docking.
Curr Opin Struct Biol 16, 183–193.
62 Wang K, Fain B, Levitt M & Samudrala R (2004)
Improved protein structure selection using decoy-
dependent discriminatory functions. BMC Struct Biol
4,8.
63 Sippl MJ (1993) Recognition of errors in three-dimen-
sional structures of proteins. Proteins: Struct Funct
Genet 17, 355–362.
64 Berman H, Henrick K & Nakamura H (2003) Announc-
ing the worldwide Protein Data Bank. Nat Struct Mol
Biol 10, 980–980.
65 Notredame C. ExPASy sequence-redundancy tool.
Available at />redundancy.cgi.
66 Gray JJ, Moughon SE, Kortemme T, Schueler-
Furman O, Misura KMS, Morozov AV & Baker D
(2003) Protein–protein docking predictions for the
CAPRI experiment. Proteins: Struct Funct Genet 52,
118–122.
67 Gray JJ, Moughon S, Wang C, Schueler-Furman O,
Kuhlman B, Rohl CA & Baker D (2003) Protein–pro-
tein docking with simultaneous optimization of rigid-
body displacement and side-chain conformations. J Mol
Biol 331, 281–299.

68 Case DA, Darden TA, Cheatham TE III, Simmerling
CL, Wang J, Duke RE, Luo R, Merz KM, Wang B,
Pearlman DA et al. (2004) Amber8. University of Cali-
fornia, San Francisco, CA.
69 Izaguirre JA, Catarello DP, Wozniak JM & Skeel RD
(2001) Langevin stabilization of molecular dynamics.
J Chem Phys 114, 2090–2098.
70 Pastor R, Brooks B & Szabo A (1988) An analysis of
the accuracy of Langevin and molecular dynamics algo-
rithms. Mol Physics 65, 1409–1419.
Supplementary material
The following supplementary material is available
online:
Doc. S1. Detailed description of the construction and
derivation of the distance-dependent potential.
Table S1. The complex structure (PDB code) in pro-
tein–RNA training set.
Table S2. The cognate sequence rank by distance
potential and contact score (cut-off ¼ 6A
˚
) in RRM ⁄
KH domain sequence decoy (256) sets.
Table S3. Optimization of the pseudocounts in decoy
sets discrimination.
Table S4. Comparison of the Z-scores obtained on the
same set of protein–RNA decoys for the RNA, DNA
and reduced atom distance potentials in decoy discrim-
ination.
Table S5. Comparison of the performance of the
potential with different upper cut-off values in decoy

discrimination.
This material is available as part of the online article
from
Please note: Blackwell Publishing is not responsible
for the content or functionality of any supplementary
materials supplied by the authors. Any queries (other
than missing material) should be directed to the corre-
sponding author for the article.
S. Zheng et al. A knowledge-based potential function
FEBS Journal 274 (2007) 6378–6391 ª 2007 The Authors Journal compilation ª 2007 FEBS 6391

×