Báo cáo Y học: Identiﬁcation of novel membrane proteins by searching for patterns in hydropathy proﬁles potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (246.13 KB, 7 trang )

Identiﬁcation of novel membrane proteins by searching for patterns
in hydropathy proﬁles
John D. Clements and Rowena E. Martin
School of Biochemistry and Molecular Biology, Australian National University, Canberra, Australia
A technique has been developed to search a proteome
database for new members of a functional class of mem-
brane protein. It takes advantage of the highly conserved
secondary structure of functionally related membrane
proteins. Such proteins typically have the same number of
transmembrane domains located at similar relative positions
in their polypeptide sequence. This gives rise to a charac-
teristic pattern of peaks in their hydropathy proﬁles. To
conduct a search, each member of a polypeptide database is
converted to a hydropathy proﬁle, peaks are automatically
detected, and the pattern of peaks is compared with a tem-
plate. A template was designed for the acetylcholine (ACh)
and glycine receptors of the cys-loop receptor superfamily.
The key feature was a closely spaced triplet of hydropathy
peaks bracketed by deep valleys. When applied to the human
proteome the search procedure retrieved 153 proﬁles with a
receptor-like triplet of peaks. The approach was highly
selective with 70% of the retrieved proﬁles annotated as
known or putative receptors. These included ACh, glycine,
c-amino butyric acid and seretonin receptors, which are all
related by sequence. However, ionotropic glutamate recep-
tors, which have almost no sequence homology with ACh
receptors, were also retrieved. Thus, the strategy can ﬁnd
members of a functional class that cannot be identiﬁed by
sequence alignment. To demonstrate that the strategy can
easily be extended to other membrane protein families, a
template was developed for the neurotransmitter/Na

+
symporter family, and similar results were obtained. This
approach should prove a useful adjunct to sequence-based
retrieval tools when searching for novel membrane proteins.
Keywords: hydropathy proﬁle; integral membrane protein;
ligand-gated channel; neurotransmitter receptor; proteo-
mics; transporter.
Integral membrane proteins are responsible for the majority
of interactions between a cell and its external environment.
Approximately 20% of the genes in animal, plant, yeast and
bacteria genomes encode integral membrane proteins,
consistent with their fundamental importance to cellular
function [1–3]. Transmembrane a-helices are encoded by a
long stretch of predominantly hydrophobic residues (typic-
ally15–19), which is sufﬁcient to cross the hydrophobic
region of the membrane bilayer (2.5 nm) [4]. The
pronounced compositional bias arises because these residues
must be capable of hydrophobic interactions with the lipid
environment in the interior of the membrane. Most
membrane-associated domains produce an easily identiﬁed
peak in the hydropathy proﬁle of the polypeptide. Standard
software tools are available that can identify the putative
transmembrane domains of a membrane protein based on
its hydropathy proﬁle [5,6]. Sophisticated algorithms that
combine hydropathy and sequence analysis can predict up
to 95% of transmembrane helices [7–12], but simple
hydropathy peak detection strategies are also very effective
[13].
The primary function of most membrane proteins is to
transfer molecules, ions or signals between the exterior and

interior of a cell, or subcellular compartment, and trans-
membrane domains provide the physical conduit for the
transfer. Typically, several transmembrane domains com-
bine to form a tightly coupled structure that is intimately
involved in the function of the protein [14]. It follows that
the number and the pattern of transmembrane domains will
be strongly conserved within a functionally related family.
Protein families within which secondary structure is highly
conserved include neurotransmitter receptors, voltage-gated
channels, connexins and transporters (Fig. 1).
The majority of neurotransmitter-activated channels can
be assigned either to the glutamate cationic receptor (iGluR)
superfamily, or the cys-loop receptor superfamily, which
includes acetylcholine (ACh), glycine, c-amino butyric acid
(GABA) and serotonin receptors [15]. Channels from both
superfamilies are formed from subunits that have four
membrane-associated domains. These four domains are
organized as a cluster of three closely spaced domains near
the centre of the polypeptide, and a fourth well separated
domain close to the C-terminal end of the polypeptide
(Fig. 1A) [14]. Despite the similarity of their secondary
structure, there is almost no sequence homology between
the two superfamilies.
Neuronal voltage-gated Na
+
,Ca
2+
and K
+
channel

families diverged from a common ancestor long ago and
there is very little sequence homology between the families,
yet all three have retained a similar secondary structure.
Correspondence to J. Clements, School of Biochemistry and Molecular
Biology, Australian National University, Canberra, ACT 0200,
Australia. Fax: + 61 26125 0313, Tel.: + 61 26125 3465,
E-mail:
Abbreviations: ACh, acetylcholine; AchR, acetylcholine receptor;
GABA, c-amino butyric acid; HU, hydrophobicity unit; LGIC,
ligand-gated ion channels; iGluR, glutamate cationic receptor;
NMDA, N-methyl-
D
-aspartate; AMPA, a-amino-3-hydroxy-5-
methyl-4-isoxazole propionate; GlyR, glycine receptor; NSS,
neurotransmitter/Na
+
symporter.
(Received 31 December 2001, revised 18 February 2002,
accepted 27 February 2002)
Eur. J. Biochem. 269, 2101–2107 (2002) Ó FEBS 2002 doi:10.1046/j.1432-1033.2002.02859.x
They are formed from four subunits, each containing six
membrane-associated domains (Fig. 1B) [14]. In voltage-
gated Na
+
and Ca
2+
channels the four subunits are linked
together as a single protein with a series of internal repeats.
In the case of voltage-gated K
+

channels the subunits are
expressed as separate proteins, and the channel forms as a
tetramer of these subunits (Fig. 1B) [14].
Two separate families of membrane proteins form gap-
junctions between mammalian cells (connexins), and
between invertebrate cells (innexins). There is negligible
sequence homology between these families, but they share a
similar secondary structure. Subunits of both connexins and
innexins contain four transmembrane domains, and com-
bine to form dodecamers [14,16–19]. In contrast to ligand-
gated channels, the four transmembrane domains of
connexin and innexin are organized into two closely spaced
pairs, which are separated by an intracellular hydrophilic
loop (Fig. 2D). Many other functionally related protein
families have been identiﬁed where secondary structural
features are better conserved than the underlying amino
acid sequences [20,21].
Despite clear evidence for conservation of secondary
structure, little systematic use has been made of structural
information in proteomic analysis. Most genomic software
Fig. 1. Schematic diagram showing that the pattern of transmembrane
domains is conserved within a functional class of membrane protein.
(A) LGICs typically have a closely spaced cluster of three transmem-
brane domains (dark bars) and a fourth well-separated domain. This
secondary structure is conserved across the cys-loop superfamily and
the iGluR superfamily, even though there is no sequence homology
between these families. Selected subunits from both families are shown.
(B) Distantly related voltage-gated channels also exhibit a character-
istic pattern of transmembrane domains. Channels are formed by four
groups of six transmembrane domains. Within each group, the ﬁrst

ﬁve transmembrane domains are closely spaced, with the sixth domain
separated by a relatively long extracellular loop.
Fig. 2. The highly conserved secondary structure of LGICs is reﬂected in
a characteristic pattern of peaks in their hydropathy proﬁles. (A) The
hydropathy proﬁle of the human AChR alpha-1 subunit reveals a
typical cluster of three peaks bracketed by deep valleys. The peak, base
and valley threshold levels used by the search algorithm are shown as
horizontal dashed lines. Peaks located at < 20 residues are likely to be
a cleaved signal sequences and are ignored. (B,C) A similar pattern of
peaks and valleys is seen in the proﬁles of the GABA
A
receptor alpha-1
subunit and glutamate receptor GluR1 subunit. (D) A human conn-
exin subunit also exhibits four hydropathy peaks, but they are
organized in a diﬀerent pattern. The peaks occur in two pairs separated
by a deep valley.
2102 J. D. Clements and R. E. Martin (Eur. J. Biochem. 269) Ó FEBS 2002
packages can generate a hydropathy proﬁle from an amino-
acid sequence, but in general they only permit one or a few
proﬁles to be generated at a time. The resulting hydropathy
proﬁles are typically examined by eye for signiﬁcant
features. Efforts have been made to improve and automate
this process. For example, the web-based programs
TMPRED
,
TMHMM
and
MEMSTAT
identify and count putative
transmembrane helices, and suggest their orientation in the

membrane [7–11]. These programs are effective when
applied to individual amino acid sequences, but no software
tools are available to automatically analyse the pattern of
putative transmembrane domains (secondary structure).
A method for alignment of hydropathy proﬁles has been
developed [20,21], and an experimental web-based server
uses this approach to align pairs of sequences submitted
by the user, or to search a database for hydropathy proﬁles
that match a submitted sequence (Bioinformatics Unit,
Weizmann Institute of Science). At present, it is limited to
the SwissProt database, and to Hopp–Woods, or Kyte–
Doolittle hydrophobicity scales. In principle, this approach
can be used to search for proteins with conserved secondary
structure, but there are technical issues that limit its
performance. For example, a proﬁle with a similar pattern
of peaks, but differently shaped peaks and valleys may be
missed. It is equally sensitive to mismatches in both peak
(transmembrane) and valley (intra- and extracellular loop)
regions, even though evolutionary changes in valley shape
will have relatively little effect on secondary structure.
In this paper we develop and test a new automated
proteome search technique. Every member of a polypeptide
database is converted to a hydropathy proﬁle, hydropathy
peaks are automatically detected, and the pattern of peaks is
compared with a template. Sequences that match the
template are output to a new database, and their proﬁles
are displayed in a convenient format. This approach can be
used to search for new members of a family or functional
class of membrane protein. It can assist with functional
analysis, and may also be useful in proteome database

annotation.
METHODS
An algorithm was developed for searching a large polypep-
tide sequence database for proteins that are likely to be new
members of a functionally related family of membrane
proteins. The program runs on a personal computer, and
the analysis of an organism’s total proteome takes about
1 min. The test is applied to the hydropathy proﬁle of each
sequence. A standard (Kyte–Doolittle) algorithm [5,6] is
used to convert a sequence into a proﬁle. The amino acids
are each assigned a hydropathy value based on experimental
measures, and the resulting proﬁle is ﬁltered to reduce noise.
We chose a set of hydropathy values and a ﬁlter width that
are near-optimal for detection of transmembrane regions
[6]. The ﬁlter function is a rectangular averaging window
(box-car ﬁlter) with a length of 17 amino acid residues. With
these settings, the amplitude of the peak produced by a
transmembrane a-helix is typically in the range 1–3 hydro-
phobicity units (HU) (Fig. 2). For example, the four
transmembrane domains are clearly visible in the hydro-
pathy proﬁles of three different ligand-gated ion channels
(LGICs) (Fig. 2A–C) and the connexin alpha-1 subunit
(Fig. 2D).
Peak detection
Each polypeptide sequence in a database is subject to a
series of three tests. The ﬁrst test simply rejects the
sequence if it is too short or too long. The range of
acceptable lengths is determined from known members of
the membrane protein family, but this restriction can be
relaxed if necessary. Membrane proteins always have both

hydrophobic and hydrophilic regions, so proﬁles that do
not cross both an upper and lower threshold are also
rejected. These thresholds are the same as those used for
peak detection (Fig. 2). Next, a simple peak-detection
procedure is applied to each hydropathy proﬁle, resulting
in an estimate of the number and the locations of putative
transmembrane helices. The algorithm identiﬁes a peak
when the proﬁle rises from below a base threshold, crosses
above a peak detection threshold, then crosses back below
both the peak and base thresholds. In Fig. 2, the peak and
base thresholds are indicated with the upper two dashed
lines.
1
Different threshold settings are used depending on
the target protein. For example, the base threshold selected
for LGICs is higher than for connexins (Figs 2A–D). The
location and amplitude of each peak is measured at the
maximum point between the two peak threshold crossings.
The width of each peak is measured between the two base
threshold crossings. This gives a more consistent result
than measuring the width at the peak threshold level. The
location and amplitude of each valley minimum is also
measured.
Comparing a proﬁle to a template
After the peaks and valleys are identiﬁed, a test is
performed to determine whether they conform to a
template. The simplest test is to count the peaks and ask
whether this number falls within a speciﬁed range. The
peak count may be adjusted by rejecting narrow peaks, or
by counting a broad peak as two merged peaks. For

example, when the base threshold is set below zero, the
majority of transmembrane regions will produce a peak
that is wider than 10 residues. If the width of a peak is
> 30 residues it is possible that two or more closely spaced
transmembrane regions have produced a single peak in the
hydropathy proﬁle. A peak located within the ﬁrst 20
residues is likely to be a cleaved signal sequence (destined
in most cases to be cleaved from the mature protein), and
can optionally be removed from the peak count (Fig. 2A).
Sometimes a false hydropathy peak is detected at a
location that is not a transmembrane domain, and true
transmembrane peaks are occasionally missed. Thus, when
searching for proteins with four transmembrane domains,
a proﬁle with three to ﬁve peaks would typically be
accepted.
If the number of peaks falls within the speciﬁed range,
then more sophisticated template-matching tests can be
applied. For example, the separation between adjacent
peaks (interpeak intervals) can be calculated. A candidate
proﬁle can be rejected if the interpeak intervals fall
outside the speciﬁed ranges. Another strategy is to scan
for a particular feature, such as a closely spaced cluster
of peaks bracketed by deep valleys. A strategy of this
type is developed below for detecting ligand-gated ion
channels.
Ó FEBS 2002 Hydropathy proﬁle search (Eur. J. Biochem. 269) 2103
Designing and reﬁning a template
When designing a search strategy, the peak detection
thresholds and the selection parameters are adjusted with
the dual goals of maximizing detection and minimizing

false-positives. The ﬁrst goal is achieved by applying the
algorithm to a sequence database containing all proteins
that belong to the family of interest. The parameters are
reﬁned by trial and error until almost all members of the
family are selected. Next the same set of search parameters
is applied to a database containing unrelated membrane
protein sequences. If necessary, the parameters are ﬁne-
tuned until all members of the unrelated family are rejected.
Finally, the search procedure is applied to a large database,
for example one containing the proteome of an organism.
The search algorithm and several related utilities were
written using a development environment that is built
into AxoGraph (Axon Instruments, CA), a scientiﬁc
data analysis and graphics program for Macintosh com-
puters ( The
AxoGraph plug-in programs that implement the search
algorithm are available on request, or from http://
johnc3.anu.edu.au/proteomic_plugins.sea. AxoGraph was
chosen for this study because it can plot and overlay several
thousand hydropathy proﬁles in a single window, and
analyse them in a single operation. It also has convenient
features for browsing and organizing the large number of
proﬁles generated by the search algorithm.
RESULTS
A search strategy was designed for LGICs. The strategy was
reﬁned by applying it to custom polypeptide databases, and
tested by applying it to a database containing the complete
human proteome. This database was chosen because it is
well annotated, which aids in the assessment of the
algorithm’s performance. The results presented below are

essentially a proof of concept. In general, this technique will
be more useful when applied to a database that is not
complete or well annotated.
Search strategy for LGICs
The following procedure was used to develop the search
strategy for LGICs. First, a custom database containing
two members of the cys
2
-loop receptor superfamily was
constructed. ACh receptors (AChRs) and glycine recep-
tors (GlyRs) were selected using a text search of the
Entrez database. Truncated sequences, duplicate sequenc-
es and sequences that were not LGICs were removed
manually. This left 119 unique, full-length sequences
from many different animal species (including human,
chicken, frog, ﬁsh, locust, fruit-ﬂy and nematode); these
were converted to hydropathy proﬁles in AxoGraph.
Features common to all of the proﬁles were identiﬁed by
eye. AxoGraph’s convenient browsing features aided in
this task. Every proﬁle had a cluster of three peaks
located approximately 200–300 residues from the start of
the sequence (Fig. 2A). Each of the three peaks had an
amplitude of 1–2 HU, and the cluster of peaks was
bracketed with deep valleys extending below )2.5 HU.
The cluster of three peaks was followed by a fourth peak
close to the end of the proﬁle.
Based on these observations, and following a period of
trial-and-error reﬁnement, the following selection criteria
were chosen. Only sequences with lengths between 300 and
1800 were accepted. A peak threshold of 1.1 HU and a base

threshold of 0.8 HU reliably detected all four peaks in every
proﬁle. However, some of the peaks were measured as very
narrow (only two residues) because the base threshold was
set relatively high. Therefore, narrow peaks were not
rejected. A putative transmembrane domain occasionally
appeared as two narrow peaks. Therefore, a pair of peaks
separated by fewer than six residues were counted as a single
peak. We noted that the ﬁrst and last peaks in the
characteristic cluster of peaks were separated by between
55 and 66 residues. Thus, the template criterion for a LGIC
was the presence of a cluster of three peaks separated by
between 50 and 75 residues, bounded by deep valleys of
< )2.5 HU. The cluster had to be followed by at least one
additional peak, but no more than three peaks.
Testing the LGIC search strategy
A search of the AChR and GlyR database using the above
detection criteria correctly retrieved every one of the 119
proﬁles. Thus, the search strategy exhibits excellent sensi-
tivity, as it was able to detect 100% of known GlyR and
AChR across a range of species.
The accuracy and sensitivity of the search strategy were
tested by applying it to a custom database containing
GABA
A
receptor sequences retrieved via a text search of
the Entrez database. GABA
A
receptors are also members of
the cys-loop superfamily, but they were not used during the
selection and tuning of the search parameters. The algo-

rithm retrieved 39 out of 41 sequences (95%), demonstra-
ting excellent sensitivity for proteins that are related in both
function and sequence to the target group.
Next, the selectivity of the search strategy was examined.
We chose two families of integral membrane proteins which
are functionally distinct from LGICs, but which also have
four transmembrane domains. A custom database of
known and putative connexins and innexins was construc-
ted using a series of text searches of the Entrez database.
The search algorithm was applied to the database and
retrieved only one out of 122 sequences. Thus, the LGIC
search strategy exhibits good selectivity.
The entire human proteome (Entrez) was searched and
153 proﬁles with a receptor-like triplet of peaks were
retrieved. Of these, 105 (70%) were annotated as known or
putative receptors. As expected, many of these were GlyR
or AChR (31). Other members of the cys-loop superfamily
were also identiﬁed, including receptors for GABA (18) and
seretonin (5). Of particular note, 13 members of the iGluR
superfamily were also retrieved, including the N-methyl-
D
-aspartate (NMDA) and kainate receptor subtypes. Thus,
the search algorithm succeeded in its central goal of
identifying proteins that were functionally related to the
target group (GlyR and AChR), but were not related by
sequence homology.
Of the proﬁles that were not annotated as receptors,
six were voltage-gated potassium channels and two were
transporters. They were retrieved because they contained six
or seven transmembrane domains, three of which formed a

cluster separated by deep valleys (Fig. 3A). It was noted
that the valleys between the triplet peaks were usually
2104 J. D. Clements and R. E. Martin (Eur. J. Biochem. 269) Ó FEBS 2002
deeper for potassium channels and transporters than for
LGICs. The receptor detection algorithm was reﬁned
to eliminate proﬁles where the deeper of the two valleys
between the triplet peaks extended below )1.5 HU. This
reﬁned algorithm was still able to detect 99% of known
GlyR and AChR. It retrieved 87 proﬁles from the human
proteome, of which 90% were receptors. Although this
reﬁned search procedure increased the selectivity for recep-
tors, it also failed to retrieve any iGluRs. This illustrates the
inevitable trade-off between the selectivity of the search
algorithm and the likelihood of detecting distantly related
functional homologues.
The search strategy’s sensitivity to membrane proteins
that were related to the target group by function but not
by sequence, was investigated further. A custom database
containing 84 sequences from the iGluR superfamily was
constructed using Entrez. It included the NMDA, kainate
and a-amino-3-hydroxy-5-methyl-4-isoxazole propionate
(AMPA) receptor subtypes. These receptors are function-
ally related to GlyRs and AChRs, but share almost no
sequence homology. Also, iGluRs are thought to form
tetrameric channels, in contrast with the cys-loop super-
family that forms pentameric channels. Despite these
differences, the search algorithm retrieved 30 sequences
(36%) from the iGluR database. By subtype, 90% of the
kainate receptors in the database were detected, but only
36% of the NMDA receptors, and 1% of the AMPA

receptors. Examination of the AMPA receptor hydropathy
proﬁles revealed that the peak associated with their second
membrane-associated domain did not reach the peak
threshold in most cases. A small reduction in this threshold
would have resulted in many more AMPA and NMDA
receptors being retrieved. Nevertheless, these results dem-
onstrate the remarkable sensitivity of the original search
strategy for membrane proteins that are related to AChRs
only by function.
Candidate LGICs retrieved by the search strategy
Four proteins with receptor-like proﬁles from the second
search were annotated as having no known or putative
function. In principle, these could be novel receptors, so we
examined them in greater detail. The proﬁle with accession
number AAF86374 is a member of the ancient conserved
domain protein family (ACDP), which has sequence
elements conserved from nematode to human. Intriguingly,
its secondary structure is very similar to that of a LGIC,
with a clear triplet of peaks followed by a well-separated
fourth peak (Fig. 3B). It has a shorter section preceding
the triplet than a typical receptor, but it is reasonable to
speculate that it is membrane protein, and possibly an
ancient ion channel or receptor. The next two proﬁles came
from an uncharacterized membrane protein expressed in
the hypothalamus (accession numbers NP_060945 and
AAG09678). These proteins had six or possibly seven
transmembrane domains and are unlikely to be receptors,
but could be novel transporters or voltage-gated channel
subunits (Fig. 3C). The proﬁle BAA18909 is simply anno-
tated ÔunknownÕ, but a

BLAST
search revealed weak homol-
ogy with a section of an intrinsic factor-vitamin B12
receptor. The proﬁle is quite similar to a typical LGIC,
although a small narrow peak precedes the main triplet
(Fig. 3D). These ﬁndings demonstrate how the hydropathy
Fig. 3. Hydropathy proﬁles of four proteins that were retrieved from the
human proteome by a search strategy designed to detect LGICs, but were
not annotated as receptors. (A) A voltage-gated potassium channel was
incorrectly retrieved because its ﬁrst two hydropathy peaks fell just
below the detection threshold. Potassium channels typically have a
cluster of ﬁve peaks followed but a sixth well-separated peak. Note that
although only one peak following the valley is highlighted, the tem-
plate will accept up to three peaks. (B) An ancient conserved domain
protein with no known function was retrieved because of its receptor-
like cluster of three transmembrane peaks bracketed by deep valleys.
The separation between the cluster and the fourth peak was larger than
for a typical LGIC, but otherwise the secondary structure is strikingly
similar. (C) An uncharacterized hypothalamus protein is unlikely to be
a LGIC, despite the fact that it is expressed in a brain region. It has two
or three extra peaks before and after the triplet, giving it a secondary
structure that has more in common with a voltage-gated channel or a
transporter. (D) A retrieved protein that was simply annotated
ÔunknownÕ, but which has weak sequence homology with an intrinsic
factor-vitamin B12 receptor.
Ó FEBS 2002 Hydropathy proﬁle search (Eur. J. Biochem. 269) 2105
peak detection algorithm may be used to search for truly
novel members of a functional class of membrane proteins.
Search strategy for neurotransmitter/Na
+

symporters
To demonstrate that our approach can be applied to other
functional classes of membrane protein, we developed a
search strategy for the neurotransmitter/Na
+
symporter
(NSS) family. A custom database was constructed contain-
ing 40 GABA and dopamine transporters, which have 10–
12 putative transmembrane domains. The corresponding
peaks in the transporter proﬁles could be detected using a
peak threshold of 1.4 and a base threshold of 0.6. The
minimum peak width was set to 10, and peaks with a width
of up to 60 residues were accepted. Proﬁles were accepted
only if they had between 10 and 13 peaks, arranged as a pair
of peaks, followed by a deep valley (< )1.9), then a cluster
of 8–11 peaks, extending over no more than 300 residues
(Fig. 4A,B). It is likely that the initial pair of peaks actually
represents three transmembrane domains. The second peak
was typically 40 residues in width, and is probably produced
by two closely spaced transmembrane domains. This search
strategy identiﬁed all 40 of the targeted NSS transporter
proﬁles.
The entire human proteome (Entrez) was searched and 59
proﬁles with an NSS transporter-like pattern of peaks were
retrieved. Of these, 51 were annotated as known or putative
transporters (86%). As expected, many of these were NSS
transporters (54%), but several other transporters were also
identiﬁed, including Na
+
/Ca

2+
antiporters (9%), Na
+
/
glucose symporters (7%), K
+
/Cl
)
symporters (5%), Na
+
/
nucleoside transporters (3%), and organic ion transporters
(3%) (Fig. 4C). Thus, the search algorithm again succeeded
in identifying proteins that were functionally related to the
target group, but were not related by sequence homology.
DISCUSSION
We have developed and tested an algorithm that can scan a
large polypeptide database, and retrieve membrane proteins
on the basis of secondary structure rather than sequence
homology. The algorithm locates putative transmembrane
domains in each sequence, and tests whether their spatial
pattern matches a template. In the past this process has been
performed manually, by visual inspection of hydropathy
plots generated one at a time. Our major innovation was to
automate the process, and apply it on the proteome scale. A
computer program performs the peak detection and tem-
plate matching. The complete proteome of an organism can
be scanned in about 1 min using a desktop personal
computer. This represents a qualitative increase in the
power of the technique, and it permits new questions to be

addressed. An analogy may be drawn with modern
sequence-based search programs, such as
BLAST
,which
can scan multiple genomes. Although it was directly based
on earlier sequence analysis programs that could align small
groups of sequences, its development opened an entirely
new ﬁeld.
In principle, our technique could be extended by
complementing hydropathy peak detection with a more
sophisticated analysis of the underlying sequence [8–12].
Several web-based programs use such an approach to
improve the reliability with which transmembrane domains
can be identiﬁed, and to predict topology. Incorporating
additional sequence analysis into our technique would
permit an orientation to be assigned to each transmembrane
a-helix, which would assist structural analysis. However, the
additional processing would substantially slow the search
run, and it unclear how much improvement would be
achieved in practice. A recent study evaluated all of the
current methods for predicting transmembrane domains,
and found
TMHMM
to be the best performing program [13].
However, the standard Kyte–Doolittle algorithm, which
forms the basis of our search technique, was a close runner-
up. Some membrane proteins incorporate a hydrophobic
pore-lining region that does not cross the membrane, but
instead forms a beta hairpin structure that dips into the
membrane then re-emerges on the same side [22]. These

membrane-associated domains represent an important
component of the highly conserved secondary structure
Fig. 4. The conserved secondary structure of neurotransmitter/Na
+
symporters is reﬂected in a characteristic pattern of peaks in their
hydropathy proﬁles. (A) The hydropathy proﬁle of a rat dopamine
symporter reveals a pair of peaks followed by a deep valley, then a
cluster of nine peaks. The peak, base and valley threshold levels used
by the search algorithm are shown as horizontal dashed lines. (B) A
similar pattern of peaks and valleys is seen in the proﬁle of a closely
related rat GABA symporter. (C) A human Na
+
-independent organic
anion transporter retrieved by the NSS symporter template exhibits a
similar pattern of peaks, although it has no sequence homology with
the neurotransmitter symporters.
2106 J. D. Clements and R. E. Martin (Eur. J. Biochem. 269) Ó FEBS 2002
of voltage-gated potassium channels, and similar hairpin
structures may also be present in other membrane proteins
[22]. A sophisticated a-helix-detection algorithm may reject
or misinterpret such regions.
Our approach is loosely analogous with a strategy that
uses alignment of hydropathy proﬁles to search for
conserved secondary structural features in polypeptide
sequences [20,21]. This alignment technique is based on
the same algorithm that is used in standard peptide and
nucleotide sequence alignment, but is applied to sequences
of hydropathy values. Proﬁle alignment will generally
provide a more stringent test for conserved structure than
our template-matching approach. However, a more strin-

gent test will be less likely to detect unusual or distantly
related family members. For example, a LGIC containing a
triplet of unusually high hydropathy peaks will be reliably
detected by our approach, but will receive a low score in an
alignment-based search. Another problematic issue for the
alignment algorithm is what penalty should be assigned
when introducing gaps into one or both proﬁles, and
how this penalty should be weighted for transmembrane
domains vs. extra-membrane loops.
We tested the performance of the hydropathy alignment
approach by submitting the sequence of the GlyR
alpha-1 subunit to the web-based search engine http://
bioinformatics.weizmann.ac.il/hydroph/, and analysing the
ﬁrst 200 sequences retrieved from the SwissProt database.
Only 43% of these sequences were annotated as receptors,
and all were close relatives of AChR (ACh, glycine and
GABA receptors). No receptors for seretonin or glutamate
were identiﬁed. Thus, hydropathy alignment is much less
sensitive to distantly related functional homologues, and less
selective for the membrane protein family of interest than
the template matching approach.
We chose the human genome to test our search strategy,
because the thorough annotations permitted a detailed
assessment of the algorithm’s performance. In practice, the
hydropathy proﬁle search tool will be more useful when
applied to an actively growing proteome database that is
not yet well annotated. The most important use for the
technique will be to search for new members of established
functional families of membrane proteins, especially those
that are missed by standard sequence-based search tech-

niques. We have demonstrated how this can be achieved for
LGICs, and for neurotransmitter symporters. Other candi-
date families include voltage-gated ion channels, G-protein
coupled receptors, connexins and a wide variety of trans-
porters.
ACKNOWLEDGEMENTS
This work was supported by a Senior Research Fellowship from the
Australian Research Council (J. D. C.) and an Australian Postgradu-
ate Award (R. E. M.).
REFERENCES
1. Himmelreich, R., Hilbert, H., Plagens, H., Pirkl, E., Li, B.C. &
Herrmann, R. (1996) Complete sequence analysis of the genome
of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24,
4420–4449.
2. Frishman, D. & Mewes, H.W. (1997) Protein structural classes in
ﬁve complete genomes. Nat. Struct. Biol. 4, 626–628.
3. Wallin, E. & von Heijne, G. (1998) Genome-wide analysis of
integral membrane proteins from eubacterial, archaean, and
eukaryotic organisms. Protein Sci. 7, 1029–1038.
4. Deisenhofer, J., Remington, S.J. & Steigemann, W. (1985)
Experience with various techniques for the reﬁnement of protein
structures. Methods Enzymol. 115, 303–323.
5. Kyte, J. & Doolittle, R.F. (1982) A simple method for displaying
the hydropathic character of a protein. J. Mol. Biol. 157, 105–132.
6. Engelman, D.M., Steitz, T.A. & Goldman, A. (1986) Identifying
nonpolar transbilayer helices in amino acid sequences of
membrane proteins. Annu. Rev. Biophys. Biophys. Chem. 15,
321–353.
7. Jones, D.T., Taylor, W.R. & Thornton, J.M. (1994) A model
recognition approach to the prediction of all-helical membrane

protein structure and topology. Biochemistry 33, 3038–3049.
8. Rost, B., Casadio, R., Fariselli, P. & Sander, C. (1995) Trans-
membrane helices predicted at 95% accuracy. Protein Sci. 4,
521–533.
9. Cserzo, M., Wallin, E., Simon, I., von Heijne, G. & Elofsson, A.
(1997) Prediction of transmembrane alpha-helices in prokaryotic
membrane proteins: the dense alignment surface method. Protein
Eng. 10, 673–676.
10. Sonnhammer, E.L., von Heijne, G. & Krogh, A. (1998) A hidden
Markov model for predicting transmembrane helices in protein
sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol.
3
6, 175–182.
11. Tusnady, G.E. & Simon, I. (1998) Principles governing amino acid
composition of integral membrane proteins: application to
topology prediction. J. Mol. Biol. 283, 489–506.
12. Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E.L.
(2001) Predicting transmembrane protein topology with a hidden
Markov model: application to complete genomes. J. Mol. Biol.
305, 567–580.
13. Moller, S., Croning, M.D. & Apweiler, R. (2001) Evaluation of
methods for the prediction of membrane spanning regions.
Bioinformatics 17, 646–653.
14. Hille, B. (1992) Ionic Channels of Excitable Membranes,2ndedn.
Sinauer Associates, Sunderland, MA.
15. Le Novere, N. & Changeux, J.P. (2001) LGICdb: the ligand-gated
ion channel database. Nucleic Acids Res. 29, 294–295.
16. Landesman, Y., White, T.W., Starich, T.A., Shaw, J.E.,
Goodenough,D.A.&Paul,D.L.(1999)Innexin-3forms
connexin-like intercellular channels. J. Cell Sci. 112, 2391–2396.

17. Unger, V.M., Kumar, N.M., Gilula, N.B. & Yeager, M. (1999)
Three-dimensional structure of a recombinant gap junction
membrane channel. Science 283, 1176–1180.
18. Bennett, M.V., Barrio, L.C., Bargiello, T.A., Spray, D.C.,
Hertzberg, E. & Saez, J.C. (1991) Gap junctions: new tools, new
answers, new questions. Neuron 6, 305–320.
19. Ganfornina, M.D., Sanchez, D., Herrera, M. & Bastiani, M.J.
(1999) Developmental expression and molecular characterization
of two gap junction channel proteins expressed during embry-
ogenesis in the grasshopper Schistocerca americana. Dev. Genet.
24, 137–150.
20. Lolkema, J.S. & Slotboom, D.J. (1998) Estimation of structural
similarity of membrane proteins by hydropathy proﬁle alignment.
Mol. Membr. Biol. 15, 33–42.
21. Lolkema, J.S. & Slotboom, D.J. (1998) Hydropathy proﬁle
alignment: a tool to search for structural homologues of mem-
brane proteins. FEMS Microbiol. Rev. 22, 305–322.
22. Wood, M.W., VanDongen, H.M. & VanDongen, A.M. (1995)
Structural conservation of ion conduction pathways in K channels
and glutamate receptors. Proc. Natl. Acad. Sci. USA 92, 4882–
4886.
Ó FEBS 2002 Hydropathy proﬁle search (Eur. J. Biochem. 269) 2107

Báo cáo Y học: Identiﬁcation of novel membrane proteins by searching for patterns in hydropathy proﬁles potx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về