Protein Family Databases
Steven Henikoff,
Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center,
Seattle, Washington, USA
Jorja G Henikoff,
Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
The rapid expansion of biological sequence databanks and the utilization of protein
sequence homologies to draw functional inferences has led to a proliferation of databases
aimed at organizing protein homology information. Databases differ in how families are
defined and in how family information is depicted.
Introduction
Improvements in the efficiency of large-scale DNA
sequencing are resulting in rapid increases in the number
of protein sequences that lack genetic or biochemical
annotation. One traditional way to deduce the function of
a protein ofinterest is tocompare it with other sequences of
known function to find a possible homologue. Methods for
homology detection formerly relied on pairwise compar-
isons of protein sequences. However, the accumulation of
sequence data has motivated and facilitated the creation of
families of relatedproteins. Whereas the number of protein
sequences increases at an exponential rate, the number of
new protein families has begun to level off. As these
families become populated with more and more sequences,
the utility of the classification increases, allowing for better
detection of family members, for identification of con-
served residues, for distinguishing orthologues (which are
related by decent) from paralogues (which derive from
gene duplication) and for structure modelling. The
increasing utility of protein family databases has led to
their proliferation: the first efforts to create a database of
protein families began in 1988 (Bairoch, 1992), and the
Nucleic Acids Research database issue for 2000 lists more
than a dozen. This article surveys these databases and
describes their use in inferring protein function.
What Is a Protein Family?
Each database uses a somewhat different operational
definition of a family. These differences reflect the difficulty
in defining just what constitutes a protein family. Some of
this ambiguity in demarcating relationships among pro-
teins that share sequence similarity is reflected in the use of
the imprecise but useful terms ‘superfamily’ and ‘sub-
family’. Whereas orthologous enzymes in different organ-
isms are clearly members of the same family or subfamily,
distinctions between groupings of paralogues, especially
between those that are not detectably similar in pairwise
comparisons, suggest a broader superfamily relationship.
Furthermore, in modular and multidomain proteins,
relationships are typically limited to only parts of the
protein’s sequence.
As detection of relationships improves with more
samples and better methodology, families and super-
families can become more populated. At present, structur-
al relationships provide the highest level of classification,
and structure-based databases classify proteins with
similar ‘folds’. These classifications reveal that whenever
structures are known for two proteins that are considered
members of the same family or superfamily, the structures
are similar, whereas the converse is often not true.
Therefore, significant sequence similarity can be used to
infer common structure (and common ancestry); however,
similar structures that lack detectable sequence similarity
may have resulted from either divergence beyond detection
or convergence to a similar fold.Because divergence from a
common ancestor can occur with retention of function,
family, subfamily and superfamily relationships are valu-
able for drawing functional inferences, whereas similarities
in fold but not sequence are less likely to reveal common
function. Two excellent protein structure databases, SCOP
(Lo Conte et al., 2000) and CATH (Pearl et al., 2000),
provide hierarchical structural classifications of proteins
above the superfamily level.
Problems in defining what a protein family is make it
difficult to estimate how many families exist. It has been
estimated that there are about 1000 protein folds (Chothia,
1992), and so there must be more than 1000 families.
Currently, the InterPro database lists almost 3000 families
classified by manual curation; however, databases that use
automated methods to cluster proteins into families may
list an order of magnitude more (e.g. Corpet et al., 2000),
with ‘singleton’ sequences potentially representing tens of
thousands of protein families yet to be catalogued. It may
be that the large number of potential families reflects the
greater divergence of proteins in the very diverse bacterial
and archaeal genomes, where sequence divergence over
eons has obliterated sequence similarity. Alternatively,
these unclassified proteins may constitute distinct families
Article Contents
Secondary article
.
Introduction
.
What Is a Protein Family?
.
Classes of Protein Family Databases
.
Curated Protein Family Databases
.
Clustered Protein Family Databases
.
Clustered Databases from Genomes
.
Derivative Protein Family Databases
.
An Example: Kinesins
.
Conclusions
1
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
that have not yet entered curated databases. Despite this
complication, it is evident that at least half of all proteins in
most eukaryotic organisms have been classified into
families, and so for most organisms of interest to
experimentalists working on model organisms and to
human biologists, protein family databases constitute
fairly comprehensive resources.
Classes of Protein Family Databases
Protein family databases obtain sequences from one of the
large protein sequence databases, most commonly SWISS-
PROT with TrEMBL (Bairoch and Apweiler, 2000) but
also PIR (Barker et al., 2000). They then apply an
algorithm, either manual or automatic, to group the
sequences into families. Each family is represented in one
or more ways to facilitate both inspection by humans and
comparison by computer programs. The most common
representation is a multiple alignment of the family’s
sequences, either with insertion and deletion (gap)
characters or without. Sometimes the multiple alignment
is summarized as a pattern or consensus sequence. For
comparison of a user’s query sequence with the protein
family database, the multiple alignment is commonly
converted to a position-specific scoring matrix (PSSM),
also called a profile or hidden Markov model (HMM).
Patterns can be compared directly with a query sequence
(Bairoch and Apweiler, 2000), and consensus sequences
with the use of a general-purpose amino acid substitution
scoring matrix (Henikoff and Henikoff, 2000).
In addition to primary protein family databases, there
are databases derived from them. Some of these derivative
databases use the primary databases’ family definitions,
but represent them differently for display and comparison
purposes, for example Blocks 1 (JG Henikoff et al., 2000)
and ProClass (Huang et al., 2000). Others combine and
cross-reference the primary databases without providing
different representations, for example InterPro (Apweiler
et al., 2000) and MetaFam (Silverstein et al., 2000). The
InterPro project is cross-referencing most of the European
protein family databases and provides a single entry point
into them.
Ideally, in a protein family database each family’s
function would be fully documented with appropriate
references. In practice, only two of the curated databases
(PROSITE and PRINTSS) provide this level of informa-
tion because it requires laborious effort. Fortunately, there
is a high level of cooperation and cross-referencing
between the protein family databases and links are usually
provided to one of the curated collections if possible.
All of the protein family databases described here
provide access via the World Wide Web (WWW), and all
allow entry into the database by some sort of keyword
search. Except as noted below, family databases also
provide a searching tool to compare a user’s sequence with
the database for classification. When a sequence is
classified in this way, users have immediate access to what
is known about the family and can apply it to their own
sequence. A few sites offer additional services such as
graphical displays, phylogenetic trees and structural
information.
Curated Protein Family Databases
Protein families in curated databases are delineated by a
human overseer, usually on the basis of personal knowl-
edge or from the published literature. Usually a proto-
family is aligned manually or semiautomatically and then
sequences are added to the family from the protein
sequence databases on the basis of sequence similarity
followed by careful validation. Curated databases have the
best documentation, but are the most difficult to maintain.
PROSITE [ />The PROSITE database (Hofmann et al., 1999) is the
original and best-documented protein family database;
unfortunately, it has not been appreciably updated with
addition of new families for several years. Protein
sequences are obtained from SWISS-PROT and grouped
based on documented common function. Each family is
represented by a simple pattern and sequences can belong
to more than one family. A few families are also
represented by profiles. The WWW site provides keyword
searching and classification of protein sequences by
pattern searching. PROSITE is part of the InterPro
project.
PRINTSS [ />dbbrowser/PRINTS/]
PRINTS (Attwood et al., 2000) obtains protein sequences
from SWISS-PROT and TrEMBL. Related sequences are
aligned manually, conserved motifs are excised and
searched iteratively through the databases to add se-
quences. The results are manually validated after each
iteration. Each family is represented by a fingerprint, which
is a series of ungapped multiple alignments corresponding
to the conserved motifs. PRINTS makes a particular effort
to provide subfamily-specific entries. The documentation
is extensive and the collection is updated regularly. The
WWW site is well maintained and provides keyword
searching and classification of protein sequences by PSSM
searching. PRINTS is part of the InterPro project.
SMART []
SMART (Schultz et al., 2000) is a carefully curated
database of signalling, extracellular and chromatin-asso-
Protein Family Databases
2
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
ciated protein domains, which are represented as gapped
multiple alignments. A manual alignment based on
known tertiary structures is converted to an HMM for
searching against the protein databases to find more
sequences, which are validated before being added to the
family. The WWW site provides keyword searching and
classification of protein sequences by PSSM (HMM)
searching.
Pfam-A []
Pfam (Bateman et al., 2000) uses structural criteria, in
addition to sequence similarity and shared function, in
defining a protein family. For example, eukaryotic proteins
containing the histone structural fold, including all four of
the distinct proteins found in the nucleosome cores, are
aligned, even though no significant sequence similarity is
detected between histone H3 and the other three histone
families. Pfam starts with manual or semiautomatic
multiple alignments of sequences with similar sequence,
function and/or structure obtained from the literature or
from other protein family databases, such as SMART.
Pfam-A constructs an HMM from a manually validated
seed alignment and searches the SWISS-PROT and
TrEMBL databases to collect more sequences. The
resulting full multiple alignment is apparently not manu-
ally validated and can be extensively gapped. Annotation
is minimal but does include references for families
taken from the literature. If families are defined on
the basis of sequence similarity alone, they are often
just documented as ‘domain of unknown function’.
Pfam-A is closely coupled with the automatically clustered
databases Pfam-B and ProDom, and is part of the InterPro
project.
PIR Superfamilies [http://
pir.georgetown.edu/pirwww/dbinfo/]
The Protein Information Resource (Barker et al., 2000) is a
collection of tools, among which is a set of protein
superfamilies. PIR is unique in providing an explicit
definition of a superfamily as sequences with the same
function in various organisms. These sequences are
identified as being at least 50% identical and globally
alignable. Unfortunately, this strict definition results in
many related entries with only a few sequences each. The
sequences are taken from the PIR protein sequence
database and represented by a single typical sequence.
Annotation does not extend beyond that for the individual
sequences. The WWW site allows query by keyword and
classification of protein sequences by a gapped BLAST
(basic local alignment search tool (Altschul et al., 1997))
search versus the representative sequences, but it is
apparently not possible to view the full multiple alignment.
Clustered Protein Family Databases
Several efforts have been made since 1990 to overcome the
effort required to maintain curated protein family data-
bases by automatically clustering the protein sequence
databases using sequence similarity. The general approach
is to compute all possible pairwise comparisons, and then
cluster them in some fashion, shifting the effort from
humans to computers. This otherwise computationally
very demanding process has benefited from the introduc-
tion of the rapid PSI-BLAST system (Altschul et al., 1997).
PSI-BLAST starts searching with a single sequence but
then makes a multiple alignment PSSM from the hits after
one iteration, then searches with it, and so forth. Problems
with any clustering method include deciding how to
delineate clusters (usually on the basis of some sort of
cutoff score from the searches) and how to handle
multidomain sequences. Users of these compendiums must
be aware that they are largely unvalidated by humans and
may not always correctly group sequences with the same
function.
ProDom [ />prodom.html]
ProDom (Corpet et al., 2000) is one of the earliest clustered
protein family databases and continually updates its
methods and services. Currently, it coordinates some of
its larger entries with Pfam-A and uses PSI-BLAST to
cluster the remaining sequences in SWISS-PROT and
TrEMBL. While only large entries have been scrutinized
manually, the consistency of all families is assessed by
computing a series of numerical measurements. The
resulting families are represented as consensus sequences
and gapped multiple alignments. Phylogenetic trees are
computed from these alignments and used to display a
family in overlapped subfamilies based on distances in the
tree. Documentation consists of links to the protein
sequence databases and to other protein family databases
(PROSITE, Pfam). The WWW site has graphic displays
that link related families through their shared sequences.
Keyword searches and classification of proteinsequences is
provided by pairwise comparison with every sequence in
each family. ProDom has recently been used by Pfam in
place of Pfam-B, which is based on the older Domainer
algorithm (Sonnhammer and Kahn, 1994).
DOMO [ gracy/
domo/home.htm]
DOMO is similar in concept to ProDom and Pfam-B, and
uses SWISS-PROT as its source database. DOMO uses a
different algorithm, however, which is intended to avoid
inclusion of overlapping subsets derived from the same
Protein Family Databases
3
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
family (Gracy and Argos, 1998). DOMO has not been
updated since its initial release in 1998.
ProtoMap [http://
www.protomap.cs.huji.ac.il]
ProtoMap (Yona et al., 2000) automatically clusters
SWISS-PROT using three different pairwise alignment
algorithms. It scores alignments with multiple substitution
matrices, resulting in a hierarchical organization, stored as
a graph where the nodes are sequences and edges are a
measure of their similarity. Uniquely among proteinfamily
databases, the representation is a similarity-based dendo-
gram. New proteins are classified by adding them to the
existing graph. Documentation consists of links to the
sequence databases. The WWW site also supports key-
word queries.
SYSTERS [ />tbi/services/cluster/systersform]
SYSTERS (Krause et al., 2000) automatically clusters the
SWISS-PROT and PIR sequence databases. It uses
pairwise alignment algorithms with conservative cutoffs.
Each family is represented by a gapped multiple alignment
with links to PROSITE and Pfam for documentation.
Keyword searching and protein classification by searching
against the multiple alignments or consensus sequences is
supported at the WWW site.
Clustered Databases from Genomes
As more complete genomes are sequenced, special
databases are being created to facilitate their comparison.
The two described here cluster whole genomes instead of
protein sequence databases.
COG [ />COG (Tatusov et al ., 2000) defines a family as an ancient
conserved region (Green et al., 1993). It clustered 21
complete genomes representing 17 phylogenetic lineages,
and each cluster of orthologous groups (COG) of proteins
consists of individual proteins or groups of paralogues
from three or more lineages. Each COG is represented as a
gapped multiple alignment with minimal documentation.
Proteins can be classified at the WWW site by searching
against the individual proteins which are then linked to
their COGs.
ProDom-CG [ />prodomCG.html]
ProDom-CG (Corpet et al., 2000) applies the ProDom
method to 20 complete genomes instead of to the protein
sequence databases.
Derivative Protein Family Databases
Protein family databases derived from primary collections
provide additional or alternative perspectives. Where they
are derived from more than one database, they can
facilitate comparison and validation of classifications.
Blocks 1 []
Blocks 1 (JG Henikoff et al., 2000) provides a nonredun-
dant collection of protein families drawn from PROSITE,
PRINTS, Pfam, ProDom and DOMO. Starting with the
sequences documented in each PROSITE entry, Blocks 1
runs the BlockMaker motif-finding algorithm to find
conserved regions, which are represented as a series of
ungapped multiple alignments called blocks. PRINTS
entries are then converted to PSSMs and compared with
the result blocks from PROSITE using the LAMA (Local
Alignment of Multiple Alignments) algorithm (Pietro-
kovski, 1996). New PRINTS entries are then added to
Blocks 1 . Next, blocks are made from Pfam-A entries and
searched with LAMA against the PROSITE and PRINTS-
derived blocks and new entries added. Then ProDom and
DOMO entries are processed successively. Note that
Blocks 1 uses only the sequences documented in each
family of the primary protein family databases and not
their representations, and thus provides an alternative
representation and classification tool. The WWW site
provides access by keyword and comparison of protein or
DNA sequence with the blocks represented as PSSMs.
Phylogenetic tree, sequence logo and 3D structural dis-
plays are also provided. Documentation consists of links to
source protein family databases.
ProClass [ />gfserver/]
ProClass (Huang et al., 2000) entries cross-reference
PROSITE and PIR superfamilies. ProClass computes a
neural network for each entry and uses it to add more
sequences from SWISS-PROT and PIR. Documentation
consists of links to source protein family databases.
InterPro [ />Unlike Blocks 1 and ProClass, which provide alternative
representations, InterPro is a curated cross-reference of
Protein Family Databases
4
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
several protein family databases. Currently, PROSITE,
PRINTS and Pfam-A are included. Each InterPro family
entry includes documentation drawn from the participat-
ing databases. Classification of new protein sequences is
not yet available.
MetaFam []
Whereas InterPro cross-references protein family data-
bases using a manual procedure based on documentation,
MetaFam cross-references them automatically based on
shared sequence segments. Families that correspond
between pairs of databases are identified by maximizing
the sequence membership overlaps. Then pairwise corre-
spondences are grouped into supersets by transitive
closure. Currently PROSITE, PRINTS, Pfam-A, PIR
Superfamilies, ProDom, SYSTERS, ProtoMap, DOMO,
SBASE-A (Murvai et al., 2000) and Blocks 1 are cross-
referenced. Documentation consists of links to these
databases, as well as to the protein sequence databases.
The WWW site shows interrelationships between the
various family classifications graphically. Classification
of new protein sequences is available.
An Example: Kinesins
We choose the kinesins to illustrate similarities and
differences between protein family databases. Kinesin
and its relatives are motor proteins that utilize ATP
hydrolysis to move along microtubules in eukaryotic cells.
The motor portion of a kinesin is structurally very similar
to that of the myosin motor, which moves along actin
filaments, although no sequence similarity is evident
between them. This is an example of likely divergence
from an ancestral fold that is beyond current sequence-
based comparison methods to detect. Kinesin subfamilies
based on sequence similarities between motor domains are
strongly predictive of cellular function, indicating diver-
gence from an ancestral kinesin-like motor. Kinesins are
multidomain proteins, with a coiled-coil stalk attached to
the motor domain and typically a domain that interacts
with other protein subunits or directly with cargo, such as
vesicles and chromosomes. A curated web site describes
kinesins in detail and provides subfamily and functional
information [ This web
site lists 238 sequences divided into 10 different subfami-
lies.
The InterPro entry (IPR001752) points to correspond-
ing entries in the curated family databases PROSITE
(PS00411, PS50067), PRINTSS (PR00380) and Pfam-A
(PF00225), which include a total of208 different sequences.
These correspondences are made on the basis of docu-
mentation in the source databases. All sequences in the
source databases can be displayed graphically, with the
kinesin region defined by each database highlighted. It can
be seen from this display that agreement is good concern-
ing its location. The PROSITE entries provide a pattern
and a profile representation. PRINTS and Pfam-A provide
two different multiple alignments. The four ungapped
PRINTS blocks correspond to four conserved regions
within the semimanually constructed gapped Pfam seed
alignment, whichwas made from 12 sequences. Automated
addition of 208 sequences to the Pfam full alignment
introduced numerous additional gaps, and split the
domain in some of the sequences.
The MetaFam tabular entry for kinesins (
Table 1
)
provides links to kinesin entries in most of the other
databases described in this article, noncurated as well as
curated, for a totalof 587sequences. Thecurated databases
PROSITE, PRINTSS and Pfam-A are represented by the
same entries as for InterPro. For PIR, MetaFam lists seven
different entries, of which two are myosins. These entries
were brought in by a very poor SYSTERS family
connection, which can be examined at the MetaFam site.
This could be due to the fact that myosins and kinesins
have long coiled-coil stalks attached to their dissimilar
motor domains, and these heptad amphipathic repeats
detect one another in standard database searches. A visit to
the PIR site reveals 14 different superfamilies for kinesins,
ranging from three to 183 sequences each. The presence of
multiple kinesin entries reflects the conservative similarity
criteria used todistinguish superfamilies from one another:
for example, three sequences including the KLPA protein,
a member of the C-terminal subfamily listed in the kinesin
web site, are considered to form a superfamily of their own
by PIR.
Among the clustered databases, ProDom lists four
separate entries for kinesins. These entries represent
nonoverlapping conserved regions found in 135 to 139
proteins. Such fragmentation of a family into multiple
entries occurs frequently in ProDom. The four SYSTERS
entries include two with one sequence each, one with 16
sequences, including KLPA, and one with 660 sequences
subdivided into 30 subfamilies, including myosin heavy
chain. DOMO lists only a single kinesin motor domain
entry, which indicates that the DOMO clustering algo-
rithm has succeeded in avoided fragmenting the kinesins.
ProtoMap also has a single entry for the kinesin motor
domains. Although ProtoMap does not provide a multiple
alignment representation, the ProtoMap kinesin classifica-
tion dendogram separates subsets of sequences below
distinct nodes: these approximately correspond to manu-
ally curated subfamilies listed on the kinesin web site.
Blocks 1 , a derivative database, has a single kinesin entry
(derived from PROSITE), with eight blocks corresponding
to the eight conserved regions identified as such in the
kinesin web site. Blocks 1 also provides a phylogenetic
tree that separates sequences approximately correspond-
ing to kinesin web site subfamilies.
Protein Family Databases
5
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Conclusions
Evolutionary processes, including mutation, transposition
and chromosome rearrangement followed by selection and
drift, have resulted in diversification of proteinfamilies and
modules from their ancient common ancestors (Henikoff
et al., 1997). The resulting protein machines have been
responsible for most of the molecular processes in life on
earth. Evolutionary ‘tinkering’ has resulted in complex
relationships between protein family members and in
multidomain proteins that complicate any simple classifi-
cation scheme. However, the relationships themselves are
of extraordinary value because much of modern biology
relies upon inferences drawn from comparing and aligning
related protein sequences. Therefore, the effort to classify
proteins into families continues despite the complexity,
and different classification models have resulted in an
abundance of protein family databases. All users should be
able to find a classification that meets their needs.
Protein family databases change constantly. In addition
to the large protein family databases such as those
described here, there are numerous small databases and
WWW sites devoted to single protein families, usually
maintained by individual researchers. Up-to-date infor-
mation may be obtained from the annual database issue of
Nucleic Acids Research and the ProWeb WWW site [http://
www.proweb.org] listed in the references.
References
Altschul SF, Madden TL, Schaffer AA et al. (1997) Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research 25: 3389–3402.
Apweiler R, AttwoodTK, Bairoch A etal. (2000) The InterPro database,
an integrated documentation resource for protein families, domains
and functional sites. Nucleic Acids Research 29: 37–40. [http://
www.ebi.ac.uk/interpro]
Table 1
MetaFam SuperSet 312: kinesin motor domain
Adapted from <>
Database Protein Family Description
Blocks+ BL00411 KINESIN_MOTOR_DOMAIN, Kinesin motor domain proteins
DOMO DM00198 KINESIN MOTOR DOMAIN
Pfam PF00225 Kinesin motor domain
PIR-D DA1175 kinesin motor domain homology
PIR-F FA1228 1134.0: kinesin heavy chain 1.0
PIR-F FA2471 1143.5: myosin heavy chain 1.0
PIR-S 1134.0 kinesin heavy chain
PIR-S 1141.5 kinesin-related protein KLPA
PIR-S 1143.5 myosin heavy chain
PIR-S 2580.0 unassigned kinesin-related proteins
PRINTS PR00380 KINESINHEAVY: Kinesin heavy chain signature
ProDom PD000454 PROTEIN MOTOR ATP-BINDING COILED COIL MICROTUBULES
KINESIN-LIKE KINESIN HEAVY CHAIN
ProDom PD000458 PROTEIN MOTOR ATP-BINDING COILED COIL MICROTUBULES
KINESIN-LIKE KINESIN HEAVY CHAIN
ProDom PD000470 PROTEIN MOTOR ATP-BINDING COILED COIL MICROTUBULES
KINESIN-LIKE KINESIN HEAVY CHAIN
PROSITE PS00411 KINESIN_MOTOR_DOMAIN1: Kinesin motor domain signature
PROSITE PS50067 KINESIN_MOTOR_DOMAIN2: Kinesin motor domain profile
ProtoMap 183 protomap 183
SBASE SB00795 KINESIN MOTOR DOMAIN
SYSTERS N1722 systers N1722
SYSTERS O1099 systers O1099
SYSTERS S42943 systers S42943
SYSTERS S43289 systers S43289
Protein Family Databases
6
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Attwood, TK, Croning MDR, Flower DR et al. (2000) PRINTS-S: the
database formerly known as PRINTS. Nucleic Acids Research 28:
225–227. [ />Bairoch A (1992) PROSITE: a dictionary of sites and patterns in
proteins. Nucleic Acids Research 20: 2013–2018. [a-
sy.ch/sprot/]
Bairoch A and Apweiler R (2000) The SWISS-PROT protein sequence
database and its supplement TrEMBL in 2000. Nucleic Acids Research
28: 45–48. [ />Barker WC, GaravelliJS, Huang H etal. (2000)The Protein Information
Resource (PIR). Nucleic Acids Research 28: 263–266. [http://
pir.georgetown.edu/pirwww/dbinfo/]
Bateman A, Birney E, Durbin R et al (2000) The Pfam protein families
database. Nucleic Acids Research 28: 263–266. []
Chothia C (1992) One thousand families for the molecular biologist.
Nature 357: 543–544.
Corpet F, Servant F, Gouzy J and Kahn D (2000) ProDom and
ProDom-CG: tools for protein domain analysis and whole genome
comparisons. Nucleic Acids Research 28: 267–269. [-
louse.inra.fr/prodom.html] [ />domCG.html]
Gracy J and Argos P (1998) Automated protein sequence database
classification. I. Integration of compositional similarity search, local
similarity search, and multiple sequence alignment. Bioinformatics 14:
164–173. [ gracy/domo/home/htm]
Green P, Lipman D, Hillier L et al. (1993) Ancient conserved regions in
new gene sequences and the protein databases. Science 259: 1711–
1716.
Henikoff JG, Greene EA, Pietrokovski S and Henikoff S (2000)
Increased coverage of protein families with the Blocks Database
servers. Nucleic Acids Research 28: 228–230. []
Henikoff S and Henikoff JG (2000) Amino acid substitution matrices.
Advances in Protein Chemistry 54: 73–97.
Henikoff S, Greene EA, Pietrokovski S et al. (1997) Gene families: the
taxonomy of protein paralogs and chimeras. Science 278: 609–614.
[]
Hofmann K, Bucher P, Falquet L and Bairoch A (1999) The PROSITE
database, its status in 1999. Nucleic Acids Research 27: 215–219.
[ />Huang H, Xiao C and Wu CH (2000) ProClass protein family database.
Nucleic Acids Research 28: 270–272. [ />gfserver/]
Krause A, Stoye J and Vingron M (2000) The SYSTERS protein
sequence cluster set. Nucleic Acids Research 28: 270–272. [http://
www.dkfz-heidelberg.de/tbi/services/cluster/systersform]
Lo Conte L, Ailey B, Hubbard TJP et al. (2000) SCOP: a structural
classification of proteins database. Nucleic Acids Research 28: 257–
259. [ />Murvai J, Vlahovicek K, Barta E, Cataletto B and Pongor S (2000) The
SBASE protein domain library, release 7.0: a collection of annotated
protein sequence segments. [ sbasesrv/]
Pearl FM, Lee D, Bray JE et al. (2000) Assigning genomic sequences to
CATH. Nucleic Acids Research 28: 277–282. [chem
ucl.ac.uk/bsm/cath/]
Pietrokovski S (1996) Searching databases of conserved sequence
regions by aligning protein multiple-alignments. Nucleic Acids
Research 24: 3836–3845. ProWeb []
Schultz J, Copley RR, Doerks T, Ponting CP and Bork P (2000)
SMART: a web-based tool for the study of genetically mobile
domains. Nucleic Acids Research 28: 231–234. [l-
heidelberg.de]
Silverstein KA, Shoop E, Johnson JE et al. (2000) The MetaFan server: a
comprehensive protein family resource. Nucleic Acid Research 29:49–
51
Sonnhammer ELL and Kahn D (1994) Modular arrangement of
proteins as inferred from analysis of homology. Protein Science 3:
482–492.
Tatusov RL, GalperinMY, Natale DA and Koonin EV (2000) The COG
database: a tool for genome-scale analysis of protein functions and
evolution. Nucleic Acids Research 28: 33–36. [
nih.gov/COG/]
Yona G, Linial N and Linial M (2000) ProtoMap: automatic
classification of protein sequences and hierarchy of protein
families. Nucleic Acids Research 28: 49–55. [tomap.c-
s.huji.ac.il]
Further Reading
Baxevanis AD (2000) The Molecular Biology Database Collection: an
online compilation of relevant database resources. Nucleic Acids
Research 28: 1–7.
Protein Family Databases
7
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net