computer methods for macromolecular sequence analysis

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (10.71 MB, 712 trang )

Preface
Volume 183 of Methods in Enzymology dealing with the computer
analysis of protein and nucleic acid sequences has proved very popular with
molecular biologists and biochemists. Computers and computer programs
evolve rapidly, however, and can become outmoded very quickly. As a
result, there was pressure to issue an updated volume that covers much
the same general subject areas.
Like the earlier volume, this one is divided into several sections, the
first of which deals with databases and some aspects related to their hold-
ings. Also, there have been some relocations of major databases. GenBank
is now centered at the National Center for Biotechnology Information
(NCBI) at the National Library of Medicine in Bethesda, Maryland, and
the EMBL Database has relocated to the European Bioinformatics Institute
(EBI) at a site just outside Cambridge, England. More than ever, of course,
geographic location is becoming moot, thanks to the World Wide Web
(WWW) and extended hyperlink access.
There is some new vocabulary in this volume that did not appear in
Volume 183. The use of neural nets, for example, is discussed in several
places, including chapters dealing with the classification of sequences, on
the one hand, and with predicting secondary structure, on the other. The
kinds of databases are also changing. For instance, it has been found that
the fragmentary data known as Expressed Sequence Tags (EST) are ex-
tremely useful.
Searching newly determined sequences remains the first order of busi-
ness. More often than not, a simple search of a new sequence provides
both functional and structural information. New pattern searching programs
have greatly extended the power of this approach so that very distant
relatives of well-characterized families can be identified.
The multiple alignment of protein sequences continues to have a promi-
nent role in protein characterization. Whether the sequences are of the
"same" protein from different organisms or are paralogs that have resulted

from gene duplications, the alignment problems are the same. Interestingly,
the most popular algorithms have not changed much, but the amino acid
substitution tables that support them have. This is chiefly the result of there
being so much comparative data in the current databases that empirical
measures of relationships can be obtained by simply tallying the occurrences
of the amino acids in blocks of obviously aligned sequences. As discussed
in Chapter [6] by Henikoff and Henikoff, these BLOSUM tables have been
remarkably effective.
xiii
xiv PREFACE
Among their many uses, multiple alignments are used to construct
profiles for more sensitive searching than is possible by single-searching.
They are also used in the consensus mode for better predictions of secondary
structure and for three-dimensional searches. And, of course, they are used
in the construction of phylogenetic trees.
Recent advances have led to some changes in emphasis in some of the
sections. Most of the chapters focus on protein sequences, even though the
vast majority of those are determined by DNA sequencing. Accordingly,
a section on RNA folding that appeared in the earlier volume has been
dropped, and instead a number of chapters that relate to the secondary
structure and three-dimensional aspects of proteins have been added.
Indeed, three-dimensional searching is following the course of sequence
searching a decade ago. As a new protein structure is characterized, the
first matter of general interest is to determine whether the fold resembles
that of any that were reported previously. The remarkable thing is that not
only are most new structures falling into well-defined families, but often
there is no hint in advance on the basis of either structure or function. The
problems associated with structure searching are similar to those experi-
enced by sequence searchers in the past: a burgeoning data bank (PDB is the
Protein Data Bank), choices of search programs, and, finally, the problem of

judgment on how significant a resemblance may be. Many of these problems
are addressed in Section V of this volume.
As with Volume 183, authors were encouraged to make their programs
or databases available to readers. Many chapters make reference to a
WWW home page or an Internet email address from which additional
information can be extracted.
Finally, I thank all the authors who wrote such interesting and informa-
tive chapters under a very strict and compressed timetable. Academic Press,
and especially our editor, Shirley Light, outdid themselves in getting the
manuscripts through the publication process in record time. As in the case
of the previous volume dealing with this topic, I must also acknowledge
that the task could not have been accomplished without the help of my
assistant, Karen Anderson. Her relentless but always gentle prodding of
authors to produce manuscripts and her remarkable organizational skills
that kept the courier traffic flowing in the right direction were indispensable.
RUSSELL F. DOOLITTLE
Contributors to Volume 266
Article numbers are in parentheses following the names of contributors.
Affiliations listed are current.
STEPHEN F. ALTSCHUL (27),
National Center
for Biotechnology Information, National
Library of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
PATRICK ARGOS (8),
European Molecular
Biology Laboratory, 69117 Heidelberg,
Germany
MARCELLA ATFIMONELLI (17),
Dipartimento

de Biochimica e Biologia Molecolare, Uni-
versitd di Bari, 70125 Bari, Italy
WINONA C. BARKER (3, 4),
National Biomedi-
cal Research Foundation, Washington, Dis-
trict of Columbia 20007
GEOFFREY J. BARTON (29),
Laboratory of
Molecular Biophysics, University of Ox-
ford, Oxford OX1 3QU, United Kingdom
PEER BORK (11),
European Molecular Biol-
ogy Laboratory, D-69012 Heidelberg, Ger-
many," and Max-Delbriick-Center for
Molecular Medicine, Department of Bioin-
formatics, D-13122 Berlin-Buch, Germany
JAMES U. BOWIE (35),
Department of Chemis-
try and Biochemistry and DOE Laboratory
of Structural Biology and Molecular Medi-
cine, University of California, Los Angeles,
Los Angeles, California 90095
STEVEN E. BRENNER (37),
Medical Research
Council Centre Laboratories of Molecular
Biology, Cambridge CB2 2QH, United
Kingdom
GRAHAM N. CAMERON (1),
European Molec-
ular Biology Laboratory Outstation the

European Bioinformatics Institute, Hinx-
ton, Cambridge CBIO 1RQ, United
Kingdom
CYRUS CHOTHIA (37),
Medical Research
Council Centre Laboratories of Molecular
Biology and Cambridge Centre for Protein
Engineering, Cambridge CB2 2QH,
United Kingdom
ix
JEAN-MICHEL CLAVERIE (14),
Laboratory of
Structural and Genetic Information, E.P.
91 Centre National de la Recherche Sci-
entifique, 13402 Marseille, France
MARC DELARUE (40),
Immunologie Structur-
ale Institut Pasteur, 75015 Paris, France
RUSSELL F. DOOLITrLE (21),
Center for Mo-
lecular Genetics, University of California,
San Diego, La Jolla, California 92093
DAVID EISENBERG (35),
Department of
Chemistry and Biochemistry and DOE
Laboratory of Structural Biology and Mo-
lecular Medicine, University of California,
Los Angeles, Los Angeles, California 90024
JONATHAN A. EPSTEIN (10),
National Center

for Biotechnology Information, National
Library of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
THURE ETZOLD (8),
European Molecular
Biology Laboratory, 69117 Heidelberg,
Germany
SCOTt FEDERHEN (33),
National Center for
Biotechnology Information, National Li-
brary of Science, National Institutes of
Health, Bethesda, Maryland 20894
JOSEPH FELSENSTEIN (24),
Department of Ge-
netics, University of Washington, Seattle,
Washington 98195
DA-FEI FENG (20,
Center for Molecular Ge-
netics, University of California, San Diego,
La Jolla, California 92093
JEAN GARNIER (32),
Unit~ de Bioinformat-
ique Biotechnologies, INRA, 78352 Jouy-
en-Josas, Paris, France
DAVID G. GEORGE (3, 4),
National Biomedi-
cal Research Foundation, Washington, Dis-
trict of Columbia 20007
JEAN-FRANfO~S GIBRAT (32),
Unit~ de Bioin-

formatique Biotechnologies, INRA, 78352
Jouy-en-Josas, Paris, France
X CONTRIBUTORS TO VOLUME
266
TOBY J. GIBSON (11, 22),
European Molecular
Biology Laboratory, 69012 Heidelberg,
Germany
WARREN GISH (27),
Department of Genetics,
Washington University School of Medicine,
St. Louis, Missouri 63108
MICHAEL GRIBSKOV (13),
San Diego Super-
computer Center, La Jolla, California 92093
XUN Gu (26),
Human Genetics Center, Sph,
University of Texas, Houston, Texas 77225
DANIEL GUSFIELD (28),
Computer Science
Department, University of California,
Davis, Davis, California 95616
ROBERT A. L. HARPER (1),
European Molec-
ular Biology Laboratory Outstation the
European Bioinformatics Institute, Hinx-
ton, Cambridge CBIO 1RQ, United
Kingdom
JOTUN HEIN (23),
Department of Ecology and

Genetics, Institute of Biological Sciences,
Aarhus University, DK-8000 Aarhus,
Denmark
JORJA G. HEN1KOFF (6),
Fred Hutchinson
Cancer Research Center, Seattle, Washing-
ton 98104
STEVEN HENIKOVV (6),
Howard Hughes Medi-
cal Institute, Fred Hutchinson Cancer Re-
search Center, Seattle, Washington 98104
DESMOND G. HIGGINS (22),
European Molec-
ular Biology Laboratory Outstation the
European Bioinformatics Institute, Hinx-
ton, Cambridge CBIO 1RQ, United
Kingdom
LIISA HOLM (39),
European Molecular Biol-
ogy Laboratory Outstation the European
Bioinformatics Institute, Hinxton, Cam-
bridge CBIO 1RQ, United Kingdom
TIMOTHY J. P. HUBBARD (37),
Medical Re-
search Council Centre Laboratories of Mo-
lecular Biology and Cambridge Centre for
Protein Engineering, Cambridge CB2 2Q H,
United Kingdom
Lois T. HUNT (3),
National Biomedical Re-

search Foundation, Washington, District of
Columbia 20007
MARK S. JOHNSON (34),
Molecular Modelling
and Biocomputing Group, Turku Center
for Biotechnology, University of Turku,
FIN-20521 Turku, Finland
JONATHAN A. KANS (10),
National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
ANTHONY R. KERLAVAGE (2),
The Institute
for Genomic Research, Gaithersburg,
Maryland 20850
EUGENE V. KOONIN (18),
National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
ERIC S. LANDER (19),
Whitehead Institute for
Biomedical Research and Department of
Biology, Massachusetts Institute of Tech-
nology, Cambridge, Massachusetts 02142
WEN-HSIUNG L1 (26),
Human Genetics Cen-
ter, Sph, Health Science Center, University
of Texas, Houston, Texas 77225

CRAIG D. LIVINGSTONE (29),
Genomics Sup-
port Group, SmithKline Beecham Pharma-
ceuticals, New Frontiers Science Park, Har-
low, Essex CM19 5AW, United Kingdom
ANDREI LUPAS (30),
Abteilung Molukulare
Strukturbiologie, Max-Planck-Institut fiir
Biochemie, D-82152 Martinsried, Germany
THOMAS L. MADDEN (9),
National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
ALEX C. W. MAY (34),
Department of Crystal-
lography, Birkbeck College, University of
London, London WC1E 7HX, United
Kingdom
RICHARD J. MURAL (16),
Biology Division,
Oak Ridge National Laboratory, Oak
Ridge, Tennessee 37831
ALEXEY G. MURZ1N (37),
Medical Research
Council Centre Laboratories of Molecular
Biology and Cambridge Centre for Protein
Engineering, Cambridge CB2 2QH,
United Kingdom
HITOMI OHKAWA

(10),
National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
CONTRIBUTORS TO VOLUME
266 xi
CHRISTINE A. ORENGO (36),
Department of
Biochemistry and Molecular Biology, Uni-
versity College, London WC1E 6BT, En-
gland
JOHN P. OVER1NGTON (34),
Computational
Chemistry, Pfizer Central Research, Sand-
wich, Kent CT13 9NJ, United Kingdom
LASZLO PATTHY (12),
Institute of Enzymol-
ogy, Biological Research Center, Hungarian
Academy of Sciences, Budapest H-1113,
Hungary
WILLIAM R. PEARSON (15),
Department of
Biochemistry, University of Virginia, Char-
lottesville, Virginia 22908
GRAZIANO PESOLE (17),
Dipartimento di Bio-
chimica e Biologia Molecolare, UniversittJ
di Bari, 70125 Bari, Italy
FRIEDHELM PFEIFFER (4),

Martinsried Insti-
tute for Protein Sequences, Max Planck
Institute for Biochemistry, Martinsried
82152, Germany
OLIV1ER POCH (40),
UPR 9002 du Centre Na-
tional de la Recherche Scientifique, I.B.M.C.
du Centre National de la Recherche Scien-
tifique, 67084 Strasbourg, France
BARRY ROBSON (32),
Dirac Foundation, Bio-
informatics Laboratory, Royal Veterinary
College, University of London, London
NW10TU, United Kingdom
MICHAEL A. RODIONOV (34),
Molecular
Modelling and Biocomputing Group,
Turku Centre for Biotechnology, University
of Turku, FIN-20521 Turku, Finland; and
Institute of Bioorganic Chemistry, Belarus
Academy of Sciences, Minsk-141, Republic
of Belarus 220141
BURKHARD ROST (31),
Protein Design Group,
European Molecular Biology Laboratory,
69012 Heidelberg, Germany
KENNETH E. RUDD (18),
National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes" of

Health, Bethesda, Maryland 20894
CECILIA SACCONE (17),
Dipartmento di Bio-
chimica e Biologia Moleculare, Universit~t
di Bari and Centro di Studio sui Mitocondri
e Metabolismo Energetico, CNR, 70125
Bari, Italy
NARUYA SAITOU (25),
Laboratorv of Evolu-
tionary Genetics, National Institute of Ge-
netics, Mishima-shi, Shizuoka-ken, 411,
Japan
CHRIS SANDER (39),
European Molecular Bi-
ology Laboratory Outstation the Euro-
pean Bioinformatics Institute, Hinxton.
Cambridge CBIO 1RQ, United Kingdom
GREGORY D. SCHULER (10),
National Center
for Biotechnology Information, National
Library of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
BENNY SHOMER (1),
European Molecular Bi-
oh)gy Laboratory Outstation the Euro-
pean Bioinformatics Institute, Hinxton,
Cambridge CBIO IRQ, United Kingdom
RODGER STADEN (7),
Medical Research
Council Centre Laboratories of Molectdar

Biology, Cambridge CB2 2QH, United
Kingdom
P. STELLING (28),
Computer Science Depart-
ment, University of California, Davis,
Davis, California 95616
JENS STOVLBA~K (23),
Department of Ecology
and Genetics, Institute of Biological Sci-
ences, Aarhus University, DK-8000 Aar-
hus, Denmark
MARK BASIL SWINDELLS (38),
Department of
Molecular Design, Institute for Drug Dis-
coverT Research, Yamanouchi Pharmaceu-
tical Company, Ltd., Tsukuba 305, Japan
ROMAN L. TATUSOV (9, 18),
National Center
of Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
WILLIAM R. TAYLOR (20, 36),
Division of
Mathematical Biology, National Institute
for Medical Research, London NW7 lAA,
United Kingdom
JUL1E D. THOMPSON (22),
European Molecu-
lar Biology Laboratory, 69012 Heidel-
berg, Germany

EDWARD C. UBERBACHER (16),
Computer
Sciences and Mathematics Division, Oak
Ridge National Laboratory, Oak Ridge,
Tennessee 37831
ANATOLY ULYANOV (8),
European Molecular
Biology Laboratory, 69117 Heidelberg,
Germany
xii CONTRIBUTORS TO VOLUME 266
STELLA VERETNIK (13),
San Diego Supercom-
puter Center, La Jolla, California 92093
OWEN WHITE (2),
The Institute for Genomic
Research, Gaithersburg, Maryland 20850
MATrmAS WILMANNS (35),
European Molec-
ular Biology Laboratory, 69001 Heidel-
berg, Germany
JOHN C. WooTroN (33),
National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
CATHY H. Wu (5),
Departments of Epidemiol-
ogy and Biomathematics, University of
Texas Health Center at Tyler, Tyler,
Texas 75710

YING Xu (16),
Computer Sciences and Mathe-
matics Division, Oak Ridge National Labo-
ratory, Oak Ridge, Tennessee 37831
TAu-Mu YI (19),
Whitehead Institute for Bio-
medical Research and Department of Biol-
ogy, Massachusetts Institute of Technology,
Cambridge, Massachusetts 02142
JINGHUI ZHANG (9),
National Center for Bio-
technology Information, National Library
of Medicine, National Institutes of Health,
Bethesda, Maryland 20892
KAM ZHANG (35),
Division of Basic Sciences,
Fred Hutchinson Cancer Center, Seattle,
Washington 98104
[ 1 ] EUROPEAN BIOINFORMATICS INSTITUTE 3
[1] Information Services of the European
Bioinformaties Institute
By BENNY SHOMER, ROBERT A. L. HARPER,
and GRAHAM N. CAMERON
Introduction
The European Bioinformatics Institute (EBI) was established in Sep-
tember 1994 as a new outstation of the European Molecular Biology Labo-
ratories (EMBL). The new outstation is located at Hinxton Hall, Cam-
bridgeshire, United Kingdom. Its main tasks are management of databases
for molecular biology, bioinformatics services, and research and develop-
ment in these fields)

The move of the bioinformatics services from the EMBL headquarters
in Heidelberg, Germany, to the EBI had various implications, including
considerable expansion in the computer power and the number of staff.
The computers are used for management of the principal databases, and
for providing network servers. The outstation provides excellent communi-
cations channels to the scientific and research community throughout Eu-
rope, and a specialized user support group ensures that all the services are
properly maintained and functional.
Various new services (which will be reviewed in this chapter) have been
established, and this has been due to the fact that there has been an increase
in both computational power and manpower at the EBI. The inspiration
for these new services has come from the various research and development
(R&D) teams now operating at the EBI, who do research on managing
sequence databases and studying the interrelationships between various
kinds of data. The main thrust of this work is to provide novel ways to
access the data and to provide interfaces that are intuitive and easy to use
for the EBI user community.
This chapter is divided into two sections. The first section is devoted
to describing the various current and future databases and resources that
are being developed in-house, and the second section describes the various
interfaces and network connections that EBI provides for the scientific
community globally. A glossary is provided at the end of this chapter that
gives a brief description of common terms.
t D. B. Emmert, P. J. Stoehr, G. Stoesser, and G. N. Cameron,
Nucleic Acids Res.
22,
3445 (1994).
Copyright © 1996 by Academic Press, Inc.
METHODS IN ENZYMOLOGY, VOL. 266 All rights of reproduction in any form reserved.
4

DATABASES AND RESOURCES
[ 11
EBI Databases and Resources
EMBL Nucleotide Sequence Database
The EMBL Nucleotide Sequence Database is a comprehensive database
of DNA and RNA sequences either collected from the scientific literature
and patent applications or submitted directly from researchers and sequenc-
ing groups. 2 The database is produced in a collaboration between the
EMBL, GenBank (Washington DC, USA), and the DNA Data Bank of
Japan (DDBJ, Mishima, Japan). Each entry that is created at any of these
databases is automatically exchanged between the other two databases.
This allows almost complete synchronization between the databases.
Currently, there is a 75% annual growth rate of the nucleotide sequence
database. The total number of entries and bases for different taxonomic
divisions can be seen in Table I. With further technological advancements,
the rate of growth of the databases will increase even more.
The nucleotide database is maintained in the relational database man-
agement system (RDBMS) ORACLE, running on a DEC Alpha VMS
cluster. Each entry in the database is assigned an accession number, which
is a permanent unique identifier. The entry is represented externally as an
ASCII "flat file." The flat file (see Fig. 1) is composed of lines beginning
with a two-character tag and followed by an associated text. The header
information ("annotation") is followed by the sequence itself. The sequence
entry ends with the unique identifier "//." Table II summarizes the meaning
of the two-character line tags.
The EBI maintains a very high level of quality assurance of the sequence
data in the EMBL database. Each new entry is carefully reviewed by a
team of annotators, and, when necessary, direct communication with the
submitting author is initiated to clarify ambiguities. Rapid data turnaround
is essential; we guarantee to process well-formed submissions within 1 week,

although in practice entries are created within 2-3 days after receipt.
Development of the next generation of the sequence database is one
of the R&D group activities. This group concentrates on various means of
ensuring database integrity and developing state-of-the-art implementa-
tions of the data. The latest release (Release 45, December 1995) contains
622,566 entries, comprising 427,620,278 nucleotides.
SWISS-PROT Protein Sequence Database
The SWISS-PROT Protein Sequence Database is a database of protein
sequences? This database is produced and maintained in a collaboration
2 C. M. Rice, R. Fuchs, D. G. Higgins, P. J. Stoehr, and G. N. Cameron,
Nucleic Acids Res.
21, 2967 (1993).
3 A. Bairoch and B. Boeckmann,
Nucleic Acid Res.
22, 3578 (1994).
[ 1 ]
EUROPEAN BIOINFORMATICS INSTITUTE 5
TABLE
I
NUMBERS OF ENTRIES AND BASES IN EMBL
NUCLEOTIDE SEQUENCE DATABASE a ACCORDING
TO TAXONOMIC DIVISION
Division b Entries Nucleotides
Bacteriophage 1066 1,493,417
EST 123,526 39,332,522
Fungi 8420 19,940,449
Invertebrates 13,831 27,610,495
Organelles 8195 9,364,254
Other mammals 6272 6,976,315
Other vertebrates 7041 8,144,622

Plants 11,105 14,145,431
Primates 35,290 36,665,648
Prokaryotes 21,427 37,074,154
Rodents 23,626 26,850,022
STS 7232 2,288,477
Synthetic 8597 4,295,284
Unclassified 6082 3,577,630
Viruses 21,496 24,801,066
Subtotal 303,206 262,559,786
Other patents 6686 2,507,063
Total 309,892 265,066,849
" Data are total numbers of entries and bases
in the EMBL nucleotide database at the time
of freezing the database for building Release
42.
b EST, Expressed sequence tags; STS,
sequence tagged sites.
between Dr. Amos Bairoch from the University of Geneva and the EBI.
The data in SWISS-PROT arise from several sources; they are derived from
translations of sequences from the EMBL Nucleotide Sequence Database,
adapted from the Protein Identification Resource (PIR) collection, ex-
tracted from the literature, and directly submitted by researchers. The
database contains high-quality annotation, is nonredundant, and is cross-
referenced to several other databases, notably the EMBL nucleotide Se-
quence Database, PROSITE pattern database, and Protein Data Bank
(PDB). The latest release (Release 32, November 1995) contained 49,340
sequence entries comprising 17,385,503 amino acids abstracted from
43,056 references.
As in the nucleotide sequence database, SWISS-PROT entries are rep-
resented externally as an ASCII flat file. The main difference between both

flat files is in the feature table, which in SWISS-PROT describes the
ID
XX
AC
XX
Dr
Dr
XX
DE
XX
KW
OS
OC
OC
XX
~N
RP
RX
RA
RT
RT
RT
RL
XX
RN
RP
RA
RT
RL
RL

RL
XX
m
ZX
FH
FH
FT
FF
FT
FT
5T
FT
FT
F?
F?
FT
FT
FT
SQ
standard; IliA; PRO; 1636 BP.
Zi1747; S35943;
28-FEB-1992 (Rel. 31, Created)
30-JUN-1993 (Rel. 36, Last tlcdated, Version 6)
C.symbiost~ gdh gene encodir~ glutamate dehydrogermse.
9dz gene; glut6m~te dehydrogenase.
Clostridi~ symbiostm
Prokaryota; Bacteria; Firmicutes; I~zdospore-forming rods and cocci;
Bacillaceae; Clostridiu~.
[1]
1-1636

M~3LINE; 92267007.
Teller J.K., Smith R.J., McPhersc~ M.J., Ehgel P.C., Guest J.R. ;
"qhe glutan~te dehydrogerkmse gene of Clostridit~n symbios~n.
Cloning by polymarase chain reactic~l, sequence analysis and
over-expressic~ in Escherichia coll.";
Eur. J. Bioc/le~. 206:151-159(1992).
[2]
1-1636
Teller J.K. ;
Suhnitted (26-FEB-1992) to the 194BL/GenBank/EfB/ databases.
Teller J.K., University of Sheffield, Molecular Biology and
Biotechnology, Western Bank, Sheffield, ihited ~, SI0 2L~
SWISS-PROT; P24295; EHE2_CLOSY.
Key Locati(xl/Quali fiers
so%trce
RBS
CDS
i 1636
/ organm~= "C lostridiu~ symbiosum"
/clcne="pC~516"
189 194
/ citation= [ 1 ]
204 1556
/gene= "gd~"
/EC_nunber:-" i. 4. i. 2"
/product: "Glutamate Dehydrogenase"
/evidence=~AL
/citaticn= [i]
/note: "pid: g49280"
Sequence 1636 BP; 474 A; 329 C; 416 G; 417 T; 0 other;

aacgtcgatc gtgcacgttt gcgctgtaac aattataatg ctaattcaat ttc3cttatat
aaQtgaaatg cgttataata aaaccag~c agaaaatttc acaas~cat agat~
< >
aagaccggca gctattattt aataacaatt gcataagcgg ttgtctg~t gattggggct
gctgcattaa gtatat
//
60
120
1620
1636
[ l ] EUROPEAN BIOINFORM ATICS INSTITUTE 7
TABLE II
Two-LE'ITER CODES HEADING EACH LINE or THE FLAT FILE AND THEIR MEANING"
Code I~ leaning
ID An identifier line, containing the ac zession code, type, and length of
molecule
AC Accession number(s)
XX A blank separator
DT Creation and update dates
DE Description of the sequence
KW Keywords
OG Organelle
OS Organism species
OC Organism classification
RN Reference number
RP Reference page
RX Cross-reference
RA Reference authors
RT Reference title
RL Reference location (publication sou :ce)

RC Reference comments
DR Databases cross-reference line
FH Features header
FT Feature table and qualifiers lines
SQ Sequence
// Terminator
" A full descriptive reference can be found in th~ EMBL user manual, available on request.
characteristics of the protein sequence The database is currently main-
tained using a mix of MS-DOS and U~JIX systems, but current research
and development work may result in tlle migration of the database to a
relational database management system.
A relatively recent development is translation of the EMBL nucleotide
sequence database, which will act as an u aannotated supplement to SWISS-
PROT. This new subgroup will consist ot several sections containing entries
derived from patent data, synthetic sequ(nces, immunoglobulins, and T-cell
receptors (IMGT database). These are :n addition to a section containing
translations of all naturally occurring sequences.
Fi6. 1. Typical EMBL database entry, represer ted in the fiat file format. Note that a large
part of the sequence has been omitted (designate ] by < >).
8 DATABASES AND RESOURCES [ 1]
External Databases Repository
The EBI provides a range of external databases on a caveat emptor
basis. The databases are maintained by scientists throughout the world,
who take responsibility and credit for the accuracy and currency. For a
comprehensive list of the various databases, see Table 11I.
TM
4 S. Pascarella and P. Argos,
Protein Eng.
5, 121 (1992).
5 j. Jurka and T. Smith,

Proc. Natl. Acad. Sci. U.S.A.
85, 4775 (1988).
6 T. Specht,
et al., Nucleic Acids Res.
19, 2189 (1991).
7 p. Rodriguez-Tome, EMBL-EBI (1995).
8 j. C. Wallace and S. Henikoff,
CABIOS
8, 249 (1992).
9 M. Cherry, Massachusetts General Hospital, Boston (1992).
10 F. Larsen,
et aL, Genomics
13, 1095 (1992).
11 K. Wada,
et aL, Nucleic Acids Res.
20, 2111 (1992).
xz M. Olson, L. Hood, C. Cantor, and D. Botstein,
Science
254, 1434 (1989).
13 M. Kroger,
et aL, Nucleic Acids Res.
20, 2119 (1992).
14 m. Bairoch,
Nucleic Acids Res.
21, 3155 (1993).
15 p. Bucher and E. N. Trifonov,
Nucleic Acids Res.
14, 10009 (1986).
16 The FlyBase Consortium,
Nucleic Acids Res.

22, 3456 (1994).
17 E. G. D. Tuddenham,
Nucleic Acids Res.
22, 3511 (1994).
18 F. Giannelli, P. M. Green, S. S. Sommer, D. P. Lillicrap, M. Ludwig, R. Schwaab, P. H.
Reitsma, M. Goossens, A. Yoshioka, and G. G. Brownlee,
Nucleic Acids Res.
22, 3534 (1994).
19 j. G. Bodmer, S. G. Marsh, E. D. Albert, W. F. Bodmer, B. Dupont, H. A. Erlich, B. Mach,
W. R. Mayr, P. Parham, and T. Sasazuki,
Tissue Antigens
44, 1 (1994).
20 M. P. Lefranc, V. Giudicelli, C. Busin, A. Malik, I. Mougenot, P. D6nais, and D. Chaume,
Ann. NY Acad. Sci.
764, 47 (1995).
21 E. A. Kabat,
et al.,
Technological Inst., Northwestern University, Evanston, Illinois (1992).
22 G. Keen, G. Redgrave, J. Lawton, M. Sinkosky, S. Mishra, J. Fickett, and G. Burks,
Math.
Comput. Modelling
16, 93 (1992).
23 R. D61z, M. D. Moss6, A. Bairoch, P. P. Slonimski, and P. Linder,
Nucleic Acids Res. 24,
66 (1994).
24 M. Nelson and M. McClelland,
Nucleic Acids Res.
19, 2045 (1991).
25 M. Hollstein,
Nucleic Acids Res.

22, 3551 (1994).
26 S. K. Hanks and A. M. Quinn,
Methods Enzymol.
2110, 38 (1991).
27 T. K. Attwood, M. E. Beck, A. J. Bleasby, and D. J. Parry-Smith,
Nucleic Acids Res.
22,
3590 (1994).
28 E. Sonnhammer and D. Kahn,
Protein Sci.
3, 482 (1994).
29 A. Bairoch,
Nucleic Acids Res.
20, 2013 (1992).
3o B. L. Maidak,
et al., Nucleic Acids Res.
22, 3485 (1994).
31 A. Bairoch, University of Geneva, Geneva (1991).
32 R. Eberhard,
Genetic Analysis: Techniques and Applications ( GA TA )
10, 49 (1993).
33 R. J. Roberts and D. Macelis,
Nucleic Acids Res.
20, 2167 (1992).
34 j. Jurka,
et aL, J. Mol. Evol.
35, 286 (1992).
35 H. Lehrach,
Genome Analysis
1, 39 (1990).

36 j. M. Neefs, Y. Van de Peer, P. De Rijk, S. Chapelle, and R. De Wachter,
Nucleic Acids
Res.
21, 3025 (1993).
37 S. Pongor, Z. H~ts~gi, K. Degtyarenko, P. F~ibi~in, V. Skerl, H. Hegyo, J. Myrvai, and V.
Bevilacqua,
Nucleic Acids Res.
22, 3610 (1994).
38 S. Gupta and R. Reddy,
Nucleic Acids Res.
19, 2073 (1991).
39 C. Zwieb and N. Larsen,
Nucleic Acids Res.
20, 2207 (1992).
[ 1 ] EUROPEAN BIOINFORMATICS INSTITUTE 9
Software Repository
The EBI also maintains a repository of software for molecular biology
applications. The programs are provided by scientists throughout the user
community and are also provided on a caveat emptor basis. That is, the
EBI takes neither responsibility nor credit for their quality. Most programs
are in a compressed format, using worldwide accepted formats of compres-
sion utilities (e.g., zip, gnuzip, compress, stuffit, and compact-pro). Most
UNIX programs are archived as tar files, and Macintosh programs are
encoded in BinHex 4.0 format.
The software repository is arranged according to the platform for which
the program is intended. The whole repository is hierarchically arranged
under the subdirectory "software," with subdirectories according to the
platform (DOS, Mac, Unix, VAX, VMS). The programs in the software
repository are included in the software BioCatalog that is now maintained
at the EBI.

BioCatalog
The BioCatalog 7 is an ongoing project, started in 1993 by G6ndthon
and the CEPH-Fondation-Jean-Dausset with the support of the RESIG
project (Networks of Computer Servers for Genomes) and a grant from
the GREG (Groupement pour la Recherche et l'Etude des Genomes). The
main aims of the project are collecting and maintaining a software directory
of general interest in molecular biology and genetics, and distributing it on
the Internet.
The catalog is categorized according to common topics (termed do-
mains), as follows: DNA, proteins, alignments, genetics, mapping, molecular
evolution, molecular graphics, database, servers, and miscellaneous. Each
of the domains contains further subdivisions. Each entry in the catalog
contains (where available) information about the program, its description,
bibliographic references, programming languages, and hardware and soft-
ware requirements. The original site from which the program can be down-
loaded is cited, and in the HTML (Hypertext Markup Language) version
it is also linked for a direct ftp session. The author details and means of
contact are included.
The BioCatalog is now maintained, distributed, and further developed
at the EBI on a collaborative basis. It is available as a full text version for
40 D. Ghosh,
Nucleic Acids Res.
20, 2091 (1992).
41 S. Steinberg, A. Misch, and M. Sprinzl,
Nucleic Acids Res.
21, 3011 (1993).
42 E. Wingender,
J. Biotechnol.
35, 273 (1994).
43 C. Brown,

Nucleic Acids Res.
21, 3119 (1993).
44 S. Liebl and E. Sonnhammer, MIPS, Germany, and Sanger Centre, UK (1994).
10 DATABASES AND RESOURCES
[11
TABLE III
EXTERNAL DATABASES PROVIDED BY EBI a
Database Comment Ref.
3d_ali
alu
berlin
bio_catal
blocks
codonusage
epgisle
cutg
dbEST
dbSTS
ecdc
enzyme
epd
flybase
haenra
haemb
HLA
IMGT
kabat
limb
lista
methyl

p53
pkcdd
prints
prodom
prosite
rdp
reflist
relibrary
rebase
RepBase
RLDB
rRNA
sbase
smallrna
srp
tfd
trna
transfac
transterm
yeast
Database merging related protein structures and sequences 4
Alu sequence database 5
RNA databank of 5 S rRNA and 5 S rRNA gene sequences 6
Catalog of molecular biology programs 7
Sequence blocks database 8
Tables of codon frequencies, calculated for different organisms 9
Human CpG-island database 10
Tables of codon frequencies in a tabulated format ll
EST (expressed sequence tags) database 12
STS (sequence tagged sites) database 12

Escherichia coli database collection 13
Enzymes database 14
Eukaryotic promoter database 15
Drosophila melanogaster set of databases 16
Mutations in factor VIII gene associated with hemophilia A 17
Mutations/deletions associated with hemophilia B 18
Alignments of HLA (human leukocyte antigen) class I and II 19
nucleotide and protein sequences
Immunogenetics database 20
Database of sequences of proteins of immunological interest 21
Listing of molecular biology databases 22
Nucleotide sequences encoding proteins from yeast Saccharomyces 23
List of effects of site-specific methylation on methylases and 24
restriction enzymes
Database of p53 somatic mutations in human tumors and cell lines 25
Protein kinase catalytic domain database 26
Database of protein motif fingerprints 27
Homologous domains database of nonfragment protein sequences 28
Database of known, specific sites in proteins 29
Database and programs of the Ribosomal Database Project 30
Reference lists with relevance to molecular biology 31
Different restriction enzyme files for sequence analysis programs 32
Restriction enzymes database, including commercial sources 33
Repetitive elements from different eukaryotic species 34
Reference Library DataBase of various sequence libraries 35
Databases of small and large ribosomal subunit rRNA sequences 36
Collection of annotated protein domain sequences 37
Compilation of small RNA sequences 38
Signal recognition particle database from eukaryotes and Archaea 39
Transcription factor database 40

tRNA database 41
Eukaryotic cis-acting regulatory DNA elements and trans-acting 42
factors
Translational termination signal database 43
Complete DNA sequences of yeast chromosomes 44
"Through the ftp server, the WWW, and gopher servers and on the CD-ROM releases.
[ 1 ]
EUROPEAN BIOINFORMATICS INSTITUTE l i
ftp. It is also indexed by the WAIS (wide area indexing system) and SRS
(sequence retrieval system) indexing systems and thus searchable, when
accessed through the EBI World Wide Web (WWW) server.
Immunogenetics Database: IMGT
The IMGT database is an integrated database of immunological inter-
estf ° under development through collaboration coordinated by the Labora-
toire d'Immunog6n6tique Mol6culaire (LIGM). The IMGT database will
contain nucleotide and protein sequences of immunoglobulins (Ig) and T-
cell receptors (TCR), detailed expert annotation of these sequences, map-
ping data, and the results of comparative sequence analysis. Further collabo-
ration with ICRF (Imperial Cancer Research Foundation) London (J.
Bodmer) will allow integration of human leukocyte antigen (HLA) proteins
and genes, and that with IFG (Institute for Genetics) Cologne (W. Mueller)
will permit integration of murine alignments in the IMGT database. The
LIGM-DB is part of the IMGT database developed by the LIGM (Montpel-
lier, France), IFG (Cologne, Germany), ICRF (London, UK), and EMBL
outstation EBI (Cambridge, UK).
The objectives for the IMGT database are to contain information about
immunoglobulins and T-cell receptors from all species, specifically, to con-
tain all sequences and alignments, allele information, sequence tagged sites
(STS) and polymorphism, genomic maps, molecular modeling information,
and information about the relations with diseases and hybridomas. Software

will be developed for facilitating the annotation process, for classification
of sequences, and for molecular modeling. The aims include developing a
user-friendly graphical interface, stabilizing keywords used in immunoge-
netics, and incorporating results of sequence alignments and translation of
sequences to amino acid sequences. The database will provide a detailed
morphological and functional analysis of immunoglobulins and T-cell recep-
tors. The data are already indexed by the SRS system. It can be obtained
from the EBI tip server in the databases section. It can also be obtained
and searched through via the EBI WWW server. The database team can
be contacted at the following address:
Interfaces between EBI and User Community
Submission Systems
Submission of Sequence Data. There are three main ways to submit
sequence data to the EBI sequence databases. The first two refer to the
nucleotide sequence and SWISS-PROT databases, while the third one
(WWW submissions) refers only to nucleotide sequences.
12 DATABASES AND
RESOURCES [
1]
MANUAL EDITING OF ELECTRONIC SUBMISSION FORM. A text (ASCII)
submission form can be filled using any text editor. The editing task can be
complex and error prone, especially for inexperienced users. Furthermore,
because no data validation can be carried out in real-time, the user receives
no feedback on possible errors or omissions.
The submission form can be obtained by various methods: (1) by an
E-mail request from

(2) by ftp from ftp.ebi.ac.uk in the directory
/pub/doc/emblsub.form
or (3) from the EBI gopher server gopher.ebi.ac.uk (port 70) from the

menu selection
EMBL Nucleotide Sequence database/
Nucleotide Sequence Submissions/Updates/
When using ftp, the file type must be set to ASCII before downloading.
Once the text version of the submission has been prepared, it can be
sent by E-mail to , or it can be sent on a diskette via
regular mail to the EBI postal address at The EMBL Outstation The
European Bioinformatics Institute, Hinxton Hall, Hinxton, Cambridge
CB10 1RQ, United Kingdom.
AUTHORIN PROGRAM. Authorin is an interactive program to help the
user to prepare a submission. The program exists for Macintosh and IBM-
compatible machines. Authorin works interactively with the submitter, to
prepare the submission while validating data as they are entered. At the
end of the submission process, the program produces a text file in a special
format that can be interpreted by software at the EBI. The output from
Authorin can be sent on a diskette or by E-mail the same way as the
submission form is sent.
Currently Authorin is a good way to create automatically processed
direct submissions, but new tools aimed at overcoming some of its disadvan-
tages are under development. In particular we aim to obviate the need to
actually install the program on your own machine, to deal with new data
items that are not handled by Authorin, and to create tools to run on
modern hardware that is at present incompatible with Authorin.
The Authorin program can be downloaded from the EBI ftp server:
ftp.ebi.ac.uk
The version for DOS operated machines is under
/pub/software/dos/authorin.exe
The version for Macintosh computers
(not
PowerPC) is under

/pub/software/mac/authorin.hqx
[ 1] EUROPEAN BIOINFORMATICS INSTITUTE 13
WORLD WIDE WEB BASED SEQUENCE SUBMISSION SYSTEM. A complete
data submission system, based on a WWW server, has been developed at
the EBI. The system provides a user with the ability to submit sequence
data in a direct and easy way. The only requirement on the user's side is
to install a WWW browser that can handle forms. The system has a few
major advantages. First, in contrast to a stand-alone program, EBI con-
stantly maintains and updates the program. This means that the user is
always working with the latest version of the program. Second, if the WWW
client is already installed, the user doesn't have to waste time, effort, and
disk space on installation of a program on the computer. Third, the program
uses the EBI database resources (like the list of previous submitters, or
journals) to enable more user-friendly interface by avoiding the lengthy
business of entering information already available. Finally, the user may
freeze a submission session for a very long time.
The system breaks the complicated task of sequence submission into a
set of interactive forms which check the user's input and present the follow-
ing forms according to the input. The system is compatible with the various
WWW browsers currently available, on all platforms. An effort was made
to reduce the need for typing to a minimum, for example, by providing
mechanisms to load automatically the personal details (where available)
of the submitter according to an accession number of a previously submitted
sequence. If more than one sequence is to be submitted, the system enables
reuse of most data items that had been already typed in. Each submission
cycle can present a practically unlimited number of features and qualifiers
sets. At the end of the submission process the system mails to the submitter
the data entered, formatted into the EMBL flat file format, which can be
reviewed again by the submitter.
The submission system has a "crash recovery" mechanism. If the sub-

mitter's computer (or the WWW browser) has crashed during the submis-
sion process, the system can resume the submission at the stage where it
was abandoned, based on a unique identifier provided with each submission.
The WWW submission system can be accessed from the EBI home
page, or it can be directly accessed at the following URL (uniform re-
source locator):

Submission of Software to the Software Repository
To submit software that has been written or developed for molecular
biology, the author should send an E-mail message to the address allocated
for this purpose:

14 DATABASES AND RESOURCES [ 1]
The message should contain information about the program, what it does,
what platform is it intended to run on, and what are the hardware require-
ments. It should note whether the source code is included and whether it
is a demo/shareware/freeware; any known problems and full details of the
submitting author should also be included.
The EBI software team will then contact the author to finalize the
means of providing the program. In most cases, the program is either
UUencoded or converted to BinHex 4.0 and is sent by E-mail. If the
program is very large, EBI will provide the author with a temporary user
login and password to enable upload to the EBI ftp server.
The authors should also provide detailed information about the program
to be included in the BioCatalog. Information can be submitted using the
WWW BioCatalog submission form (accessible through the EBI WWW
server), or authors can send the information to
Although staff at the EBI will carry out simple checks on the program
such as for obvious viruses or compilation failures, we have neither the
resources nor the expertise to do detailed quality control. Thus submitting

authors must understand that they are assumed to have tested the software
appropriately and that they may be contacted by users encountering prob-
lems with the software.
Providing Information and Retrieval Systems
CD-ROM Distribution of Databases.
The EBI databases on CD-ROM
provide a snapshot of all the databases at a specified time. Quarterly releases
of the sequence databases are distributed in CD-ROM format. The disks
contain the EMBL database, the SWISS-PROT database, their index files,
and search utilities for Macintosh and IBM-compatible computers. The
disks also contain more than 20 related databases prepared by collabora-
tors.
Usage of the search programs requires the presence of at least one CD-
ROM drive, but it is preferred that the system be equipped with two CD-
ROM drives. If only one drive is present, the system's hard disk must have
(currently) at least 150 Mb free space. As the EMBL database currently
has an annual growth rate of about 70%, the index files of the next releases
are likely to occupy much more disk space. Users can order single CD-
ROM releases or subscribe indefinitely or for several releases.
To order the EBI CD-ROM set, send an E-mail request to datalib@ebi.
ac.uk or use the special form that appears in various WWW pages (EMBL,
SWISS-PROT, documentation and software) that lets the user subscribe
on-line.
[ 1 ] EUROPEAN BIOINFORMATICS INSTITUTE 15
ftp Server.
The ftp server of the EBI can be accessed by opening an
ftp session:
ftp ftp.ebi.ac.uk
Login as "anonymous" (lowercase) and type your E-mail address as a
password.

The session starts by default in the
is organized as follows:
README (file)
/contrib (directory)
/databases (directory)
/doc (directory)
/help (directory)
ls-lR.Z (file)
/software (directory)
/pub directory. The/pub directory
The file README contains a general description of the ftp server. The
file ls-IR.Z contains (UNIX) compressed information of all the directories
and files of the ftp system. The directory "databases" contains the updates
of the EMBL and SWISS-PROT databases, and all the external databases
that are provided on CD-ROM. The directory "doc" contains documenta-
tion and forms. The directory "help" contains various information files
about the directories and databases on the ftp server. The directory "soft-
ware" contains various demo, shareware, and freeware programs for DOS,
Macintosh, UNIX, VAX, and VMS platforms in the following directories
accordingly: "dos, mac, unix, vax, and vms." There is also a "tools" subdirec-
tory that contains tools which help the user to communicate with the EBI.
All the ftp directories and files are also accessible through the EBI gopher
and WWW servers.
Gopher Server.
Although the most facile access to EBI services is via
the WWW server, a gopher server provides a last resort for users limited
to text based access. The gopher server provides access to the nucleotide
and SWISS-PROT databases (documentation and data), the ftp server for
databases and software, the BioCatalog software directory (excluding its
search utility), EMBnet gopher servers, and searches in gopherspace using

VERONICA. There is a simple text based program for the WWW called
lynx, and we recommend that if you are limited to text based systems then
use lynx to connect to EBI's WWW server. To connect with EBI's gopher
use the following address:
gopher.ebi.ac.uk
16 DATABASES AND RESOURCES
[11
World Wide Web Server.
The World Wide Web (WWW) server is cur-
rently the main interface of the EBI with the scientific community. The
advantages of the WWW as a system which provides the combination of
text, graphics, and the ability of collecting data from the user by using
forms enables EBI to use it as an optimal mechanism for providing and
collecting information. The EBI WWW home page can be logically divided
into several major topics as follows.
MAIN DATABASES
EMBL Nucleotide Sequence Database Area.
The home page introduces
the user to the EMBL Nucleotide Sequence Database. It provides the user
with the updated database release information, information for submitters,
information about the various methods of data submission, contact ad-
dresses, and the feature table definition. There is a link to a form providing
an easy means of updating the database with minor corrections. The correc-
tions are provided in a noninteractive manner, as free text. There is also
a link to the new WWW based sequence submission system described
above. Users who wish to subscribe to the database may do so on-line,
using a WWW based subscription system, linked to the EMBL page.
SWISS-PROT Protein Sequence Database area.
The home page of the
SWISS-PROT Protein Sequence Database provides users with access to

documentation, including release notes and the user manual for the data-
base. There is a link to the new "protein machine." This is a form based
on a script which translates a nucleic acid sequence to the protein product
attempting to deal with all the complexities and exceptions such as unusual
translation tables. Users can also subscribe on-line if they wish to receive
the database on CD-ROM.
The SWISS-PROT home page provides links to a wide range of retrieval
services, related databases, and search services: retrieval by accession num-
ber or entry name, SRS (sequence retrieval system) access, links to dbEST
and dbSTS (see Table III), and FASTA, BLITZ, BLAST, and PROSITE
searches. A huge advantage of the WWW interface is that a rich range of
services can be offered without making the user interface overcomplex.
SEQUENCE-RELATED OPERATIONS
Sequence query and retrieval.
The most simple and direct retrieval system
is operated by providing the server with an accession number (e.g., X58929)
or an entry name (e.g., SCARGC). Although this method is limited to
cases where the user knows the identity of the entry (e.g., when an accession
number is cited), it is the fastest method of obtaining a sequence from the
database. Users may retrieve sequences directly from the EMBL, SWISS-
PROT, PROSITE, and PDB databases.
If the sequence is found in the database, it is returned to the user
formatted as a linked HTML document. Where applicable, the MEDLINE
[ ] ] EUROPEAN BIOINFORMATICS INSTITUTE
~ 7
cross-reference is linked to the MEDLINE entry containing the reference
abstract and publication details. When the entry has a database cross-
reference, it is linked to the appropriate database entry as well. For instance,
the nucleotide sequence with accession number J00231 has a cross-refer-
ence line:

DR SWISS-PROT: P01860; GC3_HUMAN.
The SWISS-PROT accession number P01860 appears as a hypertext entry,
linked to the actual SWISS-PROT file of P01860, and then it is a simple
matter to click on this hypertext link to call up the SWISS-PROT entry.
Sequence Retrieval System.
The sequence retrieval system (SRS) is a
robust indexing system, developed by Thure Etzold and Gerald Schiller in
collaboration with Reinhard D61z from the Biozentrum in Basel. 45,46 The
SRS enables a fast and efficient search for keywords and definitions through
various databases. Currently, there are 33 database systems indexed by the
SRS on the EBI server (see Table IV). An interface to search mechanisms
of the SRS indexes is provided as a WWW form.
The SRS allows flexible selection of which databases to search, which
fields in the database should be searched, the target keywords to be sought
(including trailing wildcards), and the fields to be presented in displaying
the search "hits." Complex searches can be built up using the usual Boolean
operators, rendering the entire system powerful, flexible, and easy to use.
Indeed, SRS is the most popular access method supported by the EBI.
Expressed sequence tags and sequence tagged sites.
The two specialist
sequence libraries dbEST (database of expressed sequence tags) and dbSTS
(sequence tagged sites, Ref. 12), developed by the National Center for
Biotechnology Information (NCBI), are mirrored by EBI. dbEST is a
database of sequence and mapping data on expressed sequence tags, which
are partial, "single pass" cDNA sequences, whereas dbSTS contains se-
quence and mapping data on short genomic landmark sequences or se-
quence tagged sites. Both databases are completely searchable by using
the SRS described above.
SEQUENCE SIMILARITY SEARCHES
Nucleic acid homology searches.

The WWW server enables an easy
submission of homology searches of nucleotide and amino acid sequences
in the EMBL and SWISS-PROT databases, by using the FASTA program.
FASTA performs searches of the database for sequence homology against
a provided target, using the FASTA algorithm. 47 The WWW form enables
45 T. Etzold and P. Argos,
Comput. Appl. Biosci.
9, 49 (1993).
46 T. Etzold and P. Argos,
Appl. Biosci.
9, 59 (1993).
47 W. ]~. Pearson and D. J. Lipman,
Proc. Natl. Acad. Sci. U.S.A.
85, 2444 (1988).
18 DATABASES AND RESOURCES Ii1
TABLE
IV
DATABASES SEARCHABLE THROUGH THE SEQUENCE
RETRIEVAL SYSTEM
Name Entries" Library group
EMBL 422,829 Sequence
EMNEW 18,579 Sequence
SWISSPROT 43,470 Sequence
SWlSSNEW 3804 Sequence
PIR 71,995 Sequence
NRL3D 4153 Sequence
NRSUB 248 Sequence
PDB 3588 Protein structure
HSSP 3248 Protein structure
DSSP 3143 Protein structure

FSSP 557 Protein structure
ALl 84 Protein structure
SWISSDOM 28,224 Sequence related
PRODOM 23,105 Sequence related
FLYGENE 7126 Sequence related
ECDC 3894 Sequence related
ENZYME 3556 Sequence related
REBASE 2486 Sequence related
EPD 1252 Sequence related
PIRALN 1183 Sequence related
PROSITE 1029 Sequence related
CPGISLE 965 Sequence related
IMGT 885 Sequence related
PROSITEDOC 786 Sequence related
BLOCKS 770 Sequence related
MEDLINE 179,262 Literature
SEQANALREF 2579 Literature
LIMB 120 Others
TFSITE 4042 Transcriptional factors
TFFACTOR 1412 Transcriptional factors
DBEST 241,909 Tagged sites
DBESTNEW 7025 Tagged sites
DBSTS 12,890 Tagged sites
DBSTSNEW 11 Tagged sites
"Data are numbers of entries as of July 1995.
an easy way for selecting the target library for searches and selecting the
level of sensitivity (ktup), the number of matched sequences to be listed,
and the number of aligned sequences to be listed. After typing or copying
the sequence in the appropriate window, one initiates the search by the
system, and the results are sent back to the user by E-mail.

[ 1] EUROPEAN BIOINFORMATICS INSTITUTE
19
Protein sequence homology searches: BLITZ database searches.
The
WWW server enables submission of sequences for a BLITZ search. BLITZ
uses the MPsearch program of Shane Sturrock and John Collins. 48 MPsearch
allows sensitive and extremely fast comparisons of protein sequences
against the SWISS-PROT protein sequence database using the Smith and
Waterman best local similarity algorithm. 49 It runs on the MasPar family
of massively parallel machines. A typical search time for a query sequence
of 400 amino acids is approximately 40 sec, which covers a search of the
entire SWISS-PROT database. Additional time is required to reconstruct
the alignments depending on the number of alignments requested.
MPsearch is the fastest implementation of the Smith and Waterman algo-
rithm currently available on any machine.
PROSITE database searches.
The PROSITE database search is a WWW
interface to Mail-PROSITE based on the ppsearch software derived
from the MacPattern program developed by R. Fuchs. 5° It allows a rapid
comparison of a new protein sequence against all patterns stored in the
PROSITE pattern database. 5~ The WWW form is very simple to use. The
user needs to provide only a title for the search and the amino acid se-
quence in question. Thus, it saves the use of an E-mail submission and
retrieval of the search results. Because the database being searched is
relatively small, the results are returned in real time directly to the
WWW client.
BLAST searches.
There are two pointers for a form based interface
with the two BLAST search servers. The BLAST program searches in
SBASE 3.1, a collection of annotated protein domains. One server is located

in Trieste, Italy, and the other at the NCBI (Bethesda, MD). The main
difference between both servers is that the NCBI server provides a very
straightforward search form with predetermined search parameters,
whereas the one in Trieste calls for a thorough knowledge of the program
parameters but enables more freedom of operation. The NCBI server will
return the results of the search directly on-line, and the server at the
International Centre for Genetic Engineering and Biotechnology in Trieste
returns the results by E-mail. The interface provides a convenient manner
of setting the various variables needed for the analysis, including the type
of matrix to be used, the genetic code (for nucleic acid sequences), and
the format of the output to be provided.
4s S. S. Sturrock and J. F. Collins, MPsrch version 1.3. Biocomputing Research Unit. University
of Edinburgh, UK (1993).
49T. F. Smith and M. S. Waterman,
J. Mol. Biol.
147, 195 (198l).
50 R. Fuchs,
Comput. Appl. Biosci.
10, 171 (1994).
51 A. Bairoch,
Nucleic Acids Res. 21,
3097 (1993).
20 DATABASES AND RESOURCES [ 1]
DOCUMENTATION AND VARIOUS SERVICES.
The documentation area of
the WWW server provides some documentation of general interest, like
documentation of the EBI services and a reference list for authors.
BioCatalog.
The BioCatalog is a database of computer programs for
molecular biology and genetics. This project was initiated by Gdn6thon

and the CEPH-Fondation-Jean-Dausset. The EBI now supports the mainte-
nance, development, and distribution of the BioCatalog as part of the
ongoing research and development scheme.
The BioCatalog is divided logically into various areas of interest, called
domains. The domains available are DNA, proteins, alignments, genetics,
mapping, molecular evolution, molecular graphics, database, servers, and
miscellaneous.
The BioCatalog existson the EBI server as two versions: a text based
version, available for downloading through the ftp server (under /pub/
databases/bio-catal) and through the gopher server, and a WAIS indexed
version. The indexed version can be searched by using a specialized query
form on the WWW server. The query form supports several search possibili-
ties: a full text search, according to a BioCatalog known accession number,
by name, by description, or by author name or by bibliographic information.
The user may define the logical operator to be used (either AND or OR),
how many successful search results to display, and whether to display them
as full records or only as short informative headers. An SRS indexed
version also exists, and it is searchable through the WWW SRS
searches interface.
A very important aspect of the BioCatalog is that the users actively
update the database by announcing new programs or updating existing
ones. There is a special WWW form for announcements on new programs.
Not only does the form enable an easy way of providing the information,
but it also enables the database maintainers to direct the submitting authors
to provide the most appropriate information to describe the program.
EBI netnews filtering system.
One of the major problems of modern
scientists is keeping up to date with news in related fields of interest and
maintaining communications with colleagues. The Usenet network news
system helps to overcome this problem. However, the volume of informa-

tion that flows through the news groups constantly increases, and it is now
a problem to filter the relevant messages.
The idea behind the EBI Netnews filtering system is to allow users to
provide a search profile that identifies the topics they are interested in. A
special program will scan the Usenet groups and will mark out the articles
with relevance to the user according to the search profile provided. The
profile itself may contain Boolean operators to provide a more stringent
[ 1]
EUROPEAN BIOINFORMATICS INSTITUTE 21
search. The user can set a certain threshold to increase the filtering power
of the program. A higher threshold provides less articles, with a higher
index of relevance to the search profile.
The search program runs on a regular basis at predetermined intervals.
It indexes all the Usenet articles and sends results by E-mail. Each search
hit contains the first few lines from the message (the number of lines can
be determined by the user).
The WWW based form that enables a user to submit a search profile
requires the user to provide a password, enabling discretion and concealing
of the user's fields of interest. An end-user can submit as many profiles as
desired to the system, but it is good practice to test run each profile before
submitting it. Test runs can give an estimate of how efficient the keywords
in the profile are before the profile is submitted. Each subscription is given
an 1D number. All the ID numbers can be listed, and canceled at any time.
NETWORK NAVIGATION RELATED OPERATIONS.
Several documents and
services are provided to aid the users in finding network resources related
to their fields of interest. The "Bio-wURLd" is a home page that contains
a list of links submitted by biologists. This is an interesting service, because
users have the possibility to add new sites of interest to the list. In essence
Bio-wURLd is actively maintained by the user community. Another method

for the discovery of network resources is to look at "clickable maps." The
EBI WWW server has clickable maps for the whole of Europe and for the
United Kingdom in particular.
In a similar manner there is also "Career Connection," which allows
users to advertise job opportunities. Again, this service is end-user driven
since all the jobs being listed have been contributed through the EBI
WWW server.
EB1-CUSI search.
There are many search engines that allow users to
explore the WWW. The EBI-CUSI interface is a compilation of some of
the best search engines available, and they are all collected under one page
to allow ease of access. Users can find resources by searches. There is a
special multiform page that will help users to submit search requests to
many search servers. There are a few search groups that can be accessed:
searches through selected indexes of WWW pages, searches through indexes
generated by special search robots, other non-WWW based Internet search
engines (e.g., VERONICA, WAIS), various methods of searching for soft-
ware, finding people and places on the network, dictionaries available on
the Internet, and other documents of general interest.
FINDING OUT MORE ABOUT
EBI
SERVICES. A
verbal description of
WWW services, such as given here, does not do justice to their ease of use.
By exploring the EBI home page you will find that all this information is

computer methods for macromolecular sequence analysis

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về