Tải bản đầy đủ (.pdf) (82 trang)

a user's guide to the human genome

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (16.61 MB, 82 trang )

contents
supplement to nature genetics •
september 2002
Cover art by Darryl Leja
supplement september 2002
editorial
1Spreading the word
Alan Packer
foreword
2Power to the people
Andreas D Baxevanis & Francis S Collins
perspective
3Genomic empowerment: the importance of public databases
Harold Varmus
user’s guide
4A user’s guide to the human genome
Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins
& Andreas D Baxevanis
5Introduction: putting it together
9Question 1
How does one find a gene of interest and determine that gene’s structure? Once the
gene has been located on the map, how does one easily examine other genes in that
same region?
18Question 2
How can sequence-tagged sites within a DNA sequence be identified?
21Question 3
During a positional cloning project aimed at finding a human disease gene, linkage
data have been obtained suggesting that the gene of interest lies between two
sequence-tagged site markers. How can all the known and predicted candidate genes
in this interval be identified? What BAC clones cover that particular region?
29Question 4


A user wishes to find all the single nucleotide polymorphisms that lie between two
sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within
the coding region of a gene? Where can any additional information about the
function of these genes be found?
33Question 5
Given a fragment of mRNA sequence, how would one find where that piece of DNA
mapped in the human genome? Once its position has been determined, how would
one find alternatively spliced transcripts?
40
© 2002 Nature Publishing Group />contents
supplement to nature genetics •
september 2002
Question 6
How would one retrieve the sequence of a gene, along with all annotated exons and
introns, as well as a certain number of flanking bases for use in primer design?
44Question 7
How would an investigator easily find compiled information describing the structure
of a gene of interest? Is it possible to obtain the sequence of any putative promoter
regions?
49Question 8
How can one find all the members of a human gene family?
53Question 9
Are there ways to customize displays and designate preferences? Can tracks or
features be added to displays by users on the basis of their own research?
57Question 10
For a given protein, how can one determine whether it contains any functional
domains of interest? What other proteins contain the same functional domains as
this protein? How can one determine whether there is a similarity to other proteins,
not only at the sequence level, but also at the structural level?
63Question 11

An investigator has identified and cloned a human gene, but no corresponding
mouse ortholog has yet been identified. How can a mouse genomic sequence with
similarity to the human gene sequence be retrieved?
66Question 12
How does a user find characterized mouse mutants corresponding to human genes?
70Question 13
A user has identified an interesting phenotype in a mouse model and has been able
to narrow down the critical region for the responsible gene to approximately 0.5 cM.
How does one find the mouse genes in this region?
74Commentary: keeping biology in mind
75Acknowledgments
76References
77Web resources: Internet resources featured in this guide
© 2002 Nature Publishing Group />editorial
supplement to nature genetics •
september 2002
1
There was a time, not too long ago, when the wisdom of
genome-sequencing projects was up for discussion.
Would they be too expensive, draining funds from other
areas of the life sciences? Would they be worth the trou-
ble? Not much more than 15 years have passed since
those early debates, and the importance of sequenced
genomes to biology and medicine has now gained wide
acceptance. This is in part owing to the relatively rapid
fall in the cost of sequencing, followed by the undeniably
important insights gained from the annotation of sev-
eral bacterial genomes, and those of a few of our favorite
eukaryotes. The news has been so relentlessly upbeat
that one might even have expected some ‘genome

fatigue’ to set in, especially given the saturation coverage
of the publication of the drafts of the human genome
sequence 18 months ago. Not so, however; witness the
recent jockeying by different groups for inclusion of
‘their’ model organism in the next round of sequencing
projects. The honeymoon goes on.
And yet there are important issues to be addressed.
One is the concern surrounding any bestseller—that it
will have far fewer actual readers than one might expect.
At first glance, this would seem not to apply to the
human genome. After all, one is hard pressed these days
to pick up a copy of Nature Genetics, or any genetics
journal, and not find evidence that sequenced genomes
inform many of the most important advances. A survey
published last year by the Wellcome Trust, however,
found that only half of the researchers who were using
sequence data were fully conversant with the services
provided by the freely accessible databases.
There is also the concern that genome sequencers
might be victims of their own success. As computa-
tional biologist David Roos recently put it, “We are
swimming in a rapidly rising sea of data…how do we
keep from drowning?” And if geneticists and bioinfor-
maticians are struggling to stay afloat, what of the non-
geneticists who are eager to exploit the sequences but
are relative newcomers to the tools needed to navigate
all of this information?
It is with these questions in mind that we present A
User’s Guide to the Human Genome. Written by Tyra
Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis

Collins and Andreas Baxevanis of the National Human
Genome Research Institute (NHGRI), this peer-
reviewed how-to manual guides the reader through
some of the basic tasks facing anyone whose work might
be facilitated by an improved understanding of the
online resources that make sense of annotated genomes.
The directors of these online resources—Ewan Birney of
Ensembl, David Haussler of the University of California,
Santa Cruz and David Lipman of the National Center for
Biotechnology Information—have served as advisors
during the development of this guide, ensuring a bal-
anced and accurate treatment of their respective web
portals. The online version of the guide will also evolve,
with an initial update scheduled for April, 2003.
As noted by Harold Varmus in his eloquent perspec-
tive on A User’s Guide and the public databases it exam-
ines, one of the important legacies of the Human
Genome Project is its ethos of open access to the data. In
this spirit, and with the generous sponsorship of the
NHGRI and the Wellcome Trust, the online version of
this supplement will be freely available on the
Nature Genetics website.
Alan Packer
Nature Genetics
Spreading the word
doi:10.1038/ng961
supplement september 2002
© 2002 Nature Publishing Group />foreword
2 supplement to nature genetics •
september 2002

Power to the people
doi:10.1038/ng962
The National Human Genome Research Institute of the
National Institutes of Health is delighted to sponsor this
special supplement of Nature Genetics. The primary aim
of this supplement is to provide the reader with an ele-
mentary, hands-on guide for browsing and analyzing
data produced by the International Human Genome
Sequencing Consortium, as well as data found in other
publicly available genome databases. The majority of this
supplement is devoted to a series of worked examples,
providing an overview of the types of data available and
highlighting the most common types of questions that
can be asked by searching and analyzing genomic data-
bases. These examples, which have been set in a variety of
biological contexts, provide step-by-step instructions
and strategies for using many of the most commonly-
used tools for sequence-based discovery. It is hoped that
readers will grow in confidence and capability by work-
ing through the examples, understanding the underlying
concepts, and applying the strategies used in the exam-
ples to advance their own research interests.
One of the motivating factors behind the development
of this User’s Guide comes from the general sense that the
most commonly-used tools for genomic analysis still are
terra incognita for the majority of biologists. Despite the
large amount of publicity surrounding the Human
Genome Project, a recent survey conducted on behalf of
the Wellcome Trust indicated that only half of biomed-
ical researchers using genome databases are familiar

with the tools that can be used to actually access the data.
The inherent potential underlying all of this sequence-
based data is tremendous, so the importance of all biolo-
gists having the ability to navigate through and cull
important information from these databases cannot be
understated.
The study of biology and medicine has truly undergone
a major transition over the last year, with the public avail-
ability of advanced draft sequences of the genomes of
Homo sapiens and Mus musculus, rapidly growing
sequence data on other organisms, and ready access to a
host of other databases on nucleic acids, proteins and
their properties. Yet for the full benefits of this dramatic
revolution to be felt, all scientists on the planet must be
empowered to use these powerful databases to unravel
longstanding scientific mysteries. As pointed out by
Harold Varmus in the Perspective, free accessibility of all
of this basic information, without restrictions, subscrip-
tion fees or other obstacles, is the most critical component
of realizing this potential. It is our modest hope that this
User’s Guide will provide another useful contribution.
Andreas D. Baxevanis and Francis S. Collins
National Human Genome Research Institute
© 2002 Nature Publishing Group />perspective
supplement to nature genetics •
september 2002
3
Genomic empowerment: the importance of
public databases
doi:10.1038/ng963

Over the past twenty five years, a mere sliver of recorded time, the
world of biology — and indeed the world in general — has been
transformed by the technical tools of a field now known as
genomics. These new methods have had at least two kinds of
effects. First, they have allowed scientists to generate extraordi-
narily useful information, including the nucleotide-by-
nucleotide description of the genetic blueprint of many of the
organisms we care about most—many infectious pathogens; use-
ful experimental organisms such as mice, the round worm, the
fruitfly, and two kinds of yeast; and human beings. Second, they
have changed the way science is done: the amount of factual
knowledge has expanded so precipitously that all modern biolo-
gists using genomic methods have become dependent on com-
puter science to store, organize, search, manipulate and retrieve
the new information.
Thus biology has been revolutionized by genomic information
and by the methods that permit useful access to it. Equally
importantly, these revolutionary changes have been dissemi-
nated throughout the scientific community, and spread to other
interested parties, because many of those who practice genomics
have made a concerted effort to ensure that access is simplified
for all, including those who have not been deeply schooled in the
information sciences. The goal of providing genomic informa-
tion widely has also inevitably attracted the interests of those in
the commercial sector, and privately developed versions of vari-
ous genomes are also now available, albeit for a licensing fee.
The operative principle most prominently involved in trans-
mitting the fruits of genomics—the one that has captured the
imagination of the public and served as a standard for the shar-
ing of results and methods more generally in modern biology—

has been open access. Funding by public and philanthropic
organizations, such as the U.S. National Institutes of Health, the
U.S. Department of Energy, the Wellcome Trust in Britain, and
many other organizations, has made this altruistic behavior pos-
sible and has fostered the idea that genomic information about
biological species should be available to all. (Such information
about individual human beings is, of course, an entirely different
matter and should be protected by privacy rules.) The attitude of
open access to new biological knowledge has also been embodied
in the databases of the International Nucleotide Sequence Data-
base Collaboration, comprising the DNA DataBank of Japan, the
European Molecular Biology Laboratory, and GenBank at the US
National Library of Medicine. The same focus on open access is
exemplified by PubMed (operated by the NLM), other gateways
to the scientific literature, and the assemblies of genomic
sequence now found at the several Web portals described in this
guide.
The Human Genome Project (HGP), which has supported the
public genome sequencing effort, has been the mainstay of the
effort to make genomes accessible to the entire community of
scientists and all citizens. This effort has, in fact, been quite natu-
rally extended to instruct the public about many themes in mod-
ern biological science. This has occurred in part because the
human genome itself has been such an exciting concept for the
public; in part because genomes are natural entry points for
teaching many of the principles of biological design, including
evolution, gene organization and expression, organismal devel-
opment, and disease; and in part because those who work on
genomes have been tireless in attempts to explain the meaning of
genes to an eager public. Endless metaphors, artistic creations,

lively journalism, monographs about social and ethical implica-
tions, televised lectures from the White House, and many other
cultural happenings have been among the manifestations of this
fascination. In this way, the HGP has had a strong hand in raising
the public’s awareness of new ideas in biology and of the power-
ful implications of genomics in medicine, law and other societal
institutions.
Some of these cultural effects come as much from the behav-
ioral aspects of the HGP as from the genomic sequences them-
selves. The sharing of new information, even before its assembly
into publishable form, has spurred efforts to share other kinds of
research tools and has encouraged the notion of making the sci-
entific literature freely accessible through the Internet. The con-
tribution of scientists in many countries to the sequencing of
many genomes, including the human genome, has inspired
efforts to develop gene-based sciences—from basic genomics to
biotechnology—throughout the world, including the poorest
developing nations. Indeed, the World Health Organization, the
United Nations, and the World Bank have all contributed
recently to the growth of the ideas that science is both possible
and valuable in all economies and that science can be a means to
help unify the world’s population under a banner of enlighten-
ment, demonstrating a virtue of globalization.
From this perspective, the availability of the sequences of many
genomes through the Internet is a liberating notion, making
extraordinary amounts of essential information freely accessible
to anyone with a desktop computer and a link to the World Wide
Web. But the information itself is not enough to allow efficient
use. Interested people who reside outside the centers for studying
genomes need to be told where best to view the information in a

form suitable for their purposes and how to take advantage of the
software that has been provided for retrieval and analysis.
The manual before us now offers such help to those who might
otherwise have had trouble in attempting to use the products of
genomics. Furthermore, the advice is offered in that spirit of
altruism that has come to characterize the public world of
genomics. The information is provided in a highly inviting and
understandable format by casting it in the form of answers to the
questions most commonly posed when approaching big
genomes. The information, made freely available on the World
Wide Web, has been assembled by some of the best minds in the
HGP, who have generously given their time and intellect to
encourage widespread use of the great bounty that has been cre-
ated over the past two decades.
In other words, the guide to use of genomes provided here is
simply another indication that the HGP should take great pride
in much more than the sequencing of genomes.
Harold Varmus
Memorial Sloan-Kettering Cancer Center
© 2002 Nature Publishing Group />user’s guide
4 supplement to nature genetics •
september 2002
A user’s guide to the human genome
doi:10.1038/ng964
The primary aim of A User’s Guide to the Human Genome is to provide the reader with an elementary hands-on
guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium
and other systematic sequencing efforts. The majority of this supplement is devoted to a series of worked exam-
ples, providing an overview of the types of data available, details on how these data can be browsed, and step-
by-step instructions for using many of the most commonly-used tools for sequence-based discovery. The major
web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the

University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system,
along with many others that are discussed in the individual examples. It is hoped that readers will become more
familiar with these resources, allowing them to apply the strategies used in the examples to advance their own
research programs.
Authors
Tyra G. Wolfsberg
Kris A. Wetterstrand
Mark S. Guyer
Francis S. Collins
Andreas D. Baxevanis
National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
e-mail:
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
5
Introduction: putting it together
doi:10.1038/ng965
In its short history, the Human Genome Project (HGP) has pro-
vided significant advances in the understanding of gene structure
and organization, genetic variation, comparative genomics and
appreciation of the ethical, legal and social issues surrounding
the availability of human sequence data. One of the most signifi-
cant milestones in the history of this project was met in February
2001 with the announcement and publication of the draft ver-
sion of the human genome sequence
1
. The significance of this
milestone cannot be understated, as it firmly marks the entrance
of modern biology into the genome era (and not the post-

genome era, as many have stated). The potential usefulness of
this rich databank of information should not be lost on any biol-
ogist: it provides the basis for ‘sequence-based biology’, whereby
sequence data can be used more effectively to design and inter-
pret experiments at the bench. The intelligent use of sequence
data from humans and model organisms, along with recent tech-
nological innovation fostered by the HGP, will lead to important
advances in the understanding of diseases and disorders having a
genetic basis and, more importantly, in how health care is deliv-
ered from this point forward
2
.
Although this flood of data has enormous potential, many
investigators whose research programs stand to benefit in a tan-
gible way from the availability of this information have not
been able to capitalize on its potential. Some have found the
data difficult to use, particularly with respect to incomplete
human genome draft sequence information. Others are simply
not sufficiently conversant with the seeming myriad of data-
bases and analytical tools that have arisen over the last several
years. To assist investigators and students in navigating this
rapidly expanding information space, numerous World Wide
Web sites, courses and textbooks have become available; many
individuals, of course, also turn to their friends and colleagues
for guidance. We have prepared this Guide in that same spirit,
as an additional resource for our fellow scientists who wish to
make use (or better use) of both sequence data and the major
tools that can be used to view these data. The Guide has been
written in a practical, question-and-answer format, with step-
by-step instructions on how to approach a representative set of

problems using publicly available resources. The reader is
encouraged to work through the examples, as this is the best
way to truly learn how to navigate the resources covered and
become comfortable using them on a regular basis. We suggest
that readers keep copies of the Guide next to their computers as
an easy-to-use reference.
Before embarking on this new adventure, it is important to
review a number of basic concepts regarding the generation of
human genome sequence data. This review does not discuss the
chronological development of the HGP or provide an in-depth
treatment of its implications; the reader is referred to Nature’s
Genome Gateway ( />for more information on these topics.
Current status of human genome sequencing
Sequencing of the human genome is nearing completion. The
target date for making the complete, high-accuracy sequence
available is April 2003, the 50th anniversary of the discovery
of the double helix
3
. As we go to press, however, the work is still
a mosaic of finished and draft sequence. A sequence becomes
finished when it has been determined at an accuracy of at least
99.99% and has no gaps. Sequence data that fall short of that
benchmark but can be positioned along the physical map of the
chromosomes are termed ‘draft’. Currently, 87% of the euchro-
matic fraction of the genome is finished and less than 13% is at
the draft stage.
Even in this incomplete state, the available data are extremely
useful. This usefulness was apparent early on, leading the Inter-
national Human Genome Sequencing Consortium (IHGSC) to
pursue a staged approach in sequencing the human genome. The

first stage generated draft sequence across the entire genome
1
.
The project is now well advanced into its second stage, with draft
sequence being improved to ‘finished quality’ across the entire
genome, a necessarily localized process. As a result, and as it has
been presented to date, the human genome sequence is an evolv-
ing mix of both finished and unfinished regions, with the unfin-
ished regions varying in data quality. As the data are initially
made available in raw form, with subsequent refinement and
improvement, and because data of different quality are found in
different places in the genome, users must understand the kinds
of data presented by the various tools available.
Determining the human sequence: a brief overview
As with all systematic sequencing projects, the basic experimen-
tal problem in sequencing lies in the fact that the output of a sin-
gle reaction (a ‘read’) yields about 500–800 bp
1,4
. To determine
the sequence of a DNA molecule that is millions of bases long, it
must first be fragmented into pieces that are within an order of
magnitude of the read size. The sequence at one or both ends of
many such fragments is determined, and the pieces are then
‘assembled’ back into the long linear string from which they were
originally derived. A number of approaches for doing this have
been suggested and tested; the most commonly used is shotgun
sequencing
4
. The application of shotgun sequencing to the mul-
timegabase- or gigabase-sized genomes of metazoans is still

evolving. A small number of strategies are currently being evalu-
ated, for example, hierarchical or map-based shotgun sequenc-
ing, whole-genome shotgun sequencing and hybrid approaches.
These approaches are described in detail elsewhere
4
.
The IHGSC’s human sequencing effort began as a purely map-
based strategy and evolved into a hybrid strategy
1
. The ‘pipeline’
that the IHGSC used to generate the human sequence data
involved the following steps.
1. Bacterial artificial chromosome (BAC) clones were selected,
and a random subclone library was constructed for each one in
either an M13- or a plasmid-based vector.
2. A small number of members of the subclone library (usually
96 or 192) were sequenced to produce very-low-coverage, single-
pass or ‘phase 0’ data. These data were used for quality control
and can be found in the Genome Survey Sequence division of
The DNA Database of Japan (DDBJ), the European Molecular
Biology Laboratory (EMBL) and GenBank (of the National Cen-
ter for Biotechnology and Information; NCBI).
3. If a BAC clone met the requisite standard, subclones were
derived and sufficient sequence data generated from these to pro-
vide four- to fivefold coverage (that is, enough data to represent
an average base in the BAC clone between four and five times).
This is known as ‘draft-level’ coverage, and permits the assembly
© 2002 Nature Publishing Group />user’s guide
6 supplement to nature genetics •
september 2002

of sequence using computer programs that can detect overlaps
between the random reads from the subclones, yielding longer
‘sequence contigs’. At this stage, the sequence of a BAC clone
could typically exist on between four and ten different contigs,
only some of which were ordered and oriented with respect to
one another. The BAC ‘projects’ were submitted, within 24 hours
of having been assembled, to the High-Throughput Genomic
Sequences (HTGS) division of DDBJ/EMBL/GenBank
5
, where
each was given a unique accession number and identified with
the keyword ‘htgs_draft’. (The DDBJ, EMBL and GenBank are
members of the International Nucleotide Sequence Database
Collaboration, whose members exchange data nightly and assure
that the sequence data generated by all public sequencing efforts
are made available to all interested parties freely and in a timely
fashion.) Less-complete high-throughput genomic (HTG)
records are also known as ‘phase 1’ records. As the sequence is
refined, it is designated ‘phase 2’. In the context of a BLAST
search at the NCBI, these sequences would be available in the
HTGS database.
4. In late 2000, the draft sequence of the entire human genome
was assembled from the sequence of 30,445 clones (BAC clones
and a relatively small number of other large-insert clones). This
assembled draft human genome sequence was published in Feb-
ruary 2001 and made publicly available through three primary
portals: the University of California, Santa Cruz (UCSC),
Ensembl (of the European Bioinformatics Institute; EBI) and the
NCBI. The use of all three of these sites to obtain annotated
information on the human genome sequence is the primary sub-

ject of this guide.
5. Subsequent to the genera-
tion and publication of the
draft human genome sequence,
work has continued towards
finishing the sequencing. The
final stage initially targeted
draft-quality BAC clones. For
each of these clones, enough
additional shotgun sequence
data are obtained to bring the
coverage to eight- to tenfold, a
stage referred to as ‘fully
topped-up’. The data from each
fully topped-up BAC are
reassembled, typically resulting
in a smaller number of contigs
(often in just a single contig)
than at the draft level. The new
assembly is again submitted to
the HTGS division as an
update of the existing BAC
clone, now identified with the
keyword ‘htgs_fulltop’. The
accession number of the clone
stays the same, and the version
number increases by one
(AC108475.2, for example,
becoming AC108475.3).
6. At this stage, there are,

even for clones comprising a
single contig, typically some
regions that are of insufficient
quality for the clone to be con-
sidered finished. If this is the
case, the fully topped-up
sequence is analyzed by a sequence finisher (an actual person)
who collects, in a directed manner, the additional data that are
needed to close the few remaining gaps and to bring any regions
of low quality up to the finished sequence standard. While the
clone is worked on by the finisher, the HTGS entry in GenBank is
identified by the keyword ‘htgs_activefin’. Once work on the
clone has been completed, the keyword of the HTG record is
changed to ‘htgs_phase3’, the version number is once again
increased, and the record is moved from the HTGS division to
the primate division of DDBJ/EMBL/GenBank. In the context of
a BLAST search at NCBI, these finished BAC sequences would
now be available in the nr (“non-redundant”) database.
7. The finished clone sequences are then put together into a
finished chromosome sequence. As with the initial draft assem-
blies, there are a number of steps involved in this process that use
map-based and sequence-based information in calculating the
maps. The final assembly process involves identifying overlaps
between the clones and then anchoring the finished sequence
contigs to the map of the genome; details of the process can be
found on the NCBI web site ( />genome/guide/build.html).
Initially, both the UCSC and NCBI groups generated complete
assemblies of the human genome, albeit using different
approaches. As noted on the UCSC web site, the NCBI assembly
tended to have slightly better local order and orientation, whereas

the UCSC assembly tended to track the chromosome-level maps
somewhat better. Rather than having different assemblies based
on the same data, IHGSC, UCSC, Ensembl and NCBI decided
that it would be more productive (and obviously less confusing)
NCBI reference sequences
The data release and distribution practices adopted by the HGP participants have led not
only to very early, pre-publication access to this treasure trove of information, but also to a
potentially confusing variety of formats and sources for the sequence data. To address this and
other issues, the NCBI initiated the RefSeq project ( />locuslink/refseq.html).
The goal of the RefSeq effort is to provide a single reference sequence for each molecule of the
central dogma: DNA, the mRNA transcript, and the protein. The RefSeq project helps to sim-
plify the redundant information in GenBank by providing, for example, a single reference for
human glyceraldehyde-3-phosphate dehydrogenase mRNA and protein, out of the 14 or so full-
length sequences in GenBank. Each alternatively spliced transcript is represented by its own ref-
erence mRNA and protein. The RefSeq project also includes sequences of complete genomes
and whole chromosomes, and genomic sequence contigs. The human genomic contigs that
NCBI assembles, which form the basis of the presentations in the different genome browsers,
are part of the RefSeq project. Most RefSeq entries are considered provisional and are derived by
an automated process from existing GenBank records. Reviewed RefSeq entries are manually
curated and list additional publications, gene function summaries and sometimes sequence
corrections or extensions.
Reference sequences are available through NCBI resources, including Entrez, BLAST and
LocusLink. They can be easily recognized by the distinctive style of their accession numbers.
NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to
designate genomic contigs. The NCBI and UCSC use alignments of the mRNA RefSeqs with the
genome to annotate the positions of known genes. Ensembl aligns mRNA RefSeqs to the
genome. The NCBI also provides model mRNA RefSeqs produced from genome annotation.
These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled
genome and then extracting the genomic sequence corresponding to the transcripts. The result-
ing model mRNA and model protein sequences have accession numbers of the form

XM_###### and XP_######. As the XM_ and XP_ records are derived from genomic sequence,
they may differ from the original NM_ or GenBank mRNAs because of real-sequence polymor-
phisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic
sequence alignment. A complete list of types of RefSeqs, along with details on how they are pro-
duced, is available from />© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
7
to focus their efforts on a single, definitive assembly. To this end,
and by agreement, the NCBI assembly will be taken as the refer-
ence human genome sequence. It is this NCBI assembly that is
displayed at the three major portals covered in this guide.
Annotating the assemblies
Once the assemblies have been constructed, the DNA sequence
undergoes a process known as annotation, in which useful
sequence features and other relevant experimental data are cou-
pled to the assembly. The most obvious annotation is that of
known genes. In the case of NCBI, known genes are identified by
simply aligning Reference Sequence (RefSeq) mRNAs (see box),
GenBank mRNAs, or both to the assembly. If the RefSeq or Gen-
Bank mRNA aligns to more than one location, the best align-
ment is selected. If, however, the alignments are of the same
quality, both are marked on to the contig, subject to certain rules
(specifically, the transcript alignment must be at least 95% iden-
tical, with the aligned region covering 50% or more of the length,
or at least 1,000 bases). Transcript models are used to refine the
alignments. Ensembl identifies ‘best in genome’ positions for
known genes by performing alignments between all known
human proteins in the SPTREMBL database
6

and the assembly
using a fast protein-to-DNA sequence matcher
7
. UCSC predicts
the location of known genes and human mRNAs by aligning Ref-
Seq and other GenBank mRNAs to the genome using the BLAST-
like alignment tool (BLAT) program
8
. In addition to identifying
and placing known genes onto the assemblies, all of the major
genome browser sites provide ab initio gene predictions, using a
variety of prediction programs and approaches.
Genome annotation goes well beyond noting where known
and predicted genes are. Features found in the Ensembl, NCBI
and UCSC assemblies include, for example, the location and
placement of single-nucleotide polymorphisms, sequence-
tagged sites, expressed sequence tags, repetitive elements and
clones. Full details on the types of annotation available and the
methods underlying sequence annotation for each of these dif-
ferent types of sequence feature can be found by accessing the
URLs listed under Genome Annotation in the Web Resources
section of this guide. At UCSC, many of the annotations are pro-
vided by outside groups, and there may be a significant delay
between the release of the genome assembly and the annotation
of certain features. Furthermore, some tracks are generated for
only a limited number of assemblies. For an in-depth discussion
of genome annotation, the reader is referred to an excellent
review by Stein
9
and the references cited therein. This review,

along with the Commentary in this guide, also provides cautions
on the possible overinterpretation of genome annotation data.
The data—and sometimes the tools—change every day
The steps outlined in the previous section should emphasize
that the state of the human genome sequence will continue to be
in flux, as it will be updated daily until it has actually been
declared ‘finished’. (Finished sequence is properly defined as the
“complete sequence of a clone or genome, with an accuracy of at
least 99.99% and no gaps”
2
. A more practical definition is that of
“essentially finished sequence,” meaning the complete sequence
of a clone or genome, with an accuracy of at least 99.99% and no
gaps, except those that cannot be closed by any current
method.) The reader should be mindful of this, not just when
reading this guide, but also, when referring back to it over time.
Similarly, the tools used to search, visualize and analyze these
sequence data also undergo constant evolution, capitalizing on
new knowledge and new technology in increasing the usefulness
of these data to the user.
Over the next year, sequence producers will continue to add
finished sequence to the nucleotide sequence databases, and the
NCBI will continue to update the human sequence assembly
until its ultimate completion. The human genome sequence will,
however, continue to improve even after April 2003, as new
cloning, mapping and sequencing technologies lead to the clo-
sure of the few gaps that will remain in the euchromatic regions.
It is hoped that such technological advances will also allow for
the sequencing of heterochromatic regions, regions that cannot
be cloned or sequenced using currently available methods.

The sequence-based and functional annotations presented at
the three major genome portals will certainly continue to evolve
long after April 2003. Computational annotation is a highly
active area of research, yielding better methods for identifying
coding regions, noncoding transcribed regions and noncoding,
non-transcribed functional elements contained within the
human sequence.
Accessing human genome sequence data
Although each of the three portals through which users access
genome data has its own distinctive features, coordination
among the three ensures that the most recent version and anno-
tations of the human genome sequence are available.
Ensembl () is the product of a collab-
orative effort between the Wellcome Trust Sanger Institute and
EMBL’s European Bioinformatics Institute and provides a bioin-
formatics framework to organize biology around the sequences
of large genomes
7
. It contains comprehensive human genome
annotation through ab initio gene prediction, as well as infor-
mation on putative gene function and expression. The web site
provides numerous different views of the data, which can be
either map-, gene- or protein-centric. Ensembl is actively build-
ing comparative genome sequence views, and presents data
from human, mouse, mosquito and zebrafish. In addition,
numerous sequence-based search tools are available, and the
Ensembl system itself can be downloaded for use with individ-
ual sequencing projects.
The UCSC Genome Browser () was
originally developed by a relatively small academic research

group that was responsible for the first human genome assem-
blies. The genome can be viewed at any scale and is based on
the intuitive idea of overlaying ‘tracks’ onto the human
genome sequence; these annotation tracks include, for exam-
ple, known genes, predicted genes and possible patterns of
alternative splicing. There is also an emphasis on comparative
genomics, with mouse genomic alignments being available.
The browser also provides access to an interactive version of
the BLAT algorithm
8
, which UCSC uses for RNA and compar-
ative genomic alignments.
Given its Congressional mandate to store and analyze biologi-
cal data and to facilitate the use of databases by the research com-
munity, the NCBI () serves as a
central hub for genome-related resources. NCBI maintains Gen-
Bank, which stores sequence data, including that generated by
the HGP and other systematic sequencing projects. NCBI’s Map
Viewer provides a tool through which information such as exper-
imentally verified genes, predicted genes, genomic markers,
physical maps, genetic maps and sequence variation data can be
visualized. The Map Viewer is linked to other NCBI tools—for
example, Entrez, the integrated information retrieval system that
provides access to numerous component databases.
Although we have chosen to illustrate each example using
resources available at a single site, almost all the questions in this
guide can be answered using any of the three browsers. The
© 2002 Nature Publishing Group />user’s guide
8 supplement to nature genetics •
september 2002

informational sidebars that follow some of the questions provide
pointers on how to format the search at other sites. Furthermore,
the three sites link to each other wherever possible. Examples
presented in this Guide rely on the data and genome browser
interfaces that were available in June 2002. As new versions of the
genome assembly and viewing tools will come online every few
months, the specifics of some of the examples may change over
time. Regardless, the basic strategies behind answering the ques-
tions in the examples will remain the same. This underscores the
importance of readers working through the examples at their
own computers so that they may understand and be able to navi-
gate these public databases. The readers are encouraged to
explore the alternative methods for answering the questions.
Browser problems?
In following the question-and-answer portion of this guide,
some readers may find that their web browsers are not be able
to render the web pages properly. If this occurs, do one or
more of the following:
1. Install the most recent version of either Netscape Navi-
gator or Internet Explorer.
2. Increase the amount of memory available to the web
browser.
3. Try a different web browser. In general, Macintosh users
who seek to gain access to these three genome portals will see
better performance with Internet Explorer.
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
9
Question 1

How does one find a gene of interest and determine that gene’s struc-
ture? Once the gene has been located on the map, how does one easily
examine other genes in that same region?
doi:10.1038/ng966
This question serves as a basic introduction to the three major
genome viewers. One gene, ADAM2, will be examined using
all three sites so that the reader can gain an appreciation of
the subtle differences in information presented at each of
these sites.
National Center for Biotechnology Information Map
Viewer
The NCBI Human Map Viewer can be accessed from the NCBI’s
home page, at . Follow the hyper-
link in the right-hand column labeled Human map viewer to go
to the Map Viewer home page. The notation at the top of the
page indicates that this is Build 29, or the NCBI’s 29th assembly
of the human genome. Build 29 is based on sequence data from 5
April 2002. The previous genome assembly, Build 28, was based
on sequence data from 24 December 2001. To search for any
mapped element, such as a gene symbol, GenBank accession
number, marker name or disease name, enter that term in the
Search for box and then press Find. For this example, enter
‘ADAM2’ and then press Find. The on chromosome(s) box may be
left blank for text-based searches such as this one.
The resulting overview page shows a schematic of all of the
human chromosomes, pinpointing the position of ADAM2 to
the p arm of chromosome 8 (Fig. 1.1). The search results section
shows that the gene exists on two NCBI maps, Genes_cyto and
Genes_seq. Genes_cyto refers to the cytogenetic map, whereas
Genes_seq refers to the sequence map. Clicking on either of those

two links opens a view of just that map.
Detailed descriptions of these and other NCBI maps are
available at />humansearch.html. To get the most general overview of the
genomic context of ADAM2, including all available maps, click
on the item in the Map element column (in this case, ADAM2).
This view shows ADAM2 and a bit of flanking sequence on chro-
mosome 8p11.2 (Fig. 1.2). Three maps are displayed in this view,
each of which will be discussed below. Additional maps, dis-
cussed in other examples in this guide, can be added to this view
using the Maps & Options link.
The rightmost map is the master map, the map providing the
most detail. The master map in this case is the Genes_seq map,
which depicts the intron/exon organization of ADAM2 and is
created by aligning the ADAM2 mRNA to the genome. The gene
appears to have 14 exons. The vertical arrow next to the ADAM2
gene symbol (within the pink box) shows the direction in which
the gene is transcribed. The gene symbol itself is linked to
LocusLink, an NCBI resource that provides comprehensive
information about the gene, including aliases, nucleotide and
protein sequences, and links to other resources
10
(see Question
10). The links to the right of the gene symbol point to additional
information about the gene.
• sv, or sequence view, shows the position of the gene in the
context of the genomic contig, including the nucleotide and
encoded protein sequences.
• ev brings the user to the evidence viewer, a view that displays
the biological evidence supporting a particular gene model.
This view shows all RefSeq models, GenBank mRNAs, tran-

scripts (whether annotated, known or potential) and
expressed sequence tags (ESTs) aligning to this genomic con-
tig. More information on the evidence viewer can be found
on the NCBI web site by clicking Evidence Viewer Help on any
ev report page.
• hm is a link to the NCBI’s Human–Mouse Homology Map,
showing genome sequences with predicted orthology
between mouse and human (Fig. 12.2).
• seq allows the user to retrieve the genomic sequence of the
region in text format. The region of sequence displayed can
easily be changed.
• mm is a link to the Model Maker, which shows the exons that
result when GenBank mRNAs, ESTs and gene predictions are
aligned to the genomic sequence. The user can then select
individual exons to create a customized model of the gene.
More information on the Model Maker can be found on the
NCBI web site by clicking help on any mm report page.
The UniG_Hs map shows human UniGene clusters that have
been aligned to the genome. The gray histogram depicts the
number of aligning ESTs and the blue lines show the mapping of
UniGene clusters to the genome. The thick blue bars are regions
of alignment (that is, exons) and the thin blue lines indicate
potential introns. In this example, the mapping of UniGene clus-
ter Hs.177959 to the genome follows that of ADAM2, and all the
exons align.
The Genes_cyto map shows genes that have been mapped
cytogenetically; the orange bar shows the position of the gene.
Although ADAM2 has been finely mapped and is represented by
a short line, other genes, such as the group below it on a longer
line, have been cytogenetically mapped to broader regions of

chromosome 8.
Clicking on the zoom control in the blue sidebar allows the
user to zoom out to view a larger region of chromosome 8.
Zooming out one level shows 1/100th of the chromosome. There
are 20 genes in the region, and all 20 are labeled (displayed) in
this view (Fig. 1.3). The region of ADAM2 is highlighted in red
on all maps. On the basis of the Genes_seq map, ADAM2 is
located between ADAM18 and LOC206849.
University of California, Santa Cruz Genome Browser
The home page for the UCSC Genome Browser is http://genome.
ucsc.edu/. At present, UCSC provides browsers not only for the
most recent version of the mouse and human genome data, but
also for several earlier assemblies. To use the Genome Browser,
select the appropriate organism from the pull-down menu at the
top of the blue sidebar (Human, in this case) and then click the
link labeled Browser. On the resulting page, select the version of
the human assembly to view. The genome browser from August
2001 is based on an assembly of the human genome done by
UCSC using sequence data available on that date. The Dec. 2001
© 2002 Nature Publishing Group />user’s guide
10 supplement to nature genetics •
september 2002
browser displays annotations based on NCBI’s build 28 of the
human genome, and the Apr. 2002 browser displays annotations
on NCBI’s build 29. As the annotations presented in this most
recent human assembly are not yet as comprehensive as those
from the December 2001 assembly, the examples in this text are
based on the earlier assembly. Select Dec. 2001 from the pull-
down menu to access the assembly from that date (Fig. 1.4).
Supported types of queries are listed below the text input

boxes. Enter ‘ADAM2’ in the box labeled position and then
click Submit. The results of this search are presented in two
categories, Known Genes and mRNA Associated Search Results
(Fig. 1.5). The section marked Known Genes shows the map-
ping of the NCBI Reference mRNA sequences to the genome.
The mRNA Associated Search Results represent the mapping of
other GenBank mRNA sequences to the genome. Click on the
Known Genes link for ADAM2 (arrow, Fig. 1.5) to see the
genomic context of the ADAM2 mRNA Reference Sequence
(NM_001464).
The resulting zoomed-in view shows a region of chromosome
8 from base pair 36234934 to 36280132, located within 8p12
(Fig. 1.6). The blue track entitled Known Genes (from RefSeq)
shows the intron–exon structure of known genes. The vertical
boxes indicate exons and the horizontal lines introns. The
ADAM2 gene seems to have 14 exons. The direction of transcrip-
tion is indicated by the arrowheads on the introns. The tracks
labeled Acembly Gene Predictions, Ensembl Gene Predictions
and Fgenesh++ Gene Predictions are the results of gene predic-
tions (see Question 7). Alignments of other database nucleotide
sequences are shown in the Human mRNAs from GenBank,
spliced EST, UniGene and Nonhuman mRNAs from GenBank
tracks. Translated alignments of mouse and Tetraodon genomic
sequence are in the mouse and fish BLAT tracks. Tracks display-
ing single-nucleotide polymorphisms (SNPs), repetitive ele-
ments and microarray data are shown at the bottom. Additional
details about each track are available by selecting the track name
in the Track Controls at the bottom.
To view the genomic context of ADAM2, zoom out 10× by
clicking on the zoom out 10

×
box in the upper right corner.
ADAM2 is located between TEM5 and ADAM18 (Fig. 1.7).
Ensembl
The Ensembl
7
project, provides
genome browsers for four species: human, mouse, zebrafish and
mosquito. Click on Human to view the main entry point for the
human genome. The current version of human Ensembl is ver-
sion 6.28.1, based on the NCBI’s 28th build of the genome. To
perform a text search, enter ‘ADAM2’ in the text box, and limit
the search by selecting Gene from the pull-down search. Click on
the upper button labeled Lookup. A single result is returned with
a link to the ADAM2 gene (Fig. 1.8).
Click on either of the ADAM2 links to retrieve the GeneView
window. The returned page contains four sections of data. The
first section (Fig. 1.9) is an overview of ADAM2, including links
to accession numbers and protein domains and families. Links to
the Ensembl view of highly similar mouse sequences are pre-
sented in the Homology Matches section. Some of these fields will
be described in more detail in later examples. The second section
of the GeneView window provides information on the gene tran-
script (Fig. 1.10). The sequence of the cDNA is shown, as is a
graphic of its intron–exon structure. A limited amount of the
genomic context around the gene is shown schematically as well.
Exon sequences are shown in the third section of the GeneView
(Fig. 1.11) and splice sites in the fourth (Fig. 1.12). If more than
one transcript is predicted for the gene, each is allocated its own
transcript, exon and splice-site sections.

The complete genomic context of ADAM2 is viewed by return-
ing to the first section of the GeneView (Fig. 1.9) and clicking on
one of the two links within the Genomic Location box. The top
portion of the resulting ContigView (Fig. 1.13) depicts the chro-
mosome, with the region of interest outlined in red. The
Overview shows the genomic context of the gene, including the
chromosome bands, contigs, markers and genes that map to near
8p12. Clicking on any of these items recenters the display around
that item. The section of interest is boxed in red on the
DNA(contigs) map. The genes annotated by Ensembl as being
around ADAM2 are Q96KB2 and ADAM18.
The bottom panel of the ContigView, the Detailed View
(Fig. 1.14), shows a zoomed-in view of the boxed region, high-
lighting all features that have been mapped to this region of the
human genome. The navigator buttons between the Overview
and the Detailed View move the display to the left and right and
zoom in and out. The features to be displayed can be changed
by selecting the Features pull-down menu and then checking
which features to view.
The Features shown in Fig. 1.14 are the defaults. The DNA
(contigs) map separates items on the forward strand (above)
from those on the reverse (below). The only feature on the
reverse strand in this view is a single Genscan transcript, pre-
dicted by the GENSCAN gene prediction program
11
(see Ques-
tion 7). The forward strand shows five types of features. Starting
at the bottom, the ADAM2 transcript is shown in red, indicating
that it is a known transcript corresponding to a near-full-length
cDNA sequence, protein sequence or both already available in

the public sequence database. Black transcripts are predicted
based on EST or protein sequence similarity. EST Transcr. links to
individual aligning ESTs, whereas the UniGene track near the top
displays UniGene clusters. The Genscan model on the forward
strand contains many exons found in the known transcript. The
Proteins and Human proteins boxes indicate protein sequences
that align to this version of the genome, whereas NCBI Transcr.
links to the NCBI Map Viewer. Positioning the computer mouse
over any feature brings up the feature’s name and links to more
detailed information.
The NCBI, UCSC and Ensembl sometimes use different sym-
bols for the same genes, so it can be difficult to compare the
views obtained by the different browsers. Furthermore, the
three sites maintain independent annotation pipelines and do
not all attempt to align the same mRNA sequences to the
genome. The NCBI is currently displaying build 29, Ensembl
shows build 28, and UCSC offers both builds 28 (December
2001) and 29 (April 2002), although all examples from UCSC in
this guide will be illustrated using the better-annotated build
28. Because of the differences between the two assemblies, there
are subtle discrepancies between what is shown at the NCBI and
what is available at UCSC and Ensembl. However, it is fairly
easy to navigate among the three sites. The NCBI, for example,
links to Ensembl and UCSC through the black boxes at the top
of LocusLink entries for human genes, and Ensembl directs
users to NCBI and UCSC through the “Jump to” link in its Con-
tigView. Some versions of UCSC’s Genome Browser have links
to Ensembl and NCBI’s Map Viewer in the blue bar at the top of
each browser page.
© 2002 Nature Publishing Group />user’s guide

supplement to nature genetics •
september 2002
11
Figure 1.1
Figure 1.2
© 2002 Nature Publishing Group />user’s guide
12 supplement to nature genetics •
september 2002
Figure 1.3
Figure 1.4
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
13
Figure 1.5
Figure 1.6
© 2002 Nature Publishing Group />user’s guide
14 supplement to nature genetics •
september 2002
Figure 1.7
Figure 1.8
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
15
Figure 1.9
Figure 1.10
© 2002 Nature Publishing Group />user’s guide
16 supplement to nature genetics •
september 2002

Figure 1.11
Figure 1.12
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
17
Figure 1.13
Figure 1.14
© 2002 Nature Publishing Group />user’s guide
18 supplement to nature genetics •
september 2002
Question 2
How can sequence-tagged sites within a DNA sequence be identified?
doi:10.1038/ng967
The NCBI’s electronic PCR (e-PCR) tool
12
, which is part of the
UniSTS resource, can be used to find STS markers within a DNA
fragment of interest. UniSTS ( />genome/sts/) contains all the available data on STS markers,
including primer sequences, product size, mapping information
and alternative names. Links to other NCBI resources such as
Entrez, LocusLink and the MapViewer are also provided. e-PCR
looks for potential STSs in a DNA sequence by searching for sub-
sequences with the correct orientation and distance that could
represent the PCR primers used to generate known STSs.
The e-PCR home page can be found by going to the NCBI
home page, at , and then following
the Electronic PCR link in the right-hand column. On the e-PCR
home page, paste the sequence of interest or enter an accession
number into the large text box at the top of the page. The acces-

sion number of the sequence for this example is AF288398. This
sequence contains only one STS, stSG47693, which is located
between nucleotides (nt) 2102 and 2232 of the sequence under
study (Fig. 2.1).
Click on the marker name to bring up details of the STS from
UniSTS (Fig. 2.2). The primer information and PCR product size
are listed at the top of the page, along with alternative names for
the marker. Often STSs are known by different names on differ-
ent maps. Cross-references to LocusLink, UniGene and the
Genebridge 4 map to which this STS was mapped are shown
next. The mapping information section contains links to the
NCBI’s MapViewer. At the bottom of the page, the Electronic
PCR results show other sequences, including contigs, mRNAs
and ESTs that may contain this STS marker.
To see the genomic context of the STS marker in all maps to
which it has been mapped, click on the link labeled MapViewer
at the top of the Mapping Information section. This map view
(Fig. 2.3) shows two maps. Note that, in this view, the STS
stSG47693 is called RH92759 (highlighted in pink). Gene
Map ’99–Genebridge 4 (GM99_GB4, left) has 46,000 STS mark-
ers mapped onto the GB4 RH panel by the International
Radiation Hybrid Consortium. The STS map (right) shows the
NCBI’s placement of STSs onto the genome sequence assembly
using e-PCR. Gray lines connect markers that appear in both
maps, whereas the red line denotes where the STS RH92759
appears on both maps. In the region shown, there are a total of
211 STSs on the STS map, but only 20 are labeled in this view. To
the right of the STS map, the green and yellow circles show the
maps on which the STS markers have been placed. One can
zoom in or out of this view by clicking on the lines of the zoom

tool in the left sidebar.
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
19
Figure 2.1
Figure 2.2
© 2002 Nature Publishing Group />user’s guide
20 supplement to nature genetics •
september 2002
Figure 2.3
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
21
Question 3
During a positional cloning project aimed at finding a human disease
gene, linkage data have been obtained suggesting that the gene of
interest lies between two sequence-tagged site markers. How can all
the known and predicted candidate genes in this interval be identified?
What BAC clones cover that particular region?
doi:10.1038/ng968
UCSC
One possible starting point for this search is the UCSC Genome
Browser home page, at . From this page,
select Human from the Organism pull-down menu in the blue
bar at the side of the page, and then click Browser. On the Human
Genome Browser Gateway page, change the assembly pull-down
to Dec. 2001. To view a region of the genome between two query
terms, enter the terms in the search box, separated by a semi-

colon. For example, to view the region between STS markers
D10S1676 and D10S1675, enter ‘D10S1676;D10S1675’ in the box
marked position and press Submit. Because both of these markers
map to a single position in the genome, the genome browser for
the region between those markers is returned (Fig. 3.1).
The STS Markers track displays genetically mapped markers in
blue and radiation hybrid–mapped markers in black. Click on
the STS Markers label to expand that track and see each marker
listed individually (Fig. 3.2). The markers of interest are called by
their alternate names (AFMA232YH9 and AFMA230VA9 in this
view) and are at the top and bottom of the interval, respectively
(Fig. 3.2, arrows).
The full list of known genes in this display is shown in the
Known Genes track (Fig. 3.1). These protein-coding genes are
taken from the RefSeq mRNA sequences compiled at the NCBI
10
and aligned to the genome assembly using the BLAT program
8
. To
export a list of the genes, or other features, in this region, click the
Tables link in the top blue bar. For more information about a par-
ticular gene (such as MGMT), click on the gene symbol to get a list
of additional links to resources such as Online Mendelian Inheri-
tance in Man (OMIM), PubMed, GeneCards and Mouse Genome
Informatics (MGI; Fig. 3.3). Many tracks, including Acembly
Genes, Ensembl Genes and Fgenesh++ Genes, indicate predicted
genes (see Question 7).To view the full set of features in any of
these categories, click on the title of that track on the left side of the
screen in Fig. 3.1. To view brief descriptions of these tracks, as well
as others not mentioned, click on the gray box to the left of the

track or scroll down to Track Controls and click on the title of a fea-
ture of interest. Explanations of the gene-prediction programs can
be found in Question 7. Reset the browser to its default settings by
clicking on the reset all button below the tracks.
To see the BAC clones used for sequencing, return to the page
illustrated in Fig. 3.1 and click on Coverage at the left side of the
screen to expand that track. Here BAC clones are listed individu-
ally, with finished regions shown in black and draft regions
shown in various shades of gray (Fig. 3.4). For details such as size
and sequence coverage of a specific clone, click on the clone
accession number (such as AL355529.21, arrow). From this
screen, click on the accession number (as shown in Fig. 3.5) to
link to the NCBI Entrez document summary for the clone. The
full GenBank entry can be viewed by clicking on AL355529 on
the Entrez document summary page.
According to NCBI naming conventions, this clone is from the
RP11 library and has been named 85C15. RP11 is the NCBI desig-
nation for RPCI-11, a commonly used human BAC library pro-
duced at the Roswell Park Cancer Institute. More information
on the naming conventions of genomic sequencing libraries
can be found at the NCBI’s Clone Registry (Fig. 3.6;
/>Clone ordering information is also available, at http://www.
ncbi.nlm.nih.gov/genome/clone/ordering.html.
NCBI
The NCBI MapViewer allows for direct viewing of the region
between two markers, as long as both markers are on the master
map. If, for example, the master map is a cytogenetic one, one
can search chromosome 22 for the region between band num-
bers 22q12.1 and 22q13.2. If the master map is Gene_Seq, one
can view the region between two mapped genes.

Access the Map Viewer home page by starting at the NCBI
home page () and clicking
Human map viewer in the list on the right-hand side of the
page. To view multiple hits on the same chromosome, type in
the search terms separated by the word ‘OR’. To see the same
region between the STS markers D10S1676 and D10S1675, for
example, type ‘D10S1676 OR D10S1675’ in the search box, and
hit Find. At the top of the resulting page (Fig. 3.7), two red tick
marks on the chromosome cartoon indicate that the markers
map close to each other on chromosome 10. The search results
at the bottom of the page show the alternative names for the
two markers (AFMA232YH9 and AFMA230VA9) as well as the
maps on which they have been placed. To view both markers at
the same time, click on the link for chromosome 10 in the
chromosome diagram. Fig. 3.8 shows the region around
D10S1676 and D10S1675, with the original queries high-
lighted in pink. Red lines connect the positions of the marker
on the different maps.
The Maps & Options link, in the horizontal blue bar near the
top of the page, allows the user to customize the maps and region
displayed. To view, for example, the known and predicted genes
One can also search for a region between two STS markers
using the MapView at Ensembl. Start at the Ensembl Human
Genome Browser at />ens/, click on the idiogram of any chromosome to access the
MapView, and enter the marker names in the Jump to Con-
tigview section. To use Ensembl to obtain a list of genes (or
other annotations) in a defined chromosomal region, click on
Export→Gene List from any ContigView window (Fig. 1.14,
center yellow bar).
© 2002 Nature Publishing Group />user’s guide

22 supplement to nature genetics •
september 2002
in this region, as well as the BAC clones from which the sequence
was derived, click on the link to open the Maps & Options win-
dow (Fig. 3.9). First remove all the maps except Gene and STS
from the Maps Displayed box by highlighting them, and selecting
<<REMOVE. Next, add the Transcript (RNA), GenomeScan,
Component and Contig maps by selecting them from the Avail-
able Maps box and selecting ADD>>. Make the STS map the
master by highlighting it, then selecting Make Master/Move to
Bottom. To limit the view such that only the STSs between
D10S1676 and D10S1675 are shown, type the marker names in
the Region Shown boxes. Hit Apply to see the aligned maps. In
some cases, it may be useful to select a page size larger than the
default of 20 to view more data in the browser window.
Fig. 3.10 shows the maps, as specified in the Maps & Options
window. The green dots to the right of the STS map show all the
maps on which the markers appear. This is a fairly long region of
chromosome 10, and not every STS marker is shown. In particu-
lar, although there are 611 STSs in this region, only 20 are shown
by name in this view. For each known gene, the Genes_Seq map
shows all the exons that have been mapped to the genome. Exons
for individual known mRNAs are shown on the RNA (Tran-
script) map. Unless a gene is alternatively spliced, the Genes_Seq
and RNA maps will be the same. The GScan (GenomeScan) map
shows the NCBI’s gene predictions. Any of these genes, known or
predicted, are candidates for the disease gene.
The NCBI’s assembled contigs, also known as the NT contigs,
are found in the Contig map. Blue segments come from finished
sequence, orange from draft. These contigs are constructed from

the individual GenBank sequence entries shown in the Comp
(Component) map. Draft HTG records (phase 1 and 2; see
are displayed in orange
and finished HTGs in blue. Most of these GenBank entries are
derived from BAC clones. The tiling paths of the BAC clones that
were assembled into contigs are clearly visible. One can obtain
more details about an entry, including the clone name, by click-
ing on the accession number to link to Entrez. The clone name is
visible directly in the MapViewer if the Comp map is the master.
A map can be quickly made the master map by clicking on the
blue arrow next to its name.
Because this is a zoomed-out view of the chromosome, indi-
vidual genes and GenBank entries are difficult to visualize.
Zooming in, using the controls in the blue sidebar, will provide
a region in more detail. Alternatively, click on the Data As
Table View in the left sidebar to retrieve all data, including
those hidden in this view, as a text-based table (partially shown
in Fig. 3.11).
© 2002 Nature Publishing Group />user’s guide
supplement to nature genetics •
september 2002
23
Figure 3.1
Figure 3.2
© 2002 Nature Publishing Group />

×