Preface
A central challenge of the post-genomic era is to understand how the 30,000 to
40,000 unique genes in the human genome are selectively expressed or silenced
to coordinate cellular growth and differentiation. The packaging of eukaryotic
genomes in a complex of DNA, histones, and nonhistone proteins called
chromatin provides a surprisingly sophisticated system that plays a critical role
in controlling the flow of genetic information. This packaging system has
evolved to index our genomes such that certain genes become readily access-
ible to the transcription machinery, while other genes are reversibly silenced.
Moreover, chromatin-based mechanisms of gene regulation, often involving
domains of covalent modifications of DNA and histones, can be inherited from
one generation to the next. The heritability of chromatin states in the absence
of DNA mutation has contributed greatly to the current excitement in the field
of epigenetics.
The past 5 years have witnessed an explosion of new research on chroma-
tin biology and biochemistry. Chromatin structure and function are now widely
recognized as being critical to regulating gene expression, maintaining genomic
stability, and ensuring faithful chromosome transmission. Moreover, links be-
tween chromatin metabolism and disease are beginning to emerge. The identi-
fication of altered DNA methylation and histone acetylase activity in human
cancers, the use of histone deacetylase inhibitors in the treatment of leukemia,
and the tumor suppressor activities of ATP-dependent chromatin remodeling
enzymes are examples that likely represent just the tip of the iceberg.
As such, the field is attracting new investigators who enter with little first
hand experience with the standard assays used to dissect chromatin structure
and function. In addition, even seasoned veterans are overwhelmed by the
rapid introduction of new chromatin technologies. Accordingly, we sought to
bring together a useful ‘‘go-to’’ set of chromatin-based methods that would
update and complement two previous publications in this series, Volume 170
(Nucleosomes) and Volume 304 (Chromatin). While many of the classic proto-
cols in those volumes remain as timely now as when they were written, it is our
hope the present series will fill in the gaps for the next several years.
This 3-volume set of Methods in Enzymology provides nearly one hundred
procedures covering the full range of tools—bioinformatics, structural biology,
biophysics, biochemistry, genetics, and cell biology—employed in chromatin
research. Volume 375 includes a histone database, methods for preparation of
xv
histones, histone variants, modified histones and defined chromatin segments,
protocols for nucleosome reconstitution and analysis, and cytological methods
for imaging chromatin functions in vivo. Volume 376 includes electron micro-
scopy and biophysical protocols for visualizing chromatin and detecting chro-
matin interactions, enzymological assays for histone modifying enzymes, and
immunochemical protocols for the in situ detection of histone modifications
and chromatin proteins. Volume 377 includes genetic assays of histones and
chromatin regulators, methods for the preparation and analysis of histone
modifying and ATP-dependent chromatin remodeling enzymes, and assays
for transcription and DNA repair on chromatin templates. We are exceedingly
grateful to the very large number of colleagues representing the field’s leading
laboratories, who have taken the time and effort to make their technical
expertise available in this series.
Finally, we wish to take the opportunity to remember Vincent Allfrey,
Andrei Mirzabekov, Harold Weintraub, Abraham Worcel, and especially Alan
Wolffe, co-editor of Volume 304 (Chromatin). All of these individuals had key
roles in shaping the chromatin field into what it is today.
C. David Allis
Carl Wu
Editors’ Note: Additional methods can be found in Methods in Enzymology,
Vol. 371 (RNA Polymerases and Associated Factors, Part D) Section III
Chromatin, Sankar L. Adhya and Susan Garges, Editors.
xvi preface
METHODS IN ENZYMOLOGY
EDITORS-IN-CHIEF
John N. Abelson Melvin I. Simon
DIVISION OF BIOLOGY
CALIFORNIA INSTITUTE OF TECHNOLOGY
PASADENA, CALIFORNIA
FOUNDING EDITORS
Sidney P. Colowick and Nathan O. Kaplan
Contributors to Volume 375
Article numbers are in parentheses and following the names of contributors.
Affiliations listed are current.
Chad Alexander (3), The University of
Tennessee-Oak Ridge Graduate School
of Genome Science and Technology,
Oak Ridge National Laboratory, Life
Sciences Division, Oak Ridge, Tennessee
37831-8080
Genevie
`
ve Almouzni (8), Institut Curie,
Section de Recherche, F-75248, Paris
Cedex 05, France
Satoshi Ando (18), Department of Mo-
lecular Life Science, School of Medicine,
Tokai University, Kanagawa 259-1193,
Japan
Yunhe Bao (2), Department of Biochemis-
try and Molecular Biology, Colorado
State University, Fort Collins, Colorado
80523-1870
Blaine Bartholomew (13), Department
of Biochemistry & Molecular Biology,
Southern Illinois University School of
Medicine, Carbondale, Illinois
62901-4413
David P. Bazett-Jones (28), Programme
in Cell Biology, Hospital for Sick
Children, Toronto, Ontario M5G 1X8,
Canada
Andrew S. Belmont (23), Department of
Cell and Structural Biology, University of
Illinois at Urbana-Champaign, Urbana,
Illinois 61801
Leise Berven (16), Children’s Medical Re-
search Institute, Westmead, New South
Wales 2415, Australia
Yehudit Birger (21), National Cancer In-
stitute, National Institutes of Health,
Bethesda, Maryland 20892
Hinrich Boeger (11), Department of
Structural Biology, Stanford University
School of Medicine, Stanford, California
94305
William M. Bonner (5), Laboratory of
Molecular Pharmacology, National
Cancer Institute, Bethesda, Maryland
20892
Michael Bruno (14), Division of Gene
Regulation and Expression, The Well-
come Trust Biocentre, Department of
Biochemistry, University of Dundee,
Dundee, DD1 5EH, Scotland, United
Kingdom.
Gerard J. Bunick (3), Life Sciences Div-
ision, Oak Ridge National Laboratory,
Oak Ridge, Tennessee 37831-8080
Michael Bustin (21), National Cancer In-
stitute, National Institutes of Health,
Bethesda, Maryland 20892
Anne E. Carpenter (23), Whitehead Insti-
tute forBiomedical Research, Cambridge,
Massachusetts 02142
Gustavo Carrero (26), Department of
Mathematical and Statistical Sciences,
Faculty of Science, University of
Alberta, Edmonton, Alberta T6G 2E1,
Canada
David Carter (29), Laboratory of Chro-
matin and Gene Expression, Babraham
Institute, Cambridge CB2 4AT, United
Kingdom
Fre
´
de
´
ric Catez (21), National Cancer In-
stitute, National Institutes of Health,
Bethesda, Maryland 20892
ix
Lyubomira Chakalova (29), Laboratory
of Chromatin and Gene Expression, Bab-
raham Institute, Cambridge CB2 4AT,
United Kingdom
Srinivas Chakravarthy (2), Department
of Biochemistry and Molecular Biology,
Colorado State University, Fort Collins,
Colorado 80523-1870
Lakshmi N. Changolkar (15), Depart-
ment of Animal Biology, School of Veter-
inary Medicine, University of
Pennsylvania, Philadelphia, Pennsylvania
19104
Lisa Ann Cirillo (9), Department of Cell
Biology, Neurobiology, and Anatomy,
Medical College of Washington,
Milwaukee, Wisconsin 53149
Peter R. Cook (24), The Sir William Dunn
SchoolofPathology,UniversityofOxford,
Oxford OX1 3RE, United Kingdom
Ellen Crawford (26), Department of On-
cology, Faculty of Medicine, Universityof
Alberta and Cross Cancer Institute,
Edmonton, Alberta T6G 2E1, Canada
Wouter de Laat (30), Department of Cell
Biology, ErasmusMC, 3015 GE Rotter-
dam, The Netherlands
Gerda de Vries (26), Department of Math-
ematical and Statistical Sciences, Faculty
of Science, University of Alberta, Edmon-
ton, Alberta T6G 2E1, Canada
Graham Dellaire (28), Programme in
Cell Biology, Hospital for Sick Children,
Toronto, Ontario M5G 1X8, Canada
John D. Diller (10), Department of Bio-
chemistry and Molecular Biology, Center
for Gene Regulation, The Pennsylvania
State University, University Park,
Pennsylvania 16802
Charles E. Ducker (10), Department of
Biochemistry and Molecular Biology,
Center for Gene Regulation, The Pennsyl-
vania State University, University Park,
Pennsylvania 16802
Pamela N. Dyer (2), Department of Bio-
chemistry and Molecular Biology, Color-
ado State University, Fort Collins,
Colorado 80523-1870
Raji S. Edayathumangalam (2), Depart-
ment of Biochemistry and Molecular Biol-
ogy, Colorado State University, Fort
Collins, Colorado 80523-1870
Thomas G. Fazzio (6), Fred Hutchinson
Cancer Research Center, Seattle, Wash-
ington 98109-1024
Andrew Flaus (14), Division of Gene
Regulation and Expression, The Well-
come Trust Biocentre, Department of Bio-
chemistry, University of Dundee, Dundee,
DD1 5EH, Scotland, United Kingdom.
Peter Fraser (29), Laboratory of Chroma-
tin and Gene Expression, Babraham Insti-
tute, Cambridge CB2 4AT, United
Kingdom
Susan M. Gasser (22), Department of Mo-
lecular Biology, University of Geneva,
1211 Geneva 4, Switzerland
Stanislaw A. Gorski (25), National
Cancer Institute, National Institutes of
Health, Bethesda, Maryland 20892
Joachim Griesenbeck (11), Department of
Structural Biology, Stanford University
School of Medicine, Stanford, California
94305
Frank Grosveld (30), Department of Cell
Biology, ErasmusMC, 3015 GE Rotter-
dam, The Netherlands
B. Leif Hanson (3), The University of Ten-
nessee-Oak Ridge Graduate School of
Genome Science and Technology, Life
Sciences Divison, Oak Ridge National
Laboratory, Oak Ridge, Tennessee
37831-8080
Joel M. Harp (3), Department of Bio-
chemistry and Center for Structural Biol-
ogy, Vanderbilt University, Nashville,
Tennessee 37232-8725
x contributors to volume 375
Keiji Hashimoto (17), Core Research for
Evolutional Science and Technology,
Saitama 332-0012, Japan
Jeffrey J. Hayes (12), Department of Bio-
chemistry and Biophysics, University of
Rochester Medical Center, Rochester,
New York 14642
Florence Hediger (22), Department of
Molecular Biology, University of Geneva,
1211 Geneva 4, Switzerland
Michael J. Hendzel (26), Department of
Oncology, University of Alberta and
Cross Cancer Instutite, Edmonton,
Alberta T6G 1Z2, Canada
Miki Hieda (24), Sir William Dunn School
of Pathology, University of Oxford,
Oxford OX1 3RE, United Kingdom
Stefan R. Kassabov (13), Department of
Biochemistry & Molecular Biology,
Southern Illinois University School of
Medicine, Carbondale, Illinois
62901-4413
Hiroshi Kimura (24), Horizontal Medical
Research Organization, School of Medi-
cine, Kyoto University, Kyoto 606-8510,
Japan
Roger D. Kornberg (11), Department of
Structural Biology, Stanford University
School of Medicine, Stanford, California
94305
David Landsman (1) National Center for
Biotechnology Information, National Li-
brary of Medicine, National Institutes of
Health, Bethesda, Maryland 20894
Paul J. Laybourn (7), Department of Bio-
chemistry and Molecular Biology, Color-
ado State University, Fort Collins,
Colorado 80523-1870
Jae-Hwan Lim (21), National Cancer Insti-
tute, National Institutes of Health,
Bethesda, Maryland 20892
Karolin Luger (2), Department of Bio-
chemistry and Molecular Biology, Color-
ado State University, Fort Collins,
Colorado 80523-1870
James G. McNally (27), Laboratory of
Receptor Biology and Gene Expression,
National Cancer Institute, National Insti-
tutes of Health, Bethesda, Maryland
20892
Tom Misteli (25) National Cancer Insti-
tute, National Institutes of Health,
Bethesda, Maryland 20892
Craig A. Mizzen (19), Department of Cell
& Structural Biology, University of
Illinois at Urbana-Champaign, Urbana,
Illinois 61801
Setsuo Morishita (17), Department of Mo-
lecularBiology, School of Science, Nagoya
University, Nagoya 464-8601, Japan
Uma M. Muthurajan (2), Department of
Biochemistry and Molecular Biology,
Colorado State University, Fort Collins,
Colorado 80523-1870
Frank R. Neumann (22), Department of
Molecular Biology, University of Geneva,
1211 Geneva 4, Switzerland
Rozalia Nisman (28), Programme in Cell
Biology, Hospital for Sick Children,
Toronto, Ontario M5G 1X8, Canada
Tom Owen-Hughes (14), Division of Gene
Regulation and Expression, The Well-
come Trust Biocentre, Department of Bio-
chemistry, University of Dundee, Dundee,
DD1 5EH Scotland, United Kingdom.
John R. Pehrson (15), Department of
Animal Biology, School of Veterinary
Medicine, University of Pennsylvania,
Philadelphia, Pennsylvania 19104
Craig L. Peterson (4) University of Mas-
sachusetts Medical School, Worchester,
Massachusetts 01605
contributors to volume 375 xi
Robert D. Phair (25), BioInformatics Ser-
vices, Rockville, Maryland 20854
Duane R. Pilch (5), Laboratory of Mo-
lecular Pharmacology, National Cancer
Institute, Bethesda, Maryland 20892
Yuri V. Postnikov (21), National Cancer
Institute, National Institutes of Health,
Bethesda, Maryland 20892
Danny Rangasamy (16), The John Curtin
School of Medical Research, Australian
National University, Canberra, Australia
Capital Territory 2601, Australia
Dominique Ray-Gallet (8), Institut
Curie, Section de Recherche, F-75248,
Paris Cedex 05, France
Christophe Redon (5), Laboratory of Mo-
lecular Pharmacology, National Cancer
Institute, Bethesda, Maryland 20892
Raymond Reeves (20), School of Molecu-
lar Biosciences, Biochemistry/Biophysics,
Washington State University, Pullman,
Washington 99164-4660
Patricia Ridgway (16), The John Curtin
School of Medical Research, Australian
National University, Canberra, Austra-
lian Capital Territory 2601, Australia
Chun Ruan (10), Department of Biochem-
istry and Molecular Biology, Center for
Gene Regulation, The Pennsylvania State
University, University Park, Pennsylvania
16802
Olga A. Sedelnikova (5), Laboratory
of Molecular Pharmacology, National
Cancer Institute, Bethesda, Maryland
20892
Michael A. Shogren-Knaak (4), Univer-
sity of Massachusetts Medical School,
Worchester, Massachusetss 01605
Robert T. Simpson (10), Department of
Biochemistry and Molecular Biology,
Center for Gene Regulation, The Pennsyl-
vania State University, University Park,
Pennsylvania 16802
Erik Splinter (30), Department of Cell
Biology, ErasmusMC, 3015 GE Rotter-
dam, The Netherlands
Diana A. Stavreva (27), Laboratory of
Receptor Biology and Gene Expression,
National Cancer Institute, National Insti-
tutesofHealth,Bethesda,Maryland 20892
J. Seth Strattan (11), Department of
Structural Biology, Stanford University
School of Medicine, Stanford, California
94305
Steven A. Sullivan (1), National Center
for Biotechnology Information, National
Library of Medicine, National Institutes
of Health, Bethesda, Maryland 20894
Ulrica Svensson (16), The John Curtin
School of Medical Research, Australian
National University, Canberra, Australian
Capital Territory 2601, Australia
Angela Taddei (22), Department of Mo-
lecular Biology, University of Geneva,
1211 Geneva 4, Switzerland
John Th’ng (26), Northwestern Ontario
Regional Cancer Centre, Thunder Bay,
Ontario P7A 7T1, Canada
David John Tremethick (16), The John
Curtin School of Medical Research, Aus-
tralian National University, Canberra,
Australian Capital Territory 2601,
Australia
Toshio Tsukiyama (6), Fred Hutchinson
Cancer Research Center, Seattle, Wash-
ington 98109-1024
Jay C. Vary,Jr. (6), Molecular and Cellu-
lar Biology Program, University of
Washington, Seattle, Washington 98195
Cindy L. White (2), Department of Bio-
chemistry and Molecular Biology, Color-
ado State University, Fort Collins,
Colorado 80523-1870
Sriwan Wongwisansri (7), Department of
Biochemistry and Molecular Biology,
Colorado State University, Fort Collins,
Colorado 80523-1870
xii contributors to volume 375
Kinya Yoda (17, 18), Bioscience and Bio-
technology Center, Nagoya University,
Nagoya, 464-8601, Japan
Kenneth S. Zaret (9), Cell and Devel-
opmental Biology Program, Fox Chase
Cancer Center, Philadelphia, Pennsylva-
nia 19111
Chunyang Zheng (12), Department of
Biochemistry and Biophysics, University
of Rochester Medical Center, Rochester,
New York 14642
contributors to volume 375 xiii
[1] Mining Core Histone Sequences from Public
Protein Databases
By Steven A. Sullivan and David Landsman
Introduction
Constructing an online database of histones and histone fold-containing
proteins has allowed our group to analyze histone sequence variation in
some detail.
1,2
Here, we describe how we have inventoried core histone
protein sequences as part of this project. The issues involved in such an
undertaking are for the most part not unique to histone sequences. Our
methods and observations should be broadly applicable to studies of
protein families that are highly represented in public sequence databases.
Considerations
Our initial goal was to collect as many reported histone sequences as we
could find. Among the considerations that came into play were the
following.
1. Sourcing of sequences. Several excellent public sequence reposi-
tories make protein sequences available to researchers. We relied on the
protein database maintained by the National Center for Biotechnology
Information (NCBI), which is updated frequently and has been compiled
from worldwide sources, including Swiss-Prot,
3
the Protein Information
Resource (PIR),
4
the Protein Research Foundation (PRF) (http://
www.prf.or.jp/en/), the Protein Data Bank (PDB),
5
and translations
from annotated coding regions in GenBank
6
and RefSeq,
7
a curated,
nonredundant set of sequences.
1
S. Sullivan, D. W. Sink, K. L. Trout, I. Makalowska, P. M. Taylor, A. D. Baxevanis, and
D. Landsman, Nucleic Acids Res. 30, 341 (2002).
2
S. A. Sullivan and D. Landsman, Proteins 52, 454 (2003).
3
B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J.
Martin, K. Michoud, C. O’Donovan, I. Phan, S. Pilbout, and M. Schneider, Nucleic Acids
Res. 31, 365 (2003).
4
C. H. Wu, L. S. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu, P.
Kourtesis, R. S. Ledley, B. E. Suzek, C. R. Vinayaka, J. Zhang, and W. C. Barker, Nucleic
Acids Res. 31, 345 (2003).
5
J. Westbrook, Z. Feng, L. Chen, H. Yang, and H. M. Berman, Nucleic Acids Res. 31, 489
(2003).
[1] mining core histone sequences from public protein databases 3
METHODS IN ENZYMOLOGY, VOL. 375 0076-6879/04 $35.00
2. Sequence-harvesting tools. In general, a sequence database search is
a similarity search of either the actual sequence data or its annotation. We
find that both must be targeted in order to maximize the sequence harvest,
because sequence-based searches alone can miss small or ambiguous
sequence fragments that have been deposited in the public databases, and
text-based searches can miss ‘‘cryptic’’ histones, that is, those with
inadequate or incorrect annotation.
For text-based searches of sequence annotation we used the Entrez
search engine at the NCBI Web site ( />For sequence-based searching we used several varieties of the popular
Basic Local Alignment Search Tool (BLAST) pairwise alignment algo-
rithm. The most commonly used sequence similarity search tools find
‘‘hits’’ based on pairwise alignments of each sequence in the database to
either the query sequence alone, for example, in the case of BLAST, or a
query profile derived from a previously aligned set of similar sequences, for
example, in the case of PSI-BLAST or HMMER.
8,9
The latter tools are
better at finding highly divergent members of a protein family but can be
expected to return false positives, requiring further filtering of results.
PSI-BLAST is actually a hybrid tool that performs one round of standard
BLAST, using a user-supplied query sequence, and then builds a profile
from the alignment of the initial BLAST results, which becomes the query
for the next round of BLAST. The process is reiterated until ‘‘conver-
gence’’ is reached, that is, until no more new matches are found above
the cutoff score. Ideally this should take fewer than 10 iterations, but con-
vergence can be elusive when the query sequence matches a diverse and
perhaps distantly related set of proteins. This was more difficult to interpret
with searches for nonhistone proteins containing the histone fold than for
harvesting core histone sequences. With the latter we found that seven iter-
ations were sufficient to reach either convergence or the point at which all
the ‘‘new’’ hits appeared by other criteria to be false positives. PSI-BLAST
routinely returned a small number of true-positive matches to the query
sequences that gapped protein BLAST (BLASTPGP) had missed.
Reasonably fast BLASTPGP and PSI-BLAST servers are available at
the NCBI Web site ( One advantage
of the NCBI Web site PSI-BLAST implementation over a command-line
6
D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler, Nucleic Acids
Res. 31, 23 (2003).
7
K. D. Pruitt, T. Tatusova, and D. R. Maglott, Nucleic Acids Res. 31, 34 (2003).
8
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J.
Lipman, Nucleic Acids Res. 25, 3389 (1997).
9
S. R. Eddy, Bioinformatics 14, 755 (1998).
4 histone bioinformatics [1]
version is that the user can edit each set of aligned sequences before it is
used to generate a profile. This can redirect a diverging sequence search
back toward convergence. Unfortunately, however, it can also happen that
a valid match from one iteration falls below the noise cutoff in the next, and
in the WWW-based implementation, that match is lost. Therefore we ran
PSI-BLAST (and BLASTPGP) from the command line in a UNIX envir-
onment, which allowed us to save the results from all of the iterations into
one file for subsequent text parsing. It also allows considerable flexibility in
setting other BLAST options. Most default values were adequate for typ-
ical BLAST searches, but we commonly increased the number of displayed
description lines and alignments (the ÀbandÀv options) to 3000, to ensure
retrieval of all the possible hits for subsequent filtering steps.
3. Query sequences. Histones are ancient proteins, found in all known
eukaryote lineages as well as some archaeal microbes. Using a single query
sequence, there is the possibility that some valid hits might be missed
because of the sequence divergence and extreme biodiversity of the
histones, even using a profile-generating protocol. To maximize the
identification of eukaryote core histones from the protein databases, we
‘‘bracketed’’ the kingdom evolutionarily by using core histone sequences
from human and yeast as search queries. This proved important for the
more divergent histones, H2A and H2B, but less so for the more conserved
histones, H4 and H3. For example, queries with human or yeast H4 or H3
returned almost the same sets of true-positive hits. In H3 searches, the
most common outliers requiring taxonomic bracketing to capture were
sequence fragments from protists, and members of the centromeric H3
subclass (data not shown).
4. Sequence redundancy. Sequence redundancy is the bane of most
database searches. In most cases, redundant sequences in a large public
sequence repository such as GenBank are often the same sequence from
the same organism, automatically harvested from different databases,
rather than originating from discrete sequencing projects in different
laboratories. Thus, Web-based sequence similarity search tools, such as
PSI-BLAST at the NCBI Web site, tend to present results in a convenient,
nonredundant fashion, with sequence identifiers of identical sequences
grouped together with an anchored sequence. To populate the histone
database, however, we required every sequence in FASTA format (i.e.,
each record consisting of only a unique definition line and a sequence), one
reason being that homologous histones display remarkable degrees of
sequence identity, rather than mere similarity, across species. It is not
uncommon that fully ‘‘redundant’’ histone sequences in the public
database derive from more than one species. We wanted to start with a
set in which such identical sequences are properly resolved. Because we
[1] mining core histone sequences from public protein databases 5
were attempting an exhaustive search, the well-intentioned nonredun-
dancy of the public databases was, for us, an obstacle. Our strategy was to
extract all the unique sequence identifiers from the BLAST outputs (in the
case of NCBI records, the unique identifier is the GI number found at the
beginning of the sequence definition line of a FASTA-formatted record)
into a file, and use this file to generate a corresponding library of FASTA
records. NCBI Entrez on the World Wide Web can take a file of GI
numbers as input for batch retrieval of records; alternatively, we used the
SEALS software suite to perform such retrievals in a UNIX environ-
ment.
10
SEALS has a tool, fauniq, for reducing a set of redundant FASTA
sequence records to a nonredundant format, on the basis of either
definition-line identifiers such as the GI number or on the sequence itself.
This tool proved invaluable for filtering BLAST outputs to remove GI-
based redundancies and for generating nonredundant sequence sets for
alignment and variation analysis.
5. Fragmentary, ambiguous, and frameshifted sequences. Some se-
quences in the public databases are less than full-length; for example, a few
records annotated as ‘‘histone H3’’ consist of only two or three amino acid
residues. As sequences shorten, their detection becomes more difficult
using typical ‘‘flavors’’ of BLAST when querying a large database because
they become less distinguishable from chance hits. This problem is
compounded if, as is the case with histones, the protein features segments
of low sequence complexity, or if the fragment records contain ambiguous
(‘‘X’’) residues. To capture sequence fragments, we first divided the full-
length query sequence into overlapping segments, with a segment window
of 20 residues, sampled at intervals of 10 residues along the length. This
was easily done with the SEALS fenestrate command. We then used these
segments as queries against the public database in a modified gapped
BLAST search optimized to capture short, nearly exact matches (a search
option that is also available at the NCBI Website cited earlier). For these
searches, low-complexity filters were turned off. The combined results of
all the ‘‘window BLASTs’’ for a query sequence were made nonredundant
with respect to GI number.
Frameshifted sequences (either authentic or erroneous) can pose a simi-
lar problem, depending on the size of the frameshifted region. Putative
frameshifts are easily identified by visual inspection of multiple alignments
of query results, for example, using the popular CLUSTAL X program,
11
where they manifest as sudden and extensive loss of sequence similarity.
10
D. R. Walker and E. V. Koonin, Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 333 (1997).
11
J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, Nucleic
Acids Res. 25, 4876 (1997).
6 histone bioinformatics [1]
To verify a frameshift, assuming access to the genomic DNA or cDNA
record for the protein (which are often, but not always, available in public
databases), one should translate the DNA in all frames and add those con-
ceptual translations to the alignment; the correct frames will be visually
evident in a true frameshift. Several tools exist on the Web for doing such
translations; we commonly use the one at the ExPASy (Expert Protein
Analysis) Web site: A translation tool
is also available in the SEALS package.
Comparison of Search Strategies
There are many available variations on the basic BLAST search proto-
col. We investigated several parameters for their effects in the identifica-
tion of histone H3 sequences. Histone H3 is a moderately diverse histone
class, with more than half of the known full-length sequences displaying
>80% identity in their histone fold domains; this figure falls between those
for the more highly conserved H4 class and the more diverse H2A and
H2B classes.
2
The H3 class comprises two subclasses that are markedly dis-
tinct in sequence and in function: replication-dependent H3 (the major H3)
and centromeric H3. There is also a third, replication-independent H3.3
class, although its sequence is only marginally divergent from that of the
major H3.
We first compiled a redundant reference set of H3 sequences, using a
variety of BLAST- and Entrez-based searches, to include as many probable
H3 sequence records as we could find in the NCBI protein database. This
set was manually reviewed to eliminate false positives, yielding a final set of
1742 good candidate H3 sequences from all three subclasses. We then com-
pared the results of different individual BLAST and Entrez search strat-
egies with the reference set, to determine the efficiency (percentage of
hits that are true positives, i.e., that are also found in the reference set)
and the success (percentage of the reference set found by the search
method). The results are shown in Table I. Entrez searches of eukaryotic
sequence record annotation used the queries ‘‘H3’’ or ‘‘histon.’’ BLAST
parameters that we varied were: query sequence BLAST flavor (gapped
BLAST versus gapped PSI-BLAST versus gapped BLAST for short,
nearly exact window matches); query sequence (human versus yeast);
database size (all versus the eukaryotic subset); and SEG low-complexity
filtering (off versus on).
The Entrez results indicate that almost 20% of H3 sequences in the
public database are cryptic, lacking specific annotation as H3 histones.
The search results for ‘‘histon’’ as a query term recovered 95% of the ref-
erence sequences, with a trade-off of many more false positives, as one
[1] mining core histone sequences from public protein databases 7
TABLE I
Comparison of Search Strategies for H3 Histone Sequences
a
Unique GI H3
Success
(%)
Efficiency
(%)Reference H3 set 1742 1742
Entrez ‘‘eukaryota[ORGN]’’ 1,143,461 1742 100.0 0.2
Entrez ‘‘H3’’ 3303 1452 83.4 44.0
Entrez ‘‘histon’’ 9297 1653 94.9 17.8
Entrez ‘‘eukaryota[ORGN] and H3’’ 2703 1452 83.4 53.7
Entrez ‘‘eukaryota[ORGN] and histon’’ 7453 1653 94.9 22.2
BLASTPGP H3human 1747 1719 98.7 98.4
BLASTPGP H3human þseg 1747 1719 98.7 98.4
BLASTPGP H3human þeukgi 1754 1722 98.9 98.2
BLASTPGP H3human þeukgiþseg 1754 1722 98.9 98.2
BLASTPGP H3yeast 1777 1718 98.6 96.7
BLASTPGP H3yeastþseg 1777 1718 98.6 96.7
BLASTPGP H3yeastþeukgi 1780 1718 98.6 96.5
BLASTPGP H3yeastþeukgiþseg 1780 1718 98.6 96.5
PSIBLASTPGP H3human 1897 1726 99.1 91.0
PSIBLASTPGP H3humanþseg 1897 1726 99.1 91.0
PSIBLASTPGP H3humanþeukgi 1949 1727 99.1 88.6
PSIBLASTPGP H3humanþeukgiþseg 1949 1727 99.1 88.6
PSIBLASTPGP H3yeast 2011 1726 99.1 85.8
PSIBLASTPGP H3yeastþseg 2011 1726 99.1 85.8
PSIBLASTPGP H3yeastþeukgi 2077 1727 99.1 83.1
PSIBLASTPGP H3yeastþeukgiþseg 2077 1727 99.1 83.1
WINBLASTPGP H3human 69,678 1730 99.3 2.5
WINBLASTPGP H3human þeukgi 60,821 1732 99.4 2.8
WINBLASTPGP H3human þeukgiþseg 1697 1646 94.5 97.0
WINBLASTPGP H3yeast 70,864 1730 99.3 2.4
WINBLASTPGP H3yeastþeukgi 63,949 1730 99.3 2.7
WINBLASTPGP H3yeastþeukgiþseg 1788 1646 94.5 92.1
a
Entrez queries of the NCBI protein database were conducted from the NCBI Web site
www.ncbi.nlm.nih.gov/Entrez. BLAST searches using human or yeast histone H3
sequences were performed from the command line in a UNIX environment:
BLASTPGP, gapped protein BLAST; PSIBLASTPGP, interated gapped protein
BLAST using profiles; WINBLASTPGP, gapped protein BLAST for short, nearly exact
matches, using sequence windows as queries; eukgi, search restricted to sequences from
eukaryotes; seg, SEG filtering of low-complexity regions enabled. All results were
compared with a curated reference_H3_set of sequences. Column headers: unique GI,
number of unique sequence records retrieved; H3, number of retrieved unique GIs
shared with the reference set; efficiency, percent H3/unique GI; success, percent H3/
reference set.
8 histone bioinformatics [1]
would expect. The ‘‘histon’’ query also captured all of the true-positive
‘‘H3’’ query results (data not shown).
Any of the BLAST-based strategies was sufficient to capture at least
94% of the reference set from the public databases. The best combination
of efficiency and success was achieved using gapped BLAST. The effects of
differences in query sequence, database size, and filtering were minor com-
pared with the difference between using BLAST, PSI-BLAST, or
windowed BLAST, because the latter two BLAST implementations return
far more false positives while increasing the success rate only marginally.
Low-entropy sequence filtering appeared to make no difference whatso-
ever except in the case of windowed searches, in which the query sequence
was divided into overlapping segments 20 residues each in length, with sev-
eral gapped BLAST parameters altered to facilitate finding short, nearly
exact matches to the query segments. Using the low-complexity filter here
vastly increased efficiency by greatly reducing false positives, although suc-
cess suffered in comparison with nonfiltered strategies, reflecting the pres-
ence of short, often basic low-complexity regions that are a hallmark of
core histone sequences.
Unfortunately, as these results show, no single method captures all the
relevant sequence records. A combination of strategies was the only way to
achieve 100% success. However, the results of our comparison suggest a ra-
tional way to mine the maximum number of histone sequence records of a
class from a database. The first step is to perform a single-round gapped
BLAST search, making sure that the options for ‘‘number of descriptions’’
and ‘‘number of alignments’’ returned are set high (e.g., several thousand
each). This should return most of the true positives with high efficiency.
This set should be inspected carefully, using a variety of tools including
text-search of the definition lines, multiple alignment, and further
BLAST searches with a different query sequence, to remove false posi-
tives. The resulting validated set becomes most useful in subsequent
searches employing other strategies, such as PSI-BLAST or text-based
searches. The validated set can be used to subtract known positives from
subsequent search results, using difference-finding tools such as the SEALS
fanot command, which finds the logical exclusion of two sets of FASTA
records or definition lines. This leaves a much shorter list of candidates
from the new search results to be examined for new true positives. As these
are identified they are added to the validated set, increasing its usefulness
as a filter. This search strategy has also served us well in harvesting histone
H4, H2A, and H2B sequences, and should work for any well-conserved
class of protein sequences.
[1] mining core histone sequences from public protein databases 9
Histone Sequence Variants
Histone variants have been divided into ‘‘homomorphous’’ and ‘‘hetero-
morphous’’ categories.
12,13
Homomorphous variants have relatively minor
sequence differences and require high-resolution separation methods to
distinguish them biochemically (reviewed in von Holt et al.
14
). They are
found in all four core histone classes, and are presumed to be functionally
identical. Heteromorphous variants are readily distinguished by conven-
tional biochemical separation methods and tend to be distinct from other
histones in their class with respect to function and/or spatiotemporal local-
ization, as well as sequence. The distinction between the two categories of
variants is not rigid—for example, the ostensibly homomorphous H3.3
appears to be functionally distinct from the major H3—and may become less
so as the functions of more variants are experimentally tested. In clustering
trees made from multiple sequence alignments of each histone class, hetero-
morphous variants tend to form biodiverse clades distinct from the major
form, indicating early branching off from major histones, whereas homo-
morphous variants tend to comingle with the major form in clades that are
more strongly delineated by phylogeny than by any other factor, suggesting
the variants arose after the founding speciation event (data not shown; see
also Thatcher and Gorovsky
15
). For all core histone classes, sequence align-
ments show clear distinctions between metazoan, plant, fungal, and various
basal eukaryote subclasses. Distinct subclasses within the metazoan
sequences are also common (e.g., insect or echinoderm sequences). Nomen-
clature is only occasionally helpful in classifying histone variants. It is not
standardized, and thus ‘‘H3.2’’ in one species may not be similar to ‘‘H3.2’’
in another. The only other constant among aligned histone sequences appar-
ent in Figs. 1–4, is that there tends to be less variation in the a-helical regions
of the histone fold, than in the interhelical loops and the N- and C-terminal
regions flanking the histone fold. This pattern of variation is common in
other a helix-containing protein families.
H2A
The H2A class is the most diverse of the four core histone classes, both
functionally and in terms of sequence, comprising four subclasses of known
or putative functional variants in addition to typical phylogeny-based
12
M. H. West and W. M. Bonner, Biochemistry 19, 3238 (1980).
13
J. Ausio, D. W. Abbott, X. Wang, and S. C. Moore, Biochem. Cell Biol. 79, 693 (2001).
14
C. von Holt, W. F. Brandt, H. J. Greyling, G. G. Lindsey, J. D. Retief, J. D. Rodrigues,
S. Schwager, and B. T. Sewell, Methods Enzymol. 170, 431 (1989).
15
T. H. Thatcher and M. A. Gorovsky, Nucleic Acids Res. 22, 174 (1994).
10 histone bioinformatics [1]
Fig.1.(continued)
[1] mining core histone sequences from public protein databases 11
Fig.1.(continued)
12 histone bioinformatics [1]
subclasses (Fig. 1A and B). H2A.X is found in species spanning the
eukaryotic spectrum and features a conserved serine four residues from
the carboxyl terminus (part of an SQ motif, positions 208 and 209 in
Fig. 1A) that is phosphorylated in response to double-stranded DNA
breaks, perhaps marking the site for repair (reviewed in Redon et al.
16
).
Interestingly, the fungal H2A subclass clusters near the H2A.X subclass,
and also features a conserved SQ motif at its C terminus. H2A.F/Z
sequences constitute another pan-eukaryotic subclass and are necessary
but not sufficient for H2A function in organisms tested. Characteristic
H2A.F/Z residues in a C-terminal, H3-binding portion of the protein
(positions 145–193 in Fig. 1A) have been suggested to impart a specific,
although as yet unknown, function, as have the lysine residues in the
amino-terminal portion (reviewed in Redon et al.
16
). Of these lysine
16
C. Redon, D. Pilch, E. Rogakou, O. Sedelnikova, K. Newrock, and W. Bonner, Curr. Opin.
Genet. Dev. 12, 162 (2002).
Fig. 1. Summary of H2A subclasses and variants. (A) A consensus sequence of all aligned
H2A sequences is shown at the top. Dots in the sequences below indicate identity to the
consensus. Groups are named on the basis of clustering patterns observed in neighbor-joining
trees of aligned H2A sequences (not shown). Names, a selection of sequence descriptors
found in the definition lines of the sequence records; seq, number of unique sequences in the
group; sp, number of species in the group; max sp/seq, the greatest number of species having
the same sequence in the group. For each group the first line is the consensus sequence for
that group. Variations from the group consensus are indicated below it. Italic indicates a
‘‘singleton,’’ i.e., the residue was found in only one sequence from one species in the group.
An asterisk (*) indicates singleton identity or a gap. Background color key: white, identity to
the anchored consensus; black, gap; orange, aromatic; yellow, aliphatic/hydrophobic; light
green, glycine; green, hydrophilic; light blue, histidine; blue, basic; red, acidic. (B) C-terminal
section of macroH2A.
[1] mining core histone sequences from public protein databases 13
residues, two (at positions 11 and 42 in Fig. 1A) appear to be specificto
H2A.F/Z and not the major metazoan H2A. MacroH2A is a large bipartite
histone divided into a recognizable H2A portion with many subclass-
characteristic substitutions, and a long C-terminal extension found in
no other histone subclass (residues 227–430 in Fig. 1B). MacroH2A
has been found only in vertebrates and is concentrated in the inactive
female X chromosome (reviewed in Brown
17
). H2A-Bbd is a highly
Fig.2.(continued)
14 histone bioinformatics [1]
divergent subclass, so far found only in mammals, which displays a comple-
mentary localization to macroH2A, that is, it is excluded from inactive
chromosomes.
18
17
D. T. Brown, Genome Biol. 2, Reviews 0006 (2001).
18
B. P. Chadwick and H. F. Willard, J. Cell Biol. 152, 375 (2001).
Fig. 2. Summary of H2B subclasses and variants. A consensus sequence of all aligned H2B
sequences is shown at the top. Dots in the sequences below indicate identity to the consensus.
Groups are named on the basis of clustering patterns observed in neighbor-joining trees of
aligned H2B sequences (not shown). Names, a selection of sequence descriptors found in the
definition lines of the sequence records; seq, number of unique sequences in the group; sp,
number of species in the group; max sp/seq, the greatest number of species having the same
sequence in the group. For each group the first line is the consensus sequence for that group.
Variations from the group consensus are indicated below it. Italic indicates a ‘‘singleton,’’ i.e.,
the residue was found in only one sequence from one species in the group. An asterisk (*)
indicates singleton identity or a gap. Background color key: white, identity to the anchored
consensus; black, gap; orange, aromatic; yellow, aliphatic/hydrophobic; light green, glycine;
green, hydrophilic; light blue, histidine; blue, basic; red, acidic.
[1] mining core histone sequences from public protein databases 15
Fig. 3. Summary of H3 subclasses and variants. A consensus sequence of all aligned H3
sequences is shown at the top. Dots in the sequences below indicate identity to the consensus.
Groups are named on the basis of clustering patterns observed in neighbor-joining trees of
aligned H3 sequences (not shown). Names, a selection of sequence descriptors found in the
definition lines of the sequence records; seq, number of unique sequences in the group; sp,
number of species in the group; max sp/seq, the greatest number of species having the same
sequence in the group. For each group the first line is the consensus sequence for that group.
16 histone bioinformatics [1]
Variations from the group consensus are indicated below it. Italic indicates a ‘‘singleton,’’ i.e.,
the residue was found in only one sequence from one species in the group. An asterisk (*)
indicates singleton identity or a gap. Background color key: white, identity to the anchored
consensus; black, gap; orange, aromatic; yellow, aliphatic/hydrophobic; light green, glycine;
green, hydrophilic; light blue, histidine; blue, basic; red, acidic.
[1] mining core histone sequences from public protein databases 17
Fig.4.(continued)
18 histone bioinformatics [1]
H2B
Functional subclasses of H2B sequences have not been positively iden-
tified, although at least one tissue-specific form has been identified in mam-
malian testis (Fig. 2). An echinoderm sperm variant featuring a repeating
pentapeptide has also been described (reviewed in von Holt et al.
19
), indi-
cating that the echinoderm group in Fig. 2 probably could be subdivided
further. The N-terminal diversity seen within the plant subclass in Fig. 2
suggests that it, too, could be further subdivided.
H3
The H3 class notably contains two subclasses of replication-independ-
ent variants that are differentially localized within the cell. Histone H3.3
is an ostensibly homomorphous metazoan subclass that varies significantly
from the predominant H3 in only four positions (positions 73, 153, 155, and
156 of Fig. 3). H3.3 can be deposited in nucleosomes of replicating DNA
such as the major H3, but can also be deposited in nonreplicating DNA,
preferentially in actively transcribed regions.
20
The replication independ-
ence of H3.3 may be mediated by any of the three H3.3-specific residues
at positions 153–156.
21
Centromere-specific H3 is found in species ranging
from yeast to human, and its deposition has been shown to be replication
independent (reviewed in Smith
22
). It is thought to help specify centromere
Fig. 4. Summary of H4 subclasses and variants. A consensus sequence of all aligned H4
sequences is shown at the top. Dots in the sequences below indicate identity to the consensus.
Groups are named on the basis of clustering patterns observed in neighbor-joining trees of
aligned H4 sequences (not shown). Names, a selection of sequence descriptors found in the
definition lines of the sequence records; seq, number of unique sequences in the group; sp,
number of species in the group; max sp/seq, the greatest number of species having the same
sequence in the group. For each group the first line is the consensus sequence for that group.
Variations from the group consensus are indicated below it. Italic indicates a ‘‘singleton,’’ i.e.,
the residue was found in only one sequence from one species in the group. An asterisk (*)
indicates singleton identity or a gap. Background color key: white, identity to the anchored
consensus; black, gap; orange, aromatic; yellow, aliphatic/hydrophobic; light green, glycine;
green, hydrophilic; light blue, histidine; blue, basic; red, acidic.
19
C. von Holt, W. N. Strickland, W. F. Brandt, and M. S. Strickland, FEBS Lett. 100, 201
(1979).
20
K. Ahmad and S. Henikoff, Proc. Natl. Acad. Sci. USA 99(Suppl. 4), 16477 (2002).
21
K. Ahmad and S. Henikoff, Mol. Cell. 9, 1191 (2002).
22
M. M. Smith, Curr. Opin. Cell Biol. 14, 279 (2002).
[1] mining core histone sequences from public protein databases 19