Tải bản đầy đủ (.pdf) (200 trang)

Identifying and characterizing cis regulatory elements in the human genome

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.49 MB, 200 trang )

IDENTIFICATION AND CHARACTERIZATION OF
CONSERVED CIS-REGULATORY ELEMENTS
IN THE HUMAN GENOME




ALISON P. LEE
B. Computing (Computer Science) (Honours)
National University of Singapore, 2004





A THESIS SUBMITTED FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY


INSTITUTE OF MOLECULAR AND CELL BIOLOGY
&
NUS GRADUATE SCHOOL FOR
INTEGRATIVE SCIENCES AND ENGINEERING


NATIONAL UNIVERSITY OF SINGAPORE

2009
i
Acknowledgements


There are several people I would like to acknowledge:

Firstly, my advisor Professor Byrappa Venkatesh for being an excellent and patient
mentor, and for devoting so much time and energy to helping me improve my
reasoning, writing and presentation skills;

Dr Sydney Brenner and Dr Ng Huck Hui (GIS, Singapore), members of my thesis
advisory committee, for their scientific advice;

Dr Alice Tay for her constant encouragement and discussions at work and
conferences, and her work on the fugu Hox clusters project;

Esther Koh for her work on the fugu Hox clusters project and Yang Yuchen for his
work on the TFCONES project;

Gene Yeo, Eddie Loh, Nidhi Dandona for showing me the ropes in bioinformatics;

Luo Ming, Wang Jianli, Kevin Lam for useful software/hardware discussions and
troubleshooting;

Krish Jon Mathavan, Elizabeth Yeoh, Tay Boon Hui and Sumanty Tohari for
guidance in laboratory techniques;

Patrick Gilligan and Ravi Vydianathan for their valuable comments on manuscripts;

Data Storage and Cluster Computing teams at Bioinformatics Institute, led by Lai
Loong Fong and Stephen Wong respectively, for speedy response to my requests for
server fixes or software and database updates;

Arun Kumar, Zhao Zhiyang, Peng Huiling, and Chua Yiwen at A*STAR Biological

Resource Centre for their skilled work in DNA microinjections and mouse husbandry;
ii

The members of IMCB Histopathology Unit, especially Keith Rogers and Susan
Rogers for histology tips, protocols and equipment;

The staff of A*STAR Graduate Academy and NUS Graduate School for Integrative
Sciences and Engineering for the handling of administrative affairs;

All present and former members of the Comparative Genomics Laboratory and DNA
Sequencing Facility in IMCB, Singapore for making the laboratory a conducive and
pleasant place to work in;

Finally, my greatest appreciation goes to my parents and my brother for their
strongest support.
iii
Table of contents
Acknowledgements i
Table of contents iii
Summary vi
List of tables viii
List of figures ix
List of abbreviations xi
Chapter 1: Introduction 1
1.1 The Human Genome Project 1
1.2 Cis-regulatory elements 2
1.3 Methods to identify cis-regulatory elements 6
1.4 Comparative genomics approach for identifying cis-regulatory elements 20
1.5 Transcription factors (TFs) 24
1.6 Objectives of my work 26

Chapter 2: Materials and methods 28
2.1 Identifying human, mouse and fugu TF-encoding genes 28
2.2 Identifying CNEs in TF-encoding gene loci 29
2.3 Computing CNE statistics of human TF-encoding genes 31
2.4 Enrichment analysis of CNEs within experimentally defined TFBS 31
2.5 Gene Ontology enrichment analysis of human TF-encoding genes 32
2.6 Expression analysis of human TF-encoding genes 33
2.7 Motif finding 33
2.8 Predicting TFBS in human-fugu CNEs 34
2.9 Building and implementing the TFCONES database 35
2.10 Sequencing and annotating the fugu Hox gene clusters 35
iv
2.11 Functional assay of CNEs in transgenic mice 37
Chapter 3: Results - CNEs in transcription factor-encoding genes 45
3.1 TF-encoding genes in human, mouse and fugu 45
3.2 Conserved clusters of TF-encoding genes in human, mouse and fugu 48
3.3 Identification of human-mouse and human-fugu CNEs 51
3.4 Distribution pattern of human-fugu CNEs in the introns and flanking
regions of TF-encoding genes 62
3.5 Presence of experimentally verified TFBS within human-fugu CNEs 64
3.6 Functional categories of human TF-encoding genes associated with human-
fugu CNEs 66
3.7 Association between human-fugu CNEs and expression profiles of TF-
encoding genes 69
3.8 Over-represented motifs in the human-fugu CNEs of central nervous
system-expressing TF-encoding genes 72
3.9 Database of TFs and CNEs associated with TF-encoding genes 74
3.10 Discussion 77
Chapter 4: Results – CNEs in the Hox gene clusters 82
4.1 The Hox gene clusters 82

4.2 Fugu Hox loci and conserved syntenic blocks 87
4.3 CNEs in the Hox loci 90
4.4 Distribution of CNEs in fugu Hox loci 94
4.5 CNEs in the human Hox loci 95
4.6 Discussion 99
Chapter 5: Results – Functional assay of CNEs associated with a representative TF-
encoding gene 102
5.1 Introduction 102
5.2 Functions and expression pattern of Lhx2 gene 102
5.3 Organization of the Lhx2 locus in vertebrates 106
5.4 CNEs in the Lhx2 gene locus 109
v
5.5 Expression patterns of Lhx2, Crb2 and Dennd1a in developing mouse
embryos 113
5.6 Functional assay of Lhx2 CNEs in E11.5 transgenic mouse embryos 115
5.7 Site-directed mutagenesis of a predicted motif in Lhx2_CNE2/3 124
5.8 Discussion 128
Chapter 6: Discussion 132
Bibliography 139
Annexes 159
List of my publications 188

vi
Summary
Comparative genomics is a powerful approach for identifying conserved cis-
regulatory elements in the human genome. Since functional sequences evolve slower
than the surrounding neutrally evolving regions, cis-regulatory elements can be
identified as conserved noncoding elements (CNEs) in comparisons of human and
other vertebrate genomes. In particular, comparison of human with distantly related
vertebrates such as fishes increases the likelihood that most predicted CNEs are

functional sequences. The objectives of my project were to identify all the CNEs
associated with transcription factor (TF)-encoding genes in the human genome by
comparison with pufferfish (fugu) sequences, analyze the characteristics of the CNEs
and assay the functions of selected CNEs in transgenic mice. I started by building a
curated database of all TF-encoding genes in human, mouse and fugu and predicted
CNEs (≥ 65% identity over 50 bp) associated with orthologous genes by locus-by-
locus global alignments. In total, 1,738 human, 1,495 mouse, and 1,762 fugu TF-
encoding genes were identified, with 1,145 genes having orthologs in all three
genomes. Further analyses focused on the set of 816 DNA-binding TF-encoding
orthologous genes. A total of 2,843 human-fugu CNEs (total length ~388 kb) were
found to be associated with these TF-encoding genes. An online database was
constructed to catalog the human, mouse and fugu TF-encoding genes, together with
their associated CNEs, and this database is named TFCONES (Transcription Factor
Genes & Associated COnserved Noncoding ElementS;
The TFCONES database would be useful to researchers interested in studying the
regulation of TF-encoding genes and understanding gene regulatory networks in
vertebrates.
vii

The human-fugu CNEs identified showed a significant overlap with experimentally
verified transcription factor binding sites (TFBS) of known transcriptional activators
and repressors, confirming that some CNEs function as transcriptional enhancers or
silencers. In addition, functional enrichment analyses indicated that the CNEs are
significantly associated with TF-encoding genes that are involved in regulating
development, particularly the development of central nervous system. Furthermore,
expression profiling based on publicly available expression data, showed that genes
that express most highly in central nervous system tissues are enriched with human-
fugu CNEs. Motif discovery within the CNEs of TF-encoding genes that express most
highly in the central nervous system, revealed four 8-mer motifs that are likely to be
involved in transcriptional enhancer activity in the central nervous system. To verify

the functions of CNEs and the motifs, I assayed the CNEs of a representative TF-
encoding gene, the LIM homeobox gene Lhx2, in transgenic mice. Four out of eight
CNEs tested demonstrated enhancer activity by recapitulating Lhx2 expression in the
midbrain and hindbrain at embryonic day 11.5. Mutagenesis of a predicted motif in a
selected CNE abolished gene expression in the neural tube and dorsal root ganglia,
demonstrating that the motif is indeed critical for enhancer activity.
viii
List of tables
Table 1. Fugu scaffolds (assembly v3.0) for the seven Hox loci 36
Table 2. Primer sequences of the eight Lhx2 constructs 38
Table 3. TF-encoding genes in human, mouse and fugu genomes 47
Table 4. Conserved clusters of human, mouse and fugu TF-encoding genes 49
Table 5. Top twenty TF-encoding genes associated with the highest density of human-
fugu CNEs in human 57
Table 6. Top twenty TF-encoding genes associated with the highest density of human-
fugu CNEs in fugu 58
Table 7. Top twenty TF-encoding genes associated with the highest number and total
length of human-fugu CNEs 61
Table 8. Location of human-fugu CNEs in relation to the protein-coding sequence of
nearest human TF-encoding genes 63
Table 9. Significantly over-represented and under-represented Gene Ontology (GO)
terms (P < 0.01) of CNE-associated human TF-encoding genes 68
Table 10. Over-represented 8-mer motifs in human-fugu CNEs of cluster #3 human
TF-encoding genes 73
Table 11. Conserved syntenic fragments at the fugu and human Hox loci 87
Table 12. Human-fugu CNEs in the fugu Hox loci 95
Table 13. Human-fugu CNEs in the human Hox loci 96
Table 14. Details of Lhx2 CNE constructs tested in transgenic mice 111
Table 15. Summary of the expression patterns directed by the CNEs tested 124


ix
List of figures
Figure 1. A flowchart of protocols used for identifying human, mouse and fugu TFs,
and human-mouse, human-fugu CNEs 51
Figure 2. Distribution of lengths of human-mouse and human-fugu CNEs 53
Figure 3. Plot of total length of CNEs associated with DNA-binding TF-encoding
genes against the length of non-repetitive noncoding sequence (in kilobases; kb) of
human or fugu gene locus 54
Figure 4. Plot of CNE density against the total length of CNEs associated with DNA-
binding TF-encoding genes 55
Figure 5. CNEs in the MEIS2 gene locus 62
Figure 6. Distribution of UTR-intronic and internal-intronic human-fugu CNEs 64
Figure 7. TF-encoding genes predominantly expressed in the central nervous system
are enriched with CNEs 71
Figure 8. List of human-fugu CNEs associated with a representative TF-encoding
gene FOXA2 74
Figure 9. An image of the location of CNEs relative to the associated TF-encoding
gene FOXA2 75
Figure 10. Conserved syntenic blocks at the fugu and human Hox loci 88
Figure 11. VISTA plot of the MLAGAN alignment of fugu HoxAa locus with human
HoxA and mouse HoxA loci 92
Figure 12. VISTA plot of the MLAGAN alignment of fugu HoxDa locus with human
HoxD and mouse HoxD loci 93
Figure 13. Profiles of CNEs at the human HoxA, HoxB, HoxC, and HoxD loci 97
Figure 14. Lhx2 gene loci in human, mouse, chicken and fugu 106
Figure 15. Syntenic genes surrounding Lhx2 in human, mouse and fugu 109
Figure 16. CNEs in the Lhx2 locus 110
Figure 17. Ten CNEs associated with the human LHX2 gene locus are located within
the introns of the upstream gene DENND1A 112
Figure 18. Expression of Lhx2, Crb2, Dennd1a in E11.5 mouse embryos 114

x
Figure 19. Lhx2_CNE2/3 directs lacZ expression in the neural tube and dorsal root
ganglia 117
Figure 20. Lhx2_CNE5/6 directs lacZ expression in the hindbrain and neural tube .119
Figure 21. Lhx2_CNE7 directs lacZ expression in the hindbrain and neural tube 120
Figure 22. Lhx2_CNE10 directs lacZ expression in the midbrain, hindbrain and neural
tube 122
Figure 23. Mutation of the overlapping motifs in Lhx2_CNE2 125
Figure 24. Site-directed mutagenesis of a predicted motif in Lhx2_CNE2 results in
significantly reduced lacZ expression in the neural tube and dorsal root ganglia 126


xi
List of abbreviations
bp base pair
CNE conserved noncoding element
DEPC diethyl pyrocarbonate
DNase deoxyribonuclease
GCR global control region
GO GeneOntology
hsp68Prom
mouse hsp68 minimal promoter
kb kilobase
Mb megabase
Myr millions of years
PBS phosphate buffered saline
PCR polymerase chain reaction
TF transcription factor
TFBS transcription factor binding site
TSS transcription start site

UTR untranslated region (of an mRNA)
1
Chapter 1: Introduction
1.1 The Human Genome Project
A major quest in biology is to understand the function and regulation of all the human
genes and their role in human biology and diseases. The Human Genome Project
launched in 1990 was the first step towards accomplishing this ambitious scientific
endeavor. The project had two major goals: (i) to determine the complete sequence of
the human genome and (ii) to identify all the functional elements encoded by it. In
2001, the first goal was achieved when two draft sequences of the human genome
were released (Lander et al. 2001; Venter et al. 2001). Since then, coordinated efforts
were made to close the gaps in the genome and to achieve high sequencing accuracy.
As a result, the human genome sequence is now essentially complete (International
Human Genome Sequencing Consortium 2004). The second goal, that is to map all
the functional elements in the human genome, is however far from completion.
Although about 21,000 protein-coding genes have been identified in the human
genome with high confidence (Clamp et al. 2007), the identification of noncoding
functional elements poses a major challenge. Comparisons of the human and mouse
genomes have revealed that approximately 5% of the human genome is under
purifying selection and represent functional sequences (Waterston et al. 2002). Of
these functional sequences, 1.5% is accounted for by protein-coding genes and the
remaining 3.5% is functional noncoding sequence. Functional noncoding elements in
the human genome include noncoding RNA genes, cis-regulatory elements involved
in transcriptional regulation, splicing regulatory elements and sequences that dictate
chromatin structure. Unlike protein-coding genes which have a characteristic structure
and genetic code that facilitate their identification, very little is known about the
2
structure and organization of noncoding functional elements. Hence, their
identification and verification remains a challenge. The main focus of my project is to
identify and verify functional cis-regulatory elements in the human genome.


1.2 Cis-regulatory elements
Cis-regulatory elements are sequences that regulate the precise spatial and temporal
expression of the target gene. While the genomic content in every cell in the body is
largely the same, important biological processes such as development, differentiation,
proliferation and apoptosis can take place because gene expression is differentially
regulated in various cell types and at different developmental stages. Cis-regulatory
elements comprise of core promoters, proximal promoters, enhancers, silencers,
insulators and locus control regions. The core promoter is the minimal region
surrounding the transcription start site of a gene (extending ~35bp on either side), that
is sufficient to direct the initiation of transcription by the RNA polymerase II
machinery (Butler and Kadonaga 2002). At the core promoter, RNA polymerase II,
general transcription factors and other associated factors congregate and form the pre-
initiation complex. The other classes of cis-regulatory elements contain multiple
binding sites for sequence-dependent DNA-binding transcription factors. The
proximal promoter extends ~250 bp on either side of the transcription start site (Butler
and Kadonaga 2002). Transcriptional enhancers, also known as cis-regulatory
modules, and silencers contain binding sites for transcriptional activators and
repressors respectively, and can be located far away from their target gene (Maston et
al. 2006). Insulators prevent a neighbouring enhancer from acting on a gene promoter
when the insulator is situated between them, or shield a gene from the encroachment
of repressive heterochromatin (West et al. 2002). Locus control regions are composed
3
of multiple cis-regulatory elements including enhancers, silencers, insulators, matrix
or scaffold attachment regions (Maston et al. 2006). They are capable of directing the
tissue-specific expression of one or more linked genes to physiological levels in a
copy-number dependent manner at ectopic sites, possibly by opening up a
chromosome domain or by preventing the establishment of heterochromatin (Li et al.
2002). Among the various classes of cis-regulatory elements, transcriptional
enhancers contribute the most to specificity in gene expression. Core and proximal

promoters generally direct basal levels of transcription while increases and nuances in
gene expression are largely achieved by the highly coordinated interaction between
transcriptional enhancers and their cognate upstream transcription factors. While a
gene can have at most a proximal and a core promoter, it can be potentially associated
with several enhancers, each containing a different combination of binding sites for
different transcription factors. Hence, the total number of gene expression patterns
that can be generated is significantly more than the number of trans-acting factors
there is in the human genome (Maston et al. 2006).

Experimental studies on transcriptional enhancers have revealed some characteristic
features of enhancers. Firstly, transcriptional enhancers are modular in nature. This
means that distinct enhancers can act independently upon a single promoter at
different times, in different cell types and in response to different external stimuli.
This feature is best demonstrated by the regulated expression of pair-rule gene even
skipped (eve) in Drosophila. The expression of eve in seven thin stripes of the early
blastoderm embryo is achieved by five different enhancers and each enhancer
contributes to expression in exactly one or two stripes (Andrioli et al. 2002).
Secondly, transcriptional enhancers can be located in the 5’ or 3’ flanking regions,
4
untranslated exonic regions or introns of a gene. They can even be situated at
significant distances from the gene. For example, an enhancer that initiates and directs
specific Sonic hedgehog (Shh) expression in the posterior regions of developing
mammalian forelimbs and hindlimbs, was found to be located 1 Mb away in an intron
of a neighbouring gene (Lettice et al. 2003). Thirdly, enhancers act on their target
genes in an orientation-independent way. An SV40 viral enhancer sequence that was
cloned 1.4 kb upstream of a rabbit β-globin gene enhanced globin gene expression, as
did the sequence when cloned 3.3 kb downstream (Banerji et al. 1981). Lastly, the
prevailing model for long-range interaction between sequence-specific transcription
factors at an enhancer and the basic transcriptional machinery at the promoter is a
“looping-out” mechanism by which the DNA between the enhancer and the promoter

“loops out” so that the enhancer is brought close to the promoter (Vilar and Saiz
2005).

In order to understand the regulation of human genes in different tissues and at
different developmental stages, it is important to identify and characterize all the cis-
regulatory elements associated with human genes. A comprehensive catalog of cis-
regulatory elements will not only help improve our understanding of how individual
human genes are transcriptionally regulated, but also lead to an understanding of how
certain mutations in the human gene regulatory regions cause genetic diseases. Many
disease-associated loci fall within noncoding regions that are potentially involved in
gene regulation. For example, patients with preaxial polydactyly, a condition marked
by extra digits on hands and feet, have point mutations in the long-range enhancer of
the Shh gene that is described above. These point mutations cause Shh expression that
is usually asymmetrically localized to the posterior limb buds, to abnormally expand
5
into the anterior region, thus resulting in extra digits (Lettice et al. 2003). In patients
with the Van Buchem disease, a homozygous recessive disorder that is characterized
by a gradual increase in bone density and distortions in the face, mandible and head,
entrapment of cranial nerves and excessive weight, a 52-kb long deletion region has
been uncovered on human chromosome region 17p21. Within this deletion region
resides a 250-bp long-range enhancer that is postulated to direct expression of
sclerostin in human adult skeleton. Sclerostin is a negative regulator that keeps adult
bone formation in check. Loss of the enhancer results in significant loss of adult
sclerostin expression and uncurtailed bone growth (Loots et al. 2005). Finally, a
genomic deletion in human chromosome region 4q35 has been associated with facio-
scapulo-humeral dystrophy (FSHD), one of the most common forms of muscular
dystrophy. The deletion locus contains a matrix attachment region which blocks a
nearby enhancer from activating the FRG1 (FSHD region gene 1 protein) gene in
normal cells. The FRG1 gene encodes a putative splicing factor that is reportedly
over-expressed in the myoblasts of FSHD patients. The deletion event truncates a

D4Z4 repeat array adjacent to the matrix attachment region and disrupts the latter’s
ability to block the enhancer from activating FRG1, causing an aberrant upregulation
of FRG1 (Petrov et al. 2008). The role of FRG1 in FSHD is still unknown.

The above examples of cis-regulatory mutations that are associated with genetic
diseases, underscore the importance of cis-regulatory elements in maintaining the
correct spatio-temporal expression of crucial genes. Besides gleaning new insights
into genetic diseases from cis-regulatory elements, clinicians have begun to use cis-
regulatory elements to direct tissue-specific or sustained expression of transgenes in
gene therapy approaches. For instance, inducible promoters that are sensitive to
6
external stimuli like ionizing radiation or hypoxia are highly specific and have been
proposed as tools to restrict expression of cytotoxic or tumoricidal genes to cancer
cells (Goverdhana et al. 2005). Furthermore, insulators have been explored as a way
to override chromatin position effects and to ensure sustained expression of
transgenes introduced into the human genome during gene therapy (Recillas-Targa et
al. 2004). Hence, cis-regulatory elements have become useful in understanding and
correcting genetic disorders. Furthermore, the identification of cis-regulatory elements
and mutations that may occur in them has implications for the evolutionary processes
that give rise to population variation. At least 1% of all human genes are postulated to
possess functional cis-regulatory polymorphism and these variations lead to
considerable perturbation in gene expression levels and may thus impact various
aspects of phenotype (Rockman and Wray 2002). For the reasons mentioned above,
identifying and characterizing all the cis-regulatory elements in the human genome is
indeed a high-priority endeavor.

1.3 Methods to identify cis-regulatory elements
Cis-regulatory elements such as transcriptional enhancers lack a well-defined
structure similar to that of protein-coding genes. They are typically composed of
clusters of binding sites for several different transcription factors and there is very

limited information on how these binding sites are arranged within the cis-regulatory
element. Transcription factor binding sites are typically only 5 – 12 bp long and are
degenerate, whereby variation in some bases is tolerated more so than in other bases.
Without a priori information about which transcription factors regulate a gene of
interest, the identification of all the cis-regulatory elements associated with the gene is
a complex task. Furthermore, cis-regulatory elements are position-independent and
7
orientation-independent, hence searches for cis-regulatory elements are required to
cover extensive stretches of the genome and both strands of the DNA. The challenge
is not only to identify a cis-regulatory element, but also to dissect the core sequences
of a cis-regulatory element that are crucial for its function and to determine their
mode of action. The methods that have been used to date in identifying and
characterizing cis-regulatory elements can be broadly categorized into traditional
methods and genomic-era strategies that became possible after the availability of
whole genome sequences of human and other organisms.

1.3.1 Traditional strategies
Traditional approaches to identifying cis-regulatory elements can be described as
either biochemical or genetic methods. The first biochemical method to study protein-
DNA interaction is the DNA footprinting assay (Galas and Schmitz 1978). The aim of
this method is to identify the binding sites of a particular protein within a DNA
sequence. The protein of interest is added to PCR-amplified DNA that has been singly
end-labeled and the mixture is subjected to cleavage (either by an endonuclease or
chemical reagent). The resulting DNA fragments are separated by gel electrophoresis
alongside the fragments obtained from a control DNA sample without addition of the
protein. Two different cleavage patterns are generated from both DNA samples, with
the protein-bound DNA sample missing bands in DNA sites (“footprints”) where the
protein has bound and protected them from cleavage. To identify the binding sites, the
positions of the footprints are approximated in relation to the radiolabeled end. The
advantage of this method is that single-nucleotide resolution binding sites can be

obtained when a highly reactive and sequence-independent chemical reagent such as
hydroxyl radical is used to cleave the DNA (Jain and Tullius 2008). Another
8
biochemical method that describes protein-DNA interaction is the electrophoretic
mobility shift assay (EMSA) introduced in the 1980s. EMSA resembles the DNA
footprinting assay in that it detects binding of a protein of interest to an end-labeled
DNA sequence based on movement of the mixture along a polyacrylamide gel. The
difference is that there is no cleavage step, and hence a single DNA band is observed
after gel electrophoresis as opposed to a series of bands as in DNA footprinting. DNA
molecules to which proteins have bound move more slowly in the gel than the control
DNA with no protein, resulting in a ‘band shift’. The advantage of EMSA is that
multiple binding sites can be detected by applying increasing protein concentrations
that will then result in more pronounced band shifts. A limitation of the above two
methods is that both assays require prior knowledge of a putative cis-regulatory
element and a potential DNA-binding protein. In addition, the step of protein-DNA
interaction is carried out in vitro, which may not necessarily reflect in vivo events.

One example of a biochemical assay that captures the chromatin state of DNA
sequences in vivo is the DNase I hypersensitivity assay (Keene et al. 1981; McGhee et
al. 1981). In the eukaryotic nucleus, DNA is coiled around histone octamer complexes
at regular intervals to form nucleosomes. These nucleosomes serve to pack the large
eukaryotic genome into the nucleus and also influence important events like gene
transcription and DNA replication. Modifications to histone proteins, like
trimethylation of histone H3’s lysine 4 and acetylation of histone H3 (Xi et al. 2007),
can lower the DNA’s affinity to nucleosomes and displace nucleosomes, creating a
state of open chromatin and making the underlying DNA susceptible to DNase I
cleavage. The displacement of nucleosomes also implies that the DNA is accessible to
binding by proteins such as transcription factors. Hence, DNase I hypersensitivity
9
marks various functional noncoding elements, including cis-regulatory elements,

origins of replication, recombination elements and structural sites of telomeres and
centromeres (Cereghini et al. 1984; Gross and Garrard 1988). The advantage of using
DNase I hypersensitivity to find potential cis-regulatory elements is that DNase I
hypersensitivity is a transient property of DNA sequences, and hence this method is
useful in the study of spatial and temporal patterns of transcriptional activity. The cis-
regulatory elements that are active in a particular cellular phenotype can also be
identified by carrying out the assay on samples derived from that phenotype. Since the
assay was introduced, the level of resolution has greatly improved from ~500 bp on
either side of the nucleosome-free region using Southern transfer followed by indirect
end-labeling (Wu 1980) to nearly nucleotide resolution using PCR assessment (Yoo et
al. 1996) and quantitative PCR (McArthur et al. 2001). Nevertheless, DNase I
hypersensitivity implies the presence of a cis-regulatory element but does not
demonstrate its function.

Chromatin immunoprecipitation (ChIP) is a method that identifies the sites of
transcription factor-DNA interaction in vivo. DNA-binding proteins are covalently
cross-linked to genomic DNA by the addition of formaldehyde to living cells. The
genomic DNA is extracted from the cells and fragmented by sonication to lengths in
the range of 100 – 500 bp. An antibody highly specific for the transcription factor of
interest is used to precipitate all the DNA fragments to which the protein is bound.
Protein-DNA crosslinking is reversed and the DNA fragments are then purified. A
ChIP library is constructed whereby the purified DNA fragments are cloned into a
vector and sequencing is carried out using vector primers. Although the ChIP
technique is effectively unbiased in its search space for cis-regulatory elements and it
10
can be used to detect protein-DNA interactions in a particular type of cells, there still
remain several disadvantages to this technique. A highly specific antibody must be
raised against the transcription factor of interest, which is not a trivial task especially
for transcription factors that belong to larger families of structurally similar
transcription factors. In addition, more in-depth analyses have to be carried out to map

the actual bases of contact between the transcription factor and DNA within the larger
DNA fragment. Finally, some of the detected interactions may be indirect
interactions, meaning that another DNA-binding factor has interacted with the
transcription factor being studied.

A more accurate and reliable way to verify a putative cis-regulatory element is to
examine its ability to direct gene expression. This goal is achieved through the
reporter gene assay. In reporter gene assays aimed at verifying enhancers, the
sequence of interest is cloned upstream of a reporter gene linked to a core promoter.
Examples of reporter genes include genes that encode β-galactosidase, green
fluorescent protein and firefly-luciferase. The construct is incorporated into cultured
cells through transient or stable transfection and this is a fast method to measure or
examine the expression of the reporter gene as directed by the cis-regulatory element
being studied (Carey and Smale 2000; Himes and Shannon 2000). By varying the
length of the putative cis-regulatory element, the minimal cis-regulatory element can
be identified. Silencers are tested in a similar way as enhancers except that the core
promoter used is one that drives strong reporter gene expression so that any resulting
reduction in expression level can be readily detected (Maston et al. 2006). On the
other hand, more complex arrangements are required in order to verify putative
insulators and locus control regions. A putative insulator is cloned between a known
11
enhancer and a core promoter to demonstrate its enhancer-blocking ability or tested in
transgenic assays to test its heterochromatin-barrier ability (West et al. 2002), while a
putative locus control region must be verified in transgenic assays to see if it can
overcome chromosomal position effects (Maston et al. 2006). Two limitations of
conducting reporter gene assays in cell lines are that developmentally active
enhancers cannot be tested in cell lines and cell lines may not be available for all
types of cells. To overcome these limitations, transgenic systems are used in place of
cultured cells. In addition, transgenic systems allow the researchers to show that a cis-
regulatory element indeed functions in living organisms. The plasmid construct is first

linearized and then injected into mouse or zebrafish embryos, after which the
construct becomes randomly integrated into the genome, for example as demonstrated
by Kothary et al. (1989) and Muller et al. (1997). The researcher then monitors the
reporter gene expression during development. The advantage of transgenic reporter
assays is that the effect of putative cis-regulatory elements can be studied in vivo and
to date, this remains the most accurate and effective way to prove that an element is
functional. However, this method does have its disadvantages. Firstly, because
integration of the transgene into the host genome occurs at random locations, in vivo
expression of the reporter gene may not reflect the actual in vivo expression driven by
the cis-regulatory element, due to position effects (whereby the reporter gene
inadvertently falls under the control of endogenous cis-regulatory elements or
heterochromatin), thus leading to ectopic gene expression. Hence, it is crucial to
ensure that a similar expression pattern is obtained in several independent transgenic
lines before a conclusion is made about the putative cis-regulatory element being
tested. Secondly, microinjections of the transgene into mouse or zebrafish embryos
are carried out at the one-cell stage so that ideally, all the cells in the resulting
12
organism contain a copy of the transgene in its genome. However, mosaicism arises if
cell divisions begin before the transgene can be integrated into the genome especially
in the rapidly dividing zebrafish embryo. This means that insertion of the transgene
into the host genome occurs in only a subset of all cells and reporter gene expression
can only be observed in subpopulations of cells. Nevertheless, the analysis of cis-
regulatory elements by examining mosaic patterns of reporter gene expression in
microinjected transient transgenic fish embryos is fairly rapid (Muller et al. 2002).
Furthermore, these assays are now aided by techniques such as transposon-based gene
transfer (Grabher and Wittbrodt 2007; Kawakami 2007). After transgenic assays have
been carried out and a cis-regulatory element is shown to direct gene expression to
specific tissues at particular developmental stages, subsequent experiments include
making constructs with serial deletions or site-directed mutagenesis to find the critical
segments of the cis-regulatory element. These experiments are very tedious and time-

consuming and are appropriate for only small sets of genes.

1.3.2 Genomic era strategies
Ever since the release of the human genome, methods of detecting cis-regulatory
elements have gradually become high-throughput, capable of detecting hundreds to
thousands of elements in one experiment. Typically, these methods make use of the
completed human genome as a reference to which large numbers of experimentally
obtained sequences can be mapped or from which numerous microarray probes can be
designed for large-scale hybridization. The DNase I hypersensitivity assay and ChIP
assay take advantage of the availability of genome sequences in similar ways, to
identify DNase I hypersensitive sites and regions bound by transcription factors on a
whole-genome level. Massively parallel signature sequencing was used to generate
13
many short sequence tags from a genomic DNase library derived from quiescent
human CD4
+
T-cells and ~14,000 DNase I hypersensitive sites were identified
(Crawford et al. 2006b). In the same year, NimbleGen tiling microarrays designed
from the ENCODE regions that make up 1% of the human genome were used to
identify captured DNase I-digested ends in primary and immortalized human cell
types (Crawford et al. 2006a; Sabo et al. 2006). More recently, a high-resolution
whole-genome map of chromatin perturbations in primary human CD4
+
T-cells was
constructed using Solexa and 454 sequencing platforms in combination with
NimbleGen tiling arrays and almost 95,000 sites were identified (~2.1% of the
genome) (Boyle et al. 2008).

Recent extensions of the original ChIP assay allow the researcher to map transcription
factor binding sites on a genome-wide scale. In ChIP-on-chip, a method also known

as genome-wide location analysis, the protein-bound enriched DNA fragments are
hybridized to a tiling microarray. This method was applied to human chromosomes 21
and 22 to find the binding sites for three transcription factors – Sp1, c-Myc and p53
(Cawley et al. 2004). Although ChIP-on-chip has become a fairly high-throughput
way to identify sequences across the genome due to availability of commercially-
offered microarray chips (e.g., NimbleGen Human ChIP-chip 2.1M array, Affymetrix
GeneChip® Human Tiling 2.0R array), there are problems of cross-hybridization
especially for large mammalian genomes. In addition, the method predicts large
numbers of binding sites and only a fraction is expected to be truly functional, hence
statistical methods are required to attach a statistical significance to each site as an
indicator of its functional relevance (Buck and Lieb 2004). In ChIP-PET, the enriched
DNA sequences are sequenced through the paired-end ditag sequencing method (Loh

×