Tải bản đầy đủ (.pdf) (13 trang)

Báo cáo khoa học: A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (603.18 KB, 13 trang )

A study on genomic distribution and sequence features
of human long inverted repeats reveals species-specific
intronic inverted repeats
Yong Wang and Frederick C. C. Leung
School of Biological Sciences and Genome Research Centre, The University of Hong Kong, China
An inverted repeat consists of two repeat copies (here-
after termed arms) that are approximately complemen-
tary to each other. Generally, there is a spacer between
the arms, and the full structure of an inverted repeat
can form a stem-loop or palindrome. The potential to
form a stable stem-loop is determined by the arm size,
spacer size and the matching degree of the arms [1,2].
For example, a relatively huge spacer makes it difficult
for the two arms to form a stem.
Studies of inverted repeats show that they may raise
instability in a genome and, on the other hand, regu-
late gene expression in both prokaryotes and eukary-
otes. Being capable of forming secondary structures
[3], inverted repeats can induce genomic instability via
gene amplification, recombination, DNA double-strand
breaks and rearrangement [1,2,4–8]. Moreover,
inverted repeats provide sites for the integration of
viruses into eukaryotic genomes [9,10] and also com-
prise replication stall sites, as shown in a recent study
in which evidence obtained in vivo demonstrated repli-
cation stalling by hairpins formed by inverted repeats
in bacteria, yeast and mammalian cells [11]. As a
result, they are restricted in a genome to some extent.
For example, neighboring repetitive elements, such as
Alu repeats, are generally found to occur in the same
direction, and those in the styles of head-to-head and


tail-to-tail are rarely observed, particularly when the
spacer between them is tiny [1,12]. In a mouse trans-
Keywords
human; intron; long inverted repeat;
primates; stem-loop
Correspondence
Y. Wang, School of Biological Sciences, The
University of Hong Kong, Hong Kong, China
Fax: +852 2857 4672
Tel: +852 2299 0825
E-mail:
(Received 11 December 2008, revised 19
January 2009, accepted 23 January 2009)
doi:10.1111/j.1742-4658.2009.06930.x
The inverted repeats present in a genome play dual roles. They can induce
genomic instability and, on the other hand, regulate gene expression. In the
present study, we report the distribution and sequence features of recombi-
nogenic long inverted repeats (LIRs) that are capable of forming stable
stem-loops or palindromes within the human genome. A total of 2551 LIRs
were identified, and 37% of them were located in long introns (largely
> 10 kb) of genes. Their distribution appears to be random in introns and
is not restrictive, even for regions near intron–exon boundaries. Almost
half of them comprise TG ⁄ CA-rich repeats, inversely arranged Alu repeats
and MADE1 mariners. The remaining LIRs are mostly unique in their
sequence features. Comparative studies of human, chimpanzee, rhesus
monkey and mouse orthologous genes reveal that human genes have
more recombinogenic LIRs than other orthologs, and over 80% are
human-specific. The human genes associated with the human-specific LIRs
are involved in the pathways of cell communication, development and the
nervous system, as based on significantly over-represented Gene Ontology

terms. The functional pathways related to the development and functions
of the nervous system are not enriched in chimpanzee and mouse ortho-
logs. The findings of the present study provide insight into the role of
intronic LIRs in gene regulation and primate speciation.
Abbreviations
FDR, false discovery rate; GO, Gene Ontology; LIR, long inverted repeats; siRNA, small interference RNA; TIR, terminal inverted repeat.
1986 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS
gene experiment, the introduction of a large palin-
drome was followed by numerous rearrangements,
which were assumed to comprise a solution for attenu-
ating the impact of the palindrome in the progeny [13].
Inverted repeats also regulate gene expression. The
stem-loops and palindromes constructed by inverted
repeats are involved in RNA interference, transcription
initiation of genes, initiation of DNA replication and
alternative splicing of exons. The small interference
(si)RNA genes active in RNA interference comprise
inverted repeats capable of forming a stem-loop motif
longer than 22 bp. Some are derived from miniature
inverted-repeat transposable elements [14]. At present,
studies have identified siRNA genes from Caenorhabd-
itis elegans to humans. RNA interference was initially
discovered as an efficient mechanism for inhibiting the
expression of specific genes [15,16], and later was
found to be responsible for developmental regulation
[17,18] and heterochromatin maintenance [19,20]. In
promoters, inverted repeats can facilitate the recogni-
tion process and the subsequent binding of RNA poly-
merase during gene transcription [21,22]. Moreover,
the inverted repeats in a cruciform structure will

attract mediators of second messenger-directed tran-
scription, hence altering the transcriptional response
[23]. Many studies also show that inverted repeats are
essential for the initiation of DNA replication in plas-
mids, bacteria, eukaryotic viruses and mammalian cells
[24]. The inverted repeats in introns are able to affect
the alternative splicing of exons [25,26] and the
removal efficiency of introns [27,28]. For example,
alternative splicing of exon 2 in the COL2A1 gene was
mediated by a stem-loop adjacent to the exon–intron
boundary [25].
Because inverted repeats are both unstable and func-
tional elements in a genome, they are expected to be
distributed in intergenic regions or large introns of
genes. In the yeast genome, almost 100% of large
palindromes (> 25 bp) are far away from coding
regions [29]. Any insertion approaching conserved
transcribed sites will ultimately be erased unless their
presence provides an evolutionary advantage and,
thus, is under positive selection. One line of evidence
for this is that, compared to introns, upstream regions
of genes have more palindromes, which probably
developed for the initiation of transcription [30]. A
recent study shows that Caenorhabditis lineages have
conserved inverted repeats in intergenic regions [31],
which were suggested to be functional and therefore
actively maintained in the lineages. In the human
genome, there are many such motifs, although we have
little knowledge of their fine-scale distribution,
sequence features and potential functions at present

[32,33]. Human inverted repeats were investigated in a
previous study, in which a majority of them were
found to be weak with respect to their capacity to
form a simple stem-loop or hairpin in terms of their
structural features [32]. Genome-wide distribution of
human palindromes has also been surveyed, and a
database has been created for public use [30]. How-
ever, the palindromes with mismatches and indels were
not collected in the database.
In the present study, we first located all the long
inverted repeats (LIRs) characterized with long arms,
high arm similarity and a short internal spacer in the
human genome. They were termed as recombinogenic
LIRs in our previous study on human chromosomes
21 and 22 [33], although their distribution and fre-
quency had not been fully surveyed in the whole
human genome. The present study aims to provide a
panoramic view of recombinogenic LIRs. On the basis
of evidence obtained in vivo [1,2,11], the LIRs identi-
fied in the present study can easily form stem-loops or
palindromes. Their presence in the human genome by
itself implies that they are functional in some manner.
The results obtained showed that 37% of the LIRs
were located in intronic regions and some were
primate-specific. TG ⁄ CA-rich repeats are the most
frequently observed feature in LIR arms. Considering
that the LIRs probably have essential functions and
drive the speciation of primates, we studied the degree
of conservation and species specification of the LIRs
among orthologous genes from the mouse (Mus mus-

culus), rhesus monkey (Macaca mulatta), chimpanzee
(Pan troglodytes) and human. The results obtained
demonstrate that human orthologs have relatively
more LIRs, most of which are human-specific. These
human-specific LIRs are probably essential for the
development of the advanced functions of human
nervous system in light of the Gene Ontology (GO)
profile of human orthologs.
Results
Characters and distribution of LIRs in human
autosomes
We identified 2551 LIRs in human autosomes and
approximately 87% of them have a short spacer
(0–9 bp) and arm (31–59 bp) (Fig. 1). By contrast, the
mismatch rate between the arms of an LIR varies from
0–0.15, showing a relatively lower standard deviation
with respect to the amounts of the LIRs in different
ranges (Fig. 1). These results indicate that a majority of
the LIRs are able to form a stem-loop with a stem of
31–59 bp and a tiny loop (or none for a palindrome).
Y. Wang and F. C. C. Leung A study on human long inverted repeats
FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1987
The genomic distribution of the LIRs shows that the
density of LIRs selected by our criteria is quite low
(Fig. 2). The highest and lowest LIR densities were
observed in chromosomes 4 (1.2 ⁄ Mb) and 22
(0.44 ⁄ Mb) respectively. Interestingly, the LIR density
negatively co-varies with gene density among the chro-
mosomes (t = 19.8; P <10
)4

) (Fig. 3). The point
denoting chromosome 19 is notably far from the
regression line, accounting for gene clusters that con-
tribute to the two-fold higher gene density of the chro-
mosome 19 compared to the genomic average [34].
We found that the negative correlation is due to the
high frequency of LIRs in long genes. A total of 956
LIRs (in 702 genes) were located in genic regions, and
1595 in intergenic regions. In other words, 37% of
the LIRs were found within genes. However, our
Fig. 1. Characteristics of human LIRs. The 2551 LIRs were classified according to spacer size, mismatch rate and arm length.
Fig. 2. Distribution of LIRs in the human
genome. The density is represented by the
amount of LIRs per 1 Mb sequence. The
shortest bars denote one LIR per 1 Mb.
A study on human long inverted repeats Y. Wang and F. C. C. Leung
1988 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS
calculation of the coverage of genes in the human gen-
ome was 26.9%, which is consistent with the value
reported previously [35]. When introns were taken into
account, the percentage was 25.1%. This implies that
the distribution of the LIRs is not random. Statistical
analysis performed on the results shows that the pres-
ence of LIRs is significantly biased to be within genes
(chi-square test; P < 0.0001). The LIRs that have long
arms (> 400 bp) and the associated genes are listed in
Table S1. Surprisingly, over half of them were found
within genes. There are two cases of partial overlap
between LIRs and exons. In one case, the left arm of
an intronic LIR extends into an exon of c14orf165

and, in the other case, an LIR on the chromosome 17
partially overlaps an exon of a putative gene. The fre-
quency is much lower than expected. We did not find
any LIRs overlapping either the start or end site of a
gene.
Further results confirmed that LIRs tend to reside
in large intronic and intergenic regions. Only five LIRs
were found in introns < 1 kb (the smallest intron was
757 bp), and none in intergenic regions < 2 kb. The
median sizes for the introns and the intergenic regions
are 46 and 386 kb, respectively. Moreover, most of the
LIR-containing intergenic regions are > 10 kb. Corre-
spondingly, a chromosome that has more long genes
will show a lower gene density, in agreement with the
above negative correlation between LIR density and
gene density (Fig. 2).
We then studied the positions of the LIRs in introns
and intergenic regions. A short distance to the exon–
intron boundary or transcription starting point is
an indication that an LIR is functioning in the gene.
A ratio of 0–0.5 was applied to denote the relative
distance to the boundaries, or to the center, and was
divided into five ranges. We calculated the percentage
of LIRs falling within the ranges and observed a small
percentage difference between the ranges, suggesting a
random distribution of LIRs in both intronic and
intergenic regions (see Fig. S1). We also considered the
effect of the length of these regions on the distribution.
The intronic and intergenic regions were then classified
on the basis of their lengths. Within each of the length

groups, the numbers of LIRs in the ratio ranges do
not show any significant difference (chi-square tests;
d.f. = 4; P > 0.1) (see Fig. S1). Therefore, the LIRs
do not avoid approaching the boundaries for exonic or
genic regions. The median distance to the exon bound-
aries is 7.8 kb, and that to the gene boundaries is
69 kb.
Strikingly, pseudogenes were frequently found
around the intergenic LIRs. A total of 803 intergenic
LIRs (50%) have one or two neighboring pseudogenes,
of which 422 are RNA pseudogenes. According to the
annotation in the Ensembl database (http://www.
ensembl.org), approximately 27% of the human genes
are pseudogenes. The occurrence of pseudogenes
adjacent to LIRs is statistically significant (chi-square
test; P < 0.0001).
Sequence features of the human LIRs
We found that over half (51%) of the identified LIRs
could be packed into groups consisting of at least three
members on the basis of sequence similarity. The
group members are comprised of simple repeats,
known repetitive elements, amplified genes or dupli-
cated genomic fragments. The largest group consists of
LIRs formed by stretches of TG ⁄ CA dinucleotides and
interspersed TA dinucleotides. We defined them as
TG ⁄ CA-rich LIRs, accounting for 33% and 39% of
all the LIRs in the intronic and intergenic regions,
respectively (Fig. 4). By contrast, we also identified
TC ⁄ GA-rich LIRs that occupy only 3% in both of the
regions. Thus, the frequency of TG ⁄ CA-rich LIRs is at

least 11-fold higher than that of TC ⁄ GA-rich ones (for
intronic LIRs: 11-fold; for intergenic LIRs: 13-fold).
The difference is statistically significant (chi-square
test; P < 0.0001). On average, the combination of
TG ⁄ CA-rich and TC ⁄ GA-rich LIRs occupies 38% of
the identified LIRs. Additionally, we could not identify
any LIRs constructed by simple repeats in longer
repeat units (> 2 bp).
The second largest group comprises the LIRs
involved in known human repetitive elements. We
found 145 MADE1 mariners and 108 inverted Alu
repeats in our LIR collection. The mariner has a short
y = –23.8x + 33.6
R
2
= 0.53
0
5
10
15
20
25
30
35
0.4 0.6 0.8 1 1.2
Gene density (/Mb)
S-LIR densit
y
(/Mb)
chr.19

Fig. 3. Negative correlation between gene density and LIR fre-
quency. The black dots show the correlation between gene density
and LIR density in the 22 chromosomes.
Y. Wang and F. C. C. Leung A study on human long inverted repeats
FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1989
spacer and long terminal inverted repeats (TIRs). In
the present study, they were considered as LIRs in
cases of high identity between TIRs. Within both
intronic and intergenic LIRs, they occupy 6% in total.
Alu repeats in the LIRs are mostly in a partial struc-
ture and found to be in the styles of head-to-head or
tail-to-tail. In some large LIRs, more than one Alu
was included in one arm, and the complete structure
of Alu could be retained therein. The proportion of
inverted Alus within the LIRs is 6% for intronic
regions and 3% for intergenic regions.
The grouping of the LIRs is also a result of gene
amplification or fragmental duplication, although the
numbers of such groups and the members inside the
groups are not large. We identified 20 LIRs in genes
encoding a novel protein similar to septin (NPS-Sep-
tin) and eight in genes encoding POTE. The genes
belong to gene families and their duplication is coupled
with the spread of the LIRs inside the gene. The
remaining LIRs aside from the above groups show
similarity either to one or none of the others. They are
labeled as rare LIRs, accounting for 49% of all the
LIRs in the human genome.
We explored the LIRs in the NPS-Septin gene
family in more detail. A blat search in the University

of California Santa Cruz (UCSC) browser (http://
genome.ucsc.edu) was used to confirm the association
of the LIRs with the gene family. The longest gene
LOC400807 is approximately 107 kb, and an NPS-
Septin LIR is positioned at approximately 10.6 kb. In
addition, we also found more NPS-Septin LIRs on the
Y chromosome, although they were not present in
NPS-Septin genes. Sequence alignment displays highly
identical arms but diverse spacers for the 20 NPS-
Septin LIRs (Fig. 5). They are able to form variant
stem-loop structures where both the stem and loop are
of different sizes. Except for those on chromosomes 3,
10 and Y, all the LIRs were located at subtelomeric
regions (see Table S2). The proximal LIRs show simi-
lar spacer motifs; for example, the three LIRs on chro-
mosome 1p (no. 1–3) and the two LIRs on the
Y chromosome (Fig. 5). This is evidence for inverted
duplication of the fragments at these regions. We also
noted that sequence similarity at the flanking regions
of the LIRs declines gradually at all sites.
Species-specific LIRs inside orthologous genes
To obtain species specification of the LIRs, we
detected LIRs in mouse, rhesus monkey, chimpanzee
and human orthologous genes. Among 12 723 groups
of orthologous genes, we identified 546 LIRs for
human orthologs, 481 for chimpanzee orthologs, 201
for mouse orthologs and 130 for rhesus monkey ortho-
logs. For species specification of the LIRs, 421 (77%)
are human-specific, 355 (74%) are chimpanzee-specific,
180 (90%) are mouse-specific and 107 (82%) are rhe-

sus monkey-specific. For the nonspecific LIRs, 13
groups of orthologs from the three primate species all
have at least one LIR, and 104 ortholog pairs from
humans and chimpanzees possess LIR(s). This suggests
that most of the nonspecies-specific LIRs are shared
by the primates, and some LIRs were specifically
developed in the primate lineage.
We next obtained the biological profile of the human
orthologs that have human-specific LIR(s). Compared
to randomly-selected human genes, the orthologs are
significantly enriched with GO terms within the catego-
ries of development, binding, membrane, cell communi-
cation and signal transduction (Table 1). An important
finding is that a number of the terms are related to the
nervous system, including neurotransmitter receptor
activity (GO:0030594), central nervous system develop-
ment (GO:0007417), GABA receptor activity (GO:
0016917), axonogenesis (GO:0007409), projection,
generation, differentiation and development of neurons
(GO:0043005, GO:0048699, GO:0030182, GO:0048666),
synapse (GO:0045202), and so on. The GO term that is
under-represented in these genes is GO:0006955 for
immune response [false discovery rate (FDR) = 0.048].
We also performed the same test on 104 orthologs
with human- and chimpanzee-specific LIRs. The GO
Fig. 4. Composition of LIRs in the human
genome. The LIRs in POTE and NPS-Septin
families occupy 3% of all the intronic LIRs.
The ‘other’ LIRs, occupying 49% of all LIRs,
refer to those with unique sequence

features.
A study on human long inverted repeats Y. Wang and F. C. C. Leung
1990 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS
terms from the human orthologs were used for com-
parison with those from randomly-selected human
genes, showing that the above over-represented GO
terms related to the nervous system were largely not
assigned to these orthologs (see Table S3). Only the
term GO:0045202 is related to synapse. Basically, the
terms for binding, membrane, signal transduction and
cell communication are retained in the list.
To make a control, we obtained over-represented
GO terms from the mouse genes associated with
mouse-specific LIRs by comparison with randomly-
selected mouse orthologs. A part of the result shown
in Table S4 is similar to that also shown in Table S3
(e.g. binding and signal transduction). The difference is
that the result for the mouse orthologs includes GO
terms for the regulation of transcription, the RNA bio-
synthetic process and the phosphate metabolic process.
We found one over-represented term (GO:0007399) in
a pathway for nervous system development
(FDR = 0.0368).
The 104 LIRs common in human and chimpanzee
orthologs were studied, aiming to uncover the mecha-
nism of their formation. We searched the arm
sequences of the LIRs in the UCSC genome browser
Fig. 5. Alignment of the LIRs mostly
located in genes encoding a novel protein
similar to septin. The locations of the LIRs

are listed in Table S2. Essentially, the arms
of the LIRs are approximately 1–48 and
97–143 bp, and can be extended into the
spacers in some LIRs.
Y. Wang and F. C. C. Leung A study on human long inverted repeats
FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1991
for homologous fragments in other mammalian
genomes. The species specification of the LIRs was
demonstrated in several cases, where we found
half-sized LIRs in the rhesus monkey genome. One
case is the LIR in the human gene c9orf52 that has 19
ORFs and four transcription variants. Positioned
Table 1. GO terms over-represented in human genes having human-specific LIRs. The genes are human orthologs that have at least one
human-specific LIR. Reference genes are randomly selected from the list of orthologs, and are used for comparison with the test human
genes with specific LIRs. The GO terms in 352 test genes were compared with those in 296 reference genes, using Fisher’s exact test in
BLAST2GO. FDR was applied to obtain significantly over-represented (FDR < 0.05) GO terms in the test genes. Several GO terms belonging to
levels 1 or 2 are not included.
GO term Name FDR GO term Name FDR
GO:0060089 Molecular transdducer activity 6.49E-05 GO:0005524 ATP binding 0.011768
GO:0007165 Signal transduction 6.49E-05 GO:0048731 System development 0.013028
GO:0005886 Plasma membrane 1.22E-04 GO:00434 12 Biopolymer modification 0.013028
GO:0044425 Membrane part 2.44E-04 GO:0005887 Integral to plasma membrane 0.013136
GO:0007154 Cell communication 2.65E-04 GO:0005230 Extracellular ion channel activity 0.014854
GO:0032501 Multicellular organismal process 2.65E-04 GO:0031175 Neurite development 0.014854
GO:0045202 Synapse 3.10E-04 GO:0031402 Sodium ion binding 0.014854
GO:0004888 Transmembrane receptor activity 7.24E-04 GO:0030030 Cell projection organization 0.016226
GO:0031224 Intrinsic to membrane 7.24E-04 GO:0022804 Active transmembrane transporter 0.016354
GO:0016021 Integral to membrane 8.83E-04 GO:0004672 Protein kinase activity 0.017032
GO:0032502 Developmental process 0.001316 GO:0048856 Anatomical structure development 0.017199
GO:0030695 GTPase regulator activity 0.001397 GO:0043687 Post-translational protein modification 0.017406

GO:0007275 Multicellular organismal development 0.001397 GO:0015075 Ion transmembrane transporter 0.017424
GO:0044459 Plasma membrane part 0.001397 GO:0006464 Protein modification process 0.017424
GO:0051179 Localization 0.001397 GO:0016301 Kinase activity 0.018271
GO:0030182 Neuron differentiation 0.001526 GO:0051234 Establishment of localization 0.018271
GO:0030054 Cell junction 0.001544 GO:0005856 Cytoskeleton 0.018624
GO:0043167 Ion binding 0.001544 GO:0005216 Ion channel activity 0.019349
GO:0007186 G-protein coupled receptor protein 0.002093 GO:0022803 Passive transmembrane transporter 0.019349
GO:0004872 Receptor activity 0.002643 GO:0022838 Substrate-specific channel activity 0.019349
GO:0007166
GO:0000166
Cell surface receptor
Nucleotide binding
0.003784
0.003784
GO:0004713
GO:0004930
Protein-tyrosine kinase activity
G-protein coupled receptor activity
0.020021
0.020021
GO:0031420 Alkali metal ion binding 0.004149 GO:0022857 Transmembrane transporter activity 0.020021
GO:0045211 Postsynaptic membrane 0.004149 GO:0050793 Regulation of developmental process 0.021558
GO:0006811 Ion transport 0.004408 GO:0008509 Anion transmembrane transporter 0.021558
GO:0007399 Nervous system development 0.004741 GO:0007409 Axonogenesis 0.023565
GO:0043169 Cation binding 0.005509 GO:0048812 Neurite morphogenesis 0.023565
GO:0046872 Metal ion binding 0.005536 GO:0019199 Transmembrane protein kinasc activity 0.023565
GO:0030554 Adenyl nucleotide binding 0.005536 GO:0048667 Neuron morphogenesis 0.023565
GO:0007155 Cell adhesion 0.005536 GO:0046578 Ras protein signal transduction 0.023565
GO:0048666 Neuron development 0.005868 GO:0016917 GABA receptor activity 0.023565
GO:0005083 Small GTPase regulator activity 0.005868 GO:0004714 Kinase activity 0.023565

GO:0006814 Sodium ion transport 0.005868 GO:0048468 Cell development 0.025037
GO:0005509 Calcium ion binding 0.006231 GO:0022891 Transmembrane transporter activity 0.026669
GO:0016773 Phosphotransferase activity 0.006277 GO:0009790 Embryonic development 0.028903
GO:0017076 Purine nucleotide binding 0.00637 GO:0006793 Phosphorus metabolic process 0.030622
GO:0022008 Neurogenesis 0.006378 GO:0022892 Substrate-specific transporter 0.030713
GO:0048699 Generation of neurons 0.006378 GO:0043005 Neuron projection 0.033582
GO:0000902 Cell morphogenesis 0.007098 GO:0005096 GTPase activator activity 0.033582
GO:0032989 Cellular structure morphogenesis 0.007098 GO:0015698 Inorganic anion transport 0.033582
GO:0030234 Enzyme regulator activity 0.007765 GO:0065007 Biological regulation 0.034756
GO:0051056 Regulation of small GTPase 0.009344 GO:0005089 Rho guanyl-nucleotide exchange factor 0.041165
GO:0048869 Cellular developmental process 0.009344 GO:0005088 Ras guanyl-nucleotide exchange factor 0.041165
GO:0030154 Cell differentiation 0.009344 GO:0007010 Cytoskeleton organization 0.041874
GO:0005215 Transporter activity 0.009344 GO:0007417 Central nervous system development 0.043131
GO:0032559 Adenyl ribonucleotide binding 0.009362 GO:0030594 Neurotransmitter receptor activity 0.043131
GO:0032555 Purine ribonucleotide binding 0.01047 GO:0004674 Protein serine ⁄ threonine kinase 0.045995
GO:0032553 Ribonucleotide binding 0.01047 GO:0008092 Cytoskeletal protein binding 0.046358
GO:0031226 Intrinsic to plasma membrane 0.011022 GO:0005515 Protein binding 0.047541
A study on human long inverted repeats Y. Wang and F. C. C. Leung
1992 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS
between exons 17 and 18 (5.3 kb to exon 18; 36.59 kb
to exon 17) (Fig. 6), it has a homolog in the chimpan-
zee genome. However, all homologous fragments from
the rhesus monkey correspond to one arm of the LIR.
Moreover, motif conservation was exhibited at the
flanking sequences of the LIR in primates (Fig. 6). In
other words, the half-sized LIR represents the
ancestral status, and the full-sized LIR was developed
in the chimpanzee and human lineages. We did not
find fragments homologous to the LIR in the mouse
genome. Instead, a half-sized LIR was observed in the

dog genome, suggesting that nonprimate genomes also
lack the full-sized LIR. This also serves as additional
solid evidence for the presence of the half-sized LIR in
the rhesus monkey. These results imply that some
LIRs were derived by inverted duplication of one arm.
Discussion
A survey of recombinogenic LIRs across the
human genome
In the present study, we identified LIRs in the human
genome, and provide a fine map of the distribution of
human LIRs. Due to a strong capability for forming a
stem-loop, the LIRs are recombinogenic and account
for only approximately 0.4% of all human LIRs, as
suggested previously [33]. Our algorithm allows the
presence of mismatches and insertions in the stem part
of the secondary structure, and also provides settings
for spacer size, arm size and arm similarity. Due to
variant internal structures, inverted repeats are differ-
ent in their efficiency with respect to the induction of
instability. Evidence is available suggesting that arm
size, arm similarity and internal spacer size are all
important factors [1,2]. Therefore, the inverted repeats
identified in the present study are generally associated
with a high potential for stem-loop or palindrome
formation. This is partially supported by the fact that
approximately 87% of our LIRs have a short spacer
of < 10 bp. Nonetheless, we cannot preclude the pos-
sibility that some of the LIRs experience difficulty
regarding the formation of a stem-loop, such as the
reversely duplicated genes and those extremely large

LIRs with a huge spacer (see Table S1). In previous
studies, the methods employed for inverted repeat
identification could not search the inverted repeats by
freely defining arm similarity, spacer size and indels
[30,32,36]. Thus, the map of the LIRs obtained in the
present study provides a more detailed distribution of
stem-loops in the human genome, and confirms that
the LIRs are mostly located in long introns and inter-
genic regions. Furthermore, the inverted repeats in the
present study are more likely to be functional than
those of previous studies because functional inverted
repeats such as siRNA genes are rarely palindromes
showing 100% arm similarity [30,32,36].
Because of the difficulties encountered in the design
of the algorithm for LIR searching at the genome
level and the complex folding structures of inverted
repeats, we could not target all the inverted repeats
with a strong potential to form a stem-loop or a pal-
indrome. Particularly, there are a large number of
AT-rich regions in the human genome, and the fre-
quency of (TA)
n
repeats is 19.4 per Mb [37]. The
self-complementary (TA)
n
repeats can by themselves
form variant secondary structures. To remove these
simple repeats, we set the GC content of the arm
sequences at > 20%. This step, however, unavoidably
deleted AT-rich LIRs, and some of them have been

implicated as the mediator of constitutional t(11;22)
translocation in humans [38]. Although there are also
a large number of (CA)
n
and (GA)
n
repeats in the
human genome, the frequency of their complementary
repeats (TG)
n
and (CT)
n
is much lower [37].
Therefore, the presence of the TG ⁄ CA-rich and
TC ⁄ GA-rich LIRs is not a result of the enrichment
of (CA)
n
and (GA)
n
repeats.
Fig. 6. An intronic LIR in c9orf52 and the flanking conserved sequence. The arrow denotes the intronic LIR, positioned between exons 17
and 18. The large arrows with an opposing orientation indicate the two arms of the LIR. Rhesus monkey (Rhesus macaque) and dog
(Canis familiaris) genes possess half-sized LIRs.
Y. Wang and F. C. C. Leung A study on human long inverted repeats
FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1993
Probable functions of the LIRs
The results obtained in the present study show that a
considerable proportion of the LIRs are within genes
and tend to be located in large introns of long genes.
The LIRs in the large introns, although still unstable,

will not greatly disturb the coding parts of the genes.
Knowing the genomic distribution and sequence fea-
tures of the LIRs enables us to speculate about the
biological functions of the LIRs.
First, there are a large number of TG ⁄ CA-rich LIRs
in our collection, and these intronic TG and CA tracts
are probably involved in the alternative splicing of
genes. One study revealed that intronic TG tracts,
particularly in hairpin structure, are important in the
intron knockout process and help to create
complicated splicing patterns [39]. On the other hand,
CA-tracts and CA-rich sequences are confirmed to be
regulators for alternative splicing. One study showed
that the insertion of a CA repeat into different intronic
places will result in variant splicing patterns in a
human gene [40]. Perhaps splicing sites at intron–exon
boundaries can be recognized easily by a signal of sec-
ondary structure. Taken together, this allows us to
propose that the TG ⁄ CA-rich LIRs are regulators in
human genes.
Second, approximately half of the LIRs are unique
in sequence features, and some of them are probably
unidentified siRNA genes. In the present study, the
LIRs are longer than the minimal length required for
an siRNA. Although arm similarity is higher than that
observed in most siRNAs, some of them are still can-
didates for siRNA genes. We used emboss sirna
( to
identify the candidates with a threshold score of 8, and
found that 267 of the LIRs are potential siRNA genes.

The validity of these motifs in gene silencing requires
further empirical examination.
LIRs and recombination hotspots are not related
The question remains as to whether the recombino-
genic LIRs identified in the present study are
frequently associated with recombination hotspots in
the human genome. Almost 47% of the human gen-
ome is composed of repeats [37], and direct repeats are
predominant over inverted repeats in the human gen-
ome, partially because inverted repeats are able to
induce instability five-fold more efficiently than direct
repeats [41]. The UCSC browser provides the recombi-
nation rate data for the human genome. Essentially,
recombination hotspots concentrate on subtelomeric
regions [42]. The regions, however, do not have more
LIRs than other regions (Fig. 2) and, instead, some
chromosomal LIR-rich regions are located at the inner
part of the chromosomes. The lack of an association
between LIRs and recombination hotspots is also
suggested by a recent study on human recombination
hotspots on the basis of a computational simulation
using single nucleotide polymorphism data, which
showed that inverted repeats were not found over-
abundant in the hotspots [43]. In the present study, we
did not detect over-represented LIRs in the hotspots
(results not shown). Therefore, the contribution of LIRs
to recombination hotspots is not supported, and the
recombination-inducing effect of the recombinogenic
LIRs probably acts only on specific genomic regions.
LIRs spreading via fragmental duplications

NPS-Septin genes are spread in the human genome
possibly due to interchromosomal recombination and
fragmental duplication. One study showed that inter-
chromosomal recombination frequently occurs at the
subtelomeric regions in humans [42]. NPS-Septin was
assumed to be one of the gene families that amplified
themselves by this mechanism. The result of gene
amplification is concurrent duplication of the intronic
LIRs, as observed in chromosomes 1 and Y in the
present study. By contrast, only two chimpanzee NPS-
Septin LIRs were identified, which is in accordance
with the low frequency of subtelomeric duplications in
the chimpanzee genome [42]. Regarding the spread of
LIRs in the POTE family, at least those on chromo-
some 2 subtelomeres are most likely the result of intra-
chromosomal recombination, as inferred from genomic
locations. Similarly, the chimpanzee genome has two
POTE homologs: one on chromosome 12 and another
one on chromosome 22. In addition, the LIRs in
POTE and NPS-Septin families were entirely absent
from other mammals in current genomic assemblies.
Probable role of the LIRs in primate speciation
Among the orthologous genes, we found that human
and chimpanzee genes contain more LIRs than rhesus
monkey and mouse orthologs. Our data suggest that
most of the LIRs shared by human and chimpanzee
orthologs were developed and maintained by the com-
mon ancestor of humans and chimpanzees. However,
the difference in LIR frequency among the primates
could be narrowed to some extent. Our search for

LIRs in rhesus monkey orthologs probably missed a
proportion of the LIRs, although the similarity
between arms was lessened to 75%. In the case that
the similarity between arms was lower than 75%, some
A study on human long inverted repeats Y. Wang and F. C. C. Leung
1994 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS
of the LIRs shared by all primates could not be visual-
ized. Indeed, we observed higher mismatch rates in the
stems formed by monkey LIRs, and the corresponding
chimpanzee and human LIRs have undergone compen-
sating mutations that help to improve the stability of
the stem-loops for human and chimpanzee LIRs rela-
tive to those for rhesus monkey (results not shown).
The compensating mutations comprise one line of evi-
dence for the functional role and adaptive evolution of
the primate LIRs.
The biological profiling of the orthologs with
human-specific LIRs implies their association with the
development of the central nervous system. Moreover,
GO terms in pathways such as cell communication and
transmembrane signal transduction are enriched in
these orthologs. The number of genes in eukaryotic
genomes is not so different as previously considered
[44,45] and the morphological and physiological differ-
ences among eukaryotes are considered to be the result
of the different regulation levels of the existing genes.
The intronic LIRs in the present study are probably
novel, essential regulatory motifs that enable a com-
plex expression profile and the fine regulation of
human genes, as suggested previously [46]. The appear-

ance of the LIRs probably provides humans with an
evolutionary advantage and contributes to the specia-
tion of primates.
Experimental procedures
Identification of LIRs
The human genome (Build 35) was downloaded from the
NCBI ( and the locations of
all human genes and their exons (for protein-coding genes)
were obtained from the Ensembl database (http://www.
ensembl.org). From the gene list, we obtained the locations
of the boundaries of the genic and nongenic regions. Exons
belonging to the same genes were sorted again according to
their genomic locations, and the introns were defined as the
intervals between the exons. From the list, the boundaries
of exons and introns were determined.
We first searched for inverted repeats across the human
genome using bespoke software [33]. The settings for this
step were: arm length > 30 bp; arm identity > 85%; and
spacer < 2 kb. In addition, inverted repeats with a GC
content of the arms of < 20% were filtered out. This
aimed to exclude an abundance of inverted repeats formed
by (TA)
n
simple repeats as shown in our primary study. A
(TA)
n
by itself is an inverted repeat, and can form variant
secondary structures rather than an exclusive and stable
stem-loop. Therefore, (TA)
n

repeats were not the required
typical inverted repeats. As a result, we removed them from
the dataset in the present study. Several types of redundan-
cies were removed, as described previously [33]. To define
the recombinogenic LIRs, we screened the collection with
new criterion. The ratio of arm length to spacer length
must be larger than mismatch, where the mismatch is equal
to 100% minus identity. Therefore, the LIRs in our dataset
were recombinogenic LIRs [33].
The LIRs within genes were identified and the ratio of
their relative distance to exon–intron boundaries was calcu-
lated. Here, a ratio approaching to 0 indicates the relative
distance to the closest exon-intron boundary and that close
to 0.5 means that the LIR is positioned close to the central
of an intron, no matter in what direction. Pseudogenes were
not used in this survey. For those LIRs in intergenic
regions, the same ratios were also measured. The difference
was that the ratios in that case represent the relative dis-
tance to the closest neighboring genes. We also attempted
to identify cases of partial overlapping between LIRs and
genes or exons.
Classification of LIRs
We selected the LIRs that were basically constructed by
dinucleotide repeats. In the case where TG + TA +
CA > 80% of the arm of an LIR, it was considered as
TG ⁄ CA-rich; in the case where TC + TA + GA > 80%,
it was considered as TC ⁄ GA-rich. The remaining LIRs
were classified on the basis of similarity. We first used con-
sensus motifs of common human repetitive elements (from
the RepBase: as templates. An LIR

was considered to be formed by a known repetitive element
if the identity of the homologous part (> 20 bp) was
higher than 75%. For the results obtained, LIRs formed by
inverted Alu repeats were further confirmed by repeatmas-
ker (). Second, LIRs homologous
to each other were searched. Similarly, the criteria were:
homologous part > 20 bp and identity of homologous part
> 75%. Put simply, the algorithm for searching the
homologous part aimed to find an identical seed of 5 bp
and then extend the seed at both ends until continuous two
mismatches occur at both sides.
LIRs in mammalian orthologous genes
We obtained orthologous genes for human–chimpanzee,
human–rhesus monkey and human–mouse species pairs
from the BIOMART database ( />biomart/), which employs the Ensembl 42 Homology Data-
base. By searching the same human gene IDs in the three
ortholog tables, we created a new ortholog table containing
12 723 groups of orthologous genes from the four species.
In the BIOMART database, some orthologous genes are of
the types ‘one-to-many’ and ‘many-to-many’ that denote a
multiple orthologous relationship between the genes. In the
Y. Wang and F. C. C. Leung A study on human long inverted repeats
FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1995
present study, we kept the orthologs in the ‘one-to-one’
type to allow an easier comparison among the orthologs in
terms of the absence ⁄ presence of the LIRs.
The criteria for searching inverted repeats among the
orthologs were: arm length > 30 bp; arm identity > 75%;
spacer < 500 bp. Again, the inverted repeats with a GC
content of the arms of < 20% were excluded. We then

selected recombinogenic LIRs among the collection using
the same criterion.
Next, we tried to uncover the specification in the biological
profile of genes having species-specific LIRs. We have three
sets of test genes, which include the human orthologs having
human-specific LIRs, the human orthologs having human-
and chimpanzee-specific LIRs (i.e. rhesus monkey and mouse
orthologs do not have specific LIRs) and the mouse ortho-
logs having mouse-specific LIRs. To make a reference data-
set, we randomly selected human and mouse genes from the
ortholog table. The GO terms were obtained for the genes
from the Ensembl database ( In
the reference set, 296 human orthologs and 164 mouse ortho-
logs were assigned with GO term(s) in the Ensembl database;
in the test sets, 352 in the first gene set, 91 in the second gene
set and 164 in the third gene set were found to have GO
terms. We did not perform the test for chimpanzee and
rhesus monkey orthologs with species-specific LIRs due to
concerns of insufficient GO terms being assigned to their
genes. blast2go [47] was used to compare the frequency of
the GO terms in the reference genes and the test genes. Using
gossip software () in blast2go ,
the significance of the difference in the term frequency was
determined by the FDR (P < 0.05) [48]. Hence, we obtained
a list of GO terms that were significantly over-represented
in the test genes, revealing the pathways in which the LIR-
containing orthologs are involved.
References
1 Lobachev KS, Stenger JE, Kozyreva OG, Jurka J,
Gordenin DA & Resnick MA (2000) Inverted Alu

repeats unstable in yeast are excluded from the human
genome. EMBO J 19, 3822–3830.
2 Lobachev KS, Shor BM, Tran HT, Taylor W, Keen
JD, Resnick MA & Gordenin DA (1998) Factors affect-
ing inverted repeat stimulation of recombination and
deletion in Saccharomyces cerevisiae. Genetics 148,
1507–1524.
3 Gordenin DA, Lobachev KS, Degtyareva NP, Malkova
AL, Perkins E & Resnick MA (1993) Inverted DNA
repeats: a source of eukaryotic genomic instability. Mol
Cell Biol 13, 5315–5322.
4 Tanaka H, Tapscott SJ, Trask BJ & Yao M-C (2002)
Short inverted repeats initiate gene amplification
through the formation of a large DNA palindrome
in mammalian cells. Proc Natl Acad Sci USA 99, 8772–
8777.
5 Lin C-T, Lin W-H, Lyu YL & Whang-Peng J (2001)
Inverted repeats as genetic elements for promoting
DNA inverted duplication: implications in gene amplifi-
cation. Nucleic Acids Res 29, 3529–3538.
6 Gotter AL, Nimmakayalu MA, Jalali GR, Hacker AM,
Vorstman J, Conforto Duffy D, Medne L & Emanuel
BS (2007) A palindrome-driven complex rearrangement
of 22q11.2 and 8q24.1 elucidated using novel technolo-
gies. Genome Res 17, 470–481.
7 Nag DK & Kurst A (1997) A 140-bp-long palindromic
sequence induces double strand breaks during meiosis
in the yeast Saccharomyces cerevisiae. Genetics 146,
835–847.
8 Collick A, Drew J, Penberth J, Bois P, Luckett J, Scae-

rou F, Jeffreys A & Reik W (1996) Instability of long
inverted repeats within mouse transgenes. EMBO J 15,
1163–1171.
9 Katz RA, Gravuer K & Skalka AM (1998) A preferred
target DNA structure for retroviral integrase in vitro.
J Bio Chem 273, 24190–24195.
10 Inagaki K, Lewis SM, Wu X, Ma C, Munroe DJ, Fuess
S, Storm TA, Kay MA & Nakai H (2007) DNA palin-
dromes with a modest arm length of >20 base pairs are
a significant target for recombinant adeno-associated
virus vector integration in the liver, muscles, and heart
in mice. J Virol 81, 11290–11303.
11 Voineagu I, Narayanan V, Lobachev KS & Mirkin SM
(2008) Replication stalling at unstable inverted repeats:
interplay between DNA hairpins and fork stabilizing
proteins. Pro Nat Aca Sci USA 105, 9936–9941.
12 Stenger JE, Lobachev KS, Gordenin D, Darden TA,
Jurka J & Resnick MA (2001) Biased distribution of
inverted and direct Alus in the human genome: Implica-
tions for insertion, exclusion, and genome stability.
Genome Res 11, 12–27.
13 Akgun E, Zahn J, Baumes S, Brown G, Liang F,
Romanienko PJ, Lewis S & Jasin M (1997) Palindrome
resolution and recombination in the mammalian germ
line. Mol Cell Biol 17, 5559–5570.
14 Piriyapongsa J (2007) A family of human microRNA
genes from miniature inverted-repeat transposable ele-
ments. PLoS ONE 2, e203.
15 Montgomery MK, Xu S & Fire A (1998) RNA as a
target of double-stranded RNA-mediated genetic inter-

ference in Caenorhabditis elegans. Proc Natl Acad Sci
USA 95, 15502–15507.
16 Bartel DP (2004) MicroRNAs: genomics, biogenesis,
mechanism, and function. Cell 116, 281.
17 Moss EG, Lee RC & Ambros V (1997) The cold shock
domain protein LIN-28 controls developmental timing
in C. elegans and is regulated by the lin-4 RNA. Cell
88, 637–646.
18 Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE,
Bettinger JC, Rougvie AE, Horvitz HR & Ruvkun G
(2000) The 21-nucleotide let-7 RNA regulates develop-
A study on human long inverted repeats Y. Wang and F. C. C. Leung
1996 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS
mental timing in Caenorhabditis elegans. Nature 403 ,
901–906.
19 Schramke V, Sheedy DM, Denli AM, Bonila C, Ekwall
K, Hannon GJ & Allshire RC (2005) RNA-interfer-
ence-directed chromatin modification coupled to RNA
polymerase II transcription. Nature 435, 1275–1279.
20 Volpe TA, Kidner C, Hall IM, Teng G, Grewal SIS &
Martienssen RA (2002) Regulation of heterochromatic
silencing and histone H3 lysine-9 methylation by RNAi.
Science 297, 1833–1837.
21 Glucksmann MA, Markiewicz P, Malone C & Roth-
man-Denes LB (1992) Specific sequences and a hairpin
structure in the template strand are required for N4
virion RNA polymerase promoter recognition. Cell 70,
491–500.
22 Kim EL, Peng H, Esparza FM, Maltchenko SZ &
Stachowiak MK (1998) Cruciform-extruding regulatory

element controls cell-specific activity of the tyrosine
hydroxylase gene promoter. Nucleic Acids Res 26, 1793–
1800.
23 Spiro C & McMurray CT (1997) Switching of DNA
secondary structure in proenkephalin transcriptional
regulation. J Bio Chem 272, 33145–33152.
24 Christopher E, Pearson CE, Zorbas H, Price GB &
Zannis-Hadjopoulos M (1996) Inverted repeats, stem-
loops, and cruciforms: significance for initiation of
DNA replication. J Cell Biochem 63, 1–22.
25 McAlinden A, Havlioglu N, Liang L, Davies SR &
Sandell LJ (2005) Alternative splicing of type II procol-
lagen exon 2 is regulated by the combination of a weak
5¢ splice site and an adjacent intronic stem-loop cis
element. J Biol Chem 280, 32700–32711.
26 Baraniak AP, Lasda EL, Wagner EJ & Garcia-Blanco
MA (2003) A stem structure in fibroblast growth factor
receptor 2 transcripts mediates cell-type-specific splicing
by approximating intronic control elements. Mol Cell
Biol 23, 9327–9337.
27 Miyaso H, Okumura M, Kondo S, Higashide S,
Miyajima H & Imaizumi K (2003) An intronic splicing
enhancer element in survival motor neuron (SMN)
pre-mRNA. J Biol Chem 278, 15825–15831.
28 Chen Y & Stephan W (2003) Compensatory evolution
of a precursor messenger RNA secondary structure in
the Drosophila melanogaster Adh gene. Proc Natl Acad
Sci USA 100, 11499–11504.
29 Lisnic
´

B, Svetec I-K, S
ˇ
aric
´
H, Nikolic
´
I & Zgaga Z
(2005) Palindrome content of the yeast Saccharomyces
cerevisiae genome. Curr Genet 47, 289–297.
30 Lu L, Jia H, Dro
¨
ge P & Li J (2007) The human
genome-wide distribution of DNA palindromes. Funct
Integr Genomics 7, 221–227.
31 Zhao G, Chang KY, Varley K & Stormo GD (2007)
Evidence for active maintenance of inverted repeat
structures identified by a comparative genomic
approach. PLoS ONE 2, e262.
32 Warburton PE, Giordano J, Cheung F, Gelfand Y &
Benson G (2004) Inverted repeat structure of the human
genome: the X-chromosome contains a preponderance
of large, highly homologous inverted repeats that con-
tain testes genes. Genome Res 14, 1861–1869.
33 Wang Y & Leung FCC (2006) Long inverted repeats in
eukaryotic genomes: recombinogenic motifs determine
genomic plasticity. FEBS Lett 580, 1277–1284.
34 Grimwood J, Gordon LA, Olsen A, Terry A, Schmutz
J, Lamerdin J, Hellsten U, Goodstein D, Couronne O,
Tran-Gyamfi M et al. (2004) The DNA sequence and
biology of human chromosome 19. Nature 428, 529–

539.
35 Hu
¨
ttenhofer A, Schattner P & Polacek N (2005)
Non-coding RNAs: hope or hype? Trends Genet 21,
289–297.
36 Achaz G, Netter P & Coissac E (2001) Study of
intrachromosomal duplications among the eukaryote
genomes. Mol Biol Evol 18, 2280–2288.
37 International Human Sequencing Consortium (2001)
Initial sequencing and analysis of the human genome.
Nature 409, 860–921.
38 Edelmann L, Spiteri E, Koren K, Pulijaal V, Bialer
MG, Shanske A, Goldberg R & Morrow BE (2001)
AT-rich palindromes mediate the constitutional t(11;22)
translocation. Am J Hum Genet 68, 1–13.
39 Hefferon TW, Groman JD, Yurk CE & Cutting GR
(2004) A variable dinucleotide repeat in the CFTR gene
contributes to phenotype diversity by forming RNA
secondary structures that alter splicing. Proc Natl Acad
Sci USA 101, 3504–3509.
40 Hui J, Hung L-H, Heiner M, Schreiner S, Neumu
¨
ller
N, Reither G, Haas SA & Bindereif A (2005) Intronic
CA-repeat and CA-rich elements: a new class of regula-
tors of mammalian alternative splicing. EMBO J 24,
1988–1998.
41 Waldman AS, Tran H, Goldsmith EC & Resnick MA
(1999) Long inverted repeats are an at-risk motif for

recombination in mammalian cells. Genetics 153, 1873–
1883.
42 Linardopoulou EV, Williams EM, Fan Y, Friedman C,
Young JM & Trask BJ (2005) Human subtelomeres are
hot spots of interchromosomal recombination and seg-
mental duplication. Nature 437, 94–100.
43 Myers S, Bottolo L, Freeman C, McVean G & Donnelly
P (2005) A fine-scale map of recombination rates and
hotspots across the human genome. Science 310, 321–
324.
44 Bird AP (1995) Gene number, noise reduction and bio-
logical complexity. Trends Genet 11, 94–100.
45 Claverie J-M (2001) Gene number: what if there are
only 30,000 human genes? Science 291, 1255–1257.
46 Bacolla A, Larson JE, Collins JR, Li J, Milosavljevic
A, Stenson PD, Cooper DN & Wells RD (2008)
Abundance and length of simple repeats in vertebrate
Y. Wang and F. C. C. Leung A study on human long inverted repeats
FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1997
genomes are determined by their structural properties.
Genome Res 18, 1545–1553.
47 Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon
M & Robles M (2005) Blast2GO: a universal tool for
annotation, visualization and analysis in functional
genomics research. Bioinformatics 21, 3674–3676.
48 Blu
¨
thgen N, Brand K, Cajavec B, Swat M, Herzel H &
Beule D (2005) Biological profiling of gene groups
utilizing gene ontology. Genome Inform 16, 106–115.

Supporting information
The following supplementary material is available:
Fig. S1. Evenly-distributed LIRs in intergenic and
intronic regions.
Table S1. LIRs with arms longer than 400 bp.
Table S2. LIRs in genes encoding novel proteins simi-
lar to septin.
Table S3. GO terms over-represented in human genes
having human- and chimpanzee-specific LIRs.
Table S4. GO terms over-represented in mouse genes
having mouse-specific LIRs.
This supplementary material can be found in the
online version of this article.
Please note: Wiley-Blackwell is not responsible for
the content or functionality of any supplementary
materials supplied by the authors. Any queries (other
than missing material) should be directed to the corre-
sponding author for the article.
A study on human long inverted repeats Y. Wang and F. C. C. Leung
1998 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS

×