Tải bản đầy đủ (.pdf) (270 trang)

A population based study of copy number variations and regions of homozygosity in singapore and swedish populations using genome wide SNP genotyping arrays

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.66 MB, 270 trang )


A POPULATION-BASED STUDY OF COPY NUMBER VARIATIONS
AND REGIONS OF HOMOZYGOSITY IN SINGAPORE AND
SWEDISH POPULATIONS USING GENOME-WIDE SNP
GENOTYPING ARRAYS



KU CHEE SENG
B. Sc. (Hons.), UM; M. Med. Sc., UM



A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF EPIDEMIOLOGY AND PUBLIC HEALTH
YONG LOO LIN SCHOOL OF MEDICINE
NATIONAL UNIVERSITY OF SINGAPORE


2011




1

ACKNOWLEDGEMENT
During the four years of my Ph.D. studies (August 2007 – August 2011), I’m grateful to
the many people who in many and different ways have contributed to this work.
Specifically, I would like to thank:


 Chia Kee Seng (main supervisor), Mark Seielstad (co-supervisor) and Yudi
Pawitan (co-supervisor) for their guidance and encouragement throughout my
Ph.D. studies, and for making all the publications possible
 Yudi Pawitan and Agus Salim for their guidance and discussion in data analysis
 Teo Shu Mei and Nasheen Naidoo, my course mates and colleagues, for helping
in R package analysis (Shu Mei), critical reading and correcting the English of my
manuscripts and thesis (Nasheen)
 All my colleagues and friends in the Center for Molecular Epidemiology and
Department of Epidemiology and Public Health, National University of Singapore
for their help and support
I would also like to acknowledge the funding agency. I was funded under the grant
‘Singapore Consortium of Cohort Studies’ from June 2007 – March 2011.







2

CONTENTS
Chapter 1 – Introduction

17
Chapter 2 - Background
2.1. Human genetic variations
2.2. Categories of genetic variations
2.3. The evolution of genetic markers in disease gene mapping
2.4. A new era of CNVs discovery through microarrays

2.5. Copy neutral variations - inversions and translocations
2.6. Sequencing-based detection methods – PEM
2.7. Sequencing-based detection methods – DOC
2.8. Choosing a sequencing platform for PEM and DOC
2.9. International effort to characterize structural variations using PEM
2.10. The 1000 Genomes Project
2.11. Associations of CNVs with complex diseases and traits
2.12. Regions of homozygosity (ROHs)
2.13. Methods of detecting ROHs
2.14. Associations of ROHs with complex diseases and traits
2.15. Population history and origin for Singapore and Swedish populations
20
20
23
26
31
37
40
45
47
53
55
58
60
64
66
69

Chapter 3 – Aims



72
Chapter 4 - Materials and methods
4.1. Study I (Genomic copy number variations in three Southeast Asian
populations)
4.2. Study II (A population-based study of copy number variants and regions of
homozygosity in healthy Swedish individuals)
4.3. Study III (Copy number polymorphisms in new HapMap III and Singapore
populations)
4.4. Study IV (Regions of homozygosity in three Southeast Asian populations)
73
73

76

80

82
3

4.5. Summary for Study I – IV
84

Chapter 5 – Results
5.1. Study I
5.2. Study II
5.3. Study III
5.4. Study IV



85
85
88
96
102
Chapter 6 - Discussion
6.1. CNV and ROH maps for each population
6.2. Major criticisms from reviewers
6.3. Technological limitations
6.4. Clinical and public health significance

105
105
106
110
111
Chapter 7 - Future directions
7.1. Technological developments
7.2. A perspective on a detailed genetic variation map for each population

114
114
115
References

119
Appendices
133








4

SUMMARY
Population-based studies of copy number variations (CNVs) and regions of
homozygosity (ROHs) have received considerable attention over the past few years. In
addition, CNVs and ROHs were also found to be associated with various human complex
diseases and traits such as schizophrenia, autism and height. Genome-wide mapping of
CNVs and ROHs have been previously performed in European, East Asian and African
populations using high-density SNP genotyping arrays. However, a comprehensive
mapping study of CNVs and ROHs in the Singapore and Swedish populations has not
been conducted previously. Therefore, the primary aim of this thesis was to detect and
describe the characteristics of CNVs and ROHs in these two populations. A total of 292
samples from three Singaporean populations (99 Chinese, 98 Malay, and 95 Indian
individuals) and 100 samples from the Swedish population were genotyped using the
Affymetrix Genome-Wide Human SNP Array 6.0 or/and Illumina Human1M BeadChip
arrays. Subsequently, several hundred CNV loci and ROH loci were found in both
populations. More interestingly, some of these CNV loci overlapped with known disease-
associated or pharmacogenetic-related genes and showed substantial population
frequency differences. Novel CNV loci that were not previously reported in public
databases were also identified. Comparisons between these two populations and with the
International HapMap III populations found substantial differences in their CNV and
ROH profiles. Collectively, these results highlight the importance of characterizing
CNVs and ROHs in individual populations. The studies in this thesis will establish a
resource of CNVs and ROHs for future disease association studies in the Singapore and
Swedish populations.

5

LIST OF PUBLICATIONS
1. Ph.D. publications (see Appendices)
(A) Research papers
1. Ku CS, Pawitan Y, Sim X, Ong RT, Seielstad M, Lee EJ, Teo YY, Chia KS,
Salim A. Genomic copy number variations in three Southeast Asian populations.
Human Mutation 31: 851-857 (2010).
2. Teo SM*, Ku CS*#, Naidoo N, Hall P, Chia KS, Salim A, Pawitan Y. A
population-based study of copy number variants and regions of homozygosity in
healthy Swedish individuals. Journal of Human Genetics 56: 524-533 (2011).
3. Ku CS#, Teo SM, Naidoo N, Sim X, Teo YY, Pawitan Y, Seielstad M, Chia KS,
Salim A. Copy number polymorphisms in new HapMap III and Singapore
populations. Journal of Human Genetics 56: 552-560 (2011).
4. Teo SM*, Ku CS*, Salim A, Naidoo N, Chia KS, Pawitan Y. Regions of
homozygosity in three Southeast Asian populations. Journal of Human Genetics
57: 101-108 (2012).
* Joint first author
# Corresponding author

(B) Review papers
1. Ku CS#, Loy EY, Salim A, Pawitan Y, Chia KS. The discovery of human genetic
variations and their use as disease markers: past, present and future. Journal of
Human Genetics 55:403-415 (2010).
2. Ku CS#, Naidoo N, Teo SM, Pawitan Y. Regions of homozygosity and their
impact on complex diseases and traits. Human Genetics 129:1-15 (2011).
# Corresponding author

(C) Encyclopedia/book chapters
1. Ku, Chee Seng; Naidoo, Nasheen; Teo, Shu Mei; and Pawitan, Yudi (February

2011) Characterising Structural Variation by Means of Next-Generation
6

Sequencing. In: Encyclopedia of Life Sciences (ELS). John Wiley & Sons, Ltd:
Chichester. DOI: 10.1002/9780470015902.a0023399
2. Chee-Seng, Ku; En Yun, Loy; Yudi, Pawitan; and Kee-Seng, Chia (April 2010)
Next Generation Sequencing Technologies and Their Applications. In:
Encyclopedia of Life Sciences (ELS). John Wiley & Sons, Ltd: Chichester. DOI:
10.1002/9780470015902.a0022508
3. Chee-Seng, Ku; En Yun, Loy; Yudi, Pawitan; and Kee-Seng, Chia (April 2010)
Whole Genome Resequencing and 1000 Genomes Project. In: Encyclopedia of
Life Sciences (ELS). John Wiley & Sons, Ltd: Chichester. DOI:
10.1002/9780470015902.a0022507
4. Chee Seng, Ku; Katherine, Kasiman; and, Kee Seng, Chia (September 2009)
High-Throughput Single Nucleotide Polymorphisms Genotyping Technologies.
In: Encyclopedia of Life Sciences (ELS). John Wiley & Sons, Ltd: Chichester.
DOI: 10.1002/9780470015902.a0021631

(D) Technical note
1. Ku Chee Seng, Sim Xueling, Chia Kee Seng. Genome-Wide Mapping of Copy
Number Variations and Loss of Heterozygosity Using the InfiniumHuman1M
BeadChip. Illumina Technical Note (2008).

2. Other publications during Ph.D. candidature (August 2007 – August 2011)
Publications
Quantity
Research paper
3
Review paper
7

Commentary
1
Encyclopedia/book chapters
6



7

COMPLETE LIST OF PUBLICATIONS (August 2007 – August 2011)
Research/Review papers
1. Ku CS, Pawitan Y, Sim X, Ong RT, Seielstad M, Lee EJ, Teo YY, Chia KS,
Salim A. Genomic copy number variations in three Southeast Asian populations.
Human Mutation 31
2. Ku CS*, Teo SM, Naidoo N, Sim X, Teo YY, Pawitan Y, Seielstad M, Chia KS,
Salim A. Copy number polymorphisms in new HapMap III and Singapore
populations.
: 851-857 (2010).
Journal of Human Genetics
3. Teo SM, Ku CS*, Naidoo N, Hall P, Chia KS, Salim A, Pawitan Y. A
population-based study of copy number variants and regions of homozygosity in
healthy Swedish individuals.
56: 552-560 (2011).
Journal of Human Genetics
4. Teo SM, Ku CS, Salim A, Naidoo N, Chia KS, Pawitan Y. Regions of
homozygosity in three Southeast Asian populations. Journal of Human Genetics
57: 101-108 (2012).
56: 524-533 (2011).
5. Mei TS, Salim A, Calza S, Ku CS, Chia KS, Pawitan Y. Identification of
recurrent regions of Copy-Number Variants across multiple individuals. BMC

Bioinformatics
6. Pawitan Y, Ku CS, Magnusson PK. How many genetic variants remain to be
discovered? PLoS One 4: e7969 (2009).
11: 147 (2010).
7. Teo YY, Sim X, Ong RT, Tan AK, Chen J, Tantoso E, Small KS, Ku CS, Lee EJ,
Seielstad M, Chia KS. Singapore Genome Variation Project: a haplotype map of
three Southeast Asian populations. Genome Research
8. Naidoo N, Pawitan Y, Soong R, Cooper DN, Ku CS*. Human genetics and
genomics a decade after the release of the draft sequence of the human genome.
Human Genomics 5: 577-622 (2011).
19: 2154-2162 (2009).
9. Ku CS*, Naidoo N, Wu M, Soong R. Studying the epigenome using next
generation sequencing. Journal of Medical Genetics 48: 721-730.
10. Ku CS*, Naidoo N, Teo SM, Pawitan Y. Regions of homozygosity and their
impact on complex diseases and traits. Human Genetics 129:1-15 (2011).
8

11. Ku CS*, Naidoo N, Pawitan Y. Revisiting Mendelian disorders through exome
sequencing. Human Genetics 129:351-370 (2011).
12. Ku CS*, Loy EY, Salim A, Pawitan Y, Chia KS. The discovery of human genetic
variations and their use as disease markers: past, present and future. Journal of
Human Genetics
13. Hartman M, Loy EY, Ku CS, Chia KS. Molecular epidemiology and its current
clinical use in cancer management.
55:403-415 (2010).
Lancet of Oncology
14. Ku CS*, Loy EY, Pawitan Y, Chia KS. The pursuit of genome-wide association
studies: where are we now?
11: 383-390 (2010).
Journal of Human Genetics

15. Ku CS*, Chia KS. The success of the genome-wide association approach: a brief
story of a long struggle.
55: 195-206 (2010).
European Journal of Human Genetics
16. Ku CS, Chia KS. Genome‐wide association studies of type 2 diabetes.
16: 554-564
(2008).
*Corresponding author
Asia-
Pacific Journal of Endocrinology (2009).

Commentary
1. Polychronakos C, Ku CS. Exome diagnostics: already a reality? Journal of
Medical Genetics 48: 579.

Encyclopedia/book chapters (Encyclopedia of Life Sciences, Publisher: John Wiley &
Sons)
1. Ku Chee-Seng, Loy En Yun, Pawitan Yudi, Chia Kee-Seng. Genome-wide
Association Studies: The Success, Failure and Future. Published online: 15
December, 2009. (*Keynote Article)
2. Chee Seng Ku, Patrik K.E. Magnusson, Kee Seng Chia, Yudi Pawitan. Research
on rare variants for complex diseases. Published online: 15 September, 2010.
(*Keynote Article)
3. Chee-Seng Ku, Yudi Pawitan, Kee-Seng Chia. Genome-Wide Association
Studies. Published online: 15 March, 2009.
9

4. Ku Chee Seng, Kasiman Katherine, Chia Kee Seng. High-Throughput Single
Nucleotide Polymorphisms Genotyping Technologies. Published online: 15
September, 2009.

5. Jonathan T Tan, Kee Seng Chia, Chee Seng Ku. The Molecular Genetics of Type
2 Diabetes: Past, Present and Future. Published online: 15 September, 2009
6. Ku Chee-Seng, Loy En Yun, Pawitan Yudi, Chia Kee-Seng. Next Generation
Sequencing Technologies and Their Applications. Published online: 19 April,
2010.
7. Ku Chee-Seng, Loy En Yun, Pawitan Yudi, Chia Kee-Seng. Whole Genome
Resequencing and 1000 Genomes Project. Published online: 19 April, 2010.
8. Chee Seng Ku, Nasheen Naidoo, Mikael Hartman, Yudi Pawitan. Genome wide
association studies of cancers. Published online: 15 December 2010
9. Chee Seng Ku, Nasheen Naidoo, Mikael Hartman, Yudi Pawitan. Cancer genome
sequencing. Published online: 15 December 2010
10. Chee Seng Ku, Nasheen Naidoo, Teo Shu Mei, Yudi Pawitan. Characterizing
structural variation by means of next-generation sequencing. Published online: 15
February 2011













10

LIST OF TABLES

Chapter 2 - Background
Table 1 – Categories of human genetic variations
Table 2 – Summary statistics of the DGV
Table 3 - Summary of the features of NGS technologies
Table 4 - Comparison between microarrays and sequencing-based methods for detecting
structural variations

Chapter 4 – Materials and methods
Table 5 – Summary of samples, genotyping platforms, detection algorithms and data used
and generated by Study I - IV

Chapter 5 - Results
Table 6 – The proportion of deletion and duplication loci overlapping with the UCSC
database with varying population frequencies
Table 7 – Summary statistics of CNV loci constructed from PennCNV output
Table 8 – CNPs that overlap with important and known disease- and pharmacogenetic-
related genes
Table 9 – Correlation between CNPs and GWAS-SNPs at r
2
>0.5
Table 10 – CNPs (FDR <0.01) that overlap with known disease-associated or
pharmacogenetic-related genes
Table 11 - The number of CNPs that showed significant differences (FDR <0.01) in the
pairwise comparisons among the 10 populations
Table 12 – Correlation between CNPs and GWAS-SNPs at r
2
>0.5 in 10 populations
Table 13 – Characteristics of ROHs in three Singapore populations






11

LIST OF FIGURES
Chapter 2 - Background
Figure 1 – Types of DNA sequence or genetic variations in the human genome. The
genetic variations can be broadly divided into 5 categories: (a) single nucleotide changes,
(b) tandem repeats, (c) indels, (d) structural variations (copy number variations and copy
neutral variations) and (e) regions of homozygosity.

Figure 2a – Single nucleotide changes (adapted from Ku et al. (2010) J. Hum. Genet.
55:403-415).
Figure 2b – Tandem repeats (adapted from Ku et al. (2010) J. Hum. Genet. 55:403-415).
Figure 2c – Indels (adapted from Ku et al. (2010) J. Hum. Genet. 55:403-415).
Figure 2d - Structural variations (adapted from Ku et al. (2010) J. Hum. Genet. 55:403-
415).

Figure 3a – The proportion of new SNPs identified in whole genome resequencing
studies (adapted from Ku et al. (2010) J. Hum. Genet. 55:403-415).
Figure 3b – The proportion of new indels identified in whole genome resequencing
studies (adapted from Ku et al. (2010) J. Hum. Genet. 55:403-415).

Figure 4 – Different patterns of signal intensity of CNVs for oligonucleotide CGH and
SNP genotyping arrays (adapted from Alkan et al. (2011) Nat. Rev. Genet. 12:363-376).

Figure 5 – Top panel: No discrepancy or discordance in insert size and orientation of the
paired-end sequences aligned to the reference genome. Bottom panel: (a) Simple
deletions were predicted from paired-end sequences span larger than a specified cutoff

‘D’ (red region indicates region deleted from sample genome); (b) simple insertions had a
span smaller than a specified cutoff ‘I’ (blue region; indicates region inserted in sample
genome) and (c) inversions are seen when ends map to the genome at different relative
orientations (yellow region indicates region inverted in sample genome) (adapted from
Korbel et al. (2007) Science 318:420-426).
12

Figure 6 – This figure illustrates the difference between ‘sequence coverage’ and
‘physical coverage'. At the specific nucleotide locus or position (red arrow), it is covered
by two sequence reads highlighted by red circles (sequence coverage = 2), however, there
are four paired-end sequence reads spanning the locus (physical coverage = 4) (adapted
from Meyerson et al. (2010) Nat. Rev. Genet. 11:685-696).

Figure 7 – This figure illustrates that changes in sequencing depth (abundance of
sequence reads) are used to identify copy number changes such as homozygous and
hemizygous deletions and duplications.

Figure 8 – Plots of the differences in the LRR and BAF patterns for the ROH (left
panels) and one-copy deletion (right panels) generated from a sample derived from our
previous study (Ku et al. 2010) and genotyped by the Illumina 1M Beadchip (adpated
from Ku et al. (2011) Hum. Genet. 129:1-15).

Chapter 5 - Results
Figure 9 – Number of CNVs per genome and their frequency in each of the three
Singapore populations (adapted from Ku et al. (2010) Hum. Mutat. 31:851-857).

Figure 10 – Number of loci replicated by the Affymetrix platform and novel loci not
found in the DGV.

Figure 11 - PCA comparing the Swedish and HapMap III populations.


Figure 12 - PCA results based on the common ROH loci for three Singapore populations.




13

LIST OF ABBREVIATIONS
ABI - Applied Biosystems
ADAMTSL3 - ADAMTS-like 3
ASW - people of African ancestry in the southwestern USA
BAC – bacterial artificial chromosome
BAF - B allele frequency
BMI – body mass index
Bp - basepair
CCDC60 - coiled-coil domain containing 60
CCL3L1 - chemokine (C-C motif) ligand 3-like 1
CEPH - Centre d'Etude du Polymorphisme Humain
CFH - complement factor H
CFHR1 - complement factor H-related 1
CFHR3 - complement factor H-related 3
CGH - comparative genomic hybridization
CHD - the Chinese community in Metropolitan Denver, Colorado, USA
Chr - chromosome
CNP – copy number polymorphism
CN – copy number
CNV – copy number variation
CTDSPL - CTD (carboxy-terminal domain, RNA polymerase II, polypeptide A) small
phosphatase-like

CYP2A6 – cytochrome P450, family 2, subfamily A, polypeptide 6
CYP2A7 - cytochrome P450, family 2, subfamily A, polypeptide 7
DGV – database of genomic variants
DNA – deoxyribonucleic acid
DOC - depth-of-coverage
ERBB4 - v-erb-a erythroblastic leukemia viral oncogene homolog 4 (avian)
FCGR3A - Fc fragment of IgG, low affinity IIIa, receptor
FCGR3B – Fc fragment of IgG, low affinity IIIb, receptor
14

FCGR2B - Fc fragment of IgG, low affinity IIb, receptor
FCGR2C - Fc fragment of IgG, low affinity IIc, receptor
FDR - false discovery rate
FISH – fluorescent in situ hybridization
GA - genome analyzer
GIH - Gujarati Indians in Houston, Texas, USA
GLG1 - golgi glycoprotein 1
GS FLX - genome sequencer FLX
GSTM1 - glutathione S-transferase mu 1
GSTM2 - glutathione S-transferase mu 1
GSTT1 - glutathione S-transferase theta 1
GSTT2 - glutathione S-transferase theta 2
GSTT2B - glutathione S-transferase theta 2B
GSTTP1 - glutathione S-transferase theta pseudogene 1
GWAS – genome-wide association studies
HapMap – haplotype map
HIV- human immunodeficiency virus
HLA – human leukocyte antigen
HLA-DRB1 - major histocompatibility complex, class II, DR beta 1
Indels – insertions and deletions

IRGM – immunity-related GTPase family, M
Kb - kilobase
LCE3B - late cornified envelope 3B
LCE3C - late cornified envelope 3C
LD – linkage disequilibrium
LRR - log R ratio
LWK - the Luhya inWebuye, Kenya
Mb - megabase
MEX - people of Mexican ancestry in Los Angeles, California, USA
MHC - major histocompatibility complex
15

MKK- the Maasai in Kinyawa, Kenya
mRNA – messenger ribonucleic acid
NEGR1 - neuronal growth regulator 1
NGS - next-generation sequencing
NHGRI - National Human Genome Research Institute
NUS-IRB - National University of Singapore-Institutional Review Board
PARK2 - parkinson protein 2, E3 ubiquitin protein ligase (parkin)
PC - principal component
PCA - principal component analysis
PCR – polymerase chain reaction
PEM - paired-end mapping
qPCR – quantitative polymerase chain reaction
RFLP – restriction fragment length polymorphism
ROH – region of homozygosity
ROMA - representational oligonucleotide microarray analysis
SGCD - sarcoglycan, delta
SMRT – single molecule real time
SNP - single nucleotide polymorphism

SNR - signal-to-noise ratio
SOLiD - supported oligonucleotide ligation detection
STR – short tandem repeat
SWED - Swedish
TLR7 - toll-like receptor 7
TGS – third generation sequencing
TMEM57 - transmembrane protein 57
TP63 - tumor protein p63
TSI - the Tuscans in Italy
UCSC – University of California, Santa Cruz
UGT2B17 - UDP glucuronosyltransferase 2 family, polypeptide B17
VNTR – variable number tandem repeat
16

WDR12 - WD repeat domain 12
WTCCC – Wellcome Trust Case Control Consortium
WWOX - WW domain containing oxidoreductase
YRI - Yoruba Ibadan Nigerian
ZNP510 - zinc finger protein 510




















17

CHAPTER 1 – INTRODUCTION
A new era of copy number variations (CNVs) discovery began when two separate
studies, published concurrently in 2004, identified several hundred deletions and
duplications in the human genome
1, 2
. The comprehensive detection and characterization
of CNVs has begun to lay the foundation to improve our understanding of human genetic
variation and for deciphering the role of CNVs in the risk of complex diseases.
Subsequently, recent evidence has linked CNVs to various complex diseases such as
cancers, autoimmune diseases, schizophrenia and autism
3-8
.

Over the past several years, most of the CNV data were generated by microarrays
9, 10
.
However, a paradigm shift in the discovery of CNVs and copy-neutral variations was
attributed to the development of a sequencing-based method known as paired-end
mapping (PEM). This method was first demonstrated to be powerful in detecting
structural variations (CNVs and copy-neutral variations) using next-generation

sequencing (NGS) technologies in 2007
11
. Further studies also made use of the ability of
NGS to generate several hundred million short sequence reads where CNV detection was
based on the abundance or density of the sequence reads aligned to a reference genome.
This approach is known as depth-of-coverage (DOC)
12
.

However, at the time when our CNV project was started in 2007 as part of the Singapore
Genome Variation Project
13
, the sequencing-based methods to detect CNVs were still
developing and were not well-established. The Singapore Genome Variation Project
aimed to characterize the extent of common single nucleotide polymorphisms (SNPs) and
18

the patterns of linkage disequilibrium (LD) and haplotype in the human genome of DNA
samples from each of the three populations in Singapore, i.e., Chinese, Malays and
Indians ( Therefore, two high-density SNP
genotyping arrays were chosen for the project. These arrays were the Affymetrix
Genome-Wide Human SNP Array 6.0 and the Illumina Human1M BeadChip. As a result,
the signal intensity data of these two genotyping arrays were also used for this CNV
detection project. In addition, in collaboration with the Department of Medical
Epidemiology and Biostatistics, Karolinska Institutet, Sweden, DNA samples from the
Swedish population were also genotyped by the Affymetrix Genome-Wide Human SNP
Array 6.0 for the project.

My thesis is divided into four studies (Study I – IV), each with a specific aim. The
primary aim was to identify CNVs and study their population characteristics using high-

density SNP genotyping arrays in the Singapore population (Study I) and the Swedish
population (Study II). The motivation for these studies was that CNV data in the
Singapore and Swedish populations is limited.

Besides our SNP dataset, the CEL-files of the Affymetrix SNP Array 6.0 for the seven
populations in the International HapMap III project were downloaded from the
International HapMap ftp site
( This allowed us to
investigate population differences of CNV profiles between the HapMap III and
Singapore populations (Study III). It is important to study population differences,
19

particularly for those CNVs that overlap with known disease-associated genes,
pharmacogenetics genes or other medically importance genes which could have different
impacts in different populations
4, 14, 15
. Currently, the amount of data documenting the
differences of CNVs in various populations is limited.

In addition to CNVs, regions of homozygosity (ROHs) can be also detected using high-
density SNP genotyping arrays. ROHs are more abundant in the human genome of
outbred populations than previously thought
16
. In addition, studies have identified ROHs
to be associated with complex phenotypes such as schizophrenia, late-onset of
Alzheimer’s disease and height
17-19
. This suggests that studying ROHs may be useful for
identifying genetic susceptibility loci harboring recessive variants for complex diseases
and traits. Therefore, the secondary aim of this thesis was to identify and study ROH

distribution patterns using the same set of SNP data (the Affymetrix SNP Array 6.0 and
Illumina 1M datasets) in the Singapore population (Study IV). However, for the Swedish
population, the ROH analysis was included in Study II.

In summary, the four studies in my thesis are:
Study I – Genomic copy number variations in three Southeast Asian populations
Study II – A population-based study of copy number variants and regions of
homozygosity in healthy Swedish individuals
Study III – Copy number polymorphisms in new HapMap III and Singapore populations
Study IV - Regions of homozygosity in three Southeast Asian populations

20

CHAPTER 2 - BACKGROUND
2.1. Human genetic variations
Human genetic variations are the differences in the DNA sequence within the genome of
individuals in populations and can take many forms, including single nucleotide changes
or substitutions, tandem repeats, insertions and deletions (indels), additions or deletions
that change the copies number of a larger segment of DNA sequence (i.e. CNVs), other
chromosomal rearrangements such as inversions and translocations (also known as copy-
neutral variations), and ROHs (Figure 1 and Table 1). These genetic variations span a
spectrum of sizes from a single nucleotide to megabases. Single nucleotide substitutions
or alterations involve a change in a single nucleotide at a particular locus in the DNA
sequence, such as restriction fragment length polymorphisms (RFLPs), single nucleotide
polymorphisms (SNPs) and single nucleotide indels. On the other extreme, CNVs,
inversions, translocations and ROHs encompass larger segments of DNA sequences that
range from kilobases to megabases (>1kb), whereas tandem repeats and indels fall
between these extremes (>1bp to 1kb)
20, 21
.











21

Table 1 – Categories of human genetic variations
Category
Genetic variation
Size
Single nucleotide changes
RFLP, SNP, single nucleotide
indel
Single nucleotide
Tandem repeats
STR
VNTR
2 – 8bp
>8bp
Indels
Small indel
Intermediate indel
2 – 100bp
>100bp - <1kb

Structural variations
Deletion, duplication, inversion,
translocation
>1kb
Copy-neutral loss of
heterozygosity
ROH
>1Mb


Figure 1 – Types of DNA sequence or genetic variations in the human genome. The
genetic variations can be broadly divided into 5 categories (a) single nucleotide changes,
(b) tandem repeats, (c) indels, (d) structural variations (CNVs and copy-neutral
variations) and (e) ROHs.
22

In general, these genetic variations occur spontaneously in the human genome, and are
the footprints of alterations that occur in DNA replication during cell division. External
agents, such as viruses and chemical mutagens, can also induce changes in the DNA
sequence. The occurrence of each type of genetic variation is mediated by different
molecular mechanisms, although most of these are currently unclear. For example,
several mechanisms have been proposed to explain the widespread occurrence of CNVs
in the human genome, such as non-allelic homologous recombination and non-
homologous end joining
22
. For ROHs, the homozygosity could have resulted from
uniparental isodisomy and autozygosity
16
. Regardless of the molecular mechanisms that
generated these genetic variations, they can be broadly classified as either somatic or

germline variations depending on whether they arose during mitosis or meiosis,
respectively.

The understanding of human genetic variations has advanced considerably over the past
30 years. Before the new millennium, the physical mapping of genetic variations such as
RFLPs (in the 1980s)
23
and tandem repeats (in the 1990s)
24
was accomplished. By
contrast, other genetic variations such as SNPs
25
, indels
26, 27
, CNVs
28-30
and ROHs
16
were
identified after the turn of the new millennium. In addition to physical mapping, their
biological functional roles, for example, their effects on or associations with mRNA
expression levels, alternative splicing processes and other molecular and regulatory
processes are now better understood
31-34
. Furthermore, these genetic variations were also
found to be associated with various human diseases, including monogenic and complex
diseases
4, 17, 34-37
. Presently, research in genetic variation is drawing much attention and
23


effort from the genetics community, as is evident from the initiation of the 1000
Genomes Project. A major aim of this project is to construct the most detailed map of
genetic variations in the human genome. The pilot phase of the project was completed in
2010 (see section 2.10)
38
.

2.2. Categories of genetic variations
There is still no clear consensus on how to define and categorize genetic variations. For
example, SNPs are defined as single nucleotide substitutions; occasionally single
nucleotide insertions or deletions also fall under this category (Figure 2a). Point
mutations include both single nucleotide substitutions and single nucleotide indels with
population frequencies of less than 1%. This is different from polymorphisms, when the
population frequency is higher than the arbitrary cutoff of 1%.


Figure 2a – Single nucleotide changes (adapted from Ku et al. (2010) J. Hum. Genet.
55:403-415)
21


Tandem repeats can be broadly divided into two classes: short and variable number
tandem repeats (STRs and VNTR). STRs usually refer to tandem repeats in which the
sequence length is arbitrarily set at eight nucleotides or less, and VNTRs are longer
24

tandem repeats (Figure 2b). They are also known as microsatellites and minisatellites
respectively. The most common types of microsatellites are di-, tri- and tetra-nucleotide
repeats. However, repeats of identical nucleotides of several bases or longer in the length

are known as homopolymer sequences, for example, GGGGG or AAAAA. Although the
sequence in the tandem repeats is simple compared with other more complex DNA
sequence changes or rearrangements, these simple sequences can be repeated up to
hundreds of times, thus creating very high heterozygosity or allelic diversity
20, 21, 39, 40
.


Figure 2b – Tandem repeats (adapted from Ku et al. (2010) J. Hum. Genet. 55:403-
415)
21


The boundary or distinction between CNVs and indels is even more unclear. In the
Database of Genomic Variants (DGV; deletions and
duplications/insertions larger than 1kb are classified as ‘CNVs’, whereas those between
100bp to 1kb are grouped as ‘InDels’. Table 2 summarizes the number of indels, CNVs
and inversions cataloged in the DGV. As such, the remaining several hundred thousands
of indels in the range of several nucleotides to tens of nucleotides, which were identified
in the recent whole-genome resequencing studies, currently do not have their own
category
41-47
. For example, Wang et al. (2008)
43
found approximately 140,000 indels

×