Acknowledgement
I would like to acknowledge all who have helped and inspired me during my
years at the National University of Singapore.
I am very grateful to my supervisor, Assistant Professor Lee, Caroline G.L.,
and her husband, Associate Professor Chong, Samuel S. for their invaluable
inspiration, guidance and encouragement throughout the course of my Ph.D study.
I want to thank Miss Wong Li Peng for her diligent and excellent work. I
would also like to thank all the lab members for their kind help. They made my four
years in this “family” fun and exciting. In addition, I would like to specially thank my
friends Mr. Wang, Baoshuang, Mr. Ren, Jianwei, Mr. Wang, Zihua, Mr. Zhang,
Dongwei, Mr. Gwee, PaiChung and Dr. Lee, Alvin T.C. for their constant support and
friendship.
I acknowledge the National University of Singapore, for honoring me with
studentship and financial assistance in the form of scholarship.
i
Table of Contents
Acknowledgements ……………………………………………………………… i
Table of Contents………………… ……………………………………………… ii
Summary …………………………………………………………………. vi
Publications and Awards arising during PhD tenure………………………….… viii
List of Tables …………………………………………………………………… xi
List of Figures ……………………………………………………………………. xii
PART I INTRODUCTION
Chapter 1: General Introduction …………………………………….……………. 1
1.1 SNP profiling in candidate genes and genomic regions – the practical
pharmacogenetics approach ……………………………………… …….…2
1.1.1 Introduction of SNP profiling and several concepts related to it.……. 2
1.1.2 Association studies ………………………………………………… 7
1.1.3 Detection of signature of natural selection …………………………. 11
1.2 A drug-response related genomic region around chromosome 7q21.1: the
importance of MDR1, MDR3 and the CYP3A clusters …………………….16
1.2.1 Drug response related genes. The ATP binding cassette (ABC)
transporter super-family and the P450 cytochrome enzyme super-
family ……………………………………………………………… 16
1.2.2 ABCB1 (MDR1) gene and its related functional and genetic studies 21
1.2.3 ABCB4 (MDR3) gene and its related functional and genetic studies 26
1.2.4 An important CYP3A cluster: the CYP3A4, CYP3A5, CYP3A7 and
CYP3A43 genes ……………………………………………………. 27
1.3 Objectives and Significance ……………………………………………… 30
ii
1.4 References …………………………………………………………………. 31
PART II GENOTYPING TECHNOLOGY
Chapter 2: Simultaneous Genotyping of Multiple Single-Nucleotide
Polymorphisms in Candidate Gene by Single-Tube Multiplex
Minisequencing
2.1 Introduction ……………………………………………………………… 43
2.2 Material and methods ………………………………………………………43
2.3 Examples and Discussion …………………………………………………. 47
2.4 References ………………………………………………………………….48
PART III GENETIC CHARACTERIZATION OF ABCB1/4 AND CYP3A
Chapter 3: Distinct Haplotype Profiles and Strong Linkage Disequilibrium at the
MDR1 Multidrug Transporter Gene Locus in Three Ethnic Asian
Populations …………………………………………………………… 49
3.1 Introduction …………………………………………………………… 50
3.2 Materials and Methods ………………………………………… ……… 54
3.3 Results ………………………………………………………… ……… 56
3.3.1 SNPs in the promoter and 59 UTR of the MDR1 gene …….……. 58
3.3.2 SNPs in the coding region of the MDR1 gene ………… ………… 60
3.3.3 Haplotype profile and linkage disequilibrium of SNPs in the MDR1
gene ………………………………………………………………… ……. 63
3.4 Discussions ….………………………………………………………… 67
3.5 References ……………………………………………………… ………. 74
Chapter 4: Genomic Evidence for Recent Positive Selection at the Human MDR1
Gene Locus …………………………………………………………… 78
4.1 Introduction ……………………………………… …………………… 79
iii
4.2 Materials and Methods ……………………………………………………. 82
4.3 Results …………………………………………… …………………… 88
4.3.1 MDR1 SNP allele frequencies differ among populations ……… … 88
4.3.2 MDR1 haplotype diversity differs among populations …………… 89
4.3.3 Highly variable LD between SNP loci ……………………… …… 93
4.3.4 SNPs 10 and 11 of the MDR1 gene are positively selected ……… 96
4.4 Discussion ……………………………………………… …………… 101
4.4.1 Varied haplotype diversity and long-range LD in the MDR1 gene 101
4.4.2 Evidence of recent positive selection at the MDR1gene locus …. 103
4.4.3 Implication of recent positive selection with respect to functional
disease association studies ……………………….……………… 104
4.5 Reference …………………………………………… ……………… 107
Chapter 5: An Extended Genetic Study of a Drug-response Related Region:
Differential Selections Detected in Both MDR1 and MDR3 genes 112
5.1 Introduction ………………………………………… ………………… 113
5.2 Materials and methods …………………………………………………… 115
5.3 Results ………………………………………… ……………………… 119
5.3.1 Genotyping results and allele frequencies ……… ………………. 119
5.3.2 Linkage Disequilibrium profiles ………………… …………… 122
5.3.3 Haplotype frequency profiles ……………………… …………… 123
5.3.4 Detection of positive selection ……………………………… … 126
5.4 Discussion ………………………………………… ………………… 133
5.5 Reference ………………………………………………………………… 137
Chapter 6: The CYP3A Gene Shows Strong Evidences of Positive Selection in
Caucasians ……………………………………………………………. 140
iv
6.1 Introduction ………………………………………… …………………. 141
6.2 Materials and methods …………………………………………………… 143
6.3 Results …………………………………………………………… ……. 149
6.3.1 Allele frequency and the F
st
, P
excess
tests ………………… ……… 149
6.3.2 Haplotype and Linkage Disequilibrium profiles …… ………… 151
6.3.3 LRH test of positive selection ……………………………… … 156
6.4 Discussions …………………………………………………… ……… 160
6.5 References ……………………………………………………………… 162
PART IV ASSOCIATION STUDY
Chapter 7: MDR1, the Blood–brain Barrier Transporter, Is Associated with
Parkinson’s Disease in Ethnic Chinese ………………… ……… 165
7.1 Introduction ……………………………………………… …………… 165
7.2 Methods ………………………………… …………………………… 168
7.3 Results …………………………………………………………… ……. 172
7.3.1 Association of MDR1 SNPs and their haplotypes with Parkinson’s
disease.…………………………………………………………… 172
7.3.2 Sex differences in risk determination …………………………… 173
7.3.3 Role of SNPs/haplotypes in the MDR1 gene in later onset of
Parkinson’s disease …………………………………………….…. .175
7.4 Discussions………………………… ……………………….………… 177
7.5 Conclusions ……………………………………… …………………. 181
7.6 References ……………………………………………………………… 182
SUMMARY AND CONCLUSIONS ……………………………………… 186
v
Summary
The ABCB1/MDR1 multidrug transporter is the prototype of drug transporters
and one of the major determinants of drug/xenobiotics response. The MDR1 gene,
together with several other important drug-response genes, namely the
ABCB4/MDR3 gene and the CYP3A gene cluster, maps to a 12 Mb region around
Chromosome 7q21.1. A large number of studies here reported associations between
MDR1/CYP3A genetic polymorphisms and a diversity of functional traits including
gene expression, pharmacokinetic properties as well as susceptibilities to various
diseases. Functional polymorphisms were also identified at the CYP3A4 and
CYP3A5 gene loci. As polymorphisms in genes controlling drug response may
influence an individual’s response to medication, it is necessary to understand this
important drug-response locus, localizing the causative variants and clarifying how its
genetic variants affect function.
This thesis describes a series of studies aimed at addressing the above
questions, using the MDR1 gene and several nearby drug-response genes as models.
Specifically, comprehensive SNP profiling was carried out in and around these genes
in 5 major world populations: Chinese, Malay, Indian, Caucasian and African
American. The relationships between individual markers were described in terms of
linkage disequilibrium (LD) profiles; and the haplotype frequencies were estimated
using Expectation Maximization (EM) approaches and compared amongst the
different populations. We detected substantial, but highly variable and complex LD at
this 12Mb region of Chromosome 7q21.1. Haplotype frequencies vary amongst
populations, with the African population being the most different from the non-
African populations. We further investigated the impact of natural selection at these
gene loci through several tests including a modified Long Range Haplotype (LRH)
vi
test and F
st
/ P
excess
based tests. The MDR1 and MDR3 genes demonstrated significant
evidence of positive selection for several variants residing on a common extended
haplotype, in the 4 non-African groups. Tests of positive selection, including the LRH
approach and F
st
/ P
excess
tests together revealed strong signatures of selection at the
CYP3A gene cluster in Caucasians. We further examined the association between
SNPs / haplotypes of SNPs within the MDR1 gene with Parkinson’s disease. Several
MDR1 polymorphisms were found to significantly affect one’s susceptibility to
Parkinson’s disease.
The studies described in this thesis are amongst the first efforts to clarify the
genetic profiles at this important drug-response region at Chromosome 7q21.1. We
presented the genetic relationships of several functionally associated drug-response
genes at this chromosome locus. Our studies would provide a basis for future studies
directed at single locus in different populations to be compared systematically. It
should also facilitate the inference of the genomic location of causative variants. The
evidences of natural selection demonstrated in our studies are among the first to be
reported for genes important for drug response. These evidences strongly support the
notion that genes controlling drug/xenobotics responses were under substantial
selection pressures during recent human migrations. Additionally, the approaches for
detecting signatures of natural selection and functional association, applied and
evaluated in our studies could contribute to the identification of other functional
variants in the genome.
vii
Publications and Awards arising during PhD tenure:
Peer-Reviewed Publications:
1. Kun Tang, Soo-Mun Ngoi, Pai-Chung Gwee, John MZ Chua, Edmund JD Lee,
Samuel S. Chong and Caroline G. Lee*. “Distinct Haplotype Profiles and Strong
Linkage Disequilibrium in the MDR1 multidrug transporter gene locus in three
Asian populations. Pharmacogenetics 12(6):437-450 (2002). In focus comments
on our article: Kim RB. MDR1 single nucleotide polymorphisms: multiplicity of
haplotypes and functional consequences. Pharmacogenetics 12(6): 425 (2002)
(2003 Impact Factor: 5.851)
2. Pai-Chung Gwee, Kun Tang, John MZ Chua, Edmund JD Lee, Samuel S Chong,
Caroline G. Lee*. Simultaneous Genotyping of Seven Single Nucleotide
Polymorphisms (SNPs) of the MDR1 gene by Single Tube Multiplex
Minisequencing. Clinical Chemistry 49(4):672-676 (2003). (2003 Impact
Factor: 5.538)
3. Caroline GL Lee*, Kun Tang, Yin Bun Cheung, Li Peng Wong, Chris Tan, Hui
Shen, Yi Zhao, R. Pavanni, Meng-Cheong Wong, Samuel S Chong and Eng King
Tan. MDR1, the blood-brain barrier transporter, is associated with Parkinson’s
Disease in Ethnic Chinese. Journal of Medical Genetics 41:e60 (2004). (2003
Impact Factor: 6.368)
4. Kun Tang, Li Peng Wong, Edmund JD Lee, Samuel S. Chong, Caroline G.L.
Lee*. Genomic Evidence for Positive Selection at the MDR1 Gene Locus.
Human Molecular Genetics 13(8): 783-797 (2004). (2003 Impact Factor:
8.597)
5. Eng-King Tan, Marek Drozdzik, Monika Bialecka, Krystyna Honczarenko,
Gabriela Klodowska-Duda, YY Teo, Kun Tang, Li-Peng Wong, Samuel S
viii
Chong, Chris Tan, Kenneth Yew, Yi Zhao, Caroline GL Lee. Analysis of MDR1
Haplotypes in Parkinson’s Disease in a White Population. Neuroscience Letts
372: 240-244(2004).
6. Eng-King Tan, Daniel Kam-Yin Chan, Ping-Wing Ng, Jean Woo, Y Y Teo, Kun
Tang, Li-Peng Wong, Samuel S Chong, Chris Tan, Hui Shen, Yi Zhao, Caroline
GL Lee. MDR1 Haplotype (e21/2677T and e26/3435T) Modulates Risk of
Parkinson’s Disease. Archives of Neurology (in press) (2003 Impact Factor:
4.684)
7. Pai Chung Gwee, Kun Tang, Pui Hoon Sew, Edmund J.D. Lee, Samuel S.
Chong, and Caroline G.L. Lee*. Strong Linkage Disequilibrium at the
Nucleotide Analogue Transporter ABCC5 Gene Locus. Pharmacogenetics (in
press). (2003 Impact Factor: 5.851)
ix
Awards:
1. 2003 American Association for Cancer Research (AACR) Pfizer Scholar-
in-Training Award to present at the AACR Special Meeting: “SNPs,
Haplotypes, and Cancer: Applications in Molecular Epidemiology” September
13-17 (2003) at the Sonesta Beach Resort Key Biscayne, in Key Biscayne,
Florida (did not go because of visa problems). Details of presentation: Kun
Tang, Li Peng Wong, Edmund JD Lee, Samuel S Chong, and Caroline GL
Lee. e21/26777T and e26/3435T alleles in the MDR1 gene showed evidence
of recent positive selection.
2. 2004 AACR-ITO EN Ltd Scholar-in-Training Award to present at the
AACR 95
th
Annual Meeting in Orlando, Florida, March 27-31, 2004. Abstract
# 2923. Details of presentation: Kun Tang, Li Peng Wong, Edmund JD Lee,
Samuel S Chong, and Caroline GL Lee. Recent positive selection of SNPs
e21/2677 and e26/3435 in the MDR1 gene.
x
List of Tables
Chapter 1
Chapter 2
2.1 Conditions for multiplex PCR and minisequencing …………………………….45
Chapter 3
3.1 Conditions for PCR amplification and genotype analyses of the 10 MDR1 SNPs
……………………………………………………………………………… ….57
3.2 Validation of 10 MDR1 SNPs identified from GenBank sequence alignments,
reports in dbSNP……………………………………………………………… 59
3.3 Pairwise allele frequency comparisons of SNPs exon1 -129T > C, exon12 1236C
>T, exon21 2677G > T/A and exon26 3435C > T between the different ethnic
groups in this study and between previously published populations and this
study ……………………………………………………………………………. 62
Chapter 4
4.1 Allele frequency comparisons of the different SNPs in the different populations
………………………………………………………………………………… 83
4.2 Primers, PCR and Minisequencing Conditions for genotyping of 5 SNPs at the
MDR1 gene locus…………………………………………………………… 85
4.3 P-values computed by ranking relative EHH of the observed SNP of interest with
that of all of the simulated data points under specified models of population
history at specified recombination rate ……………………………… …… 105
Chapter 5
5.1 Conditions for PCR amplification and genotype analyses of the 19 MDR3 SNPs
………………………………………………………………………………… 117
5.2 Allele frequency comparisons of the different SNPs in the different populations
…………………………………………………………………………………. 122
5.3 Empirical p-values computed by ranking ReEHHs of the observed SNP of interest
against the distribution of the simulated data points under specified models of
population history at specified recombination rate …………………… ……. 130
Chapter 6
6.1 Conditions for PCR amplification and genotype analyses of the 24 CYP SNPs
…………………… ………………………………………………………… 143
6.2 Allele frequency comparisons of the different SNPs in the different populations
…………………… ………………………………………………………… 144
6.3 Empirical p-values of candidate loci of positive selection against specified
demographic models of selection neutrality at various recombination
rates …………………………………………………………… ……………. 155
Chapter 7
7.1 Characteristics of the study populations ………………………………………. 167
7.2 Association of SNPs / haplotypes of SNPs with Parkinson’s disease ………… 170
7.3 Effect of gender in the association of SNPs / Haplotypes in the MDR1 gene with
Parkinson’s disease ……………………………………………………………. 173
7.4 Effect of Age-of-Onset in the association of SNPs / Haplotypes in the MDR1 gene
with Parkinson’s disease …………………………………………… ………. 175
xi
List of Figures
Chapter 1
1.1 Schematic distribution of drug-response related genes at locus 7q21.1 … … 20
Chapter 2
2.1 Multiplex PCR and genotyping results for the seven MDR1 SNP …………… 46
Chapter 3
3.1 Drawn-to-scale map of the MDR1 gene, mRNA and putative protein secondary
structure…………………….………………………………………………… 55
3.2 Pairwise linkage profiles of the four SNPs present in our population …… … 64
3.3 Haplotype frequencies of the three high-frequency SNP loci present in our
population ………………………………………………………………………. 65
3.4 Linkage disequilibrium profile of the three high frequency MDR1 SNPs in the
three ethnic groups ………………………………… ………………………… 66
Chapter 4
4.1 Distribution of 12 SNPs across the MDR1 gene ……………………………… 84
4.2 Haplotype profiles of the 10 MDR1 SNPs in the five populations …………… 91
4.3 Pairwise linkage disequilibrium profiles for single SNPs ……………………… 94
4.4 HBDs for five selected loci in the five populations…………………………… 97
4.5 EHH and relative EHH tests ……………………………………………………. 99
Chapter 5
5.1 The physical map of MDR1, MDR3 genomic structure, and the distribution of
tested SNPs …………………… …………………………………………… 115
5.2 Pairwise linkage disequilibrium profiles for single SNPs ………… ………… 122
5.3 Haplotype profiles based on 19 SNPs selected from MDR1 and MDR3 (see
methods) in the five populations ……………………………… …………… 124
5.4 Haplotype branching diagrams (HBD) for five selection indicative loci in the
MDR1/MDR3 region in all the five populations ……………………………… 126
Chapter 6
6.1 The physical map of the CYP3A gene cluster, the distribution of the tested SNPs
and the F
st
/ P
excess
profiles …………………………………………………… 147
6.2 Haplotype profiles based on 19 SNPs (see methods) …………………………. 152
6.3 Pairwise linkage disequilibrium profiles for single SNPs. |D’| and r2 were
calculated for each pair of the 24SNPs in the five populations and represented in
color gradients ………………………………………………………………… 154
6.4 Examples of Haplotype branching diagrams (HBD) for several selection
indicative loci in the CYP3A region in the fours non-African groups … …… 158
Chapter 7
7.1 Schematic diagram showing relative positions of the SNP sites in the promoter
and exons of the MDR1 gene …………………………………… ………… 168
xii
PART I INTRODUCTION
Chapter 1: General Introduction
“If it were not for the great variability among individuals, medicine might as
well be a science and not an art”, said Sir William Osler in 1892 (1). His view of
medicine as an art dominated the last 100 years. This notion well reflected the fact
that there lacks precise judgments when doctors prescribe medicine to individual
patients, although individuals differ greatly in their responses to medicine. Besides,
great variations in drug-response also exist among different ethnic populations. It is
not rare that medicines/dosages tested on one population do not directly apply to other
populations. Fortunately, the practice of medicine is seeing a great change at the dawn
of the new century, as a result of the recent surges of genetic studies. With the
accomplishment of the Human Genome Project and the rapid development of genetic
assays and technologies, geneticists are now seeing the possibility of identifying the
inherited differences between individuals that causes the diverse drug responses. The
study about how genetic differences influence patients’ response to drugs is defined as
Pharmacogenetics (1). It is hoped that by the time we have a sufficient understanding
of Pharmacogenetics, we can accurately predict one’s response to medication by
analyzing his/her genetic profile; personalize medication to obtain optimal treatment
for each individual; and target the specific gene that causes adverse responses or
diseases.
This thesis aims to shed some light on the understanding of the genetics of
drug-response genes. In this thesis, I will present several genetics studies carried out
on the Multidrug Resistant 1 (MDR1, ABCB1) gene and a few others closely located
1
drug-response-related genes, namely the ABCB4 (MDR3) gene and the CYP3A gene
cluster, based mainly on the approach of Single Nucleotide Polymorphisms (SNP)
profiling. The introduction is given in three sections. The first section is a brief
review of the importance and the problems of SNP profiling approaches in the current
Pharmacogenetics studies. Thereafter, the two major drug-response related members
of the gene super-families, ABC and CYP, and their specific members that are
examined in this thesis, the MDR1, MDR3 genes and the CYP3A cluster are
introduced. Previous genetic and pharmacogenetics studies on these candidate genes
are also reviewed in this section. Finally, the objectives and significance of this study
will be given in the last section.
1.1 SNP profiling for candidate genes (regions) – the practical Pharmacogenetics
approach
1.1.1 Introduction to SNP profiling and several concepts related to it.
Modern medical research has undergone substantial progresses during the last
few decades, owing to the accelerated understanding of the basic mechanism
underlying normal life processes and diseases at both cellular and molecular levels.
Every year, there are a significant number of novel, highly efficient medication and
treatments out in the market. As the choices of therapy grow rapidly, doctors and
researchers are, nonetheless, faced with a dilemma – which medication is most
appropriate for which individual as there are great variations in the outcomes of
treatments amongst different patients. Due to the poor understanding of the nature of
these variations, a large portion of patients are suffering from undesirable responses to
drug therapy, generally categorized as the adverse drug reactions and the lack of
2
therapeutic response. Adverse drug reactions have been listed among the five leading
causes of death in Western countries (2). The lack of therapeutic response, on the
other hand, has caused even greater consequences according to some estimation (3).
It was long believed that the majority of phenotypic diversity in human results
from inheritable inter-individual differences in our genetic composition. And this
notion has received strong experimental evidences from twin studies. The study of
how genetic differences predict and contribute to the phenotypic variations in drug
response, namely Pharmacogenetics, therefore promises a way towards individualized,
safer, and more efficient drug treatment (4).
The term of “Pharmacogenetics” has been coined for 40 years (5). However
this field has attracted great interest and undergoes rapid progress only recently. The
first few pioneering studies in Pharmacogenetics were all based on the simple model
of Mendelian inheritance, i.e. with mono-gene controlling the altered phenotypes (6).
However, Mendelian model is applicable only on rare genetic variations that
introduce dramatic effects (6). It is now generally accepted that, the majority of
phenotypes in drug response are controlled by multiple genes, resulting in continuous
variation distribution (1, 7, 8). A concept related to this idea in genetic epidemiology
is the hypothesis of “Common Disease -Common Variant” (CD-CV), stating that
common diseases in human are caused mainly by interaction of common variants in
multiple genes (9). On the hand, Pritchard, et al also proposed the hypothesis of
“Common Disease-Rare allele” (CD-RA), where they emphasized alternative
possibility that the common diseases are rather attributable mainly to the
heterogeneous rare variants rising during population expansion (10). Whether or not a
certain disease/phenotype difference is controlled by multiple common variants,
3
identification of genetic variants that control drug-response remains the central
problem of the modern Pharmacogenetics.
The natural history of genetic polymorphisms in the human genome provides
the most important record for the search of functional variants important for drug
responses. On one hand, genetic variations causing phenotypic changes constitute a
special subset of the whole body of genetic polymorphism. On the other hand, neutral
loci near to the causal variants are strongly shaped by genetic forces acting on the
causal variants, due to the physical association (defined as Linkage Disequilibrium,
explained in later sections); and thereby provide an informative footprint about the
historical genetic events. There are many different forms of polymorphisms, including
Single Nucleotide Polymorphism (SNP), micro-satellites, short-tandem-repeat
polymorphisms (STRPs), sequence insertions and deletions, etc. The polymorphism
marker commonly used in early studies is the micro-satellite repeats. Micro-satellite
possesses certain desirable properties including the relatively even genome
distribution and the multi-allelic property. However, its relatively low density limited
its application to fine mapping studies (11). Another important form of genetic
polymorphism, the Single Nucleotide Polymorphism (SNP), drew much attention
only after the accomplishment of the Human Genome Project, but nonetheless piqued
great enthusiasm in the community. It is now commonly hailed as a promising way
towards a comprehensive understanding of Pharmacogenetics (12, 13). SNP is the
most abundant form of genetic polymorphisms. In several genome-wide
characterization studies, the informative high-frequency SNPs (minor-allele frequency
higher than 10%) were estimated to occur on average once every kilo-bases (14, 15).
Such a high density is ideal for fine mapping of the genome. The great prevalence of
4
SNPs among all forms of polymorphisms also suggests it may be responsible for the
majority of the phenotype variance (13).
SNP is of special importance also to the study of Linkage Disequilibrium (LD),
or the non-random association between the genetic variations, which plays pivotal
role in many genetic studies including Pharmacogenetics. Linkage Disequilibrium
rises and decays by mutation and recombination respectively. Under the “Infinite
Site” model (16), a newly arisen variant is related to all the existing variants on the
carrier chromosome via physical connection. Such linkages maintain through
generations, until it is broken down by recombination, resulting in incomplete
association or complete Linkage Equilibrium (LE) as a function of inverse correlation
to distance. The pattern of LD over short distance is also greatly affected by other
molecular and demographic factors, such as recurrent mutation, random genetic drift,
population structure and natural selection (17). The LD can be viewed as the extent to
which one polymorphism predicts the status of another. This property is of great
importance to Pharmacogenetic study, as effects of functional variants can be detected
by examining surrogate markers in strong LD with the causative ones. Patterns of LD
on fine scale can also provide insights into the demographic processes of human
history (10). As the average extent of considerable LD is around 30~60kb in the
human genome (15), almost beyond the highest resolution of any other genetic
markers, only SNP satisfies the LD mapping in both large and fine scales.
Another concept that looks at the polymorphism distribution from a different
angle, and provides more insights into the genetic profiles is Haplotype. A haplotype
is the combination of alleles of multiple polymorphic loci on one chromosome (14).
Haplotype is thus commonly treated as abstracted chromosome. Under conditions of
no recombination, genetic forces on functional polymorphisms affect the chromosome
5
as a whole. Such effects can be clearly seen in the non-recombinant Y chromosome
and mitochondria DNA. If recombinant chromosomes are considered, the haplotype
blocks that define genetic region of strong LD, with negligible or limited crossovers,
can also be treated as integral genetic unit. These blocks of haplotypes thus serve as
good predictors of functional variants that reside within the regions of strong LD.
Hence, since a haplotype is defined by alleles of multiple binary polymorphisms, e.g.
SNPs, they, therefore possesses statistically better power than single SNPs (18). The
early use of haplotype was limited to the analyses of sex chromosomes, where
recombination is very low or absent; or of autosomal chromosome with family data
(18), for which the haplotype phases - the actual haplotype sequences on either of the
pair of chromosomes, can be inferred exactly from genotyping data. For the case of
autosomal genotyping in random population sample however, one faces the problem
of phase-uncertainty, where the number of possible ways to allocate either alleles of
each marker SNP onto a pair of haplotypes increases exponentially with the number
of heterozygous loci (19). Several strategies were proposed to solve this problem. One
category uses so-called experimental “haplotyping” approach, which intends to
physically separate the chromosome pairs by means of molecular technologies. This
includes cloning, allele-specific polymerase chain reaction and single molecule
dilution, etc (20-22). Although very precise, these experimental approaches inevitably
suffer from high labor and expense cost and major technological improvements are
necessary to facilitate high-throughput studies (18). On the other hand, statisticians
proposed computational ways of resolving this problem by compensating certain
levels of precision. Of these, the Expectation Maximization (EM) based algorithm for
haplotype estimation (19, 23, 24) is probably the most commonly used approach
recently. This algorithm estimates the maximum likelihood frequency distribution
6
based on the assumption of complete random mating within the studied population.
Under tests utilizing either experimental or simulation data, EM produced satisfying
estimations with little overall differences from actual data for most common
conditions, although with relatively less accuracy for the minor frequency haplotypes
(25-27). Another commonly used approach is the “PHASE” algorithm, originally
proposed by Stephens et al. (28). PHASE is a Bayesian statistical method
incorporating the prior knowledge that unresolved haplotypes will be similar to phase-
certain haplotypes (18). Compared to the EM algorithm, the performance of PHASE
is similar, or in certain cases slightly better (18, 27, 28). A third algorithm that was
once popular in genetic study is the subtraction method described by Clark (29),
although now EM and PHASE generally outperform it (18).
1.1.2 Association studies
As mentioned in the previous section, the central problem of
Pharmacogenetics is to scan for functional loci that determine the variability in drug
responses. Association study is the most common strategy for mapping the candidate
functional variants.
The principle of genetic association study is to detect the statistical association
between genetic variations/mutations and the variances in the traits-of-interest, such
as the differences in individual’s predisposition to certain diseases and the response to
certain drugs. The statistical dependence between a genetic marker locus and the trait
suggest possible functional effect the corresponding locus has on the trait, although
other factors, such as population structure could also contribute to it. Therefore
association study provides the possibility of detecting the causal variants by genetic
study. In reality, the association study of phenotypic changes and their corresponding
7
genetic variants strongly relies on the understanding of Linkage Disequilibrium. On
the one hand, the linkage between genetic polymorphisms suggests that an observed
association of a genetic marker to a phenotype variation does not necessarily attribute
any direct functional effect to this marker, as it could be an indirect reflection of the
causal effect by a causative variant in LD with the observed markder. On the other
hand, LD may serve as a short cut in the association study, as theoretically a small
subset of polymorphisms (tagging SNPs), which are designed to represent a high
percentage of all polymorphisms that are in LD, can detect most genetic
variant/function associations without greatly compromising in the test power, and
with a great increase in the efficiency (1, 17, 30).
Despite the theoretical advantage of using limited marker set to map the
genome-wide LD, the feasibility of this idea highly relies on the behaviors of Linkage
Disequilibrium, such as how long the useful LD extends and how variable LD is
among different genome regions (17). Long and regular LD patterns could
significantly reduce the number of markers necessary for LD mapping; whereas short
LDs require more markers to be genotyped, although might reduce the effort in
experimental identification of causative variants because of the shorter candidate
regions. The behavior of LD in the human genome generated intensive debates in the
recent few years. Simple computational models of constant population size and
uniform recombination rates predicted Linkage Disequilibrium as a relatively defined
function of the physical distance, with useful strong LD extending no more than
several thousands base-pairs (31). However, large experimental surveys of high-
density SNP profiles in human genome revealed very different scenario, where LD is
far more variable and generally longer than expected under simple evolution model
(14, 15). Later experiments all confirmed this big variability of LD in different
8
genes/regions (17, 32-35) . Several genetic factors were proposed to account for the
large LD variability. Reich et al. ruled out the possibility of stochastic variation as the
only cause of LD variability, proposing that the discrete LD patterns could come from
genetic events such as severe bottlenecks and selection (35). However, simulations
assuming genetic events could not well explain the observed high nucleotide diversity
in modern human (10, 17, 36). On the other hand, Patil et al. derived a high-
resolution SNP profile of the entire chromosome 21 and found that SNPs follow a
blockwise distribution (37). Several other studies confirmed this finding and
attributed the block-wise LDs to recombination hotspots (32, 37-40). The uneven
recombination activities have been confirmed experimentally (39), and were thought
to account for the majority of LD variation by many researchers (35, 40). However
others questioned the uneven recombination rate as the major cause of discrete LD,
since simulations showed that block-like LD pattern can also derive from uniform
recombination rate, under severe bottlenecks or strong selection (39, 41). Furthermore,
extensively long LDs were observed when haplotype-specific |D’| were measured (42),
showing a seemly “jumping” action of LD across LD blocks and contradict the
hotspot hypothesis. In view of the complicated LD profile, people have proposed
some new measurement to describe Linkage disequilibrium as well as to represent the
association information underlying the tested regions, such as the tagging SNP and
LD block, haplotype blocks (32, 37, 40, 41, 43, 44). These new technologies may help
handle large data sets in genome-wide studies.
The first commonly used genetic marker in association study was the micro-
satellite (45-54). However, studies using such genetic markers covered only limited
genomic regions due to the low marker density. SNPs provide the opportunity of fine,
genome-wide association tests.
9
The association studies based on SNPs can be classified into two major
categories: the genome-wide association study and the candidate gene/region directed
association study. The genome-wide association approach is based on the idea of the
design of a subset of SNP markers that covers most of the LD patterns across the
whole human genome. The search for the marker loci correlated with certain
phenotypes, e.g. some diseases, can then be conducted across the entire human
genome. Such genome-wide approaches make no assumptions about the functions of
individual gene regions, and therefore provide the possibility to detect new drug-
related genes and to investigate the multiple-gene interactions (30). Such strategy has
proven to be promising in the identification of candidate causative loci for
Alzheimer’s disease (55). However, there are also several problems. The genome-
wide approach requires a complete SNP subset throughout the entire genome with
sufficient SNP density in the presence of largely predictable patterns of LD. While the
genome-wide SNP density increases rapidly as a result of big efforts from several
international SNP databases (dbSNP, TSC, Celera), the LD profile has been shown to
be anything but simple, discussed in the previous section. Furthermore, genome-wide
SNP profiling is certainly highly costly and time-consuming; and the pre-requisite
large sample-size and statistics handling big datasets are always practical difficulties
(56). These led to the preference to the alternative approach: the candidate
gene/region directed SNP profiling and association study, which is also the basic
strategy used in this thesis study. This approach chooses candidate genes/regions
related to drug-response, according to the experimental or taxonomical evidences.
Within the candidate genes/regions, appropriate SNP markers were then selected for
LD and haplotype analyses, as well as association analyses.
10
The candidate gene/region directed association study has been widely applied
in the epidemiology and pharmacogenetics studies. The majority of such studies
report associations of single SNPs to certain phenotype changes. However, most
early SNP-based association studies failed to provide analyses of LD and haplotype,
and thus are limited in identifying the causal variants as well as assessing the validity
of associations with the candidate loci. As the importance of LD and haplotype
profiling becomes more apparent, there is an increasing number of candidate gene
studies that carried out detailed LD and haplotype analyses (33, 34, 42, 57, 58).
Besides the tests based on single SNPs, novel association tests using multiple-SNPs
and haplotypes have also been proposed. The major debates on this type of approach
is about how to construct tests that properly utilize phase-uncertain data, and how to
increase the statistical power of tests (59-62). As these tests utilize genotype data from
population samples of unrelated individuals, they can be easily applied in general
association studies (59).
1.1.3 Detection of signatures of natural selection
Although association study provides a promising way to detect functional
genetic variants of medical significance, it sometime requires prohibitively large
sample size to achieve enough statistical power for moderate or multi-gene effects.
Recently a new idea draws much attention that important genetic variants determining
drug-response variability may be inferred by detecting signatures of natural selection
(63). Since Darwin, it is known that genetic variants conferring better fitness to the
carriers have a higher probability of being inherited to subsequent generations and
therefore increase their allele frequency in the population. This process is called
positive selection. On the other hand the deleterious mutations are continuously
11
screened out from the population by environmental pressures as a result of purifying
selection or negative selection. It is hypothesized that during harsh environmental
changes, the genes that strongly interact with the altering environmental factors will
be subject to more intensive selection pressures, differentially selected for their
functional polymorphisms and leave specific genetic patterns in local genome regions
(63). The ability to detect such signatures of natural selection thus provides an
attractive way of finding functional variants. Current data strongly suggests an
African origin of the modern global human populations (64-67). The dispersal of the
early humans from the African continent to the rest of the world: Euro-Asia and
America challenged the Out-of-Africa groups with quite different environments,
particularly the varied climates, pathogens and sources of food (63). Given this
migration model, it is very likely that a number of variants were locally selected in
different populations to adapt to the drastic environmental changes. These functional
variants that once facilitated the fitness of ancestral humans in different environments
should still contribute greatly to today’s inter-individual / inter-population differences
of drug responses and susceptibility to diseases (63). Hence, the study of selection
events that occurred in the recent human history, especially during the period of the
Out-of-Africa migration and the colonization of ancient humans in the new continents
is therefore very important to the Pharmacogenetics study.
The intrinsic way of detecting the signatures of positive selection is to detect
allele frequency changes, which are the major consequences of a selection event.
However, the variability in allele frequency does not only come from selection. Under
selection neutrality, allele frequencies do fluctuate, and sometimes greatly, due to the
stochastic process known as genetic drift (68). Nonetheless, it is now evident that
human population history was far from a single, constant, equilibrium process; rather
12
it involved many complicated demographic processes such as bottleneck, population
sub-structure, population size change, etc. (63, 69-71). All these demographic factors
influence the genetic profiles, including the allelic spectrum, in different ways. And
some of these effects can resemble those caused by selection (63) and complicate the
correct identification of evidence of natural selection.
The power of a test of selection largely relies on the design of the statistics.
Statistics should be chosen to capture the genetic signatures that best define the
occurrence of natural selection and yet to be robust against any demographic effects
of neutrality. Bamshad and Wooding recently described a variety of selection
signatures (63), including the excess of rare polymorphisms, excess of high-frequency
derived variants, big allele frequency differences among populations, and extensively
strong LD or high homozygosity for common variants.
Detection of positive selection has been for a long time the central issue of
population genetics. There have been available a number of useful positive selection
tests. Traditional tests are usually classified into two major categories (63). The first
category is based on the comparison of the variability and divergence of different
types of genetic variations. The commonly used tests in this category include the
d
N
/d
S
ratio test (72, 73), Hudson–Kreitman–Aguade’s test (74) and McDonald–
Kreitman’s test (75), etc. These tests depend less on assumptions about the
demographic processes, and thus are more robust against variability caused by
demographic factors. However they are not very powerful, especially when testing of
candidate gene/region is concerned, as most of them utilize only the information of
variation types. The tests of the second category, on the other hand, tend to recognize
the specific patterns in allelic spectrum and/or variability levels caused by selection
sweeps (63). Many of them are based on the comparisons of estimators of population
13