Tải bản đầy đủ (.pdf) (104 trang)

Genetic determinants of infectious disease susceptibility

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.62 MB, 104 trang )

1

1. Introduction
Beginning last century, modernization and improved health care have brought a
reduction in infectious diseases mortality mostly seen in developed countries, but the
burden of infectious diseases in developing countries, such as Indonesia, continues to
remain high[14]. Surprisingly, in recent years, there has been a resurgence in TB
incidences[15], even among the developed nations, and this phenomena has sparked
renewed interest in epidemiological and other studies of infectious diseases. The World
Health Organization (WHO) and other organizations are reaching out to developing
countries, where infectious diseases are endemic[16, 17]. Despite the massive efforts
made in introducing drugs and vaccines for treatment and prevention, many common
infectious diseases, such as TB and hepatitis B, have yet to be brought under sufficient
control, and the spread continues [15, 18-21].
Surveillance of infections revealed that certain individuals could be exposed to
large doses of infectious agent, but were recalcitrant to infection[22]. Furthermore, even
for those infected, a large percentage of these individuals have the immunity to naturally
clear the infection without disease symptoms[6]. This suggests that host defense is often
an effective means of controlling infection. Heritability studies also indicated a strong
genetic component in determining the variable degrees of immune response to infection,
and disease susceptibility among individuals[1, 2]. The burden of infectious diseases has
been tremendous throughout the history of mankind, not only in economic terms, but
especially as a major selective pressure on our genome [23, 24]. A diverse
immunological response to a wide range of infectious agents translates into an
evolutionary advantage, thus it is not surprising to find genes involving immunity as the
2

most abundant and diverse in the human genome[25]. With their multiple adverse social,
economical, and evolutionary effects, there is indeed motivation to study host genetics in
controlling immune response to infectious diseases[5-7].
Global initiatives, such as the Human Genome Project, the International HapMap


Consortium and the recent 1000 Genomes Project, have cataloged most of the genetic
variations that are common in humans[26-30]. These efforts have spurred the advent of
high throughput genotyping technology for conducting genetic association studies in a
cost effective and efficient manner[31]. This essential advancement in genomics has
helped to materialize the main objective of the current PhD project, which is to identify
genetic variation that influences susceptibility to infection or the variable outcomes to
infection. Two genetic mapping studies are used to achieve the aim: the first is a case-
control study on pulmonary TB susceptibility, and the latter, a post vaccination antibody
titer response against hepatitis B virus infection. The study populations for both studies
were drawn from Indonesia, where these diseases are endemic[16, 20] (Figures. 1 and 2),
and there is significant interest in elucidating the genomic mechanisms of disease control.
3

Figure 1(A): Annual number of new reported TB cases. [16]












4

Figure 1(B)[16]





5

Figure2: Worldwide distribution of chronic HBV infection and annual incidence of
primary hepatocellular carcinoma[20].



6

2. Literature review

2.1 Human genome
Our human genome is a 3 billion base pairs long sequence of deoxyribonucleic acid
(DNA) that is packed as 2 sets of 23 chromosomes in a single cell. Only about 3 percent
of this sequence contains genetic code for genes that function in biological processes.
Despite humans being highly complex organisms, it is fascinating that we require less
than 30,000 genes to function, a number, which is similar to those of simpler organisms.
In addition, although there is a surprisingly high percentage of similarity in our genome
between two individuals (~99%), there are vast differences in many phenotypic aspects,
from the outward appearance to the internal responses to environmental assaults, such as
pathogen infections. Contributing to these differences are, interestingly, small DNA
variations. They comprise of the remaining small percentage of differences, which
renders each of us unique, and are important in determining our response to diseases and
well-being.

2.2 Genes and immunity
Over the course of human evolutionary and migratory history, selection and

adaptation to environmental challenges has shaped our genome. Natural selection has
influenced our modern population’s collection of genetic variations, and may instill a
population specific genetic response to environmental triggers[15, 24, 32]. Since the
beginning of our species, we are constantly confronted with massive amounts of
microbes, and the burden of infectious diseases can be tremendous. Therefore, it is not
7

surprising that current studies on natural selection in humans has highlighted genes
related to host defense and immunity as among the most strongly selected genes[23, 24].
This implies that immunity and host defense in individuals today are highly influenced by
the historical experience and genetic responses of our ancestors to infections, and the
consequential selection and thus inherited form of our current gene pool[25, 33].

2.3 Types of genetic variation
Genetic variants are primarily ancestral mutations that arise many generations back,
which were successfully passed on to offspring, and thus occur in increased frequency in
the current population. When its alleles reach a frequency of more than 1 percent in the
population, these variations are termed polymorphisms. Although many of these variants
may confer functional changes to protein, most of them act as markers for mapping
specific genomic loci of interest, and serve to aid the identification of differences
between individuals.
Restriction fragment length polymorphisms (RFLPs)
RFLPs were among the earliest type of DNA variants/polymorphisms utilized for
genetic studies. It is characterized as an alteration to its electrophoresis gel pattern. When
there is a base change (variant form) in the DNA, it renders restriction endonuclease
unable to recognize and cut the specific target sequence, and hence produces fragments of
different lengths, which were identified by gel electrophoresis followed by Southern
blotting. Although a useful tool in the earlier days, RFLPs had many drawbacks due to
their relative scarcity in the genome, and the fact that tedious and time-consuming
laboratory steps were difficult to automate.

8

Microsatellites
Microsatellites are composed of multiple tandem repeats of a short DNA sequence
motif, in which differences in the number of short sequence repeats differentiates
between alleles. Unlike SNPs, they have an extremely high mutation rate, giving rise to
their high variability, thus rendering them a highly informative and popular choice for
linkage studies, especially in the 1990s. Owing to their highly polymorphic nature, they
are not ideal for population based association studies, which have now become the
mainstay of genetic epidemiological study design. In addition, they have the disadvantage
of being less amenable to cost efficient high throughput genotyping technology, and their
finite number in the genome limits the density of microsatellite-based genetic maps.
Single nucleotide polymorphism (SNP)
Single nucleotide polymorphisms (SNPs) are variants in the form of single base
substitutions; in which each of its alleles has a population frequency of at least 1%. SNPs
are the most abundant and well-studied form of genetic variation in our genome. The
completion of the Human Genome Project discovered at least 1.8 million SNPs, and had
attempted to map the physical positions of these SNPs to their specific genomic locations,
which are found to be widespread throughout the genome[26]. Most SNPs are located in
introns or between genes, and are therefore non-coding, while those that are in protein-
coding region could be either non-synonymous or synonymous, depending on whether
there is an amino acid change involved. Nevertheless, all classes of SNPs could implicate
a change in phenotype, since intronic and intergenic SNPs may also affect regulation and
expression of genes. To date, there are more than 10 million common SNPs in humans
that have been cataloged in databases, of which many were independently validated
9

( This is a treasure trove of
information, where the data for genetic studies can be mined and utilized. In addition,
SNPs are usually bi-allelic and relatively stable due to its low mutation rate, which makes

it easy to genotype and decipher, and is thus perfectly amenable for high-throughput
technology[34]. Therefore, it is the variant of choice for the popular population-based
association studies in recent years.

2.4 Infection and host defense
Harmful microorganisms, such as pathogenic bacteria and parasitic viruses, which
we encounter constantly in our environment, cause infectious diseases. The intrusive
attack of such infectious agents typically triggers a natural cascade of tightly regulated
immune responses to control the infection, which often resulted in a successful attempt to
eradicate the pathogens. In natural infections, there are two basic lines of defense –
innate and adaptive immunity.
Although it is usually successful in controlling infections, our immune system is
complex and multi-factorial, which makes it susceptible to persistent pathogens, thus
failing in its protective role and allowing the onset of disease. In addition, the variability
in our genome could possibly cause variation in the immune responses of different
individuals. Although many may be exposed to the same infectious agents in the
environment, only certain individuals suffer the onset of the disease from the
infection[22, 35]. This is proven true when heritability studies on twins, and other
familial aggregation and segregation studies indicated a genetic component in
contributing to the variable immunity between individuals [3, 4, 36, 37]. This suggests a
10

potential area of study where the identification of genetic variation in genes that may
influence immune responses can be made to further understand the relationship between
pathogenic exposure and actual infections[2, 4, 6, 35, 38].

2.5 Human genetic traits
2.5.1 Simple monogenic traits
Mendelian traits are controlled by a single locus and show a simple Mendelian
inheritance pattern. We have been relatively successful in identifying the genetic culprits

for this kind of monogenic traits, which they tend to be rare single mutations that often
display severe phenotype early in life, and therefore are infrequently found in the vast
population[39].
However, many common diseases that most people in the general population suffer
from are far more complex. Even though common traits like diabetes may still have a
tendency of correlating in families, they do not follow simple inheritance patterns, and
hence encompasses a different spectrum of genetic architecture[40].

2.5.2 Multi-factorial common complex traits
Most common complex traits have been shown to result from the complex interplay
of multiple genes with environmental factors [41]. Even though each factor contributes
only subtly, they are able to effect phenotypic changes if allowed to sufficiently
accumulate to tip the balance. Evidence from archeology and population genetics
suggested that our current population size of more than 6 billion people is the result of a
fairly rapid expansion over just the last 100,000 years or less from a relatively small
11

number of ancestors who originated in Africa. [42, 43]. This implied that most of our
species shares common variants, and most of the genomic variants found in us are
ubiquitously circulating among continental populations. This might also lay the
foundation for common variants to play a role in common traits present in most
people[44-46]. Moreover, in contrast to rare mendelian diseases, common diseases
typically result from common variants with subtle to moderate effects – possibly because
the late age of onset for the disease reduces the impact of these variants on reproductive
fitness. Hence, these variants could carry through generations more successfully, and are
commonly present in the population to contribute their part towards the trait.
In recent decades, geneticists have banked on the common disease common variant
(CDCV) hypothesis for studying common complex traits[41, 47]. This idea is well-
received because it is conceptually straightforward to conduct association tests in search
of common variants. Moreover, the typically high frequency of risk- alleles translates to

significant population-attributable risk estimates that have important impact on public
health.
In order to support research in this area, resources such as the Human Genome
Project, International HapMap Consortium, and state of the art genotyping technologies,
are also made available.




12

2.6 Genetic mapping for common complex trait: linkage and association
methods
2.6.1 Linkage study
Genome wide linkage analysis is the earliest gene mapping method. When the
study involves human subjects, this method is performed among genetically related
individuals as part of a family. Each family is designated as a single unit of analysis to
trace the transmission of genetic markers in their genomes. In cases where the marker is
found more frequent than expected to accompany the trait of interest, it suggests that a
gene with functional effects is present close to it[48]. In mendelian traits, linkage analysis
could trace the simple transmission pattern to a defined site of interest. However, for
common diseases with complex inheritance, linkage method lacks both power and
resolution to achieve such specificity. Instead, it usually maps to broad region; typically
tens of centimorgans in size, which often contain hundreds of genes, in which, the key
players are difficult to distinguish. As a result, it is still necessary to employ extensive
fine mapping through candidate gene studies for better identification. Incidentally, the
necessity of familial design for detecting transmission also renders linkage studies a
propensity to discover highly penetrant single gene defects of severe effect size, which
seldom covers the polygenic spectrum of common complex trait. Therefore, it is not
surprising that most success stories of linkage analysis have the characteristic of

mendelian disease, of which Huntington disease is the most celebrated since the
1980s[49]. Variants found through linkages are usually unique to the studied families. It
explains few of the cases in the general population, and are also bleakly replicated if
possible in subsequent studies[40, 50]. Likewise, this tendency of detecting low
13

frequency (rare) marker would require a prohibitively large number of families to meet
the required study power, which is both expensive and essentially unrealistic to achieve
in the instance of late disease onset.

2.6.2 Association study
Hope came from a paradigm shift to association study, when a landmark report by
Risch and Merikangas in 1996 showed that association method is more powerful than
linkage analysis to identify genes of modest effects for common complex traits[47]. Even
though an association study also utilized genetic markers such as SNPs, it compares the
allele frequency differences between response groups, such as case versus control,
instead of tracing transmission. Association study has since become the method of
choice, because it is designed to study unrelated individuals from the population at large,
where each sample is considered a single study unit rather than a cluster of family
members. This is imperative because it eases the collection of more subjects for a less
expensive larger study, and there is no need to know the mode of inheritance for the
particular trait. Besides, many variants found through associations usually are common in
the population, which more likely retain the required power for detection, since given that
a higher exposure (variant’s allele) frequency could translate to greater study power.
For decades, the community had advocated the need of large-scale association
analysis, which in essence, entails “a large number of polymorphisms on a reasonably
large sample size” for studying the genetics of common complex diseases[47]. This is
fundamentally only achievable in recent years when high-throughput genotyping
technologies are available for genome-wide association studies.
14


2.6.2.1 Direct association
Ideally, direct genotyping of the target polymorphism, which is possibly the causal
variant, is the easiest to analyze and most powerful for testing association with the trait of
interest. In general, non-synonymous SNP in the coding region of a gene is a promising
candidate, because of the ease of predicting from database for an obvious functional
role[51]. However, due to evolutionary pressure, most of the functional coding sequence
of genes are highly conserved and have low frequency in the genome[52]. In addition,
many of the causal variants in common complex diseases may also be located in non-
coding region of genes. These variants could be controlling regulatory elements and
affecting gene expression, which should have eminent functional roles as well[51].
Nevertheless, current knowledge is still insufficient to classify variants that has key
regulatory roles. Thus, the difficulty of identifying candidate polymorphisms limited our
opportunities to perform direct associations.
2.6.2.2 Indirect association
Frequently, we are unaware of the causal variant’s identity in our target region, and
usually a surrogate marker, which is in close proximity with it, is employed for indirect
associations. This is achievable based on the foundation of linkage disequilibrium (LD)
between tightly linked loci, such that information captured in one could be shared among
them[53, 54].
Concept of linkage disequilibrium
Our human genome is inherited as a distinctive block like pattern that consists of
ancestrally conserved chromosome segments called haplotype blocks, which are
interspersed between boundaries of recombination hot spots[55, 56]. Variants in a
15

haplotype block tend to be inherited together and are correlated. This phenomenon of
non-random association of alleles between tightly linked loci in a population is called
linkage disequilibrium (LD). The occurrence of recombination and mutation would lead
to a breakdown of LD. D’ and r

2
are the two basic measurements for the extent of
pairwise LD between two loci. D’ estimates the history of recombination between a pair
of variants[57], where a value of 1 indicates absence of recombination, whereas values
less than 1 indicate presence of recombination that separates the two loci. On the other
hand, r
2
is determined by dividing D’ by the product of the four allele frequencies, which
fundamentally is the square of correlation coefficient between the two loci, therefore its
value is related to the amount of information provided by one locus about the other.
When r
2
= 1, the two loci provide identical information, they not only have a D’ value of
1, they also have equal allele frequencies. Nevertheless, these measurements primarily
provide us with the level of correlation and redundancy between linked loci.
The concept of linkage disequilibrium is important for the design and analysis of
genetic disease mapping. A variant found to be associated with a phenotype or disease
state may not necessarily contribute directly to the disease process, but merely acts as a
marker, which is in linkage disequilibrium with the neighboring causal variant that is
usually unknown. And since, knowing that variants in LD are correlated and basically
providing the same information, it is therefore not essential to study all variants in a LD
block. This can be a cost savings advantage, as it only requires genotyping of tagging
SNP that best capture the information for the region instead of for each specific
variant[58, 59]. With the same principle, a large main sample set that just had a subset of
variants genotyped could perform imputation to have its information extended to the full
16

set of variants collected on a small reference population, as long as they share similar LD
patterns[60]. As such, this also potentially helps in reducing genotyping burden.
However, studies have found linkage disequilibrium patterns to vary between

different ethnic populations. As each population may experience a unique history of
recombination that could result in divergence in haplotype size and allele frequencies
between populations, the degree of correlation and redundancy among SNPs might vary
between distant populations. In recognition of this, resources such as the International
HapMap Consortium have cataloged the genome wide LD pattern of a number of
reference populations to guide researchers in their planning of genotyping resources. This
is especially impetus to the growing importance of association studies in gene mapping
and understanding disease mechanisms [27-29].

2.7 Hypothesis driven candidate gene approach
There are substantial interest on association studies, after it is established as a more
powerful method to study complex traits[47]. In the earlier days, when the scale of
genotyping was low and expensive, genetic associations were only affordable on
manageable candidate gene regions that usually have biological merit. The candidate
gene method is solely hypothesis driven, and hence relies on published studies on the trait
of interest for suggestions[51]. Even though immunological studies have been ongoing
for decades, but current knowledge on the full complexity of the immune system in an
infected human host is still limited. Besides candidate region from previous linkage
studies, many hypotheses for infectious diseases are generated in-directly from animal
models and among related infections that may portray shared etiologies [2, 61-63].
17

In addition to the criticisms of its limited scope, care must also be taken in the study
design, and interpretation of associations of a candidate gene study. Since most
population studies are likely to be confounded by population sub-structure, it could lead
to spurious association that is unrelated to the trait of interest[64]. This is a more critical
problem for the candidate gene study, as it generally has low variant density, which
renders it less capable of detecting the presence of population stratification. Hence, even
though it would add to the genotyping burden, additional genotyping of carefully selected
ancestry informative markers (AIMs) may be necessary to cater for this purpose. In

addition, if ancestry differences are indeed present between response groups, the
association test statistics may require further genomic correction[64].
The current customizable platforms available for targeted genotyping of candidate
gene study are the small and medium scale technologies. Small-scale technologies are the
single-plex Applied Biosystem’s Taqman® assay (Life Technologies Corporation, CA,
USA), and Sequenom iPlex® Gold assay (Sequenom Inc., CA, USA), which could
multiplex up to 40 SNPs per sample. For medium-level multiplexing, Illumina’s
GoldenGate assay (Illumina Inc., CA, USA) allows genotyping to a maximum of 1,536
SNPs per sample, and newer players, which allow multi-plexing of up to tens of
thousands of SNPs are Illumina’s recent Infinium iSelect custom panel (Illumina Inc.,
CA, USA) and Affymetrix’s GeneChip® custom SNP kit (Affymetrix Inc., CA, USA).

2.8 Hypothesis generating genome wide association scan (GWAS)
Our imperfect understanding on the biological mechanisms of most complex traits
is limiting our capability of generating hypothesis for candidate gene associations. Many
18

genes that are still awaiting revelations may harbor unexpected but biologically important
functions, which current literature has yet to find relevant associations with the trait of
interest. To expand our knowledge, we need a hypothesis-free, yet, extensive search
throughout the genome to identify these novel genes. Acknowledging greater power
awarded by association study for common complex trait, many genetic epidemiologists
are motivated toward a comprehensive genome wide association scan (GWAS) for genes,
related to common traits in the human populations[47]. Even though the cost of
genotyping has significantly reduced over the years, it is still too expensive to query all
10 million common SNPs in the human genome. Fortunately, the structure of our genome
is made of haplotype blocks, where variants in LD tend to be inherited together.

This will
allow us to conduct indirect association through marker tagging, which could

theoretically capture the required information of all variants in the same block/ region
[27-29, 56]. This shortcut strategy of marker tagging in reducing genotyping and cost has
been developed into microarray genechips, which are commercialized mainly by
Affymetrix GeneChip® (Affymetrix Inc., CA, USA) and Illumina’s Infinium
technologies (Illumina Inc., CA, USA). Depending on its generations, these microarrays
have SNP densities that range from 10,000 to a million common SNPs per array, such
that it is now technologically possible to query the entire genome for genes implicating
our trait of interest [65-68]. However, different commercially available fixed content
platforms have their preferred combination of SNPs tiled on the array, which provide
variable degree of tagging efficiency in a GWAS study[69, 70]. Nevertheless, studies
from HapMap phase II data had revealed the mere requirement of just slightly more than
500,000 SNPs to capture all known common SNPs in phase II of HapMap with an r
2
of ≥
19

0.8 in Asian and Caucasian, while Africans will need twice the amount of more than a
million SNPs[27].
2.8.1 Multiple-stage design
Most of these genome wide genotyping products follow the concept of the
“common disease common variant” (CDCV) hypothesis in their SNP selection. The
effect size of these common variants tends to be modest; with typical odds ratio less than
2[71]. In order to have adequate power for detecting associations of this magnitude, it is
essential to have large number of samples, which preferentially are in the range of
thousands[54]. Even though the SNP microarray had increased our throughput and made
high-density genotyping more cost efficient, but the price of an array is still fairly
expensive. When the first generation microarray was first introduced to the market, it was
selling at about a thousand dollars each. Nonetheless, as they get superseded by the ever-
increasing SNP density of newer chips, the cost of older products had significantly gone
down over the years. Even so, if the entire large sample size is genotype with the array,

this will still cost millions just for the initial screening. This is seldom affordable by most
laboratories, and is also found to be unnecessary, if a staged sampling approach is
employed.
The multiple stages study design is widely adopted for GWAS, with the main aim
of reducing genotyping burden and cost while still maintaining good study power[72, 73].
To achieve this, an affordable fraction of samples is genotyped with the microarray for an
initial screen on the entire SNP collection. After association analysis, a liberal alpha
significant threshold is used to identify a subset of putatively associated SNPs for
validation on the remaining independent samples. It is taken that SNPs in the first phase,
20

which do not even pass this liberal threshold are unlikely to achieve the required higher
alpha significance threshold of the whole study, and therefore can be safely discarded to
minimize genotyping. On the other hand, those promising variants that are carried into
subsequent stage are tested on all samples, and the good power of the whole sample set
for detecting associations should therefore be retained. Furthermore, any spurious
associations that might arise in the initial stage could also be validated subsequently in
the other stages.
21

3. Objectives
The main objective of my thesis project is to identify genetic variation that
influences susceptibility to infectious disease. This was conducted in two studies to
specifically address the following objectives:

I. To understand host genetic susceptibility to the natural infection of
pulmonary tuberculosis

II. To identify genetic determinants influencing hepatitis B vaccine antibody
titer


4. Study 1
Case-control study on pulmonary tuberculosis susceptibility
4.1 Background
With evidence dated as far back as 9,000 years ago, TB is a disease with long
history of causing much suffering to many, and it is thus one of the top infectious causes
of mortality[74]. It is primarily an infection of the lungs that is spread by inhaling
droplets from the coughing, sneezing or even speech of an ill individual. TB has close to
10 million new cases and a high fatality rate of 2 million annually[16, 75]. Among the
countries with high TB burden, Indonesia ranks third in the list. The Southeast Asian
region, which Indonesia is located in, has been estimated to have the largest number of
22

new cases that accounts for 35% of incidence cases globally. Hence, it is necessary for us
to study TB susceptibility in this endemic population.
Although Mycobacterium tuberculosis (M. tuberculosis) has infected around a third
of the world’s population, only 3-10% of those infected develop active disease during
their lifetime[76]. More than 90% of infected individuals remain asymptomatic with a
latent infection. This indicates that host immune / defense pathways are often highly
effective in controlling this disease. Because the infection causes such a burden of disease
in those unable to contain the infection, it is therefore important to discover underlying
mechanisms to aid the development of more effective interventions such as better
vaccines and novel treatments for latent and active infection. Similarly, it is important to
identify predictive biomarkers that might identify individuals who are most susceptible to
developing active TB disease.
Heritability estimates from twins and other familial aggregation and segregation
studies had convincingly implicate a notable genetic component contributing to this
outcome[3, 36, 77-79]. Consequently, this has convinced us to use genetic mapping for
discovering relevant genes in pulmonary TB susceptibility. For this kind of common
disease, the genetic mode of action frequently attributed from multiple genetic factors,

which individually produce modest effects and is present commonly in the population[41,
46]. In this case, population-based association study that compare allele frequency
differences between disease and non-disease groups is known as a more powerful method
of mapping to relevant gene regions, that may accordingly help in comprehending the
mechanisms of TB[47].

23

4.1.1 Subjects
Main study population
Indonesia cohort
Indonesian TB patients and controls were enrolled from the cities of Jakarta and
Bandung on the island of Java, Indonesia using a uniform enrollment protocol for all
subjects [80]. We recruited 799 TB patients (mean age 32, range 14-75, 55.8% male, see
Table 1) who had been diagnosed by the local health care service using information about
clinical symptoms, chest X-rays, and sputum smear. For all cases in this study, diagnosis
was further confirmed by sputum culture of M. tuberculosis. Clinical information, as well
as the patients’ age, ethnicity, socio-economic status, and concurrent medical history
were recorded in structured questionnaires. Patients with extra-pulmonary TB, diabetes
mellitus (fasting blood glucose > 126 mg/dL), and HIV-positive subjects were excluded
from the genetic study [81, 82]. For controls, we recruited 746 (mean age 33, range 15-
70, 52.5% male) sex- and age (+/- 10 year) matched subjects with no history of TB and
showing no evidence of TB-related infiltrates in chest X-rays, who were also living in the
same or neighboring households of the enrolled cases, but were not known to be
genetically related. Self and parental ethnicities recorded during recruitment were used to
characterize subjects with a Javanese origin from three groups -the Jawa, Betawi, and
Sunda, which altogether comprised more than 80% of the total sample. Individuals in the
non-Javanese category have both parents coming from other Indonesian Islands, whereas
subjects with one parent from non-Javanese origin were considered having mixed
parentage (Table 1). Using the information, we made an effort to avoid spurious

associations arising from population stratification, by excluding ssubjects with self-
24

reported ethnicity that were of non-Indonesian origin from the genetic studies
Nevertheless, we also confirmed in a previous study that the cohort does not have
significant population stratification problem[83]. In summary, for the purpose of this
population stratification analysis 299 SNPs were carefully chosen from genomic
locations that are more than 10kb away from any known genes, besides that, these SNPs
also need to be in linkage equilibrium with one another, and having average minor allele
frequencies around 30%, to act as a set of ancestry informative markers (AIMs). We
genotyped these AIMs in a subset of randomly selected 330 cases and 368 controls from
this cohort. One of the SNPs was out of HWE, and thus was excluded from the analysis.
We used the Devlin and Roeder method to calculate the lambda inflation factor [84], this
was calculated by using the median of the chi-square values for all 298 SNPs, divided by
0.675 and then squared. We arrived with a value close to 1 (0.82), which indicated that
there is no significant population stratification in this Indonesian group [83].
This protocol was reviewed and approved by the relevant institutional review
boards in Indonesia and the Netherlands, and written informed consent was voluntarily
signed by all patients and control subjects.








25

Table 1: Demographic data of the main study population


Indonesian
TB Patients (N=799)
Controls (N=746)
Age years (mean)
14-75 (32)
15-70 (33)
Gender male:female (%)
55.8% : 44.2%
52.5% : 47.5%
Self reported ethnicity (%)


Javanese
675 (84.48%)
617 (82.71%)
mixed (either parent Javanese)
26 (3.25%)
43 (5.76%)
non-Javanese
59 (7.38%)
32 (4.29%)
Unknown
39 (4.88%)
54 (7.24%)
Other study populations that were used partly in the study
Russian cohort
Russian TB patients and controls were collected at two cities, St. Petersburg
(1,528 patients and 1,609 controls) and Samara (384 patients and 495 controls),
according to the same protocol[85]. The majority of the residence in St Petersburg is

Russians (93%), and the remaining minority ethnic groups consist of Ukrainian (1.5%),
Tatar (0.7%), and Armenian (0.5%), among others. Similarly in Samara, the majority is
Russians (85.6%), and the remaining minority groups consist of Tatars (4.1%),
Ukrainians (1.4%), and Armenians (0.7%), among others. However, based on structured
questionnaires we only recruited subjects who were reported as having Russian ethnicity.
In total 1,912 TB patients (mean age 43.8, range 17-86, 73.8% male) were
diagnosed by the local health care service based on clinical symptoms, and evidence from
X-rays and sputum smear. For confirmation of diagnosis, sputum cultures of M.

×