Tải bản đầy đủ (.pdf) (11 trang)

Báo cáo khoa học: Human-blind probes and primers for dengue virus identification Exhaustive analysis of subsequences present in the human and 83 dengue genome sequences doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (379.22 KB, 11 trang )

Human-blind probes and primers for dengue virus
identification
Exhaustive analysis of subsequences present in the human and
83 dengue genome sequences
Catherine Putonti1, Sergei Chumakov2, Rahul Mitra3, George E. Fox4, Richard C. Willson4,5
and Yuriy Fofanov1,4
1
2
3
4
5

Department of Computer Science, University of Houston, Houston, TX, USA
Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico
Genomics USA, Houston, TX, USA
Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
Department of Chemical Engineering, University of Houston, Houston, TX, USA

Keywords
dengue; diagnostic assay; flavivirus;
microarray; pathogen identification
Correspondence
C. Putonti, University of Houston, 218 PGH,
Houston, TX 77204–3058, USA
Fax: +1 713 7431250
Tel: +1 713 7433992
E-mail:
(Received 5 September 2005, revised 22
November 2005, accepted 23 November
2005)
doi:10.1111/j.1742-4658.2005.05074.x



Reliable detection and identification of pathogens in complex biological
samples, in the presence of contaminating DNA from a variety of sources,
is an important and challenging diagnostic problem for the development of
field tests. The problem is compounded by the difficulty of finding a single,
unique genomic sequence that is present simultaneously in all genomes of a
species of closely related pathogens and absent in the genomes of the host
or the organisms that contribute to the sample background. Here we describe ‘host-blind probe design’ – a novel strategy of designing probes based
on highly frequent genomic signatures found in the pathogen genomes of
interest but absent from the host genome. Upon hybridization, an array of
such informative probes will produce a unique pattern that is a genetic fingerprint for each pathogen strain. This multiprobe approach was applied
to 83 dengue virus genome sequences, available in public databases, to
design and perform in silico microarray experiments. The resulting patterns
allow one to unequivocally distinguish the four major serotypes, and within
each serotype to identify the most similar strain among those that have
been completely sequenced. In an environment where dengue is indigenous,
this would allow investigators to determine if a particular isolate belongs
to an ongoing outbreak or is a previously circulating version. Using our
probe set, the probability that misdiagnosis at the serotype level would
occur is % 1 : 10150.

Members of the Flavivirus genus are responsible for a
number of diseases, including yellow fever, West Nile,
St Louis encephalitis, and dengue fever. One or
more of the four serotypes of the dengue virus are
endemic in many parts of the world, including all of
south-east Asia, parts of Africa, and Southern and
Central America. The Aedes aegypti mosquito, which
prefers to feed on humans, is a carrier of the dengue
virus and is commonly found on the US Gulf Coast

according to the CDC (Centers for Disease Control
and Prevention) ( />398

dengue/index.htm). Although the USA has had relatively few reported cases of dengue, epidemics have
occurred in northern Mexico and hence dengue is a
growing concern for bordering states.
As no vaccine or treatments are available for dengue, early detection of the viral infection is critical to
avoid a potential epidemic. Dengue diagnosis has historically relied on either (a) isolation and growth of
the virus in cell cultures in vitro or (b) serological tests.
The former, while able to provide a more definitive
diagnosis, is time consuming and ill-suited for use in

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS


C. Putonti et al.

the field. Thus, serology has emerged as the primary
method for dengue diagnosis. Serological tests are easy
to use and able to accommodate a great number of
samples, both necessities when confronting an epidemic. These benefits, however, come at a cost; tests
such as hemagglutination inhibition, IgG-ELISA and
MAC-ELISA cannot easily distinguish dengue at the
serotype level and are likely to misidentify other flaviviruses as dengue [1,2]. Recently, specific tests have
been developed for dengue identification using nucleic
acid-based technologies [3] such as the PCR [4–14] and
nucleic acid sequence-based amplification (NASBA)
assays [15,16], and microarrays of cDNA [17] and
oligonucleotides [18]. These methods are both quick
and easy to use, while offering reliable serotype-specific

detection. The probability of false positives, however,
still remains a concern [4,15,16]. Regardless of which
technology is used, identification is typically based on
the presence of one or a few unique subsequences
[15,16,19] as indicators of the target of interest.
Several inherent problems exist in basing detection
and ⁄ or identification on recognition of unique
sequences. First, to select a candidate one must know
the pathogen’s genomic sequence. Moreover, even if
appropriate unique sequences can be found for the
entire group, they will not be able to distinguish the
various subgroups of the target organism. This would
require unique sequences for every subgroup of interest. However, an important observation was made previously, by McGill et al. [20], in that sequences need
not be universally present in a group of interest or
always absent from other groups to be informative
about phylogenetic relationships. Recently it was
shown that large numbers of such ‘characteristic’
sequences exist in the 16S ribosomal RNA [21]. Hence,
an alternative approach [21] is to rely on multiple
sequences that may individually not be uniquely or
universally found in any particular grouping, but
which are highly characteristic of particular groups.
Recognition is then based on a set of such characteristic sequences that together form a signature [19] for a
particular organism or grouping. In either approach,
analysis is further complicated because viruses are obligate intracellular parasites; they are found in conjunction with host cells whose DNA might contain
sequences that would interfere with the test. As separation of viral from host nucleic acids is quite difficult,
it is important that the sequences used for virus detection are absent from any potentially contaminating
DNA.
We have recently developed a set of novel algorithms that make it possible to efficiently calculate
the frequency of all subsequences (n-mers) of length


Human-blind sequences for dengue identification

5–25+ nucleotides in any sequenced genome within no
more than a few hours, depending on the genome size.
This allows exclusion of all subsequences that are
present in a selected host ⁄ background genome (e.g.
human) in the PCR primer ⁄ microarray probe design
step, which has greatly increased speed, predictability
and effectiveness compared with current design methods. The microarray format is particularly attractive as
it permits testing for multiple pathogens simultaneously (e.g. the set of viral pathogens causing similar
symptoms in hosts or those rampant in the same
regions in which the infection has occurred). We refer
to the sequences that are present in the genome of
interest and absent from the host genome as being
‘host-blind’ (human-blind, mosquito-blind, mouseblind, rat-blind, etc.) sequences. The greater the number of changes necessary to ‘convert’ such a host-blind
sequence to a sequence found in the host genome, the
less likely the host-blind sequence, when used as a
PCR primer and ⁄ or microarray probe, is to mispair
with the host’s genomic sequence. Thus, our algorithms can also exclude, in the design step, all hostblind sequences one, two, three, etc. changes away
from the nearest host sequence. We refer to such
sequences as being host-blind and one [two, three, etc.]
change away from the nearest host sequence (Fig. 1).
This new approach can readily be extended to develop
assays that are insensitive to the background of a host,
such as a food (animal or plant) species, pathogen vector, or any other environmental background for which
genomic sequence information is available. By using
sequences three or four changes away, we can reduce
and possibly eliminate false positives in the presence of
background contaminating genomes.


Fig. 1. Sensitivity of host-blind sequences. The pathogen sequences are examples of host-blind sequences that are one possible
change (left) and two possible changes (right) away from the nearest human host sequence (above).

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS

399


Human-blind sequences for dengue identification

C. Putonti et al.

In the work presented here, 83 complete dengue
virus genomes (representative of all four serotypes)
were analyzed in conjunction with the available draft
sequence of the human genome in order to find all
potential probes ⁄ primers that could be used to detect
or identify this pathogen in a human-derived sample.
The analysis was conducted for all n-mers up to 22
nucleotides long. Our analysis focuses on those n-mers
present in dengue and human-blind for all possible
changes of one, two, three or four nucleotides. Several
hundred human-blind sequences were identified, including those that were (a) present in each individual viral
strain’s genome, (b) present in all 83 dengue strains
regardless of their serotype, (c) unique to each serotype
of the virus (present in all strains of the serotype), and
(d) unique to each individual viral strain’s genome
(present in the strain and absent from all other
strains).

The results demonstrate that any method of identification based solely on hybridization with a particular
unique sequence or a small set (typically less than six)
of sequences, as used in the existing tests of dengue
diagnosis, would not be able to reliably accommodate
potential mispriming. To minimize the probability of
misdiagnosis, sequences that require three, four or
more bases to be altered for a mispriming to occur are
considered ideal for identification purposes. A multiple
probe approach was taken in which detection and
identification of any dengue virus strain in the presence
of human DNA was developed using characteristic
sequences. A sample probe set that could be used in a
microarray format was developed and tested by
in silico hybridization. This probe set was designed to
contain the minimal number of probes necessary to
detect and identify dengue at the strain level and the
ability to unequivocally distinguish between the four
major serotypes.

the human genome lie within two changes of all possible sequences of 14 nucleotides. It is only when n ¼
16 that the human genome does not include a sequence
within three changes of any selected n-mer. Therefore,
in our search for human-blind n-mers, only values of n
for which some sequences are actually absent from the
human genome should be considered. Furthermore, it
is important that the number of sequences absent from
the human genome are large enough such that there is
a reasonable probability that some will occur within
the much smaller viral genome. For large values of n,
however, specificity becomes a greater concern. Thus,

calculations and analysis were confined to n-mers of
size 16–22.
Human-blind sequences present in each
individual viral genome
For each of the 83 dengue virus genomes, calculations
identified each n-mer, as 16 £ n £ 22, that is at least
one, two, three, or four changes away from the
nearest human sequence (Fig. 2). The results of which
are provided in the Supplementary data. The presence
or absence of each n-mer was calculated, rather than
the frequency of occurrence. There were no 16-mers
three changes away from the nearest human sequence
in any of the viral genomes and only a single 17-mer
(and its complementary 17-mer), which was found in

Results
Human n-mers
In order to generate the set of all probes ⁄ primers that
are present in the dengue virus genome(s) and absent
in (and distant from) the human genome, analysis of
the human genome was first necessary. An analytical
model was designed to provide us with an estimate of
the absence of subsequences from the complete human
genome given any one, two, three, or four changes. In
addition, calculations were performed for the complete
human genomic sequence for n < 18. (The results are
included in the Supplementary material.) For n-mers
of size less than 15 nucleotides, sequences present in
400


Fig. 2. Number of unique human-blind sequences found in each of
the 83 complete dengue virus strain genomes considering different
sizes of n and number of changes away from the nearest human
sequence. Also listed is the average number of 22-mers present in
an individual genome and absent from the human sequence given
any one, two, three or four changes. The ideal set of probes would
be 22-mers that are four changes away; 16-, 17- and 18-mers with
one change away will lead to false-positive results because the
mismatches could be tolerated in the hybridization between the
host target and dengue probes.

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS


C. Putonti et al.

Human-blind sequences for dengue identification

just two of the dengue strains. It is not until 19-mers
were considered that all of the dengue genomes were
found to have some human-blind sequences at least
three changes away from the nearest human sequence.
The sequences four changes away are ideal candidates
for use in recognizing dengue because it is unlikely
that mispriming will occur and a false positive will
be reported. Each of the 83 strains had sequences
at least four changes away when considering n-mers
for n ‡ 21.

human sequence and shared by all 83 dengue genomes.

This 18-mer could be used to detect the presence of
dengue in a human sample; however, it is possible, and
even likely, that this sequence could mispair to the
host sequence or related flavivirus genomes. Our
results lead us to the conclusion that there are no
human-blind sequences common to all 83 dengue
strains that are at least three changes away from the
nearest human sequence.
Sequences unique for serotype 1 and 2

Sequences present in all 83 dengue genomes
regardless of serotype
Prior to our calculations, we hypothesized that there
would be some human-blind n-mers that are present in
all 83 dengue genomes. Such sequences could then
serve as a reliable indicator of the presence of the virus
in a complex sample (e.g. an infected individual).
Using all 83 genomic sequences, the number of such
n-mers was calculated (Table 1). The number of unique
sequences was quite small. There appear to be several
reasons for this. First, as n increases, the number of
common n-mers decreases. Second, because the
number of human-blind sequences in general is smaller
for small n-mer sizes (no human-blind n-mers for
n < 11 and no human-blind n-mers two changes away
from the nearest human sequence for n £ 15), the number of human-blind sequences decreases rapidly. It is
also obvious that by requiring characteristic sequences
be at least two, three, etc., changes away from the
nearest human sequence, one dramatically reduces the
number of available sequences (Table 1). Such

sequences are ideal primers ⁄ probes for identification of
dengue because of their decreased probability of a false
positive, yet there are no n-mers present in any of the
83 dengue sequences and absent from human given
any 3+ changes. There is only one 18-mer (and its
complement) two changes away from the nearest
Table 1. The number of n-mers present simultaneously in all 83
dengue genomes. The first row does not consider if the sequences
are absent from the human genome, just that they are present in
all of the dengue genomes.
n
16
Present in all dengue; absence
in human not considered
Human-blind one change away
Human-blind two changes away
Human-blind three changes away
Human-blind four changes away

17

18

19

20

21

22


20

14

8

4

2

0

0

8
0
0
0

12
0
0
0

8
2
0
0


2
0
0
0

2
0
0
0

0
0
0
0

0
0
0
0

We also calculated the number of unique sequences for
each dengue type 1 (DENV-1) or DENV-2 serotype,
as these types comprise the great majority of the 83
genomes considered (Table 2). It is likely that when a
more extensive sample of DENV-3 and DENV-4
genomes become available that the results will be similar. It is observed that while there are far more
human-blind n-mers shared within each group, as the
sequence length and stringency increase the number of
common n-mers decreases. In the case of DENV-2,
there are no n-mers four changes away from the nearest human sequence shared amongst all 46 virus

genomes. Further analysis of all serotype-specific
sequences is required to verify that they are unique
with respect to other flavivirus genomes as well. Selecting host-blind primers ⁄ probes that are unique to the
serotype and host-blind with the most changes possible
Table 2. Human-blind sequences present simultaneously in all
DENV-1 and DENV-2 genomes. DENV-3 and DENV-4 are not included, because the few sequences that are available are so similar to
each other that the vast majority of the n-mers present in the
sequence are unique to the serotype.
n
16

17

18

19

20

21

DENV-1 (28 genomes)
Present in all dengue;
664 558 458 392 336 284
absence in human
not considered
Human-blind one change away
218 372 408 382 334 284
Human-blind two changes away
2 38 94 172 250 252

Human-blind three changes away
0
0
2 12 54 118
Human-blind four changes away
0
0
0
0
0
2
DENV-2 (46 genomes)
Present in all dengue;
62 54 44 34 24 16
absence in human
not considered
Human-blind one change away
24 38 40 34 24 16
Human-blind two changes away
0
0
6
8 16 16
Human-blind three changes away
0
0
0
0
0
0

Human-blind four changes away
0
0
0
0
0
0

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS

22

254

254
242
182
38
10

10
10
4
0

401


Human-blind sequences for dengue identification


C. Putonti et al.

from the nearest human sequence would ensure a more
reliable method of detection with a lower false-positive
rate than the currently available techniques.
Sequences unique for each individual viral
genome
For each n-mer present in a dengue genome, the number of other dengue genomes that also contain this
particular n-mer was calculated. On average, 4.4%
(16-mers) to 8.3% (22-mers) of the viral genome is
comprised of n-mers that are not present in any of the
other dengue genomes. For example, in the genome of
the DENV-4 China Guangzhou B5 strain (AF289029),
75.4% of the 22-mers are unique to this genome. Three
genomes (one DENV-1, two DENV-2, and one
DENV-4) do not have any 16- to 22-mers that do not
occur in any other dengue strain’s genomic sequence.
Thus, no single sequence could be used as a primer ⁄ probe to identify one of these strains. Figure 3
shows the distribution of the percentage of unique
n-mers per genome for 16- to 22-mers. This analysis
was next extended to those n-mers that are humanblind. The average number of host-blind n-mers that
are unique to a particular genome is less than 8%, and
many genomes have no human-blind n-mers at least
two changes away from the nearest human sequence.
Despite this low average, there are several genomes
that have a higher number of human-blind sequences
then would be expected. In AF289029, 30 of the 34
16-mers that are two changes away from the nearest
human sequence are unique, and 16 014 of its 21 248
22-mers one change away from the nearest human

sequence are unique. Figure 4 reflects the distribution
of host-blind 22-mers one, two, three or four changes

Fig. 3. Distribution of the percentage of n-mers per genome that
are unique (i.e. not contained in any of the other dengue genomes
considered).

402

Fig. 4. Distribution of the percentage of human-blind 22-mers per
genome that are unique (i.e. not contained in any of the other dengue genomes considered).

away from the nearest human sequence. A complete
report of our calculations of unique n-mers for all 83
genomes is available in the Supplementary data.
In silico array hybridization studies
To reduce the likelihood of false positives, host-blind
sequences that are 3+ changes away from the nearest
human sequence are ideal for diagnostic purposes. The
fact that there is no single sequence meeting this
criteria for all of the 83 dengue strains considered,
suggests the use of multiple probes in a parallel (e.g.
array) assay in which a particular unique subset will
hybridize with each dengue virus strain. Thus, in the
case of a microarray assay, identification would be
based not on a single unique sequence but rather on a
unique pattern. The set of probes can be designed such
that each serotype, or even each strain, will produce a
unique pattern. The proposed approach can easily be
extended to unsequenced strains of dengue because

novel patterns can be compared with all known patterns and the affinity of the new isolate to the known
strains can be inferred by clustering techniques. Identification of the particular strain of infection would, for
example, allow epidemiologists and public health officials to rapidly determine if an isolate causing hemorrhagic fever represents a new outbreak or belongs to
known circulating versions of the virus. The ability to
quickly, inexpensively, and reliably diagnose dengue at
the strain level in such a manner is not possible with
existing techniques.
Based upon the results, presented above, for humanblind n-mers in the 83 dengue genomes, a set of 216
probes (22-mers, at least three changes away from the
nearest human sequence) was designed for in silico

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS


C. Putonti et al.

experiments. The 216-probe set was computed as the
minimum number of probes possible to uniquely identify each of the 83 genomes such that each genome
was required to contain a subset of at least 28% (in
this case 61) of the 216 22-mers or probe sequences.
Furthermore, for any two strain’s genomes, the subsets
contained in each must differ by at least two
sequences. If two strains differ only by two sequences
and mutations occur in these two sequences, the
strains will be indistinguishable. The likelihood of such
an occurrence can be reduced by demanding more
sequences for distinguishing between any two individual strains; this, however, will necessitate a larger
probe set size. We further stipulated that serotypes are
distinguishable from each other such that any strain in
one serotype differs significantly from any strain in

any of the other three serotypes. To this end, it was
required that serotypes be distinguishable by at least
20% of the 216 22-mers or probe sequences contained
in any of their strain members. For the 216-probe set,
the minimum number of probes differentiating a
DENV-1 strain from the all other strains belonging to
one of the three other serotypes was 70, 56 for DENV2, 65 for DENV-3, and 56 for DENV-4. Thus, in the
event that identification is not possible at the strain
level as a result of mutations, identification at the serotype level is possible.
To estimate the probability that a misdiagnosis
occurs at the serotype level, we assume, in the worst
case scenario, that a target sequence in dengue will no
longer hybridize with its complementary probe if just
one point mutation occurs. For a given sequence of
length l, there are l ) n n-mers. As the length of the
dengue genomic sequence is significantly larger than
the sizes of n considered here, l ) n % l, such that the
probability that m specific n-mers are mutated can be
estimated as m! ⁄ lm. To misdiagnose the infection at the
serotype level in the 216-probe set would require at
least 56 mutations (m ¼ 56) to occur within a dengue
genome of % 10 000 bp (l ¼ 10 000). Thus, the probability that such an event would occur is % 1 : 10150.
The microarray of 216 probes represents what many
researchers can produce in-house at low cost. We
determined the pattern that would appear on the
microarray given a particular genome’s ability to
hybridize with the probe sequences. Figure 5 shows the
overlapping expression patterns for two pairs of
genomes for the set of 216 probes. The distribution of
the number of probes present on the 216-probe set

microarray for each of 83 genomes ranges from 61 to
95. Because dengue infections occur in regions in
which other flaviviruses are also prevalent, it is imperative that a diagnostic tool is able to discriminate

Human-blind sequences for dengue identification

Fig. 5. Overlapping in silico expression patterns obtained using the
216 probes with pairs of dengue virus genomes. (A) DENV-1 strain
BR ⁄ 90 AF226685 (green) and DENV-2 M29095 (red); (B) DENV-2
from Cambodia AF309641 (green) and DENV-3 DENCME (red).
Probes present in both genomes are shown in black, while probes
absent in both genomes are shown in white.

between the different viruses [2]. Considering the close
relative of dengue virus, West Nile virus, we computed
the number of probes expected to be present in 26
publicly available strains. Of the 216 dengue probes, at
most only three would hybridize with a West Nile
strain. In fact, 24 of the West Nile virus strains share
these same three 22-mers with the dengue virus strains.
In the event that the clinical sample contained West
Nile virus and not dengue virus, the expression pattern
is expected to show only 1% of the probes hybridized,
far less than the 28% required during the set design.
Thus, it is highly unlikely that a misidentification of
the presence of dengue will be made using the 216probe set, even in the presence of another flavivirus.
To estimate the ability of such arrays to distinguish
between different strains of a virus, as well as its possible genomic modifications, we introduce the distance
(D) between any two patterns:
Dẳ1


n12
;
minn1 ; n2 ị

where n1 and n2 are the numbers of probes present in
each of genomes being compared and n12 is number of
probes present in both genomes simultaneously. While
there are many different ways to define such a distance
between patterns, we chose this definition because of

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS

403


Human-blind sequences for dengue identification

C. Putonti et al.

its simplicity; the distance is 0 if both genomes produce
the same pattern and 1 if they do not share any of the
same probes.
By computing the distances between each pair of 83
patterns (the distance matrix), we were able to group
virus isolates using phylip’s kitsch (University of
Washington, Seattle, WA, USA) [22] and visualize these
groups using publicly available software packages
[23,24] based on the distances between the patterns
observed on the microarray (Fig. 6). The trees generated

clearly separate DENV-1 strains from the remainder of
the serotypes. DENV-3 and DENV-4 are most closely
clustered within their own respective serotypes but are
nested within the DENV-2 branch. While this may be
attributed to the fact that there are far fewer DENV-3
and DENV-4 available to be included in this analysis, it
is much more probable that it is a result of the design
process itself. Because each strain must contain a
percentage of the overall probe set, sequences that are
unique to a strain are, in essence, selected against.

Fig. 6. Dengue groupings based on the similarity of the observed
hybridization patterns for the 216-probe hypothetical microarray of
22-mers at least three changes away from the nearest human
sequence.

404

A second probe set was designed containing a random sampling of 4000 sequences (18-mers, two away
from the nearest human sequence). Because members
of this set were chosen at random, many more
sequences unique to a single strain or to just a few
strains are included. A tree displaying similarity
between isolates was also created using this set
(Fig. 7). This allows the dengue stains to be grouped
by their origin and the time at which the samples were
taken, as well as by serotype. Although a true evolutionary history of the virus can probably not be
obtained in this way, the results suggest that an
unknown isolate can be characterized with respect to
its closest relatives by a comparison of hybridization

patterns. Well-developed methods such as k-means
[25], self-organising map (SOM) [26] and hierarchical
clustering [27] might improve determinations of how a
new isolate compares to the previously studied isolates.

Discussion
We found no single sequence 16–22 bases in length
present in all 83 dengue sequences and absent given
three base changes from the human genome. Therefore, our approach was to use a unique pattern made
by a group of oligos (minimum 216) to identify a particular virus strain. A probe set of 216 human-blind
sequences was designed that can both diagnose and
identify the most similar strain among those whose
genome has previously been sequenced. The tests
currently being used in the field can at best only distinguish between serotypes. With this decreased specificity, the probability of misdiagnosis remains a major
concern. The assay proposed here will essentially
reduce the error in misdiagnosing the serotype of
dengue to 1 : 10150. With microarray technology, the
216-probe set can easily be accommodated in a single
diagnostic device. Here, just one experiment provides
both diagnosis and phylogenetic tree construction. This
assay will be able, without necessitating viral isolation,
to quickly detect a new pattern signifying a new strain
of dengue almost akin to sequencing the genome.
The ability to identify strains very similar to an
unknown isolate in the data set of sequenced dengue
genomes may be especially valuable in epidemiological
studies where one would like to rapidly understand the
origins of an outbreak of hemorrhagic fever. For
example, if such an outbreak were to occur in a location where dengue fever is indigenous, it may be the
result of a new variant of the virus which is common

in that region, a re-emergence of an earlier version, a
continuation of an outbreak from the previous season
or the introduction of a new strain as the result of

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS


C. Putonti et al.

Human-blind sequences for dengue identification

Fig. 7. Dengue groupings obtained from the
similarity of the observed hybridization
patterns for the hypothetical 4000-probe
microarray of randomly sampled 18-mers at
least two changes away from the nearest
human sequence.

travel. If the needed complete sequences are obtained
initially, hybridization arrays will allow these alternative explanations to be monitored on an ongoing basis.
Such monitoring might be conducted routinely to
detect changes in the local virus population before
cases of hemorrhagic fever occur.
If reduction of cost and size of this test are critical,
the ability to identify dengue at the strain level can be
sacrificed such that specificity is available only at the
serotype level. The host-blind technology provides a
much more reliable solution than is currently available
by greatly decreasing the likelihood that a primer ⁄
probe sequence will mispair with the host sequence.

We are confident in the ability of this technology
for reliable detection. Human-blind sequences have
been successfully used as PCR primers generating
amplicons matching those predicted computationally
(M. Anez, R. C. Willson, et al. unpublished results).
The development of host-blind diagnostic microarrays
is underway. The human-blind dengue primer ⁄ probe
sequences can be additionally improved to not only be
blind to humans but also to single nucleotide polymorphisms and organisms known to be associated with
humans (e.g. microflora), in addition to pathogens
known to be transmitted by the same vector. Furthermore, the computation-based host-blind approach can
easily be extended to include not only human hosts

but also mouse, rat, chicken, chimpanzee, mosquito,
and any other sequenced host genome.

Experimental procedures
Data
Version 3.2.2 of the human genome was used. This partially
assembled human genome, located in 944 files containing
2 860 215 662 base pairs of sequence, is available
from GenBank ( />search.cgi?taxid ¼ 9606). This version contains 794 007
unknown ⁄ unidentified bases. For simplicity, all n-mers containing such characters were excluded from the calculations.
Moreover, because the file structure of the genome assembly does not allow the assembly of each chromosome without gaps, all n-mers having a subsequence belonging to one
file and the remaining sequence in another file were not
included in our calculations. All calculations on the human
genome utilized both the original and complementary
strand sequences.
Eighty-three complete sequences of the dengue virus (28
DENV-1, 46 DENV-2, two DENV-3, and seven DENV-4)

were considered. This set of sequences, including their
accession numbers, is provided in the Supplementary data.
The dengue genome is % 10 kb with minor variations in
length. Although dengue is a single-stranded RNA positive-

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS

405


Human-blind sequences for dengue identification

C. Putonti et al.

strand virus with no DNA stage, both the original and
complementary strand sequences were used in our calculations as a precautionary measure.

Calculations
We have recently developed a set of novel algorithms that
make it possible to analyze the occurrence frequency of all
short subsequences (n-mers) of length 5–25+ nucleotides
in any sequenced genome within a reasonable time (hours)
[28–30]. The unique properties of this new approach are:
l
exact consideration of each subsequence of size n
(n-mer) in contrast to traditional blast-based approaches
(no approximate heuristics – no missing cases);
l
consideration of all sequences that can be derived from
each n-mer by up to any four changes;

l
extremely good time efficiency: calculations for up to
19-mers can be performed on a regular desktop PC; calculations for 20 £ n < 25+ can be performed using a standard high-performance cluster;
l
the large number of background (or host) genomic
sequences can be taken into consideration in one run, as
needed to avoid possible false positives; and
l
it can be used for genomic sequences of all sizes of
practical interest, including the human genome (3 Gb).
The basic idea is to set in correspondence to each of the
4n n-mers a particular element of a counting array, A, and
define the procedure to convert the n-mer character
sequence to an index of an element in such an array. It currently takes less than one minute to find the set of all
16-mers present in dengue and absent in the human genome. For PCR assay applications, the spacing of pairs and
sets of primers is important; however, these can be designed
much faster if consideration can be limited to only the subset of unique n-mers present in the genome of interest;
extension of the algorithms to a PCR primer set design is
now in progress.

Probe selection
It is our intent to define the minimum optimal set of subsequences, smin, that can both identify the presence of a particular pathogen and distinguish between different strains
of the pathogen. To ensure the sensitivity needed to properly identify a genomic sequence, each genome under consideration must contain at least a subsequences from smin.
If applicable, each subclass or type must be distinguishable
from any other subclass or type by at least b subsequences.
The set of subsequences present in each genome must differ
by at least c subsequences from the set present in every
other genome. Furthermore, for each element k in smin, its
complement k¢ must not be a member of smin.
In designing this optimal set, an evolutionary programming approach was taken. While many sets, s, may meet


406

the criteria above, a fitness function is needed to measure
how ‘good’ a particular set s is in order to determine whether it is, in fact, the optimal solution. For instance, a particular set may exceed the minimum values required of a, b
and c and, in fact, have values A, B and G, where A ‡ a,
B ‡ b and G ‡ c. While these values contribute to the fitness of a set, the size of the set plays a much greater role in
an effort to reduce the number of probes needed and thus
the cost of the array. Therefore, we chose to evaluate the
fitness of a particular set as f(s) ¼ (A + B + G) ⁄ set size,
such that for the optimal set, smin, there exists no other set
with a greater fitness value. For sets consisting of hostblind sequences, the number of changes away from the host
genome must also be integrated into the assessment of
fitness.

Acknowledgements
We would like to express our gratitude to the Texas
Learning and Computation Center (TLCC) and to
NASA (Grant NNJ04HF43G to GEF and RCW) for
partial support of this work. CP’s work was supported
by a training fellowship from the Keck Center for
Computational and Structural Biology of the Gulf
Coast Consortia (NLM Grant no. 5T15LM07093).
The authors would also like to thank Dr R. Pad
Padmanabhan for many interesting discussions and
suggestions.

References
1 De Paula SO & da Fonseca BAL (2004) Dengue: a
review of the laboratory tests a clinician must know to

achieve a correct diagnosis. Braz J Infect Dis 8, 390–398.
2 Kao CL, King CC, Chao DY, Wu HL & Chang GJJ
(2005) Laboratory diagnosis of dengue virus infection:
current and future perspectives in clinical diagnosis and
public health. J Microbiol Immunol Infect 38, 5–16.
3 Relman DA (1998) Detection and identification of previously unrecognized microbial pathogens. Emerg Infect
Dis 4, 382–389.
4 Lanciotti RS, Calisher CH, Gubler DJ, Chang GJ &
Vorndam AV (1992) Rapid detection and typing of dengue viruses from clinical samples by using reverse transcriptase-polymerase chain reaction. J Clin Microbiol
30, 545–551.
5 Harris E, Roberts TG, Smith L, Selle J, Krammer LD,
Valle S, Sandoval E & Balmaseda A (1998) Typing of
dengue viruses in clinical specimens and mosquitoes by
single-tube multiplex reverse trascriptase PCR. J Clin
Microbiol 36, 2634–2639.
6 De Paula SOD, Lima CDM, Torres MP, Pereira MR &
da Fonseca BAL (2004) One-step RT-PCR protocols

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS


C. Putonti et al.

7

8

9

10


11

12

13

14

15

16

17

improve the rate of dengue diagnosis compared to twostep RT-PCR approaches. J Clin Virol 30, 297–301.
Wang WK, Sung TL, Tsai YC, Kao CL, Chang SM &
King CC (2002) Detection of dengue virus replication in
perifperal blood mononuclear cells from dengue virus
type 2-infected patients by a reverse transcription-realtime PCR assay. J Clin Microbiol 40, 4472–4478.
Sudiro TM, Zivny J, Ishiko H, Green S, Vaughn DW,
Kalayanorooj S, Nisalak A, Norman JE, Ennis FA &
Rothman AL (2001) Analysis of plasma viral RNA
levels during acute dengue virus infection using quantitative competitor reverse transcription-polymerase chain
reaction. J Med Virol 63, 29–34.
Houng HH, Hritz D & Kanesa-thasan N (2000) Quantitative detection of dengue 2 virus using fluorogenic
RT-PCR based on 3¢-noncoding sequence. J Virol
Methods 86, 1–11.
Drosten C, Gottig S, Schilling S, Asper M, Panning M,
Schmitz H & Gunther S (2002) Rapid detection and

quantification of RNA of Ebola and Marburg viruses,
Lassa virus, Crimean-Congo hemorrhagic fever virus,
Rift Valley fever virus, dengue virus, and yellow fever
virus by real-time reverse transcription-PCR. J Clin
Microbiol 40, 2323–2330.
Shu PY, Chang SF, Kuo YC, Yueh YY, Chien LJ,
Sue CL, Lin TH & Huang JH (2003) Development of
group- and seortype-specific one-step SYBR green
I-based real-time reverse transcription-PCR assay for
dengue virus. J Clin Microbiol 41, 2408–2416.
Tanaka M (1993) Rapid identification of flavivirus using
the polymerase chain reaction. J Virol Methods 41,
311–322.
Figueiredo LT, Batista WC, Kashima S & Nassar ES
(1998) Identification of Brazilian flaviviruses by a
simplified reverse transcription-polymerase chain reaction method using flavivirus universal primers. Am J
Trop Med Hyg 59, 357–362.
Ito M, Takasaki T, Yamada KI, Nerome R, Tajima S
& Kurane I (2004) Development and evaluation of
fluorogenic TaqMan reverse-transciptase PCR assays
for detection of dengue virus types 1–4. J Clin Microbiol
42, 5935–5937.
Wu SJL, Lee EM, Pubatana R, Shurtliff RN, Porter
KR, Suharyono W, Watts DM, King CC, Murphey
GS, Hayes CG et al. (2001) Detection of dengue viral
RNA using a nucleic acid sequence-based amplification
assay. J Clin Microbiol 39, 2794–2798.
Baeumner AJ, Schlesinger NA, Slutzki NS, Romano J,
Lee EM & Montagna RA (2002) Biosensor for dengue
virus detection: sensitive, rapid, and serotype specific.

Anal Chem 74, 1442–1448.
Schena M, Shalon D, Davis RW & Brown PO (1995)
Quantitative monitoring of gene expression patterns
with a complementary DNA microarray. Science 270,
467–470.

Human-blind sequences for dengue identification

18 Lipshutz RJ, Fodor SP, Gingeras TR & Lockhart DJ
(1999) High density synthetic oligonucleotide arrays.
Nat Genet 21, 20–24.
19 Woese CR, Maniloff J & Zablen LB (1980) Phylogenetic analysis of the mycoplasmas. Proc Natl Acad Sci
USA 77, 494–498.
20 McGill TR, Jurka J, Sobieski JM, Pickett MH, Woese
CR & Fox GE (1986) Characteristic Archaebacterial
16S rRNA Oligonucleotides. Syst Appl Microbiol 7,
194–197.
21 Zhang Z, Willson RC & Fox GE (2002) Identification of characteristic oligonucleotides in the 16S
ribosomal RNA sequence dataset. Bioinformatics 18,
244–250.
22 Felsenstein J (2005) PHYLIP (Phylogeny Inference Package), Version 3.6. Distributed by the Author. Department of Genome Sciences. University of Washington,
Seattle.
23 Choi J-H, Jung H-Y, Kim H-S & Cho HG (2000)
PhyloDraw: a phylogenetic tree drawing system.
Bioinformatics 16, 1056–1058.
`
24 Perriere G & Gouy M (1996) WWW-Query: An on-line
retrieval system for biological sequence banks. Biochimie
78, 364–369.
25 MacQueen J (1967) Methods for classification and analysis of multivariate observations. In Proceedings of the

Fifth Berkeley Symposium on Mathematical Statistics
and Probability (LeCam LM & Neyman J, eds), pp.
281–297. California Press, Berkeley, California.
26 Dopazo J & Carazo JM (1997) Phylogenetic reconstruction using an unsupervised growing neural network that
adopts the topology of a phylogenetic tree. J Mol Evol
44, 226–233.
27 Ward JH (1963) Hierarchical grouping to optimize an
objective function. J Am Stat Assoc 58, 236–244.
28 Fofanov Y, Belapurkar C, Luo Y, Katili C, Wang J,
Belosludtsev Y, Powdrill T, Fofanov V, Li T-B, Chumakov S et al. (2004) How independent are the appearances of n-mers in different genomes? Bioinformatics 20,
2421–2428.
29 Chumakov S, Putonti C, Pettitt BM, Fox GE,
Willson RC & Fofanov Y (2004) Using statistical
properties of short subsequences in microbial identification. In Proceedings of the International Conference
on Mathematics and Engineering Techniques in
Medicine and Biological Sciences (Valafar F &
Valafar H, eds), pp. 363–367. CSREA Press, Las
Vegas, NV.
30 Fofanov V, Putonti C, Chumakov S, Pettitt BM &
Fofanov Y (2005) Fast Algorithm for the Analysis of the
Presence of Short Oligonucleotide Sequences in Genomic
Sequences. UH Technical Report #UH-CS-05–11, University of Houston, Houston, Texas. [Online http://
www.cs.uh.edu/Preprints/preprint/uh-cs-05-11.pdf]

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS

407


Human-blind sequences for dengue identification


C. Putonti et al.

Supplementary material
The following material is available online:
Table S1. Data set of publicly available dengue strains
considered.
Table S2. Estimated number of sequences absent in a
genome of size 2 Gb (the approximate size of the
human genome excluding highly repeated elements).
Table S3. Number of n-mers absent from the human
genome.

408

Table S4. Number of human-blind n-mers, one, two,
three or four changes away, present in each dengue
genome.
Table S5. Number of unique human-blind n-mers, one,
two, three or four changes away.
This material is available as part of the online article
at

FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS



×