Asymptotic results in over and under representation of words in DNA

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (355.86 KB, 55 trang )

ASYMPTOTIC RESULTS IN OVER- AND
UNDER-REPRESENTATION OF WORDS IN
DNA

WANG RANRAN
(B.Sc., Peking University)

A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF MATHEMATICS
NATIONAL UNIVERSITY OF SINGAPORE
2005

Acknowledgements

Firstly, I would like to express my sincere thanks to my advisor, Professor Chen,
Louis H. Y. for his help and guidance in these two years. Professor Chen suggested
the research topic on over- and under-representation of words in DNA sequences to
me, which is interesting and inspiring. From the research work, I not only gained
knowledge of basics of Computational Biology, but also learned many important
modern techniques in probability theories, like the Chen-Stein method. Meanwhile,
Professor Chen gave me precious advice on my research work and taught me how
to think rigorously. He always encouraged me to open my mind and discover new
methods independently. The academic training I acquired during these two years
will greatly benefit my future research.
I would also like to thank Professor Shao Qiman for his inspiring suggestion
which leads to Remark 2.4 and finally to the proof of Theorem 3.9; and Associate
Professor Choi Kwok Pui, who helped me revise the first draft of my thesis and
gave me many valuable suggestions.
My thank also goes to Mr. Chew Soon Huat David, who gave me guidance

in conducting computer simulations; Mr. Lin Honghuang for helping me generate
DNA sequences and compute word counts when conducting the simulations; and

ii

Acknowledgements
Mr. Dong Bin, for giving me advise in revising this thesis.
Finally, I would like to thank Mr. Chew Soon Huat David again for providing
this wonderful Latex template of thesis.

iii

Contents

Acknowledgements

ii

Summary

v

List of Figures

vi

1 Introduction

1

2 Extrema of Normal Random Variables

5

2.1

Distribution Functions of Extrema

. . . . . . . . . . . . . . . . . .

5

2.2

Poisson Approximation Approach . . . . . . . . . . . . . . . . . . .

13

3 Asymptotic Results of Words in DNA

17

3.1

Tail Probabilities of Extrema of Sums of m-dependent Variables . .

17

3.2

Asymptotic Normality of Markov Chains . . . . . . . . . . . . . . .

27

4 Simulation Results

38

4.1

DNA Sequences under M0 . . . . . . . . . . . . . . . . . . . . . . .

38

4.2

DNA Sequences under M1 . . . . . . . . . . . . . . . . . . . . . . .

42

iv

Summary

Identifying over- and under-represented words is often useful in extracting information of DNA sequences. Since the criteria of defining these over and under
represented words is somewhat ambiguous, we shall focus on the words of maximal
and minimal occurrences, which will be definitely regarded as over- and underrepresented words respectively. In this thesis, we study the tail probabilities of

the extrema over a finite set of standard normal random variables by using techniques like Bonferroni’s inequalities and Poisson Approximation associated with
the Chen-Stein method. We apply similar techniques and the moderate deviations
of m-dependent random variables together, and then derive the asymptotic tail
probabilities of extrema over a set of word occurrences under M0 model. The
statistical distribution of word counts is also studied. We show the asymptotic
normality of word counts under both the M0 and M1 models. Finally we use computer simulations to study the tail probabilities of the most frequently and most
rarely occurred DNA words under both the M0 and M1 models. The asymptotic
results under the M1 model are shown to be similar to those for the M0 model.

v

List of Figures

4.1

Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in
20,000 simulated DNA sequences under M0 model. . . . . . . . . .

4.2

39

Normal Q-Q plot of the maxima of ξ scores of all 64 3-tuple words
in 20,000 simulated DNA sequences under M0 model. . . . . . . . .

4.3

40

Point-to-point plots of values of (a) Fmax(x) versus G(x), where
Fmax(x) stands for the estimated probabilities P (M ≥ x) and
G(x) = 64 1 − Φ(x) ; (b) Fmin(x) versus G(x), where Fmin(x)
stands for the estimated probability P (m ≤ x) with G(x) = 64Φ(x).

4.4

41

Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in
20,000 simulated DNA sequences under M1 model. . . . . . . . . .

4.5

43

Point-to-point plots of values of (a) Fmax(x) versus G(x); (b) Fmax(x)
versus G(x); (c) Fmin(x) versus G(x). . . . . . . . . . . . . . . . .

45

vi

Chapter

1

Introduction
The analysis of rarity and abundance of words in DNA sequences has always

been of interest in biological sequence analysis. A direct way to observe whether
a DNA word occurs rarely (or frequently) in a genome is to analyze the number
of its occurrences in a DNA sequence. For a DNA sequence A1 A2 · · · An with
Ai ∈ A = {A, C, G, T }, we define the word count (e.g. Waterman (1995)) Nu for
a k-tuple word u as follows,
n−k+1

Nu =

n−k+1

Iu (i) =
i=1

I(Ai = u1 , Ai+1 = u2 , · · · , Ai+k−1 = uk ),
i=1

where Iu (i) is the indicator of the word u occurring at the starting position Ai .
To determine whether a DNA word u is rare or abundant in a DNA sequence,
one needs to introduce a probability model first. Typical models, such as the
stationary m-order Markov chains, have been widely considered in the literature
(Reinert et al.(2000)). In this thesis, two models for DNA sequences will be considered. One is called M0 model, for which all letters are independently and identically
distributed; and the other one is called M1 model, for which {A1 , A2 , · · · } forms a
stationary Markov chain of order 1.
In order to analyze word count Nu , naturally, we shall first study the possible
statistical distribution of it for a given model of the underlying DNA sequence. We
1

2

adopt a commonly used standardized score:
zu =

Nu − ENu
Var(Nu )

(1.1)

where ENu and Var(Nu ) are the mean and variance of Nu respectively (Leung
et al.(1996)). Statistical distribution of the word count Nu has already been well
studied in the literature. Waterman (1995) (Chapter 12) showed that the joint
distribution of a finite set of z scores can be well approximated by a multivariate
normal distribution under M1 model. Several research works aim at identifying
over- and under-represented words in DNA or palindromes. A word is called over(or under-)represented if it is observed more (or less) frequently than expected
under some specified probability model (Phillips et al. (1987)). Leung et al.(1996)
identified over- and under-represented short DNA words by ranking their z L scores
(maximum likelihood plug-in z scores) in a specific genome. Chew et al. (2003)
studied the over- and under-representation of the accumulative counts of all palindromes of certain length by identifying their upper and lower 5% z scores of a
standard normal distribution. In these studies, the criteria one used to identify
the over- (or under-)representation were different. Indeed, for different purposes
in biological studies, the criteria would be different in general. There is no single
universal way to determine whether a given word is over- (or under-)represented.
However, if we consider the extreme case, i.e. if we only take the words of maximal and minimal occurrences, these two words are surely the over-represented and
the under-represented ones respectively (which is exactly what we will do in this
thesis).
In this thesis, we shall apply the ξ scores which is essentially the same as
the z score defined above in equation (1.1), and analyze the over- and underrepresentation of a finite set of DNA words as the sequence length goes to infinity,
by investigating the behavior of the extrema over their ξ scores.
We shall study the asymptotic results of ξ scores. Generally, the DNA sequence

3
are long and asymptotic results may be of relevance to the statistical analysis of
the word counts. For this, we introduce the following notations.
an = O(bn ) :
an = o(bn ) :
an ∼ b n :
an

bn :

|an | ≤ c|bn | (constant), as n → ∞.
an
−→ 0, as n → ∞.
bn
an
−→ 1, as n → ∞.
bn
c1 bn ≤ an ≤ c2 bn (c1 , c2 constants), as n → ∞.

Assuming that the DNA sequence is modelled by M0, we shall show that (see
Theorem 3.9) if there exists a finite set of ξ scores {ξ1 , ξ2 , · · · , ξd }, we have
P (max ξi ≥ x) ∼ d(1 − Φ(x)),
i

and
P (min ξi ≤ −x) ∼ dΦ(−x),
i
√
as n → ∞ and x → ∞ with 1 ≤ x ≤ c ln n, provided that the covariance matrix

of word counts is non-singular. Here, Φ and ϕ respectively denote the distribution
function and the density function of a standard normal random variable. When
assuming the DNA sequence is M1, we will prove the asymptotic normality of
the joint distribution of ξ scores by applying a central limit theorem for random
variables under mixing condition (Billingsley (1995), Section 27). Unfortunately,
under the M1 model, the convergence of the ratios

P (maxi ξi >x)
d 1−Φ(x)

and

P (mini ξi ≤−x)
dΦ(−x)

to

1 for ξ scores remain unsolved.
This thesis is organized as follows. Chapter 2 shows how the distribution functions of extrema of a finite set of correlated standard normal random variables
behave when these extrema tend to extremely large or small values. In Chapter 3,
the asymptotic convergence of the tail probabilities of extrema is established for
word counts under M0 model. It is also devoted to study the asymptotic normality
of word counts under M0 and M1 models. Results of simulations are presented in

4
Chapter 4, which support the asymptotic results given by Theorem 3.8 and show
the possibility that similar results will be obtained under M1 model.

Chapter

2

Extrema of Normal Random Variables
In this chapter, we would like to investigate the distributions of both the maximum and minimum of a set of standard normal random variables. More precisely,
we will try to find out the probabilities of the maximum being greater than c,
and the probabilities of the minimum being less than c0 , for c, c0 ∈ R. Our main
theorem in this chapter shows that, when c is large enough and c0 is small enough,
the asymptotic tail distributions of the both extrema follow certain expressions in
terms of c and c0 respectively. We will present two methods in proving this theorem,
one using Bonferroni’s inequalities and the other one using Poisson approximation
associated with the Chen-Stein method.

2.1

Distribution Functions of Extrema

To facilitate the proof of Theorem 2.7, we need a few lemmas first. The first
lemma was given by Barbour et al.(1992). To make this thesis self-contained, we
shall provide its proof, which is essentially the same as that of Barbour et al.(1992).
Throughout this section, we assume the correlation of two random variables X and
Y , r, be strictly bounded between -1 and 1, i.e. −1 < r = corr(X, Y ) < 1.

5

2.1 Distribution Functions of Extrema

6

Lemma 2.1. Let 
(X, Y ) be
 jointly normally distributed with mean vector 0 and
1 r
.
covariance matrix 
r 1
(i) If 0 ≤ r < 1, then for any positive a and b,
b − ra
1 − Φ(a) 1−Φ( √
) ≤ P (X > a, Y > b)
1 − r2
b − ra
ϕ(b)
a − rb
≤ 1 − Φ(a) 1 − Φ( √
) +r
1 − Φ( √
) .
2
ϕ(a)
1−r
1 − r2
If −1 < r ≤ 0, the inequalities are reversed.
(ii) If 0 ≤ r < 1, then for any nonpositive a and b,
b − ra
Φ(a)Φ( √
) ≤ P (X ≤ a, Y ≤ b)
1 − r2

ϕ(b)
a − rb
b − ra
)+r
Φ( √
) .
≤ Φ(a) Φ( √
ϕ(a)
1 − r2
1 − r2
If −1 < r ≤ 0, the inequalities are reversed.
Proof. For part (i),
∞

∞

1
1
−
(x2 +y 2 −2rxy)
e 2(1−r2 )
dxdy
2π 1 − r2
b
∞
x2
1 − (y−rx)2 dy
1
√ e− 2 dx
√ e 2(1−r2 ) √

1 − r2
2π
2π
b

√

P (X > a, Y > b) =
a
∞

=
a

∞

=

∞

ϕ(x)dx
a

a

ϕ(y)dy

1−r 2

∞

=

√b−rx

b − rx
ϕ(x) 1 − Φ( √
) dx.
1 − r2

Integrating by parts, we get
P (X > a, Y > b)
∞

b − rx
1 − Φ( √
) d 1 − Φ(x)
1 − r2
a
∞
(2.1)
a
r
b − rx
b − rx
1 − Φ(x) √
+
)
ϕ( √
)dx

= 1 − Φ(x) 1 − Φ( √
1 − r2 ∞
1 − r2
1 − r2
a
∞
b − ra
1
b − rx
= 1 − Φ(a) 1 − Φ( √
1 − Φ(x) √
) +r
ϕ( √
)dx.
1 − r2
1 − r2
1 − r2
a
=−

2.1 Distribution Functions of Extrema

7

If 0 ≤ r < 1, we get the lower bound immediately. Next, we want to prove that
the function

1−Φ(x)
ϕ(x)

is decreasing. Let f (x) =

1−Φ(x)
.
ϕ(x)

Then f (x) = −1 +

x 1−Φ(x)
ϕ(x)

.

When x > 0, we have
∞

1 − Φ(x) =

ϕ(y)dy ≤
x

y2
1 1
= √ e− 2
x 2π

x
∞

1
x

∞
x

y2
1
y √ e− 2 dy
2π

1
= ϕ(x).
x

The above inequality gives f (x) > 0. It follows that
∞

b − rx
dx
1 − Φ(x) ϕ( √
)√
2
1−r
1 − r2
a
1 − Φ(a) ∞
b − rx
dx
) √

≤
ϕ(x)ϕ( √
2
ϕ(a)
1−r
1 − r2
a
x − br
dx
1 − Φ(a) ∞
ϕ(b)ϕ( √
)√
=
2
ϕ(a)
1−r
1 − r2
a
ϕ(b) ∞
x − br
x − rb
= 1 − Φ(a)
ϕ( √
)d( √
)
2
ϕ(a) a
1−r
1 − r2
a − rb

ϕ(b)
1 − Φ√
) ,
= 1 − Φ(a)
ϕ(a)
1 − r2
which gives the upper bound. Due to equation (2.1), the lower and upper bounds
are reversed when r < 0. Hence, the same argument can be used to derive the
reversed inequalities for −1 < r ≤ 0.
For part (ii), since
P (X > a, Y > b) = P (−X < −a, −Y < −b)
= P (X < −a, Y < −b) = P (X ≤ −a, Y ≤ −b),
the same argument works when we take a and b to be nonpositive. And the
inequalities become
P (X ≤ a, Y ≤ b) = P (X > −a, Y > −b)
−b + ra
−a + rb
ϕ(b)
1 − Φ( √
1 − Φ( √
) +r
)
2
ϕ(a)
1−r
1 − r2
ϕ(b)
a − rb
b − ra
)+r

Φ( √
).
= Φ(a)Φ( √
2
ϕ(a)
1−r
1 − r2

≤ 1 − Φ(−a)

2.1 Distribution Functions of Extrema

8

As a result, the inequalities are established when 0

r < 1 for nonpositive

a and b. The same argument works when −1 < r ≤ 0, with the inequalities
reversed.
Lemma 2.2. Let 
(X, Y ) be
 jointly normally distributed with mean vector 0 and
1 r
.
covariance matrix 
r 1
(i) If 0 r < 1, then for any positive a,
1 − Φ(a) 1 − Φ(a

1−r
) ≤ P (X > a, Y > a) ≤ (1 + r) 1 − Φ(a) 1 − Φ(a
1+r

1−r
) .
1+r

If −1 < r ≤ 0, the inequalities are reversed.
(ii) If 0

r < 1, then for any nonpositive a and b,

Φ(a)Φ(a

1−r
) ≤ P (X ≤ a, Y ≤ a) ≤ (1 + r)Φ(a)Φ(a
1+r

1−r
).
1+r

If −1 < r ≤ 0, the inequalities are reversed.
Proof. This lemma is the direct result of Lemma 2.1 by substituting b into a.
The above two lemmas give the exact expressions of the lower and upper bounds
of probability P (X > a, Y > a). Next, we would like to find out the asymptotic
behavior of P (X > a, Y > a) as a tends to infinity. The rate of convergence of
P (X > a, Y > a) is also expected in following lemma.

Lemma 2.3. Let 
(X, Y ) be
 jointly normally distributed with mean vector 0 and
1 r
. We have
covariance matrix 
r 1
P (X > a, Y > a) = o 1 − Φ(a)

as a → ∞,

(2.2)

and
P (X ≤ −a, Y ≤ −a) = o Φ(−a)

as a → ∞.

(2.3)

2.1 Distribution Functions of Extrema
Proof. When a → ∞, 1 − Φ(

1−r
a)
1+r

9

→ 0. Immediately, by applying the squeeze

theorem, Lemma 2.2 yields equations (2.2) and (2.3).
Remark 2.4. The upper and lower bounds obtained by Lemma 2.1 are vary tight
that will refine the error bounds in normal approximation problems. However, it
is not necessary to use such tight bounds to prove Lemma 2.3 as can be seen as
follows.
Since X and Y are jointly normal, then X + Y is normal with E(X + Y ) = 0
and Var(X + Y ) = 2(1 + r). Hence √(X+Y ) is standard normal. Therefore,
2(1+r)

P (X > a, Y > a) ≤ P (X + Y > 2a)
=P

X +Y
2(1 + r)

=1−Φ a
Let λ =

2
.
1+r

2a

>

2(1 + r)

(2.4)

2
.
1+r

Obviously, we have λ > 1. It suffices to prove that, when λ > 1,

1 − Φ(λa) = o 1 − Φ(a) as a → ∞. When a > 0,
∞

∞

1
1
2
√ e−x /2 dx ≤
1 − Φ(a) =
a
2π
a
1
ϕ(a)
2
= √ e−a /2 =
.
a
a 2π

a

x
2
√ e−x /2 dx
2π

Also,
∞

∞
1 −x2 /2
1
2
√ e
√ de−x /2
1 − Φ(a) =
dx = −
2π
x 2π
a
a
∞
1
1
2
2
√ e−x /2 d(1/x)
= √ e−a /2 +
a 2π
2π

a
∞
1
1
2
√ e−x /2 dx
≥ ϕ(a)/a − 2
a a
2π

= ϕ(a)/a − 1 − Φ(a) /a2 .
Therefore, when a > 0, we obtain
a
ϕ(a)
.
ϕ(a)
≤
1
−
Φ(a)
≤
1 + a2
a

2.1 Distribution Functions of Extrema

10

When λ > 1,

1 − Φ(λa)
ϕ(λa)/(λa)
1 + a2 − 1 (λ2 −1)a2
−→ 0,
≤
=
e 2
1 − Φ(a)
aϕ(a)/(1 + a2 )
λa2

as a → ∞.

(2.5)

Equation (2.3) is a direct result of equation (2.2), since P (X ≤ −a, Y ≤ −a) =
P (−X > a, −Y > a) = P (X > a, Y > a).
Since we only need to observe the asymptotic convergence of the ratio P (X >
a, Y > a)/ 1 − Φ(a) , tight bounds for the term P (X > a, Y > a) will be unnecessary. Furthermore, Lemma 2.1 is applied to standard normal random variables,
while the method shown in the proof of above lemma can also be applied to random variables which converge weakly to standard normal random variables. For
example, if we have random variables Xn ⇒ N (0, 1) and Yn ⇒ N (0, 1), we may get
asymptotic results similar to Lemma 2.3. We will discuss this in the later chapters.
We notice that to derive the asymptotic convergence of the ratio P (X > a, Y >
a)/ 1 − Φ(a) , the correlation of X and Y should be strictly bounded between −1
and 1. If we have a sequence of correlated random variables {Z1 , Z2 , · · · , Zd }, it is
not realistic to check the correlations of every two nonidentical random variables
one by one. In Proposition 2.6 below, we will show that a non-singular covariance
matrix of {Z1 , Z2 , · · · , Zd } implies that the correlations of every two nonidentical
random variables are not equal to either 1 or -1. To prove this argument, we recall
a well known fact below.

Theorem 2.5. Let X and Y be two random variables and r be their correlation,
then |r| = 1 if and only if there exists constants a, b such that Y = aX + b with
probability 1.
Then, we give our proposition as follows.
Proposition 2.6. Let Z1 , Z2 , · · · , Zd be random variables with mean 0 and variance 1. Let Σ and R be the covariance matrix and correlation matrix respectively.

2.1 Distribution Functions of Extrema

11

If Σ is non-singular, all non-diagonal entries of R are strictly bounded between −1
and 1, i.e. −1 < rij < 1 for i = j, where rij is the (i, j)-entry of R.
Proof. If there exist Zi and Zj such that rij = corr(Zi , Zj ) = 1, Theorem 2.5
implies that there exist constants a and b, such that Zj = aZi + b with probability
1. Together with the conditions EZi = EZj = 0 and Var(Zi ) = Var(Zj )=1, we
have Zi = Zj . It is known that

cov(Z1 , Z1 )


 ···


 cov(Zi , Z1 )


Σ =  ···



 cov(Zj , Z1 )


 ···

cov(Zd , Z1 )


cov(Z1 , Z2 ) · · · cov(Z1 , Zd )
···

···

cov(Zi , Z2 ) · · ·
···

···

cov(Zj , Z2 ) · · ·
···

···

cov(Zd , Z2 ) · · ·







cov(Zi , Zd ) 


.
···


cov(Zj , Zd ) 



···

cov(Zd , Zd )
···

Consequently, the ith and jth rows of Σ are identical, and |Σ| = 0 follows. If
corr(Zi , Zj ) = −1, the sum of the ith and jth rows of Σ is a zero vector and it
also yields |Σ| = 0. This contradicts with our assumption that Σ is a non-singular
matrix.
With the above Lemma 2.3 and Proposition 2.6, we shall introduce the main
result of this chapter. This theorem presents the asymptotic tail distributions for
both maximum and minimum over a sequence of normal random variables.
Theorem 2.7. Let (Z1 , · · · , Zd ) be a random vector with multivariate normal distribution with a mean vector 0 and a non-singular covariance matrix Σ. Assume
further that variance of Zi , 1 ≤ i ≤ d is 1. Then
P ( max Zi > c) ∼ d 1 − Φ(c)
1≤i≤d

as c → +∞,

and
P ( min Zi ≤ c0 ) ∼ dΦ(c0 ) as c0 → −∞.
1≤i≤d

2.1 Distribution Functions of Extrema

12

Proof. We shall give two proofs to this theorem.
(Proof I) Recall the Bonferroni’s inequalities. For any sets A1 , A2 , · · · , Ad ,
d

d

P (Ai ) −
i=1

d

P (Ai ∩ Aj ) ≤ P (
i=1

1≤i
Obviously, the event {max1≤i≤d Zi > c} =

Ai ) ≤

d

i=1 {Zi

P (Ai ).
i=1

> c}. So,

d

P ( max Zi > c) = P
1≤i≤d

(Zi > c) .
i=1

It follows that
d

d

P (Zi > c) −
i=1

P (Zi > c, Zj > c) ≤ P (max Zi > c) ≤
i

1≤i
P (Zi > c),
i=1

which is equivalent to
P (Zi > c, Zj > c) ≤ P (max Zi > c) ≤ d 1 − Φ(c) .

d 1 − Φ(c) −

i

1≤i
(2.6)

Equation (2.6) immediately gives that
d−

1≤i
P (Zi > c, Zj > c)
P (maxi Zi > c)
≤
≤ d.
1 − Φ(c)
1 − Φ(c)

(2.7)

Recall that Zi and Zj , 1 ≤ i, j ≤ d, are both standard normal random variables
with zero mean. Lemma 2.3 gives that
P (Zi > c, Zj > c) = o 1 − Φ(c)

as c → ∞.

Due to the finiteness of the index set,
1≤i
P (Zi > c, Zj > c)
−→ 0 as c → ∞.
1 − Φ(c)

Applying squeeze theorem to equation (2.7), we have
P ( max Zi > c) ∼ d 1 − Φ(c)
1≤i≤d

as c → ∞.

2.2 Poisson Approximation Approach

13

Since (Z1 , Z2 , · · · , Zd ) has zero mean, (−Z1 , −Z2 , · · · , −Zd ) has the same distribution as (Z1 , Z2 , · · · , Zd ). Hence,
P (min Zi ≤ c0 ) = P min(−Zi ) ≤ c0
i

i

= P (max Zi ≥ −c0 )
i

∼ d 1 − Φ(−c0 )

= dΦ(c0 ).
Therefore, Theorem 2.7 is obtained.

2.2

Poisson Approximation Approach

For a set of independent events, if the probabilities for these events to occur
are very small, we call them rare events. Suppose there are n independent events
each with probability pi of occurring, 1 ≤ i ≤ n, and pi tends to zero. Then, for
k = 0, 1, · · · , the probability for exactly k of these events to occur is approximately
equal to e−λ λk /k!, where λ =

i

pi . This is known as the Poisson limit theorem.

It leads to an important fact that the probability for at least one event to occur is
equal to 1 − e−λ .
Therefore, it is quite natural to think of using Poisson Approximation here.
In this section, we would like to provide another method to prove Theorem 2.7,
which employs the technique of Poisson Approximation associated with the ChenStein method. In 1975, Chen first applied Stein’s method (Stein (1972)) to Poisson
approximation problems, and obtained error bounds when approximating sums of
dependent Bernoulli random variables with Poisson distribution. The Chen-Stein
method has been successfully developed in the past 30 years and resulted in lots
of interesting applications (See e.g. Barbour and Chen (2005a, b)).
In Poisson approximation problems, we use the total variation distance to show
how one random variable approximates the other. The total variation distance

2.2 Poisson Approximation Approach

14

between two distributions is defined as
||L(X) − L(Y )|| = sup |P (X ∈ A) − P (Y ∈ A)|
A

where X and Y are random variables.
Suppose {Xα : α ∈ J} are dependent Bernoulli random variables with index
set J. Denote the probabilities of occurring as
pα = P (Xα = 1) = 1 − P (Xα = 0).
Let W =
λ = EW =

α∈J

Xα be the number of occurrences of dependent random events, and

α∈J

pα , for every α ∈ J, α ∈ Aα ⊂ J.

To prove Theorem 2.7 using the Chen-Stein method, we shall apply one main
result of Poisson Approximation (Arratia et.al.(1990)) which is given below.
Theorem 2.8. The total variation distance between the distribution of W , L(W ),
and the Poisson distribution with mean λ, P o(λ), is
1 − e−λ
(b1 + b2 ) + min(1, 1.4λ−1/2 )b3
λ

≤ 2(b1 + b2 + b3 ),

||L(W ), P o(λ)|| ≤

and
|P (W = 0) − e−λ | ≤ (b1 + b2 + b3 )(1 − e−λ )/λ
1
< (1 ∧ )(b1 + b2 + b3 ),
λ
where
b1 =

pα pβ ,
α∈J β∈Aα

b2 =

EXα Xβ ,
α∈J α=β∈Aα

E E(Xα − pα |Xβ : β ∈ Acα ) .

b3 =
α∈J

(2.8)

2.2 Poisson Approximation Approach

15

With Theorem 2.8, we now introduce the second proof of Theorem 2.7.
Proof. (Proof II) Let the finite state space be J = (1, · · · , d), and the indicator of
the event {Zi > c} be Xi = I(Zi > c). Suppose
pi = P (Zi > c) = P (Xi = 1)
W =

d
i=1

Xi ,

d
i=1

λ=

pi ,

then the event
{ max Zi > c} = {W ≥ 1}.
1≤i≤d

Next, we shall apply Theorem 2.8. Take Bi , the neighborhood of Xi , to be the
whole index set in state space. Then b3 , given by equation (2.8), becomes 0, and
it follows that
P (W ≥ 1) − (1 − e−λ ) = P (W = 0) − e−λ
1
≤ (1 ∧ )

λ

d

d

d

d

E(Xi Xj ) .

pi pj +
i=1 j=1

i=1 j=1,j=i

Obviously, when c tends to infinity, pi will be a very small number. So does λ.
Therefore, λ → 0, as c → ∞. Consequently, λ/(1 − eλ ) → 1, and 1 ∧

1
λ

= 1. We

rewrite the inequality above as
1
P (W ≥ 1)
−1 ≤
−λ

1−e
1 − e−λ

d

d

d

d

pi pj +
i=1 j=1

E(Xi Xj ) .

(2.9)

i=1 j=1,j=i

We look at the second term of the upper bound given in (2.9). From Lemma
2.3 and Proposition 2.6, we obtain that
E(Xi Xj ) = P (Xi > c, Xj > c) = o(1 − Φ(c)).
Since d is finite, it follows that
d

d

E(Xi Xj ) = o 1 − Φ(c) ,
i=1 j=1,j=i

as c → ∞.

2.2 Poisson Approximation Approach

16

Therefore, equation (2.9) becomes
P (W ≥ 1)
1
−1 ≤
λ2 + o(λ) −→ 0,
−λ
1−e
1 − e−λ
with λ → 0.
Finally, we get the result that
P ( max Zi > c) ∼ 1 − e−λ ∼ λ = d 1 − Φ(c) , as c → +∞.
1≤i≤d

(2.10)

Similar to the arguments referring to max Zi , we shall define the indicators
Yi = I(Zi ≤ c0 ), qi = P (Yi = 1) and U =
becomes {U ≥ 1}. Since

i qi

d

i=1

Yi . Then the event {mini Zi ≤ c0 }

tends to zero, it yields that

P (min Zi ≤ c0 ) = P (U ≥ 1) ∼
i

qi = dΦ(c0 ),
i

as c0 → −∞.
Remark 2.9. As one can see the proof using Bonferroni’s inequalities is more approachable and easier to understand. However, we still keep the second proof in
this section, because it is an interesting application to the Poisson approximation
associated with the Chen-Stein method.

Chapter

3

Asymptotic Results of Words in DNA
3.1

Tail Probabilities of Extrema of Sums of mdependent Variables

Let X1 , X2 , · · · be a sequence of m-dependent random variables with EXk = 0
and EXk2 < ∞, k = 1, 2, · · · We put Bn2 = ESn2 , Sn = X1 + · · · + Xn , and
Mn,p (h) = max E |Xk |p ep|h||Xk | ,

1≤k≤n

Mn,p = Mn,p (0),

Ln,p = nMn,p Bn−p .

Denote the distribution function of Sn /Bn as Fn , that is,
Fn (x) = P (Sn < xBn ).
Here, we present an important result of Heinrich (1985), on the behavior of the
ratios 1 − Fn (x) / 1 − Φ(x) and Fn (−x)/Φ(−x) for x ∈ [1, c

ln Bn2 ], c > 0 as

n → ∞ (so-called moderate deviations).
Theorem 3.1. Let X1 , X2 , · · · be a sequence of m-dependent random variables
with EXk = 0 and E|Xk |p < ∞, p = 2 + c20 , for some c0 > 0, and let q = min(p, 3).
Take Bn2 = ESn2 , Sn = X1 + · · · + Xn . Then in the interval 1 ≤ x ≤ c

ln Bn2 ,

17

3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables

18

0 < c ≤ c0 , we have
1 − Fn (x)
Fn (−x)

mp+1
=
=1+O
2
1 − Φ(x)
Φ(−x)
xp−1 Bnp−c

n

E|Xk |p I Xk >
k=1

Bn
2(2m + 1)x

× (1 + mq−1 x2p−q Ln,q ) + mq−1 xq Ln,q + mp−1 xp Ln,p ,
(3.1)
if mq−1 xq Ln,q + mp−1 xp Ln,p → 0 as n → ∞.
The proof for above theorem can be found in Heinrich (1985), which is technical
and shall be omitted here. Interested readers may consult the original paper for
more details.
With additional assumptions, we may get the ratio 1 − Fn (x) / 1 − Φ(x)
asymptotically equals to 1, as shown in the following theorem.
Theorem 3.2. Let X1 , X2 , · · · be a sequence of m-dependent random variables
with EXk = 0 and E|Xk |2p ≤ Cp < ∞, p = 2 + c20 for some c0 > 0. Then, for
Bn2

n, we have
1 − Fn (x)

Fn (−x)
=
= 1 + o(1),
1 − Φ(x)
Φ(−x)

√
for 1 ≤ x ≤ c ln n and 0 < c ≤ c0 as n → ∞.

Proof. From E|Xk |2p ≤ Cp < ∞, we know that Mn,2p = max E|Xk |2p ≤ Cp uniformly. Recall Theorem 3.1, what we are interested in is actually the expression
inside O(·) at the right hand side of equation (3.1). The expression can be written
as R1 + R2 + R3 where
R1 =

n

mp+1
xp−1 Bnp−c

E|Xk |p I Xk >

2

k=1

Bn
2(2m + 1)x
(3.2)

R2 = mq−1 x2p−q Ln,q R1

R3 = mq−1 xq Ln,q + mp−1 xp Ln,p
We shall first prove that R3 → 0 as n → ∞. Note that
xp Ln,p = xp

nMn,p
n(ln Bn2 )p/2
≤
C
p
Bnp
Bn

(ln n)p/2
,
np/2−1

3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables
and
(ln n)p/2
ln n
=
p/2−1
n
n1−2/p

p/2

.

Since 1 − 2/p > 0, the ratio (ln n)/(n1−2/p ) converges to 0 as n → ∞. Similarly, when q equals to either p or 3, the statement xq Ln,q → 0 holds. Therefore,
mq−1 xq Ln,q + mp−1 xp Ln,p → 0 as n → ∞.
Next, we want to show that R1 → 0 as n → ∞:
n

2

mp+1 Bnc
xp−1 Bnp

2
m Bnc
≤ p−1 p
x Bn

p+1

k=1
n

E|Xk |2p

1/2

Bn
2(2m + 1)x

P Xk >

k=1

Bn
2(2m + 1)x

1/2

p+1

2(2m + 1)x
· n(Mn,2p )1/2 ·
(E|Xk |2 )1/2
Bn
√
nx
n ln n
≤C 3+c2 −c2
2
2
n1+(1+c0 −c )/2
Bn 0

≤

m

E|Xk |p I Xk >

2
Bnp−c

Note that when c20 − c2 > 0,
ln n
2

n1+c0 −c

2

−→ 0 as n → ∞.

Thus, R1 → 0 as n → ∞.
Finally, we consider R2 . From above discussions, we only need to explore the
convergence of mq−1 x2p−q Ln,q . When q = p, x2p−q Lnq = xp Lnp → 0 as has been
proved earlier. When q = 3, using a similar argument, we have that
x2p−q Lnq = x2p−3

nMn3
Bn3

(ln n)p−3/2
→0
n1/2

as n → ∞, since p > 3. Therefore, we have shown that under the additional
assumption Bn2

n,
1 − Fn (x)
= 1 + o(1)
1 − Φ(x)

√
as n → ∞ with 1 ≤ x ≤ c ln n.

19

Asymptotic results in over and under representation of words in DNA

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về