Class Notes in Statistics and Econometrics Part 8 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (503.05 KB, 48 trang )

CHAPTER 15
Hypothesis Testing
Imagine you are a business p e rson considering a major investment in order to
launch a new product. The sales prospects of this product are not known with
certainty. You have to rely on the outcome of n marketing surveys that measure
the demand for the product once it is oﬀered. If µ is the actual (unknown) rate of
return on the investment, each of these surveys here will be modeled as a random
variable, which has a Normal distribution with this mean µ and known variance 1.
Let y
1
, y
2
, . . . , y
n
be the observed survey results. How would you decide whether to
build the plant?
The intuitively reasonable thing to do is to go ahead with the investment if
the sample mean of the observations is greater than a given value c, and not to do
425
426 15. HYPOTHESIS TESTING
it otherwise. This is indeed an optimal decision rule, and we will discuss in what
respect it is, and how c should be picked.
Your decision can be the wrong decision in two diﬀerent ways: either you decide
to go ahead with the investment although there will be no demand for the product,
or you fail to invest although there would have been demand. There is no decision
rule which eliminates both errors at once; the ﬁrst error would be minimized by the
rule never to produce, and the second by the rule always to pro duce. In order to
determine the right tradeoﬀ between these errors, it is important to be aware of their
asymmetry. The error to go ahead with production although there is no demand has
potentially disastrous consequences (loss of a lot of money), while the other error
may cause you to miss a proﬁt opportunity, but there is no actual loss involved, and

presumably you can ﬁnd other opportunities to invest your money.
To express this asymmetry, the error with the potentially disastrous consequences
is called “error of type one,” and the other “error of type two.” The distinction
between type one and type two errors can also be made in other cases. Locking up
an innocent person is an error of type one, while letting a criminal go unpunished
is an error of type two; publishing a pap e r with false results is an error of type one,
while foregoing an opportunity to publish is an error of type two (at least this is
what it ought to be).
15. HYPOTHESIS TESTING 427
Such an asymmetric situation calls for an asymmetric decision rule. One needs
strict safeguards against committing an error of type one, and if there are several
decision rules which are equally safe with resp ec t to errors of type one, then one will
select among those that decision rule which minimizes the error of type two.
Let us look here at decision rules of the form: make the investment if ¯y > c.
An error of type one occurs if the decision rule advises you to make the investment
while there is no demand for the product. This will be the case if ¯y > c but µ ≤ 0.
The probability of this error depends on the unknown parameter µ, but it is at most
α = Pr[¯y > c |µ = 0]. This maximum value of the type one error probability is called
the signiﬁcance level, and you, as the director of the ﬁrm, will have to decide on α
depending on how tolerable it is to lose money on this venture, which presumably
depends on the chances to lose money on alternative investments. It is a serious
shortcoming of the classical theory of hypothesis testing that it does not provide
good guidelines how α should be chosen, and how it should change with sample size.
Instead, there is the tradition to choose α to be either 5% or 1% or 0.1%. Given α,
a table of the cumulative standard normal distribution function allows you to ﬁnd
that c for which Pr[¯y > c |µ = 0] = α.
Problem 213. 2 points Assume each y
i
∼ N(µ, 1), n = 400 and α = 0.05, and
diﬀerent y

i
are independent. Compute the value c which satisﬁes Pr[¯y > c |µ = 0] =
α. You shoule either look it up in a table and include a xerox copy of the table with
428 15. HYPOTHESIS TESTING
the entry circled and the complete bibliographic reference written on the xerox copy,
or do it on a computer, writing exactly which commands you used. In R, the function
qnorm does what you need, ﬁnd out about it by typing help(qnorm).
Answer. In the case n = 400, ¯y has variance 1/400 and therefore standard deviation 1/20 =
0.05. Therefore 20¯y is a standard normal: from Pr[¯y > c |µ = 0] = 0.05 follows Pr[20¯y > 20c |µ =
0] = 0.05. Therefore 20c = 1.645 can be looked up in a table, perhaps use [JHG
+
88, p. 986], the
row for ∞ d.f.
Let us do this in R. The p-“quantile” of the distribution of the random variable y is deﬁned
as that value q for which Pr[y ≤ q] = p. If y is normally distributed, this quantile is computed
by the R-function qnorm(p, mean=0, sd=1, lower.tail=TRUE). In the present case we need either
qnorm(p=1-0.05, mean=0, sd=0.05) or qnorm(p=0.05, mean=0, sd=0.05, lower.tail=FALSE) which
gives the value 0.08224268.

Choosing a decision which makes a loss unlikely is not enough; your decision
must also give you a chance of success. E.g., the decision rule to build the plant if
−0.06 ≤ ¯y ≤ −0.05 and not to build it otherwise is completely perverse, although
the signiﬁcance level of this decision rule is approximately 4% (if n = 100). In other
words, the signiﬁcance level is not enough information for evaluating the performance
of the test. You also need the “power function,” which gives you the probability
with which the test advises you to make the “critical” decision, as a function of
the true parameter values. (Here the “critical” decision is that decision which might
15. HYPOTHESIS TESTING 429
-3 -2 -1 0 1 2 3
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

Figure 1. Eventually this Figure will show the Power function of
a one-sided normal test, i.e., the probability of error of type one as
a function of µ; right now this is simply the cdf of a Standard Normal
potentially lead to an error of type one.) By the deﬁnition of the signiﬁcance level, the
power function does not exceed the signiﬁcance level for those parameter values for
which going ahead would lead to a type 1 error. But only those tests are “powerful”
whose power function is high for those parameter values for which it would be correct
to go ahead. In our case, the power function must be below 0.05 when µ ≤ 0, and
we want it as high as possible when µ > 0. Figure 1 shows the power function for
the dec ision rule to go ahead whenever ¯y ≥ c, where c is chosen in s uch a way that
the signiﬁcance level is 5%, for n = 100.
The hypothesis whose rejection, although it is true, constitutes an error of type
one, is called the null hypothesis, and its alternative the alternative hypothesis. (In the
examples the null hypotheses were: the return on the investment is zero or negative,
430 15. HYPOTHESIS TESTING
the defendant is innoce nt, or the results about which one wants to publish a research
paper are wrong.) The null hypothesis is therefore the hypothesis that nothing is
the case. The test tests whether this hypothesis should be rejected, will safeguard
against the hypothesis one wants to reject but one is afraid to reject erroneously. If
you reject the null hypothesis, you don’t want to regret it.
Mathematically, every test can be identiﬁed with its null hypothesis, which is
a region in parameter space (often consisting of one point only), and its “critical
region,” which is the event that the test comes out in favor of the “critical decision,”
i.e., rejects the null hypothesis. The critical region is usually an event of the form
that the value of a certain random variable, the “test statistic,” is within a given
range, usually that it is too high. The power function of the test is the probability
of the critical region as a function of the unknown parameters, and the signiﬁcance
level is the maximum (or, if this maximum depends on unknown parameters, any

upp e r bound) of the p ower function over the null hypothesis.
Problem 214. Mr. Jones is on trial for counterfeiting Picasso paintings, and
you are an expert witness who has developed fool-proof statistical signiﬁcance tests
for identifying the painter of a given painting.
• a. 2 points There are two ways you can set up your test.
15. HYPOTHESIS TESTING 431
a: You can either say: The null hypothesis is that the painting was done by
Picasso, and the alternative hypothesis that it was done by Mr. Jones.
b: Alternatively, you might say: The null hypothesis is that the painting was
done by Mr. Jones, and the alternative hypothesis that it was done by Pi-
casso.
Does it matter which way you do the test, and if so, which way is the correct one.
Give a reason to your answer, i.e., say what would be the consequences of testing in
the incorrect way.
Answer. The determination of what the null and what the alternative hypothesis is depends
on what is considered to b e the catastrophic error which is to be guarded against. On a trial, Mr.
Jones is considered innocent until proven guilty. Mr. Jones should not be convicted unless he can be
proven guilty beyond “reasonable doubt.” Therefore the test must be set up in such a way that the
hypothesis that the painting is by Picasso will only be rejected if the chance that it is actually by
Picasso is very small. The error of type one is that the painting is considered counterfeited although
it is really by Picasso. Since the error of type one is always the error to reject the null hypothesis
although it is true, solution a. is the correct one. You are not proving, you are testing. 
• b. 2 points After the trial a customer calls you who is in the process of acquiring
a very expensive alleged Picasso painting, and who wants to be sure that this painting
is not one of Jones’s falsiﬁcations. Would you now set up your test in the same way
as in the trial or in the opposite way?
432 15. HYPOTHESIS TESTING
Answer. It is worse to spend money on a counterfeit painting than to forego purchasing a
true Picasso. Therefore the null hypothesis would be that the painting was done by Mr. Jones, i.e.,
it is the opposite way. 

Problem 215. 7 points Someone makes an extended experiment throwing a coin
10,000 times. The relative frequency of heads in these 10,000 throws is a random
variable. Given that the probability of getting a head is p, what are the mean and
standard deviation of the relative frequency? Design a test, at 1% signiﬁcance level,
of the null hypothesis that the coin is fair, against the alternative hypothesis that
p < 0.5. For this you should use the central limit theorem. If the head showed 4,900
times, would you reject the null hypothesis?
Answer. Let x
i
be the random variable that equals one when the i-th throw is a head, and
zero otherwise. The expected value of x is p, the probability of throwing a head. Since x
2
= x,
var[x] = E[x] − (E[x])
2
= p(1 − p). The relative frequency of heads is simply the average of all x
i
,
call it ¯x. It has mean p and variance σ
2
¯x
=
p(1−p)
10,000
. Given that it is a fair coin, its mean is 0.5 and
its standard deviation is 0.005. Reject if the actual frequency < 0.5 − 2.326σ
¯x
= .48857. Another
approach:
(15.0.33) Pr(¯x ≤ 0.49) = Pr


¯x − 0.5
0.005
≤ −2

= 0.0227
since the fraction is, by the central limit theorem, approximately a standard normal random variable.
Therefore do not reject. 
15.1. DUALITY BETWEEN SIGNIFICANCE TESTS AND CONFIDENCE REGIONS 433
15.1. Duality between Signiﬁcance Tests and Conﬁdence Regions
There is a duality between conﬁdence regions with conﬁdence level 1 − α and
certain signiﬁcance tests. Let us look at a family of signiﬁcance tests, which all have
a signiﬁcance level ≤ α, and which deﬁne for every possible value of the parameter
φ
0
∈ Ω a critical region C(φ
0
) for rejecting the simple null hypothesis that the true
parameter is equal to φ
0
. The condition that all signiﬁcance levels are ≤ α means
mathematically
(15.1.1) Pr

C(φ
0
)|φ = φ
0

≤ α for all φ

0
∈ Ω.
Mathematically, conﬁdence regions and such families of tests are one and the
same thing: if one has a conﬁdence region R(y), one can deﬁne a test of the null
hypothesis φ = φ
0
as follows: for an observed outcome y reject the null hypothesis
if and only if φ
0
is not contained in R(y). On the other hand, given a family of tests,
one can build a conﬁdence region by the prescription: R(y) is the set of all those
parameter values which would not be rejected by a test based on observation y.
Problem 216. Show that with these deﬁnitions, equations (14.0.5) and (15.1.1)
are equivalent.
Answer. Since φ
0
∈ R(y) iﬀ y ∈ C

(φ
0
) (the complement of the critical region rejecting that
the parameter value is φ
0
), it follows Pr[R(y) ∈ φ
0
|φ = φ
0
] = 1 − Pr[C(φ
0
)|φ = φ

0
] ≥ 1 − α. 
434 15. HYPOTHESIS TESTING
This duality is discussed in [BD77, pp. 177–182].
15.2. The Neyman Pearson Lemma and Likelihood Ratio Tests
Look one more time at the example with the fertilizer. Why are we considering
only regions of the form ¯y ≥ µ
0
, why not one of the form µ
1
≤ ¯y ≤ µ
2
, or maybe not
use the mean but decide to build if y
1
≥ µ
3
? Here the µ
1
, µ
2
, and µ
3
can be chosen
such that the probability of committing an error of type one is still α.
It seems intuitively clear that these alternative decision rules are not reasonable.
The Neyman Pearson lemma proves this intuition right. It says that the critical
regions of the form ¯y ≥ µ
0
are uniformly most powerful, in the sense that every

other critical region with same probability of type one error has equal or higher
probability of committing error of type two, regardless of the true value of µ.
Here are formulation and proof of the Neyman Pearson lemma, ﬁrst for the
case that both null hypothesis and alternative hypothesis are simple: H
0
: θ = θ
0
,
H
A
: θ = θ
1
. In other words, we want to determine on the basis of the observations of
the random variables y
1
, . . . , y
n
whether the true θ was θ
0
or θ
1
, and a determination
θ = θ
1
when in fact θ = θ
0
is an error of type one. The critical region C is the set of
all outcomes that lead us to conclude that the parameter has value θ
1
.

The Neyman Pearson lemma says that a uniformly most powerful test exists in
this situation. It is a so-called likelihood-ratio te st, which has the following critical
15.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 435
region:
(15.2.1) C = {y
1
, . . . , y
n
: L(y
1
, . . . , y
n
; θ
1
) ≥ kL(y
1
, . . . , y
n
; θ
0
)}.
C consists of those outcomes for which θ
1
is at least k times as likely as θ
0
(where k
is chosen such that Pr[C|θ
0
] = α).
To prove that this decision rule is uniformly most powerful, assume D is the crit-

ical region of a diﬀerent test with same signiﬁcance level α, i.e., if the null hypothesis
is correct, then C and D reject (and therefore commit an error of type one) with
equally low probabilities α. In formulas, Pr[C|θ
0
] = Pr[D|θ
0
] = α. Look at ﬁgure 2
with C = U ∪ V and D = V ∪ W . Since C and D have the same signiﬁcance level,
it follows
Pr[U|θ
0
] = Pr[W |θ
0
].(15.2.2)
Also
Pr[U|θ
1
] ≥ k Pr[U|θ
0
],(15.2.3)
436 15. HYPOTHESIS TESTING
since U ⊂ C and C were chosen such that the likelihood (density) function of the
alternative hypothesis is high relatively to that of the null hypothesis. Since W lies
outside C, the same argument gives
Pr[W |θ
1
] ≤ k Pr[W |θ
0
].(15.2.4)
Linking those two inequalities and the equality gives

(15.2.5) Pr[W |θ
1
] ≤ k Pr[W |θ
0
] = k Pr[U|θ
0
] ≤ Pr[U|θ
1
],
hence Pr[D|θ
1
] ≤ Pr[C|θ
1
]. In other words, if θ
1
is the correct parameter value, then
C will discover this and reject at least as often as D. Therefore C is at leas t as
powerful as D, or the type two error probability of C is at least as small as that of
D.
Back to our fertilizer example. To make both null and alternative hypotheses
simple, assume that either µ = 0 (fertilizer is ineﬀective) or µ = t for some ﬁxed
15.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 437
Figure 2. Venn Diagram for Proof of Neyman Pearson Lemma ec660.1005
t > 0. Then the likelihood ratio critical region has the form
C = {y
1
, . . . , y
n
:


1
√
2π

n
e
−
1
2
((y
1
−t)
2
+···+(y
n
−t)
2
)
≥ k

1
√
2π

n
e
−
1
2
(y

2
1
+···+y
2
n
)
}
(15.2.6)
= {y
1
, . . . , y
n
: −
1
2
((y
1
− t)
2
+ ···+ (y
n
− t)
2
) ≥ ln k −
1
2
(y
2
1
+ ··· + y

2
n
)}
(15.2.7)
= {y
1
, . . . , y
n
: t(y
1
+ ··· + y
n
) −
t
2
n
2
≥ ln k}
(15.2.8)
= {y
1
, . . . , y
n
: ¯y ≥
ln k
nt
+
t
2
}

(15.2.9)
438 15. HYPOTHESIS TESTING
i.e., C has the form ¯y ≥ some constant. The dependence of this constant on k is not
relevant, since this constant is usually chosen such that the maximum probability of
error of type one is e qual to the given signiﬁcance level.
Problem 217. 8 points You have four independent observations y
1
, . . . , y
4
from
an N (µ, 1), and you are testing the null hypothesis µ = 0 against the alternative
hypothesis µ = 1. For your test you are using the likelihood ratio test with critical
region
(15.2.10) C = {y
1
, . . . , y
4
: L(y
1
, . . . , y
4
; µ = 1) ≥ 3.633 ·L(y
1
, . . . , y
4
; µ = 0)}.
Compute the signiﬁcance level of this test. (According to the Neyman-Pearson
lemma, this is the uniformly most powerful test for this signiﬁcance level.) Hints:
In order to show this you need to know that ln3.633 = 1.29, everything else can be
done without a calculator. Along the way you may want to show that C can also be

written in the form C = {y
1
, . . . , y
4
: y
1
+ ··· + y
4
≥ 3.290}.
Answer. Here is the equation which determines when y
1
, . . . , y
4
lie in C:
(2π)
−2
exp −
1
2

(y
1
− 1)
2
+ ···+ (y
4
− 1)
2

≥ 3.633 · (2π)

−2
exp −
1
2

y
2
1
+ ···+ y
2
4

(15.2.11)
−
1
2

(y
1
− 1)
2
+ ···+ (y
4
− 1)
2

≥ ln(3.633) −
1
2


y
2
1
+ ···+ y
2
4

(15.2.12)
y
1
+ ···+ y
4
− 2 ≥ 1.290(15.2.13)
15.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 439
Since Pr[y
1
+ ···+ y
4
≥ 3.290] = Pr[z = (y
1
+ ···+ y
4
)/2 ≥ 1.645] and z is a standard normal, one
obtains the signiﬁcance level of 5% from the standard normal table or the t-table. 
Note that due to the properties of the Normal distribution, this critical region,
for a given signiﬁcance level, does not depend at all on the value of t. Therefore this
test is uniformly most powerful against the composite hypothesis µ > 0.
One can als write the null hypothesis as the composite hypothesis µ ≤ 0, because
the highest probability of type one error will still be attained when µ = 0. This
completes the proof that the test given in the original fertilizer example is uniformly

most powerful.
Most other distributions discussed here are equally well behaved, therefore uni-
formly most powerful one-sided tests exist not only for the mean of a normal with
known variance, but also the variance of a normal with known mean, or the param-
eters of a Bernoulli and Poisson distribution.
However the given one-sided hypothesis is the only situation in which a uniformly
most powerful test exists. In other situations, the generalized likelihood ratio test has
good properties even though it is no longer uniformly most powerful. Many known
tests (e.g., the F test) are generalized likelihood ratio tests.
Assume you want to test the composite null hypothesis H
0
: θ ∈ ω, where ω is
a subset of the parameter space, against the alternative H
A
: θ ∈ Ω, where Ω ⊃ ω
is a more comprehensive subset of the parameter space. ω and Ω are deﬁned by
440 15. HYPOTHESIS TESTING
functions with continuous ﬁrst-order derivatives. The generalized likelihood ratio
critical region has the form
(15.2.14) C = {x
1
, . . . , x
n
:
sup
θ∈Ω
L(x
1
, . . . , x
n

; θ)
sup
θ∈ω
L(x
1
, . . . , x
n
; θ)
≥ k}
where k is chosen such that the probability of the critical region when the null
hypothesis is true has as its maximum the desired signiﬁcance level. It can be shown
that twice the log of this quotient is asymptotically distributed as a χ
2
q−s
, where q
is the dimension of Ω and s the dimension of ω. (Sometimes the likelihood ratio
is deﬁned as the inverse of this ratio, but whenever possible we will deﬁne our test
statistics so that the null hypothjesis is rejected if the value of the test statistic is
too large.)
In order to perform a likelihood ratio test, the following steps are necessary:
First construct the MLE’s for θ ∈ Ω and θ ∈ ω, then take twice the diﬀerence of the
attained levels of the log likelihoodfunctions, and compare with the χ
2
tables.
15.3. The Runs Test
[Spr98, pp. 171–175] is a good introductory treatment, similar to the one given
here. More detail in [GC92, Chapter 3] (not in University of Utah Main Library)
and even more in [Bra68, Chapters 11 and 23] (which is in the Library).
15.3. THE RUNS TEST 441
Each of your three research assistants has to repeat a certain experiment 9 times,

and record whether each experiment was a success (1) or a failure (0). In all cases, the
experiments happen to have been successful 4 times. Assistant A has the following
sequence of successes and failures: 0, 1, 0, 0, 1, 0, 1, 1, 0, B has 0, 1, 0, 1, 0, 1, 0, 1, 0, and
C has 1, 1, 1, 1, 0, 0, 0, 0, 0.
On the basis of these results, you suspect that the experimental setup used by
B and C is faulty: for C, it seems that something changed over time so that the
ﬁrst experiments were successful and the latter experiments were not. Or perhaps
the fact that a given experiment was a succe ss (failure) made it more likely that also
the next experiment would be a success (failure). For B, the opposite eﬀect seems
to have taken place.
From the pattern of successes and failures you made inferences about whether
the outcomes were independent or followed som e regularity. A mathematical for-
malization of this inference counts “runs” in each sequence of outcomes. A run is a
sucession of several ones or zeros. The ﬁrst outcome had 7 runs, the second 9, and
the third only 2. Given that the number of successes is 4 and the number of failures
is 5, 9 runs seem too many and 2 runs too few.
The “runs test” (sometimes also called “run test”) exploits this in the following
way: it counts the numb er of runs, and then asks if this is a reasonable number of
442 15. HYPOTHESIS TESTING
runs to expect given the total number of successes and failures. It rejects whenever
the number of runs is either too large or too low.
The choice of the number of runs as test statistic cannot be derived from a like-
lihood ratio principle, since we did not specify the joint distribution of the outcome
of the experiment. But the above argument says that it will probably detect at least
some of the cases we are interested in.
In order to compute the error of type one, we will ﬁrst derive the probability
distribution of the number of runs conditionally on the outcome that the number of
successes is 4. This conditional distribution can be computed, even if we do not know
the probability of success of each experiment, as long as their joint distribution has
the following property (which holds under the null hypothesis of statistical indepen-

dence): the probability of a given sequence of failures and successes only depends on
the number of failures and successes, not on the order in which they occur. Then the
conditional distribution of the number of runs can be obtained by simple counting.
How many arrangements of 5 zeros and 4 ones are there? The answer is

9
4

=
(9)(8)(7)(6)
(1)(2)(3)(4)
= 126. How many of these arrangements have 9 runs? Only one, i.e., the
probability of having 9 runs (conditionally on observing 4 successes) is 1/126. The
probability of having 2 runs is 2/126, since one can either have the zeros ﬁrst, or the
ones ﬁrst.
15.3. THE RUNS TEST 443
In order to compute the probability of 7 runs, lets ﬁrst ask: what is the proba-
bility of having 4 runs of ones and 3 runs of zeros? Since there are only 4 ones, each
run of ones must have exactly one element. So the distribution of ones and zeros
must be:
1 −one or more zeros −1 −one or more zeros − 1 − one or more zeros −1.
In order to specify the distribution of ones and zeros completely, we must therefore
count how many ways there are to split the sequence of 5 zeros into 3 nonempty
batches. Here are the possibilities:
(15.3.1)
0 0 0 | 0 | 0
0 0 | 0 0 | 0
0 0 | 0 | 0 0
0 | 0 0 0 | 0
0 | 0 0 | 0 0

0 | 0 | 0 0 0
Generally, the number of possibilities is

4
2

because there are 4 spaces between those
5 zeros, and we have to put in two dividers.
We have therfore 6 possibilities to make 4 runs of zeros and 3 runs of ones. Now
how many possiblities are there to make 3 runs of zeros and 4 runs of ones? There
are 4 ways to split the 5 zeros into 4 batches, and there are 3 ways to split the 4 ones
444 15. HYPOTHESIS TESTING
into 3 batches, represented by the schemes
(15.3.2)
0 0 | 0 | 0 | 0
0 | 0 0 | 0 | 0
0 | 0 | 0 0 | 0
0 | 0 | 0 | 0 0
and
1 1 | 1 | 1
1 | 1 1 | 1
1 | 1 | 1 1
One can combine any of the ﬁrst with any of the second, i.e., one obtains 12 possi-
bilities. Together the probability of seven runs is therefore 18/126.
One can do the same thing for all other possibilities of runs and will get a
distribution of runs similar to that depicted in the diagram (which is for 7 instead of
9 trials). Mathematically one gets two diﬀerent formulas according to whether the
number of runs is odd or even: we have a total of m zeros and n ones (it could also
be the other way round), and r is the number of runs:
Pr[r = 2s + 1] =


m−1
s−1

n−1
s

+

m−1
s

n−1
s−1


m+n
m

(15.3.3)
Pr[r = 2s] = 2

m−1
s−1

n−1
s−1


m+n

m

(15.3.4)
Some computer programs (StatXact, www.cytel.com) compute these probabilities
exactly or by monte carlo simulation; but there is also an asymptotic test based on
15.3. THE RUNS TEST 445
the facts that
E[r] = 1 +
2mn
m + n
var[r] =
2mn(2mn −m −n)
(m + n)
2
(m + n −1)
(15.3.5)
and that the standardized number of runs is asymptotically a Normal distribution.
(see [GC92, section 3.2])
We would therefore reject when the observed number of runs is in the tails of
this distribution. Since the exact test statistic is discrete, we cannot make tests
for every arbitrary signiﬁcance level. In the given example, if the critical region is
{r = 9}, then the signiﬁcance level is 1/126. If the critical region is {r = 2 or 9},
the signiﬁcance level is 3/126.
We said before that we could not make precise statements about the power of
the test, i.e., the error of type two. But we will show that it is possible to make
precise statements about the error of type one.
Right now we only have the conditional probability of errors of type one, given
that there are exactly 4 successes in our 9 trials. And we have no information about
the probability of having indeed four successes, it might be 1 in a million. However in
certain situations, the conditional signiﬁcance level is exactly what is needed. And

even if the unconditional signiﬁcance level is needed, there is one way out. If we
were to specify a decision rule for every number of successes in such a way that the
conditional probability of rejecting is the same in all of them, then this conditional
446 15. HYPOTHESIS TESTING
✻
✻
✻
✻
✻
✻
Figure 3. Distribution of runs in 7 trials, if there are 4 successes
and 3 failures
probability is also equal to the unconditional probability. The only problem here
is that, due to discreteness, we can make the probability of type one errors only
approximately equal; but with increasing sample size this problem disappears.
15.4. PEARSON’S GOODNESS OF FIT TEST. 447
Problem 218. Write approximately 200 x’es and o’s on a piece of paper trying
to do it in a random manner. Then make a run test whether these x’s and o’s were
indeed random. Would you want to run a two-sided or one-sided test?
The law of rare events literature can be considered a generalization of the run
test. For epidemiology compare [Cha96], [DH94], [Gri79], and [JL97].
15.4. Pearson’s Goodness of Fit Test.
Given an experiment with r outcomes, which have probabilities p
1
, . . . , p
r
, where

p
i

= 1. You make n independent trials and the ith outcome occurred x
i
times.
The x
1
, . . . , x
r
have the multinomial distribution with parameters n and p
1
, . . . , p
r
.
Their mean and covariance matrix are given in equation (8.4.2) above. How do you
test H
0
: p
1
= p
0
1
, . . . , p
r
= p
0
r
?
Pearson’s Goodness of Fit test uses as test statistic a weighted sum of the squared
deviations of the observed values from their expected values:
(15.4.1)
r


i=1
(
x
i
− np
0
i
)
2
np
0
i
.
This test statistic is often called the Chi-Square statistic. It is asymptotically dis-
tributed as a χ
2
r−1
; reject the null hypothesis when the observed value of this statistic
is too big, the critical region can be read oﬀ a table of the χ
2
.
448 15. HYPOTHESIS TESTING
Why does one get a χ
2
distribution in the limiting case? Because the x
i
them-
selves are asymptotically normal, and certain quadratic forms of normal distributions
are χ

2
. The matter is made a little complicated by the fact that the x
i
are linearly
dependent, since

x
j
= n, and therefore their covariance matrix is singular. There
are two ways to deal with such a situation. One is to drop one observation; one
will not lose any information by this, and the remaining r − 1 observations are well
behaved. (This explains, by the way, why one has a χ
2
r−1
instead of a χ
2
r
.)
We will take an alternative route, namely, use theorems which are valid even
if the covariance matrix is singular. This is preferable because it leads to more
uniﬁed theories. In equation (10.4.9), we characterized all the quadratic forms of
multivariate normal variables that are χ
2
’s. Here it is again: Assume y is a jointly
normal vector random variable with mean vector µ and covariance matrix σ
2
Ψ, and
Ω is a symmetric nonnegative deﬁnite matrix. Then (y −µ)

Ω(y − µ) ∼ σ

2
χ
2
k
iﬀ
ΨΩΨΩΨ = ΨΩΨ and k is the rank of Ω. If Ψ is singular, i.e., does not have an
inverse, and Ω is a g-inverse of Ψ, then condition (10.4.9) holds. A matrix Ω is a
g-inverse of Ψ iﬀ ΨΩΨ = Ψ. Every matrix has at least one g-inverse, but may have
more than one.
Now back to our multinomial distribution. By the central limit theorem, the x
i
are asymptotically jointly normal; their mean and covariance matrix are given by
equation (8.4.2). This covariance matrix is singular (has rank r −1), and a g-inverse
15.4. PEARSON’S GOODNESS OF FIT TEST. 449
is given by (15.4.2), which has in its diagonal exactly the weighting factors used in
the statistic for the goodness of ﬁt test.
Problem 219. 2 points A matrix Ω is a g-inverse of Ψ iﬀ ΨΩΨ = Ψ. Show
that the following matrix
(15.4.2)
1
n





1
p
1
0 ··· 0

0
1
p
2
··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ···
1
p
r





is a g-inverse of the covariance matrix of the multinomial distribution given in
(8.4.2).
Answer. Postmultiplied by the g-inverse given in (15.4.2), t he covariance matrix from (8.4.2)
becomes