Chapter 14: HYPOTHESIS TESTING ANH CONFIDENCE REGIONS

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (899.42 KB, 27 trang )

CHAPTER

14

Hypothesis testing and confidence regions

The current framework of hypothesis testing is largely due to the work of
Neyman and Pearson in the late 1920s, early 30s, complementing Fisher’s
work on estimation. As in estimation, we begin by postulating a statistical
model but instead of seeking an estimator of 8@in @ we consider the question

whether @€@, <9 or

EO, = O-—

@, is mostly supported by the observed

data. The discussion which follows will proceed in a similar way, though
less systematically and formally, to the discussion of estimation. This is due
to the complexity of the topic which arises mainly because one is asked to
assimilate too many concepts too quickly just to be able to define the
problem properly. This difficulty, however, is inherent in testing, if any
proper understanding of the topic is to be attempted, and thus unavoidable.
Every effort is made to ensure that the formal definitions are supplemented
with intuitive explanations and examples. In Sections 14.1 and 14.2 the

concepts needed to define a test and some criteria for ‘good’ tests are

discussed using a simple example. In Section 14.3 the question of
constructing ‘good’ tests is considered. Section 14.4 relates hypothesis
testing to confidence estimation, bringing out the duality between the two

areas. In Section 14.5 the related topic of prediction is considered.

14.1
Let

Testing, definitions and concepts
X

be

a random

variable

(r.v.) defined

(S, A P(-)) and consider the statistical model

(i)
(ii)

on

the

probability

space

associated with X:

D={ f(x; 6), GEO}:
X=(X,, X2,..., X,,)' is a random sample, from f(x; 6).

The problem of hypothesis testing is one of deciding whether or not some
285

286

Hypothesis testing and confidence regions

conjecture about @ of the form 6 belongs to some subset @, of @ is
supported by the data x =(x,,x,...,X,)'. Wecall such a conjecture the null

hypothesis and denote it by H 5: 8 € Og, where if the sample realisation x EC,

we accept Ho, ifx EC, we reject it. The mapping which enables us to define
Co and C, we call a test statistic 1(X): # > R (see Fig. 11.4).
In order to illustrate the concepts introduced so far let us consider the

following example. Let X be the random variable representing the marks

achieved by students in an econometric theory paper and let the statistical
model be:

()i

O=<

(ii)

lớn

f(X;

= A=

1

Tam ep

1/X -0\?
—~|——

( : ) | 0e© = =[0, 100]; :

X=(X¡.X;....,X/,,

n=40 is a random sample from ƒ(x; Ø). The hypothesis to be tested is
against

Hy: 0=60

(i.e. X ~N(60,64)),

H,: 0460

(ie. X~N(u, 64), 1 #60),

Common

©, ={60}
©, =[0, 100] — {60}.

sense suggests that if some ‘good’ estimator of 0, say X,,=

(1/n) $7... X;, for the sample realisation x takes a value ‘around’ 60 then we
will be inclined to accept H,. Let us formalise this argument:

The acceptance region takes the form 60-—eand

Co= (x: |X, —60| C,={x: |X, —60|2e}

is the rejection region.

The next question is, ‘how do we choose e” If¢ is too small we run the risk of
rejecting Hy when it is true; we call this type I error. On the other hand, if ¢is
too large we run the risk of accepting H, when it is false; we call this type II

error. Formally, ifxeC, (reyect Hy) and 6€@, (H, is true) — type I error; if
x € Cy (accept H,) and 6 € ©, (His false) — type I] error (see Table 14.1). The

Table

H, true

H, false

14.1
Hy accepted

Hy rejected

correct

type I error

type II error

correct

14.1

Testing, definitions and concepts

287

hypothesis to be tested is formally stated as follows:

Hạ: 0cO,,

O)SO.

(14.1)

Against the null hypothesis H, we postulate the alternative H, which takes

the form:
H,: 0€05

(14.2)

or, equivalently,

H,: 0€0, =O-O,.

(14.3)

It is important to note at the outset that H, and H, are in effect hypotheses
about the distribution of the sample ƒ(x; 9), Le.
Ho: f(x: 8),

@E€O,,

H,: f(x: 6),

GeO,.

(14.4)

A hypothesis H, or H, is called simple if knowing 6€ ©, or 6 €©, specifies
J (x; 8) completely, otherwise it is called a composite hypothesis. That is, if
I(x; 0), 0€ Oy or f(x; 8), GEO, contain only one density function we say
that Hy or H, are simple hypotheses, respectively; otherwise they are said to
be composite.

In testing a null hypothesis Hy against an alternative H, the issue is to

decide whether the sample realisation x ‘supports’ Hy or H,. In the former

case we say that H is accepted, in the latter Hg is rejected. In order to be
able to make such a decision we need to formulate a mapping which relates
©, to some subset of the observation space 2% say Co, we call an acceptance
region, and its complement C, (Cy UC, =4%) Co
\C,=@) we call the

rejection region (see Fig. 11.4). Obviously, in any particular situation we
cannot say for certain in which of the four boxes in Table 14.1 we are in; at

best we can only make a probabilistic statement relating to this. Moreover,
if we were to choose ¢ ‘too small’ we run a higher risk of committing a type I
error than of committing a type IJ error and vice versa. That is, there is a
trade off between the probability of type I error, i.e.
Pr(xeC,; OE Oo) =a,

(14.5)

and the probability B of type II error, i.e.
Pr(xeŒạ; 9c©;)= ổ.

(14.6)

Ideally we would like x = 6 =0 for all @ € © which is not possible for a fixed n.
Moreover, we cannot control both simultaneously because of the trade-off

between them. ‘How do we proceed, then” In order to help us decide let us

consider the close analogy between this problem and the dilemma facing

the jury in a trial of a criminal offence.

288

Hypothesis testing and confidence regions

The jury in a criminal offence trial are instructed to choose between:
Hy: the accused is not guilty; and
H,: the accused is guilty;
with their decision based on the evidence presented in the court. This
evidence in hypothesis testing comes in the form of ® and X. The jury are
instructed to accept Hy unless they have been convinced beyond any
reasonable doubt otherwise. This requirement is designed to protect an
innocent person from being convicted and it corresponds to choosing a
small value for «, the probability of convicting the accused when innocent.
By adopting such a strategy, however,they are running the risk of letting a
number of ‘crooks off the hook’. This corresponds to being prepared to
accept a relatively high value of Ø, the probability of not convicting the
accused when guilty, in order to protect an innocent person from

conviction. This is based on the moral argument that it is preferable to let

off a number of guilty people rather than to sentence an innocent person.
However, we can never be sure that an innocent person has not been sent to
prison and the strategy is designed to keep the probability of this happening
very low. A similar strategy is also adopted in hypothesis testing where a
small value of » is chosen and for a given a, f is minimised. Formally, this
amounts to choosing «* such that
and

Pr(x EC; 0€O,)=a(8)Pr(x €Cy; 0€O,)= (8)

for PEO,

is minimised for 0e©,

(14.7)
(14.8)

by choosing C, or Cy appropriately.
In the case of the above example if we were to choose a, say «* =0.05, then

Pr(|X,, —60| > e; 8=60)=0.05.

(14.9)

This represents a probabilistic statement with ¢ being the only unknown.
“How do we determine ¢, then?’ Being a probabilistic statement it must be
based on some distribution. The only random variable involved in the
statement is X,, and hence it has to be its sampling distribution. For the
above probabilistic statement to have any operational meaning to enable
us to determine e, the distribution of X, must be known. In the present case

we know that
_

X,~N(0,°-)

n

2

2

64

where “= “=16,
n

40

(14.10)

which implies that for @=60 (i.e. when Ho is true)

eave ASN
1.265

1),

(14.11)

14.1

Testing, definitions and concepts

289

and thus the distribution of t(-) is known completely (no unknown
parameters). When this is the case this distribution can be used in
conjunction with the above probabilistic statement to determine e. In order

to do this we need to relate |X,,—60| to r(X) (a statistic) for which the

distribution is known. The obvious way to do this is to standardise the

former, i.e. consider |X, —60|/1.265 which is equal to |z(X)|. This suggests

changing the above probabilistic statement to the equivalent statement

p,(

" = 00.

1.265

= 60)=005

where c,= ——.

(14.12)

1.265

Given that the distribution of the statistic t(X) is symmetric and we want to
determine c, such that Pr(|c(X)| 2 c,)=0.05 we should choose the value of c,
from the tables of N(0, 1) which leaves «*/2 =0.025 probability on either side

of the distribution as shown in Fig. 14.1. The value of c, given from the
N(O, 1) tables is c,= 1.96. This in turn implies that the rejection region for
the test is

¥,—60
C, -}x nO
or

C, = {x:|X,

196) [x |r(X)|> 1.96)

(14.13)

60] > 2.48}.

(14.14)

That is, for sample realisations x which give rise to X,, falling outside the
interval (57.52, 62.48) we reject Ho.
Let us summarise the argument so far in order to keep the discussion in
perspective. We set out to construct a test for Hy: 0=60 against H,:0460

and intuition suggested the rejection region (|X,,—60|>c). In order to

determine ¢ we had to

(i)
choose an «a; and then
(H)

define the rejection region in terms of some statistic r(X).
The latter is necessary to enable us to determine ¢ via some known
distribution. This is the distribution of the test statistic t(X) under Hy (Le.
when H, is true).
f (z)

—c

a

0

Fig. 14.1. The rejection region (14.13).

c

Q

Z

290

Hypothesis testing and confidence regions

Given that C¡ ={x: |r(X)|> 1.96} defines a test with « =0.05, the question

which naturally arises is: ‘What do we need the probability of type IT error 8
for? The answer is that we need f to decide whether the test defined in terms
of C, is a ‘good’ or a ‘bad’ test. As we mentioned at the outset, the way we

decided to ‘solve’ the problem of the trade-off between « and f was to
choose a smail value for a and define C, so as to minimise f. At this stage we
do not know whether the test defined above is a ‘good’ test or not. Let us
consider setting up the apparatus to enable us to consider the question of

optimality.
14.2

Optimal tests

Since the acceptance and rejection regions constitute a partition of the
observation space % ie. CoUC,=2% and CynC,=2, it implies that
Pr(x
e C)=

for all EO,

1

Pr(x eC)) for all 8Ââ,. Hence, minimisation of Pr(x €C,)

is equivalent to maximising Pr(x eC;) for all 9€Q,.

Definition 1

The probability of rejecting Hy when false at some point 6, €©,, i.e.
Pr(xeC;; 0=) ¡s called the power of the test a 0= 0).
Note that

Pr(xeC¡;0=0Ø)= 1— Pr(xeCạ; 0=8,)= 1— 8(0)).

(14.15)

In the case of the example above we can define the power of the test at some

0, €©,, say 0, =54, to be Pr[(|X,,—60))/1.265> 1.96; 6=54]. ‘How do we

calculate this probability” The temptation is to suggest using the same
distribution as above, i.e. 7(X) =(X,, — 60)/ 1.265 ~ N(0, 1). This is, however,
wrong because @ is no longer equal to 60; we assumed that 6= 54 and thus
(X,, — 54)/1.265
~ N(O, 1). This implies that
:X)~ NỈ

(54-60)
'
1265 `

for 8=54.

Using this we can define the power of the test at Ø= 54 to be

Pr

X,—60
1.265

> 1.96: 0=54 |= Pr (Xn = 4) - _ 1.96
1.265

" (; —54)_
1265

24-69)
1.265

(54-60)

1265 )=0sm

Hence, the power of the test defined by C, above is indeed very high for
€= 54. In order to be able to decide on how good such a test is, however, we

142

Optimal tests

291

need to calculate the power for all 9 ¢@,. Following the same procedure the
power of the test defined by C, for 0= 56, 58, 60, 62, 64, 66 is as follows:

Pr(|c(X)| > 1.96; 0= 56) =0.8849,
Pr(|t(X)| > 1.96; @= 58) = 0.3520,
Pr(|r(X)| > 1.96; 0=60)=0.05,
Pr(|t(X)| > 1.96; 6= 62) =0.3520,
Pr(|t(X)| > 1.96; 0 = 64)
= 0.8849,
Pr(|t(X)| > 1.96; 0=66)=0.9973.

As we can see, the power of the test increases as we go further away from
0=60 (Hạ) and the power at 6=60 equals the probability of type I error.

This prompts us to define the power function as follows:
Definition 2

A0)=Pr(xeEC,), G€O is called the power function of the test
defined by the rejection region C,.
Definition 3

œz=maXạ,e, A(8) is defined to be the size (or the significance level) of
the test.
In the case where Hy is simple, say 6= 6, then « = A(6,). These definitions
enable us to define a criterion for ‘a best’ test of a given size « to be the one (if
it exists) whose power function A(@), 8€@,

is maximum

at every 0.

Definition 4
A test of Hạ: 0 c©g against H,: 0 €O, as defined by some rejection

region C, is said to be uniformly most powerful (UM P) test of size x

if

(i)

max A(8) =a;

(ii)

2(0)>Z*\0)_

GEO,

for all 0e©;;

where Z2*() ¡is the power function oƒ any other test 0ƒ size a.
As we saw above, in order to be able to determine the power function we

need to know the distribution of the test statistic t(X) (in terms of which C,

is defined) under H, (i.e. when Hg is false). The concept of a UMP test
provides us with the criterion needed to choose between tests for the
same H o.

Let us consider the question of optimality for the size 0.05 test derived

292

Hypothesis testing and confidence regions
f (z)
Cy ={x:1(X) = 1.645} (14.16)

0

1.645

2

f (z)

C‡ * ={x+(X) <1.648} (14.17)

—1.645

0

Zz
f tz)

C‡ * * ={xjr(X)|<0.038} (14.18)

-0.03

0

003

Zz

Fig. 14.2. The rejection regions (14.16), (14.17) and (14.18).

above with rejection region

Cy = {x: |c(X)| > 1.96}.

(14.19)

To that end we shall compare the power of this test with the power of the
size 0.05 tests (Fig. 14.2), defined by the rejection regions. All the rejection
regions define size 0.05 tests for Hy: 0=60 against H,: 860. In order to
discriminate between ‘bad’, ‘good’ and ‘better’ tests we have to calculate
their power functions and compare them. The diagram of the power
functions A(0), 2 ”(0), F~ *(0), A” ~ *(0) is illustrated in Fig. 14.3.
Looking at the diagram we can see that only one thing is clear ‘cut’;
C; ** defines a very bad test, its power function being dominated by the
other tests. Comparing the other three tests we can see that C* is more

powerful than the other two for 0> 60 but A* (6)
powerful than the other two for 0<60 but Y* *(@)<« for Ø0 > 60, but none of
the tests is more powerful over the whole range. That there is no UMP test
of size 0.05 for Hy: 0= 60 against H,: 0460. As will be seen in the sequel, no
UMP tests exist in most situations of interest in practice. The procedure
adopted in such cases is to reduce the class of all tests to some subclass by

imposing some more criteria and consider the question of UMP

tests within

14.2

Optimal tests

293