Tải bản đầy đủ (.pdf) (83 trang)

Statistical Methods in Medical Research - part 3 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (527.14 KB, 83 trang )

156

Analysing variances

Comparison of two counts
Suppose that x1 is a count which can be assumed to follow a Poisson distribution
with mean m1 . Similarly let x2 be a count independently following a
Poisson distribution with mean m2 . How might we test the null hypothesis that
m1 ˆ m2 ?
One approach would be to use the fact that the variance of x1 À x2 is m1 ‡ m2
(by virtue of (3.19) and (4.9)). The best estimate of m1 ‡ m2 on the basis of
the available information is x1 ‡ x2 . On the null hypothesis E…x1 À x2 † ˆ
m1 À m2 ˆ 0, and x1 À x2 can be taken to be approximately normally distributed
unless m1 and m2 are very small. Hence,
x1 À x2
zˆp
…x1 ‡ x2 †

…5:7†

can be taken as approximately a standardized normal deviate.
A second approach has already been indicated in the test for the comparison
of proportions in paired samples (§4.5). Of the total frequency x1 ‡ x2 , a portion
x1 is observed in the first sample. Writing r ˆ x1 and n ˆ x1 ‡ x2 in (4.17) we have


x1 À 1 …x1 ‡ x2 †
x 1 À x2
p2
ˆp
1


…x1 ‡ x2 †
…x1 ‡ x2 †
2

as in (5.7). The two approaches thus lead to exactly the same test procedure.
A third approach uses a rather different application of the x2 test from that
described for the 2  2 table in §4.5, the total frequency of x1 ‡ x2 now being
divided into two components rather than four. Corresponding to each observed
frequency we can consider the expected frequency, on the null hypothesis, to be
1
2 …x1 ‡ x2 †:
Observed
Expected

x1
1
2 …x1

‡ x2 †

x2
1
2 …x1 ‡ x2 †

Applying the usual formula (4.30) for a x2 statistic, we have
X2 ˆ

‰x1 À 1 …x1 ‡ x2 †Š2 ‰x2 À 1 …x1 ‡ x2 †Š2
2
2

‡
1
1
…x1 ‡ x2 †
…x1 ‡ x2 †
2
2

…x1 À x2 †2
ˆ
:
x1 ‡ x2

…5:8†

As for (4.30) X 2 follows the x2 distribution, which we already know to be
…1†
the distribution of the square of a standardized normal deviate. It is therefore
not surprising that X 2 given by (5.8) is precisely the square of z given by
(5.7). The third approach is thus equivalent to the other two, and forms a
particularly useful method of computation since no square root is involved
in (5.8).


5.2 Inferences from counts

157

Consider now an estimation problem. What can be said about the ratio m1 =m2 ?
The second approach described above can be generalized, when the null hypothesis is not necessarily true, by saying that x1 follows a binomial distribution with

parameters x1 ‡ x2 (the n of §3.7) and m1 =…m1 ‡ m2 † (the p of §3.6). The methods
of §4.4 thus provide confidence limits for p ˆ m1 =…m1 ‡ m2 †, and hence for m1 =m2
which is merely p=…1 À p†. The method is illustrated in Example 5.4.
The difference m1 À m2 is estimated by x1 À x2 , and the usual normal theory
can be applied as an approximation, with the standard error of x1 À x2 estip
mated as in (5.7) by …x1 ‡ x2 †.
Example 5.4
Equal volumes of two bacterial cultures are spread on nutrient media and after incubation
the numbers of colonies growing on the two plates are 13 and 31. We require confidence
limits for the ratio of concentrations of the two cultures.
The estimated ratio is 13=31 ˆ 0Á4194. From the Geigy tables a binomial sample with
13 successes out of 44 provides the following 95% confidence limits for p: 0Á1676 and
0Á4520. Calculating p=…1 À p† for each of these limits gives the following 95% confidence
limits for m1 =m2 :
0Á1676=0Á8324 ˆ 0Á2013
and
0Á4520=0Á5480 ˆ 0Á8248:
The mid-P limits for p, calculated exactly as described in §4.4, are 0Á1752 and 0Á4418,
leading to mid- P limits for m1 =m2 of 0Á2124 and 0Á7915.
The normal approximations described in §4.4 can, of course, be used when the frequencies are not too small.
Example 5.5
Just as the distribution of a proportion, when n is large and p is small, is well approximated by assuming that the number of successes, r, follows a Poisson distribution, so a
comparison of two proportions under these conditions can be effected by the methods of
this section. Suppose, for example, that, in a group of 1000 men observed during a
particular year, 20 incurred a certain disease, whereas, in a second group of 500 men,
four cases occurred. Is there a significant difference between these proportions? This
question could be answered by the methods of §4.5. As an approximation we could
compare the observed proportion of deaths falling into group 2, p ˆ 4=24, with the
theoretical proportion p ˆ 500=1500 ˆ 0Á3333. The equivalent x2 test would run as
follows:


Observed cases
Expected cases

Group 1
20
1000 Â 24
ˆ 16
1500

Group 2
4
500 Â 24
ˆ8
1500

Total
24
24


158

Analysing variances

With continuity correction
2
Xc ˆ …3 1†2 =16 ‡ …3 1†2 =8
2
2

ˆ 0Á766 ‡ 1Á531
ˆ 2Á30 …P ˆ 0Á13†:

The difference is not significant. Without the continuity correction, X 2 ˆ 3Á00
…P ˆ 0Á083†.
If the full analysis for the 2 Â 2 table is written out it will become clear that this
abbreviated analysis differs from the full version in omitting the contributions to X 2 from
the non-affected individuals. Since these are much more numerous than the cases, their
contributions to X 2 have large denominators and are therefore negligible in comparison
with the terms used above. This makes it clear that the short method described here must
be used only when the proportions concerned are very small.

Example 5.6
Consider a slightly different version of Example 5.5. Suppose that the first set of 20 cases
occurred during the follow-up of a large group of men for a total of 1000 man-years,
whilst the second set of four cases occurred amongst another large group followed for 500
man-years. Different men may have different risks of disease, but, under the assumptions
that each man has a constant risk during his period of observation and that the lengths of
follow-up are unrelated to the individual risks, the number of cases in each group will
approximately follow a Poisson distribution. As a test of the null hypothesis that the mean
risks per unit time in the two groups are equal, the x2 test shown in Example 5.5 may be
applied.
Note, though, that a significant difference may be due to failure of the assumptions.
One possibility is that the risk varies with time, and that the observations for one group
are concentrated more heavily at the times of high risk than is the case for the other group;
an example would be the comparison of infant deaths, where one group might be observed
for a shorter period after birth, when the risk is high. Another possibility is that lengths of
follow-up are related to individual risk. Suppose, for example, that individuals with high
risk were observed for longer periods than those with low risk; the effect would be to
increase the expected number of cases in that group.

Further methods for analysing follow-up data are described in Chapter 17.

5.3 Ratios and other functions
We saw, in §4.2, that inferences about the population mean are conveniently
made by using the standard error of the sample mean. In §§4.4 and 5.2,
approximate methods for proportions and counts made use of the appropriate
standard errors, invoking the normal approximations to the sampling distributions. Similar normal approximations are widely used in other situations, and it
is therefore useful to obtain formulae for standard errors (or, equivalently, their
squares, the sampling variances) for various other statistics.


5.3 Ratios and other functions

159

Many situations involve functions of one or more simple statistics, such as
means or proportions. We have already, in (4.9), given a general formula for the
variance of a difference between two independent random variables, and applied
it, in §§4.3, 4.5 and 5.2, to comparisons of means, proportions and counts. In the
present section we give some other useful formulae for the variances of functions
of independent random variables.
Two random variables are said to be independent if the distribution of one is
unaffected by the value taken by the other. One important consequence of
independence is that mean values can be multiplied. That is, if x1 and x2 are
independent and y ˆ x1 x2 , then
E…y† ˆ E…x1 †E…x2 †:

…5:9†

Linear function

Suppose x1 , x2 , . . . , xk are independent random variables, and
y ˆ a1 x1 ‡ a2 x2 ‡ . . . ‡ ak xk ,
the as being constants. Then,
var…y† ˆ a2 var…x1 † ‡ a2 var…x2 † ‡ . . . ‡ a2 var…xk †:
1
2
k

…5:10†

The result (4.9) is a particular case of (5.10) when k ˆ 2, a1 ˆ 1 and a2 ˆ À1.
The independence condition is important. If the xs are not independent, there
must be added to the right-hand side of (5.10) a series of terms like
2ai aj cov…xi , xj †,

…5:11†

where `cov' stands for the covariance of xi and xj , which is defined by
cov…xi , xj † ˆ Ef‰xi À E…xi †Š ‰xj À E…xj †Šg:
The covariance is the expectation of the product of deviations of two random
variables from their means. When the variables are independent, the covariance
is zero. When all k variables are independent, all the covariance terms vanish and
we are left with (5.10).
Ratio
In §5.1, we discussed the ratio of two variance estimates and (at least for
normally distributed data) were able to use specific methods based on the F
distribution. In §5.2, we noted that the ratio of two counts could be treated by
using results established for the binomial distribution. In general, though, exact
methods for ratios are not available, and recourse has to be made to normal
approximations.



160

Analysing variances

Let y ˆ x1 =x2 , where again x1 and x2 are independent. No general formula
can be given for the variance of y. Indeed, it may be infinite. However, if x2 has a
small coefficient of variation, the distribution of y will be rather similar to a
distribution with a variance given by the following formula:
var…y† ˆ

var…x1 †
‰E…x2 †Š

2

‡

‰E…x1 †Š2
‰E…x2 †Š4

var…x2 †:

…5:12†

Note that if x2 has no variability at all, (5.12) reduces to
var…y† ˆ

var…x1 †

,
x2
2

which is an exact result when x2 is a constant.
Approximate confidence limits for a ratio may be obtained from (5.12), with
p
the usual multiplying factors for SE…y† ‰ˆ var…y†Š based on the normal distribution. However, if x1 and x2 are normally distributed, an exact expression for
confidence limits is given by Fieller's theorem (Fieller, 1940). This covers a rather
more general situation, in which x1 and x2 are dependent, with a non-zero
covariance. We suppose that x1 and x2 are normally distributed with variances
and a covariance which are known multiples of some unknown parameter s2 ,
and that s2 is estimated by a statistic s2 on f DF. Define E…x1 †ˆ m1 ,
E…x2 †ˆ m2 , var…x1 † ˆ v11 s2 , var…x2 † ˆ v22 s2 and cov…x1 , x2 † ˆ v12 s2 . Denote
the unknown ratio m1 =m2 by r, so that m1 ˆ rm2 . It then follows that the
quantity z ˆ x1 À rx2 is distributed as N‰0, …v11 À 2rv12 ‡ r2 v22 †s2 Š, and so
the ratio
x1 À rx2
Tˆ p
s …v11 À 2rv12 ‡ r2 v22 †

…5:13†

follows a t distribution on f DF. Hence, the probability is 1 À a that
Àtf , a < T < tf , a ,
or, equivalently,
T 2 < t2 , a :
f

…5:14†


Substitution of (5.13) in (5.14) gives a quadratic inequality for r, leading to
100…1 À a†% confidence limits for r given by

!1
2
gv12
v2
2
12
v Ỉt
S v11 À 2yv12 ‡ y v22 À g v11 À v
f,a
22
22
x2
,
rL , r U ˆ
1Àg
where

…5:15†


5.3 Ratios and other functions



t2, a s2 v22
f

x2
2

161

…5:16†

1

and ‰ Š2 indicates a square root.
If g is greater than 1, x2 is not significantly different from zero at the a level,
and the data are consistent with a zero value for m2 and hence an infinite value
for r. The confidence set will then either be the two intervals (ÀI, rL ) and
(rU , I), excluding the observed value y, or the whole set of values (ÀI, I).
Otherwise, the interval (rL , rU ) will include y, and when g is very small the
limits will be close to those given by the normal approximation using (5.12). This
may be seen by setting g ˆ 0 in (5.15), when the limits become
y Ỉ tf , a

!1
2
var…x1 † 2x1
x2
1
À 3 cov…x1 , x2 † ‡ 4 var…x2 † :
x2
x2
x2
2


…5:17†

Equation (5.17) agrees with (5.12), with the replacement of expectations of x1
and x2 by their observed values, and the inclusion of the covariance term.
The validity of (5.15) depends on the assumption of normality for x1 and x2 .
Important use is made of Fieller's theorem in biological assay (§20.2) where the
normality assumption is known to be a good approximation.
A situation commonly encountered is the comparison of two independent
samples when the quantity of interest is the ratio of the location parameters
rather than their difference. The formulae above may be useful, taking x1 and x2
to be the sample means, and using standard formulae for their variances. The
use of Fieller's theorem will be problematic if (as is usually the case) the variances
are not estimated as multiples of the same s2 , although approximations may be
used. An alternative approach is to work with the logarithms of the individual
readings, and make inferences about the difference in the means of the
logarithms (which is the logarithm of their ratio), using the standard procedures
of §4.3.
Product
Let y ˆ x1 x2 , where x1 and x2 are independent. Denote the means of x1 and x2
by m1 and m2 , and their variances by s2 and s2 . Then
1
2
var…y† ˆ m2 s2 ‡ m2 s2 ‡ s2 s2 :
1 2
2 1
1 2

…5:18†

The assumption of independence is crucial.

General function
Suppose we know the mean and variance of the random variable x. Can we
calculate the mean and variance of any general function of x such as 3x3 or


162

Analysing variances

p

… log x†? There is no simple formula, but again a useful approximation is
available when the coefficient of variation of x is small. We have to assume
some knowledge of calculus at this point. Denote the function of x by y.
Then
 2
dy
var…y† g
var…x†,
…5:19†
dx xˆE…x†
the symbol g standing for `approximately equal to'. In (5.19), dy=dx is the
differential coefficient (or derivative) of y with respect to x, evaluated at the
mean value of x.
If y is a function of two variables, x1 and x2 ,








@y 2
@y
@y
@y 2
var…y† g
var…x1 † ‡ 2
var…x2 †, …5:20†
cov…x1 , x2 † ‡
@x1
@x1
@x2
@x2
where @y=@x1 and @y=@x2 are the partial derivatives of y with respect to x1 and x2 ,
and these are again evaluated at the mean values. The reader with some
knowledge of calculus will be able to derive (4.9) as a particular case of
(5.20) when cov…x1 , x2 † ˆ 0. An obvious extension of (5.20) to k variables
gives (5.10) as a special case. Equations (5.12) and (5.18) are special cases of
(5.20), when cov…x1 , x2 † ˆ 0. In (5.18), the last term becomes negligible if the
coefficients of variation of x1 and x2 are very small; the first two terms agree with
(5.20).
The method of approximation by (5.19) and (5.20) is known as the delta
method.

5.4 Maximum likelihood estimation
In §4.1 we noted several desirable properties of point estimators, and remarked
that many of these were achieved by the method of maximum likelihood. In
Chapter 4 and the earlier sections of the present chapter, we considered the
sampling distributions of various statistics chosen on rather intuitive grounds,

such as the mean of a sample from a normal distribution. Most of these turn out
to be maximum likelihood estimators, and it is useful to reconsider their properties in the light of this very general approach.
In §3.6 we derived the binomial distribution and in §4.4 we used this result to
obtain inferences from a sample proportion. The probability distribution here is
a two-point distribution with probabilities p and 1 À p for the two types of
individual. There is thus one parameter, p, and a maximum likelihood (ML)
estimator is obtained by finding the value that maximizes the probability shown
in (3.12). The answer is p, the sample proportion, which was, of course, the
statistic chosen intuitively. We shall express this result by writing


5.4 Maximum likelihood estimation

163

p ˆ p,
^
the `hat' symbol indicating the ML estimator.
Two of the properties already noted in §3.6 follow from general properties of
ML estimators: first, in large samples (i.e. for large values of n), the distribution
of p tends to become closer and closer to a normal distribution; and, secondly, p
is a consistent estimator of p because its variance decreases as n increases, and so
p fluctuates more and more closely around its mean, p.
A third property of ML estimators is their efficiency: no other estimator
would have a smaller variance than p in large samples. One other property of p is
its unbiasedness, in that its mean value is p. This can be regarded as a bonus, as
not all ML estimators are unbiased, although in large samples any bias must
become proportionately small in comparison with the standard error, because of
the consistency property.
Since the Poisson distribution is closely linked with the binomial, as

explained in §3.7, it is not surprising that similar properties hold. There is
again one parameter, m, and the ML estimator from a sample of n counts is
the observed mean count:
m ˆ x:
^ 
An equivalent statement is that the ML estimator of nm is n, which is the total
x
€
count
x. The large-sample normality of ML estimators implies a tendency
towards normality of the Poisson distribution with a large mean (nm here),
confirming the decreased skewness noted in connection with Fig. 3.9. The con
sistency of x is illustrated by the fact that
var…† ˆ var…x†=n ˆ m=n,
x

so, as n increases, the distribution of x becomes more tightly concentrated
around its mean m. Again, the unbiasedness is a bonus.
In Fig. 4.1, the concept of maximum likelihood estimation was illustrated by
reference to a single observation from a normal distribution N(m, 1). The ML
estimator of m is clearly x. In a sample of size n from the same distribution, the

situation would be essentially the same, except that the distributions of x for
different values of m would now have a variance of 1=n rather than 1. The ML

estimator would clearly be x, which has the usual properties of consistency and
efficiency and, as a bonus, unbiasedness.
In practice, if we are fitting a normal distribution to a set of n observations,
we shall not usually know the population variance, and the distribution we fit,
N(m, s2 ), will have two unknown parameters. The likelihood now has to be

maximized simultaneously over all possible values of m and s2 . The resulting
ML estimators are:
m ˆ x,
^ 


164

Analysing variances

as expected, and
2

s ˆ
^

€


…xi À x†2
:
n

This is the biased estimator of the variance, (2.1), with divisor n, rather than the
unbiased estimator s2 given by (2.2). As we noted in §2.6, the bias of (2.1)
becomes proportionately unimportant as n gets large, and the estimator is
consistent, as we should expect.
Proofs that the ML estimators noted here do maximize the likelihood are
easily obtained by use of the differential calculus. That is, in fact, the general
approach for maximum likelihood solutions to more complex problems, many of

which we shall encounter later in the book. In some of these more complex
models, such as logistic regression (§14.2), the solution is obtained by a computer
program, acting iteratively, so that each round of the calculation gets closer and
closer to the final value.
Two points may be noted finally:
1 The ML solution depends on the model put forward for the random variation. Choice of an inappropriate model may lead to inefficient or misleading
estimates. For certain non-normal distributions, for instance, the ML estimator of the location parameter may not be (as with the normal distribution)

the sample mean x. This corresponds to the point made in §§2.4 and 2.5 that
for skew distributions the median or geometric mean may be a more satisfactory measure than the arithmetic mean.
2 There are some alternative approaches to estimation, other than maximum
likelihood, that also provide large-sample normality, consistency and efficiency. Some of these, such as generalized estimating equations (§12.6), will
be met later in the book.


6 Bayesian methods

6.1 Subjective and objective probability
Our approach to the interpretation of probability, and its application in statistical inference, has hitherto been frequentist. That is, we have regarded the
probability of a random event as being the long-run proportion of occasions
on which it occurs, conditional on some specified hypothesis. Similarly, in
methods of inference, a P value is defined as the proportion of trials in which
some observed result would have been observed on the null hypothesis; and a
confidence interval is characterized by the probability of inclusion of the true
value of a parameter in repeated samples.
Bayes' theorem (§3.3) allowed us to specify prior probabilities for hypotheses,
and hence to calculate posterior probabilities after data had been observed, but
the prior probabilities were, at that stage, justified as representing the long-run
frequencies with which these hypotheses were true. In medical diagnosis, for
example, we could speak of the probabilities of data (symptoms, etc.) on certain

hypotheses (diagnoses), and attribute (at least approximately) probabilities to
the diagnoses according to the relative frequencies seen in past records of similar
patients.
It would be attractive if one could allot probabilities to hypotheses like the
following: `The use of tetanus antitoxin in cases of clinical tetanus reduces the
fatality of the disease by more than 20%,' for which no frequency interpretation
is possible. Such an approach becomes possible only if we interpret the probability of a hypothesis as a measure of our degree of belief in its truth. A
probability of zero would correspond to complete disbelief, a value of one
representing complete certainty. These numerical values could be manipulated
by Bayes' theorem, measures of prior belief being modified in the light of
observations on random variables by multiplication by likelihoods, resulting in
measures of posterior belief.
It is often argued that this is a more `natural' interpretation of probability
than the frequency approach, and that non-specialist users of statistical methods
often erroneously interpret the results of significance tests or confidence intervals
in this subjective way. That is, a non-significant result may be wrongly interpreted as showing that the null hypothesis has low probability, and a parameter
may be claimed to have a 95% probability of lying inside a confidence interval.
165


166

Bayesian methods

This argument should not be used to justify an incorrect interpretation, but it
does lend force to attempts to develop a coherent approach in terms of degrees of
belief.
Such an approach to probability and statistical inference was, in fact, conventional in the late eighteenth century and most of the nineteenth century,
following the work of T. Bayes and P.-S. Laplace (1749±1827), the `degrees of
belief ' interpretation being termed `inverse probability' in contrast to the frequentist `direct probability'. As we shall see, there are close parallels between

many results obtained by the two approaches, and the distinction became
blurred during the nineteenth century. The frequentist approach dominated
during the early part of the twentieth century, especially through the influence
of R.A. Fisher (1890±1962), but many writers (Good, 1950; Savage, 1954;
Jeffreys, 1961; Lindley, 1965) have advocated the inverse approach (now normally called `Bayesian') as the basis for statistical inference, and it is at present
very influential.
The main problem is how to determine prior probabilities in situations where
frequency interpretations are meaningless, but where values in between the two
extremes of complete disbelief and complete certainty are needed. One approach
is to ask oneself what odds one would be prepared to accept for a bet on the truth
or falsehood of a particular proposition. If the acceptable odds were judged to be
4 to 1 against, the proposition could be regarded as having a probability of 1/5 or
0Á2. However, the contemplation of hypothetical gambles on outcomes that may
never be realized, is an unattractive prospect, and seems inappropriate for the
large number of probability assessments that would be needed in any realistic
scientific study. It is therefore more convenient to use some more flexible
approach to capture the main features of a prior assessment of the plausibility
of different hypotheses.
Most applications of statistics involve inference about parameters in models.
It is often possible to postulate a family of probability distributions for the
parameter, the various members of which allow sufficient flexibility to meet the
needs of most situations. At one extreme are distributions with a very wide
dispersion, to represent situations where the user has little prior knowledge or
belief. At the other extreme are distributions with very low dispersion, for
situations where the user is confident that the parameter lies within a small
range. We shall see later that there are particular mathematical distributions,
called conjugate priors, that present such flexibility and are especially appropriate
for particular forms of distribution for the data, in that they combine naturally
with the likelihoods in Bayes' theorem.
The first extreme mentioned above, leading to a prior with wide dispersion, is

of particular interest, because there are many situations in which the investigator
has very little basis for an informed guess, especially when a scientific study is
being done for the first time. It is then tempting to suggest that a prior distribu-


6.1 Subjective and objective probability

167

tion should give equal probabilities, or probability densities, to all the possible
values of the parameter. However, that approach is ambiguous, because a uniform distribution of probability across all values of a parameter would lead to a
non-uniform distribution on a transformed scale of measurement that might be
just as attractive as the original. For example, for a parameter u representing a
proportion of successes in an experiment, a uniform distribution of u between 0
and 1 would not lead to a uniform distribution of the logit of u ((14.5), p. 488)
between ÀI and I. This problem was one of the main objections to Bayesian
methods raised throughout the nineteenth century.
A convenient way out of the difficulty is to use the family of conjugate priors
appropriate for the situation under consideration, and to choose the extreme
member of that family to represent ignorance. This is called a non-informative or
vague prior. A further consideration is that the precise form of the prior distribution is important only for small quantities of data. When the data are
extensive, the likelihood function is tightly concentrated around the maximum
likelihood value, and the only feature of the prior that has much influence in
Bayes' theorem is its behaviour in that same neighbourhood. Any prior distribution will be rather flat in that region unless it is is very concentrated there or
elsewhere. Such a prior will lead to a posterior distribution very nearly proportional to the likelihood, and thus almost independent of the prior. In other
words, as might be expected, large data sets almost completely determine the
posterior distribution unless the user has very strong prior evidence.
The main body of statistical methods described in this book was built on the
frequency view of probability, and we adhere mainly to this approach. Bayesian
methods based on suitable choices of non-informative priors (Lindley, 1965)

often correspond precisely to the more traditional methods, when appropriate
changes of wording are made. We shall indicate many of these points of correspondence in the later sections of this chapter. Nevertheless, there are points at
which conflicts between the viewpoints necessarily arise, and it is wrong to
suggest that they are merely different ways of saying the same thing.
In our view both Bayesian and non-Bayesian methods have their proper
place in statistical methodology. If the purpose of an analysis is to express the
way in which a set of initial beliefs is modified by the evidence provided by the
data, then Bayesian methods are clearly appropriate. Formal introspection of
this sort is somewhat alien to the working practices of most scientists, but the
informal synthesis of prior beliefs and the assessment of evidence from data is
certainly commonplace. Any sensible use of statistical information must take
some account of prior knowledge and of prior assessments about the plausibility
of various hypotheses. In a card-guessing experiment to investigate extrasensory
perception, for example, a score in excess of chance expectation which was just
significant at the 1% level would be regarded by most people with some scepticism: many would prefer to think that the excess had arisen by chance (to say


168

Bayesian methods

nothing of the possibility of experimental laxity) rather than by the intervention
of telepathy or clairvoyance. On the other hand, in a clinical trial to compare an
active drug with a placebo, a similarly significant result would be widely accepted
as evidence for a drug effect because such findings are commonly made. The
question, then, is not whether prior beliefs should be taken into account, but
rather whether this should be done formally, through a Bayesian analysis, or
informally, using frequentist methods for data analysis.
The formal approach is particularly appropriate when decisions need to be
taken, for instance about whether a pharmaceutical company should proceed

with the development of a new product. Here, the evidence, subjective and
objective, for the ultimate effectiveness of the product, needs to be assessed
together with the financial and other costs of taking alternative courses of action.
Another argument in favour of Bayesian methods has emerged in recent
decades as a result of research into new models for complex data structures. In
general, Bayesian methods lead to a simplification of computing procedures in
that the calculations require the likelihood function based on the observed data,
whereas frequentist methods using tail-area probabilities require that results
should be integrated over sets of data not actually observed. Nevertheless,
Bayesian calculations for complex problems involve formidable computing
resources, and these are now becoming available in general computer packages
(Goldstein, 1998) as well as in specialist packages such as BUGS (Thomas et al.,
1992; Spiegelhalter et al., 2000; available from />bugs/); see Chapter 16.
With more straightforward data sets arising in the general run of medical
research, the investigator may have no strong prior beliefs to incorporate into the
analysis, and the emphasis will be on the evidence provided by the data. The
statistician then has two options: either to use frequentist methods such as those
described in this book, or to keep within the Bayesian framework by calculating
likelihoods. The latter can be presented directly, as summarizing the evidence
from the data, enabling the investigator or other workers to incorporate whatever priors they might wish to use. It may sometimes be useful to report a
`sensitivity analysis' in which the effects of different prior assumptions can be
explored.
Bayesian methods for some simple situations are explained in the following
sections, and Bayesian approaches to more complex situations are described in
Chapter 16. Fuller accounts are to be found in books such as Lee (1997), Carlin
and Louis (2000) and, at a rather more advanced level, Box and Tiao (1973).

6.2 Bayesian inference for a mean
The frequentist methods of inference for a mean, described in §4.2, made use of
the fact that, for large sample sizes, the sample mean tends to be normally



6.2 Bayesian inference for a mean

169

distributed. The methods developed for samples from a normal distribution
therefore provide a reliable approximation for samples from non-normal distributions, unless the departure from normality is severe or the sample size is very
small. The same is true in Bayesian inference, and we shall concentrate here on
methods appropriate for samples from normal distributions.
Figure 4.1 describes the likelihood function for a single observation x from a
normal distribution with unit variance, N(m, 1). It is a function of m which takes
the shape of a normal curve with mean x and unit variance. This result can
immediately be extended to give the likelihood from a sample mean. Suppose

that x is the mean of a sample of size n from a normal distribution N(m, s2 ).

From §4.2, we know that x is distributed as N(m, s2 =n), and the likelihood
function is therefore a normal curve N(, s2 =n).
x
Suppose now that m follows a normal prior distribution N(m0 , s2 ). Then,
0
application of Bayes' theorem shows that the posterior distribution of m is



s2 =n
x ‡ m0 s2 =ns2
0
N

,
:
…6:1†
1 ‡ s2 =ns2 1 ‡ s2 =ns2
0
0
The mean of this distribution can be written in the form
 
n
1

x 2 ‡ m0 2
s
s0
,
n
1
‡ 2
s2 s0

…6:2†


which is a weighted mean of the observed mean x and the prior mean m0 , the
weights being inversely proportional to the two variances of these quantities (the

sampling variance of x, s2 =n, and the prior variance of m, s2 ). Thus, the observed
0
data and the prior information contribute to the posterior mean in proportion to
their precision. The fact that the posterior estimate of m is shifted from the


sample mean x, in the direction of the prior mean m0 , is an example of the
phenomenon known as shrinkage, to be discussed further in §6.4.
The variance of the posterior distribution (6.1) may be written in the form
…s2 =n†s2
0
,
…s2 =n† ‡ s2
0

…6:3†

which is less than either of the two separate variances, s2 =n and s2 . In this sense,
0
precision has been gained by combining the information from the data and the
prior information.
These results illustrate various points made in §6.1. First, the family chosen for
the prior distributions, the normal, constitutes the conjugate family for the normal
likelihood. When the prior is chosen from a conjugate family, the posterior
distribution is always another member of the same family, but with parameters
altered by the incorporation of the likelihood. Although this is mathematically
very convenient, it does not follow that the prior should necessarily be chosen


170

Bayesian methods

in this way. For example, in the present problem, the user might believe that the
mean lies in the neighbourhood of either of two values, u0 or u1 . It might then be

appropriate to use a bimodal prior distribution with peaks at these two values. In
that case, the simplicity afforded by the conjugate family would be lost, and the
posterior distribution would no longer take the normal form (6.1).
Secondly, if either n is very large (when the evidence from the data overwhelms the prior information) or if s2 is very large (when the prior evidence is
0
very weak and the prior distribution is non-informative), the posterior distribution (6.1) tends towards the likelihood N(, s2 =n).
x
In principle, once the formulations for the prior and likelihood have been
accepted as appropriate, the posterior distribution provides all we need for
inference about m. In practice, as in frequentist inference, it will be useful to
consider ways of answering specific questions about the possible value of m. In
particular, what are the Bayesian analogues of the two principal modes of
inference discussed in §4.1: significance tests and confidence intervals?
Bayesian significance tests
Suppose that, in the formulation leading up to (6.1), we wanted to ask whether
there was strong evidence that m < 0 or m > 0. In frequentist inference we should
test the hypothesis that m ˆ 0, and see whether it was strongly contradicted by a
significant result in either direction. In the present Bayesian formulation there is
no point in considering the probability that m is exactly 0, since that probability
is zero (although m ˆ 0 has a non-zero density). However, we can state directly
the probability that, say m < 0 by calculating the tail area to the left of zero in the
normal distribution (6.1).
It is instructive to note what happens in the limiting case considered above,
when the sample size is large or the prior is non-informative and the posterior
distribution is N(, s2 =n). The posterior probability that m < 0 is the probability
x
of a standardized normal deviate less than
p

0 À x À n

x
p ˆ
,
s= n
s
and this is precisely the same as the one-sided P value obtained in a frequentist
test of the null hypothesis that m ˆ 0. The posterior tail area and the one-sided P
value are thus numerically the same, although of course their strict interpretations are quite different.
Example 6.1
Example 4.1 described a frequentist significance test based on a sample of n ˆ 100 survival

times of patients with a form of cancer. The observed mean was x ˆ 46Á9 months, and the


6.2 Bayesian inference for a mean

171

hypothesis tested was that the population mean was (in the notation of the present
section) m ˆ 38Á3 months, the assumed standard deviation being s ˆ 43Á3 months. (The
subscript 0 used in that example is dropped here, since it will be needed for the parameters
of the prior distribution.) Although, as noted in Example 4.1, the individual survival times
x must be positive, and the large value of s indicates a highly skew distribution, the
normal theory will provide a reasonable approximation for the distribution of the sample
mean.
Table 6.1 shows the results of applying (6.1) with various assumptions about the prior
distribution N(m0 , s2 ). Since m must be positive, a normal distribution is strictly inap0
propriate, and a distributional form allowing positive values only would be preferable.
However, if (as in Table 6.1) s0 =m0 is small, the normal distribution will assign very little
probability to the range m < 0, and the model provides a reasonable approach.

Case A represents a vague prior centred around the hypothesized value. The usual
assumption for a non-informative prior, that s0 ˆ I, is inappropriate here, as it would
assign too much probability to negative values of m; the value chosen for s0 would allow a
wide range of positive values, and would be suitable if the investigator had very
little preconception of what might occur. The final inference is largely determined
by the likelihood from the data. The probability of m < 38Á3 is small, and close
to the one-sided P value of 0Á023 (which is half the two-sided value quoted in Example
4.1).
Cases B, C and D represent beliefs that the new treatment might have a moderate
effect in improving or worsening survival, in comparison with the previous mean of 38Á3,
with respectively scepticism, agnosticism and enthusiasm. The final inferences reflect
these different prior judgements, with modest evidence for an improvement in C and
strong evidence, boosted by the prior belief, in D. In B, the evidence from the data
in favour of the new treatment is unable to counteract the gloomy view presented by
the prior.
Case E represents a strong belief that the new treatment is better than the old, with a
predicted mean survival between about 38 and 42 months. This prior belief is supported
by the data, although the observed mean of 46Á9 is somewhat above the presumed range.
The evidence for an improvement is now strong.
Note that in each of these cases the posterior standard deviation is less than the
standard error of the mean, 4Á33, indicating the additional precision conferred by the
prior assumptions.

Table 6.1 Various prior distributions for Example 4.1.
Prior distribution

Posterior distribution

m0
A

B
C
D
E

s0

Mean

SD

P(m < 38Á3)

38Á3
30
40
50
40

10
5
5
5
1

45Á54
39Á66
43Á94
48Á23
40Á35


3Á97
3Á27
3Á27
3Á27
0Á97

0Á034
0Á34
0Á043
0Á001
0Á017


172

Bayesian methods

In some situations it may be appropriate to assign a non-zero probability to a
null hypothesis such as m ˆ 0. For example, in a clinical trial to study the efficacy
of a drug, it might be held that there is a non-negligible probability f0 that the
drug is ineffective, whilst the rest of the prior probability, 1 À f0 , is spread over a
range of values. This model departs from the previous one in not using a normal
(and hence conjugate) prior distribution, and various possibilities may be considered. For instance, the remaining part of the distribution may be assumed to
be normal over an infinite range, or it may be distributed in some other way,
perhaps over a finite range. We shall not examine possible models in any detail

here, but one or two features should be noted. First, if the observed mean x is
sufficiently close to zero, the posterior odds in favour of the null hypothesis, say
f1 =…1 À f1 †, will tend to be greater than the prior odds f0 =…1 À f0 †. That is, the

observed mean tends to confirm the null hypothesis. Conversely, an observed
mean sufficiently far from zero will tend to refute the null hypothesis, and the
posterior odds will be less than the prior odds. However, the close relationship

with frequentist methods breaks down. A value of x which is just significantly
different from zero at some level a may, in sufficiently large samples, confirm the
null hypothesis by producing posterior odds greater than the prior odds. Moreover, the proportionate increase in odds increases with the sample size.
This result, often called Lindley's paradox, has been much discussed (Lindley, 1957; Cox & Hinkley, 1974, §10.5; Shafer, 1982; Senn, 1997, pp. 179±184). It
arises because, for large samples, the prior distribution need be considered only

in a small neighbourhood of the maximum likelihood estimate x, and with a
diffuse distribution of the non-null part of the prior the contribution from this
neighbourhood is very small and leads to a low posterior probability against the
null hypothesis. Lindley's paradox is often used as an argument against the use
of frequentist methods, or at least to assert that large samples require very
extreme significance levels (i.e. small values of a) before they become convincing.
However, it can equally well be argued that, with a sample mean near, but
significantly different from, the null value in large samples, the initial choice of
a diffuse prior for the non-null hypothesis was inappropriate. A more concentrated distribution around the null value would have removed the difficulty.
This example illustrates the dilemma facing the Bayesian analyst if the
evidence from the data is in some way inconsistent with the prior assumptions.
A purist approach would suggest that the prior distribution represents prior
opinion and should not be changed by hindsight. A more pragmatic approach
would be to recognize that the initial choice was ill-informed, and to consider
analyses using alternative formulations.
Unknown mean and variance
We have assumed so far in this section that, in inferences about the mean m, the
variance s2 is known. In practice, as noted in §4.2, the variance is usually



6.2 Bayesian inference for a mean

173

unknown, and this is taken into account in the frequentist approach by use of the
t distribution.
The Bayesian approach requires a prior distribution for s2 as well as for m,
and in the absence of strong contraindications it is useful to introduce the
conjugate family for the distribution of variance. This turns out to be an inverse
gamma distribution, which means that some multiple of the reciprocal of s2 is
assumed to have a x2 distribution on some appropriate degrees of freedom (see
§5.1). There are two arbitrary constants hereÐthe multiplying factor and the
degrees of freedomÐso the model presents a wide range of possible priors.
The full development is rather complicated, but simplification is achieved by
the use of non-informative priors for the mean and variance, and the further
assumption that these are independent. We assume as before that s2 , the prior
0
variance for m, is infinite; and a non-informative version of the inverse gamma
distribution for s2 (with zero mean and zero `degrees of freedom') is chosen. The

posterior distribution of m is then centred around x, the variation around this
p
mean taking the form of t…nÀ1† times the usual standard error, s= n, where t…nÀ1†
is a variate following the t distribution on n À 1 DF. There is thus an analogy
with frequentist methods similar to that noted for the case with known variance.
In particular, the posterior probability that m < 0 is numerically the same as the
one-sided P value in a frequentist t test of the null hypothesis that m ˆ 0.
The comparison of the means of two independent samples, for which frequentist methods were described in §4.3, requires further assumptions about the
prior distributions for the two pairs of means and variances. If these are all
assumed to be non-informative, as in the one-sample case, and independent, the

posterior distribution of the difference between the two means, m1 À m2 , involves
the Fisher±Behrens distribution referred to in §4.3.
Bayesian estimation
The posterior distribution provides all the information needed for Bayesian
estimation, but, as with frequentist methods, more compact forms of description
will usually be sought.
Point estimation
As noted in §4.1, a single-valued point estimator, without any indication of its
variability, is of limited value. Nevertheless, estimates of important parameters
such as means are often used, for instance in tabulations. A natural suggestion is
that a parameter should be estimated by a measure of location of the posterior
distribution, such as the mean, median or mode. Decision theory suggests that
the choice between these should be based on the loss functionÐthe way in which
the adverse consequences of making an incorrect estimate depend on the


174

Bayesian methods

difference between the true and estimated values. The mean is an appropriate
choice if the loss is proportional to the square of this difference; and the median
is appropriate if the loss is proportional to the absolute value of the difference.
These are rather abstruse considerations in the context of simple data analysis,
and it may be wise to choose the mean as being the most straightforward, unless
the distribution has extreme outlying values which affect the mean, in which case
the median might be preferable. The mode is less easy to justify, being appropriate for a loss function which is constant for all incorrect values.
If the posterior distribution is normal, as in the discussion leading up to
Example 6.1, the three measures of location coincide, and there is no ambiguity.
We should emphasize, though, that a Bayesian point estimate will, as in Example

6.1, be influenced by the prior distribution, and may be misleading for many
purposes where the reader is expecting a simple descriptive statement about the
data rather than a summary incorporating the investigator's preconceptions.
Finally, note that for a non-informative prior, when the posterior distribution is proportional to the likelihood, the mode of the posterior distribution
coincides with the maximum likelihood estimator. In Example 6.1, case A, the
prior is almost non-informative, and the posterior mean (coinciding here with the
mode and median) is close to the sample mean of 46Á9, the maximum likelihood
value.
Interval estimation
A natural approach is to select an interval containing a specified high proportion, say 1 À a, of the posterior probability. The resulting interval may be termed
a Bayesian confidence interval, although it is important to realize that the interpretation is quite different from that of the frequentist confidence interval
presented in §4.1. Alternative phrases such as credibility interval or Bayesian
probability interval are preferable.
The choice of a 1 À a credibility interval is not unique, as any portion of the
posterior distribution covering the required probability of 1 À a could be
selected. (A similar feature of frequentist confidence intervals was noted in
§4.1.) The simplest, and most natural, approach is to choose the interval with
equal tail areas of 1 a. In the situation described at the beginning of this
2
section, with a normal sampling distribution with known variance, and a noninformative normal prior, the 1 À a credibility interval coincides with the usual
symmetric 1 À a confidence interval centred around the sample mean. When the
variance is unknown, the non-informative assumptions described earlier lead to
the use of the t distribution, and again the credibility interval coincides with the
usual confidence interval. In other situations, and with more specific prior
assumptions, the Bayesian credibility interval will not coincide with that
obtained from a frequentist approach.


6.3 Bayesian inference for proportions and counts


175

6.3 Bayesian inference for proportions and counts
The model described in §6.2, involving a normal likelihood and a normal prior,
will serve as a useful approximation in many situations where these conditions
are not completely satisfied, as in Example 6.1. In particular, it may be adequate
for analyses involving proportions and counts, provided that the normal
approximations to the sampling distributions, described in §3.8, are valid, and
that a normal distribution reasonably represents the prior information.
However, for these two situations, more exact methods are available, based
on the binomial distribution for proportions (§3.6) and the Poisson distribution
for counts (§3.7).
Bayesian inference for a proportion
Consider the estimation of a population proportion p from a random sample of
size n in which r individuals are affected in some way. The sampling results,
involving the binomial distribution, were discussed in §3.6 and §4.4. In the
Bayesian approach we need a prior distribution for p. A normal distribution
can clearly provide only a rough approximation, since p must lie between 0 and
1. The most convenient and flexible family of distributions for this purpose,
which happens also to be the conjugate family, is that of the beta distributions.
The density of a beta distribution takes the form
f …p† ˆ

paÀ1 …1 À p†bÀ1
,
B…a, b†

…6:4†

where the two parameters a and b must both be positive. The denominator

B(a, b) in (6.4), which is needed to ensure that the total probability is 1, is
known as the beta function. When a and b are both integers, it can be expressed
in terms of factorials (see §3.6), as follows:
B…a, b† ˆ

…a À 1†! …b À 1†!
:
…a ‡ b À 2†!

…6:5†

We shall refer to (6.4) as the Beta (a, b) distribution. The mean and variance
of p are, respectively,
E…p† ˆ p0 ˆ a=…a ‡ b†,

and

var…p† ˆ ab=…a ‡ b†2 …a ‡ b ‡ 1†:

The mean is < , ˆ or > 1 according to whether a < , ˆ or > b. The variance
2
decreases as a ‡ b increases, so strong prior evidence is represented by high
values of a ‡ b.
The shape of the beta distribution is determined by the values of a and b. If
a ˆ b ˆ 1, f …p† is constant, and the distribution of p is uniform, all values
between 0 and 1 having the same density. If a and b are both greater than 1,


176


Bayesian methods

the distribution of p is unimodal with a mode at p ˆ …a À 1†=…a ‡ b À 2†. If a
and b are both less than 1, the distribution is U-shaped, with modes at 0 and 1. If
a > 1 and b < 1, the distribution is J-shaped, with a mode at 1, and the reverse
conditions for a and b give a reversed J-shape, with a mode at 0.
With (6.4) as the prior distribution, and the binomial sampling distribution
for the observed value r, application of Bayes' theorem shows that the posterior
distribution of p is again a beta distribution, Beta (r ‡ a, n À r ‡ b). The posterior mean is thus
r‡a

~
,
…6:6†
n‡a‡b
which lies between the observed proportion of affected individuals, p ˆ r=n, and
the prior mean p0 . The estimate of the population proportion is thus shrunk
from the sample estimate towards the prior mean. For very weak prior evidence
and a large sample (a ‡ b small, n large), the posterior estimate will be close to
the sample proportion p. For strong prior evidence and a small sample (a ‡ b
large, n small) the estimate will be close to the prior mean p0 .
As a representation of prior ignorance, it might seem natural to choose the
uniform distribution, which is the member of the conjugate family of beta
distributions with a ˆ b ˆ 1. Note, however, from (6.6) that this gives
p ˆ …r ‡ 1†=…n ‡ 2†, a slightly surprising result. The more expected result with
~
p ˆ p, the sample proportion, would be obtained only with a ˆ b ˆ 0, which is
~
strictly not an allowable combination of parameters for a beta distribution.
Theoretical reasons have been advanced for choosing, instead, a ˆ b ˆ 1,

2
although this choice is not normally adopted. The dilemma is of little practical
importance, however. The change of parameters in the beta function, in moving
from the prior to the posterior, is effectively to add a hypothetical number of a
affected individuals to the r observed, and b non-affected to the n À r observed,
and unless r or n À r is very small none of the choices mentioned above will have
much effect on the posterior distribution.
Statements of the posterior probability for various possible ranges of values
of p require calculations of the area under the curve (i.e. the integral) for
specified portions of the beta distribution. These involve the incomplete beta
function, and can be obtained from suitable tables (e.g. Pearson & Hartley, 1966,
Tables 16 and 17 and §8) or from tabulations of the F distribution included in
some computer packages. Using the latter approach, the probability that p < pH
in the Beta (a, b) distribution is equal to the probability that F > F H in the F
distribution with 2b and 2a degrees of freedom, where
FH ˆ

a…1 À pH †
:
bpH

We illustrate some of the points discussed above by reference in §6.2 to the
data analysed earlier by frequentist methods in Example 4.6.


6.3 Bayesian inference for proportions and counts

177

Example 6.2

In the clinical trial described in Example 4.6, 100 patients receive two drugs, X and Y, in
random order; 65 prefer X and 35 prefer Y. Denote by p the probability that a patient
prefers X. Example 4.6 described a frequentist significance test of the null hypothesis that
p ˆ 1 and, in the continuation on p. 117, provided 95% confidence limits for p.
2
Table 6.2 shows the results of Bayesian analyses with various prior beta distributions
for p.
In case A, the uniform distribution, Beta (1, 1), represents vague prior knowledge as
indicated earlier, and the posterior distribution is determined almost entirely by the data.
The central 95% posterior probability region is very similar to the 95% confidence range
given in Example 4.6 (continued on p. 117), method 2. The probability that p < 0Á5 agrees
(to four decimal places) with the one-sided mid-P significance level in a test of the null
hypothesis that p ˆ 0Á5.
The tighter prior distribution used in case B suggests that the observed proportion
p ˆ 0Á65 is an overestimate of the true probability p. The posterior mean is shrunk
towards 0Á5, but the probability that p < 0Á5 is still very low.
In case C, the prior distribution is even more tightly packed around 0Á5 and the
posterior mean is shrunk further. The lower limit of the central 95% posterior probability
region barely exceeds 0Á5, and P(p < 0Á5) is correspondingly only a little short of 0Á025.
With such strong prior belief the highly significant difference between p and 0Á5 (as judged
by a frequentist test) is heavily diluted, although still providing a moderate degree of
evidence for a verdict in favour of drug X.

Bayesian comparison of two proportions
In the comparison of two proportions, discussed from a frequentist standpoint in
§4.5, a fully Bayesian approach would require a formulation for the prior
distributions of the two parameters p1 and p2 , allowing for the possibility that
their random variation is associated in some way. In most situations, however,
progress can be made by concentration on a single measure of the contrast
between the two parameters.

For the paired case, treated earlier on p. 121, the analysis may be reduced to
that of a single proportion by consideration of the relative proportions of the two
types of untied pairs, for which the observed frequencies (p. 121) are r and s.
Table 6.2 Various prior distributions for Example 6.2.
Prior distribution

a
A
B
C

b

1
30
60

1
30
60

Posterior distribution

Central 95%
Mean probability region
0Á5
0Á5
0Á5

(0Á025, 0Á975)

(0Á375, 0Á625)
(0Á411, 0Á589)

a

b

66
95
125

36
65
95

Central 95%
Mean probability region
0Á650
0Á595
0Á569

(0Á552, 0Á736)
(0Á517, 0Á668)
(0Á502, 0Á633)

P(p < 0Á5)
0Á0013
0Á0085
0Á0212



178

Bayesian methods

In the unpaired case (p. 124), one possible simplification is to express the
contrast between p1 and p2 in terms of the log of the odds ratio,
log C ˆ log

p1 …1 À p2 †
:
…1 À p1 †p2

…6:7†

In the notation used in (4.25), log C may be estimated by the log of the observed
odds ratio ad/bc, the variance of which is given approximately by the square of
the standard error (4.26) divided by …2Á3026†2 ˆ 5Á3019 (to convert from natural
to common logs). Unless some of the frequencies are very small, this statistic
may be assumed to be approximately normally distributed. The normal theory
for the Bayesian estimate of a mean may then be applied. The prior distribution
of the parameter (6.7) may also be assumed to be approximately normal, with a
mean and variance reflecting prior opinion. The normal theory outlined in §6.2
may then be applied. Note that the formulation in terms of the log of the odds
ratio, rather than the odds ratio itself, makes the normal model more plausible,
since the parameter and its estimate both have an unlimited range in each
direction.
Bayesian inference for a count
Frequentist methods of inference from a count x, following a Poisson distribution with mean m, were described in §5.2. The Bayesian approach requires the
formulation of a prior distribution for m, which can take positive values between

0 and I. The conjugate family here is that of the gamma distributions, the
density of which is
f …m† ˆ

maÀ1 eÀm=b
,
G…a†ba

…6:8†

where a and b are two adjustable parameters taking positive values. The expression G…a† indicates the gamma function; for integral values of a, G…a† ˆ …a À 1†!.
We shall refer to (6.8) as the Gamma (a, b) distribution. The mean and
variance of p are, respectively, ab and ab2 . By variation of the two parameters
a flexible range of possible prior distributions may be obtained. Essentially, a
determines the shape, and b the scale, of the distribution. When a 1, the
distribution has a reversed J-shape, with a peak at zero; otherwise the distribution is double-tailed. The family of gamma distributions is related closely to the
family of chi-square (x2 ) distributions: 2m=b has a x2 distribution on 2a degrees
of freedom.
For an observed count x, and (6.8) as the prior distribution for m, the
posterior distribution is Gamma (x ‡ a, b=…1 ‡ b†). A suitable choice of parameters for a non-informative prior is a ˆ 1 , b ˆ I, which has an infinitely
2


6.4 Further comments on Bayesian methods

179

dispersed reversed J-shape. With that assumption, the posterior distribution
becomes Gamma (x ‡ 1 , 1), and 2m has a x2 distribution on 2x ‡ 1 DF.
2

Example 6.3
Example 5.2 described a study in which x ˆ 33 workers died of lung cancer. The number
expected at national death rates was 20Á0.
The Bayesian model described above, with a non-informative prior, gives the posterior
distribution Gamma (33Á5, 1), and 2m has a x2 distribution on 67 DF. From computer
tabulations of this distribution we find that P (m < 20Á0) is 0Á0036, very close to the
frequentist one-sided mid-P significance level of 0Á0037 quoted in Example 5.2.
If, on the other hand, it was believed that local death rates varied around the national
rate by relatively small increments, a prior might be chosen to have a mean count of 20
and a small standard deviation of, say, 2. Setting the mean of the prior to be ab ˆ 20 and
its variance to be ab2 ˆ 4, gives a ˆ 100 and b ˆ 0Á2, and the posterior distribution is
Gamma (133, 0Á1667). Thus, 12m has a x2 distribution on 266 DF, and computer tabulations show that P(m < 20Á0) is 0Á128. There is now considerably less evidence that the local
death rate is excessive. The posterior estimate of the expected number of deaths is
(133) (0Á1667) ˆ 22Á17. Note, however, that the observed count is somewhat incompatible
with the prior assumptions. The difference between x and the prior mean is 33 À 20 ˆ 13.
p
Its variance might be estimated as 33 ‡ 4 ˆ 37 and its standard error as 37 ˆ 6Á083. The
difference is thus over twice its standard error, and the investigator might be well advised
to reconsider prior assumptions.

Analyses involving the ratio of two counts can proceed from the approach
described in §5.2 and illustrated in Examples 5.2 and 5.3. If two counts, x1 and
x2 , follow independent Poisson distributions with means m1 and m2 , respectively,
then, given the total count x1 ‡ x2 , the observed count x1 is binomially distributed with mean …x1 ‡ x2 †m1 =…m2 ‡ m2 †. The methods described earlier in this
section for the Bayesian analysis of proportions may thus be applied also to
this problem.

6.4 Further comments on Bayesian methods
Shrinkage
The phenomenon of shrinkage was introduced in §6.2 and illustrated in several of

the situations described in that section and in §6.3. It is a common feature of
parameter estimation in Bayesian analyses. The posterior distribution is determined by the prior distribution and the likelihood based on the data, and its
measures of location will tend to lie between those of the prior distribution and
the central features of the likelihood function. The relative weights of these two
determinants will depend on the variability of the prior and the tightness of the
likelihood function, the latter being a function of the amount of data.


180

Bayesian methods

The examples discussed in §6.2 and §6.3 involved the means of the prior and
posterior distributions and the mean of the sampling distribution giving rise to
the likelihood. For unimodal distributions shrinkage will normally also be
observed for other measures of location, such as the median or mode. However,
if the prior had two or more well-separated modes, as might be the case for some
genetic traits, the tendency would be to shrink towards the nearest major mode,
and that might be in the opposite direction to the overall prior mean. An
example, for normally distributed observations with a prior distribution concentrated at just two points, in given by Carlin and Louis (2000, §4.1.1), who refer to
the phenomenon as stretching.
We discuss here two aspects of shrinkage that relate to concepts of linear
regression, a topic dealt with in more detail in Chapter 7. We shall anticipate
some results described in Chapter 7, and the reader unfamiliar with the principles
of linear regression may wish to postpone a reading of this subsection.
First, we take another approach to the normal model described at the start of
§6.2. We could imagine taking random observations, simultaneously, of the two

variables m and x. Here, m is chosen randomly from the distribution N…m0 , s2 †.
0


Then, given this value of m, x is chosen randomly from the conditional distribu
tion N…m, s2 =n†. If this process is repeated, a series of random pairs …m, x† is
generated. These paired observations form a bivariate normal distribution (§7.4,
Fig. 7.6). In this distribution, var…m† ˆ s2 , var…† ˆ s2 ‡ s2 =n (incorporating
x
0
0

both the variation of m and that of x given m), and the correlation (§7.3) between

m and x is
s0
r0 ˆ p 2
:
…s0 ‡ s2 =n†

The regression equation of x on m is
E… j m† ˆ m,
x
so the regression coefficient bx : m is 1. From the relationship between the regres
sion coefficients and the correlation coefficient, shown below (7.11), the other
regression coefficient is
bm : x ˆ


r2
s2
0
ˆ r2 ˆ 2 0 2 :

0
bx : m
s0 ‡ s =n


This result is confirmed by the mean of the distribution (6.1), which can be
written as


s2
0

E…m j x† ˆ m0 ‡
… À m0 †:
x
s2 ‡ s2 =n
0
The fact that bm : x …ˆ r2 † is less than 1 reflects the shrinkage in the posterior

0
mean. The proportionate shrinkage is


×