Tải bản đầy đủ (.pdf) (21 trang)

A handbook of statistics analysis of R

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (244.7 KB, 21 trang )

A Handbook of

Statistical
Analyses
Using n
SECOND
EDITION

© 2010 by Taylor and Francis Group, LLC


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

A Handbook of

Statistical
Analyses
Using
SECOND
EDITION
Brian S. Everitt and Ibrsten Hothorn

CRC Press
Taylor & Francis Croup
Boca Raton London New York
CRC Press is an imprint of the
Taylor & Francis Croup, an informa business
A CHAPMAN & HALL BOOK

© 2010 by Taylor and Francis Group, LLC



Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2010 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number: 978-1-4200-7933-3 (Paperback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com ( or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data

Everitt, Brian.
A handbook of statistical analyses using R / Brian S. Everitt and Torsten Hothorn.
-- 2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-4200-7933-3 (pbk. : alk. paper)
1. Mathematical statistics--Data processing--Handbooks, manuals, etc. 2. R
(Computer program language)--Handbooks, manuals, etc. I. Hothorn, Torsten. II. Title.
QA276.45.R3E94 2010
519.50285’5133--dc22
Visit the Taylor & Francis Web site at

and the CRC Press Web site at


© 2010 by Taylor and Francis Group, LLC

2009018062


Dedication

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

To our wives, Mary-Elizabeth and Carolin,
for their constant support and encouragement

© 2010 by Taylor and Francis Group, LLC



Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

Preface to Second Edition
Like the first edition this book is intended as a guide to data analysis with
the R system for statistical computing. New chapters on graphical displays,
generalised additive models and simultaneous inference have been added to
this second edition and a section on generalised linear mixed models completes
the chapter that discusses the analysis of longitudinal data where the response
variable does not have a normal distribution. In addition, new examples and
additional exercises have been added to several chapters. We have also taken
the opportunity to correct a number of errors that were present in the first
edition. Most of these errors were kindly pointed out to us by a variety of people to whom we are very grateful, especially Guido Schwarzer, Mike Cheung,
Tobias Verbeke, Yihui Xie, Lothar H¨aberle, and Radoslav Harman.
We learnt that many instructors use our book successfully for introductory
courses in applied statistics. We have had the pleasure to give some courses
based on the first edition of the book ourselves and we are happy to share
slides covering many sections of particular chapters with our readers. LATEX
sources and PDF versions of slides covering several chapters are available from
the second author upon request.
A new version of the HSAUR package, now called HSAUR2 for obvious
reasons, is available from CRAN. Basically the package vignettes have been
updated to cover the new and modified material as well. Otherwise, the technical infrastructure remains as described in the preface to the first edition,
with two small exceptions: names of R add-on packages are now printed in
bold font and we refrain from showing significance stars in model summaries.
Lastly we would like to thank Thomas Kneib and Achim Zeileis for commenting on the newly added material and again the CRC Press staff, in particular Rob Calver, for their support during the preparation of this second
edition.

Brian S. Everitt and Torsten Hothorn
London and M¨
unchen, April 2009


© 2010 by Taylor and Francis Group, LLC


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

Preface to First Edition
This book is intended as a guide to data analysis with the R system for statistical computing. R is an environment incorporating an implementation of
the S programming language, which is powerful and flexible and has excellent
graphical facilities (R Development Core Team, 2009b). In the Handbook we
aim to give relatively brief and straightforward descriptions of how to conduct
a range of statistical analyses using R. Each chapter deals with the analysis appropriate for one or several data sets. A brief account of the relevant
statistical background is included in each chapter along with appropriate references, but our prime focus is on how to use R and how to interpret results.
We hope the book will provide students and researchers in many disciplines
with a self-contained means of using R to analyse their data.
R is an open-source project developed by dozens of volunteers for more than
ten years now and is available from the Internet under the General Public Licence. R has become the lingua franca of statistical computing. Increasingly,
implementations of new statistical methodology first appear as R add-on packages. In some communities, such as in bioinformatics, R already is the primary
workhorse for statistical analyses. Because the sources of the R system are open
and available to everyone without restrictions and because of its powerful language and graphical capabilities, R has started to become the main computing
engine for reproducible statistical research (Leisch, 2002a,b, 2003, Leisch and
Rossini, 2003, Gentleman, 2005). For a reproducible piece of research, the original observations, all data preprocessing steps, the statistical analysis as well
as the scientific report form a unity and all need to be available for inspection,
reproduction and modification by the readers.
Reproducibility is a natural requirement for textbooks such as the Handbook
of Statistical Analyses Using R and therefore this book is fully reproducible
using an R version greater or equal to 2.2.1. All analyses and results, including
figures and tables, can be reproduced by the reader without having to retype
a single line of R code. The data sets presented in this book are collected
in a dedicated add-on package called HSAUR accompanying this book. The

package can be installed from the Comprehensive R Archive Network (CRAN)
via
R> install.packages("HSAUR")
and its functionality is attached by
R> library("HSAUR")
The relevant parts of each chapter are available as a vignette, basically a

© 2010 by Taylor and Francis Group, LLC


document including both the R sources and the rendered output of every
analysis contained in the book. For example, the first chapter can be inspected
by

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

R> vignette("Ch_introduction_to_R", package = "HSAUR")
and the R sources are available for reproducing our analyses by
R> edit(vignette("Ch_introduction_to_R", package = "HSAUR"))
An overview on all chapter vignettes included in the package can be obtained
from
R> vignette(package = "HSAUR")
We welcome comments on the R package HSAUR, and where we think these
add to or improve our analysis of a data set we will incorporate them into the
package and, hopefully at a later stage, into a revised or second edition of the
book.
Plots and tables of results obtained from R are all labelled as ‘Figures’ in
the text. For the graphical material, the corresponding figure also contains
the ‘essence’ of the R code used to produce the figure, although this code may
differ a little from that given in the HSAUR package, since the latter may

include some features, for example thicker line widths, designed to make a
basic plot more suitable for publication.
We would like to thank the R Development Core Team for the R system, and
authors of contributed add-on packages, particularly Uwe Ligges and Vince
Carey for helpful advice on scatterplot3d and gee. Kurt Hornik, Ludwig A.
Hothorn, Fritz Leisch and Rafael Weißbach provided good advice with some
statistical and technical problems. We are also very grateful to Achim Zeileis
for reading the entire manuscript, pointing out inconsistencies or even bugs
and for making many suggestions which have led to improvements. Lastly we
would like to thank the CRC Press staff, in particular Rob Calver, for their
support during the preparation of the book. Any errors in the book are, of
course, the joint responsibility of the two authors.

Brian S. Everitt and Torsten Hothorn
London and Erlangen, December 2005

© 2010 by Taylor and Francis Group, LLC


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

List of Figures

1.1

1.2
1.3
1.4

2.1

2.2
2.3
2.4
2.5
2.6
2.7
2.8
3.1

3.2
3.3
3.4
3.5

Histograms of the market value and the logarithm of the
market value for the companies contained in the Forbes 2000
list.
Raw scatterplot of the logarithms of market value and sales.
Scatterplot with transparent shading of points of the logarithms of market value and sales.
Boxplots of the logarithms of the market value for four
selected countries, the width of the boxes is proportional to
the square roots of the number of companies.
Histogram (top) and boxplot (bottom) of malignant melanoma
mortality rates.
Parallel boxplots of malignant melanoma mortality rates by
contiguity to an ocean.
Estimated densities of malignant melanoma mortality rates
by contiguity to an ocean.
Scatterplot of malignant melanoma mortality rates by geographical location.
Scatterplot of malignant melanoma mortality rates against

latitude.
Bar chart of happiness.
Spineplot of health status and happiness.
Spinogram (left) and conditional density plot (right) of
happiness depending on log-income
Boxplots of estimates of room width in feet and metres (after
conversion to feet) and normal probability plots of estimates
of room width made in feet and in metres.
R output of the independent samples t-test for the roomwidth
data.
R output of the independent samples Welch test for the
roomwidth data.
R output of the Wilcoxon rank sum test for the roomwidth
data.
Boxplot and normal probability plot for differences between
the two mooring methods.

© 2010 by Taylor and Francis Group, LLC

19
20
21

22
30
31
32
33
34
35

36
38

55
56
56
57
58


3.6
3.7

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

3.8

3.9
3.10
3.11
3.12
3.13
4.1

4.2
4.3
4.4
5.1
5.2
5.3

5.4
5.5
5.6
6.1
6.2
6.3
6.4
6.5
6.6
6.7

R output of the paired t-test for the waves data.
R output of the Wilcoxon signed rank test for the waves
data.
Enhanced scatterplot of water hardness and mortality,
showing both the joint and the marginal distributions and,
in addition, the location of the city by different plotting
symbols.
R output of Pearsons’ correlation coefficient for the water
data.
R output of the chi-squared test for the pistonrings data.
Association plot of the residuals for the pistonrings data.
R output of McNemar’s test for the rearrests data.
R output of an exact version of McNemar’s test for the
rearrests data computed via a binomial test.
An approximation for the conditional distribution of the
difference of mean roomwidth estimates in the feet and
metres group under the null hypothesis. The vertical lines
show the negative and positive absolute value of the test
statistic T obtained from the original data.

R output of the exact permutation test applied to the
roomwidth data.
R output of the exact conditional Wilcoxon rank sum test
applied to the roomwidth data.
R output of Fisher’s exact test for the suicides data.
Plot of mean weight gain for each level of the two factors.
R output of the ANOVA fit for the weightgain data.
Interaction plot of type and source.
Plot of mean litter weight for each level of the two factors for
the foster data.
Graphical presentation of multiple comparison results for the
foster feeding data.
Scatterplot matrix of epoch means for Egyptian skulls data.
Scatterplot of velocity and distance.
Scatterplot of velocity and distance with estimated regression
line (left) and plot of residuals against fitted values (right).
Boxplots of rainfall.
Scatterplots of rainfall against the continuous covariates.
R output of the linear model fit for the clouds data.
Regression relationship between S-Ne criterion and rainfall
with and without seeding.
Plot of residuals against fitted values for clouds seeding
data.

© 2010 by Taylor and Francis Group, LLC

59
59

60

61
61
62
63
63

71
72
73
73
84
85
86
87
90
92
104
105
107
108
109
111
113


6.8
6.9

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014


7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11

8.1
8.2

8.3
8.4
8.5

8.6
8.7
8.8

Normal probability plot of residuals from cloud seeding model
clouds_lm.
Index plot of Cook’s distances for cloud seeding data.
Conditional density plots of the erythrocyte sedimentation
rate (ESR) given fibrinogen and globulin.
R output of the summary method for the logistic regression
model fitted to ESR and fibrigonen.

R output of the summary method for the logistic regression
model fitted to ESR and both globulin and fibrinogen.
Bubbleplot of fitted values for a logistic regression model
fitted to the plasma data.
R output of the summary method for the logistic regression
model fitted to the womensrole data.
Fitted (from womensrole_glm_1) and observed probabilities
of agreeing for the womensrole data.
R output of the summary method for the logistic regression
model fitted to the womensrole data.
Fitted (from womensrole_glm_2) and observed probabilities
of agreeing for the womensrole data.
Plot of deviance residuals from logistic regression model fitted
to the womensrole data.
R output of the summary method for the Poisson regression
model fitted to the polyps data.
R output of the print method for the conditional logistic
regression model fitted to the backpain data.
Three commonly used kernel functions.
Kernel estimate showing the contributions of Gaussian kernels
evaluated for the individual observations with bandwidth
h = 0.4.
Epanechnikov kernel for a grid between (−1.1, −1.1) and
(1.1, 1.1).
Density estimates of the geyser eruption data imposed on a
histogram of the data.
A contour plot of the bivariate density estimate of the
CYGOB1 data, i.e., a two-dimensional graphical display for a
three-dimensional problem.
The bivariate density estimate of the CYGOB1 data, here shown

in a three-dimensional fashion using the persp function.
Fitted normal density and two-component normal mixture
for geyser eruption data.
Bootstrap distribution and confidence intervals for the mean
estimates of a two-component mixture for the geyser data.

© 2010 by Taylor and Francis Group, LLC

114
115

123
124
125
126
127
129
130
131
132
133
136
144

145
146
148

149
150

152
155


9.1

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

9.2
9.3
9.4
9.5

9.6
9.7
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
11.1
11.2
11.3
11.4
11.5
11.6

11.7

12.1

Initial tree for the body fat data with the distribution of body
fat in terminal nodes visualised via boxplots.
Pruned regression tree for body fat data.
Observed and predicted DXA measurements.
Pruned classification tree of the glaucoma data with class
distribution in the leaves.
Estimated class probabilities depending on two important
variables. The 0.5 cut-off for the estimated glaucoma probability is depicted as a horizontal line. Glaucomateous eyes are
plotted as circles and normal eyes are triangles.
Conditional inference tree with the distribution of body fat
content shown for each terminal leaf.
Conditional inference tree with the distribution of glaucomateous eyes shown for each terminal leaf.
A linear spline function with knots at a = 1, b = 3 and c = 5.
Scatterplot of year and winning time.
Scatterplot of year and winning time with fitted values from
a simple linear model.
Scatterplot of year and winning time with fitted values from
a smooth non-parametric model.
Scatterplot of year and winning time with fitted values from
a quadratic model.
Partial contributions of six exploratory covariates to the
predicted SO2 concentration.
Residual plot of SO2 concentration.
Spinograms of the three exploratory variables and response
variable kyphosis.
Partial contributions of three exploratory variables with

confidence bands.

166
167
168
169

172
173
174
183
187
188
189
190
191
192
193
194

‘Bath tub’ shape of a hazard function.
Survival times comparing treated and control patients.
Kaplan-Meier estimates for breast cancer patients who either
received a hormonal therapy or not.
R output of the summary method for GBSG2_coxph.
Estimated regression coefficient for age depending on time
for the GBSG2 data.
Martingale residuals for the GBSG2 data.
Conditional inference tree for the GBSG2 data with the
survival function, estimated by Kaplan-Meier, shown for

every subgroup of patients identified by the tree.

211

Boxplots for the repeated measures by treatment group for
the BtheB data.

220

© 2010 by Taylor and Francis Group, LLC

202
205
207
208
209
210


12.2
12.3

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

12.4

12.5

13.1


13.2
13.3
13.4
13.5
13.6
13.7
13.8
13.9
13.10
13.11
13.12
13.13

14.1
14.2

14.3

R output of the linear mixed-effects model fit for the BtheB
data.
R output of the asymptotic p-values for linear mixed-effects
model fit for the BtheB data.
Quantile-quantile plots of predicted random intercepts and
residuals for the random intercept model BtheB_lmer1 fitted
to the BtheB data.
Distribution of BDI values for patients that do (circles) and
do not (bullets) attend the next scheduled visit.
Simulation of a positive response in a random intercept
logistic regression model for 20 subjects. The thick line is the
average over all 20 subjects.

R output of the summary method for the btb_gee model
(slightly abbreviated).
R output of the summary method for the btb_gee1 model
(slightly abbreviated).
R output of the summary method for the resp_glm model.
R output of the summary method for the resp_gee1 model
(slightly abbreviated).
R output of the summary method for the resp_gee2 model
(slightly abbreviated).
Boxplots of numbers of seizures in each two-week period post
randomisation for placebo and active treatments.
Boxplots of log of numbers of seizures in each two-week period
post randomisation for placebo and active treatments.
R output of the summary method for the epilepsy_glm
model.
R output of the summary method for the epilepsy_gee1
model (slightly abbreviated).
R output of the summary method for the epilepsy_gee2
model (slightly abbreviated).
R output of the summary method for the epilepsy_gee3
model (slightly abbreviated).
R output of the summary method for the resp_lmer model
(abbreviated).
Distribution of levels of expressed alpha synuclein mRNA in
three groups defined by the NACP -REP1 allele lengths.
Simultaneous confidence intervals for the alpha data based
on the ordinary covariance matrix (left) and a sandwich
estimator (right).
Probability of damage caused by roe deer browsing for six
tree species. Sample sizes are given in brackets.


© 2010 by Taylor and Francis Group, LLC

222
223

224
227

237
239
240
241
242
243
244
245
246
247
248
249
249

258

261
263


14.4


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

15.1
15.2
15.3
15.4
15.5
15.6

15.7

15.8
16.1
16.2
16.3
16.4
16.5
17.1
17.2
17.3
17.4

18.1
18.2
18.3
18.4
18.5
18.6


Regression relationship between S-Ne criterion and rainfall
with and without seeding. The confidence bands cover the
area within the dashed curves.
R output of the summary method for smokingOR.
Forest plot of observed effect sizes and 95% confidence
intervals for the nicotine gum studies.
R output of the summary method for BCG_OR.
R output of the summary method for BCG_DSL.
R output of the summary method for BCG_mod.
Plot of observed effect size for the BCG vaccine data against
latitude, with a weighted least squares regression fit shown in
addition.
Example funnel plots from simulated data. The asymmetry
in the lower plot is a hint that a publication bias might be a
problem.
Funnel plot for nicotine gum data.
Scatterplot matrix for the heptathlon data (all countries).
Scatterplot matrix for the heptathlon data after removing
observations of the PNG competitor.
Barplot of the variances explained by the principal components. (with observations for PNG removed).
Biplot of the (scaled) first two principal components (with
observations for PNG removed).
Scatterplot of the score assigned to each athlete in 1988 and
the first principal component.
Two-dimensional solution from classical multidimensional
scaling of distance matrix for water vole populations.
Minimum spanning tree for the watervoles data.
Two-dimensional solution from non-metric multidimensional
scaling of distance matrix for voting matrix.
The Shepard diagram for the voting data shows some

discrepancies between the original dissimilarities and the
multidimensional scaling solution.
Bivariate data showing the presence of three clusters.
Example of a dendrogram.
Darwin’s Tree of Life.
Image plot of the dissimilarity matrix of the pottery data.
Hierarchical clustering of pottery data and resulting dendrograms.
3D scatterplot of the logarithms of the three variables
available for each of the exoplanets.

© 2010 by Taylor and Francis Group, LLC

265
274
275
277
278
279

280

281
282
289
291
294
295
296
306
308

309

310
319
321
322
326
327
328


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

18.7

Within-cluster sum of squares for different numbers of clusters
for the exoplanet data.
18.8 Plot of BIC values for a variety of models and a range of
number of clusters.
18.9 Scatterplot matrix of planets data showing a three-cluster
solution from Mclust.
18.10 3D scatterplot of planets data showing a three-cluster solution
from Mclust.

© 2010 by Taylor and Francis Group, LLC

329
331
332
333



Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

List of Tables

2.1
2.2
2.3
2.4
2.5
2.6
3.1
3.2

3.3

3.4
3.5
3.6
3.7
4.1
4.2
4.3
4.4
4.5

USmelanoma data. USA mortality rates for white males due
to malignant melanoma.
CHFLS data. Chinese Health and Family Life Survey.

household data. Household expenditure for single men and
women.
suicides2 data. Mortality rates per 100, 000 from male
suicides.
USstates data. Socio-demographic variables for ten US
states.
banknote data (package alr3). Swiss bank note data.
roomwidth data. Room width estimates (width) in feet and
in metres (unit).
waves data. Bending stress (root mean squared bending
moment in Newton metres) for two mooring methods in a
wave energy experiment.
water data. Mortality (per 100,000 males per year, mortality) and water hardness for 61 cities in England and
Wales.
pistonrings data. Number of piston ring failures for three
legs of four compressors.
rearrests data. Rearrests of juvenile felons by type of court
in which they were tried.
The general r × c table.
Frequencies in matched samples data.
suicides data. Crowd behaviour at threatened
suicides.
Classification system for the response variable.
Lanza data. Misoprostol randomised clinical trial from Lanza
(1987).
Lanza data. Misoprostol randomised clinical trial from Lanza
et al. (1988a).
Lanza data. Misoprostol randomised clinical trial from Lanza
et al. (1988b).


© 2010 by Taylor and Francis Group, LLC

25
28
40
41
42
43

45

46

47
49
49
52
53

66
66
66
67
67


4.6
4.7

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014


4.8

5.1
5.2
5.3
5.4
5.5

6.1
6.2
6.3

7.1
7.2
7.3
7.4

7.5
7.6

8.1
8.2
8.3
8.4
8.5

Lanza data. Misoprostol randomised clinical trial from Lanza
et al. (1989).
anomalies data. Abnormalities of the face and digits of

newborn infants exposed to antiepileptic drugs as assessed by
a paediatrician (MD) and a research assistant (RA).
orallesions data. Oral lesions found in house-to-house
surveys in three geographic regions of rural India.
weightgain data. Rat weight gain for diets differing by the
amount of protein (type) and source of protein (source).
foster data. Foster feeding experiment for rats with different
genotypes of the litter (litgen) and mother (motgen).
skulls data. Measurements of four variables taken from
Egyptian skulls of five periods.
schooldays data. Days absent from school.
students data. Treatment and results of two tests in three
groups of students.
hubble data. Distance and velocity for 24 galaxies.
clouds data. Cloud seeding experiments in Florida – see
above for explanations of the variables.
Analysis of variance table for the multiple linear regression
model.
plasma data. Blood plasma data.
womensrole data. Women’s role in society data.
polyps data. Number of polyps for two treatment arms.
¯
backpain data. Number of drivers (D) and non-drivers (D),
¯
suburban (S) and city inhabitants (S) either suffering from a
herniated disc (cases) or not (controls).
bladdercancer data. Number of recurrent tumours for
bladder cancer patients.
leuk data (package MASS). Survival times of patients
suffering from leukemia.

faithful data (package datasets). Old Faithful geyser waiting
times between two eruptions.
CYGOB1 data. Energy output and surface temperature of Star
Cluster CYG OB1.
galaxies data (package MASS). Velocities of 82 galaxies.
birthdeathrates data. Birth and death rates for 69 countries.
schizophrenia data. Age on onset of schizophrenia for both
sexes.

© 2010 by Taylor and Francis Group, LLC

67

68
78

79
80
81
95
96
97
98
102
117
118
119

120
137

138

139
141
156
157
158


9.1

Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

10.1
10.2
10.3

11.1

11.2

11.3

12.1
12.2

13.1

13.2


13.3

14.1

14.2
15.1

bodyfat data (package mboost). Body fat prediction by
skinfold thickness, circumferences, and bone breadths.
men1500m data. Olympic Games 1896 to 2004 winners of the
men’s 1500m.
USairpollution data. Air pollution in 41 US cities.
kyphosis data (package rpart). Children who have had
corrective spinal surgery.
glioma data. Patients suffering from two types of glioma
treated with the standard therapy or a novel radioimmunotherapy (RIT).
GBSG2 data (package ipred). Randomised clinical trial data
from patients suffering from node-positive breast cancer. Only
the data of the first 20 patients are shown here.
mastectomy data. Survival times in months after mastectomy
of women with breast cancer.
BtheB data. Data of a randomised trial evaluating the effects
of Beat the Blues.
phosphate data. Plasma inorganic phosphate levels for
various time points after glucose challenge.

161

177
178

180

197

199
212

214
228

respiratory data. Randomised clinical trial data from
patients suffering from respiratory illness. Only the data of
the first seven patients are shown here.
epilepsy data. Randomised clinical trial data from patients
suffering from epilepsy. Only the data of the first seven
patients are shown here.
schizophrenia2 data. Clinical trial data from patients
suffering from schizophrenia. Only the data of the first four
patients are shown here.

251

alpha data (package coin). Allele length and levels of
expressed alpha synuclein mRNA in alcohol-dependent
patients.
trees513 data (package multcomp).

253
255


smoking data. Meta-analysis on nicotine gum showing the
number of quitters who have been treated (qt), the total
number of treated (tt) as well as the number of quitters in
the control group (qc) with total number of smokers in the
control group (tc).

268

© 2010 by Taylor and Francis Group, LLC

231

232


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

15.2

15.4

15.5

16.1
16.2
16.3
17.1
17.2
17.3
17.4

18.1
18.2
18.3

BCG data. Meta-analysis on BCG vaccine with the following
data: the number of TBC cases after a vaccination with BCG
(BCGTB), the total number of people who received BCG (BCG)
as well as the number of TBC cases without vaccination
(NoVaccTB) and the total number of people in the study
without vaccination (NoVacc).
aspirin data. Meta-analysis on aspirin and myocardial
infarct, the table shows the number of deaths after placebo
(dp), the total number subjects treated with placebo (tp) as
well as the number of deaths after aspirin (da) and the total
number of subjects treated with aspirin (ta).
toothpaste data. Meta-analysis on trials comparing two
toothpastes, the number of individuals in the study, the mean
and the standard deviation for each study A and B are shown.
heptathlon data. Results Olympic heptathlon, Seoul, 1988.
meteo data. Meteorological measurements in an 11-year
period.
Correlations for calculus measurements for the six anterior
mandibular teeth.
watervoles data. Water voles data – dissimilarity matrix.
voting data. House of Representatives voting data.
eurodist data (package datasets). Distances between European cities, in km.
gardenflowers data. Dissimilarity matrix of 18 species of
gardenflowers.
pottery data. Romano-British pottery data.
planets data. Jupiter mass, period and eccentricity of

exoplanets.
Number of possible partitions depending on the sample size
n and number of clusters k.

© 2010 by Taylor and Francis Group, LLC

269

283

284
286
297
297
300
301
312
313
315
317
322


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

Contents

1 An
1.1
1.2

1.3
1.4
1.5
1.6
1.7
1.8
1.9

Introduction to R
What is R?
Installing R
Help and Documentation
Data Objects in R
Data Import and Export
Basic Data Manipulation
Computing with Data
Organising an Analysis
Summary

1
1
2
4
5
9
11
14
20
21


2 Data Analysis Using Graphical Displays
2.1 Introduction
2.2 Initial Data Analysis
2.3 Analysis Using R
2.4 Summary

25
25
27
29
38

3 Simple Inference
3.1 Introduction
3.2 Statistical Tests
3.3 Analysis Using R
3.4 Summary

45
45
49
53
63

4 Conditional Inference
4.1 Introduction
4.2 Conditional Test Procedures
4.3 Analysis Using R
4.4 Summary


65
65
68
70
77

5 Analysis of Variance
5.1 Introduction
5.2 Analysis of Variance
5.3 Analysis Using R
5.4 Summary

79
79
82
83
94

© 2010 by Taylor and Francis Group, LLC


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

6 Simple and Multiple Linear Regression
6.1 Introduction
6.2 Simple Linear Regression
6.3 Multiple Linear Regression
6.4 Analysis Using R
6.5 Summary


97
97
99
100
103
112

7 Logistic Regression and Generalised Linear Models
7.1 Introduction
7.2 Logistic Regression and Generalised Linear Models
7.3 Analysis Using R
7.4 Summary

117
117
120
122
136

8 Density Estimation
8.1 Introduction
8.2 Density Estimation
8.3 Analysis Using R
8.4 Summary

139
139
141
147
155


9 Recursive Partitioning
9.1 Introduction
9.2 Recursive Partitioning
9.3 Analysis Using R
9.4 Summary

161
161
164
165
174

10 Smoothers and Generalised Additive Models
10.1 Introduction
10.2 Smoothers and Generalised Additive Models
10.3 Analysis Using R

177
177
181
186

11 Survival Analysis
11.1 Introduction
11.2 Survival Analysis
11.3 Analysis Using R
11.4 Summary

197

197
198
204
211

12 Analysing Longitudinal Data I
12.1 Introduction
12.2 Analysing Longitudinal Data
12.3 Linear Mixed Effects Models
12.4 Analysis Using R
12.5 Prediction of Random Effects
12.6 The Problem of Dropouts
12.7 Summary

213
213
216
217
219
223
223
226

© 2010 by Taylor and Francis Group, LLC


Downloaded by [King Mongkut's Institute of Technology, Ladkrabang] at 01:45 11 September 2014

13 Analysing Longitudinal Data II
13.1 Introduction

13.2 Methods for Non-normal Distributions
13.3 Analysis Using R: GEE
13.4 Analysis Using R: Random Effects
13.5 Summary

231
231
233
238
247
250

14 Simultaneous Inference and Multiple Comparisons
14.1 Introduction
14.2 Simultaneous Inference and Multiple Comparisons
14.3 Analysis Using R
14.4 Summary

253
253
256
257
264

15 Meta-Analysis
15.1 Introduction
15.2 Systematic Reviews and Meta-Analysis
15.3 Statistics of Meta-Analysis
15.4 Analysis Using R
15.5 Meta-Regression

15.6 Publication Bias
15.7 Summary

267
267
269
271
273
276
277
279

16 Principal Component Analysis
16.1 Introduction
16.2 Principal Component Analysis
16.3 Analysis Using R
16.4 Summary

285
285
285
288
295

17 Multidimensional Scaling
17.1 Introduction
17.2 Multidimensional Scaling
17.3 Analysis Using R
17.4 Summary


299
299
299
305
310

18 Cluster Analysis
18.1 Introduction
18.2 Cluster Analysis
18.3 Analysis Using R
18.4 Summary

315
315
318
325
334

Bibliography

335

© 2010 by Taylor and Francis Group, LLC



×