2012 categorical data analysis using SAS

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.7 MB, 589 trang )

Categorical
Data Analysis
Using SAS
®

Third Edition

Maura E. Stokes
Charles S. Davis
Gary G. Koch

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

The correct bibliographic citation for this manual is as follows: Stokes, Maura E., Charles S. Davis, and
Gary G. Koch. 2012. Categorical Data Analysis Using SAS®, Third Edition. Cary, NC: SAS Institute Inc.

Categorical Data Analysis Using SAS®, Third Edition
Copyright © 2012, SAS Institute Inc., Cary, NC, USA
ISBN 978-1-61290-090-2 (electronic book)
ISBN 978-1-60764-664-8
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission
of the publisher, SAS Institute Inc.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the
vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of
the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not

participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related
documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in
FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414
1st printing, July 2012
SAS Institute Inc. provides a complete selection of books and electronic products to help customers use SAS software
to its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit
the SAS Books Web site at support.sas.com/bookstore or call 1-800-727-3228.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Contents
Chapter 1.
Chapter 2.
Chapter 3.
Chapter 4.
Chapter 5.
Chapter 6.
Chapter 7.
Chapter 8.
Chapter 9.
Chapter 10.
Chapter 11.
Chapter 12.

Chapter 13.
Chapter 14.
Chapter 15.
References
Index . .

Introduction . . . . . . . . . . . . . .
The 2 2 Table . . . . . . . . . . . . .
Sets of 2 2 Tables . . . . . . . . . . .
2 r and s 2 Tables . . . . . . . . . .
The s r Table . . . . . . . . . . . . .
Sets of s r Tables . . . . . . . . . . .
Nonparametric Methods . . . . . . . . . .
Logistic Regression I: Dichotomous Response .
Logistic Regression II: Polytomous Response . .
Conditional Logistic Regression . . . . . . .
Quantal Response Data Analysis . . . . . .
Poisson Regression and Related Loglinear Models
Categorized Time-to-Event Data . . . . . .
Weighted Least Squares . . . . . . . . . .
Generalized Estimating Equations . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
15
47
73
107
141

175
189
259
297
345
373
409
427
487
557
573

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Preface to the Third Edition
The third edition accomplishes several purposes. First, it updates the use of SAS® software to
current practices. Since the last edition was published more than 10 years ago, numerous sets of
example statements have been modified to reflect best applications of SAS/STAT® software.
Second, the material has been expanded to take advantage of the many graphs now provided by
SAS/STAT software through ODS Graphics. Beginning with SAS/STAT 9.3, these graphs are
available with SAS/STAT—no other product license is required (a SAS/GRAPH® license was
required for previous releases). Graphs displayed in this edition include:
mosaic plots
effect plots

odds ratio plots
predicted cumulative proportions plot
regression diagnostic plots
agreement plots
Third, the book has been updated and reorganized to reflect the evolution of categorical data
analysis strategies. The previous Chapter 14, “Repeated Measurements Using Weighted Least
Squares,” has been combined with the previous Chapter 13, “Weighted Least Squares,” to create
the current Chapter 14, “Weighted Least Squares.” The material previously in Chapter 16,
“Loglinear Models,” is found in the current Chapter 12, “Poisson Regression and Related Loglinear
Models.” The material in Chapter 10, “Conditional Logistic Regression,” has been rewritten, and
Chapter 8, “Logistic Regression I: Dichotomous Response,” and Chapter 9, “Logistic Regression
II: Polytomous Response,” have been expanded. In addition, the previous Chapter 16, “Categorized
Time-to-Event Data” is the current Chapter 13.
Numerous additional techniques are covered in this edition, including:
incidence density ratios and their confidence intervals
additional confidence intervals for difference of proportions
exact Poisson regression
difference measures to reflect direction of association in sets of tables
partial proportional odds model
use of the QIC statistic in GEE analysis

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

odds ratios in the presence of interactions
Firth penalized likelihood approach for logistic regression
In addition, miscellaneous revisions and additions have been incorporated throughout the book.
However, the scope of the book remains the same as described in Chapter 1, “Introduction.”

Computing Details
The examples in this third edition were executed with SAS/STAT 12.1, although the revision was
largely based on SAS/STAT 9.3. The features specific to SAS/STAT 12.1 are:
mosaic plots in the FREQ procedure
partial proportional odds model in the LOGISTIC procedure
Miettinen-Nurminen confidence limits for proportion differences in PROC FREQ
headings for the estimates from the FIRTH option in PROC LOGISTIC
Because of limited space, not all of the output that is produced with the example SAS code is shown.
Generally, only the output pertinent to the discussion is displayed. An ODS SELECT statement is
sometimes used in the example code to limit the tables produced. The ODS GRAPHICS ON and
ODS GRAPHICS OFF statements are used when graphs are produced. However, these statements
are not needed when graphs are produced as part of the SAS windowing environment beginning
with SAS 9.3. Also, the graphs produced for this book were generated with the STYLE=JOURNAL
option of ODS because the book does not feature color.

For More Information
The website contains further information that pertains to
topics in the book, including data (where possible) and errata.

Acknowledgments
We are grateful to the many people who have contributed to this revision. Bob Derr, Amy Herring,
Michael Hussey, Diana Lam, Siying Li, Michela Osborn, Ashley Lauren Paynter, Margaret
Polinkovsky, John Preisser, David Schlotzhauer, Todd Schwartz, Valerie Smith, Daniela SoltresAlvarez, Donna Watts, Catherine Wiener, Laura Elizabeth Weiner, and Laura Zhou provided
reviews, suggestions, proofing, and numerous other contributions that are greatly appreciated.

iv

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

And, of course, we remain thankful to those persons who contributed to the earlier editions. They
include Diane Catellier, Sonia Davis, Bob Derr, William Duckworth II, Suzanne Edwards, Stuart
Gansky, Greg Goodwin, Wendy Greene, Duane Hayes, Allison Kinkead, Gordon Johnston, Lisa
LaVange, Antonio Pedroso-de-Lima, Annette Sanders, John Preisser, David Schlotzhauer, Todd
Schwartz, Dan Spitzner, Catherine Tangen, Lisa Tomasko, Donna Watts, Greg Weier, and Ozkan
Zengin.
Anne Baxter and Ed Huddleston edited this book.
Tim Arnold provided documentation programming support.

v

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Chapter 1

Introduction
Contents
1.1
1.2
1.3
1.4

1.5

1.6

1.1

Overview . . . . . . . . . . . . . . . .
Scale of Measurement . . . . . . . . .
Sampling Frameworks . . . . . . . . .
Overview of Analysis Strategies . . . .
1.4.1
Randomization Methods . . .
1.4.2
Modeling Strategies . . . . . .
Working with Tables in SAS Software .
Using This Book . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

1
2
4
5
6
6
8
13

Overview

Data analysts often encounter response measures that are categorical in nature; their outcomes
reflect categories of information rather than the usual interval scale. Frequently, categorical data are
presented in tabular form, known as contingency tables. Categorical data analysis is concerned with
the analysis of categorical response measures, regardless of whether any accompanying explanatory

variables are also categorical or are continuous. This book discusses hypothesis testing strategies
for the assessment of association in contingency tables and sets of contingency tables. It also
discusses various modeling strategies available for describing the nature of the association between
a categorical response measure and a set of explanatory variables.
An important consideration in determining the appropriate analysis of categorical variables is their
scale of measurement. Section 1.2 describes the various scales and illustrates them with data sets
used in later chapters. Another important consideration is the sampling framework that produced
the data; it determines the possible analyses and the possible inferences. Section 1.3 describes the
typical sampling frameworks and their ramifications. Section 1.4 introduces the various analysis
strategies discussed in this book and describes how they relate to one another. It also discusses the
target populations generally assumed for each type of analysis and what types of inferences you
are able to make to them. Section 1.5 reviews how SAS software handles contingency tables and
other forms of categorical data. Finally, Section 1.6 provides a guide to the material in the book for
various types of readers, including indications of the difficulty level of the chapters.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

2

Chapter 1: Introduction

1.2

Scale of Measurement

The scale of measurement of a categorical response variable is a key element in choosing an
appropriate analysis strategy. By taking advantage of the methodologies available for the particular
scale of measurement, you can choose a well-targeted strategy. If you do not take the scale of

measurement into account, you may choose an inappropriate strategy that could lead to erroneous
conclusions. Recognizing the scale of measurement and using it properly are very important in
categorical data analysis.
Categorical response variables can be
dichotomous
ordinal
nominal
discrete counts
grouped survival times
Dichotomous responses are those that have two possible outcomes—most often they are yes and no.
Did the subject develop the disease? Did the voter cast a ballot for the Democratic or Republican
candidate? Did the student pass the exam? For example, the objective of a clinical trial for a
new medication for colds is whether patients obtained relief from their pain-producing ailment.
Consider Table 1.1, which is analyzed in Chapter 2, “The 2 2 Table.”
Table 1.1 Respiratory Outcomes

Treatment
Placebo
Test

Favorable
16
40

Unfavorable
48
20

Total
64

60

The placebo group contains 64 patients, and the test medication group contains 60 patients. The
columns contain the information concerning the categorical response measure: 40 patients in the
Test group had a favorable response to the medication, and 20 subjects did not. The outcome in this
example is thus dichotomous, and the analysis investigates the relationship between the response
and the treatment.
Frequently, categorical data responses represent more than two possible outcomes, and often these
possible outcomes take on some inherent ordering. Such response variables have an ordinal scale
of measurement. Did the new school curriculum produce little, some, or high enthusiasm among
the students? Does the water exhibit low, medium, or high hardness? In the former case, the order
of the response levels is clear, but there is no clue as to the relative distances between the levels.
In the latter case, there is a possible distance between the levels: medium might have twice the
hardness of low, and high might have three times the hardness of low. Sometimes the distance is
even clearer: a 50% potency dose versus a 100% potency dose versus a 200% potency dose. All
three cases are examples of ordinal data.
An example of an ordinal measure occurs in data displayed in Table 1.2, which is analyzed in
Chapter 9, “Logistic Regression II: Polytomous Response.” A clinical trial investigated a treatment

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

1.2. Scale of Measurement

3

for rheumatoid arthritis. Male and female patients were given either the active treatment or a
placebo; the outcome measured was whether they showed marked, some, or no improvement at the
end of the clinical trial. The analysis uses the proportional odds model to assess the relationship

between the response variable and gender and treatment.
Table 1.2 Arthritis Data

Sex
Female
Female
Male
Male

Treatment
Active
Placebo
Active
Placebo

Improvement
Marked Some None
16
5
6
6
7
19
5
2
7
1
0
10

Total
27
32
14
11

Note that categorical response variables can often be managed in different ways. You could
combine the Marked and Some columns in Table 1.2 to produce a dichotomous outcome: No
Improvement versus Improvement. Grouping categories is often done during an analysis if the
resulting dichotomous response is also of interest.
If you have more than two outcome categories, and there is no inherent ordering to the categories,
you have a nominal measurement scale. Which of four candidates did you vote for in the town
council election? Do you prefer the beach, mountains, or lake for a vacation? There is no
underlying scale for such outcomes and no apparent way in which to order them.
Consider Table 1.3, which is analyzed in Chapter 5, “The s r Table.” Residents in one town
were asked their political party affiliation and their neighborhood. Researchers were interested in
the association between political affiliation and neighborhood. Unlike ordinal response levels, the
classifications Bayside, Highland, Longview, and Sheffeld lie on no conceivable underlying scale.
However, you can still assess whether there is association in the table, which is done in Chapter 5.
Table 1.3 Distribution of Parties in Neighborhoods

Party
Democrat
Independent
Republican

Bayside
221
200
208

Neighborhood
Highland Longview
160
360
291
160
106
316

Sheffeld
140
311
97

Categorical response variables sometimes contain discrete counts. Instead of falling into categories
that are labeled (yes, no) or (low, medium, high), the outcomes are numbers themselves. Was the
litter size 1, 2, 3, 4, or 5 members? Did the house contain 1, 2, 3, or 4 air conditioners? While the
usual strategy would be to analyze the mean count, the assumptions required for the standard linear
model for continuous data are often not met with discrete counts that have small range; the counts
are not distributed normally and may not have homogeneous variance.
For example, researchers examining respiratory disease in children visited children in different
regions two times and determined whether they showed symptoms of respiratory illness. The
response measure was whether the children exhibited symptoms in 0, 1, or 2 periods. Table 1.4
contains these data, which are analyzed in Chapter 14, “Weighted Least Squares.”

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

4

Chapter 1: Introduction

Table 1.4 Colds in Children

Sex
Female
Female
Male
Male

Residence
Rural
Urban
Rural
Urban

Periods with Colds
0
1
2
45
64
71
80 104
116
84 124
82
106 117

87

Total
180
300
290
310

The table represents a cross-classification of gender, residence, and number of periods with colds.
The analysis is concerned with modeling mean colds as a function of gender and residence.
Finally, another type of response variable in categorical data analysis is one that represents survival
times. With survival data, you are tracking the number of patients with certain outcomes (possibly
death) over time. Often, the times of the condition are grouped together so that the response
variable represents the number of patients who fail during a specific time interval. Such data are
called grouped survival times. For example, the data displayed in Table 1.5 are from Chapter 13,
“Categorized Time-to-Event Data.” A clinical condition is treated with an active drug for some
patients and with a placebo for others. The response categories are whether there are recurrences,
no recurrences, or whether the patients withdrew from the study. The entries correspond to the time
intervals 0–1 years, 1–2 years, and 2–3 years, which make up the rows of the table.
Table 1.5 Life Table Format for Clinical Condition Data

Controls
Interval
0–1 Years
1–2 Years
2–3 Years
Active
Interval
0–1 Years
1–2 Years

2–3 Years

1.3

No Recurrences
50
30
17

Recurrences
15
13
7

Withdrawals
9
7
6

At Risk
74
50
30

No Recurrences
69
59
45

Recurrences

12
7
10

Withdrawals
9
3
4

At Risk
90
69
59

Sampling Frameworks

Categorical data arise from different sampling frameworks. The nature of the sampling framework
determines the assumptions that can be made for the statistical analyses and in turn influences the
type of analysis that can be applied. The sampling framework also determines the type of inference
that is possible. Study populations are limited to target populations, those populations to which
inferences can be made, by assumptions justified by the sampling framework.
Generally, data fall into one of three sampling frameworks: historical data, experimental data,
and sample survey data. Historical data are observational data, which means that the study
population has a geographic or circumstantial definition. These may include all the occurrences of

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

1.4. Overview of Analysis Strategies

5

an infectious disease in a multicounty area, the children attending a particular elementary school,
or those persons appearing in court during a specified time period. Highway safety data concerning
injuries in motor vehicles is another example of historical data.
Experimental data are drawn from studies that involve the random allocation of subjects to different
treatments of one sort or another. Examples include studies where types of fertilizer are applied to
agricultural plots and studies where subjects are administered different dosages of drug therapies.
In the health sciences, experimental data may include patients randomly administered a placebo or
treatment for their medical condition.
In sample survey studies, subjects are randomly chosen from a larger study population. Investigators
may randomly choose students from their school IDs and survey them about social behavior;
national health care studies may randomly sample Medicare users and investigate physician
utilization patterns. In addition, some sampling designs may be a combination of sample survey
and experimental data processes. Researchers may randomly select a study population and then
randomly assign treatments to the resulting study subjects.
The major difference in the three sampling frameworks described in this section is the use of
randomization to obtain them. Historical data involve no randomization, and so it is often difficult
to assume that they are representative of a convenient population. Experimental data have good
coverage of the possibilities of alternative treatments for the restricted protocol population, and
sample survey data have very good coverage of the larger population from which they were
selected.
Note that the unit of randomization can be a single subject or a cluster of subjects. In addition,
randomization may be applied within subsets, called strata or blocks, with equal or unequal
probabilities. In sample surveys, all of this can lead to more complicated designs, such as stratified
random samples, or even multistage cluster random samples. In experimental design studies, such
considerations lead to repeated measurements (or split-plot) studies.

1.4

Overview of Analysis Strategies

Categorical data analysis strategies can be classified into those that are concerned with hypothesis
testing and those that are concerned with modeling. Many questions about a categorical data set
can be answered by addressing a specific hypothesis concerning association. Such hypotheses
are often investigated with randomization methods. In addition to making statements about
association, you may also want to describe the nature of the association in the data set. Statistical
modeling techniques using maximum likelihood estimation or weighted least squares estimation
are employed to describe patterns of association or variation in terms of a parsimonious statistical
model. Imrey (2011) includes a historical perspective on numerous methods described in this book.
Most often the hypothesis of interest is whether association exists between the rows of a contingency
table and its columns. The only assumption that is required is randomized allocation of subjects,
either through the study design (experimental design) or through the hypothesis itself (necessary
for historical data). In addition, particularly for the use of historical data, you often want to control
for other explanatory variables that may have influenced the observed outcomes.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

6

Chapter 1: Introduction

1.4.1

Randomization Methods

Table 1.1, the respiratory outcomes data, contains information obtained as part of a randomized

allocation process. The hypothesis of interest is whether there is an association between treatment
and outcome. For these data, the randomization is accomplished by the study design.
Table 1.6 contains data from a similar study. The main difference is that the study was conducted
in two medical centers. The hypothesis of association is whether there is an association between
treatment and outcome, controlling for any effect of center.
Table 1.6 Respiratory Improvement

Center
1
1
Total
2
2
Total

Treatment
Test
Placebo
Test
Placebo

Yes
29
14
43
37
24
61

No

16
31
47
8
21
29

Total
45
45
90
45
45
90

Chapter 2, “The 2 2 Table,” is primarily concerned with the association in 2 2 tables; in
addition, it discusses measures of association, that is, statistics designed to evaluate the strength of
the association. Chapter 3, “Sets of 2 2 Tables,” discusses the investigation of association in sets
of 2 2 tables. When the table of interest has more than two rows and two columns, the analysis
is further complicated by the consideration of scale of measurement. Chapter 4, “Sets of 2 r and
s 2 Tables,” considers the assessment of association in sets of tables where the rows (columns)
have more than two levels.
Chapter 5 describes the assessment of association in the general s r table, and Chapter 6, “Sets of
s r Tables,” describes the assessment of association in sets of s r tables. The investigation of
association in tables and sets of tables is further discussed in Chapter 7, “Nonparametric Methods,”
which discusses traditional nonparametric tests that have counterparts among the strategies for
analyzing contingency tables.
Another consideration in data analysis is whether you have enough data to support the asymptotic
theory required for many tests. Often, you may have an overall table sample size that is too small or
a number of zero or small cell counts that make the asymptotic assumptions questionable. Recently,

exact methods have been developed for a number of association statistics that permit you to address
the same hypotheses for these types of data. The above-mentioned chapters illustrate the use of
exact methods for many situations.

1.4.2

Modeling Strategies

Often, you are interested in describing the variation of your response variable in your data with a
statistical model. In the continuous data setting, you frequently fit a model to the expected mean
response. However, with categorical outcomes, there are a variety of response functions that you
can model. Depending on the response function that you choose, you can use weighted least
squares or maximum likelihood methods to estimate the model parameters.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Modeling Strategies

7

Perhaps the most common response function modeled for categorical data is the logit. If you have
a dichotomous response and represent the proportion of those subjects with an event (versus no
event) outcome as p, then the logit can be written
Â
Ã
p
log
1 p

Logistic regression is a modeling strategy that relates the logit to a set of explanatory variables
with a linear model. One of its benefits is that estimates of odds ratios, important measures of
association, can be obtained from the parameter estimates. Maximum likelihood estimation is used
to provide those estimates.
Chapter 8, “Logistic Regression I: Dichotomous Response,” discusses logistic regression for
a dichotomous outcome variable. Chapter 9, “Logistic Regression II: Polytomous Response,”
discusses logistic regression for the situation where there are more than two outcomes for the
response variable. Logits called generalized logits can be analyzed when the outcomes are nominal.
And logits called cumulative logits can be analyzed when the outcomes are ordinal. Chapter
10, “Conditional Logistic Regression,” describes a specialized form of logistic regression that is
appropriate when the data are highly stratified or arise from matched case-control studies. These
chapters describe the use of exact conditional logistic regression for those situations where you
have limited or sparse data, and the asymptotic requirements for the usual maximum likelihood
approach are not met.
Poisson regression is a modeling strategy that is suitable for discrete counts, and it is discussed in
Chapter 12, “Poisson Regression and Related Loglinear Models.” Most often the log of the count
is used as the response function.
Some application areas have features that led to the development of special statistical techniques.
One of these areas for categorical data is bioassay analysis. Bioassay is the process of determining
the potency or strength of a reagent or stimuli based on the response it elicits in biological
organisms. Logistic regression is a technique often applied in bioassay analysis, where its
parameters take on specific meaning. Chapter 11, “Quantal Bioassay Analysis,” discusses the use
of categorical data methods for quantal bioassay. Another special application area for categorical
data analysis is the analysis of grouped survival data. Chapter 13, “Categorized Time-to-Event
Data,” discusses some features of survival analysis that are pertinent to grouped survival data,
including how to model them with the piecewise exponential model.
In logistic regression, the objective is to predict a response outcome from a set of explanatory
variables. However, sometimes you simply want to describe the structure of association in a set of
variables for which there are no obvious outcome or predictor variables. This occurs frequently for
sociological studies. The loglinear model is a traditional modeling strategy for categorical data and

is appropriate for describing the association in such a set of variables. It is closely related to logistic
regression, and the parameters in a loglinear model are also estimated with maximum likelihood
estimation. Chapter 12, “Poisson Regression and Related Loglinear Models,” includes a discussion
of the loglinear model, including a typical application.
Besides the logit and log counts, other useful response functions that can be modeled include
proportions, means, and measures of association. Weighted least squares estimation is a method of
analyzing such response functions, based on large sample theory. These methods are appropriate
when you have sufficient sample size and when you have a randomly selected sample, either

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

8

Chapter 1: Introduction

directly through study design or indirectly via assumptions concerning the representativeness
of the data. Not only can you model a variety of useful functions, but weighted least squares
estimation also provides a useful framework for the analysis of repeated categorical measurements,
particularly those limited to a small number of repeated values. Chapter 14, “Weighted Least
Squares,” addresses modeling categorical data with weighted least squares methods, including the
analysis of repeated measurements data.
Generalized estimating equations (GEE) is a widely used method for the analysis of correlated
responses, particularly for the analysis of categorical repeated measurements. The GEE method
applies to a broad range of repeated measurements situations, such as those including timedependent covariates and continuous explanatory variables, that weighted least squares doesn’t
handle. Chapter 15, “Generalized Estimating Equations,” discusses the GEE approach and
illustrates its application with a number of examples.

1.5

Working with Tables in SAS Software

This section discusses some considerations of managing tables with SAS. If you are already
familiar with the FREQ procedure, you may want to skip this section.
Many times, categorical data are presented to the researcher in the form of tables, and other times,
they are presented in the form of case record data. SAS procedures can handle either type of data.
In addition, many categorical data have ordered categories, so that the order of the levels of the
rows and columns takes on special meaning. There are numerous ways that you can specify a
particular order to SAS procedures.
Consider the following SAS DATA step that inputs the data displayed in Table 1.1.
data respire;
input treat $ outcome $ count;
datalines;
placebo f 16
placebo u 48
test
f 40
test
u 20
;
proc freq;
weight count;
tables treat*outcome;
run;

The data set RESPIRE contains three variables: TREAT is a character variable containing values
for treatment, OUTCOME is a character variable containing values for the outcome (f for favorable
and u for unfavorable), and COUNT contains the number of observations that have the respective
TREAT and OUTCOME values. Thus, COUNT effectively takes values corresponding to the cells

of Table 1.1. The PROC FREQ statements request that a table be constructed using TREAT as
the row variable and OUTCOME as the column variable. By default, PROC FREQ orders the
values of the rows (columns) in alphanumeric order. The WEIGHT statement is necessary to tell

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

1.5. Working with Tables in SAS Software

9

the procedure that the data are count data, or frequency data; the variable listed in the WEIGHT
statement contains the values of the count variable.
Output 1.1 contains the resulting frequency table.
Output 1.1 Frequency Table
Frequency
Percent
Row Pct
Col Pct

Table of treat by outcome
outcome
treat

f

u

Total

placebo

16
12.90
25.00
28.57

48
38.71
75.00
70.59

64
51.61

test

40
32.26
66.67
71.43

20
16.13
33.33
29.41

60
48.39

Total

56
45.16

68
54.84

124
100.00

Suppose that a different sample produced the numbers displayed in Table 1.7.
Table 1.7 Respiratory Outcomes

Treatment
Placebo
Test

Favorable
5
8

Unfavorable
10
20

Total
15
28

These data may be stored in case record form, which means that each individual is represented
by a single observation. You can also use this type of input with the FREQ procedure. The only
difference is that the WEIGHT statement is not required.
The following statements create a SAS data set for these data and invoke PROC FREQ for case
record data. The @@ symbol in the INPUT statement means that the data lines contain multiple
observations.
data respire;
input treat $ outcome $ @@;
datalines;
placebo f placebo f placebo f
placebo f placebo f
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u
test
f test
f test
f
test
f test
f test
f
test
f test
f
test
u test
u test

u

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

10

Chapter 1: Introduction

test
test
test
test
test
test
;

u
u
u
u
u
u

test
test
test
test
test

test

u
u
u
u
u
u

test
test
test
test
test

u
u
u
u
u

proc freq;
tables treat*outcome;
run;

Output 1.2 displays the resulting frequency table.
Output 1.2 Frequency Table
Frequency
Percent
Row Pct

Col Pct

Table of treat by outcome
outcome
treat

f

u

Total

placebo

5
11.63
33.33
38.46

10
23.26
66.67
33.33

15
34.88

test

8

18.60
28.57
61.54

20
46.51
71.43
66.67

28
65.12

Total

13
30.23

30
69.77

43
100.00

In this book, the data are generally presented in count form.
When ordinal data are considered, it becomes quite important to ensure that the levels of the rows
and columns are sorted correctly. By default, the data are going to be sorted alphanumerically. If
this isn’t suitable, then you need to alter the default behavior.
Consider the data displayed in Table 1.2. Variable IMPROVE is the outcome, and the values
marked, some, and none are listed in decreasing order. Suppose that the data set ARTHRITIS is
created with the following statements.

data arthritis;
length treatment $7. sex $6. ;
input sex $ treatment $ improve $ count @@;
datalines;
female active marked 16 female active some 5
female placebo marked 6 female placebo some 7
male
active marked 5 male
active some 2
male
placebo marked 1 male
placebo some 0
;

female
female
male
male

active
placebo
active
placebo

none 6
none 19
none 7
none 10

If you invoked PROC FREQ for this data set and used the default sort order, the levels of the

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

1.5. Working with Tables in SAS Software

11

columns would be ordered marked, none, and some, which would be incorrect. One way to change
this default sort order is to use the ORDER=DATA option in the PROC FREQ statement. This
specifies that the sort order is the same order in which the values are encountered in the data set.
Thus, since ‘marked’ comes first, it is first in the sort order. Since ‘some’ is the second value for
IMPROVE encountered in the data set, then it is second in the sort order. And ‘none’ would be third
in the sort order. This is the desired sort order. The following PROC FREQ statements produce a
table displaying the sort order resulting from the ORDER=DATA option.
proc freq order=data;
weight count;
tables treatment*improve;
run;

Output 1.3 displays the frequency table for the cross-classification of treatment and improvement
for these data; the values for IMPROVE are in the correct order.
Output 1.3 Frequency Table from ORDER=DATA Option
Frequency
Percent
Row Pct
Col Pct

Table of treatment by improve

improve
treatment

marked

some

none

Total

active

21
25.00
51.22
75.00

7
8.33
17.07
50.00

13
15.48
31.71
30.95

41
48.81

placebo

7
8.33
16.28
25.00

7
8.33
16.28
50.00

29
34.52
67.44
69.05

43
51.19

Total

28
33.33

14
16.67

42

50.00

84
100.00

Other possible values for the ORDER= option include FORMATTED, which means sort by the
formatted values. The ORDER= option is also available with the CATMOD, LOGISTIC, and
GENMOD procedures. For information on the ORDER= option for the FREQ procedure, refer to
the SAS/STAT User’s Guide. This option is used frequently in this book.
Often, you want to analyze sets of tables. For example, you may want to analyze the crossclassification of treatment and improvement for both males and females. You do this in PROC
FREQ by using a three-way crossing of the variables SEX, TREAT, and IMPROVE.
proc freq order=data;
weight count;
tables sex*treatment*improve / nocol nopct;
run;

The two rightmost variables in the TABLES statement determine the rows and columns of the
table, respectively. Separate tables are produced for the unique combination of values of the other
variables in the crossing. Since SEX has two levels, one table is produced for males and one table
is produced for females. If there were four variables in this crossing, with the two variables on the

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

12

Chapter 1: Introduction

left having two levels each, then four tables would be produced, one for each unique combination

of the two leftmost variables in the TABLES statement.
Note also that the options NOCOL and NOPCT are included. These options suppress the printing
of column percentages and cell percentages, respectively. Since generally you are interested in row
percentages, these options are often specified in the code displayed in this book.
Output 1.4 contains the two tables produced with the preceding statements.
Output 1.4 Producing Sets of Tables
Frequency
Row Pct

Table 1 of treatment by improve
Controlling for sex=female
improve
treatment

marked

some

none

Total

active

16
59.26

5
18.52

6
22.22

27

placebo

6
18.75

7
21.88

19
59.38

32

22

12

25

59

Total

Frequency
Row Pct

Table 2 of treatment by improve
Controlling for sex=male
improve
treatment
active
placebo
Total

marked

some

none

Total

5
35.71

2
14.29

7
50.00

14

1
9.09

0
0.00

10
90.91

11

6

2

17

25

This section reviewed some of the basic table management necessary for using the FREQ procedure.
Other related options are discussed in the appropriate chapters.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

1.6. Using This Book

1.6

13

Using This Book

This book is intended for a variety of audiences, including novice readers with some statistical
background (solid understanding of regression analysis), those readers with substantial statistical
background, and those readers with a background in categorical data analysis. Therefore, not all of
this material will have the same importance to all readers. Some chapters include a good deal of
tutorial material, while others have a good deal of advanced material. This book is not intended to
be a comprehensive treatment of categorical data analysis, so some topics are mentioned briefly for
completeness and some other topics are emphasized because they are not well documented.
The data used in this book come from a variety of sources and represent a wide breadth of
application. However, due to the biostatistical background of all three authors, there is a certain
inevitable weighting of biostatistical examples. Most of the data come from practice, and
the original sources are cited when this is true; however, due to confidentiality concerns and
pedagogical requirements, some of the data are altered or created. However, they still represent
realistic situations.
Chapters 2–4 are intended to be accessible to all readers, as is most of Chapter 5. Chapter 6 is
an integration of Mantel-Haenszel methods at a more advanced level, but scanning it is probably
a good idea for any reader interested in the topic. In particular, the discussion about the analysis
of repeated measurements data with extended Mantel-Haenszel methods is useful material for all
readers comfortable with the Mantel-Haenszel technique.
Chapter 7 is a special interest chapter relating Mantel-Haenszel procedures to traditional nonparametric methods used for continuous data outcomes.
Chapters 8 and 9 on logistic regression are intended to be accessible to all readers, particularly
Chapter 8. The last section of Chapter 8 describes the statistical methodology more completely
for the advanced reader. Most of the material in Chapter 9 should be accessible to most readers.
Chapter 10 is a specialized chapter that discusses conditional logistic regression and requires
somewhat more statistical expertise. Chapter 11 discusses the use of logistic regression in
analyzing bioassay data.
Parts of the subsequent chapters discuss more advanced topics and are necessarily written at a
higher statistical level. Chapter 12 describes Poisson regression and loglinear models; much of the
Poisson regression should be fairly accessible but the loglinear discussion is somewhat advanced.

Chapter 13 discusses the analysis of categorized time-to-event data and most of it should be fairly
accessible.
Chapter 14 discusses weighted least squares and is written at a somewhat higher statistical level
than Chapters 8 and 9, but most readers should find this material useful, particularly the examples.
Chapter 15 describes the use of generalized estimating equations. The opening section includes a
basic example that is intended to be accessible to a wide range of readers.
All of the examples were executed with SAS/STAT 12.1, and the few exceptions where options and
results are only available with SAS/STAT 12.1 are noted in the “Preface to the Third Edition.”

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

Chapter 2

The 2

2 Table

Contents
2.1
2.2
2.3
2.4
2.5
2.6

2.7
2.8
2.9

2.1

Introduction . . . . . . . . . . . . . . . . . . . . .
Chi-Square Statistics . . . . . . . . . . . . . . . . .
Exact Tests . . . . . . . . . . . . . . . . . . . . . .
2.3.1
Exact p-values for Chi-Square Statistics . .
Difference in Proportions . . . . . . . . . . . . . .
Odds Ratio and Relative Risk . . . . . . . . . . . .
2.5.1
Exact Confidence Limits for the Odds Ratio
Sensitivity and Specificity . . . . . . . . . . . . . .
McNemar’s Test . . . . . . . . . . . . . . . . . . .
Incidence Densities . . . . . . . . . . . . . . . . .
Sample Size and Power Computations . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

15
17
20
23

25
31
38
39
41
43
46

Introduction

The 2 2 contingency table is one of the most common ways to summarize categorical data.
Categorizing patients by their favorable or unfavorable response to two different drugs, asking
health survey participants whether they have regular physicians and regular dentists, and asking
residents of two cities whether they desire more environmental regulations all result in data that can
be summarized in a 2 2 table.
Generally, interest lies in whether there is an association between the row variable and the column
variable that produce the table; sometimes there is further interest in describing the strength of that
association. The data can arise from several different sampling frameworks, and the interpretation
of the hypothesis of no association depends on the framework. Data in a 2 2 table can represent
the following:
simple random samples from two groups that yield two independent binomial distributions
for a binary response
Asking residents from two cities whether they desire more environmental regulations is an
example of this framework. This is a stratified random sampling setting, since the subjects
from each city represent two independent random samples. Because interest lies in whether
the proportion favoring regulation is the same for the two cities, the hypothesis of interest is
the hypothesis of homogeneity. Is the distribution of the response the same in both groups?

Stokes, Maura E., Charles S. Davis, and Gary G. Koch. Categorical Data Analysis Using SAS®, Third Edition. Copyright © 2012,
SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. For additional SAS resources, visit support.sas.com/bookstore.

16

Chapter 2: The 2

2 Table

a simple random sample from one group that yields a single multinomial distribution for the
cross-classification of two binary responses
Taking a random sample of subjects and asking whether they see both a regular physician
and a regular dentist is an example of this framework. The hypothesis of interest is one of
independence. Are having a regular dentist and having a regular physician independent of
each other?
randomized assignment of patients to two equivalent treatments, resulting in the hypergeometric distribution
This framework occurs when patients are randomly allocated to one of two drug treatments,
regardless of how they are selected, and their response to that treatment is the binary outcome.
Under the null hypothesis that the effects of the two treatments are the same for each patient,
a hypergeometric distribution applies to the response distributions for the two treatments.
incidence densities for counts of subjects who responded with some event versus the extent
of exposure for the event
These counts represent independent Poisson processes. This framework occurs less frequently than the others but is still important.
Table 2.1 summarizes the information from a randomized clinical trial that compared two treatments
(test and placebo) for a respiratory disorder.
Table 2.1 Respiratory Outcomes

Treatment
Placebo
Test

Favorable
16
40

Unfavorable
48
20

Total
64
60

The question of interest is whether the rates of favorable response for test (67%) and placebo
(25%) are the same. You can address this question by investigating whether there is a statistical
association between treatment and outcome. The null hypothesis is stated
H0 W There is no association between treatment and outcome.
There are several ways of testing this hypothesis; many of the tests are based on the chi-square
statistic. Section 2.2 discusses these methods. However, sometimes the counts in the table cells are
too small to meet the sample size requirements necessary for the chi-square distribution to apply,
and exact methods based on the hypergeometric distribution are used to test the hypothesis of no
association. Exact methods are discussed in Section 2.3.
In addition to testing the hypothesis concerning the presence of association, you may be interested
in describing the association or gauging its strength. Section 2.4 discusses the estimation of the
difference in proportions from 2 2 tables. Section 2.5 discusses measures of association, which
assess strength of association, and Section 2.6 discusses measures called sensitivity and specificity,
which are useful when the two responses correspond to two different methods for determining
whether a particular disorder is present. And 2 2 tables often display data for matched pairs;
Section 2.7 discusses McNemar’s test for assessing association for matched pairs data. Finally,
Section 2.8 discusses computing incidence density ratios when the 2 2 table represents counts
from Poisson processes.

2012 categorical data analysis using SAS

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về