Tải bản đầy đủ (.pdf) (366 trang)

2006 regression analysis by example

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (14.95 MB, 366 trang )


Regression Analysis by Example


WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A . C. Cressie, Nicholas I. Fisher,
Iain M. Johnstone, J. B. Kadane, Geert Molenberghs, Louise M. Ryan,
David I$? Scott, Adrian F. M, Smith, Jozef L. Teugels
Editors Emeriti: Vic Barnett, J. Stuart Hunter, David G. Kendall
A complete list of the titles in this series appears at the end of this volume.


Regression Analysis by Example
Fourth Edition

SAMPRIT CHATTEFUEE
Department of Health Policy
Mount Sinai School of Medicine
New York, NY

ALI S . HAD1
Department of Mathematics
The American University in Cairo
Cairo, Egypt

WILEY-

INTERSCl ENCE

A JOHN WILEY & SONS, INC., PUBLICATION




Copyright 02006 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 11 1 River Street, Hoboken, NJ
07030, (201) 748-601 I, fax (201) 748-6008, or online at />Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(3 17) 572-3993 or fax (3 17) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic format. For information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publieation Data:

Chatterjee, Samprit, 1938Regression analysis by example. - 4th ed. / Samprit Chatterjee, Ah S. Hadi.

p. cm.
Includes bibliographical references and index.
ISBN-I3 978-0-471-74696-6 (cloth : acid-free paper)
ISBN-I0 0-471-74696-7 (cloth : acid-free paper)
1. Regression analysis. 1. Title.
QA278.2.C5 2006
519.5’36dc22
Printed in the United States of America.
10 9 8 7 6 5 4 3 2 1

2006044595


Dedicated to:
Allegra, Martha, and Rima - S. C.
My mother and the memory of my father - A. S. H.

It’s a gift to be simple.

..
Old Shaker hymn

True knowledge is knowledge of why things are
as they are, and not merely what they are.
Isaiah Berlin


CONTENTS

Preface

1

Introduction
1.1
1.2
1.3

1.4

1.5

What Is Regression Analysis?
Publicly Available Data Sets
Selected Applications of Regression Analysis
1.3.1
Agricultural Sciences
1.3.2 Industrial and Labor Relations
1.3.3 History
1.3.4
Government
1.3.5 Environmental Sciences
Steps in Regression Analysis
1.4.1 Statement of the Problem
Selection of Potentially Relevant Variables
1.4.2
1.4.3 Data Collection
1.4.4 Model Specification
1.4.5 Method of Fitting
1.4.6 Model Fitting
1.4.7 Model Criticism and Selection

1.4.8 Objectives of Regression Analysis
Scope and Organization of the Book
Exercises

xiii
1
1

2
3
3
3
4
6
6
7
11
11
11
12
14
14
16
16
17
18
vii


Viii


2

3

CONTENTS

Simple Linear Regression

21

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.1 1
2.12

21
21
26
28
29
32

37
37
39
42
44
45
45

Multiple Linear Regression

53

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

53
53
54
57
58
60
61
62

64
66

3.10
3.1 1

4

Introduction
Covariance and Correlation Coefficient
Example: Computer Repair Data
The Simple Linear Regression Model
Parameter Estimation
Tests of Hypotheses
Confidence Intervals
Predictions
Measuring the Quality of Fit
Regression Line Through the Origin
Trivial Regression Models
Bibliographic Notes
Exercises

Introduction
Description of the Data and Model
Example: Supervisor Performance Data
Parameter Estimation
Interpretations of Regression Coefficients
Properties of the Least Squares Estimators
Multiple Correlation Coefficient
Inference for Individual Regression Coefficients

Tests of Hypotheses in a Linear Model
3.9.1 Testing All Regression Coefficients Equal to Zero
3.9.2 Testing a Subset of Regression Coefficients Equal to
Zero
3.9.3 Testing the Equality of Regression Coefficients
Estimating and Testing of Regression Parameters
3.9.4
Under Constraints
Predictions
Summary
Exercises
Appendix: Multiple Regression in Matrix Notation

69
71
73
74
75
75
82

Regression Diagnostics: Detection of Model Violations

85

4.1
4.2
4.3
4.4
4.5


85
86
88
90
93

Introduction
The Standard Regression Assumptions
Various Types of Residuals
Graphical Methods
Graphs Before Fitting a Model


CONTENTS

4.6
4.7
4.8

4.9

4.10
4.1 1
4.12

4.13
4.14

4.5.1

One-Dimensional Graphs
4.5.2
Two-Dimensional Graphs
4.5.3
Rotating Plots
4.5.4
Dynamic Graphs
Graphs After Fitting a Model
Checking Linearity and Normality Assumptions
Leverage, Influence, and Outliers
Outliers in the Response Variable
4.8.1
4.8.2
Outliers in the Predictors
4.8.3
Masking and Swamping Problems
Measures of Influence
4.9.1
Cook’s Distance
4.9.2
Welsch and Kuh Measure
4.9.3
Hadi’s Influence Measure
The Potential-Residual Plot
What to Do with the Outliers?
Role of Variables in a Regression Equation
4.12.1 Added-Variable Plot
4.12.2 Residual Plus Component Plot
Effects of an Additional Predictor
Robust Regression

Exercises

iX

93
93
96
96
97
97
98
100
100
100
103
103
104
105
107
108
109
109
110
114
115
115

Qualitative Variables as Predictors

121


5.1
5.2
5.3
5.4

121
122
125
128
130
137
138
139
140
141
143

5.5
5.6
5.7

Introduction
Salary Survey Data
Interaction Variables
Systems of Regression Equations
5.4.1 Models with Different Slopes and Different Intercepts
5.4.2 Models with Same Slope and Different Intercepts
5.4.3 Models with Same Intercept and Different Slopes
Other Applications of Indicator Variables

Seasonality
Stability of Regression Parameters Over Time
Exercises

Transformationof Variables

151

6.1
6.2
6.3

151
153
155
156
158

Introduction
Transformations to Achieve Linearity
Bacteria Deaths Due to X-Ray Radiation
6.3.1
Inadequacy of a Linear Model
6.3.2
Logarithmic Transformation for Achieving Linearity


X

CONTENTS


6.4
6.5
6.6
6.7
6.8
6.9
6.10
7

179

7.1
7.2

179
180
180
182
183
185
194
196

Introduction
Heteroscedastic Models
7.2.1
Supervisors Data
7.2.2 College Expense Data
Two-Stage Estimation

Education Expenditure Data
Fitting a Dose-Response Relationship Curve
Exercises

The Problem of Correlated Errors
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10

9

159
164
1 66
167
168
169
173
174

Weighted Least Squares

7.3

7.4
7.5
8

Transformations to Stabilize Variance
Detection of Heteroscedastic Errors
Removal of Heteroscedasticity
Weighted Least Squares
Logarithmic Transformation of Data
Power Transformation
Summary
Exercises

Introduction: Autocorrelation
Consumer Expenditure and Money Stock
Durbin-Watson Statistic
Removal of Autocorrelation by Transformation
Iterative Estimation With Autocorrelated Errors
Autocorrelation and Missing Variables
Analysis of Housing Starts
Limitations of Durbin-Watson Statistic
Indicator Variables to Remove Seasonality
Regressing Two Time Series
Exercises

197

197
198
200

202
204
205
206
210
21 I
214
216

Analysis of Collinear Data

22 1

9.1
9.2
9.3
9.4
9.5

22 1
222
228
233
239
240
24 1
243
246

9.6

9.7

Introduction
Effects on Inference
Effects on Forecasting
Detection of Multicollinearity
Centering and Scaling
Centering and Scaling in Intercept Models
9.5.1
Scaling in No-Intercept Models
9.5.2
Principal Components Approach
Imposing Constraints


CONTENTS

9.8
9.9
9.10

10

Biased Estimation of Regression Coefficients
10.1
10.2
10.3
10.4
10.5
10.6

10.7
10.8
10.9

11

Searching for Linear Functions of the P’s
Computations Using Principal Components
Bibliographic Notes
Exercises
Appendix: Principal Components

Introduction
Principal Components Regression
Removing Dependence Among the Predictors
Constraints on the Regression Coefficients
Principal Components Regression: A Caution
Ridge Regression
Estimation by the Ridge Method
Ridge Regression: Some Remarks
Summary
Exercises
Appendix: Ridge Regression

xi

248
252
254
254

255

259
259
260
262
264
265
268
269
272
275
275
277

Variable Selection Procedures

281

1 1.1
1 1.2

28 1
282
282
284
284
284
284
285

285
286

Introduction
Formulation of the Problem
1 1.3 Consequences of Variables Deletion
1 1.4 Uses of Regression Equations
1 1.4.1 Description and Model Building
1 1.4.2 Estimation and Prediction
1 1.4.3 Control
1 1.5 Criteria for Evaluating Equations
1 1.5.1 Residual Mean Square
11.5.2 Mallows C,
1 I S.3 Information Criteria: Akaike and Other Modified
Forms
11.6 Multicollinearity and Variable Selection
1 1.7 Evaluating All Possible Equations
11.8 Variable Selection Procedures
11 3.1 Forward Selection Procedure
1 1.8.2 Backward Elimination Procedure
11 3 . 3 Stepwise Method
1 1.9 General Remarks on Variable Selection Methods
1 1.10 A Study of Supervisor Performance
1 1.11 Variable Selection With Collinear Data
1 1.12 The Homicide Data

287
288
288
289

289
290
290
29 1
292
296
296


Xii

CONTENTS

11.13
11.14
11.15
11.16

12

Logistic Regression
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8


12.9
13

Variable Selection Using Ridge Regression
Selection of Variables in an Air Pollution Study
A Possible Strategy for Fitting Regression Models
Bibliographic Notes
Exercises
Appendix: Effects of Incorrect Model Specifications

Introduction
Modeling Qualitative Data
The Logit Model
Example: Estimating Probability of Bankruptcies
Logistic Regression Diagnostics
Determination of Variables to Retain
Judging the Fit of a Logistic Regression
The Multinomial Logit Model
12.8.1 Multinomial Logistic Regression
12.8.2 Example: Determining Chemical Diabetes
12.8.3 Ordered Response Category: Ordinal Logistic
Regression
12.8.4 Example: Determining Chemical Diabetes Revisited
Classification Problem: Another Approach
Exercises

299
300
307
308

308
313
317

317
318
318
320
323
324
327
329
329
330
334
335
336
337

Further Topics

341

13.1
13.2
13.3
13.4
13.5
13.6
13.7


341
341
342
343
345
346
348
352

Introduction
Generalized Linear Model
Poisson Regression Model
Introduction of New Drugs
Robust Regression
Fitting a Quadratic Model
Distribution of PCB in U.S. Bays
Exercises

Appendix A: Statistical Tables

353

References

363

index

371



PREFACE

Regression analysis has become one of the most widely used statistical tools for
analyzing multifactor data. It is appealing because it provides a conceptually
simple method for investigating functional relationships among variables. The
standard approach in regression analysis is to take data, fit a model, and then
evaluate the fit using statistics such as t , F,and R2. Our approach is much broader.
We view regression analysis as a set of data analytic techniques that examine the
interrelationships among a given set of variables. The emphasis is not on formal
statistical tests and probability calculations. We argue for an informal analysis
directed towards uncovering patterns in the data.
We utilize most standard and some not so standard summary statistics on the
basis of their intuitive appeal. We rely heavily on graphical representations of the
data, and employ many variations of plots of regression residuals. We are not overly
concerned with precise probability evaluations. Graphical methods for exploring
residuals can suggest model deficiencies or point to troublesome observations.
Upon further investigation into their origin, the troublesome observations often
turn out to be more informative than the well-behaved observations. We notice
often that more information is obtained from a quick examination of a plot of
residuals than from a formal test of statistical significance of some limited nullhypothesis. In short, the presentation in the chapters of this book is guided by the
principles and concepts of exploratory data analysis.
Our presentation of the various concepts and techniques of regression analysis
relies on carefully developed examples. In each example, we have isolated one
xiii


xiv


PREFACE

or two techniques and discussed them in some detail. The data were chosen to
highlight the techniques being presented. Although when analyzing a given set of
data it is usually necessary to employ many techniques, we have tried to choose the
various data sets so that it would not be necessary to discuss the same technique
more than once. Our hope is that after working through the book, the reader will be
ready and able to analyze hisker data methodically, thoroughly, and confidently.
The emphasis in this book is on the analysis of data rather than on formulas,
tests of hypotheses, or confidence intervals. Therefore no attempt has been made
to derive the techniques. Techniques are described, the required assumptions are
given, and finally, the success of the technique in the particular example is assessed.
Although derivations of the techniques are not included, we have tried to refer the
reader in each case to sources in which such discussion is available. Our hope is
that some of these sources will be followed up by the reader who wants a more
thorough grounding in theory.
We have taken for granted the availability of a computer and a statistical package.
Recently there has been a qualitative change in the analysis of linear models, from
model fitting to model building, from overall tests to clinical examinations of
data, from macroscopic to the microscopic analysis. To do this kind of analysis
a computer is essential and we have assumed its availability. Almost all of the
analyses we use are now available in software packages. We are particularly
heartened by the arrival of the package R, available on the Internet under the
General Public License (GPL). The package has excellent computing and graphical
features. It is also free!
The material presented is intended for anyone who is involved in analyzing data.
The book should be helpful to those who have some knowledge of the basic concepts
of statistics. In the university, it could be used as a text for a course on regression
analysis for students whose specialization is not statistics, but, who nevertheless,
use regression analysis quite extensively in their work. For students whose major

emphasis is statistics, and who take a course on regression analysis from a book
at the level of Rao (1973), Seber (1977), or Sen and Srivastava (1990), this book
can be used to balance and complement the theoretical aspects of the subject with
practical applications. Outside the university, this book can be profitably used
by those people whose present approach to analyzing multifactor data consists of
looking at standard computer output ( t ,F, R2,standard errors, etc.), but who want
to go beyond these summaries for a more thorough analysis.
The book has a Web site: Thadi/RABE4. This Web
site contains, among other things, all the data sets that are included in this book and
more.
Several new topics have been introduced in this edition. The discussion in Section
2.10 about the regression line through the origin has been considerably expanded. In
the chapter on variable selection (Chapter 1l), we introduce information measures
and illustrate their use. The information criteria help in variable selection by


PREFACE

XV

balancing the conflicting requirements of accuracy and complexity. It is a useful
tool for arriving at parsimonious models.
The chapter on logistic regression (Chapter 12) has been considerably expanded.
This reflects the increased use of the logit models in statistical analysis. In addition
to binary logistic regression, we have now included a discussion of multinomial
logistic regression. This extends the application of logistic regression to more
diverse situations. The categories in some multinomial are ordered, for example in
attitude surveys. We also discuss the application of the logistic model to ordered
response variable.
A new chapter titled Further Topics (Chapter 13) has been added to this edition.

This chapter is intended to be an introduction to a more advanced study of regression
analysis. The topics discussed are generalized linear models (GLM) and robust
regression. We introduce the concept of GLM and discuss how the linear regression
and logistic regression models can be regarded as special cases from a large family
of linear models. This provides a unifying view of linear models. We discuss
Poisson regression in the context of GLM, and its use for modeling count data.
We have attempted to write a book for a group of readers with diverse backgrounds. We have also tried to put emphasis on the art of data analysis rather than
on the development of statistical theory.
We are fortunate to have had assistance and encouragement from several friends,
colleagues, and associates. Some of our colleagues at New York University and
Cornell University have used portions of the material in their courses and have
shared with us their comments and comments of their students. Special thanks
are due to our friend and former colleague Jeffrey Simonoff (New York University) for comments, suggestions, and general help. The students in our classes on
regression analysis have all contributed by asking penetrating questions and demanding meaningful and understandable answers. Our special thanks go to Nedret
Billor (Cukurova University, Turkey) and Sahar El-Sheneity (Cornell University)
for their very careful reading of an earlier edition of this book. We also thank Amy
Hendrickson for preparing the Latex style files and for responding to our Latex
questions, and Dean Gonzalez for help with the production of some of the figures.

SAMPRIT
CHATTERJEE
ALI S. HADI
Brooksville, Maine
Cairo, Egypt


CHAPTER 1

INTRODUCTION


1.I

WHAT IS REGRESSION ANALYSIS?

Regression analysis is a conceptually simple method for investigating functional relationships among variables. A real estate appraiser may wish to relate the sale price
of a home from selected physical characteristics of the building and taxes (local,
school, county) paid on the building. We may wish to examine whether cigarette
consumption is related to various socioeconomic and demographic variables such
as age, education, income, and price of cigarettes. The relationship is expressed in
the form of an equation or a model connecting the response or dependent variable
and one or more explanatory or predictor variables. In the cigarette consumption
example, the response variable is cigarette consumption (measured by the number
of packs of cigarette sold in a given state on a per capita basis during a given year)
and the explanatory or predictor variables are the various socioeconomic and demographic variables. In the real estate appraisal example, the response variable is
the price of a home and the explanatory or predictor variables are the characteristics
of the building and taxes paid on the building.
We denote the response variable by Y and the set of predictor variables by
X I ,Xp, . . ., X,, where p denotes the number of predictor variables. The true
relationship between Y and X I ,X p , . . . , X , can be approximated by the regression
1
Regression Analysis by Example, Fourth Edition. By Samprit Chatterjee and Ali S. Hadi
Copyright @ 2006 John Wiley & Sons, Inc.


2

INTRODUCTION

model


y = f ( X l 1x21. . . ,X,) + € 1

where E is assumed to be a random error representing the discrepancy in the
approximation. It accounts for the failure of the model to fit the data exactly. The
function f(X1,
X2, . . . ,X,) describes the relationship between Y and XI,
X 2 , . . .,
X,. An example is the linear regression model

Y = Po

+ p1x1 + p2x2 + . . . + ppx,+ E l

(1 .a

where pol ,& . . . ,,LIP, called the regression parameters or coefficients, are unknown
constants to be determined (estimated) from the data. We follow the commonly
used notational convention of denoting unknown parameters by Greek letters.
The predictor or explanatory variables are also called by other names such as
independent variables, covariates, regressors, factors, and carriers. The name
independent variable, though commonly used, is the least preferred, because in
practice the predictor variables are rarely independent of each other.

1.2 PUBLICLY AVAILABLE DATA SETS
Regression analysis has numerous areas of applications. A partial list would include
economics, finance, business, law, meteorology, medicine, biology, chemistry,
engineering, physics, education, sports, history, sociology, and psychology. A few
examples of such applications are given in Section 1.3. Regression analysis is
learned most effectively by analyzing data that are of direct interest to the reader.
We invite the readers to think about questions (in their own areas of work, research,

or interest) that can be addressed using regression analysis. Readers should collect
the relevant data and then apply the regression analysis techniques presented in this
book to their own data. To help the reader locate real-life data, this section provides
some sources and links to a wealth of data sets that are available for public use.
A number of datasets are available in books and on the Internet. The book by
Hand et al. (1994) contains data sets from many fields. These data sets are small
in size and are suitable for use as exercises. The book by Chatterjee, Handcock,
and Simonoff (1995) provides numerous data sets from diverse fields. The data are
included in a diskette that comes with the book and can also be found in the World
Wide Web site.'
Data sets are also available on the Internet at many other sites. Some of the Web
sites given below allow the direct copying and pasting into the statistical package
of choice, while others require downloading the data file and then importing them
into a statistical package. Some of these sites also contain further links to yet other
data sets or statistics-related Web sites.
The Data and Story Library (DASL, pronounced "dazzle") is one of the most
interesting sites that contains a number of data sets accompanied by the "story" or

' r jsirnonoWCasebook


SELECTED APPLICATIONS OF REGRESSION ANALYSIS

3

background associated with each data set. DASL is an online library2 of data files
and stories that illustrate the use of basic statistical methods. The data sets cover
a wide variety of topics. DASL comes with a powerful search engine to locate the
story or data file of interest.
Another Web site, which also contains data sets arranged by the method used in

the analysis, is the Electronic Dataset S e r ~ i c e .The
~ site also contains many links
to other data sources on the Internet.
Finally, this book has a Web site: Thadi/RABE4.
This site contains, among other things, all the data sets that are included in this
book and more. These and other data sets can be found in the book's Web site.

1.3 SELECTED APPLICATIONS OF REGRESSION ANALYSIS
Regression analysis is one of the most widely used statistical tools because it
provides simple methods for establishing a functional relationship among variables.
It has extensive applications in many subject areas. The cigarette consumption and
the real estate appraisal, mentioned above, are but two examples. In this section, we
give a few additional examples demonstrating the wide applicability of regression
analysis in real-life situations. Some of the data sets described here will be used
later in the book to illustrate regression techniques or in the exercises at the end of
various chapters.

1.3.1

Agricultural Sciences

The Dairy Herd Improvement Cooperative (DHI) in Upstate New York collects
and analyzes data on milk production. One question of interest here is how to
develop a suitable model to predict current milk production from a set of measured
variables. The response variable (current milk production in pounds) and the
predictor variables are given in Table 1.1. Samples are taken once a month during
milking. The period that a cow gives milk is called lactation. Number of lactations is
the number of times a cow has calved or given milk. The recommended management
practice is to have the cow produce milk for about 305 days and then allow a 60day rest period before beginning the next lactation. The data set, consisting of
199 observations, was compiled from the DHI milk production records. The Milk

Production data can be found in the book's Web site.

1.3.2 Industrial and Labor Relations
In 1947, the United States Congress passed the Taft-Hartley Amendments to the
Wagner Act. The original Wagner Act had permitted the unions to use a Closed
*DASL'SWeb site is: />3rstatdata/


4

INTRODUCTION

Table 1.1 Variables for the Milk Production Data

Variable

Definition

Current month milk production in pounds
Current
Previous Previous month milk production in pounds
Percent of fat in milk
Fat
Percent
of protein in milk
Protein
Number of days since present lactation
Days
Lactation Number of lactations
Indicator variable (0 if Days 5 79 and 1 if Days

I79

> 79)

Table 1.2 Variables for the Right-To-Work Laws Data

Variable Definition
COL
PD
URate
POP
Taxes
Income
RTWL

Cost of living for a four-person family
Population density (person per square mile)
State unionization rate in 1978
Population in 1975
Property taxes in 1972
Per capita income in 1974
Indicator variable (1 if there is right-to-work laws
in the state and 0 otherwise)

Shop Contract4 unless prohibited by state law. The Taft-Hartley Amendments
made the use of Closed Shop Contract illegal and gave individual states the right
to prohibit union shops5 as well. These right-to-work laws have caused a wave
of concern throughout the labor movement. A question of interest here is: What
are the effects of these laws on the cost of living for a four-person family living
on an intermediate budget in the United States? To answer this question a data set

consisting of 38 geographic locations has been assembled from various sources.
The variables used are defined in Table 1.2. The Right-To-Work Laws data are
given in Table 1.3 and can also be found in the book's Web site.

1.3.3 History
A question of historical interest is how to estimate the age of historical objects
based on some age-related characteristics of the objects. For example, the variables
4Under a Closed Shop Contract provision, all employees must be union members at the time of hire
and must remain members as a condition of employment.
'Under a Union Shop clause, employees are not required to be union members at the time of hire,
but must become a member within two months, thus allowing the employer complete discretion in
hiring decisions.


5

SELECTED APPLICATIONS OF REGRESSIONANALYSIS

Table 1.3 The Right-To-Work Laws Data

City
Atlanta
Austin
Bakersfield
Baltimore
Baton Rouge
Boston
Buffalo
Champaign-Urbana
Cedar Rapids

Chicago
Cincinnati
Cleveland
Dallas
Dayton
Denver
Detriot
Green Bay
Hartford
Houston
Indianapolis
Kansas City
Lancaster, PA
Los Angeles
Mi 1waukee
Minneapolis, St. Paul
Nashville
New York
Orlando
Philadelphia
Pittsburgh
Portland
St. Louis
San Diego
San Francisco
Seattle
Washington
Wichita
Raleigh-Durham


COL

PD

169 414
143 239
339
43
173 95 1
99
255
363 1257
253
834
117
162
294
229
29 1 1886
170 643
239 1295
174 302
183 489
227
304
255 1130
323
249
326
696

194
337
25 1
37 1
20 1
386
124
362
340 1717
328
968
265
433
120
183
323 6908
117
230
182 1353
169 762
267
20 1
184 480
256
372
381 1266
195 333
205 1073
157
206

126 302

URate

13.6
11
23.7
21
16
24.4
39.2
31.5
18.2
31.5
29.5
29.5
11
29.5
15.2
34.6
27.8
21.9
11
29.3
30
34.2
23.7
27.8
24.4
17.7

39.2
11.7
34.2
34.2
23.1
30
23.7
23.7
33.1
21
12.8
6.5

Pop

1790128
39689 1
349874
2147850
411725
3914071
1326848
162304
164145
701525 1
1381196
1966725
2527224
835708
14133 18

4424382
169467
1062565
2286247
1 138753
1290110
342797
6986898
1409363
2010841
748493
9561089
582664
4807001
2322224
228417
2366542
1584583
3 140306
1406746
3021801
384920
468512

Taxes Income

5128
4303
4166
5001

3965
4928
447 1
4813
4839
5408
4637
5138
4923
4787
5386
5246
4289
5134
5084
4837
5052
4377
528 1
5176
5206
4454
5260
4613
4877
4677
4123
4721
4837
5940

5416
6404
4796
4614

296 1
1711
2122
4654
1620
5634
7213
5535
7224
61 13
4806
6432
2363
5606
5982
6275
8214
6235
1278
5699
4868
5205
1349
7635
8392

3578
4862
782
5144
5987
751 1
4809
1458
3015
4424
4224
4620
3393

RTWL

1
1
0
0
1
0
0
0
1
0
0
0
1
0

0
0
0
0
1
0
0
0

0
0
0
1
0
1
0
0
0
0
0
0
0
0
1
1


6

INTRODUCTION


Table 1.4 Variables for the Egyptian Skulls Data

Variable

Definition

Year

Approximate Year of Skull Formation
(negative = B.C.; positive = A.D.)
Maximum Breadth of Skull
Basibregmatic Height of Skull
Basialveolar Length of Skull
Nasal Height of Skull

MB
BH
BL
NH

in Table 1.4 can be used to estimate the age of Egyptian skulls. Here the response
variable is Year and the other four variables are possible predictors. The original
source of the data is Thomson and Randall-Maciver (1903, but they can be found
in Hand et al. (1994), pp. 299-301. An analysis of the data can be found in Manly
(1986). The Egyptian Skulls data can be found in the book’s Web site.

1.3.4

Government


Information about domestic immigration (the movement of people from one state
or area of a country to another) is important to state and local governments. It
is of interest to build a model that predicts domestic immigration or to answer
the question of why do people leave one place to go to another? There are many
factors that influence domestic immigration, such as weather conditions, crime, tax,
and unemployment rates. A data set for the 48 contiguous states has been created.
Alaska and Hawaii are excluded from the analysis because the environments of these
states are significantlydifferent from the other 48, and their locations present certain
barriers to immigration. The response variable here is net domestic immigration,
which represents the net movement of people into and out of a state over the period
1990-1994 divided by the population of the state. Eleven predictor variables
thought to influence domestic immigration are defined in Table 1.5. The data are
given in Tables 1.6 and 1.7, and can also be found in the book’s Web site.

1.3.5

Environmental Sciences

In a 1976 study exploring the relationship between water quality and land use, Haith
(1976) obtained the measurements (shown in Table 1.8) on 20 river basins in New
York State. A question of interest here is how the land use around a river basin
contributes to the water pollution as measured by the mean nitrogen concentration
(mghter). The data are shown in Table 1.9 and can also be found in the book’s
Web site.


STEPS IN REGRESSION ANALYSIS

7


Table 1.5 Variables for the Study of Domestic Immigration

Variable

Definition

State
NDIR
Unemp
Wage

State name
Net domestic immigration rate over the period 1990-1994
Unemployment rate in the civilian labor force in 1994
Average hourly earnings of production workers in manufacturing
in 1994
Violent crime rate per 100,000people in 1993
Median household income in 1994
Percentage of state population living in metropolitan areas
in 1992
Percentage of population who fall below the poverty level
in 1994
Total state and local taxes per capita in 1993
Percentage of population 25 years or older who have a high school
degree or higher in 1990
The number of business failures divided by the population of the
state in 1993
Average of the 12 monthly average temperatures (in degrees Fahrenheit)
for the state in 1993

Region in which the state is located (northeast, south, midwest, west)

Crime
Income
Metrop
Poor
Taxes
Educ
BusFail
Temp
Region

1.4 STEPS IN REGRESSION ANALYSIS
Regression analysis includes the following steps:
0

Statement of the problem

0

Selection of potentially relevant variables

0

Data collection

0

Model specification


0

Choice of fitting method

0

Model fitting

0

Model validation and criticism

0

Using the chosen model(s) for the solution of the posed problem.

These steps are examined below.


8

INTRODUCTION

Table 1.6 First Six Variables of the Domestic Immigration Data
State

NDIR

Alabama
Arizona

Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota

Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

17.47
49.60
23.62
-37.21
53.17
-38.41
22.43
39.73
39.24
71.41
-20.87
9.04
0.00

-1.25
13.44
- 13.94
-9.770
- 1.55
- 30.46
-13.19
9.46
5.33
6.97
41.50
-0.62
128.52
-8.72
-24.90
29.05
-45.46
29.46
-26.47
-3.27
7.37
49.63
-4.30
-35.32
11.88
13.71
32.1 1
13.00
31.25
3.94

6.94
44.66
10.75
11.73
1 1.95

Unemp
6.0
6.4
5.3
8.6
4.2
5.6
4.9
6.6
5.2
5.6
5.7
4.9
3.7
5.3
5.4
8.0
7.4
5.1
6.0
5.9
4.0
6.6
4.9

5.1
2.9
6.2
4.6
6.8
6.3
6.9
4.4
3.9
5.5
5.8
5.4
6.2
7.1
6.3
3.3
4.8
6.4
3.7
4.7
4.9
6.4
8.9
4.7
5.3

Wage
10.75
11.17
9.65

12.44
12.27
13.53
13.90
9.97
10.35
1 1.88
12.26
13.56
12.47
12.14
11.82
13.13
11.68
13.15
12.59
16.13
12.60
9.40
11.78
12.50
10.94
11.83
11.73
13.38
10.14
12.19
10.19
10.19
14.38

11.41
12.31
12.49
10.35
9.99
9.19
10.51
11.14
11.26
11.54
1 1.25
14.42
12.60
12.41
11.81

Crime

Income

Metrop

780
715
593
1078
567
456
686
1206

723
282
960
489
326
469
463
1062
126
998
805
792
327
434
744
178
339
875
138
627
930
1074
679
82
504
635
503
418
402
1023

208
766
762
30 1
1 I4
372
515
208
264
286

27196
3 1293
25565
35331
37833
41097
35873
29294
3 1467
3 1536
35081
27858
33079
28322
26595
25676
303 16
39 I98
40500

35284
33644
25400
30 190
27631
3 1794
35871
35245
42280
26905
31 899
301 14
28278
31855
2699 1
3 1456
32066
31928
29846
29733
28639
30775
35716
35802
37647
33533
23564
35388
33140


67.4
84.7
44.7
96.7
81.8
95.7
82.7
93.0
67.7
30.0
84.0
71.6
43.8
54.6
48.5
75.0
35.7
92.8
96.2
82.7
69.3
34.6
68.3
24.0
50.6
84.8
59.4
100.0
56.0
91.7

66.3
41.6
81.3
60.1
70.0
84.8
93.6
69.8
32.6
67.7
83.9
77.5
27.0
77.5
83.0
41.8
68.1
29.7


STEPS tN REGRESSION ANALYSIS

9

Table 1.7 Last Six Variables of the Domestic Immigration Data
State
Alabama
Arizona
Arkansas
California

Colorado
Connecticut
Delaware
Florida
Georgia
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma

Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

Poor
16.4
15.9
15.3
17.9
9.0
10.8
8.3
14.9
14.0
12.0
12.4
13.7
10.7
14.9

18.5
25.7
9.4
10.7
9.7
14.1
11.7
19.9
15.6
11.5
8.8
11.1
7.7
9.2
21.1
17.0
14.2
10.4
14.1
16.7
11.8
12.5
10.3
13.8
14.5
14.6
19.1
8.0
7.6
10.7

11.7
18.6
9.0
9.3

Taxes
I553
2122
1590
2396
2092
3334
2336
2048
1999
1916
2332
1919
2200
2126
1816
1685
2281
2565
2664
2371
2673
1535
1721
1853

2128
2289
230.5
305 1
2131
3655
1975
1986
2059
1777
2169
2260
240.5
1736
1668
1684
1932
1806
2379
2073
2433
1752
2524
2295

Educ

BusFail

66.9

78.7
66.3
76.2
84.4
79.2
77.5
74.4
70.9
79.7
16.2
75.6
80.1
81.3
64.6
68.3
78.8
78.4
80.0
76.8
82.4
64.3
73.9
81.0
81.8
78.8
82.2
76.7
75. I
74.8
70.0

76.7
75.7
74.6
81.5
74.7
72.0
68.3
77.1
67.1
72.1
85.1
80.8
75.2
83.8
66.0
78.6
83.0

0.20
0.5 1
0.08
0.63
0.42
0.33
0.19
0.36
0.33
0.31
0.18
0.19

0.18
0.42
0.22
0.15
0.31
0.31
0.45
0.27
0.20
0.12
0.23
0.20
0.25
0.39
0.54
0.36
0.27
0.38
0.17
0.23
0.19
0.44
0.31
0.26
0.35
0.1 1
0.24
0.23
0.39
0.18

0.30
0.27
0.38
0.17
0.24
0.19

Temp

Region

62.77
61.09
59.57
59.25
43.43
48.63
54.58
70.64
63.54
42.35
50.98
50.88
45.83
52.03
55.36
65.91
40.23
54.04
47.35

43.68
39.30
63.18
53.41
40.40
46.01
48.23
43.53
52.72
53.37
44.85
59.36
38.53
50.87
58.36
46.55
49.01
49.99
62.53
42.89
57.75
64.40
46.32
42.46
55.55
46.93
52.25
42.20
43.68


South
West
South
West
West
Northeast
South
South
South
West
Midwest
Midwest
Midwest
Midwest
South
South
Northeast
South
Northeast
Midwest
Midwest
South
Midwest
West
Midwest
West
Northeast
Northeast
Midwest
Northeast

South
Midwest
Midwest
South
West
Northeast
Northeast
South
Midwest
South
South
West
Northeast
South
Midwest
South
Midwest
West


10

INTRODUCTION

Table 1.8 Variables for Study of Water Pollution in New York Rivers

Variable

Definition


Y

Mean nitrogen concentration (mg/liter) based on samples taken
at regular intervals during the spring, summer, and fall months
Agriculture: percentage of land area currently in agricultural use
Forest: percentage of forest land
Residential: percentage of land area in residential use
Commercial/Industrial: percentage of land area in either
commercial or industrial use

X1
x
2
x3
x
4

Table 1.9 The New York Rivers Data

Row
1
2
3
4
5
6
7
8
9
10

11
12
13
14
15
16
17
18
19
20

River

Y

Olean
Cassadaga
Oatka
Neversink
Hackensack
Wappinger
Fishkill
Honeoye
Susquehanna
Chenango
Tioughnioga
West Canada
East Canada
Saranac
Ausable

Black
Schoharie
Raquette
Oswegatchie
Cohocton

1.10
1.01
1.90
1.oo
1.99
1.42
2.04
1.65
1.01
1.21
1.33
0.75
0.73
0.80
0.76
0.87
0.80
0.87
0.66
1.25

X1
26
29

54
2
3
19
16
40
28
26
26
15
6
3
2
6
22
4
21
40

x
2

63
57
26
84
27
61
60
43

62
60
53
75
84
81
89
82
70
75
56
49

x
3

x
4

1.2
0.7
1.8
1.9
29.4
3.4
5.6
1.3
1.1
0.9
0.9

0.7
0.5
0.8
0.7
0.5
0.9
0.4
0.5
1.1

0.29
0.09
0.58
1.98
3.11
0.56
1.11
0.24
0.15
0.23
0.18
0.16
0.12
0.35
0.35
0.15
0.22
0.18
0.13
0.13



×