Tải bản đầy đủ (.pdf) (605 trang)

2015 multiple regression and beyond 2nd edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.97 MB, 605 trang )


Multiple Regression
and Beyond

Multiple Regression and Beyond offers a conceptually oriented introduction to multiple
regression (MR) analysis and structural equation modeling (SEM), along with analyses that
flow naturally from those methods. By focusing on the concepts and purposes of MR and
related methods, rather than the derivation and calculation of formulae, this book introduces
material to students more clearly, and in a less threatening way. In addition to illuminating
content necessary for coursework, the accessibility of this approach means students are more
likely to be able to conduct research using MR or SEM—and more likely to use the methods
wisely.
• Covers both MR and SEM, while explaining their relevance to one another
• Also includes path analysis, confirmatory factor analysis, and latent growth modeling
• Figures and tables throughout provide examples and illustrate key concepts and
techniques
Timothy Z. Keith is Professor and Program Director of School Psychology at University of
Texas, Austin.


This page intentionally left blank


Multiple Regression
and Beyond
An Introduction to
Multiple Regression
and Structural
Equation Modeling
2nd Edition
Timothy Z. Keith




Second edition published 2015
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2015 Taylor & Francis
The right of Timothy Z. Keith to be identified as author of this work has been asserted by
him in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilized in any
form or by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying and recording, or in any information storage or retrieval system,
without permission in writing from the publishers.
Trademark Notice: Product or corporate names may be trademarks or registered
trademarks, and are used only for identification and explanation without intent to infringe.
First edition published by Pearson Education, Inc. 2006
Library of Congress Cataloging-in-Publication Data
Library of Congress Control Number: 2014956124
ISBN: 978-1-138-81194-2 (hbk)
ISBN: 978-1-138-81195-9 (pbk)
ISBN: 978-1-315-74909-9 (ebk)
Typeset in Minion
by Apex CoVantage, LLC


Contents

Preface


vii

Acknowledgments

xi

Multiple Regression

1

1

Introduction: Simple (Bivariate) Regression

3

2

Multiple Regression: Introduction

26

3

Multiple Regression: More Detail

44

4


Three and More Independent Variables and Related Issues

57

5

Three Types of Multiple Regression

77

6

Analysis of Categorical Variables

108

7

Categorical and Continuous Variables

129

8

Continuous Variables: Interactions and Curves

161

9


Multiple Regression: Summary, Assumptions, Diagnostics,
Power, and Problems

182

Related Methods: Logistic Regression and Multilevel Modeling

213

Beyond Multiple Regression: Structural Equation Modeling

241

Path Modeling: Structural Equation Modeling
With Measured Variables

243

Part I

10

Part II
11

v


vi • CONTENTS


12

Path Analysis: Dangers and Assumptions

267

13

Analyzing Path Models Using SEM Programs

282

14

Error: The Scourge of Research

318

15

Confirmatory Factor Analysis I

332

16

Putting It All Together: Introduction to Latent Variable SEM

371


17

Latent Variable Models: More Advanced Topics

391

18

Latent Means in SEM

424

19

Confirmatory Factor Analysis II: Invariance and Latent Means

455

20

Latent Growth Models

493

21

Summary: Path Analysis, CFA, SEM, and Latent Growth Models

514


Appendices
Appendix A: Data Files

537

Appendix B: Review of Basic Statistics Concepts

539

Appendix C: Partial and Semipartial Correlation

557

Appendix D: Symbols Used in This Book

565

Appendix E: Useful Formulae

567

References

569

Author Index

579


Subject Index

583


Preface

Multiple Regression and Beyond is designed to provide a conceptually oriented introduction to
multiple regression along with more complex methods that flow naturally from multiple regression: path analysis, confirmatory factor analysis, and structural equation modeling. Multiple
regression (MR) and related methods have become indispensable tools for modern social science
researchers. MR closely implements the general linear model and thus subsumes methods, such
as analysis of variance (ANOVA), that have traditionally been more commonplace in psychological and educational research. Regression is especially appropriate for the analysis of nonexperimental research, and with the use of dummy variables and modern computer packages, it is
often more appropriate or easier to use MR to analyze the results of complex quasi-experimental
or even experimental research. Extensions of multiple regression—particularly structural equation modeling (SEM)—partially obviate threats due to the unreliability of the variables used in
research and allow the modeling of complex relations among variables. A quick perusal of the full
range of social science journals demonstrates the wide applicability of the methods.
Despite its importance, MR-based analyses are too often poorly conducted and poorly
reported. I believe one reason for this incongruity is inconsistency between how material is
presented and how most students best learn.
Anyone who teaches (or has ever taken) courses in statistics and research methodology
knows that many students, even those who may become gifted researchers, do not always
gain conceptual understanding through numerical presentation. Although many who teach
statistics understand the processes underlying a sequence of formulas and gain conceptual
understanding through these formulas, many students do not. Instead, such students often
need a thorough conceptual explanation to gain such understanding, after which a numerical presentation may make more sense. Unfortunately, many multiple regression textbooks
assume that students will understand multiple regression best by learning matrix algebra,
wading through formulas, and focusing on details.
At the same time, methods such as structural equation modeling (SEM) and confirmatory factor analysis (CFA) are easily taught as extensions of multiple regression. If structured
properly, multiple regression flows naturally into these more complex topics, with nearly
complete carry-over of concepts. Path models (simple SEMs) illustrate and help deal with

some of the problems of MR, CFA does the same for path analysis, and latent variable SEM
combines all the previous topics into a powerful, flexible methodology.
I have taught courses including these topics at four universities (the University of Iowa,
Virginia Polytechnic Institute & State University, Alfred University, and the University of
vii


viii • PREFACE

Texas). These courses included faculty and students in architecture, engineering, educational
psychology, educational research and statistics, kinesiology, management, political science,
psychology, social work, and sociology, among others. This experience leads me to believe
that it is possible to teach these methods by focusing on the concepts and purposes of MR
and related methods, rather than the derivation and calculation of formulas (what my wife
calls the “plug and chug” method of learning statistics). Students generally find such an
approach clearer, more conceptual, and less threatening than other approaches. As a result
of this conceptual approach, students become interested in conducting research using MR,
CFA, or SEM and are more likely to use the methods wisely.

THE ORIENTATION OF THIS BOOK
My overriding bias in this book is that these complex methods can be presented and learned
in a conceptual, yet rigorous, manner. I recognize that not all topics are covered in the depth
or detail presented in other texts, but I will direct you to other sources for topics for which
you may want additional detail. My style is also fairly informal; I’ve written this book as if I
were teaching a class.

Data
I also believe that one learns these methods best by doing, and the more interesting and relevant that “doing,” the better. For this reason, there are numerous example analyses throughout this book that I encourage you to reproduce as you read. To make this task easier, the Web
site that accompanies the book (www.tzkeith.com) includes the data in a form that can be
used in common statistical analysis programs. Many of the examples are taken from actual

research in the social sciences, and I’ve tried to sample from research from a variety of areas.
In most cases simulated data are provided that mimic the actual data used in the research.
You can reproduce the analyses of the original researchers and, perhaps, improve on them.
And the data feast doesn’t end there! The Web site also includes data from a major federal
data set: 1000 cases from the National Education Longitudinal Study (NELS) from the National
Center for Education Statistics. NELS was a nationally representative sample of 8th-grade students first surveyed in 1988 and resurveyed in 10th and 12th grades and then twice after leaving high school. The students’ parents, teachers, and school administrators were also surveyed.
The Web site includes student and parent data from the base year (8th grade) and student data
from the first follow-up (10th grade). Don’t be led astray by the word Education in NELS; the
students were asked an incredible variety of questions, from drug use to psychological wellbeing to plans for the future. Anyone with an interest in youth will find something interesting
in these data. Appendix A includes more information about the data at www.tzkeith.com.

Computer Analysis
Finally, I firmly believe that any book on statistics or research methods should be closely related
to statistical analysis software. Why plug and chug—plug numbers into formulas and chug out
the answers on a calculator—when a statistical program can do the calculations more quickly
and accurately with, for most people, no loss of understanding? Freed from the drudgery of
hand calculations, you can then concentrate on asking and answering important research questions, rather than on the intricacies of calculating statistics. This bias toward computer calculations is especially important for the methods covered in this book, which quickly become
unmanageable by hand. Use a statistical analysis program as you read this book; do the examples with me and the problems at the end of the chapters, using that program.
Which program? I use SPSS as my general statistical analysis program, and you can get
the program for a reasonable price as a student in a university (approximately $100–$125


PREFACE • ix

per year for the “Grad Pack” as this is written). But you need not use SPSS; any of the common packages will do (e.g., SAS or SYSTAT). The output in the text has a generic look to it,
which should be easily translatable to any major statistical package output. In addition, the
website (www.tzkeith.com) includes sample multiple regression and SEM output from various statistical packages.
For the second half of the book, you will need access to a structural equation modeling
program. Fortunately, student or tryout versions of many such programs are available online.
Student pricing for the program used extensively in this book, Amos, is available, at this writing, for approximately $50 per year as an SPSS add-on. Although programs (and pricing)

change, one current limitation of Amos is that there is no Mac OS version of Amos. If you
want to use Amos, you need to be able to run Windows. Amos is, in my opinion, the easiest
SEM program to use (and it produces really nifty pictures). The other SEM program that
I will frequently reference is Mplus. We’ll talk more about SEM in Part 2 of this book. The
website for this text has many examples of SEM input and output using Amos and Mplus.

Overview of the Book
This book is divided into two parts. Part 1 focuses on multiple regression analysis. We begin
by focusing on simple, bivariate regression and then expand that focus into multiple regression with two, three, and four independent variables. We will concentrate on the analysis
and interpretation of multiple regression as a way of answering interesting and important
research questions. Along the way, we will also deal with the analytic details of multiple
regression so that you understand what is going on when we do a multiple regression analysis.
We will focus on three different types, or flavors, of multiple regression that you will encounter in the research literature, their strengths and weaknesses, and their proper interpretation.
Our next step will be to add categorical independent variables to our multiple regression
analyses, at which point the relation of multiple regression and ANOVA will become clearer.
We will learn how to test for interactions and curves in the regression line and to apply these
methods to interesting research questions.
The penultimate chapter for Part 1 is a review chapter that summarizes and integrates
what we have learned about multiple regression. Besides serving as a review for those who
have gone through Part 1, it also serves as a useful introduction for those who are interested
primarily in the material in Part 2. In addition, this chapter introduces several important
topics not covered completely in previous chapters. The final chapter in Part 1 presents two
related methods, logistic regression and multilevel modeling, in a conceptual fashion using
what we have learned about multiple regression.
Part 2 focuses on structural equation modeling—the “Beyond” portion of the book’s
title. We begin by discussing path analysis, or structural equation modeling with measured
variables. Simple path analyses are easily estimated via multiple regression analysis, and
many of our questions about the proper use and interpretation of multiple regression will
be answered with this heuristic aid. We will deal in some depth with the problem of valid
versus invalid inferences of causality in these chapters. The problem of error (“the scourge of

research”) serves as our jumping off place for the transition from path analysis to methods
that incorporate latent variables (confirmatory factor analysis and latent variable structural
equation modeling). Confirmatory factor analysis (CFA) approaches more closely the constructs of primary interest in our research by separating measurement error from variation
due to these constructs. Latent variable structural equation modeling (SEM) incorporates
the advantages of path analysis with those of confirmatory factor analysis into a powerful
and flexible analytic system that partially obviates many of the problems we discuss as the
book progresses. As we progress to more advanced SEM topics we will learn how to test for


x • PREFACE

interactions in SEM models, and for differences in means of latent constructs. SEM allows
powerful analysis of change over time via methods such as latent growth models. Even when
we discuss fairly sophisticated SEMs, we reiterate one more time the possible dangers of
nonexperimental research in general and SEM in particular.

CHANGES TO THE SECOND EDITION
If you are coming to the second edition from the first, thank you! There are changes throughout the book, including quite a few new topics, especially in Part 2. Briefly, these include:

Changes to Part 1
All chapters have been updated to add, I hope, additional clarity. In some chapters the examples used to illustrate particular points have been replaced with new ones. In most chapters I
have added additional exercises and have tried to sample these from a variety of disciplines.
New to Part 1 is a chapter on Logistic Regression and Multilevel Modeling (Chapter 10). This
brief introduction is not intended as an introduction to these important topics but instead as
a bridge to assist students who are interested in pursuing these topics in more depth in subsequent coursework. When I teach MR classes I consistently get questions about these methods,
how to think about them, and where to go for more information. The chapter focuses on using
what students have learned so far in MR, especially categorical variables and interactions, to
bridge the gap between a MR class and ones that focus in more detail on LR and MLM.

Changes to Part 2

What is considered introductory material in SEM has expanded a great deal since I wrote the
first edition to Multiple Regression and Beyond, and thus new chapters have been added to
address these additional topics.
A chapter on Latent Means in SEM (Chapter 18) introduces the topic of mean structures
in SEM, which is required for understanding the next three chapters and which has increasingly become a part of introductory classes in SEM. The chapter uses a research example to
illustrate two methods of incorporating mean structures in SEM: MIMIC-type models and
multi-group mean and covariance structure models.
A second chapter on Confirmatory Factor Analysis has been added (Chapter 19). Now
that latent means have been introduced, this chapter revisits CFA, with the addition of latent
means. The topic of invariance testing across groups, hinted at in previous chapters, is covered in more depth.
Chapter 20 focuses on Latent Growth Models. Longitudinal models and data have been
covered in several places in the text. Here latent growth models are introduced as a method
of more directly studying the process of change.
Along with these additions, Chapter 17 (Latent Variable Models: More Advanced Topics)
and the final SEM summary chapter (Chapter 21) have been extensively modified as well.

Changes to the Appendices
Appendix A, which focused on the data sets used for the text, is considerably shortened, with
the majority of the material transferred to the web (www.tzkeith.com). Likewise, the information previously contained in appendices illustrating output from statistics programs and
SEM programs has been transferred to the web, so that I can update it regularly. There are
still appendices focused on a review of basic statistics (Appendix B) and on understanding
partial and semipartial correlations (Appendix C). The tables showing the symbols used in
the book and useful formulae are now included in appendices as well.


Acknowledgments

This project could not have been completed without the help of many people. I was amazed
by the number of people who wrote to me about the first edition with questions, compliments, and suggestions (and corrections!). Thank you! I am very grateful to the students
who have taken my classes on these topics over the years. Your questions and comments have

helped me understand what aspects of the previous edition of the book worked well and
which needed improvement or additional explanation. I owe a huge debt to the former and
current students who “test drove” the new chapters in various forms.
I am grateful to the colleagues and students who graciously read and commented on various new sections of the book: Jacqueline Caemmerer, Craig Enders, Larry Greil, and Keenan
Pituch. I am especially grateful to Matthew Reynolds, who read and commented on every
one of the new chapters and who is a wonderful source of new ideas for how to explain difficult concepts.
I thank my hard-working editor, Rebecca Novack, and her assistants at Routledge for all
of their assistance. Rebecca’s zest and humor, and her commitment to this project, were key
to its success. None of these individuals is responsible for any remaining deficiencies of the
book, however.
Finally, a special thank you to my wife and to my sons and their families. Davis, Scotty, and
Willie, you are a constant source of joy and a great source of research ideas! Trisia provided
advice, more loving encouragement than I deserve, and the occasional nudge, all as needed.
Thank you, my love, I really could not have done this without you!

xi


This page intentionally left blank


Part I
Multiple Regression

1


This page intentionally left blank



1
Introduction
Simple (Bivariate) Regression

Simple (Bivariate) Regression
4
Example: Homework and Math Achievement
4
Regression in Perspective
15
Relation of Regression to Other Statistical Methods
15
Explaining Variance
17
Advantages of Multiple Regression
18
Other Issues
19
Prediction Versus Explanation
19
Causality19
Review of Some Basics
20
Variance and Standard Deviation
20
Correlation and Covariance
20
Working With Extant Data Sets
21
Summary23

Exercises24
Notes24
This book is designed to provide a conceptually oriented introduction to multiple regres­
sion along with more complex methods that flow naturally from multiple regression: path
analysis, confirmatory factor analysis, and structural equation modeling. In this introduc­
tory chapter, we begin with a discussion and example of simple, or bivariate, regression. For
many readers, this will be a review, but, even then, the example and computer output should
provide a transition to subsequent chapters and to multiple regression. The chapter also
reviews several other related concepts, and introduces several issues (prediction and expla­
nation, causality) that we will return to repeatedly in this book. Finally, the chapter relates
regression to other approaches with which you may be more familiar, such as analysis of
variance (ANOVA). I will demonstrate that ANOVA and regression are fun�damentally the
same process and that, in fact, regression subsumes ANOVA.
As I suggested in the Preface, we start this journey by jumping right into an example and
explaining it as we go. In this introduction, I have assumed that you are fairly familiar with
the topics of correlation and statistical significance testing and that you have some familiar­
ity with statistical procedures such as the t test for comparing means and analysis of vari­
ance. If these concepts are not familiar to you a quick review is provided in Appendix B. This

3


4 • MULTIPLE REGRESSION

appendix reviews basic statistics, distributions, standard errors and confidence intervals,
correlations, t tests, and ANOVA.

SIMPLE (BIVARIATE) REGRESSION
Let’s start our adventure into the wonderful world of multiple regression with a review of sim­
ple, or bivariate, regression; that is, regression with only one influence (independent variable)

and one outcome (dependent variable).1 Pretend that you are the parent of an adolescent.
As a parent, you are interested in the influences on adolescents’ school performance: what’s
important and what’s not? Homework is of particular interest because you see your daughter
Lisa struggle with it nightly and hear her complain about it daily. A quick search of the Internet
reveals conflicting evidence. You may find books (Kohn, 2006) and articles (Wallis, 2006) criti­
cal of homework and homework policies. On the other hand, you may find links to research
suggesting homework improves learning and achievement (Cooper, Robinson, & Patall, 2006).
So you wonder if homework is just busywork or is it a worthwhile learning experience?

Example: Homework and Math Achievement
The Data
Fortunately for you, your good friend is an 8th-grade math teacher and you are a researcher;
you have the means, motive, and opportunity to find the answer to your question. Without
going into the levels of permission you’d need to collect such data, pretend that you devise a
quick survey that you give to all 8th-graders. The key question on this survey is:
Think about your math homework over the last month. Approximately how much time
did you spend, per week, doing your math homework? Approximately ____ (fill in the blank)
hours per week.
A month later, standardized achievement tests are administered; when they are available,
you record the math achievement test score for each student. You now have a report of aver­
age amount of time spent on math homework and math achievement test scores for 100
8th-graders.
A portion of the data is shown in Figure 1.1. The complete data are on the website that
accompanies this book, www.tzkeith.com, under Chapter 1, in several formats: as an SPSS
System file (homework & ach.sav), as a Microsoft Excel file (homework & ach.xls), and as an
ASCII, or plain text, file (homework & ach.txt). The values for time spent on Math Home­
work are in hours, ranging from zero for those who do no math homework to some upper
value limited by the number of free hours in a week. The Math Achievement test scores have
a national mean of 50 and a standard deviation of 10 (these are known as T scores, which
have nothing to do with t tests).2

Let’s turn to the analysis. Fortunately, you have good data analytic habits: you check basic
descriptive data prior to doing the main regression analysis. Here’s my rule: Always, always,
always, always, always, always check your data prior to conducting analyses! The frequencies
and descrip�tive statistics for the Math Homework variable are shown in Figure 1.2. Reported
Math Home�work ranged from no time, or zero hours, reported by 19 students, to 10 hours
per week. The range of values looks reasonable, with no excessively high or impossible val­
ues. For example, if someone had reported spending 40 hours per week on Math Homework,
you might be a lit�tle suspicious and would check your original data to make sure you entered the
data correctly (e.g., you may have entered a “4” as a “40”). You might be a little surprised that
the average amount of time spent on Math Homework per week is only 2.2 hours, but this
value is certainly plausible. (As noted in the Preface, the regression and other results shown


Math Homework

Math Achievement

2
54
0
53
4
53
0
56
59
2
0
30
49

1
0
54
3
37
0
49
4
55
7
50
3
45
1
44
1
60
0
36
53
3
0
22
56
1
(Data Continue............................)

Figure 1.1╇ Portion of the Math Homework and Achievement data. The complete data are on the
website under Chapter 1.
MATHHOME Time Spent on Math Homework per Week


Valid

.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
10.00

Total

Frequency
19
19
25
16
11
6
2
1
1
100

Percent
19.0
19.0

25.0
16.0
11.0
6.0
2.0
1.0
1.0
100.0

Valid Percent
19.0
19.0
25.0
16.0
11.0
6.0
2.0
1.0
1.0
100.0

Statistics
MATHHOME Time Spent on Math Homework per Week
N
Valid
100
Missing
0
2.2000
Mean

2.0000
Median
2.00
Mode
1.8146
Std. Deviation
3.2929
Variance
.00
Minimum
10.00
Maximum
220.00
Sum

Figure 1.2╇ Frequencies and descriptive statistics for Math Homework.

Cumulative
Percent
19.0
38.0
63.0
79.0
90.0
96.0
98.0
99.0
100.0



6 • MULTIPLE REGRESSION

are portions of an SPSS printout, but the information displayed is easily generalizable to that
produced by other statistical programs.)
Next, turn to the descriptive statistics for the Math Achievement test (Figure 1.3). Again,
given that the national mean for this test is 50, the 8th-grade school mean of 51.41 is reason­
able, as is the range of scores from 22 to 75. In contrast, if the descriptive statistics had shown
a high of, for example, 90 (four standard deviations above the mean), further investigation
would be called for. The data appear to be in good shape.

The Regression Analysis
Next, we conduct regression: we regress Math Achievement scores on time spent on Homework
(notice the structure of this statement: we regress the outcome on the influence or influences).
Figure 1.4 shows the means, standard deviations, and correlation between the two variables.
Descriptive Statistics
N
MATHACH Math
Achievement Test Score
Valid N (listwise)

100

Range

Minimum

53.00

22.00


Maximum

Sum

Mean

Std. Deviation Variance

75.00 5141.00 51.4100

11.2861

127.376

100

Figure 1.3╇ Descriptive statistics for Math Achievement test scores.
Descriptive Statistics
Mean
MATHACH Math
Achievement Test Score
MATHHOME Time
Spent on Math
Homework per Week

Std. Deviation

N

51.4100


11.2861

100

2.2000

1.8146

100

Correlations

Pearson Correlation

Sig. (1-tailed)

N

MATHACH Math
Achievement Test Score
MATHHOME Time
Spent on Math
Homework per Week
MATHACH Math
Achievement Test Score
MATHHOME Time
Spent on Math
Homework per Week
MATHACH Math

Achievement Test Score
MATHHOME Time
Spent on Math
Homework per Week

MATHACH
Math
Achievement
Test Score

MATHHOME
Time Spent
on Math
Homework
per Week

1.000

.320

.320

1.000

.

.001

.001


.

100

100

100

100

Figure 1.4╇ Results of the regression of Math Achievement on Math Homework: descriptive statistics
and correlation coefficients.


INTRODUCTION: SIMPLE (BIVARIATE) REGRESSION • 7

The descriptive statistics match those presented earlier, without the detail. The corre�lation
between the two variables is .320, not large, but certainly statistically significant (p€<€.01)
with this sample of 100 students. As you read articles that use multiple regression, you may
see this ordinary correlation coefficient referred to as a zero-order correlation (which dis­
tinguishes it from first-, second-, or multiple-order partial correlations, topics dis�cussed in
Appendix C).
Next, we turn to the regression itself; although we have conducted a simple regres�sion, the
computer output is in the form of multiple regression to allow a smooth transition. First,
look at the model summary in Figure 1.5. It lists the R, which normally is used to des�ignate
the multiple correlation coefficient, but which, with one predictor, is the same as the simple
Pearson correlation (.320).3 Next is the R2, which denotes the variance explained in the out­
come variable by the predictor variables. Homework time explains, accounts for, or predicts
.102 (proportion) or 10.2% of the variance in Math test scores. As you run this regression
yourself, your output will probably show some additional statistics (e.g., the adjusted R2); we

will ignore these for the time being.
Is the regression, that is, the multiple R and R2, statistically significant? We know it is,
because we already noted the statistical significance of the zero-order correlation, and this
“multiple” regression is actually a simple regression with only one predictor. But, again, we’ll
check the output for consistency with subsequent examples. Interestingly, we use an F test, as
in ANOVA, to test the statistical significance of the regression equation:
F=

ssregression / df regression
ssresidual / df residual

The term ssregression stands for sums of squares regression and is a measure of the variation
in the dependent variable that is explained by the independent variable(s); the ssresidual is the
vari�ance unexplained by the regression. If you are interested in knowing how to calculate
these values by hand, turn to Note 4 at the end of this chapter; here, we will use the values
from the statistical output in Figure 1.5.4 The sums of squares for the regression versus the
Model Summary
Model

R

1

.320a

R Square
.102

a. Predictors: (Constant), MATHHOME Time
Spent on Math Homework per Week

ANOVAb
Model
1

Sum of
Squares

df

Mean Square

F

Sig.

Regression
1291.231
1
1291.231
11.180
.001a
Residual
11318.959
98
115.500
Total
12610.190
99
a. Predictors: (Constant), MATHHOME Time Spent on Math Homework per Week
b. Dependent Variable: MATHACH Math Achievement Test Score


Figure 1.5╇ Results of the regression of Math Achievement on Math Homework: statistical significance
of the regression.


8 • MULTIPLE REGRESSION

residual are shown in the ANOVA table. In regression, the degrees of freedom (df) for the
regression are equal to the number of independent variables (k), and the df for the residual,
or error, are equal to the sample size minus the number of independent variables in the equa­
tion minus 1 (N€−€k€−€1); the df are also shown in the ANOVA table. We’ll double-check the
numbers:
1291.231 / 1
11318.959 / 98
1291.231
=
115.500
= 11.179

F=

which is the same value shown in the table, within errors of rounding. What is the probabil­
ity of obtaining a value of F as large as 11.179 if these two variables were in fact unrelated
in the population? According to the table (in the column labeled “Sig.”), such an occurrence
would occur only 1 time in 1,000 (p = .001); it would seem logical that these two variables
are indeed related. We can double-check this probability by referring to an F table under 1
and 98 df; is the value 11.179 greater than the tabled value? Instead, however, I suggest that
you use a computer program to calculate these probabilities. Excel, for example, will find the
probability for values of all the distributions discussed in this text. Simply put the calculated
value of F (11.179) in one cell, the degrees of freedom for the regression (1)€in the next,

and€the df for the residual in the next (98). Go to the next cell, then click on Insert, Function,
and select the category of Statistical and scroll down until you find FDIST, for F distribution.
Click on it and point to the cells containing the required information. Alternatively, you
could go directly to Function and FDIST and simply type in these numbers, as was done in
Figure 1.6. Excel returns a value of .001172809, or .001, as shown in the Figure. Although I
present this method of determining probabilities as a way of double-checking the computer
output at this point, at times your computer program will not display the probabilities you
are interested in, and this method will be useful.

Figure 1.6╇ Using Excel to calculate probability: statistical significance of an F (1,98) of 11.179.


INTRODUCTION: SIMPLE (BIVARIATE) REGRESSION • 9

There is another formula you can use to calculate F, an extension of which will come in
handy later:
F=

R2 / k
(1− R 2 ) /(N − k −1)

This formula compares the proportion of variance explained by the regression (R2) with
the proportion of variance left unexplained by the regression (1€−€R2). This formula may
seem quite different from the one presented previously until you remember that (1) k is
equal to the df for the regression, and N€−€k€−€1 is equal to the df for the residual, and (2) the
sums of squares from the previous formula are also estimates of variance. Try this formula to
make sure you get the same results (within rounding error).
I noted that the ssregression is a measure of the variance explained in the dependent variable
by the independent variables, and also that R2 denotes the variance explained. Given these
descriptions, you may expect that the two

concepts should be related. They are, and we can
ss
calculate the R2 from the ssregression: R 2 = regression
sstotal . We can put this formula into words: There is
a certain amount of variance in the dependent variable (total variance), and the independent
variables can explain a portion of this variance (variance due to the regression). The R2 is a
proportion of the total variance in the dependent variable that is explained by the indepen­
dent variables. For the current example, the total variance in the dependent variable, Math
Achievement (sstotal), was 12610.190 (Figure 1.5), and Math Homework explained 1291.231
of this variance. Thus,
R2 =

ssregression
sstotal

1291.231
12610.190
= .102
=

and Homework explains .102 or 10.2% of the variance in Math Achievement. Obviously, R2
can vary between 0 (no variance explained) and 1 (100% explained).

The Regression Equation
Next, let’s take a look at the coefficients for the regression equation, the notable parts
of which are shown in Figure 1.7. The general formula for a regression equation is Y =
a€+€bX€+€e, which, translated into English, says that a person’s score on the dependent vari­
able (in this case, Math Achievement) is a result of a con�stant (a), plus a coefficient (b)
times his or her value on the independent variable (Math Homework), plus error. Values
for both a and b are shown in the second column of the table in Figure 1.7 (Unstan­

dardized Coefficients, B; SPSS uses the uppercase B rather than the lower case b). a is
a constant, called the intercept, and its value is 47.032 for this homework–achievement
example. The intercept is the predicted score on the dependent variable for someone with a
score of zero on the independent variable. b, the unstandard�ized regression coefficient, is
1.990. Because we don’t have a direct estimate of the error, we’ll focus on a different form
of the regression equation: Y' = a€+€bX, in which Y' is the preÂ�dicted value of Y. The com­
pleted equation is Y' = 47.032€+€1.990X, meaning that to predict a person’s Math Achieve­
ment score we can multiply his or her report of time spent on Math Homework by 1.990
and add 47.032. Thus, the predicted score for a student who does no homework would
be 47.032, the predicted score for an 8th-grader who does 1 hour of homework is 49.022


10 • MULTIPLE REGRESSION
Coefficientsa
Unstandardized
Coefficients
B
Std. Error

Model
1

Intercept (Constant)
MATHHOME Time
Spent on Math
Homework per Week

47.032

1.694


1.990

.595

Standardized
Coefficients
Beta

.320

t
27.763

Sig.
.000

3.344

.001

95% Confidence Interval for B
Lower Bound Upper Bound
43.670
50.393
.809

3.171

a. Dependent Variable: MATHACH Math Achievement Test Score


Figure 1.7╇Results of the regression of Math Achievement on Math Homework: Regression
Coefficients.

(1€×€1.990€+€47.032), the predicted score for a student who does 2 hours of homework is
51.012 (2 ×€1.990€+€47.032), and so on.
Several questions may spring to mind after these last statements. Why, for example, would
we want to predict a student’s Achievement score (Y') when we already know the student’s
real Achievement score? The answer is that we want to use this formula to summarize the
relation between homework and achievement for all students at the same time. We may also
be able to use the formula for other purposes: to predict scores for another group of students
or, to return to the original purpose, to predict Lisa’s likely future math achievement, given
her time spent on math homework. Or we may want to know what would likely happen if
a student or group of students were to increase or decrease the time they spent on math
homework.

Interpretation
But to get back to our original question, we now have some very useful information for Lisa,
contained within the regression coefficient (b€=€1.99), because this coefficient tells us the
amount we can expect the outcome variable (Math Achievement) to change for each 1-unit
change in the independent variable (Math Homework). Because the Homework variable is
in hours spent per week, we can make this statement: “For each addiÂ�tional hour students
spend on Mathematics Homework every week, they can expect to see close to a 2-point
increase in Math Achievement test scores.” Now, Achievement test scores are not that easy to
change; it is much easier, for example, to improve grades than test scores (Keith, DiamondHallam, & Fine, 2004), so this represents an important effect. Given the standard deviation
of the test scores (10 points), a student should be able to improve his or her scores by a stan­
dard deviation by studying a little more than 5 extra hours a week; this could mean moving
from average-level to high-average-level achievement. Of course, this proposition might be
more interesting to a student who is currently spending very little time studying than to one
who is already spending a lot of time working on math homework.


The Regression Line
The regression equation may be used to graph the relation between Math Homework and
Achievement, and this graph can also illustrate nicely the predictions made in the previous
paragraph. The intercept (a) is the value on the Y (Achievement) axis for a value of zero for
X (Homework); in other words, the intercept is the value on the Achievement test we would
expect for someone who does no homework. We can use the intercept as one data point for
drawing the regression line (X = 0, Y = 47.032). The second data point is simply the point
defined by the mean of X (Mx = 2.200) and the mean of Y (My = 51.410). The graph, with
these two data points highlighted, is shown in Figure 1.8. We can use the graph and data to


INTRODUCTION: SIMPLE (BIVARIATE) REGRESSION • 11
Regression of Math Achievement on Math Homework
80

Achievement Test Scores

70
60
50
51.41
47.032

40
30
20
0

1


2

3

4

5

6

7

8

9

10

11

12

Time Spent on Math Homework per Week

Figure 1.8╇ Regression line for Math Achievement on Math Homework. The line is drawn through the
intercept and the joint means of X and Y.

check the calculation of the value of b, which is equal to the slope of the regression line. The
slope is equal to the increase in Y for each unit increase in X (or the rise of the line divided by

the run of the line); we can use the two data points plotted to calculate the slope:
b=

rise M y − a
=
run M x − 0

51.410 − 47.032
2.200
= 1.990
=

Let’s consider for a few moments the graph and these formulas. The slope represents the
predicted increase in Y for each unit increase in X. For this example, this means that for each
unit—in this case, each hour—increase in Homework, Achievement scores increase, on aver­
age, 1.990 points. This, then, is the interpretation of an unstandardized coefficient: It is the
predicted increase in Y expected for each unit increase in X. When the independent variable
has a meaningful metric, like hours spent studying Mathematics every week, the interpre­
tation of b is easy and straightforward. We can also generalize from this group-gen�erated
equation to individuals (to the extent that they are similar to the group that generated the
regression equation). Thus the graph and b can be used to make predictions for others, such
as Lisa. She can check her current level of homework time and see how much payoff she
might expect for additional time (or how much she can expect to lose if she studies less). The
intercept is also worth noting; it shows that the average Achievement test score for stu�dents
who do no studying is 47.032, slightly below the national average.
Because we are using a modern statistical package, there is no need to draw the plot of the
regression line ourselves; any such program will do it for us. Figure 1.9 shows the data points
and regression line drawn using SPSS (a scatterplot was created using the graph feature; see
www.tzkeith.com for examples). The small circles in this figure are the actual data points;



12 • MULTIPLE REGRESSION
80.00

Math Achievement Test Score

70.00

60.00

50.00

40.00

30.00

20.00
0.00

2.00

4.00

6.00

8.00

10.00

Time Spent on Math Homework per Week


Figure 1.9╇ Regression line, with data points, as produced by the SPSS Scatter/Dot graph command.

notice how variable they are. If the R were larger, the data points would cluster more closely
around the regression line. We will return to this topic in a subsequent chapter.

Statistical Significance of Regression Coefficients
There are a few more details to study for this regression analysis before stepping back and
further considering the meaning of the results. With multiple regression, we will also be
interested in whether each regres�sion coefficient is statistically significant. Return to the table
of regression coefficients (Figure 1.7), and note the columns labeled t and Sig. The values
corresponding to the regression coefficient are simply the results of a t test of the statistical
significance of the regression coefficient (b). The formula for t is one of the most ubiquitous
in statistics (Kerlinger, 1986):
t=

statistic
, or, in this case,
standard error of thestatistic

t=

b
1.990
=
= 3.345.
SEb
.595

As shown in Figure 1.7, the value of t is 3.344, with N€−€k€−€1 degrees of freedom (98).

If we look up this value in Excel (using the function TDIST), we find the probability of
obtaining such a t by chance is .001171 (a two-tailed test) rounded off to .001 (the value


×