simple linear regression analysis view

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.16 MB, 62 trang )

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

Text

© The McGraw−Hill
Companies, 2003

Chapter 11
Simple Linear
Regression
Analysis

Chapter Outline
11.1 The Simple Linear Regression Model
11.2 The Least Squares Estimates, and Point
Estimation and Prediction
11.3 Model Assumptions and the Standard Error
11.4 Testing the Signiﬁcance of the Slope and
y Intercept
*Optional section

11.5 Conﬁdence and Prediction Intervals
11.6 Simple Coefﬁcients of Determination and
Correlation
11.7 An F Test for the Model
*11.8 Residual Analysis

*11.9 Some Shortcut Formulas

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

anagers often make decisions by studying the
relationships between variables, and process
improvements can often be made by
understanding how changes in one or more
variables affect the process output. Regression analysis
is a statistical technique in which we use observed data
to relate a variable of interest, which is called the
dependent (or response) variable, to one or more
independent (or predictor) variables. The objective is to
build a regression model, or prediction equation, that
can be used to describe, predict, and control the
dependent variable on the basis of the independent
variables. For example, a company might wish to
improve its marketing process. After collecting data
concerning the demand for a product, the product’s

price, and the advertising expenditures made to
promote the product, the company might use
regression analysis to develop an equation to predict
demand on the basis of price and advertising

M

expenditure. Predictions of demand for various
price–advertising expenditure combinations can then
be used to evaluate potential changes in the company’s
marketing strategies. As another example, a
manufacturer might use regression analysis to describe
the relationship between several input variables and
an important output variable. Understanding the
relationships between these variables would allow the
manufacturer to identify control variables that can be
used to improve the process performance.
In the next two chapters we give a thorough
presentation of regression analysis. We begin in this
chapter by presenting simple linear regression analysis.
Using this technique is appropriate when we are
relating a dependent variable to a single independent
variable and when a straight-line model describes the
relationship between these two variables. We explain
many of the methods of this chapter in the context of
two new cases:

C
The Fuel Consumption Case: A management
consulting ﬁrm uses simple linear regression

analysis to predict the weekly amount of fuel (in
millions of cubic feet of natural gas) that will be
required to heat the homes and businesses in a
small city on the basis of the week’s average
hourly temperature. A natural gas company
uses these predictions to improve its gas
ordering process. One of the gas company’s
objectives is to reduce the ﬁnes imposed by its
pipeline transmission system when the

company places inaccurate natural gas
orders.
The QHIC Case: The marketing department at
Quality Home Improvement Center (QHIC) uses
simple linear regression analysis to predict home
upkeep expenditure on the basis of home value.
Predictions of home upkeep expenditures are used
to help determine which homes should be sent
advertising brochures promoting QHIC’s products
and services.

11.1 ■ The Simple Linear Regression Model
The simple linear regression model assumes that the relationship between the dependent
variable, which is denoted y, and the independent variable, denoted x, can be approximated
by a straight line. We can tentatively decide whether there is an approximate straight-line relationship between y and x by making a scatter diagram, or scatter plot, of y versus x. First,
data concerning the two variables are observed in pairs. To construct the scatter plot, each value
of y is plotted against its corresponding value of x. If the y values tend to increase or decrease
in a straight-line fashion as the x values increase, and if there is a scattering of the (x, y) points
around the straight line, then it is reasonable to describe the relationship between y and x by
using the simple linear regression model. We illustrate this in the following case study, which

shows how regression analysis can help a natural gas company improve its gas ordering
process.

EXAMPLE 11.1 The Fuel Consumption Case: Reducing Natural Gas
Transmission Fines
When the natural gas industry was deregulated in 1993, natural gas companies became responsible for acquiring the natural gas needed to heat the homes and businesses in the cities they serve.
To do this, natural gas companies purchase natural gas from marketers (usually through longterm contracts) and periodically (daily, weekly, monthly, or the like) place orders for natural gas
to be transmitted by pipeline transmission systems to their cities. There are hundreds of pipeline
transmission systems in the United States, and many of these systems supply a large number of

CHAPTER 14

C

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

446

11. Simple Linear
Regression Analysis

Chapter 11

Text

© The McGraw−Hill
Companies, 2003

Simple Linear Regression Analysis

cities. For instance, the map on pages 448 and 449 illustrates the pipelines of and the cities served
by the Columbia Gas System.
To place an order (called a nomination) for an amount of natural gas to be transmitted to its
city over a period of time (day, week, month), a natural gas company makes its best prediction of
the city’s natural gas needs for that period. The natural gas company then instructs its marketer(s)
to deliver this amount of gas to its pipeline transmission system. If most of the natural gas companies being supplied by the transmission system can predict their cities’ natural gas needs with
reasonable accuracy, then the overnominations of some companies will tend to cancel the undernominations of other companies. As a result, the transmission system will probably have enough
natural gas to efﬁciently meet the needs of the cities it supplies.
In order to encourage natural gas companies to make accurate transmission nominations and
to help control costs, pipeline transmission systems charge, in addition to their usual fees, transmission ﬁnes. A natural gas company is charged a transmission ﬁne if it substantially undernominates natural gas, which can lead to an excessive number of unplanned transmissions, or if it
substantially overnominates natural gas, which can lead to excessive storage of unused gas. Typically, pipeline transmission systems allow a certain percentage nomination error before they
impose a ﬁne. For example, some systems do not impose a ﬁne unless the actual amount of natural gas used by a city differs from the nomination by more than 10 percent. Beyond the allowed
percentage nomination error, ﬁnes are charged on a sliding scale—the larger the nomination
error, the larger the transmission ﬁne. Furthermore, some transmission systems evaluate nomination errors and assess ﬁnes more often than others. For instance, some transmission systems do
this as frequently as daily, while others do this weekly or monthly (this frequency depends on the
number of storage ﬁelds to which the transmission system has access, the system’s accounting
practices, and other factors). In any case, each natural gas company needs a way to accurately
predict its city’s natural gas needs so it can make accurate transmission nominations.
Suppose we are analysts in a management consulting ﬁrm. The natural gas company serving a
small city has hired the consulting ﬁrm to develop an accurate way to predict the amount of fuel
(in millions of cubic feet—MMcf—of natural gas) that will be required to heat the city. Because
the pipeline transmission system supplying the city evaluates nomination errors and assesses ﬁnes
weekly, the natural gas company wants predictions of future weekly fuel consumptions.1 Moreover, since the pipeline transmission system allows a 10 percent nomination error before assessing a ﬁne, the natural gas company would like the actual and predicted weekly fuel consumptions
to differ by no more than 10 percent. Our experience suggests that weekly fuel consumption
substantially depends on the average hourly temperature (in degrees Fahrenheit) measured in the
city during the week. Therefore, we will try to predict the dependent (response) variable weekly
fuel consumption (y) on the basis of the independent (predictor) variable average hourly temperature (x) during the week. To this end, we observe values of y and x for eight weeks. The data

are given in Table 11.1. In Figure 11.1 we give an Excel output of a scatter plot of y versus x. This
plot shows
1
2

A tendency for the fuel consumption to decrease in a straight-line fashion as the temperatures increase.
A scattering of points around the straight line.

A regression model describing the relationship between y and x must represent these two characteristics. We now develop such a model.2
We begin by considering a speciﬁc average hourly temperature x. For example, consider the
average hourly temperature 28°F, which was observed in week 1, or consider the average hourly
temperature 45.9°F, which was observed in week 5 (there is nothing special about these two
average hourly temperatures, but we will use them throughout this example to help explain the
idea of a regression model). For the speciﬁc average hourly temperature x that we consider, there
are, in theory, many weeks that could have this temperature. However, although these weeks

1

For whatever period of time a transmission system evaluates nomination errors and charges ﬁnes, a natural gas company is free
to actually make nominations more frequently. Sometimes this is a good strategy, but we will not further discuss it.

2

Generally, the larger the sample size is—that is, the more combinations of values of y and x that we have observed—the more
accurately we can describe the relationship between y and x. Therefore, as the natural gas company observes values of y and
x in future weeks, the new data should be added to the data in Table 11.1.

Bowerman−O’Connell:
Business Statistics in

Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

447

The Simple Linear Regression Model

TA B L E 1 1 . 1

The Fuel Consumption
Data
FuelCon1

Week

Average
Hourly
Temperature,
x (°F)

Weekly Fuel
Consumption,
y (MMcf)

1
2
3
4
5
6
7
8

28.0
28.0
32.5
39.0
45.9
57.8
58.1
62.5

12.4
11.7
12.4
10.8
9.4
9.5
8.0
7.5

FIGURE 11.1

Excel Output of a Scatter Plot of y versus x

A
B
1 TEMP
FUELCONS
12.4
2
28
11.7
3
28
12.4
4
32.5
10.8
5
39
9.4
6
45.9
9.5
7
57.8
8
8
58.1
7.5
9
62.5

10
11
12
13
14

C

FUEL

11.1

D
15
13
11
9
7
5
20

E

30

F

40
50
TEMP

each have the same average hourly temperature, other factors that affect fuel consumption could
vary from week to week. For example, these weeks might have different average hourly wind
velocities, different thermostat settings, and so forth. Therefore, the weeks could have different
fuel consumptions. It follows that there is a population of weekly fuel consumptions that could
be observed when the average hourly temperature is x. Furthermore, this population has a mean,
which we denote as ␮y | x (pronounced mu of y given x).
We can represent the straight-line tendency we observe in Figure 11.1 by assuming that myԽx is
related to x by the equation
my͉x ϭ b0 ϩ b1x
This equation is the equation of a straight line with y-intercept B0 (pronounced beta zero) and
slope B1 (pronounced beta one). To better understand the straight line and the meanings of b0
and b1, we must ﬁrst realize that the values of b0 and b1 determine the precise value of the mean
weekly fuel consumption my͉x that corresponds to a given value of the average hourly temperature x. We cannot know the true values of b0 and b1, and in the next section we learn how to
estimate these values. However, for illustrative purposes, let us suppose that the true value of b0
is 15.77 and the true value of b1 is Ϫ.1281. It would then follow, for example, that the mean of
the population of all weekly fuel consumptions that could be observed when the average hourly
temperature is 28°F is
my͉ 28 ϭ b0 ϩ b1(28)
ϭ 15.77 Ϫ .1281(28)
ϭ 12.18 MMcf of natural gas
As another example, it would also follow that the mean of the population of all weekly fuel consumptions that could be observed when the average hourly temperature is 45.9°F is
my͉ 45.9 ϭ b0 ϩ b1(45.9)
ϭ 15.77 Ϫ .1281(45.9)
ϭ 9.89 MMcf of natural gas
Note that, as the average hourly temperature increases from 28°F to 45.9°F, mean weekly fuel
consumption decreases from 12.18 MMcf to 9.89 MMcf of natural gas. This makes sense
because we would expect to use less fuel if the average hourly temperature increases. Of course,
because we do not know the true values of b0 and b1, we cannot actually calculate these mean
weekly fuel consumptions. However, when we learn in the next section how to estimate b0 and

b1, we will then be able to estimate the mean weekly fuel consumptions. For now, when we say
that the equation my͉ x ϭ b0 ϩ b1x is the equation of a straight line, we mean that the different
mean weekly fuel consumptions that correspond to different average hourly temperatures lie
exactly on a straight line. For example, consider the eight mean weekly fuel consumptions that
correspond to the eight average hourly temperatures in Table 11.1. In Figure 11.2(a) we depict
these mean weekly fuel consumptions as triangles that lie exactly on the straight line deﬁned by

G

60

H

70

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

448

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11

Simple Linear Regression Analysis

Columbia Gas System

Michigan

Middleburg Heights
Sandusky
Lorain
Toledo

Elyria
Parma

Ohio
Gulf of
Mexico

Mansfield
Marion

Springfield
Columbus

Columbia Gas Transmission
Columbia Gulf Transmission

Dayton

Indiana

Cove Point LNG
Corporate Headquarters

Athens

Cincinnati

Cove Point Terminal
Storage Fields
Distribution Service Territory

Ashland

Independent Power Projects
Communities Served by Columbia Companies
Communities Served by Companies
Supplied by Columbia

Frankfort

Kentucky

Tennessee

Source: Columbia Gas System 1995 Annual Report.

Huntington

Lexington

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

11.1

The Simple Linear Regression Model

New York

Binghamton

Wilkes-Barre
Elizabeth

Pennsylvania

Bethlehem
New Jersey

Allentown
New Brighton

Harrisburg

Pittsburgh
York

Uniontown
Wheeling

Cumberland

Wilmington

Hagerstown
Maryland

Atlantic
City
Baltimore

West
Virginia

Delaware
Manassas

Washington, D.C.

Cove Point Terminal

Charleston

Staunton

Fredericksburg
Richmond
Virginia

Lynchburg
Roanoke

Williamsburg

Petersburg
Portsmouth

Newport News
Norfolk
Chesapeake

North Carolina

© Reprinted courtesy of Columbia Gas System.

449

Bowerman−O’Connell:

Business Statistics in
Practice, Third Edition

450

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11
F I G U R E 11.2

Simple Linear Regression Analysis

The Simple Linear Regression Model Relating Weekly Fuel Consumption (y) to Average
Hourly Temperature (x)

y
13

␮y ͦ28 ϭ Mean weekly fuel consumption when x ϭ 28
The error term for the first week (a positive error term)
12.4 ϭ The observed fuel consumption for the first week

12

␮y ͦ45.9 ϭ Mean weekly fuel consumption when x ϭ 45.9
The error term for the fifth week (a negative error term)
9.4 ϭ The observed fuel consumption for the fifth week

11
10
9

The straight line
defined by the equation
␮y ͦ x ϭ ␤0 ϩ ␤1 x

8
7

x
28.0

45.9

62.5

(a) The line of means and the error terms
y

␤0 ϩ ␤1c

␤1 ϭ The change in mean weekly fuel consumption
that is associated with a one-degree increase
in average hourly temperature

␤1

␤0 ϩ ␤1(c ϩ 1)

x

cϩ1

c

(b) The slope of the line of means
y
15
14
13
12
11
10
9
8
7

␤0 ϭ Mean weekly fuel consumption
when the average hourly
temperature is 0°F

x
0

28

62.5

(c) The y-intercept of the line of means

the equation my͉x ϭ b0 ϩ b1x. Furthermore, in this ﬁgure we draw arrows pointing to the triangles that represent the previously discussed means my͉28 and my͉45.9. Sometimes we refer to the
straight line deﬁned by the equation my͉x ϭ b0 ϩ b1x as the line of means.
In order to interpret the slope b1 of the line of means, consider two different weeks. Suppose
that for the ﬁrst week the average hourly temperature is c. The mean weekly fuel consumption for
all such weeks is
b0 ϩ b1(c)
For the second week, suppose that the average hourly temperature is (c ϩ 1). The mean weekly
fuel consumption for all such weeks is
b0 ϩ b1(c ϩ 1)
It is easy to see that the difference between these mean weekly fuel consumptions is b1. Thus, as
illustrated in Figure 11.2(b), the slope b1 is the change in mean weekly fuel consumption that is
associated with a one-degree increase in average hourly temperature. To interpret the meaning of

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

11.1

The Simple Linear Regression Model

451

the y-intercept b0, consider a week having an average hourly temperature of 0°F. The mean
weekly fuel consumption for all such weeks is
b0 ϩ b1(0) ϭ b0
Therefore, as illustrated in Figure 11.2(c), the y-intercept b0 is the mean weekly fuel consumption when the average hourly temperature is 0°F. However, because we have not observed any
weeks with temperatures near 0, we have no data to tell us what the relationship between mean
weekly fuel consumption and average hourly temperature looks like for temperatures near 0.
Therefore, the interpretation of b0 is of dubious practical value. More will be said about this later.
Now recall that the observed weekly fuel consumptions are not exactly on a straight line.
Rather, they are scattered around a straight line. To represent this phenomenon, we use the simple
linear regression model
y ϭ my͉x ϩ e
ϭ b0 ϩ b1x ϩ e
This model says that the weekly fuel consumption y observed when the average hourly temperature is x differs from the mean weekly fuel consumption my͉x by an amount equal to e
(pronounced epsilon). Here ␧ is called an error term. The error term describes the effect on y of
all factors other than the average hourly temperature. Such factors would include the average
hourly wind velocity and the average hourly thermostat setting in the city. For example, Figure 11.2(a) shows that the error term for the ﬁrst week is positive. Therefore, the observed fuel
consumption y ϭ 12.4 in the ﬁrst week was above the corresponding mean weekly fuel consumption for all weeks when x ϭ 28. As another example, Figure 11.2(a) also shows that the
error term for the ﬁfth week was negative. Therefore, the observed fuel consumption y ϭ 9.4 in
the ﬁfth week was below the corresponding mean weekly fuel consumption for all weeks when
x ϭ 45.9. More generally, Figure 11.2(a) illustrates that the simple linear regression model says
that the eight observed fuel consumptions (the dots in the ﬁgure) deviate from the eight mean fuel
consumptions (the triangles in the ﬁgure) by amounts equal to the error terms (the line segments

in the ﬁgure). Of course, since we do not know the true values of b0 and b1, the relative positions
of the quantities pictured in the ﬁgure are only hypothetical.
With the fuel consumption example as background, we are ready to deﬁne the simple linear
regression model relating the dependent variable y to the independent variable x. We suppose that we have gathered n observations—each observation consists of an observed value of x
and its corresponding value of y. Then:

The Simple Linear Regression Model
he simple linear (or straight line) regression model is: y ϭ my͉x ϩ e ϭ b0 ϩ b1x ϩ e
Here

T
1
2

b0 is the y-intercept. b0 is the mean value of y
when x equals 0.3

associated with a one-unit increase in x. If b1 is
positive, the mean value of y increases as x
increases. If b1 is negative, the mean value of y
decreases as x increases.

my ͉x ϭ b0 ϩ b1x is the mean value of the dependent variable y when the value of the independent variable is x.

3

b1 is the slope. b1 is the change (amount of
increase or decrease) in the mean value of y

4

e is an error term that describes the effects on y
of all factors other than the value of the independent variable x.

This model is illustrated in Figure 11.3 (note that x0 in this ﬁgure denotes a speciﬁc value of the
independent variable x). The y-intercept b0 and the slope b1 are called regression parameters.
Because we do not know the true values of these parameters, we must use the sample data to
3

As implied by the discussion of Example 11.1, if we have not observed any values of x near 0, this interpretation is of dubious
practical value.

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

452

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11
F I G U R E 11.3

Simple Linear Regression Analysis

The Simple Linear Regression Model (Here B1 Ͼ 0)
y

Error
term

Slope ϭ ␤1
␤0

An observed
value of y
when x equals x0

Straight line defined
by the equation
␮y ͉ x ϭ ␤0 ϩ ␤1 x

Mean value of y
when x equals x0

One-unit change
in x
y-intercept

x
0

x0 ϭ A specific value of

the independent
variable x

estimate these values. We see how this is done in the next section. In later sections we show how
to use these estimates to predict y.
The fuel consumption data in Table 11.1 were observed sequentially over time (in eight
consecutive weeks). When data are observed in time sequence, the data are called time series
data. Many applications of regression utilize such data. Another frequently used type of data
is called cross-sectional data. This kind of data is observed at a single point in time.

C

EXAMPLE 11.2 The QHIC Case

Quality Home Improvement Center (QHIC) operates ﬁve stores in a large metropolitan area. The
marketing department at QHIC wishes to study the relationship between x, home value (in thousands of dollars), and y, yearly expenditure on home upkeep (in dollars). A random sample of
40 homeowners is taken and asked to estimate their expenditures during the previous year on the
types of home upkeep products and services offered by QHIC. Public records of the county
auditor are used to obtain the previous year’s assessed values of the homeowner’s homes. The
resulting x and y values are given in Table 11.2. Because the 40 observations are for the same
year (for different homes), these data are cross-sectional.
The MINITAB output of a scatter plot of y versus x is given in Figure 11.4. We see that the observed values of y tend to increase in a straight-line (or slightly curved) fashion as x increases.
Assuming that my͉x and x have a straight-line relationship, it is reasonable to relate y to x by using
the simple linear regression model having a positive slope (b1 Ͼ 0)
y ϭ b0 ϩ b1x ϩ e
The slope b1 is the change (increase) in mean dollar yearly upkeep expenditure that is associated with each $1,000 increase in home value. In later examples the marketing department at QHIC will use predictions given by this simple linear regression model to help
determine which homes should be sent advertising brochures promoting QHIC’s products
and services.
We have interpreted the slope b1 of the simple linear regression model to be the change in the
mean value of y associated with a one-unit increase in x. We sometimes refer to this change as the

effect of the independent variable x on the dependent variable y. However, we cannot prove that

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

11.1

453

The Simple Linear Regression Model

T A B L E 11.2

The QHIC Upkeep Expenditure Data

QHIC

Home

Value of Home, x

(Thousands of Dollars)

Upkeep Expenditure,
y (Dollars)

Home

Value of Home, x
(Thousands of Dollars)

Upkeep Expenditure,
y (Dollars)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

18
19
20

237.00
153.08
184.86
222.06
160.68
99.68
229.04
101.78
257.86
96.28
171.00
231.02
228.32
205.90
185.72
168.78
247.06
155.54
224.20
202.04

1,412.08
797.20
872.48
1,003.42
852.90

288.48
1,288.46
423.08
1,351.74
378.04
918.08
1,627.24
1,204.78
857.04
775.00
869.26
1,396.00
711.50
1,475.18
1,413.32

21
22
23
24
25
26
27
28
29
30
31
32
33
34

35
36
37
38
39
40

153.04
232.18
125.44
169.82
177.28
162.82
120.44
191.10
158.78
178.50
272.20
48.90
104.56
286.18
83.72
86.20
133.58
212.86
122.02
198.02

849.14
1,313.84

602.06
642.14
1,038.80
697.00
324.34
965.10
920.14
950.90
1,670.32
125.40
479.78
2,010.64
368.36
425.60
626.90
1,316.94
390.16
1,090.84

F I G U R E 11.4

MINITAB Plot of Upkeep Expenditure versus Value of Home
for the QHIC Data

UPKEEP

2000

1000

0
100

200

300

VALUE

a change in an independent variable causes a change in the dependent variable. Rather, regression can be used only to establish that the two variables move together and that the independent
variable contributes information for predicting the dependent variable. For instance, regression
analysis might be used to establish that as liquor sales have increased over the years, college professors’ salaries have also increased. However, this does not prove that increases in liquor sales
cause increases in college professors’ salaries. Rather, both variables are inﬂuenced by a third
variable—long-run growth in the national economy.

Exercises for Section 11.1
CONCEPTS
11.1 When does the scatter plot of the values of a dependent variable y versus the values of an independent variable x suggest that the simple linear regression model
y ϭ my͉x ϩ e
ϭ b0 ϩ b1x ϩ e
might appropriately relate y to x?

11.5, 11.6

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

454

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11

Simple Linear Regression Analysis

11.2 In the simple linear regression model, what are y, my͉x, and e?
11.3 In the simple linear regression model, deﬁne the meanings of the slope b1 and the y-intercept b0.
11.4 What is the difference between time series data and cross-sectional data?
METHODS AND APPLICATIONS
11.5 THE STARTING SALARY CASE

StartSal

The chairman of the marketing department at a large state university undertakes a study to relate
starting salary (y) after graduation for marketing majors to grade point average (GPA) in major
courses. To do this, records of seven recent marketing graduates are randomly selected.

GPA, x

Starting Salary,
y (Thousands
of Dollars)

1
2
3
4
5
6
7

3.26
2.60
3.35
2.86
3.82
2.21
3.47

33.8
29.8
33.5
30.4
36.4
27.6
35.3

StartSal

Marketing
Graduate

37
36
35
34
33
32
31
30
29
28
27
2

3
GPA

4

Using the scatter plot (from MINITAB) of y versus x, explain why the simple linear regression
model
y ϭ my͉x ϩ e
ϭ b0 ϩ b1x ϩ e
might appropriately relate y to x.
11.6 THE STARTING SALARY CASE

StartSal

Consider the simple linear regression model describing the starting salary data of Exercise 11.5.
a Explain the meaning of my͉xϭ4.00 ϭ b0 ϩ b1(4.00).
b Explain the meaning of my͉xϭ2.50 ϭ b0 ϩ b1(2.50).

c Interpret the meaning of the slope parameter b1.
d Interpret the meaning of the y-intercept b0. Why does this interpretation fail to make practical
sense?
e The error term e describes the effects of many factors on starting salary y. What are these
factors? Give two speciﬁc examples.
11.7 THE SERVICE TIME CASE

SrvcTime

Accu-Copiers, Inc., sells and services the Accu-500 copying machine. As part of its standard
service contract, the company agrees to perform routine service on this copier. To obtain
information about the time it takes to perform routine service, Accu-Copiers has collected data for
11 service calls. The data are as follows:

Number of Copiers
Serviced, x

Number of Minutes
Required, y

200

1
2
3
4
5
6
7
8

9
10
11

4
2
5
7
1
3
4
5
2
4
6

109
58
138
189
37
82
103
134
68
112
154

150
Minutes

Service
Call

100
50
0
0

2

4
Copiers

6

8

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

11.1

455

The Simple Linear Regression Model
Using the scatter plot (from Excel) of y versus x, discuss why the simple linear regression model
might appropriately relate y to x.

11.8 THE SERVICE TIME CASE

SrvcTime

Consider the simple linear regression model describing the service time data in Exercise 11.7.
a Explain the meaning of my͉xϭ4 ϭ b0 ϩ b1(4).
b Explain the meaning of my͉xϭ6 ϭ b0 ϩ b1(6).
c Interpret the meaning of the slope parameter b1.
d Interpret the meaning of the y-intercept b 0 . Does this interpretation make practical
sense?
e The error term e describes the effects of many factors on service time. What are these factors?
Give two speciﬁc examples.
11.9 THE FRESH DETERGENT CASE

Fresh

Enterprise Industries produces Fresh, a brand of liquid laundry detergent. In order to study the
relationship between price and demand for the large bottle of Fresh, the company has gathered data
concerning demand for Fresh over the last 30 sales periods (each sales period is four weeks). Here,
for each sales period,
y ϭ demand for the large bottle of Fresh (in hundreds of thousands of bottles) in the sales

period
x1 ϭ the price (in dollars) of Fresh as offered by Enterprise Industries in the sales period
x2 ϭ the average industry price (in dollars) of competitors’ similar detergents in the sales
period
x4 ϭ x2 Ϫ x1 ϭ the “price difference” in the sales period
Note: We denote the “price difference” as x4 (rather than, for example, x3) to be consistent with
other notation to be introduced in the Fresh detergent case in Chapter 12.

Sales
Period

x1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

3.85

3.75
3.70
3.70
3.60
3.60
3.60
3.80
3.80
3.85
3.90
3.90
3.70
3.75
3.75

x2

Fresh Detergent Demand Data
Sales
x4 ‫ ؍‬x2 ؊ x1
y
Period

x1

x2

x4 ‫ ؍‬x2 ؊ x1

y

3.80
4.00
4.30
3.70
3.85
3.80
3.75
3.85
3.65
4.00
4.10
4.00
4.10
4.20
4.10

Ϫ.05
.25
.60
0
.25
.20
.15
.05
Ϫ.15
.15
.20
.10
.40

.45
.35

3.80
3.70
3.80
3.70
3.80
3.80
3.75
3.70
3.55
3.60
3.65
3.70
3.75
3.80
3.70

4.10
4.20
4.30
4.10
3.75
3.75
3.65
3.90
3.65
4.10
4.25

3.65
3.75
3.85
4.25

.30
.50
.50
.40
Ϫ.05
Ϫ.05
Ϫ.10
.20
.10
.50
.60
Ϫ.05
0
.05
.55

8.87
9.26
9.00
8.75
7.95
7.65
7.27
8.00
8.50

8.75
9.21
8.27
7.67
7.93
9.26

7.38
8.51
9.52
7.50
9.33
8.28
8.75
7.87
7.10
8.00
7.89
8.15
9.10
8.86
8.90

16
17
18
19
20
21
22

23
24
25
26
27
28
29
30

Using the scatter plot (from MINITAB) of y versus x4 shown below, discuss why the simple linear
regression model might appropriately relate y to x4.

Demand, y

9.5
9.0
8.5
8.0
7.5
7.0
-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6
PriceDif, x4

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

456

Chapter 11

Direct Labor Cost
DirLab
Data

11.10 THE FRESH DETERGENT CASE

Direct
Labor Cost,
y ($100s)

Batch
Size, x

71
663
381
138
861
145
493

548
251
1024
435
772

5
62
35
12
83
14
46
52
23
100
41
75

Real Estate Sales
Price Data

RealEst
Sales
Price (y)

Home
Size (x)

180

98.1
173.1
136.5
141
165.9
193.5
127.8
163.5
172.5

23
11
20
17
15
21
24
13
19
25

Source: Reprinted with
permission from The Real
Estate Appraiser and
Analyst Spring 1986 issue.
Copyright 1986 by the
Appraisal Institute,
Chicago, Illinois.

Simple Linear Regression Analysis

Fresh

Consider the simple linear regression model relating demand, y, to the price difference, x4, and
the Fresh demand data of Exercise 11.9.
a Explain the meaning of myΗx4 ϭ .10 ϭ b0 ϩ b1(.10).
b Explain the meaning of myΗx4 ϭ Ϫ.05 ϭ b0 ϩ b1(Ϫ.05).
c Explain the meaning of the slope parameter b1.
d Explain the meaning of the intercept b0. Does this explanation make practical sense?
e What factors are represented by the error term in this model? Give two speciﬁc examples.
11.11 THE DIRECT LABOR COST CASE

DirLab

An accountant wishes to predict direct labor cost (y) on the basis of the batch size (x) of a product
produced in a job shop. Data for 12 production runs are given in the table in the margin.
a Construct a scatter plot of y versus x.
b Discuss whether the scatter plot suggests that a simple linear regression model might
appropriately relate y to x.
11.12 THE DIRECT LABOR COST CASE

DirLab

Consider the simple linear regression model describing the direct labor cost data of Exercise 11.11.
a Explain the meaning of my͉xϭ60 ϭ b0 ϩ b1(60).
b Explain the meaning of my͉xϭ30 ϭ b0 ϩ b1(30).
c Explain the meaning of the slope parameter b1.
d Explain the meaning of the intercept b0. Does this explanation make practical sense?
e What factors are represented by the error term in this model? Give two speciﬁc examples of
these factors.
11.13 THE REAL ESTATE SALES PRICE CASE

RealEst

A real estate agency collects data concerning y ϭ the sales price of a house (in thousands of
dollars), and x ϭ the home size (in hundreds of square feet). The data are given in the margin.
a Construct a scatter plot of y versus x.
b Discuss whether the scatter plot suggests that a simple linear regression model might
appropriately relate y to x.
11.14 THE REAL ESTATE SALES PRICE CASE

RealEst

Consider the simple linear regression model describing the sales price data of Exercise 11.13.
a Explain the meaning of my͉xϭ20 ϭ b0 ϩ b1(20).
b Explain the meaning of my͉xϭ18 ϭ b0 ϩ b1(18).
c Explain the meaning of the slope parameter b1.
d Explain the meaning of the intercept b0. Does this explanation make practical sense?
e What factors are represented by the error term in this model? Give two speciﬁc examples.

11.2 ■ The Least Squares Estimates, and Point Estimation
and Prediction
CHAPTER 15

The true values of the y-intercept (b0) and slope (b1) in the simple linear regression model are
unknown. Therefore, it is necessary to use observed data to compute estimates of these regression parameters. To see how this is done, we begin with a simple example.

EXAMPLE 11.3 The Fuel Consumption Case

C

Consider the fuel consumption problem of Example 11.1. The scatter plot of y (fuel consumption)
versus x (average hourly temperature) in Figure 11.1 suggests that the simple linear regression
model appropriately relates y to x. We now wish to use the data in Table 11.1 to estimate the intercept b0 and the slope b1 of the line of means. To do this, it might be reasonable to estimate the line
of means by “ﬁtting” the “best” straight line to the plotted data in Figure 11.1. But how do we ﬁt the
best straight line? One approach would be to simply “eyeball” a line through the points. Then we
could read the y-intercept and slope off the visually ﬁtted line and use these values as the estimates
of b0 and b1. For example, Figure 11.5 shows a line that has been visually ﬁtted to the plot of the

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

11.2

© The McGraw−Hill
Companies, 2003

Text

The Least Squares Estimates, and Point Estimation and Prediction

F I G U R E 11.5 Visually Fitting a Line to the Fuel Consumption Data
y
y-intercept

16

15

Slope ϭ

14
13.8
12.8

change in y 12.8 Ϫ 13.8
ϭ
ϭ Ϫ.1
change in x
20 Ϫ 10

12.4 Ϫ 12.2 ϭ .2

13
Change
in y Change
in x

12
11
10
9
8
7

x
0

10

20

30

40

50

60

70

Using the Visually Fitted Line to Predict When x ‫82 ؍‬

F I G U R E 11.6

y
16
15
14
y ϭ 12.2 13
12
^

11
10
9

8
7

x ϭ 28
x
0

10

20

30

40

50

60

70

fuel consumption data. We see that this line intersects the y axis at y ϭ 15. Therefore, the
y-intercept of the line is 15. In addition, the ﬁgure shows that the slope of the line is
12.8 Ϫ 13.8
Ϫ1
change in y
ϭ
ϭ
ϭ Ϫ.1
change in x

20 Ϫ 10
10
Therefore, based on the visually ﬁtted line, we estimate that b0 is 15 and that b1 is Ϫ.1.
In order to evaluate how “good” our point estimates of b0 and b1 are, consider using the visually ﬁtted line to predict weekly fuel consumption. Denoting such a prediction as y (pronounced
ˆ
y hat), a prediction of weekly fuel consumption when average hourly temperature is x is
y ϭ 15 Ϫ .1x
ˆ
For instance, when temperature is 28°F, predicted fuel consumption is
y ϭ 15 Ϫ .1(28) ϭ 15 Ϫ 2.8 ϭ 12.2
ˆ
Here y is simply the point on the visually ﬁtted line corresponding to x ϭ 28 (see Figure 11.6).
ˆ
We can evaluate how well the visually determined line ﬁts the points on the scatter plot by

457

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

458

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11
T A B L E 11.3

Simple Linear Regression Analysis

Calculation of SSE for a Line Visually Fitted to the Fuel Consumption Data
y

x

ˆ
y ‫1. ؊ 51 ؍‬x

12.4
11.7
12.4
10.8
9.4
9.5
8.0
7.5

28.0
28.0
32.5
39.0
45.9
57.8

58.1
62.5

15 Ϫ .1(28.0) ϭ 12.2
15 Ϫ .1(28.0) ϭ 12.2
15 Ϫ .1(32.5) ϭ 11.75
15 Ϫ .1(39.0) ϭ 11.1
15 Ϫ .1(45.9) ϭ 10.41
15 Ϫ .1(57.8) ϭ 9.22
15 Ϫ .1(58.1) ϭ 9.19
15 Ϫ .1(62.5) ϭ 8.75

ˆ
y؊ y
12.4 Ϫ 12.2 ϭ .2
11.7 Ϫ 12.2 ϭ Ϫ.5
12.4 Ϫ 11.75 ϭ .65
10.8 Ϫ 11.1 ϭ Ϫ.3
9.4 Ϫ 10.41 ϭ Ϫ1.01
9.5 Ϫ 9.22 ϭ .28
8.0 Ϫ 9.19 ϭ Ϫ1.19
7.5 Ϫ 8.75 ϭ Ϫ1.25

ˆ
(y ؊ y)2
(.2)2 ϭ .04
(Ϫ.5)2 ϭ .25
(.65)2 ϭ .4225
(Ϫ.3)2 ϭ .09
(Ϫ1.01)2 ϭ 1.0201

(.28)2 ϭ .0784
(Ϫ1.19)2 ϭ 1.4161
(Ϫ1.25)2 ϭ 1.5625

ˆ
SSE ϭ a ( y Ϫ y )2 ϭ .04 ϩ .25 ϩ и и и ϩ 1.5625 ϭ 4.8796

comparing each observed value of y with the corresponding predicted value of y given by the ﬁtted line. We do this by computing the deviation y Ϫ y. For instance, looking at the ﬁrst obserˆ
vation in Table 11.1 (page 447), we observed y ϭ 12.4 and x ϭ 28.0. Since the predicted fuel
consumption when x equals 28 is y ϭ 12.2, the deviation y Ϫ y equals 12.4 Ϫ 12.2 ϭ .2. This
ˆ
ˆ
deviation is illustrated in Figure 11.5. Table 11.3 gives the values of y, x, y, and y Ϫ y for each
ˆ
ˆ
observation in Table 11.1. The deviations (or prediction errors) are the vertical distances between the observed y values and the predictions obtained using the ﬁtted line—that is, they are
the line segments depicted in Figure 11.5.
If the visually determined line ﬁts the data well, the deviations (errors) will be small. To obtain
an overall measure of the quality of the ﬁt, we compute the sum of squared deviations or sum
of squared errors, denoted SSE. Table 11.3 also gives the squared deviations and the SSE for our
visually ﬁtted line. We ﬁnd that SSE ϭ 4.8796.
Clearly, the line shown in Figure 11.5 is not the only line that could be ﬁtted to the observed
fuel consumption data. Different people would obtain somewhat different visually ﬁtted lines.
However, it can be shown that there is exactly one line that gives a value of SSE that is smaller
than the value of SSE that would be given by any other line that could be ﬁtted to the data. This
line is called the least squares regression line or the least squares prediction equation. To
show how to ﬁnd the least squares line, we ﬁrst write the general form of a straight-line prediction equation as
y ϭ b0 ϩ b1x
ˆ
Here b0 (pronounced b zero) is the y-intercept and b1 (pronounced b one) is the slope of the line.

In addition, y denotes the predicted value of the dependent variable when the value of the indeˆ
pendent variable is x. Now suppose we have collected n observations (x1, y1), (x2, y2), . . . ,
(xn, yn). If we consider a particular observation (xi, yi), the predicted value of yi is
yi ϭ b0 ϩ b1xi
ˆ
Furthermore, the prediction error (also called the residual) for this observation is
ei ϭ yi Ϫ yi ϭ yi Ϫ (b0 ϩ b1xi)
ˆ
Then the least squares line is the line that minimizes the sum of the squared prediction errors
(that is, the sum of squared residuals):
n

SSE ϭ a (yi Ϫ (b0 ϩ b1xi))2
iϭ 1

To ﬁnd this line, we ﬁnd the values of the y-intercept b0 and the slope b1 that minimize SSE. These
values of b0 and b1 are called the least squares point estimates of b0 and b1. Using calculus, it

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

11.2

© The McGraw−Hill
Companies, 2003

Text

459

The Least Squares Estimates, and Point Estimation and Prediction

can be shown that these estimates are calculated as follows:4

The Least Squares Point Estimates

F

or the simple linear regression model:

1

The least squares point estimate of the slope B1 is b1 ϭ

΂ a x ΃΂ a y ΃
i

SSxy ϭ a ( xi Ϫ x )( yi Ϫ y ) ϭ a xi yi Ϫ

2

SSxy
where
SSxx

i

n

΂ a x΃

2

i

and

SSxx ϭ a (xi Ϫ x )2 ϭ a x2 Ϫ
i

n

The least squares point estimate of the y-intercept B0 is b0 ϭ y Ϫ b1x where
yϭ

a yi
n

and

xϭ

a xi
n

Here n is the number of observations (an observation is an observed value of x and its corresponding
value of y).

The following example illustrates how to calculate these point estimates and how to use these
point estimates to estimate mean values and predict individual values of the dependent variable.
Note that the quantities SSxy and SSxx used to calculate the least squares point estimates are also
used throughout this chapter to perform other important calculations.

C

EXAMPLE 11.4 The Fuel Consumption Case
Part 1: Calculating the least squares point estimates Again consider the fuel consumption problem. To compute the least squares point estimates of the regression parameters b0
and b1 we ﬁrst calculate the following preliminary summations:
yi

x2
i

xi

xi yi

12.4
11.7
12.4
10.8
9.4
9.5
8.0
7.5

28.0
28.0
32.5
39.0
45.9
57.8
58.1
62.5

(28.0) ϭ 784
(28.0)2 ϭ 784
(32.5)2 ϭ 1,056.25
(39.0)2 ϭ 1,521
(45.9)2 ϭ 2,106.81
(57.8)2 ϭ 3,340.84
(58.1)2 ϭ 3,375.61
(62.5)2 ϭ 3,906.25

(28.0)(12.4) ϭ 347.2
(28.0)(11.7) ϭ 327.6
(32.5)(12.4) ϭ 403
(39.0)(10.8) ϭ 421.2
(45.9)(9.4) ϭ 431.46
(57.8)(9.5) ϭ 549.1
(58.1)(8.0) ϭ 464.8
(62.5)(7.5) ϭ 468.75

a yi ϭ 81.7

a xi ϭ 351.8

2
a xi ϭ 16,874.76

a xiyi ϭ 3,413.11

2

Using these summations, we calculate SSxy and SSxx as follows.

΂ a x ΃΂ a y ΃
i

SSxy ϭ a xiyi Ϫ

i

n
(351.8)(81.7)
ϭ 3,413.11 Ϫ
ϭ Ϫ179.6475
8

΂ a x΃

2

i

SSxx ϭ a x2 Ϫ
i

ϭ 16,874.76 Ϫ
4

n
(351.8)2
ϭ 1,404.355
8

In order to simplify notation, we will often drop the limits on summations in this and subsequent chapters. That is, instead of
n

using the summation a we will simply write a .
iϭ 1

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

460

11. Simple Linear
Regression Analysis

Chapter 11
T A B L E 11.4

© The McGraw−Hill
Companies, 2003

Text

Simple Linear Regression Analysis

Calculation of SSE Obtained by Using the Least Squares Point Estimates

yi

xi

ˆ
yi ‫9721. ؊ 48.51 ؍‬xi

ˆ
yi ؊ yi ‫ ؍‬residual

ˆ
(yi ؊ yi)2

12.4
11.7
12.4
10.8
9.4
9.5
8.0
7.5

28.0
28.0
32.5
39.0
45.9
57.8
58.1
62.5

15.84 Ϫ .1279(28.0) ϭ 12.2588
15.84 Ϫ .1279(28.0) ϭ 12.2588
15.84 Ϫ .1279(32.5) ϭ 11.68325
15.84 Ϫ .1279(39.0) ϭ 10.8519
15.84 Ϫ .1279(45.9) ϭ 9.96939
15.84 Ϫ .1279(57.8) ϭ 8.44738
15.84 Ϫ .1279(58.1) ϭ 8.40901
15.84 Ϫ .1279(62.5) ϭ 7.84625

12.4 Ϫ 12.2588 ϭ .1412
11.7 Ϫ 12.2588 ϭ Ϫ.5588
12.4 Ϫ 11.68325 ϭ .71675
10.8 Ϫ 10.8519 ϭ Ϫ.0519
9.4 Ϫ 9.96939 ϭ Ϫ.56939
9.5 Ϫ 8.44738 ϭ 1.05262
8.0 Ϫ 8.40901 ϭ Ϫ.40901
7.5 Ϫ 7.84625 ϭ Ϫ.34625

(.1412)2 ϭ .0199374
(Ϫ.5588)2 ϭ .3122574

(.71675)2 ϭ .5137306
(Ϫ.0519)2 ϭ .0026936
(Ϫ.56939)2 ϭ .324205
(1.05262)2 ϭ 1.1080089
(Ϫ.40901)2 ϭ .1672892
(Ϫ.34625)2 ϭ .1198891

ˆ
SSE ϭ a (yi Ϫ yi)2 ϭ .0199374 ϩ .3122574 ϩ и и и ϩ .1198891 ϭ 2.5680112

It follows that the least squares point estimate of the slope b1 is
b1 ϭ

SSxy
Ϫ179.6475
ϭ
ϭ Ϫ.1279
SSxx
1,404.355

Furthermore, because
81.7
351.8
a yi
a xi
ϭ
ϭ 10.2125
and
xϭ
ϭ

ϭ 43.98
8
8
8
8
the least squares point estimate of the y-intercept b0 is
yϭ

b0 ϭ y Ϫ b1x ϭ 10.2125 Ϫ (Ϫ.1279)(43.98) ϭ 15.84
Since b1 ϭ Ϫ.1279, we estimate that mean weekly fuel consumption decreases (since b1 is
negative) by .1279 MMcf of natural gas when average hourly temperature increases by 1 degree.
Since b0 ϭ 15.84, we estimate that mean weekly fuel consumption is 15.84 MMcf of natural gas
when average hourly temperature is 0°F. However, we have not observed any weeks with temperatures near 0, so making this interpretation of b0 might be dangerous. We discuss this point
more fully after this example.
Table 11.4 gives predictions of fuel consumption for each observed week obtained by using
the least squares line (or prediction equation)
y ϭ b0 ϩ b1x ϭ 15.84 Ϫ .1279x
ˆ
The table also gives each of the residuals and squared residuals and the sum of squared residuals
(SSE ϭ 2.5680112) obtained by using this prediction equation. Notice that the SSE here, which
was obtained using the least squares point estimates, is smaller than the SSE of Table 11.3, which
was obtained using the visually ﬁtted line y ϭ 15 Ϫ .1x. In general, it can be shown that the SSE
ˆ
obtained by using the least squares point estimates is smaller than the value of SSE that would be
obtained by using any other estimates of b0 and b1. Figure 11.7(a) illustrates the eight observed
fuel consumptions (the dots in the ﬁgure) and the eight predicted fuel consumptions (the squares
in the ﬁgure) given by the least squares line. The distances between the observed and predicted
fuel consumptions are the residuals. Therefore, when we say that the least squares point estimates
minimize SSE, we are saying that these estimates position the least squares line so as to minimize
the sum of the squared distances between the observed and predicted fuel consumptions. In this

sense, the least squares line is the best straight line that can be ﬁtted to the eight observed fuel
consumptions. Figure 11.7(b) gives the MINITAB output of this best ﬁt line. Note that this output gives the least squares estimates b0 ϭ 15.8379 and b1 ϭ Ϫ.127922. In general, we will rely
on MINITAB, Excel, and MegaStat to compute the least squares estimates (and to perform many
other regression calculations).
Part 2: Estimating a mean fuel consumption and predicting an individual fuel
consumption We deﬁne the experimental region to be the range of the previously observed
values of the average hourly temperature x. Because we have observed average hourly temperatures between 28°F and 62.5°F (see Table 11.4), the experimental region consists of the range of
average hourly temperatures from 28°F to 62.5°F. The simple linear regression model relates

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

11.2

© The McGraw−Hill
Companies, 2003

Text

461

The Least Squares Estimates, and Point Estimation and Prediction

F I G U R E 11.7

The Least Squares Line for the Fuel Consumption Data

(a) The observed and predicted fuel consumptions

(b) The MINITAB output of the least squares line
BEST FIT LINE FOR FUEL CONSUMPTION DATA

y
16
15
14
13
12
11
10
9
8
7

Y = 15.8379 0.127922X
R-Squared = 0.899
The least squares line
^
y ϭ 15.84 Ϫ .1279x

14

Predicted fuel consumption
when x ϭ 45.9

Residual
Observed fuel consumption
when x ϭ 45.9

10

20

30

40

50

12
11
10
9
8
7

x
0

FUELCONS

13

60

70

6
30

40

50
TEMP

weekly fuel consumption y to average hourly temperature x for values of x that are in the experimental region. For such values of x, the least squares line is the estimate of the line of means.
This implies that the point on the least squares line that corresponds to the average hourly temperature x
y ϭ b0 ϩ b1x
ˆ
ϭ 15.84 Ϫ .1279x
is the point estimate of the mean of all the weekly fuel consumptions that could be observed when
the average hourly temperature is x:
myΗx ϭ b0 ϩ b1x
ˆ
Note that y is an intuitively logical point estimate of my͉x. This is because the expression b0 ϩ b1x
used to calculate y has been obtained from the expression b0 ϩ b1x for my͉x by replacing the
ˆ
unknown values of b0 and b1 by their least squares point estimates b0 and b1.
The quantity y is also the point prediction of the individual value
ˆ
y ϭ b0 ϩ b1x ϩ e
which is the amount of fuel consumed in a single week when average hourly temperature equals
x. To understand why y is the point prediction of y, note that y is the sum of the mean b0 ϩ b1x
ˆ
and the error term e. We have already seen that y ϭ b0 ϩ b1x is the point estimate of b0 ϩ b1x.

ˆ
We will now reason that we should predict the error term E to be 0, which implies that y is also
ˆ
the point prediction of y. To see why we should predict the error term to be 0, note that in the next
section we discuss several assumptions concerning the simple linear regression model. One implication of these assumptions is that the error term has a 50 percent chance of being positive and
a 50 percent chance of being negative. Therefore, it is reasonable to predict the error term to be
0 and to use y as the point prediction of a single value of y when the average hourly temperature
ˆ
equals x.
Now suppose a weather forecasting service predicts that the average hourly temperature in the
next week will be 40°F. Because 40°F is in the experimental region
y ϭ 15.84 Ϫ .1279(40)
ˆ
ϭ 10.72 MMcf of natural gas
is (1) the point estimate of the mean weekly fuel consumption when the average hourly temperature is 40°F and (2) the point prediction of an individual weekly fuel consumption when the

60

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

462

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11
F I G U R E 11.8

Simple Linear Regression Analysis

Point Estimation and Point Prediction in the Fuel Consumption Problem
y
16

The least squares line
^
y ϭ 15.84 Ϫ .1279x

15

An individual value of fuel
consumption when x ϭ 40

14
13

The true mean fuel consumption
when x ϭ 40

12
^
y ϭ 10.72

11
10
9

The true line of means
␮y ͉x ϭ ␤0 ϩ ␤1x

The point estimate of
mean fuel consumption
when x ϭ 40

8
7
x
10

F I G U R E 11.9

20

30

40

50

60

70

The Danger of Extrapolation Outside the Experimental Region
y

The relationship between mean fuel consumption
and x might become curved at low temperatures

22
21
True mean fuel
consumption when
x ϭ Ϫ10

20
19
18
17

Estimated mean fuel
consumption when
x ϭ Ϫ10 obtained by
extrapolating the
least squares line

16
15
14
13
12

The least squares line
^
y ϭ 15.84 Ϫ .1279x

11
10
9
8
7

x
Ϫ10

0

10

20

30
40
50
60
70
28
62.5
Experimental region

average hourly temperature is 40°F. This says that (1) we estimate that the average of all possible weekly fuel consumptions that could potentially be observed when the average hourly
temperature is 40°F equals 10.72 MMcf of natural gas, and (2) we predict that the fuel consumption in a single week when the average hourly temperature is 40°F will be 10.72 MMcf of

natural gas.
Figure 11.8 illustrates (1) the point estimate of mean fuel consumption when x is 40°F
(the square on the least squares line), (2) the true mean fuel consumption when x is 40°F (the

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11.2

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

The Least Squares Estimates, and Point Estimation and Prediction

463

triangle on the true line of means), and (3) an individual value of fuel consumption when x is
40°F (the dot in the ﬁgure). Of course this ﬁgure is only hypothetical. However, it illustrates that
the point estimate of the mean value of y (which is also the point prediction of the individual value
of y) will (unless we are extremely fortunate) differ from both the true mean value of y and the
individual value of y. Therefore, it is very likely that the point prediction y ϭ 10.72, which is the
ˆ
natural gas company’s transmission nomination for next week, will differ from next week’s actual fuel consumption, y. It follows that we might wish to predict the largest and smallest that y

might reasonably be. We will see how to do this in Section 11.5.
To conclude this example, note that Figure 11.9 illustrates the potential danger of using the
least squares line to predict outside the experimental region. In the ﬁgure, we extrapolate the least
squares line far beyond the experimental region to obtain a prediction for a temperature of
Ϫ10°F. As shown in Figure 11.1, for values of x in the experimental region the observed values
of y tend to decrease in a straight-line fashion as the values of x increase. However, for temperatures lower than 28°F the relationship between y and x might become curved. If it does, extrapolating the straight-line prediction equation to obtain a prediction for x ϭ Ϫ10 might badly
underestimate mean weekly fuel consumption (see Figure 11.9).
The previous example illustrates that when we are using a least squares regression line, we
should not estimate a mean value or predict an individual value unless the corresponding value
of x is in the experimental region—the range of the previously observed values of x. Often the
value x ϭ 0 is not in the experimental region. For example, consider the fuel consumption problem. Figure 11.9 illustrates that the average hourly temperature 0°F is not in the experimental
region. In such a situation, it would not be appropriate to interpret the y-intercept b0 as the estimate of the mean value of y when x equals 0. For example, in the fuel consumption problem it
would not be appropriate to use b0 ϭ 15.84 as the point estimate of the mean weekly fuel consumption when average hourly temperature is 0. Therefore, because it is not meaningful to
interpret the y-intercept in many regression situations, we often omit such interpretations.
We now present a general procedure for estimating a mean value and predicting an individual value:

Point Estimation and Point Prediction in Simple Linear Regression

L

et b0 and b1 be the least squares point estimates
of the y-intercept b0 and the slope b1 in the simple
linear regression model, and suppose that x0, a speciﬁed value of the independent variable x, is inside
the experimental region. Then
ˆ
y ϭ b0 ϩ b1x0

is the point estimate of the mean value of the dependent variable when the value of the independent variable is x0. In addition, y is the point predicˆ
tion of an individual value of the dependent
variable when the value of the independent variable

is x0. Here we predict the error term to be 0.

EXAMPLE 11.5 The QHIC Case
Consider the simple linear regression model relating yearly home upkeep expenditure, y, to home
value, x. Using the data in Table 11.2 (page 453), we can calculate the least squares point
estimates of the y-intercept b0 and the slope b1 to be b0 ϭ Ϫ348.3921 and b1 ϭ 7.2583. Since
b1 ϭ 7.2583, we estimate that mean yearly upkeep expenditure increases by $7.26 for each
additional $1,000 increase in home value. Consider a home worth $220,000, and note that x0 ϭ
220 is in the range of previously observed values of x: 48.9 to 286.18 (see Table 11.2). It follows
that
y ϭ b0 ϩ b1x0
ˆ
ϭ Ϫ348.3921 ϩ 7.2583(220)
ϭ 1,248.43 (or $1,248.43)
is the point estimate of the mean yearly upkeep expenditure for all homes worth $220,000 and is
the point prediction of a yearly upkeep expenditure for an individual home worth $220,000.

C

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

464

11. Simple Linear
Regression Analysis

© The McGraw−Hill

Companies, 2003

Text

Chapter 11

Simple Linear Regression Analysis

The marketing department at QHIC wishes to determine which homes should be sent advertising brochures promoting QHIC’s products and services. The prediction equation y ϭ b0 ϩ b1x
ˆ
implies that the home value x corresponding to a predicted upkeep expenditure of y is
ˆ
xϭ

y Ϫ b0
ˆ
y Ϫ (Ϫ348.3921)
ˆ
y ϩ 348.3921
ˆ
ϭ
ϭ
b1
7.2583
7.2583

Therefore, for example, if QHIC wishes to send an advertising brochure to any home that has a
predicted upkeep expenditure of at least $500, then QHIC should send this brochure to any home
that has a value of at least
xϭ

y ϩ 348.3921
ˆ
500 ϩ 348.3921
ϭ
ϭ 116.886 ($116,886)
7.2583
7.2583

Exercises for Section 11.2
CONCEPTS
11.15 What does SSE measure?
11.16 What is the least squares regression line, and what are the least squares point estimates?
11.17 How do we obtain a point estimate of the mean value of the dependent variable and a point
prediction of an individual value of the dependent variable?

11.19, 11.23

11.18 Why is it dangerous to extrapolate outside the experimental region?
METHODS AND APPLICATIONS
Exercises 11.19, 11.20, and 11.21 are based on the following MINITAB and Excel output. At the left is
the output obtained when MINITAB is used to ﬁt a least squares line to the starting salary data given in
Exercise 11.5 (page 454). In the middle is the output obtained when Excel is used to ﬁt a least squares line
to the service time data given in Exercise 11.7 (page 454). The rightmost output is obtained when
MINITAB is used to ﬁt a least squares line to the Fresh detergent demand data given in Exercise 11.9
(page 455).

Copiers Line Fit Plot

Regression Plot

Y ϭ 14.8156 ϩ 5.70657X
R-Sq ϭ 0.977

Y ϭ 11.4641 ϩ 24.6022X

Y ϭ 7.81409 ϩ 2.66522X
R-Sq ϭ 0.792

2

3
GPA

4

200
180
160
140
120
100
80
60
40
20
0

9.5
9.0

Demand

37
36
35
34
33
32
31
30
29
28
27

Minutes

StartSal

Regression Plot

8.5
8.0
7.5

0

2

4
Copiers

11.19 THE STARTING SALARY CASE

6

8

7.0
Ϫ0.2Ϫ0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6
PriceDif

StartSal

Using the leftmost output
a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0
make practical sense?
b Use the least squares line to obtain a point estimate of the mean starting salary for all marketing graduates having a grade point average of 3.25 and a point prediction of the starting salary
for an individual marketing graduate having a grade point average of 3.25.

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

Text

11.3

Model Assumptions and the Standard Error

11.20 THE SERVICE TIME CASE

SrvcTime

Using the middle output
a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0
make practical sense?
b Use the least squares line to obtain a point estimate of the mean time to service four copiers
and a point prediction of the time to service four copiers on a single call.
11.21 THE FRESH DETERGENT CASE

Fresh

Using the rightmost output
a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0
make practical sense?
b Use the least squares line to obtain a point estimate of the mean demand in all sales periods
when the price difference is .10 and a point prediction of the actual demand in an individual
sales period when the price difference is .10.
c If Enterprise Industries wishes to maintain a price difference that corresponds to a
predicted demand of 850,000 bottles (that is, y ϭ 8.5), what should this price
ˆ
difference be?
11.22 THE DIRECT LABOR COST CASE

DirLab

Consider the direct labor cost data given in Exercise 11.11 (page 456), and suppose that a simple
linear regression model is appropriate.
a Verify that b0 ϭ 18.4880 and b1 ϭ 10.1463 by using the formulas illustrated in Example 11.4
(pages 459–460).
b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense?
c Write the least squares prediction equation.
d Use the least squares line to obtain a point estimate of the mean direct labor cost for all
batches of size 60 and a point prediction of the direct labor cost for an individual batch of
size 60.
11.23 THE REAL ESTATE SALES PRICE CASE

RealEst

Consider the sales price data given in Exercise 11.13 (page 456), and suppose that a simple linear
regression model is appropriate.
a Verify that b0 ϭ 48.02 and b1 ϭ 5.7003 by using the formulas illustrated in Example 11.4
(pages 459–460).
b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense?
c Write the least squares prediction equation.
d Use the least squares line to obtain a point estimate of the mean sales price of all houses
having 2,000 square feet and a point prediction of the sales price of an individual house having 2,000 square feet.

11.3 ■ Model Assumptions and the Standard Error
Model assumptions In order to perform hypothesis tests and set up various types of intervals when using the simple linear regression model
y ϭ my͉x ϩ e
ϭ b0 ϩ b1x ϩ e
we need to make certain assumptions about the error term e. At any given value of x, there is a
population of error term values that could potentially occur. These error term values describe the
different potential effects on y of all factors other than the value of x. Therefore, these error term
values explain the variation in the y values that could be observed when the independent variable

is x. Our statement of the simple linear regression model assumes that my͉x, the mean of the population of all y values that could be observed when the independent variable is x, is b0 ϩ b1x.
This model also implies that e ϭ y Ϫ (b0 ϩ b1x), so this is equivalent to assuming that the mean
of the corresponding population of potential error term values is 0. In total, we make four
assumptions—called the regression assumptions—about the simple linear regression model.

© The McGraw−Hill
Companies, 2003

465

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

466

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11

Simple Linear Regression Analysis

These assumptions can be stated in terms of potential y values or, equivalently, in terms of

potential error term values. Following tradition, we begin by stating these assumptions in terms
of potential error term values:

The Regression Assumptions
1 At any given value of x, the population of poten- 3
tial error term values has a mean equal to 0.

2

Constant Variance Assumption
At any given value of x, the population of
potential error term values has a variance that
does not depend on the value of x. That is, the
different populations of potential error term
values corresponding to different values of x
have equal variances. We denote the constant
variance as ␴2.

4

Normality Assumption
At any given value of x, the population of potential error term values has a normal distribution.
Independence Assumption
Any one value of the error term E is statistically
independent of any other value of E. That is, the
value of the error term E corresponding to an
observed value of y is statistically independent
of the value of the error term corresponding to
any other observed value of y.

Taken together, the ﬁrst three assumptions say that, at any given value of x, the population of
potential error term values is normally distributed with mean zero and a variance S2 that does
not depend on the value of x. Because the potential error term values cause the variation in the
potential y values, these assumptions imply that the population of all y values that could be
observed when the independent variable is x is normally distributed with mean B0 ؉ B1x and a
variance S2 that does not depend on x. These three assumptions are illustrated in Figure 11.10
in the context of the fuel consumption problem. Speciﬁcally, this ﬁgure depicts the populations of
weekly fuel consumptions corresponding to two values of average hourly temperature—32.5 and
45.9. Note that these populations are shown to be normally distributed with different means (each
of which is on the line of means) and with the same variance (or spread).
The independence assumption is most likely to be violated when time series data are being utilized in a regression study. Intuitively, this assumption says that there is no pattern of positive error
terms being followed (in time) by other positive error terms, and there is no pattern of positive
error terms being followed by negative error terms. That is, there is no pattern of higher-thanaverage y values being followed by other higher-than-average y values, and there is no pattern of
higher-than-average y values being followed by lower-than-average y values.
It is important to point out that the regression assumptions very seldom, if ever, hold exactly
in any practical regression problem. However, it has been found that regression results are not
extremely sensitive to mild departures from these assumptions. In practice, only pronounced

F I G U R E 11.10

An Illustration of the Model Assumptions
y
12.4 ϭ Observed value of y when x ϭ 32.5
The mean fuel consumption when x ϭ 32.5
The mean fuel consumption
when x ϭ 45.9
Population of
y values when
x ϭ 32.5
Population of

y values when
x ϭ 45.9

9.4 ϭ Observed value of y when x ϭ 45.9
The straight line defined
by the equation ␮y |x ϭ ␤0 ϩ ␤1x
(the line of means)
x

32.5

45.9

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

11. Simple Linear
Regression Analysis

11.3

© The McGraw−Hill
Companies, 2003

Text

Model Assumptions and the Standard Error

467

departures from these assumptions require attention. In optional Section 11.8 we show how to
check the regression assumptions. Prior to doing this, we will suppose that the assumptions are
valid in our examples.
In Section 11.2 we stated that, when we predict an individual value of the dependent variable,
we predict the error term to be 0. To see why we do this, note that the regression assumptions
state that, at any given value of the independent variable, the population of all error term values
that can potentially occur is normally distributed with a mean equal to 0. Since we also assume
that successive error terms (observed over time) are statistically independent, each error term has
a 50 percent chance of being positive and a 50 percent chance of being negative. Therefore, it is
reasonable to predict any particular error term value to be 0.
The mean square error and the standard error To present statistical inference formulas
in later sections, we need to be able to compute point estimates of s2 and s, the constant variance
and standard deviation of the error term populations. The point estimate of s2 is called the mean
square error and the point estimate of s is called the standard error. In the following box, we
show how to compute these estimates:

The Mean Square Error and the Standard Error

I

f the regression assumptions are satisﬁed and SSE is the sum of squared residuals:

1

The point estimate of s2 is the mean square error
s2 ϭ

2

SSE
Bn Ϫ 2

The point estimate of s is the standard error

SSE
nϪ2

sϭ

In order to understand these point estimates, recall that s2 is the variance of the population of
ˆ
y values (for a given value of x) around the mean value my͉x. Because y is the point estimate of
this mean, it seems natural to use
SSE ϭ a (yi Ϫ yi)2
ˆ
to help construct a point estimate of s2. We divide SSE by n Ϫ 2 because it can be proven that
doing so makes the resulting s2 an unbiased point estimate of s2. Here we call n Ϫ 2 the number
of degrees of freedom associated with SSE.

EXAMPLE 11.6 The Fuel Consumption Case
Consider the fuel consumption situation, and recall that in Table 11.4 (page 460) we have calculated the sum of squared residuals to be SSE ϭ 2.568. It follows, because we have observed
n ϭ 8 fuel consumptions, that the point estimate of s2 is the mean square error
s2 ϭ

SSE
2.568
ϭ
ϭ .428

nϪ2
8Ϫ2

s ϭ 2s2 ϭ 2.428 ϭ .6542

This implies that the point estimate of s is the standard error

As another example, it can be veriﬁed that the standard error for the simple linear regression
model describing the QHIC data is s ϭ 146.8970.
To conclude this section, note that in optional Section 11.9 we present a shortcut formula for
calculating SSE. The reader may study Section 11.9 now or at any later point.

C

Bowerman−O’Connell:
Business Statistics in
Practice, Third Edition

468

11. Simple Linear
Regression Analysis

© The McGraw−Hill
Companies, 2003

Text

Chapter 11

Simple Linear Regression Analysis

Exercises for Section 11.3
CONCEPTS
11.24 What four assumptions do we make about the simple linear regression model?
11.25 What is estimated by the mean square error, and what is estimated by the standard error?
11.26, 11.31

METHODS AND APPLICATIONS
11.26 THE STARTING SALARY CASE

StartSal

Refer to the starting salary data of Exercise 11.5 (page 454). Given that SSE ϭ 1.438, calculate s2
and s.
11.27 THE SERVICE TIME CASE

SrvcTime

Refer to the service time data in Exercise 11.7 (page 454). Given that SSE ϭ 191.70166,
calculate s2 and s.
11.28 THE FRESH DETERGENT CASE

Fresh

Refer to the Fresh detergent data of Exercise 11.9 (page 455). Given that SSE ϭ 2.8059, calculate
s2 and s.
11.29 THE DIRECT LABOR COST CASE

DirLab

Refer to the direct labor cost data of Exercise 11.11 (page 456). Given that SSE ϭ 747, calculate
s2 and s.
11.30 THE REAL ESTATE SALES PRICE CASE

RealEst

Refer to the sales price data of Exercise 11.13 (page 456). Given that SSE ϭ 896.8, calculate s2
and s.
11.31 Ten sales regions of equal sales potential for a company were randomly selected. The advertising expenditures (in units of $10,000) in these 10 sales regions were purposely set during
July of last year at, respectively, 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14. The sales volumes (in
units of $10,000) were then recorded for the 10 sales regions and found to be, respectively,
89, 87, 98, 110, 103, 114, 116, 110, 126, and 130. Assuming that the simple linear
regression model is appropriate, it can be shown that b0 ϭ 66.2121, b1 ϭ 4.4303, and
SSE ϭ 222.8242.
SalesVol
Calculate s2 and s.

11.4 ■ Testing the Signiﬁcance of the Slope
and y Intercept
Testing the signiﬁcance of the slope A simple linear regression model is not likely to be
useful unless there is a signiﬁcant relationship between y and x. In order to judge the signiﬁcance of the relationship between y and x, we test the null hypothesis
H0: b1 ϭ 0
which says that there is no change in the mean value of y associated with an increase in x, versus
the alternative hypothesis
Ha: b1

0

which says that there is a (positive or negative) change in the mean value of y associated with an
increase in x. It would be reasonable to conclude that x is signiﬁcantly related to y if we can be
quite certain that we should reject H0 in favor of Ha.
In order to test these hypotheses, recall that we compute the least squares point estimate b1 of
the true slope b1 by using a sample of n observed values of the dependent variable y. A different
sample of n observed y values would yield a different least squares point estimate b1. For example, consider the fuel consumption problem, and recall that we have observed eight average
hourly temperatures. Corresponding to each temperature there is a (theoretically) inﬁnite population of fuel consumptions that could potentially be observed at that temperature [see
Table 11.5(a)]. Sample 1 in Table 11.5(b) is the sample of eight fuel consumptions that we have
actually observed from these populations (these are the same fuel consumptions originally given

simple linear regression analysis view

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về