Tải bản đầy đủ (.pdf) (911 trang)

Giáo trình Business statistics in practice using data modeling and analystics 8e by bowemen

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (40.45 MB, 911 trang )


Bruce L. Bowerman
Miami University

Richard T. O’Connell
Miami University

Emily S. Murphree
Miami University

Business Statistics in Practice
Using Modeling, Data, and Analytics
EIGHTH EDITION

with major contributions by

Steven C. Huchendorf
University of Minnesota

Dawn C. Porter

University of Southern California

Patrick J. Schur
Miami University

bow49461_fm_i–xxi.indd 1

20/11/15 4:06 pm



BUSINESS STATISTICS IN PRACTICE: USING DATA, MODELING, AND ANALYTICS, EIGHTH EDITION
Published by McGraw-Hill Education, 2 Penn Plaza, New York, NY 10121. Copyright © 2017 by McGraw-Hill
Education. All rights reserved. Printed in the United States of America. Previous editions © 2014, 2011, and
2009. No part of this publication may be reproduced or distributed in any form or by any means, or stored in a
database or retrieval system, without the prior written consent of McGraw-Hill Education, including, but not
limited to, in any network or other electronic storage or transmission, or broadcast for distance learning.
Some ancillaries, including electronic and print components, may not be available to customers outside the
United States.
This book is printed on acid-free paper.
1 2 3 4 5 6 7 8 9 0 DOW/DOW 1 0 9 8 7 6
ISBN 978-1-259-54946-5
MHID 1-259-54946-1
Senior Vice President, Products & Markets: Kurt L. Strand
Vice President, General Manager, Products & Markets: Marty Lange
Vice President, Content Design & Delivery: Kimberly Meriwether David
Managing Director: James Heine
Senior Brand Manager: Dolly Womack
Director, Product Development: Rose Koos
Product Developer: Camille Corum
Marketing Manager: Britney Hermsen
Director of Digital Content: Doug Ruby
Digital Product Developer: Tobi Philips
Director, Content Design & Delivery: Linda Avenarius
Program Manager: Mark Christianson
Content Project Managers: Harvey Yep (Core) / Bruce Gin (Digital)
Buyer: Laura M. Fuller
Design: Srdjan Savanovic
Content Licensing Specialists: Ann Marie Jannette (Image) / Beth Thole (Text)
Cover Image: ©Sergei Popov, Getty Images and ©teekid, Getty Images
Compositor: MPS Limited

Printer: R. R. Donnelley
All credits appearing on page or at the end of the book are considered to be an extension of the copyright page.
            Library of Congress Control Number: 2015956482
The Internet addresses listed in the text were accurate at the time of publication. The inclusion of a website does
not indicate an endorsement by the authors or McGraw-Hill Education, and McGraw-Hill Education does not
guarantee the accuracy of the information presented at these sites.
    

www.mhhe.com

bow49461_fm_i–xxi.indd 2

20/11/15 4:06 pm


ABOUT THE AUTHORS
Bruce L. Bowerman   Bruce L.

Bowerman is emeritus professor
of information systems and
analytics at Miami University in
Oxford, Ohio. He received  his
Ph.D. degree in statistics from
Iowa State University in 1974,
and he has over 40 years of
experience teaching basic statistics, regression analysis, time series forecasting,
survey sampling, and design of experiments to both
undergraduate and graduate students. In 1987 Professor
Bowerman received an Outstanding Teaching award
from the Miami University senior class, and in 1992 he

received an Effective Educator award from the Richard
T. Farmer School of Business Administration. Together
with Richard T. O’Connell, Professor Bowerman has
written 23 textbooks. These include Forecasting, Time
Series, and Regression: An Applied Approach (also
coauthored with Anne B. Koehler); Linear Statistical
Models: An Applied Approach; Regression Analysis:
Unified Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily
S. Murphree); and Experimental Design: Unified
Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily S. Murphree).
The first edition of Forecasting and Time Series earned
an Outstanding Academic Book award from Choice
magazine. Professor Bowerman has also published a
number of articles in applied stochastic process, time
series forecasting, and statistical education. In his spare
time, Professor Bowerman enjoys watching movies and
sports, playing tennis, and designing houses.
Richard T. O’Connell   Richard
T. O’Connell is emeritus professor of information systems
and analytics at Miami University in Oxford, Ohio. He has
more than 35 years of experience teaching basic statistics,
statistical quality control and
process improvement, regression analysis, time series forecasting, and design of experiments to both undergraduate and graduate business
students. He also has extensive consulting experience
and has taught workshops dealing with statistical process control and process improvement for a variety of

companies in the Midwest. In 2000 Professor O’Connell
received an Effective Educator award from the
Richard T. Farmer School of Business Administration.
Together with Bruce L. Bowerman, he has written 23

textbooks. These include Forecasting, Time Series, and
Regression: An Applied Approach (also coauthored
with Anne B. Koehler); Linear Statistical Models:
An Applied Approach; Regression Analysis: Unified
Concepts, Practical Applications, and Computer Implementation (also coauthored with Emily S. Murphree);
and Experimental Design: Unified Concepts, Practical
Applications, and Computer Implementation  (also
coauthored with Emily S. Murphree). Professor
O’Connell has published a number of articles in the
area of innovative statistical education. He is one of the
first college instructors in the United States to integrate
statistical process control and process improvement
methodology into his basic business statistics course.
He (with Professor Bowerman) has written several
articles advocating this approach. He has also given
presentations on this subject at meetings such as the
Joint Statistical Meetings of the American Statistical Association and the Workshop on Total Quality
Management: Developing Curricula and Research
Agendas (sponsored by the Production and Operations
Management Society). Professor O’Connell received
an M.S. degree in decision sciences from Northwestern University in 1973. In his spare time, Professor
O’Connell enjoys fishing, collecting 1950s and 1960s
rock music, and following the Green Bay Packers and
Purdue University sports.
Emily S.
Murphree is emerita professor
of statistics at Miami University
in Oxford, Ohio. She received
her Ph.D. degree in statistics
from the University of North

Carolina and does research in
applied probability. Professor
Murphree received Miami’s
College of Arts and Science Distinguished Educator
Award in 1998. In 1996, she was named one of Oxford’s
Citizens of the Year for her work with Habitat for Humanity and for organizing annual Sonia Kovalevsky
Mathematical Sciences Days for area high school girls.
In 2012 she was recognized as “A Teacher Who Made a
Difference” by the University of Kentucky.
Emily S. Murphree

iii

bow49461_fm_i–xxi.indd 3

20/11/15 4:06 pm


AUTHORS’ PREVIEW
1.4

Business Statistics in Practice: Using Data, Modeling,
and Analytics, Eighth Edition, provides a unique and
flexible framework for teaching the introductory course
in business statistics. This framework features:

Figure 1.6

(a) A histogram of the 50 mileages


We now discuss how these features are implemented in
the book’s 18 chapters.
Chapters 1, 2, and 3: Introductory concepts and
statistical modeling. Graphical and numerical
descriptive methods. In Chapter 1 we discuss

data, variables, populations, and how to select random and other types of samples (a topic formerly
discussed in Chapter 7). A new section introduces statistical modeling by defining what a statistical model
is and by using The Car Mileage Case to preview
specifying a normal probability model describing the
mileages obtained by a new midsize car model (see
Figure 1.6):

Percent

.5
33

.5

.0

32

33

.5

.0


31

32

.5

.0

30

31

.5

.0

29

30

.5
33

.5

.0
33

.0


.5

32

.0

31

.5

31

.0

30

.5

30

29

32

18

16

The exact reasoning behind
15 and meaning of this statement is given in Chapter 8, which discusses confidence intervals.


10

10

6

5

4

2

.5
33

.0

.5

33

32

32

.0

0
31

.5

and Minitab to carry out traditional statistical
analysis and descriptive analytics. Use of JMP and
the Excel add-in XLMiner to carry out predictive
analytics.

20

10

31
.0

• Use of Excel (including the Excel add-in MegaStat)

In Chapters 2 and 3 we begin to formally discuss the
statistical analysis used in statistical modeling and the
statistical inferences that can be made using statistical
models. For example, in Chapter 2 (graphical descriptive methods) we show how to construct the histogram
of car mileages shown in Chapter 1, and in Chapter 3
(numerical descriptive methods) we use this histogram
BI
to help explain the Empirical Rule. As illustrated in
Figure 3.15, this rule gives tolerance intervals providing estimates of the “lowest” and “highest” mileages
that the new midsize car model should be expected to
get in combined city and highway driving:

.5


students doing complete statistical analyses on
their own.

Mpg

all mileages achieved by the new midsize cars, the population histogram would look “bellshaped.” This leads us to “smooth out” the sample histogram and represent the population of
all mileages by the bell-shaped probability curve in Figure 1.6 (b). One type of bell-shaped
probability curve is a graph of what is called the normal probability distribution (or normal
probability model), which is discussed in Chapter 6. Therefore, we might conclude that the
statistical model describing the sample of 50 mileages in Table 1.7 states that this sample has
been (approximately) randomly selected from a population of car mileages that is described
by a normal probability distribution. We will see in Chapters 7 and 8 that this statistical
model and probability theory allow us to conclude that we are “95 percent” confident that the
sampling error in estimating the population mean mileage by the sample mean mileage is no
more than .23 mpg. Because we have seen in Example 1.4 that the mean of the sample of n 5
50 mileages in Table 1.7 is 31.56 mpg, this implies that we are 95 percent confident that the
true population mean EPA combined mileage for the new midsize model is between 31.56 2
.23 5 31.33 mpg and 31.56 1 .23 5 31.79 mpg.10 Because we are 95 percent confident that
the population mean EPA combined mileage is at least 31.33 mpg, we have strong statistical
evidence that this not only meets, but slightly exceeds, the tax credit standard of 31 mpg and
thus that the new midsize model deserves the tax credit.
Throughout this book we will encounter many situations where we wish to make a statistical inference about one or more populations by using sample data. Whenever we make
assumptions about how the sample data are selected and about the population(s) from which
the sample data are selected, we are specifying a statistical model that will lead to making
what we hope are valid statistical inferences. In Chapters 13, 14, and 15 these models become
complex and not only specify the probability distributions describing the sampled populations but also specify how the means of the sampled populations are related to each other
through one or more predictor variables. For example, we might relate mean, or expected,
sales of a product to the predictor variables advertising expenditure and price. In order to
150 a response variable
Chapter

3 sales to one
Descriptive
Statistics:variables
Numerical
and Some Predictive Analytics
relate
such as
or more predictor
soMethods
that we can
explain and predict values of the response variable, we sometimes use a statistical technique
called regression analysis and specify a regression model.
F i g u r e 3 . 1 5 Estimated Tolerance Intervals in the Car Mileage Case
The idea of building a model to help explain and predict is not new. Sir Isaac Newton’s
equations describing motion and gravitational attraction help us understand bodies in
motion and are used today by scientists plotting the trajectories of spacecraft. Despite their
Histogram of the 50 Mileages
successful use, however, these equations are only approximations to the exact nature of
25
motion. Seventeenth-century
Newtonian
has been superseded by the more sophis22 physics
22
ticated twentieth-century physics of Einstein and Bohr. But even with the refinements of

30

• Many new exercises, with increased emphasis on

Mpg


.0

in yellow and designated by icons BI in the
page margins—that explicitly show how
statistical analysis leads to practical business
decisions.

4

2

0

30

• Business improvement conclusions—highlighted

6

5

.5

learning by presenting new concepts in the
context of familiar situations.

10

10


29

• Continuing case studies that facilitate student

18

15

Percent

of probability, probability modeling, traditional
statistical inference, and regression and time series
modeling.

(b) The normal probability curve

22

16

• A substantial and innovative presentation of

• Improved and easier to understand discussions

22

20

Chapter 1 and used throughout the text.


business analytics and data mining that provides
instructors with a choice of different teaching
options.

A Histogram of the 50 Mileages and the Normal Probability Curve

25

• A new theme of statistical modeling introduced in

17

Random Sampling, Three Case Studies That Illustrate Statistical Inference

Mpg

30.8

30.0

29.2

Estimated tolerance interval for
the mileages of 68.26 percent of
all individual cars

32.4

Estimated tolerance interval for

the mileages of 95.44 percent of
all individual cars

33.2

34.0

Estimated tolerance interval for
the mileages of 99.73 percent of
all individual cars

Figure 3.15 depicts these estimated tolerance intervals, which are shown below the histogram.

Because
difference3:
between
the upper
and lower limitssections
of each estimated tolerance
Chapters 1,
2,theand
Six
optional
dis- interval is fairly small, we might conclude that the variability of the individual car mileages
BI business
around
the estimated
mean mileage of
31.6 mpg
is fairly small.

Furthermore, The
the interval
cussing
analytics
and
data
mining.
_
[x 6 3s] 5 [29.2, 34.0] implies that almost any individual car that a customer might pur-

chase this
year will is
obtainused
a mileagein
between
mpg
Disney Parks
Case
an29.2optional
section of
_ and 34.0 mpg.
Before continuing, recall that we have rounded x and s to one decimal point accuracy
order
to simplify ourhow
initial example
of the Empirical
Rule. If, instead,
we calculate
Chapter 1 toinEmpirical
introduce

business
analytics
and
data the
_
Rule intervals by using x 5 31.56 and s 5 .7977 and then round the interval endpoints to one
decimal
place accuracy
at the
end of the
calculations,
we obtain
the same inmining are used
to
analyze
big
data.
This
case
considtervals as obtained above. In general, however, rounding intermediate calculated results can
lead toDisney
inaccurate final
results. Because
this, throughout this
book we will avoid
greatly
ers how Walt
World
in ofOrlando,
Florida,

uses
rounding intermediate results.
next note
thatmany
if we actually
count
number of the
50_collect
mileages in Table
3.1 that
MagicBandsareWe
worn
by
of
its
visitors
to
mas_ the
_ contained in each of the intervals [x 6 s] 5 [30.8, 32.4], [x 6 2s] 5 [30.0, 33.2], and
3s] 5real-time
[29.2, 34.0], we find
that these intervals
contain, respectively,
34, 48,and
and 50 of
[x 6of
sive amounts
location,
riding
pattern,

the 50 mileages. The corresponding sample percentages—68 percent, 96 percent, and 100
percent—are
close
to
the
theoretical
percentages—68.26
percent,
95.44
percent,
and
purchase history
data. These data help Disney improve99.73
percent—that apply to a normally distributed population. This is further evidence that the
population
of all mileages
is (approximately)
distributed and
thus that the Empirivisitor experiences
and tailor
its normally
marketing
messages
cal Rule holds for this population.
To
conclude
this
example,
we
note

that
the
automaker
has
studied
the
city and
to different highway
typesmileages
of visitors.
At its Epcot park, combined
Disney
of the new model because the federal tax credit is based on these combined mileages. When reporting fuel economy estimates for a particular car model to the
public, however, the EPA realizes that the proportions of city and highway driving vary from
purchaser to purchaser. Therefore, the EPA reports both a combined mileage estimate and
separate city and highway mileage estimates to the public (see Table 3.1(b) on page 137).

iv

bow49461_fm_i–xxi.indd 4

20/11/15 4:06 pm


A Dashboard of the Key Performance Indicators for an Airline

Figure 2.35
Flights on Time

Average Load

Arrival

Average Load Factor

Breakeven Load Factor

90%

Departure

Midwest
50 60 70 80 90 100 50 60 70 80 90 100

85%

50 60 70 80 90 100 50 60 70 80 90 100

80%

Northeast
Pacific

75%

50 60 70 80 90 100 50 60 70 80 90 100

70%

South
Fleet Utilization


90

80
75

Short-Haul

95
100

85
80
75

70

90

International

95

85

100

90

80


100

75

70

95

70

Apr May June July Aug Sept Oct

Costs
10

Fuel Costs

Nov Dec

Total Costs

8

$ 100,000,000

Regional
85

Feb Mar


Jan

50 60 70 80 90 100 50 60 70 80 90 100

6
4
2
0
Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

2.8

Flights on Time

Average Load
Arrival

The Number of Ratings and the Mean Rating for Each of Seven Rides at Epcot

Nemo & Friends
(0 5 Poor, 1 5 Fair, 2 5 Good, 3 5 Very Good, 4 5 Excellent, 5 5 Superb) and
Mission: Space green
an Excel
Mission:
SpaceOutput
orangeof a Treemap of the Numbers of Ratings and the Mean Ratings
Living With The Land
(a) The number
of

ratings
and the mean ratings DS DisneyRatings
Spaceship Earth
Test Track
Ride
Number of Ratings
Mean Rating
Soarin'
Soarin‘
2572
4.815

94

0
Test Track presented by Chevrolet

20

40 2045 60

80

Pacific

697
Living

With
725


Mission: Space orange

Midwest
Mission: Space
green

70%

Regional

2.186

Living With The Land

85

90

80

Short-Haul

95

85

100

75


90

80

100

75

70

85

90

100
70

Chapter 3

Descriptive Statistics: Numerical Methods and Some Pr

4

(Continued )

2

(e) XLMiner classification
tree using coupon redemption training data

0

(f) Growing the tree in (

Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Card
Figure 2.36

2.5

3.116
2.712

Nov Dec

Total Costs

6

0.5

,0.5; that is, Card 5 0

3.408

467

The Seas with Nemo
50 &

60Friends
70 80 90 100 50 60 1157
70 80 90 100

Fuel Costs

8

F i75g u r e 3 . 2 8

70

Apr May June July Aug Sept Oct

Costs
10

95

80

Feb Mar

Jan

International

182

95


Spaceship1.319
Earth
2.186

Mission: Space
The
Arrival orange
Departure
1589
Land

Breakeven Load Factor

75%

50 60 70 80 90 100 50 60 70 80 90 100

Fleet Utilization

Mission:the
Space
orange
1589or objective, which
3.408 represented by
graphs
single
primary
measure
to aMean

target,
F i g u rcompare
e 2.37
The
Number
of Ratings
and the
Rating for Each of SevenisRides
at Epcot
Mission:
Space green
467
3.116times uses five
a symbol on
the bullet
bullet
predicted
waiting
(0 5graph.
Poor, 1 The
5 Fair,
2 5 graph
Good, 3of5Disney’s
Very Good,
4 5 Excellent,
5 5 Superb) and
The Seas with Nemo & Friends
1157
2.712
colors ranging

fromandark
green
to
red
and
signifying
short
(0
to
20
minutes)
to
very
long (80
Excel Output of a Treemap of the Numbers of Ratings and the Mean Ratings
to 100 minutes) predicted waiting times. This bullet graph does not compare the predicted
(b) Excel
output
of the treemap
waiting times
to
an
objective.
However,
the
bullet
graphs
located
in
the

upper
left
of the
(a) The number of ratings and the mean ratings DS DisneyRatings
dashboard in Figure 2.35 (representing the percentages of on-time arrivals and departures
Ride
Number of Ratings
Mean Rating
4.8 example,
for the airline)
do displayTest
objectives
represented
vertical
Soarin'
Track
The
Seas Withby short
Mission:
Spaceblack lines. For
Soarin‘
2572
4.815
Nemo
Friends green
consider the bullet graphspresented
representing
the &percentages
of on-time arrivals and departures in
Test Track presented

2045
4.247
by by Chevrolet
the Midwest, which are shown
below.
3.6
Spaceship Earth Chevrolet

80%

50 60 70 80 90 100 50 60 70 80 90 100

100
4.247

725

85%

50 60 70 80 90 100 50 60 70 80 90 100

South

Chapter 2 Spaceship
Descriptive
Analytics
Earth Statistics: Tabular and Graphical
697 Methods and Descriptive
1.319
Living With The Land


50 60 70 80 90 100 50 60 70 80 90 100

Northeast

$ 100,000,000

Figure 2.37

Average Load Factor

90%

Departure

Midwest

$0.5; that is, Card 5 1

Excel Output of a Bullet Graph of Disney’s 15
Predicted Waiting Times (in minutes) for 9
the Seven Epcot Rides Posted at 3 p.m. on February 21, 2015 DS DisneyTimes

Purchases

1.3

51.17

,51.17


(b) Excel output of the treemap

13

Nemo & Friends
Mission: Space green
Mission: Space orange
Living With The Land
Spaceship Earth
Test Track
Soarin'

4.8
Test Track
The Seas With
Mission: Space
presented
Nemo & Friends green
The airline’s objective
by was to have 80 percent of midwestern arrivals be on time. The
approximately 75 percent
of actual midwestern arrivals
that were
on time is in3.6
the airline’s
Chevrolet
Living
Spaceship
light brown “satisfactory” region of Mission:

the bullet
graph,With
but this 75Earth
percent does not reach the
Space
The
2.5
80 percent objective.
orange
Land
Soarin'

Purchases

$51.17

22.52

,22.52

2

7

helps visitors choose their next ride by continuously rides posted by Disney
. on February
21,33.95
2015,
at 3 p.m
1

,33.95
$33.95
0
0
0
2
0
summarizing predicted waiting times for seven popular and Figure 2.37p 5shows
a
treemap
illustrating
ficticious
4
3
50
51
50
p5
p5
13
2
2
rides
on
large
screens
in
the
park.
Disney

management
visitor
ratings
of
the
seven
Epcot
rides.
Other
graphics
Treemaps We next discuss treemaps, which help visualize two variables. Treemaps
1
1
display uses
information
series of clustered
rectangles,
represent a whole.
The sizes planalso
thein ariding
pattern
datawhich
it collects
to1.3 make
discussed in the optional section on descriptive
analyt-2
4
of the rectangles represent a first variable, and treemaps use color to characterize the vari5 .667
51
p5

p5
graphs compare the single primary measure to a target, or objective, which is represented by
4
ning
decisions,
as isaccording
shown
by variable.
the following
business
gauges,
datatimesdrill-down
graph-3
ous rectangles
within the treemap
to a second
For example, suppose
a symbolics
on theinclude
bullet graph. The
bullet graph ofsparklines,
Disney’s predicted waiting
uses five
(as a purely hypothetical example) that Disney gave visitors at Epcot the voluntary opporcolors ranging from dark green to red and signifying short (0 to 20 minutes) to very long (80
improvement
conclusion
from toChapter
ics, and
combining
graphics

tunity to use their personal
computers or smartphones
rate as many 1:
of the seven Epcot
to 100 minutes)
predicteddashboards
waiting times. This bullet
graph does not compare
the predictedillustrating a
rides
desired objective
on a scalewas
from
to 5.80Here,
0 represents
“poor,”
1 represents
“fair,”
waiting times to an objective. However, the bullet graphs located in the upper left of the
Theasairline’s
to 0have
percent
of midwestern
arrivals
be on time.
The2
business’s
key
performance
indicators.

For example,
dashboard in Figure 2.35 (representing
percentages
of on-time
arrivals and departures
represents “good,”
3 represents
“very
good,” 4arrivals
represents
“excellent,”
and
represents
(g) the
XLMiner
best pruned
classification
tree using the validation data
approximately
75 percent
of actual
midwestern
that were
on time is
in 5the
airline’s
for the airline) do display objectives represented by short vertical black lines. For example,
“superb.”
Figure
2.37(a)

gives
of graph,
ratings
andthis
the75
mean
rating
fornot
each
ridethe
on
…As
a matter
of
fact,
Channel
13
News
light
brown
“satisfactory”
regiontheofnumber
the bullet
but
percent
does
reach
Figure 
2.35
is

a
dashboard
showing
eight
“flight on
consider the bullet graphs representing the percentages of on-time arrivals and departures in
a particular
day. (These data are completely fictitious.) Figure 2.37(b) shows the Excel
80
percent objective.
the Midwest, which are shown below.
in
reported
on
6,
Card
output
of Orlando
a treemap, where
the size and color
ofMarch
the rectangle
for2015—
a particular ride repretime”
bullet
graphs
and
three
“flight
utilization”

gauges
We next
treemaps,
which
help
twofor
variables.
Treemaps
sent, respectively,
the discuss
total number
of ratings
the visualize
mean rating
the ride. Treemaps
The colors
0.5
display
information
in a(signifying
series
of clustered
which
representora5,
whole.
sizes
the
writing
ofand near
this

case—that
rangeduring
from
dark green
a mean rectangles,
rating
the “superb,”
level)The
to white
for
an
airline.
of
the rectangles
represent
a firstthe
variable,
and1,treemaps
colorby
to the
characterize
theonvari5
11
(signifying
a mean
rating near
“fair,” or
level), asuse
color scale
the

Disney
had
announced
plans
toshown
add For
a third
ous
rectangles
within
treemap
to
a second
example,
suppose
treemap.
Note that
sixthe
of the
sevenaccording
rides are rated
to be atvariable.
least “good,”
four of the
seven
Chapter 3 containsPurchases
four optional sections
that disPurchases
(as
a

purely
hypothetical
example)
that
Disney
gave
visitors
at
Epcot
the
voluntary
opporrides“theatre”
are rated to be at for
least “very
good,” and(a
one ride
is rated asride)
“fair.” Many
treemaps
Soarin’
virtual
in
tunity
to
use
their
personal
computers
or
smartphones

to
rate
as
many
of
the
seven
Epcot
cuss
six
methods
of
predictive
analytics.
The
methods
51.17
22.52
use a larger range of colors (ranging, say, from dark green to red), but the Excel app we
rides
asobtain
desiredFigure
on a scale
from
to 5.range
Here,
represents
“poor,”
1 represents
“fair,”

2
used order
to
2.37(b)
gave0long
the
of 0colors
shown
in that figure.
Also, note
that
to
shorten
visitor
waiting
times.
0 practical
11 way
5
0 applied and
represents
“good,”
3
represents
“very
good,”
4
represents
“excellent,”
and

5
represents
discussed
are
explained
in
an
treemaps are frequently used to display hierarchical information (information that could
“superb.”
Figure
givesdifferent
the number
of ratings
and be
theused
meantorating
ride on
be displayed
as a 2.37(a)
tree, where
branchings
would
show for
the each
hierarchical
by
using
the
numerical
descriptive

statistics
previously
ainformation).
particular day.
(These
data
are
completely
fictitious.)
Figure
2.37(b)
shows
the
Excel
For example, Parks
Disney could
have visitors
voluntarily
rate the
in each
0
1
0
1
The
Disney
Case
isrectangle
also
used

inrides
an
optional
output
of aOrlando
treemap,
where the size and
color
of the
for
a particular
repreof its four
parks—Disney’s
Magic
Kingdom,
Epcot, Disney’s
Animalride
Kingdom,
discussed in Chapter
sent,
respectively,
the totalStudios.
number A
of treemap
ratings and
the mean
rating for by
thebreaking
ride. Theacolors
0 3. These2methods are:

0
6
and
Disney’s
Hollywood
would
be
constructed
large
50
51
50
5 .857
p5
p5
p5
p5
section
2 rating
to nearhelp
discuss
range from darkof
green Chapter
(signifying a mean
the “superb,”
or 5, level)descriptive
to white
13
2
2

7
(signifying a mean rating near the “fair,” or 1, level), as shown by the color scale on the
Classification
tree
modeling
and
regression
tree
analytics.
Specifically,
Figure
2.36
shows
a
bullet
graph

treemap. Note that six of the seven rides are rated to be at least “good,” four of the seven
rides are rated to be at least
“very good,” and
one ride is rated
as “fair.”
treemapsEpcot
modeling (see
Section
3.7
thepruned
following
figures):
summarizing

predicted
waiting
times
forManyseven
(i) XLMiner
training
dataand
and best
Fresh demand
regression
use a larger range of colors (ranging, say, from dark green to red), but the Excel app we
^

0

20

40

80

BI

177

Decision Trees: Classification Trees and Regression Trees (Optional)

JMP Output of a Classification Tree for the Card Upgrade Data

Figure 3.26


DS

CardUpgrade

Arrival

Departure

Midwest

50 60 70 80 90 100 50 60 70 80 90 100

1

Upgrade

0.50

0

0.25

^

0.00

^

^


^

Purchases
,39.925

Purchases.539.925

PlatProfile
Purchases.526.185
(0)

PlatProfile(1)

Purchases.532.45

Purchases,26.185

Purchases,32.45

All Rows

used to obtain Figure 2.37(b) gave the range of colors shown in that figure. Also, note that
treemaps are frequently used to display hierarchical information (information that could
be displayed as a tree, where different branchings would be used to show the hierarchical
information). For example, Disney could have visitors voluntarily rate the rides in each
of its four Orlando parks—Disney’s Magic Kingdom, Epcot, Disney’s Animal Kingdom,
and Disney’s Hollywood Studios. A treemap would be constructed by breaking a large
Split


Prune

Color Points RSquare
0.640

tree (for Exercise 3.57(a))

N Number of Splits
40
4

DS

Fresh2

All Rows

Count
G^2
40 55.051105
Level

0
1

Rate

AdvExp

LogWorth

6.0381753
22

0.4500 0.4500

18

9

PurchasesϾϭ32.45
G^2
Count
21 20.450334
Level
0
1

Rate

4

0.8095 0.7932

17

Rate

Level

1


0.9375 0.9108

15

0
1

0
1

0
1

Rate

18

0.0526 0.0725

1

3

0.4000 0.4141

2

Level
0

1

Rate

G^2
Count
5 5.0040242
Prob Count

0.0000 0.0394

0

1.0000 0.9606

11

Level
0
1

Rate

Prob Count

0.2000 0.2455

1

0.8000 0.7545


4

0

Error Report
Class
# Cases # Err
0
1
Overall

(h) Pruning the tree in (e)
Nodes

% Error

4
3
2
1
0

6.25
6.25
6.25
6.25
62.5

Best P

& Min

Classification Confusion M

Predicted C
Actual Class

0

0

5

1

0

Error Report
Class # Cases # Errors
0

6

1

10

Overall

16


3

AdvExp

5.525

7.075
8

6

Count
10

Prob Count

0.8889 0.8588

8

0.1111 0.1412

1

Level
0
1

Prob. for 0

Cust. 1
Cust. 2

1

PriceDif
7.45

G^2
0
Rate

For Exercise 3.56 (d) and (e)

8.022

0.3

Prob. for 1

0.142857
0

0.857143
1

9.52

Prob Count


1.0000 0.9625

10

0.0000 0.0375

0

3

5

8.587

Predicted
Value

8.826

PriceDif

8.826
8.021667

v

bow49461_fm_i–xxi.indd 5

16
8

24

9

AdvExp

PurchasesϽ26.185

G^2
Count
9 6.2789777

Prob Count

0.6000 0.5859

Prob Count

0.9474 0.9275

PurchasesϾϭ26.185

G^2
Count
5 6.7301167
Level

Rate

LogWorth

0.1044941

PurchasesϽ39.925

G^2
0
Rate

LogWorth
0.937419

Prob Count

0.0625 0.0892

PurchasesϾϭ39.925
Count
11

Level

PlatProfile(0)

G^2
Count
16 7.4813331
Level

G^2
Count

19 7.8352979

Prob Count

PlatProfile(1)

0
1

PurchasesϽ32.45

LogWorth
1.9031762

0.1905 0.2068

15

6.65

Prob Count

0.5500 0.5500

0

1

^


Partition for Upgrade
1.00

0.75

Actual Class
0

100

^

3.7

33.33333
20.83333
12.5
4.166667
4.166667

Classification Confusio
Predicte

^

60

% Error

0

1
2
3
4

$22.52

2

Purchases

^

Nodes



Chapter 2

A Dashboard of the Key Performance Indicators for an Airline

Figure 2.35

Excel Output of a Bullet Graph of Disney’s Predicted Waiting Times (in minutes) for
the Seven Epcot Rides Posted at 3 p.m. on February 21, 2015 DS DisneyTimes
Descriptive Statistics: Tabular and Graphical Methods and Descriptive Analytics

Figure 2.36

94


93

Descriptive Analytics (Optional)

20/11/15 4:06 pm

0.3
0.1

AdvExp

6.
6.


(a) The Minitab output
Dendrogram
Complete Linkage

192

0.00

Chapter 3

Sport

SportCluster ID


Boxing

Boxing

5

Basketball

Basketball

4

Golf

Similarity

Swimming
33.33

66.67

100.00
1

4

BOX

SKI


5

7

9

12

SWIM P PONG H BALL TEN

10

3

1

12

8

TR & F GOLF BOWL BSKT

HKY

1

Descriptive
Statistics:
2
Golf

3

Swimming

Skiing
Skiing valthe centroids of each cluster (that
is, the six mean
Baseball
Baseball
ues on the six perception scales
of the cluster’s
memPing-Pong
Ping-Pong
bers), the average distance of each
cluster’s members
Hockey
Hockey
from the cluster centroid, and the distances between
Handball
Handball
the cluster centroids DS SportsRatings
Track & Field Track & Field
a Use the output to summarize
the
members
of each
Bowling
Bowling
36 cluster.
Tennis


FOOT

Tennis

BASE

Dist.
Clust-1
Dist.
Clust-2 Dist.
Dist.
Clust-3 Dist.
Dist.
Clust-4 Dist.
Dist.
Clust-5 Dist. C
Cluster
ID Dist.
Clust-1
Clust-2
Clust-3
Clust-4

5
1
3
4
3
3

2
3

5.64374
5
3.350791
4
Numerical
5.192023
2

5.649341
5.64374
6.914365
3.350791
Methods
and
0.960547
5.192023

4.656952
3
5.528173
5

3.138766
4.656952
4.049883
5.528173


Method 5 Complete
Dendrogram
• Boxing
Chapter
3
• Basketball

Some
Predictive3.825935
Analytics
3.825935
7.91777
0.960547
1.836496
3.138766
2.58976
4.049883

4.171558
3
4.387004
4

3.396653
4.171558
7.911413
4.387004

6.290393
1.836496

6.107439
2.58976

1.349154
3.396653
6.027006
7.911413

XLMiner Output for Exercise 3.61

(b) The JMP output

5.353059
4.289699
1.315692
4.599382

b By using the members of each cluster and the clus0
4.573068
3.76884
3.928169
0
4.573068
3.76884
ter1 centroids, discuss
the basic
differences between
4.081887
3.115774
1.286466

6.005379
3
4.081887
1.286466
the
clusters.
Also, discuss 3.115774
how this k-means
cluster
4.357545
7.593
5.404686
0.593558
4
4.357545
7.593
5.404686
analysis
leads
to the same practical
conclusions
3.621243
3.834742
0.924898
4.707396
3
3.834742
0.924898
about
how3.621243

to improve the popularities
of baseball
3.362202
3.4311
1.201897
4.604437
3
3.4311
1.201897
and
tennis 3.362202
that have been obtained
using the
previ4.088288
0.960547
2.564819
7.005408
2
4.088288
ously
discussed
hierachical0.960547
clustering. 2.564819

Football
4
3.8 andFootball
the following
figures):
• Hierarchical clustering and k-means clustering (see Section


196

4.289699
5.649341
4.599382
6.914365

5.417842
1.349154
1.04255
6.027006

2.382945
5.353059
5.130782
1.315692

2.3

4.546991
7.91777
3.223697
6.290393

4.5

2.382945
6.107439
5.052507

3.928169

2.3

3.434243
6.005379
5.110051
0.593558

3.4

2.643109
4.707396
2.647626
4.604437

2.6

4.257925
7.005408
2.710078
5.417842

4.2

5.712401
1.04255

5.7


} }

Sport
Cluster ID Dist.Cluster
Clust-1 Dist.
Clust-2
Dist.
Dist.Summary
Clust-4Ncon
Dist.
Fast
Compl
Team
Opp
Cluster
Fast Clust-3
Compl
Team
Easy Clust-5
Ncon
ChapterEasy
201
Descriptive Statistics: Numerical
Methods and Some
AnalyticsCluster-1
Cluster-1
4.78
2.16
3.33
3.6

2.67 3.6
Boxing
5 Predictive
5.64374
5.649341 4.184.784.289699
5.353059
4.18
2.16
3.332.382945

Opp Summary
Chapter

5.1

3.2

5.0

5.1

2.6

2.7

2

2.67

In the real world,

companies
such as Amazon
In the
Netflix
real world,
sell or companies
rent
thousands
such
oraseven
Amazon
and Netflix
sell or rent
thousands
or even
Cluster-2
5.6 and
4.825
5.99
3.475
1.71
3.92
4
3.350791
6.914365
1.315692
Cluster-2
5.64.599382
4.825
5.99

3.4755.130782
1.71 These
3.92are These are
millions of items and find association rules based
millions
on millions
of items of
and
customers.
find association
In orderrules
to based on millions of customers.
In order to
the3.022
centroids
the centroids
Cluster-3
2.858
4.796
5.078
3.638
2.418
3.022
Cluster-3
2.858
4.796association
5.078
3.6384.546991
2.418
2meaningful

5.192023
0.960547
3.825935
7.91777
make
obtaining
association
rules
manageable,
make
obtaining
these
meaningful
companies
break
products
rules
manageable,
these
companies
break
products
F• Golf
igure 3.35
Minitab Output of a Factor Analysis of the Applicant Data (4 Factors
Used)
Cluster-4
1.99
3.253333
1.606667

4.62
5.773333
2.363333
Cluster-4
1.99
3.253333
1.606667
4.623.223697
5.773333 (for
2.363333
• Bowling
Swimming
3 obtaining
4.656952
3.138766
1.836496
6.290393
for which they are
association rules
into
forvarious
which
they
categories
are
obtaining
(for example,
association
comedies
rules

into various
categories
example, comedies
• Baseball
Cluster-5
2.6or thrillers)
4.61
6.29
5
4.265 5related
3.22
or thrillers) and5hierarchies
(for example,Cluster-5
a hierarchy
related
and
to how
hierarchies
new
the(for
product
example,
is). a hierarchy
to
how new the3.22
product is).
2.6
4.61
6.29
4.265

Skiing
5.528173
4.049883
2.58976
6.107439
2.382945
Principal
• Ping-Pong Component Factor Analysis of the Correlation Matrix
• Handball
Baseball
1
0
4.573068
3.76884
3.928169
5.052507
Unrotated
Factor Loadings and Communalities
• Tennis
Ping-Pong
3
4.081887
3.115774
1.286466
6.005379
3.434243
• Swimming
Cluster
#Obs
Avg. #Obs

Dist
Distance
Avg. Dist
Distance
Variable
Factor1
Factor2
Factor3
Factor4 for Section
Communality
Exercises
3.10Cluster
Exercises
for Section
3.10
• Track & Field
Hockey
4
4.357545
7.593
5.404686
0.593558 Between
5.110051
Between
Cluster-1
1
0
Cluster-1
1
0

Var
1
0.447
20.619
20.376
20.121
0.739
• Skiing
CONCEPTS
CONCEPTS3.69 The XLMiner output
of an association
rule analysis
3.69Cluster-2
of The XLMinerCluster-3
output of an association
rule analysis o
Centers
Cluster-1
Handball0.020
3.6212430.422
3.834742
0.924898
4.707396
2.643109
Centers
Cluster-1
Cluster-2 Cluster-4
Cluster-3 Clu
Clu
Cluster-2

2
0.960547
Var 2
0.583
0.050
0.2823
Basketball

• Hockey

• Football

Golf

Cluster-2
2 DVD0.960547
the
data using
a specified support perthe DVD renters data using a specified support per3.66 What is the purpose of association rules?
3.66 What is the purpose
ofrenters
association
rules?
Cluster-1
0per- 4.573068
3.76884
3.928169
Track & Field
3.3622020.881 Cluster-3
3.4311

1.201897
4.604437
2.647626
Cluster-1
4.573068
3.76884
Var 3
0.109
20.339
0.494
0.7143
centage
of 40 percent
and a specified
confidence
centage of040 percent
and a specified
confidence per-5
Cluster-3
5
1.319782
5
1.319782
3.67
Discuss
the
meanings
of
the
terms

support
percentage,
3.67
Discuss
the
meanings
termsissupport
DS DVDRent centage
DS DVDRen
Varwill
4 illustrate k-means
0.617
0.3572
0.877
centage ofof70the
percent
shownpercentage,
below.
of0 70 percent
is shown
Cluster-2
4.573068
3.112135
7.413553
4
Cluster-2
4.573068
0 below.
3.112135
4.088288

0.960547
2.564819
7.005408
4.257925
We
clustering by using0.181
a real dataBowling
mining20.580
project. For
Cluster-4
0.983933
Cluster-4 3
3
0.983933
confidence
percentage,
and
lift
ratio.
confidence
percentage,
and
lift
ratio.
Var 5
0.356
0.299 the
20.179
0.885
a Summarize the

recommendationsCluster-3
based
on a lift
a 3.76884
Summarize the3.112135
recommendations
based on0a lift 2
confidentiality
purposes, we0.798
will consider a fictional
grocery chain. However,
Cluster-3
3.76884
0
5.276346
Tennis
4.1715580.825 Cluster-5
3.396653
1.349154
5.417842
2.710078 3.112135
Cluster-5
2
2.382945
2 ratio2.382945
Var 6 reached are real.0.867
0.185
20.0693
conclusions
Consider, then, the Just

Right grocery chain,0.184
which has
greater than 1.
ratio greater than 1.
3.928169
7.413553
5.276346
0
5
AND
METHODS
AND6.027006
APPLICATIONS Cluster-4
3.928169
7.413553
5.276346
2.3Var
million
holders. Store managers
are interested
in clustering
Football
4 APPLICATIONS
4.3870040.855 Overall
7.911413
1.04255 ofCluster-4
5.712401
Overall
13
1.249053

7 store loyalty card0.433
0.582
20.360 their METHODS
0.446
13
1.249053
b Consider
the recommendation
DVD B
based on b Consider the recommendation of DVD B based on
customers
whose shopping habits
tend to be similar. They
expect to 3.68 20.228
Cluster-5
5.052507
5.224902
In the previous XLMiner output,
show how the 3.68
lift
In the previous XLMiner
output,
the Cluster-5
lift
Var 8 into various subgroups
0.882
0.056
0.248
0.895
5.052507

2.622167
having rented
C &show
E. (1)how
Identify
and
interpret the 4.298823
having rented 2.622167
C4.298823
& E. (1) Identify
and interpret the
find that certain customers tend to buy many cooking basics like oil, flour, eggs, rice, and
ratio
of 1.1111(rounded) for the0.779
recommendation of C ratio of 1.1111(rounded)
Var 9
0.365
20.794
20.093
0.074
support forfor
C the
& E.recommendation
Do the same for of
theCsupport for
support for C & E. Do the same for the support for
raw chickens, while others are buying prepared items from the deli, salad bar, and frozen
to renters of B has been calculated.
Interpret this lift to renters of B has
Interpret

lift
C &been
E &calculated.
B. (2) Show
how thethis
Confidence%
of 80
C & E & B. (2) Show how the Confidence% of 80
Var
10Perhaps there are other
0.863
20.069
20.166
0.787
food
aisle.
important categories
like calorie-conscious,0.100
vegetarian,
Cluster 0.256 Fast 20.209
Compl
Team 0.879Easy
Ncon
Opp
ratio.
ratio.
has been calculated. (3) Show how the Lift Ratio of
has been calculated. (3) Show how the Lift Ratio o
Var 11
0.098

or premium-quality
shoppers. 0.872
1.1429 (rounded) has been calculated.
1.1429 (rounded) has been calculated.
The
don’t know 0.908
what the clusters are and
hope the data
will enlighten
Varexecutives
12
0.030
0.135them.4.78
0.097
Cluster-1
4.18
2.16 0.8523.33
3.6
2.67
LO3-10
LO3-10
They
choose
to
concentrate
on
100
important
products
offered

in
their
stores.
Suppose
that
Var 13
0.913
20.032
0.073
0.218
These are
Cluster-2
4.825
5.99 0.888
3.475
1.71
3.92
product 1 is fresh strawberries, product 2 is olive oil, product 3 is hamburger
buns, and prod- 5.6
Var 14
0.710
0.115
20.558
0.884
uct
4 is potato chips.
having a Just
Right loyalty
card,
they willMethods

know the
Chapter
3 For each customer
Descriptive
Statistics:
Numerical
and20.235
Some
Analytics
Interpret
the 3.638
the centroids
the 2.418
Cluster-3
4.796Predictive
5.078
3.022
Var 15
0.646
20.604
20.107 2.858
20.029
0.794Interpret
Rule: If all
Antecedent items are purchased,
then with
Rule:Confidence
If all Antecedent
percentage
itemsConsequent

are purchased,
items
then
willwith
alsoConfidence
be purchased.
percentage Consequent items will also be purchase

}

3.9 Factor
(Optional
and Requires
3.9 Analysis
Factor Analysis
(Optional
and Requi
Section
3.4)
Section
3.4)
• Factor analysis and association rule mining (see Sections 3.9 and 3.10 and the following figures):

196

information provided
information
provided 2.363333
1.606667
4.62 5.773333


Factor analysis
starts
with a large
of correlated
variables
andvariables
attemptsand
to fin
Factor
analysis
starts number
with a large
number of
correlated
a
Variance
7.5040
2.0615
1.4677
1.2091
12.2423
by
a6.29
factor
analysis
a 5factor
analysis
Row
Antecedent

(x) by
Consequent
Row
(y)
ID Support
Confidence%
for x3.22
Support
Antecedent
for y (x)
Support
Consequent
for x & y
(y) Support
Lift Ratio
for x Support for y Support for x & y
Lift Ra
Cluster-5
2.6 ID Confidence%
4.61
4.265
%
0.500 Output of0.137
0.081
0.816
underlying
uncorrelated
factors
that
describe

the
“essential
aspects”
of
the
large
nu
underlying
uncorrelated
factors
that
describe
the
“essential
aspects”
of
F Var
igure 3.35
Minitab
a Factor Analysis 0.098
of the Applicant
Data
(4
Factors
Used)
(Optional).
1
71.42857143 B
A
7 B

7
A
5 1.020408163 7
7
5 1.0204081
(Optional). 1 71.42857143
Cluster-4

1.99

3.253333

correlated variables. To illustrate factor analysis, suppose that a personnel officer h

To 5illustrate
factor
analysis,7 suppose that5 a person
2
71.42857143 A
B
2
71.42857143
7 correlated
A
7variables.
B
1.020408163
7
1.0204081
Rotated Factor Loadings and Communalities

85.71428571 A
C
3
85.71428571
A
C
6applicants
0.952380952
7 sales
6 0.9523809
viewed
and 7rated
48 and
job9 applicants
sales
positions
on positions
the 9following
15following
variables
Principal
Component Factor Analysis of the Correlation Matrix 3
viewed
rated
48 jobfor
for
on the
Varimax Rotation
Cluster
#Obs

Avg. DistC
Distance
4
77.77777778
B
4
77.77777778
9 C
7
B
7 1.111111111 9
7
7 1.1111111
Unrotated
Loadings and Communalities
Variable Factor Factor1
Factor2
Factor3
Between
15 Form of100
6 Lucidity
11 Ambition
1 B Form of
application
letter
6 Lucidity
11
Cluster-1
1
5Factor4

100 0
BCommunality
C
7application
9 letter
C
7 1.111111111
7
9
7 1.1111111
Variable
Factor1
Factor2
Factor3
Var
1
0.114
20.833✓
0.739
Centers
Cluster-1
Cluster-3
6Factor4
71.42857143
BCommunality
&C
A
71.42857143
7 Cluster-2
B&C

7
A
5 Cluster-4
1.020408163 7 Cluster-5 7
5 1.0204081
Cluster-220.111
220.138
0.960547
26 Appearance
7 Honesty 7 Honesty
12 Grasp
2
Appearance
12
Var
2
0.440
20.150
20.394
0.226
0.422
7 20.121
83.33333333 A & C
B
7
83.33333333
C
7
B
1.19047619

7
5
1.190476
Cluster-1
06 A &4.573068
3.768845
3.9281696
5.052507
Var 1
0.447
20.619
20.376
0.739
Cluster-3
5
1.319782
Var 2
3
Var
Var 3
4
Var
Var 4
5
Var
Var
6
Var 5
Var
7

Var 6
Var 7
8
Var
Var 8
9
Var
Var
10
Var 9
Var
11
Var 10
Var 11
12
Var
Var 13
Var 12
Var 14
Var 13
Var 15
Var 14
Var 15
Variance
% Var
Variance
% Var

0.061
0.583

0.216
0.109
0.919 ✓
0.617
0.864 ✓
0.798
0.217
0.867
0.918 ✓
0.433
0.085
0.882
0.796 ✓
0.365
0.916 ✓
0.863
0.804 ✓
0.872
0.739 ✓
0.908
0.436
0.913
0.379
0.710
0.646
5.7455
0.383
7.5040
0.500


20.127
0.050
20.247
20.339
0.104
0.181
20.102
0.356
0.246
0.185
20.206
0.582
20.849✓
0.056
20.354
20.794
20.163
20.069
20.259
0.098
20.329
0.030
20.364
20.032
20.798✓
0.115
20.604
2.7351
0.182
2.0615

0.137

0.881
C
100
5 3 ability
A Academic
&B
9
C
5 1.111111111
5
9
5 1.1111111
38 Academic
8 Salesmanship
13 Potential
84.298823
Salesmanship
13
0.422
Cluster-2
4.573068
0 ability
3.1121355
7.413553
0.877
A&C
9
71.42857143

7 B
6
A&C
1.19047619 7
6
5
1.190476
0.881
4
Likability
9
Experience
14
Keenness
92.622167
Experience
14
20.162
20.062
0.885
Cluster-3
3.768847 4 A Likability
3.112135
05 1.020408163
5.2763467
10 2
71.42857143
A
B&C
10

71.42857143
7
B&C
7
5 1.0204081
Cluster-5
2.382945
20.580
0.357
0.877
20.259
0.825
11 0.006
83.33333333 E
C
83.33333333
6 E 7.413553
9
C 5.2763465 0.925925926 06
9
5 0.9259259
Cluster-4
3.928169
0.299
20.179
0.885
511 Self-confidence
10 Drive 105.224902
15 Suitabilit
5 Self-confidence

Drive
15
Overall 20.864
13
1.249053

0.855
12 0.003
80 C & E
B
12
580 C & E
7
B
4 1.142857143 5
7
4 1.1428571
0.184
20.069
0.825
Cluster-5
5.052507
4.298823
2.6221674 1.111111111
5.2249024
0
20.088
0.895
13 20.049
100 B & E

C
13
100
4 B&E
9
C
9
4 1.1111111
20.360
0.446
0.855
0.055
0.219
0.779
0.248
20.228
0.895
20.160
20.050
0.787
20.093
0.074
0.779
20.105
20.042
0.879
0.100
20.166
0.787
Chapter

Summary
Chapter Summary
20.340
0.152
0.852
0.256
20.209
0.879
LO3-10 20.425
0.230
0.888
0.135 We began this
0.097
0.852
use percentiles
and and
quartiles
to measure
use we
percentiles and quartiles to measure variation, and
chapter
by
presenting
and
comparing
several
We
began
mea- thistochapter
by presenting

comparing
several variation,
mea- to and
20.541
20.519
0.884
0.073
0.218
0.888
Interpret the
learned
how toWe
construct
box-and-whiskers
using the
how to construct a box-and-whiskers plot by using
tendency. We defined the
population mean
sures of
andcentral
tendency.
defined athe
population meanplot
and by learned
20.078 sures of central
0.082
0.794
20.558 we saw how
20.235
0.884

to estimate the population mean by using we
a sample
saw how quartiles.
to estimate the population mean by using a sample quartiles.
information
provided
20.107
0.794with
2.4140 mean. We20.029
1.3478
12.2423
Factor
analysis
starts
large
of
variables
and
attempts
to findhowfewer
After learning
howand
to measure
and
central
tendency
After learning
to measure and depict central tende
also
defined the

median and
mode,
and we acompared
mean.
We number
also defined
the correlated
median
mode,
and
wedepict
compared
by a factor 0.161
analysis
0.090 and mode for symmetrical
0.816
and 
variability,
weforpresented
various
optional topics.
and 
First,variability,
we
we presented
median,
the mean,
and
and mode
symmetrical

distributions
and
underlying
uncorrelated
factors
thatmedian,
describe
the
“essential
aspects”
of the
large
number
of various optional topics. First,
1.4677 the mean,1.2091
12.2423 distributions
(Optional).0.098 for distributions
discussed
numerical
measures
relationship
between several numerical measures of the relationship betw
that are skewed to the right or left. We then
for dis
studtributions
that areseveral
skewed
to the right
or left. of
Wethe

then
stud- discussed
0.081
0.816
correlated variables.
To illustrate factor analysis, suppose that a personnel officer has inter20.006
0.020


Cluster-420.874
0.494

8

0.928✓
0.282

100 A & B

20.081
0.983933
93
71.42857143
B
0.714

3.9 Factor Analysis (Optional and Requires
Section 3.4)

variables. These

included
covariance,
the correlation
two variables. These included the covariance, the correla
ied measures of variation (or spread). We defined the
iedrange,
measurestwo
of variation
(or spread).
We the
defined
the range,
coefficient,
and
the least
We
then introduced
coefficient,
the and the least squares line. We then introduced
variance, and
standardand
deviation,
and48
we job
saw how
tovariance,
estimate and
standard
andsquares
we saw

how
to estimate
viewed
rated
applicants
for
sales deviation,
positions
on
theline.
following
15
variables.
a weighted
mean andby
also
explained
how toconcept
computeof a weighted mean and also explained how to com
a population variance and standard deviation by using a population
sample. concept
varianceofand
standard deviation
using
a sample.
Factor 2, “experience”;
Factor
3,
“agreeable
descriptive

statistics
for grouped
data. Indeviation
addition, wedescriptive
showed statistics for grouped data. In addition, we sho
We learned 1
that aForm
good way
to
interpret
the
standard
We
deviation
learned
that
a
good
way
to
interpret
the
standard
of application
letter
6 Lucidity
11 Ambition
how toiscalculate
the geometric
mean

and demonstrated
its interto calculate the geometric mean and demonstrated its in
when a population
is (approximately)
normally distributed
whenisa to
population
(approximately)
normally
distributed
is to how
Variable
Factor1
Factor2
Factor3
Factor4
personality”;
Factor
4, “academic
ability.” Variable
2 (appearance)
does
notCommunality
load heavily
pretation.
Finally,
used the
numerical methods
of
chapter

Finally, we used the numerical methods of this cha
use the Empirical
Rule, and we studied Chebyshev’s Theorem,
use the Empirical
Rule,
and wewestudied
Chebyshev’s
Theorem,
2
Appearance
7
Honesty
12thispretation.
Grasp
Var
1
0.114
20.833✓
20.111
20.138
0.739
on any factor and thus is its own factor, as Factor 6 on the
Minitab
outputcontaining
in Figure
3.34large fractions
to give
an introduction
four important
to give an introduction to four important techniques of predic

which
gives us intervals
reasonably
which gives
of us
intervals
containing to
reasonably
large techniques
fractions ofof predictive
Var 2
0.440
20.150
20.394
0.226
0.422
decision
trees,
analysis,
analysis,
analytics:
and decision trees, cluster analysis, factor analysis,
the population
matter what
the population’s shape
the population
might analytics:
units8no matter
what
the cluster

population’s
shapefactor
might13
3 units
Academic
Salesmanship
Potential
indicated
is true. Variable
1 (form20.127
of application letter)
loads
heavily
onnoFactor
2ability
(“experiVar 3
0.061
20.006 be. We also 0.928✓
association
rules.
saw that, when a data set is 0.881
highly skewed,be.
it isWe
best
also saw
that, when
a data set is highly skewed, it is best association rules.

Rotated Factor Loadings and Communalities
asVarimax

follows:Rotation
Factor 1, “extroverted personality”;

BI

BI

We believe that an early introduction to predictive andistributions. Sampling distributions and conalytics (in Chapter 3) will make statistics seem more
fidence intervals. Chapter 4 discusses probability
useful
and
relevant
from
beginning
and✓thus
by
featuring
of probability modeling
ence”).
summary,
there
is notthe
much
difference20.874
between
the moti7-factor
and
4-factor
solu- a new discussion
Var 4 In

0.216
20.247
20.081
0.877
4 Likability
9 Experience
14 Keenness to join
Var 5 We might therefore
0.919 ✓ conclude
0.104
20.162 can be20.062
0.885
tions.
that the in
15 the
variables
reduced toand
the following
vate
students
to
be
more
interested
entire
course.
using
motivating
examples—The
Crystal15Cable

5
Self-confidence
10
Drive
Suitability

Var
6
0.864
20.102
20.259
0.006
0.825
five uncorrelated factors: “extroverted personality,” “experience,” “agreeable personality,”
Var 7
0.217
0.246
20.864 ✓
0.003
0.855 a real-world example of gender discrimiHowever,
our presentation
gives instructors
various
Casefocus
and
“academic
ability,”
and “appearance.”
helps the
personnel officer

on

Var 8
0.918
20.206This conclusion
20.088
20.049
0.895
choices.
This
is 0.085
because,ofafter
coveringMoreover,
the
in- analyst
nation
atata pharmaceutical company—to illustrate
the
characteristics”
a20.849✓
job applicant.
if a company
wishes
Var“essential
9
0.055optional
0.219
0.779
10 date to use a0.796
20.354

20.160to predict20.050
0.787
aVar
later
tree ✓diagram
or
regression
analysis
sales
performance
on
the
troduction
to
business
analytics
in
Chapter
1,
the
five
the
probability
rules. Chapters 5 and 6 give more
Var 11
0.916 ✓
20.163
20.105
20.042
0.879

basis
of
the
characteristics
of
salespeople,
the
analyst
can
simplify
the
prediction
modeling

Var 12
0.804
20.259 and predictive
20.340
0.152
0.852
optional
sections
on
descriptive
analytconcise
discussions
of discrete and continuous probprocedure
by using0.739
the five
uncorrelated

vari✓
Var 13
20.329 factors instead
20.425 of the original
0.230 15 correlated0.888
ics
2 andvariables.
3 can
be covered
in any order
Varin
14
0.436
20.364
20.541
20.519
ability0.884
distributions (models) and feature practical
ables
asChapters
potential predictor
20.078
0.794
Var 15
0.379
20.798✓
In general,
mining project
where wethe
wish

to predict a 0.082
response variable
and
inillustrating the “rare event approach” to
without
lossinofa data
continuity.
Therefore,
instructor
can
examples
Variance
2.4140
12.2423
which
there are an 5.7455
extremely large2.7351
number of potential
correlated1.3478
predictor variables,
it can
choose
which
of
the
six
optional
analytics
secmaking
a

statistical inference. In Chapter 7, The Car
% Var
0.383
0.182 business
0.161
0.090
0.816
be useful to first employ factor analysis to reduce the large number of potential correlated pretions
to coverto early,
as part offactors
the that
main
flow
Chap- predictor
Mileage
Case is used to introduce sampling distribudictor variables
fewer uncorrelated
we can
useof
as potential
variables.
ters 1–3, and which to discuss later. We recommend
tions and motivate the Central Limit Theorem (see
as follows:
Factorchosen
1, “extroverted
“experience”;
3, “agreeable
that
sections

to bepersonality”;
discussedFactor
later2, be
coveredFactorFigures
7.1, 7.3, and 7.5). In Chapter 8, the automaker
personality”; Factor 4, “academic ability.” Variable 2 (appearance) does not load heavily
after
Chapter 14,
which
presents
the
further
predictive
in
The
Car
on any factor and thus is its own factor, as Factor 6 on the Minitab output in Figure 3.34 Mileage Case uses a confidence interval
analytics
of multiple
regression,
procedure
indicated is topics
true. Variable
1 (form of linear
application
letter) loads logistic
heavily on Factor
2 (“experi- specified by the Environmental Protection
ence”). In summary,
there isnetworks.

not much difference between the 7-factor and Agency
4-factor soluregression,
and neural
(EPA) to find the EPA estimate of a new midtions. We might therefore conclude that the 15 variables can be reduced to the following
size
model’s
true mean mileage and determine if the
five uncorrelated factors: “extroverted personality,” “experience,” “agreeable personality,”
Chapters
4–8: and
Probability
probability
new focus
midsize
“academic ability,”
“appearance.” and
This conclusion
helps themodpersonnel officer
on model deserves a federal tax credit (see
the “essential
characteristics”
job applicant. Moreover,
if a company analyst
wishes
at
eling.
Discrete
andof a continuous
probability
Figure

8.2).
a later date to use a tree diagram or regression analysis to predict sales performance on the

vibasis of the characteristics of salespeople, the analyst can simplify the prediction modeling

procedure by using the five uncorrelated factors instead of the original 15 correlated variables as potential predictor variables.
In general, in a data mining project where we wish to predict a response variable and in
which there are an extremely large number of potential correlated predictor variables, it can
be useful to first employ factor analysis to reduce the large number of potential correlated predictor variables to fewer uncorrelated factors that we can use as potential predictor variables.
bow49461_fm_i–xxi.indd 6

20/11/15 4:06 pm


m for the
population
of six preThis sample mean is the point estimate of the mean has
mileage
been working
to improve
gas mileages,
we cannot assume that we know the true value of
production cars and is the preliminary mileage estimate
for the new
midsize
that
was
m for the
new
midsize model. However, engineering data might

the population
mean
mileagemodel
indicate that the spread of individual car mileages for the automaker’s midsize cars is the
reported at the auto shows.
same from
modelthe
andnew
year midsize
to year. Therefore, if the mileages for previous models
When the auto shows were over, the automaker decided
to model
furthertostudy
hadtests.
a standard
to .8 mpg,
it might be reasonable to assume that the standard
model by subjecting the four auto show cars to various
Whendeviation
the EPAequal
mileage
test was
deviation of the mileages for the new model will also equal .8 mpg. Such an assumption
performed, the four cars obtained mileages of 29 mpg, 31 mpg, 33 mpg, and 34 mpg. Thus,
would, of course, be questionable, and in most real-world situations there would probably not
the mileages obtained by the six preproduction cars were
29 mpg,
31 mpg,
32 mpg,
be an actual

basis30
formpg,
knowing
s. However,
assuming that s is known will help us to illustrate
33 mpg, and 34 mpg. The probability distribution sampling
of this population
sixinindividual
car
distributions,ofand
later chapters
we will see what to do when s is unknown.
mileages is given in Table 7.1 and graphed in Figure 7.1(a). The mean of the population of
C EXAMPLE 7.2 The Car Mileage Case: Estimating Mean Mileage
A Probability Distribution Describing thePart
Population
Six Individual
Car Mileages
Consider
the infinite population of the mileages of all of the new
1: BasicofConcepts

Table 7.1

Individual Car Mileage
Probability

Figure 7.1

30


31

1y6

1y6

Probability

0.20

29

30

1/6

1/6

1/6

1/6

31

32

33

34


0.15

0.10

0.05

0.00

Individual Car Mileage

(b) A graph of the probability distribution describing the
population of 15 sample means

3/15

0.20

Probability

0.15

2/15 2/15

2/15 2/15

0.10
1/15

1/15


29.5

30

1/15

1/15

33

33.5

0.05

0.00

29

30.5

31

31.5

32

32.5

1y6


1y6

Sampling Distribution of the Sample Mean x When n 5 5, and (3) the Sampling

(a) A graph of the probability distribution describing the
population of six individual car mileages

1/6

1y6

re 7.3
A Comparison of (1) the Population of All Individual Car Mileages, (2) the
T a b l e 7 .F2i g uThe
Population
of Sample Means

A Comparison of Individual Car
Mileages and Sample Means

1/6

midsize cars that could potentially be produced by this year’s manufacturing process. If
33
34
we assume32that this population
is normally distributed with mean m and standard deviation

29

1y6

34

of the
Mean x When n 5 50
(a) The population of theDistribution
15 samples
of nSample
5 2 car
mileages and corresponding sample means

Car
Mileages

Sample
Mean

(a) The population of individual mileages

Sample
1
2
3
4
5
6
7
8
9

10
11
12
13
14
15



(b) A probability distribution describing the
of sampling
the sample mean x¯ when n 5 50
population(c)ofThe
15sampling
sample distribution
means: the
distribution of the sample mean
Sample
Mean

Frequency

Probability

29.5
30
30.5
31
31.5
32

32.5
33
33.5

1
1
2
2
3
2
2
1
1

1y15
1y15
2y15
2y15
3y15
2y15
2y15
1y15
1y15

The Sampling Distribution of the Sample Mean

Figure 7.5

The Central Limit Theorem Says That the Larger the Sample Size Is, the More
Nearly Normally Distributed Is the Population of All Possible Sample Means


The normal distribution describing the population
of all possible sample means when the sample size

is 50, where ␮x¯ 5 ␮ and ␴x¯ 5
5 .8 5 .113
n
50

Scale of sample means, x¯



Sample Mean

7.1

The normal distribution describing the

29, 30
29.5
population of all individual car mileages, which
has mean ␮ and standard deviation ␴ 5 .8
29, 31
30
29, 32
30.5
29, 33
31
Scale of gas mileages


29, 34
31.5
30, 31
30.5
(b) The
distribution of
30,sampling
32
31the sample mean x¯ when n 5 5
30, 33
31.5
The normal distribution describing the population
30, 34
32
of all possible sample means when the sample

31, 32
31.5
size is 5, where ␮x¯ 5 ␮ and ␴ x¯ 5
5 .8 5 .358
n
5
31, 33
32
31, 34
32.5
32, 33
32.5
32, 34

33
Scale of sample means, x¯
33, 34
33.5

8.1

z-Based Confidence Intervals for a Population Mean: s Known

335

Figure 8.2

349

Three 95 Percent Confidence Intervals for m
The probability is .95 that
x¯ will be within plus or minus
1.96␴x¯ 5 .22 of ␮

x

x

x

n=2

n=2


n=2





x

(a) Several sampled populations

n=6

n=6

n=2



n=6



Population of
all individual
car mileages



n=6


.95

Samples of n 5 50
car mileages

m

n 5 50
x¯ 5 31.56



31.6

31.6 2 .22

31.6 1 .22

31.56








n 5 50
x¯ 5 31.2


31.78
31.68

31.34

n 5 50
x¯ 5 31.68

31.46

31.90

31.2
n = 30

n = 30





n = 30

30.98

n = 30






(b) Corresponding populations of all possible sample means for
different sample sizes

_

How large must the sample size be for the sampling distribution of x to be approximately
normal? In general, the more skewed the probability distribution of the sampled population, the larger the sample size must be for the population of all possible sample means to
be approximately normally distributed. For some sampled populations, particularly those
described by symmetric distributions, the population of all possible sample means is approximately normally distributed for a fairly small sample size. In addition, studies indicate that,
if the sample size is at least 30, then for most sampled populations the population of
all possible sample means is approximately normally distributed. In this book, when_
ever the sample size n is at least 30, we will assume that the sampling distribution of x is
approximately a normal distribution. Of course,
if the sampled population is exactly nor_
mally distributed, the sampling distribution of x is exactly normal for any sample size.

Chapters 9–12: Hypothesis testing. Two-sample
procedures. Experimental design and analysis
of variance. Chi-square tests. Chapter 9 discusses

hypothesis testing and begins with a new section on
formulating statistical hypotheses. Three cases—
e-billing Case
: Reducing
Mean Bill Payment
7.3 The Case,
C EXAMPLE
The
Trash Bag

The
e-billing
Case,Timeand The
Recall that a management consulting firm has installed a new computer-based electronic
Valentine’s
Day Chocolate Case—are then used in a
billing system in a Hamilton, Ohio, trucking company. Because of the previously discussed
advantages
of the new
billingexplains
system, and because
the trucking company’s
renew
section
that
the critical
value clients
and are
p-value
ceptive to using this system, the management consulting firm believes that the new system
will reduce the mean to
bill payment
time by
than 50 percent. The
mean payment
time
approaches
testing
amore
hypothesis

about
a populausing the old billing system was approximately equal to, but no less than, 39 days. Theretion
Anew
summary
boxthevisually
these
m denotes the
mean payment time,
consulting firmillustrating
believes that m will be
fore, ifmean.
less than 19.5 days. To assess whether m is less than 19.5 days, the consulting firm has
randomly selected a sample of n 5 65 invoices processed using the new billing system and
_has determined the payment times for these invoices. The mean of the 65 payment times is
x 5 18.1077 days, which is less than 19.5 days. Therefore, we ask the following question: If

bow49461_fm_i–xxi.indd 7

31.42

_

3 In statement 1 we showed that the probability is .95 that the sample mean x will be

2 we showed
within
plus or minus 1.96s_x 5 .22 of the population mean m. In statement
_
_
that x being within plus or minus .22 of m is the same as the interval [x 6 .22] containing m. Combining these results, we see that the probability is .95 that the sample mean

_
x will be such that the interval
_

_

[x 6 1.96s_x] 5 [x 6 .22]
contains
the population
mean m.
approaches
is presented

in the middle of this section
editions) so that
Statement
3 says
that,
before
we randomly
select
the sample,to
theredeveloping
is a .95 probability
that
more
of
the
section
can

be
devoted
the
_
we will obtain an interval [x 6 .22] that contains the population mean m. In other words,
summary
boxthat
and
showing
how
it. ofInthese
addition,
m,to
anduse
5 percent
intervals do
95 percent
of all intervals
we might
obtain contain
_
not contain m. For this reason, we call the interval [x 6 .22] a 95 percent confidence interval
a
five-step
hypothesis
testing
procedure
emphasizes
for m. To better understand this interval, we must realize that, when we actually select the
sample,

will observe one particular
extremely
large number
of possible
thatwesuccessfully
usingsample
anyfrom
ofthethe
book’s
hypothesis
samples. Therefore, we will obtain one particular confidence interval from the extremely large
testing
summary
boxes
requires
simply
identifying
number
of possible
confidence intervals.
For example,
recall that
when the automaker
randomly
selected
sample of n 5 50
cars and tested them
as prescribed
the EPA,
the automaker

the the
alternative
hypothesis
being
testedbyand
then
look_
obtained the sample of 50 mileages given in Table 1.7. The mean of this sample is x 5
ing
inandthe
summary
box
the corresponding
critical
31.56
mpg,
a histogram
constructed
usingfor
this sample
(see Figure 2.9 on page 66)
indicates
that the population of all individual car mileages is normally distributed. It follows that a
value rule and/or p-value (see the next page).
95 percent confidence interval for the population mean mileage m of the new midsize model is
(rather
than
at the interval
end, asfor
in mprevious

A 95
percent
confidence

_

[ x 6 .22] 5 [31.56 6 .22]
5 [31.34, 31.78]

vii

Because we do not know the true value of m, we do not know for sure whether this interval
contains m. However, we are 95 percent confident that this interval contains m. That is, we
are 95 percent confident that m is between 31.34 mpg and 31.78 mpg. What we mean by
“95 percent confident” is that we hope that the confidence interval [31.34, 31.78] is one of
the 95 percent of all confidence intervals that contain m and not one of the 5 percent of all
confidence intervals that do not contain m. Here, we say that 95 percent is the confidence 20/11/15 4:06 pm


396

Chapter 9

Hypothesis Testing

p-value is a right-tailed p-value. This p-value, which we have previously computed, is the
area under the standard normal curve to the right of the computed test statistic value z. In
the next two subsections we will discuss using the critical value rules and p-values in the
summary box to test a “less than” alternative hypothesis (Ha: m , m0) and a “not equal to”
alternative hypothesis (Ha: m Þ m0). Moreover, throughout this book we will (formally or

informally) use the five steps below to implement the critical value and p-value approaches
to hypothesis testing.

The Five Steps of Hypothesis Testing
1 State the null hypothesis H0 and the alternative hypothesis Ha.
2 Specify the level of significance a.
402
Chapter 9
3 Plan the sampling procedure
and select the test statistic.

Hypothesis Testing

Using a critical value rule:
LO9-4
4 Use the summary box to find the critical value rule corresponding to the alternative hypothesis.
Use critical
values
5
Collect
the and
sample data, compute the value of the test statistic, and decide whether to reject H0 by using the
p-values to perform a t
critical value rule. Interpret the statistical results.
If we do not know s (which is usually the case), we can base a hypothesis test about m on
test about a population
Usingwhen
a p-value
mean
s is rule: the sampling distribution of

_
x alternative
2m
unknown.
4 Use the summary box to find the p-value corresponding to the______
hypothesis. Collect the sample data,

9.3 t Tests about a Population Mean:
s Unknown
__

syÏn
compute the value of the test statistic, and compute the p-value.

5

the sampledapopulation
is normally
distributed
(or ifthe
thestatistical
sample size
is large—at least 30),
Reject H0 at level ofIfsignificance
if the p-value
is less than
a. Interpret
results.
then this sampling distribution is exactly (or approximately) a t distribution having n 2 1
degrees of freedom. This leads to the following results:


Testing a “less than” alternative hypothesis

We have seen in the e-billing case that to study whether the new electronic billing system
A ttime
Test
a Population
s Unknown
reduces the mean bill payment
byabout
more than
50 percent, theMean:
management
consulting
firm will test H0: m 5 19.5 versus Ha: m , 19.5 (step 1). A Type I error (concluding that Ha:
m , 19.5 is true when H0: m_ 5 19.5 is true) would result in the consulting firm overstating
Null
Test
Normal population
x2
m
the benefits of the new billing
system,
both to the company in which it has been installed and
__0
Hypothesis
H0: m 5 m0
Statistic
t 5 _______
df 5 n 2 1

Assumptions
or
syÏ n
to other companies that are considering
installing such a system. Because the consulting firm
Large sample size
desires to have only a 1 percent chance of doing this, the firm will set a equal to .01 (step 2).
To perform the hypothesis test, we will randomly select
a sample of n 5 65 invoices
_
Critical Value Rule
p-Value (Reject
H0 if p-Value Ͻ ␣)
payment times of these
paid using the new billing system and calculate the mean x of the
Ha: ␮ Ͼ ␮0
Ha: ␮ Ͻ ␮Then,
Ha: ␮ ϶
Ͼ ␮0 we willHautilize
: ␮ Ͻ ␮0the test statistic
Ha: ␮ ϶ ␮0in the
0
invoices.
because
the␮0sample sizeHais: ␮large,
summary
(step 3):
Do not
Reject Reject
Do not boxReject

Do not
Reject
_
reject H0
H0
H0
reject H0
H0
reject H0
H0
x 2 19.5
__
z 5 ________
9.4
z Tests about
a Population
Proportion
407
p-value
p-value
syÏn
_

␣ր2
␣ր2

A
value
of
the

test
statistic
z
that
is
less
than
zero
results
when
x
is
less
than
19.5.
This
In order to see how to test this kind of hypothesis, remember that when n is large, the
_
of Ha because
the point
estimate
x indiprovides
to␣ր2
support
rejecting H0 in0favor
sampling0distribution
of
t␣ր2
Ϫt␣ 0 evidence Ϫt
t␣

0
t
t
0
0 ԽtԽ
ϪԽtԽ
m
might
be
less
than
19.5.
To
decide
how
much
less
than
zero
the
value
of the
cates
that
pˆ 2Hp0 if
p-value ϭ twice
p-value ϭ area
p-value ϭ area
Reject H0 if
Reject

Reject H0 if
_________
________
of Hofa tat level
ofleft
significance
, wetonote
test t statistic
mustԽtԽbeϾ p(1
reject
theaarea
the that
to the right
to the
of t
t Ͼ t␣
Ͻ Ϫt␣
tto
␣ր2—that
2 p)is,H0 in favor
________
right of ԽtԽ
t␣ր2form
or tnϽ H
Ϫt␣ր2
Ha: m , 19.5 is oft Ͼthe
a: m , m0, and we look in the summary box under the critical
value rule heading Ha: m , m0. The critical value rule that we find is a left-tailed critical
is approximately a standard
normal

p0 denote a specified value between
value rule
and distribution.
says to do theLet
following:
0 and 1 (its exact value will depend on the problem), and consider testing the null hypothesis
H0: p 5 p0. We then have the following result:
Place the probability of a Type I error, ␣, in the left-hand tail of the standard normal curve
Mean
Debt-to-Equity
Ratio
Here
2za
the negative of
the
and useCtheEXAMPLE
normal table 9.4
to find
theCommercial
critical
valueLoan
2z␣.Case:
AThe
Large
Sample
Test
about
Population
Proportion
␣ is

normal point z␣. That is, 2z␣ is the point on the horizontal axis under the standard normal
One measure of a company’s financial health is its debt-to-equity ratio. This quantity is
curve that gives a left-hand
Null
Testtail area equal to ␣.
np0 $ 5
defined to be the ratio
of the company’s
equity.
If this
pˆ 2 p0 corporate debt to the company’s
3
Hypothesis H0:Reject
and
p 5 p0 H0: ␮ ϭ 19.5 in favor
Assumptions
__________
of indication
Hz␣:5␮____________
Ͻ
19.5
if and only
if the computed
value
of the
ratio is too high, itStatistic
is one
of2 financial
instability.
For obvious

reasons,
banks
(1
p
)
p
n(1 the
2 p 0) $ 5
0
0
__________
test statistic
z is less
the critical
value
2z
4). Because
equals .01,
␣ (step
often monitor
the than
financial
health of
companies
to which
they have␣ extended
commercial
n
criticalloans.
value Suppose

2z␣ is 2z.
[see
Table
A.3aand
Figure
01 5
that,
in 22.33
order to
reduce
risk,
large
bank 9.3(a)].
has decided to initiate a policy
limiting
theRule
mean debt-to-equity ratio for its portfolio
commercial
loansϽ to
Critical
Value
␣) being less
p-Valueof(Reject
H0 if p-Value
than 1.5. In order to assess whether the mean debt-to-equity ratio m of its (current) comHa: p Ͼ p0
Ha: p Ͻ p0
Ha: p ϶ p0
Ha: p Ͼ p0
Ha: p Ͻ p0
Ha: p ϶ p0

mercial loan portfolio is less than 1.5, the bank will test the null hypothesis H0: m 5 1.5
Do not
Reject Reject
not alternative
Reject
Do
not
Reject H : m , 1.5. In this situation, a Type I error (rejectversusDothe
hypothesis
a
reject H0
H0
H0
H0
H0
reject H0
H0
result
in the bank concluding that the
ing Hreject
p-value
p-value
0: m 5 1.5 when H0: m 5 1.5 is true) would
mean debt-to-equity ratio of its commercial loan portfolio is less than 1.5 when it is not.
1.0 5 ␣

Because
the bank ␣ր2
wishes to be␣ր2
very sure that it does not commit this Type I error, it will

1.1 1 9
versus Ha byϪzusing
a .01
To perform
the hypothesis
test
0 z1␣2 9
ϪzH
z␣ր2 level of significance.
0
z
z
0
0 test,
ϪԽzԽ
ԽzԽ the
␣ 00
␣ր2 0
1.2
bank
randomly
selects
a sample
of 15p-value
of its ϭcommercial
loanϭaccounts.
Auditsϭ of
these
area
p-value

area
p-value
twice
Reject
Reject
H0 if
Reject
H0 if
1.3 H01if2 3 7
to the right of z ratiosto (arranged
the left of z in increasing
the area to the
—that is,
z Ͼ z␣
z Ͻ Ϫz␣ result ԽzԽ
z␣ր2following
companies
in Ͼthe
debt-to-equity
order):
1.4 1 5 6
right
of
ԽzԽ
Ͼ z␣ր21.22,
or z Ͻ1.29,
Ϫz␣ր2 1.31, 1.32, 1.33, 1.37, 1.41, 1.45, 1.46, 1.65, and 1.78.
1.05, 1.11, 1.19, z1.21,
1.5
The mound-shaped stem-and-leaf display of these ratios is given in the page margin and

1.6 5
indicates that the population of all debt-to-equity ratios is (approximately) normally dis1.7 8
tributed. It follows that it is appropriate to calculate the value of the test statistic t in the
DS DebtEq
summary box. Furthermore, because the alternative hypothesis Ha: m , 1.5 says to use
C EXAMPLE 9.6 The Cheese Spread Case: Improving Profitability

Ï

Ï

Hypothesis testing summary boxes are featured
theory. Chapters 13–15 present predictive analytWe have seen that the cheese spread producer wishes to test H0: p 5 .10 versus Ha: p , .10,
throughout Chapter
Chapter
(two-sample
proceics
where p is the9,
proportion
of all10
current
purchasers who would
stop buying
the methods
cheese spread that are based on parametric regression
can betime
rejected
in
if the new
spout

were used. The
producer will use
the newand
spout if H0and
dures), Chapter
11
(one-way,
randomized
block,
series
models. Specifically, Chapter 13 and
favor of Ha at the .01 level of significance. To perform the hypothesis test, we will rantwo-way analysis
of nvariance),
Chapter
(chi-square
the
first seven
sections of Chapter 14 discuss simple
domly select
5 1,000 current
purchasers12
of the
cheese spread, find the
proportion
(pˆ) of
these purchasers
stop buying the cheese
the new spout
used,multiple
and

tests of goodness
of fit who
andwould
independence),
andspread
theifreandwere
basic
regression analysis by using a more
calculate the value of the test statistic z in the summary box. Then, because the alternative
mainder of hypothesis
the book.
addition,
emphasis
is placed
, .10
says to use the
left-tailed critical
value rule in thestreamlined
summary box, weorganization and The Tasty Sub Shop (revHa: pIn
the value of zimportance
is less than 2za 5after
2z.01 5 22.33. (Note that using
H0: p 5 .10 ifpractical
throughout will
on reject
estimating
enue prediction) Case (see Figure 14.4). The next five
this procedure is valid because np0 5 1,000(.10) 5 100 and n(1 2 p0) 5 1,000(1 2 .10) 5
testing for statistical
sections

ofthat
Chapter 14 present five advanced modeling
900 are both significance.
at least 5.) Suppose that when the sample is randomly selected,
we find
63 of the 1,000 current purchasers say they would stop buying the cheese spread if the new
topics
that
can
be covered in any order without loss of
spout were used. Because pˆ 5 63y1,000 5 .063, the value of the test statistic is
Chapters 13–18: Simple and
multiple
regression
continuity:
dummy
variables (including a discussion
p
ˆ
2
p
.063 2 .10 5 23.90
0
___________
z 5 ___________
_________ 5 ____________
.10(1 2 .10) and
analysis. Model building. Logistic
regression
of

interaction);
quadratic
and quantitative
(1
2
p
)
p
___________
0
0
n
Ï_________
Ï 1,000 Con- interaction variables; modelvariables
neural networks.
Time
series
forecasting.
building
and the effects
Because z 5 23.90 is less than 2z.01 5 22.33, we reject H0: p 5 .10 in favor of Ha: p , .10.
trol charts.
statistics.
Decision
of multicollinearity;
residual analysis and diagnosing
that the proportion
of all current purchasers
who would
ThatNonparametric

is, we conclude (at an a of .01)
␣ 5 .01

2z.01

0

22.33

p-value
5 .00005
z

0

23.90

viii

stop buying the cheese spread if the new spout were used is less than .10. It follows that the
company will use the new spout. Furthermore, the point estimate pˆ 5 .063 says we estimate
that 6.3 percent of all current customers would stop buying the cheese spread if the new
spout were used.

BI

3

Some statisticians suggest using the more conservative rule that both np0 and n(1 2 p0) must be at least 10.


bow49461_fm_i–xxi.indd 8

20/11/15 4:06 pm


outlying and influential observations; and logistic regression (see Figure 14.36). The last section of Chapter
14 discusses neural networks and has logistic regression as a prerequisite. This section shows why neural
network modeling is particularly useful when analyzing big data and how neural network models are used
to make predictions (see Figures 14.37 and 14.38).
Chapter 15 discusses time series forecasting, includ594

Chapter 14

652

Multiple Regression and Model Building

Chapter 14

Figure 14.36

Excel and Minitab Outputs of a Regression Analysis of the Tasty Sub Shop Revenue Data
in Table 14.1 Using the Model y 5 b0 1 b1x1 1 b2x2 1 «

Figure 14.4

ing Holt–Winters’ exponential smoothing models, and
refers readers to Appendix B (at the end of the book),
which succinctly discusses the Box–Jenkins methodology. The book concludes with Chapter  16 (a clear
discussion of control charts and process capability),

Chapter 17 (nonparametric statistics), and Chapter 18
(decision theory, another useful predictive analytics
topic).

Deviance Table
Source
Regression
Purchases
PlatProfile
Error
Total

(a) The Excel output
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.9905
0.9810 8
0.9756 9
36.6856 7
10

ANOVA
Regression
Residual
Total


df
2
7
9

SS
486355.7 10
9420.8 11
495776.5 12

Coefficients
125.289 1
14.1996 2
22.8107 3

Intercept
population
bus_rating

MS
243177.8
1345.835

Standard Error 4
40.9333
0.9100
5.7692

t Stat 5

3.06
15.60
3.95

F
180.689 13

Significance F
9.46E-07 14

P-value 6
0.0183
1.07E-06
0.0055

Lower 95% 19
28.4969
12.0478
9.1686

Coefficients
Coef
Term
210.68
Constant
0.2264
Purchases
3.84
PlatProfile


(b) The Minitab output
Analysis of Variance
Source
DF
Regression
2
Population
1
Bus_Rating
1
Error
7
Total
9

Adj SS
486356 10
327678
21039
9421 11
495777 12

Model Summary
S
R-sq
36.6856 7
98.10% 8
Coefficients
Term
Constant

Population
Bus_Rating

AdJ MS
243178
327678
21039
1346

Coef
125.3 1
14.200 2
22.81 3

R-sq(adj)
97.56% 9

F-Value
180.69 13
243.48
15.63

Setting
47.3
7

1 b0

2 b1


8 R2

9 Adjusted R2

3 b2

14 p-value for F(model)

Fit 15
956.606

T-Value 5
3.06
15.60
3.95

SE Fit 16
15.0476

VIF

P-Value 6
0.018
0.000
0.006

1.18
1.18

95% CI 17

(921.024, 992.188)

4 S bj 5 standard error of the estimate bj
10 Explained variation

5 t statistics

95% PI 18
(862.844, 1050.37)

6 p-values for t statistics

11 SSE 5 Unexplained variation

15 yˆ 5 point prediction when x1 5 47.3 and x2 5 7

17 95% confidence interval when x1 5 47.3 and x2 5 7

12 Total variation

7 s 5 standard error
13 F(model) statistic

16 syˆ 5 standard error of the estimate yˆ

18 95% prediction interval when x1 5 47.3 and x2 5 7

residual—the difference between the restaurant’s observed and predicted yearly revenues—
fairly small (in magnitude). We define the least squares point estimates to be the values of
14

Multiple Regression and Model Building
bChapter
0, b1, and b2 that minimize SSE, the sum of squared residuals for the 10 restaurants.
The formula for the least squares point estimates of the parameters in a multiple regresmodel
expressed using a branch of mathematics called matrix algebra. This formula
F i g u r e 1 4 . 3 7 Thesion
Single
LayerisPerceptron
is presented in Bowerman, O’Connell, and Koehler (2005). In the main body of this book,
Excel Layer
and Minitab to compute the needed Output
estimates.
Input Layer we will rely onHidden
LayerFor example, consider
the Excel and Minitab outputs in Figure 14.4. The Excel output tells us that the least squares
point estimates of b0, b1, and b2 in the Tasty Sub Shop revenue model are b0 5 125.289,
b1 5 14.1996, and b2 5 22.8107 (see 1 , 2 , and 3 ). The point estimate b1 5 14.1996 of b1
l1 5 h10that
1 h11
x1 1 hyearly
says we estimate
mean
revenue increases by $14,199.60 when the population size
12x2
1 … residents
1 h1kxk
increases by 1,000
and the business rating does not change. The point estimate

g(L) 5


e l2 2 1
el2 1 1

if response
1
variable is
1 1 e2L
qualitative.
L
if response
variable is
quantitative.

elm 2

1
elm1 1

1 An input layer consisting of the predictor variables x1, x2, . . . , xk under consideration.
2 A single hidden layer consisting of m hidden nodes. At the vth hidden node, for

v 5 1, 2, . . . , m, we form a linear combination ℓv of the k predictor variables:
ℓ 5h 1h x 1h x 1...1h x
v

v0

v1 1


v2 2

vk k

Here, hv0, hv1, . . . , hvk are unknown parameters that must be estimated from the sample
data. Having formed ℓv, we then specify a hidden node function Hv(ℓv) of ℓv. This hidden node function, which is also called an activation function, is usually nonlinear. The
activation function used by JMP is
eℓv 2 1
Hv (ℓv) 5 ______
eℓv 1 1
[Noting that (e2x 2 1)y(e2x 1 1) is the hyperbolic tangent function of the variable x, it
follows that Hv (ℓv) is the hyperbolic tangent function of x 5 .5 ℓv.] For example, at nodes
1, 2, . . . , m, we specify

bow49461_fm_i–xxi.indd 9

95% CI
(218.89, 22.46)
( 0.0458, 0.4070)
( 0.68, 7.01)

Z-Value
22.55
2.46
2.38

P-Value
0.011
0.014
0.017


VIF
1.59
1.59

95% CI
(1.9693, 1110.1076)

P-Value
0.993
0.998
0.919

Variable
Purchases
PlatProfile

Setting
42.571
1

Fitted
Probability
0.943012

SE Fit
0.0587319

95% CI
(0.660211, 0.992954)


Variable
Purchases
PlatProfile

Setting
51.835
0

Fitted
Probability
0.742486

SE Fit
0.250558

95% CI
(0.181013, 0.974102)

byOutput
25 percent.
oddsEstimation
ratio estimate
of 46.76
for PlatProfile
that we
JMP
of NeuralThe
Network
for the Credit

Card Upgrade
Data DS says
CardUpgrade

24.34324

H1_2:Purchases
H1_2: PlatProfile:0
H1_2:Intercept

0.062612
0.172119
22.28505

H1_3:Purchases
H1_3: PlatProfile:0
H1_3:Intercept

0.023852
0.93322
21.1118

Upgrade(0):H1_1
Upgrade(0):H1_2
Upgrade(0):H1_3
Upgrade(0):Intercept

2201.382
236.2743
81.97204

27.26818

}
}
}

5 2.0399995

}
}
}

5 .7698664933

(210.681.2264(42.571)13.84 (1))

e
_________________________
5 .9430

(210.681.2264(42.571)13.84 (1))
ℓˆ2 5 hˆ20 1 hˆ21(Purchases) 1 hˆ22(JDPlatProfile)

11e

5 22.28505 1 .062612(51.835) 1 .172119(1)

1.132562

e

21
H2 (ℓˆ2) 5 ____________
1.132562
e
11
5 .5126296505

• The upgrade5 1.132562
probability for a Silver card holder who had purchases of $51,835 last year
and does not conform to the bank’s Platinum profile is

}

ℓˆ3 5 hˆ30 1 hˆ31(Purchases) (210.681.2264(51.835)13.84(0))
1 hˆ32(JDPlatProfile)

e
11e

1.0577884

e
21
_________________________
5 21.1118 1 .023852(51.835) 1 .93322(1)
H (ℓˆ ) 5 ____________
1.0577884
(210.681.2264(51.835)13.84(0))3 3
e
11

5 1.0577884
5 .4845460029

Lˆ 5 b0 1 b1H1 (ℓˆ1) 1 b2H2 (ℓˆ2) 1 b3H3 (ℓˆ3)
5 27.26818 2 201.382(.7698664933)

}

1236.2743(.5126296505) 1 81.97204(.4845460029)
5 21.464996

5 .7425

1
g (Lˆ ) 5 __________
2(21.464996)
11e
5 .1877344817

Most
Likely

H1_1

H1_2

H1_3

1


0

31.95

0

1

2.001752e-11

20.108826172

20.056174265

0.2837494642

0

17

0

34.964

1

0.3115599633

0.6884400367


20.408631173

20.133198987

20.540965203

1

33

1

39.925

1

3.3236679e-6

0.9999966763

20.151069087

0.0213116076

20.497781967

1

40


0

17.584

0

1

1.241383e-17

20.728299523

20.466805781

0.1198442857

0

41

42.571

1

6.333631e-10

0.9999999994

20.001969351


0.1037760733

20.47367384

1

42

51.835

0

0.1877344817

0.8122655183

0.7698664933

0.5126296505

0.4845460029

1

card holders who have not yet been sent an upgrade offer and for whom we wish to estimate the probability of upgrading. Silver card holder 42 had purchases last year of $51,835
(Purchases 5 51.835) and did not conform to the bank’s Platinum profile (PlatProfile 5 0).
Because PlatProfile 5 0, we have JDPlatProfile 5 1. Figure 14.38 shows the parameter estimates for the neural network model based on the training data set and how they are used
to estimate the probability that Silver card holder 42 would upgrade. Note that because the
response variable Upgrade is qualitative, the output layer function is g(L) 5 1y(1 1 e2L).
The final result obtained in the calculations, g(Lˆ) 5 .1877344817, is an estimate of the probability that Silver card holder 42 would not upgrade (Upgrade 5 0). This implies that the

estimate of the probability that Silver card holder 42 would upgrade is 1 2 .1877344817 5
.8122655183. If we predict a Silver card holder would upgrade if and only if his or her
upgrade probability is at least .5, then Silver card holder 42 is predicted to upgrade (as is
Silver card holder 41). JMP uses the model fit to the training data set to calculate an upgrade
probability estimate for each of the 67 percent of the Silver card holders in the training data
set and for each of the 33 percent of the Silver card holders in the validation data set. If a
particular Silver card holder’s upgrade probability estimate is at least .5, JMP predicts an
upgrade for the card holder and assigns a “most likely” qualitative value of 1 to the card
holder. Otherwise, JMP assigns a “most likely” qualitative value of 0 to the card holder. At
the bottom of Figure 14.38, we show the results of JMP doing this for Silver card holders
1, 17, 33, and 40. Specifically, JMP predicts an upgrade (1) for card holders 17 and 33, but
only card holder 33 did upgrade. JMP predicts a nonupgrade (0) for card holders 1 and 40, and
neither of these card holders upgraded. The “confusion matrices” in Figure 14.39 summarize

The idea behind neural network modeling is to represent the response variable as a nonlinear function of linear combinations of the predictor variables. The simplest but most widely
used neural network model is called the single-hidden-layer, feedforward neural network.
This model, which is also sometimes called the single-layer perceptron, is motivated (like
all neural network models) by the connections of the neurons in the human brain. As illustrated in Figure 14.37, this model involves:

bow49461_ch14_590-679.indd 654

SE Coef
4.19
0.0921
1.62

Probability
Probability
Upgrade Purchases PlatProfile (Upgrade50) (Upgrade51)


lm = hm0 + hm1x1 + hm2x2
+ … + hmkxk
Hm(lm) 5

AIC
25.21

Chi-Square
19.21
17.14
3.23

H1_1:Intercept



xk

P-Value
0.000
0.000
0.001

Neural
(Optional)
657 purchases
estimate of 1.25 for Purchases says that
forNetworks
each increase
of $1,000 in last year’s

by a Silver card holder, we estimate that the Silver card holder’s odds of upgrading increase
estimate that the
odds of upgrading for a Silver card holder who conforms to the bank’s Platinum profile are
Neural
Validation: 46.76
Randomtimes
Holdback
Modelthe
NTanH(3)
larger than
odds of upgrading for a Silver card holder who does not conform
to the bank’s Platinum profile, if both Silver card holders had the same amount of purchases
Estimates
last year. Finally, the bottom of the Minitab output says that we estimate that
Parameter
Estimate
ˆ 1 hˆ (Purchases)
ℓˆ1 5 h
)
H1_1:Purchases
0.113579
• The upgrade
probability
for1ahˆSilver
card
holder who had purchases
of $42,571 last year
10
11
12(JDPlatProfile

e 2.0399995 2 1
H1_1: PlatProfile:0
0.495872
5 24.34324
1 .113579(51.835)
1 .495872(1)
H1 (ℓˆ1) 5 ____________
and conforms
to the bank’s
Platinum
profile is
e 2.0399995 1 1

L 5 ␤0 1 ␤1H1(l1) 1 ␤2H2(l2)
1 … 1 ␤mHm(lm)



H2(l2) 5

Chi-Square
35.84
13.81
10.37

Goodness-of-Fit Tests
DF
Test
37
Deviance

37
Pearson
8
Hosmer-Lemeshow

Figure 14.38

e l1 2 1
e l1 1 1

l2 5 h20 1 h21x1 1 h22x2
1 … 1 h2kxk

Adj Mean
17.9197
13.8078
10.3748
0.5192

14.13

x1

x2

Adj Dev
35.84
13.81
10.37
19.21


19 95% confidence interval for bj

654

H1(l1) 5

Contribution
65.10%
46.26%
18.85%
34.90%
100.00%

Odds Ratios for Categorical Predictors
Level A
Level B
Odds Ratio
PlatProfile
0
46.7564
1
Odds ratio for level A relative to level B

Regression Equation
Revenue 5 125.3 1 14.200 Population 1 22.81 Bus_Rating
Variable
Population
Bus_Rating


Seq Dev
35.84
25.46
10.37
19.21
55.05

Odds Ratios for Continuous Predictors
Odds Ratio
95% CI
Purchases
1.2541 (1.0469, 1.5024)

P-Value
0.000 14
0.000
0.006

R-sq(pred)
96.31%

SE Coef 4
40.9
0.910
5.77

DF
2
1
1

37
39

Model Summary
Deviance Deviance
R-Sq R-Sq(adj)
61.47%
65.10%
Upper 95% 19
222.0807
16.3517
36.4527

Multiple Regression and Model Building

Minitab Output of a Logistic Regression of the Credit Card Upgrade Data

23/11/15 4:37 pm

ix

23/11/15 5:27 pm


WHAT SOFTWARE IS
AVAILABLE
MEGASTAT® FOR MICROSOFT EXCEL® 2003,
2007, AND 2010 (AND EXCEL: MAC 2011)
MegaStat is a full-featured Excel add-in by J. B. Orris of Butler University that is available
with this text. It performs statistical analyses within an Excel workbook. It does basic

functions such as descriptive statistics, frequency distributions, and probability calculations,
as well as hypothesis testing, ANOVA, and regression.
MegaStat output is carefully formatted. Ease-of-use features include AutoExpand for quick
data selection and Auto Label detect. Since MegaStat is easy to use, students can focus on
learning statistics without being distracted by the software. MegaStat is always available
from Excel’s main menu. Selecting a menu item pops up a dialog box. MegaStat works with
all recent versions of Excel.

MINITAB®
Minitab® Student Version 17 is available to help students solve the business statistics exercises in the text. This software is available in the student version and can be packaged with
any McGraw-Hill business statistics text.

TEGRITY CAMPUS: LECTURES 24/7
Tegrity Campus is a service that makes class time available 24/7. With Tegrity Campus, you
can automatically capture every lecture in a searchable format for students to review when
they study and complete assignments. With a simple one-click start-and-stop process, you
capture all computer screens and corresponding audio. Students can replay any part of any
class with easy-to-use browser-based viewing on a PC or Mac.
Educators know that the more students can see, hear, and experience class resources, the
better they learn. In fact, studies prove it. With Tegrity Campus, students quickly recall key
moments by using Tegrity Campus’s unique search feature. This search helps students efficiently find what they need, when they need it, across an entire semester of class recordings.
Help turn all your students’ study time into learning moments immediately supported by your
lecture. To learn more about Tegrity, watch a two-minute Flash demo at http://tegritycampus
.mhhe.com.

x

bow49461_fm_i–xxi.indd 10

20/11/15 4:06 pm



www.downloadslide.net

ACKNOWLEDGMENTS
We wish to thank many people who have helped to
make this book a reality. As indicated on the title page,
we thank Professor Steven C. Huchendorf, University
of Minnesota; Dawn C. Porter, University of Southern
California; and Patrick J. Schur, Miami University; for
major contributions to this book. We also thank Susan
Cramer of Miami University for very helpful advice on
writing this new edition.
We also wish to thank the people at McGraw-Hill
for their dedication to this book. These people include senior brand manager Dolly Womack, who is
extremely helpful to the authors; senior development
editor Camille Corum, who has shown great dedication
to the improvement of this book; content project manager Harvey Yep, who has very capably and diligently
guided this book through its production and who has
been a tremendous help to the authors; and our former
executive editor Steve Scheutz, who always greatly
supported our books. We also thank executive editor

Michelle Janicek for her tremendous help in developing this new edition; our former executive editor Scott
Isenberg for the tremendous help he has given us in
developing all of our McGraw-Hill business statistics
books; and our former executive editor Dick Hercher,
who persuaded us to publish with McGraw-Hill.
We also wish to thank Sylvia Taylor and Nicoleta
Maghear, Hampton University, for accuracy checking Connect content; Patrick Schur, Miami University,

for developing learning resources; Ronny Richardson,
Kennesaw State University, for revising the instructor
PowerPoints and developing new guided examples and
learning resources; Denise Krallman, Miami University,
for updating the Test Bank; and James Miller, Dominican University, and Anne Drougas, Dominican University, for developing learning resources for the new
business analytics content. Most importantly, we wish
to thank our families for their acceptance, unconditional
love, and support.

xi

bow49461_fm_i–xxi.indd 11

20/11/15 4:06 pm


www.downloadslide.net

DEDICATION
Bruce L. Bowerman
To my wife, children, sister, and other family members:
Drena
Michael, Jinda, Benjamin, and Lex
Asa, Nicole, and Heather
Susan
Barney, Fiona, and Radeesa
Daphne, Chloe, and Edgar
Gwyneth and Tony
Callie, Bobby, Marmalade, Randy, and Penney
Clarence, Quincy, Teddy, Julius, Charlie, Sally, Milo, Zeke,

Bunch, Big Mo, Ozzie, Harriet, Sammy, Louise, Pat, Taylor,
and Jamie
Richard T. O’Connell
To my children and grandchildren:
Christopher, Bradley, Sam, and Joshua
Emily S. Murphree
To Kevin and the Math Ladies

xii

bow49461_fm_i–xxi.indd 12

20/11/15 4:06 pm


www.downloadslide.net

CHAPTER-BY-CHAPTER
REVISIONS FOR 8TH EDITION
Chapter 1

• Initial example made clearer.
• Two new graphical examples added to better intro•





duce quantitative and qualitative variables.
How to select random (and other types of) samples

moved from Chapter 7 to Chapter 1 and combined
with examples introducing statistical inference.
New subsection on statistical modeling added.
More on surveys and errors in surveys moved from
Chapter 7 to Chapter 1.
New optional section introducing business analytics
and data mining added.
Sixteen new exercises added.

Chapter 2

• Thirteen new data sets added for this chapter on
graphical descriptive methods.
• Fourteen new exercises added.
• New optional section on descriptive analytics
added.
Chapter 3

• Twelve new data sets added for this chapter on
numerical descriptive methods.

• Twenty-three new exercises added.
• Four new optional sections on predictive analytics
added:
one section on classification trees and regression trees;
one section on hierarchical clustering and
k-means clustering;
one section on factor analysis;
one section on association rule mining.


Chapter 4

• New subsection on probability modeling added.
• Exercises updated in this and all subsequent
chapters.

Chapter 5

• Discussion of general discrete probability dis-

tributions, the binomial distribution, the Poisson

distribution, and the hypergeometric distribution
simplified and shortened.
Chapter 6

• Discussion of continuous probability distributions
and normal plots simplified and shortened.

Chapter 7

• This chapter covers the sampling distribution of

the sample mean and the sampling distribution of
the sample proportion; as stated above, the material
on how to select samples and errors in surveys has
been moved to Chapter 1.

Chapter 8


• No significant changes when discussing confidence
intervals.

Chapter 9

• Discussion of formulating the null and alternative
hypotheses completely rewritten and expanded.
• Discussion of using critical value rules and
p-values to test a population mean completely
rewritten; development of and instructions
for using hypothesis testing summary boxes
improved.
• Short presentation of the logic behind finding
the probability of a Type II error when testing
a two-sided alternative hypothesis now
accompanies the general formula for calculating
this probability.
Chapter 10

• Statistical inference for a single population variance
and comparing two population variances moved
from its own chapter (the former Chapter 11) to
Chapter 10.
• More explicit examples of using hypothesis testing
summary boxes when comparing means, proportions, and variances.
Chapter 11

• New exercises for one-way, randomized block, and

two-way analysis of variance, with added emphasis

on students doing complete statistical analyses.
xiii

bow49461_fm_i–xxi.indd 13

20/11/15 4:06 pm


www.downloadslide.net

Chapter 12

• No significant changes when discussing chi-square
tests.

Chapter 13

• Discussion of basic simple linear regression analy-

sis streamlined, with discussion of r2 moved up and
discussions of t and F tests combined into one section.
• Section on residual analysis significantly shortened
and improved.
• New exercises, with emphasis on students doing
complete statistical analyses on their own.

• Section on logistic regression expanded.
• New section on neural networks added.
• New exercises, with emphasis on students doing
complete statistical analyses on their own.


Chapter 15

• Discussion of the Box–Jenkins methodology

slightly expanded and moved to Appendix B (at the
end of the book).
• New time series exercises, with emphasis on students doing complete statistical analyses on their
own.
Chapters 16, 17, and 18

Chapter 14

• Discussion of R2 moved up.
• Discussion of backward elimination added.
• New subsection on model validation and PRESS

• No significant changes. (These were the former

Chapters 17, 18, and 19 on control charts, nonparametrics, and decision theory.)

added.

xiv

bow49461_fm_i–xxi.indd 14

20/11/15 4:06 pm



www.downloadslide.net

BRIEF CONTENTS
Chapter 1

2

Chapter 2

54

An Introduction to Business Statistics and
Analytics
Descriptive Statistics: Tabular and
Graphical Methods and Descriptive
Analytics
Chapter 3

Descriptive Statistics: Numerical Methods
and Some Predictive Analytics
Chapter 4

Probability and Probability Models
Chapter 5

Discrete Random Variables

Chapter 13

530


Chapter 14

590

Chapter 15

680

Chapter 16

726

Chapter 17

778

Chapter 18

808

Simple Linear Regression Analysis
Multiple Regression and Model Building
Times Series Forecasting and Index
Numbers

134

Process Improvement Using
Control Charts

220

Nonparametric Methods
254

Decision Theory

Chapter 6

288

Chapter 7

326

Appendix A

828

Chapter 8

346

Appendix B

852

Chapter 9

382


Answers to Most Odd-Numbered
Exercises

863

References

871

Photo Credits

873

Index

875

Continuous Random Variables
Sampling Distributions
Confidence Intervals
Hypothesis Testing
Chapter 10

Statistical Inferences Based on
Two Samples

428

Chapter 11


464

Chapter 12

504

Experimental Design and Analysis
of Variance
Chi-Square Tests

Statistical Tables
An Introduction to Box–Jenkins Models

xv

bow49461_fm_i–xxi.indd 15

20/11/15 4:06 pm


www.downloadslide.net

CONTENTS
Chapter 1
An Introduction to Business Statistics
and Analytics
1.1 ■ Data  3
1.2 ■ Data Sources, Data Warehousing, and Big
Data  6

1.3 ■ Populations, Samples, and Traditional
Statistics  8
1.4 ■ Random Sampling, Three Case Studies That
Illustrate Statistical Inference, and Statistical
Modeling  10
1.5 ■ Business Analytics and Data Mining
(Optional)  21
1.6 ■ Ratio, Interval, Ordinal, and Nominative Scales
of Measurement (Optional)  25
1.7 ■ Stratified Random, Cluster, and Systematic
Sampling (Optional)  27
1.8 ■ More about Surveys and Errors in Survey
Sampling (Optional)  29
Appendix 1.1 ■ Getting Started with Excel  36
Appendix 1.2 ■ Getting Started with MegaStat  43
Appendix 1.3 ■ Getting Started with Minitab  46

Chapter 2
Descriptive Statistics: Tabular and Graphical
Methods and Descriptive Analytics
2.1 ■ Graphically Summarizing Qualitative Data  55
2.2 ■ Graphically Summarizing Quantitative Data  61
2.3 ■ Dot Plots  75
2.4 ■ Stem-and-Leaf Displays  76
2.5 ■ Contingency Tables (Optional)  81
2.6 ■ Scatter Plots (Optional)  87
2.7 ■ Misleading Graphs and Charts (Optional)  89
2.8 ■ Descriptive Analytics (Optional)  92
Appendix 2.1 ■ Tabular and Graphical Methods Using
Excel  103

Appendix 2.2 ■ Tabular and Graphical Methods Using
MegaStat  121

Appendix 2.3 ■ Tabular and Graphical Methods Using
Minitab  125

Chapter 3
Descriptive Statistics: Numerical Methods
and Some Predictive Analytics
Part 1 ■ Numerical Methods of Descriptive Statistics
3.1 ■ Describing Central Tendency  135
3.2 ■ Measures of Variation  145
3.3 ■ Percentiles, Quartiles, and Box-and-Whiskers
Displays  155
3.4 ■ Covariance, Correlation, and the Least Squares
Line (Optional)  161
3.5 ■ Weighted Means and Grouped Data
(Optional)  166
3.6 ■ The Geometric Mean (Optional)  170
Part 2 ■ Some Predictive Analytics (Optional)
3.7 ■ Decision Trees: Classification Trees and
Regression Trees (Optional)  172
3.8 ■ Cluster Analysis and Multidimensional Scaling
(Optional)  184
3.9 ■ Factor Analysis (Optional and Requires
Section 3.4)  192
3.10 ■ Association Rules (Optional)  198
Appendix 3.1 ■ Numerical Descriptive Statistics
Using Excel  207
Appendix 3.2 ■ Numerical Descriptive Statistics

Using MegaStat  210
Appendix 3.3 ■ Numerical Descriptive Statistics
Using Minitab  212
Appendix 3.4 ■ Analytics Using JMP  216

Chapter 4
Probability and Probability Models
4.1 ■ Probability, Sample Spaces, and Probability
Models  221
4.2 ■ Probability and Events  224
4.3 ■ Some Elementary Probability Rules  229
4.4 ■ Conditional Probability and Independence  235

xvi

bow49461_fm_i–xxi.indd 16

20/11/15 4:06 pm


www.downloadslide.net

4.5 ■ Bayes’ Theorem (Optional)  243
4.6 ■ Counting Rules (Optional)  247

Chapter 8

Chapter 5

8.1 ■ z-Based Confidence Intervals for a Population

Mean: σ Known  347
8.2 ■ t-Based Confidence Intervals for a Population
Mean: σ Unknown  355
8.3 ■ Sample Size Determination  364
8.4 ■ Confidence Intervals for a Population
Proportion  367
8.5 ■ Confidence Intervals for Parameters of Finite
Populations (Optional)  373
Appendix 8.1 ■ Confidence Intervals Using
Excel  379
Appendix 8.2 ■ Confidence Intervals Using
MegaStat  380
Appendix 8.3 ■ Confidence Intervals Using
Minitab  381

Discrete Random Variables
5.1 ■ Two Types of Random Variables  255
5.2 ■ Discrete Probability Distributions  256
5.3 ■ The Binomial Distribution  263
5.4 ■ The Poisson Distribution (Optional)  274
5.5 ■ The Hypergeometric Distribution
(Optional)  278
5.6 ■ Joint Distributions and the Covariance
(Optional)  280
Appendix 5.1 ■ Binomial, Poisson, and
Hypergeometric Probabilities
Using Excel  284
Appendix 5.2 ■ Binomial, Poisson, and
Hypergeometric Probabilities
Using MegaStat  286

Appendix 5.3 ■ Binomial, Poisson, and
Hypergeometric Probabilities
Using Minitab  287

Confidence Intervals

Chapter 9
Hypothesis Testing

6.1 ■ Continuous Probability Distributions  289
6.2 ■ The Uniform Distribution  291
6.3 ■ The Normal Probability Distribution  294
6.4 ■ Approximating the Binomial Distribution by
Using the Normal Distribution (Optional)  310
6.5 ■ The Exponential Distribution (Optional)  313
6.6 ■ The Normal Probability Plot (Optional)  316
Appendix 6.1 ■ Normal Distribution Using
Excel  321
Appendix 6.2 ■ Normal Distribution Using
MegaStat  322
Appendix 6.3 ■ Normal Distribution Using
Minitab  323

9.1 ■ The Null and Alternative Hypotheses and Errors
in Hypothesis Testing  383
9.2 ■ z Tests about a Population Mean:
σ Known  390
9.3 ■ t Tests about a Population Mean:
σ Unknown  402
9.4 ■ z Tests about a Population Proportion  406

9.5 ■ Type II Error Probabilities and Sample Size
Determination (Optional)  411
9.6 ■ The Chi-Square Distribution  417
9.7 ■ Statistical Inference for a Population Variance
(Optional)  418
Appendix 9.1 ■ One-Sample Hypothesis Testing
Using Excel  424
Appendix 9.2 ■ One-Sample Hypothesis Testing
Using MegaStat  425
Appendix 9.3 ■ One-Sample Hypothesis Testing
Using Minitab  426

Chapter 7

Chapter 10

Chapter 6
Continuous Random Variables

Sampling Distributions

Statistical Inferences Based on Two Samples

7.1 ■ The Sampling Distribution of the Sample
Mean  327
7.2 ■ The Sampling Distribution of the Sample
Proportion  339
7.3 ■ Derivation of the Mean and the Variance of
the Sample Mean (Optional)  342


10.1 ■ Comparing Two Population Means by Using
Independent Samples  429
10.2 ■ Paired Difference Experiments  439
10.3 ■ Comparing Two Population Proportions by
Using Large, Independent Samples  445
10.4 ■ The F Distribution  451

xvii

bow49461_fm_i–xxi.indd 17

20/11/15 4:06 pm


www.downloadslide.net

10.5 ■ Comparing Two Population Variances by Using
Independent Samples  453
Appendix 10.1 ■ Two-Sample Hypothesis Testing
Using Excel  459
Appendix 10.2 ■ Two-Sample Hypothesis Testing
Using MegaStat  460
Appendix 10.3 ■ Two-Sample Hypothesis Testing
Using Minitab  462

Chapter 11
Experimental Design and Analysis
of Variance
11.1 ■ Basic Concepts of Experimental Design  465
11.2 ■ One-Way Analysis of Variance  467

11.3 ■ The Randomized Block Design  479
11.4 ■ Two-Way Analysis of Variance  485
Appendix 11.1 ■ Experimental Design and Analysis
of Variance Using Excel  497
Appendix 11.2 ■ Experimental Design and Analysis
of Variance Using MegaStat  498
Appendix 11.3 ■ Experimental Design and Analysis
of Variance Using Minitab  500

Chapter 12
Chi-Square Tests
12.1 ■ Chi-Square Goodness-of-Fit Tests  505
12.2 ■ A Chi-Square Test for Independence  514
Appendix 12.1 ■ Chi-Square Tests Using Excel  523
Appendix 12.2 ■ Chi-Square Tests Using
MegaStat  525
Appendix 12.3 ■ Chi-Square Tests Using
Minitab  527

Chapter 13
Simple Linear Regression Analysis
13.1 ■ The Simple Linear Regression Model and
the Least Squares Point Estimates  531
13.2 ■ Simple Coefficients of Determination and
Correlation  543
13.3 ■ Model Assumptions and the Standard
Error  548
13.4 ■ Testing the Significance of the Slope and
y-Intercept  551
13.5 ■ Confidence and Prediction Intervals  559

13.6 ■ Testing the Significance of the Population
Correlation Coefficient (Optional)  564
13.7 ■ Residual Analysis  565

Appendix 13.1 ■ Simple Linear Regression Analysis
Using Excel  583
Appendix 13.2 ■ Simple Linear Regression Analysis
Using MegaStat  585
Appendix 13.3 ■ Simple Linear Regression Analysis
Using Minitab  587

Chapter 14
Multiple Regression and Model Building
14.1 ■ The Multiple Regression Model and the Least
Squares Point Estimates  591
14.2 ■ R2 and Adjusted R2  601
14.3 ■ Model Assumptions and the Standard
Error  603
14.4 ■ The Overall F Test  605
14.5 ■ Testing the Significance of an Independent
Variable  607
14.6 ■ Confidence and Prediction Intervals  611
14.7 ■ The Sales Representative Case: Evaluating
Employee Performance  614
14.8 ■ Using Dummy Variables to Model Qualitative
Independent Variables (Optional)  616
14.9 ■ Using Squared and Interaction Variables
(Optional)  625
14.10 ■ Multicollinearity, Model Building, and Model
Validation (Optional)  631

14.11 ■ Residual Analysis and Outlier Detection in
Multiple Regression (Optional)  642
14.12 ■ Logistic Regression (Optional)  647
14.13 ■ Neural Networks (Optional)  653
Appendix 14.1 ■ Multiple Regression Analysis Using
Excel  666
Appendix 14.2 ■ Multiple Regression Analysis Using
MegaStat  668
Appendix 14.3 ■ Multiple Regression Analysis Using
Minitab  671
Appendix 14.4 ■ Neural Network Analysis in
JMP  677

Chapter 15
Times Series Forecasting and Index
Numbers
15.1 ■ Time Series Components and Models  681
15.2 ■ Time Series Regression  682
15.3 ■ Multiplicative Decomposition  691
15.4 ■ Simple Exponential Smoothing  699
15.5 ■ Holt–Winters’ Models  704

xviii

bow49461_fm_i–xxi.indd 18

20/11/15 4:06 pm


www.downloadslide.net


15.6 ■ Forecast Error Comparisons  712
15.7 ■ Index Numbers  713
Appendix 15.1 ■ Time Series Analysis Using
Excel  722
Appendix 15.2 ■ Time Series Analysis Using
MegaStat  723
Appendix 15.3 ■ Time Series Analysis Using
Minitab  725

17.4 ■ Comparing Several Populations Using the
Kruskal–Wallis H Test  794
17.5 ■ Spearman’s Rank Correlation Coefficient  797
Appendix 17.1 ■ Nonparametric Methods Using
MegaStat  802
Appendix 17.2 ■ Nonparametric Methods Using
Minitab  805

Chapter 16

Decision Theory

Process Improvement Using Control Charts
16.1 ■ Quality: Its Meaning and a Historical
Perspective  727
16.2 ■ Statistical Process Control and Causes of
Process Variation  731
16.3 ■ Sampling a Process, Rational Subgrouping,
and Control Charts  734
16.4 ■ x– and R Charts  738

16.5 ■ Comparison of a Process with Specifications:
Capability Studies  754
16.6 ■ Charts for Fraction Nonconforming  762
16.7 ■ Cause-and-Effect and Defect Concentration
Diagrams (Optional)  768
Appendix 16.1 ■ Control Charts Using MegaStat  775
Appendix 16.2 ■ Control Charts Using Minitab  776

Chapter 17
Nonparametric Methods
17.1 ■ The Sign Test: A Hypothesis Test about the
Median  780
17.2 ■ The Wilcoxon Rank Sum Test  784
17.3 ■ The Wilcoxon Signed Ranks Test  789

Chapter 18
18.1 ■ Introduction to Decision Theory  809
18.2 ■ Decision Making Using Posterior
Probabilities  815
18.3 ■ Introduction to Utility Theory  823

Appendix A
Statistical Tables  828

Appendix B
An Introduction to Box–Jenkins Models  852
Answers to Most Odd-Numbered
Exercises  863
References  871
Photo Credits  873

Index  875

xix

bow49461_fm_i–xxi.indd 19

20/11/15 4:06 pm


www.downloadslide.net

bow49461_fm_i–xxi.indd 20

20/11/15 4:06 pm


www.downloadslide.net

Business Statistics in Practice
Using Modeling, Data, and Analytics
EIGHEIGHTH EDITION

bow49461_fm_i–xxi.indd 21

20/11/15 4:06 pm


CHAPTER 1

www.downloadslide.net


An
Introduction
to Business
Statistics and
Analytics
Learning Objectives
When you have mastered the material in this chapter, you will be able to:
LO1-1 Define a variable.
LO1-2 Describe the difference between a quantitative

variable and a qualitative variable.
LO1-3 Describe the difference between cross-

sectional data and time series data.
LO1-4 Construct and interpret a time series (runs) plot.
LO1-5 Identify the different types of data sources:

existing data sources, experimental studies,
and observational studies.
LO1-6 Explain the basic ideas of data

warehousing and big data.
LO1-7 Describe the difference between a

population and a sample.

LO1-8 Distinguish between descriptive statistics

and statistical inference.

LO1-9 Explain the concept of random sampling

and select a random sample.
LO1-10 Explain the basic concept of statistical

modeling.
LO1-11 Explain some of the uses of business

analytics and data mining (Optional).
LO1-12 Identify the ratio, interval, ordinal, and

nominative scales of measurement (Optional).
LO1-13 Describe the basic ideas of stratified random,

cluster, and systematic sampling (Optional).
LO1-14 Describe basic types of survey questions, survey

procedures, and sources of error (Optional).

Chapter Outline
1.1
1.2
1.3
1.4

Data
Data Sources, Data Warehousing, and Big Data
Populations, Samples, and Traditional Statistics
Random Sampling, Three Case Studies
That Illustrate Statistical Inference, and

Statistical Modeling
1.5 Business Analytics and Data Mining (Optional)

1.6 Ratio, Interval, Ordinal, and Nominative
Scales of Measurement (Optional)
1.7 Stratified Random, Cluster, and Systematic
Sampling (Optional)
1.8 More about Surveys and Errors in Survey
Sampling (Optional)

2

bow49461_ch01_002-053.indd 2

19/10/15 12:53 pm


www.downloadslide.net

T

he subject of statistics involves the study
of  how to collect, analyze, and interpret
data. Data are facts and figures from which
conclusions can be drawn. Such conclusions are
important to the decision making of many professions and organizations. For example, economists
use conclusions drawn from the latest data on unemployment and inflation to help the government
make policy decisions. Financial planners use recent
trends in stock market prices and economic conditions to make investment decisions. Accountants use
sample data concerning a company’s actual sales revenues to assess whether the company’s claimed sales

revenues are valid. Marketing professionals and
data miners help businesses decide which products
to develop and market and which consumers to

target in marketing campaigns by using data
that reveal consumer preferences. Production supervisors use manufacturing data to evaluate, control,
and improve product quality. Politicians rely on data
from public opinion polls to formulate  legislation
and to devise campaign strategies. Physicians and
hospitals use data on the effectiveness of drugs and
surgical procedures to provide patients with the best
possible treatment.
In this chapter we begin to see how we collect
and analyze data. As we proceed through the chapter, we introduce several case studies. These case
studies (and others to be introduced later) are revisited throughout later chapters as we learn the statistical methods needed to analyze them. Briefly, we
will begin to study four cases:

The Cell Phone Case: A bank estimates its cellular
phone costs and decides whether to outsource
management of its wireless resources by studying
the calling patterns of its employees.

an automaker studies the gas mileage of its new
midsize model.

The Marketing Research Case: A beverage company
investigates consumer reaction to a new bottle
design for one of its popular soft drinks.
The Car Mileage Case: To determine if it qualifies
for a federal tax credit based on fuel economy,


The Disney Parks Case: Walt Disney World Parks
and Resorts in Orlando, Florida, manages Disney
parks worldwide and uses data gathered from
its guests to give these guests a more “magical”
experience and increase Disney revenues and
profits.

1.1 Data

LO1-1

Data sets, elements, and variables

Define a variable.

We have said that data are facts and figures from which conclusions can be drawn. Together,
the data that are collected for a particular study are referred to as a data set. For example,
Table 1.1 is a data set that gives information about the new homes sold in a Florida luxury
home development over a recent three-month period. Potential home buyers could choose
either the “Diamond” or the “Ruby” home model design and could have the home built on
either a lake lot or a treed lot (with no water access).
In order to understand the data in Table 1.1, note that any data set provides information
about some group of individual elements, which may be people, objects, events, or other
entities. The information that a data set provides about its elements usually describes one or
more characteristics of these elements.
Any characteristic of an element is called a variable.
Table 1.1

A Data Set Describing Five Home Sales


DS

HomeSales

Home

Model Design

Lot Type

List Price

Selling Price

1
2
3
4
5

Diamond
Ruby
Diamond
Diamond
Ruby

Lake
Treed
Treed

Treed
Lake

$494,000
$447,000
$494,000
$494,000
$447,000

$494,000
$398,000
$440,000
$469,000
$447,000

3

bow49461_ch01_002-053.indd 3

19/10/15 12:53 pm


www.downloadslide.net
4

Chapter 1

LO1-2
Describe the difference
between a quantitative

variable and a
qualitative variable.

Table 1.2
2014 MLB Payrolls
DS MLB
Team

2014
Payroll

Los Angeles Dodgers
New York Yankees
Philadelphia Phillies
Boston Red Sox
Detroit Tigers
Los Angeles Angels
San Francisco Giants
Texas Rangers
Washington Nationals
Toronto Blue Jays
Arizona Diamondbacks
Cincinnati Reds
St. Louis Cardinals
Atlanta Braves
Baltimore Orioles
Milwaukee Brewers
Colorado Rockies
Seattle Mariners
Kansas City Royals

Chicago White Sox
San Diego Padres
New York Mets
Chicago Cubs
Minnesota Twins
Oakland Athletics
Cleveland Indians
Pittsburgh Pirates
Tampa Bay Rays
Miami Marlins
Houston Astros

235
204
180
163
162
156
154
136
135
133
113
112
111
111
107
104
96
92

92
91
90
89
89
86
83
83
78
77
48
45

Source: ut
.com/od/newsrumors/fl/2014
-Major-League-Baseball-Team
-Payrolls.htm (accessed January 14,
2015).

Figure 1.1

40

An Introduction to Business Statistics and Analytics

For the data set in Table 1.1, each sold home is an element, and four variables are used to
describe the homes. These variables are (1) the home model design, (2) the type of lot on
which the home was built, (3) the list (asking) price, and (4) the (actual) selling price. Moreover, each home model design came with “everything included”—specifically, a complete,
luxury interior package and a choice (at no price difference) of one of three different architectural exteriors. The builder made the list price of each home solely dependent on the model
design. However, the builder gave various price reductions for homes built on treed lots.

The data in Table 1.1 are real (with some minor changes to protect privacy) and were provided by a business executive—a friend of the authors—who recently received a promotion
and needed to move to central Florida. While searching for a new home, the executive and his
family visited the luxury home community and decided they wanted to purchase a Diamond
model on a treed lot. The list price of this home was $494,000, but the developer offered to
sell it for an “incentive” price of $469,000. Intuitively, the incentive price’s $25,000 savings
off list price seemed like a good deal. However, the executive resisted making an immediate decision. Instead, he decided to collect data on the selling prices of new homes recently
sold in the community and use the data to assess whether the developer might accept a lower
offer. In order to collect “relevant data,” the executive talked to local real estate professionals
and learned that new homes sold in the community during the previous three months were
a good indicator of current home value. Using real estate sales records, the executive also
learned that five of the community’s new homes had sold in the previous three months. The
data given in Table 1.1 are the data that the executive collected about these five homes.
When the business executive examined Table 1.1, he noted that homes on lake lots had sold
at their list price, but homes on treed lots had not. Because the executive and his family wished
to purchase a Diamond model on a treed lot, the executive also noted that two Diamond models on treed lots had sold in the previous three months. One of these Diamond models had
sold for the incentive price of $469,000, but the other had sold for a lower price of $440,000.
Hoping to pay the lower price for his family’s new home, the executive offered $440,000 for
the Diamond model on the treed lot. Initially, the home builder turned down this offer, but two
days later the builder called back and accepted the offer. The executive had used data to buy
the new home for $54,000 less than the list price and $29,000 less than the incentive price!

Quantitative and qualitative variables
For any variable describing an element in a data set, we carry out a measurement to assign
a value of the variable to the element. For example, in the real estate example, real estate
sales records gave the actual selling price of each home to the nearest dollar. As another
example, a credit card company might measure the time it takes for a cardholder’s bill to be
paid to the nearest day. Or, as a third example, an automaker might measure the gasoline
mileage obtained by a car in city driving to the nearest one-tenth of a mile per gallon by
conducting a mileage test on a driving course prescribed by the Environmental Protection
Agency (EPA). If the possible values of a variable are numbers that represent quantities (that

is, “how much” or “how many”), then the variable is said to be quantitative. For example,
(1) the actual selling price of a home, (2) the payment time of a bill, (3) the gasoline mileage of a car, and (4) the 2014 payroll of a Major League Baseball team are all quantitative
variables. Considering the last example, Table 1.2 in the page margin gives the 2014 payroll
(in millions of dollars) for each of the 30 Major League Baseball (MLB) teams. Moreover,
Figure 1.1 portrays the team payrolls as a dot plot. In this plot, each team payroll is shown

A Dot Plot of 2014 MLB Payrolls (Payroll Is a Quantitative Variable)

60

80

100

120

140

160

180

200

220

240

2014 Payroll (in millions of dollars)


bow49461_ch01_002-053.indd 4

23/11/15 4:54 pm


×