Chapter 5
Association between
Categorical Variables
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Which hosts send more buyers to
Amazon.com?
To answer this question we must gather data on
two categorical variables: Host and Purchase
Host identifies the originating site: MSN,
RecipeSource, or Yahoo; Purchase indicates
whether or not the visit results in a sale
3 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Consider Two Categorical Variables
Simultaneously
A table that shows counts of cases on one
categorical variable contingent on the value of
another (for every combination of both variables)
Cells in a contingency table are mutually
exclusive
4 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Contingency Table for Web Shopping
5 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Marginal and Conditional Distributions
•
Marginal distributions appear in the “margins” of a
contingency table and represent the totals
(frequencies) for each categorical variable
separately
•
Conditional distributions refer to counts within a
row or column of a contingency table (restricted to
cases satisfying a condition)
6 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Conditional Distribution of Purchase for each
Host (Column Counts and Percentages)
7 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Conditional Distribution
•
Reveals the percentage of purchases
among visitors from RecipeSource to be
much less than for MSN and Yahoo
•
Host and Purchase are associated
8 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Segmented Bar Charts
•
Used to display conditional distributions
•
Divides the bars in a bar chart into
segments that are proportional to the
percentage in each category of a second
variable
9 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Contingency Table of Purchase by Region
10 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Segmented Bar Chart Shows Association
11 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Mosaic Plots
Alternative to segmented bar chart
A plot in which the size of each “tile” is
proportional to the count in a cell of a
contingency table
12 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Contingency Table of Shirt Size by Style
13 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Mosaic Plot Shows Association
14 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Motivation
Should insurance companies vary the
premiums for different car models (are
some cars more likely to be stolen than
others)?
15 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Method
Data obtained from the National Highway Traffic
Safety Administration (NHTSA) on car theft for
seven popular models (two categorical variables:
type of car and whether the car was stolen).
16 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Mechanics
17 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Mechanics
18 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Message
The Dodge Intrepid is more likely to be stolen than
other popular models. The data suggest that
higher premiums for theft insurance should be
charged for models that are more likely to be
stolen.
19 of 39
Copyright © 2011 Pearson Education, Inc.
5.2 Lurking Variables
and Simpson’s Paradox
Association Not Necessarily Causation
Lurking Variable: a concealed variable that
affects the apparent relationship between two
other variables
Simpson’s Paradox: a change in the association
between two variables when data are separated
into groups defined by a third variable
20 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Motivation
Does it matter which of two airlines a
corporate CEO chooses when flying to
meetings if he wants to avoid delays?
21 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Method
Data obtained from US Bureau of
Transportation Statistics on flight delays for
two airlines (two categorical variables:
airline and whether the flight arrived on
time).
22 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
23 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Mechanics –
Is destination a lurking variable?
24 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Mechanics –
This is Simpson’s Paradox
25 of 39
Copyright © 2011 Pearson Education, Inc.