Data Analysis Using
SQL and Excel®
Data Analysis Using
SQL and Excel®
Second Edition
Gordon S. Linoff
Data Analysis Using SQL and Excel®, Second Edition
Published by
John Wiley & Sons, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2016 by John Wiley & Sons, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-1-119-02143-8
ISBN: 978-1-119-02145-2 (ebk)
ISBN: 978-1-119-02144-5 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written
permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the
Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600.
Requests to the Publisher for permission should be addressed to the Permissions Department, John
Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
/>Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be
created or extended by sales or promotional materials. The advice and strategies contained herein may not
be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in
rendering legal, accounting, or other professional services. If professional assistance is required, the services
of a competent professional person should be sought. Neither the publisher nor the author shall be liable for
damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation
and/or a potential source of further information does not mean that the author or the publisher endorses
the information the organization or website may provide or recommendations it may make. Further, readers
should be aware that Internet websites listed in this work may have changed or disappeared between when
this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department
within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included
with standard print versions of this book may not be included in e-books or in print-on-demand. If this book
refers to media such as a CD or DVD that is not included in the version you purchased, you may download
this material at . For more information about Wiley products, visit
www.wiley.com.
Library of Congress Control Number: 2015950486
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc.
and/or its affiliates, in the United States and other countries, and may not be used without written permission.
Excel is a registered trademark of Microsoft Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
To Giuseppe—for twenty five years, five books, and counting . . .
About the Author
Gordon S. Linoff has been working with databases, big data, and data mining for
almost longer than he can remember. With decades of experience on the practice
of using data effectively, he is a recognized expert in the field of data mining.
Gordon started using spreadsheets while a student at MIT, on the original
Compaq Portable, the world’s first luggable computer. Not very many years
later, he managed a development group at the now‐defunct Thinking Machines
Corporation, tasked with building a massively parallel relational database for
decision support.
After Thinking Machines’ demise, he founded Data Miners in 1998 with his
friend and former colleague Michael J. A. Berry (who left in 2012). Since then, he
has worked on a wide diversity of projects across many different companies. He
has taught hundreds of classes around the world on data mining and survival
analysis through SAS Institute, a leader in statistical and business analytics
software. He is also an avid contributor to Stack Overflow, particularly on questions related to databases, having the highest score in 2014.
Together with Michael Berry, Gordon has written several influential books on
data mining, including Data Mining Techniques for Marketing, Sales, and Customer
Support, the first book on data mining to achieve a third edition.
Gordon lives in New York with Giuseppe Scalia, his partner of 25 years.
vii
Credits
Project Editor
John Sleeva
Business Manager
Amy Knies
Technical Editor
Michael Berry
Associate Publisher
Jim Minatel
Production Editor
Dassi Zeidel
Project Coordinator, Cover
Brent Savage
Copy Editor
Mike La Bonne
Proofreader
Sara Wilson
Manager of Content Development
& Assembly
Mary Beth Wakefield
Indexer
Johnna VanHoose Dinse
Marketing Director
David Mayhew
Marketing Manager
Carrie Sherrill
Cover Designer
Wiley
Cover Image
©iStock.com/Nobi_Prizue
Professional Technology &
Strategy Director
Barry Pruett
ix
Acknowledgments
Although this book has only one name on the cover, many people have helped
me both specifically on this book and more generally in understanding data,
analysis, and presentation.
I first met Michael Berry in 1990. We later founded Data Miners together, and
he has been helpful on all fronts. He reviewed the chapters, tested the SQL code
in the examples, and helped anonymize the data. His insights have been helpful
and his debugging skills have made the examples much more accurate. His wife,
Stephanie Jack, also deserves special praise for her patience and willingness to
share Michael’s time.
The original idea for the book came from Nick Drake, who then worked at
Datran Media. A statistician by training, Nick was looking for a book that would
help him use databases for data analysis. Bob Elliott, at the time my editor at
Wiley, liked the idea.
Throughout the chapters, the understanding of data processing is based on
dataflows, which Craig Stanfill of Ab Initio Corporation first introduced me to
long ago when we worked together at Thinking Machines Corporation.
Along the way, I have learned a lot from many people. Anne Milley of SAS
Institute first suggested that I learn survival analysis. Will Potts, now working at CapitalOne, then taught me much of what I know about the subject. Brij
Masand helped extend the ideas to practical forecasting applications. Chi Kong
Ho and his team at the New York Times provided valuable feedback for applying
survival analysis to customer value calculations.
Stuart Ward from the New York Times and Zaiying Huang spent countless
hours explaining and discussing statistical concepts. Harrison Sohmer, also of
the New York Times, taught me many Excel tricks, some of which I’ve been able
to include in the book.
xi
xii
Acknowledgments
Jamie MacLennan and the SQL Server team at Microsoft have been helpful
in answering my questions about the product.
Over the past few years, I have been a major contributor to Stack Overflow.
Along the way, I have learned an incredible amount about SQL and about
how to explain concepts. A handful of people whom I’ve never met in person
have helped in various ways. Richard Stallman invented emacs and the Free
Software Foundation; emacs provided the basis for the calendar table. Rob Bovey
of Applications Professional, Inc. created the X‐Y chart labeler used in several
chapters. The Census data set was created by the folks at the Missouri Census
Data Center. Juice Analytics inspired the example for Worksheet bar charts in
Chapter 5 (and thanks to Alex Wimbush, who pointed me in their direction).
Edwin Straver of Frontline Systems answered several questions about Solver.
Over the years, many colleagues, friends, and students have provided inspiration, questions, and answers. There are too many to list them all, but I want to
particularly thank Eran Abikhzer, Christian Albright, Michael Benigno, Emily
Cohen, Carol D’Andrea, Sonia Dubin, Lounette Dyer, Victor Fu, Josh Goff,
Richard Greenburg, Gregory Lampshire, Mikhail Levdanski, Savvas Mavridis,
Fiona McNeill, Karen Kennedy McConlogue, Steven Mullaney, Courage Noko,
Laura Palmer, Alan Parker, Ashit Patel, Ronnie Rowton, Vishal Santoshi, Adam
Schwebber, Kent Taylor, John Trustman, John Wallace, David Wang, and Zhilang
Zhao. I would also like to thank the folks in the SAS Institute Training group
who have organized, reviewed, and sponsored my data mining classes for many
years, giving me the opportunity to meet many interesting and diverse people
involved with data mining.
I also thank all those friends and family I’ve visited while writing this book
and who (for the most part) allowed me the space and time to work—my mother,
my father, my sister Debbie, my brother Joe, my in‐laws Raimonda Scalia, Ugo
Scalia, and Terry Sparacio, and my friends Jon Mosley, Paul Houlihan, Leonid
Poretsky, Anthony DiCarlo, and Maciej Zworski. On the other hand, my cat
Luna, who spent many hours curled up next to me, will miss my writing.
Finally, acknowledgments would be incomplete without thanking Giuseppe
Scalia, my partner through seven books, who has managed to maintain my
sanity through all of them.
Thank you, everyone!
Contents at a Glance
Foreword
Introduction
xxxiii
xxxvii
Chapter 1
A Data Miner Looks at SQL
Chapter 2 What’s in a Table? Getting Started with Data Exploration
Chapter 3 How Different Is Different?
Chapter 4 Where Is It All Happening? Location, Location, Location
Chapter 5 It’s a Matter of Time
Chapter 6How Long Will Customers Last? Survival Analysis to
Understand Customers and Their Value
Chapter 7 Factors Affecting Survival: The What and Why of
Customer Tenure
Chapter 8
Customer Purchases and Other Repeated Events
Chapter 9 What’s in a Shopping Cart? Market Basket Analysis
Chapter 10 Association Rules and Beyond
Chapter 11 Data Mining Models in SQL
Chapter 12 The Best-Fit Line: Linear Regression Models
Chapter 13 Building Customer Signatures for Further Analysis
Chapter 14 Performance Is the Issue: Using SQL Effectively
Appendix Equivalent Constructs Among Databases
1
49
97
145
197
315
367
421
465
507
561
609
655
703
Index
731
255
xiii
Contents
Foreword
xxxiii
Introduction
xxxvii
Chapter 1
A Data Miner Looks at SQL
Databases, SQL, and Big Data
What Is Big Data?
Relational Databases
Hadoop and Hive
NoSQL and Other Types of Databases
SQL
Picturing the Structure of the Data
What Is a Data Model?
What Is a Table?
Allowing NULL Values
Column Types
What Is an Entity-Relationship Diagram?
The Zip Code Tables
Subscription Dataset
Purchases Dataset
Tips on Naming Things
Picturing Data Analysis Using Dataflows
What Is a Dataflow?
READ: Reading a Database Table
OUTPUT: Outputting a Table (or Chart)
SELECT: Selecting Various Columns in the Table
FILTER: Filtering Rows Based on a Condition
APPEND: Appending New Calculated Columns
UNION: Combining Multiple Datasets into One
AGGREGATE: Aggregating Values
1
2
3
3
4
4
5
6
6
7
8
9
10
12
13
14
14
16
16
18
18
18
18
19
19
19
xv
xvi
Contents
LOOKUP: Looking Up Values in One Table in Another
CROSSJOIN: Generating the Cartesian Product of Two Tables
JOIN: Combining Two Tables Using a Key Column
SORT: Ordering the Results of a Dataset
Dataflows, SQL, and Relational Algebra
SQL Queries
Chapter 2
19
19
20
20
20
21
What to Do, Not How to Do It
The SELECT Statement
A Basic SQL Query
A Basic Summary SQL Query
What It Means to Join Tables
Cross-Joins: The Most General Joins
Lookup: A Useful Join
Equijoins
Nonequijoins
Outer Joins
Other Important Capabilities in SQL
UNION ALL
CASE
IN
Window Functions
21
22
22
24
25
26
27
29
31
31
32
33
33
34
35
Subqueries and Common Table Expressions
Are Our Friends
36
Subqueries for Naming Variables
Subqueries for Handling Summaries
Subqueries and IN
Rewriting the “IN” as a JOIN
Correlated Subqueries
NOT IN Operator
EXISTS and NOT EXISTS Operators
Subqueries for UNION ALL
37
40
42
42
43
44
45
46
Lessons Learned
47
What’s in a Table? Getting Started with Data Exploration
What Is Data Exploration?
Excel for Charting
49
50
51
A Basic Chart: Column Charts
Inserting the Data
Creating the Column Chart
Formatting the Column Chart
Bar Charts in Cells
Character-Based Bar Charts
Conditional Formatting-Based Bar Charts
Useful Variations on the Column Chart
A New Query
Side-by-Side Columns
51
52
53
55
57
57
58
59
59
59
Contents
Stacked Columns
Stacked and Normalized Columns
Number of Orders and Revenue
Other Types of Charts
Line Charts
Area Charts
X-Y Charts (Scatter Plots)
Sparklines
What Values Are in the Columns?
Histograms
Histograms of Counts
Cumulative Histograms of Counts
Histograms (Frequencies) for Numeric Values
Ranges Based on the Number of Digits, Using Numeric
Techniques
Ranges Based on the Number of Digits, Using String
Techniques
More Refined Ranges: First Digit Plus Number of Digits
Breaking Numeric Values into Equal-Sized Groups
More Values to Explore—Min, Max, and Mode
Minimum and Maximum Values
The Most Common Value (Mode)
Calculating Mode Using Basic SQL
Calculating Mode Using Window Functions
Exploring String Values
Histogram of Length
Strings Starting or Ending with Spaces
Handling Upper- and Lowercase
What Characters Are in a String?
Exploring Values in Two Columns
What Are Average Sales by State?
How Often Are Products Repeated within a Single Order?
Direct Counting Approach
Comparison of Distinct Counts to Overall Counts
Which State Has the Most American Express Users?
60
60
60
63
63
63
64
65
68
68
72
74
75
75
77
77
77
79
79
80
80
81
81
82
82
82
83
86
86
86
87
88
89
From Summarizing One Column to Summarizing All Columns 90
Good Summary for One Column
Query to Get All Columns in a Table
Using SQL to Generate Summary Code
Chapter 3
90
93
94
Lessons Learned
96
How Different Is Different?
Basic Statistical Concepts
97
98
The Null Hypothesis
Confidence and Probability
Normal Distribution
98
100
101
xvii
xviii
Contents
How Different Are the Averages?
The Approach
Standard Deviation for Subset Averages
Three Approaches
Estimation Based on Two Samples
Estimation Based on Difference
Sampling from a Table
Random Sample
Repeatable Random Sample
Proportional Stratified Sample
Balanced Sample
Counting Possibilities
How Many Men?
How Many Californians?
Null Hypothesis and Confidence
How Many Customers Are Still Active?
Given the Count, What Is the Probability?
Given the Probability, What Is the Number of Stops?
The Rate or the Number?
Ratios and Their Statistics
Standard Error of a Proportion
Confidence Interval on Proportions
Difference of Proportions
Conservative Lower Bounds
Chi-Square
Expected Values
Chi-Square Calculation
Chi-Square Distribution
Chi-Square in SQL
What States Have Unusual Affinities for Which
Types of Products?
Data Investigation
SQL to Calculate Chi-Square Values
Affinity Results
What Months and Payment Types Have Unusual Affinities
for Which Types of Products?
Multidimensional Chi-Square
Using a SQL Query
The Results
Chapter 4
105
105
105
107
108
109
110
110
111
112
113
115
116
120
122
123
124
125
126
128
128
129
131
132
132
133
134
134
135
138
138
139
140
140
141
141
142
Lessons Learned
143
Where Is It All Happening? Location, Location, Location
Latitude and Longitude
145
146
Definition of Latitude and Longitude
Degrees, Minutes, Seconds, and All That
Distance between Two Locations
146
147
149
Contents
Euclidian Method
Accurate Method
Finding All Zip Codes within a Given Distance
Finding Nearest Zip Code in Excel
Pictures with Zip Codes
The Scatter Plot Map
Who Uses Solar Power for Heating?
Where Are the Customers?
Census Demographics
The Extremes: Richest and Poorest
Median Income
Proportion of Wealthy and Poor
Income Similarity and Dissimilarity Using Chi-Square
Comparison of Zip Codes with and without Orders
Zip Codes Not in Census File
Profiles of Zip Codes with and without Orders
Classifying and Comparing Zip Codes
Geographic Hierarchies
Wealthiest Zip Code in a State?
Zip Code with the Most Orders in Each State
Interesting Hierarchies in Geographic Data
Counties
Designated Marketing Areas
Census Hierarchies
Other Geographic Subdivisions
Geography on the Web
Calculating County Wealth
Identifying Counties
Measuring Wealth
Distribution of Values of Wealth
Which Zip Code Is Wealthiest Relative to Its County?
County with Highest Relative Order Penetration
Mapping in Excel
Why Create Maps?
It Can’t Be Mapped
Mapping on the Web
State Boundaries on Scatter Plots of Zip Codes
Plotting State Boundaries
Pictures of State Boundaries
Chapter 5
149
151
152
154
155
155
157
159
160
161
161
162
163
167
167
168
170
172
172
175
176
177
177
178
178
179
181
181
182
183
185
185
188
188
190
190
191
191
193
Lessons Learned
194
It’s a Matter of Time
Dates and Times in Databases
197
198
Some Fundamentals of Dates and Times in Databases
Extracting Components of Dates and Times
Converting to Standard Formats
199
199
201
xix
xx
Contents
Intervals (Durations)
Time Zones
Calendar Table
Starting to Investigate Dates
Verifying That Dates Have No Times
Comparing Counts by Date
Order Lines Shipped and Billed
Customers Shipped and Billed
Number of Different Bill and Ship Dates per Order
Counts of Orders and Order Sizes
Items as Measured by Number of Units
Items as Measured by Distinct Products
Size as Measured by Dollars
Days of the Week
Billing Date by Day of the Week
Changes in Day of the Week by Year
Comparison of Days of the Week for Two Dates
How Long Between Two Dates?
Duration in Days
Duration in Weeks
Duration in Months
How Many Mondays?
A Business Problem about Days of the Week
Outline of a Solution
Solving It in SQL
Using a Calendar Table Instead
When Is the Next Anniversary (or Birthday)?
First Year Anniversary This Month
First Year Anniversary Next Month
Manipulating Dates to Calculate the Next Anniversary
Year-over-Year Comparisons
Comparisons by Day
Adding a Moving Average Trend Line
Comparisons by Week
Comparisons by Month
Month-to-Date Comparison
Extrapolation by Days in Month
Estimation Based on Day of Week
Estimation Based on Previous Year
Counting Active Customers by Day
How Many Customers on a Given Day?
How Many Customers Every Day?
How Many Customers of Different Types?
How Many Customers by Tenure Segment?
Calculating Actives Entirely Using SQL
Simple Chart Animation in Excel
202
203
203
204
204
205
206
208
209
210
211
211
214
215
215
216
217
218
218
220
221
221
222
222
224
224
225
225
226
227
229
229
230
231
231
233
235
237
239
239
239
240
241
242
246
247
Contents
Order Date to Ship Date
Order Date to Ship Date by Year
Querying the Data
Creating the One-Year Excel Table
Creating and Customizing the Chart
Lessons Learned
248
250
250
251
252
254
Chapter 6How Long Will Customers Last? Survival Analysis to Understand
Customers and Their Value
255
Background on Survival Analysis
256
Life Expectancy
Medical Research
Examples of Hazards
The Hazard Calculation
Data Investigation
Stop Flag
Tenure
Hazard Probability
Visualizing Customers: Time versus Tenure
Censoring
Survival and Retention
Point Estimate for Survival
Calculating Survival for All Tenures
Calculating Survival in SQL
Calculating the Product of Column Values
Adding in More Dimensions
A Simple Customer Retention Calculation
Comparison between Retention and Survival
Simple Example of Hazard and Survival
Constant Hazard
What Happens to a Mixture?
Constant Hazard Corresponding to Survival
Comparing Different Groups of Customers
256
258
259
260
261
261
262
264
265
266
269
269
269
271
272
274
274
276
276
277
278
279
280
Summarizing the Markets
Stratifying by Market
Survival Ratio
Conditional Survival
280
281
284
285
Comparing Survival over Time
287
How Has a Particular Hazard Changed over Time?
What Is Customer Survival by Year of Start?
What Did Survival Look Like in the Past?
Important Measures Derived from Survival
Point Estimate of Survival
Median Customer Tenure
Average Customer Lifetime
Confidence in the Hazards
288
289
290
293
293
294
295
297
xxi
xxii
Contents
Using Survival for Customer Value Calculations
Estimated Revenue
Estimating Future Revenue for One Future Start
Value in the First Year
SQL Day-by-Day Approach
Estimated Revenue for a Group of Existing Customers
Estimated Second Year Revenue for a Homogenous Group
Estimated Future Revenue for All Customers
Forecasting
Existing Base Forecast
Existing Base Calculation
Calculating Survival on July 1st
Calculating the Number of Existing Customers on July 1st
How Good Is It?
Estimating the Long-Term Hazard
New Start Forecast
Lessons Learned
Chapter 7Factors Affecting Survival: The What and Why of
Customer Tenure
Which Factors Are Important and When
Explanation of the Approach
Using Averages to Compare Numeric Variables
The Answer
Answering the Question in SQL and Excel
Answering the Question Entirely in SQL
Extension to Include Confidence Bounds
Hazard Ratios
Interpreting Hazard Ratios
Calculating Hazard Ratios Using SQL and Excel
Calculating Hazard Ratios in SQL
Why the Hazard Ratio?
Left Truncation
Recognizing Left Truncation
Effect of Left Truncation
How to Fix Left Truncation, Conceptually
Estimating Hazard Probability for One Tenure
Estimating Hazard Probabilities for All Tenures
Doing the Calculation in SQL
Time Windowing
A Business Problem
Time Windows = Left Truncation + Right Censoring
Calculating One Hazard Probability Using a Time Window
All Hazard Probabilities for a Time Window
Comparison of Hazards by Stops in Year in Excel
Comparison of Hazards by Stops in Year in SQL
298
299
300
301
301
303
303
305
308
309
309
309
311
312
313
313
314
315
316
316
317
319
320
321
322
324
324
326
326
327
328
328
330
331
333
333
335
336
336
337
338
339
339
341
Contents
Competing Risks
Examples of Competing Risks
I=Involuntary Churn
V=Voluntary Churn
M=Migration
Other
Competing Risk “Hazard Probability”
Competing Risk “Survival”
What Happens to Customers over Time
Example
A Cohort-Based Approach
The Survival Analysis Approach
Before and After
Three Scenarios
A Billing Mistake
A Loyalty Program
Raising Prices
Using Survival Forecasts to Understand One-Time Events
Forecasting Identified Customers Who Stopped
Estimating Excess Stops
Before and After Comparison
Cohort-Based Approach
Cohort-Based Approach: Full Cohorts
Direct Estimation of Event Effect
Approach to the Calculation
Time-Dependent Covariate Survival Using SQL and Excel
Doing the Calculation in SQL
Chapter 8
342
342
343
343
344
344
345
346
347
347
348
351
353
353
353
354
355
356
356
357
357
358
358
361
361
362
364
Lessons Learned
366
Customer Purchases and Other Repeated Events
Identifying Customers
367
368
Who Is the Customer?
How Many?
How Many Genders in a Household?
Investigating First Names
Other Customer Information
First and Last Names
Addresses
Email Addresses
Other Identifying Information
How Many New Customers Appear Each Year?
Counting Customers
Span of Time Making Purchases
Average Time between Orders
Purchase Intervals
How Many Days in a Row Do Customers Make Purchases?
368
369
371
374
378
378
380
381
382
383
383
386
388
390
391
xxiii