Tải bản đầy đủ (.pdf) (795 trang)

Data Analysis Using Sql And Excel ( Pdfdrive ).Pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (28.03 MB, 795 trang )



Data Analysis Using
SQL and Excel®



Data Analysis Using
SQL and Excel®
Second Edition

Gordon S. Linoff


Data Analysis Using SQL and Excel®, Second Edition
Published by
John Wiley & Sons, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2016 by John Wiley & Sons, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-1-119-02143-8
ISBN: 978-1-119-02145-2 (ebk)
ISBN: 978-1-119-02144-5 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written
permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the
Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600.


Requests to the Publisher for permission should be addressed to the Permissions Department, John
Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
/>Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be
created or extended by sales or promotional materials. The advice and strategies contained herein may not
be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in
rendering legal, accounting, or other professional services. If professional assistance is required, the services
of a competent professional person should be sought. Neither the publisher nor the author shall be liable for
damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation
and/or a potential source of further information does not mean that the author or the publisher endorses
the information the organization or website may provide or recommendations it may make. Further, readers
should be aware that Internet websites listed in this work may have changed or disappeared between when
this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department
within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included
with standard print versions of this book may not be included in e-books or in print-on-demand. If this book
refers to media such as a CD or DVD that is not included in the version you purchased, you may download
this material at . For more information about Wiley products, visit
www.wiley.com.
Library of Congress Control Number: 2015950486
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc.
and/or its affiliates, in the United States and other countries, and may not be used without written permission.
Excel is a registered trademark of Microsoft Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.


To Giuseppe—for twenty five years, five books, and counting . . .




About the Author 

Gordon S. Linoff has been working with databases, big data, and data mining for
almost longer than he can remember. With decades of experience on the practice
of using data effectively, he is a recognized expert in the field of data mining.
Gordon started using spreadsheets while a student at MIT, on the original
Compaq Portable, the world’s first luggable computer. Not very many years
later, he managed a development group at the now‐defunct Thinking Machines
Corporation, tasked with building a massively parallel relational database for
decision support.
After Thinking Machines’ demise, he founded Data Miners in 1998 with his
friend and former colleague Michael J. A. Berry (who left in 2012). Since then, he
has worked on a wide diversity of projects across many different companies. He
has taught hundreds of classes around the world on data mining and survival
analysis through SAS Institute, a leader in statistical and business analytics
software. He is also an avid contributor to Stack Overflow, particularly on questions related to databases, having the highest score in 2014.
Together with Michael Berry, Gordon has written several influential books on
data mining, including Data Mining Techniques for Marketing, Sales, and Customer
Support, the first book on data mining to achieve a third edition.
Gordon lives in New York with Giuseppe Scalia, his partner of 25 years.

vii



Credits

Project Editor
John Sleeva


Business Manager
Amy Knies

Technical Editor
Michael Berry

Associate Publisher
Jim Minatel

Production Editor
Dassi Zeidel

Project Coordinator, Cover
Brent Savage

Copy Editor
Mike La Bonne

Proofreader
Sara Wilson

Manager of Content Development
& Assembly
Mary Beth Wakefield

Indexer
Johnna VanHoose Dinse

Marketing Director
David Mayhew

Marketing Manager
Carrie Sherrill

Cover Designer
Wiley
Cover Image
©iStock.com/Nobi_Prizue

Professional Technology &
Strategy Director
Barry Pruett

ix



Acknowledgments

Although this book has only one name on the cover, many people have helped
me both specifically on this book and more generally in understanding data,
analysis, and presentation.
I first met Michael Berry in 1990. We later founded Data Miners together, and
he has been helpful on all fronts. He reviewed the chapters, tested the SQL code
in the examples, and helped anonymize the data. His insights have been helpful
and his debugging skills have made the examples much more accurate. His wife,
Stephanie Jack, also deserves special praise for her patience and willingness to
share Michael’s time.
The original idea for the book came from Nick Drake, who then worked at
Datran Media. A statistician by training, Nick was looking for a book that would
help him use databases for data analysis. Bob Elliott, at the time my editor at

Wiley, liked the idea.
Throughout the chapters, the understanding of data processing is based on
dataflows, which Craig Stanfill of Ab Initio Corporation first introduced me to
long ago when we worked together at Thinking Machines Corporation.
Along the way, I have learned a lot from many people. Anne Milley of SAS
Institute first suggested that I learn survival analysis. Will Potts, now working at CapitalOne, then taught me much of what I know about the subject. Brij
Masand helped extend the ideas to practical forecasting applications. Chi Kong
Ho and his team at the New York Times provided valuable feedback for applying
survival analysis to customer value calculations.
Stuart Ward from the New York Times and Zaiying Huang spent countless
hours explaining and discussing statistical concepts. Harrison Sohmer, also of
the New York Times, taught me many Excel tricks, some of which I’ve been able
to include in the book.
xi


xii

Acknowledgments

Jamie MacLennan and the SQL Server team at Microsoft have been helpful
in answering my questions about the product.
Over the past few years, I have been a major contributor to Stack Overflow.
Along the way, I have learned an incredible amount about SQL and about
how to explain concepts. A handful of people whom I’ve never met in person
have helped in various ways. Richard Stallman invented emacs and the Free
Software Foundation; emacs provided the basis for the calendar table. Rob Bovey
of Applications Professional, Inc. created the X‐Y chart labeler used in several
chapters. The Census data set was created by the folks at the Missouri Census
Data Center. Juice Analytics inspired the example for Worksheet bar charts in

Chapter 5 (and thanks to Alex Wimbush, who pointed me in their direction).
Edwin Straver of Frontline Systems answered several questions about Solver.
Over the years, many colleagues, friends, and students have provided inspiration, questions, and answers. There are too many to list them all, but I want to
particularly thank Eran Abikhzer, Christian Albright, Michael Benigno, Emily
Cohen, Carol D’Andrea, Sonia Dubin, Lounette Dyer, Victor Fu, Josh Goff,
Richard Greenburg, Gregory Lampshire, Mikhail Levdanski, Savvas Mavridis,
Fiona McNeill, Karen Kennedy McConlogue, Steven Mullaney, Courage Noko,
Laura Palmer, Alan Parker, Ashit Patel, Ronnie Rowton, Vishal Santoshi, Adam
Schwebber, Kent Taylor, John Trustman, John Wallace, David Wang, and Zhilang
Zhao. I would also like to thank the folks in the SAS Institute Training group
who have organized, reviewed, and sponsored my data mining classes for many
years, giving me the opportunity to meet many interesting and diverse people
involved with data mining.
I also thank all those friends and family I’ve visited while writing this book
and who (for the most part) allowed me the space and time to work—my mother,
my father, my sister Debbie, my brother Joe, my in‐laws Raimonda Scalia, Ugo
Scalia, and Terry Sparacio, and my friends Jon Mosley, Paul Houlihan, Leonid
Poretsky, Anthony DiCarlo, and Maciej Zworski. On the other hand, my cat
Luna, who spent many hours curled up next to me, will miss my writing.
Finally, acknowledgments would be incomplete without thanking Giuseppe
Scalia, my partner through seven books, who has managed to maintain my
sanity through all of them.
Thank you, everyone!


Contents at a Glance

Foreword
Introduction


xxxiii
xxxvii

Chapter 1
A Data Miner Looks at SQL
Chapter 2 What’s in a Table? Getting Started with Data Exploration
Chapter 3 How Different Is Different?
Chapter 4 Where Is It All Happening? Location, Location, Location
Chapter 5 It’s a Matter of Time
Chapter 6How Long Will Customers Last? Survival Analysis to
Understand Customers and Their Value 
Chapter 7 Factors Affecting Survival: The What and Why of
Customer Tenure 
Chapter 8
Customer Purchases and Other Repeated Events 
Chapter 9 What’s in a Shopping Cart? Market Basket Analysis 
Chapter 10 Association Rules and Beyond 
Chapter 11 Data Mining Models in SQL
Chapter 12 The Best-Fit Line: Linear Regression Models
Chapter 13 Building Customer Signatures for Further Analysis
Chapter 14 Performance Is the Issue: Using SQL Effectively
Appendix Equivalent Constructs Among Databases

1
49
97
145
197

315

367
421
465
507
561
609
655
703

Index

731

255

xiii



Contents

Foreword

xxxiii

Introduction

xxxvii

Chapter 1


A Data Miner Looks at SQL
Databases, SQL, and Big Data
What Is Big Data?
Relational Databases
Hadoop and Hive
NoSQL and Other Types of Databases
SQL

Picturing the Structure of the Data
What Is a Data Model?
What Is a Table?
Allowing NULL Values
Column Types
What Is an Entity-Relationship Diagram?
The Zip Code Tables
Subscription Dataset
Purchases Dataset
Tips on Naming Things

Picturing Data Analysis Using Dataflows
What Is a Dataflow?
READ: Reading a Database Table
OUTPUT: Outputting a Table (or Chart)
SELECT: Selecting Various Columns in the Table
FILTER: Filtering Rows Based on a Condition
APPEND: Appending New Calculated Columns
UNION: Combining Multiple Datasets into One
AGGREGATE: Aggregating Values


1
2
3
3
4
4
5

6
6
7
8
9
10
12
13
14
14

16
16
18
18
18
18
19
19
19

xv



xvi

Contents
LOOKUP: Looking Up Values in One Table in Another
CROSSJOIN: Generating the Cartesian Product of Two Tables
JOIN: Combining Two Tables Using a Key Column
SORT: Ordering the Results of a Dataset
Dataflows, SQL, and Relational Algebra

SQL Queries

Chapter 2

19
19
20
20
20

21

What to Do, Not How to Do It
The SELECT Statement
A Basic SQL Query
A Basic Summary SQL Query
What It Means to Join Tables
Cross-Joins: The Most General Joins
Lookup: A Useful Join

Equijoins
Nonequijoins
Outer Joins
Other Important Capabilities in SQL
UNION ALL
CASE
IN
Window Functions

21
22
22
24
25
26
27
29
31
31
32
33
33
34
35

Subqueries and Common Table Expressions
Are Our Friends

36


Subqueries for Naming Variables
Subqueries for Handling Summaries
Subqueries and IN
Rewriting the “IN” as a JOIN
Correlated Subqueries
NOT IN Operator
EXISTS and NOT EXISTS Operators
Subqueries for UNION ALL

37
40
42
42
43
44
45
46

Lessons Learned

47

What’s in a Table? Getting Started with Data Exploration
What Is Data Exploration?
Excel for Charting

49
50
51


A Basic Chart: Column Charts
Inserting the Data
Creating the Column Chart
Formatting the Column Chart
Bar Charts in Cells
Character-Based Bar Charts
Conditional Formatting-Based Bar Charts
Useful Variations on the Column Chart
A New Query
Side-by-Side Columns

51
52
53
55
57
57
58
59
59
59




Contents
Stacked Columns
Stacked and Normalized Columns
Number of Orders and Revenue
Other Types of Charts

Line Charts
Area Charts
X-Y Charts (Scatter Plots)

Sparklines
What Values Are in the Columns?
Histograms
Histograms of Counts
Cumulative Histograms of Counts
Histograms (Frequencies) for Numeric Values
Ranges Based on the Number of Digits, Using Numeric
Techniques
Ranges Based on the Number of Digits, Using String
Techniques
More Refined Ranges: First Digit Plus Number of Digits
Breaking Numeric Values into Equal-Sized Groups

More Values to Explore—Min, Max, and Mode
Minimum and Maximum Values
The Most Common Value (Mode)
Calculating Mode Using Basic SQL
Calculating Mode Using Window Functions

Exploring String Values
Histogram of Length
Strings Starting or Ending with Spaces
Handling Upper- and Lowercase
What Characters Are in a String?

Exploring Values in Two Columns

What Are Average Sales by State?
How Often Are Products Repeated within a Single Order?
Direct Counting Approach
Comparison of Distinct Counts to Overall Counts
Which State Has the Most American Express Users?

60
60
60
63
63
63
64

65
68
68
72
74
75
75
77
77
77

79
79
80
80
81


81
82
82
82
83

86
86
86
87
88
89

From Summarizing One Column to Summarizing All Columns 90
Good Summary for One Column
Query to Get All Columns in a Table
Using SQL to Generate Summary Code

Chapter 3

90
93
94

Lessons Learned

96

How Different Is Different?

Basic Statistical Concepts

97
98

The Null Hypothesis
Confidence and Probability
Normal Distribution

98
100
101

xvii


xviii

Contents
How Different Are the Averages?
The Approach
Standard Deviation for Subset Averages
Three Approaches
Estimation Based on Two Samples
Estimation Based on Difference

Sampling from a Table
Random Sample
Repeatable Random Sample
Proportional Stratified Sample

Balanced Sample

Counting Possibilities
How Many Men?
How Many Californians?
Null Hypothesis and Confidence
How Many Customers Are Still Active?
Given the Count, What Is the Probability?
Given the Probability, What Is the Number of Stops?
The Rate or the Number?

Ratios and Their Statistics
Standard Error of a Proportion
Confidence Interval on Proportions
Difference of Proportions
Conservative Lower Bounds

Chi-Square
Expected Values
Chi-Square Calculation
Chi-Square Distribution
Chi-Square in SQL
What States Have Unusual Affinities for Which
Types of Products?
Data Investigation
SQL to Calculate Chi-Square Values
Affinity Results

What Months and Payment Types Have Unusual Affinities
for Which Types of Products?

Multidimensional Chi-Square
Using a SQL Query
The Results

Chapter 4

105
105
105
107
108
109

110
110
111
112
113

115
116
120
122
123
124
125
126

128
128

129
131
132

132
133
134
134
135
138
138
139
140

140
141
141
142

Lessons Learned

143

Where Is It All Happening? Location, Location, Location
Latitude and Longitude

145
146

Definition of Latitude and Longitude

Degrees, Minutes, Seconds, and All That
Distance between Two Locations

146
147
149




Contents
Euclidian Method
Accurate Method
Finding All Zip Codes within a Given Distance
Finding Nearest Zip Code in Excel
Pictures with Zip Codes
The Scatter Plot Map
Who Uses Solar Power for Heating?
Where Are the Customers?

Census Demographics
The Extremes: Richest and Poorest
Median Income
Proportion of Wealthy and Poor
Income Similarity and Dissimilarity Using Chi-Square
Comparison of Zip Codes with and without Orders
Zip Codes Not in Census File
Profiles of Zip Codes with and without Orders
Classifying and Comparing Zip Codes


Geographic Hierarchies
Wealthiest Zip Code in a State?
Zip Code with the Most Orders in Each State
Interesting Hierarchies in Geographic Data
Counties
Designated Marketing Areas
Census Hierarchies
Other Geographic Subdivisions
Geography on the Web
Calculating County Wealth
Identifying Counties
Measuring Wealth
Distribution of Values of Wealth
Which Zip Code Is Wealthiest Relative to Its County?
County with Highest Relative Order Penetration

Mapping in Excel
Why Create Maps?
It Can’t Be Mapped
Mapping on the Web
State Boundaries on Scatter Plots of Zip Codes
Plotting State Boundaries
Pictures of State Boundaries

Chapter 5

149
151
152
154

155
155
157
159

160
161
161
162
163
167
167
168
170

172
172
175
176
177
177
178
178
179
181
181
182
183
185
185


188
188
190
190
191
191
193

Lessons Learned

194

It’s a Matter of Time
Dates and Times in Databases

197
198

Some Fundamentals of Dates and Times in Databases
Extracting Components of Dates and Times
Converting to Standard Formats

199
199
201

xix



xx

Contents
Intervals (Durations)
Time Zones
Calendar Table

Starting to Investigate Dates
Verifying That Dates Have No Times
Comparing Counts by Date
Order Lines Shipped and Billed
Customers Shipped and Billed
Number of Different Bill and Ship Dates per Order
Counts of Orders and Order Sizes
Items as Measured by Number of Units
Items as Measured by Distinct Products
Size as Measured by Dollars
Days of the Week
Billing Date by Day of the Week
Changes in Day of the Week by Year
Comparison of Days of the Week for Two Dates

How Long Between Two Dates?
Duration in Days
Duration in Weeks
Duration in Months
How Many Mondays?
A Business Problem about Days of the Week
Outline of a Solution
Solving It in SQL

Using a Calendar Table Instead
When Is the Next Anniversary (or Birthday)?
First Year Anniversary This Month
First Year Anniversary Next Month
Manipulating Dates to Calculate the Next Anniversary

Year-over-Year Comparisons
Comparisons by Day
Adding a Moving Average Trend Line
Comparisons by Week
Comparisons by Month
Month-to-Date Comparison
Extrapolation by Days in Month
Estimation Based on Day of Week
Estimation Based on Previous Year

Counting Active Customers by Day
How Many Customers on a Given Day?
How Many Customers Every Day?
How Many Customers of Different Types?
How Many Customers by Tenure Segment?
Calculating Actives Entirely Using SQL

Simple Chart Animation in Excel

202
203
203

204

204
205
206
208
209
210
211
211
214
215
215
216
217

218
218
220
221
221
222
222
224
224
225
225
226
227

229
229

230
231
231
233
235
237
239

239
239
240
241
242
246

247




Contents
Order Date to Ship Date
Order Date to Ship Date by Year
Querying the Data
Creating the One-Year Excel Table
Creating and Customizing the Chart

Lessons Learned

248

250
250
251
252

254

Chapter 6How Long Will Customers Last? Survival Analysis to Understand
Customers and Their Value 
255
Background on Survival Analysis
256
Life Expectancy
Medical Research
Examples of Hazards

The Hazard Calculation
Data Investigation
Stop Flag
Tenure
Hazard Probability
Visualizing Customers: Time versus Tenure
Censoring

Survival and Retention
Point Estimate for Survival
Calculating Survival for All Tenures
Calculating Survival in SQL
Calculating the Product of Column Values
Adding in More Dimensions

A Simple Customer Retention Calculation
Comparison between Retention and Survival
Simple Example of Hazard and Survival
Constant Hazard
What Happens to a Mixture?
Constant Hazard Corresponding to Survival

Comparing Different Groups of Customers

256
258
259

260
261
261
262
264
265
266

269
269
269
271
272
274
274
276
276

277
278
279

280

Summarizing the Markets
Stratifying by Market
Survival Ratio
Conditional Survival

280
281
284
285

Comparing Survival over Time

287

How Has a Particular Hazard Changed over Time?
What Is Customer Survival by Year of Start?
What Did Survival Look Like in the Past?

Important Measures Derived from Survival
Point Estimate of Survival
Median Customer Tenure
Average Customer Lifetime
Confidence in the Hazards


288
289
290

293
293
294
295
297

xxi


xxii

Contents
Using Survival for Customer Value Calculations
Estimated Revenue
Estimating Future Revenue for One Future Start
Value in the First Year
SQL Day-by-Day Approach
Estimated Revenue for a Group of Existing Customers
Estimated Second Year Revenue for a Homogenous Group
Estimated Future Revenue for All Customers

Forecasting
Existing Base Forecast
Existing Base Calculation
Calculating Survival on July 1st
Calculating the Number of Existing Customers on July 1st

How Good Is It?
Estimating the Long-Term Hazard
New Start Forecast

Lessons Learned
Chapter 7Factors Affecting Survival: The What and Why of
Customer Tenure 
Which Factors Are Important and When
Explanation of the Approach
Using Averages to Compare Numeric Variables
The Answer
Answering the Question in SQL and Excel
Answering the Question Entirely in SQL
Extension to Include Confidence Bounds
Hazard Ratios
Interpreting Hazard Ratios
Calculating Hazard Ratios Using SQL and Excel
Calculating Hazard Ratios in SQL
Why the Hazard Ratio?

Left Truncation
Recognizing Left Truncation
Effect of Left Truncation
How to Fix Left Truncation, Conceptually
Estimating Hazard Probability for One Tenure
Estimating Hazard Probabilities for All Tenures
Doing the Calculation in SQL

Time Windowing
A Business Problem

Time Windows = Left Truncation + Right Censoring
Calculating One Hazard Probability Using a Time Window
All Hazard Probabilities for a Time Window
Comparison of Hazards by Stops in Year in Excel
Comparison of Hazards by Stops in Year in SQL

298
299
300
301
301
303
303
305

308
309
309
309
311
312
313
313

314
315
316
316
317
319

320
321
322
324
324
326
326
327

328
328
330
331
333
333
335

336
336
337
338
339
339
341




Contents
Competing Risks

Examples of Competing Risks
I=Involuntary Churn
V=Voluntary Churn
M=Migration
Other
Competing Risk “Hazard Probability”
Competing Risk “Survival”
What Happens to Customers over Time
Example
A Cohort-Based Approach
The Survival Analysis Approach

Before and After
Three Scenarios
A Billing Mistake
A Loyalty Program
Raising Prices
Using Survival Forecasts to Understand One-Time Events
Forecasting Identified Customers Who Stopped
Estimating Excess Stops
Before and After Comparison
Cohort-Based Approach
Cohort-Based Approach: Full Cohorts
Direct Estimation of Event Effect
Approach to the Calculation
Time-Dependent Covariate Survival Using SQL and Excel
Doing the Calculation in SQL

Chapter 8


342
342
343
343
344
344
345
346
347
347
348
351

353
353
353
354
355
356
356
357
357
358
358
361
361
362
364

Lessons Learned


366

Customer Purchases and Other Repeated Events 
Identifying Customers

367
368

Who Is the Customer?
How Many?
How Many Genders in a Household?
Investigating First Names
Other Customer Information
First and Last Names
Addresses
Email Addresses
Other Identifying Information
How Many New Customers Appear Each Year?
Counting Customers
Span of Time Making Purchases
Average Time between Orders
Purchase Intervals
How Many Days in a Row Do Customers Make Purchases?

368
369
371
374
378

378
380
381
382
383
383
386
388
390
391

xxiii


×