www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page iii
Data Analysis Using
SQL and Excel®
Gordon S. Linoff
Wiley Publishing, Inc.
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page ii
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page i
Data Analysis Using
SQL and Excel®
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page ii
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page iii
Data Analysis Using
SQL and Excel®
Gordon S. Linoff
Wiley Publishing, Inc.
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page iv
Data Analysis Using SQL and Excel®
Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2008 by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-0-470-09951-3
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA
01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be
addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN
46256, (317) 572-3447, fax (317) 572-4355, or online at />Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations
or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular
purpose. No warranty may be created or extended by sales or promotional materials. The advice
and strategies contained herein may not be suitable for every situation. This work is sold with the
understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising
herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a
potential source of further information does not mean that the author or the publisher endorses the
information the organization or Website may provide or recommendations it may make. Further,
readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services or to obtain technical support, please
contact our Customer Care Department within the U.S. at (800) 762-2974, outside the U.S. at (317)
572-3993, or fax (317) 572-4002.
Library of Congress Cataloging-in-Publication Data:
Linoff, Gordon.
Data analysis using SQL and Excel / Gordon S. Linoff.
p. cm.
Includes index.
ISBN 978-0-470-09951-3 (paper/website)
1. SQL (Computer program language) 2. Querying (Computer science) 3. Data mining. 4.
Microsoft Excel (Computer file) I. Title.
QA76.73.S67L56 2007
005.75'85--dc22
2007026313
Trademarks: Wiley, the Wiley logo, and related trade dress are trademarks or registered trademarks
of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not
be used without written permission. Excel is a registered trademark of Microsoft Corporation in the
United States and/or other countries. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print
may not be available in electronic books.
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page v
To Giuseppe for sixteen years, five books, and counting . . .
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page vi
About the Author
Gordon Linoff () is a recognized expert in the field
of data mining. He has more than twenty-five years of experience working
with companies large and small to analyze customer data and to help design
data warehouses. His passion for SQL and relational databases dates to the
early 1990s, when he was building a relational database engine designed for
large corporate data warehouses at the now-defunct Thinking Machines Corporation. Since then, he has had the opportunity to work with all the leading
database vendors, including Microsoft, Oracle, and IBM.
With his colleague Michael Berry, Gordon has written three of the most popular books on data mining, starting with Data Mining Techniques for Marketing,
Sales, and Customer Support. In addition to writing books on data mining, he
also teaches courses on data mining, and has taught thousands of students on
four continents.
Gordon is currently a principal at Data Miners, a consulting company he
and Michael Berry founded in 1998. Data Miners is devoted to doing and
teaching data mining and customer-centric data analysis.
vi
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page vii
Credits
Acquisitions Editor
Robert Elliott
Vice President and Executive
Publisher
Joseph B. Wikert
Development Editor
Ed Connor
Project Coordinator, Cover
Lynsey Osborn
Technical Editor
Michael J. A. Berry
Copy Editor
Kim Cofer
Graphics and Production
Specialists
Craig Woods, Happenstance
Type-O-Rama
Oso Rey, Happenstance
Type-O-Rama
Editorial Manager
Mary Beth Wakefield
Proofreading
Ian Golder, Word One
Production Manager
Tim Tate
Indexing
Johnna VanHoose Dinse
Vice President and Executive
Group Publisher
Richard Swadley
Anniversary Logo Design
Richard Pacifico
Production Editor
William A. Barton
vii
www.it-ebooks.info
99513ffirs.qxd:WileyRed
8/27/07
4:15 PM
Page viii
www.it-ebooks.info
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page ix
Contents
Foreword
xxvii
Acknowledgments
xxxi
Introduction
Chapter 1
xxxiii
A Data Miner Looks at SQL
Picturing the Structure of the Data
What Is a Data Model?
What Is a Table?
Allowing NULL Values
Column Types
What Is an Entity-Relationship Diagram?
The Zip Code Tables
Subscription Dataset
Purchases Dataset
Picturing Data Analysis Using Dataflows
What Is a Dataflow?
Dataflow Nodes (Operators)
READ: Reading a Database Table
OUTPUT: Outputting a Table (or Chart)
SELECT: Selecting Various Columns in the Table
FILTER: Filtering Rows Based on a Condition
APPEND: Appending New Calculated Columns
UNION: Combining Multiple Datasets into One
AGGREGATE: Aggregating Values
LOOKUP: Looking Up Values in One Table in Another
CROSSJOIN: General Join of Two Tables
JOIN: Join Two Tables Together Using a Key Column
SORT: Ordering the Results of a Dataset
Dataflows, SQL, and Relational Algebra
1
2
3
3
5
6
7
8
10
11
12
13
15
15
15
15
15
15
16
16
16
16
16
17
17
ix
www.it-ebooks.info
99513ftoc.qxd:WileyRed
x
8/24/07
11:15 AM
Page x
Contents
SQL Queries
18
What to Do, Not How to Do It
A Basic SQL Query
A Basic Summary SQL Query
What it Means to Join Tables
Cross-Joins: The Most General Joins
Lookup: A Useful Join
Equijoins
Nonequijoins
Outer Joins
Other Important Capabilities in SQL
UNION ALL
CASE
IN
Subqueries Are Our Friend
Subqueries for Naming Variables
Subqueries for Handling Summaries
Subqueries and IN
Rewriting the “IN” as a JOIN
Correlated Subqueries
The NOT IN Operator
Subqueries for UNION ALL
Chapter 2
18
19
20
22
23
24
26
27
28
29
30
30
31
32
33
34
36
36
37
38
39
Lessons Learned
40
What’s In a Table? Getting Started with Data Exploration
What Is Data Exploration?
Excel for Charting
43
44
45
A Basic Chart: Column Charts
Inserting the Data
Creating the Column Chart
Formatting the Column Chart
Useful Variations on the Column Chart
A New Query
Side-by-Side Columns
Stacked Columns
Stacked and Normalized Columns
Number of Orders and Revenue
Other Types of Charts
Line Charts
Area Charts
X-Y Charts (Scatter Plots)
What Values Are in the Columns?
Histograms
Histograms of Counts
Cumulative Histograms of Counts
Histograms (Frequencies) for Numeric Values
Ranges Based on the Number of Digits, Using
Numeric Techniques
www.it-ebooks.info
45
46
47
49
52
52
52
54
54
54
56
56
57
57
59
60
64
66
67
68
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xi
Contents
Ranges Based on the Number of Digits, Using
String Techniques
More Refined Ranges: First Digit Plus Number of Digits
Breaking Numerics into Equal-Sized Groups
More Values to Explore — Min, Max, and Mode
72
Minimum and Maximum Values
The Most Common Value (Mode)
Calculating Mode Using Standard SQL
Calculating Mode Using SQL Extensions
Calculating Mode Using String Operations
72
73
73
74
75
Exploring String Values
Histogram of Length
Strings Starting or Ending with Spaces
Handling Upper- and Lowercase
What Characters Are in a String?
Exploring Values in Two Columns
What Are Average Sales By State?
How Often Are Products Repeated within a Single Order?
Direct Counting Approach
Comparison of Distinct Counts to Overall Counts
Which State Has the Most American Express Users?
From Summarizing One Column to Summarizing
All Columns
Good Summary for One Column
Query to Get All Columns in a Table
Using SQL to Generate Summary Code
Chapter 3
69
69
71
76
76
76
77
77
79
79
80
80
81
83
84
84
87
88
Lessons Learned
90
How Different Is Different?
Basic Statistical Concepts
91
92
The Null Hypothesis
Confidence and Probability
Normal Distribution
93
94
95
How Different Are the Averages?
The Approach
Standard Deviation for Subset Averages
Three Approaches
Estimation Based on Two Samples
Estimation Based on Difference
Counting Possibilities
How Many Men?
How Many Californians?
Null Hypothesis and Confidence
How Many Customers Are Still Active?
Given the Count, What Is the Probability?
Given the Probability, What Is the Number of Stops?
The Rate or the Number?
www.it-ebooks.info
99
99
100
101
102
104
104
105
110
112
113
114
116
117
xi
99513ftoc.qxd:WileyRed
xii
8/24/07
11:15 AM
Page xii
Contents
Ratios, and Their Statistics
Standard Error of a Proportion
Confidence Interval on Proportions
Difference of Proportions
Conservative Lower Bounds
Chi-Square
118
120
121
122
123
Expected Values
Chi-Square Calculation
Chi-Square Distribution
Chi-Square in SQL
What States Have Unusual Affinities for Which
Types of Products?
Data Investigation
SQL to Calculate Chi-Square Values
Affinity Results
Chapter 4
118
123
124
125
127
128
129
130
131
Lessons Learned
132
Where Is It All Happening? Location, Location, Location
Latitude and Longitude
133
134
Definition of Latitude and Longitude
Degrees, Minutes, Seconds, and All That
Distance between Two Locations
Euclidian Method
Accurate Method
Finding All Zip Codes within a Given Distance
Finding Nearest Zip Code in Excel
Pictures with Zip Codes
The Scatter Plot Map
Who Uses Solar Power for Heating?
Where Are the Customers?
Census Demographics
The Extremes: Richest and Poorest
Median Income
Proportion of Wealthy and Poor
Income Similarity and Dissimilarity Using Chi-Square
Comparison of Zip Codes with and without Orders
Zip Codes Not in Census File
Profiles of Zip Codes with and without Orders
Classifying and Comparing Zip Codes
Geographic Hierarchies
Wealthiest Zip Code in a State?
Zip Code with the Most Orders in Each State
Interesting Hierarchies in Geographic Data
Counties
Designated Marketing Areas (DMAs)
Census Hierarchies
Other Geographic Subdivisions
www.it-ebooks.info
134
136
137
137
139
141
143
145
145
146
148
149
150
150
152
152
156
156
157
159
162
162
165
167
167
168
168
169
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xiii
Contents
Calculating County Wealth
Identifying Counties
Measuring Wealth
Distribution of Values of Wealth
Which Zip Code Is Wealthiest Relative to Its County?
County with Highest Relative Order Penetration
Mapping in Excel
Why Create Maps?
It Can’t Be Done
Mapping on the Web
State Boundaries on Scatter Plots of Zip Codes
Plotting State Boundaries
Pictures of State Boundaries
Chapter 5
170
170
171
172
173
175
177
178
179
180
180
180
182
Lessons Learned
183
It’s a Matter of Time
Dates and Times in Databases
185
186
Some Fundamentals of Dates and Times in Databases
Extracting Components of Dates and Times
Converting to Standard Formats
Intervals (Durations)
Time Zones
Calendar Table
Starting to Investigate Dates
Verifying that Dates Have No Times
Comparing Counts by Date
Orderlines Shipped and Billed
Customers Shipped and Billed
Number of Different Bill and Ship Dates per Order
Counts of Orders and Order Sizes
Items as Measured by Number of Units
Items as Measured by Distinct Products
Size as Measured by Dollars
Days of the Week
Billing Date by Day of the Week
Changes in Day of the Week by Year
Comparison of Days of the Week for Two Dates
How Long between Two Dates?
Duration in Days
Duration in Weeks
Duration in Months
How Many Mondays?
A Business Problem about Days of the Week
Outline of a Solution
Solving It in SQL
Using a Calendar Table Instead
www.it-ebooks.info
187
187
189
190
191
191
192
192
193
193
195
196
197
198
198
201
203
203
204
205
206
206
208
209
210
210
210
212
213
xiii
99513ftoc.qxd:WileyRed
xiv
8/24/07
11:15 AM
Page xiv
Contents
Year-over-Year Comparisons
Comparisons by Day
Adding a Moving Average Trend Line
Comparisons by Week
Comparisons by Month
Month-to-Date Comparison
Extrapolation by Days in Month
Estimation Based on Day of Week
Estimation Based on Previous Year
Counting Active Customers by Day
How Many Customers on a Given Day?
How Many Customers Every Day?
How Many Customers of Different Types?
How Many Customers by Tenure Segment?
Simple Chart Animation in Excel
Order Date to Ship Date
Order Date to Ship Date by Year
Querying the Data
Creating the One-Year Excel Table
Creating and Customizing the Chart
Chapter 6
213
213
214
215
216
218
220
221
223
224
224
224
226
227
231
231
234
234
235
236
Lessons Learned
238
How Long Will Customers Last? Survival Analysis
to Understand Customers and Their Value
Background on Survival Analysis
239
240
Life Expectancy
Medical Research
Examples of Hazards
The Hazard Calculation
Data Investigation
Stop Flag
Tenure
Hazard Probability
Visualizing Customers: Time versus Tenure
Censoring
Survival and Retention
Point Estimate for Survival
Calculating Survival for All Tenures
Calculating Survival in SQL
Step 1. Create the Survival Table
Step 2: Load POPT and STOPT
Step 3: Calculate Cumulative Population
Step 4: Calculate the Hazard
Step 5: Calculate the Survival
Step 6: Fix ENDTENURE and NUMDAYS in Last Row
Generalizing the SQL
www.it-ebooks.info
242
243
243
245
245
245
247
249
250
251
253
254
254
256
257
257
258
259
259
260
260
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xv
Contents
A Simple Customer Retention Calculation
Comparison between Retention and Survival
Simple Example of Hazard and Survival
Constant Hazard
What Happens to a Mixture
Constant Hazard Corresponding to Survival
Comparing Different Groups of Customers
267
Summarizing the Markets
Stratifying by Market
Survival Ratio
Conditional Survival
267
268
270
272
Comparing Survival over Time
272
How Has a Particular Hazard Changed over Time?
What Is Customer Survival by Year of Start?
What Did Survival Look Like in the Past?
Important Measures Derived from Survival
Point Estimate of Survival
Median Customer Tenure
Average Customer Lifetime
Confidence in the Hazards
Using Survival for Customer Value Calculations
Estimated Revenue
Estimating Future Revenue for One Future Start
SQL Day-by-Day Approach
SQL Summary Approach
Estimated Revenue for a Simple Group of Existing Customers
Estimated Second Year Revenue for a Homogenous Group
Pre-calculating Yearly Revenue by Tenure
Estimated Future Revenue for All Customers
Chapter 7
260
262
262
263
264
266
273
275
275
278
278
279
281
282
284
285
286
287
288
289
289
291
292
Lessons Learned
295
Factors Affecting Survival: The What and
Why of Customer Tenure
What Factors Are Important and When
297
298
Explanation of the Approach
Using Averages to Compare Numeric Variables
The Answer
Answering the Question in SQL
Extension to Include Confidence Bounds
Hazard Ratios
Interpreting Hazard Ratios
Calculating Hazard Ratios
Why the Hazard Ratio
Left Truncation
Recognizing Left Truncation
Effect of Left Truncation
www.it-ebooks.info
298
301
301
302
304
306
306
307
308
309
309
311
xv
99513ftoc.qxd:WileyRed
xvi
8/24/07
11:15 AM
Page xvi
Contents
How to Fix Left Truncation, Conceptually
Estimating Hazard Probability for One Tenure
Estimating Hazard Probabilities for All Tenures
Time Windowing
316
A Business Problem
Time Windows = Left Truncation + Right Censoring
Calculating One Hazard Probability Using a Time Window
All Hazard Probabilities for a Time Window
Comparison of Hazards by Stops in Year
Competing Risks
317
318
318
319
320
321
Examples of Competing Risks
I=Involuntary Churn
V=Voluntary Churn
M=Migration
Other
Competing Risk “Hazard Probability”
Competing Risk “Survival”
What Happens to Customers over Time
Example
A Cohort-Based Approach
The Survival Analysis Approach
Before and After
322
322
323
323
324
324
326
327
327
328
330
332
Three Scenarios
A Billing Mistake
A Loyalty Program
Raising Prices
Using Survival Forecasts
Forecasting Identified Customers Who Stopped
Estimating Excess Stops
Before and After Comparison
Cohort-Based Approach
Direct Estimation of Event Effect
Approach to the Calculation
Time-Varying Covariate Survival Using SQL and Excel
Chapter 8
313
314
314
333
333
333
335
335
336
336
337
338
341
341
342
Lessons Learned
344
Customer Purchases and Other Repeated Events
Identifying Customers
347
348
Who Is the Customer?
How Many?
How Many Genders in a Household
Investigating First Names
Other Customer Information
First and Last Names
Addresses
Other Identifying Information
www.it-ebooks.info
348
349
351
354
358
358
360
361
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xvii
Contents
How Many New Customers Appear Each Year?
Counting Customers
Span of Time Making Purchases
Average Time between Orders
Purchase Intervals
RFM Analysis
370
The Dimensions
Recency
Frequency
Monetary
Calculating the RFM Cell
Utility of RFM
A Methodology for Marketing Experiments
Customer Migration
RFM Limits
Which Households Are Increasing Purchase
Amounts Over Time?
Comparison of Earliest and Latest Values
Calculating the Earliest and Latest Values
Comparing the First and Last Values
Comparison of First Year Values and Last Year Values
Trend from the Best Fit Line
Using the Slope
Calculating the Slope
Time to Next Event
Idea behind the Calculation
Calculating Next Purchase Date Using SQL
From Next Purchase Date to Time-to-Event
Stratifying Time-to-Event
Chapter 9
362
362
364
367
369
370
371
374
374
375
377
377
378
380
381
381
381
386
390
392
393
393
395
395
396
397
398
Lessons Learned
399
What’s in a Shopping Cart? Market Basket Analysis
and Association Rules
Exploratory Market Basket Analysis
401
402
Scatter Plot of Products
Duplicate Products in Orders
Histogram of Number of Units
Products Associated with One-Time Customers
Products Associated with the Best Customers
Changes in Price
Combinations (Item Sets)
Combinations of Two Products
Number of Two-Way Combinations
Generating All Two-Way Combinations
Examples of Combinations
Variations on Combinations
Combinations of Product Groups
Multi-Way Combinations
www.it-ebooks.info
402
403
407
408
410
413
415
415
415
417
419
420
420
422
xvii
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xviii
xviii Contents
Households Not Orders
Combinations within a Household
Investigating Products within Households but
Not within Orders
Multiple Purchases of the Same Product
The Simplest Association Rules
Associations and Rules
Zero-Way Association Rules
What Is the Distribution of Probabilities?
What Do Zero-Way Associations Tell Us?
One-Way Association Rules
Example of One-Way Association Rules
Generating All One-Way Rules
One-Way Rules with Evaluation Information
One-Way Rules on Product Groups
Calculating Product Group Rules Using an
Intermediate Table
Calculating Product Group Rules Using
Window Functions
Two-Way Associations
Calculating Two-Way Associations
Using Chi-Square to Find the Best Rules
Applying Chi-Square to Rules
Applying Chi-Square to Rules in SQL
Comparing Chi-Square Rules to Lift
Chi-Square for Negative Rules
Heterogeneous Associations
Rules of the Form “State Plus Product”
Rules Mixing Different Types of Products
Extending Association Rules
Multi-Way Associations
Rules Using Attributes of Products
Rules with Different Left- and Right-Hand Sides
Before and After: Sequential Associations
Lessons Learned
424
424
425
426
428
428
429
429
430
431
431
433
434
436
438
440
441
441
442
442
444
445
447
448
448
450
451
451
452
453
454
455
Chapter 10 Data Mining Models in SQL
Introduction to Directed Data Mining
Directed Models
The Data in Modeling
Model Set
Score Set
Prediction Model Sets versus Profiling Model Sets
Examples of Modeling Tasks
Similarity Models
Yes-or-No Models (Binary Response Classification)
www.it-ebooks.info
457
458
459
459
459
461
461
463
463
463
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xix
Contents
Yes-or-No Models with Propensity Scores
Multiple Categories
Estimating Numeric Values
Model Evaluation
Look-Alike Models
464
465
465
465
466
What Is the Model?
What Is the Best Zip Code?
A Basic Look-Alike Model
Look-Alike Using Z-Scores
Example of Nearest Neighbor Model
466
466
468
469
473
Lookup Model for Most Popular Product
475
Most Popular Product
Calculating Most Popular Product Group
Evaluating the Lookup Model
Using a Profiling Lookup Model for Prediction
Using Binary Classification Instead
Lookup Model for Order Size
Most Basic Example: No Dimensions
Adding One Dimension
Adding More Dimensions
Examining Nonstationarity
Evaluating the Model Using an Average Value Chart
Lookup Model for Probability of Response
The Overall Probability as a Model
Exploring Different Dimensions
How Accurate Are the Models?
Adding More Dimensions
Naïve Bayesian Models (Evidence Models)
Some Ideas in Probability
Probabilities
Odds
Likelihood
Calculating the Naïve Bayesian Model
An Intriguing Observation
Bayesian Model of One Variable
Bayesian Model of One Variable in SQL
The “Naïve” Generalization
Naïve Bayesian Model: Scoring and Lift
Scoring with More Attributes
Creating a Cumulative Gains Chart
Comparison of Naïve Bayesian and Lookup Models
Lessons Learned
Chapter 11 The Best-Fit Line: Linear Regression Models
The Best-Fit Line
Tenure and Amount Paid
www.it-ebooks.info
475
475
477
478
480
481
481
482
484
484
485
487
487
488
490
493
495
495
496
497
497
498
499
500
500
502
504
505
506
507
508
511
512
512
xix
99513ftoc.qxd:WileyRed
xx
8/24/07
11:15 AM
Page xx
Contents
Properties of the Best-fit Line
What Does Best-Fit Mean?
Formula for Line
Expected Value
Error (Residuals)
Preserving the Averages
Inverse Model
Beware of the Data
Trend Lines in Charts
Best-fit Line in Scatter Plots
Logarithmic, Power, and Exponential Trend Curves
Polynomial Trend Curves
Moving Average
Best-fit Using LINEST() Function
Returning Values in Multiple Cells
Calculating Expected Values
LINEST() for Logarithmic, Exponential, and Power Curves
Measuring Goodness of Fit Using R2
The R2 Value
Limitations of R2
What R2 Really Means
Direct Calculation of Best-Fit Line Coefficients
Doing the Calculation
Calculating the Best-Fit Line in SQL
Price Elasticity
Price Frequency
Price Frequency for $20 Books
Price Elasticity Model in SQL
Price Elasticity Average Value Chart
Weighted Linear Regression
Customer Stops during the First Year
Weighted Best Fit
Weighted Best-Fit Line in a Chart
Weighted Best-Fit in SQL
Weighted Best-Fit Using Solver
The Weighted Best-Fit Line
Solver Is Better Than Guessing
More Than One Input Variable
Multiple Regression in Excel
Getting the Data
Investigating Each Variable Separately
Building a Model with Three Input Variables
Using Solver for Multiple Regression
Choosing Input Variables One-By-One
Multiple Regression in SQL
Lessons Learned
513
513
515
515
517
518
518
519
521
521
522
524
525
528
528
530
531
532
532
534
535
536
536
537
538
539
541
542
543
544
545
546
548
549
550
550
551
552
552
553
554
555
557
558
558
560
www.it-ebooks.info
99513ftoc.qxd:WileyRed
8/24/07
11:15 AM
Page xxi
Contents
Chapter 12 Building Customer Signatures for Further Analysis
What Is a Customer Signature?
563
564
What Is a Customer?
Sources of Data for the Customer Signature
Current Customer Snapshot
Initial Customer Information
Self-Reported Information
External Data (Demographic and So On)
About Their Neighbors
Transaction Summaries
Using Customer Signatures
Predictive and Profile Modeling
Ad Hoc Analysis
Repository of Customer-Centric Business Metrics
565
566
566
567
568
568
569
569
570
570
570
570
Designing Customer Signatures
571
Column Roles
Identification Columns
Input Columns
Target Columns
Foreign Key Columns
Cutoff Date
Profiling versus Prediction
Time Frames
Naming of Columns
Eliminating Seasonality
Adding Seasonality Back In
Multiple Time Frames
571
571
572
572
572
573
573
573
574
574
575
576
Operations to Build a Customer Signature
Driving Table
Using an Existing Table as the Driving Table
Derived Table as the Driving Table
Looking Up Data
Fixed Lookup Tables
Customer Dimension Lookup Tables
Initial Transaction
Without Window Functions
With Window Functions
Pivoting
Payment Type Pivot
Channel Pivot
Year Pivot
Order Line Information Pivot
Summarizing
Basic Summaries
More Complex Summaries
www.it-ebooks.info
577
578
578
580
580
581
582
584
584
586
586
588
589
590
591
594
594
594
xxi
99513ftoc.qxd:WileyRed
xxii
8/24/07
11:15 AM
Page xxii
Contents
Extracting Features
596
Geographic Location Information
Date Time Columns
Patterns in Strings
Email Addresses
Addresses
Product Descriptions
Credit Card Numbers
596
597
598
598
599
599
600
Summarizing Customer Behaviors
601
Calculating Slope for Time Series
Calculating Slope from Pivoted Time Series
Calculating Slope for a Regular Time Series
Calculating Slope for an Irregular Time Series
Weekend Shoppers
Declining Usage Behavior
Appendix
601
601
603
604
604
606
Lessons Learned
609
Equivalent Constructs Among Databases
String Functions
611
612
Searching for Position of One String within Another
IBM
Microsoft
mysql
Oracle
SAS proc sql
String Concatenation
IBM
Microsoft
mysql
Oracle
SAS proc sql
String Length Function
IBM
Microsoft
mysql
Oracle
SAS proc sql
Substring Function
IBM
Microsoft
mysql
Oracle
SAS proc sql
Replace One Substring with Another
IBM
Microsoft
www.it-ebooks.info
612
612
613
613
613
613
614
614
614
614
614
614
614
614
615
615
615
615
615
615
615
615
616
616
616
616
616