Tải bản đầy đủ (.pdf) (690 trang)

Data analysis using SQL and excel

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (19.51 MB, 690 trang )

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iii

Data Analysis Using
SQL and Excel®
Gordon S. Linoff

Wiley Publishing, Inc.

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page ii

www.it-ebooks.info



99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page i

Data Analysis Using
SQL and Excel®

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page ii

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM


Page iii

Data Analysis Using
SQL and Excel®
Gordon S. Linoff

Wiley Publishing, Inc.

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page iv

Data Analysis Using SQL and Excel®
Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com

Copyright © 2008 by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-0-470-09951-3

Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA
01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be
addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN
46256, (317) 572-3447, fax (317) 572-4355, or online at />Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations
or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular
purpose. No warranty may be created or extended by sales or promotional materials. The advice
and strategies contained herein may not be suitable for every situation. This work is sold with the
understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising
herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a
potential source of further information does not mean that the author or the publisher endorses the
information the organization or Website may provide or recommendations it may make. Further,
readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services or to obtain technical support, please
contact our Customer Care Department within the U.S. at (800) 762-2974, outside the U.S. at (317)
572-3993, or fax (317) 572-4002.
Library of Congress Cataloging-in-Publication Data:
Linoff, Gordon.
Data analysis using SQL and Excel / Gordon S. Linoff.
p. cm.
Includes index.
ISBN 978-0-470-09951-3 (paper/website)
1. SQL (Computer program language) 2. Querying (Computer science) 3. Data mining. 4.
Microsoft Excel (Computer file) I. Title.
QA76.73.S67L56 2007

005.75'85--dc22
2007026313
Trademarks: Wiley, the Wiley logo, and related trade dress are trademarks or registered trademarks
of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not
be used without written permission. Excel is a registered trademark of Microsoft Corporation in the
United States and/or other countries. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print
may not be available in electronic books.

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM

Page v

To Giuseppe for sixteen years, five books, and counting . . .

www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM


Page vi

About the Author

Gordon Linoff () is a recognized expert in the field
of data mining. He has more than twenty-five years of experience working
with companies large and small to analyze customer data and to help design
data warehouses. His passion for SQL and relational databases dates to the
early 1990s, when he was building a relational database engine designed for
large corporate data warehouses at the now-defunct Thinking Machines Corporation. Since then, he has had the opportunity to work with all the leading
database vendors, including Microsoft, Oracle, and IBM.
With his colleague Michael Berry, Gordon has written three of the most popular books on data mining, starting with Data Mining Techniques for Marketing,
Sales, and Customer Support. In addition to writing books on data mining, he
also teaches courses on data mining, and has taught thousands of students on
four continents.
Gordon is currently a principal at Data Miners, a consulting company he
and Michael Berry founded in 1998. Data Miners is devoted to doing and
teaching data mining and customer-centric data analysis.

vi
www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07

4:15 PM


Page vii

Credits

Acquisitions Editor
Robert Elliott

Vice President and Executive
Publisher
Joseph B. Wikert

Development Editor
Ed Connor

Project Coordinator, Cover
Lynsey Osborn

Technical Editor
Michael J. A. Berry

Copy Editor
Kim Cofer

Graphics and Production
Specialists
Craig Woods, Happenstance
Type-O-Rama
Oso Rey, Happenstance
Type-O-Rama


Editorial Manager
Mary Beth Wakefield

Proofreading
Ian Golder, Word One

Production Manager
Tim Tate

Indexing
Johnna VanHoose Dinse

Vice President and Executive
Group Publisher
Richard Swadley

Anniversary Logo Design
Richard Pacifico

Production Editor
William A. Barton

vii
www.it-ebooks.info


99513ffirs.qxd:WileyRed

8/27/07


4:15 PM

Page viii

www.it-ebooks.info


99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page ix

Contents

Foreword

xxvii

Acknowledgments

xxxi

Introduction
Chapter 1

xxxiii
A Data Miner Looks at SQL

Picturing the Structure of the Data
What Is a Data Model?
What Is a Table?
Allowing NULL Values
Column Types
What Is an Entity-Relationship Diagram?
The Zip Code Tables
Subscription Dataset
Purchases Dataset

Picturing Data Analysis Using Dataflows
What Is a Dataflow?
Dataflow Nodes (Operators)
READ: Reading a Database Table
OUTPUT: Outputting a Table (or Chart)
SELECT: Selecting Various Columns in the Table
FILTER: Filtering Rows Based on a Condition
APPEND: Appending New Calculated Columns
UNION: Combining Multiple Datasets into One
AGGREGATE: Aggregating Values
LOOKUP: Looking Up Values in One Table in Another
CROSSJOIN: General Join of Two Tables
JOIN: Join Two Tables Together Using a Key Column
SORT: Ordering the Results of a Dataset
Dataflows, SQL, and Relational Algebra

1
2
3
3

5
6
7
8
10
11

12
13
15
15
15
15
15
15
16
16
16
16
16
17
17

ix
www.it-ebooks.info


99513ftoc.qxd:WileyRed

x


8/24/07

11:15 AM

Page x

Contents
SQL Queries

18

What to Do, Not How to Do It
A Basic SQL Query
A Basic Summary SQL Query
What it Means to Join Tables
Cross-Joins: The Most General Joins
Lookup: A Useful Join
Equijoins
Nonequijoins
Outer Joins
Other Important Capabilities in SQL
UNION ALL
CASE
IN

Subqueries Are Our Friend
Subqueries for Naming Variables
Subqueries for Handling Summaries
Subqueries and IN

Rewriting the “IN” as a JOIN
Correlated Subqueries
The NOT IN Operator
Subqueries for UNION ALL

Chapter 2

18
19
20
22
23
24
26
27
28
29
30
30
31

32
33
34
36
36
37
38
39


Lessons Learned

40

What’s In a Table? Getting Started with Data Exploration
What Is Data Exploration?
Excel for Charting

43
44
45

A Basic Chart: Column Charts
Inserting the Data
Creating the Column Chart
Formatting the Column Chart
Useful Variations on the Column Chart
A New Query
Side-by-Side Columns
Stacked Columns
Stacked and Normalized Columns
Number of Orders and Revenue
Other Types of Charts
Line Charts
Area Charts
X-Y Charts (Scatter Plots)

What Values Are in the Columns?
Histograms
Histograms of Counts

Cumulative Histograms of Counts
Histograms (Frequencies) for Numeric Values
Ranges Based on the Number of Digits, Using
Numeric Techniques

www.it-ebooks.info

45
46
47
49
52
52
52
54
54
54
56
56
57
57

59
60
64
66
67
68



99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xi

Contents
Ranges Based on the Number of Digits, Using
String Techniques
More Refined Ranges: First Digit Plus Number of Digits
Breaking Numerics into Equal-Sized Groups

More Values to Explore — Min, Max, and Mode

72

Minimum and Maximum Values
The Most Common Value (Mode)
Calculating Mode Using Standard SQL
Calculating Mode Using SQL Extensions
Calculating Mode Using String Operations

72
73
73
74
75


Exploring String Values
Histogram of Length
Strings Starting or Ending with Spaces
Handling Upper- and Lowercase
What Characters Are in a String?

Exploring Values in Two Columns
What Are Average Sales By State?
How Often Are Products Repeated within a Single Order?
Direct Counting Approach
Comparison of Distinct Counts to Overall Counts
Which State Has the Most American Express Users?

From Summarizing One Column to Summarizing
All Columns
Good Summary for One Column
Query to Get All Columns in a Table
Using SQL to Generate Summary Code

Chapter 3

69
69
71

76
76
76
77
77


79
79
80
80
81
83

84
84
87
88

Lessons Learned

90

How Different Is Different?
Basic Statistical Concepts

91
92

The Null Hypothesis
Confidence and Probability
Normal Distribution

93
94
95


How Different Are the Averages?
The Approach
Standard Deviation for Subset Averages
Three Approaches
Estimation Based on Two Samples
Estimation Based on Difference

Counting Possibilities
How Many Men?
How Many Californians?
Null Hypothesis and Confidence
How Many Customers Are Still Active?
Given the Count, What Is the Probability?
Given the Probability, What Is the Number of Stops?
The Rate or the Number?

www.it-ebooks.info

99
99
100
101
102
104

104
105
110
112

113
114
116
117

xi


99513ftoc.qxd:WileyRed

xii

8/24/07

11:15 AM

Page xii

Contents
Ratios, and Their Statistics
Standard Error of a Proportion
Confidence Interval on Proportions
Difference of Proportions
Conservative Lower Bounds

Chi-Square

118
120
121

122

123

Expected Values
Chi-Square Calculation
Chi-Square Distribution
Chi-Square in SQL
What States Have Unusual Affinities for Which
Types of Products?
Data Investigation
SQL to Calculate Chi-Square Values
Affinity Results

Chapter 4

118

123
124
125
127
128
129
130
131

Lessons Learned

132


Where Is It All Happening? Location, Location, Location
Latitude and Longitude

133
134

Definition of Latitude and Longitude
Degrees, Minutes, Seconds, and All That
Distance between Two Locations
Euclidian Method
Accurate Method
Finding All Zip Codes within a Given Distance
Finding Nearest Zip Code in Excel
Pictures with Zip Codes
The Scatter Plot Map
Who Uses Solar Power for Heating?
Where Are the Customers?

Census Demographics
The Extremes: Richest and Poorest
Median Income
Proportion of Wealthy and Poor
Income Similarity and Dissimilarity Using Chi-Square
Comparison of Zip Codes with and without Orders
Zip Codes Not in Census File
Profiles of Zip Codes with and without Orders
Classifying and Comparing Zip Codes

Geographic Hierarchies

Wealthiest Zip Code in a State?
Zip Code with the Most Orders in Each State
Interesting Hierarchies in Geographic Data
Counties
Designated Marketing Areas (DMAs)
Census Hierarchies
Other Geographic Subdivisions

www.it-ebooks.info

134
136
137
137
139
141
143
145
145
146
148

149
150
150
152
152
156
156
157

159

162
162
165
167
167
168
168
169


99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xiii

Contents
Calculating County Wealth
Identifying Counties
Measuring Wealth
Distribution of Values of Wealth
Which Zip Code Is Wealthiest Relative to Its County?
County with Highest Relative Order Penetration

Mapping in Excel
Why Create Maps?

It Can’t Be Done
Mapping on the Web
State Boundaries on Scatter Plots of Zip Codes
Plotting State Boundaries
Pictures of State Boundaries

Chapter 5

170
170
171
172
173
175

177
178
179
180
180
180
182

Lessons Learned

183

It’s a Matter of Time
Dates and Times in Databases


185
186

Some Fundamentals of Dates and Times in Databases
Extracting Components of Dates and Times
Converting to Standard Formats
Intervals (Durations)
Time Zones
Calendar Table

Starting to Investigate Dates
Verifying that Dates Have No Times
Comparing Counts by Date
Orderlines Shipped and Billed
Customers Shipped and Billed
Number of Different Bill and Ship Dates per Order
Counts of Orders and Order Sizes
Items as Measured by Number of Units
Items as Measured by Distinct Products
Size as Measured by Dollars
Days of the Week
Billing Date by Day of the Week
Changes in Day of the Week by Year
Comparison of Days of the Week for Two Dates

How Long between Two Dates?
Duration in Days
Duration in Weeks
Duration in Months
How Many Mondays?

A Business Problem about Days of the Week
Outline of a Solution
Solving It in SQL
Using a Calendar Table Instead

www.it-ebooks.info

187
187
189
190
191
191

192
192
193
193
195
196
197
198
198
201
203
203
204
205

206

206
208
209
210
210
210
212
213

xiii


99513ftoc.qxd:WileyRed

xiv

8/24/07

11:15 AM

Page xiv

Contents
Year-over-Year Comparisons
Comparisons by Day
Adding a Moving Average Trend Line
Comparisons by Week
Comparisons by Month
Month-to-Date Comparison
Extrapolation by Days in Month

Estimation Based on Day of Week
Estimation Based on Previous Year

Counting Active Customers by Day
How Many Customers on a Given Day?
How Many Customers Every Day?
How Many Customers of Different Types?
How Many Customers by Tenure Segment?

Simple Chart Animation in Excel
Order Date to Ship Date
Order Date to Ship Date by Year
Querying the Data
Creating the One-Year Excel Table
Creating and Customizing the Chart

Chapter 6

213
213
214
215
216
218
220
221
223

224
224

224
226
227

231
231
234
234
235
236

Lessons Learned

238

How Long Will Customers Last? Survival Analysis
to Understand Customers and Their Value
Background on Survival Analysis

239
240

Life Expectancy
Medical Research
Examples of Hazards

The Hazard Calculation
Data Investigation
Stop Flag
Tenure

Hazard Probability
Visualizing Customers: Time versus Tenure
Censoring

Survival and Retention
Point Estimate for Survival
Calculating Survival for All Tenures
Calculating Survival in SQL
Step 1. Create the Survival Table
Step 2: Load POPT and STOPT
Step 3: Calculate Cumulative Population
Step 4: Calculate the Hazard
Step 5: Calculate the Survival
Step 6: Fix ENDTENURE and NUMDAYS in Last Row
Generalizing the SQL

www.it-ebooks.info

242
243
243

245
245
245
247
249
250
251


253
254
254
256
257
257
258
259
259
260
260


99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xv

Contents
A Simple Customer Retention Calculation
Comparison between Retention and Survival
Simple Example of Hazard and Survival
Constant Hazard
What Happens to a Mixture
Constant Hazard Corresponding to Survival

Comparing Different Groups of Customers


267

Summarizing the Markets
Stratifying by Market
Survival Ratio
Conditional Survival

267
268
270
272

Comparing Survival over Time

272

How Has a Particular Hazard Changed over Time?
What Is Customer Survival by Year of Start?
What Did Survival Look Like in the Past?

Important Measures Derived from Survival
Point Estimate of Survival
Median Customer Tenure
Average Customer Lifetime
Confidence in the Hazards

Using Survival for Customer Value Calculations
Estimated Revenue
Estimating Future Revenue for One Future Start

SQL Day-by-Day Approach
SQL Summary Approach
Estimated Revenue for a Simple Group of Existing Customers
Estimated Second Year Revenue for a Homogenous Group
Pre-calculating Yearly Revenue by Tenure
Estimated Future Revenue for All Customers

Chapter 7

260
262
262
263
264
266

273
275
275

278
278
279
281
282

284
285
286
287

288
289
289
291
292

Lessons Learned

295

Factors Affecting Survival: The What and
Why of Customer Tenure
What Factors Are Important and When

297
298

Explanation of the Approach
Using Averages to Compare Numeric Variables
The Answer
Answering the Question in SQL
Extension to Include Confidence Bounds
Hazard Ratios
Interpreting Hazard Ratios
Calculating Hazard Ratios
Why the Hazard Ratio

Left Truncation
Recognizing Left Truncation
Effect of Left Truncation


www.it-ebooks.info

298
301
301
302
304
306
306
307
308

309
309
311

xv


99513ftoc.qxd:WileyRed

xvi

8/24/07

11:15 AM

Page xvi


Contents
How to Fix Left Truncation, Conceptually
Estimating Hazard Probability for One Tenure
Estimating Hazard Probabilities for All Tenures

Time Windowing

316

A Business Problem
Time Windows = Left Truncation + Right Censoring
Calculating One Hazard Probability Using a Time Window
All Hazard Probabilities for a Time Window
Comparison of Hazards by Stops in Year

Competing Risks

317
318
318
319
320

321

Examples of Competing Risks
I=Involuntary Churn
V=Voluntary Churn
M=Migration
Other

Competing Risk “Hazard Probability”
Competing Risk “Survival”
What Happens to Customers over Time
Example
A Cohort-Based Approach
The Survival Analysis Approach

Before and After

322
322
323
323
324
324
326
327
327
328
330

332

Three Scenarios
A Billing Mistake
A Loyalty Program
Raising Prices
Using Survival Forecasts
Forecasting Identified Customers Who Stopped
Estimating Excess Stops

Before and After Comparison
Cohort-Based Approach
Direct Estimation of Event Effect
Approach to the Calculation
Time-Varying Covariate Survival Using SQL and Excel

Chapter 8

313
314
314

333
333
333
335
335
336
336
337
338
341
341
342

Lessons Learned

344

Customer Purchases and Other Repeated Events

Identifying Customers

347
348

Who Is the Customer?
How Many?
How Many Genders in a Household
Investigating First Names
Other Customer Information
First and Last Names
Addresses
Other Identifying Information

www.it-ebooks.info

348
349
351
354
358
358
360
361


99513ftoc.qxd:WileyRed

8/24/07


11:15 AM

Page xvii

Contents
How Many New Customers Appear Each Year?
Counting Customers
Span of Time Making Purchases
Average Time between Orders
Purchase Intervals

RFM Analysis

370

The Dimensions
Recency
Frequency
Monetary
Calculating the RFM Cell
Utility of RFM
A Methodology for Marketing Experiments
Customer Migration
RFM Limits

Which Households Are Increasing Purchase
Amounts Over Time?
Comparison of Earliest and Latest Values
Calculating the Earliest and Latest Values
Comparing the First and Last Values

Comparison of First Year Values and Last Year Values
Trend from the Best Fit Line
Using the Slope
Calculating the Slope

Time to Next Event
Idea behind the Calculation
Calculating Next Purchase Date Using SQL
From Next Purchase Date to Time-to-Event
Stratifying Time-to-Event

Chapter 9

362
362
364
367
369
370
371
374
374
375
377
377
378
380

381
381

381
386
390
392
393
393

395
395
396
397
398

Lessons Learned

399

What’s in a Shopping Cart? Market Basket Analysis
and Association Rules
Exploratory Market Basket Analysis

401
402

Scatter Plot of Products
Duplicate Products in Orders
Histogram of Number of Units
Products Associated with One-Time Customers
Products Associated with the Best Customers
Changes in Price


Combinations (Item Sets)
Combinations of Two Products
Number of Two-Way Combinations
Generating All Two-Way Combinations
Examples of Combinations
Variations on Combinations
Combinations of Product Groups
Multi-Way Combinations

www.it-ebooks.info

402
403
407
408
410
413

415
415
415
417
419
420
420
422

xvii



99513ftoc.qxd:WileyRed

8/24/07

11:15 AM

Page xviii

xviii Contents
Households Not Orders
Combinations within a Household
Investigating Products within Households but
Not within Orders
Multiple Purchases of the Same Product

The Simplest Association Rules
Associations and Rules
Zero-Way Association Rules
What Is the Distribution of Probabilities?
What Do Zero-Way Associations Tell Us?

One-Way Association Rules
Example of One-Way Association Rules
Generating All One-Way Rules
One-Way Rules with Evaluation Information
One-Way Rules on Product Groups
Calculating Product Group Rules Using an
Intermediate Table
Calculating Product Group Rules Using

Window Functions

Two-Way Associations
Calculating Two-Way Associations
Using Chi-Square to Find the Best Rules
Applying Chi-Square to Rules
Applying Chi-Square to Rules in SQL
Comparing Chi-Square Rules to Lift
Chi-Square for Negative Rules
Heterogeneous Associations
Rules of the Form “State Plus Product”
Rules Mixing Different Types of Products

Extending Association Rules
Multi-Way Associations
Rules Using Attributes of Products
Rules with Different Left- and Right-Hand Sides
Before and After: Sequential Associations

Lessons Learned

424
424
425
426

428
428
429
429

430

431
431
433
434
436
438
440

441
441
442
442
444
445
447
448
448
450

451
451
452
453
454

455

Chapter 10 Data Mining Models in SQL

Introduction to Directed Data Mining
Directed Models
The Data in Modeling
Model Set
Score Set
Prediction Model Sets versus Profiling Model Sets
Examples of Modeling Tasks
Similarity Models
Yes-or-No Models (Binary Response Classification)

www.it-ebooks.info

457
458
459
459
459
461
461
463
463
463


99513ftoc.qxd:WileyRed

8/24/07

11:15 AM


Page xix

Contents
Yes-or-No Models with Propensity Scores
Multiple Categories
Estimating Numeric Values
Model Evaluation

Look-Alike Models

464
465
465
465

466

What Is the Model?
What Is the Best Zip Code?
A Basic Look-Alike Model
Look-Alike Using Z-Scores
Example of Nearest Neighbor Model

466
466
468
469
473

Lookup Model for Most Popular Product


475

Most Popular Product
Calculating Most Popular Product Group
Evaluating the Lookup Model
Using a Profiling Lookup Model for Prediction
Using Binary Classification Instead

Lookup Model for Order Size
Most Basic Example: No Dimensions
Adding One Dimension
Adding More Dimensions
Examining Nonstationarity
Evaluating the Model Using an Average Value Chart

Lookup Model for Probability of Response
The Overall Probability as a Model
Exploring Different Dimensions
How Accurate Are the Models?
Adding More Dimensions

Naïve Bayesian Models (Evidence Models)
Some Ideas in Probability
Probabilities
Odds
Likelihood
Calculating the Naïve Bayesian Model
An Intriguing Observation
Bayesian Model of One Variable

Bayesian Model of One Variable in SQL
The “Naïve” Generalization
Naïve Bayesian Model: Scoring and Lift
Scoring with More Attributes
Creating a Cumulative Gains Chart
Comparison of Naïve Bayesian and Lookup Models

Lessons Learned
Chapter 11 The Best-Fit Line: Linear Regression Models
The Best-Fit Line
Tenure and Amount Paid

www.it-ebooks.info

475
475
477
478
480

481
481
482
484
484
485

487
487
488

490
493

495
495
496
497
497
498
499
500
500
502
504
505
506
507

508
511
512
512

xix


99513ftoc.qxd:WileyRed

xx


8/24/07

11:15 AM

Page xx

Contents
Properties of the Best-fit Line
What Does Best-Fit Mean?
Formula for Line
Expected Value
Error (Residuals)
Preserving the Averages
Inverse Model
Beware of the Data
Trend Lines in Charts
Best-fit Line in Scatter Plots
Logarithmic, Power, and Exponential Trend Curves
Polynomial Trend Curves
Moving Average
Best-fit Using LINEST() Function
Returning Values in Multiple Cells
Calculating Expected Values
LINEST() for Logarithmic, Exponential, and Power Curves

Measuring Goodness of Fit Using R2
The R2 Value
Limitations of R2
What R2 Really Means


Direct Calculation of Best-Fit Line Coefficients
Doing the Calculation
Calculating the Best-Fit Line in SQL
Price Elasticity
Price Frequency
Price Frequency for $20 Books
Price Elasticity Model in SQL
Price Elasticity Average Value Chart

Weighted Linear Regression
Customer Stops during the First Year
Weighted Best Fit
Weighted Best-Fit Line in a Chart
Weighted Best-Fit in SQL
Weighted Best-Fit Using Solver
The Weighted Best-Fit Line
Solver Is Better Than Guessing

More Than One Input Variable
Multiple Regression in Excel
Getting the Data
Investigating Each Variable Separately
Building a Model with Three Input Variables
Using Solver for Multiple Regression
Choosing Input Variables One-By-One
Multiple Regression in SQL

Lessons Learned

513

513
515
515
517
518
518
519
521
521
522
524
525
528
528
530
531

532
532
534
535

536
536
537
538
539
541
542
543


544
545
546
548
549
550
550
551

552
552
553
554
555
557
558
558

560

www.it-ebooks.info


99513ftoc.qxd:WileyRed

8/24/07

11:15 AM


Page xxi

Contents
Chapter 12 Building Customer Signatures for Further Analysis
What Is a Customer Signature?

563
564

What Is a Customer?
Sources of Data for the Customer Signature
Current Customer Snapshot
Initial Customer Information
Self-Reported Information
External Data (Demographic and So On)
About Their Neighbors
Transaction Summaries
Using Customer Signatures
Predictive and Profile Modeling
Ad Hoc Analysis
Repository of Customer-Centric Business Metrics

565
566
566
567
568
568
569
569

570
570
570
570

Designing Customer Signatures

571

Column Roles
Identification Columns
Input Columns
Target Columns
Foreign Key Columns
Cutoff Date
Profiling versus Prediction
Time Frames
Naming of Columns
Eliminating Seasonality
Adding Seasonality Back In
Multiple Time Frames

571
571
572
572
572
573
573
573

574
574
575
576

Operations to Build a Customer Signature
Driving Table
Using an Existing Table as the Driving Table
Derived Table as the Driving Table
Looking Up Data
Fixed Lookup Tables
Customer Dimension Lookup Tables
Initial Transaction
Without Window Functions
With Window Functions
Pivoting
Payment Type Pivot
Channel Pivot
Year Pivot
Order Line Information Pivot
Summarizing
Basic Summaries
More Complex Summaries

www.it-ebooks.info

577
578
578
580

580
581
582
584
584
586
586
588
589
590
591
594
594
594

xxi


99513ftoc.qxd:WileyRed

xxii

8/24/07

11:15 AM

Page xxii

Contents
Extracting Features


596

Geographic Location Information
Date Time Columns
Patterns in Strings
Email Addresses
Addresses
Product Descriptions
Credit Card Numbers

596
597
598
598
599
599
600

Summarizing Customer Behaviors

601

Calculating Slope for Time Series
Calculating Slope from Pivoted Time Series
Calculating Slope for a Regular Time Series
Calculating Slope for an Irregular Time Series
Weekend Shoppers
Declining Usage Behavior


Appendix

601
601
603
604
604
606

Lessons Learned

609

Equivalent Constructs Among Databases
String Functions

611
612

Searching for Position of One String within Another
IBM
Microsoft
mysql
Oracle
SAS proc sql
String Concatenation
IBM
Microsoft
mysql
Oracle

SAS proc sql
String Length Function
IBM
Microsoft
mysql
Oracle
SAS proc sql
Substring Function
IBM
Microsoft
mysql
Oracle
SAS proc sql
Replace One Substring with Another
IBM
Microsoft

www.it-ebooks.info

612
612
613
613
613
613
614
614
614
614
614

614
614
614
615
615
615
615
615
615
615
615
616
616
616
616
616


×