Tải bản đầy đủ (.pdf) (672 trang)

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Second Edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (13.4 MB, 672 trang )

TEAMFLY






















































Team-Fly
®

470643 FM.qxd 3/17/04 10:28 AM Page i
Michael J.A. Berry
Customer Relationship
Management

Second Edition
Gordon S. Linoff
Data Mining Techniques
For Marketing, Sales, and
470643 ffirs.qxd 3/8/04 11:32 AM Page iv
470643 FM.qxd 3/17/04 10:28 AM Page i
Michael J.A. Berry
Customer Relationship
Management
Second Edition
Gordon S. Linoff
Data Mining Techniques
For Marketing, Sales, and
470643 ffirs.qxd 3/8/04 11:32 AM Page ii
Vice President and Executive Group Publisher: Richard Swadley
Vice President and Executive Publisher: Bob Ipsen
Vice President and Publisher: Joseph B. Wikert
Executive Editorial Director: Mary Bednarek
Executive Editor: Robert M. Elliott
Editorial Manager: Kathryn A. Malm
Senior Production Editor: Fred Bernardi
Development Editor: Emilie Herman, Erica Weinstein
Production Editor: Felicia Robinson
Media Development Specialist: Laura Carpenter VanWinkle
Text Design & Composition: Wiley Composition Services
Copyright  2004 by Wiley Publishing, Inc., Indianapolis, Indiana
All rights reserved.
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted

under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8700. Requests to the Pub-
lisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint
Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4447, E-mail:
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or completeness
of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for
a particular purpose. No warranty may be created or extended by sales representatives or written sales mate-
rials. The advice and strategies contained herein may not be suitable for your situation. You should consult
with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit
or any other commercial damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services please contact our Customer Care Department
within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Trademarks: Wiley, the Wiley Publishing logo, are trademarks or registered trademarks of John Wiley & Sons,
Inc. and/or its affiliates in the United States and other countries. All other trademarks are the property of their
respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this
book.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not
be available in electronic books.
Library of Congress Cataloging-in-Publication Data:
Berry, Michael J. A.
Data mining techniques : for marketing, sales, and customer
relationship management / Michael J.A. Berry, Gordon Linoff.— 2nd ed.
p. cm.
Includes index.
ISBN 0-471-47064-3 (paper/website)
1. Data mining. 2. Marketing—Data processing. 3. Business—Data
processing. I. Linoff, Gordon. II. Title.

HF5415.125 .B47 2004
658.8’02—dc22
2003026693
ISBN: 0-471-47064-3
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
470643 ffirs.qxd 3/8/04 11:32 AM Page iii
To Stephanie, Sasha, and Nathaniel. Without your patience and
understanding, this book would not have been possible.
— Michael
To Puccio. Grazie per essere paziente con me.
Ti amo.
— Gordon
470643 ffirs.qxd 3/8/04 11:32 AM Page iv
470643 flast.qxd 3/8/04 11:32 AM Page xix
Acknowledgments
We are fortunate to be surrounded by some of the most talented data miners
anywhere, so our first thanks go to our colleagues at Data Miners, Inc. from
whom we have learned so much: Will Potts, Dorian Pyle, and Brij Masand.
There are also clients with whom we work so closely that we consider them
our colleagues as well: Harrison Sohmer and Stuart E. Ward, III are in that cat-
egory. Our Editor, Bob Elliott, Editorial Assistant, Erica Weinstein, and Devel-
opment Editor, Emilie Herman, kept us (more or less) on schedule and helped
us maintain a consistent style. Lauren McCann, a graduate student at M.I.T.
and intern at Data Miners, prepared the census data used in some examples
and created some of the illustrations.
We would also like to acknowledge all of the people we have worked with
in scores of data mining engagements over the years. We have learned some-
thing from every one of them. The many whose data mining projects have
influenced the second edition of this book include:

Al Fan
Herb Edelstein
Nick Gagliardo
Alan Parker
Jill Holtz
Nick Radcliffe
Anne Milley
Joan Forrester
Patrick Surry
Brian Guscott
John Wallace
Ronny Kohavi
Bruce Rylander
Josh Goff
Sheridan Young
Corina Cortes
Karen Kennedy
Susan Hunt Stevens
Daryl Berry
Kurt Thearling
Ted Browne
Daryl Pregibon
Lynne Brennen
Terri Kowalchuk
Doug Newell
Mark Smith
Victor Lo
Ed Freeman
Mateus Kehder
Yasmin Namini

Erin McCarthy
Michael Patrick
Zai Ying Huang
xix
470643 flast.qxd 3/8/04 11:32 AM Page xx
xx Acknowledgments
And, of course, all the people we thanked in the first edition are still deserv-
ing of acknowledgement:
Bob Flynn
Jim Flynn
Paul Berry
Bryan McNeely
Kamran Parsaye
Rakesh Agrawal
Claire Budden
Karen Stewart
Ric Amari
David Isaac
Larry Bookman
Rich Cohen
David Waltz
Larry Scroggins
Robert Groth
Dena d’Ebin
Lars Rohrberg
Robert Utzschnieder
Diana Lin
Lounette Dyer
Roland Pesch
Don Peppers

Marc Goodman
Stephen Smith
Ed Horton
Marc Reifeis
Sue Osterfelt
Edward Ewen
Marge Sherold
Susan Buchanan
Fred Chapman
Mario Bourgoin
Syamala Srinivasan
Gary Drescher
Prof. Michael Jordan
Wei-Xing Ho
Gregory Lampshire
Patsy Campbell
William Petefish
Janet Smith
Paul Becker
Yvonne McCollin
Jerry Modes
470643 flast.qxd 3/8/04 11:32 AM Page xxi
About the Authors
Michael J. A. Berry and Gordon S. Linoff are well known in the data mining
field. They have jointly authored three influential and widely read books on
data mining that have been translated into many languages. They each have
close to two decades of experience applying data mining techniques to busi-
ness problems in marketing and customer relationship management.
Michael and Gordon first worked together during the 1980s at Thinking
Machines Corporation, which was a pioneer in mining large databases. In

1996, they collaborated on a data mining seminar, which soon evolved into the
first edition of this book. The success of that collaboration gave them the
courage to start Data Miners, Inc., a respected data mining consultancy, in
1998. As data mining consultants, they have worked with a wide variety of
major companies in North America, Europe, and Asia, turning customer data-
bases, call detail records, Web log entries, point-of-sale records, and billing
files into useful information that can be used to improve the customer experi-
ence. The authors’ years of hands-on data mining experience are reflected in
every chapter of this extensively updated and revised edition of their first
book, Data Mining Techniques.
When not mining data at some distant client site, Michael lives in Cam-
bridge, Massachusetts, and Gordon lives in New York City.
xxi
470643 flast.qxd 3/8/04 11:32 AM Page xxii
TEAMFLY























































Team-Fly
®

470643 flast.qxd 3/8/04 11:32 AM Page xxiii
Introduction
The first edition of Data Mining Techniques for Marketing, Sales, and Customer
Support appeared on book shelves in 1997. The book actually got its start in
1996 as Gordon and I were developing a 1-day data mining seminar for
NationsBank (now Bank of America). Sue Osterfelt, a vice president at
NationsBank and the author of a book on database applications with Bill
Inmon, convinced us that our seminar material ought to be developed into a
book. She introduced us to Bob Elliott, her editor at John Wiley & Sons, and
before we had time to think better of it, we signed a contract.
Neither of us had written a book before, and drafts of early chapters clearly
showed this. Thanks to Bob’s help, though, we made a lot of progress, and the
final product was a book we are still proud of. It is no exaggeration to say that
the experience changed our lives — first by taking over every waking hour
and some when we should have been sleeping; then, more positively, by pro-
viding the basis for the consulting company we founded, Data Miners, Inc.
The first book, which has become a standard text in data mining, was followed
by others, Mastering Data Mining and Mining the Web.
So, why a revised edition? The world of data mining has changed a lot since

we starting writing in 1996. For instance, back then, Amazon.com was still
new; U.S. mobile phone calls cost on average 56 cents per minute, and fewer
than 25 percent of Americans even owned a mobile phone; and the KDD data
mining conference was in its second year. Our understanding has changed
even more. For the most part, the underlying algorithms remain the same,
although the software in which the algorithms are imbedded, the data to
which they are applied, and the business problems they are used to solve have
all grown and evolved.
xxiii
470643 flast.qxd 3/8/04 11:32 AM Page xxiv
xxiv Introduction
Even if the technological and business worlds had stood still, we would
have wanted to update Data Mining Techniques because we have learned so
much in the intervening years. One of the joys of consulting is the constant
exposure to new ideas, new problems, and new solutions. We may not be any
smarter than when we wrote the first edition, but we do have more experience
and that added experience has changed the way we approach the material. A
glance at the Table of Contents may suggest that we have reduced the amount
of business-related material and increased the amount of technical material.
Instead, we have folded some of the business material into the technical chap-
ters so that the data mining techniques are introduced in their business con-
text. We hope this makes it easier for readers to see how to apply the
techniques to their own business problems.
It has also come to our attention that a number of business school courses
have used this book as a text. Although we did not write the book as a text, in
the second edition we have tried to facilitate its use as one by using more
examples based on publicly available data, such as the U.S. census, and by
making some recommended reading and suggested exercises available at the
companion Web site, www.data-miners.com/companion.
The book is still divided into three parts. The first part talks about the busi-

ness context of data mining, starting with a chapter that introduces data min-
ing and explains what it is used for and why. The second chapter introduces
the virtuous cycle of data mining — the ongoing process by which data min-
ing is used to turn data into information that leads to actions, which in turn
create more data and more opportunities for learning. Chapter 3 is a much-
expanded discussion of data mining methodology and best practices. This
chapter benefits more than any other from our experience since writing the
first book. The methodology introduced here is designed to build on the suc-
cessful engagements we have been involved in. Chapter 4, which has no coun-
terpart in the first edition, is about applications of data mining in marketing
and customer relationship management, the fields where most of our own
work has been done.
The second part consists of the technical chapters about the data mining
techniques themselves. All of the techniques described in the first edition are
still here although they are presented in a different order. The descriptions
have been rewritten to make them clearer and more accurate while still retain-
ing nontechnical language wherever possible.
In addition to the seven techniques covered in the first edition — decision
trees, neural networks, memory-based reasoning, association rules, cluster
detection, link analysis, and genetic algorithms — there is now a chapter on
data mining using basic statistical techniques and another new chapter on sur-
vival analysis. Survival analysis is a technique that has been adapted from the
small samples and continuous time measurements of the medical world to the
470643 flast.qxd 3/8/04 11:32 AM Page xxv
Introduction xxv
large samples and discrete time measurements found in marketing data. The
chapter on memory-based reasoning now also includes a discussion of collab-
orative filtering, another technique based on nearest neighbors that has
become popular with Web retailers as a way of generating recommendations.
The third part of the book talks about applying the techniques in a business

context, including a chapter on finding customers in data, one on the relation-
ship of data mining and data warehousing, another on the data mining envi-
ronment (both corporate and technical), and a final chapter on putting data
mining to work in an organization. A new chapter in this part covers prepar-
ing data for data mining, an extremely important topic since most data miners
report that transforming data takes up the majority of time in a typical data
mining project.
Like the first edition, this book is aimed at current and future data mining
practitioners. It is not meant for software developers looking for detailed
instructions on how to implement the various data mining algorithms nor for
researchers trying to improve upon those algorithms. Ideas are presented in
nontechnical language with minimal use of mathematical formulas and arcane
jargon. Each data mining technique is shown in a real business context with
examples of its use taken from real data mining engagements. In short, we
have tried to write the book that we would have liked to read when we began
our own data mining careers.
— Michael J. A. Berry, October, 2003
470643 flast.qxd 3/8/04 11:32 AM Page xxvi
470643 ftoc.qxd 3/8/04 11:33 AM Page v
Contents
Acknowledgments xix
About the Authors xxi
Introduction xxiii
Chapter 1 Why and What Is Data Mining? 1
Analytic Customer Relationship Management 2
The Role of Transaction Processing Systems 3
The Role of Data Warehousing 4
The Role of Data Mining 5
The Role of the Customer Relationship Management Strategy 6
What Is Data Mining? 7

What Tasks Can Be Performed with Data Mining? 8
Classification 8
Estimation 9
Prediction 10
Affinity Grouping or Association Rules 11
Clustering 11
Profiling 12
Why Now? 12
Data Is Being Produced 12
Data Is Being Warehoused 13
Computing Power Is Affordable 13
Interest in Customer Relationship Management Is Strong 13
Every Business Is a Service Business 14
Information Is a Product 14
Commercial Data Mining Software Products
Have Become Available 15
v
470643 ftoc.qxd 3/8/04 11:33 AM Page vi
vi Contents
How Data Mining Is Being Used Today 15
A Supermarket Becomes an Information Broker 15
A Recommendation-Based Business 16
Cross-Selling 17
Holding on to Good Customers 17
Weeding out Bad Customers 18
Revolutionizing an Industry 18
And Just about Anything Else 19
Lessons Learned 19
Chapter 2 The Virtuous Cycle of Data Mining 21
A Case Study in Business Data Mining 22

Identifying the Business Challenge 23
Applying Data Mining 24
Acting on the Results 25
Measuring the Effects 25
What Is the Virtuous Cycle? 26
Identify the Business Opportunity 27
Mining Data 28
Take Action 30
Measuring Results 30
Data Mining in the Context of the Virtuous Cycle 32
A Wireless Communications Company Makes
the Right Connections
34
The Opportunity 34
How Data Mining Was Applied 35
Defining the Inputs 37
Derived Inputs 37
The Actions 38
Completing the Cycle 39
Neural Networks and Decision Trees Drive SUV Sales 39
The Initial Challenge 39
How Data Mining Was Applied 40
The Data 40
Down the Mine Shaft 40
The Resulting Actions 41
Completing the Cycle 42
Lessons Learned 42
Chapter 3 Data Mining Methodology and Best Practices 43
Why Have a Methodology? 44
Learning Things That Aren’t True 44

Patterns May Not Represent Any Underlying Rule 45
The Model Set May Not Reflect the Relevant Population 46
Data May Be at the Wrong Level of Detail 47
470643 ftoc.qxd 3/8/04 11:33 AM Page vii
Contents vii
Learning Things That Are True, but Not Useful 48
Learning Things That Are Already Known 49
Learning Things That Can’t Be Used 49
Hypothesis Testing 50
Generating Hypotheses 51
Testing Hypotheses 51
Models, Profiling, and Prediction 51
Profiling 53
Prediction 54
The Methodology 54
Step One: Translate the Business Problem
into a Data Mining Problem
56
What Does a Data Mining Problem Look Like? 56
How Will the Results Be Used? 57
How Will the Results Be Delivered? 58
The Role of Business Users and Information Technology 58
Step Two: Select Appropriate Data 60
What Is Available? 61
How Much Data Is Enough? 62
How Much History Is Required? 63
How Many Variables? 63
What Must the Data Contain? 64
Step Three: Get to Know the Data 64
Examine Distributions 65

Compare Values with Descriptions 66
Validate Assumptions 67
Ask Lots of Questions 67
Step Four: Create a Model Set 68
Assembling Customer Signatures 68
Creating a Balanced Sample 68
Including Multiple Timeframes 70
Creating a Model Set for Prediction 70
Partitioning the Model Set 71
Step Five: Fix Problems with the Data 72
Categorical Variables with Too Many Values 73
Numeric Variables with Skewed Distributions and Outliers 73
Missing Values 73
Values with Meanings That Change over Time 74
Inconsistent Data Encoding 74
Step Six: Transform Data to Bring Information to the Surface 74
Capture Trends 75
Create Ratios and Other Combinations of Variables 75
Convert Counts to Proportions 75
Step Seven: Build Models 77
470643 ftoc.qxd 3/8/04 11:33 AM Page viii
viii Contents
Step Eight: Assess Models 78
Assessing Descriptive Models 78
Assessing Directed Models 78
Assessing Classifiers and Predictors 79
Assessing Estimators 79
Comparing Models Using Lift 81
Problems with Lift 83
Step Nine: Deploy Models 84

Step Ten: Assess Results 85
Step Eleven: Begin Again 85
Lessons Learned 86
Chapter 4 Data Mining Applications in Marketing and
Customer Relationship Management
87
Prospecting 87
Identifying Good Prospects 88
Choosing a Communication Channel 89
Picking Appropriate Messages 89
Data Mining to Choose the Right Place to Advertise 90
Who Fits the Profile? 90
Measuring Fitness for Groups of Readers 93
Data Mining to Improve Direct Marketing Campaigns 95
Response Modeling 96
Optimizing Response for a Fixed Budget 97
Optimizing Campaign Profitability 100
How the Model Affects Profitability 103
Reaching the People Most Influenced by the Message 106
Differential Response Analysis 107
Using Current Customers to Learn About Prospects 108
Start Tracking Customers before They Become Customers 109
Gather Information from New Customers 109
Acquisition-Time Variables Can Predict Future Outcomes 110
Data Mining for Customer Relationship Management 110
Matching Campaigns to Customers 110
Segmenting the Customer Base 111
Finding Behavioral Segments 111
Tying Market Research Segments to Behavioral Data 113
Reducing Exposure to Credit Risk 113

Predicting Who Will Default 113
Improving Collections 114
Determining Customer Value 114
Cross-selling, Up-selling, and Making Recommendations 115
Finding the Right Time for an Offer 115
Making Recommendations 116
Retention and Churn 116
Recognizing Churn 116
Why Churn Matters 117
Different Kinds of Churn 118
470643 ftoc.qxd 3/8/04 11:33 AM Page ix
ix Contents
Different Kinds of Churn Model 119
Predicting Who Will Leave 119
Predicting How Long Customers Will Stay 119
Lessons Learned 120
Chapter 5 The Lure of Statistics: Data Mining Using Familiar Tools 123
Occam’s Razor 124
The Null Hypothesis 125
P-Values 126
A Look at Data 126
Looking at Discrete Values 127
Histograms 127
Time Series 128
Standardized Values 129
From Standardized Values to Probabilities 133
Cross-Tabulations 136
Looking at Continuous Variables 136
Statistical Measures for Continuous Variables 137
Variance and Standard Deviation 138

A Couple More Statistical Ideas 139
Measuring Response 139
Standard Error of a Proportion 139
Comparing Results Using Confidence Bounds 141
Comparing Results Using Difference of Proportions 143
Size of Sample 145
What the Confidence Interval Really Means 146
Size of Test and Control for an Experiment 147
Multiple Comparisons 148
The Confidence Level with Multiple Comparisons 148
Bonferroni’s Correction 149
Chi-Square Test 149
Expected Values 150
Chi-Square Value 151
Comparison of Chi-Square to Difference of Proportions 153
An Example: Chi-Square for Regions and Starts 155
Data Mining and Statistics 158
No Measurement Error in Basic Data 159
There Is a Lot of Data 160
Time Dependency Pops Up Everywhere 160
Experimentation is Hard 160
Data Is Censored and Truncated 161
Lessons Learned 162
Chapter 6 Decision Trees 165
What Is a Decision Tree? 166
Classification 166
Scoring 169
Estimation 170
Trees Grow in Many Forms 170
470643 ftoc.qxd 3/8/04 11:33 AM Page x

x Contents
How a Decision Tree Is Grown 171
Finding the Splits 172
Splitting on a Numeric Input Variable 173
Splitting on a Categorical Input Variable 174
Splitting in the Presence of Missing Values 174
Growing the Full Tree 175
Measuring the Effectiveness Decision Tree 176
Tests for Choosing the Best Split 176
Purity and Diversity 177
Gini or Population Diversity 178
Entropy Reduction or Information Gain 179
Information Gain Ratio 180
Chi-Square Test 180
Reduction in Variance 183
F Test 183
Pruning 184
The CART Pruning Algorithm 185
Creating the Candidate Subtrees 185
Picking the Best Subtree 189
Using the Test Set to Evaluate the Final Tree 189
The C5 Pruning Algorithm 190
Pessimistic Pruning 191
Stability-Based Pruning 191
Extracting Rules from Trees 193
Taking Cost into Account 195
Further Refinements to the Decision Tree Method 195
Using More Than One Field at a Time 195
Tilting the Hyperplane 197
Neural Trees 199

Piecewise Regression Using Trees 199
Alternate Representations for Decision Trees 199
Box Diagrams 199
Tree Ring Diagrams 201
Decision Trees in Practice 203
Decision Trees as a Data Exploration Tool 203
Applying Decision-Tree Methods to Sequential Events 205
Simulating the Future 206
Case Study: Process Control in a Coffee-Roasting Plant 206
Lessons Learned 209
Chapter 7 Artificial Neural Networks 211
A Bit of History 212
Real Estate Appraisal 213
Neural Networks for Directed Data Mining 219
What Is a Neural Net? 220
What Is the Unit of a Neural Network? 222
Feed-Forward Neural Networks 226
TEAMFLY























































Team-Fly
®

470643 ftoc.qxd 3/8/04 11:33 AM Page xi
xi Contents
How Does a Neural Network Learn Using
Back Propagation? 228
Heuristics for Using Feed-Forward,
Back Propagation Networks 231
Choosing the Training Set 232
Coverage of Values for All Features 232
Number of Features 233
Size of Training Set 234
Number of Outputs 234
Preparing the Data 235
Features with Continuous Values 235
Features with Ordered, Discrete (Integer) Values 238
Features with Categorical Values 239
Other Types of Features 241

Interpreting the Results 241
Neural Networks for Time Series 244
How to Know What Is Going on Inside a Neural Network 247
Self-Organizing Maps 249
What Is a Self-Organizing Map? 249
Example: Finding Clusters 252
Lessons Learned 254
Chapter 8 Nearest Neighbor Approaches: Memory-Based
Reasoning and Collaborative Filtering
257
Memory Based Reasoning 258
Example: Using MBR to Estimate Rents in Tuxedo, New York 259
Challenges of MBR 262
Choosing a Balanced Set of Historical Records 262
Representing the Training Data 263
Determining the Distance Function, Combination
Function, and Number of Neighbors
265
Case Study: Classifying News Stories 265
What Are the Codes? 266
Applying MBR 267
Choosing the Training Set 267
Choosing the Distance Function 267
Choosing the Combination Function 267
Choosing the Number of Neighbors 270
The Results 270
Measuring Distance 271
What Is a Distance Function? 271
Building a Distance Function One Field at a Time 274
Distance Functions for Other Data Types 277

When a Distance Metric Already Exists 278
The Combination Function: Asking the Neighbors
for the Answer
279
The Basic Approach: Democracy 279
Weighted Voting 281
470643 ftoc.qxd 3/8/04 11:33 AM Page xii
xii Contents
Chapter 9
Chapter 10
Collaborative Filtering: A Nearest Neighbor Approach to
Making Recommendations
282
Building Profiles 283
Comparing Profiles 284
Making Predictions 284
Lessons Learned 285
Market Basket Analysis and Association Rules 287
Defining Market Basket Analysis 289
Three Levels of Market Basket Data 289
Order Characteristics 292
Item Popularity 293
Tracking Marketing Interventions 293
Clustering Products by Usage 294
Association Rules 296
Actionable Rules 296
Trivial Rules 297
Inexplicable Rules 297
How Good Is an Association Rule? 299
Building Association Rules 302

Choosing the Right Set of Items 303
Product Hierarchies Help to Generalize Items 305
Virtual Items Go beyond the Product Hierarchy 307
Data Quality 308
Anonymous versus Identified 308
Generating Rules from All This Data 308
Calculating Confidence 309
Calculating Lift 310
The Negative Rule 311
Overcoming Practical Limits 311
The Problem of Big Data 313
Extending the Ideas 315
Using Association Rules to Compare Stores 315
Dissociation Rules 317
Sequential Analysis Using Association Rules 318
Lessons Learned 319
Link Analysis 321
Basic Graph Theory 322
Seven Bridges of Königsberg 325
Traveling Salesman Problem 327
Directed Graphs 330
Detecting Cycles in a Graph 330
A Familiar Application of Link Analysis 331
The Kleinberg Algorithm 332
The Details: Finding Hubs and Authorities 333
Creating the Root Set 333
Identifying the Candidates 334
Ranking Hubs and Authorities 334
Hubs and Authorities in Practice 336
470643 ftoc.qxd 3/8/04 11:33 AM Page xiii

Contents xiii
Case Study: Who Is Using Fax Machines from Home? 336
Why Finding Fax Machines Is Useful 336
The Data as a Graph 337
The Approach 338
Some Results 340
Case Study: Segmenting Cellular Telephone Customers 343
The Data 343
Analyses without Graph Theory 343
A Comparison of Two Customers 344
The Power of Link Analysis 345
Lessons Learned 346
Chapter 11 Automatic Cluster Detection 349
Searching for Islands of Simplicity 350
Star Light, Star Bright 351
Fitting the Troops 352
K-Means Clustering 354
Three Steps of the K-Means Algorithm 354
What K Means 356
Similarity and Distance 358
Similarity Measures and Variable Type 359
Formal Measures of Similarity 360
Geometric Distance between Two Points 360
Angle between Two Vectors 361
Manhattan Distance 363
Number of Features in Common 363
Data Preparation for Clustering 363
Scaling for Consistency 363
Use Weights to Encode Outside Information 365
Other Approaches to Cluster Detection 365

Gaussian Mixture Models 365
Agglomerative Clustering 368
An Agglomerative Clustering Algorithm 368
Distance between Clusters 368
Clusters and Trees 370
Clustering People by Age: An Example of
Agglomerative Clustering 370
Divisive Clustering 371
Self-Organizing Maps 372
Evaluating Clusters 372
Inside the Cluster 373
Outside the Cluster 373
Case Study: Clustering Towns 374
Creating Town Signatures 374
The Data 375
Creating Clusters 377
Determining the Right Number of Clusters 377
Using Thematic Clusters to Adjust Zone Boundaries 380
Lessons Learned 381
470643 ftoc.qxd 3/8/04 11:33 AM Page xiv
xiv Contents
Chapter 12
Chapter 13
Knowing When to Worry: Hazard Functions and
Survival Analysis in Marketing
383
Customer Retention 385
Calculating Retention 385
What a Retention Curve Reveals 386
Finding the Average Tenure from a Retention Curve 387

Looking at Retention as Decay 389
Hazards 394
The Basic Idea 394
Examples of Hazard Functions 397
Constant Hazard 397
Bathtub Hazard 397
A Real-World Example 398
Censoring 399
Other Types of Censoring 402
From Hazards to Survival 404
Retention 404
Survival 405
Proportional Hazards 408
Examples of Proportional Hazards 409
Stratification: Measuring Initial Effects on Survival 410
Cox Proportional Hazards 410
Limitations of Proportional Hazards 411
Survival Analysis in Practice 412
Handling Different Types of Attrition 412
When Will a Customer Come Back? 413
Forecasting 415
Hazards Changing over Time 416
Lessons Learned 418
Genetic Algorithms 421
How They Work 423
Genetics on Computers 424
Selection 429
Crossover 430
Mutation 431
Representing Data 432

Case Study: Using Genetic Algorithms for
Resource Optimization
433
Schemata: Why Genetic Algorithms Work 435
More Applications of Genetic Algorithms 438
Application to Neural Networks 439
Case Study: Evolving a Solution for Response Modeling 440
Business Context 440
Data 441
The Data Mining Task: Evolving a Solution 442
Beyond the Simple Algorithm 444
Lessons Learned 446

×