Tải bản đầy đủ (.pdf) (726 trang)

Data mining for business intelligence concepts, techniques, and applications in microsoft office excel with XLMiner shmueli, patel bruce 2010 10 26

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.24 MB, 726 trang )


Contents
Foreword
Preface to the second edition
Preface to the first edition
Acknowledgments
PART I PRELIMINARIES
Chapter 1 Introduction
1.1 What Is Data Mining?
1.2 Where Is Data Mining Used?
1.3 Origins of Data Mining
1.4 Rapid Growth of Data Mining
1.5 Why Are There So Many Different Methods?
1.6 Terminology and Notation
1.7 Road Maps to This Book
Chapter 2 Overview of the Data Mining Process
2.1 Introduction
2.2 Core Ideas in Data Mining

2


2.3 Supervised and Unsupervised Learning
2.4 Steps in Data Mining
2.5 Preliminary Steps
2.6 Building a Model: Example with Linear Regression
2.7 Using Excel for Data Mining
PROBLEMS
PART II DATA EXPLORATION AND DIMENSION
REDUCTION
Chapter 3 Data Visualization


3.1 Uses of Data Visualization
3.2 Data Examples
3.3 Basic Charts: bar charts, line graphs, and scatterplots
3.4 Multidimensional Visualization
3.5 Specialized Visualizations
3.6 Summary of major visualizations and operations,
according to data mining goal
PROBLEMS
Chapter 4 Dimension Reduction

3


4.1 Introduction
4.2 Practical Considerations
4.3 Data Summaries
4.4 Correlation Analysis
4.5 Reducing the Number of Categories in Categorical
Variables
4.6 Converting A Categorical Variable to A Numerical
Variable
4.7 Principal Components Analysis
4.8 Dimension Reduction Using Regression Models
4.9 Dimension Reduction Using Classification and
Regression Trees
PROBLEMS
PART III PERFORMANCE EVALUATION
Chapter 5 Evaluating Classification and Predictive
Performance
5.1 Introduction

5.2 Judging Classification Performance
5.3 Evaluating Predictive Performance

4


PROBLEMS
PART IV PREDICTION AND CLASSIFICATION
METHODS
Chapter 6 Multiple Linear Regression
6.1 Introduction
6.2 Explanatory versus Predictive modeling
6.3 Estimating the Regression Equation and Prediction
6.4 Variable Selection in Linear Regression
PROBLEMS
Chapter 7 k-Nearest Neighbors (k-NN)
7.1 k-NN Classifier (categorical outcome)
7.2 k-NN for a Numerical Response
7.3 Advantages and Shortcomings of k-NN Algorithms
PROBLEMS
Chapter 8 Naive Bayes
8.1 Introduction
8.2 Applying the Full (Exact) Bayesian Classifier

5


8.3 Advantages and Shortcomings of the Naive Bayes
Classifier
PROBLEMS

Chapter 9 Classification and Regression Trees
9.1 Introduction
9.2 Classification Trees
9.3 Measures of Impurity
9.4 Evaluating the Performance of a Classification Tree
9.5 Avoiding Overfitting
9.6 Classification Rules from Trees
9.7 Classification Trees for More Than two Classes
9.8 Regression Trees
9.9 Advantages, weaknesses, and Extensions
PROBLEMS
Chapter 10 Logistic Regression
10.1 Introduction
10.2 Logistic Regression Model
10.3 Evaluating Classification performance
6


10.4 Example of Complete Analysis: Predicting Delayed
Flights
10.5 Appendix: logistic Regression for Profiling
PROBLEMS
Chapter 11 Neural Nets
11.1 Introduction
11.2 Concept And Structure Of A Neural Network
11.3 Fitting A Network To Data
11.4 Required User Input
11.5 Exploring The Relationship Between Predictors And
Response
11.6 Advantages And Weaknesses Of Neural Networks

PROBLEMS
Chapter 12 Discriminant Analysis
12.1 Introduction
12.2 Distance of an Observation from a Class
12.3 Fisher’s Linear Classification Functions
12.4 Classification performance of Discriminant Analysis

7


12.5 Prior Probabilities
12.6 Unequal Misclassification Costs
12.7 Classifying more Than Two Classes
12.8 Advantages and Weaknesses
PROBLEMS
PART V
RECORDS

MINING

RELATIONSHIPS

AMONG

Chapter 13 Association Rules
13.1 Introduction
13.2 Discovering Association Rules in Transaction
Databases
13.3 Generating Candidate Rules
13.4 Selecting Strong Rules

13.5 Summary
PROBLEMS
Chapter 14 Cluster Analysis
14.1 Introduction
14.2 Measuring Distance Between Two Records

8


14.3 Measuring Distance Between Two Clusters
14.4 Hierarchical (Agglomerative) Clustering
14.5 Nonhierarchical Clustering: The k-Means Algorithm
PROBLEMS
PART VI FORECASTING TIME SERIES
Chapter 15 Handling Time Series
15.1 Introduction
15.2 Explanatory versus Predictive Modeling
15.3 Popular Forecasting Methods in Business
15.4 Time Series Components
15.5 Data Partitioning
PROBLEMS
Chapter 16 Regression-Based Forecasting
16.1 Model With Trend
16.2 Model With Seasonality
16.3 Model With Trend And Seasonality
16.4 Autocorrelation And ARIMA Models

9



PROBLEMS
Chapter 17 Smoothing Methods
17.1 Introduction
17.2 Moving Average
17.3 Simple Exponential Smoothing
17.4 Advanced Exponential Smoothing
PROBLEMS
PART VII CASES
Chapter 18 Cases
18.1 Charles book Club
18.2 German Credit
18.3 Tayko Software Cataloger
18.4 Segmenting Consumers of Bath Soap
18.5 Direct-Mail Fundraising
18.6 Catalog Cross Selling
18.7 Predicting Bankruptcy
18.8 Time Series Case: Forecasting Public Transportation
Demand
10


References
Index

11


12



To our families
Boaz and Noa
Tehmi, Arjun, and in
memory of Aneesh
Liz, Lisa, and Allison

13


Copyright 2010 by John Wiley & Sons, Inc. All rights
reserved
Published by John Wiley & Sons, Inc., Hoboken, New
Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a
retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording,
scanning, or otherwise, except as permitted under Section
107 or 108 of the 1976 United States Copyright Act,
without either the prior written permission of the
Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923,
(978) 750-8400, fax (978) 750-4470, or on the web at
www.copyright.com. Requests to the Publisher for
permission should be addressed to the Permissions
Department, John Wiley & Sons, Inc., 111 River Street,
Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008,
or online at />Limit of Liability/Disclaimer of Warranty: While the
publisher and author have used their best efforts in

preparing this book, they make no representations or
warranties with respect to the accuracy or completeness of
the contents of this book and specifically disclaim any
implied warranties of merchantability or fitness for a
particular purpose. No warranty may be created or
extended by sales representatives or written sales

14


materials. The advice and strategies contained herein may
not be suitable for your situation. You should consult with
a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other
commercial damages, including but not limited to special,
incidental, consequential, or other damages.
For general information on our other products and services
or for technical support, please contact our Customer Care
Department within the United States at (800) 762-2974,
outside the United States at (317) 572-3993 or fax
(317)572-4002.
Wiley also publishes its books in a variety of electronic
formats. Some content that appears in print may not be
available in electronic formats. For more information
about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Shmueli, Galit, 1971Data mining for business intelligence: concepts,
techniques, and applications in Microsoft Office Excel
with XLMiner / Galit Shmueli, Nitin R. Patel, Peter C.

Bruce. – 2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-52682-8 (cloth)

15


1. Business–Data processing. 2. Data mining. 3. Microsoft
Excel (Computer file) I. Patel, Nitin R. (Nitin Ratilal) II.
Bruce, Peter C., 1953- III.
Title.
HF5548.2.S44843 2010
005.54–dc22
2010005152

16


Foreword
Data mining—the art of extracting useful information from
large amounts of data—is of growing importance in
today’s world. Your e-mail spam filter relies at least in part
on rules that a data mining algorithm has learned from
examining millions of e-mail messages that have been
classified as spam or not spam. Real-time data mining
methods enable Web-based merchants to tell you that
“customers who purchased x are also likely to purchase y.”
Data mining helps banks determine which applicants are
likely to default on loans, helps tax authorities identify

which tax returns are most likely to be fraudulent, and
helps catalog merchants target those customers most likely
to purchase.
And data mining is not just about numbers—text mining
techniques help search engines like Google and Yahoo
find what you are looking for by ordering documents
according to their relevance to your query. In the process
they have effectively monetized search by ordering
sponsored ads that are relevant to your query.
The amount of data flowing from, to, and through
enterprises of all sorts is enormous, and growing
rapidly—more rapidly than the capabilities of
organizations to use it. Successful enterprises are those that
make effective use of the abundance of data to which they
have access: to make better predictions, better decisions,
and better strategies. The margin over a competitor may be
small (they, after all, have access to the same methods for

17


making effective use of information), hence the need to
take advantage of every possible avenue to advantage.
At no time has the need been greater for quantitatively
skilled managerial expertise. Successful managers now
need to know about the possibilities and limitations of data
mining. But at what level? A high-level overview can
provide a general idea of what data mining can do for the
enterprise but fails to provide the intuition that could be
attained by actually building models with real data. A very

technical approach from the computer science, database, or
statistical standpoint can get bogged down in detail that
has little bearing on decision making.
It is essential that managers be able to translate business or
other functional problems into the appropriate statistical
problem before it can be “handed off” to a technical team.
But it is difficult for managers to do this with confidence
unless they have actually had hands-on experience
developing models for a variety of real problems using real
data. That is the perspective of this book—the use of real
data, actual cases, and an Excel-based program to build
and compare models with a minimal learning curve.
DARYL PREGIBON
Google Inc, 2006

18


Preface to the Second Edition
Since the book’s appearance in early 2007, it has been
used in many classes, ranging from dedicated data mining
classes to more general business intelligence courses.
Following feedback from instructors teaching both MBA
and undergraduate courses, as well as students, we revised
some of the existing chapters as well as covered two new
topics that are central in data mining: data visualization
and time series forecasting.
We have added a set of three chapters on time series
forecasting (Chapters 15–17), which present the most
commonly used forecasting tools in the business world.

They include a set of new datasets and exercises, and a
new case (in Chapter 18).
The chapter on data visualization provides comprehensive
coverage of basic and advanced visualization techniques
that support the exploratory step of data mining. We also
provide a discussion of interactive visualization principles
and tools, and the chapter exercises include assignments to
familiarize readers with interactive visualization in
practice.
In the new edition we have created separate chapters for
the k-nearest-neighbor and naive Bayes methods. The
explanation of the naive Bayes classifier is now clearer,
and additional exercises have been added to both chapters.
Another addition are brief chapter summaries at the
beginning of each chapter.

19


We have also reorganized the order of some chapters,
following readers’ feedback. The chapters are now
grouped into seven parts: Preliminaries, Data Exploration
and Dimension Reduction, Performance Evaluation,
Prediction
and
Classification
Methods,
Mining
Relationships Among Records, Forecasting Time Series,
and Cases. The new organization is aimed at helping

instructors of various types of courses to choose subsets of
topics to teach.
Two-semester data mining courses could cover in detail
data exploration and dimension reduction and supervised
learning in one term (choosing the type and amount of
prediction and classification methods according to the
course flavor and the audience interest). Forecasting time
series and unsupervised learning can be covered in the
second term.
Single-semester data mining courses would do best to
concentrate on the first parts of the book, and only
introduce time series forecasting as time allows. This is
especially true if a dedicated forecasting course is offered
in the program.
General business intelligence courses would best focus
on the first three parts, then choose a small number of
prediction/classification methods for illustration, and
present the mining relationships chapters. All these can be
covered via a few cases, where students read the relevant
chapters that support the analysis done in the case.
A set of data mining courses that constitute a
concentration can be built according to the sequence of
20


parts in the book. The first three parts (Preliminaries, Data
Exploration and Dimension Reduction, and Performance
Evaluation) should serve as requirements for the next
courses. Cases can be used either within appropriate topic
courses or as project-type courses.

In all courses, we strongly recommend including a project
component, where data are either collected by students
according to their interest or provided by the instructor
(e.g., from the many data mining competition datasets
available). From our experience and other instructors’
experience, such projects enhance the learning and provide
students with an excellent opportunity to understand the
strengths of data mining and the challenges that arise in the
process.

21


Preface to the First Edition
This book arose out of a data mining course at MIT’s
Sloan School of Management and was refined during its
use in data mining courses at the University of Maryland’s
R. H. Smith School of Business and at statistics.com.
Preparation for the course revealed that there are a number
of excellent books on the business context of data mining,
but their coverage of the statistical and machine-learning
algorithms that underlie data mining is not sufficiently
detailed to provide a practical guide if the instructor’s goal
is to equip students with the skills and tools to implement
those algorithms. On the other hand, there are also a
number of more technical books about data mining
algorithms, but these are aimed at the statistical researcher
or more advanced graduate student, and do not provide the
case-oriented business focus that is successful in teaching
business students.

Hence, this book is intended for the business student (and
practitioner) of data mining techniques, and its goal is
threefold:
1. To provide both a theoretical and a practical
understanding of the key methods of classification,
prediction, reduction, and exploration that are at the heart
of data mining.
2. To provide a business decision-making context for these
methods.

22


3. Using real business cases, to illustrate the application
and interpretation of these methods.
The presentation of the cases in the book is structured so
that the reader can follow along and implement the
algorithms on his or her own with a very low learning
hurdle.
Just as a natural science course without a lab component
would seem incomplete, a data mining course without
practical work with actual data is missing a key ingredient.
The MIT data mining course that gave rise to this book
followed an introductory quantitative course that relied on
Excel—this made its practical work universally accessible.
Using Excel for data mining seemed a natural progression.
An important feature of this book is the use of Excel, an
environment familiar to business analysts. All required
data mining algorithms (plus illustrative datasets) are
provided in an Excel add-in, XLMiner. Data for both the

cases
and
exercises
are
available
at
www.dataminingbook.com.
Although the genesis for this book lay in the need for a
case-oriented guide to teaching data mining, analysts and
consultants who are considering the application of data
mining techniques in contexts where they are not currently
in use will also find this a useful, practical guide.

23


Acknowledgments
The authors thank the many people who assisted us in
improving the first edition and improving it further in the
second edition. Anthony Babinec, who has been using
drafts of this book for years in his data mining courses at
statistics.com, provided us with detailed and expert
corrections. Similarly, Dan Toy and John Elder IV greeted
our project with enthusiasm and provided detailed and
useful comments on earlier drafts. Boaz Shmueli and
Raquelle Azran gave detailed editorial comments and
suggestions on both editions; Bruce McCullough and
Adam Hughes did the same for the first edition. Ravi
Bapna, who used an early draft in a data mining course at
the Indian School of Business, provided invaluable

comments and helpful suggestions. Useful comments and
feedback have also come from the many instructors, too
numerous to mention, who have used the book in their
classes.
From the Smith School of Business at the University of
Maryland, colleagues Shrivardhan Lele, Wolfgang Jank,
and Paul Zantek provided practical advice and comments.
We thank Robert Windle, and MBA students Timothy
Roach, Pablo Macouzet, and Nathan Birckhead for
invaluable datasets. We also thank MBA students Rob
Whitener and Daniel Curtis for the heatmap and map
charts. And we thank the many MBA students for fruitful
discussions and interesting data mining projects that have
helped shape and improve the book.

24


This book would not have seen the light of day without the
nurturing support of the faculty at the Sloan School of
Management at MIT. Our special thanks to Dimitris
Bertsimas, James Orlin, Robert Freund, Roy Welsch,
Gordon Kaufmann, and Gabriel Bitran. As teaching
assistants for the data mining course at Sloan, Adam
Mersereau gave detailed comments on the notes and cases
that were the genesis of this book, Romy Shioda helped
with the preparation of several cases and exercises used
here, and Mahesh Kumar helped with the material on
clustering. We are grateful to the MBA students at Sloan
for stimulating discussions in the class that led to

refinement of the notes as well as XLMiner.
Chris Albright, Gregory Piatetsky-Shapiro, Wayne
Winston, and Uday Karmarkar gave us helpful advice on
the use of XLMiner. Anand Bodapati provided both data
and advice. Suresh Ankolekar and Mayank Shah helped
develop several cases and provided valuable pedagogical
comments. Vinni Bhandari helped write the Charles Book
Club case.
We would like to thank Marvin Zelen, L. J. Wei, and
Cyrus Mehta at Harvard, as well as Anil Gore at Pune
University, for thought-provoking discussions on the
relationship between statistics and data mining. Our thanks
to Richard Larson of the Engineering Systems Division,
MIT, for sparking many stimulating ideas on the role of
data mining in modeling complex systems. They helped us
develop a balanced philosophical perspective on the
emerging field of data mining.

25


×