Tải bản đầy đủ (.pdf) (378 trang)

Applied data mining statistical methods for business and industry giudici 2003 10 17

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4 MB, 378 trang )

Applied Data Mining
Statistical Methods for Business and Industry

PAOLO GIUDICI
Faculty of Economics
University of Pavia
Italy



Applied Data Mining



Applied Data Mining
Statistical Methods for Business and Industry

PAOLO GIUDICI
Faculty of Economics
University of Pavia
Italy


Copyright  2003

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,
West Sussex PO19 8SQ, England
Telephone (+44) 1243 779777

Email (for orders and customer service enquiries):
Visit our Home Page on www.wileyeurope.com or www.wiley.com


All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, electronic, mechanical, photocopying, recording,
scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or
under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court
Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the
Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The
Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to
, or faxed to (+44) 1243 770620.
This publication is designed to provide accurate and authoritative information in regard to the
subject matter covered. It is sold on the understanding that the Publisher is not engaged in
rendering professional services. If professional advice or other expert assistance is required, the
services of a competent professional should be sought.
Other Wiley Editorial Offices
John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA
Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA
Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809
John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1
Wiley also publishes its books in a variety of electronic formats. Some content that appears
in print may not be available in electronic books.
Library of Congress Cataloging-in-Publication Data
Giudici, Paolo.
Applied data mining : statistical methods for business and industry / Paolo Giudici.
p. cm.
Includes bibliographical references and index.
ISBN 0-470-84678-X (alk. paper) – ISBN 0-470-84679-8 (pbk.)
1. Data mining. 2. Business – Data processing. 3. Commercial statistics. I. Title.
QA76.9.D343G75 2003
2003050196

British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
ISBN 0-470-84678-X (Cloth)
ISBN 0-470-84679-8 (Paper)
Typeset in 10/12pt Times by Laserwords Private Limited, Chennai, India
Printed and bound in Great Britain by Biddles Ltd, Guildford, Surrey
This book is printed on acid-free paper responsibly manufactured from sustainable forestry
in which at least two trees are planted for each one used for paper production.


Contents
Preface
1 Introduction
1.1 What is data mining?
1.1.1 Data mining and computing
1.1.2 Data mining and statistics
1.2 The data mining process
1.3 Software for data mining
1.4 Organisation of the book
1.4.1 Chapters 2 to 6: methodology
1.4.2 Chapters 7 to 12: business cases
1.5 Further reading
Part I Methodology

xi
1
1
3
5
6

11
12
13
13
14
17

2 Organisation of the data
2.1 From the data warehouse to the data marts
2.1.1 The data warehouse
2.1.2 The data webhouse
2.1.3 Data marts
2.2 Classification of the data
2.3 The data matrix
2.3.1 Binarisation of the data matrix
2.4 Frequency distributions
2.4.1 Univariate distributions
2.4.2 Multivariate distributions
2.5 Transformation of the data
2.6 Other data structures
2.7 Further reading

19
20
20
21
22
22
23
25

25
26
27
29
30
31

3 Exploratory data analysis
3.1 Univariate exploratory analysis
3.1.1 Measures of location
3.1.2 Measures of variability
3.1.3 Measures of heterogeneity

33
34
35
37
37


vi

CONTENTS

3.2
3.3
3.4

3.5


3.6
4

3.1.4 Measures of concentration
3.1.5 Measures of asymmetry
3.1.6 Measures of kurtosis
Bivariate exploratory analysis
Multivariate exploratory analysis of quantitative data
Multivariate exploratory analysis of qualitative data
3.4.1 Independence and association
3.4.2 Distance measures
3.4.3 Dependency measures
3.4.4 Model-based measures
Reduction of dimensionality
3.5.1 Interpretation of the principal components
3.5.2 Application of the principal components
Further reading

Computational data mining
4.1 Measures of distance
4.1.1 Euclidean distance
4.1.2 Similarity measures
4.1.3 Multidimensional scaling
4.2 Cluster analysis
4.2.1 Hierarchical methods
4.2.2 Evaluation of hierarchical methods
4.2.3 Non-hierarchical methods
4.3 Linear regression
4.3.1 Bivariate linear regression
4.3.2 Properties of the residuals

4.3.3 Goodness of fit
4.3.4 Multiple linear regression
4.4 Logistic regression
4.4.1 Interpretation of logistic regression
4.4.2 Discriminant analysis
4.5 Tree models
4.5.1 Division criteria
4.5.2 Pruning
4.6 Neural networks
4.6.1 Architecture of a neural network
4.6.2 The multilayer perceptron
4.6.3 Kohonen networks
4.7 Nearest-neighbour models
4.8 Local models
4.8.1 Association rules
4.8.2 Retrieval by content
4.9 Further reading

39
41
43
45
49
51
53
54
56
58
61
63

65
66
69
70
71
72
74
75
77
81
83
85
85
88
90
91
96
97
98
100
103
105
107
109
111
117
119
121
121
126

127


CONTENTS

vii

5 Statistical data mining
5.1 Uncertainty measures and inference
5.1.1 Probability
5.1.2 Statistical models
5.1.3 Statistical inference
5.2 Non-parametric modelling
5.3 The normal linear model
5.3.1 Main inferential results
5.3.2 Application
5.4 Generalised linear models
5.4.1 The exponential family
5.4.2 Definition of generalised linear models
5.4.3 The logistic regression model
5.4.4 Application
5.5 Log-linear models
5.5.1 Construction of a log-linear model
5.5.2 Interpretation of a log-linear model
5.5.3 Graphical log-linear models
5.5.4 Log-linear model comparison
5.5.5 Application
5.6 Graphical models
5.6.1 Symmetric graphical models
5.6.2 Recursive graphical models

5.6.3 Graphical models versus neural networks
5.7 Further reading

129
129
130
132
137
143
146
147
150
154
155
157
163
164
167
167
169
171
174
175
177
178
182
184
185

6 Evaluation of data mining methods

6.1 Criteria based on statistical tests
6.1.1 Distance between statistical models
6.1.2 Discrepancy of a statistical model
6.1.3 The Kullback–Leibler discrepancy
6.2 Criteria based on scoring functions
6.3 Bayesian criteria
6.4 Computational criteria
6.5 Criteria based on loss functions
6.6 Further reading

187
188
188
190
192
193
195
197
200
204

Part II

Business cases

7 Market basket analysis
7.1 Objectives of the analysis
7.2 Description of the data
7.3 Exploratory data analysis
7.4 Model building

7.4.1 Log-linear models
7.4.2 Association rules

207
209
209
210
212
215
215
218


viii

CONTENTS

7.5 Model comparison
7.6 Summary report
clickstream analysis
Objectives of the analysis
Description of the data
Exploratory data analysis
Model building
8.4.1 Sequence rules
8.4.2 Link analysis
8.4.3 Probabilistic expert systems
8.4.4 Markov chains
8.5 Model comparison
8.6 Summary report


224
226

8

Web
8.1
8.2
8.3
8.4

229
229
229
232
238
238
242
244
245
250
252

9

Profiling website visitors
9.1 Objectives of the analysis
9.2 Description of the data
9.3 Exploratory analysis

9.4 Model building
9.4.1 Cluster analysis
9.4.2 Kohonen maps
9.5 Model comparison
9.6 Summary report

255
255
255
258
258
258
262
264
271

10

Customer relationship management
10.1 Objectives of the analysis
10.2 Description of the data
10.3 Exploratory data analysis
10.4 Model building
10.4.1 Logistic regression models
10.4.2 Radial basis function networks
10.4.3 Classification tree models
10.4.4 Nearest-neighbour models
10.5 Model comparison
10.6 Summary report


273
273
273
275
278
278
280
281
285
286
290

11

Credit scoring
11.1 Objectives of the analysis
11.2 Description of the data
11.3 Exploratory data analysis
11.4 Model building
11.4.1 Logistic regression models
11.4.2 Classification tree models
11.4.3 Multilayer perceptron models

293
293
294
296
299
299
303

314


CONTENTS

11.5 Model comparison
11.6 Summary report

ix
314
319

12 Forecasting television audience
12.1 Objectives of the analysis
12.2 Description of the data
12.3 Exploratory data analysis
12.4 Model building
12.5 Model comparison
12.6 Summary report

323
323
324
327
337
347
350

Bibliography
Index


353
357



Preface
The increasing availability of data in the current information society has led to
the need for valid tools for its modelling and analysis. Data mining and applied
statistical methods are the appropriate tools to extract knowledge from such data.
Data mining can be defined as the process of selection, exploration and modelling
of large databases in order to discover models and patterns that are unknown a
priori. It differs from applied statistics mainly in terms of its scope; whereas
applied statistics concerns the application of statistical methods to the data at
hand, data mining is a whole process of data extraction and analysis aimed at
the production of decision rules for specified business goals. In other words, data
mining is a business intelligence process.
Although data mining is a very important and growing topic, there is insufficient coverage of it in the literature, especially from a statistical viewpoint.
Most of the available books on data mining are either too technical and computer science oriented or too applied and marketing driven. This book aims to
establish a bridge between data mining methods and applications in the fields of
business and industry by adopting a coherent and rigorous approach to statistical
modelling.
Not only does it describe the methods employed in data mining, typically coming from the fields of machine learning and statistics, but it describes them in
relation to the business goals that have to be achieved, hence the word ‘applied’
in the title. The second part of the book is a set of case studies that compare the
methods of the first part in terms of their performance and usability. The first part
gives a broad coverage of all methods currently used for data mining and puts
them into a functional framework. Methods are classified as being essentially
computational (e.g. association rules, decision trees and neural networks) or statistical (e.g. regression models, generalised linear models and graphical models).
Furthermore, each method is classified in terms of the business intelligence goals

it can achieve, such as discovery of local patterns, classification and prediction.
The book is primarily aimed at advanced undergraduate and graduate students
of business management, computer science and statistics. The case studies give
guidance to professionals working in industry on projects involving large volumes
of data, such as in customer relationship management, web analysis, risk management and, more broadly, marketing and finance. No unnecessary formalisms


xii

PREFACE

and mathematical tools are introduced. Those who wish to know more should
consult the bibliography; specific pointers are given at the end of Chapters 2 to 6.
The book is the result of a learning process that began in 1989, when I
was a graduate student of statistics at the University of Minnesota. Since then
my research activity has always been focused on the interplay between computational and multivariate statistics. In 1998 I began building a group of data mining
statisticians and it has evolved into a data mining laboratory at the University
of Pavia. There I have had many opportunities to interact and learn from industry experts and my own students working on data mining projects and doing
internships within the industry. Although it is not possible to name them all, I
thank them and hope they recognise their contribution in the book. A special
mention goes to the University of Pavia, in particular to the Faculty of Business
and Economics, where I have been working since 1993. It is a very stimulating
and open environment to do research and teaching.
I acknowledge Wiley for having proposed and encouraged this effort, in particular the statistics and mathematics editor and assistant editor, Sian Jones and
Rob Calver. I also thank Greg Ridgeway, who revised the final manuscript and
suggested several improvements. Finally, the most important acknowledgement
goes to my wife, Angela, who has constantly encouraged the development of my
research in this field. The book is dedicated to her and to my son Tommaso, born
on 24 May 2002, when I was revising the manuscript.
I hope people will enjoy reading the book and eventually use it in their work.

I will be very pleased to receive comments at I will consider
any suggestions for a subsequent edition.
Paolo Giudici
Pavia, 28 January 2003


CHAPTER 1

Introduction
Nowadays each individual and organisation – business, family or institution –
can access a large quantity of data and information about itself and its environment. This data has the potential to predict the evolution of interesting variables
or trends in the outside environment, but so far that potential has not been fully
exploited. This is particularly true in the business field, the subject of this book.
There are two main problems. Information is scattered within different archive
systems that are not connected with one another, producing an inefficient organisation of the data. There is a lack of awareness about statistical tools and their
potential for information elaboration. This interferes with the production of efficient and relevant data synthesis.
Two developments could help to overcome these problems. First, software and
hardware continually, offer more power at lower cost, allowing organisations to
collect and organise data in structures that give easier access and transfer. Second,
methodological research, particularly in the field of computing and statistics, has
recently led to the development of flexible and scalable procedures that can be
used to analyse large data stores. These two developments have meant that data
mining is rapidly spreading through many businesses as an important intelligence
tool for backing up decisions.
This chapter introduces the ideas behind data mining. It defines data mining
and compares it with related topics in statistics and computer science. It describes
the process of data mining and gives a brief introduction to data mining software.
The last part of the chapter outlines the organisation of the book and suggests
some further reading.


1.1 What is data mining?
To understand the term ‘data mining’ it is useful to look at the literal translation
of the word: to mine in English means to extract. The verb usually refers to mining operations that extract from the Earth her hidden, precious resources. The
association of this word with data suggests an in-depth search to find additional
information which previously went unnoticed in the mass of data available. From
the viewpoint of scientific research, data mining is a relatively new discipline that
has developed mainly from studies carried out in other disciplines such as computing, marketing, and statistics. Many of the methodologies used in data mining
Applied Data Mining. Paolo Giudici
 2003 John Wiley & Sons, Ltd ISBNs: 0-470-84679-8 (Paper); 0-470-84678-X (Cloth)


2

APPLIED DATA MINING

come from two branches of research, one developed in the machine learning
community and the other developed in the statistical community, particularly in
multivariate and computational statistics.
Machine learning is connected to computer science and artificial intelligence
and is concerned with finding relations and regularities in data that can be translated into general truths. The aim of machine learning is the reproduction of the
data-generating process, allowing analysts to generalise from the observed data to
new, unobserved cases. Rosenblatt (1962) introduced the first machine learning
model, called the perceptron. Following on from this, neural networks developed in the second half of the 1980s. During the same period, some researchers
perfected the theory of decision trees used mainly for dealing with problems of
classification. Statistics has always been about creating models for analysing data,
and now there is the possibility of using computers to do it. From the second half
of the 1980s, given the increasing importance of computational methods as the
basis for statistical analysis, there was also a parallel development of statistical
methods to analyse real multivariate applications. In the 1990s statisticians began
showing interest in machine learning methods as well, which led to important

developments in methodology.
Towards the end of the 1980s machine learning methods started to be used
beyond the fields of computing and artificial intelligence. In particular, they were
used in database marketing applications where the available databases were used
for elaborate and specific marketing campaigns. The term knowledge discovery
in databases (KDD) was coined to describe all those methods that aimed to
find relations and regularity among the observed data. Gradually the term KDD
was expanded to describe the whole process of extrapolating information from a
database, from the identification of the initial business aims to the application of
the decision rules. The term ‘data mining’ was used to describe the component
of the KDD process where the learning algorithms were applied to the data.
This terminology was first formally put forward by Usama Fayaad at the
First International Conference on Knowledge Discovery and Data Mining, held
in Montreal in 1995 and still considered one of the main conferences on this
topic. It was used to refer to a set of integrated analytical techniques divided into
several phases with the aim of extrapolating previously unknown knowledge from
massive sets of observed data that do not appear to have any obvious regularity
or important relationships. As the term ‘data mining’ slowly established itself, it
became a synonym for the whole process of extrapolating knowledge. This is the
meaning we shall use in this text. The previous definition omits one important
aspect – the ultimate aim of data mining. In data mining the aim is to obtain
results that can be measured in terms of their relevance for the owner of the
database – business advantage. Here is a more complete definition of data mining:
Data mining is the process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown with the
aim of obtaining clear and useful results for the owner of the database.

In a business context the utility of the result becomes a business result in
itself. Therefore what distinguishes data mining from statistical analysis is not



INTRODUCTION

3

so much the amount of data we analyse or the methods we use but that we
integrate what we know about the database, the means of analysis and the business
knowledge. To apply a data mining methodology means following an integrated
methodological process that involves translating the business needs into a problem
which has to be analysed, retrieving the database needed to carry out the analysis,
and applying a statistical technique implemented in a computer algorithm with the
final aim of achieving important results useful for taking a strategic decision. The
strategic decision will itself create new measurement needs and consequently new
business needs, setting off what has been called ‘the virtuous circle of knowledge’
induced by data mining (Berry and Linoff, 1997).
Data mining is not just about the use of a computer algorithm or a statistical
technique; it is a process of business intelligence that can be used together with
what is provided by information technology to support company decisions.
1.1.1 Data mining and computing
The emergence of data mining is closely connected to developments in computer
technology, particularly the evolution and organisation of databases, which have
recently made great leaps forward. I am now going to clarify a few terms.
Query and reporting tools are simple and very quick to use; they help us
explore business data at various levels. Query tools retrieve the information and
reporting tools present it clearly. They allow the results of analyses to be transmitted across a client-server network, intranet or even on the internet. The networks
allow sharing, so that the data can be analysed by the most suitable platform.
This makes it possible to exploit the analytical potential of remote servers and
receive an analysis report on local PCs. A client-server network must be flexible
enough to satisfy all types of remote requests, from a simple reordering of data
to ad hoc queries using Structured Query Language (SQL) for extracting and
summarising data in the database.

Data retrieval, like data mining, extracts interesting data and information from
archives and databases. The difference is that, unlike data mining, the criteria for
extracting information are decided beforehand so they are exogenous from the
extraction itself. A classic example is a request from the marketing department of
a company to retrieve all the personal details of clients who have bought product
A and product B at least once in that order. This request may be based on the
idea that there is some connection between having bought A and B together at
least once but without any empirical evidence. The names obtained from this
exploration could then be the targets of the next publicity campaign. In this way
the success percentage (i.e. the customers who will actually buy the products
advertised compared to the total customers contacted) will definitely be much
higher than otherwise. Once again, without a preliminary statistical analysis of
the data, it is difficult to predict the success percentage and it is impossible to
establish whether having better information about the customers’ characteristics
would give improved results with a smaller campaign effort.
Data mining is different from data retrieval because it looks for relations and
associations between phenomena that are not known beforehand. It also allows


4

APPLIED DATA MINING

the effectiveness of a decision to be judged on the data, which allows a rational
evaluation to be made, and on the objective data available. Do not confuse data
mining with methods used to create multidimensional reporting tools, e.g. online
analytical processing (OLAP). OLAP is usually a graphical instrument used to
highlight relations between the variables available following the logic of a twodimensional report. Unlike OLAP, data mining brings together all the variables
available and combines them in different ways. It also means we can go beyond
the visual representation of the summaries in OLAP applications, creating useful

models for the business world. Data mining is not just about analysing data; it
is a much more complex process where data analysis is just one of the aspects.
OLAP is an important tool for business intelligence. The query and reporting
tools describe what a database contains (in the widest sense this includes the
data warehouse), but OLAP is used to explain why certain relations exist. The
user makes his own hypotheses about the possible relations between the variables
and he looks for confirmation of his opinion by observing the data. Suppose he
wants to find out why some debts are not paid back; first he might suppose
that people with a low income and lots of debts are high-risk categories. So he
can check his hypothesis, OLAP gives him a graphical representation (called a
multidimensional hypercube) of the empirical relation between the income, debt
and insolvency variables. An analysis of the graph can confirm his hypothesis.
Therefore OLAP also allows the user to extract information that is useful for
business databases. Unlike data mining, the research hypotheses are suggested by
the user and are not uncovered from the data. Furthermore, the extrapolation is a
purely computerised procedure; no use is made of modelling tools or summaries
provided by the statistical methodology. OLAP can provide useful information
for databases with a small number of variables, but problems arise when there are
tens or hundreds of variables. Then it becomes increasingly difficult and timeconsuming to find a good hypothesis and analyse the database with OLAP tools
to confirm or deny it.
OLAP is not a substitute for data mining; the two techniques are complementary and used together they can create useful synergies. OLAP can be used
in the preprocessing stages of data mining. This makes understanding the data
easier, because it becomes possible to focus on the most important data, identifying special cases or looking for principal interrelations. The final data mining
results, expressed using specific summary variables, can be easily represented in
an OLAP hypercube.
We can summarise what we have said so far in a simple sequence that shows
the evolution of business intelligence tools used to extrapolate knowledge from
a database:
QUERY AND REPORTING −−−→ DATA RETRIEVAL −−−→ OLAP
−−−→ DATA MINING

Query and reporting has the lowest information capacity and data mining has the
highest information capacity. Query and reporting is easiest to implement and data
mining is hardest to implement. This suggests a trade-off between information


INTRODUCTION

5

capacity and ease of implementation. The choice of tool must also consider the
specific needs of the business and the characteristics of the company’s information
system. Lack of information is one of the greatest obstacles to achieving efficient
data mining. Very often a database is created for reasons that have nothing to do
with data mining, so the important information may be missing. Incorrect data is
another problem.
The creation of a data warehouse can eliminate many of these problems.
Efficient organisation of the data in a data warehouse coupled with efficient and
scalable data mining allows the data to be used correctly and efficiently to support
company decisions.
1.1.2 Data mining and statistics
Statistics has always been about creating methods to analyse data. The main
difference between statistical methods and machine learning methods is that statistical methods are usually developed in relation to the data being analysed but
also according to a conceptual reference paradigm. Although this has made the
statistical methods coherent and rigorous, it has also limited their ability to adapt
quickly to the new methodologies arising from new information technology and
new machine learning applications. Statisticians have recently shown an interest
in data mining and this could help its development.
For a long time statisticians saw data mining as a synonymous with ‘data
fishing’, ‘data dredging’ or ‘data snooping’. In all these cases data mining had
negative connotations. This idea came about because of two main criticisms.

First, there is not just one theoretical reference model but several models in
competition with each other; these models are chosen depending on the data
being examined. The criticism of this procedure is that it is always possible to
find a model, however complex, which will adapt well to the data. Second, the
great amount of data available may lead to non-existent relations being found
among the data.
Although these criticisms are worth considering, we shall see that the modern
methods of data mining pay great attention to the possibility of generalising
results. This means that when choosing a model, the predictive performance is
considered and the more complex models are penalised. It is difficult to ignore
the fact that many important findings are not known beforehand and cannot be
used in developing a research hypothesis. This happens in particular when there
are large databases.
This last aspect is one of the characteristics that distinguishes data mining
from statistical analysis. Whereas statistical analysis traditionally concerns itself
with analysing primary data that has been collected to check specific research
hypotheses, data mining can also concern itself with secondary data collected for
other reasons. This is the norm, for example, when analysing company data that
comes from a data warehouse. Furthermore, statistical data can be experimental
data (perhaps the result of an experiment which randomly allocates all the statistical units to different kinds of treatment), but in data mining the data is typically
observational data.


6

APPLIED DATA MINING

Berry and Linoff (1997) distinguish two analytical approaches to data mining. They differentiate top-down analysis (confirmative) and bottom-up analysis
(explorative). Top-down analysis aims to confirm or reject hypotheses and tries to
widen our knowledge of a partially understood phenomenon; it achieves this principally by using the traditional statistical methods. Bottom-up analysis is where

the user looks for useful information previously unnoticed, searching through the
data and looking for ways of connecting it to create hypotheses. The bottom-up
approach is typical of data mining. In reality the two approaches are complementary. In fact, the information obtained from a bottom-up analysis, which identifies
important relations and tendencies, cannot explain why these discoveries are useful and to what extent they are valid. The confirmative tools of top-down analysis
can be used to confirm the discoveries and evaluate the quality of decisions based
on those discoveries.
There are at least three other aspects that distinguish statistical data analysis
from data mining. First, data mining analyses great masses of data. This implies
new considerations for statistical analysis. For many applications it is impossible
to analyse or even access the whole database for reasons of computer efficiency.
Therefore it becomes necessary to have a sample of the data from the database
being examined. This sampling must take account of the data mining aims, so it
cannot be performed using traditional statistical theory. Second many databases
do not lead to the classic forms of statistical data organisation, for example,
data that comes from the internet. This creates a need for appropriate analytical
methods from outside the field of statistics. Third, data mining results must be of
some consequence. This means that constant attention must be given to business
results achieved with the data analysis models.
In conclusion there are reasons for believing that data mining is nothing new
from a statistical viewpoint. But there are also reasons to support the idea that,
because of their nature, statistical methods should be able to study and formalise
the methods used in data mining. This means that on one hand we need to look
at the problems posed by data mining from a viewpoint of statistics and utility,
while on the other hand we need to develop a conceptual paradigm that allows
the statisticians to lead the data mining methods back to a scheme of general and
coherent analysis.

1.2 The data mining process
Data mining is a series of activities from defining objectives to evaluating results.
Here are its seven phases:

A.
B.
C.
D.
E.

Definition of the objectives for analysis
Selection, organisation and pretreatment of the data
Exploratory analysis of the data and subsequent transformation
Specification of the statistical methods to be used in the analysis phase
Analysis of the data based on the chosen methods


INTRODUCTION

7

F. Evaluation and comparison of the methods used and the choice of the final
model for analysis
G. Interpretation of the chosen model and its subsequent use in decision
processes
Definition of the objectives
Definition of the objectives involves defining the aims of the analysis. It is not
always easy to define the phenomenon we want to analyse. In fact, the company
objectives that we are aiming for are usually clear, but the underlying problems
can be difficult to translate into detailed objectives that need to be analysed.
A clear statement of the problem and the objectives to be achieved are the
prerequisites for setting up the analysis correctly. This is certainly one of the most
difficult parts of the process since what is established at this stage determines
how the subsequent method is organised. Therefore the objectives must be clear

and there must be no room for doubts or uncertainties.
Organisation of the data
Once the objectives of the analysis have been identified, it is necessary to select
the data for the analysis. First of all it is necessary to identify the data sources.
Usually data is taken from internal sources that are cheaper and more reliable.
This data also has the advantage of being the result of experiences and procedures
of the company itself. The ideal data source is the company data warehouse, a
storeroom of historical data that is no longer subject to changes and from which
it is easy to extract topic databases, or data marts, of interest. If there is no
data warehouse then the data marts must be created by overlapping the different
sources of company data.
In general, the creation of data marts to be analysed provides the fundamental
input for the subsequent data analysis. It leads to a representation of the data,
usually in a tabular form known as a data matrix, that is based on the analytical
needs and the previously established aims. Once a data matrix is available it is
often necessary to carry out a preliminary cleaning of the data. In other words, a
quality control is carried out on the available data, known as data cleansing. It is a
formal process used to highlight any variables that exist but which are not suitable
for analysis. It is also an important check on the contents of the variables and
the possible presence of missing, or incorrect data. If any essential information is
missing, it will then be necessary to review the phase that highlights the source.
Finally, it is often useful to set up an analysis on a subset or sample of the
available data. This is because the quality of the information collected from the
complete analysis across the whole available data mart is not always better than
the information obtained from an investigation of the samples. In fact, in data
mining the analysed databases are often very large, so using a sample of the data
reduces the analysis time. Working with samples allows us to check the model’s
validity against the rest of the data, giving an important diagnostic tool. It also
reduces the risk that the statistical method might adapt to irregularities and lose
its ability to generalise and forecast.



8

APPLIED DATA MINING

Exploratory analysis of the data
Exploratory analysis of the data involves a preliminary exploratory analysis of
the data, very similar to OLAP techniques. An initial evaluation of the data’s
importance can lead to a transformation of the original variables to better understand the phenomenon or it can lead to statistical methods based on satisfying
specific initial hypotheses. Exploratory analysis can highlight any anomalous
data – items that are different from the rest. These items data will not necessarily be eliminated because they might contain information that is important to
achieve the objectives of the analysis. I think that an exploratory analysis of the
data is essential because it allows the analyst to predict which statistical methods
might be most appropriate in the next phase of the analysis. This choice must
obviously bear in mind the quality of the data obtained from the previous phase.
The exploratory analysis might also suggest the need for new extraction of data
because the data collected is considered insufficient to achieve the set aims. The
main exploratory methods for data mining will be discussed in Chapter 3.
Specification of statistical methods
There are various statistical methods that can be used and there are also many
algorithms, so it is important to have a classification of the existing methods.
The choice of method depends on the problem being studied or the type of data
available. The data mining process is guided by the applications. For this reason
the methods used can be classified according to the aim of the analysis. Then we
can distinguish three main classes:
• Descriptive methods: aim to describe groups of data more briefly; they are
also called symmetrical, unsupervised or indirect methods. Observations may
be classified into groups not known beforehand (cluster analysis, Kohonen
maps); variables may be connected among themselves according to links

unknown beforehand (association methods, log-linear models, graphical models). In this way all the variables available are treated at the same level and
there are no hypotheses of causality. Chapters 4 and 5 give examples of
these methods.
• Predictive methods: aim to describe one or more of the variables in relation
to all the others; they are also called asymmetrical, supervised or direct methods. This is done by looking for rules of classification or prediction based on
the data. These rules help us to predict or classify the future result of one or
more response or target variables in relation to what happens to the explanatory or input variables. The main methods of this type are those developed
in the field of machine learning such as the neural networks (multilayer perceptrons) and decision trees but also classic statistical models such as linear
and logistic regression models. Chapters 4 and 5 both illustrate examples of
these methods.
• Local methods: aim to identify particular characteristics related to subset
interests of the database; descriptive methods and predictive methods are
global rather than local. Examples of local methods are association rules for


INTRODUCTION

9

analysing transactional data, which we shall look at in Chapter 4, and the identification of anomalous observations (outliers), also discussed in Chapter 4.
I think this classification is exhaustive, especially from a functional viewpoint.
Further distinctions are discussed in the literature. Each method can be used on
its own or as one stage in a multistage analysis.
Data analysis
Once the statistical methods have been specified, they must be translated into
appropriate algorithms for computing calculations that help us synthesise the
results we need from the available database. The wide range of specialised and
non-specialised software for data mining means that for most standard applications it is not necessary to develop ad hoc algorithms; the algorithms that come
with the software should be sufficient. Nevertheless, those managing the data
mining process should have a sound knowledge of the different methods as well

as the software solutions, so they can adapt the process to the specific needs of
the company and interpret the results correctly when taking decisions.
Evaluation of statistical methods
To produce a final decision it is necessary to choose the best model of data
analysis from the statistical methods available. Therefore the choice of the model
and the final decision rule are based on a comparison of the results obtained with
the different methods. This is an important diagnostic check on the validity of
the specific statistical methods that are then applied to the available data. It is
possible that none of the methods used permits the set of aims to be achieved
satisfactorily. Then it will be necessary to go back and specify a new method
that is more appropriate for the analysis.
When evaluating the performance of a specific method, as well as diagnostic
measures of a statistical type, other things must be considered such as time
constraints, resource constraints, data quality and data availability. In data mining
it is rarely a good idea to use just one statistical method to analyse the data.
Different methods have the potential to highlight different aspects, aspects which
might otherwise have been ignored.
To choose the best final model it is necessary to apply and compare various
techniques quickly and simply, to compare the results produced and then give a
business evaluation of the different rules created.
Implementation of the methods
Data mining is not just an analysis of the data, it is also the integration of
the results into the decision process of the company. Business knowledge, the
extraction of rules and their participation in the decision process allow us to
move from the analytical phase to the production of a decision engine. Once
the model has been chosen and tested with a data set, the classification rule
can be applied to the whole reference population. For example we will be able
to distinguish beforehand which customers will be more profitable or we can



10

APPLIED DATA MINING

calibrate differentiated commercial policies for different target consumer groups,
thereby increasing the profits of the company.
Having seen the benefits we can get from data mining, it is crucial to implement the process correctly to exploit its full potential. The inclusion of the data
mining process in the company organisation must be done gradually, setting out
realistic aims and looking at the results along the way. The final aim is for data
mining to be fully integrated with the other activities that are used to back up
company decisions.
This process of integration can be divided into four phases:
• Strategic phase: in this first phase we study the business procedure being
used in order to identify where data mining could give most benefits. The
results at the end of this phase are the definition of the business objectives
for a pilot data mining project and the definition of criteria to evaluate the
project itself.
• Training phase: this phase allows us to evaluate the data mining activity
more carefully. A pilot project is set up and the results are assessed using
the objectives and the criteria established in the previous phase. The choice
of the pilot project is a fundamental aspect. It must be simple and easy to
use but important enough to create interest. If the pilot project is positive,
there are two possible results: the preliminary evaluation of the utility of
the different data mining techniques and the definition of a prototype data
mining system.
• Creation phase: if the positive evaluation of the pilot project results in implementing a complete data mining system, it will then be necessary to establish
a detailed plan to reorganise the business procedure to include the data mining activity. More specifically, it will be necessary to reorganise the business
database with the possible creation of a data warehouse; to develop the previous data mining prototype until we have an initial operational version; and
to allocate personnel and time to follow the project.
• Migration phase: at this stage all we need to do is prepare the organisation appropriately so the data mining process can be successfully integrated. This means teaching likely users the potential of the new system

and increasing their trust in the benefits it will bring. This means constantly
evaluating (and communicating) the efficient results obtained from the data
mining process.
For data mining to be considered a valid process within a company, it needs to
involve at least three different people with strong communication and interactive
skills:
– Business experts, to set the objectives and interpret the results of data mining
– Information technology experts, who know about the data and technologies needed
– Experts in statistical methods for the data analysis phase


INTRODUCTION

11

1.3 Software for data mining
A data mining project requires adequate software to perform the analysis. Most
software systems only implement specific techniques; they can be seen as specialised software systems for statistical data analysis. But because the aim of
data mining is to look for relations that are previously unknown and to compare
the available methods of analysis, I do not think these specialised systems are
suitable.
Valid data mining software should create an integrated data mining system that
allows the use and comparison of different techniques; it should also integrate
with complex database management software. Few such systems exist. Most of
the available options are listed on the website www.kdnuggets.com/.
This book makes many references to the SAS software, so here is a brief
description of the integrated SAS data mining software called Enterprise Miner
(SAS Institute, 2001). Most of the processing presented in the case studies is
carried out using this system as well as other SAS software models.
To plan, implement and successfully set up a data mining project it is necessary to have an integrated software solution that includes all the phases of

the analytical process. These go from sampling the data, through the analytical
and modelling phases, and up to the publication of the resulting business information. Furthermore, the ideal solution should be user-friendly, intuitive and
flexible enough to allow the user with little experience in statistics to understand
and use it.
The SAS Enterprise Miner software is a solution of this kind. It comes from
SAS’s long experience in the production of software tools for data analysis, and
since it appeared on the market in 1998 it has become worldwide leader in this
field. It brings together the system of statistical analysis and SAS reporting with
a graphical user interface (GUI) that is easy to use and can be understood by
company analysts and statistics experts.
The GUI elements can be used to implement the data mining methods developed by the SAS Institute, the SEMMA method. This method sets out some basic
data mining elements without imposing a rigid and predetermined route for the
project. It provides a logical process that allows business analysts and statistics
experts to achieve the aims of the data mining projects by choosing the elements
of the GUI they need. The visual representation of this structure is a process flow
diagram (PFD) that graphically illustrates the steps taken to complete a single
data mining project.
The SEMMA method defined by the SAS Institute is a general reference
structure that can be used to organise the phases of the data mining project.
Schematically the SEMMA method set out by the SAS consists of a series of
‘steps’ that must be followed to complete the data analysis, steps which are
perfectly integrated with SAS Enterprise Miner. SEMMA is an acronym that
stands for ‘sample, explore, modify, model and assess:
• Sample: this extracts a part of the data that is large enough to contain important information and small enough to be analysed quickly.


×