Tải bản đầy đủ (.pdf) (252 trang)

Analytics big data world applications 3342 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.03 MB, 252 trang )



Analytics in a Big
Data World


Wiley & SAS
Business Series

The Wiley & SAS Business Series presents books that help senior‐level
managers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Activity‐Based Management for Financial Institutions: Driving Bottom‐
Line Results by Brent Bahnub
Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
Branded! How Retailers Engage Consumers with Social Media and Mobility by Bernie Brennan and Lori Schafer
Business Analytics for Customer Intelligence by Gert Laursen
Business Analytics for Managers: Taking Business Intelligence beyond
Reporting by Gert Laursen and Jesper Thorlund
The Business Forecasting Deal: Exposing Bad Practices and Providing
Practical Solutions by Michael Gilliland
Business Intelligence Applied: Implementing an Effective Information and
Communications Technology Infrastructure by Michael Gendron
Business Intelligence in the Cloud: Strategic Implementation Guide by
Michael S. Gendron
Business Intelligence Success Factors: Tools for Aligning Your Business in
the Global Economy by Olivia Parr Rud
CIO Best Practices: Enabling Strategic Value with Information Technology,
second edition by Joe Stenzel
Connecting Organizational Silos: Taking Knowledge Flow Management to


the Next Level with Social Media by Frank Leistner
Credit Risk Assessment: The New Lending System for Borrowers, Lenders,
and Investors by Clark Abrahams and Mingyuan Zhang


Credit Risk Scorecards: Developing and Implementing Intelligent Credit
Scoring by Naeem Siddiqi
The Data Asset: How Smart Companies Govern Their Data for Business
Success by Tony Fisher
Delivering Business Analytics: Practical Guidelines for Best Practice by
Evan Stubbs
Demand‐Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase
Demand‐Driven Inventory Optimization and Replenishment: Creating a
More Efficient Supply Chain by Robert A. Davis
The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and
Mike Barlow
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah
Watt, and Sam Bullard
Executive’s Guide to Solvency III by David Buckham, Jason Wahl, and
Stuart Rose
Fair Lending Compliance: Intelligence and Implications for Credit Risk
Managementt by Clark R. Abrahams and Mingyuan Zhang
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide
to Fundamental Concepts and Practical Applications by Robert Rowan
Health Analytics: Gaining the Insights to Transform Health Care by Jason
Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our
Analytical World
d by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization’s Greatest Assett by Gene Pease, Boyce Byerly, and Jac Fitz‐enz

Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and
Armistead Sapp
Information Revolution: Using the Information Evolution Model to Grow
Your Business by Jim Davis, Gloria J. Miller, and Allan Russell


Killer Analytics: Top 20 Metrics Missing from Your Balance Sheett by Mark
Brown
Manufacturing Best Practices: Optimizing Productivity and Product Quality by Bobby Hull
Marketing Automation: Practical Steps to More Effective Direct Marketing
by Jeff LeSueur
Mastering Organizational Knowledge Flow: How to Make Knowledge
Sharing Workk by Frank Leistner
The New Know: Innovation Powered by Analytics by Thornton May
Performance Management: Integrating Strategy Execution, Methodologies,
Risk, and Analytics by Gary Cokins
Predictive Business Analytics: Forward‐Looking Capabilities to Improve
Business Performance by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis
Pinheiro
Statistical Thinking: Improving Business Performance, second edition by
Roger W. Hoerl and Ronald D. Snee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with Advanced Analytics by Bill Franks
Too Big to Ignore: The Business Case for Big Data by Phil Simon
The Value of Business Analytics: Identifying the Path to Profitability by
Evan Stubbs
Visual Six Sigma: Making Data Analysis Lean by Ian Cox, Marie A.
Gaudard, Philip J. Ramsey, Mia L. Stephens, and Leo Wright

Win with Advanced Business Analytics: Creating Business Value from
Your Data by Jean Paul Isson and Jesse Harriott
For more information on any of the above titles, please visit www
.wiley.com.


Analytics in a Big
Data World
The Essential Guide to Data Science
and Its Applications

Bart Baesens


Cover image: ©iStockphoto/vlastos
Cover design: Wiley
Copyright © 2014 by Bart Baesens. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, electronic, mechanical,
photocopying, recording, scanning, or otherwise, except as permitted under
Section 107 or 108 of the 1976 United States Copyright Act, without either the
prior written permission of the Publisher, or authorization through payment
of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222
Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or
on the Web at www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley & Sons, Inc.,
111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or
online at />Limit of Liability/Disclaimer of Warranty: While the publisher and author have

used their best efforts in preparing this book, they make no representations
or warranties with respect to the accuracy or completeness of the contents of
this book and specifically disclaim any implied warranties of merchantability
or fitness for a particular purpose. No warranty may be created or extended
by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult
with a professional where appropriate. Neither the publisher nor author shall
be liable for any loss of profit or any other commercial damages, including but
not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical
support, please contact our Customer Care Department within the United
States at (800) 762-2974, outside the United States at (317) 572-3993 or
fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-ondemand. Some material included with standard print versions of this book
may not be included in e-books or in print-on-demand. If this book refers to
media such as a CD or DVD that is not included in the version you purchased,
you may download this material at . For more
information about Wiley products, visit www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Baesens, Bart.
Analytics in a big data world : the essential guide to data science and its
applications / Bart Baesens.
1 online resource. — (Wiley & SAS business series)
Description based on print version record and CIP data provided by publisher;
resource not viewed.
ISBN 978-1-118-89271-8 (ebk); ISBN 978-1-118-89274-9 (ebk);
ISBN 978-1-118-89270-1 (cloth) 1. Big data. 2. Management—Statistical
methods. 3. Management—Data processing. 4. Decision making—Data
processing. I. Title.
HD30.215

658.4’038 dc23
2014004728
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1


To my wonderful wife, Katrien, and my kids,
Ann-Sophie, Victor, and Hannelore.
To my parents and parents-in-law.



Contents

Preface

xiii

Acknowledgments xv
Chapter 1

Big Data and Analytics 1

Example Applications 2
Basic Nomenclature 4
Analytics Process Model
Job Profiles Involved
Analytics

4


6

7

Analytical Model Requirements
Notes

9

10

Chapter 2 Data Collection, Sampling,
and Preprocessing 13
Types of Data Sources
Sampling

13

15

Types of Data Elements

17

Visual Data Exploration and Exploratory
Statistical Analysis 17
Missing Values

19


Outlier Detection and Treatment
Standardizing Data

24

Categorization 24
Weights of Evidence Coding
Variable Selection

29

ix

28

20


x

▸ C ON T E N T S

Segmentation
Notes
Chapter 3

32

33

Predictive Analytics

Target Definition

35

35

Linear Regression

38

Logistic Regression

39

Decision Trees 42
Neural Networks

48

Support Vector Machines

58

Ensemble Methods 64
Multiclass Classification Techniques

67


Evaluating Predictive Models 71
Notes

84

Chapter 4 Descriptive Analytics
Association Rules
Sequence Rules
Segmentation
Notes
Chapter 5

87

87
94

95

104
Survival Analysis

105

Survival Analysis Measurements

106

Kaplan Meier Analysis 109
Parametric Survival Analysis 111

Proportional Hazards Regression

114

Extensions of Survival Analysis Models 116
Evaluating Survival Analysis Models 117
Notes
Chapter 6

117
Social Network Analytics

Social Network Definitions

119

Social Network Metrics 121
Social Network Learning

123

Relational Neighbor Classifier

124

119


CONTENTS ◂


Probabilistic Relational Neighbor Classifier
Relational Logistic Regression
Collective Inferencing
Egonets

125

126

128

129

Bigraphs 130
Notes

132

Chapter 7 Analytics: Putting It All to Work
Backtesting Analytical Models 134
Benchmarking 146
Data Quality

149

Software 153
Privacy

155


Model Design and Documentation 158
Corporate Governance
Notes

159

159

Chapter 8

Example Applications

Credit Risk Modeling

161

Fraud Detection 165
Net Lift Response Modeling 168
Churn Prediction

172

Recommender Systems 176
Web Analytics

185

Social Media Analytics 195
Business Process Analytics 204
Notes


220

About the Author
Index

225

223

161

133

xi



Preface

C

ompanies are being flooded with tsunamis of data collected in a
multichannel business environment, leaving an untapped potential for analytics to better understand, manage, and strategically
exploit the complex dynamics of customer behavior. In this book, we
will discuss how analytics can be used to create strategic leverage and
identify new business opportunities.
The focus of this book is not on the mathematics or theory, but on
the practical application. Formulas and equations will only be included
when absolutely needed from a practitioner’s perspective. It is also not

our aim to provide exhaustive coverage of all analytical techniques
previously developed, but rather to cover the ones that really provide
added value in a business setting.
The book is written in a condensed, focused way because it is targeted at the business professional. A reader’s prerequisite knowledge
should consist of some basic exposure to descriptive statistics (e.g.,
mean, standard deviation, correlation, confidence intervals, hypothesis
testing), data handling (using, for example, Microsoft Excel, SQL, etc.),
and data visualization (e.g., bar plots, pie charts, histograms, scatter
plots). Throughout the book, many examples of real‐life case studies
will be included in areas such as risk management, fraud detection,
customer relationship management, web analytics, and so forth. The
author will also integrate both his research and consulting experience
throughout the various chapters. The book is aimed at senior data analysts, consultants, analytics practitioners, and PhD researchers starting
to explore the field.
Chapter 1 discusses big data and analytics. It starts with some
example application areas, followed by an overview of the analytics
process model and job profiles involved, and concludes by discussing
key analytic model requirements. Chapter 2 provides an overview of
xiii


xiv

▸ P R E FAC E

data collection, sampling, and preprocessing. Data is the key ingredient to any analytical exercise, hence the importance of this chapter.
It discusses sampling, types of data elements, visual data exploration
and exploratory statistical analysis, missing values, outlier detection
and treatment, standardizing data, categorization, weights of evidence
coding, variable selection, and segmentation. Chapter 3 discusses predictive analytics. It starts with an overview of the target definition

and then continues to discuss various analytics techniques such as
linear regression, logistic regression, decision trees, neural networks,
support vector machines, and ensemble methods (bagging, boosting, random forests). In addition, multiclass classification techniques
are covered, such as multiclass logistic regression, multiclass decision trees, multiclass neural networks, and multiclass support vector
machines. The chapter concludes by discussing the evaluation of predictive models. Chapter 4 covers descriptive analytics. First, association
rules are discussed that aim at discovering intratransaction patterns.
This is followed by a section on sequence rules that aim at discovering
intertransaction patterns. Segmentation techniques are also covered.
Chapter 5 introduces survival analysis. The chapter starts by introducing some key survival analysis measurements. This is followed by a
discussion of Kaplan Meier analysis, parametric survival analysis, and
proportional hazards regression. The chapter concludes by discussing
various extensions and evaluation of survival analysis models. Chapter 6 covers social network analytics. The chapter starts by discussing
example social network applications. Next, social network definitions
and metrics are given. This is followed by a discussion on social network
learning. The relational neighbor classifier and its probabilistic variant
together with relational logistic regression are covered next. The chapter ends by discussing egonets and bigraphs. Chapter 7 provides an
overview of key activities to be considered when putting analytics to
work. It starts with a recapitulation of the analytic model requirements
and then continues with a discussion of backtesting, benchmarking,
data quality, software, privacy, model design and documentation, and
corporate governance. Chapter 8 concludes the book by discussing various example applications such as credit risk modeling, fraud detection,
net lift response modeling, churn prediction, recommender systems,
web analytics, social media analytics, and business process analytics.


Acknowledgments

I

would like to acknowledge all my colleagues who contributed to

this text: Seppe vanden Broucke, Alex Seret, Thomas Verbraken,
Aimée Backiel, Véronique Van Vlasselaer, Helen Moges, and Barbara
Dergent.

xv



Analytics in a Big
Data World



C H A P T E R

1

Big Data and
Analytics

D

ata are everywhere. IBM projects that every day we generate 2.5
quintillion bytes of data.1 In relative terms, this means 90 percent
of the data in the world has been created in the last two years.
Gartner projects that by 2015, 85 percent of Fortune 500 organizations
will be unable to exploit big data for competitive advantage and about
4.4 million jobs will be created around big data.2 Although these estimates should not be interpreted in an absolute sense, they are a strong
indication of the ubiquity of big data and the strong need for analytical
skills and resources because, as the data piles up, managing and analyzing these data resources in the most optimal way become critical success factors in creating competitive advantage and strategic leverage.

Figure 1.1 shows the results of a KDnuggets3 poll conducted during April 2013 about the largest data sets analyzed. The total number
of respondents was 322 and the numbers per category are indicated
between brackets. The median was estimated to be in the 40 to 50 gigabyte (GB) range, which was about double the median answer for a similar poll run in 2012 (20 to 40 GB). This clearly shows the quick increase
in size of data that analysts are working on. A further regional breakdown of the poll showed that U.S. data miners lead other regions in big
data, with about 28% of them working with terabyte (TB) size databases.
A main obstacle to fully harnessing the power of big data using analytics is the lack of skilled resources and “data scientist” talent required to
1


2

▸ A N A LY T I C S I N A BI G DATA WO RL D

3.7%

Less than 1 MB (12)
1.1 to 10 MB (8)

2.5%
4.3%

11 to 100 MB (14)

15.5%

101 MB to 1 GB (50)
1.1 to 10 GB (59)

18%
16%


11 to 100 GB (52)
101 GB to 1 TB
(59)

18%
12%

1.1 to 10 TB (39)
4.7%

11 to 100 TB (15)
101 TB to 1 PB (6)

1.9%

1.1 to 10 PB (2)

0.6%

11 to 100 PB (0)

0%

Over 100 PB (6)

1.9%

Figure 1.1 Results from a KDnuggets Poll about Largest Data Sets Analyzed
Source: www.kdnuggets.com/polls/2013/largest‐dataset‐analyzed‐data‐mined‐2013.html.


exploit big data. In another poll ran by KDnuggets in July 2013, a strong
need emerged for analytics/big data/data mining/data science education.4 It is the purpose of this book to try and fill this gap by providing a
concise and focused overview of analytics for the business practitioner.

EXAMPLE APPLICATIONS
Analytics is everywhere and strongly embedded into our daily lives. As I
am writing this part, I was the subject of various analytical models today.
When I checked my physical mailbox this morning, I found a catalogue
sent to me most probably as a result of a response modeling analytical
exercise that indicated that, given my characteristics and previous purchase behavior, I am likely to buy one or more products from it. Today,
I was the subject of a behavioral scoring model of my financial institution. This is a model that will look at, among other things, my checking account balance from the past 12 months and my credit payments
during that period, together with other kinds of information available
to my bank, to predict whether I will default on my loan during the
next year. My bank needs to know this for provisioning purposes. Also
today, my telephone services provider analyzed my calling behavior


B I G D ATA A N D A N A LY T I C S ◂

3

and my account information to predict whether I will churn during the
next three months. As I logged on to my Facebook page, the social ads
appearing there were based on analyzing all information (posts, pictures,
my friends and their behavior, etc.) available to Facebook. My Twitter
posts will be analyzed (possibly in real time) by social media analytics to
understand both the subject of my tweets and the sentiment of them.
As I checked out in the supermarket, my loyalty card was scanned first,
followed by all my purchases. This will be used by my supermarket to

analyze my market basket, which will help it decide on product bundling, next best offer, improving shelf organization, and so forth. As I
made the payment with my credit card, my credit card provider used
a fraud detection model to see whether it was a legitimate transaction.
When I receive my credit card statement later, it will be accompanied by
various vouchers that are the result of an analytical customer segmentation exercise to better understand my expense behavior.
To summarize, the relevance, importance, and impact of analytics
are now bigger than ever before and, given that more and more data
are being collected and that there is strategic value in knowing what
is hidden in data, analytics will continue to grow. Without claiming to
be exhaustive, Table 1.1 presents some examples of how analytics is
applied in various settings.
Table 1.1

Example Analytics Applications

Marketing

Risk
Management Government Web

Response
modeling

Credit risk
modeling

Tax avoidance Web analytics Demand
forecasting

Net lift

modeling

Market risk
modeling

Social
security fraud

Social media
analytics

Retention
modeling

Operational
risk modeling

Money
laundering

Multivariate
testing

Market basket
analysis

Fraud
detection

Terrorism

detection

Recommender
systems
Customer
segmentation

Logistics

Other
Text
analytics

Supply chain Business
analytics
process
analytics


4

▸ A N A LY T I C S I N A BI G DATA WO RL D

It is the purpose of this book to discuss the underlying techniques
and key challenges to work out the applications shown in Table 1.1
using analytics. Some of these applications will be discussed in further
detail in Chapter 8.

BASIC NOMENCLATURE
In order to start doing analytics, some basic vocabulary needs to be

defined. A first important concept here concerns the basic unit of analysis. Customers can be considered from various perspectives. Customer
lifetime value (CLV) can be measured for either individual customers
or at the household level. Another alternative is to look at account
behavior. For example, consider a credit scoring exercise for which
the aim is to predict whether the applicant will default on a particular
mortgage loan account. The analysis can also be done at the transaction level. For example, in insurance fraud detection, one usually performs the analysis at insurance claim level. Also, in web analytics, the
basic unit of analysis is usually a web visit or session.
It is also important to note that customers can play different roles.
For example, parents can buy goods for their kids, such that there is
a clear distinction between the payer and the end user. In a banking
setting, a customer can be primary account owner, secondary account
owner, main debtor of the credit, codebtor, guarantor, and so on. It
is very important to clearly distinguish between those different roles
when defining and/or aggregating data for the analytics exercise.
Finally, in case of predictive analytics, the target variable needs to
be appropriately defined. For example, when is a customer considered
to be a churner or not, a fraudster or not, a responder or not, or how
should the CLV be appropriately defined?

ANALYTICS PROCESS MODEL
Figure 1.2 gives a high‐level overview of the analytics process model.5
As a first step, a thorough definition of the business problem to be
solved with analytics is needed. Next, all source data need to be identified that could be of potential interest. This is a very important step, as
data is the key ingredient to any analytical exercise and the selection of


B I G D ATA A N D A N A LY T I C S ◂

5


data will have a deterministic impact on the analytical models that will
be built in a subsequent step. All data will then be gathered in a staging area, which could be, for example, a data mart or data warehouse.
Some basic exploratory analysis can be considered here using, for
example, online analytical processing (OLAP) facilities for multidimensional data analysis (e.g., roll‐up, drill down, slicing and dicing). This
will be followed by a data cleaning step to get rid of all inconsistencies,
such as missing values, outliers, and duplicate data. Additional transformations may also be considered, such as binning, alphanumeric to
numeric coding, geographical aggregation, and so forth. In the analytics step, an analytical model will be estimated on the preprocessed and
transformed data. Different types of analytics can be considered here
(e.g., to do churn prediction, fraud detection, customer segmentation,
market basket analysis). Finally, once the model has been built, it will
be interpreted and evaluated by the business experts. Usually, many
trivial patterns will be detected by the model. For example, in a market
basket analysis setting, one may find that spaghetti and spaghetti sauce
are often purchased together. These patterns are interesting because
they provide some validation of the model. But of course, the key issue
here is to find the unexpected yet interesting and actionable patterns
(sometimes also referred to as knowledge diamonds) that can provide
added value in the business setting. Once the analytical model has
been appropriately validated and approved, it can be put into production as an analytics application (e.g., decision support system, scoring
engine). It is important to consider here how to represent the model
output in a user‐friendly way, how to integrate it with other applications (e.g., campaign management tools, risk engines), and how to
make sure the analytical model can be appropriately monitored and
backtested on an ongoing basis.
It is important to note that the process model outlined in Figure 1.2 is iterative in nature, in the sense that one may have to go back
to previous steps during the exercise. For example, during the analytics step, the need for additional data may be identified, which may
necessitate additional cleaning, transformation, and so forth. Also, the
most time consuming step is the data selection and preprocessing step;
this usually takes around 80% of the total efforts needed to build an
analytical model.



×