Practical data analysis a practical guide to obtaining, transforming, exploring, and analyzing data using python, MongoDB, and apache spark 2nd edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (40.85 MB, 330 trang )

www.allitebooks.com

Practical Data Analysis

Second Edition

A practical guide to obtaining, transforming, exploring, and
analyzing data using Python, MongoDB, and Apache Spark

Hector Cuesta
Dr. Sampath Kumar

BIRMINGHAM - MUMBAI

www.allitebooks.com

Practical Data Analysis
Second Edition
Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the

companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Second published: September 2016
Production reference: 1260916

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78528-971-2
www.packtpub.com

www.allitebooks.com

Credits
Authors

Copy Editor

Hector Cuesta
Dr. Sampath Kumar

Safis Editing

Reviewers

Project Coordinator

Chandana N. Athauda
Mark Kerzner

Ritika Manoj

Commissioning Editor

Proofreader

Amarabha Banarjee

Safis Editing

Acquisition Editor

Indexer

Denim Pinto

Tejal Daruwale Soni

Content Development Editor

Production Coordinator

Divij Kotian

Melwyn Dsa

Technical Editor

Cover Work

Rutuja Vaze

Melwyn Dsa

www.allitebooks.com

About the Authors
Hector Cuesta is founder and Chief Data Scientist at Dataxios, a machine intelligence
research company. Holds a BA in Informatics and a M.Sc. in Computer Science. He
provides consulting services for data-driven product design with experience in a variety of
industries including financial services, retail, fintech, e-learning and Human Resources. He
is an enthusiast of Robotics in his spare time.
You can follow him on Twitter at />I would like to dedicate this book to my wife Yolanda, and to my wonderful children Damian and
Isaac for all the joy they bring into my life. To my parents Elena and Miguel for their constant
support and love.

Dr. Sampath Kumar works as an assistant professor and head of Department of Applied
Statistics at Telangana University. He has completed M.Sc., M.Phl., and Ph. D. in statistics.
He has five years of teaching experience for PG course. He has more than four years of
experience in the corporate sector. His expertise is in statistical data analysis using SPSS,
SAS, R, Minitab, MATLAB, and so on. He is an advanced programmer in SAS and matlab
software. He has teaching experience in different, applied and pure statistics subjects such
as forecasting models, applied regression analysis, multivariate data analysis, operations
research, and so on for M.Sc. students. He is currently supervising Ph.D. scholars.

www.allitebooks.com

About the Reviewers
Chandana N. Athauda is currently employed at BAG (Brunei Accenture Group)
Networks—Brunei and he serves as a technical consultant. He mainly focuses on Business
Intelligence, Big Data and Data Visualization tools and technologies.
He has been working professionally in the IT industry for more than 15 years (Ex-Microsoft
Most Valuable Professional (MVP) and Microsoft Ranger for TFS). His roles in the IT
industry have spanned the entire spectrum from programmer to technical consultant.
Technology has always been a passion for him.
If you would like to talk to Chandana about this book, feel free to write to him at info
@inzeek.net or by giving him a tweet @inzeek.

Mark Kerzner is a Big Data architect and trainer. Mark is a founder and principal at
Elephant Scale, offering Big Data training and consulting. Mark has written HBase Design
Patterns for Packt.
I would like to acknowledge my co-founder Sujee Maniyam and his colleague Tim Fox, as well as all
the students and teachers. Last but not least, thanks to my multi-talented family.

www.allitebooks.com

www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.

eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.

/>
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser

Free access for Packt account holders
Get notified! Find out when new books are published by following @PacktEnterprise on
Twitter or the Packt Enterprise Facebook page.

www.allitebooks.com

Table of Contents
Preface
Chapter 1: Getting Started

1
7

Computer science
Artificial intelligence
Machine learning

Statistics
Mathematics
Knowledge domain
Data, information, and knowledge
Inter-relationship between data, information, and knowledge
The nature of data
The data analysis process
The problem
Data preparation
Data exploration
Predictive modeling
Visualization of results
Quantitative versus qualitative data analysis
Importance of data visualization
What about big data?
Quantified self
Sensors and cameras
Social network analysis
Tools and toys for this book
Why Python?
Why mlpy?
Why D3.js?
Why MongoDB?
Summary

Chapter 2: Preprocessing Data
Data sources
Open data
Text files
Excel files

7
7
8
8
9
9
9
10
12
13
14
14
15
15
16
17
18
19
21
22
23
24
24
25
25
26
26
27
27

29
29
30

www.allitebooks.com

SQL databases
NoSQL databases
Multimedia
Web scraping
Data scrubbing
Statistical methods
Text parsing
Data transformation
Data formats
Parsing a CSV file with the CSV module
Parsing CSV file using NumPy

JSON
Parsing JSON file using the JSON module

XML
Parsing XML in Python using the XML module

YAML
Data reduction methods
Filtering and sampling
Binned algorithm
Dimensionality reduction

Getting started with OpenRefine
Text facet
Clustering
Text filters
Numeric facets
Transforming data
Exporting data
Operation history
Summary

Chapter 3: Getting to Grips with Visualization
What is visualization?
Working with web-based visualization
Exploring scientific visualization
Visualization in art
The visualization life cycle
Visualizing different types of data
HTML
DOM
CSS
[ ii ]

www.allitebooks.com

30
32
32
33
35
36

37
38
39
40
40
41
41
42
43
44
45
45
45
46
47
49
50
51
51
51
53
53
54
55
56
56
57
58
58
59

60
61
61

JavaScript
SVG
Getting started with D3.js
Bar chart
Pie chart
Scatter plots
Single line chart
Multiple line chart
Interaction and animation
Data from social networks
An overview of visual analytics
Summary

61
61
61
62
68
71
74
77
81
84
85
85

Chapter 4: Text Classification

87

Learning and classification
Bayesian classification
NaÃ¯ve Bayes
E-mail subject line tester
The data
The algorithm
Classifier accuracy
Summary

87
89
89
89
91
93
97
99

Chapter 5: Similarity-Based Image Retrieval
Image similarity search
Dynamic time warping
Processing the image dataset
Implementing DTW
Analyzing the results
Summary

100
100
102
104
105
107
110

Chapter 6: Simulation of Stock Prices
Financial time series
Random Walk simulation
Monte Carlo methods
Generating random numbers
Implementation in D3js
Quantitative analyst
Summary

111
111
112
114
114
115
123
124

Chapter 7: Predicting Gold Prices

125

[ iii ]

www.allitebooks.com

Working with time series data
Components of a time series
Smoothing time series
Lineal regression
The data – historical gold prices
Nonlinear regressions
Kernel Ridge Regressions
Smoothing the gold prices time series
Predicting in the smoothed time series
Contrasting the predicted value
Summary

Chapter 8: Working with Support Vector Machines
Understanding the multivariate dataset
Dimensionality reduction
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)
Getting started with SVM
Kernel functions
The double spiral problem
SVM implemented on mlpy
Summary

Chapter 9: Modeling Infectious Diseases with Cellular Automata

Introduction to epidemiology
The epidemiology triangle
The epidemic models
The SIR model
Solving the ordinary differential equation for the SIR model with SciPy
The SIRS model
Modeling with Cellular Automaton
Cell, state, grid, neighborhood
Global stochastic contact model
Simulation of the SIRS model in CA with D3.js
Summary

Chapter 10: Working with Social Graphs
Structure of a graph
Undirected graph
Directed graph
Social networks analysis

126
127
129
132
134
135
135
138
139
140
142
143

144
147
148
149
151
152
153
154
157
158
159
160
161
161
162
164
165
166
167
168
177
178
178
179
179
180

[ iv ]

Acquiring the Facebook graph
Working with graphs using Gephi
Statistical analysis
Male to female ratio
Degree distribution
Histogram of a graph
Centrality
Transforming GDF to JSON
Graph visualization with D3.js
Summary

180
182
188
189
191
192
193
195
197
202

Chapter 11: Working with Twitter Data

203

The anatomy of Twitter data
Tweet
Followers
Trending topics

Using OAuth to access Twitter API
Getting started with Twython
Simple search using Twython
Working with timelines
Working with followers
Working with places and trends
Working with user data
Streaming API
Summary

204
204
204
205
205
208
209
212
214
217
219
219
221

Chapter 12: Data Processing and Aggregation with MongoDB
Getting started with MongoDB
Database
Collection
Document
Mongo shell

Insert/Update/Delete
Queries
Data preparation
Data transformation with OpenRefine
Inserting documents with PyMongo
Group
Aggregation framework
Pipelines
[v]

222
223
224
226
226
227
227
228
230
231
233
236
237
238

Expressions
Summary

239

241

Chapter 13: Working with MapReduce
An overview of MapReduce
Programming model
Using MapReduce with MongoDB
Map function
Reduce function
Using mongo shell
Using Jupyter
Using PyMongo
Filtering the input collection
Grouping and aggregation
Counting the most common words in tweets
Summary

Chapter 14: Online Data Analysis with Jupyter and Wakari
Getting started with Wakari
Creating an account in Wakari
Getting started with IPython notebook
Data visualization
Introduction to image processing with PIL
Opening an image
Working with an image histogram
Filtering
Operations
Transformations
Getting started with pandas
Working with Time Series
Working with multivariate datasets with DataFrame

Grouping, Aggregation, and Correlation
Sharing your Notebook
The data
Summary

Chapter 15: Understanding Data Processing using Apache Spark
Platform for data processing
The Cloudera platform
Installing Cloudera VM
An introduction to the distributed file system
[ vi ]

242
243
244
244
245
246
246
248
250
252
253
256
259
260
260
261
264
267

268
269
269
271
273
275
276
276
280
284
287
287
290
291
292
293
294
296

First steps with Hadoop Distributed File System – HDFS
File management with HUE – web interface
An introduction to Apache Spark
The Spark ecosystem
The Spark programming model
An introductory working example of Apache Startup
Summary

Index

297
298
299
300
301
304
305
306

[ vii ]

Preface
Practical Data Analysis provides a series of practical projects in order to turn data into
insight. It covers a wide range of data analysis tools and algorithms for classification,
clustering, visualization, simulation and forecasting. The goal of this book is to help you to
understand your data to find patterns, trends, relationships and insight.
This book contains practical projects that take advantage of the MongoDB, D3.js, Python
language and its ecosystem to present the concepts using code snippets and detailed
descriptions.

What this book covers
Chapter 1, Getting Started, In this chapter, we discuss the principles of data analysis and the

data analysis process.

Chapter 2, Preprocessing Data, explains how to scrub and prepare your data for the analysis,

also introduces the use of OpenRefine which is a Data Cleansing tool.

Chapter 3, Getting to Grips with Visualization, shows how to visualize different kinds of data

using D3.js which is a JavaScript Visualization Framework.

Chapter 4, Text Classification, introduces the binary classification using a Naïve Bayes

Algorithm to classify spam.

Chapter 5, Similarity-Based Image Retrieval, presents a project to find the Similarity between

images using a dynamic time warping approach.

Chapter 6, Simulation of Stock Prices, explains how to simulate a Stock Price using Random

Walk algorithm, visualized with a D3.js animation.

Chapter 7, Predicting Gold Prices, introduces how Kernel Ridge Regression works, and how

to use it to predict the gold price using time series.

Chapter 8, Working with Support Vector Machines, describes how to use Support Vector

Machines as a classification method.

Chapter 9, Modeling Infectious Diseases with Cellular Automata, introduces the basic concepts

of Computational Epidemiology simulation and explains how to implement a cellular
automaton to simulate an epidemic outbreak using D3.js and JavaScript.

Preface
Chapter 10, Working with Social Graphs, explains how to obtain and visualize your social

media graph from Facebook using Gephi.

Chapter 11, Working with Twitter Data, explains how to use the Twitter API to retrieve data

from twitter. We also see how to improve the text classification to perform a sentiment
analysis using the Naïve Bayes Algorithm implemented in the Natural Language Toolkit
(NLTK).

Chapter 12, Data Processing and Aggregation with MongoDB, introduces the basic operations

in MongoDB as well as methods for grouping, filtering, and aggregation.

Chapter 13, Working with MapReduce, illustrates how to use the MapReduce programming

model implemented in MongoDB.

Chapter 14, Online Data Analysis with Jupyter and Wakari, explains how to use the Wakari

platform and introduces the basic use of Pandas and PIL with IPython.

Chapter 15, Understanding Data Processing using Apache Spark, explains how to use

distributed file system along with Cloudera VM and how to get started with a data
environment. Finally, we describe the main features of Apache Spark with a practical
example.

What you need for this book

The basic requirements for this book are as follows:
Python
OpenRefine
D3.js
Mlpy
Natural Language Toolkit (NLTK)
Gephi
MongoDB

[2]

Preface

Who this book is for
This book is for Software Developers, Analyst and Computer Scientists who want to
implement data analysis and visualization in a practical way. The book is also intended to
provide a self-contained set of practical projects in order to get insight from different kinds
of data like, time series, numerical, multidimensional, social media graphs and texts.
You are not required to have previous knowledge about data analysis, but some basic
knowledge about statistics and a general understanding of Python programming is
assumed.

Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their
meaning. Code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as
follows: "For this example, we will use the BeautifulSoup library version 4."
A block of code is set as follows:

from bs4 import BeautifulSoup
import urllib.request
from time import sleep
from datetime import datetime

Any command-line input or output is written as follows:
>>>
>>> readers
>>> packt.com

New terms and important words are shown in bold. Words that you see on the screen, for
example, in menus or dialog boxes, appear in the text like this: "Now, just click on the OK
button to apply the transformation."
Warnings or important notes appear in a box like this.

[3]

Preface

Tips and tricks appear like this.

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or disliked. Reader feedback is important for us as it helps us
develop titles that you will really get the most out of.
To send us general feedback, simply e-mail , and mention the
book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.

Downloading the example code
You can download the example code files for this book from your account at http://www.p
acktpub.com. If you purchased this book elsewhere, you can visit ktpub.c
om/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
1.
2.
3.
4.
5.
6.
7.

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

[4]

Preface

You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the book's
name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at />ishing/Practical-Data-Analysis-Second-Edition. We also have other code bundles
from our rich catalog of books and videos available at />ing/. Check them out!

Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from />loads/B 4227_PracticalDataAnalysisSecondEdition_ColorImages.pdf.

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting />selecting your book, clicking on the Errata Submission Form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata
will be uploaded to our website or added to any list of existing errata under the Errata
section of that title.
To view the previously submitted errata, go to />t/support and enter the name of the book in the search field. The required information will
appear under the Errata section.

[5]

Preface

Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable
content.

Questions
If you have a problem with any aspect of this book, you can contact us at
, and we will do our best to address the problem.

[6]

1

Getting Started
Data analysis is the process in which raw data is ordered and organized to be used in methods that help to
evaluate and explain the past and predict the future. Data analysis is not about the numbers, it is about
making/asking questions, developing explanations, and testing hypotheses based on logical and analytical
methods. Data analysis is a multidisciplinary ﬁeld that combines computer science, artiﬁcial intelligence,
machine learning, statistics, mathematics, and business domain, as shown in the following ﬁgure:

All of these skills are important for gaining a good understanding of the problem and its

optimal solutions, so let's define those fields.

Computer science
Computer science creates the tools for data analysis. The vast amount of data generated has
made computational analysis critical and has increased the demand for skills like
programming, database administration, network administration, and high-performance
computing. Some programming experience in Python (or any high-level programming
language) is needed to follow the chapters in this book.

Getting Started

Artificial intelligence
According to Stuart Russell and Peter Norvig:
“Artificial intelligence has to do with smart programs, so let's get on and write some”.
In other words, Artificial intelligence (AI) studies the algorithms that can simulate an
intelligent behavior. In data analysis we use AI to perform those activities that require
intelligence, like inference, similarity search, or unsupervised classification. Fields like
deep learning rely on artificial intelligence algorithms; some of its current uses are chatbots,
recommendation engines, image classification, and so on.

Machine learning
Machine learning (ML) is the study of computer algorithms to learn how to react in a
certain situation or recognize patterns. According to Arthur Samuel (1959):
“Machine Learning is a field of study that gives computers the ability to learn without
being explicitly programmed”.
ML has a large amount of algorithms generally split into three groups depending how the
algorithms are training. They are as follows:
Supervised learning
Unsupervised learning

Reinforcement learning
The relevant number of algorithms is used throughout the book and they are combined
with practical examples, leading the reader through the process from the initial data
problem to its programming solution.

Statistics
In January 2009, Google's Chief Economist Hal Varian said:
“I keep saying the sexy job in the next ten years will be statisticians. People think I'm
joking, but who would've guessed that computer engineers would've been the sexy job of
the 1990s?”

[8]

Getting Started

Statistics is the development and application of methods to collect, analyze, and interpret
data. Data analysis encompasses a variety of statistical techniques such as simulation,
Bayesian methods, forecasting, regression, time-series analysis, and clustering.

Mathematics
Data analysis makes use of a lot of mathematical techniques like linear algebra (vector and
matrix, factorization, eigenvalue), numerical methods, and conditional probability, in
algorithms. In this book, all the chapters are self-contained and include the necessary math
involved.

Knowledge domain
One of the most important activities in data analysis is asking questions, and a good
understanding of the knowledge domain can give you the expertise and intuition needed to
ask good questions. Data analysis is used in almost every domain, including finance,

administration, business, social media, government, and science.

Data, information, and knowledge
Data is facts of the world. Data represents a fact or statement of an event without relation to
other things. Data comes in many forms, such as web pages, sensors, devices, audio, video,
networks, log files, social media, transactional applications, and much more. Most of these
data are generated in real time and on a very large-scale. Although it is generally
alphanumeric (text, numbers, and symbols), it can consist of images or sound. Data consists
of raw facts and figures. It does not have any meaning until it is processed. For example,
financial transactions, age, temperature, and the number of steps from my house to my
office are simply numbers. The information appears when we work with those numbers
and we can find value and meaning.
Information can be considered as an aggregation of data. Information has usually got some
meaning and purpose. The information can help us to make decisions easier. After
processing the data, we can get the information within a context in order to give proper
meaning. In computer jargon, a relational database makes information from the data stored
within it.

[9]

Getting Started

Knowledge is information with meaning. Knowledge happens only when human
experience and insight is applied to data and information. We can talk about knowledge
when the data and the information turn into a set of rules to assist the decisions. In fact, we
can't store knowledge because it implies the theoretical or practical understanding of a
subject. The ultimate purpose of knowledge is for value creation.

Inter-relationship between data, information, and

knowledge
We can observe that the relationship between data, information, and knowledge looks like
cyclical behavior. The following diagram demonstrates the relationship between them. This
diagram also explains the transformation of data into information and vice versa, similarly
information and knowledge. If we apply valuable information based on context and
purpose, it reflects knowledge. At the same time, the processed and analyzed data will give
the information. When looking at the transformation of data to information and information
to knowledge, we should concentrate on the context, purpose, and relevance of the task.

[ 10 ]

Getting Started

Now I would like to discuss these relationships with a real-life example:
Our students conducted a survey for their project with the purpose of collecting data
related to customer satisfaction of a product and to see the conclusion of reducing the price
of that product. As it was a real project, our students got to make the final decision to satisfy
the customers. Data collected by the survey was processed and a final report was prepared.
Based on the project report, the manufacturer of that product has since reduced the cost.
Let's take a look at the following:
Data: Facts from the survey.
For example: Number of customers purchased the product,
satisfaction levels, competitor information, and so on.
Information: Project report.
For example: Satisfaction level related to price based on the
competitor product.
Knowledge: The manufacturer learned what to do for customer satisfaction and
increase product sales.
For example: The manufacturing cost of the product, transportation

cost, quality of the product, and so on.
Finally, we can say that the data-information-knowledge hierarchy seemed like a great
idea. However, by using predictive analytics we can simulate an intelligent behavior and
provide a good approximation. In the following image is an example of how to turn data
into knowledge:

[ 11 ]

Practical data analysis a practical guide to obtaining, transforming, exploring, and analyzing data using python, MongoDB, and apache spark 2nd edition

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về