Tải bản đầy đủ (.pdf) (911 trang)

Python end to end data analysis leveraging the power of python to clean, scrape, analyze, and visualize your data

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (22.09 MB, 911 trang )


Python: End-to-end
Data Analysis

Leverage the power of Python to clean, scrape,
analyze, and visualize your data

A course in three modules

BIRMINGHAM - MUMBAI


Python: End-to-end Data Analysis
Copyright © 2016 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy
of the information presented. However, the information contained in this course
is sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this course by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: May 2017
Production reference: 1050517


Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78839-469-7
www.packtpub.com


Credits
Authors
Phuong Vo.T.H

Content Development Editor
Aishwarya Pandere

Martin Czygan
Ivan Idris
Magnus VilhelmPersson
Luiz Felipe Martins
Reviewers
Dong Chao
Hai Minh Nguyen
Kenneth Emeka Odoh
Bill Chambers
Alexey Grigorev
Dr. VahidMirjalili
Michele Usuelli
Hang (Harvey) Yu
Laurie Lugrin
Chris Morgan

Michele Pratusevich

Graphics
Jason Monteiro
Production Coordinator
Deepika Naik


Preface
The use of Python for data analysis and visualization has only increased in
popularity in the last few years.
The aim of this book is to develop skills to effectively approach almost any data
analysis problem, and extract all of the available information. This is done by
introducing a range of varying techniques and methods such as uni- and multivariate linear regression, cluster finding, Bayesian analysis, machine learning, and
time series analysis. Exploratory data analysis is a key aspect to get a sense of what
can be done and to maximize the insights that are gained from the data. Additionally,
emphasis is put on presentation-ready figures that are clear and easy to interpret.

What this learning path covers

Module 1, Getting Started with Python Data Analysis, shows how to work with timeoriented data in Pandas. How do you clean, inspect, reshape, merge, or group data
– these are the concerns in this chapter. The library of choice in the course will be
Pandas again.
Module 2, Python Data Analysis Cookbook, demonstrates how to visualize
data and mentions frequently encountered pitfalls. Also, discusses
statistical probability distributions and correlation between two variables.
Module 3, Mastering Python Data Analysis, introduces linear, multiple, and logistic
regression with in-depth examples of using SciPy and stats models packages to test
various hypotheses of relationships between variables.


[i]


Preface

What you need for this learning path
Module 1:

There are not too many requirements to get started. You will need a Python
programming environment installed on your system. Under Linux and Mac OS X,
Python is usually installed by default. Installation on Windows is supported by an
excellent installer provided and maintained by the community.This book uses a
recent Python 2, but many examples will work with Python 3as well.
The versions of the libraries used in this book are the following: NumPy 1.9.2,Pandas
0.16.2, matplotlib 1.4.3, tables 3.2.2, pymongo 3.0.3, redis 2.10.3, and scikit-learn
0.16.1. As these packages are all hosted on PyPI, the Python package index, they can
be easily installed with pip. To install NumPy, you would write:
$ pip install numpy
If you are not using them already, we suggest you take a look at virtual environments
for managing isolating Python environment on your computer. For Python 2, there
are two packages of interest there: virtualenv and virtualenvwrapper. Since Python
3.3, there is a tool in the standard library called pyvenv (https://docs. python.org/3/
library/venv.html), which serves the same purpose.
Most libraries will have an attribute for the version, so if you already have
a library installed, you can quickly check its version:
>>>importredis
>>>redis.__version__'2.10.3'
This works well for most libraries. A few, such as pymongo, use a different attribute
(pymongo uses just version, without the underscores). While all the examples can
be run interactively in a Python shell, we recommend using IPython. IPython

started as a more versatile Python shell, but has since evolved into a powerful tool
for exploration and sharing. We used IPython 4.0.0 with Python 2.7.10. IPython is a
great way to work interactively with Python, be it in the terminal or in the browser.
Module 2:
First, you need a Python 3 distribution. I recommend the full Anaconda distribution
as it comes with the majority of the software we need. I tested the code with Python
3.4 and the following packages:


joblib 0.8.4



IPython 3.2.1
[ ii ]


Preface



NetworkX 1.9.1



NLTK 3.0.2



Numexpr 2.3.1




pandas 0.16.2



SciPy 0.16.0



seaborn 0.6.0



sqlalchemy 0.9.9



statsmodels 0.6.1



matplotlib 1.5.0



NumPy 1.10.1




scikit-learn 0.17



dautil0.0.1a29

For some recipes, you need to install extra software, but this is explained whenever
the software is required.
Module 3:
All you need to follow through the examples in this book is a computer running
any recent version of Python. While the examples use Python 3, they can easily be
adapted to work with Python 2, with only minor changes. The packages used in the
examples are NumPy, SciPy, matplotlib, Pandas, stats models, PyMC, Scikit-learn.
Optionally, the packages basemap and cartopy are used to plot coordinate points
on maps. The easiest way to obtain and maintain a Python environment that meets
all the requirements of this book is to download a prepackaged Python distribution.
In this book, we have checked all the code against Continuum Analytics' Anaconda
Python distribution and Ubuntu Xenial Xerus (16.04) running Python 3.
To download the example data and code, an Internet connection is needed.

Who this learning path is for

This learning path is for developers, analysts, and data scientists who want to learn
data analysis from scratch. This course will provide you with a solid foundation
from which to analyze data with varying complexity. A working knowledge of
Python (and a strong interest in playing with your data) is recommended.
[ iii ]



Preface

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about
this course—what you liked or disliked. Reader feedback is important for us as it
helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail , and mention
the course's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things
to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at
. If you purchased this course elsewhere, you can visit
and register to have the files e-mailed directly
to you.
You can download the code files by following these steps:
1.

Log in or register to our website using your e-mail address and password.

2.


Hover the mouse pointer on the SUPPORT tab at the top.

3.

Click on Code Downloads & Errata.

4.

Enter the name of the course in the Search box.

5.

Select the course for which you're looking to download the code files.

6.

Choose from the drop-down menu where you purchased this course from.

7.

Click on Code Download.

You can also download the code files by clicking on the Code Files button on the
course's webpage at the Packt Publishing website. This page can be accessed by
entering the course's name in the Search box. Please note that you need to be logged
into your Packt account.

[ iv ]



Preface

Once the file is downloaded, please make sure that you unzip or extract the folder
using the latest version of:


WinRAR / 7-Zip for Windows



Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux
The code bundle for the course is also hosted on GitHub at />PacktPublishing/Python-End-to-end-Data-Analysis. We also have other code
bundles from our rich catalog of books, videos, and courses available at https://
github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our courses—maybe a mistake in the
text or the code—we would be grateful if you could report this to us. By doing
so, you can save other readers from frustration and help us improve subsequent
versions of this course. If you find any errata, please report them by visiting http://
www.packtpub.com/submit-errata, selecting your course, clicking on the Errata
Submission Form link, and entering the details of your errata. Once your errata are
verified, your submission will be accepted and the errata will be uploaded to our
website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to />content/support and enter the name of the course in the search field. The required
information will appear under the Errata section.


Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all
media. At Packt, we take the protection of our copyright and licenses very seriously.
If you come across any illegal copies of our works in any form on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.

[v]


Preface

Please contact us at with a link to the suspected pirated
material.
We appreciate your help in protecting our authors and our ability to bring you
valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at
, and we will do our best to address the problem.

[ vi ]


Module 1: Getting Started with Python Data Analysis
Chapters 1: Introducing Data Analysis and Libraries


3

Data analysis and processing
4
An overview of the libraries in data analysis
7
Python libraries in data analysis
9
NumPy10
Pandas10
Matplotlib11
PyMongo11
The scikit-learn library
11
Summary11

Chapters 2: NumPy Arrays and Vectorized Computation

13

NumPy arrays
14
Data types
14
Array creation
16
Indexing and slicing
18
Fancy indexing
19

Numerical operations on arrays
20
Array functions
21
Data processing using arrays
23
Loading and saving data
24
Saving an array
24
Loading an array
25
Linear algebra with NumPy
26
NumPy random numbers
27
Summary30
[i]


Chapters 3: Data Analysis with Pandas

33

Chapters 4: Data Visualization

61

Chapters 5: Time Series


85

An overview of the Pandas package
33
The Pandas data structure
34
Series34
The DataFrame
36
The essential basic functionality
40
Reindexing and altering labels
40
Head and tail
41
Binary operations
42
Functional statistics
43
Function application
45
Sorting46
Indexing and selecting data
48
Computational tools
49
Working with missing data
51
Advanced uses of Pandas for data analysis
54

Hierarchical indexing
54
The Panel data
56
Summary58

The matplotlib API primer
62
Line properties
65
Figures and subplots
67
Exploring plot types
70
Scatter plots
70
Bar plots
71
Contour plots
72
Histogram plots
74
Legends and annotations
75
Plotting functions with Pandas
78
Additional Python data visualization tools
80
Bokeh81
MayaVi81

Summary83
Time series primer
Working with date and time objects
Resampling time series

[ ii ]

85
86
94


Downsampling time series data
94
Upsampling time series data
97
Time zone handling
99
Timedeltas100
Time series plotting
101
Summary105

Chapters 6: Interacting with Databases

107

Chapters 7: Data Analysis Application Examples

127


Chapters 8: Machine Learning Models with scikit-learn

147

Interacting with data in text format
107
Reading data from text format
107
Writing data to text format
112
Interacting with data in binary format
113
HDF5114
Interacting with data in MongoDB
115
Interacting with data in Redis
120
The simple value
120
List121
Set122
Ordered set
123
Summary124
Data munging
128
Cleaning data
130
Filtering133

Merging data
136
Reshaping data
139
Data aggregation
141
Grouping data
144
Summary146
An overview of machine learning models
147
The scikit-learn modules for different models
148
Data representation in scikit-learn
150
Supervised learning – classification and regression
152
Unsupervised learning – clustering and dimensionality reduction
158
Measuring prediction performance
162
Summary164

[ iii ]


Module 2: Python Data Analysis Cookbook
Chapter 1: Laying the Foundation for Reproducible
Data Analysis


167

Introduction168
Setting up Anaconda
168
Getting ready
169
How to do it...
169
There's more...
170
See also
170
Installing the Data Science Toolbox
170
Getting ready
171
How to do it...
171
How it works...
172
See also
172
Creating a virtual environment with virtualenv and virtualenvwrapper 172
Getting ready
173
How to do it...
173
See also
174

Sandboxing Python applications with
Docker images
174
Getting ready
174
How to do it...
174
How it works...
176
See also
176
Keeping track of package versions and history in IPython Notebook 176
Getting ready
177
How to do it...
177
How it works...
179
See also
179
Configuring IPython
179
Getting ready
180
How to do it...
180
See also
181
Learning to log for robust error checking
182

Getting ready
182
How to do it...
182
How it works...
185
See also
185
[ iv ]


Unit testing your code
Getting ready
How to do it...
How it works...
See also
Configuring pandas
Getting ready
How to do it...
Configuring matplotlib
Getting ready
How to do it...
How it works...
See also
Seeding random number generators and NumPy print options
Getting ready
How to do it...
See also
Standardizing reports, code style, and data access
Getting ready

How to do it...
See also

Chapter 2: Creating Attractive Data Visualizations

185
185
186
187
187
188
188
188
190
191
191
194
194
194
194
194
196
196
197
197
199

201

Introduction202

Graphing Anscombe's quartet
202
How to do it...
202
See also
205
Choosing seaborn color palettes
205
How to do it...
205
See also
208
Choosing matplotlib color maps
208
How to do it...
208
See also
209
Interacting with IPython Notebook widgets
209
How to do it...
209
See also
213
Viewing a matrix of scatterplots
213
How to do it...
213
Visualizing with d3.js via mpld3
215

Getting ready
215
How to do it...
216
[v]


Creating heatmaps
Getting ready
How to do it...
See also
Combining box plots and kernel density plots with violin plots
How to do it...
See also
Visualizing network graphs with hive plots
Getting ready
How to do it...
Displaying geographical maps
Getting ready
How to do it...
Using ggplot2-like plots
Getting ready
How to do it...
Highlighting data points with influence plots
How to do it...
See also

Chapter 3: Statistical Data Analysis and Probability

217

217
217
219
220
220
221
221
222
222
224
224
224
226
227
227
228
229
231

233

Introduction234
Fitting data to the exponential distribution
234
How to do it...
234
How it works…
236
See also
236

Fitting aggregated data to the gamma distribution
237
How to do it...
237
See also
238
Fitting aggregated counts to the Poisson distribution
238
How to do it...
239
See also
241
Determining bias
241
How to do it...
242
See also
244
Estimating kernel density
244
How to do it...
244
See also
246
Determining confidence intervals for mean, variance, and standard
deviation247
How to do it...
247
[ vi ]



See also
249
Sampling with probability weights
249
How to do it...
250
See also
252
Exploring extreme values
253
How to do it...
253
See also
256
Correlating variables with Pearson's correlation
257
How to do it...
257
See also
260
Correlating variables with the Spearman rank correlation
260
How to do it...
260
See also
263
Correlating a binary and a continuous variable with the point biserial
correlation263
How to do it...

263
See also
265
Evaluating relations between variables with ANOVA
265
How to do it...
266
See also
267

Chapter 4: Dealing with Data and Numerical Issues

269

Introduction269
Clipping and filtering outliers
270
How to do it...
270
See also
272
Winsorizing data
273
How to do it...
273
See also
274
Measuring central tendency of noisy data
275
How to do it...

275
See also
277
Normalizing with the Box-Cox transformation
278
How to do it...
278
How it works
280
See also
280
Transforming data with the power ladder
280
How to do it...
281
Transforming data with logarithms
282
How to do it...
283

[ vii ]


Rebinning data
How to do it...
Applying logit() to transform proportions
How to do it...
Fitting a robust linear model
How to do it...
See also

Taking variance into account with weighted least squares
How to do it...
See also
Using arbitrary precision for optimization
Getting ready
How to do it...
See also
Using arbitrary precision for linear algebra
Getting ready
How to do it...
See also

Chapter 5: Web Mining, Databases, and Big Data

284
285
286
287
288
289
291
291
291
294
294
294
294
296
297
297

297
299

301

Introduction302
Simulating web browsing
302
Getting ready
303
How to do it…
303
See also
305
Scraping the Web
305
Getting ready
306
How to do it…
306
Dealing with non-ASCII text and HTML entities
308
Getting ready
308
How to do it…
308
See also
310
Implementing association tables
310

Getting ready
310
How to do it…
310
Setting up database migration scripts
313
Getting ready
314
How to do it…
314
See also
314


Adding a table column to an existing table
Getting ready
How to do it…
Adding indices after table creation
Getting ready
How to do it…
How it works…
See also
Setting up a test web server
Getting ready
How to do it…
Implementing a star schema with fact and dimension tables
How to do it…
See also
Using HDFS
Getting ready

How to do it…
See also
Setting up Spark
Getting ready
How to do it…
See also
Clustering data with Spark
Getting ready
How to do it…
How it works…
There's more…
See also

Chapter 6: Signal Processing and Timeseries

314
314
315
316
316
316
317
317
317
318
318
319
320
324
325

325
325
326
326
327
327
327
327
328
328
331
331
331

333

Introduction333
Spectral analysis with periodograms
334
How to do it...
334
See also
336
Estimating power spectral density with the Welch method
336
How to do it...
336
See also
338
Analyzing peaks

338
How to do it...
338
See also
340

[ ix ]


Measuring phase synchronization
How to do it...
See also
Exponential smoothing
How to do it...
See also
Evaluating smoothing
How to do it...
See also
Using the Lomb-Scargle periodogram
How to do it...
See also
Analyzing the frequency spectrum of audio
How to do it...
See also
Analyzing signals with the discrete cosine transform
How to do it...
See also
Block bootstrapping time series data
How to do it...
See also

Moving block bootstrapping time series data
How to do it...
See also
Applying the discrete wavelet transform
Getting started
How to do it...
See also

Chapter 7: Selecting Stocks with Financial Data Analysis

340
341
342
343
343
345
346
346
348
349
349
351
351
352
354
354
355
356
357
357

359
359
360
362
363
364
364
366

367

Introduction368
Computing simple and log returns
368
How to do it...
369
See also
369
Ranking stocks with the Sharpe ratio and liquidity
370
How to do it...
370
See also
372
Ranking stocks with the Calmar and
Sortino ratios
372
How to do it...
372
See also

374


Analyzing returns statistics
How to do it...
Correlating individual stocks with the broader market
How to do it...
Exploring risk and return
How to do it...
See also
Examining the market with the
non-parametric runs test
How to do it...
See also
Testing for random walks
How to do it...
See also
Determining market efficiency with autoregressive models
How to do it...
See also
Creating tables for a stock prices database
How to do it...
Populating the stock prices database
How to do it...
Optimizing an equal weights two-asset portfolio
How to do it...
See also

Chapter 8: Text Mining and Social Network Analysis


374
375
377
377
380
380
381
382
382
384
385
385
386
387
387
389
389
390
391
391
396
397
399

401

Introduction401
Creating a categorized corpus
402
Getting ready

402
How to do it...
403
See also
405
Tokenizing news articles in sentences
and words
405
Getting ready
405
How to do it...
405
See also
406
Stemming, lemmatizing, filtering,
and TF-IDF scores
406
Getting ready
408
How to do it...
408
How it works
409
[ xi ]


See also
Recognizing named entities
Getting ready
How to do it...

How it works
See also
Extracting topics with non-negative matrix factorization
How to do it...
How it works
See also
Implementing a basic terms database
How to do it...
How it works
See also
Computing social network density
Getting ready
How to do it...
See also
Calculating social network closeness centrality
Getting ready
How to do it...
See also
Determining the betweenness centrality
Getting ready
How to do it...
See also
Estimating the average clustering coefficient
Getting ready
How to do it...
See also
Calculating the assortativity coefficient
of a graph
Getting ready
How to do it...

See also
Getting the clique number of a graph
Getting ready
How to do it...
See also

410
410
410
411
412
412
412
413
414
414
414
415
418
418
418
419
419
420
420
420
420
421
421
421

422
422
423
423
423
424
424
424
425
425
425
426
426
426


Creating a document graph with cosine similarity
How to do it...
See also

Chapter 9: Ensemble Learning and Dimensionality Reduction

427
428
430

431

Introduction432
Recursively eliminating features

432
How to do it...
433
How it works
434
See also
434
Applying principal component analysis for dimension reduction
435
How to do it...
435
See also
436
Applying linear discriminant analysis for dimension reduction
437
How to do it...
437
See also
438
Stacking and majority voting for multiple models
438
How to do it...
439
See also
441
Learning with random forests
442
How to do it...
442
There's more…

444
See also
445
Fitting noisy data with the RANSAC algorithm
445
How to do it...
446
See also
448
Bagging to improve results
449
How to do it...
449
See also
451
Boosting for better learning
452
How to do it...
452
See also
454
Nesting cross-validation
455
How to do it...
455
See also
458
Reusing models with joblib
458
How to do it...

458
See also
459
Hierarchically clustering data
460
How to do it...
460
See also
461
[ xiii ]


Taking a Theano tour
Getting ready
How to do it...
See also

Chapter 10: Evaluating Classifiers, Regressors, and Clusters

462
462
462
464

465

Introduction466
Getting classification straight with the confusion matrix
466
How to do it...

467
How it works
468
See also
469
Computing precision, recall, and F1-score
469
How to do it...
470
See also
472
Examining a receiver operating characteristic and the area under a curve
472
How to do it...
473
See also
474
Visualizing the goodness of fit
475
How to do it...
475
See also
476
Computing MSE and median absolute error
476
How to do it...
477
See also
479
Evaluating clusters with the mean

silhouette coefficient
479
How to do it...
479
See also
481
Comparing results with a dummy classifier
482
How to do it...
482
See also
484
Determining MAPE and MPE
485
How to do it...
485
See also
487
Comparing with a dummy regressor
487
How to do it...
487
See also
489
Calculating the mean absolute error
and the residual sum of squares
490
How to do it...
490
See also

492


Examining the kappa of classification
How to do it...
How it works
See also
Taking a look at the Matthews correlation coefficient
How to do it...
See also

Chapter 11: Analyzing Images

492
493
495
495
495
495
497

499

Introduction499
Setting up OpenCV
500
Getting ready
500
How to do it...
501

How it works
502
There's more
503
Applying Scale-Invariant Feature Transform (SIFT)
503
Getting ready
503
How to do it...
503
See also
505
Detecting features with SURF
505
Getting ready
506
How to do it...
506
See also
507
Quantizing colors
507
Getting ready
508
How to do it...
508
See also
509
Denoising images
509

Getting ready
510
How to do it...
510
See also
511
Extracting patches from an image
511
Getting ready
512
How to do it...
512
See also
514
Detecting faces with Haar cascades
514
Getting ready
515
How to do it...
515
See also
517
Searching for bright stars
517
Getting ready
518
[ xv ]



×