Hands-On Data Science and Python Machine Learning
Perform data mining and machine learning efficiently using Python
and Spark
Frank Kane
BIRMINGHAM - MUMBAI
< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
" />
Hands-On Data Science and Python
Machine Learning
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained in
this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by
this book.
Packt Publishing has endeavored to provide trademark information about all
of the companies and products mentioned in this book by the appropriate use
of capitals. However, Packt Publishing cannot guarantee the accuracy of this
information.
First published: July 2017
Production reference: 1300717
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78728-074-8
www.packtpub.com
Credits
Author
Proofreader
Frank Kane
Safis Editing
Acquisition Editor
Indexer
Ben Renow-Clarke
Tejal Daruwale Soni
Content Development Editor
Graphics
Khushali Bhangde
Jason Monteiro
Technical Editor
Production Coordinator
Nidhisha Shetty
Arvindkumar Gupta
Copy Editor
Â
Tom Jacob
About the Author
My name is Frank Kane. I spent nine years at amazon.com and imdb.com,
wrangling millions of customer ratings and customer transactions to produce
things such as personalized recommendations for movies and products and
"people who bought this also bought." I tell you, I wish we had Apache Spark
back then, when I spent years trying to solve these problems there. I hold 17
issued patents in the fields of distributed computing, data mining, and
machine learning. In 2012, I left to start my own successful company,
Sundog Software, which focuses on virtual reality environment technology,
and teaching others about big data analysis.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub
.com. Did you know that Packt offers eBook versions of every book published,
with PDF and ePub files available? You can upgrade to the eBook version at
www.PacktPub.comand as a print book customer, you are entitled to a discount on
the eBook copy. Get in touch with us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles,
sign up for a range of free newsletters and receive exclusive discounts and
offers on Packt books and eBooks.
/>
Get the most in-demand software skills with Mapt. Mapt gives you full
access to all Packt books and video courses, as well as industry-leading tools
to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our
editorial process. To help us improve, please leave us an honest review on
this book's Amazon page at />If you'd like to join our team of regular reviewers, you can email us at
We award our regular reviewers with free eBooks
and videos in exchange for their valuable feedback. Help us be relentless in
improving our products!
Table of Contents
Preface
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1.
Getting Started
Installing Enthought Canopy
Giving the installation a test run
If you occasionally get problems opening your IPNYB files
Using and understanding IPython (Jupyter) Notebooks
Python basics - Part 1
Understanding Python code
Importing modules
Data structures
Experimenting with lists
Pre colon
Post colon
Negative syntax
Adding list to list
The append function
Complex data structures
Dereferencing a single element
The sort function
Reverse sort
Tuples
Dereferencing an element
List of tuples
Dictionaries
Iterating through entries
Python basics - Part 2
Functions in Python
Lambda functions - functional programming
Understanding boolean expressions
The if statement
The if-else loop
Looping
The while loop
Exploring activity
Running Python scripts
More options than just the IPython/Jupyter Notebook
Running Python scripts in command prompt
Using the Canopy IDE
Summary
2.
Statistics and Probability Refresher, and Python Practice
Types of data
Numerical data
Discrete data
Continuous data
Categorical data
Ordinal data
Mean, median, and mode
Mean
Median
The factor of outliers
Mode
Using mean, median, and mode in Python
Calculating mean using the NumPy package
Visualizing data using matplotlib
Calculating median using the NumPy package
Analyzing the effect of outliers
Calculating mode using the SciPy package
Some exercises
Standard deviation and variance
Variance
Measuring variance
Standard deviation
Identifying outliers with standard deviation
Population variance versus sample variance
The Mathematical explanation
Analyzing standard deviation and variance on a histogram
Using Python to compute standard deviation and variance
Try it yourself
Probability density function and probability mass function
The probability density function and probability mass functions
Probability density functions
Probability mass functions
Types of data distributions
Uniform distribution
Normal or Gaussian distribution
The exponential probability distribution or Power law
Binomial probability mass function
Poisson probability mass function
Percentiles and moments
Percentiles
Quartiles
Computing percentiles in Python
Moments
Computing moments in Python
Summary
3.
Matplotlib and Advanced Probability Concepts
A crash course in Matplotlib
Generating multiple plots on one graph
Saving graphs as images
Adjusting the axes
Adding a grid
Changing line types and colors
Labeling axes and adding a legend
A fun example
Generating pie charts
Generating bar charts
Generating scatter plots
Generating histograms
Generating box-and-whisker plots
Try it yourself
Covariance and correlation
Defining the concepts
Measuring covariance
Correlation
Computing covariance and correlation in Python
Computing correlation – The hard way
Computing correlation – The NumPy way
Correlation activity
Conditional probability
Conditional probability exercises in Python
Conditional probability assignment
My assignment solution
Bayes' theorem
Summary
4.
Predictive Models
Linear regression
The ordinary least squares technique
The gradient descent technique
The co-efficient of determination or r-squared
Computing r-squared
Interpreting r-squared
Computing linear regression and r-squared using Python
Activity for linear regression
Polynomial regression
Implementing polynomial regression using NumPy
Computing the r-squared error
Activity for polynomial regression
Multivariate regression and predicting car prices
Multivariate regression using Python
Activity for multivariate regression
Multi-level models
Summary
5.
Machine Learning with Python
Machine learning and train/test
Unsupervised learning
Supervised learning
Evaluating supervised learning
K-fold cross validation
Using train/test to prevent overfitting of a polynomial regression
Activity
Bayesian methods - Concepts
Implementing a spam classifier with Naïve Bayes
Activity
K-Means clustering
Limitations to k-means clustering
Clustering people based on income and age
Activity
Measuring entropy
Decision trees - Concepts
Decision tree example
Walking through a decision tree
Random forests technique
Decision trees - Predicting hiring decisions using Python
Ensemble learning – Using a random forest
Activity
Ensemble learning
Support vector machine overview
Using SVM to cluster people by using scikit-learn
Activity
Summary
6.
Recommender Systems
What are recommender systems?
User-based collaborative filtering
Limitations of user-based collaborative filtering
Item-based collaborative filtering
Understanding item-based collaborative filtering
How item-based collaborative filtering works?
Collaborative filtering using Python
Finding movie similarities
Understanding the code
The corrwith function
Improving the results of movie similarities
Making movie recommendations to people
Understanding movie recommendations with an example
Using the groupby command to combine rows
Removing entries with the drop command
Improving the recommendation results
Summary
7.
More Data Mining and Machine Learning Techniques
K-nearest neighbors - concepts
Using KNN to predict a rating for a movie
Activity
Dimensionality reduction and principal component analysis
Dimensionality reduction
Principal component analysis
A PCA example with the Iris dataset
Activity
Data warehousing overview
ETL versus ELT
Reinforcement learning
Q-learning
The exploration problem
The simple approach
The better way
Fancy words
Markov decision process
Dynamic programming
Summary
8.
Dealing with Real-World Data
Bias/variance trade-off
K-fold cross-validation to avoid overfitting
Example of k-fold cross-validation using scikit-learn
Data cleaning and normalisation
Cleaning web log data
Applying a regular expression on the web log
Modification one - filtering the request field
Modification two - filtering post requests
Modification three - checking the user agents
Filtering the activity of spiders/robots
Modification four - applying website-specific filters
Activity for web log data
Normalizing numerical data
Detecting outliers
Dealing with outliers
Activity for outliers
Summary
9.
Apache Spark - Machine Learning on Big Data
Installing Spark
Installing Spark on Windows
Installing Spark on other operating systems
Installing the Java Development Kit
Installing Spark
Spark introduction
It's scalable
It's fast
It's young
It's not difficult
Components of Spark
Python versus Scala for Spark
Spark and Resilient Distributed Datasets (RDD)
The SparkContext object
Creating RDDs
Creating an RDD using a Python list
Loading an RDD from a text file
More ways to create RDDs
RDD operations
Transformations
Using map()
Actions
Introducing MLlib
Some MLlib Capabilities
Special MLlib data types
The vector data type
LabeledPoint data type
Rating data type
Decision Trees in Spark with MLlib
Exploring decision trees code
Creating the SparkContext
Importing and cleaning our data
Creating a test candidate and building our decision tree
Running the script
K-Means Clustering in Spark
Within set sum of squared errors (WSSSE)
Running the code
TF-IDF
TF-IDF in practice
Using TF- IDF
Searching wikipedia with Spark MLlib
Import statements
Creating the initial RDD
Creating and transforming a HashingTF object
Computing the TF-IDF score
Using the Wikipedia search engine algorithm
Running the algorithm
Using the Spark 2.0 DataFrame API for MLlib
How Spark 2.0 MLlib works
Implementing linear regression
Summary
10.
Testing and Experimental Design
A/B testing concepts
A/B tests
Measuring conversion for A/B testing
How to attribute conversions
Variance is your enemy
T-test and p-value
The t-statistic or t-test
The p-value
Measuring t-statistics and p-values using Python
Running A/B test on some experimental data
When there's no real difference between the two groups
Does the sample size make a difference?
Sample size increased to six-digits
Sample size increased seven-digits
A/A testing
Determining how long to run an experiment for
A/B test gotchas
Novelty effects
Seasonal effects
Selection bias
Auditing selection bias issues
Data pollution
Attribution errors
Summary
Preface
Being a data scientist in the tech industry is one of the most rewarding
careers on the planet today. I went and studied actual job descriptions for data
scientist roles at tech companies and I distilled those requirements down into
the topics that you'll see in this course.
Hands-On Data Science and Python Machine Learning is really
comprehensive. We'll start with a crash course on Python and do a review of
some basic statistics and probability, but then we're going to dive right into
over 60 topics in data mining and machine learning. That includes things
such as Bayes' theorem, clustering, decision trees, regression analysis,
experimental design; we'll look at them all. Some of these topics are really
fun.
We're going to develop an actual movie recommendation system using actual
user movie rating data. We're going to create a search engine that actually
works for Wikipedia data. We're going to build a spam classifier that can
correctly classify spam and nonspam emails in your email account, and we
also have a whole section on scaling this work up to a cluster that runs on big
data using Apache Spark.
If you're a software developer or programmer looking to transition into a
career in data science, this course will teach you the hottest skills without all
the mathematical notation and pretense that comes along with these topics.
We're just going to explain these concepts and show you some Python code
that actually works that you can dive in and mess around with to make those
concepts sink home, and if you're working as a data analyst in the finance
industry, this course can also teach you to make the transition into the tech
industry. All you need is some prior experience in programming or scripting
and you should be good to go.
The general format of this book is I'll start with each concept, explaining it in
a bunch of sections and graphical examples. I will introduce you to some of
the notations and fancy terminologies that data scientists like to use so you
can talk the same language, but the concepts themselves are generally pretty
simple. After that, I'll throw you into some actual Python code that actually
works that we can run and mess around with, and that will show you how to
actually apply these ideas to actual data. These are going to be presented as
IPython Notebook files, and that's a format where I can intermix code and
notes surrounding the code that explain what's going on in the concepts. You
can take these notebook files with you after going through this book and use
that as a handy-quick reference later on in your career, and at the end of each
concept, I'll encourage you to actually dive into that Python code, make some
modifications, mess around with it, and just gain more familiarity by getting
hands-on and actually making some modifications, and seeing the effects
they have.
Who this book is for
If you are a budding data scientist or a data analyst who wants to analyze and
gain actionable insights from data using Python, this book is for you.
Programmers with some experience in Python who want to enter the lucrative
world of Data Science will also find this book to be very useful.
Conventions
In this book, you will find a number of text styles that distinguish between
different kinds of information. Here are some examples of these styles and an
explanation of their meaning.
Code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles are
shown as follows: "We can measure that using the r2_score() function from
sklearn.metrics."
A block of code is set as follows:
import numpy as np
import pandas as pd
from sklearn import tree
input_file = "c:/spark/DataScience/PastHires.csv"
df = pd.read_csv(input_file, header = 0)
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
import numpy as np
import pandas as pd
from sklearn import tree
input_file = "c:/spark/DataScience/PastHires.csv"
df = pd.read_csv(input_file, header = 0)
Any command-line input or output is written as follows:
spark-submit SparkKMeans.py
New terms and important words are shown in bold. Words that you see on the
screen, for example, in menus or dialog boxes, appear in the text like this:
"On Windows 10, you'll need to open up the Start menu and go to Windows
System | Control Panel to open up Control Panel."
Warnings or important notes appear like this.