Tải bản đầy đủ (.pdf) (503 trang)

Learning data mining with python second edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.91 MB, 503 trang )


Title Page
Learning Data Mining with Python


Second Edition

Use Python to manipulate data and build predictive models

Robert Layton

BIRMINGHAM - MUMBAI


Copyright


Learning Data Mining with Python
Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained in
this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by
this book.
Packt Publishing has endeavored to provide trademark information about all


of the companies and products mentioned in this book by the appropriate use
of capitals. However, Packt Publishing cannot guarantee the accuracy of this
information.
First published: July 2015
Second edition: April 2017
Production reference: 1250417

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham


B3 2PB, UK.
ISBN 978-1-78712-678-7
www.packtpub.com


Credits

Author

Copy Editor

Robert Layton

Vikrant Phadkay

Reviewer


Project Coordinator

Asad Ahamad

Nidhi Joshi

Commissioning Editor

Proofreader

Veena Pagare

Safis Editing


Acquisition Editor

Indexer

Divya Poojari

Mariammal Chettiyar

Content Development Editor

Graphics

Tejas Limkar

Tania Dutta


Technical Editor

Production Coordinator

Danish Shaikh

Aparna Bhagat





About the Author
Robert Layton is a data scientist investigating data-driven applications to
businesses across a number of sectors. He received a PhD investigating
cybercrime analytics from the Internet Commerce Security Laboratory at
Federation University Australia, before moving into industry, starting his
own data analytics company dataPipeline (www.datapipeline.com.au). Next, he
created Eureaktive (www.eureaktive.com.au), which works with tech-based
startups on developing their proof-of-concepts and early-stage prototypes.
Robert also runs www.learningtensorflow.com, which is one of the world's premier
tutorial websites for Google's TensorFlow library.
Robert is an active member of the Python community, having used Python
for more than 8 years. He has presented at PyConAU for the last four years
and works with Python Charmers to provide Python-based training for
businesses and professionals from a wide range of organisations.
Robert can be best reached via Twitter @robertlayton
Thank you to my family for supporting me on this journey, thanks to all the
readers of revision 1 for making it a success, and thanks to Matty for his

assistance behind-the-scenes with the book.


About the Reviewer
Asad Ahamad is a data enthusiast and loves to work on data to solve
challenging problems.
He did his masters in Industrial Mathematics with Computer Application
from Jamia Millia Islamia, New Delhi. He admires Mathematics a lot and
always tries to use it to gain maximum profit for business.
He has good experience working on data mining, machine learning and data
science and worked for various multinationals in India. He mainly uses R and
Python to perform data wrangling and modeling. He is fond of using open
source tools for data analysis.
He is active social media user. Feel free to connect him on twitter @asadtaj88


www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub
.com.
Did you know that Packt offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at www.P
acktPub.com and as a print book customer, you are entitled to a discount on the
eBook copy. Get in touch with us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles,
sign up for a range of free newsletters and receive exclusive discounts and
offers on Packt books and eBooks.

/>
Get the most in-demand software skills with Mapt. Mapt gives you full
access to all Packt books and video courses, as well as industry-leading tools

to help you plan your personal development and advance your career.


Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser


Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our
editorial process. To help us improve, please leave us an honest review on
this book's Amazon page at />If you'd like to join our team of regular reviewers, you can e-mail us at
We award our regular reviewers with free eBooks
and videos in exchange for their valuable feedback. Help us be relentless in
improving our products!


Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions


1.

Getting Started with Data Mining
Introducing data mining
Using Python and the Jupyter Notebook
Installing Python
Installing Jupyter Notebook
Installing scikit-learn
A simple affinity analysis example
What is affinity analysis?
Product recommendations
Loading the dataset with NumPy
Downloading the example code
Implementing a simple ranking of rules
Ranking to find the best rules
A simple classification example
What is classification?
Loading and preparing the dataset
Implementing the OneR algorithm
Testing the algorithm
Summary

2.

Classifying with scikit-learn Estimators
scikit-learn estimators
Nearest neighbors
Distance metrics
Loading the dataset



Moving towards a standard workflow
Running the algorithm
Setting parameters
Preprocessing
Standard pre-processing
Putting it all together
Pipelines
Summary

3.

Predicting Sports Winners with Decision Trees
Loading the dataset
Collecting the data
Using pandas to load the dataset
Cleaning up the dataset
Extracting new features
Decision trees
Parameters in decision trees
Using decision trees
Sports outcome prediction
Putting it all together
Random forests
How do ensembles work?
Setting parameters in Random Forests
Applying random forests
Engineering new features
Summary


4.

Recommending Movies Using Affinity Analysis
Affinity analysis
Algorithms for affinity analysis
Overall methodology
Dealing with the movie recommendation problem
Obtaining the dataset
Loading with pandas
Sparse data formats
Understanding the Apriori algorithm and its implementation
Looking into the basics of the Apriori algorithm
Implementing the Apriori algorithm
Extracting association rules
Evaluating the association rules


Summary

5.

Features and scikit-learn Transformers
Feature extraction
Representing reality in models
Common feature patterns
Creating good features
Feature selection
Selecting the best individual features
Feature creation

Principal Component Analysis
Creating your own transformer
The transformer API
Implementing a Transformer
Unit testing
Putting it all together
Summary

6.

Social Media Insight using Naive Bayes
Disambiguation
Downloading data from a social network
Loading and classifying the dataset
Creating a replicable dataset from Twitter
Text transformers
Bag-of-words models
n-gram features
Other text features
Naive Bayes
Understanding Bayes' theorem
Naive Bayes algorithm
How it works
Applying of Naive Bayes
Extracting word counts
Converting dictionaries to a matrix
Putting it all together
Evaluation using the F1-score
Getting useful features from models
Summary


7.

Follow Recommendations Using Graph Mining
Loading the dataset


Classifying with an existing model
Getting follower information from Twitter
Building the network
Creating a graph
Creating a similarity graph
Finding subgraphs
Connected components
Optimizing criteria
Summary

8.

Beating CAPTCHAs with Neural Networks
Artificial neural networks
An introduction to neural networks
Creating the dataset
Drawing basic CAPTCHAs
Splitting the image into individual letters
Creating a training dataset
Training and classifying
Back-propagation
Predicting words
Improving accuracy using a dictionary

Ranking mechanisms for word similarity
Putting it all together
Summary

9.

Authorship Attribution
Attributing documents to authors
Applications and use cases
Authorship attribution
Getting the data
Using function words
Counting function words
Classifying with function words
Support Vector Machines
Classifying with SVMs
Kernels
Character n-grams
Extracting character n-grams
The Enron dataset
Accessing the Enron dataset
Creating a dataset loader


Putting it all together
Evaluation
Summary

10.


Clustering News Articles
Trending topic discovery
Using a web API to get data
Reddit as a data source
Getting the data
Extracting text from arbitrary websites
Finding the stories in arbitrary websites
Extracting the content
Grouping news articles
The k-means algorithm
Evaluating the results
Extracting topic information from clusters
Using clustering algorithms as transformers
Clustering ensembles
Evidence accumulation
How it works
Implementation
Online learning
Implementation
Summary

11.

Object Detection in Images using Deep Neural Networks
Object classification
Use cases
Application scenario
Deep neural networks
Intuition
Implementing deep neural networks

An Introduction to TensorFlow
Using Keras
Convolutional Neural Networks
GPU optimization
When to use GPUs for computation
Running our code on a GPU
Setting up the environment
Application


Getting the data
Creating the neural network
Putting it all together
Summary

12.

Working with Big Data
Big data
Applications of big data
MapReduce
The intuition behind MapReduce
A word count example
Hadoop MapReduce
Applying MapReduce
Getting the data
Naive Bayes prediction
The mrjob package
Extracting the blog posts
Training Naive Bayes

Putting it all together
Training on Amazon's EMR infrastructure
Summary

13.

Next Steps...
Getting Started with Data Mining
Scikit-learn tutorials
Extending the Jupyter Notebook
More datasets
Other Evaluation Metrics
More application ideas
Classifying with scikit-learn Estimators
Scalability with the nearest neighbor
More complex pipelines
Comparing classifiers
Automated Learning
Predicting Sports Winners with Decision Trees
More complex features
Dask
Research
Recommending Movies Using Affinity Analysis
New datasets


The Eclat algorithm
Collaborative Filtering
Extracting Features with Transformers
Adding noise

Vowpal Wabbit
word2vec
Social Media Insight Using Naive Bayes
Spam detection
Natural language processing and part-of-speech tagging
Discovering Accounts to Follow Using Graph Mining
More complex algorithms
NetworkX
Beating CAPTCHAs with Neural Networks
Better (worse?) CAPTCHAs
Deeper networks
Reinforcement learning
Authorship Attribution
Increasing the sample size
Blogs dataset
Local n-grams
Clustering News Articles
Clustering Evaluation
Temporal analysis
Real-time clusterings
Classifying Objects in Images Using Deep Learning
Mahotas
Magenta
Working with Big Data
Courses on Hadoop
Pydoop
Recommendation engine
W.I.L.L
More resources
Kaggle competitions

Coursera


Preface
The second revision of Learning Data Mining with Python was written with
the programmer in mind. It aims to introduce data mining to a wide range of
programmers, as I feel that this is critically important to all those in the
computer science field. Data mining is quickly becoming the building block
of the next generation of Artificial Intelligence systems. Even if you don't
find yourself building these systems, you will be using them, interfacing with
them, and being guided by them. Understand the process behind them is
important and helps you get the best out of them.
The second revision builds upon the first. Many of chapters and exercises are
similar, although new concepts are introduced and exercises are expanded in
scope. Those that had read the first revision should be able to move quickly
through the book and pick up new knowledge along the way and engage with
the extra activities proposed. Those new to the book are encouraged to take
their time, do the exercises and experiment. Feel free to break the code to
understand it, and reach out if you have any questions.
As this is a book aimed at programmers, we assume that you have some
knowledge of programming and of Python itself. For this reason, there is
little explanation of what the Python code itself is doing, except in cases
where it is ambiguous.


What this book covers
Getting started with data mining, introduces the technologies we will
be using, along with implementing two basic algorithms to get started.
Chapter 1,


Classifying with scikit-learn, covers classification, a key form of data
mining. You’ll also learn about some structures for making your data mining
experimentation easier to perform..
Chapter 2,

Predicting Sports Winners with Decisions Trees, introduces two new
algorithms, Decision Trees and Random Forests, and uses it to predict sports
winners by creating useful features..
Chapter 3,

Recommending Movies using Affinity Analysis, looks at the problem
of recommending products based on past experience, and introduces the
Apriori algorithm.
Chapter 4,

Features and scikit-learn Transformers, introduces more types of
features you can create, and how to work with different datasets.
Chapter 5,

Social Media Insight using Naive Bayes, uses the Naïve Bayes
algorithm to automatically parse text-based information from the social
media website Twitter.
Chapter 6,

Follow Recommendations Using Graph Mining, applies cluster
analysis and network analysis to find good people to follow on social media.
Chapter 7,

Beating CAPTCHAs with Neural Networks, looks at extracting
information from images, and then training neural networks to find words

and letters in those images.
Chapter 8,

Authorship attribution, looks at determining who wrote a given
documents, by extracting text-based features and using Support Vector
Machines.
Chapter 9,


Clustering news articles, uses the k-means clustering algorithm to
group together news articles based on their content.
Chapter 10,

Object Detection in Images using Deep Neural Networks,
determines what type of object is being shown in an image, by applying deep
neural networks.
Chapter 11,

Working with Big Data, looks at workflows for applying algorithms
to big data and how to get insight from it.
Chapter 12,

Next step, goes through each chapter, giving hints on where to go
next for a deeper understanding of the concepts introduced.
Appendix,


What you need for this book
It should come as no surprise that you’ll need a computer, or access to one, to
complete the book. The computer should be reasonably modern, but it

doesn’t need to be overpowered. Any modern processor (from about 2010
onwards) and 4 gigabytes of RAM will suffice, and you can probably run
almost all of the code on a slower system too.
The exception here is with the final two chapters. In these chapters, I step
through using Amazon’s web services (AWS) for running the code. This will
probably cost you some money, but the advantage is less system setup than
running the code locally. If you don’t want to pay for those services, the tools
used can all be set-up on a local computer, but you will definitely need a
modern system to run it. A processor built in at least 2012, and more than 4
GB of RAM are necessary.
I recommend the Ubuntu operating system, but the code should work well on
Windows, Macs, or any other Linux variant. You may need to consult the
documentation for your system to get some things installed though.
In this book, I use pip for installing code, which is a command line tool for
installing Python libraries. Another option is to use Anaconda, which can be
found online here: />I also have tested all code using Python 3. Most of the code examples work
on Python 2 with no changes. If you run into any problems, and can’t get
around it, send an email and we can offer a solution.


×