Tải bản đầy đủ (.pdf) (612 trang)

Statistical application development with r and python second edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (21.32 MB, 612 trang )


Statistical Application Development
with R and Python - Second Edition


Table of Contents
Statistical Application Development with R and Python - Second Edition
Credits
About the Author
Acknowledgment
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Data Characteristics
Questionnaire and its components
Understanding the data characteristics in an R environment
Experiments with uncertainty in computer science
Installing and setting up R


Using R packages
RSADBE – the books R package
Python installation and setup
Using pip for packages
IDEs for R and Python
The companion code bundle
Discrete distributions


Discrete uniform distribution
Binomial distribution
Hypergeometric distribution
Negative binomial distribution
Poisson distribution
Continuous distributions
Uniform distribution
Exponential distribution
Normal distribution
Summary
2. Import/Export Data
Packages and settings – R and Python
Understanding data.frame and other formats
Constants, vectors, and matrices
Time for action – understanding constants, vectors, and basic
arithmetic
What just happened?
Doing it in Python
Time for action – matrix computations
What just happened?
Doing it in Python

The list object
Time for action – creating a list object
What just happened?
The data.frame object
Time for action – creating a data.frame object
What just happened?
Have a go hero
The table object
Time for action – creating the Titanic dataset as a table object
What just happened?
Have a go hero
Using utils and the foreign packages
Time for action – importing data from external files
What just happened?
Doing it in Python


Importing data from MySQL
Doing it in Python
Exporting data/graphs
Exporting R objects
Exporting graphs
Time for action – exporting a graph
What just happened?
Managing R sessions
Time for action – session management
What just happened?
Doing it in Python
Pop quiz
Summary

3. Data Visualization
Packages and settings – R and Python
Visualization techniques for categorical data
Bar chart
Going through the built-in examples of R
Time for action – bar charts in R
What just happened?
Doing it in Python
Have a go hero
Dot chart
Time for action – dot charts in R
What just happened?
Doing it in Python
Spine and mosaic plots
Time for action – spine plot for the shift and operator data
What just happened?
Time for action – mosaic plot for the Titanic dataset
What just happened?
Pie chart and the fourfold plot
Visualization techniques for continuous variable data
Boxplot
Time for action – using the boxplot
What just happened?


Doing it in Python
Histogram
Time for action – understanding the effectiveness of histograms
What just happened?
Doing it in Python

Have a go hero
Scatter plot
Time for action – plot and pairs R functions
What just happened?
Doing it in Python
Have a go hero
Pareto chart
A brief peek at ggplot2
Time for action – qplot
What just happened?
Time for action – ggplot
What just happened?
Pop quiz
Summary
4. Exploratory Analysis
Packages and settings – R and Python
Essential summary statistics
Percentiles, quantiles, and median
Hinges
Interquartile range
Time for action – the essential summary statistics for The Wall
dataset
What just happened?
Techniques for exploratory analysis
The stem-and-leaf plot
Time for action – the stem function in play
What just happened?
Letter values
Data re-expression
Have a go hero

Bagplot – a bivariate boxplot


Time for action – the bagplot display for multivariate datasets
What just happened?
Resistant line
Time for action – resistant line as a first regression model
What just happened?
Smoothing data
Time for action – smoothening the cow temperature data
What just happened?
Median polish
Time for action – the median polish algorithm
What just happened?
Have a go hero
Summary
5. Statistical Inference
Packages and settings – R and Python
Maximum likelihood estimator
Visualizing the likelihood function
Time for action – visualizing the likelihood function
What just happened?
Doing it in Python
Finding the maximum likelihood estimator
Using the fitdistr function
Time for action – finding the MLE using mle and fitdistr functions
What just happened?
Confidence intervals
Time for action – confidence intervals
What just happened?

Doing it in Python
Hypothesis testing
Binomial test
Time for action – testing probability of success
What just happened?
Tests of proportions and the chi-square test
Time for action – testing proportions
What just happened?
Tests based on normal distribution – one sample


Time for action – testing one-sample hypotheses
What just happened?
Have a go hero
Tests based on normal distribution – two sample
Time for action – testing two-sample hypotheses
What just happened?
Have a go hero
Doing it in Python
Summary
6. Linear Regression Analysis
Packages and settings - R and Python
The essence of regression
The simple linear regression model
What happens to the arbitrary choice of parameters?
Time for action - the arbitrary choice of parameters
What just happened?
Building a simple linear regression model
Time for action - building a simple linear regression model
What just happened?

Have a go hero
ANOVA and the confidence intervals
Time for action - ANOVA and the confidence intervals
What just happened?
Model validation
Time for action - residual plots for model validation
What just happened?
Doing it in Python
Have a go hero
Multiple linear regression model
Averaging k simple linear regression models or a multiple linear
regression model
Time for action - averaging k simple linear regression models
What just happened?
Building a multiple linear regression model
Time for action - building a multiple linear regression model
What just happened?


The ANOVA and confidence intervals for the multiple linear regression
model
Time for action - the ANOVA and confidence intervals for the
multiple linear regression model
What just happened?
Have a go hero
Useful residual plots
Time for action - residual plots for the multiple linear regression
model
What just happened?
Regression diagnostics

Leverage points
Influential points
DFFITS and DFBETAS
The multicollinearity problem
Time for action - addressing the multicollinearity problem for the
gasoline data
What just happened?
Doing it in Python
Model selection
Stepwise procedures
The backward elimination
The forward selection
The stepwise regression
Criterion-based procedures
Time for action - model selection using the backward, forward, and
AIC criteria
What just happened?
Have a go hero
Summary
7. Logistic Regression Model
Packages and settings – R and Python
The binary regression problem
Time for action – limitation of linear regression model
What just happened?
Probit regression model


Time for action – understanding the constants
What just happened?
Doing it in Python

Logistic regression model
Time for action – fitting the logistic regression model
What just happened?
Doing it in Python
Hosmer-Lemeshow goodness-of-fit test statistic
Time for action – Hosmer-Lemeshow goodness-of-fit statistic
What just happened?
Model validation and diagnostics
Residual plots for the GLM
Time for action – residual plots for logistic regression model
What just happened?
Doing it in Python
Have a go hero
Influence and leverage for the GLM
Time for action – diagnostics for the logistic regression
What just happened?
Have a go hero
Receiving operator curves
Time for action – ROC construction
What just happened?
Doing it in Python
Logistic regression for the German credit screening dataset
Time for action – logistic regression for the German credit dataset
What just happened?
Doing it in Python
Have a go hero
Summary
8. Regression Models with Regularization
Packages and settings – R and Python
The overfitting problem

Time for action – understanding overfitting
What just happened?
Doing it in Python


Have a go hero
Regression spline
Basis functions
Piecewise linear regression model
Time for action – fitting piecewise linear regression models
What just happened?
Natural cubic splines and the general B-splines
Time for action – fitting the spline regression models
What just happened?
Ridge regression for linear models
Protecting against overfitting
Time for action – ridge regression for the linear regression model
What just happened?
Doing it in Python
Ridge regression for logistic regression models
Time for action – ridge regression for the logistic regression model
What just happened?
Another look at model assessment
Time for action – selecting iteratively and other topics
What just happened?
Pop quiz
Summary
9. Classification and Regression Trees
Packages and settings – R and Python
Understanding recursive partitions

Time for action – partitioning the display plot
What just happened?
Splitting the data
The first tree
Time for action – building our first tree
What just happened?
Constructing a regression tree
Time for action – the construction of a regression tree
What just happened?
Constructing a classification tree
Time for action – the construction of a classification tree


What just happened?
Doing it in Python
Classification tree for the German credit data
Time for action – the construction of a classification tree
What just happened?
Doing it in Python
Have a go hero
Pruning and other finer aspects of a tree
Time for action – pruning a classification tree
What just happened?
Pop quiz
Summary
10. CART and Beyond
Packages and settings – R and Python
Improving the CART
Time for action – cross-validation predictions
What just happened?

Understanding bagging
The bootstrap
Time for action – understanding the bootstrap technique
What just happened?
How the bagging algorithm works
Time for action – the bagging algorithm
What just happened?
Doing it in Python
Random forests
Time for action – random forests for the German credit data
What just happened?
Doing it in Python
The consolidation
Time for action – random forests for the low birth weight data
What just happened?
Summary
Index


Statistical Application Development
with R and Python - Second Edition


Statistical Application Development
with R and Python - Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained in
this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by
this book.
Packt Publishing has endeavored to provide trademark information about all
of the companies and products mentioned in this book by the appropriate use
of capitals. However, Packt Publishing cannot guarantee the accuracy of this
information.
First published: July 2013
Second edition: August 2017
Production reference: 1290817
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.


ISBN 978-1-78862-119-9
www.packtpub.com


Credits
Author
Prabhanjan Narayanachar Tattar
Reviewers
Dr. Ratnadip Adhikari
Ajay Ohri
Abhinav Rai

Commissioning Editor
Adarsh Ranjan
Acquisition Editor
Tushar Gupta
Content Development Editor
Snehal Kolte
Technical Editor
Dharmendra Yadav
Copy Editor
Safis Editing
Project Coordinator
Manthan Patel


Proofreader
Safis Editing
Indexer
Tejal Daruwale Soni
Graphics
Tania Dutta
Production Coordinator
Nilesh Mohite
Cover Work
Nilesh Mohite


About the Author
Prabhanjan Narayanachar Tattar has a combined twelve years of
experience with R and Python software. He has also authored the books A
Course in Statistics with R, Wiley, and Practical Data Science Cookbook,

Packt. The author has built three packages in R titled gpk, RSADBE, and
ACSWR. He has obtained a PhD (statistics) from Bangalore University under
the broad area of survival snalysis and published several articles in peerreviewed journals. During the PhD program, the author received the young
Statistician honors for the IBS(IR)-GK Shukla Young Biometrician Award
(2005) and the Dr. U.S. Nair Award for Young Statistician (2007) and also
held a Junior and Senior Research Fellowship at CSIR-UGC.
Prabhanjan has worked in various positions in the analytical industry and
nearly 10 years of experience in using statistical and machine learning
techniques.


Acknowledgment
I would like to thank the readers and reviewers of the first edition and it is
their constructive criticism that a second edition has been possible. The R and
Python open source community deservers a huge applause for making the
software so complete that it is almost akin to rubbing a magical lamp.
I continue to express my gratitude to all the people mentioned in the previous
edition. My family has been at the forefront as always in extending their
cooperation and whenever I am working on a book, they understand the
weekends would have to be spent on the idiot box.
Profs. D. D. Pawar and V. A. Jadhav were my first two Statistics teachers and
I learnt my first craft from them during 1996-99 at Department of Statistics,
Science College, Nanded. Prof. Pawar had been very kind and generous
towards me and invited in March 2015 to deliver some R talks from the first
edition. Even 20 years later they are the flag-bearers of the subject in the
Marathawada region and it is with profound love and affection that I express
my gratitude to both of them. Thank you a lot, sirs.
It was a mere formal dinner meeting with Tushar Gupta in Chennai a month
ago and we thought of getting the second edition. We both were convinced
that if we work in sync, do parallel publication processing, we would finish

this task within a month. And it has been a roller-coaster ride with Menka
Bohra, Snehal Kolte, and Dharmendra Yadav that the book is a finished
product in a record time. My special thanks to this wonderful Packt team.


About the Reviewers
Dr. Ratnadip Adhikari received his B.Sc degree with Mathematics Honors
from Assam University, India, in 2004 and M.Sc in applied mathematics
from Indian Institute of Technology, Roorkee, in 2006. After that he obtained
M.Tech in Computer Science and Technology and Ph.D. in Computer
Science, both from Jawaharlal Nehru University, New Delhi, India, in 2009
and 2014, respectively.
He worked as an Assistant Professor in the Computer Science &
Engineering (CSE) Dept. of the LNM Institute of Information
Technology (LNMIIT), Jaipur, Rajasthan, India. At present, he works as a
Senior Data Scientist at Fractal Analytics, Bangalore, India. His primary
research interests include Pattern recognition, time series forecasting, data
stream classification, and hybrid modeling. The research works of Dr.
Adhikari has been published in various reputed international journals and at
conferences. He has attended a number of conferences and workshops
throughout his academic career.
Ajay Ohri is the founder of Decisionstats.com and has 14 years work
experience as a data scientist. He advises multiple startups in analytics offshoring, analytics services, and analytics education, as well as using social
media to enhance buzz for analytics products. Mr. Ohri's research interests
include spreading open source analytics, analyzing social media manipulation
with mechanism design, simpler interfaces for cloud computing, investigating
climate change and knowledge flows.
He founded Decisionstats.com in 2007 a blog which has gathered more than
100,000 views annually since past 7 years.
His other books include R for Business Analytics (Springer 2012) and R for

Cloud Computing (Springer 2014), and Python for R Users (Wiley 2017)
Abhinav Rai has been working as a Data Scientist for nearly a decade,
currently working at Microsoft. He has experience working in telecom, retail
marketing, and online advertisement. His areas of interest include the


evolving techniques of machine learning and the associated technologies. He
is especially more interested in analyzing large and humongous datasets and
likes to generate deep insights in such scenarios. Academically, he holds a
double master's degree in Mathematics from Deendayal Upadhyay
Gorakhpur University with an NBHM scholarship and in Computer Science
from Indian Statistical Institute, rigor and sophistication is a surety with his
analytical deliveries.


www.PacktPub.com


eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at
www.PacktPub.com and as a print book customer, you are entitled to a
discount on the eBook copy. Get in touch with us at
<> for more details.
At www.PacktPub.com, you can also read a collection of free technical
articles, sign up for a range of free newsletters and receive exclusive
discounts and offers on Packt books and eBooks.

/>Get the most in-demand software skills with Mapt. Mapt gives you full
access to all Packt books and video courses, as well as industry-leading tools

to help you plan your personal development and advance your career.


Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser


Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our
editorial process. To help us improve, please leave us an honest review on
this book's Amazon page at />If you'd like to join our team of regular reviewers, you can e-mail us at
We award our regular reviewers with free
eBooks and videos in exchange for their valuable feedback. Help us be
relentless in improving our products!


×