Tải bản đầy đủ (.pdf) (523 trang)

Machine learning with spark develop intelligent machine learning systems with spark 2 x 2nd edition

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (20.04 MB, 523 trang )


Machine Learning with Spark
Second Edition
Develop intelligent machine learning systems with Spark 2.x

Rajdeep Dua
Manpreet Singh Ghotra
Nick Pentreath

BIRMINGHAM - MUMBAI


Machine Learning with Spark
Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2015
Second edition: April 2017
Production reference: 1270417
Published by Packt Publishing Ltd.
Livery Place


35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78588-993-6
www.packtpub.com


Credits
Authors

Copy Editors

Rajdeep Dua
Manpreet Singh Ghotra
Nick Pentreath

Safis Editing
Sonia Mathur

Reviewer

Project Coordinator

Brian O'Neill

Vaidehi Sawant

Commissioning Editor


Proofreader

Akram Hussain

Safis Editing

Acquisition Editor

Indexer

Tushar Gupta

Francy Puthiry

Content Development Editor

Production Coordinator
Deepika Naik

Rohit Kumar Singh
Technical Editors
Nirant Carvalho
Kunal Mali


About the Authors
Rajdeep Dua has over 16 years of experience in the Cloud and Big Data space. He worked
in the advocacy team for Google's big data tools, BigQuery. He worked on the Greenplum
big data platform at VMware in the developer evangelist team. He also worked closely with
a team on porting Spark to run on VMware's public and private cloud as a feature set. He

has taught Spark and Big Data at some of the most prestigious tech schools in India: IIIT
Hyderabad, ISB, IIIT Delhi, and College of Engineering Pune.
Currently, he leads the developer relations team at Salesforce India. He also works with the
data pipeline team at Salesforce, which uses Hadoop and Spark to expose big data
processing tools for developers.
He has published Big Data and Spark tutorials at . He has
also presented BigQuery and Google App Engine at the W3C conference in Hyderabad (htt
p://wwwconference.org/proceedings/www2011/schedule/www2011_Program.pdf). He
led the developer relations teams at Google, VMware, and Microsoft, and he has spoken at
hundreds of other conferences on the cloud. Some of the other references to his work can be
seen at />loper-relations-in-india/ and />His contributions to the open source community are related to Docker, Kubernetes,
Android, OpenStack, and cloudfoundry.
You can connect with him on LinkedIn at />

Manpreet Singh Ghotra has more than 12 years of experience in software development for
both enterprise and big data software. He is currently working on developing a machine
learning platform using Apache Spark at Salesforce. He has worked on a sentiment analyzer
using the Apache stack and machine learning.
He was part of the machine learning group at one of the largest online retailers in the world,
working on transit time calculations using Apache Mahout and the R Recommendation
system using Apache Mahout.
With a master's and postgraduate degree in machine learning, he has contributed to and
worked for the machine learning community.
His GitHub profile is and you can find him on
LinkedIn at />
Nick Pentreath has a background in financial markets, machine learning, and software
development. He has worked at Goldman Sachs Group, Inc., as a research scientist at the
online ad targeting start-up, Cognitive Match Limited, London, and led the data science and
analytics team at Mxit, Africa's largest social network.
He is a cofounder of Graphflow, a big data and machine learning company focused on usercentric recommendations and customer intelligence. He is passionate about combining

commercial focus with machine learning and cutting-edge technology to build intelligent
systems that learn from data to add value to the bottom line.
Nick is a member of the Apache Spark Project Management Committee.


About the Reviewer
Brian O'Neill is the principal architect at Monetate, Inc. Monetate's personalization
platform leverages Spark and machine learning algorithms to process millions of events per
second, leveraging real-time context and analytics to create personalized brand experiences
at scale. Brian is a perennial Datastax Cassandra MVP and has also won InfoWorld’s
Technology Leadership award. Previously, he was CTO for Health Market Science (HMS),
now a LexisNexis company. He is a graduate of Brown University and holds patents in
artificial intelligence and data management.
Prior to this publication, Brian authored a book on distributed computing, Storm Blueprints:
Patterns for Distributed Real-time Computation, and contributed to Learning Cassandra for
Administrators.
All the thanks in the world to my wife, Lisa, and my sons, Collin and Owen, for their
understanding, patience, and support. They know all my shortcomings and love me
anyway. Together always and forever, I love you more than you know and more than I will
ever be able to express.


www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and

eBooks.

/>
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.

Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser


Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at />If you'd like to join our team of regular reviewers, you can e-mail us
at We award our regular reviewers with free eBooks
and videos in exchange for their valuable feedback. Help us be relentless in improving our
products!


Table of Contents
Preface
Chapter 1: Getting Up and Running with Spark

1

Installing and setting up Spark locally
Spark clusters

The Spark programming model
SparkContext and SparkConf
SparkSession
The Spark shell
Resilient Distributed Datasets

10
11
12
13
14
14
17
18
18
21
22
24
24
25
28
32
34
34
36
37
42
46
48
54

56
56
57
60
63

Creating RDDs
Spark operations
Caching RDDs

Broadcast variables and accumulators
SchemaRDD
Spark data frame
The first step to a Spark program in Scala
The first step to a Spark program in Java
The first step to a Spark program in Python
The first step to a Spark program in R
SparkR DataFrames
Getting Spark running on Amazon EC2
Launching an EC2 Spark cluster
Configuring and running Spark on Amazon Elastic Map Reduce
UI in Spark
Supported machine learning algorithms by Spark
Benefits of using Spark ML as compared to existing libraries
Spark Cluster on Google Compute Engine - DataProc
Hadoop and Spark Versions
Creating a Cluster
Submitting a Job
Summary


Chapter 2: Math for Machine Learning
Linear algebra
Setting up the Scala environment in Intellij

8

64
66
66


Setting up the Scala environment on the Command Line
Fields
Real numbers
Complex numbers
Vectors
Vector spaces
Vector types
Vectors in Breeze
Vectors in Spark
Vector operations
Hyperplanes
Vectors in machine learning

Matrix
Types of matrices
Matrix in Spark
Distributed matrix in Spark
Matrix operations
Determinant

Eigenvalues and eigenvectors
Singular value decomposition
Matrices in machine learning

Functions
Function types
Functional composition
Hypothesis

Gradient descent
Prior, likelihood, and posterior
Calculus
Differential calculus
Integral calculus
Lagranges multipliers
Plotting
Summary

Chapter 3: Designing a Machine Learning System
What is Machine Learning?
Introducing MovieStream
Business use cases for a machine learning system
Personalization
Targeted marketing and customer segmentation
Predictive modeling and analytics
Types of machine learning models
The components of a data-driven machine learning system
Data ingestion and storage

[ ii ]


68
68
69
69
70
70
71
71
72
73
77
77
78
78
79
81
83
86
86
87
89
90
91
93
94
95
96
97
97

98
98
99
101
102
103
104
105
105
106
107
107
108
108


Data cleansing and transformation
Model training and testing loop
Model deployment and integration
Model monitoring and feedback
Batch versus real time
Data Pipeline in Apache Spark
An architecture for a machine learning system
Spark MLlib
Performance improvements in Spark ML over Spark MLlib
Comparing algorithms supported by MLlib
Classification
Clustering
Regression
MLlib supported methods and developer APIs

Spark Integration
MLlib vision
MLlib versions compared
Spark 1.6 to 2.0
Summary

109
111
111
112
113
114
115
116
116
118
118
119
119
119
121
121
122
122
122

Chapter 4: Obtaining, Processing, and Preparing Data with Spark

123


Accessing publicly available datasets
The MovieLens 100k dataset
Exploring and visualizing your data
Exploring the user dataset
Count by occupation

Movie dataset
Exploring the rating dataset
Rating count bar chart
Distribution of number ratings

Processing and transforming your data
Filling in bad or missing data
Extracting useful features from your data
Numerical features
Categorical features
Derived features
Transforming timestamps into categorical features
Extract time of Day
Extract time of day

Text features
Simple text feature extraction

[ iii ]

124
125
127
129

135
139
143
145
147
150
151
153
154
154
156
157
157
158
159
160


Sparse Vectors from Titles

Normalizing features
Using ML for feature normalization

Using packages for feature extraction
TFID
IDF
Word2Vector
Skip-gram model
Standard scalar


Summary

Chapter 5: Building a Recommendation Engine with Spark
Types of recommendation models
Content-based filtering
Collaborative filtering
Matrix factorization
Explicit matrix factorization
Implicit Matrix Factorization
Basic model for Matrix Factorization
Alternating least squares

Extracting the right features from your data
Extracting features from the MovieLens 100k dataset
Training the recommendation model
Training a model on the MovieLens 100k dataset
Training a model using Implicit feedback data

Using the recommendation model
ALS Model recommendations
User recommendations
Generating movie recommendations from the MovieLens 100k dataset
Inspecting the recommendations

Item recommendations
Generating similar movies for the MovieLens 100k dataset
Inspecting the similar items

Evaluating the performance of recommendation models
ALS Model Evaluation

Mean Squared Error
Mean Average Precision at K
Using MLlib's built-in evaluation functions
RMSE and MSE
MAP

FP-Growth algorithm
FP-Growth Basic Sample
FP-Growth Applied to Movie Lens Data

[ iv ]

163
165
166
167
167
168
169
169
171
172
173
174
175
175
177
178
181
182

183
188
188
189
190
193
194
195
196
197
198
200
200
205
206
206
208
210
215
215
216
217
217
220


Summary

222


Chapter 6: Building a Classification Model with Spark
Types of classification models
Linear models
Logistic regression
Multinomial logistic regression
Visualizing the StumbleUpon dataset
Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
StumbleUponExecutor
Linear support vector machines

The naive Bayes model
Decision trees
Ensembles of trees
Random Forests
Gradient-Boosted Trees
Multilayer perceptron classifier

Extracting the right features from your data
Training classification models
Training a classification model on the Kaggle/StumbleUpon evergreen
classification dataset
Using classification models
Generating predictions for the Kaggle/StumbleUpon evergreen
classification dataset
Evaluating the performance of classification models
Accuracy and prediction error
Precision and recall
ROC curve and AUC
Improving model performance and tuning parameters
Feature standardization

Additional features
Using the correct form of data
Tuning model parameters
Linear models
Iterations
Step size
Regularization
Decision trees
Tuning tree depth and impurity
The naive Bayes model

Cross-validation
Summary

[v]

223
225
226
228
229
230
231
233
237
239
242
245
245
247

249
251
252
252
254
254
255
255
257
258
261
261
265
268
269
270
271
272
272
274
275
276
277
280


Chapter 7: Building a Regression Model with Spark
Types of regression models
Least squares regression
Decision trees for regression

Evaluating the performance of regression models
Mean Squared Error and Root Mean Squared Error
Mean Absolute Error
Root Mean Squared Log Error
The R-squared coefficient
Extracting the right features from your data
Extracting features from the bike sharing dataset
Training and using regression models
BikeSharingExecutor
Training a regression model on the bike sharing dataset
Linear regression
Generalized linear regression
Decision tree regression

Ensembles of trees
Random forest regression
Gradient boosted tree regression

Improving model performance and tuning parameters
Transforming the target variable
Impact of training on log-transformed targets

Tuning model parameters
Creating training and testing sets to evaluate parameters
Splitting data for Decision tree
The impact of parameter settings for linear models
Iterations
Step size
L2 regularization
L1 regularization

Intercept
The impact of parameter settings for the decision tree
Tree depth
Maximum bins
The impact of parameter settings for the Gradient Boosted Trees
Iterations
MaxBins

Summary

Chapter 8: Building a Clustering Model with Spark
Types of clustering models
k-means clustering

281
282
283
284
285
285
286
286
287
287
287
292
292
294
294
296

300
304
304
307
311
311
315
318
319
319
319
320
321
323
323
324
326
327
328
330
330
331
333
334
335
335

[ vi ]



Initialization methods

Mixture models
Hierarchical clustering
Extracting the right features from your data
Extracting features from the MovieLens dataset
K-means - training a clustering model
Training a clustering model on the MovieLens dataset
K-means - interpreting cluster predictions on the MovieLens dataset
Interpreting the movie clusters
Interpreting the movie clusters

K-means - evaluating the performance of clustering models
Internal evaluation metrics
External evaluation metrics
Computing performance metrics on the MovieLens dataset
Effect of iterations on WSSSE
Bisecting KMeans
Bisecting K-means - training a clustering model
WSSSE and iterations
Gaussian Mixture Model
Clustering using GMM
Plotting the user and item data with GMM clustering
GMM - effect of iterations on cluster boundaries
Summary

Chapter 9: Dimensionality Reduction with Spark
Types of dimensionality reduction
Principal components analysis
Singular value decomposition

Relationship with matrix factorization
Clustering as dimensionality reduction
Extracting the right features from your data
Extracting features from the LFW dataset
Exploring the face data
Visualizing the face data
Extracting facial images as vectors
Loading images
Converting to grayscale and resizing the images
Extracting feature vectors
Normalization

Training a dimensionality reduction model
Running PCA on the LFW dataset
Visualizing the Eigenfaces

[ vii ]

341
342
342
343
343
347
348
350
350
352
356
356

357
357
358
361
362
369
372
373
375
376
378
379
380
380
381
382
382
383
383
384
386
387
388
388
391
392
394
394
395



Interpreting the Eigenfaces

Using a dimensionality reduction model
Projecting data using PCA on the LFW dataset
The relationship between PCA and SVD
Evaluating dimensionality reduction models
Evaluating k for SVD on the LFW dataset
Singular values

Summary

Chapter 10: Advanced Text Processing with Spark
What's so special about text data?
Extracting the right features from your data
Term weighting schemes
Feature hashing
Extracting the tf-idf features from the 20 Newsgroups dataset
Exploring the 20 Newsgroups data
Applying basic tokenization
Improving our tokenization
Removing stop words
Excluding terms based on frequency
A note about stemming
Feature Hashing
Building a tf-idf model
Analyzing the tf-idf weightings

Using a tf-idf model
Document similarity with the 20 Newsgroups dataset and tf-idf features

Training a text classifier on the 20 Newsgroups dataset using tf-idf
Evaluating the impact of text processing
Comparing raw features with processed tf-idf features on the 20
Newsgroups dataset
Text classification with Spark 2.0
Word2Vec models
Word2Vec with Spark MLlib on the 20 Newsgroups dataset
Word2Vec with Spark ML on the 20 Newsgroups dataset
Summary

Chapter 11: Real-Time Machine Learning with Spark Streaming
Online learning
Stream processing
An introduction to Spark Streaming
Input sources
Transformations
Keeping track of state

[ viii ]

397
398
398
399
401
402
402
405
406
406

407
407
408
410
412
414
415
417
420
422
423
423
426
427
427
430
433
433
433
436
437
438
440
441
442
443
443
444
444
445



General transformations
Actions
Window operators

Caching and fault tolerance with Spark Streaming
Creating a basic streaming application
The producer application
Creating a basic streaming application
Streaming analytics
Stateful streaming
Online learning with Spark Streaming
Streaming regression
A simple streaming regression program
Creating a streaming data producer
Creating a streaming regression model

Streaming K-means
Online model evaluation
Comparing model performance with Spark Streaming
Structured Streaming
Summary

Chapter 12: Pipeline APIs for Spark ML
Introduction to pipelines
DataFrames
Pipeline components
Transformers
Estimators

How pipelines work
Machine learning pipeline with an example
StumbleUponExecutor
Summary

Index

445
446
446
447
448
449
452
455
457
459
459
460
461
463
466
466
467
471
471
472
472
473
473

473
473
476
481
483
493
494

[ ix ]


Preface
In recent years, the volume of data being collected, stored, and analyzed has exploded, in
particular in relation to activity on the Web and mobile devices, as well as data from the
physical world collected via sensor networks. While large-scale data storage, processing,
analysis, and modeling were previously the domain of the largest institutions, such as
Google, Yahoo!, Facebook, Twitter, and Salesforce, increasingly, many organizations are
being faced with the challenge of how to handle a massive amount of data.
When faced with this quantity of data and the common requirement to utilize it in real time,
human-powered systems quickly become infeasible. This has led to a rise in so-called big
data and machine learning systems that learn from this data to make automated decisions.
In answer to the challenge of dealing with ever larger-scale data without any prohibitive
cost, new open source technologies emerged at companies such as Google, Yahoo!,
Amazon, and Facebook, which aimed at making it easier to handle massive data volumes
by distributing data storage and computation across a cluster of computers.
The most widespread of these is Apache Hadoop, which made it significantly easier and
cheaper to both store large amounts of data (via the Hadoop Distributed File System, or
HDFS) and run computations on this data (via Hadoop MapReduce, a framework to
perform computation tasks in parallel across many nodes in a computer cluster).
However, MapReduce has some important shortcomings, including high overheads to

launch each job and reliance on storing intermediate data and results of the computation to
disk, both of which make Hadoop relatively ill-suited for use cases of an iterative or lowlatency nature. Apache Spark is a new framework for distributed computing that is
designed from the ground up to be optimized for low-latency tasks and to store
intermediate data and results in memory, thus addressing some of the major drawbacks of
the Hadoop framework. Spark provides a clean, functional, and easy-to-understand API to
write applications, and is fully compatible with the Hadoop ecosystem.
Furthermore, Spark provides native APIs in Scala, Java, Python, and R. The Scala and
Python APIs allow all the benefits of the Scala or Python language, respectively, to be used
directly in Spark applications, including using the relevant interpreter for real-time,
interactive exploration. Spark itself now provides a toolkit (Spark MLlib in 1.6 and Spark
ML in 2.0) of distributed machine learning and data mining models that is under heavy
development and already contains high-quality, scalable, and efficient algorithms for many
common machine learning tasks, some of which we will delve into in this book.


Preface

Applying machine learning techniques to massive datasets is challenging, primarily
because most well-known machine learning algorithms are not designed for parallel
architectures. In many cases, designing such algorithms is not an easy task. The nature of
machine learning models is generally iterative, hence the strong appeal of Spark for this use
case. While there are many competing frameworks for parallel computing, Spark is one of
the few that combines speed, scalability, in-memory processing, and fault tolerance with
ease of programming and a flexible, expressive, and powerful API design.
Throughout this book, we will focus on real-world applications of machine learning
technology. While we may briefly delve into some theoretical aspects of machine learning
algorithms and required maths for machine learning, the book will generally take a
practical, applied approach with a focus on using examples and code to illustrate how to
effectively use the features of Spark and MLlib, as well as other well-known and freely
available packages for machine learning and data analysis, to create a useful machine

learning system.

What this book covers
Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local

development environment for the Spark framework, as well as how to create a Spark cluster
in the cloud using Amazon EC2. The Spark programming model and API will be
introduced and a simple Spark application will be created using Scala, Java, and Python.
Chapter 2, Math for Machine Learning, provides a mathematical introduction to machine

learning. Understanding math and many of its techniques is important to get a good hold
on the inner workings of the algorithms and to get the best results.
Chapter 3, Designing a Machine Learning System, presents an example of a real-world use

case for a machine learning system. We will design a high-level architecture for an
intelligent system in Spark based on this illustrative use case.

Chapter 4, Obtaining, Processing, and Preparing Data with Spark, details how to go about

obtaining data for use in a machine learning system, in particular from various freely and
publicly available sources. We will learn how to process, clean, and transform the raw data
into features that may be used in machine learning models, using available tools, libraries,
and Spark's functionality.
Chapter 5, Building a Recommendation Engine with Spark, deals with creating a

recommendation model based on the collaborative filtering approach. This model will be
used to recommend items to a given user, as well as create lists of items that are similar to a
given item. Standard metrics to evaluate the performance of a recommendation model will
be covered here.


[2]


Preface
Chapter 6, Building a Classification Model with Spark, details how to create a model for binary

classification, as well as how to utilize standard performance-evaluation metrics for
classification tasks.
Chapter 7, Building a Regression Model with Spark, shows how to create a model for

regression, extending the classification model created in Chapter 6, Building a Classification
Model with Spark. Evaluation metrics for the performance of regression models will be
detailed here.
Chapter 8, Building a Clustering Model with Spark, explores how to create a clustering model

and how to use related evaluation methodologies. You will learn how to analyze and
visualize the clusters that are generated.

Chapter 9, Dimensionality Reduction with Spark, takes us through methods to extract the

underlying structure from, and reduce the dimensionality of, our data. You will learn some
common dimensionality-reduction techniques and how to apply and analyze them. You
will also see how to use the resulting data representation as an input to another machine
learning model.
Chapter 10, Advanced Text Processing with Spark, introduces approaches to deal with large-

scale text data, including techniques for feature extraction from text and dealing with the
very high-dimensional features typical in text data.

Chapter 11, Real-Time Machine Learning with Spark Streaming, provides an overview of Spark


Streaming and how it fits in with the online and incremental learning approaches to apply
machine learning on data streams.

Chapter 12, Pipeline APIs for Spark ML, provides a uniform set of APIs that are built on top

of Data Frames and help the user to create and tune machine learning pipelines.

What you need for this book
Throughout this book, we assume that you have some basic experience with programming
in Scala or Python and have some basic knowledge of machine learning, statistics, and data
analysis.

[3]


Preface

Who this book is for
This book is aimed at entry-level to intermediate data scientists, data analysts, software
engineers, and practitioners involved in machine learning or data mining with an interest in
large-scale machine learning approaches, but who are not necessarily familiar with Spark.
You may have some experience of statistics or machine learning software (perhaps
including MATLAB, scikit-learn, Mahout, R, Weka, and so on) or distributed systems
(including some exposure to Hadoop).

Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Spark
places user scripts to run Spark in the bin directory."
A block of code is set as follows:
val conf = new SparkConf()
.setAppName("Test Spark App")
.setMaster("local[4]")
val sc = new SparkContext(conf)

Any command-line input or output is written as follows:
>tar xfvz spark-2.1.0-bin-hadoop2.7.tgz
>cd spark-2.1.0-bin-hadoop2.7

New terms and important words are shown in bold. Words that you see on the screen, for
example, in menus or dialog boxes, appear in the text like this: "These can be obtained from
the AWS homepage by clicking Account | Security Credentials | Access Credentials."
Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

[4]


Preface

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of. To send us general feedback, simply
e-mail , and mention the book's title in the subject of your
message. If there is a topic that you have expertise in and you are interested in either

writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.

Downloading the example code
You can download the example code files for this book from your account at http://www.p
acktpub.com. If you purchased this book elsewhere, you can visit ktpub.c
om/supportand register to have the files e-mailed directly to you.
You can download the code files by following these steps:
1.
2.
3.
4.
5.
6.
7.

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

[5]


Preface

The code bundle for the book is also hosted on GitHub at />ishing/Machine-Learning-with-Spark-Second-Edition. We also have other code
bundles from our rich catalog of books and videos available at />Publishing/. Check them out!

Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from
/>SecondEdition_ColorImages.pdf.

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the codewe would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title.
To view the previously submitted errata, go to />t/supportand enter the name of the book in the search field. The required information will
appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.

[6]


Preface

Please contact us at with a link to the suspected pirated
material.
We appreciate your help in protecting our authors and our ability to bring you valuable
content.

Questions
If you have a problem with any aspect of this book, you can contact us
at , and we will do our best to address the problem.

[7]


×