Tải bản đầy đủ (.pdf) (964 trang)

Scala and spark for big data analytics tame big data with scala and apache spark

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (37.11 MB, 964 trang )


Scala and Spark for Big Data Analytics

Tame big data with Scala and Apache Spark!

Md. Rezaul Karim
Sridhar Alla

BIRMINGHAM - MUMBAI



Scala and Spark for Big Data Analytics
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the publisher, except
in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
First published: July 2017
Production reference: 1210717

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham


B3 2PB, UK.

ISBN 978-1-78528-084-9

www.packtpub.com


Credits

Authors

Md. Rezaul Karim

Copy Editor

Safis Editing

Sridhar Alla

Reviewers
Project Coordinator

Andrea Bessi
Ulhas Kambali
Sumit Pal

Commissioning Editor

Proofreader


Aaron Lazar

Safis Editing


Acquisition Editor

Indexer

Nitin Dasan

Rekha Nair

Content Development Editor

Cover Work

Vikas Tiwari

Melwyn Dsa

Technical Editor

Production Coordinator

Subhalaxmi Nadar

Melwyn Dsa



About the Authors
Md. Rezaul Karim is a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at
RWTH Aachen University, Aachen, Germany. He holds a BSc and an MSc in computer science.
Before joining Fraunhofer FIT, he had been working as a researcher at the Insight Centre for data
analytics, Ireland. Previously, he worked as a lead engineer with Samsung Electronics' distributed
R&D centers in Korea, India, Vietnam, Turkey, and Bangladesh. Earlier, he worked as a research
assistant in the Database Lab at Kyung Hee University, Korea, and as an R&D engineer with
BMTech21 Worldwide, Korea. Even before that, he worked as a software engineer with
i2SoftTechnology, Dhaka, Bangladesh.
He has more than 8 years of experience in the area of research and development, with a solid
knowledge of algorithms and data structures in C/C++, Java, Scala, R, and Python-focused big data
technologies: Spark, Kafka, DC/OS, Docker, Mesos, Zeppelin, Hadoop, and MapReduce, and deep
learning technologies: TensorFlow, DeepLearning4j, and H2O-Sparking Water. His research
interests include machine learning, deep learning, semantic web, linked data, big data, and
bioinformatics. He is the author of the following book titles with Packt:
Large-Scale Machine Learning with Spark
Deep Learning with TensorFlow
I am very grateful to my parents, who have always encouraged me to pursue knowledge. I also
want to thank my wife Saroar, son Shadman, elder brother Mamtaz, elder sister Josna, and
friends, who have endured my long monologues about the subjects in this book, and have always
been encouraging and listening to me. Writing this book was made easier by the amazing efforts of
the open source community and the great documentation of many projects out there related to
Apache Spark and Scala. Further more, I would like to thank the acquisition, content development,
and technical editors of Packt (and others who were involved in this book title) for their sincere
cooperation and coordination. Additionally, without the work of numerous researchers and data
analytics practitioners who shared their expertise in publications, lectures, and source code, this
book might not exist at all!
Sridhar Alla is a big data expert helping small and big companies solve complex problems, such as
data warehousing, governance, security, real-time processing, high-frequency trading, and
establishing large-scale data science practices. He is an agile practitioner as well as a certified agile

DevOps practitioner and implementer. He started his career as a storage software engineer at
Network Appliance, Sunnyvale, and then worked as the chief technology officer at a cyber security
firm, eIQNetworks, Boston. His job profile includes the role of the director of data science and
engineering at Comcast, Philadelphia. He is an avid presenter at numerous Strata, Hadoop World,
Spark Summit, and other conferences. He also provides onsite/online training on several
technologies. He has several patents filed in the US PTO on large-scale computing and distributed
systems. He holds a bachelors degree in computer science from JNTU, Hyderabad, India, and lives
with his wife in New Jersey.


Sridhar has over 18 years of experience writing code in Scala, Java, C, C++, Python, R and Go. He
also has extensive hands-on knowledge of Spark, Hadoop, Cassandra, HBase, MongoDB, Riak,
Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics,
distributed computing and high performance computing.
I would like to thank my wonderful wife, Rosie Sarkaria, for all the love and patience during the
many months I spent writing this book as well as reviewing countless edits I made. I would also
like to thank my parents Ravi and Lakshmi Alla all the support and encouragement they continue
to bestow upon me. I am very grateful to the many friends especially Abrar Hashmi, Christian
Ludwig who helped me bounce ideas and get clarity on the various topics. Writing this book was
not possible without the fantastic larger Apache community and Databricks folks who are making
Spark so powerful and elegant. Further, I would like to thank the acquisition, content development
and technical editors of Packt Publishing (and others who were involved in this book title) for
their sincere cooperation and coordination.


About the Reviewers
Andre Baianov is an economist-turned-software developer, with a keen interest in data
science. After a bachelor's thesis on data mining and a master's thesis on business
intelligence, he started working with Scala and Apache Spark in 2015. He is currently
working as a consultant for national and international clients, helping them build

reactive architectures, machine learning frameworks, and functional programming
backends.
To my wife: beneath our superficial differences, we share the same soul.

Sumit Pal is a published author with Apress for SQL on Big Data - Technology, Architecture and
Innovations and SQL on Big Data - Technology, Architecture and Innovations. He has more than 22
years of experience in the software industry in various roles, spanning companies from start-ups to
enterprises.
Sumit is an independent consultant working with big data, data visualization, and data science, and a
software architect building end-to-end, data-driven analytic systems.
He has worked for Microsoft (SQL Server development team), Oracle (OLAP development team),
and Verizon (big data analytics team) in a career spanning 22 years.
Currently, he works for multiple clients, advising them on their data architectures and big data
solutions, and does hands-on coding with Spark, Scala, Java, and Python.
Sumit has spoken at the following big data conferences: Data Summit NY, May 2017; Big Data
Symposium, Boston, May 2017; Apache Linux Foundation, May 2016, in Vancouver, Canada; and
Data Center World, March 2016, in Las Vegas.


www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer,
you are entitled to a discount on the eBook copy. Get in touch with us at for more
details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free
newsletters and receive exclusive discounts and offers on Packt books and eBooks.

/>
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and

video courses, as well as industry-leading tools to help you plan your personal development and
advance your career.


Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser


Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To
help us improve, please leave us an honest review on this book's Amazon page at />dp/1785280848.
If you'd like to join our team of regular reviewers, you can e-mail us at We
award our regular reviewers with free eBooks and videos in exchange for their valuable feedback.
Help us be relentless in improving our products!


Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy

Questions

1.

Introduction to Scala
History and purposes of Scala
Platforms and editors
Installing and setting up Scala
Installing Java
Windows
Mac OS
Using Homebrew installer
Installing manually
Linux
Scala: the scalable language
Scala is object-oriented
Scala is functional
Scala is statically typed
Scala runs on the JVM
Scala can execute Java code
Scala can do concurrent and ;synchronized processing
Scala for Java programmers
All types are objects
Type inference
Scala REPL
Nested functions
Import statements
Operators as methods
Methods and parameter lists
Methods inside methods

Constructor in Scala
Objects instead of static methods
Traits
Scala for the beginners
Your first line of code


I'm ; the hello world program, explain me well!
Run Scala interactively!
Compile it!
Execute it with Scala command
Summary

2.

Object-Oriented Scala
Variables in Scala
Reference versus value immutability
Data types in Scala
Variable initialization
Type annotations
Type ascription
Lazy val
Methods, classes, and objects in Scala
Methods in Scala
The return in Scala
Classes in Scala
Objects in Scala
Singleton and companion objects
Companion objects

Comparing and contrasting: val and final
Access and visibility
Constructors
Traits in Scala
A trait syntax
Extending traits
Abstract classes
Abstract classes and the override keyword
Case classes in Scala
Packages and package objects
Java interoperability
Pattern matching
Implicit in Scala
Generic in Scala
Defining a generic class
SBT and other build systems
Build with SBT
Maven with Eclipse
Gradle with Eclipse
Summary

3.

Functional Programming Concepts
Introduction to functional programming
Advantages of functional programming
Functional Scala for the data scientists
Why FP and Scala for learning Spark?
Why Spark?



Scala and the Spark programming model
Scala and the Spark ecosystem
Pure functions and higher-order functions
Pure functions
Anonymous functions
Higher-order functions
Function as a return value
Using higher-order functions
Error handling in functional Scala
Failure and exceptions in Scala
Throwing exceptions
Catching exception using try and catch
Finally
Creating an Either
Future
Run one task, but block
Functional programming and data mutability
Summary

4.

Collection APIs
Scala collection APIs
Types and hierarchies
Traversable
Iterable
Seq, LinearSeq, and IndexedSeq
Mutable and immutable
Arrays

Lists
Sets
Tuples
Maps
Option
Exists
Forall
Filter
Map
Take
GroupBy
Init
Drop
TakeWhile
DropWhile
FlatMap
Performance characteristics
Performance characteristics of collection objects


Memory usage by collection objects
Java interoperability
Using Scala implicits
Implicit conversions in Scala
Summary

5.

Tackle Big Data – Spark Comes to the Party
Introduction to data analytics

Inside the data analytics process
Introduction to big data
4 Vs of big data
Variety of Data
Velocity of Data
Volume of Data
Veracity of Data
Distributed computing using Apache Hadoop
Hadoop Distributed File System (HDFS)
HDFS High Availability
HDFS Federation
HDFS Snapshot
HDFS Read
HDFS Write
MapReduce framework
Here comes Apache Spark
Spark core
Spark SQL
Spark streaming
Spark GraphX
Spark ML
PySpark
SparkR
Summary

6.

Start Working with Spark – REPL and RDDs
Dig deeper into Apache Spark
Apache Spark installation

Spark standalone
Spark on YARN
YARN client mode
YARN cluster mode
Spark on Mesos
Introduction to RDDs
RDD Creation
Parallelizing a collection
Reading data from an external source
Transformation of an existing RDD
Streaming API
Using the Spark shell


Actions and Transformations
Transformations
General transformations
Math/Statistical transformations
Set theory/relational transformations
Data structure-based transformations
map function
flatMap function
filter function
coalesce
repartition
Actions
reduce
count
collect
Caching

Loading and saving data
Loading data
textFile
wholeTextFiles
Load from a JDBC Datasource
Saving RDD
Summary

7.

Special RDD Operations
Types of RDDs
Pair RDD
DoubleRDD
SequenceFileRDD
CoGroupedRDD
ShuffledRDD
UnionRDD
HadoopRDD
NewHadoopRDD
Aggregations
groupByKey
reduceByKey
aggregateByKey
combineByKey
Comparison of groupByKey, reduceByKey, combineByKey, and aggregateByKey
Partitioning and shuffling
Partitioners
HashPartitioner
RangePartitioner

Shuffling
Narrow Dependencies
Wide Dependencies


Broadcast variables
Creating broadcast variables
Cleaning broadcast variables
Destroying broadcast variables
Accumulators
Summary

8.

Introduce a Little Structure - Spark SQL
Spark SQL and DataFrames
DataFrame API and SQL API
Pivots
Filters
User-Defined Functions (UDFs)
Schema   structure of data
Implicit schema
Explicit schema
Encoders
Loading and saving datasets
Loading datasets
Saving datasets
Aggregations
Aggregate functions
Count

First
Last
approx_count_distinct
Min
Max
Average
Sum
Kurtosis
Skewness
Variance
Standard deviation
Covariance
groupBy
Rollup
Cube
Window functions
ntiles
Joins
Inner workings of join
Shuffle join
Broadcast join
Join types
Inner join
Left outer join


Right outer join
Outer join
Left anti join
Left semi join

Cross join
Performance implications of join
Summary

9.

Stream Me Up, Scotty - Spark Streaming
A Brief introduction to streaming
At least once processing
At most once processing
Exactly once processing
Spark Streaming
StreamingContext
Creating StreamingContext
Starting StreamingContext
Stopping StreamingContext
Input streams
receiverStream
socketTextStream
rawSocketStream
fileStream
textFileStream
binaryRecordsStream
queueStream
textFileStream example
twitterStream example
Discretized streams
Transformations
Window operations
Stateful/stateless transformations

Stateless transformations
Stateful transformations
Checkpointing
Metadata checkpointing
Data checkpointing
Driver failure recovery
Interoperability with streaming platforms (Apache Kafka)
Receiver-based approach
Direct stream
Structured streaming
Structured streaming
Handling Event-time and late data
Fault tolerance semantics
Summary


10.

Everything is Connected - GraphX
A brief introduction to graph theory
GraphX
VertexRDD and EdgeRDD
VertexRDD
EdgeRDD
Graph operators
Filter
MapValues
aggregateMessages
TriangleCounting
Pregel API

ConnectedComponents
Traveling salesman problem
ShortestPaths
PageRank
Summary

11.

Learning Machine Learning - Spark MLlib and Spark ML
Introduction to machine learning
Typical machine learning workflow
Machine learning tasks
Supervised learning
Unsupervised learning
Reinforcement learning
Recommender system
Semisupervised learning
Spark machine learning APIs
Spark machine learning libraries
Spark MLlib
Spark ML
Spark MLlib or Spark ML?
Feature extraction and transformation
CountVectorizer
Tokenizer
StopWordsRemover
StringIndexer
OneHotEncoder
Spark ML pipelines
Dataset abstraction

Creating a simple pipeline
Unsupervised machine learning
Dimensionality reduction
PCA
Using PCA
Regression Analysis - a practical use of PCA
Dataset collection and exploration


What is regression analysis?
Binary and multiclass classification
Performance metrics
Binary classification using logistic regression
Breast cancer prediction using logistic regression of Spark ML
Dataset collection
Developing the pipeline using Spark ML
Multiclass classification using logistic regression
Improving classification accuracy using random forests
Classifying MNIST dataset using random forest
Summary

12.

Advanced Machine Learning Best Practices
Machine learning best practices
Beware of overfitting and underfitting
Stay tuned with Spark MLlib and Spark ML
Choosing the right algorithm for your application
Considerations when choosing an algorithm
Accuracy

Training time
Linearity
Inspect your data when choosing an algorithm
Number of parameters
How large is your training set?
Number of features
Hyperparameter tuning of ML models
Hyperparameter tuning
Grid search parameter tuning
Cross-validation
Credit risk analysis – An example of hyperparameter tuning
What is credit risk analysis? Why is it important?
The dataset exploration
Step-by-step example with Spark ML
A recommendation system with Spark
Model-based recommendation with Spark
Data exploration
Movie recommendation using ALS
Topic modelling - A best practice for text clustering
How does LDA work?
Topic modeling with Spark MLlib
Scalability of LDA
Summary

13.

My Name is Bayes, Naive Bayes
Multinomial classification
Transformation to binary
Classification using One-Vs-The-Rest approach

Exploration and preparation of the OCR dataset
Hierarchical classification


Extension from binary
Bayesian inference
An overview of Bayesian inference
What is inference?
How does it work?
Naive Bayes
An overview of Bayes' theorem
My name is Bayes, Naive Bayes
Building a scalable classifier with NB
Tune me up!
The decision trees
Advantages and disadvantages of using DTs
Decision tree versus Naive Bayes
Building a scalable classifier with DT algorithm
Summary

14.

Time to Put Some Order - Cluster Your Data with Spark MLlib
Unsupervised learning
Unsupervised learning example
Clustering techniques
Unsupervised learning and the clustering
Hierarchical clustering
Centroid-based clustering
Distribution-based clustestering

Centroid-based clustering (CC)
Challenges in CC algorithm
How does K-means algorithm work?
An example of clustering using K-means of Spark MLlib
Hierarchical clustering (HC)
An overview of HC algorithm and challenges
Bisecting K-means with Spark MLlib
Bisecting K-means clustering of the neighborhood using Spark MLlib
Distribution-based clustering (DC)
Challenges in DC algorithm
How does a Gaussian mixture model work?
An example of clustering using GMM with Spark MLlib
Determining number of clusters
A comparative analysis between clustering algorithms
Submitting Spark job for cluster analysis
Summary

15.

Text Analytics Using Spark ML
Understanding text analytics
Text analytics
Sentiment analysis
Topic modeling
TF-IDF (term frequency - inverse document frequency)
Named entity recognition (NER)
Event extraction
Transformers and Estimators



Standard Transformer
Estimator Transformer
Tokenization
StopWordsRemover
NGrams
TF-IDF
HashingTF
Inverse Document Frequency (IDF)
Word2Vec
CountVectorizer
Topic modeling using LDA
Implementing text classification
Summary

16.

Spark Tuning
Monitoring Spark jobs
Spark web interface
Jobs
Stages
Storage
Environment
Executors
SQL
Visualizing Spark application using web UI
Observing the running and completed Spark jobs
Debugging Spark applications using logs
Logging with log4j with Spark
Spark configuration

Spark properties
Environmental variables
Logging
Common mistakes in Spark app development
Application failure
Slow jobs or unresponsiveness
Optimization techniques
Data serialization
Memory tuning
Memory usage and management
Tuning the data structures
Serialized RDD storage
Garbage collection tuning
Level of parallelism
Broadcasting
Data locality
Summary

17.

Time to Go to ClusterLand - Deploying Spark on a Cluster
Spark architecture in a cluster


Spark ecosystem in brief
Cluster design
Cluster management
Pseudocluster mode (aka Spark local)
Standalone
Apache YARN

Apache Mesos
Cloud-based deployments
Deploying the Spark application on a cluster
Submitting Spark jobs
Running Spark jobs locally and in standalone
Hadoop YARN
Configuring a single-node YARN cluster
Step 1: Downloading Apache Hadoop
Step 2: Setting the JAVA_HOME
Step 3: Creating users and groups
Step 4: Creating data and log directories
Step 5: Configuring core-site.xml
Step 6: Configuring hdfs-site.xml
Step 7: Configuring mapred-site.xml
Step 8: Configuring yarn-site.xml
Step 9: Setting Java heap space
Step 10: Formatting HDFS
Step 11: Starting the HDFS
Step 12: Starting YARN
Step 13: Verifying on the web UI
Submitting Spark jobs on YARN cluster
Advance job submissions in a YARN cluster
Apache Mesos
Client mode
Cluster mode
Deploying on AWS
Step 1: Key pair and access key configuration
Step 2: Configuring Spark cluster on EC2
Step 3: Running Spark jobs on the AWS cluster
Step 4: Pausing, restarting, and terminating the Spark cluster

Summary

18.

Testing and Debugging Spark
Testing in a distributed environment
Distributed environment
Issues in a distributed system
Challenges of software testing in a distributed environment
Testing Spark applications
Testing Scala methods
Unit testing
Testing Spark applications


Method 1: Using Scala JUnit test
Method 2: Testing Scala code using FunSuite
Method 3: Making life easier with Spark testing base
Configuring Hadoop runtime on Windows
Debugging Spark applications
Logging with log4j with Spark recap
Debugging the Spark application
Debugging Spark application on Eclipse as Scala debug
Debugging Spark jobs running as local and standalone mode
Debugging Spark applications on YARN or Mesos cluster
Debugging Spark application using SBT
Summary

19.


PySpark and SparkR
Introduction to PySpark
Installation and configuration
By setting SPARK_HOME
Using Python shell
By setting PySpark on Python IDEs
Getting started with PySpark
Working with DataFrames and RDDs
Reading a dataset in Libsvm format
Reading a CSV file
Reading and manipulating raw text files
Writing UDF on PySpark
Let's do some analytics with k-means clustering
Introduction to SparkR
Why SparkR?
Installing and getting started
Getting started
Using external data source APIs
Data manipulation
Querying SparkR DataFrame
Visualizing your data on RStudio
Summary

20.

Accelerating Spark with Alluxio
The need for Alluxio
Getting started with Alluxio
Downloading Alluxio
Installing and running Alluxio locally

Overview
Browse
Configuration
Workers
In-Memory Data
Logs


×