www.it-ebooks.info
Scala for Machine Learning
Leverage Scala and Machine Learning to construct and
study systems that can learn from data
Patrick R. Nicolas
BIRMINGHAM - MUMBAI
www.it-ebooks.info
Scala for Machine Learning
Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2014
Production reference: 1121214
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78355-874-2
www.packtpub.com
www.it-ebooks.info
Credits
Author
Project Coordinator
Patrick R. Nicolas
Danuta Jones
Reviewers
Proofreaders
Subhajit Datta
Simran Bhogal
Rui Gonçalves
Maria Gould
Patricia Hoffman, PhD
Paul Hindle
Md Zahidul Islam
Elinor Perry-Smith
Chris Smith
Commissioning Editor
Owen Roberts
Indexer
Mariammal Chettiyar
Acquisition Editor
Owen Roberts
Graphics
Sheetal Aute
Content Development Editor
Mohammed Fahad
Technical Editors
Madhuri Das
Taabish Khan
Copy Editors
Janbal Dharmaraj
Vikrant Phadkay
Valentina D'silva
Disha Haria
Abhinash Sahu
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
www.it-ebooks.info
About the Author
Patrick R. Nicolas is a lead R&D engineer at Dell in Santa Clara, California.
He has 25 years of experience in software engineering and building large-scale
applications in C++, Java, and Scala, and has held several managerial positions.
His interests include real-time analytics, modeling, and optimization.
Special thanks to the Packt Publishing team: Mohammed Fahad for
his patience and encouragement, Owen Roberts for the opportunity,
and the reviewers for their guidance and dedication.
www.it-ebooks.info
About the Reviewers
Subhajit Datta is a passionate software developer.
He did his Bachelor of Engineering in Information Technology (BE in IT) from Indian
Institute of Engineering Science and Technology, Shibpur (IIEST, Shibpur), formerly
known as Bengal Engineering and Science University, Shibpur.
He completed his Master of Technology in Computer Science and Engineering
(MTech CSE) from Indian Institute of Technology Bombay (IIT Bombay); his
thesis focused on topics in natural language processing.
He has experience working in the investment banking domain and web application
domain, and is a polyglot having worked on Java, Scala, Python, Unix shell scripting,
VBScript, JavaScript, C#.Net, and PHP. He is interested in learning and applying
new and different technologies.
He believes that choosing the right programming language, tool, and framework for
the problem at hand is more important than trying to fit all problems in one technology.
He also has experience working in the Waterfall and Agile processes. He is excited
about the Agile software development processes.
Rui Gonçalves is an all-round, hardworking, and dedicated software engineer.
He is an enthusiast of software architecture, programming paradigms, algorithms,
and data structures with the ambition of developing products and services that
have a great impact on society.
He currently works at ShiftForward, where he is a software engineer in the online
advertising field. He is focused on designing and implementing highly efficient,
concurrent, and scalable systems as well as machine learning solutions. In order
to achieve this, he uses Scala as the main development language of these systems
on a day-to-day basis.
www.it-ebooks.info
Patricia Hoffman, PhD, is a consultant at iCube Consulting Service Inc., with
over 25 years of experience in modeling and simulation, of which the last six years
concentrated on machine learning and data mining technologies. Her software
development experience ranges from modeling stochastic partial differential equations
to image processing. She is currently an adjunct faculty member at International
Technical University, teaching machine learning courses. She also teaches machine
learning and data mining at the University of California, Santa Cruz—Silicon Valley
Campus. She was Chair of Association for Computing Machinery of the Data Mining
Special Interest Group for the San Francisco Bay area for 5 years, organizing monthly
lectures and five data mining conferences with over 350 participants.
Patricia has a long list of significant accomplishments. She developed the architecture
and software development plan for a collaborative recommendation system
while consulting as a data mining expert for Quantum Capital. While consulting
for Revolution Analytics, she developed training materials for interfacing the R
statistical language with IBM's Netezza data warehouse appliance.
She has also set up the systems used for communication and software development
along with technical coordination for GTECH, a medical device start-up.
She has also technically directed, produced, and managed operations concepts
and architecture analysis for hardware, software, and firmware. She has performed
risk assessments and has written qualification letters, proposals, system specs, and
interface control documents. Also, she has coordinated with subcontractors, associate
contractors, and various Lockheed departments to produce analysis, documents,
technology demonstrations, and integrated systems. She was the Chief Systems
Engineer for a $12 million image processing workstation development, and had
scored 100 percent from the customer.
The various contributions of Patricia to the publications field are as follows:
• A unified view on the rotational symmetry of equilibria of nematic polymers, dipolar
nematic polymers, and polymers in higher dimensional space, Communications in
Mathematical Sciences, Volume 6, 949-974
• She worked as a technical editor on the book Machine Learning in Action, Peter
Harrington, Manning Publications Co.
• A Distributed Architecture for the C3 I (Command, Control, Communications,
and Intelligence) Collection Management Expert System, with Allen Rude,
AIC Lockheed
• A book review of computer-supported cooperative work, ACM/SIGCHI
Bulletin, Volume 21, Issue 2, pages 125-128, ISSN:0736-6906, 1989
www.it-ebooks.info
Md Zahidul Islam is a software developer working for HSI Health and lives in
Concord, California, with his wife.
He has a passion for functional programming, machine learning, and working
with data. He is currently working with Scala, Apache Spark, MLlib, Ruby on Rails,
ElasticSearch, MongoDB, and Backbone.js. Earlier in his career, he worked with C#,
ASP.NET, and everything around the .NET ecosystem.
I would like to thank my wife, Sandra, who lovingly supports me in
everything I do. I'd also like to thank Packt Publishing and its staff
for the opportunity to contribute to this book.
www.it-ebooks.info
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.com
and as a print book customer, you are entitled to a discount on the eBook copy. Get in
touch with us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM
/>
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view 9 entirely free books. Simply use your login credentials for
immediate access.
www.it-ebooks.info
www.it-ebooks.info
www.it-ebooks.info
To Jennifer, for her kindness and support throughout this long journey.
www.it-ebooks.info
www.it-ebooks.info
Table of Contents
Preface1
Chapter 1: Getting Started
9
Mathematical notation for the curious
Why machine learning?
Classification
Prediction
Optimization
Regression
Why Scala?
Abstraction
Scalability
Configurability
Maintainability
Computation on demand
Model categorization
Taxonomy of machine learning algorithms
Unsupervised learning
Clustering
Dimension reduction
Supervised learning
10
10
10
11
11
11
11
11
12
13
14
14
14
15
15
15
16
16
Generative models
Discriminative models
16
17
Reinforcement learning
Tools and frameworks
Java
Scala
Apache Commons Math
18
19
19
20
20
Description
20
www.it-ebooks.info
Table of Contents
Licensing
Installation
20
21
JFreeChart
21
Description
Licensing
Installation
21
21
22
Other libraries and frameworks
Source code
Context versus view bounds
Presentation
Primitives and implicits
22
22
23
23
24
Immutability
Performance of Scala iterators
Let's kick the tires
Overview of computational workflows
Writing a simple workflow
25
26
26
26
28
Primitive types
Type conversions
Operators
Selecting a dataset
Loading the dataset
Preprocessing the dataset
Creating a model (learning)
Classify the data
24
24
25
28
29
30
34
36
Summary37
Chapter 2: Hello World!
39
The preprocessing module
The clustering module
51
52
Modeling39
A model by any other name
39
Model versus design
41
Selecting a model's features
41
Extracting features
42
Designing a workflow
42
The computational framework
44
The pipe operator
44
Monadic data transformation
45
Dependency injection
46
Workflow modules
48
The workflow factory
49
Examples of workflow components
51
[ ii ]
www.it-ebooks.info
Table of Contents
Assessing a model
Validation
54
54
Key metrics
Implementation
54
56
K-fold cross-validation
57
Bias-variance decomposition
58
Overfitting
61
Summary62
Chapter 3: Data Preprocessing
63
Time series
Moving averages
The simple moving average
The weighted moving average
The exponential moving average
Fourier analysis
Discrete Fourier transform (DFT)
DFT-based filtering
Detection of market cycles
The Kalman filter
The state space estimation
63
66
67
68
69
73
73
79
82
85
86
The transition equation
The measurement equation
86
87
The recursive algorithm
87
Prediction
Correction
Kalman smoothing
Experimentation
89
91
92
93
Alternative preprocessing techniques
97
Summary97
Chapter 4: Unsupervised Learning
99
Clustering100
K-means clustering
101
Measuring similarity
Overview of the K-means algorithm
Step 1 – cluster configuration
Step 2 – cluster assignment
Step 3 – iterative reconstruction
Curse of dimensionality
Experiment
Tuning the number of clusters
Validation
101
103
103
107
108
109
111
114
117
[ iii ]
www.it-ebooks.info
Table of Contents
Expectation-maximization (EM) algorithm
Gaussian mixture model
EM overview
Implementation
Testing
Online EM
Dimension reduction
Principal components analysis (PCA)
Algorithm
Implementation
Test case
Evaluation
118
119
120
120
123
126
126
127
128
129
130
131
Other dimension reduction techniques
133
Performance considerations
133
K-means
133
EM
134
PCA
134
Summary135
Chapter 5: Naïve Bayes Classifiers
Probabilistic graphical models
Naïve Bayes classifiers
Introducing the multinomial Naïve Bayes
Formalism
The frequentist perspective
The predictive model
The zero-frequency problem
Implementation
137
137
139
139
141
142
144
145
145
Software design
Training
Classification
Labeling
Results
Multivariate Bernoulli classification
Model
Implementation
Naïve Bayes and text mining
Basics of information retrieval
Implementation
Extraction of terms
Scoring of terms
Testing
145
146
151
152
154
155
155
156
156
158
159
160
161
163
Retrieving textual information
Evaluation
163
166
[ iv ]
www.it-ebooks.info
Table of Contents
Pros and cons
Summary
Chapter 6: Regression and Regularization
Linear regression
One-variate linear regression
Implementation
Test case
Ordinary least squares (OLS) regression
Design
Implementation
Test case 1 – trending
Test case 2 – features selection
168
168
169
169
170
170
171
173
173
174
175
178
Regularization
Ln roughness penalty
The ridge regression
184
184
186
Numerical optimization
The logistic regression
The logit function
Binomial classification
Software design
The training workflow
191
192
192
193
196
197
Implementation
The test case
186
188
Configuring the least squares optimizer
Computing the Jacobian matrix
Defining the exit conditions
Defining the least squares problem
Minimizing the loss function
Test
198
199
200
201
201
202
Classification
203
Summary205
Chapter 7: Sequential Data Models
Markov decision processes
The Markov property
The first-order discrete Markov chain
The hidden Markov model (HMM)
Notation
The lambda model
HMM execution state
Evaluation (CF-1)
Alpha class (the forward variable)
Beta class (the backward variable)
[v]
www.it-ebooks.info
207
207
208
208
209
211
212
214
216
217
220
Table of Contents
Training (CF-2)
222
Decoding (CF-3)
226
Baum-Welch estimator (EM)
222
The Viterbi algorithm
Putting it all together
Test case
The hidden Markov model for time series analysis
Conditional random fields
Introduction to CRF
Linear chain CRF
CRF and text analytics
The feature functions model
Software design
Implementation
Building the training set
Generating tags
Extracting data sequences
CRF control parameters
Putting it all together
Tests
226
228
230
232
232
233
235
237
238
240
241
242
243
244
244
245
246
The training convergence profile
Impact of the size of the training set
Impact of the L2 regularization factor
247
247
248
Comparing CRF and HMM
249
Performance consideration
250
Summary250
Chapter 8: Kernel Models and Support Vector Machines
Kernel functions
Overview
Common discriminative kernels
The support vector machine (SVM)
The linear SVM
The separable case (hard margin)
The nonseparable case (soft margin)
The nonlinear SVM
251
252
252
254
256
256
257
258
260
Max-margin classification
The kernel trick
260
261
Support vector classifier (SVC)
The binary SVC
262
262
LIBSVM
Software design
Configuration parameters
SVM implementation
262
263
264
267
[ vi ]
www.it-ebooks.info
Table of Contents
C-penalty and margin
Kernel evaluation
Application to risk analysis
Anomaly detection with one-class SVC
Support vector regression (SVR)
Overview
SVR versus linear regression
Performance considerations
Summary
Chapter 9: Artificial Neural Networks
Feed-forward neural networks (FFNN)
The Biological background
The mathematical background
The multilayer perceptron (MLP)
The activation function
The network architecture
Software design
Model definition
Layers
Synapses
Connections
269
272
277
282
284
284
285
288
288
289
289
290
291
293
294
295
296
297
298
299
299
Training cycle/epoch
300
Training strategies and classification
312
Step 1 – input forward propagation
Step 2 – sum of squared errors
Step 3 – error backpropagation
Step 4 – synapse/weights adjustment
Step 5 – convergence criteria
Configuration
Putting all together
Online versus batch training
Regularization
Model instantiation
Prediction
301
305
305
308
309
309
310
312
313
313
314
Evaluation315
Impact of learning rate
315
Impact of the momentum factor
316
Test case
317
Implementation
Models evaluation
Impact of hidden layers architecture
319
321
323
Benefits and limitations
324
Summary326
[ vii ]
www.it-ebooks.info
Table of Contents
Chapter 10: Genetic Algorithms
327
Evolution327
The origin
328
NP problems
328
Evolutionary computing
329
Genetic algorithms and machine learning
330
Genetic algorithm components
330
Encodings
331
Value encoding
Predicate encoding
Solution encoding
The encoding scheme
331
332
333
334
Genetic operators
335
Selection
Crossover
Mutation
336
338
339
Fitness score
340
Implementation340
Software design
340
Key components
341
Selection
344
Controlling population growth
345
GA configuration
345
Crossover
345
Population
Chromosomes
Genes
346
347
348
Mutation
349
Population
Chromosomes
Genes
349
349
349
The reproduction cycle
GA for trading strategies
Definition of trading strategies
350
351
352
Trading operators
The cost/unfitness function
Trading signals
Trading strategies
Signal encoding
353
353
354
355
356
Test case
357
Data extraction
Initial population
Configuration
GA instantiation
358
358
359
359
[ viii ]
www.it-ebooks.info
Table of Contents
GA execution
Tests
360
360
Advantages and risks of genetic algorithms
363
Summary364
Chapter 11: Reinforcement Learning
365
Introduction365
The problem
366
A solution – Q-learning
366
Terminology
Concept
Value of policy
Bellman optimality equations
Temporal difference for model-free learning
Action-value iterative update
367
368
369
370
371
372
Implementation
373
Option trading using Q-learning
382
Software design
States and actions
Search space
Policy and action-value
The Q-learning training
Tail recursion to the rescue
Prediction
373
374
375
376
378
380
381
Option property
Option model
Function approximation
Constrained state-transition
Putting it all together
Evaluation
Pros and cons of reinforcement learning
Learning classifier systems
Introduction to LCS
Why LCS
Terminology
Extended learning classifier systems (XCS)
XCS components
Application to portfolio management
XCS core data
XCS rules
Covering
Example of implementation
383
384
385
386
387
389
391
391
392
393
394
395
396
396
398
399
401
401
Benefits and limitation of learning classifier systems
402
Summary403
[ ix ]
www.it-ebooks.info
Table of Contents
Chapter 12: Scalable Frameworks
405
Overview406
Scala407
Controlling object creation
407
Parallel collections
407
Processing a parallel collection
Benchmark framework
Performance evaluation
408
409
410
Messages exchange
Worker actors
The workflow controller
The master Actor
Master with routing
Distributed discrete Fourier transform
Limitations
417
418
419
419
421
422
425
Scalability with Actors
413
The Actor model
413
Partitioning
415
Beyond actors – reactive programming
415
Akka415
Master-workers
417
Futures
425
The Actor life cycle
Blocking on futures
Handling future callbacks
Putting all together
426
426
428
430
Apache Spark
Why Spark
Design principles
431
432
433
In-memory persistency
Laziness
Transforms and Actions
Shared variables
433
433
434
436
Experimenting with Spark
437
Performance evaluation
442
Deploying Spark
Using Spark shell
MLlib
RDD generation
K-means using Spark
437
438
439
439
440
Tuning parameters
Tests
Performance considerations
442
443
444
Pros and cons
445
0xdata Sparkling Water
446
Summary446
[x]
www.it-ebooks.info
Table of Contents
Appendix A: Basic Concepts
447
Scala programming
447
List of libraries
447
Format of code snippets
448
Encapsulation
449
Class constructor template
449
Companion objects versus case classes
450
Enumerations versus case classes
450
Overloading
451
Design template for classifiers
452
Data extraction
453
Data sources
454
Extraction of documents
455
Matrix class
456
Mathematics457
Linear algebra
457
QR Decomposition
LU factorization
LDL decomposition
Cholesky factorization
Singular value decomposition
Eigenvalue decomposition
Algebraic and numerical libraries
First order predicate logic
Jacobian and Hessian matrices
Summary of optimization techniques
Gradient descent methods
Quasi-Newton algorithms
Nonlinear least squares minimization
Lagrange multipliers
Overview of dynamic programming
Finances 101
Fundamental analysis
Technical analysis
Terminology
Trading signals and strategy
Price patterns
458
458
458
458
459
459
459
460
461
462
462
463
464
465
466
467
467
468
468
469
471
Options trading
471
Financial data sources
472
Suggested online courses
473
References473
Index475
[ xi ]
www.it-ebooks.info
www.it-ebooks.info