Tải bản đầy đủ (.pdf) (520 trang)

ChienNguyenScala cho máy học nicolas 2014 12 31 scala for machine learning nicolas 2014 12 31

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.53 MB, 520 trang )

www.it-ebooks.info


Scala for Machine Learning

Leverage Scala and Machine Learning to construct and
study systems that can learn from data

Patrick R. Nicolas

BIRMINGHAM - MUMBAI

www.it-ebooks.info


Scala for Machine Learning
Copyright © 2014 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.


First published: December 2014

Production reference: 1121214

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78355-874-2
www.packtpub.com

www.it-ebooks.info


Credits
Author

Project Coordinator

Patrick R. Nicolas

Danuta Jones

Reviewers

Proofreaders

Subhajit Datta

Simran Bhogal


Rui Gonçalves

Maria Gould

Patricia Hoffman, PhD

Paul Hindle

Md Zahidul Islam

Elinor Perry-Smith
Chris Smith

Commissioning Editor
Owen Roberts

Indexer
Mariammal Chettiyar

Acquisition Editor
Owen Roberts

Graphics
Sheetal Aute

Content Development Editor
Mohammed Fahad
Technical Editors
Madhuri Das

Taabish Khan
Copy Editors
Janbal Dharmaraj
Vikrant Phadkay

Valentina D'silva
Disha Haria
Abhinash Sahu
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta

www.it-ebooks.info


About the Author
Patrick R. Nicolas is a lead R&D engineer at Dell in Santa Clara, California.

He has 25 years of experience in software engineering and building large-scale
applications in C++, Java, and Scala, and has held several managerial positions.
His interests include real-time analytics, modeling, and optimization.
Special thanks to the Packt Publishing team: Mohammed Fahad for
his patience and encouragement, Owen Roberts for the opportunity,
and the reviewers for their guidance and dedication.

www.it-ebooks.info


About the Reviewers

Subhajit Datta is a passionate software developer.
He did his Bachelor of Engineering in Information Technology (BE in IT) from Indian
Institute of Engineering Science and Technology, Shibpur (IIEST, Shibpur), formerly
known as Bengal Engineering and Science University, Shibpur.
He completed his Master of Technology in Computer Science and Engineering
(MTech CSE) from Indian Institute of Technology Bombay (IIT Bombay); his
thesis focused on topics in natural language processing.
He has experience working in the investment banking domain and web application
domain, and is a polyglot having worked on Java, Scala, Python, Unix shell scripting,
VBScript, JavaScript, C#.Net, and PHP. He is interested in learning and applying
new and different technologies.
He believes that choosing the right programming language, tool, and framework for
the problem at hand is more important than trying to fit all problems in one technology.
He also has experience working in the Waterfall and Agile processes. He is excited
about the Agile software development processes.

Rui Gonçalves is an all-round, hardworking, and dedicated software engineer.
He is an enthusiast of software architecture, programming paradigms, algorithms,
and data structures with the ambition of developing products and services that
have a great impact on society.
He currently works at ShiftForward, where he is a software engineer in the online
advertising field. He is focused on designing and implementing highly efficient,
concurrent, and scalable systems as well as machine learning solutions. In order
to achieve this, he uses Scala as the main development language of these systems
on a day-to-day basis.

www.it-ebooks.info


Patricia Hoffman, PhD, is a consultant at iCube Consulting Service Inc., with

over 25 years of experience in modeling and simulation, of which the last six years
concentrated on machine learning and data mining technologies. Her software
development experience ranges from modeling stochastic partial differential equations
to image processing. She is currently an adjunct faculty member at International
Technical University, teaching machine learning courses. She also teaches machine
learning and data mining at the University of California, Santa Cruz—Silicon Valley
Campus. She was Chair of Association for Computing Machinery of the Data Mining
Special Interest Group for the San Francisco Bay area for 5 years, organizing monthly
lectures and five data mining conferences with over 350 participants.
Patricia has a long list of significant accomplishments. She developed the architecture
and software development plan for a collaborative recommendation system
while consulting as a data mining expert for Quantum Capital. While consulting
for Revolution Analytics, she developed training materials for interfacing the R
statistical language with IBM's Netezza data warehouse appliance.
She has also set up the systems used for communication and software development
along with technical coordination for GTECH, a medical device start-up.
She has also technically directed, produced, and managed operations concepts
and architecture analysis for hardware, software, and firmware. She has performed
risk assessments and has written qualification letters, proposals, system specs, and
interface control documents. Also, she has coordinated with subcontractors, associate
contractors, and various Lockheed departments to produce analysis, documents,
technology demonstrations, and integrated systems. She was the Chief Systems
Engineer for a $12 million image processing workstation development, and had
scored 100 percent from the customer.
The various contributions of Patricia to the publications field are as follows:
• A unified view on the rotational symmetry of equilibria of nematic polymers, dipolar
nematic polymers, and polymers in higher dimensional space, Communications in
Mathematical Sciences, Volume 6, 949-974
• She worked as a technical editor on the book Machine Learning in Action, Peter
Harrington, Manning Publications Co.

• A Distributed Architecture for the C3 I (Command, Control, Communications,
and Intelligence) Collection Management Expert System, with Allen Rude,
AIC Lockheed
• A book review of computer-supported cooperative work, ACM/SIGCHI
Bulletin, Volume 21, Issue 2, pages 125-128, ISSN:0736-6906, 1989

www.it-ebooks.info


Md Zahidul Islam is a software developer working for HSI Health and lives in
Concord, California, with his wife.
He has a passion for functional programming, machine learning, and working
with data. He is currently working with Scala, Apache Spark, MLlib, Ruby on Rails,
ElasticSearch, MongoDB, and Backbone.js. Earlier in his career, he worked with C#,
ASP.NET, and everything around the .NET ecosystem.
I would like to thank my wife, Sandra, who lovingly supports me in
everything I do. I'd also like to thank Packt Publishing and its staff
for the opportunity to contribute to this book.

www.it-ebooks.info


www.PacktPub.com
Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.com
and as a print book customer, you are entitled to a discount on the eBook copy. Get in
touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM

/>
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view 9 entirely free books. Simply use your login credentials for
immediate access.

www.it-ebooks.info


www.it-ebooks.info


www.it-ebooks.info



To Jennifer, for her kindness and support throughout this long journey.

www.it-ebooks.info


www.it-ebooks.info


Table of Contents
Preface1
Chapter 1: Getting Started
9
Mathematical notation for the curious
Why machine learning?
Classification
Prediction
Optimization
Regression
Why Scala?
Abstraction
Scalability
Configurability
Maintainability
Computation on demand
Model categorization
Taxonomy of machine learning algorithms
Unsupervised learning
Clustering
Dimension reduction


Supervised learning

10
10
10
11
11
11
11
11
12
13
14
14
14
15
15
15
16

16

Generative models
Discriminative models

16
17

Reinforcement learning
Tools and frameworks

Java
Scala
Apache Commons Math

18
19
19
20
20

Description

20

www.it-ebooks.info


Table of Contents
Licensing
Installation

20
21

JFreeChart

21

Description
Licensing

Installation

21
21
22

Other libraries and frameworks
Source code
Context versus view bounds
Presentation
Primitives and implicits

22
22
23
23
24

Immutability
Performance of Scala iterators
Let's kick the tires
Overview of computational workflows
Writing a simple workflow

25
26
26
26
28


Primitive types
Type conversions
Operators

Selecting a dataset
Loading the dataset
Preprocessing the dataset
Creating a model (learning)
Classify the data

24
24
25

28
29
30
34
36

Summary37

Chapter 2: Hello World!

39

The preprocessing module
The clustering module

51

52

Modeling39
A model by any other name
39
Model versus design
41
Selecting a model's features
41
Extracting features
42
Designing a workflow
42
The computational framework
44
The pipe operator
44
Monadic data transformation
45
Dependency injection
46
Workflow modules
48
The workflow factory
49
Examples of workflow components
51

[ ii ]


www.it-ebooks.info


Table of Contents

Assessing a model
Validation

54
54

Key metrics
Implementation

54
56

K-fold cross-validation
57
Bias-variance decomposition
58
Overfitting
61
Summary62

Chapter 3: Data Preprocessing

63

Time series

Moving averages
The simple moving average
The weighted moving average
The exponential moving average
Fourier analysis
Discrete Fourier transform (DFT)
DFT-based filtering
Detection of market cycles
The Kalman filter
The state space estimation

63
66
67
68
69
73
73
79
82
85
86

The transition equation
The measurement equation

86
87

The recursive algorithm


87

Prediction
Correction
Kalman smoothing
Experimentation

89
91
92
93

Alternative preprocessing techniques
97
Summary97

Chapter 4: Unsupervised Learning

99

Clustering100
K-means clustering
101
Measuring similarity
Overview of the K-means algorithm
Step 1 – cluster configuration
Step 2 – cluster assignment
Step 3 – iterative reconstruction
Curse of dimensionality

Experiment
Tuning the number of clusters
Validation

101
103
103
107
108
109
111
114
117

[ iii ]

www.it-ebooks.info


Table of Contents

Expectation-maximization (EM) algorithm
Gaussian mixture model
EM overview
Implementation
Testing
Online EM

Dimension reduction
Principal components analysis (PCA)

Algorithm
Implementation
Test case
Evaluation

118

119
120
120
123
126

126
127

128
129
130
131

Other dimension reduction techniques
133
Performance considerations
133
K-means
133
EM
134
PCA

134
Summary135

Chapter 5: Naïve Bayes Classifiers

Probabilistic graphical models
Naïve Bayes classifiers
Introducing the multinomial Naïve Bayes
Formalism
The frequentist perspective
The predictive model
The zero-frequency problem

Implementation

137
137
139
139

141
142
144
145

145

Software design
Training
Classification

Labeling
Results

Multivariate Bernoulli classification
Model
Implementation
Naïve Bayes and text mining
Basics of information retrieval
Implementation
Extraction of terms
Scoring of terms

Testing

145
146
151
152
154

155
155
156
156
158
159

160
161


163

Retrieving textual information
Evaluation

163
166

[ iv ]

www.it-ebooks.info


Table of Contents

Pros and cons
Summary

Chapter 6: Regression and Regularization
Linear regression
One-variate linear regression
Implementation
Test case

Ordinary least squares (OLS) regression
Design
Implementation
Test case 1 – trending
Test case 2 – features selection


168
168

169
169
170

170
171

173

173
174
175
178

Regularization
Ln roughness penalty
The ridge regression

184
184
186

Numerical optimization
The logistic regression
The logit function
Binomial classification
Software design

The training workflow

191
192
192
193
196
197

Implementation
The test case

186
188

Configuring the least squares optimizer
Computing the Jacobian matrix
Defining the exit conditions
Defining the least squares problem
Minimizing the loss function
Test

198
199
200
201
201
202

Classification

203
Summary205

Chapter 7: Sequential Data Models
Markov decision processes
The Markov property
The first-order discrete Markov chain
The hidden Markov model (HMM)
Notation
The lambda model
HMM execution state
Evaluation (CF-1)
Alpha class (the forward variable)
Beta class (the backward variable)

[v]

www.it-ebooks.info

207
207
208
208
209
211
212
214
216

217

220


Table of Contents

Training (CF-2)

222

Decoding (CF-3)

226

Baum-Welch estimator (EM)

222

The Viterbi algorithm

Putting it all together
Test case
The hidden Markov model for time series analysis
Conditional random fields
Introduction to CRF
Linear chain CRF
CRF and text analytics
The feature functions model
Software design
Implementation
Building the training set

Generating tags
Extracting data sequences
CRF control parameters
Putting it all together

Tests

226

228
230
232
232
233
235
237
238
240
241

242
243
244
244
245

246

The training convergence profile
Impact of the size of the training set

Impact of the L2 regularization factor

247
247
248

Comparing CRF and HMM
249
Performance consideration
250
Summary250

Chapter 8: Kernel Models and Support Vector Machines
Kernel functions
Overview
Common discriminative kernels
The support vector machine (SVM)
The linear SVM
The separable case (hard margin)
The nonseparable case (soft margin)

The nonlinear SVM

251
252
252
254
256
256


257
258

260

Max-margin classification
The kernel trick

260
261

Support vector classifier (SVC)
The binary SVC

262
262

LIBSVM
Software design
Configuration parameters
SVM implementation

262
263
264
267

[ vi ]

www.it-ebooks.info



Table of Contents
C-penalty and margin
Kernel evaluation
Application to risk analysis

Anomaly detection with one-class SVC
Support vector regression (SVR)
Overview
SVR versus linear regression
Performance considerations
Summary

Chapter 9: Artificial Neural Networks
Feed-forward neural networks (FFNN)
The Biological background
The mathematical background
The multilayer perceptron (MLP)
The activation function
The network architecture
Software design
Model definition
Layers
Synapses
Connections

269
272
277


282
284
284
285
288
288

289
289
290
291
293
294
295
296
297

298
299
299

Training cycle/epoch

300

Training strategies and classification

312


Step 1 – input forward propagation
Step 2 – sum of squared errors
Step 3 – error backpropagation
Step 4 – synapse/weights adjustment
Step 5 – convergence criteria
Configuration
Putting all together
Online versus batch training
Regularization
Model instantiation
Prediction

301
305
305
308
309
309
310
312
313
313
314

Evaluation315
Impact of learning rate
315
Impact of the momentum factor
316
Test case

317
Implementation
Models evaluation
Impact of hidden layers architecture

319
321
323

Benefits and limitations
324
Summary326
[ vii ]

www.it-ebooks.info


Table of Contents

Chapter 10: Genetic Algorithms

327

Evolution327
The origin
328
NP problems
328
Evolutionary computing
329

Genetic algorithms and machine learning
330
Genetic algorithm components
330
Encodings
331
Value encoding
Predicate encoding
Solution encoding
The encoding scheme

331
332
333
334

Genetic operators

335

Selection
Crossover
Mutation

336
338
339

Fitness score
340

Implementation340
Software design
340
Key components
341
Selection
344
Controlling population growth
345
GA configuration
345
Crossover
345
Population
Chromosomes
Genes

346
347
348

Mutation

349

Population
Chromosomes
Genes

349

349
349

The reproduction cycle
GA for trading strategies
Definition of trading strategies

350
351
352

Trading operators
The cost/unfitness function
Trading signals
Trading strategies
Signal encoding

353
353
354
355
356

Test case

357

Data extraction
Initial population
Configuration

GA instantiation

358
358
359
359

[ viii ]

www.it-ebooks.info


Table of Contents
GA execution
Tests

360
360

Advantages and risks of genetic algorithms
363
Summary364

Chapter 11: Reinforcement Learning

365

Introduction365
The problem
366

A solution – Q-learning
366
Terminology
Concept
Value of policy
Bellman optimality equations
Temporal difference for model-free learning
Action-value iterative update

367
368
369
370
371
372

Implementation

373

Option trading using Q-learning

382

Software design
States and actions
Search space
Policy and action-value
The Q-learning training
Tail recursion to the rescue

Prediction

373
374
375
376
378
380
381

Option property
Option model
Function approximation
Constrained state-transition
Putting it all together

Evaluation
Pros and cons of reinforcement learning
Learning classifier systems
Introduction to LCS
Why LCS
Terminology
Extended learning classifier systems (XCS)
XCS components
Application to portfolio management
XCS core data
XCS rules
Covering
Example of implementation


383
384
385
386
387

389
391
391
392
393
394
395
396

396
398
399
401
401

Benefits and limitation of learning classifier systems
402
Summary403

[ ix ]

www.it-ebooks.info



Table of Contents

Chapter 12: Scalable Frameworks

405

Overview406
Scala407
Controlling object creation
407
Parallel collections
407
Processing a parallel collection
Benchmark framework
Performance evaluation

408
409
410

Messages exchange
Worker actors
The workflow controller
The master Actor
Master with routing
Distributed discrete Fourier transform
Limitations

417
418

419
419
421
422
425

Scalability with Actors
413
The Actor model
413
Partitioning
415
Beyond actors – reactive programming
415
Akka415
Master-workers
417

Futures

425

The Actor life cycle
Blocking on futures
Handling future callbacks
Putting all together

426
426
428

430

Apache Spark
Why Spark
Design principles

431
432
433

In-memory persistency
Laziness
Transforms and Actions
Shared variables

433
433
434
436

Experimenting with Spark

437

Performance evaluation

442

Deploying Spark
Using Spark shell

MLlib
RDD generation
K-means using Spark

437
438
439
439
440

Tuning parameters
Tests
Performance considerations

442
443
444

Pros and cons
445
0xdata Sparkling Water
446
Summary446
[x]

www.it-ebooks.info


Table of Contents


Appendix A: Basic Concepts

447

Scala programming
447
List of libraries
447
Format of code snippets
448
Encapsulation
449
Class constructor template
449
Companion objects versus case classes
450
Enumerations versus case classes
450
Overloading
451
Design template for classifiers
452
Data extraction
453
Data sources
454
Extraction of documents
455
Matrix class
456

Mathematics457
Linear algebra
457
QR Decomposition
LU factorization
LDL decomposition
Cholesky factorization
Singular value decomposition
Eigenvalue decomposition
Algebraic and numerical libraries

First order predicate logic
Jacobian and Hessian matrices
Summary of optimization techniques
Gradient descent methods
Quasi-Newton algorithms
Nonlinear least squares minimization
Lagrange multipliers

Overview of dynamic programming
Finances 101
Fundamental analysis
Technical analysis
Terminology
Trading signals and strategy
Price patterns

458
458
458

458
459
459
459

460
461
462

462
463
464
465

466
467
467
468

468
469
471

Options trading
471
Financial data sources
472
Suggested online courses
473
References473


Index475
[ xi ]

www.it-ebooks.info


www.it-ebooks.info


×