Tải bản đầy đủ (.pdf) (507 trang)

Aggarwal c linear algebra and optimization for machine learning 2020

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (9.51 MB, 507 trang )

Charu C. Aggarwal

Linear Algebra
and Optimization
for Machine
Learning
A Textbook


Linear Algebra and Optimization for Machine Learning


Charu C. Aggarwal

Linear Algebra and Optimization
for Machine Learning
A Textbook


Charu C. Aggarwal
Distinguished Research Staff Member
IBM T.J. Watson Research Center
Yorktown Heights, NY, USA

ISBN 978-3-030-40343-0
ISBN 978-3-030-40344-7 (eBook)
/>© Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,


even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and
therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be
true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or
implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher
remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland


To my wife Lata, my daughter Sayani,
and all my mathematics teachers


Contents

1 Linear Algebra and Optimization: An Introduction
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Scalars, Vectors, and Matrices . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Basic Operations with Scalars and Vectors . . . . . . . . . . .
1.2.2 Basic Operations with Vectors and Matrices . . . . . . . . . .
1.2.3 Special Classes of Matrices . . . . . . . . . . . . . . . . . . . .
1.2.4 Matrix Powers, Polynomials, and the Inverse . . . . . . . . .
1.2.5 The Matrix Inversion Lemma: Inverting the Sum of Matrices
1.2.6 Frobenius Norm, Trace, and Energy . . . . . . . . . . . . . .
1.3
Matrix Multiplication as a Decomposable Operator . . . . . . . . . .
1.3.1 Matrix Multiplication as Decomposable Row and Column

Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Matrix Multiplication as Decomposable Geometric Operators
1.4
Basic Problems in Machine Learning . . . . . . . . . . . . . . . . . .
1.4.1 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 Classification and Regression Modeling . . . . . . . . . . . . .
1.4.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Optimization for Machine Learning . . . . . . . . . . . . . . . . . . .
1.5.1 The Taylor Expansion for Function Simplification . . . . . . .
1.5.2 Example of Optimization in Machine Learning . . . . . . . .
1.5.3 Optimization in Computational Graphs . . . . . . . . . . . .
1.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

1
1
2
3
8
12
14
17
19
21


.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.
.

21
25
27
27
28
29
30
31
31
33
34
35
35
36


2 Linear Transformations and Linear Systems
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 What Is a Linear Transform? . . . . . . . . . . . . . . . . . . . . .
2.2
The Geometry of Matrix Multiplication . . . . . . . . . . . . . . . . . . . .

41
41
42
43
VII


VIII

2.3

2.4
2.5

2.6
2.7

2.8

2.9
2.10
2.11

2.12
2.13
2.14

CONTENTS

Vector Spaces and Their Geometry . . . . . . . . . . . . . . . . .
2.3.1 Coordinates in a Basis System . . . . . . . . . . . . . . . .
2.3.2 Coordinate Transformations Between Basis Sets . . . . . .
2.3.3 Span of a Set of Vectors . . . . . . . . . . . . . . . . . . .
2.3.4 Machine Learning Example: Discrete Wavelet Transform .
2.3.5 Relationships Among Subspaces of a Vector Space . . . .
The Linear Algebra of Matrix Rows and Columns . . . . . . . . .
The Row Echelon Form of a Matrix . . . . . . . . . . . . . . . . .
2.5.1 LU Decomposition . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Application: Finding a Basis Set . . . . . . . . . . . . . .
2.5.3 Application: Matrix Inversion . . . . . . . . . . . . . . . .
2.5.4 Application: Solving a System of Linear Equations . . . .
The Notion of Matrix Rank . . . . . . . . . . . . . . . . . . . . .
2.6.1 Effect of Matrix Operations on Rank . . . . . . . . . . . .
Generating Orthogonal Basis Sets . . . . . . . . . . . . . . . . . .
2.7.1 Gram-Schmidt Orthogonalization and QR Decomposition
2.7.2 QR Decomposition . . . . . . . . . . . . . . . . . . . . . .
2.7.3 The Discrete Cosine Transform . . . . . . . . . . . . . . .
An Optimization-Centric View of Linear Systems . . . . . . . . .
2.8.1 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . .
2.8.2 The Projection Matrix . . . . . . . . . . . . . . . . . . . .
Ill-Conditioned Matrices and Systems . . . . . . . . . . . . . . . .
Inner Products: A Geometric View . . . . . . . . . . . . . . . . .
Complex Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . .

2.11.1 The Discrete Fourier Transform . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

3 Eigenvectors and Diagonalizable Matrices
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Diagonalizable Transformations and Eigenvectors . . . . . . . . . .
3.3.1 Complex Eigenvalues . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Left Eigenvectors and Right Eigenvectors . . . . . . . . . .
3.3.3 Existence and Uniqueness of Diagonalization . . . . . . . .
3.3.4 Existence and Uniqueness of Triangulization . . . . . . . . .
3.3.5 Similar Matrix Families Sharing Eigenvalues . . . . . . . . .
3.3.6 Diagonalizable Matrix Families Sharing Eigenvectors . . . .
3.3.7 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . .
3.3.8 Positive Semidefinite Matrices . . . . . . . . . . . . . . . . .
3.3.9 Cholesky Factorization: Symmetric LU Decomposition . . .
3.4
Machine Learning and Optimization Applications . . . . . . . . . .
3.4.1 Fast Matrix Operations in Machine Learning . . . . . . . .
3.4.2 Examples of Diagonalizable Matrices in Machine Learning .
3.4.3 Symmetric Matrices in Quadratic Optimization . . . . . . .
3.4.4 Diagonalization Application: Variable Separation
for Optimization . . . . . . . . . . . . . . . . . . . . . . . .
3.4.5 Eigenvectors in Norm-Constrained Quadratic Programming

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

51
55
57
59
60

61
63
64
66
67
67
68
70
71
73
73
74
77
79
81
82
85
86
87
89
90
91
91

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

97
97
98
103
107
108
109
111
113
115
115
117
119
120
121
121
124

. . . .
. . . .

128
130


CONTENTS

3.5


.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

131

132
133
135
135
135

4 Optimization Basics: A Machine Learning View
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
The Basics of Optimization . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Univariate Optimization . . . . . . . . . . . . . . . . . . . . .
4.2.1.1 Why We Need Gradient Descent . . . . . . . . . . .
4.2.1.2 Convergence of Gradient Descent . . . . . . . . . . .
4.2.1.3 The Divergence Problem . . . . . . . . . . . . . . . .
4.2.2 Bivariate Optimization . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Multivariate Optimization . . . . . . . . . . . . . . . . . . . .
4.3
Convex Objective Functions . . . . . . . . . . . . . . . . . . . . . . .
4.4
The Minutiae of Gradient Descent . . . . . . . . . . . . . . . . . . . .
4.4.1 Checking Gradient Correctness with Finite Differences . . . .
4.4.2 Learning Rate Decay and Bold Driver . . . . . . . . . . . . .
4.4.3 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3.1 Binary Search . . . . . . . . . . . . . . . . . . . . . .
4.4.3.2 Golden-Section Search . . . . . . . . . . . . . . . . .
4.4.3.3 Armijo Rule . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
Properties of Optimization in Machine Learning . . . . . . . . . . . .

4.5.1 Typical Objective Functions and Additive Separability . . . .
4.5.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . .
4.5.3 How Optimization in Machine Learning Is Different . . . . . .
4.5.4 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . .
4.5.5 The Importance of Feature Preprocessing . . . . . . . . . . .
4.6
Computing Derivatives with Respect to Vectors . . . . . . . . . . . .
4.6.1 Matrix Calculus Notation . . . . . . . . . . . . . . . . . . . .
4.6.2 Useful Matrix Calculus Identities . . . . . . . . . . . . . . . .
4.6.2.1 Application: Unconstrained Quadratic Programming
4.6.2.2 Application: Derivative of Squared Norm . . . . . .
4.6.3 The Chain Rule of Calculus for Vectored Derivatives . . . . .
4.6.3.1 Useful Examples of Vectored Derivatives . . . . . . .
4.7
Linear Regression: Optimization with Numerical Targets . . . . . . .
4.7.1 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . . .
4.7.1.1 Pseudoinverse and Connections to Regularization . .
4.7.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . .
4.7.3 The Use of Bias . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.3.1 Heuristic Initialization . . . . . . . . . . . . . . . . .
4.8
Optimization Models for Binary Targets . . . . . . . . . . . . . . . .
4.8.1 Least-Squares Classification: Regression on Binary Targets . .
4.8.1.1 Why Least-Squares Classification Loss Needs Repair

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

141
141
142
142
146

147
148
149
151
154
159
159
159
160
161
161
162
163
163
163
164
165
168
168
169
170
171
173
174
174
175
176
178
179
179

179
180
180
181
183

3.6
3.7
3.8

Numerical Algorithms for Finding Eigenvectors . . . . . . . .
3.5.1 The QR Method via Schur Decomposition . . . . . . .
3.5.2 The Power Method for Finding Dominant Eigenvectors
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

IX

.
.
.
.
.
.

.
.
.
.

.
.

.
.
.
.
.
.


X

CONTENTS

4.8.2

.
.
.
.
.
.

.
.
.
.
.
.


.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

184

185
186
186
188
188

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

189
190
190
191
192
193
194
194
196
197
197
198
199
199

5 Advanced Optimization Solutions
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Challenges in Gradient-Based Optimization . . . . . . . . . . . . . .
5.2.1 Local Optima and Flat Regions . . . . . . . . . . . . . . . . .
5.2.2 Differential Curvature . . . . . . . . . . . . . . . . . . . . . .
5.2.2.1 Revisiting Feature Normalization . . . . . . . . . . .
5.2.3 Examples of Difficult Topologies: Cliffs and Valleys . . . . . .

5.3
Adjusting First-Order Derivatives for Descent . . . . . . . . . . . . .
5.3.1 Momentum-Based Learning . . . . . . . . . . . . . . . . . . .
5.3.2 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.4 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4
The Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 The Basic Form of the Newton Method . . . . . . . . . . . . .
5.4.2 Importance of Line Search for Non-quadratic Functions . . .
5.4.3 Example: Newton Method in the Quadratic Bowl . . . . . . .
5.4.4 Example: Newton Method in a Non-quadratic Function . . .
5.5
Newton Methods in Machine Learning . . . . . . . . . . . . . . . . .
5.5.1 Newton Method for Linear Regression . . . . . . . . . . . . .
5.5.2 Newton Method for Support-Vector Machines . . . . . . . . .
5.5.3 Newton Method for Logistic Regression . . . . . . . . . . . .
5.5.4 Connections Among Different Models and Unified Framework
5.6
Newton Method: Challenges and Solutions . . . . . . . . . . . . . . .
5.6.1 Singular and Indefinite Hessian . . . . . . . . . . . . . . . . .
5.6.2 The Saddle-Point Problem . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

205
205
206
207
208
209
210
212
212
214
215
215
216
217
219
220
220
221
221
223
225
228

229
229
229

4.9

4.10

4.11
4.12
4.13

The Support Vector Machine . . . . . . . . . . . . . . .
4.8.2.1 Computing Gradients . . . . . . . . . . . . . .
4.8.2.2 Stochastic Gradient Descent . . . . . . . . . . .
4.8.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . .
4.8.3.1 Computing Gradients . . . . . . . . . . . . . .
4.8.3.2 Stochastic Gradient Descent . . . . . . . . . . .
4.8.4 How Linear Regression Is a Parent Problem in Machine
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization Models for the MultiClass Setting . . . . . . . . .
4.9.1 Weston-Watkins Support Vector Machine . . . . . . . .
4.9.1.1 Computing Gradients . . . . . . . . . . . . . .
4.9.2 Multinomial Logistic Regression . . . . . . . . . . . . . .
4.9.2.1 Computing Gradients . . . . . . . . . . . . . .
4.9.2.2 Stochastic Gradient Descent . . . . . . . . . . .
Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . .
4.10.1 Linear Regression with Coordinate Descent . . . . . . .
4.10.2 Block Coordinate Descent . . . . . . . . . . . . . . . . .
4.10.3 K-Means as Block Coordinate Descent . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


CONTENTS

5.6.3

5.7

5.8

5.9
5.10
5.11

Convergence Problems and Solutions with Non-quadratic
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.3.1 Trust Region Method . . . . . . . . . . . . . . . .
Computationally Efficient Variations of Newton Method . . . . . .
5.7.1 Conjugate Gradient Method . . . . . . . . . . . . . . . . . .
5.7.2 Quasi-Newton Methods and BFGS . . . . . . . . . . . . . .
Non-differentiable Optimization Functions . . . . . . . . . . . . . .
5.8.1 The Subgradient Method . . . . . . . . . . . . . . . . . . . .
5.8.1.1 Application: L1 -Regularization . . . . . . . . . . .
5.8.1.2 Combining Subgradients with Coordinate Descent
5.8.2 Proximal Gradient Method . . . . . . . . . . . . . . . . . .
5.8.2.1 Application: Alternative for L1 -Regularized
Regression . . . . . . . . . . . . . . . . . . . . . .

5.8.3 Designing Surrogate Loss Functions for Combinatorial
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8.3.1 Application: Ranking Support Vector Machine . .
5.8.4 Dynamic Programming for Optimizing Sequential Decisions
5.8.4.1 Application: Fast Matrix Multiplication . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XI

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.

231
232
233
233
237
239
240
242
243
244

. . . .

245

.

.
.
.
.
.
.

246
247
248
249
250
250
251

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

6 Constrained Optimization and Duality
255
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.2
Primal Gradient Descent Methods . . . . . . . . . . . . . . . . . . . . . . . 256
6.2.1 Linear Equality Constraints . . . . . . . . . . . . . . . . . . . . . . 257
6.2.1.1 Convex Quadratic Program with Equality Constraints . . 259
6.2.1.2 Application: Linear Regression with Equality
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 261

6.2.1.3 Application: Newton Method with Equality
Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.2.2 Linear Inequality Constraints . . . . . . . . . . . . . . . . . . . . . 262
6.2.2.1 The Special Case of Box Constraints . . . . . . . . . . . . 263
6.2.2.2 General Conditions for Projected Gradient Descent
to Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.2.2.3 Sequential Linear Programming . . . . . . . . . . . . . . . 266
6.2.3 Sequential Quadratic Programming . . . . . . . . . . . . . . . . . . 267
6.3
Primal Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.3.1 Coordinate Descent for Convex Optimization Over Convex Set . . 268
6.3.2 Machine Learning Application: Box Regression . . . . . . . . . . . 269
6.4
Lagrangian Relaxation and Duality . . . . . . . . . . . . . . . . . . . . . . 270
6.4.1 Kuhn-Tucker Optimality Conditions . . . . . . . . . . . . . . . . . 274
6.4.2 General Procedure for Using Duality . . . . . . . . . . . . . . . . . 276
6.4.2.1 Inferring the Optimal Primal Solution from Optimal Dual
Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.4.3 Application: Formulating the SVM Dual . . . . . . . . . . . . . . . 276
6.4.3.1 Inferring the Optimal Primal Solution from Optimal Dual
Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278


XII

CONTENTS

6.4.4

6.5


6.6
6.7
6.8
6.9
6.10

Optimization Algorithms for the SVM Dual . . . . . . . . . . . . .
6.4.4.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . .
6.4.4.2 Coordinate Descent . . . . . . . . . . . . . . . . . . . . .
6.4.5 Getting the Lagrangian Relaxation of Unconstrained Problems . .
6.4.5.1 Machine Learning Application: Dual of Linear Regression
Penalty-Based and Primal-Dual Methods . . . . . . . . . . . . . . . . . . .
6.5.1 Penalty Method with Single Constraint . . . . . . . . . . . . . . . .
6.5.2 Penalty Method: General Formulation . . . . . . . . . . . . . . . .
6.5.3 Barrier and Interior Point Methods . . . . . . . . . . . . . . . . . .
Norm-Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . .
Primal Versus Dual Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Singular Value Decomposition
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
SVD: A Linear Algebra Perspective . . . . . . . . . . . . . . . . . . . . .
7.2.1 Singular Value Decomposition of a Square Matrix . . . . . . . . .
7.2.2 Square SVD to Rectangular SVD via Padding . . . . . . . . . . .
7.2.3 Several Definitions of Rectangular Singular Value Decomposition

7.2.4 Truncated Singular Value Decomposition . . . . . . . . . . . . . .
7.2.4.1 Relating Truncation Loss to Singular Values . . . . . .
7.2.4.2 Geometry of Rank-k Truncation . . . . . . . . . . . . .
7.2.4.3 Example of Truncated SVD . . . . . . . . . . . . . . . .
7.2.5 Two Interpretations of SVD . . . . . . . . . . . . . . . . . . . . .
7.2.6 Is Singular Value Decomposition Unique? . . . . . . . . . . . . .
7.2.7 Two-Way Versus Three-Way Decompositions . . . . . . . . . . .
7.3
SVD: An Optimization Perspective . . . . . . . . . . . . . . . . . . . . .
7.3.1 A Maximization Formulation with Basis Orthogonality . . . . . .
7.3.2 A Minimization Formulation with Residuals . . . . . . . . . . . .
7.3.3 Generalization to Matrix Factorization Methods . . . . . . . . . .
7.3.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . .
7.4
Applications of Singular Value Decomposition . . . . . . . . . . . . . . .
7.4.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Noise Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Finding the Four Fundamental Subspaces in Linear Algebra . . .
7.4.4 Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . . .
7.4.4.1 Ill-Conditioned Square Matrices . . . . . . . . . . . . .
7.4.5 Solving Linear Equations and Linear Regression . . . . . . . . . .
7.4.6 Feature Preprocessing and Whitening in Machine Learning . . . .
7.4.7 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.8 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . .
7.5
Numerical Algorithms for SVD . . . . . . . . . . . . . . . . . . . . . . .
7.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

279
279
280
281
283
286
286
287
288
290
292
293
294
294
299
299
300
300
304
305
307
309
311
311
313

315
316
317
318
319
320
320
323
323
324
325
325
326
327
327
328
329
330
332
332
333


CONTENTS

XIII

8 Matrix Factorization
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.2
Optimization-Based Matrix Factorization . . . . . . . . . . . . . . . . .
8.2.1 Example: K-Means as Constrained Matrix Factorization . . . .
8.3
Unconstrained Matrix Factorization . . . . . . . . . . . . . . . . . . . .
8.3.1 Gradient Descent with Fully Specified Matrices . . . . . . . . .
8.3.2 Application to Recommender Systems . . . . . . . . . . . . . .
8.3.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . .
8.3.2.2 Coordinate Descent . . . . . . . . . . . . . . . . . . .
8.3.2.3 Block Coordinate Descent: Alternating Least Squares
8.4
Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . . . . .
8.4.1 Optimization Problem with Frobenius Norm . . . . . . . . . . .
8.4.1.1 Projected Gradient Descent with Box Constraints . .
8.4.2 Solution Using Duality . . . . . . . . . . . . . . . . . . . . . . .
8.4.3 Interpretability of Nonnegative Matrix Factorization . . . . . .
8.4.4 Example of Nonnegative Matrix Factorization . . . . . . . . . .
8.4.5 The I-Divergence Objective Function . . . . . . . . . . . . . . .
8.5
Weighted Matrix Factorization . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Practical Use Cases of Nonnegative and Sparse Matrices . . . .
8.5.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . .
8.5.2.1 Why Negative Sampling Is Important . . . . . . . . .
8.5.3 Application: Recommendations with Implicit Feedback Data . .
8.5.4 Application: Link Prediction in Adjacency Matrices . . . . . . .
8.5.5 Application: Word-Word Context Embedding with GloVe . . .
8.6
Nonlinear Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . .
8.6.1 Logistic Matrix Factorization . . . . . . . . . . . . . . . . . . .
8.6.1.1 Gradient Descent Steps for Logistic Matrix

Factorization . . . . . . . . . . . . . . . . . . . . . . .
8.6.2 Maximum Margin Matrix Factorization . . . . . . . . . . . . .
8.7
Generalized Low-Rank Models . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 Handling Categorical Entries . . . . . . . . . . . . . . . . . . .
8.7.2 Handling Ordinal Entries . . . . . . . . . . . . . . . . . . . . .
8.8
Shared Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1 Gradient Descent Steps for Shared Factorization . . . . . . . .
8.8.2 How to Set Up Shared Models in Arbitrary Scenarios . . . . . .
8.9
Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 The Linear Algebra of Similarity
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
Equivalence of Data and Similarity Matrices . . . . . . . . . . . . .
9.2.1 From Data Matrix to Similarity Matrix and Back . . . . . .
9.2.2 When Is Data Recovery from a Similarity Matrix Useful? .
9.2.3 What Types of Similarity Matrices Are “Valid”? . . . . . .
9.2.4 Symmetric Matrix Factorization as an Optimization Model
9.2.5 Kernel Methods: The Machine Learning Terminology . . . .

.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

339
339
341
342
342
343
346
348
348
349
350
350
351
351
353
353
356
356
357
359
360
360
360
361
362
362


.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

363
364
365

367
367
369
370
370
371
375
375
375

.
.
.
.
.
.
.

.
.
.
.
.
.
.

379
379
379
380

381
382
383
383


XIV

9.3

9.4

9.5

9.6
9.7
9.8
9.9
9.10
10 The
10.1
10.2
10.3
10.4
10.5

CONTENTS

Efficient Data Recovery from Similarity Matrices . . . . . . . . .
9.3.1 Nystr¨

om Sampling . . . . . . . . . . . . . . . . . . . . . .
9.3.2 Matrix Factorization with Stochastic Gradient Descent . .
9.3.3 Asymmetric Similarity Decompositions . . . . . . . . . . .
Linear Algebra Operations on Similarity Matrices . . . . . . . . .
9.4.1 Energy of Similarity Matrix and Unit Ball Normalization
9.4.2 Norm of the Mean and Variance . . . . . . . . . . . . . . .
9.4.3 Centering a Similarity Matrix . . . . . . . . . . . . . . . .
9.4.3.1 Application: Kernel PCA . . . . . . . . . . . . .
9.4.4 From Similarity Matrix to Distance Matrix and Back . . .
9.4.4.1 Application: ISOMAP . . . . . . . . . . . . . . .
Machine Learning with Similarity Matrices . . . . . . . . . . . . .
9.5.1 Feature Engineering from Similarity Matrix . . . . . . . .
9.5.1.1 Kernel Clustering . . . . . . . . . . . . . . . . . .
9.5.1.2 Kernel Outlier Detection . . . . . . . . . . . . .
9.5.1.3 Kernel Classification . . . . . . . . . . . . . . . .
9.5.2 Direct Use of Similarity Matrix . . . . . . . . . . . . . . .
9.5.2.1 Kernel K-Means . . . . . . . . . . . . . . . . . .
9.5.2.2 Kernel SVM . . . . . . . . . . . . . . . . . . . .
The Linear Algebra of the Representer Theorem . . . . . . . . . .
Similarity Matrices and Linear Separability . . . . . . . . . . . .
9.7.1 Transformations That Preserve Positive Semi-definiteness
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

Linear Algebra of Graphs
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Graph Basics and Adjacency Matrices . . . . . . . . . . . . . . . . . .
Powers of Adjacency Matrices . . . . . . . . . . . . . . . . . . . . . . .
The Perron-Frobenius Theorem . . . . . . . . . . . . . . . . . . . . . .
The Right Eigenvectors of Graph Matrices . . . . . . . . . . . . . . . .
10.5.1 The Kernel View of Spectral Clustering . . . . . . . . . . . . .
10.5.1.1 Relating Shi-Malik and Ng-Jordan-Weiss Embeddings
10.5.2 The Laplacian View of Spectral Clustering . . . . . . . . . . . .
10.5.2.1 Graph Laplacian . . . . . . . . . . . . . . . . . . . . .
10.5.2.2 Optimization Model with Laplacian . . . . . . . . . .
10.5.3 The Matrix Factorization View of Spectral Clustering . . . . .
10.5.3.1 Machine Learning Application: Directed Link
Prediction . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.4 Which View of Spectral Clustering Is Most Informative? . . . .
10.6 The Left Eigenvectors of Graph Matrices . . . . . . . . . . . . . . . . .
10.6.1 PageRank as Left Eigenvector of Transition Matrix . . . . . . .

10.6.2 Related Measures of Prestige and Centrality . . . . . . . . . . .
10.6.3 Application of Left Eigenvectors to Link Prediction . . . . . . .
10.7 Eigenvectors of Reducible Matrices . . . . . . . . . . . . . . . . . . . .
10.7.1 Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.2 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

385
385

386
388
389
390
390
391
391
392
393
394
395
395
396
396
397
397
398
399
403
405
407
407
407

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

411
411
411
416
419
423
423
425
426
426
428
430


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

430
431
431
433
434
435
436
436
436



CONTENTS

10.8

Machine Learning Applications . . . . . . . .
10.8.1 Application to Vertex Classification . .
10.8.2 Applications to Multidimensional Data
10.9 Summary . . . . . . . . . . . . . . . . . . . .
10.10 Further Reading . . . . . . . . . . . . . . . . .
10.11 Exercises . . . . . . . . . . . . . . . . . . . . .

XV

.
.
.
.
.
.

.
.
.
.
.
.

.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.

.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.


11 Optimization in Computational Graphs
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 The Basics of Computational Graphs . . . . . . . . . . . . . . . . . . . .
11.2.1 Neural Networks as Directed Computational Graphs . . . . . . .
11.3 Optimization in Directed Acyclic Graphs . . . . . . . . . . . . . . . . . .
11.3.1 The Challenge of Computational Graphs . . . . . . . . . . . . . .
11.3.2 The Broad Framework for Gradient Computation . . . . . . . . .
11.3.3 Computing Node-to-Node Derivatives Using Brute Force . . . . .
11.3.4 Dynamic Programming for Computing Node-to-Node Derivatives
11.3.4.1 Example of Computing Node-to-Node Derivatives . . .
11.3.5 Converting Node-to-Node Derivatives into Loss-to-Weight
Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.5.1 Example of Computing Loss-to-Weight Derivatives . . .
11.3.6 Computational Graphs with Vector Variables . . . . . . . . . . .
11.4 Application: Backpropagation in Neural Networks . . . . . . . . . . . . .
11.4.1 Derivatives of Common Activation Functions . . . . . . . . . . .
11.4.2 Vector-Centric Backpropagation . . . . . . . . . . . . . . . . . . .
11.4.3 Example of Vector-Centric Backpropagation . . . . . . . . . . . .
11.5 A General View of Computational Graphs . . . . . . . . . . . . . . . . .
11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.


439
440
442
443
443
444

.
.
.
.
.
.
.
.
.

447
447
448
451
453
453
455
456
459
461

.
.

.
.
.
.
.
.
.
.
.

464
465
466
468
470
471
473
475
478
478
478

Bibliography

483

Index

491



Preface

“Mathematics is the language with which God wrote the universe.”– Galileo
A frequent challenge faced by beginners in machine learning is the extensive background
required in linear algebra and optimization. One problem is that the existing linear algebra
and optimization courses are not specific to machine learning; therefore, one would typically
have to complete more course material than is necessary to pick up machine learning.
Furthermore, certain types of ideas and tricks from optimization and linear algebra recur
more frequently in machine learning than other application-centric settings. Therefore, there
is significant value in developing a view of linear algebra and optimization that is better
suited to the specific perspective of machine learning.
It is common for machine learning practitioners to pick up missing bits and pieces of linear algebra and optimization via “osmosis” while studying the solutions to machine learning
applications. However, this type of unsystematic approach is unsatisfying, because the primary focus on machine learning gets in the way of learning linear algebra and optimization
in a generalizable way across new situations and applications. Therefore, we have inverted
the focus in this book, with linear algebra and optimization as the primary topics of interest
and solutions to machine learning problems as the applications of this machinery. In other
words, the book goes out of its way to teach linear algebra and optimization with machine
learning examples. By using this approach, the book focuses on those aspects of linear algebra and optimization that are more relevant to machine learning and also teaches the
reader how to apply them in the machine learning context. As a side benefit, the reader
will pick up knowledge of several fundamental problems in machine learning. At the end
of the process, the reader will become familiar with many of the basic linear-algebra- and
optimization-centric algorithms in machine learning. Although the book is not intended to
provide exhaustive coverage of machine learning, it serves as a “technical starter” for the key
models and optimization methods in machine learning. Even for seasoned practitioners of
machine learning, a systematic introduction to fundamental linear algebra and optimization
methodologies can be useful in terms of providing a fresh perspective.
The chapters of the book are organized as follows:
1. Linear algebra and its applications: The chapters focus on the basics of linear algebra together with their common applications to singular value decomposition, matrix factorization, similarity matrices (kernel methods), and graph analysis. Numerous
machine learning applications have been used as examples, such as spectral clustering,

XVII


XVIII

PREFACE

kernel-based classification, and outlier detection. The tight integration of linear algebra methods with examples from machine learning differentiates this book from generic
volumes on linear algebra. The focus is clearly on the most relevant aspects of linear
algebra for machine learning and to teach readers how to apply these concepts.
2. Optimization and its applications: Much of machine learning is posed as an optimization problem in which we try to maximize the accuracy of regression and classification models. The “parent problem” of optimization-centric machine learning is
least-squares regression. Interestingly, this problem arises in both linear algebra and
optimization and is one of the key connecting problems of the two fields. Least-squares
regression is also the starting point for support vector machines, logistic regression,
and recommender systems. Furthermore, the methods for dimensionality reduction
and matrix factorization also require the development of optimization methods. A
general view of optimization in computational graphs is discussed together with its
applications to backpropagation in neural networks.
This book contains exercises both within the text of the chapter and at the end of the
chapter. The exercises within the text of the chapter should be solved as one reads the
chapter in order to solidify the concepts. This will lead to slower progress, but a better
understanding. For in-chapter exercises, hints for the solution are given in order to help the
reader along. The exercises at the end of the chapter are intended to be solved as refreshers
after completing the chapter.
Throughout this book, a vector or a multidimensional data point is annotated with a bar,
such as X or y. A vector or multidimensional point may be denoted by either small letters
or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots,
such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout
the book, the n × d matrix corresponding to the entire training data set is denoted by
D, with n data points and d dimensions. The individual data points in D are therefore

d-dimensional row vectors and are often denoted by X 1 . . . X n . Conversely, vectors with
one component for each data point are usually n-dimensional column vectors. An example
is the n-dimensional column vector y of class variables of n data points. An observed value
yi is distinguished from a predicted value yˆi by a circumflex at the top of the variable.
Yorktown Heights, NY, USA

Charu C. Aggarwal


Acknowledgments

I would like to thank my family for their love and support during the busy time spent in
writing this book. Knowledge of the very basics of optimization (e.g., calculus) and linear
algebra (e.g., vectors and matrices) starts in high school and increases over the course of
many years of undergraduate/graduate education as well as during the postgraduate years
of research. As such, I feel indebted to a large number of teachers and collaborators over
the years. This section is, therefore, a rather incomplete attempt to express my gratitude.
My initial exposure to vectors, matrices, and optimization (calculus) occurred during my
high school years, where I was ably taught these subjects by S. Adhikari and P. C. Pathrose.
Indeed, my love of mathematics started during those years, and I feel indebted to both these
individuals for instilling the love of these subjects in me. During my undergraduate study
in computer science at IIT Kanpur, I was taught several aspects of linear algebra and
optimization by Dr. R. Ahuja, Dr. B. Bhatia, and Dr. S. Gupta. Even though linear algebra
and mathematical optimization are distinct (but interrelated) subjects, Dr. Gupta’s teaching
style often provided an integrated view of these topics. I was able to fully appreciate the value
of such an integrated view when working in machine learning. For example, one can approach
many problems such as solving systems of equations or singular value decomposition either
from a linear algebra viewpoint or from an optimization viewpoint, and both perspectives
provide complementary views in different machine learning applications. Dr. Gupta’s courses
on linear algebra and mathematical optimization had a profound influence on me in choosing

mathematical optimization as my field of study during my PhD years; this choice was
relatively unusual for undergraduate computer science majors at that time. Finally, I had
the good fortune to learn about linear and nonlinear optimization methods from several
luminaries on these subjects during my graduate years at MIT. In particular, I feel indebted
to my PhD thesis advisor James B. Orlin for his guidance during my early years. In addition,
Nagui Halim has provided a lot of support for all my book-writing projects over the course
of a decade and deserves a lot of credit for my work in this respect. My manager, Horst
Samulowitz, has supported my work over the past year, and I would like to thank him for
his help.
I also learned a lot from my collaborators in machine learning over the years. One
often appreciates the true usefulness of linear algebra and optimization only in an applied
setting, and I had the good fortune of working with many researchers from different areas
on a wide range of machine learning problems. A lot of the emphasis in this book to specific
aspects of linear algebra and optimization is derived from these invaluable experiences and
XIX


XX

ACKNOWLEDGMENTS

collaborations. In particular, I would like to thank Tarek F. Abdelzaher, Jinghui Chen, Jing
Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang,
Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad
M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Saket Sathe, Jaideep
Srivastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang,
Jianyong Wang, Min Wang, Suhang Wang, Wei Wang, Joel Wolf, Xifeng Yan, Wenchao Yu,
Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao.
Several individuals have also reviewed the book. Quanquan Gu provided suggestions
on Chapter 6. Jiliang Tang and Xiaorui Liu examined several portions of Chapter 6 and

pointed out corrections and improvements. Shuiwang Ji contributed Problem 7.2.3. Jie Wang
reviewed several chapters of the book and pointed out corrections. Hao Liu also provided
several suggestions.
Last but not least, I would like to thank my daughter Sayani for encouraging me to
write this book at a time when I had decided to hang up my boots on the issue of book
writing. She encouraged me to write this one. I would also like to thank my wife for fixing
some of the figures in this book.


Author Biography

Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM
T. J. Watson Research Center in Yorktown Heights, New York. He completed his undergraduate degree in Computer Science from the Indian Institute of Technology at Kanpur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.
He has worked extensively in the field of data mining. He has published more than 400 papers in refereed conferences and journals
and authored more than 80 patents. He is the author or editor
of 19 books, including textbooks on data mining, recommender
systems, and outlier analysis. Because of the commercial value of
his patents, he has thrice been designated a Master Inventor at
IBM. He is a recipient of an IBM Corporate Award (2003) for his
work on bioterrorist threat detection in data streams, a recipient
of the IBM Outstanding Innovation Award (2008) for his scientific
contributions to privacy technology, and a recipient of two IBM
Outstanding Technical Achievement Awards (2009, 2015) for his
work on data streams/high-dimensional data. He received the EDBT 2014 Test of Time
Award for his work on condensation-based privacy-preserving data mining. He is also a
recipient of the IEEE ICDM Research Contributions Award (2015) and the ACM SIGKDD
Innovation Award (2019), which are the two highest awards for influential research contributions in data mining.
He has served as the general cochair of the IEEE Big Data Conference (2014) and as
the program cochair of the ACM CIKM Conference (2015), the IEEE ICDM Conference
(2015), and the ACM KDD Conference (2016). He served as an associate editor of the IEEE

Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate
editor of the IEEE Transactions on Big Data, an action editor of the Data Mining and
Knowledge Discovery Journal, and an associate editor of the Knowledge and Information
Systems Journal. He serves as the editor-in-chief of the ACM Transactions on Knowledge
Discovery from Data as well as the ACM SIGKDD Explorations. He serves on the advisory
board of the Lecture Notes on Social Networks, a publication by Springer. He has served
as the vice president of the SIAM Activity Group on Data Mining and is a member of the
SIAM Industry Committee. He is a fellow of the SIAM, ACM, and IEEE, for “contributions
to knowledge discovery and data mining algorithms.”
XXI


Chapter 1

Linear Algebra and Optimization: An
Introduction
“No matter what engineering field you’re in, you learn the same basic science
and mathematics. And then maybe you learn a little bit about how to apply
it.”–Noam Chomsky

1.1

Introduction

Machine learning builds mathematical models from data containing multiple attributes (i.e.,
variables) in order to predict some variables from others. For example, in a cancer prediction application, each data point might contain the variables obtained from running clinical
tests, whereas the predicted variable might be a binary diagnosis of cancer. Such models are
sometimes expressed as linear and nonlinear relationships between variables. These relationships are discovered in a data-driven manner by optimizing (maximizing) the “agreement”
between the models and the observed data. This is an optimization problem.
Linear algebra is the study of linear operations in vector spaces. An example of a vector

space is the infinite set of all possible Cartesian coordinates in two dimensions in relation to
a fixed point referred to as the origin, and each vector (i.e., a 2-dimensional coordinate) can
be viewed as a member of this set. This abstraction fits in nicely with the way data is represented in machine learning as points with multiple dimensions, albeit with dimensionality
that is usually greater than 2. These dimensions are also referred to as attributes in machine
learning parlance. For example, each patient in a medical application might be represented
by a vector containing many attributes, such as age, blood sugar level, inflammatory markers, and so on. It is common to apply linear functions to these high-dimensional vectors in
many application domains in order to extract their analytical properties. The study of such
linear transformations lies at the heart of linear algebra.
While it is easy to visualize the spatial geometry of points/operations in 2 or 3 dimensions, it becomes harder to do so in higher dimensions. For example, it is simple to visualize
© Springer Nature Switzerland AG 2020
C. C. Aggarwal, Linear Algebra and Optimization for Machine Learning,
1

1


2

CHAPTER 1. LINEAR ALGEBRA AND OPTIMIZATION: AN INTRODUCTION

a 2-dimensional rotation of an object, but it is hard to visualize a 20-dimensional object and
its corresponding rotation. This is one of the primary challenges associated with linear algebra. However, with some practice, one can transfer spatial intuitions to higher dimensions.
Linear algebra can be viewed as a generalized form of the geometry of Cartesian coordinates
in d dimensions. Just as one can use analytical geometry in two dimensions in order to find
the intersection of two lines in the plane, one can generalize this concept to any number of
dimensions. The resulting method is referred to as Gaussian elimination for solving systems
of equations, and it is one of the fundamental cornerstones of linear algebra. Indeed, the
problem of linear regression, which is fundamental to linear algebra, optimization, and machine learning, is closely related to solving systems of equations. This book will introduce
linear algebra and optimization with a specific focus on machine learning applications.
This chapter is organized as follows. The next section introduces the definitions of vectors

and matrices and important operations. Section 1.3 closely examines the nature of matrix
multiplication with vectors and its interpretation as the composition of simpler transformations on vectors. In Section 1.4, we will introduce the basic problems in machine learning
that are used as application examples throughout this book. Section 1.5 will introduce the
basics of optimization, and its relationship with the different types of machine learning
problems. A summary is given in Section 1.6.

1.2

Scalars, Vectors, and Matrices

We start by introducing the notions of scalars, vectors, and matrices, which are the fundamental structures associated with linear algebra.
1. Scalars: Scalars are individual numerical values that are typically drawn from the real
domain in most machine learning applications. For example, the value of an attribute
such as Age in a machine learning application is a scalar.
2. Vectors: Vectors are arrays of numerical values (i.e., arrays of scalars). Each such
numerical value is also referred to as a coordinate. The individual numerical values of
the arrays are referred to as entries, components, or dimensions of the vector, and the
number of components is referred to as the vector dimensionality. In machine learning,
a vector might contain components (associated with a data point) corresponding to
numerical values like Age, Salary, and so on. A 3-dimensional vector representation
of a 25-year-old person making 30 dollars an hour, and having 5 years of experience
might be written as the array of numbers [25, 30, 5].
3. Matrices: Matrices can be viewed as rectangular arrays of numerical values containing
both rows and columns. In order to an access an element in the matrix, one must
specify its row index and its column index. For example, consider a data set in a
machine learning application containing d properties of n individuals. Each individual
is allocated a row, and each property is allocated in column. In such a case, we can
define a data matrix, in which each row is a d-dimensional vector containing the
properties of one of the n individuals. The size of such a matrix is denoted by the
notation n×d. An element of the matrix is accessed with the pair of indices (i, j), where

the first element i is the row index, and the second element j is the column index.
The row index increases from top to bottom, whereas the column index increases from
left to right. The value of the (i, j)th entry of the matrix is therefore equal to the jth
property of the ith individual. When we define a matrix A = [aij ], it refers to the fact


1.2. SCALARS, VECTORS, AND MATRICES

3

that the (i, j)th element of A is denoted by aij . Furthermore, defining A = [aij ]n×d
refers to the fact that the size of A is n × d. When a matrix has the same number of
rows as columns, it is referred to as a square matrix. Otherwise, it is referred to as a
rectangular matrix. A rectangular matrix with more rows than columns is referred to
as tall, whereas a matrix with more columns than rows is referred to as wide or fat.
It is possible for scalars, vectors, and matrices to contain complex numbers. This book will
occasionally discuss complex-valued vectors when they are relevant to machine learning.
Vectors are special cases of matrices, and scalars are special cases of both vectors and
matrices. For example, a scalar is sometimes viewed as a 1 × 1 “matrix.” Similarly, a ddimensional vector can be viewed as a 1 × d matrix when it is treated as a row vector. It
can also be treated as a d × 1 matrix when it is a column vector. The addition of the word
“row” or “column” to the vector definition is indicative of whether that vector is naturally
a row of a larger matrix or whether it is a column of a larger matrix. By default, vectors
are assumed to be column vectors in linear algebra, unless otherwise specified. We always
use an overbar on a variable to indicate that it is a vector, although we do not do so for
matrices or scalars. For example, the row vector [y1 , . . . , yd ] of d values can be denoted by y
or Y . In this book, scalars are always represented by lower-case variables like a or δ, whereas
matrices are always represented by upper-case variables like A or Δ.
In the sciences, a vector is often geometrically visualized as a quantity, such as the velocity, that has a magnitude as well as a direction. Such vectors are referred to as geometric
vectors. For example, imagine a situation where the positive direction of the X-axis corresponds to the eastern direction, and the positive direction of the Y -axis corresponds to
the northern direction. Then, a person that is simultaneously moving at 4 meters/second

in the eastern direction and at 3 meters/second in √
the northern direction is really moving
in the north-eastern direction in a straight line at 42 + 32 = 5 meters/second (based on
the Pythagorean theorem). This is also the length of the vector. The vector of the velocity
of this person can be written as a directed line from the origin to [4, 3]. This vector is
shown in Figure 1.1(a). In this case, the tail of the vector is at the origin, and the head of
the vector is at [4, 3]. Geometric vectors in the sciences are allowed to have arbitrary tails.
For example, we have shown another example of the same vector [4, 3] in Figure 1.1(a) in
which the tail is placed at [1, 4] and the head is placed at [5, 7]. In contrast to geometric
vectors, only vectors that have tails at the origin are considered in linear algebra (although
the mathematical results, principles, and intuition remain the same). This does not lead to
any loss of expressivity. All vectors, operations, and spaces in linear algebra use the origin
as an important reference point.

1.2.1

Basic Operations with Scalars and Vectors

Vectors of the same dimensionality can be added or subtracted. For example, consider two
d-dimensional vectors x = [x1 . . . xd ] and y = [y1 . . . yd ] in a retail application, where the
ith component defines the volume of sales for the ith product. In such a case, the vector of
aggregate sales is x + y, and its ith component is xi + yi :
x + y = [x1 . . . xd ] + [y1 . . . yd ] = [x1 + y1 . . . xd + yd ]
Vector subtraction is defined in the same way:
x − y = [x1 . . . xd ] − [y1 . . . yd ] = [x1 − y1 . . . xd − yd ]


4

CHAPTER 1. LINEAR ALGEBRA AND OPTIMIZATION: AN INTRODUCTION


[1, 4]

Y-AXIS

[5, 7]

Y-AXIS

Y-AXIS

[5, 7]

[1, 4]

[4, 3]

[4, 3]

[4, 3]

[4/5, 3/5]

X-AXIS

(a) Non-origin vectors
(not allowed)

X-AXIS


X-AXIS

(b) Vector addition

(c) Vector normalization

Figure 1.1: Examples of vector definition and basic operations
Vector addition is commutative (like scalar addition) because x + y = y + x. When two vectors, x and y, are added, the origin, x, y, and x + y represent the vertices of a parallelogram.
For example, consider the vectors A = [4, 3] and B = [1, 4]. The sum of these two vectors
is A + B = [5, 7]. The addition of these two vectors is shown in Figure 1.1(b). It is easy to
show that the four points [0, 0], [4, 3], [1, 4], and [5, 7] form a parallelogram in 2-dimensional
space, and the addition of the vectors is one of the diagonals of the parallelogram. The
other diagonal can be shown to be parallel to either A − B or B − A, depending on the
direction of the vector. Note that vector addition and subtraction follow the same rules
in linear algebra as for geometric vectors, except that the tails of the vectors are always
origin rooted. For example, the vector (A − B) should no longer be drawn as a diagonal of
the parallelogram, but as an origin-rooted vector with the same direction as the diagonal.
Nevertheless, the diagonal abstraction still helps in the computation of (A − B). One way of
visualizing vector addition (in terms of the velocity abstraction) is that if a platform moves
on the ground with velocity [1, 4], and if the person walks on the platform (relative to it)
with velocity [4, 3], then the overall velocity of the person relative to the ground is [5, 7].
It is possible to multiply a vector with a scalar by multiplying each component of the
vector with the scalar. Consider a vector x = [x1 , . . . xd ], which is scaled by a factor of a:
x = ax = [a x1 . . . a xd ]
For example, if the vector x contains the number of units sold of each product, then one
can use a = 10−6 to convert units sold into number of millions of units sold. The scalar
multiplication operation simply scales the length of the vector, but does not change its
direction (i.e., relative values of different components). The notion of “length” is defined
more formally in terms of the norm of the vector, which is discussed below.
Vectors can be multiplied with the notion of the dot product. The dot product between

two vectors, x = [x1 , . . . , xd ] and y = [yi , . . . yd ], is the sum of the element-wise multiplication
of their individual components. The dot product of x and y is denoted by x · y (with a dot
in the middle) and is formally defined as follows:
d

x·y =

x i yi
i=1

(1.1)


1.2. SCALARS, VECTORS, AND MATRICES

5

Consider a case where we have x = [1, 2, 3] and y = [6, 5, 4]. In such a case, the dot product
of these two vectors can be computed as follows:
x · y = (1)(6) + (2)(5) + (3)(4) = 28

(1.2)

The dot product is a special case of a more general operation, referred to as the inner
product, and it preserves many fundamental rules of Euclidean geometry. The space of
vectors that includes a dot product operation is referred to as a Euclidean space. The dot
product is a commutative operation:
d

d


x·y =

yi x i = y · x

x i yi =
i=1

i=1

The dot product also inherits the distributive property of scalar multiplication:
x · (y + z) = x · y + x · z
The dot product of a vector, x = [x1 , . . . xd ], with itself is referred to as its squared norm
or Euclidean norm. The norm defines the vector length and is denoted by · :
d

x

2

=x·x=

x2i
i=1

The norm of the vector is the Euclidean distance of its√coordinates from the origin. In
the case of Figure 1.1(a), the norm of the vector [4, 3] is 42 + 32 = 5. Often, vectors are
normalized to unit length by dividing them with their norm:
x =


x
x
=√
x
x·x

Scaling a vector by its norm does not change the relative values of its components, which
define the direction of the vector. For example, the Euclidean distance of [4, 3] from the
origin is 5. Dividing each component of the vector by 5 results in the vector [4/5, 3/5],
which changes the length of the vector to 1, but not its direction. This shortened vector is
shown in Figure 1.1(c), and it overlaps with the vector [4, 3]. The resulting vector is referred
to as a unit vector.
A generalization of the Euclidean norm is the Lp -norm, which is denoted by · p :
d

x

p

|xi |p )(1/p)

=(

(1.3)

i=1

Here, | · | indicates the absolute value of a scalar, and p is a positive integer. For example,
when p is set to 1, the resulting norm is referred to as the Manhattan norm or the L1 -norm.
The (squared) Euclidean distance between x = [x1 , . . . xd ] and y = [y1 , . . . , yd ] can be

shown to be the dot product of x − y with itself:
d

x−y

2

= (x − y) · (x − y) =

(xi − yi )2 = Euclidean(x, y)2
i=1


6

CHAPTER 1. LINEAR ALGEBRA AND OPTIMIZATION: AN INTRODUCTION

Y-AXIS

[1.0, 1.732]

600

X-AXIS
150
[0.966, 0.259]

Figure 1.2: The angular geometry of vectors A and B
Dot products satisfy the Cauchy-Schwarz inequality, according to which the dot product
between a pair of vectors is bounded above by the product of their lengths:

d

|

xi yi | = |x · y| ≤ x

y

(1.4)

i=1

The Cauchy-Schwarz inequality can be proven by first showing that |x · y| ≤ 1 when x and y
are unit vectors (i.e., the result holds when the arguments are unit vectors). This is because
both x−y 2 = 2−2x·y and x+y 2 = 2+2x·y are nonnegative. This is possible only when
|x · y| ≤ 1. One can then generalize this result to arbitrary length vectors by observing that
the dot product scales up linearly with the norms of the underlying arguments. Therefore,
one can scale up both sides of the inequality with the norms of the vectors.
Problem 1.2.1 (Triangle Inequality) Consider the triangle formed by the origin, x, and
y. Use the Cauchy-Schwarz inequality to show that the side length x − y is no greater than
the sum x + y of the other two sides.
A hint for solving the above problem is that both sides of the triangle inequality are nonnegative. Therefore, the inequality is true if and only if it holds after squaring both sides.
The Cauchy-Schwarz inequality shows that the dot product between a pair of vectors is
no greater than the product of vector lengths. In fact, the ratio between these two quantities
is the cosine of the angle between the two vectors (which is always less than 1). For example,
one often represents the coordinates of a 2-dimensional vector in polar form as [a, θ], where
a is the length of the vector, and θ is the counter-clockwise angle the vector makes with
the X-axis. The Cartesian coordinates are [a cos(θ), a sin(θ)], and the dot product of this
Cartesian coordinate vector with [1, 0] (the X-axis) is a cos(θ). As another example, consider
two vectors with lengths 2 and 1, respectively, which make (counter-clockwise) angles of

60◦ and −15◦ with respect to the X-axis in a 2-dimensional setting. These vectors
√ are
shown in Figure 1.2. The coordinates of these vectors are [2 cos(60), 2 sin(60)] = [1, 3] and
[cos(−15), sin(−15)] = [0.966, −0.259].
The cosine function between two vectors x = [x1 . . . xd ] and y = [yi , . . . yd ] is algebraically
defined by the dot product between the two vectors after scaling them to unit norm:
x·y
x·y
=
cos(x, y) = √

x y
x·x y·y

(1.5)

The algebraically computed cosine function over x and y has the normal trigonometric
interpretation of being equal to cos(θ), where θ is the angle between the vectors x and y.


×