Sets and Functions: Sets

Programming in Julia

1 A set is an unordered collection of objects. The objects in

a set are called elements.

min(m, n)
1 A vector in Rn is a column of n real numbers, also written

2 The cardinality of a set is the number of elements it con-

by a program. Values have types; for example, 5 is an Int and
"Hello world!" is a String.

as [v1 , . . . , vn ]. A vector may be depicted as an arrow from
the origin in n-dimensional space. The norm of a vector v is

2 A variable is a name used to refer to a value. We can
assign a value 5 to a variable x using x = 5.

the length

3 If every element of A is also an element of B, then we say

tains. The empty set ∅ is the set with no elements.

A is a subset of B and write A ⊂ B. If A ⊂ B and B ⊂ A,
then we say that A = B.
4 Set operations:

(i) An element is in the union A ∪ B of two sets A and B if
it is in A or B.
(ii) An element is in the intersection A ∩ B of two sets A
and B if it is in A and B.
(iii) An element is in the set difference A \ B if it is in A but
not B.
(iv) Given a set Ω and a set A ⊂ Ω, the complement of A
with respect to Ω is Ac = Ω \ A.


Linear Algebra: Vector Spaces

1 A value is a fundamental entity that may be manipulated









A\ B

3 A function performs a particular task. You prompt a
function to perform its task by calling it. Values supplied to a
function are called arguments. For example, in the function
call print(1,2), 1 and 2 are arguments.

5 Two sets A and B are disjoint if A ∩ B = ∅ (in other

words, if they have no elements in common).

6 A partition of a set is a collection of nonempty disjoint
subsets whose union is the whole set.
7 The Cartesian product of A and B is

7 A numerical value can be either an integer or a float. The

basic operations are +,-,*,/,^, and expressions are evaluated according to the order of operations.
9 Textual data is represented using strings. length(s) re-

turns the number of characters in s. The * operator concatenates strings.

10 A boolean is a value which is either true or false.


2 The set A is called the domain of f and B is called the
codomain of f .
3 Given a subset A of A, we define the image of
f —denoted f ( A )—to be the set of elements which are
mapped to from some element in A .

if x >

is positive"
x == 0
is zero"
is negative"

13 The scope of a variable is the region in the program

where it is accessible. Variables defined in the body of a function are not accessible outside the body of the function.

14 Array is a compound data type for storing lists of objects. Entries of an array may be accessed with square bracket
syntax using an index or using a range object a:b.
A = [-5,3,2,1]

15 An array comprehension can be used to generate new
arrays: [k^2 for k=1:10 if mod(k,2) == 0]

6 The identity function on a set A is the function f : A →
A which maps each element to itself.

16 A dictionary encodes a discrete function by storing
input-output pairs and looking up input values when indexed. This expression returns [0,0,1.0]:

codomain of f .

9 A function f is bijective if it is both injective and surjec-

tive. If f is bijective, then the function from B to A that maps
b ∈ B to the element a ∈ A that satisfies f ( a) = b is called
the inverse of f .
10 If f : A → B is bijective, then the function f −1 ◦ f is
equal to the identity function on A, and f ◦ f −1 is the iden-

tity function on B.

can be written as a linear combination of the vectors in L.

5 A list of vectors is linearly independent if and only if the
only linear combination which yields the zero vector is the
one with all weights zero.


17 A while loop takes a conditional expression and a body

and evaluates them alternatingly until the conditional expression returns false. A for loop evaluates its body once
for each entry in a given iterator (for example, a range, array,
or dictionary). Each value in the iterator is assigned to a loop
variable which can be referenced in the body of the loop.
while x > 0
x -= 1

9 A linear transformation L is a function from a vector
space V to a vector space W which satisfies L(cv + βw) =
cL(v) + L(w) for all c ∈ R, u, v ∈ V. These are “flat maps”:
equally spaced lines are mapped to equally spaces lines or
points. Examples: scaling, rotation, projection, reflection.
10 Given two vector spaces V and W, a basis {v1 , . . . , vn }

g : B → C is the function g ◦ f which maps a ∈ A to
g( f ( a)) ∈ C.

8 A function f is surjective if the range of f is equal to the

4 The span of a list L of vectors is the set of all vectors which

called a basis of that vector space. The number of vectors in
a basis of a vector space is called the dimension of the space.

function f(x,y; shift=0)
3x + 2y + shift

5 The composition of two functions f : A → B and

map to the same element in the codomain.

where c1 , . . . , ck are real numbers. The c’s are called the
weights of the linear combination.

8 A linearly independent spanning list of a vector space is

12 Functions may be defined using the familiar math notation: f(x,y) = 3x + 2y or using a function block (shift is
a keyword argument):

4 The range of f is the image of the domain of f .

7 A function f is injective if no two elements in the domain

11 The transpose distributes across matrix multiplication
but with an order reversal: ( AB) = B A if A and B are
matrices for which AB is defined.

7 A list of vectors in a vector space is a spanning list of that
vector space if every vector in the vector space can be written
as a linear combination of the vectors in that list.

Sets and Functions: Functions
1 If A and B are sets, then a function f : A → B is an
assignment of some element of B to each element of A.

c1 v 1 + c2 v 2 + · · · + c k v k ,

11 Code blocks can be executed conditionally:

∩ Bc .

Lists differ from sets in that (i) order matters, (ii) repetition
matters, and (iii) the cardinality is restricted.

10 The transpose is a linear operator: (cA + B) = cA + B
if c is a constant and A and B are matrices.

6 A vector space is a nonempty set of vectors which is
closed under the vector space operations.

(i) ( A ∩ B)c = Ac ∪ Bc , and
9 A list is an ordered collection of finitely many objects.

3 A linear combination of a list of vectors v1 , . . . , vk is an
expression of the form

Booleans can be combined with the operators && (and), ||
(or), ! (not).

8 (De Morgan’s laws) If A, B ⊂ Ω, then


9 The transpose A of a matrix A is defined so that the rows
of A are the columns of A (and vice versa).

5 A statement is an instruction to be executed (like x = -3).
6 An expression is a combination of values, variables, operators, and function calls that a language interprets and
evaluates to a value.

for i=1:10

7 If the columns of a square matrix A are linearly independent, then it has a unique inverse matrix A−1 with the property that Ax = b implies x = A−1 b for all x and b.
8 Matrix inversion satisfies ( AB)−1 = B−1 A−1 if A and B
are both invertible.

way. For example, * is an operator since we can call the multiplication function with the syntax 3 * 5.

A × B = {( a, b) : a ∈ A and b ∈ B}.

(ii) ( A ∪ B)c

2 The fundamental vector space operations are vector addition and scalar multiplication.

4 An operator is a function that can be called in a special

8 Numbers can be compared using <,>,==,≤ or ≥.


v21 + · · · + v2n of its arrow.

6 Ax = b has a solution x if and only if b is in the span of
the columns of A. If Ax = b does have a solution, then the
solution is unique if and only if the columns of A are linearly
independent. If Ax = b does not have a solution, then there
is a unique vector x which minimizes | Ax − b|2 .

of V, and a list {w1 , . . . , wn } of vectors in W, there exists one
and only one linear transformation which maps v1 to w1 , v2
to w2 , and so on.
11 The rank of a linear transformation from one vector
space to another is the dimension of its range.
12 The null space of a linear transformation is the set of
vectors which are mapped to the zero vector by the linear
13 The rank of a transformation plus the dimension of its
null space is equal to the dimension of its domain (the ranknullity theorem).

Linear Algebra: Matrix Algebra
1 The matrix-vector product Ax is the linear combination
of the columns of A with weights given by the entries of x.
2 Linear transformations from Rn to Rm are in one-to-one
correspondence with m × n matrices.

12 A matrix A is symmetric if A = A .
13 A linear transformation T from Rn to Rn scales all ndimensional volumes by the same factor: the (absolute value
of the) determinant of T.
14 The sign of the determinant tells us whether T reverses
15 det AB = det A det B and det A−1 = (det A)−1 .
16 A square matrix is invertible if and only if its determinant is nonzero.

Linear Algebra: Orthogonality
1 The dot product of two vectors in Rn is defined by

x · y = x1 y1 + x2 y2 + · · · + x n y n .
2 x · y = x y cos θ , where x, y ∈ Rn and θ is the angle
between the vectors.
3 x · y = 0 if and only if x and y are orthogonal.
4 The dot product is linear: x · (cy + z) = cx · y + x · z.
5 The orthogonal complement of a subspace V ⊂ Rn is
the set of vectors which are orthogonal to every vector in V.
6 The orthogonal complement of the span of the columns
of a matrix A is equal to the null space of A .
7 rank A = rank A A for any matrix A.
8 A list of vectors satisfying vi · v j = 0 for i = j is orthogonal. An orthogonal list of unit vectors is orthonormal.
9 Every orthogonal list is linearly independent
10 A matrix U has orthonormal columns if and only if

U U = I. A square matrix with orthonormal columns is
called orthogonal. An orthogonal matrix and its transpose
are inverses.
11 Orthogonal matrices represent rigid transformations

(ones which preserve lengths and angles).
12 If U has orthonormal columns, then UU is the matrix
which represents projection onto the span of the columns of

3 The identity transformation corresponds to the identity
matrix, which has entries of 1 along the diagonal and zero
entries elsewhere.

Linear Algebra: Spectral Analysis

4 Matrix multiplication corresponds to composition of
the corresponding linear transformations: AB is the matrix
for which ( AB)(x) = A( Bx) for all x.

1 An eigenvector v of an n × n matrix A is a nonzero vector with the property that Av = λv for some λ ∈ R. We call
λ an eigenvalue.

5 A m × n matrix is full rank if its rank is equal to

If v is an eigenvector of A, then A maps the line span({v})
to itself:

Multivariable calculus
2 1
0 3

occupying the region D and having mass density f ( x, y) at
each point ( x, y).

2 (Squeeze theorem) If an ≤ bn ≤ cn for all n ≥ 1 and if
limn→∞ an = limn→∞ cn = b, then bn → b as n → ∞.

14 Double integration over D: the bounds for the outer integral are the smallest and largest values of y for any point
in D, and the bounds for the inner integral are the smallest
and largest values of x for any point in a given “y = constant”
slice of the region.

verges to a number x ∈ R if the distance from xn to x on the
number line can be made as small as desired by choosing n
sufficiently large. We say limn→∞ xn = x or xn → x.

2 Eigenvectors of A with distinct eigenvalues are linearly


3 Not every n × n matrix A has n linearly independent
eigenvectors. If A does have n linearly independent eigenvectors, we can make a matrix V with these eigenvectors as
columns and get

AV = V Λ =⇒ A = V ΛV −1

(diagonalization of A)

4 If A = V ΛV −1 , then An = V Λn V −1 .
5 If A is a symmetric matrix, then A is orthogonally diag-


A = V ΛV ,
where V is an orthogonal matrix (the spectral theorem).
6 A symmetric matrix is positive semidefinite if its eigen-

values are all nonnegative. We define the square root of a

positive semidefinite matrix A = V ΛV to be V ΛV , where

Λ is obtained by applying the square root function elementwise.
Linear Algebra: SVD

1 The Gram matrix A A of any m × n matrix A is positive

semidefinite. Furthermore, | A Ax| = | Ax| for all x ∈

Rn .

2 The singular value decomposition is the factorization of
any rectangular m × n matrix A as U ΣV , where U and V are
orthogonal and Σ is an m × n diagonal matrix (with diagonal

entries in decreasing order).



3 (Comparison test) If ∑∞
n=1 bn converges and if | an | ≤ bn
for all n, then ∑∞
n=1 an converges.

Conversely, if ∑∞
n=1 bn does not converge and 0 ≤ bn < an ,
then Σ∞
n=1 an also does not converge.
4 The series ∑∞
n=1 n converges if and only if p < −1. The
series ∑∞
n=1 a converges if and only if −1 < a < 1.

5 The Taylor series, centered at c, of an infinitely differen-

tiable function f is defined to be

f (c)
f (c)
f (c) + f (c)( x − c) +
( x − c )2 +
( x − c )3 + · · ·

where Λ is a diagonal matrix of eigenvalues.

1 A sequence of real numbers ( xn )∞
n=1 = x1 , x2 , . . . con-

6 We can multiply or add Taylor series term-by-term, we

can integrate or differentiate a Taylor series term-by-term,
we can substitute one Taylor series into another to obtain a
Taylor series for the composition.

7 The partial derivative ∂x ( x0 , y0 ) of a function f ( x, y) at
a point ( x0 , y0 ) is the slope of the graph of f in the x-direction
at the point ( x0 , y0 ).
8 Given f : Rn → Rm , we define ∂f/∂x to be the matrix
whose (i, j)th entry is ∂ f i /∂x j . Then

( Ax) = A
(x A) = A


(u v) = u
9 A function of two variables is differentiable at a point if
its graph looks like a plane when you zoom in sufficiently
around the point. More generally, a function f : Rn → Rm is
differentiable at x if it is well-approximated by its derivative
near x:
∂f (x)∆x
f(x + ∆x) − f(x) + ∂x
= 0.
|∆x |
∆x → 0
10 The Hessian H of f

: Rn → R is the matrix
of its second order derivatives: Hi,j (x) = ∂x∂ ∂x∂ f (x).




The quadratic approximation of f at the origin is f (0) +

(0)x + 12 x
V ′ = −73.2◦ turn

U = 16.8◦ turn


(i) f realizes an absolute maximum and absolute minimum
on D (the extreme value theorem).

3 The diagonal entries of Σ are the singular values of A,

and the columns of U and V are called left singular vectors
and right singular vectors, respectively. A maps each right
singular vector vi to the corresponding left singular vector
ui scaled by σi .
4 The vectors in Rn stretched the most by A are the ones

which run in the direction of the column or columns of V
corresponding to the greatest singular value. Same for least.
5 For k ≥ 1, the k-dimensional vector space with minimal
sum of squared distances to the columns of A (interpreted
as points in Rm ) is the span of the first k columns of U.
6 The absolute value of the determinant of a square matrix
is equal to the product of its singular values.

Numerical Computation: machine arithmetic
1 Computers store numerical values as sequences of bits.

The type of a numeric value specifies how to interpret the
underlying sequence of bits as a number.
2 The Int64 type uses 64 bits to represent the integers from
−263 to 263 − 1. For 0 ≤ n ≤ 263 − 1, we represent n using its
binary representation, and for 1 ≤ n ≤ 263 , we represent −n
using the binary representation of 264 − n. Int64 arithmetic
is performed modulo 264 .
3 The Float64 type uses 64 bits to represent real numbers.
We call the first bit σ, the next 11 bits (interpreted as a binary
integer) e ∈ [0, 2047], and the final 52 bits f ∈ [0, 252 − 1]. If
/ {0, 2047}, then the number represented by (σ, e, f ) is

x = (−1)σ 2e−1023

1+ f

1 52


The representable numbers between consecutive powers of
2 are the ones obtained by 52 recursive iterations of binary
subdivision. The value of e indicates the powers of 2 that
x is between, and the value of f indicates the position of x
between those powers of 2.
The Float64 exponent value e = 2047 is reserved for Inf and
NaN, while e = 0 is reserved for the subnormal numbers:
(σ, 0, f ) represents (−1)σ f /21074 .

252 values



252 values



252 values


largest finite
representable value

252 values






4 The BigInt and BigFloat are types use an arbitrary num-

ber of bits and can handle very large numbers or very high
precision. Computations are much slower than for 64-bit

11 Suppose that f is a continuous function defined on a

closed and bounded subset D of Rn . Then:

Σ = 2.303
0 1.303

15 Polar integration over D: the outer integral bounds are
the least and greatest values of θ for a point in D, and the
inner integral bounds are the least and greatest values of r
for any point in D along each given “θ = constant” ray. The
area element is dA = r dr dθ .

(ii) Any point where f realizes an extremum is either a
critical point—meaning that ∇ f = 0 or f is nondifferentiable at that point—or at a point on the boundary.
(iii) (Lagrange multipliers) If f realizes an extremum at a
point on a portion of the boundary which is the level set
of a differentiable function g with non-vanishing gradient ∇ g, then either f is non-differentiable at that point

or the equation
∇ f = λ∇ g
is satisfied at that point, for some λ ∈ R.
12 If r : R1 → R2 and f : R2 → R1 , then

( f ◦ r) =
(r(t)) (t).

(chain rule)

13 Integrating a function is a way of totaling up its values.
D f ( x, y) dx dy can be interpreted as the mass of an object

Numerical Computation: Error
1 If A is an approximation for A, then the relative error is
A− A

7 The condition number of a → an is κ( a) = n, and the
a (so subtracting b is
condition number of a → a − b is a−
ill-conditioned near b—this is called catastrophic cancellation).
8 The relative roundoff error between a non-extreme real

number and the nearest T-representable value is no more
than the machine epsilon ( mach ) of the floating point type
9 An algorithm which solves a problem with error much
greater than κ mach is unstable. An algorithm is unstable if
at least one of the steps it performs is ill-conditioned. If every
step of an algorithm is well-conditioned, then the algorithm
is stable.
10 The condition number of an matrix A is defined to be
the maximum condition number of the function x → Ax
over its domain. The condition number is equal to the ratio
of the largest to the smallest singular value of A.

Numerical Computation: PRNGs
1 A pseudorandom number generator (PRNG) is an algorithm for generating a deterministic sequence of numbers
which is intended to share properties with a sequence of random numbers. The PRNG’s initial value is called its seed.
2 The linear congruential generator: fix positive integers
M, a, and c, and consider a seed X0 ∈ {0, 1, . . . , M −
1}. We return the sequence X0 , X1 , X2 , . . ., where Xn =
mod( aXn−1 + c, M) for n ≥ 1.
3 The period of a PRNG is the minimum length of a repeating block. A long period is a desirable property of a PRNG,
and a very short period is typically unacceptable.
4 Frequency tests check whether blocks of terms appear
with the appropriate frequency (for example, we can check
whether a2n > a2n−1 for roughly half of the values of n).

Numerical Computation: Automatic Differentiation
1 A dual number is an object that can be substituted into
a function f to yield both the value of the function and its
derivative at a point x.

2 If f is a function which can act on matrices, then

represents a dual number at x, since f



f (x) f (x)
. (This identity is true for any function f which
0 f (x)

can be defined as a limit of polynomial functions, since it can
be checked to hold for f + g and f g whenever it holds for f
and g, and it holds for the identity function).

3 To find the derivative of f with automatic differentiation,
every step in the computation of f must be dual-numberaware. See the packages ForwardDiff (for Julia) and autograd
(for Python).

2 Roundoff error comes from rounding numbers to fit

them into a floating point representation.

3 Truncation error comes from using approximate math-

ematical formulas or algorithms.

4 Statistical error arises from using randomness in an ap-


5 The condition number of a function measures how it
stretches or compresses relative error. The condition number of a problem is the condition number of the map from
the problem’s initial data a to its solution S( a):

κ( a ) =

d S ( a)|
| a|| da
|S( a)|

6 A problem is well-conditioned if its condition number is
modest and ill-conditioned if the condition number is large.

Numerical Computation: Optimization
1 Gradient descent seeks to minimize f : Rn → R by repeatedly stepping in f ’s direction of maximum decrease. We
begin with a value x0 ∈ Rn and repeatedly update using the
rule xn+1 = xn − ∇ f (xn−1 ), where is the learning rate.
We fix a small number τ > 0 and stop when |∇ f (xn )| < τ.
2 A function is strictly convex if its Hessian is positive
semidefinite everywhere. A strictly convex function has at
most one local minimum, and any local minimum is also a
global minimum. Gradient descent will find the global minimum for a convex function, but for non-convex functions it
can get stuck in a local minimum.
3 Algorithms similar to gradient descent but with usually

faster convergence: conjugate gradient, BFGS, L-BFGS.

Probability: Counting

Probability: Expectation and Variance

1 Fundamental principle of counting: If one experiment

1 The expectation E[ X ] (or mean µX ) of a random variable

has m possible outcomes, and if a second experiment has n
possible outcomes for each of the outcomes in the first experiment, then there are mn possible outcomes for the pair
of experiments.

X is the probability-weighted average of X:

2 The number of ways to arrange n objects in order is
n! = 1 · 2 · 3 · · · · · n (read n factorial).

2 The expectation E[ X ] may be thought of as the value of
a random game with payout X, or as the long-run average
of X over many independent runs of the underlying experiment. The Monte Carlo approximation of E[ X ] is obtained
by simulating the experiment many times and averaging the
value of X.

3 Permutations: if S is a set with n elements, then there
are (n−
ordered r-tuples of distinct elements of S.

r )!

E[ X ] =

3 The cumulative distribution function (CDF) of a random variable X is the function FX ( x ) = P( X ≤ x ).

4 Combinations: The number of r-element subsets of an

n-element set is (nr)


X (ω) m (ω)

3 The expectation is the center of mass of the distribution

mX (x)

r!(n−r )!


of X:


2 The function f is called a density, because it measures
the amount of probability mass per unit volume at each point
(2D volume = area, 1D volume = length).
3 If ( X, Y ) is a pair of random variables whose joint distribution has density f X,Y : R2 → R, then the conditional distribution of Y given the event { X = x } has density f Y | X = x
defined by
f X,Y ( x, y)
f Y | { X = x} (y) =
f X (x)

where f X ( x ) =


f ( x, y) dy is the PDF of X.

4 If a random variable X has density f X on R, then

E[ g( X )] =
1 Given a random experiment, the set of possible outcomes
is called the sample space Ω, like {(H, H), (H, T), (T, H), (T, T)}.





FX ( x )

2 We associate with each outcome ω ∈ Ω a probability
mass, denoted m(ω). For example, m((H, T)) = 14 .
3 In a random experiment, an event is a predicate that can
be determined based on the outcome of the experiment (like
“first flip turned up heads”). Mathematically, an event is a
subset of Ω (like {(H, H), (H, T)}).
4 Basic set operations ∪, ∩, and c correspond to disjunc-

tion, conjunction, and negation of events:

(i) The event that E happens or F happens is E ∪ F.
(ii) The event that E happens and F happens is E ∩ F.
(iii) The event that E does not happen is Ec .
5 If E and F cannot both occur (that is, E ∩ F = ∅), we say

that E and F are mutually exclusive or disjoint.

6 If E’s occurrence implies F’s occurrence, then E ⊂ F.

mental probability space properties are

(i) P(Ω) = 1 — “something has to happen”
(ii) P( E) ≥ 0 — “probabilities are non-negative”
(iii) P( E ∪ F ) = P( E) + P( F ) if E and F are mutually exclusive — “probability is additive”.

9 Other properties which follow from the fundamental

(i) P(∅) = 0
(ii) P( Ec ) = 1 − P( E)
(iii) E ⊂ F =⇒ P( E) ≤ P( F ) (monotonicity)





4 The joint distribution of two random variables X and
Y is the probability measure on R2 which maps A ⊂ R2 to
P(( X, Y ) ∈ A). The probability mass function of the joint
distribution is m( X,Y ) ( x, y) = P( X = x and Y = y).

Probability: Conditional Probability
1 Given a probability space Ω and an event E ⊂ Ω, the conditional probability measure given E is an updated probability measure on Ω which accounts for the information that
the result ω of the random experiment falls in E:

P( F | E ) =

P( F ∩ E )
P( E )

2 The conditional probability mass function of Y given

{ X = x } is mY | X = x (y) = m X,Y ( x, y)/m X ( x ).
3 Bayes’ theorem tells us how to update beliefs in light
of new evidence. It relates the conditional probabilities
P( A | E) and P( E | A):

P( A | E ) =

P( E | A )P( A )
P( E | A )P( A )
P( E )
P( E | A )P( A ) + P( E | Ac )P( Ac )

4 Two events E and F are independent if P( E ∩ F ) =

P( E )P( F ).

5 Two random variables X and Y are independent if the
every pair of events of the form { X ∈ A} and {Y ∈ B} are
independent, where A ⊂ R and B ⊂ R.

Probability: Random Variables

6 The PMF of the joint distribution of a pair of independent
random variables factors as m X,Y ( x, y) = m X ( x )mY (y):

result of a random experiment (one’s lottery winnings, for
example). Mathematically, a random variable is a function
X from the sample space Ω to R.

2 The distribution of a random variable X is the probability measure on R which maps each set A ⊂ R to P( X ∈ A).
The probability mass function of the distribution of X may
be obtained by pushing forward the probability mass from
each ω ∈ Ω:

E[ g( X, Y )] =

( x,y)∈R2

g( x ) f X ( x ) dx.

5 CDF sampling: F −1 (U ) has CDF F if f U = 1[0,1] .

Probability: Conditional Expectation

g( x )m X ( x )

x ∈R

(iv) P( E ∪ F ) = P( E) + P( F ) − P( E ∩ F ) (principle of

1 A random variable is a number which depends on the

able (or two random variables) may be expressed in terms of
the PMF m X of the distribution of X (or the PMF m( X,Y ) of

the joint distribution of X and Y):

E[ g( X )] =

7 The probability P( E) of an event E is the sum of the probability masses of the outcomes in that event. The domain of
P is 2Ω , the set of all subsets of Ω.
8 The pair (Ω, P) is called a probability space. The funda-


4 The expectation of a function of a discrete random vari-

Probability: Probability Spaces

g( x, y)m( X,Y ) ( x, y).

5 Expectation is linear: if c ∈ R and X and Y are random
variables defined on the same probability space, then

1 The conditional expectation of a random variable given
an event is the expectation of the random variable calculated
with respect to the conditional probability measure given
that event: if ( X, Y ) has PMF m X,Y , then

E [Y | X = x ] =

y ∈R

E[cX + Y ] = c E[ X ] + E[Y ]
6 The variance of a random variable is its average squared
deviation from its mean. The variance measures how spread
out the distribution of X is. The standard deviation σ( X ) is
the square root of the variance.
7 Variance satisfies the properties, if X and Y are independent random variables and a ∈ R:

Var( aX ) = a2 Var X
Var( X + Y ) = Var( X ) + Var(Y )
8 The covariance of two random variables X and Y is the

expected product of their deviations from their respective
means µX = E[ X ] and µY = E[Y ]:

where mY | X = x (y) =

ymY | X = x (y),

m X,Y ( x,y)
. If ( X, Y ) has pdf f X,Y , then
mX (x)

E [Y | X = x ] =

y f Y | X = x (y) dy.

2 The conditional expectation of a random variable Y given
another random variable X is obtained by substituting X for
x in the expression for the conditional expectation of Y given

X = x. Thus E[Y | X ] is a random variable.
3 If X and Y are independent, then E[Y | X ] = E[Y ]. If Z is
a function of X, then E[ ZY | X ] = ZE[Y | X ].
4 The law of iterated expectation: E[E[Y | X ]] = E[Y ].

Cov( X, Y ) = E[( X − µX )(Y − µY )] = E[ XY ] − E[ X ]E[Y ].
Probability: Common Distributions
9 The covariance of two independent random variables is

zero, but zero covariance does not imply independence.

1 Bernoulli (Ber( p)): A weighted coin flip.

10 The correlation of two random variables is their nor-

malized covariance:

Corr( X, Y ) =

Cov( X, Y )
∈ [−1, 1]
σ ( X ) σ (Y )


σ2 = p ( 1 − p )

1− p
11 The covariance matrix of a vector X = [ X1 , . . . , Xn ] of

random variables defined on the same probability space is
defined to be the matrix Σ whose (i, j)th entry is equal to
Cov( Xi , X j ). If E[X] = 0, then Σ = E[XX ].



2 Binomial (Bin(n, p)): A sum of n independent Ber( p)’s.

Probability: Continuous Distributions

m(k) = (nk) pk (1 − p)n−k
µ = np
σ2 = np(1 − p)

1 If Ω ⊂ Rn and P( A) = A f , where f ≥ 0 and Rn f = 1,
then we call (Ω, P) a continuous probability space.
f (x)
P([ a, b ])






3 Geometric (Geom( p)): Time to first success (1) in a sequence of independent Ber( p)’s.

m ( k ) = p (1 − p ) k − 1
µ = 1p
1− p

σ2 = p 2

4 Poisson distribution (Poiss(λ)): Limit as n → ∞ of
Binomial(n, λn ).

1 A sequence of random variables X1 , X2 , . . . , converges

bility space with an unknown probability measure, we seek
to draw conclusions about the measure.

2 A sequence ν1 , ν2 , . . . of probability measures on Rn converges to a probability measure ν if νn ( A) → ν( A) for every
set A ⊂ Rn with the property that ν(∂A) = 0 (intuitively,
two measures are close if they put approximately the same
amount of mass in approximately the same places). We say
Xn converges in distribution to ν if the distribution of Xn
converges to ν.

2 Supervised learning: (X, Y ) is drawn from an unknown
probability measure P on a product space X × Y , and we
aim to predict Y given X, based on a i.i.d. collection of samples from P (the training data).

3 Chebyshev’s inequality: if X is a random variable with
variance σ2 < ∞, then X differs from its mean by more than
k standard deviations with probability at most k−2 :

P(| X − E[ X ]| > kσ) ≤


X1 + · · · + X n

/ [µ − , µ + ]

3 We call the components of X features, predictors, or input
variables, and we call Y the response variable or output variable.

6 Normal distribution (N (µ, σ2 )): Limit as n → ∞ of
X + X +···+ Xn
the distribution of 1 2√
, for any independent sen
quence X1 , . . . , Xn of identically distributed random variables (i.i.d.) with E[ X1 ] = µ and Var( X1 ) = σ2 < ∞ (see

Central Limit Theorem).

→ 0,

variance distribution looks increasingly bell-shaped as n increases, regardless of the distribution being sampled from.

PMFs for sums of n fair coin flips















f (x) =


σ 2π




6 We define the standardized running sum of X1 , X2 , . . .

to have zero mean and unit variance for all n ≥ 1:



Sn∗ =

7 Multivariate normal distribution (N (0, Σ)): if Z =

( Z1 , Z2 , . . . , Zn ) is a vector of independent N (0, 1)’s, A is an
m × n matrix of constants, and µ ∈ Rm , then the vector

is multivariate normal. The covariance matrix of X is Σ =

AA .

X1 + X2 + · · · + X n − n µ

σ n

7 Central limit theorem: the sequence of standardized
sums of an i.i.d. sequence of finite-variance random variables
converges in distribution to N (0, 1): for any interval [ a, b],
we have

X = AZ + µ

P(Sn∗ ∈ [ a, b]) →


D ( u) =

(i) a space H of candidate functions, and

√ e−t /2 dt

total mass = 1


3 The width of each pile is specified by a bandwidth λ:
Dλ (u) = λ1 D λu .

L(h) = E 1{h( X )=Y } ,

4 The kernel density estimator with bandwidth λ is the
sum of the piles at each sample:

8 The empirical probability measure on X × Y is the measure which assigns a probability mass of n1 to the location
of each training sample (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ). The
empirical risk of a candidate function h is the risk functional
evaluated with respect to the empirical measure of the training data. The empirical risk minimizer (ERM) is the function which minimizes empirical risk.
9 Generalization error (or test error) is the difference be-

tween empirical risk and the actual value of the risk functional.

9 The central limit theorem explains the ubiquity of the
normal distribution in statistics: many random quantities
may be realized as a sum of a multitude of independent contributions.

fλ (x) =

1 n
Dλ ( x − Xi ).
n i∑

5 To choose a suitable bandwidth, we seek to minimize the
integrated squared error (ISE) L( f ) = R ( f − f )2 .

6 We approximate the minimizer of L with the minimizer
of the cross-validation loss estimator

J( f ) =
(−i )

where f λ

f λ2 −

is the KDE with the ith sample omitted.

f λ ( x, y) =

1 n
Dλ ( x − Xi ) Dλ (y − Yi ).
n i∑

Example: if H is the space of polynomials and no two training
samples have the same x values, then then there are functions in H
which have zero empirical risk.


2 n (−i)

( Xi ) ,
n i∑


7 If f is a density on R2 , then we use the KDE

10 The ERM can overfit, meaning that test error and L( h)
are large despite small empirical risk.



empirical risk

8 Stone’s theorem says that the ratio of the CV ISE to the
optimal-λ ISE converges to 1 in probability as n → ∞. Also,
1 , and the minimal ISE goes
the optimal λ goes to 0 like 1/5

risk minimizer

X1 + X2 + · · · + X n − nµ

converges in distribution to N (0, Σ).


and H contains r (x) = E[Y | X = x], then r is the target function. If the loss functional for a classification problem is

as n → ∞.
8 Multivariate central limit theorem: If X1 , X2 , . . . is a sequence of independent random vectors whose common distribution has mean µ and covariance matrix Σ, then

3 3
81 (1 − | u| )

for |u| ≤ 1

L(h) = E[(h( X ) − Y )2 ]

7 Since P is unknown, we must approximate the target
function with a function h whose values can be computed
from the training data. A learner is a function which takes a
set of training data and returns a prediction function h.


2 We choose a kernel function for the shape of each pile:

and H contains G (x) = argmaxc P(Y = c | X = x), then G is
the target function.

n = 10
n = 13
n = 16


1 Given n samples X1 , . . . , Xn from a distribution with
density f on R, we can estimate the PDF of the distribution by placing 1/n units of probability mass in a small pile
around each sample.

6 If the loss functional for a regression problem is

− ( x −µ2)

Statistical Learning: Kernel density estimation

The target function is argminh∈H L(h).

5 The PDF of a sum of n independent samples from a finite-


σ2 = λ12

13 No-free-lunch theorem: all learners are equal on average (over all possible problems), so inductive bias appropriate to a given type of problem is essential to have an effective
learner for that type of problem.

(ii) a loss (or risk) functional L from H to R.

f ( x ) = λe−λx
µ = λ1

12 Inductive bias can lead to underfitting: relevant relations are missed, so both training and test error are larger
than necessary. The tension between the costs of high inductive bias and the costs of low inductive bias is called the
bias-complexity (or bias-variance) tradeoff.

5 To choose a prediction function h : X → Y , we specify a

as n → ∞.

P ( Sn = k )

dependent samples from a finite-variance distribution with
mean µ, then the sequence’s running average converges in
probability to µ: for all > 0,


Example: X = [ X1 , X2 ], where X1 is the color of a banana, X2
is the weight of the banana, and Y is a measure of deliciousness.
Values of X1 , X2 , and Y are recorded for many bananas, and they
are used to predict Y for other bananas whose X values are known.

4 A supervised learning problem is a regression problem
if Y is quantitative (Y ⊂ R) and a classification problem if
Y is a set of labels.


4 Law of large numbers: if X1 , X2 , . . . is a sequence of in-

5 Exponential distribution (Exp(λ)): Limit as n → ∞ of
distribution of 1/n times a Geometric(λ/n).

1 Statistical learning: Given some samples from a proba-

in probability to X if P(| Xn − X | > ) → 0 as n → ∞, for
any > 0.


m(k) = k! e−λ
σ2 = λ

Statistical Learning: Theory

Probability: Central Limit Theorem


11 Mitigate overfitting with inductive bias:

(i) Use a restrictive class H of candidate functions.
(ii) Regularize: add a term to the loss functional which
penalizes complexity.


1 .
to 0 like 4/5

9 The Nadaraya-Watson nonparametric regression estimator r ( x ) computes E[Y | X = x ] with respect to the estimated
density f λ . Equivalently, we average the Yi ’s, weighted according to horizontal distance from x:


i =1

i =1

∑ Yi D(x − Xi ) ∑ D(x − Xi ).

tions which is indexed by finitely many parameters.

2 Linear regression uses the set of affine functions: H =
{ x → β0 + [ β1 , . . . , β p ] · x : β 0 , . . . β p ∈ R } .

Statistical Learning: QDA, LDA, Naive Bayes
tion algorithm which uses the training data to estimate the
mean —y and covariance matrix Σy of each class conditional
—y = mean({xi : yi = y})

3 We choose the parameters to minimize a risk function,

customarily the residual sum of squares:

RSS(˛) =

( y i − β0 − ˛ · x i ) 2


|y − X˛|2 ,

i =1

where y = [y1 , . . . , yn ], ˛ = [β0 , . . . , β p ], and X is an
n × ( p + 1) matrix whose ith row is a 1 followed by the components of xi .


where [u]+ denotes max(0, u), the positive part of u.

1 Quadratic discriminant analysis (QDA) is a classifica-

4 The parameters ˛ and α encode both H and the a

distance—called the margin—from H to a parallel hyperplane where we begin penalizing for lack of decisively correct classification. The margin is 1/|˛| (and can be adjusted
without changing H by scaling ˛ and α).

Σy = mean({(xi − —y )(xi − —y ) : yi = y}).

proposed (where { py : y ∈ Y } are the class proportions
from the training data).

| ˛|

2 Linear discriminant analysis (LDA) is the same as QDA

˛·x−α = 1

We have

˛ · x − α = −1

conditionally independent given Y:

5 Example assumption-satisfying data sets:

4 The RSS minimizer is ˛ = ( X X )−1 X Y.
5 We can use the linear regression framework to do polynomial regression, since a polynomial is linear in its coefficients: we supplement the list of regressors with products of
the original regressors.



Naive Bayes

6 Kernelization: mapping the feature vectors to a higher

Statistical Learning: Optimal classification
1 Consider a classification problem with feature set X and
class set Y . For each y ∈ Y , we define py = P(Y = y) and
let f y be the conditional PMF or PDF of X given {Y = y} (y’s
class conditional distribution).
2 Given a prediction function (or classifier) h and an enumeration of the elements of Y as {y1 , y2 , . . . , y|Y | }, we define
the (normalized) confusion matrix of h to be the |Y | × |Y |
matrix whose (i, j)th entry is P(h(X) = yi | Y = y j ).
3 If Y = {−, +}, the conditional probability of correct

classification given a positive sample is the detection rate
(DR), while the conditional probability of incorrect classification given a negative sample is the false alarm rate (FAR).
4 The precision of a classifier is the conditional probabil-

ity that a sample is positive given that the classifier predicts
positive, and recall is a synonym of detection rate.

5 The Bayes classifier G (x) = argmaxy py f y (x) minimizes

the misclassification probability but gives equal weight to
both types of misclassification.
6 The likelihood ratio test generalizes the Bayes classifier by allowing a variable tradeoff between false-alarm rate
and detection rate: given t > 0, we say ht (x) = −1 if
f + (x)/ f − (x) < t and ht (x) = 1 otherwise.
7 The Neyman-Pearson lemma says that no classifier does
better on both false alarm rate and detection rate than ht .
8 The receiver operating characteristic of ht is the curve


Statistical Learning: Logistic regression
1 Logistic regression for binary classification estimates

false alarm rate 1

tion of affine transformations and componentwise applications of a function K : R → R.

(i) We call K the activation function. Common choices:
(a) u → max(0, u) (rectifier, or ReLU)
(b) u → 1/(1 + e−u ) (logistic)


(ii) Component-wise application of K on Rt refers to the
function K.( x1 , . . . , xt ) = (K ( x1 ), . . . , K ( xt )).
(iii) An affine transformation from Rt to Rs is a map of
the form A(x) = Wx + b, where W is an s × t matrix and b ∈ Rs . Entries of W are called weights and
entries of b are called biases.


2 We choose α and β1 , . . . , β p to minimize the risk

L (r ) =

i =1

yi log

+ (1 − yi ) log
r ( xi )
1 − r ( xi )

which applies large penalty if yi = 1 and r ( xi ) is close to
zero or if yi = 0 and r ( xi ) is close to 1.

(W1 , b1 )

 −2 

(W2 , b2 )

input (xi ∈ R p )





 1 

A2  2 

Statistical Learning: Support vector machines
1 A support vector machine
(SVM) chooses a hyperplane
H ⊂ R p and predicts classification (Y = {−1, +1}) based on
which side of H the feature vector x lies on.

H = { x ∈ R p : ˛ · x − α = 0}.
3 We train the SVM with the
1 n
L(˛, α) = λ|˛|2 + ∑ 1 − yi (˛ · xi − α) +
n i =1



Statistical Learning: Dimension reduction

desired output (yi )

(W3 , b3 )

 1 

 2 


3 Structure may be taken to mean variation about the center,
in which case we use principal component analysis:

(i) Store the points’ components in an n × p matrix,
(ii) de-mean each column,
(iii) compute the SVD U ΣV of the resulting matrix, and
(iv) let W be the first k columns of V.
Then WW : R p → R p is the rank-k projection matrix which

minimizes the sum of squared projection distances of the
points, and W : R p → Rk maps each point to its coordinates
in that k-dimensional subspace (with respect to the columns
of W).








output ( N (xi ) ∈ Rq )



MNIST handwritten
digit images




3 L is convex, so it can be reliably minimized using numer-

ical optimization algorithms.

= I and v

2 Dimension reduction can be used as a visualization aid
or as a feature pre-processing step in a machine learning

1 + e− x


(v) Stochastic gradient descent: repeat (ii)–(iv) for each
sample in a randomly chosen subset of the training set and
determine the average desired change in weights and biases to reduce the cost function. Update the weights and
biases accordingly and iterate to convergence.

Statistical Learning: Neural networks

σ( t )

σ( x ) =

(v) .

1 The goal of dimension reduction is to map a set of n
points in R p to a lower-dimensional space Rk while retaining
as much of the data’s structure as possible.

r (x) = P(Y = 1 | X = x ) as a logistic function of a linear
function of x:
r ( x ) = σ(α + ˛ · x ),


dimensional space allows us to find nonlinear separating
surfaces in the original feature space.

1 A neural network function N : R p → Rq is a composi-

2 x → sgn(˛ · x − α) is the
prediction function, where

{(FAR(ht ), DR(ht )) : t ∈ [0, ∞]}.
The AUROC (area under the
ROC) is close to 1 for an excellent
classifier and close to 12 for a
worthless one. NP says that
no classifier is above the ROC.
We choose a point on the ROC
curve based on context-specific

detection rate

5 If λ is small, then the optimization prioritizes the correctness term and uses a small margin if necessary. If λ is
large, the optimization must minimize a large-margin incorrectness penalty. A value for λ may be chosen by crossvalidation.

f y ( x1 , . . . , x p ) = f y,1 ( x1 ) · · · f y,d ( x p ),


(iv) Compute the change in cost per small change in the
weights and biases at each blue node. Each such gradient is equal to the gradient stored at the next purple
node times the derivative of the intervening affine map.

˛·x−α = 0

4 A Naive Bayes classifier assumes that the features are

for some f y,1 , . . . f y,p .

(i) Start with random weights and a training input xi .
(ii) Forward propagation: apply each successive map and
store the vectors at each green or purple node. The vectors stored at the green nodes are called activations.

is Wj , and the derivative of K. is v → diag

except the class covariance matrices are assumed to be equal
and are estimated using all of the data, not just class-specific

3 QDA and LDA are so named because they yield class
prediction boundaries which are quadric surfaces and hyperplanes, respectively.


5 When the weight matrices are large, they have many parameters to tune. We use a custom optimization scheme:

(iii) Backpropagation: starting with the last green node
and working left, compute the change in cost per small
change in the vector at each green or purple node. By
the chain rule, each such gradient is equal to the gradient computed at right-adjacent node times the derivative of map between the two nodes. The derivative of A j

for samples
of class −1

Each distribution is assumed to be multivariate normal
(N (—y , Σy )) and the classifier h(y) = argmaxy py f y (x) is

1 Parametric regression uses a family H of candidate func-


Statistical Learning: Parametric regression

2 The architecture of a neural network is the sequence of
dimensions of the domains and codomains of its affine maps.
For example, a neural net with W1 ∈ R5×3 , W2 ∈ R4×5 , and
W3 ∈ R1×4 has architecture [3, 5, 4, 1].
3 Given training samples {(xi , yi )}iN=1 , we obtain a
neural net regression function by minimizing L( N ) =





∑ C( N (xi ), yi ) where C(y, yi ) = |y − yi |2 .

first two principal components

i =1






4 For classification, we

(i) let yi = [0, . . . , 0, 1, 0, . . . 0] ∈ R|Y | , with the location of
the nonzero entry indicating class (this is called one-hot

(ii) replace the identity map in the diagram with the softu
max function u → e j / ∑nk=1 euk

|Y |
j =1

, and

(iii) replace the cost function with C (y, yi ) = − log(y · yi ).

4 Structure may be taken to mean pairwise proximity of
points, which stochastic neighbor embedding attempts to
preserve. Given the data points x1 , . . . , xn and a parameter
ρ called the perplexity of the model, we define

Pi,j (σ) =


−|xi −x j |2 /(2σ2 )

∑k = j e

−|xk −x j |2 /(2σ2 )


and for each j we define σ j to be the solution σ of the equation

P (σ) log2 Pi,j (σ)
2 i = j i,j
= ρ.
1 ( P (σ ) + P (σ )), which describes the
We define pi,j = 2n
i,j j
j,i i
similarity of xi and x j . Given y1 , . . . , yn in Rk , we define
(1 + | y i − y j |2 ) −1
qi,j =
∑ k = j (1 + | y k − y j |2 ) −1

We choose y1 , . . . , yn to minimize

C ( y1 , . . . , y n ) =

1≤ i = j ≤ n

pi,j log2


Statistics: Confidence intervals
1 Consider an unknown probability distribution ν from

which we get n independent samples X1 , . . . , Xn , and suppose that θ is the value of some statistical functional of ν.
A confidence interval for θ is an interval-valued function of
the sample data X1 , . . . , Xn . A confidence interval has confidence level 1 − α if it contains θ with probability at least
1 − α.
2 If θ is unbiased, then

θ − k se(θ ), θ + k se(θ )

is a 1 −

then θ − 1.96 se(θ ), θ + 1.96 se(θ ) is an approximate 95%
confidence interval, since 95% of the mass of the standard
normal distribution is in the interval [−1.96, 1.96].

MNIST handwritten
digit images





4 Let I ⊂ R, and suppose that T is a function from the set of
distributions to the set of real-valued functions on I. A 1 − α
confidence band for T (ν) is pair of random functions ymin
and ymax from I to R defined in terms of n independent samples from ν and having ymin ≤ T (ν) ≤ ymax everywhere on
I with probability at least 1 − α.

Statistics: Empirical CDF convergence




Statistics: Point estimation
1 The central problem of statistics is to make inferences

about a population or data-generating process based on the
information in a finite sample drawn from the population.

2 Parametric estimation involves an assumption that the
distribution of the data-generating process comes from a
family of distributions parameterized by finitely many real
numbers, while nonparametric estimation does not. Examples: QDA is a parametric density estimator, while kernel density
estimation is nonparametric.
3 Point estimation is the inference of a single real-valued

1 Statistics is predicated on the idea that a distribution is
well-approximated by independent samples therefrom. The
Glivenko-Cantelli theorem is
F 20
one formalization of this idea: If
F is the CDF of a distribution ν
and Fn is the CDF of the empiriUnif([0,1])
cal distribution νn of n samples
from ν, then Fn converges to F
along the whole number line:

max | F ( x ) − Fn ( x )| → 0
x ∈R

as n → ∞,

2 The Dvoretzky-Kiefer-Wolfowitz inequality (DKW)
says that the graph of Fn lies in the -band around the graph

of F with probability at least 1 − 2e−2n .

4 A statistical functional is any function T from the set of
distributions to [−∞, ∞]. An estimator θ is a random variable defined in terms of n i.i.d. random variables, the purpose of which is to approximate some statistical functional of
the random variables’ common distribution. Example: Suppose that T (ν) = the mean of ν, and that θ = ( X1 + · · · + Xn )/n.

Statistics: Bootstrapping

6 Given a distribution ν and a statistical functional T, let
θ = T (ν). The bias of an estimator of θ is the difference between the estimator’s expected value and θ . Example: The

expectation of the sample mean θ = ( X1 + · · · + Xn )/n is
E( X1 + · · · + Xn )/n = E[ν], so the bias of the sample mean
is zero.

1 Bootstrapping is the use of simulation to approximate

the value of the plug-in estimator of a statistical functional
which is expressed in terms of independent samples from ν.
Example: if θ = T (ν) is the variance of the median of 3 independent samples from ν, then the bootstrap estimate of θ is obtained as
a Monte Carlo approximation of T (ν): we sample 3 times (with replacement) from { X1 , . . . , Xn }, record the median, repeat B times
for B large, and take the sample variance of the resulting list of B
2 The bootstrap approximation of T (ν) may be made as
close to T (ν) as desired by choosing B large enough. The
difference between T (ν) and T (ν) is likely to be small if n is
large (that is, if many samples from ν are available).
3 The bootstrap is useful for computing standard errors,


since the standard error of an estimator is often infeasible to
compute analytically but conducive to Monte Carlo approximation.

8 An estimator is consistent if θ → θ in probability as
n → ∞.

Statistics: Maximum likelihood estimation

7 The standard error se(θ ) of an estimator θ is its standard

9 The mean squared error of an estimator is defined to be

MSE(θ ) = E[(θ − θ )2 ].
10 MSE is equal to variance plus squared bias. Therefore,
MSE converges to zero as the number of samples goes to ∞
if and only if variance and bias both converge to zero.

2 The maximum likelihood estimator is

„ MLE = argmax

„ ∈Rd

L X („ ).

Equivalently, „ MLE = argmax
(„ ), where x („ ) de„ ∈Rd X
notes the logarithm of Lx („ ).
Example: Suppose that x → f ( x; µ, σ2 ) is the normal density
with mean µ and variance σ2 . Then the maximum likelihood estimator is the minimizer of the log-likelihood

( X1 − µ ) 2
( Xn − µ)2
log 2π − n log σ −

Setting the derivatives with respect to µ and σ2 equal to zero, we
find µ = X = n1 ( X1 + · · · + Xn ) and σ2 = n1 (( X1 − X )2 +
· · · + ( Xn − X )2 ). So the maximum likelihood estimators agree
with the plug-in estimators.
3 MLE enjoys several nice properties: under certain regularity conditions, we have (stated for θ ∈ R1 ):

(i) Consistency: E[(θ − θ )2 ] → 0 as the number of samples
goes to ∞.
(ii) Asymptotic normality: (θ − θ )/ Var θ converges to
N (0, 1) as the number of samples goes to ∞.
(iii) Asymptotic optimality: the MSE of the MLE converges
to 0 approximately as fast as the MSE of any other consistent estimator.
4 Potential difficulties with the MLE:

in probability.

feature of the distribution of the data-generating process
(such as its mean, variance, or median).

5 The empirical measure ν of X1 , . . . , Xn is the probability measure which assigns mass n1 to each sample’s location.
The plug-in estimator of θ = T (ν) is obtained by applying
T to the empirical measure: θ = T (ν).

Example: Suppose x → f ( x; θ ) is the density of a uniform random
variable on [0, θ ]. We observe four samples drawn from this distribution: 1.41, 2.45, 6.12, and 4.9. Then the likelihood of θ = 5 is
zero, and the likelihood of θ = 106 is very small.

1 confidence interval, by Chebyshev’s inequality.

3 If θ is unbiased and approximately normally distributed,


If X is a vector of n independent samples drawn from f „ ( x ),
then LX („ ) is small or zero when „ is not in accordance with
the observed data.

1 Maximum likelihood estimation is a general approach
for proposing an estimator. Consider a parametric family
{ f „ ( x ) : „ ∈ Rd } of PDFs or PMFs. Given x ∈ Rn , the
likelihood Lx : Rd → R is defined by

L x („ ) = f „ ( x1 ) f „ ( x2 ) · · · f „ ( x n ).

(i) Computational challenges. It might be hard to work
out where the maximum of the likelihood occurs, either
analytically or numerically.

(ii) Misspecification. The MLE may be inaccurate if the distribution of the samples is not in the specified parametric family.
(iii) Unbounded likelihood. If the likelihood function is not
bounded, then θ is not well-defined.

pothesis. The corresponding p-value is defined to be the
minimum α-value which would have resulted in rejecting the null hypothesis, with the critical region chosen
in the same way*.
Example: Muriel Bristol claims that she can tell by taste whether
the tea or the milk was poured into the cup first. She is given eight
cups of tea, four poured milk-first and four poured tea-first.
We posit a null hypothesis that she isn’t able to discern the pouring method, under which the number of cups identified correctly is
4 with probability 1/(84) ≈ 1.4% and at least 3 with probability
17/70 ≈ 24%. Therefore, at the 5% significance level, only a correct identification of all the cups would give us grounds to reject
the null hypothesis. The p-value in that case would be 1.4%.
3 Failure to reject the null hypothesis is not necessarily evidence for the null hypothesis. The power of a hypothesis test
is the conditional probability of rejecting the null hypothesis
given that the alternative hypothesis is true. A p-value may
be low either because the null hypothesis is true or because
the test has low power.
4 The Wald test is based on the normal approximation.
Consider a null hypothesis θ = 0 and the alternative hypothesis θ = 0, and suppose that θ is approximately normally
distributed. The Wald test rejects the null hypothesis at the
5% significance level if |θ | > 1.96 se(θ ).
5 The random permutation test is applicable when the
null hypothesis is that the mean of a given random variable
is equal for two populations.

(i) We compute the difference between the sample means
for the two groups.
(ii) We randomly re-assign the group labels and compute

the resulting sample mean differences. Repeat many
(iii) We check where the original difference falls in the sorted
list of re-sampled differences.
Example: Suppose the heights of the Romero sons are 72, 69, 68,
and 66 inches, and the heights of the Larsen sons are 70, 65, and 64
inches. Consider the null hypothesis that the expected heights are
the same for the two families, and the alternative hypothesis that
the Romero sons are taller on average (with α = 5%). We find that
the sample mean difference of about 2.4 inches is larger than 88.5%
of the mean differences obtained by resampling many times. Since
88.5% < 95%, we retain the null hypothesis.

1 Hypothesis testing is a disciplined framework for adjudicating whether observed data do not support a given hypothesis.

6 If we conduct many hypothesis tests, then the probability of obtaining some false rejections is high (xkcd.com/882).
This is called the multiple testing problem. The Bonferroni
method is to reject the null hypothesis only for those tests
whose p-values are less than α divided by the number of hypothesis tests being run. This ensures that the probability
of having even one false rejection is less than α, so it is very

2 Consider an unknown distribution from which we will
observe n samples X1 , . . . Xn .

Statistics: dplyr and ggplot2

Statistics: Hypothesis testing

(i) We state a hypothesis H0 –called the null hypothesis–about the distribution.

(ii) We come up with a test statistic T, which is a function
of the data X1 , . . . Xn , for which we can evaluate the distribution of T assuming the null hypothesis.
(iii) We give an alternative hypothesis Ha under which T is
expected to be significantly different from its value under H0 .
(iv) We give a significance level α (like 5% or 1%), and based
on Ha we determine a set of values for T—called the
critical region—which T would be in with probability at
most α under the null hypothesis.

(v) After setting H0 , Ha , ¸, T , and the critical region,
we run the experiment, evaluate T on the samples we
get, and record the result as tobs .

(vi) If tobs falls in the critical region, we reject the null hy-

1 dplyr is an R package for manipulating data frames. The
following functions filter rows, sort rows, select columns,
add columns, group, and aggregate the columns of a
grouped data frame.
flights %>%
filter(month == 1, day < 5) %>%
arrange(day, distance) %>%
select(month, day, distance, air_time) %>%
mutate(speed = distance / air_time * 60) %>%
group_by(day) %>%
summarise(avgspeed = mean(speed,na.rm=TRUE))

2 ggplot2 is an R package for data visualization. Graphics
are built as a sum of layers, which consist of a data frame,
a geom, a stat, and a mapping from the data to the geom’s

aesthetics (like x, y, color, or size). The appearance of the
plot can be customized with scales, coords, and themes.

Model Evaluation

Logistic Regression

Discrete Distributions

Prediction Error = Bias2 + Variance + Irreducible Noise
Bias - wrong assumptions when training → can’t capture
underlying patterns → underfit
Variance - sensitive to fluctuations when training→ can’t
generalize on unseen data → overfit

Predicts probability that Y belongs to a binary class (1 or 0).
Fits a logistic (sigmoid) function to the data that maximizes
the likelihood that the observations follow the curve.
Regularization can be added in the exponent.
P (Y = 1) =
1 + e−(β0 +βx)
Odds - output probability can be transformed using
P (Y =1)

Odds(Y = 1) = 1−P (Y =1) , where P ( 13 ) = 1:2 odds
– Linear relationship between X and log-odds of Y
– Independent observations
– Low multicollinearity

Binomial Bin(n, p) - number of successes in n events, each
with p probability. If n = 1, this is the Bernoulli distribution.
Geometric Geom(p) - number of failures before success
Negative Binomial NBin(r, p) - number of failures before r
Hypergeometric HGeom(N, k, n) - number of k successes in
a size N population with n draws, without replacement
Poisson Pois(λ) - number of successes in a fixed time interval,
where successes occur independently at an average rate λ

Continuous Distributions
Normal/Gaussian N (µ, σ), Standard Normal Z ∼ N (0, 1)
Central Limit Theorem - sample mean of i.i.d. data
approaches normal distribution
Exponential Exp(p) - time between independent events
occurring at an average rate λ
Gamma Gamma(p) - time until n independent events
occurring at an average rate λ

Hypothesis Testing
Significance Level α - probability of Type 1 error
p-value - probability of getting results at least as extreme as
the current test. If p-value < α, or if test statistic > critical
value, then reject the null.

Type I Error (False Positive) - null true, but reject
Type II Error (False Negative) - null false, but fail to reject
Power - probability of avoiding a Type II Error, and rejecting
the null when it is indeed false
Z-Test - tests whether population means/proportions are
different. Assumes test statistic is normally distributed and is
used when n is large and variances are known. If not, then use
a t-test. Paired tests compare the mean at different points in
time, and two-sample tests compare means for two groups.
ANOVA - analysis of variance, used to compare 3+ samples
with a single test
Chi-Square Test - checks relationship between categorical
variables (age vs. income). Or, can check goodness-of-fit
between observed data and expected population distribution.

– Supervised - labeled data
– Unsupervised - unlabeled data
– Reinforcement - actions, states, and rewards
Cross Validation - estimate test error with a portion of
training data to validate accuracy and model parameters
– k-fold - divide data into k groups, and use one to validate
– leave-p-out - use p samples to validate and the rest to train
Parametric - assume data follows a function form with a
fixed number of parameters
Non-parametric - no assumptions on the data and an
unbounded number of parameters



Mean Squared Error (MSE) =

(yi − yˆ)2

Mean Absolute Error (MAE)
|(yi − yˆ)|
Residual Sum of Squares = (yi − yˆ)2
Total Sum of Squares = (yi − y¯)2
R2 = 1 − SS

Decision Trees

Actual Yes
Actual No
– Precision =

Predict Yes
True Positives (TP)
False Positives (FP)
T P +F P

Predict No
False Negatives (FN)
True Negatives (TN)

, percent correct when predict positive

– Recall, Sensitivity = T PT+F
, percent of actual positives
identified correctly (True Positive Rate)
– Specificity = T NT+F
, percent of actual negatives identified
correctly, also 1 - FPR (True Negative Rate)
– F1 = 2 precision+recall
, useful when classes are imbalanced

Classification and Regression Tree
CART for regression minimizes SSE by splitting data into
sub-regions and predicting the average value at leaf nodes.
Trees are prone to high variance, so tune through CV.
– Complexity parameter, to only keep splits that improve
SSE by at least cp (most influential, small cp → deep tree)
– Minimum number of samples at a leaf node
– Minimum number of samples to consider a split

ROC Curve - plots TPR vs. FPR for every threshold α.
AUC measures how likely the model differentiates positives
and negatives. Perfect AUC = 1, Baseline = 0.5
Precision-Recall Curve - focuses on the correct prediction
of class 1, useful when data or FP/FN costs are imbalanced

Linear Regression
Models linear relationships between a continuous response and
explanatory variables
ˆ + by
Ordinary Least Squares - find βˆ for yˆ = βˆ0 + βX
solving βˆ = (X T X)−1 X T Y which minimizes the SSE
– Linear relationship and independent observations
– Homoscedasticity - error terms have constant variance
– Errors are uncorrelated and normally distributed
– Low multicollinearity
Add a penalty for large coefficients to reduce overfitting
ˆ 0 = λ(number of non−zero variables)
Subset (L0): λ||β||
– Computationally slow, need to fit 2k models
– Alternatives: forward and backward stepwise selection
ˆ 1 = λ |β|
LASSO (L1): λ||β||
– Coefficients shrunk to zero
ˆ 2 = λ (β)
Ridge (L2): λ||β||

– Reduces effects of multicollinearity
Combining LASSO and Ridge gives Elastic Net. In all cases,
as λ grows, bias increases and variance decreases.
Regularization can also be applied to many other algorithms.

CART for classification minimizes the sum of region impurity,
where pˆi is the probability of a sample being in category i.
Possible measures, each with a max impurity of 0.5.
– Gini impurity = i (pˆi )2
– Entropy = i −(pˆi )log2 (pˆi )
At each leaf node, CART predicts the most frequent category,
assuming false negative and false positive costs are the same.

Random Forest
Trains an ensemble of trees that vote for the final prediction
Bootstrapping - sampling with replacement (will contain
duplicates), until the sample is as large as the training set
Bagging - training independent models on different subsets of
the data, which reduces variance. Each tree is trained on
∼63% of the data, so the out-of-bag 37% can estimate
prediction error without resorting to CV.
Additional Hyperparameters (no cp):
– Number of trees to build
– Number of variables considered at each split
Deep trees increase accuracy, but at a high computational
cost. Model bias is always equal to one of its individual trees.
Variable Importance - RF ranks variables by their ability to
minimize error when split upon, averaged across all trees



Classifies data using the label with the highest conditional
probability, given data a and classes c. Naive because it
assumes variables are independent.
P (a|ci )P (ci )
Bayes’ Theorem P (ci |a) =
P (a)
Gaussian Naive Bayes - calculates conditional probability
for continuous data by assuming a normal distribution

Support Vector Machines
Separates data between two classes by maximizing the margin
between the hyperplane and the nearest data points of any
class. Relies on the following:



Unsupervised, non-parametric methods that groups similar
data points together based on distance

Principal Component Analysis


Randomly place k centroids across normalized data, and assig
observations to the nearest centroid. Recalculate centroids as
the mean of assignments and repeat until convergence. Using
the median or medoid (actual data point) may be more robust
to noise and outliers.
k-means++ - improves selection of initial clusters
1. Pick the first center randomly
2. Compute distance between points and the nearest center
3. Choose new center using a weighted probability
distribution proportional to distance
4. Repeat until k centers are chosen
Evaluating the number of clusters and performance:
Silhouette Value - measures how similar a data point is to
its own cluster compared to other clusters, and ranges from 1
(best) to -1 (worst).
Davies-Bouldin Index - ratio of within cluster scatter to
between cluster separation, where lower values are

Hierarchical Clustering
Support Vector Classifiers - account for outliers by
allowing misclassifications on the support vectors (points in or
on the margin)
Kernel Functions - solve nonlinear problems by computing
the similarity between points a, b and mapping the data to a
higher dimension. Common functions:
– Polynomial (ab + r)d

– Radial e−γ(a−b)

Hinge Loss - max(0, 1 − yi (wT xi − b)), where w is the margin
width, b is the offset bias, and classes are labeled ±1. Note,
even a correct prediction inside the margin gives loss > 0.

Clusters data into groups using a predominant hierarchy
Agglomerative Approach
1. Each observation starts in its own cluster
2. Iteratively combine the most similar cluster pairs
3. Continue until all points are in the same cluster
Divisive Approach - all points start in one cluster and splits
are performed recursively down the hierarchy
Linkage Metrics - measure dissimilarity between clusters
and combines them using the minimum linkage value over all
pairwise points in different clusters by comparing:
– Single - the distance between the closest pair of points
– Complete - the distance between the farthest pair of points
– Ward’s - the increase in within-cluster SSE if two clusters
were to be combined


Projects data onto orthogonal vectors that maximize variance.
Remember, given an n × n matrix A, a nonzero vector x, and
a scaler λ, if Ax = λx then x and λ are an eigenvector and
eigenvalue of A. In PCA, the eigenvectors are uncorrelated
and represent principal components.
1. Start with the covariance matrix of standardized data
2. Calculate eigenvalues and eigenvectors using SVD or
3. Rank the principal components by their proportion of

variance explained = λiλ
For a p-dimensional data, there will be p principal components
Sparse PCA - constrains the number of non-zero values in
each component, reducing susceptibility to noise and
improving interpretability

Linear Discriminant Analysis
Maximizes separation between classes and minimizes variance
within classes for a labeled dataset
1. Compute the mean and variance of each independent
variable for every class
2 ) and between-class (σ 2 )
2. Calculate the within-class (σw
3. Find the matrix W = (σw ) (σb ) that maximizes Fisher’s
signal-to-noise ratio
4. Rank the discriminant components by their signal-to-noise
ratio λ
– Independent variables are normally distributed
– Homoscedasticity - constant variance of error
– Low multicollinearity

Factor Analysis
Describes data using a linear combination of k latent factors.
Given a normalized matrix X, it follows the form X = Lf + ,

with factor loadings L and hidden factors f .

Dendrogram - plots the full hierarchy of clusters, where the
height of a node indicates the dissimilarity between its children

k-Nearest Neighbors
Non-parametric method that calculates yˆ using the average
value or most common class of its k-nearest points. For
high-dimensional data, information is lost through equidistant
vectors, so dimension reduction is often applied prior to k-NN.
Minkowski Distance = ( |ai − bi |p )1/p
– p = 1 gives Manhattan distance
– p = 2 gives Euclidean distance

– E(X) = E(f ) = E( ) = 0
– Cov(f ) = I → uncorrelated factors
– Cov(f, ) = 0
Since Cov(X) = Cov(Lf ) + Cov( ), then Cov(Lf ) = LL

|ai − bi |
(ai − bi )2

Scree Plot - graphs the eigenvalues of factors (or principal
components) and is used to determine the number of factors to
retain. The ’elbow’ where values level off is often used as the

Hamming Distance - count of the differences between two
vectors, often used to compare categorical variables

Language Processing

Transforms human language into machine-usable code
Processing Techniques




Feeds inputs through different hidden layers and relies on
weights and nonlinear functions to reach an output

– Tokenization - splitting text into individual words (tokens)
– Lemmatization - reduces words to its base form based on
dictionary definition (am, are, is → be)
– Stemming - reduces words to its base form without context
(ended → end)
– Stop words - remove common and irrelevant words (the, is)
Markov Chain - stochastic and memoryless process that
predicts future events based only on the current state
n-gram - predicts the next term in a sequence of n terms
based on Markov chains

Bag-of-words - represents text using word frequencies,
without context or order
tf-idf - measures word importance for a document in a
collection (corpus), by multiplying the term frequency
(occurrences of a term in a document) with the inverse
document frequency (penalizes common terms across a corpus)
Cosine Similarity - measures similarity between vectors,
calculated as cos(θ) = ||A||||B||
, which ranges from o to 1

Perceptron - the foundation of a neural network that
multiplies inputs by weights, adds bias, and feeds the result z
to an activation function
Activation Function - defines a node’s output




max(0, z)

ez −e−z
ez +e−z

– Continuous bag-of-words (CBOW) - predicts the word

given its context
– skip-gram - predicts the context given a word
GloVe - combines both global and local word co-occurence
data to learn word similarity
BERT - accounts for word order and trains on subwords, and
unlike word2vec and GloVe, BERT outputs different vectors
for different uses of words (cell phone vs. blood cell)

Sentiment Analysis
Extracts the attitudes and emotions from text
Polarity - measures positive, negative, or neutral opinions
– Valence shifters - capture amplifiers or negators such as
’really fun’ or ’hardly fun’
Sentiment - measures emotional states such as happy or sad
Subject-Object Identification - classifies sentences as
either subjective or objective

Topic Modelling
Captures the underlying themes that appear in documents
Latent Dirichlet Allocation (LDA) - generates k topics by
first assigning each word to a random topic, then iteratively
updating assignments based on parameters α, the mix of topics
per document, and β, the distribution of words per topic
Latent Semantic Analysis (LSA) - identifies patterns using
tf-idf scores and reduces data to k dimensions through SVD

Pooling - downsamples convolution layers to reduce
dimensionality and maintain spatial invariance, allowing
detection of features even if they have shifted slightly.
Common techniques return the max or average value in the

pooling window.
The general CNN architecture is as follows:
1. Perform a series of convolution, ReLU, and pooling
operations, extracting important features from the data
2. Feed output into a fully-connected layer for classification,
object detection, or other structural analyses

Word Embedding
Maps words and phrases to numerical vectors
word2vec - trains iteratively over local word context
windows, places similar words close together, and embeds
sub-relationships directly into vectors, such that
king − man + woman ≈ queen
Relies on one of the following:

Neural Network

Analyzes structural or visual data by extracting local features
Convolutional Layers - iterate over windows of the image,
applying weights, bias, and an activation function to create
feature maps. Different weights lead to different features maps.

Recurrent Neural Network
Since a system of linear activation functions can be simplified
to a single perceptron, nonlinear functions are commonly used
for more accurate tuning and meaningful gradients

Predicts sequential data using a temporally connected system
that captures both new inputs and previous outputs using
hidden states

Loss Function - measures prediction error using functions
such as MSE for regression and binary cross-entropy for
probability-based classification
Gradient Descent - minimizes the average loss by moving
iteratively in the direction of steepest descent, controlled by
the learning rate γ (step size). Note, γ can be updated
adaptively for better performance. For neural networks,
finding the best set of weights involves:
1. Initialize weights W randomly with near-zero values
2. Loop until convergence:
– Calculate the average network loss J(W )
– Backpropagation - iterate backwards from the last
∂J(W )
layer, computing the gradient ∂W and updating the
weight W ← W −

∂J(W )
γ ∂W

3. Return the minimum loss weight matrix W
To prevent overfitting, regularization can be applied by:
– Stopping training when validation performance drops
– Dropout - randomly drop some nodes during training to
prevent over-reliance on a single node
– Embedding weight penalties into the objective function
Stochastic Gradient Descent - only uses a single point to
compute gradients, leading to smoother convergence and faster
compute speeds. Alternatively, mini-batch gradient descent
trains on small subsets of the data, striking a balance between

the approaches.

RNNs can model various input-output scenarios, such as
many-to-one, one-to-many, and many-to-many. Relies on
parameter (weight) sharing for efficiency. To avoid redundant
calculations during backpropagation, downstream gradients
are found by chaining previous gradients. However, repeatedly
multiplying values greater than or less than 1 leads to:
– Exploding gradients - model instability and overflows
– Vanishing gradients - loss of learning ability
This can be solved using:
– Gradient clipping - cap the maximum value of gradients
– ReLU - its derivative prevents gradient shrinkage for x > 0
– Gated cells - regulate the flow of information
Long Short-Term Memory - learns long-term dependencies
using gated cells and maintains a separate cell state from what
is outputted. Gates in LSTM perform the following:

Forget and filter out irrelevant info from previous layers
Store relevant info from current input
Update the current cell state
Output the hidden state, a filtered version of the cell state

LSTMs can be stacked to improve performance.




Ensemble method that learns by sequentially fitting many
simple models. As opposed to bagging, boosting trains on all
the data and combines weak models using the learning rate α.
Boosting can be applied to many machine learning problems.
AdaBoost - uses sample weighting and decision ’stumps’
(one-level decision trees) to classify samples
1. Build decision stumps for every feature, choosing the one
with the best classification accuracy
2. Assign more weight to misclassified samples and reward
trees that differentiate them, where α = 12 ln 1−T
T otalError
3. Continue training and weighting decision stumps until
Gradient Boost - trains sequential models by minimizing a
given loss function using gradient descent at each step
1. Start by predicting the average value of the response
2. Build a tree on the errors, constrained by depth or the
number of leaf nodes
3. Scale decision trees by a constant learning rate α
4. Continue training and weighting decision trees until

XGBoost - fast gradient boosting method that utilizes
regularization and parallelization

Maximizes future rewards by learning through state-action
pairs. That is, an agent performs actions in an environment,
which updates the state and provides a reward.

Identifies unusual patterns that differ from the majority of the
data, and can be applied in supervised, unsupervised, and
semi-supervised scenarios. Assumes that anomalies are:


Recommender Systems
Suggests relevant items to users by predicting ratings and
preferences, and is divided into two main types:
– Content Filtering - recommends similar items
– Collaborative Filtering - recommends what similar users like
The latter is more common, and includes methods such as:
Memory-based Approaches - finds neighborhoods by using
rating data to compute user and item similarity, measured
using correlation or cosine similarity
– User-User - similar users also liked...
– Leads to more diverse recommendations, as opposed to
just recommending popular items
– Suffers from sparsity, as the number of users who rate
items is often low
– Item-Item - similar users who liked this item also liked...
– Efficient when there are more users than items, since the
item neighborhoods update less frequently than users

– Similarity between items is often more reliable than
similarity between users
Model-based Approaches - predict ratings of unrated
items, through methods such as Bayesian networks, SVD, and
clustering. Handles sparse data better than memory-based
– Matrix Factorization - decomposes the user-item rating
matrix into two lower-dimensional matrices representing the
users and items, each with k latent factors
Recommender systems can also be combined through ensemble
methods to improve performance.

Reinforcement Learning

Anomaly Detection

– Rare - the minority class that occurs rarely in the data
– Different - have feature values that are very different from
normal observations

Multi-armed Bandit Problem - a gambler plays slot
machines with unknown probability distributions and must
decide the best strategy to maximize reward. This exemplifies
the exploration-exploitation tradeoff, as the best long-term
strategy may involve short-term sacrifices.
RL is divided into two types, with the former being more
– Model-free - learn through trial and error in the
– Model-based - access to the underlying (approximate)

state-reward distribution
Q-Value Q(s, a) - captures the expected discounted total
future reward given a state and action
Policy - chooses the best actions for an agent at various states
π(s) = arg max Q(s, a)

Deep RL algorithms can further be divided into two main
types, depending on their learning objective
Value Learning - aims to approximate Q(s, a) for all actions
the agent can take, but is restricted to discrete action spaces.
Can use the -greedy method, where measures the
probability of exploration. If chosen, the next action is
selected uniformly at random.
– Q-Learning - simple value iteration model that maximizes
the Q-value using a table on states and actions
– Deep Q Network - finds the best action to take by
minimizing the Q-loss, the squared error between the target
Q-value and the prediction
Policy Gradient Learning - directly optimize the the policy
π(s) through a probability distribution of actions, without the
need for a value function, allowing for continuous action
Actor-Critic Model - hybrid algorithm that relies on two
neural networks, an actor π(s, a, θ) which controls agent
behavior and a critic Q(s, a, w) that measures how good an
action is. Both run in parallel to find the optimal weights θ, w
to maximize expected reward. At each step:
1. Pass the current state into the actor and critic
2. The critic evaluates the action’s Q-value, and the actor

updates its weight θ
3. The actor takes the next action leading to a new state, and
the critic updates its weight w

Anomaly detection techniques spans a wide range, including
methods based on:
Statistics - relies on various statistical methods to identify
outliers, such as Z-tests, boxplots, interquartile ranges, and
variance comparisons
Density - useful when data is grouped around dense
neighborhoods, measured by distance. Methods include
k-nearest neighbors, local outlier factor, and isolation forest.
– Isolation Forest - tree-based model that labels outliers
based on an anomaly score
1. Select a random feature and split value, dividing the
dataset in two
2. Continue splitting randomly until every point is isolated
3. Calculate the anomaly score for each observation, based
on how many iterations it took to isolate that point.
4. If the anomaly score is greater than a threshold, mark it
as an outlier
Intuitively, outliers are easier to isolate and should have
shorter path lengths in the tree
Clusters - data points outside of clusters could potentially be
marked as anomalies
Autoencoders - unsupervised neural networks that compress
data and reconstructs it. The network has two parts: an
encoder that embeds data to a lower dimension, and a decoder

that produces a reconstruction. Autoencoders do not
reconstruct the data perfectly, but rather focus on capturing
important features in the data.

Upon decoding, the model will accurately reconstruct normal
patterns but struggle with anomalous data. The reconstruction
error is used as an anomaly score to detect outliers.
Autoencoders are applied to many problems, including image
processing, dimension reduction, and information retrieval.

Summary of Machine Learning Algorithms descriptions,


We want to learn a target function f that maps input
𝑌 = 𝑓 𝑋 + 𝑒
Linear, Nonlinear

shape and structure of f, thus the need of testing several

- Non parametric (or Nonlinear): free to learn any
functional form from the training data, while maintaining
Linear algorithms are usually simpler, faster and requires
less data, while Nonlinear can be are more flexible, more
Supervised, Unsupervised

Supervised learning methods learn to predict Y from X

The goal of parameterization is to achieve a low bias
(underlying pattern not too simplified) and low variance

In statistics, fit refers to how well the target function is




Overfitting refers to learning the training data detail and


Ordinary Least Squares

ofsquaredresiduals: E?CD(𝑦? − 𝛽@ − B9CD 𝛽9 𝑥?9 )F = 𝑦 − 𝑋𝛽

Gradient Descent


Usinglinearalgebrasuchthatwehave𝛽 = (𝑋 G 𝑋)HD 𝑋 G 𝑦


Maximum Likelihood Estimation

àInitialization𝜃 = 0(coefficientsto0orrandom)

MLE is used to find the estimators that minimizes the

àCalculatecost𝐽(𝜃) = 𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒(𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 )



àUpdatecoeff𝜃𝑗 = 𝜃𝑗 − 𝛼



The cost updating process is repeated until convergence

ℒ 𝜃 𝑥 = 𝑓8 (𝑥)densityfunctionofthedatadistribution

Linear Algorithms
Linear Regression


Bias-Variance trade-off





Underfitting, Overfitting


Bias refers to simplifying assumptions made to learn the


𝑦 = 𝛽@ + 𝛽D 𝑥D + ⋯ + 𝛽? 𝑥?

Batch Gradient Descend does summing/averaging of the
Stochastic Gradient Descent apply the procedure of

𝛽@ is usually called intercept or bias coefficient. The
dimension of the hyperplane of the regression is its


Logistic Regression

Linear Discriminant Analysis


For multiclass classification, LDA is the preferred linear

Logistic regression a linear method but predictions are

LDA representation consists of statistical properties

𝜇Z =



?CD 𝑥? and𝜎





− 𝜇Z )

Learning a LR means estimating the coefficients from the
There are extensions of LR training called regularization




𝛽9 𝑥?9 )F + 𝜆

(𝑦? − 𝛽@ −





𝑦 =

|𝛽9 |



𝛽9 F = 𝑅𝑆𝑆 + 𝜆

𝛽9 𝑥?9 )F + 𝜆

(𝑦? − 𝛽@ −



|𝛽9 | = 𝑅𝑆𝑆 + 𝜆




𝛽9 F

where𝜆 ≥ 0isatuningparametertobedetermined.


-.Call-center waiting-time prediction according to the



𝑝 𝑋 =
= 𝑝 𝑌 = 1 𝑋
Learning the Logistic regression coefficients is done using

LDAassumesGaussiandataandattributesofsame𝝈𝟐 .
𝑃 𝑌=𝑘𝑋=𝑥 =


cCD 𝑃(𝑙)×𝑃(𝑥|𝑙)

𝐷Z 𝑥 = 𝑥×

𝜇Z 𝜇Z F

+ ln(𝑃 𝑘 )
𝜎 F 2𝜎 F














-Standardizedatato𝜇 = 0,𝜎 = 1tohavesamevariance











𝑝Z (1 − 𝑝Z )



Nonlinear Algorithms
All Nonlinear Algorithms are non-parametric and more
Classification and Regression Trees

The model representation is a binary tree, where each

The most common Stopping Criterion for splitting is a
The simplest form of pruning is Reduced Error Pruning:





− 𝜇 𝑥 )F

𝑓 𝑥 𝜇 𝑥 ,𝜎 =




F€ •






Naive Bayes is a classification algorithm interested in



K-Nearest Neighbors

𝑃 ℎ𝑑 =



withnạvehypothesis𝑃 ℎ 𝑑 = 𝑃 𝑥D ℎ × …×𝑃(𝑥? |ℎ)


(𝑦? − 𝑦)






?CD 𝑥? and𝜎






Ateachstep,thebestpredictor𝑋9 andthebestcutpoints
are selected such that 𝑋 𝑋9 < 𝑠 and 𝑋 𝑋9 ≥ 𝑠



𝜇 𝑥 =


Learning of a CART is done by a greedy approach called



Naive Bayes Classifier

The model actually split the input space into (hyper)

Gaussian Naive Bayes can extend to numerical attributes

𝑀𝐴𝑃 ℎ = max 𝑃 ℎ 𝑑

= max(𝑃 𝑑 ℎ ×𝑃 ℎ )



Training is fast because only probabilities need to be
𝑃 ℎ =


and𝑃 𝑥 ℎ =



Ensemble Algorithms


Ensemble methods use multiple, simpler algorithms

𝑑 𝑎, 𝑏 =


Bagging and Random Forest


|𝑎? − 𝑏? |

𝑑 𝑎, 𝑏 =

Random Forest is part of a bigger type of ensemble

(𝑎? − 𝑏? )F




𝑓 𝑥 =< 𝑤, 𝑥 > +𝜌 = 𝑤 G 𝑥 + 𝜌with𝜌thebias
Whichgivesforlinearkernel,with𝑥? thesupportvectors:


It uses the Bootstrap statistical procedure: estimate a
quantity from a sample by creating many random

𝑓 𝑥 =


𝑎? ×(𝑥×𝑥? )) + 𝜌



max 0,1 − 𝑦? 𝑤. 𝑥† − 𝑏

+ 𝜆||𝑤||F










-Linear(dot-product):𝐾 𝑥, 𝑥? = (𝑥×𝑥? )


-Polynomial:𝐾 𝑥, 𝑥? = 1 + (𝑥×𝑥? )ˆ

Random Forest is a tweaked version of bagged decision

Support Vector Machines

-Radial:𝐾 𝑥, 𝑥? = 𝑒 H‰




- SVM assumes numeric inputs, may require dummy

During learning, each sub-tree can only access a random


Agooddefaultis 𝑝forclassificationand forregression.

the input variable space by their class, with the largest









)• )

However, combining models works best if submodels are





Bagged method can provide feature importance, by
calculating and averaging the error function drop for
individual variables (depending of samples where a




Weak models are added sequentially using the training


Interesting Resources












Boosting and AdaBoost

AdaBoost was the first successful boosting algorithm








𝑓s (𝑥)

𝐹G 𝑥 =




whereeach𝑓s isaweeklearnercorrectingtheerrorsofthe


Adaboost is commonly used with decision trees with one

Predictions are made using the weighted average of the


Eachtrainingsetinstanceisinitiallyweighted𝑤 𝑥? =

One decision stump is prepared using the weighted

?CD(? ìvããxã? )


Which is the weighted sum of the misclassification rates,
where w is the training instance i weight and 𝑝v••x•? its

𝑠𝑡𝑎𝑔𝑒 = ln(


= ì rstãvìã

Convolutional Neural Networks


❒ Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs,
are a specific type of neural networks that are generally composed of the following layers:

1 Convolutional Neural Networks
1.1 Overview . . . . . . . . . . . . . . . . .
1.2 Types of layer . . . . . . . . . . . . . .
1.3 Filter hyperparameters . . . . . . . . . .
1.4 Tuning hyperparameters . . . . . . . . .
1.5 Commonly used activation functions . . .
1.6 Object detection . . . . . . . . . . . . .
1.6.1 Face verification and recognition .
1.6.2 Neural style transfer . . . . . . .
1.6.3 Architectures using computational







tricks .















4 The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters
5 that are described in the next sections.


2 Recurrent Neural Networks
2.1 Overview . . . . . . . . . . . . . .
2.2 Handling long term dependencies .
2.3 Learning word representation . . .
2.3.1 Motivation and notations
2.3.2 Word embeddings . . . .
2.4 Comparing words . . . . . . . . .
2.5 Language model . . . . . . . . . .
2.6 Machine translation . . . . . . . .
2.7 Attention . . . . . . . . . . . . . .




























3 Deep Learning Tips and Tricks
3.1 Data processing . . . . . . . .
3.2 Training a neural network . . .
3.2.1 Definitions . . . . . . .
3.2.2 Finding optimal weights
3.3 Parameter tuning . . . . . . .
3.3.1 Weights initialization .
3.3.2 Optimizing convergence
3.4 Regularization . . . . . . . . .
3.5 Good practices . . . . . . . . .




























Types of layer

❒ Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform
convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or
activation map.

Remark: the convolution step can be generalized to the 1D and 3D cases as well.
❒ Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied
after a convolution layer, which does some spatial invariance. In particular, max and average
pooling are special kinds of pooling where the maximum and average value is taken, respectively.


Max pooling

Average pooling

Each pooling operation selects the
maximum value of the current view

Each pooling operation averages
the values of the current view

❒ Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the
boundaries of the input. This value can either be manually specified or automatically set through
one of the three modes detailed below:


Pstart =


P =0
Pend =


- Preserves detected features
- Most commonly used



−I+F −S

Pstart ∈ [[0,F − 1]]
Pend = F − 1

- Downsamples feature map
- Used in LeNet

❒ Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.

- Padding such that feature

- No padding



I −I+F −S

map size has size

- Drops last
convolution if

dimensions do not


- Output size is
mathematically convenient
- Also called ’half’ padding

- Maximum padding
such that end
convolutions are
applied on the limits
of the input
- Filter ’sees’ the input

Tuning hyperparameters

❒ Parameter compatibility in convolution layer – By noting I the length of the input
volume size, F the length of the filter, P the amount of zero padding, S the stride, then the
output size O of the feature map along that dimension is given by:

Filter hyperparameters

The convolution layer contains filters for which it is important to know the meaning behind its


❒ Dimensions of a filter – A filter of size F × F applied to an input containing C channels is
a F × F × C volume that performs convolutions on an input of size I × I × C and produces an
output feature map (also called activation map) of size O × O × 1.

I − F + Pstart + Pend

Remark: the application of K filters of size F × F results in an output feature map of size
O × O × K.
Remark: often times, Pstart = Pend
the formula above.

❒ Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels
by which the window moves after each operation.

P , in which case we can replace Pstart + Pend by 2P in

❒ Understanding the complexity of the model – In order to assess the complexity of a
model, it is often useful to determine the number of parameters that its architecture will have.
In a given layer of a convolutional neural network, it is done as follows:



Input size

I ×I ×C

I ×I ×C


Output size




Number of

(F × F × C + 1) · K


(Nin + 1) × Nout


- One bias parameter
per filter
- In most cases, S < F
- A common choice
for K is 2C


Leaky ReLU


g(z) = max(0,z)

g(z) = max( z,z)

g(z) = max(α(ez − 1),z)
with α

Non-linearity complexities
biologically interpretable

Addresses dying ReLU
issue for negative values

Differentiable everywhere


- Pooling operation
done channel-wise
- In most cases, S = F

❒ Softmax – The softmax step can be seen as a generalized logistic function that takes as input
a vector of scores x ∈ Rn and outputs a vector of output probability p ∈ Rn through a softmax
function at the end of the architecture. It is defined as follows:

- Input is flattened
- One bias parameter
per neuron
- The number of FC
neurons is free of
structural constraints




pi =

e xi

e xj

❒ Receptive field – The receptive field at layer k is the area denoted Rk × Rk of the input
that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula:
(Fj − 1)

Object detection

❒ Types of models – There are 3 main types of object recognition algorithms, for which the
nature of what is predicted is different. They are described in the table below:



Rk = 1 +



w. localization


- Predicts probability
of object

- Detects object in a picture
- Predicts probability of
object and where it is

- Detects up to several objects
in a picture
- Predicts probabilities of objects
and where they are located

Traditional CNN

Simplified YOLO, R-CNN


Image classification


In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =

- Classifies a picture


Commonly used activation functions

❒ Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g
that is used on all elements of the volume. It aims at introducing non-linearities to the network.
Its variants are summarized in the table below:

❒ Detection – In the context of object detection, different methods are used depending on
whether we just want to locate the object or detect a more complex shape in the image. The
two main ones are summed up in the table below:


Bounding box detection

Landmark detection

Detects the part of the image where
the object is located

- Detects a shape or characteristics of
an object (e.g. eyes)
- More granular

❒ YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:
• Step 1: Divide the input image into a G ì G grid.
ã Step 2: For each grid cell, run a CNN that predicts y of the following form:
Box of center (bx ,by ), height bh
and width bw

Reference points (l1x ,l1y ), ...,(lnx ,lny )

y = pc ,bx ,by ,bh ,bw ,c1 ,c2 ,...,cp ,...


∈ RG×G×k×(5+p)

repeated k times

❒ Intersection over Union – Intersection over Union, also known as IoU, is a function that
quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding
box Ba . It is defined as:
IoU(Bp ,Ba ) =

where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were
detected, and k is the number of anchor boxes.

Bp ∩ Ba
Bp ∪ Ba

• Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.

Remark: when pc = 0, then the network does not detect any object. In that case, the corresponding predictions bx , ..., cp have to be ignored.

Remark: we always have IoU ∈ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) 0.5.

❒ R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the
detection algorithm to find most probable objects in those bounding boxes.

❒ Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
prediction is constrained to have a given set of geometrical properties. For instance, the first
prediction can potentially be a rectangular box of a given form, while the second will be another
rectangular box of a different geometrical form.
❒ Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:
• Step 1: Pick the box with the largest prediction probability.
• Step 2: Discard any box having an IoU

Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.

0.5 with the previous box.


Face verification and recognition

❒ Types of models – Two main types of model are summed up in table below:

Face verification
- Is this the correct person?
- One-to-one lookup

Face recognition
- Is this one of the K persons in the database?
- One-to-many lookup
❒ Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH × nw × nc
❒ Content cost function – The content cost function Jcontent (C,G) is used to determine how

the generated image G differs from the original content image C. It is defined as follows:
Jcontent (C,G) =

1 [l](C)
− a[l](G) ||2

❒ Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
elements Gkk quantifies how correlated the channels k and k are. It is defined with respect to
activations a[l] as follows:

❒ One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited
training set to learn a similarity function that quantifies how different two given images are. The
similarity function applied to two images is often noted d(image 1, image 2).


nH n[l]

❒ Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).




aijk aijk
i=1 j=1

Remark: the style matrix for the style image and the generated image are noted G[l](S) and
G[l](G) respectively.

❒ Triplet loss – The triplet loss is a loss function computed on the embedding representation
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling α ∈ R+
the margin parameter, this loss is defined as follows:

❒ Style cost function – The style cost function Jstyle (S,G) is used to determine how the
generated image G differs from the style S. It is defined as follows:

(A,P,N ) = max (d(A,P ) − d(A,N ) + α,0)

Jstyle (S,G) =

||G[l](S) − G[l](G) ||2F =
(2nH nw nc )2
(2nH nw nc )2






− Gkk

k,k =1

❒ Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters α,β, as follows:
J(G) = αJcontent (C,G) + βJstyle (S,G)
Remark: a higher value of α will make the model care more about the content while a higher
value of β will make it care more about the style.


Neural style transfer

❒ Generative Adversarial Network – Generative adversarial networks, also known as GANs,
are composed of a generative and a discriminative model, where the generative model aims at
generating the most truthful output that will be fed into the discriminative which aims at
differentiating the generated and true image.

❒ Motivation – The goal of neural style transfer is to generate an image G based on a given
content C and a given style S.

Architectures using computational tricks


Recurrent Neural Networks

❒ Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,
are a class of neural networks that allow previous outputs to be used as inputs while having
hidden states. They are typically as follows:

Remark: use cases using variants of GANs include text to image, music generation and synthesis.
❒ ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
high number of layers meant to decrease the training error. The residual block has the following
characterizing equation:
a[l+2] = g(a[l] + z [l+2] )

For each timestep t, the activation a<t> and the output y <t> are expressed as follows:

❒ Inception Network – This architecture uses inception modules and aims at giving a try
at different convolutions in order to increase its performance. In particular, it uses the 1 × 1
convolution trick to lower the burden of computation.

a<t> = g1 (Waa a<t−1> + Wax x<t> + ba )


y <t> = g2 (Wya a<t> + by )

where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation

The pros and cons of a typical RNN architecture are summed up in the table below:


- Possibility of processing input of any length
- Model size not increasing with size of input
- Computation takes into account
historical information
- Weights are shared across time

- Computation being slow
- Difficulty of accessing information
from a long time ago
- Cannot consider any future input

for the current state

❒ Applications of RNNs – RNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the table below:

∂L(T )


∂L(T )




Traditional neural network
Tx = Ty = 1


Handling long term dependencies

❒ Commonly used activation functions – The most common activation functions used in
RNN modules are described below:

Music generation
Tx = 1, Ty > 1

g(z) =

1 + e−z

g(z) =



ez + e−z

g(z) = max(0,z)

Sentiment classification
Tx > 1, Ty = 1

Name entity recognition
Tx = Ty

❒ Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.


❒ Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for
the gradient, this phenomenon is controlled in practice.

Machine translation
Tx = Ty

❒ Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows:

L(y,y) =

❒ Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and
are equal to:

L(y <t> ,y <t> )

Γ = σ(W x<t> + U a<t−1> + b)
❒ Backpropagation through time – Backpropagation is done at each point in time. At
timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows:

where W, U, b are coefficients specific to the gate and σ is the sigmoid function. The main ones
are summed up in the table below:


