Tải bản đầy đủ (.pdf) (35 trang)

Logistic regression explained from scratch

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.12 MB, 35 trang )

Follow

574K Followers

·

Editors' Picks

Features

Deep Dives

Grow

Contribute

About

Logistic Regression Explained
from Scratch (Visually,
Mathematically and
Programmatically)
Hands-on Vanilla Modelling Part III
Abhibhav Sharma 18 hours ago · 14 min read

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


Image By Author



A plethora of results appear on a small google search “Logistic Regression”.
Sometimes it gets very confusing for beginners in data science, to get
around the main idea behind logistic regression. And why wouldn't they be
confused!!? Every different tutorial, article, or forum has a different
narration on Logistic Regression (not including the legit verbose of
textbooks because that would kill the entire purpose of these “quick
sources” of mastery). Some sources claim it a “Classification algorithm” and
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


some more sophisticated ones call it a “Regressor”, however, the idea and
utility remain unrevealed. Remember that Logistic regression is the
basic building block of artificial neural networks and no/fallacious
understanding of it could make it really difficult to understand the
advanced formalisms of data science.
Here, I will try to shed some light on and inside the Logistic Regression
model and its formalisms in a very basic manner in order to give a sense of
understanding to the readers (hopefully without confusing them). Now the
simplicity offered here is at a cost of capering the in-depth details of some
crucial aspects, and to get into the nitty-gritty of each aspect of Logistic
regression would be like diving into the fractal (there will be no end to the
discussion). However, for each such concept, I will provide eminent
readings/sources that one should refer to.
For there are two major branches in the study of Logistic regression (i)
Modelling and (ii) Post Modelling analysis (using the logistic regression
results). While the latter is the measure of effect from the fitted coefficients,
I believe that the black-box aspect of logistic regression has always been in

its Modelling.

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


My aim here is to:
1. To elaborate Logistic regression in the most layman way.
2. To discuss the underlying mathematics of two popular optimizers that
are employed in Logistic Regression (Gradient Descent and Newton
Method).
3. To create a logistic-regression module from scratch in R for each type of
optimizer.
One last thing before we proceed, this entire article is designed by keeping
the binary classification problem in mind in order to avoid complexity.

1. The Logistic Regression is NOT A CLASSIFIER
Yes, it is not. It is rather a regression model in the core of its heart. I will
depict what and why logistic regression while preserving its resonance with
a linear regression model. Assuming that my readers are somewhat aware
of the basics of linear regression, it is easy to say that the linear regression
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


predicts a “value” of the targeted variable through a linear combination of
the given features, while on the other hand, a Logistic regression predicts
“probability value” through a linear combination of the given features

plugged inside a logistic function (aka inverse-logit) given as eq(1):

Equation 1

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


Logistic Function (Image by author)

Hence the name logistic regression. This logistic function is a simple
strategy to map the linear combination “z”, lying in the (-inf,inf) range to
the probability interval of [0,1] (in the context of logistic regression, this z
will be called the log(odd) or logit or log(p/1-p)) (see the above plot).
Consequently, Logistic regression is a type of regression where the range of
mapping is confined to [0,1], unlike simple linear regression models where
the domain and range could take any real value.

A small sample of the data (Image by author)

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


Consider simple data with one variable and its corresponding binary class
either 0 or 1. The scatter plot of this data looks something like (Fig A left).
We see that the data points are in the two extreme clusters. Good, now for
our prediction modeling, a naive regression line in this scenario will give a

nonsense fit (red line in Fig A right) and what we actually require to fit is
something like a squiggly line (or a curvy “S” shaped blue rule in Fig A
right) to explain (or to correctly separate) a maximum number of data
points.

(Image by author) Fig A

Logistic regression is a scheme to search this most optimum blue squiggly
line. Now first let's understand what each point on this squiggly line
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


represents. given any variable value projected on this line, this squiggly line
tells the probability of falling in Class 1 (say “p”) for that projected
variable value. So accordingly the line tells that all the bottom points that
lie on this blue line have zero chances (p=0) of being in class 1 and the top
points that lie on it have the probability of 1(p=1) for the same. Now,
remember that I have mentioned that the logistic (aka inverse-logit) is a
strategy to map infinitely stretching space (-inf, inf) to a probability space
of [0,1], a logit function could transform the probability space of [0,1] to a
space stretching to (-inf, inf) eq(2)&(Fig B).

Equation 2

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD



Fig B. The logit function is given by log(p/1-p) that maps each probability value to the point on the number line
{ℝ} stretching from -infinity to infinity (Image by author)

Keeping this in mind, here comes the mantra of logistic regression
modeling:
Logistic Regression starts with first

Ⓐ transforming the space of class

probability[0,1] vs variable{ℝ} (as in fig A right) to the space of Logit{ℝ}
vs variable{ℝ} where a “regression like” fitting is performed by adjusting the
coefficient and slope in order to maximize the Likelihood (a very fancy stuff
that I will elaborated this part in coming section).

Ⓑ Once tweaking and

tuning are done, the Logit{ℝ} vs variable{ℝ} space is remapped to class
probability[0,1] vs variable{ℝ} using inverse-Logit (aka Logistic function).

Ⓐ →Ⓑ →Ⓐ ) would eventually result in the

Performing this cycle iteratively (

most optimum squiggly line or the most discriminating rule.
WOW!!!

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD



Well, you may (should) ask (i) Why and how to do this transformation ??,
(ii) what the heck is Likelihood?? and (iii) How this scheme would lead to
the most optimum squiggle?!!.
So for (i), the idea to the transformation from a confined probability space
[0,1] to an infinitely stretching real space (-inf, inf) is because it will make
the fitting problem very close to solving a linear regression, for which we
have a lot of optimizers and techniques to fit the most optimum line. The
latter questions will be answered eventually.
Now coming back to our search for the best classifying blue squiggly line,
the idea is to plot an initial linear regression line with the arbitrary
coefficient on ⚠logit vs variable space⚠ coordinates first and then adjust
the coefficients of this fit to maximize the likelihood (relax!! I will explain
the “likelihood” when it is needed).
In our one variable case, we can write equation 3:
logit(p) = log(p/1-p) = β₀+ β₁*v ……………………………….(eq 3)

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


(Image by author) Fig C

In Fig C (I), the red line is our arbitrary chosen regression line fitted for the
data points, mapped in a different coordinate system with β₀ (intercept) as
-20 and β₁(slope) as 3.75.
⚠ Note that the coordinate space is not class{0,1} vs variable{ℝ} but its
the Logit{ℝ} vs variable{ℝ}. Also, notice that the transformation from Fig

A(right) to Fig C(I) has no effects on the positional preferences of the points
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


i.e. the extremes as in equation 2 above, the logit(0)=-infinity, and
logit(1)=+infinity.
At this point, let me reiterate our objective: We want to fit the straight line
for the data points in logit vs variable plot in such a way that it explains
(correctly separates) the maximum number of data points when it gets
converted to the blue squiggly line through inverse-logit (aka logistic
function) eq(1). So to achieve the best regression, a similar strategy of
simple linear regression comes into play but despite minimizing the squared
residual, the idea is to maximize the likelihood (relax!!). Since the points
scattered on the infinity make it difficult to proceed in an orthodox linear
regression method, the trick is to project these points on the logit (the
initial chosen/fitted line with the arbitrary coefficient) Fig C(II). In this
way, each data point projected on the logit corresponds to a logit value.
When these logit values are plugged into the logistic function eq(1), we get
their probability of falling in class 1 Fig C(III).
Note: This also can be proven mathematically as well that:
logit(p)=logistic⁻¹(p)

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


This probability can be represented mathematically as equation 4, which is

very close to a Bernoulli distribution, isn't it?.
P(Y = y|X = x) = σ(βᵀ x)ʸ · [1 − σ(βᵀx)]⁽¹⁻ʸ⁾ ; where y is either 0 or
1..eq(4)
The equation reads that for a given data instance x, the probability of the
label Y being y(where y is either 0 or 1) is equal to the logistic of logit when
y=1 and is equal to (1-logistic of logit) if y=0. These new probability values
are illustrated in our class{0,1} vs variable{ℝ} space as blue dots in Fig
C(III). This new probability value for a datapoint is what we call the
LIKELIHOOD of that data point. So in simple terms, likelihood is the
probability value of the datapoint where the probability value indicates that
how LIKELY the point is to be falling in the class 1 category. And the
likelihood of the training label for the fitted weight vector β is nothing but
the product of each of these newfound probability values equation 5&6.
L(β) = ⁿ∏ᵢ₌₁ P(Y = y⁽ⁱ⁾ | X = x⁽ⁱ⁾ )………………….……….eq(5)
Substituting equation 4 in equation 5 we get,

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


L(β) = ⁿ∏ᵢ₌₁ σ(βᵀ x⁽ⁱ⁾)ʸ⁽ⁱ⁾ · [1 − σ(βᵀx⁽ⁱ⁾)]⁽¹⁻ʸ⁽ⁱ⁾⁾ ………………eq(6)
The idea is to estimate the parameters (β) such that it maximizes the L(β).
However, due to the mathematical convenience, we maximize the log of
L(β) and call its log-likelihood equation 7.
LL(β) = ⁿ∑ᵢ₌₁ y⁽ⁱ⁾log σ(βᵀ x⁽ⁱ⁾) + (1− y⁽ⁱ⁾) log[1 − σ(βᵀx⁽ⁱ⁾)]……..eq(7)
So at this point, I hope that our earlier stated objective is much
understandable i.e. to find the best fitting parameters β in logit vs variable
space such that LL(β) in probability vs variable space is maximum. For this,
there is no close form and so in the next section, I will touch upon two

optimization methods (1) Gradient descent and (2) Newton’s method to
find the optimum parameters.

2. Optimizers
Gradient Ascent
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


Our optimization first requires the partial derivative of the log-likelihood
function. So let me shamelessly share the snap from a very eminent lecture
note that beautifully elucidate the steps to derive the partial derivative of
LL(β). (Note: the calculations shown here use θ in place of β to represent the
parameters.)

To update the parameter, the steps toward the global maximum is:

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


Where η is the learning rate

so the algorithm is:
Initialize β and set likelihood=0
While likelihood≤max(likelihood){
Calculate logit(p) = xβᵀ
Calculate P=logistic(Xβᵀ)= 1/(1+exp(-Xβᵀ))

Calculate Likelihood L(β) = ⁿ∏ᵢ₌₁ifelse( y(i)=1, p(i), (1-p(i)))
Calculate first_derivative ∇LL(β) = Xᵀ (Y-P)
Update: β(new) = β (old) + η∇LL(β)
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


}

Newton Method
Newton’s Method is another strong candidate among the all available
optimizers. We have learned Newton’s Method as an algorithm to stepwise
find the maximum/minimum point on the concave/convex functions in our
early lessons:
xₙ₊₁=xₙ + ∇f(xₙ). ∇∇f⁻¹(xₙ)
In the context of our log-likelihood function, the ∇f(xₙ) will be replaced by
the gradient of LL(β) (i.e ∇LL(β)) and the ∇∇f⁻¹(xₙ) would be the Hessian
H i.e. the second-order partial derivative of LL(β)). Well, I will caper the
details here, but your curious brain should refer to this. So, the ultimate
expression to update the parameter, in this case, is given by:
βₙ₊₁=βₙ+ H⁻¹. ∇LL(β)
Here in the case of logistic regression, the calculation of H is super easy
because:
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


H= ∇∇LL(β) = ∇ⁿ∑ᵢ₌₁[y − σ(βᵀx⁽ⁱ⁾)].x⁽ⁱ⁾

= ∇ⁿ∑ᵢ₌₁[y − pᵢ].x⁽ⁱ⁾
= −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾(∇pᵢ) = −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾ pᵢ(1-pᵢ) (x⁽ⁱ⁾)ᵀ
thus, H= -XWXᵀ, where W=(P*(1-P))ᵀ I
So the algorithms are:
Initialize β and set likelihood=0
While likelihood≤max(likelihood){
Calculate logit(p) = xβᵀ
Calculate P=logistic(Xβᵀ)= 1/(1+exp(-Xβᵀ))
Calculate Likelihood L(β) = ⁿ∏ᵢ₌₁ifelse( y(i)=1, p(i), (1-p(i)))
Calculate first_derivative ∇LL(β) = Xᵀ (Y-P)
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


Calculate Second_derivative Hessian H = -Xᵀ (P*(1-P))ᵀI X
β(new) = β (old) + H⁻¹. ∇LL(β)
}

3. The Scratch Code
The data
The dataset that I am going to use for training and testing my binary
classification model can be downloaded from here. Originally this dataset
is an Algerian Forest Fires Dataset. You can check out the details of the
dataset here.

The Code
For the readers who hopped the entire article above to play around with
code, I would recommend having a quick eyeballing through the second


Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


section as I have given a spet-wise algorithm for both the optimizer and my
code will strictly follow that order.
The training Function

setwd("C:/Users/Dell/Desktop")
set.seed(111) #to generate the same results as mine
#-------------Training Function---------------------------------#
logistic.train<- function(train_data, method, lr, verbose){
b0<-rep(1, nrow(train_data))
x<-as.matrix(cbind(b0, train_data[,1:(ncol(train_data)-1)]))
y<- train_data[, ncol(train_data)]
beta<- as.matrix(rep(0.5,ncol(x))); likelihood<-0; epoch<-0
#initiate
beta_all<-NULL
beta_at<-c(1,10,50,100,110,150,180,200,300,500,600,800,1000,
1500,2000,4000,5000,6000,10000) #checkpoints (the epochs
at which I will record the betas)
#

#-----------------------------Gradient Descent--------------------if(method=="Gradient"){
while( (likelihood < 0.95) & (epoch<=35000)){
logit<-x%*%beta #Calculate logit(p) = xβᵀ

Create PDF in your applications with the Pdfcrowd HTML to PDF API


PDFCROWD


p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)=
1/(1+exp(-Xβᵀ))
# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta)
likelihood<-1
for(i in 1:length(p)){
likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i])))
#product of all the probability
}
first_d
t(x) %*% (y-p)#first derivative of the likelihood

beta <- beta + lr*first_d #updating the parameters for a step
toward maximization
#to see inside the steps of learning (irrelevant to the main
working algo)
if(verbose==T){
ifelse(epoch%%200==0,
print(paste0(epoch, "th Epoch",
"---------Likelihood=",
round(likelihood,4),
"---------log-likelihood=",
round(log(likelihood),4),
collapse = "")), NA)}
if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)}

}


}

epoch<- epoch+1

#--------------Newton second order diff method-------------#
else if(method=="Newton"){
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


while((likelihood < 0.95) & (epoch<=35000)){
logit<-x%*%beta #Calculate logit(p) = xβᵀ
p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)=
1/(1+exp(-Xβᵀ))
# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta)
likelihood<-1
for(i in 1:length(p)){
likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i])))
}
first_d
t(x) %*% (y-p)#first derivative of the likelihood

w<-matrix(0, ncol= nrow(x), nrow = nrow(x)) #initializing p(1p) diagonal matrix
diag(w)<-p*(1-p)
hessian<- -t(x) %*% w %*% x #hessian matrix
hessian<- diag(ncol(x))-hessian #Levenberg-Marquardt method:
Add a scaled identity matrix to avoid singularity issues

k<- solve(hessian) %*% (t(x) %*% (y-p)) #the gradient for
newton method
beta <- beta + k #updating the parameters for a step toward
maximization
if(verbose==T){
ifelse(epoch%%200==0,
print(paste0(epoch, "th Epoch",
"---------Likelihood=",
round(likelihood,4),
"---------log-likelihood=",
round(log(likelihood),4),
collapse = "")), NA)}
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)} #just
to inside the learning
epoch<- epoch+1
}
}
else(break)
beta_all<-cbind(beta_all, beta)
colnames(beta_all)<-c(beta_at[1:(ncol(beta_all)-1)], epoch-1)
mylist<-list(as.matrix(beta), likelihood, beta_all)
names(mylist)<- c("Beta", "likelihood", "Beta_all")
return(mylist)
} # Fitting of logistic model


The Prediction Function

logistic.pred<-function(model, test_data){
test_new<- cbind( rep(1, nrow(test_data)), test_data[,ncol(test_data)]) #adding 1 to fit the intercept
beta<-as.matrix(model$Beta) #extract the best suiting beta (the
beta at final epoch)
beta_all<-model$Beta_all #extract all the betas at different
checkpoints
ll<- model$likelihood #extract the highest likelihood obtained
log_odd<-cbind(as.matrix(test_new)) %*% beta #logit(p)
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD


probability<- 1/(1+ exp(-log_odd)) # p=logistic(logit(p))
predicted_label<- ifelse(probability >= 0.5, 1, 0) #discrimination
rule
k<-cbind(test_data[,ncol(test_data)], predicted_label) # actual
label vs predicted label
colnames(k)<- c("Actual", "Predicted")
k<- as.data.frame(k)
tp<-length(which(k$Actual==1
tn<-length(which(k$Actual==0
fp<-length(which(k$Actual==0
fn<-length(which(k$Actual==1

&
&
&

&

k$Predicted==1))
k$Predicted==0))
k$Predicted==1))
k$Predicted==0))

#true positive
#true negative
#false positive
#false negative

cf<-matrix(c(tp, fn, fp, tn), 2, 2, byrow = F) #confusion matrix
rownames(cf)<- c("1", "0")
colnames(cf)<- c("1", "0")
p_list<-list(k, cf, beta, ll, beta_all)
names(p_list)<- c("predticted", "confusion matrix","beta",
"liklihood", "Beta_all")
return(p_list)
} # to make prediction from the trained model

Data Parsing

#importing data
data<-read.csv("fire.csv", header = T) #import
data$Classesone hot encoding ; numeric conversion from label to 1 or 0
Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD



data<-rbind(data[which(data$Classes==0),],
data[sample(size=length(which(data$Classes==0)),which(data$Classes==1
)),]) #balancing the classes
data<-data[sample(1:nrow(data)),] #shuffling
data<-as.data.frame(data) #data to data frame
data<-lapply(data, as.numeric)
data<-as.data.frame(data)
#missing data handling
if(!is.null(is.na(data))){
data<-data[-unique(which(is.na(data), arr.ind = T)[,1]),]
}
#test train partition
partition<-sample(c(0,1), size=nrow(data), prob = c(0.8,0.2), replace
= T)
train<-data[which(partition==0),]
test<-data[which(partition==1),]

Training → Testing → Results

#-------------------------TRAINING---------------------------------#
mymodel_newton<- logistic.train(train, "Newton", 0.01, verbose=T) #
Fitting the model using Newton method
mymodel_gradient<- logistic.train(train, "Gradient", 0.01, verbose=T)
# Fitting the model using Gradient method

Create PDF in your applications with the Pdfcrowd HTML to PDF API

PDFCROWD



×