1. Logistic Regression
The Logistic Regression is a regression model in which
the response variable (dependent variable) has
categorical values such as True/False or 0/1. It
actually measures the probability of a binary
response as the value of response variable based on
the mathematical equation relating it with the
predictor
Logistic regression estimates the probability of an
event occurring, such as voted or didn't vote, based
on a given dataset of independent variables.
Since the outcome is a probability, the dependent
variable is bounded between 0 and 1. variables.
2. • Linear : When there is a linear relationship
between independent and dependent
variables is known as linear regression
• Logistic: When the independent variable is
categorical in nature it is known as logistic
regression
• Polynomial: When the power of the
independent variables is more than 1 then it is
referred as polynomial regression
3.
4. Why Logistic Regression
• Whenever the outcome of the dependent variable
(y) is discrete like0/1 then we use logistic regression
• In linear regression y ‘s value is in a range but in our
case Y value is discrete, ie, the value will either be 0
or 1.
• Logistic regression
• It gives a probability . What are the chances that Y
will become 1.
• Ex: Basketball( 10 scored, the threshhold value is
0.8, if it is less than to 0.5, Y value is above threshold
then Y is 1 otherwise Y is 0)
5. • The general mathematical equation for logistic
regression is −
• y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
• Following is the description of the parameters
used −
• y is the response variable.
• x is the predictor variable.
• a and b are the coefficients which are numeric
constants.
• The function used to create the regression model
is the glm() function.
6. • Syntax
• The basic syntax for glm() function in logistic
regression is −
• glm(formula,data,family) Following is the
description of the parameters used −
• formula is the symbol presenting the relationship
between the variables.
• data is the data set giving the values of these
variables.
• family is R object to specify the details of the
model. It's value is binomial for logistic regression.
7. Example
• The in-built data set "mtcars" describes different
models of a car with their various engine
specifications. In "mtcars" data set, the
transmission mode (automatic or manual) is
described by the column am which is a binary
value (0 or 1). We can create a logistic regression
model between the columns "am" and 3 other
columns - hp, wt and cyl.
• # Select some columns form mtcars.
• input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
8. • When we execute the above code, it produces
the following result −
• am cyl hp wt
• Mazda RX4 1 6 110 2.620
• Mazda RX4 Wag 1 6 110 2.875
• Datsun 710 1 4 93 2.320
• Hornet 4 Drive 0 6 110 3.215
• Hornet Sportabout 0 8 175 3.440
• Valiant 0 6 105 3.460
9. • Create Regression Model
• We use the glm() function to create the
regression model and get its summary for
analysis.
• input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt,
data = input, family = binomial)
print(summary(am.data))
10. Advantages
• Logistic regression is easier to implement, interpret,
and very efficient to train.
• It makes no assumptions about distributions of
classes in feature space.
• It can easily extend to multiple classes(multinomial
regression) and a natural probabilistic view of class
predictions.
• It not only provides a measure of how appropriate a
predictor(coefficient size)is, but also its direction of
association (positive or negative).
• It is very fast at classifying unknown records.
11. Uses of Logistic Regression
• Classification Problems
it is important category of problems in which a decision
makers classifies the customers into two or more
categories
• Discrete choice Model
It estimates the probability about customers who
select a particular brand over several brands are
available
• Probability
It measures the probability of the occurrence of any
event. It find out the probability of an event
12. Generalized Linear Model
• Generalized Linear Model (GLiM, or GLM) is an
advanced statistical modelling technique
formulated by John Nelder and Robert
Wedderburn in 1972.
• It is an umbrella term that encompasses many
other models, which allows the response
variable y to have an error distribution other
than a normal distribution.
• The models include Linear Regression, Logistic
Regression, and Poisson Regression.
13. Why GLM?
• Linear Regression model is not suitable if,
• The relationship between X and y is not linear. There exists
some non-linear relationship between them. For example,
y increases exponentially as X increases.
• Variance of errors in y (commonly called as
Homoscedasticity in Linear Regression), is not constant,
and varies with X.
• Response variable is not continuous, but
discrete/categorical.
• Linear Regression assumes normal distribution of the
response variable, which can only be applied on a
continuous data.
• If we try to build a linear regression model on a
discrete/binary y variable, then the linear regression
model predicts negative values for the corresponding
response variable, which is inappropriate.
14. Assumptions of GLM
• Similar to Linear Regression Model, there are some basic
assumptions for Generalized Linear Models as well. Most
of the assumptions are similar to Linear Regression
models, while some of the assumptions of Linear
Regression are modified.
• Data should be independent and random (Each Random
variable has the same probability distribution).
• The response variable y does not need to be normally
distributed, but the distribution is from an exponential
family (e.g. binomial, Poisson, multinomial, normal)
• The original response variable need not have a linear
relationship with the independent variables, but the
transformed response variable (through the link function)
is linearly dependent on the independent variables
15. Binomial Logistic Regression
• A (often referred to simply as logistic regression), predicts
the probability that an observation falls into one of two
categories of a dichotomous dependent variable based
on one or more independent variables that can be either
continuous or categorical.
• For example, you could use binomial logistic regression to
understand whether exam performance can be predicted
based on revision time, test anxiety and lecture
attendance (i.e., where the dependent variable is "exam
performance", measured on a dichotomous scale –
"passed" or "failed" – and you have three independent
variables: "revision time", "test anxiety" and "lecture
attendance").
16. Logistic Function
• It is a function that estimates various parameters
and check whether they are statistically significant
and influence the probability of an event
• logit function
• One of the big assumptions of linear models is that
the residuals are normally distributed.
• This doesn’t mean that Y, the response variable, has
to also be normally distributed, but it does have to
be continuous, unbounded and measured on an
interval or ratio scale.
• Unfortunately, categorical response variables are
none of these.
17. • The Logit Link Function
• A link function is simply a function of the mean of the
response variable Y that we use as the response instead
of Y itself.
• All that means is when Y is categorical, we use the logit of
Y as the response in our regression equation instead of
just Y:
• The logit function is the natural log of the odds that Y
equals one of the categories. For mathematical simplicity,
we’re going to assume Y has only two categories and code
them as 0 and 1.
• This is entirely arbitrary–we could have used any
numbers. But these make the math work out nicely, so
let’s stick with them.
• P is defined as the probability that Y=1. So for example,
those Xs could be specific risk factors, like age, high blood
pressure, and cholesterol level, and P would be the
probability that a patient develops heart disease.
18. Optim function
• The function optim provides algorithms for general-
purpose optimisations and the documentation is perfectly
reasonable, but I remember that it took me a little while to
get my head around how to pass data and parameters to
optim
• should return a scalar result. A function to return the
gradient for the "BFGS" , "CG" and "L-BFGS-B" methods. If
it is NULL , a finite-difference approximation will be used.
For the "SANN" method it specifies a function to generate
a new candidate point.
• optim(par, fn, data, ...)
• where:
• par: Initial values for the parameters to be optimized over
• fn: A function to be minimized or maximized
• data: The name of the object in R that contains the data
19. • df <- data.frame(x=c(1, 3, 3, 5, 6, 7, 9, 12), y=c(4,
5, 8, 6, 9, 10, 13, 17))
• #define function to minimize residual sum of
squares min_residuals <- function(data, par) {
with(data, sum((par[1] + par[2] * x - y)^2)) }
• #find coefficients of linear regression model
optim(par=c(0, 1), fn=min_residuals, data=df)
• $par
• $value
• $counts
• $convergence
• $message
20. Maximum likelihood Estimator
• We can use MLE in order to get more robust
parameter estimates. Thus, MLE can be
defined as a method for estimating population
parameters (such as the mean and variance
for Normal, rate (lambda) for Poisson, etc.)
from sample data such that the probability
(likelihood) of obtaining the observed data is
maximized.
21. • n <- 1000 x <- rnorm(n,2,3) # with mean = 2, sd = 3
• he essential part of MLE is to specify the likelihood
function. In R, you can easily use “dnorm” to obtain
the density, and specify “log = TRUE”. Then, the
objective to minimize the negative sum of log
likelihood function, which is equivalent to maximize
the positive sum.
• LL <- function(beta, sigma){ R = dnorm(x, beta,
sigma, log = TRUE) -sum(R) }Following that, the
parameters to be estimated can be passed to the
“mle2” function, available in package “bbmle”, which
uses the optimization technique to find the solution.
• library(bbmle). (true parameter)
• fit_norm <- mle2(LL, start = list(beta = 0, sigma = 1),
lower = c(-Inf, 0), upper = c(Inf, Inf), method = 'L-
BFGS-B')