3. Object-wise Analysis
4
Steps to select appropriate statistical
test
Define clearly the objective of the
study
Define the level of measurement
(metric/non-metric) of each variable
to be included in the analysis.
5. Selecting the appropriate technique
10
Bivariate techniques
Response Variable (DV)
Explanatory
Variable
(IDV)
Metric Non-metric
Metric Regression Logistic
Regression/
LDA
Non-metric Dummy Var
Reg./
Hypothesis
Test*
Chi-square
test
Make sure to check all assumptions before applying any statistical
technique.
6. Selecting the appropriate technique
12
Response Variable(s) (DVs)
One DV More than
one DV
Explanatory
Variable(s)
(IDVs)
One IDV
Metric Non-metric Metric
Metric Simple
Regression
Binary/Multi
Nominal
(Logistic) Reg
Path
Analysis
Non-metric t test/Anova Chi Square
Test
Manova
More
than one
IDV
All Metric Multiple Reg Multiple Logit
Reg/Multiple
Multinominal
Path
Analysis
All Non-
metric
n – way Anova Complex
Crosstab/
Log-linear
analysis
n – way
Manova
Mixed n – way
Ancova/Dumm
y var
Multiple Logit
Reg/Multiple
Multinominal
n– way
Mancova
8. • Response has only two 2 possible outcomes.
• E.g.: Spam or Not
Binary
• Three or more categories without ordering.
• E.g.: Predicting which food is preferred more
(Veg, Non-Veg, Vegan)
Multinominal
• Three or more categories with ordering.
• E.g.: Movie rating from 1 to 5
Ordinal
14
Types of Logistic Regression
11. 17
To predict in advance whether a product launch will be
successful or not
An online banking service must be able to determine whether or
not a transaction being performed on the site is fraudulent
Benign or malignant tumor
Spam detection
Movies genres classification
Classification Problem
12. Box-Tidwell Test
In the model, include interactions between the continuous predictors and
their logs.
If such an interaction is significant, then the assumption has been
violated.
If any interaction is significant, try adding to the model powers of the
predictor (that is, going polynomial)
Caution:
Not a very robust test as it gets affected by sample size.
You should not be very concerned with a just significant interaction
when sample sizes are large.
13. Assumptions of Logit Regression
Binary response variable with mutually exclusive and exhaustive
categories.
One or more predictor variable(s)
Independent Observations
linear relationship between continuous independent variable(s) and
the logit transformation of the dependent variable
This assumption can be tested by using Box-Tidwell Test.
including in the model interactions between the continuous
predictors and their logs. If such an interaction is significant, then
the assumption has been violated.
14. Assumptions of Logit Regression
Binary response variable with mutually exclusive and exhaustive
categories.
One or more predictor variable(s)
Independent Observations
linear relationship between continuous independent variable(s) and
the logit transformation of the dependent variable
What about Co-linearity, Perfect co-linearity, and Multi-co-linearity?
https://stats.stackexchange.com/a/432543/79100
15. More Discussion on Multi-Colinearity
• What happens when you’ve Multi-Colinearity?
Multicollinearity isn't as deleterious for prediction but may affect variable’s
Significance
https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-
checked-in-modern-statistics-machine-learning
• Can you safely ignore Multi-Colineairty?
https://statisticalhorizons.com/multicollinearity
• How to handle Multi-colinearity?
https://www.researchgate.net/post/how_to_deal_with_multicolinearity#view=580e
f132ed99e1c1046fcf01
• Why not to use STEP_WISE method?
http://www.philender.com/courses/linearmodels/notes4/swprobs.html
http://www.danielezrajohnson.com/stepwise.pdf
17. Assumptions _ More Considerations
Logistic regression typically requires a large sample size because they
use maximum likelihood estimation techniques. [maximum likelihood
estimates are less powerful at low sample sizes than ordinary least
square].
It is also important to keep in mind that when the outcome is rare, even
if the overall dataset is large, it can be difficult to estimate a logit model.
Empty cells or small cells: You should check for empty or small cells
by doing a crosstab between categorical predictors and the outcome
variable. If a cell has very few cases (a small cell), the model may
become unstable or it might not run at all.
18. 26
Why can’t we use Linear
Regression for
Classification Problems?
21. What is Logistic Regression?
The Logistic Regression Curve is
called as “Sigmoid Curve”, also
known as S-Curve
How to decide whether
the value is 0 or 1 from
this curve?
Set a
threshold
22. Default - 0.5
Based on group sizes (as we do in LDA)
Based on performance evaluation matrix using cross validation
31
How to set a threshold?
23. Logistic Regression Equation
Rather than modeling this response Y directly,
Logistic regression models the probability that Y belongs
to a particular category.
P(Y =1 | X) or P(X) can take values from 0 to 1
n
n X
X
X
P
X
P
...
)
(
1
)
(
log 1
1
0
How to
interpret
?
25. Logistic Regression Equation
Alternatively, we can write
Or
n
n X
X
e
X
P
X
P
...
1
1
0
)
(
1
)
(
n
n
n
n
X
X
X
X
e
e
X
P
...
...
1
1
0
1
1
0
1
)
(
26. Logistic Regression Equation
Alternatively, we can write
Or
n
n X
X
e
X
P
X
P
...
1
1
0
)
(
1
)
(
n
n
n
n
X
X
X
X
e
e
X
P
...
...
1
1
0
1
1
0
1
)
(
27. Understanding the Odds
Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit
change in the predictor.
n
n X
X
e
X
P
X
P
...
1
1
0
)
(
1
)
(
29. Understanding Odds
Logit = log (Odds) = Log (p/1-p)
= log (probability of event happening/ probability of
event not happening)
Odds Ratio/ OR =
0
0
|
_
_
_
_
_
1
|
_
_
_
_
_
X
X
Y
event
of
favor
in
Odds
X
X
Y
event
of
favor
in
Odds
30. Interpreting the coefficients
41
Response: default [Y/N] Predictor: [Account] balance
Estimated coefficients of the logistic regression model that predicts the
probability of default using balance.
A one-unit increase in balance is associated with
An increase in the log odds of default by 0.0055 units. OR
A change in odds by exp(0.0055), i.e., 1.0055
31. Interpreting the coefficients
42
Probability of default for an individual with a balance of $1, 000 is
Probability of default for an individual with a balance of $2, 000 is
%
576
.
0
00576
.
0
1 1000
*
0055
.
0
6513
.
10
1000
*
0055
.
0
6513
.
10
e
e
%
6
.
58
586
.
0
1 2000
*
0055
.
0
6513
.
10
2000
*
0055
.
0
6513
.
10
e
e
33. Maximum Likelihood Estimation
The objective: Not to “correctly” estimate the logit, but to make better
classification.
Parameters should take values which result in such a score [probabilities or p]
which enables us to have a good cutoff.
Meaning this “score” should be high for one class and low for another
If P(Yi = 1|Xi) = P(Xi) = Pi, then
To maximize collective form of this function for all observations
Maximum Likelihood Estimation
i
i Y
i
Y
i
i P
P
L
1
1
*
n
i
i
L
Max
MaxL
1
36. Note Points
For a standard logistic regression you should ignore
the Previous and Next buttons because they are for sequential (hierarchical)
logistic regression.
The Method: option needs to be kept at the default value, which is Enter Method.
The "Enter" method is the name given by SPSS Statistics to standard regression
analysis.
SPSS Statistics requires you to define all the categorical predictor values in the
logistic regression model. It does not do this automatically.
The default behaviour in SPSS Statistics is for the last category (numerically) to
be selected as the reference category.
If we change the method from Enter to Forward: Wald the quality of the logistic
regression improves. Now only the significant coefficients are included in the
logistic regression equation.
47 https://statistics.laerd.com/
54. How to report the results SPSS
A logistic regression was performed to ascertain the effects of x, y, and gender on the
likelihood that participants to have the event (positive response).
1. The logistic regression model was statistically significant, χ2(df) = 28.605, p < 0.05
[Omninus test]
2. A non-significant test result (p=0.78) of Hosmer Lemeshow test is an indicator of
good model fit.
3. The psudeo R2 measures for explained variations are: 56.4% (Cox & Snell R2) and
67.8% (Nagelkerke R2) [For validation data, psudeo R2 …]
… … … … … … … … … … … … … … … …
66
55. How to report the results SPSS
1. The model correctly classified 81.0% (model accuracy) of cases for the training
data set and 76 % of cases for validation data set.
[The data set was randomly divided into training & validation set with 70%
observation into training and rest of the observations into the validation set.]
2. The model specificity
3. Sensitivity
At cut-off value
2. ROC curve was used to optimize cut-off point
… … … … … … … … … … … … … … … …
67
56. How to report the results SPSS
The results from the "Variables in the Equation" table, including which of the
predictor variables were statistically significant and what predictions can be made
based on the use of odds ratios. E.g.,
Males were 6.02 times more likely to do this (event) than females.
Increasing x was associated with an decrease in likelihood of the event, but increasing y
was associated with a reduction in the likelihood of the event
… … … … … … … … … … … … … … … …
68
57. How to report the results SPSS
Box-Tidwell (1962) Test:
69
62. Other Considerations
Categorical Predictors
Accuracy Paradox
Balanced, Unbalanced & Rare Event Data
Complete or Quasi-separation
Psudo R2 Measures
Multinomial and Ordinal Logistic Regression Homoskedasticity
is not an
assumption in
logistic
regression
65. References
78
Field, A. P. (2013). Discovering statistics using IBM SPSS
Statistics: and sex and drugs and rock 'n' roll (fourth edition).
London: Sage publications.
Field, A. P., Miles, J. N. V., & Field, Z. C. (2012). Discovering
statistics using R: and sex and drugs and rock 'n' roll. London: Sage
publications.
Field, A. P. & Miles, J. N. V. (2010). Discovering statistics using
SAS: and sex and drugs and rock 'n' roll. London: Sage
publications.
Kothri, C. R. (2004). Research methodology : methods &
techniques. New Age publications.
66. 79
My Interesting answers/posts
To understand results of logistics regression or other classifiers
https://learnerworld.tumblr.com/post/152327498485/enjoystatisticswith
mebinaryclassifierperformance
Hypothesis testing in layman’s terms
https://learnerworld.tumblr.com/search/hypothesis
Understanding mediation effect
https://learnerworld.tumblr.com/post/146541892120/mediation-
effectenjoystatisticswtihme
67. 80
My Interesting answers/posts
Dependence Vs Correlation
https://www.quora.com/What-is-the-difference-between-dependence-
and-correlation/answer/Nisha-Arora-9
Co-linearity & Correlation
https://www.quora.com/In-statistics-what-is-the-difference-between-
collinearity-and-correlation/answer/Nisha-Arora-9
68. 81
My Expertise
Technical Topics:
Python for Data Science or Data Analysis
R Programming
Data Visualization & Storytelling
Machine Learning/Data Science
Statistics [For researchers/Data Science practitioners/ university
students] _Theory/mathematical proofs/application based/using
interactive tools/playing with data using some software
Data Analysis using SPSS
Mathematics [Don't want to write too much but depends on what is the
requirement]
Excel [Basic to intermediate/tools for data analysis/operations
research/operations management/specific course for academicians, etc]
To know more about these,
click here
69. 82
My Expertise
Non-technical Topics:
Interactive pedagogical tools/web resources
The art of effective use of Information & Communication Tools (ICT)
Tools/Platform for hosting online lectures/meetings/live sessions
Effective Googling for finding the right resources (books/ research
papers/ answers)
Leveraging online research communities, Q/A sites, groups, meet-ups
to dive deep in a particular topic of interest
Bridging the gap between industry & academia
Creating a personal brand by leveraging power of social media
Getting smart with MS Office (Word, Excel, Power Point, etc.)
Learning Google products (mentioned in the slide)
Learning how to learn
Learning how to teach
Note Taking
Effective Communication & Presentation
Dr.aroranisha@gmail.com
75. 88
Reach Out to Me
http://stats.stackexchange.com/users/79100/nisha-arora
https://www.researchgate.net/profile/Nisha_Arora2/contributions
https://www.quora.com/profile/Nisha-Arora-9
http://learnerworld.tumblr.com/
nishaarora4@gmail.com