This document provides an overview of regression analysis, including linear regression, multiple regression, and assessing assumptions. It defines regression as a technique for investigating relationships between variables. Simple linear regression involves one predictor and one response variable, while multiple regression extends this to multiple predictors. Key steps are outlined such as assessing the fit of regression models using R-squared, testing the significance of individual predictors, and ensuring assumptions of normality, linearity and equal variance are met. Examples are provided demonstrating how to evaluate these assumptions and interpret regression results.
3. “◉ Linear Regression is a supervised modeling technique for continuous data that generates a
response based on the set of input features.
◉ It is used for explaining the linear relationship between a single variable Y, called the response
(output or dependent variable), and one or more predictor (input, independent or explanatory
variables).
◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the
form of a multiple regression problem, that is if more than one predictor is used in the model.
3
4. ◉ Statistical Modelling is the process of obtaining a
statistical model which adequately describes the
relationships between the variables involved
◉ The model: takes the form of a prediction equation - the
values of a dependent variable (DV) are predicted by a
set of independent variables (IV)
◉ Simplest model: simple linear regression
4
6. Simple Linear Regression
◉ SLR is investigating the linear relation between two variables
Y (DV) and X (IV or explanatory variable)
◉ “Linear”: used because the population mean of Y is
represented as a linear or straight-line function of X
◉ “Simple”: refers to the fact that there is only one independent
variable
◉ Examples:
• air quality and lung function
• medication dose and outcome of blood test
6
7. Explore the relationship Between Two
Continuous Variables
◉ Step 1: Scatterplot
Shape of scatterplot gives form of relation
• linear
• quadratic
• more complex
◉ Step 2: Correlation coefficient
Strength of linear relation given by correlation coefficient
• r: ranges from –1 to +1
• –1 : perfect negative linear relationship
• +1 : perfect positive linear relationship.
• 0 : no linear relationship.
7
9. ◉ Step 3: Simple linear regression
This is the population line.
• Y = dependent or response variable. Must be continuous.
• X = independent / predictor / explanatory variable or covariate
• α = population regression parameter / intercept: point where the
line crosses the vertical axis
• β = population regression parameter / slope: the change in the
mean value of Y for each increase of one unit in the value of X
• e = model error term e (residual)
= deviations between predicted values of Y and the actual values
of Y
Assumed normally distributed with mean 0 and
standard deviation
9
11. Objective of SLR
◉ Objective: to predict or estimate the value of DV Y
corresponding to a given value of IV X, thru the estimated
regression line
◉ Sample: the observed values are Xi and Yi, I =1,2,…n
◉ Build up: an estimated regression line using the sample. The
regression line from the observed data is an estimate of the
relationship between X and Y in the population
11
12. Estimated Regression Equation
◉ a = regression coefficient
= the estimate of the parameter α
= the intercept of the estimated regression line
= the value of Y where the line crosses the Y axis
◉ b = regression coefficient
= the estimate of the parameter β
= the slope of estimated regression line
= the change in the mean value of Y for each one unit increase in the value of X
12
13. Residual
◉ For any subject i, i = 1,2,3, …, n
◉ The original observed values are Xi and Yi
◉ For any given Xi , the ‘Y’ value given by the line is called the
predicted value and denoted by
◉ The residual ei is the difference between the predicted value
and the observed value
13
14. Least Squares Estimation
◉ Least squares estimation is the method of estimating the equation /
fitting the model to the data in an optimal way
◉ The sum of squares of the vertical distances of the observations from
the line are minimized
◉ Least squares estimation minimizes
14
16. Is X a significant predictor of Y
◉ The association between X and Y is given by the
regression coefficient for the slope
◉ A zero slope means X has no “impact” on Y
◉ whereas a large value indicates large changes in Y
when X changes
16
17. ◉ Denoted by R2
◉ Measures the goodness ‘fit’ of the model
◉ Assesses the usefulness or predictive value of the model
◉ Is interpreted as the proportion of variability in the observed
values of Y explained by the regression of Y on X
◉ E.g. R2 =71.9%, almost 72% of the variation in lung function
(FEV) is explained by the regression of FEV on height
◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%)
17
Coefficient of Determination
18. R2 and b
◉ The coefficient of determination R2 describes how well the
regression equation summaries the data
◉ The regression coefficient b gives the nature of the
relationship between X and Y the degree of change in Y for
certain changes in X
◉ Two data sets may have the same slope b but different R2
values and vise versa
18
21. Assumptions
1) The observations must be independent
2) The values of the dependent variable Y should be Normally distributed
(normality)
3) The variability (variance) of Y should be the same for each value of X -
homoscedasticity or constant variation
4) If X is continuous, the relation between X and Y should be linear (linearity)
Note
• X need not be a random variable nor have a Normal distribution
• In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals
• A transformation of Y may be required
21
22. Assumptions - Strategies for testing
◉ Normality
• Test for Y values or for standardized residuals
• using 5 measures (histogram, Normal Q-Q plot, boxplot,
skewness and kurtosis statistics)
◉ Linearity
• Assess from scatterplot of X vs.Y
◉ Constant variation
• Plot of standardized residuals vs. X
• In plot of standardized residuals vs. X the points should scatter
randomly (without any pattern) and evenly (vertical spread the
same)
22
33. Multiple Regression
◉ Simple linear regression describes the linear relationship
between a dependent variable Y and a single explanatory
variable X
◉ Multiple regression is an extension to the case of one
dependent variable and two or more explanatory variables
33
34. Reasons for performing Multiple
regression
◉ Predictions on the basis of a number of variables will be better
than those based on only one explanatory variable
◉ When testing the effect of a primary variable of interest e.g.
treatment effect / exposure, one needs to account for all other
extraneous influences
• The need to ‘control’ or ‘adjust’ for the possible effects of
‘nuisance’ explanatory variables (known as confounders)
◉ The relationships may be complex e.g. variables may have
combined or synergistic effects on the dependent variable
34
35. Reasons for performing Multiple
regression
◉ It is almost always better to perform one comprehensive
analysis including all the relevant variables than a series of two-
way comparisons
• Reduce chances of increasing Type I error rate beyond 5%
• In multiple regression a linear model is fitted for the dependent
variable, which is expressed as a linear combination of the
independent variables
35
36. Importance of Predictors
◉ The regression coefficient bi represents the effect of that
independent variable on DV Y, after controlling for all
the other variables in the model
◉ The importance of each individual variable is tested by a
t test or an F test as for SLR
◉ Significance of an explanatory variable is dependent on
which other variables included in the regression model
◉ A confidence interval gives further information
36
37. Multiple regression models
◉ Multiple linear regression
• predictors all continuous and linearly related to the
dependent variable
◉ Analysis of covariance (ANCOVA)
• both continuous and categorical predictors
◉ Analysis of variance (eg. two-way ANOVA)
• predictors all categorical
◉ Polynomial regression
• quadratic or higher order terms included
37
38. Categorical predictors
◉ Association between a continuous DV Y and a categorical IV
X is assessed by comparing the mean Y values in each
category of X
◉ A reference category is chosen to compare the other
category/ies with
◉ The regression coefficient for a comparison represents the
difference in the mean for Y for the given category vs the
reference category
38
39. Assessing the fit of the model
◉ R2 measures usefulness or predictive value of model
◉ R2 is interpreted as the proportion of the total variability
explained by the model
◉ R2 increases in value as each additional variable is added to
the model
◉ adjusted R2 (preferred measure) takes into account the
number of explanatory variables included in the model
◉ E.g. R2 = 0.482 Radj2 =0.462
39
40. ◉ Also assess fit by inspection of standardized residuals
• If these follow a Normal distribution
• Any value
◉ Large residual: model does not fit well for that subject
◉ Some large residuals will occur by chance, many large
residuals are of concern s > 3 and < -3 are large
40
41. Assumptions of multiple regression
◉ The observations must be independent
◉ The relation between each continuous X and the dependent
variable should be linear
◉ The values of the dependent variable Y should have a Normal
distribution
◉ The variability of Y should be the same for any set of values of the
explanatory variables – homoscedasticity
41
42. How to assess assumptions
◉ Assessing the Normality of Y (or the standardized residuals)
• Obtaining scatterplots of Y (or the standardized residuals)
against each continuous X primarily to assess linearity
◉ Obtaining
• Levene’s test for Y (or the standardized residuals) (if
categorical predictors are included in the model) to assess
equal variance
• a plot of the standardized residuals against each X (if
continuous predictors are included in the model) primarily to
assess constant variation
42
43. Example - assess assumptions
◉ DV: FEV1
◉ Explanatory variables:
• Height (in cm’s)
• Gender (binary)
• Smoking status (3categories)
◉ Normality of FEV1 (5measures)
• skewness= -0.11,
• kurtosis -3 = -0.80
• Assumed
43
45. Constant variation:
◉ Constant variation:
• standardized residuals vs height (scatterplot): no clear pattern
◉ Assumed
45
46. Equality Variances
• Levenes’ Test (Robust): p = 0.937 >0.05,
• Assumed
46
Conclusion: All the assumptions are met
Note: the test could be done using standardized residuals
47. R Code for MLR
#loading of the data
data("mtcars")
#viewing the data
mtcars
head(mtcars)
names(mtcars)
#attach command is used in R so that we need not call the data
everytime
attach(mtcars)
#checking the realtionship between the variables
plot(mpg,cyl)
plot(mpg,disp)
plot(mpg,hp)
plot(mpg,drat)
plot(mpg,wt)
plot(mpg,qsec)
plot(mpg,vs)
47
plot(mpg,am)
plot(mpg,gear)
plot(mpg,carb)
#creating simple linear regression
#creating the multiple linear model
model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
model
#checking the summary of the model
summary(model)
#various parameters to check the fitness of the model
#mean square error
sqrt(sum((model$residuals)^2)/21)
summary(model)
#hypothesis testing t-test
#test statistic is just the point estimate of the slope of the model divided by the standard error
of that coefficient/slope value
#example 1
tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2]
tstat
2*pt(tstat, 21, lower.tail=FALSE)
#example2
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
tstat2
2*pt(tstat2, 21, lower.tail=FALSE)
summary(model)
48. 48
#F-test
summary(model)
#Coefficient Confidence Intervals
confint(model, level=.95)
#testing the various assumptions of the model
#1 checking whether the residuals are normally distributed or not
#histogram
resid<- model$residuals
hist(resid)
#quantile plot
qqnorm(resid)
qqline(resid)
#2 checking the homoscedasticity
plot(model$residuals ~ disp)
abline(0,0)
#residual analysis
plot(model)
#transformations
model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb)
summary(model1)
plot(model1)
#reducing the model
#calling of library
library(MASS)
#running the AIC on intial model
stepAIC(model)
#running the AIC on the transformed model
stepAIC(model1)
#constructing new models with reduced variables
model2<-lm(mpg~qsec+wt+am)
summary(model2)
model6<-lm(mpg~log(disp)+gear+carb)
summary(model6)
#partial F-test
nestmodel = lm(mpg ~ wt + qsec + am)
anova(model,nestmodel)
#Multicollinearity
plot(mtcars)
#checking the correlation
cor(qsec, wt)
cor(am, wt)
cor(am, qsec)
#variance inflation factor
install.packages("car")
library(car)
vif(model2)
#Polynomial Model
plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am)
plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
summary(quadmod)
AIC(quadmod)
#Interaction Model
model3<-lm(mpg~qsec+wt*am)
summary(model3)
AIC(model3)
resid3<- model3$residuals
hist(resid3)
qqnorm(resid3)
qqline(resid3)
plot(model3$residuals ~ disp)
abline(0,0)
plot(model3)