SlideShare une entreprise Scribd logo
1  sur  50
REGRESSION
ANALYSISPresented by –
Alichy Sowmya
Parth Prajapati
Vikrant Ratnakar
Department of Pharmacoinformatics , NIPER S.A.S. Nagar
What is regression?
2
“◉ Linear Regression is a supervised modeling technique for continuous data that generates a
response based on the set of input features.
◉ It is used for explaining the linear relationship between a single variable Y, called the response
(output or dependent variable), and one or more predictor (input, independent or explanatory
variables).
◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the
form of a multiple regression problem, that is if more than one predictor is used in the model.
3
◉ Statistical Modelling is the process of obtaining a
statistical model which adequately describes the
relationships between the variables involved
◉ The model: takes the form of a prediction equation - the
values of a dependent variable (DV) are predicted by a
set of independent variables (IV)
◉ Simplest model: simple linear regression
4
Simple linear regression
Simple Linear Regression
◉ SLR is investigating the linear relation between two variables
Y (DV) and X (IV or explanatory variable)
◉ “Linear”: used because the population mean of Y is
represented as a linear or straight-line function of X
◉ “Simple”: refers to the fact that there is only one independent
variable
◉ Examples:
• air quality and lung function
• medication dose and outcome of blood test
6
Explore the relationship Between Two
Continuous Variables
◉ Step 1: Scatterplot
Shape of scatterplot gives form of relation
• linear
• quadratic
• more complex
◉ Step 2: Correlation coefficient
Strength of linear relation given by correlation coefficient
• r: ranges from –1 to +1
• –1 : perfect negative linear relationship
• +1 : perfect positive linear relationship.
• 0 : no linear relationship.
7
8
◉ Step 3: Simple linear regression
This is the population line.
• Y = dependent or response variable. Must be continuous.
• X = independent / predictor / explanatory variable or covariate
• α = population regression parameter / intercept: point where the
line crosses the vertical axis
• β = population regression parameter / slope: the change in the
mean value of Y for each increase of one unit in the value of X
• e = model error term e (residual)
= deviations between predicted values of Y and the actual values
of Y
 Assumed normally distributed with mean 0 and
standard deviation
9
10
Objective of SLR
◉ Objective: to predict or estimate the value of DV Y
corresponding to a given value of IV X, thru the estimated
regression line
◉ Sample: the observed values are Xi and Yi, I =1,2,…n
◉ Build up: an estimated regression line using the sample. The
regression line from the observed data is an estimate of the
relationship between X and Y in the population
11
Estimated Regression Equation
◉ a = regression coefficient
= the estimate of the parameter α
= the intercept of the estimated regression line
= the value of Y where the line crosses the Y axis
◉ b = regression coefficient
= the estimate of the parameter β
= the slope of estimated regression line
= the change in the mean value of Y for each one unit increase in the value of X
12
Residual
◉ For any subject i, i = 1,2,3, …, n
◉ The original observed values are Xi and Yi
◉ For any given Xi , the ‘Y’ value given by the line is called the
predicted value and denoted by
◉ The residual ei is the difference between the predicted value
and the observed value
13
Least Squares Estimation
◉ Least squares estimation is the method of estimating the equation /
fitting the model to the data in an optimal way
◉ The sum of squares of the vertical distances of the observations from
the line are minimized
◉ Least squares estimation minimizes
14
15
Is X a significant predictor of Y
◉ The association between X and Y is given by the
regression coefficient for the slope
◉ A zero slope means X has no “impact” on Y
◉ whereas a large value indicates large changes in Y
when X changes
16
◉ Denoted by R2
◉ Measures the goodness ‘fit’ of the model
◉ Assesses the usefulness or predictive value of the model
◉ Is interpreted as the proportion of variability in the observed
values of Y explained by the regression of Y on X
◉ E.g. R2 =71.9%, almost 72% of the variation in lung function
(FEV) is explained by the regression of FEV on height
◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%)
17
Coefficient of Determination
R2 and b
◉ The coefficient of determination R2 describes how well the
regression equation summaries the data
◉ The regression coefficient b gives the nature of the
relationship between X and Y the degree of change in Y for
certain changes in X
◉ Two data sets may have the same slope b but different R2
values and vise versa
18
19
20
Assumptions
1) The observations must be independent
2) The values of the dependent variable Y should be Normally distributed
(normality)
3) The variability (variance) of Y should be the same for each value of X -
homoscedasticity or constant variation
4) If X is continuous, the relation between X and Y should be linear (linearity)
Note
• X need not be a random variable nor have a Normal distribution
• In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals
• A transformation of Y may be required
21
Assumptions - Strategies for testing
◉ Normality
• Test for Y values or for standardized residuals
• using 5 measures (histogram, Normal Q-Q plot, boxplot,
skewness and kurtosis statistics)
◉ Linearity
• Assess from scatterplot of X vs.Y
◉ Constant variation
• Plot of standardized residuals vs. X
• In plot of standardized residuals vs. X the points should scatter
randomly (without any pattern) and evenly (vertical spread the
same)
22
23
24
25
An Example: FEV (Y) and height (X)
◉ Normality of FEV:
• skewness=0.867,
• kurtosis -3 =1.028
26
Linearity:
27
Constant variance?
28
• Constant variance is not assumed
• FEV needs a natural logarithm transformation
After transformation, reassess the assumptions for the
transformed variable: ln(FEV)
◉ Normality:
• skew=0.040
• kurtosis - 3= -0.433
29
Constant variation
30
31
dataset = read.csv("SLR.csv", header=T,
colClasses = c("numeric", "numeric", "numeric"))
head(dataset,5)
#/////Simple Regression/////
simple.fit = lm(Sales~Spend,data=dataset)
summary(simple.fit)
#Loading the necessary libraries
library(lmtest) #dwtest
library(fBasics) #JarqueBeraTest
#Testing normal distribution and independence assumptions
jarqueberaTest(simple.fit$resid) #Test residuals for normality
#Null Hypothesis: Skewness and Kurtosis are equal to zero
dwtest(simple.fit) #Test for independence of residuals
#Null Hypothesis: Errors are serially UNcorrelated
#Simple Regression Residual Plots
layout(matrix(c(1,1,2,3),2,2,byrow=T))
#Spend x Residuals Plot
plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)],
main="Spend x Residualsnfor Simple Regression",
xlab="Marketing Spend", ylab="Residuals")
abline(h=0,lty=2)
#Histogram of Residuals
hist(simple.fit$resid, main="Histogram of Residuals",
ylab="Residuals")
#Q-Q Plot
qqnorm(simple.fit$resid)
qqline(simple.fit$resid)
R Code for SLR
Multiple Regression
Multiple Regression
◉ Simple linear regression describes the linear relationship
between a dependent variable Y and a single explanatory
variable X
◉ Multiple regression is an extension to the case of one
dependent variable and two or more explanatory variables
33
Reasons for performing Multiple
regression
◉ Predictions on the basis of a number of variables will be better
than those based on only one explanatory variable
◉ When testing the effect of a primary variable of interest e.g.
treatment effect / exposure, one needs to account for all other
extraneous influences
• The need to ‘control’ or ‘adjust’ for the possible effects of
‘nuisance’ explanatory variables (known as confounders)
◉ The relationships may be complex e.g. variables may have
combined or synergistic effects on the dependent variable
34
Reasons for performing Multiple
regression
◉ It is almost always better to perform one comprehensive
analysis including all the relevant variables than a series of two-
way comparisons
• Reduce chances of increasing Type I error rate beyond 5%
• In multiple regression a linear model is fitted for the dependent
variable, which is expressed as a linear combination of the
independent variables
35
Importance of Predictors
◉ The regression coefficient bi represents the effect of that
independent variable on DV Y, after controlling for all
the other variables in the model
◉ The importance of each individual variable is tested by a
t test or an F test as for SLR
◉ Significance of an explanatory variable is dependent on
which other variables included in the regression model
◉ A confidence interval gives further information
36
Multiple regression models
◉ Multiple linear regression
• predictors all continuous and linearly related to the
dependent variable
◉ Analysis of covariance (ANCOVA)
• both continuous and categorical predictors
◉ Analysis of variance (eg. two-way ANOVA)
• predictors all categorical
◉ Polynomial regression
• quadratic or higher order terms included
37
Categorical predictors
◉ Association between a continuous DV Y and a categorical IV
X is assessed by comparing the mean Y values in each
category of X
◉ A reference category is chosen to compare the other
category/ies with
◉ The regression coefficient for a comparison represents the
difference in the mean for Y for the given category vs the
reference category
38
Assessing the fit of the model
◉ R2 measures usefulness or predictive value of model
◉ R2 is interpreted as the proportion of the total variability
explained by the model
◉ R2 increases in value as each additional variable is added to
the model
◉ adjusted R2 (preferred measure) takes into account the
number of explanatory variables included in the model
◉ E.g. R2 = 0.482 Radj2 =0.462
39
◉ Also assess fit by inspection of standardized residuals
• If these follow a Normal distribution
• Any value
◉ Large residual: model does not fit well for that subject
◉ Some large residuals will occur by chance, many large
residuals are of concern s > 3 and < -3 are large
40
Assumptions of multiple regression
◉ The observations must be independent
◉ The relation between each continuous X and the dependent
variable should be linear
◉ The values of the dependent variable Y should have a Normal
distribution
◉ The variability of Y should be the same for any set of values of the
explanatory variables – homoscedasticity
41
How to assess assumptions
◉ Assessing the Normality of Y (or the standardized residuals)
• Obtaining scatterplots of Y (or the standardized residuals)
against each continuous X primarily to assess linearity
◉ Obtaining
• Levene’s test for Y (or the standardized residuals) (if
categorical predictors are included in the model) to assess
equal variance
• a plot of the standardized residuals against each X (if
continuous predictors are included in the model) primarily to
assess constant variation
42
Example - assess assumptions
◉ DV: FEV1
◉ Explanatory variables:
• Height (in cm’s)
• Gender (binary)
• Smoking status (3categories)
◉ Normality of FEV1 (5measures)
• skewness= -0.11,
• kurtosis -3 = -0.80
• Assumed
43
◉ Linearity: FEV1 vs height (scatterplot)
• Assumed
44
Linearity
Constant variation:
◉ Constant variation:
• standardized residuals vs height (scatterplot): no clear pattern
◉ Assumed
45
Equality Variances
• Levenes’ Test (Robust): p = 0.937 >0.05,
• Assumed
46
Conclusion: All the assumptions are met
Note: the test could be done using standardized residuals
R Code for MLR
#loading of the data
data("mtcars")
#viewing the data
mtcars
head(mtcars)
names(mtcars)
#attach command is used in R so that we need not call the data
everytime
attach(mtcars)
#checking the realtionship between the variables
plot(mpg,cyl)
plot(mpg,disp)
plot(mpg,hp)
plot(mpg,drat)
plot(mpg,wt)
plot(mpg,qsec)
plot(mpg,vs)
47
plot(mpg,am)
plot(mpg,gear)
plot(mpg,carb)
#creating simple linear regression
#creating the multiple linear model
model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
model
#checking the summary of the model
summary(model)
#various parameters to check the fitness of the model
#mean square error
sqrt(sum((model$residuals)^2)/21)
summary(model)
#hypothesis testing t-test
#test statistic is just the point estimate of the slope of the model divided by the standard error
of that coefficient/slope value
#example 1
tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2]
tstat
2*pt(tstat, 21, lower.tail=FALSE)
#example2
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
tstat2
2*pt(tstat2, 21, lower.tail=FALSE)
summary(model)
48
#F-test
summary(model)
#Coefficient Confidence Intervals
confint(model, level=.95)
#testing the various assumptions of the model
#1 checking whether the residuals are normally distributed or not
#histogram
resid<- model$residuals
hist(resid)
#quantile plot
qqnorm(resid)
qqline(resid)
#2 checking the homoscedasticity
plot(model$residuals ~ disp)
abline(0,0)
#residual analysis
plot(model)
#transformations
model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb)
summary(model1)
plot(model1)
#reducing the model
#calling of library
library(MASS)
#running the AIC on intial model
stepAIC(model)
#running the AIC on the transformed model
stepAIC(model1)
#constructing new models with reduced variables
model2<-lm(mpg~qsec+wt+am)
summary(model2)
model6<-lm(mpg~log(disp)+gear+carb)
summary(model6)
#partial F-test
nestmodel = lm(mpg ~ wt + qsec + am)
anova(model,nestmodel)
#Multicollinearity
plot(mtcars)
#checking the correlation
cor(qsec, wt)
cor(am, wt)
cor(am, qsec)
#variance inflation factor
install.packages("car")
library(car)
vif(model2)
#Polynomial Model
plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am)
plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
summary(quadmod)
AIC(quadmod)
#Interaction Model
model3<-lm(mpg~qsec+wt*am)
summary(model3)
AIC(model3)
resid3<- model3$residuals
hist(resid3)
qqnorm(resid3)
qqline(resid3)
plot(model3$residuals ~ disp)
abline(0,0)
plot(model3)
49
#using the model
newdata <- data.frame(wt=2.92, qsec=20.1, am=1)
predy <- predict(model3, newdata, interval="predict")
predy
confy <- predict(model3, newdata, interval="confidence")
confy
confy %*% c(0, -1, 1)
predy %*% c(0, -1, 1)
confy[1] == predy[1]
#sample prediciton
mtcars[20, ]
pred<-
coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]*
1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1
pred
33.9-31.0523
THANK YOU…..
50

Contenu connexe

Tendances

Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliRegression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliAkanksha Bali
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regressiondessybudiyanti
 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysisRabin BK
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepDan Wellisch
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysisNimrita Koul
 
regression and correlation
regression and correlationregression and correlation
regression and correlationPriya Sharma
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisSalim Azad
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlationRione Drevale
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IJames Neill
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regressionMaria Theresa
 
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.sonia gupta
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
 
multiple regression
multiple regressionmultiple regression
multiple regressionPriya Sharma
 

Tendances (20)

Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliRegression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysis
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
regression and correlation
regression and correlationregression and correlation
regression and correlation
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlation
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA I
 
Correlation
CorrelationCorrelation
Correlation
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
multiple regression
multiple regressionmultiple regression
multiple regression
 

Similaire à Regression analysis in R

STATISTICAL REGRESSION MODELS
STATISTICAL REGRESSION MODELSSTATISTICAL REGRESSION MODELS
STATISTICAL REGRESSION MODELSAneesa K Ayoob
 
A presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptA presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptvigia41
 
Unit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxUnit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxAnusuya123
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relationnuwan udugampala
 
Regression &amp; correlation coefficient
Regression &amp; correlation coefficientRegression &amp; correlation coefficient
Regression &amp; correlation coefficientMuhamamdZiaSamad
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regressionKhulna University
 
Applied statistics lecture_6
Applied statistics lecture_6Applied statistics lecture_6
Applied statistics lecture_6Daria Bogdanova
 
Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxkrunal soni
 
P G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 RegressionP G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 RegressionAashish Patel
 
Simple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdfSimple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdfUVAS
 

Similaire à Regression analysis in R (20)

STATISTICAL REGRESSION MODELS
STATISTICAL REGRESSION MODELSSTATISTICAL REGRESSION MODELS
STATISTICAL REGRESSION MODELS
 
rugs koco.pptx
rugs koco.pptxrugs koco.pptx
rugs koco.pptx
 
Research Methodology-Chapter 14
Research Methodology-Chapter 14Research Methodology-Chapter 14
Research Methodology-Chapter 14
 
A presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptA presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.ppt
 
Unit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxUnit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptx
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 
Quantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA ProgramQuantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA Program
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
 
Simple Linear Regression.pptx
Simple Linear Regression.pptxSimple Linear Regression.pptx
Simple Linear Regression.pptx
 
IDS.pdf
IDS.pdfIDS.pdf
IDS.pdf
 
Simple egression.pptx
Simple egression.pptxSimple egression.pptx
Simple egression.pptx
 
Regression &amp; correlation coefficient
Regression &amp; correlation coefficientRegression &amp; correlation coefficient
Regression &amp; correlation coefficient
 
Regression
Regression Regression
Regression
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
 
Applied statistics lecture_6
Applied statistics lecture_6Applied statistics lecture_6
Applied statistics lecture_6
 
Regression
RegressionRegression
Regression
 
Regression
RegressionRegression
Regression
 
Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptx
 
P G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 RegressionP G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 Regression
 
Simple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdfSimple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdf
 

Plus de Alichy Sowmya

Plant tissue culture
Plant tissue culturePlant tissue culture
Plant tissue cultureAlichy Sowmya
 
Probability distribution in R
Probability distribution in RProbability distribution in R
Probability distribution in RAlichy Sowmya
 
Chemistry development kit
Chemistry development kitChemistry development kit
Chemistry development kitAlichy Sowmya
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modelingAlichy Sowmya
 
Big data in metabolism
Big data in metabolismBig data in metabolism
Big data in metabolismAlichy Sowmya
 
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY EVALUATION OF DECALEPIS HAMILTONII
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY  EVALUATION OF DECALEPIS HAMILTONIIPHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY  EVALUATION OF DECALEPIS HAMILTONII
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY EVALUATION OF DECALEPIS HAMILTONIIAlichy Sowmya
 
SciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discoverySciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discoveryAlichy Sowmya
 
Prescription filling record
Prescription filling recordPrescription filling record
Prescription filling recordAlichy Sowmya
 
Limitations of in silico drug discovery methods
Limitations of in silico drug discovery methodsLimitations of in silico drug discovery methods
Limitations of in silico drug discovery methodsAlichy Sowmya
 
Crimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic feverCrimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic feverAlichy Sowmya
 

Plus de Alichy Sowmya (12)

Plant tissue culture
Plant tissue culturePlant tissue culture
Plant tissue culture
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Probability distribution in R
Probability distribution in RProbability distribution in R
Probability distribution in R
 
Chemistry development kit
Chemistry development kitChemistry development kit
Chemistry development kit
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modeling
 
Big data in metabolism
Big data in metabolismBig data in metabolism
Big data in metabolism
 
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY EVALUATION OF DECALEPIS HAMILTONII
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY  EVALUATION OF DECALEPIS HAMILTONIIPHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY  EVALUATION OF DECALEPIS HAMILTONII
PHARMACOGNOSTICAL AND BIOLOGICAL ACTIVITY EVALUATION OF DECALEPIS HAMILTONII
 
SciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discoverySciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discovery
 
Prescription filling record
Prescription filling recordPrescription filling record
Prescription filling record
 
Information science
Information scienceInformation science
Information science
 
Limitations of in silico drug discovery methods
Limitations of in silico drug discovery methodsLimitations of in silico drug discovery methods
Limitations of in silico drug discovery methods
 
Crimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic feverCrimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic fever
 

Dernier

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Dernier (20)

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Regression analysis in R

  • 1. REGRESSION ANALYSISPresented by – Alichy Sowmya Parth Prajapati Vikrant Ratnakar Department of Pharmacoinformatics , NIPER S.A.S. Nagar
  • 3. “◉ Linear Regression is a supervised modeling technique for continuous data that generates a response based on the set of input features. ◉ It is used for explaining the linear relationship between a single variable Y, called the response (output or dependent variable), and one or more predictor (input, independent or explanatory variables). ◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the form of a multiple regression problem, that is if more than one predictor is used in the model. 3
  • 4. ◉ Statistical Modelling is the process of obtaining a statistical model which adequately describes the relationships between the variables involved ◉ The model: takes the form of a prediction equation - the values of a dependent variable (DV) are predicted by a set of independent variables (IV) ◉ Simplest model: simple linear regression 4
  • 6. Simple Linear Regression ◉ SLR is investigating the linear relation between two variables Y (DV) and X (IV or explanatory variable) ◉ “Linear”: used because the population mean of Y is represented as a linear or straight-line function of X ◉ “Simple”: refers to the fact that there is only one independent variable ◉ Examples: • air quality and lung function • medication dose and outcome of blood test 6
  • 7. Explore the relationship Between Two Continuous Variables ◉ Step 1: Scatterplot Shape of scatterplot gives form of relation • linear • quadratic • more complex ◉ Step 2: Correlation coefficient Strength of linear relation given by correlation coefficient • r: ranges from –1 to +1 • –1 : perfect negative linear relationship • +1 : perfect positive linear relationship. • 0 : no linear relationship. 7
  • 8. 8
  • 9. ◉ Step 3: Simple linear regression This is the population line. • Y = dependent or response variable. Must be continuous. • X = independent / predictor / explanatory variable or covariate • α = population regression parameter / intercept: point where the line crosses the vertical axis • β = population regression parameter / slope: the change in the mean value of Y for each increase of one unit in the value of X • e = model error term e (residual) = deviations between predicted values of Y and the actual values of Y  Assumed normally distributed with mean 0 and standard deviation 9
  • 10. 10
  • 11. Objective of SLR ◉ Objective: to predict or estimate the value of DV Y corresponding to a given value of IV X, thru the estimated regression line ◉ Sample: the observed values are Xi and Yi, I =1,2,…n ◉ Build up: an estimated regression line using the sample. The regression line from the observed data is an estimate of the relationship between X and Y in the population 11
  • 12. Estimated Regression Equation ◉ a = regression coefficient = the estimate of the parameter α = the intercept of the estimated regression line = the value of Y where the line crosses the Y axis ◉ b = regression coefficient = the estimate of the parameter β = the slope of estimated regression line = the change in the mean value of Y for each one unit increase in the value of X 12
  • 13. Residual ◉ For any subject i, i = 1,2,3, …, n ◉ The original observed values are Xi and Yi ◉ For any given Xi , the ‘Y’ value given by the line is called the predicted value and denoted by ◉ The residual ei is the difference between the predicted value and the observed value 13
  • 14. Least Squares Estimation ◉ Least squares estimation is the method of estimating the equation / fitting the model to the data in an optimal way ◉ The sum of squares of the vertical distances of the observations from the line are minimized ◉ Least squares estimation minimizes 14
  • 15. 15
  • 16. Is X a significant predictor of Y ◉ The association between X and Y is given by the regression coefficient for the slope ◉ A zero slope means X has no “impact” on Y ◉ whereas a large value indicates large changes in Y when X changes 16
  • 17. ◉ Denoted by R2 ◉ Measures the goodness ‘fit’ of the model ◉ Assesses the usefulness or predictive value of the model ◉ Is interpreted as the proportion of variability in the observed values of Y explained by the regression of Y on X ◉ E.g. R2 =71.9%, almost 72% of the variation in lung function (FEV) is explained by the regression of FEV on height ◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%) 17 Coefficient of Determination
  • 18. R2 and b ◉ The coefficient of determination R2 describes how well the regression equation summaries the data ◉ The regression coefficient b gives the nature of the relationship between X and Y the degree of change in Y for certain changes in X ◉ Two data sets may have the same slope b but different R2 values and vise versa 18
  • 19. 19
  • 20. 20
  • 21. Assumptions 1) The observations must be independent 2) The values of the dependent variable Y should be Normally distributed (normality) 3) The variability (variance) of Y should be the same for each value of X - homoscedasticity or constant variation 4) If X is continuous, the relation between X and Y should be linear (linearity) Note • X need not be a random variable nor have a Normal distribution • In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals • A transformation of Y may be required 21
  • 22. Assumptions - Strategies for testing ◉ Normality • Test for Y values or for standardized residuals • using 5 measures (histogram, Normal Q-Q plot, boxplot, skewness and kurtosis statistics) ◉ Linearity • Assess from scatterplot of X vs.Y ◉ Constant variation • Plot of standardized residuals vs. X • In plot of standardized residuals vs. X the points should scatter randomly (without any pattern) and evenly (vertical spread the same) 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. An Example: FEV (Y) and height (X) ◉ Normality of FEV: • skewness=0.867, • kurtosis -3 =1.028 26
  • 28. Constant variance? 28 • Constant variance is not assumed • FEV needs a natural logarithm transformation
  • 29. After transformation, reassess the assumptions for the transformed variable: ln(FEV) ◉ Normality: • skew=0.040 • kurtosis - 3= -0.433 29
  • 31. 31 dataset = read.csv("SLR.csv", header=T, colClasses = c("numeric", "numeric", "numeric")) head(dataset,5) #/////Simple Regression///// simple.fit = lm(Sales~Spend,data=dataset) summary(simple.fit) #Loading the necessary libraries library(lmtest) #dwtest library(fBasics) #JarqueBeraTest #Testing normal distribution and independence assumptions jarqueberaTest(simple.fit$resid) #Test residuals for normality #Null Hypothesis: Skewness and Kurtosis are equal to zero dwtest(simple.fit) #Test for independence of residuals #Null Hypothesis: Errors are serially UNcorrelated #Simple Regression Residual Plots layout(matrix(c(1,1,2,3),2,2,byrow=T)) #Spend x Residuals Plot plot(simple.fit$resid~dataset$Spend[order(dataset$Spend)], main="Spend x Residualsnfor Simple Regression", xlab="Marketing Spend", ylab="Residuals") abline(h=0,lty=2) #Histogram of Residuals hist(simple.fit$resid, main="Histogram of Residuals", ylab="Residuals") #Q-Q Plot qqnorm(simple.fit$resid) qqline(simple.fit$resid) R Code for SLR
  • 33. Multiple Regression ◉ Simple linear regression describes the linear relationship between a dependent variable Y and a single explanatory variable X ◉ Multiple regression is an extension to the case of one dependent variable and two or more explanatory variables 33
  • 34. Reasons for performing Multiple regression ◉ Predictions on the basis of a number of variables will be better than those based on only one explanatory variable ◉ When testing the effect of a primary variable of interest e.g. treatment effect / exposure, one needs to account for all other extraneous influences • The need to ‘control’ or ‘adjust’ for the possible effects of ‘nuisance’ explanatory variables (known as confounders) ◉ The relationships may be complex e.g. variables may have combined or synergistic effects on the dependent variable 34
  • 35. Reasons for performing Multiple regression ◉ It is almost always better to perform one comprehensive analysis including all the relevant variables than a series of two- way comparisons • Reduce chances of increasing Type I error rate beyond 5% • In multiple regression a linear model is fitted for the dependent variable, which is expressed as a linear combination of the independent variables 35
  • 36. Importance of Predictors ◉ The regression coefficient bi represents the effect of that independent variable on DV Y, after controlling for all the other variables in the model ◉ The importance of each individual variable is tested by a t test or an F test as for SLR ◉ Significance of an explanatory variable is dependent on which other variables included in the regression model ◉ A confidence interval gives further information 36
  • 37. Multiple regression models ◉ Multiple linear regression • predictors all continuous and linearly related to the dependent variable ◉ Analysis of covariance (ANCOVA) • both continuous and categorical predictors ◉ Analysis of variance (eg. two-way ANOVA) • predictors all categorical ◉ Polynomial regression • quadratic or higher order terms included 37
  • 38. Categorical predictors ◉ Association between a continuous DV Y and a categorical IV X is assessed by comparing the mean Y values in each category of X ◉ A reference category is chosen to compare the other category/ies with ◉ The regression coefficient for a comparison represents the difference in the mean for Y for the given category vs the reference category 38
  • 39. Assessing the fit of the model ◉ R2 measures usefulness or predictive value of model ◉ R2 is interpreted as the proportion of the total variability explained by the model ◉ R2 increases in value as each additional variable is added to the model ◉ adjusted R2 (preferred measure) takes into account the number of explanatory variables included in the model ◉ E.g. R2 = 0.482 Radj2 =0.462 39
  • 40. ◉ Also assess fit by inspection of standardized residuals • If these follow a Normal distribution • Any value ◉ Large residual: model does not fit well for that subject ◉ Some large residuals will occur by chance, many large residuals are of concern s > 3 and < -3 are large 40
  • 41. Assumptions of multiple regression ◉ The observations must be independent ◉ The relation between each continuous X and the dependent variable should be linear ◉ The values of the dependent variable Y should have a Normal distribution ◉ The variability of Y should be the same for any set of values of the explanatory variables – homoscedasticity 41
  • 42. How to assess assumptions ◉ Assessing the Normality of Y (or the standardized residuals) • Obtaining scatterplots of Y (or the standardized residuals) against each continuous X primarily to assess linearity ◉ Obtaining • Levene’s test for Y (or the standardized residuals) (if categorical predictors are included in the model) to assess equal variance • a plot of the standardized residuals against each X (if continuous predictors are included in the model) primarily to assess constant variation 42
  • 43. Example - assess assumptions ◉ DV: FEV1 ◉ Explanatory variables: • Height (in cm’s) • Gender (binary) • Smoking status (3categories) ◉ Normality of FEV1 (5measures) • skewness= -0.11, • kurtosis -3 = -0.80 • Assumed 43
  • 44. ◉ Linearity: FEV1 vs height (scatterplot) • Assumed 44 Linearity
  • 45. Constant variation: ◉ Constant variation: • standardized residuals vs height (scatterplot): no clear pattern ◉ Assumed 45
  • 46. Equality Variances • Levenes’ Test (Robust): p = 0.937 >0.05, • Assumed 46 Conclusion: All the assumptions are met Note: the test could be done using standardized residuals
  • 47. R Code for MLR #loading of the data data("mtcars") #viewing the data mtcars head(mtcars) names(mtcars) #attach command is used in R so that we need not call the data everytime attach(mtcars) #checking the realtionship between the variables plot(mpg,cyl) plot(mpg,disp) plot(mpg,hp) plot(mpg,drat) plot(mpg,wt) plot(mpg,qsec) plot(mpg,vs) 47 plot(mpg,am) plot(mpg,gear) plot(mpg,carb) #creating simple linear regression #creating the multiple linear model model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb) model #checking the summary of the model summary(model) #various parameters to check the fitness of the model #mean square error sqrt(sum((model$residuals)^2)/21) summary(model) #hypothesis testing t-test #test statistic is just the point estimate of the slope of the model divided by the standard error of that coefficient/slope value #example 1 tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2] tstat 2*pt(tstat, 21, lower.tail=FALSE) #example2 tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2] tstat2 2*pt(tstat2, 21, lower.tail=FALSE) summary(model)
  • 48. 48 #F-test summary(model) #Coefficient Confidence Intervals confint(model, level=.95) #testing the various assumptions of the model #1 checking whether the residuals are normally distributed or not #histogram resid<- model$residuals hist(resid) #quantile plot qqnorm(resid) qqline(resid) #2 checking the homoscedasticity plot(model$residuals ~ disp) abline(0,0) #residual analysis plot(model) #transformations model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb) summary(model1) plot(model1) #reducing the model #calling of library library(MASS) #running the AIC on intial model stepAIC(model) #running the AIC on the transformed model stepAIC(model1) #constructing new models with reduced variables model2<-lm(mpg~qsec+wt+am) summary(model2) model6<-lm(mpg~log(disp)+gear+carb) summary(model6) #partial F-test nestmodel = lm(mpg ~ wt + qsec + am) anova(model,nestmodel) #Multicollinearity plot(mtcars) #checking the correlation cor(qsec, wt) cor(am, wt) cor(am, qsec) #variance inflation factor install.packages("car") library(car) vif(model2) #Polynomial Model plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals") abline(0,0) quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am) plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals") abline(0,0) summary(quadmod) AIC(quadmod) #Interaction Model model3<-lm(mpg~qsec+wt*am) summary(model3) AIC(model3) resid3<- model3$residuals hist(resid3) qqnorm(resid3) qqline(resid3) plot(model3$residuals ~ disp) abline(0,0) plot(model3)
  • 49. 49 #using the model newdata <- data.frame(wt=2.92, qsec=20.1, am=1) predy <- predict(model3, newdata, interval="predict") predy confy <- predict(model3, newdata, interval="confidence") confy confy %*% c(0, -1, 1) predy %*% c(0, -1, 1) confy[1] == predy[1] #sample prediciton mtcars[20, ] pred<- coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]* 1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1 pred 33.9-31.0523