SlideShare une entreprise Scribd logo
1  sur  50
ANALYSISPresented by –
Alichy Sowmya
Parth Prajapati
Vikrant Ratnakar
Department of Pharmacoinformatics , NIPER S.A.S. Nagar
What is regression?
“◉ Linear Regression is a supervised modeling technique for continuous data that generates a
response based on the set of input features.
◉ It is used for explaining the linear relationship between a single variable Y, called the response
(output or dependent variable), and one or more predictor (input, independent or explanatory
◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the
form of a multiple regression problem, that is if more than one predictor is used in the model.
◉ Statistical Modelling is the process of obtaining a
statistical model which adequately describes the
relationships between the variables involved
◉ The model: takes the form of a prediction equation - the
values of a dependent variable (DV) are predicted by a
set of independent variables (IV)
◉ Simplest model: simple linear regression
Simple linear regression
Simple Linear Regression
◉ SLR is investigating the linear relation between two variables
Y (DV) and X (IV or explanatory variable)
◉ “Linear”: used because the population mean of Y is
represented as a linear or straight-line function of X
◉ “Simple”: refers to the fact that there is only one independent
◉ Examples:
• air quality and lung function
• medication dose and outcome of blood test
Explore the relationship Between Two
Continuous Variables
◉ Step 1: Scatterplot
Shape of scatterplot gives form of relation
• linear
• quadratic
• more complex
◉ Step 2: Correlation coefficient
Strength of linear relation given by correlation coefficient
• r: ranges from –1 to +1
• –1 : perfect negative linear relationship
• +1 : perfect positive linear relationship.
• 0 : no linear relationship.
◉ Step 3: Simple linear regression
This is the population line.
• Y = dependent or response variable. Must be continuous.
• X = independent / predictor / explanatory variable or covariate
• α = population regression parameter / intercept: point where the
line crosses the vertical axis
• β = population regression parameter / slope: the change in the
mean value of Y for each increase of one unit in the value of X
• e = model error term e (residual)
= deviations between predicted values of Y and the actual values
of Y
 Assumed normally distributed with mean 0 and
standard deviation
Objective of SLR
◉ Objective: to predict or estimate the value of DV Y
corresponding to a given value of IV X, thru the estimated
regression line
◉ Sample: the observed values are Xi and Yi, I =1,2,…n
◉ Build up: an estimated regression line using the sample. The
regression line from the observed data is an estimate of the
relationship between X and Y in the population
Estimated Regression Equation
◉ a = regression coefficient
= the estimate of the parameter α
= the intercept of the estimated regression line
= the value of Y where the line crosses the Y axis
◉ b = regression coefficient
= the estimate of the parameter β
= the slope of estimated regression line
= the change in the mean value of Y for each one unit increase in the value of X
◉ For any subject i, i = 1,2,3, …, n
◉ The original observed values are Xi and Yi
◉ For any given Xi , the ‘Y’ value given by the line is called the
predicted value and denoted by
◉ The residual ei is the difference between the predicted value
and the observed value
Least Squares Estimation
◉ Least squares estimation is the method of estimating the equation /
fitting the model to the data in an optimal way
◉ The sum of squares of the vertical distances of the observations from
the line are minimized
◉ Least squares estimation minimizes
Is X a significant predictor of Y
◉ The association between X and Y is given by the
regression coefficient for the slope
◉ A zero slope means X has no “impact” on Y
◉ whereas a large value indicates large changes in Y
when X changes
◉ Denoted by R2
◉ Measures the goodness ‘fit’ of the model
◉ Assesses the usefulness or predictive value of the model
◉ Is interpreted as the proportion of variability in the observed
values of Y explained by the regression of Y on X
◉ E.g. R2 =71.9%, almost 72% of the variation in lung function
(FEV) is explained by the regression of FEV on height
◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%)
Coefficient of Determination
R2 and b
◉ The coefficient of determination R2 describes how well the
regression equation summaries the data
◉ The regression coefficient b gives the nature of the
relationship between X and Y the degree of change in Y for
certain changes in X
◉ Two data sets may have the same slope b but different R2
values and vise versa
1) The observations must be independent
2) The values of the dependent variable Y should be Normally distributed
3) The variability (variance) of Y should be the same for each value of X -
homoscedasticity or constant variation
4) If X is continuous, the relation between X and Y should be linear (linearity)
• X need not be a random variable nor have a Normal distribution
• In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals
• A transformation of Y may be required
Assumptions - Strategies for testing
◉ Normality
• Test for Y values or for standardized residuals
• using 5 measures (histogram, Normal Q-Q plot, boxplot,
skewness and kurtosis statistics)
◉ Linearity
• Assess from scatterplot of X vs.Y
◉ Constant variation
• Plot of standardized residuals vs. X
• In plot of standardized residuals vs. X the points should scatter
randomly (without any pattern) and evenly (vertical spread the
An Example: FEV (Y) and height (X)
◉ Normality of FEV:
• skewness=0.867,
• kurtosis -3 =1.028
Constant variance?
• Constant variance is not assumed
• FEV needs a natural logarithm transformation
After transformation, reassess the assumptions for the
transformed variable: ln(FEV)
◉ Normality:
• skew=0.040
• kurtosis - 3= -0.433
Constant variation
dataset = read.csv("SLR.csv", header=T,
colClasses = c("numeric", "numeric", "numeric"))
#/////Simple Regression///// = lm(Sales~Spend,data=dataset)
#Loading the necessary libraries
library(lmtest) #dwtest
library(fBasics) #JarqueBeraTest
#Testing normal distribution and independence assumptions
jarqueberaTest($resid) #Test residuals for normality
#Null Hypothesis: Skewness and Kurtosis are equal to zero
dwtest( #Test for independence of residuals
#Null Hypothesis: Errors are serially UNcorrelated
#Simple Regression Residual Plots
#Spend x Residuals Plot
main="Spend x Residualsnfor Simple Regression",
xlab="Marketing Spend", ylab="Residuals")
#Histogram of Residuals
hist($resid, main="Histogram of Residuals",
#Q-Q Plot
R Code for SLR
Multiple Regression
Multiple Regression
◉ Simple linear regression describes the linear relationship
between a dependent variable Y and a single explanatory
variable X
◉ Multiple regression is an extension to the case of one
dependent variable and two or more explanatory variables
Reasons for performing Multiple
◉ Predictions on the basis of a number of variables will be better
than those based on only one explanatory variable
◉ When testing the effect of a primary variable of interest e.g.
treatment effect / exposure, one needs to account for all other
extraneous influences
• The need to ‘control’ or ‘adjust’ for the possible effects of
‘nuisance’ explanatory variables (known as confounders)
◉ The relationships may be complex e.g. variables may have
combined or synergistic effects on the dependent variable
Reasons for performing Multiple
◉ It is almost always better to perform one comprehensive
analysis including all the relevant variables than a series of two-
way comparisons
• Reduce chances of increasing Type I error rate beyond 5%
• In multiple regression a linear model is fitted for the dependent
variable, which is expressed as a linear combination of the
independent variables
Importance of Predictors
◉ The regression coefficient bi represents the effect of that
independent variable on DV Y, after controlling for all
the other variables in the model
◉ The importance of each individual variable is tested by a
t test or an F test as for SLR
◉ Significance of an explanatory variable is dependent on
which other variables included in the regression model
◉ A confidence interval gives further information
Multiple regression models
◉ Multiple linear regression
• predictors all continuous and linearly related to the
dependent variable
◉ Analysis of covariance (ANCOVA)
• both continuous and categorical predictors
◉ Analysis of variance (eg. two-way ANOVA)
• predictors all categorical
◉ Polynomial regression
• quadratic or higher order terms included
Categorical predictors
◉ Association between a continuous DV Y and a categorical IV
X is assessed by comparing the mean Y values in each
category of X
◉ A reference category is chosen to compare the other
category/ies with
◉ The regression coefficient for a comparison represents the
difference in the mean for Y for the given category vs the
reference category
Assessing the fit of the model
◉ R2 measures usefulness or predictive value of model
◉ R2 is interpreted as the proportion of the total variability
explained by the model
◉ R2 increases in value as each additional variable is added to
the model
◉ adjusted R2 (preferred measure) takes into account the
number of explanatory variables included in the model
◉ E.g. R2 = 0.482 Radj2 =0.462
◉ Also assess fit by inspection of standardized residuals
• If these follow a Normal distribution
• Any value
◉ Large residual: model does not fit well for that subject
◉ Some large residuals will occur by chance, many large
residuals are of concern s > 3 and < -3 are large
Assumptions of multiple regression
◉ The observations must be independent
◉ The relation between each continuous X and the dependent
variable should be linear
◉ The values of the dependent variable Y should have a Normal
◉ The variability of Y should be the same for any set of values of the
explanatory variables – homoscedasticity
How to assess assumptions
◉ Assessing the Normality of Y (or the standardized residuals)
• Obtaining scatterplots of Y (or the standardized residuals)
against each continuous X primarily to assess linearity
◉ Obtaining
• Levene’s test for Y (or the standardized residuals) (if
categorical predictors are included in the model) to assess
equal variance
• a plot of the standardized residuals against each X (if
continuous predictors are included in the model) primarily to
assess constant variation
Example - assess assumptions
◉ DV: FEV1
◉ Explanatory variables:
• Height (in cm’s)
• Gender (binary)
• Smoking status (3categories)
◉ Normality of FEV1 (5measures)
• skewness= -0.11,
• kurtosis -3 = -0.80
• Assumed
◉ Linearity: FEV1 vs height (scatterplot)
• Assumed
Constant variation:
◉ Constant variation:
• standardized residuals vs height (scatterplot): no clear pattern
◉ Assumed
Equality Variances
• Levenes’ Test (Robust): p = 0.937 >0.05,
• Assumed
Conclusion: All the assumptions are met
Note: the test could be done using standardized residuals
R Code for MLR
#loading of the data
#viewing the data
#attach command is used in R so that we need not call the data
#checking the realtionship between the variables
#creating simple linear regression
#creating the multiple linear model
model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
#checking the summary of the model
#various parameters to check the fitness of the model
#mean square error
#hypothesis testing t-test
#test statistic is just the point estimate of the slope of the model divided by the standard error
of that coefficient/slope value
#example 1
tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2]
2*pt(tstat, 21, lower.tail=FALSE)
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
2*pt(tstat2, 21, lower.tail=FALSE)
#Coefficient Confidence Intervals
confint(model, level=.95)
#testing the various assumptions of the model
#1 checking whether the residuals are normally distributed or not
resid<- model$residuals
#quantile plot
#2 checking the homoscedasticity
plot(model$residuals ~ disp)
#residual analysis
model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb)
#reducing the model
#calling of library
#running the AIC on intial model
#running the AIC on the transformed model
#constructing new models with reduced variables
#partial F-test
nestmodel = lm(mpg ~ wt + qsec + am)
#checking the correlation
cor(qsec, wt)
cor(am, wt)
cor(am, qsec)
#variance inflation factor
#Polynomial Model
plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am)
plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
#Interaction Model
resid3<- model3$residuals
plot(model3$residuals ~ disp)
#using the model
newdata <- data.frame(wt=2.92, qsec=20.1, am=1)
predy <- predict(model3, newdata, interval="predict")
confy <- predict(model3, newdata, interval="confidence")
confy %*% c(0, -1, 1)
predy %*% c(0, -1, 1)
confy[1] == predy[1]
#sample prediciton
mtcars[20, ]

Contenu connexe


Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliRegression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliAkanksha Bali
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regressiondessybudiyanti
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysisRabin BK
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepDan Wellisch
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysisNimrita Koul
regression and correlation
regression and correlationregression and correlation
regression and correlationPriya Sharma
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisSalim Azad
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlationRione Drevale
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IJames Neill
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regressionMaria Theresa
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.sonia gupta
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
multiple regression
multiple regressionmultiple regression
multiple regressionPriya Sharma

Tendances (20)

Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliRegression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysis
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
regression and correlation
regression and correlationregression and correlation
regression and correlation
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
Logistic regression
Logistic regressionLogistic regression
Logistic regression
Regression analysis
Regression analysisRegression analysis
Regression analysis
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlation
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA I
Multiple regression
Multiple regressionMultiple regression
Multiple regression
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
multiple regression
multiple regressionmultiple regression
multiple regression

Similaire à Regression analysis in R

A presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptA presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptvigia41
Unit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxUnit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxAnusuya123
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relationnuwan udugampala
Regression &amp; correlation coefficient
Regression &amp; correlation coefficientRegression &amp; correlation coefficient
Regression &amp; correlation coefficientMuhamamdZiaSamad
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regressionKhulna University
Applied statistics lecture_6
Applied statistics lecture_6Applied statistics lecture_6
Applied statistics lecture_6Daria Bogdanova
Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxkrunal soni
P G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 RegressionP G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 RegressionAashish Patel
Simple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdfSimple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdfUVAS

Similaire à Regression analysis in R (20)

rugs koco.pptx
rugs koco.pptxrugs koco.pptx
rugs koco.pptx
Research Methodology-Chapter 14
Research Methodology-Chapter 14Research Methodology-Chapter 14
Research Methodology-Chapter 14
A presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.pptA presentation for Multiple linear regression.ppt
A presentation for Multiple linear regression.ppt
Unit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptxUnit-III Correlation and Regression.pptx
Unit-III Correlation and Regression.pptx
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
Quantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA ProgramQuantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA Program
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
Simple Linear Regression.pptx
Simple Linear Regression.pptxSimple Linear Regression.pptx
Simple Linear Regression.pptx
Simple egression.pptx
Simple egression.pptxSimple egression.pptx
Simple egression.pptx
Regression &amp; correlation coefficient
Regression &amp; correlation coefficientRegression &amp; correlation coefficient
Regression &amp; correlation coefficient
Regression Regression
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
Applied statistics lecture_6
Applied statistics lecture_6Applied statistics lecture_6
Applied statistics lecture_6
Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptx
P G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 RegressionP G STAT 531 Lecture 10 Regression
P G STAT 531 Lecture 10 Regression
Simple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdfSimple Linear Regression detail explanation.pdf
Simple Linear Regression detail explanation.pdf

Plus de Alichy Sowmya

Plant tissue culture
Plant tissue culturePlant tissue culture
Plant tissue cultureAlichy Sowmya
Probability distribution in R
Probability distribution in RProbability distribution in R
Probability distribution in RAlichy Sowmya
Chemistry development kit
Chemistry development kitChemistry development kit
Chemistry development kitAlichy Sowmya
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modelingAlichy Sowmya
Big data in metabolism
Big data in metabolismBig data in metabolism
Big data in metabolismAlichy Sowmya
SciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discoverySciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discoveryAlichy Sowmya
Prescription filling record
Prescription filling recordPrescription filling record
Prescription filling recordAlichy Sowmya
Limitations of in silico drug discovery methods
Limitations of in silico drug discovery methodsLimitations of in silico drug discovery methods
Limitations of in silico drug discovery methodsAlichy Sowmya
Crimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic feverCrimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic feverAlichy Sowmya

Plus de Alichy Sowmya (12)

Plant tissue culture
Plant tissue culturePlant tissue culture
Plant tissue culture
Protein data bank
Protein data bankProtein data bank
Protein data bank
Probability distribution in R
Probability distribution in RProbability distribution in R
Probability distribution in R
Chemistry development kit
Chemistry development kitChemistry development kit
Chemistry development kit
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modeling
Big data in metabolism
Big data in metabolismBig data in metabolism
Big data in metabolism
SciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discoverySciFinder and its utility in Drug discovery
SciFinder and its utility in Drug discovery
Prescription filling record
Prescription filling recordPrescription filling record
Prescription filling record
Information science
Information scienceInformation science
Information science
Limitations of in silico drug discovery methods
Limitations of in silico drug discovery methodsLimitations of in silico drug discovery methods
Limitations of in silico drug discovery methods
Crimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic feverCrimean Congo Hemorrhagic fever
Crimean Congo Hemorrhagic fever


Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Dernier (20)

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project

Regression analysis in R

  • 1. REGRESSION ANALYSISPresented by – Alichy Sowmya Parth Prajapati Vikrant Ratnakar Department of Pharmacoinformatics , NIPER S.A.S. Nagar
  • 3. “◉ Linear Regression is a supervised modeling technique for continuous data that generates a response based on the set of input features. ◉ It is used for explaining the linear relationship between a single variable Y, called the response (output or dependent variable), and one or more predictor (input, independent or explanatory variables). ◉ It’s a simple regression problem if only a single variable X is considered, otherwise it takes the form of a multiple regression problem, that is if more than one predictor is used in the model. 3
  • 4. ◉ Statistical Modelling is the process of obtaining a statistical model which adequately describes the relationships between the variables involved ◉ The model: takes the form of a prediction equation - the values of a dependent variable (DV) are predicted by a set of independent variables (IV) ◉ Simplest model: simple linear regression 4
  • 6. Simple Linear Regression ◉ SLR is investigating the linear relation between two variables Y (DV) and X (IV or explanatory variable) ◉ “Linear”: used because the population mean of Y is represented as a linear or straight-line function of X ◉ “Simple”: refers to the fact that there is only one independent variable ◉ Examples: • air quality and lung function • medication dose and outcome of blood test 6
  • 7. Explore the relationship Between Two Continuous Variables ◉ Step 1: Scatterplot Shape of scatterplot gives form of relation • linear • quadratic • more complex ◉ Step 2: Correlation coefficient Strength of linear relation given by correlation coefficient • r: ranges from –1 to +1 • –1 : perfect negative linear relationship • +1 : perfect positive linear relationship. • 0 : no linear relationship. 7
  • 8. 8
  • 9. ◉ Step 3: Simple linear regression This is the population line. • Y = dependent or response variable. Must be continuous. • X = independent / predictor / explanatory variable or covariate • α = population regression parameter / intercept: point where the line crosses the vertical axis • β = population regression parameter / slope: the change in the mean value of Y for each increase of one unit in the value of X • e = model error term e (residual) = deviations between predicted values of Y and the actual values of Y  Assumed normally distributed with mean 0 and standard deviation 9
  • 10. 10
  • 11. Objective of SLR ◉ Objective: to predict or estimate the value of DV Y corresponding to a given value of IV X, thru the estimated regression line ◉ Sample: the observed values are Xi and Yi, I =1,2,…n ◉ Build up: an estimated regression line using the sample. The regression line from the observed data is an estimate of the relationship between X and Y in the population 11
  • 12. Estimated Regression Equation ◉ a = regression coefficient = the estimate of the parameter α = the intercept of the estimated regression line = the value of Y where the line crosses the Y axis ◉ b = regression coefficient = the estimate of the parameter β = the slope of estimated regression line = the change in the mean value of Y for each one unit increase in the value of X 12
  • 13. Residual ◉ For any subject i, i = 1,2,3, …, n ◉ The original observed values are Xi and Yi ◉ For any given Xi , the ‘Y’ value given by the line is called the predicted value and denoted by ◉ The residual ei is the difference between the predicted value and the observed value 13
  • 14. Least Squares Estimation ◉ Least squares estimation is the method of estimating the equation / fitting the model to the data in an optimal way ◉ The sum of squares of the vertical distances of the observations from the line are minimized ◉ Least squares estimation minimizes 14
  • 15. 15
  • 16. Is X a significant predictor of Y ◉ The association between X and Y is given by the regression coefficient for the slope ◉ A zero slope means X has no “impact” on Y ◉ whereas a large value indicates large changes in Y when X changes 16
  • 17. ◉ Denoted by R2 ◉ Measures the goodness ‘fit’ of the model ◉ Assesses the usefulness or predictive value of the model ◉ Is interpreted as the proportion of variability in the observed values of Y explained by the regression of Y on X ◉ E.g. R2 =71.9%, almost 72% of the variation in lung function (FEV) is explained by the regression of FEV on height ◉ R2 =SSR/SST (eg., 78.34/109.01 = 0.719 =71.9%) 17 Coefficient of Determination
  • 18. R2 and b ◉ The coefficient of determination R2 describes how well the regression equation summaries the data ◉ The regression coefficient b gives the nature of the relationship between X and Y the degree of change in Y for certain changes in X ◉ Two data sets may have the same slope b but different R2 values and vise versa 18
  • 19. 19
  • 20. 20
  • 21. Assumptions 1) The observations must be independent 2) The values of the dependent variable Y should be Normally distributed (normality) 3) The variability (variance) of Y should be the same for each value of X - homoscedasticity or constant variation 4) If X is continuous, the relation between X and Y should be linear (linearity) Note • X need not be a random variable nor have a Normal distribution • In fact the assumptions need to hold for the residuals but can equivalently be tested for Y or the residuals • A transformation of Y may be required 21
  • 22. Assumptions - Strategies for testing ◉ Normality • Test for Y values or for standardized residuals • using 5 measures (histogram, Normal Q-Q plot, boxplot, skewness and kurtosis statistics) ◉ Linearity • Assess from scatterplot of X vs.Y ◉ Constant variation • Plot of standardized residuals vs. X • In plot of standardized residuals vs. X the points should scatter randomly (without any pattern) and evenly (vertical spread the same) 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. An Example: FEV (Y) and height (X) ◉ Normality of FEV: • skewness=0.867, • kurtosis -3 =1.028 26
  • 28. Constant variance? 28 • Constant variance is not assumed • FEV needs a natural logarithm transformation
  • 29. After transformation, reassess the assumptions for the transformed variable: ln(FEV) ◉ Normality: • skew=0.040 • kurtosis - 3= -0.433 29
  • 31. 31 dataset = read.csv("SLR.csv", header=T, colClasses = c("numeric", "numeric", "numeric")) head(dataset,5) #/////Simple Regression///// = lm(Sales~Spend,data=dataset) summary( #Loading the necessary libraries library(lmtest) #dwtest library(fBasics) #JarqueBeraTest #Testing normal distribution and independence assumptions jarqueberaTest($resid) #Test residuals for normality #Null Hypothesis: Skewness and Kurtosis are equal to zero dwtest( #Test for independence of residuals #Null Hypothesis: Errors are serially UNcorrelated #Simple Regression Residual Plots layout(matrix(c(1,1,2,3),2,2,byrow=T)) #Spend x Residuals Plot plot($resid~dataset$Spend[order(dataset$Spend)], main="Spend x Residualsnfor Simple Regression", xlab="Marketing Spend", ylab="Residuals") abline(h=0,lty=2) #Histogram of Residuals hist($resid, main="Histogram of Residuals", ylab="Residuals") #Q-Q Plot qqnorm($resid) qqline($resid) R Code for SLR
  • 33. Multiple Regression ◉ Simple linear regression describes the linear relationship between a dependent variable Y and a single explanatory variable X ◉ Multiple regression is an extension to the case of one dependent variable and two or more explanatory variables 33
  • 34. Reasons for performing Multiple regression ◉ Predictions on the basis of a number of variables will be better than those based on only one explanatory variable ◉ When testing the effect of a primary variable of interest e.g. treatment effect / exposure, one needs to account for all other extraneous influences • The need to ‘control’ or ‘adjust’ for the possible effects of ‘nuisance’ explanatory variables (known as confounders) ◉ The relationships may be complex e.g. variables may have combined or synergistic effects on the dependent variable 34
  • 35. Reasons for performing Multiple regression ◉ It is almost always better to perform one comprehensive analysis including all the relevant variables than a series of two- way comparisons • Reduce chances of increasing Type I error rate beyond 5% • In multiple regression a linear model is fitted for the dependent variable, which is expressed as a linear combination of the independent variables 35
  • 36. Importance of Predictors ◉ The regression coefficient bi represents the effect of that independent variable on DV Y, after controlling for all the other variables in the model ◉ The importance of each individual variable is tested by a t test or an F test as for SLR ◉ Significance of an explanatory variable is dependent on which other variables included in the regression model ◉ A confidence interval gives further information 36
  • 37. Multiple regression models ◉ Multiple linear regression • predictors all continuous and linearly related to the dependent variable ◉ Analysis of covariance (ANCOVA) • both continuous and categorical predictors ◉ Analysis of variance (eg. two-way ANOVA) • predictors all categorical ◉ Polynomial regression • quadratic or higher order terms included 37
  • 38. Categorical predictors ◉ Association between a continuous DV Y and a categorical IV X is assessed by comparing the mean Y values in each category of X ◉ A reference category is chosen to compare the other category/ies with ◉ The regression coefficient for a comparison represents the difference in the mean for Y for the given category vs the reference category 38
  • 39. Assessing the fit of the model ◉ R2 measures usefulness or predictive value of model ◉ R2 is interpreted as the proportion of the total variability explained by the model ◉ R2 increases in value as each additional variable is added to the model ◉ adjusted R2 (preferred measure) takes into account the number of explanatory variables included in the model ◉ E.g. R2 = 0.482 Radj2 =0.462 39
  • 40. ◉ Also assess fit by inspection of standardized residuals • If these follow a Normal distribution • Any value ◉ Large residual: model does not fit well for that subject ◉ Some large residuals will occur by chance, many large residuals are of concern s > 3 and < -3 are large 40
  • 41. Assumptions of multiple regression ◉ The observations must be independent ◉ The relation between each continuous X and the dependent variable should be linear ◉ The values of the dependent variable Y should have a Normal distribution ◉ The variability of Y should be the same for any set of values of the explanatory variables – homoscedasticity 41
  • 42. How to assess assumptions ◉ Assessing the Normality of Y (or the standardized residuals) • Obtaining scatterplots of Y (or the standardized residuals) against each continuous X primarily to assess linearity ◉ Obtaining • Levene’s test for Y (or the standardized residuals) (if categorical predictors are included in the model) to assess equal variance • a plot of the standardized residuals against each X (if continuous predictors are included in the model) primarily to assess constant variation 42
  • 43. Example - assess assumptions ◉ DV: FEV1 ◉ Explanatory variables: • Height (in cm’s) • Gender (binary) • Smoking status (3categories) ◉ Normality of FEV1 (5measures) • skewness= -0.11, • kurtosis -3 = -0.80 • Assumed 43
  • 44. ◉ Linearity: FEV1 vs height (scatterplot) • Assumed 44 Linearity
  • 45. Constant variation: ◉ Constant variation: • standardized residuals vs height (scatterplot): no clear pattern ◉ Assumed 45
  • 46. Equality Variances • Levenes’ Test (Robust): p = 0.937 >0.05, • Assumed 46 Conclusion: All the assumptions are met Note: the test could be done using standardized residuals
  • 47. R Code for MLR #loading of the data data("mtcars") #viewing the data mtcars head(mtcars) names(mtcars) #attach command is used in R so that we need not call the data everytime attach(mtcars) #checking the realtionship between the variables plot(mpg,cyl) plot(mpg,disp) plot(mpg,hp) plot(mpg,drat) plot(mpg,wt) plot(mpg,qsec) plot(mpg,vs) 47 plot(mpg,am) plot(mpg,gear) plot(mpg,carb) #creating simple linear regression #creating the multiple linear model model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb) model #checking the summary of the model summary(model) #various parameters to check the fitness of the model #mean square error sqrt(sum((model$residuals)^2)/21) summary(model) #hypothesis testing t-test #test statistic is just the point estimate of the slope of the model divided by the standard error of that coefficient/slope value #example 1 tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2] tstat 2*pt(tstat, 21, lower.tail=FALSE) #example2 tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2] tstat2 2*pt(tstat2, 21, lower.tail=FALSE) summary(model)
  • 48. 48 #F-test summary(model) #Coefficient Confidence Intervals confint(model, level=.95) #testing the various assumptions of the model #1 checking whether the residuals are normally distributed or not #histogram resid<- model$residuals hist(resid) #quantile plot qqnorm(resid) qqline(resid) #2 checking the homoscedasticity plot(model$residuals ~ disp) abline(0,0) #residual analysis plot(model) #transformations model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb) summary(model1) plot(model1) #reducing the model #calling of library library(MASS) #running the AIC on intial model stepAIC(model) #running the AIC on the transformed model stepAIC(model1) #constructing new models with reduced variables model2<-lm(mpg~qsec+wt+am) summary(model2) model6<-lm(mpg~log(disp)+gear+carb) summary(model6) #partial F-test nestmodel = lm(mpg ~ wt + qsec + am) anova(model,nestmodel) #Multicollinearity plot(mtcars) #checking the correlation cor(qsec, wt) cor(am, wt) cor(am, qsec) #variance inflation factor install.packages("car") library(car) vif(model2) #Polynomial Model plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals") abline(0,0) quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am) plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals") abline(0,0) summary(quadmod) AIC(quadmod) #Interaction Model model3<-lm(mpg~qsec+wt*am) summary(model3) AIC(model3) resid3<- model3$residuals hist(resid3) qqnorm(resid3) qqline(resid3) plot(model3$residuals ~ disp) abline(0,0) plot(model3)
  • 49. 49 #using the model newdata <- data.frame(wt=2.92, qsec=20.1, am=1) predy <- predict(model3, newdata, interval="predict") predy confy <- predict(model3, newdata, interval="confidence") confy confy %*% c(0, -1, 1) predy %*% c(0, -1, 1) confy[1] == predy[1] #sample prediciton mtcars[20, ] pred<- coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]* 1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1 pred 33.9-31.0523