SlideShare une entreprise Scribd logo
1  sur  78
Regression and
Correlation
Analysis
1
Objectives
To determine the relationship
between response variable
and independent variables for
prediction purposes
2
• compute a simple linear regression model
• interpret the slope and intercept in a linear
regression model
• Model adequacy checking
• Use the model for prediction purposes
3
Contents
1. Introduction
regression and correlation
2. Simple Linear Regression
- Simple linear regression model ( deals
with one independent variable)
- Least- square estimation of parameters
- Hypothesis testing on the parameters
- Interpretation
4
3. Correlation
-Correlation co-efficient
- Co- efficient of determination and its
interpretation
5
Learning Outcomes
• Student will be able to identify the nature
of the association between a given pair of
variables
• Find a suitable regression model to a given
set of data of two variables
• Check for model assumptions
• Interpret the model parameters of the fixed
model
• Predict or estimate Y values for given X
values
6
Reference
1. Introduction to Linear Regression Analysis (3 rd
edition) D.C. Montgomery, E.A. Peck and G.G.
Vining, John Wiley ( 2004)
2. Applied Regression Analysis ( 3rd
edition) N.R.
Draper, H. Smith, John Wiley ( 1998)
7
Introduction
Regression and correlation are very important
statistical tools which are used to identify
and quantify the relationship between two
or more variables
Application of regression occurs almost in
every field such as engineering, physical
and chemical sciences, economics, life and
biological sciences and social science
8
Regression analysis was first developed by Sir
Francis Galton ( 1822-1911)
Regression and correlation are two different but
closely related concepts
Regression is a quantitative expression of the basic
nature of the relationship between the dependent
and independent variables
Correlation is the strength of the relationship. That
means correlation measures how strong the
relationship between two variables is?
9
Dependent variable
• In a research study, the dependent variable
is the variable that you believe might be
influenced or modified by some treatment
or exposure. It may also represent the
variable you are trying to predict.
Sometimes the dependent variable is called
the outcome variable. This definition
depends on the context of the study
10
If one variable is depended on other we can say that
one variable is a function of another
Y = ƒ (X)
Hear Y depends on X in some manner
As Y depends on X , Y is called the dependent
variable, criterion variable or response variable..
11
Independent variable
In a research study, an independent variable
is a variable that you believe might
influence your outcome measure.
X is called the independent variable,
predictor variable, regress or explanatory
variable
12
This might be a variable that you control, like
a treatment, or a variable not under your
control, like an exposure.
It also might represent a demographic factor
like age or gender
13
Regression
Simple Y = ƒ (X)
Multiple Y = ƒ (X1,X2,…X3)
Linear
Non linear
Linear
Non linear
14
CONTENTS
• Coefficients of correlation
–meaning
–values
–role
–significance
• Regression
–line of best fit
–prediction
–significance
15
• Correlation
–the strength of the linear relationship
between two variables
• Regression analysis
–determines the nature of the relationship
Ex : Is there a relationship between the
number of units of alcohol consumed
and the likelihood of developing
cirrhosis of the liver?
16
Correlation and Covariance
Correlation is the standardized covariance:
17
Measures the relative strength of the linear
relationship between two variables
The correlation is scale invariant and the
units of measurement don't matter (unit-
less)
This gives the direction (- or +) and strength
(0 to1) of the linear relationship between X
and Y.
18
• It is always true that -1≤ corr(X; Y ) ≤ 1. That means
ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker any linear relationship
Though a value close to zero indicates almost no
linear association it does not mean no relationship
19
Scatter Plots of Data with Various
Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
X
r = 0 20
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation
21
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
22
Linear Correlation
Y
X
Y
X
No relationship
23
interpreting the Pearson correlation
coefficient
• The value of r for this
data is 0.39. thus
indicating weak
positive linear
association.
• Omitting the last
observation, r is 0.96.
• Thus, r is sensitive to
extreme observations.
Hight (inches)
Weight(lbs)
7672686460
170
160
150
140
130
120
110
100
90
Scatterplot of Weight (lbs) vs Hight (inches)
Extreme observation
24
• The value of r
here is 0.94.
• However, a
straight line model
may not be
suitable.
• The relationship
appears
curvilinear.
Predictor
Response
20151050
90
80
70
60
50
40
30
20
10
25
continued…
Extreme Observation
• The value of r is
-0.07.
• But the plot indicates
positive linear
association.
• Again, this anomaly
is due to extreme
data values.
OBT marks
Finalmarks
9080706050403020
70
60
50
40
30
20
10
Scatterplot of Final marks vs OBT marks
26
• The value of r is around
0.006, thus indicating
almost no linear
association.
• However, from the plot,
we find strong
relationship between the
two variables.
• This exemplifies that r
does not provide evidence
of all relationships.
• These examples highlight
the importance of looking
at scatter plots of data
prior to deciding on a
model function.
Age in years
ReactiontimeinSeconds
403020100
50
40
30
20
10
0
Scatterplot of Reaction time in Seconds vs Age in years
27
17.28
Coefficient of Determination
R2
has a value of .6483. This means 64.83% of
the variation in the auction selling prices (y) is
explained by your regression model. The
remaining 35.17% is unexplained, i.e. due to
error.
.
28
Unlike the value of a test statistic, the
coefficient of determination does not have
a critical value that enables us to draw
conclusions.
In general the higher the value of R2
, the better
the model fits the data.
R2
= 1: Perfect match between the line and the
data points.
R2
= 0: There are no linear relationship
between x and y
29
Coefficient of determination
x1 x2
y1
y2
y
Two data points (x1,y1) and (x2,y2)
of a certain sample are shown.
=−+− 2
2
2
1 )yy()yy( 2
2
2
1 )yyˆ()yyˆ( −+− 2
22
2
11 )yˆy()yˆy( −+−+
Total variation in y = Variation explained by the
regression line
+ Unexplained variation (error)
Variation in y = SSR + SSE
30
Coefficient of Determination
• How “strong” is relationship between predictor &
outcome? (Fraction of observed variance of
outcome variable explained by the predictor
variables).
• Relationship Among SST, SSR, SSE
where:where:
SST = total sum of squaresSST = total sum of squares
SSR = sum of squares due to regressionSSR = sum of squares due to regression
SSE = sum of squares due to errorSSE = sum of squares due to error
SST = SSR + SSESST = SSR + SSE
2
( )iy y−∑ 2
ˆ( )iy y= −∑ 2
ˆ( )i iy y+ −∑
31
REGRESSION
32
Estimation Process
Regression ModelRegression Model
yy == ββ00 ++ ββ11xx ++εε
Regression EquationRegression Equation
EE((yy) =) = ββ00 ++ ββ11xx
Unknown ParametersUnknown Parameters
ββ00,, ββ11
Sample Data:Sample Data:
x yx y
xx11 yy11
. .. .
. .. .
xxnn yynn
bb00 andand bb11
provide estimates ofprovide estimates of
ββ00 andand ββ11
EstimatedEstimated
Regression EquationRegression Equation
Sample StatisticsSample Statistics
bb00,, bb11
0 1
ˆy b b x= +
33
Introduction
• We will examine the relationship between
quantitative variables x and y via a
mathematical equation.
• The motivation for using the technique:
– Forecast the value of a dependent variable (y)
from the value of independent variables (x1, x2,
…xk.).
– Analyze the specific relationships between the
independent variables and the dependent 34
For a continuous variable X the easiest way
of checking for a linear relationship with Y
is by means of a scatter plot of Y against X.
Hence, regression analysis can be started
with a scatter plot.
35
3636
Least SquaresLeast Squares
• 1.1. ‘Best Fit’ Means Difference Between‘Best Fit’ Means Difference Between
Actual Y Values & Predicted Y Values AreActual Y Values & Predicted Y Values Are
a Minimum.a Minimum. ButBut Positive Differences Off-Positive Differences Off-
Set Negative. So square errors!Set Negative. So square errors!
• 2.2. LS Minimizes the Sum of the SquaredLS Minimizes the Sum of the Squared
Differences (errors) (SSE)Differences (errors) (SSE)
( ) ∑∑ ==
=−
n
i
i
n
i
ii YY
1
2
1
2
ˆˆ ε
36
3737
Coefficient EquationsCoefficient Equations
• Prediction equationPrediction equation
• Sample slopeSample slope
• Sample Y - interceptSample Y - intercept
ii xy 10
ˆˆˆ ββ +=
( )( )
( )∑ −
∑ −−
==
21
ˆ
xx
yyxx
SS
SS
i
ii
xx
xy
β
xy 10
ˆˆ ββ −=
37
Interpreting regression
coefficients
You should interpret the slope and the
intercept of this line as follows:
–The slope represents the estimated
average change in Y when X increases by one
unit.
–The intercept represents the estimated
average value of Y when X equals zero
38
3939
Interpretation of CoefficientsInterpretation of Coefficients
• 1.1. Slope (Slope (ββ11))
– EstimatedEstimated YY changes bychanges by ββ11 for each 1 unitfor each 1 unit
increase inincrease in XX
• IfIf ββ11 = 2, then= 2, then YY is Expected to Increase by 2 foris Expected to Increase by 2 for
each 1 unit increase ineach 1 unit increase in XX
• 2.2. Y-Intercept (Y-Intercept (ββ00))
– Average Value ofAverage Value of YY whenwhen XX = 0= 0
• IfIf ββ00 = 4, then Average of= 4, then Average of YY is expected to beis expected to be
4 when4 when XX is 0is 0
^^
^^
^^
^^
^^
39
The Model
• The first order linear model
y = dependent variable
x = independent variable
β0 = y-intercept
β1 = slope of the line
ε = error variable
ε+β+β= xy 10
x
y
β0
Run
Rise β1 = Rise/Run
β0 and β1 are unknown population
parameters, therefore are estimated
from the data.
40
The Least Squares (Regression)
Line
A good line is one that minimizes
the sum of squared differences between the
points and the line.
41
Model adequacy cheking
When conducting linear regression, it is important
to make sure the assumptions behind the model
are met. It is also important to verify that the
estimated linear regression model is a good fit for
the data (often a linear regression line can be
estimated by SAS, SPSS, MINITAB etc. even if
it’s not appropriate—in this case it is up to you to
judge whether the model is a good one).
42
Assumptions
• The relationship between the explanatory
variable and the outcome variable is linear.
In other words, each increase by one unit in
an explanatory variable is associated with a
fixed increase in the outcome variable.
• The regression equation describes the mean
value of the dependent variable for a given
values of independent variable.
43
• The individual data points of Y (the
response variable) for each value of the
explanatory variable are normally
distributed about the line of means
(regression line).
• The variance of the data points about the
line of means is the same for each value of
explanatory variable.
44
Assumptions About the Error
Term ε
1. The error1. The error εε is a random variable with mean of zero.is a random variable with mean of zero.
2.2. The variance ofThe variance of εε ,, denoted bydenoted by σσ 22
,, is the same foris the same for
all values of the independent variable.all values of the independent variable.
3.3. The values ofThe values of εε are independent (randomly distributed.are independent (randomly distributed.
4.4. The errorThe error εε is a normally distributed randomis a normally distributed random
variable with mean zero and variancevariable with mean zero and variance σσ 22
..
45
Testing the assumptions for
regression - 2
• Normality (interval level variables)
– Skewness & Kurtosis must lie within acceptable limits
(-1 to +1)
• How to test?
• You can examine a histogram. Normality of distribution of
Y data points can be checked by plotting a histogram of
the residuals.
46
• If condition violated?
– Regression procedure can overestimate significance, so
should add a note of caution to the interpretation of
results (increases type I error rate)
47
Testing the assumptions -
normality
To compute skewness and
kurtosis for the included
cases, select Descriptive
Statistics|Descriptives…
from the Analyze menu.
1
48
Testing the assumptions -
normality
Second, click on
the Continue
button to complete
the options.
First, mark the
checkboxes for Kurtosis
and Skew ness.
49
Analysis of Residual
• To examine whether the regression model is
appropriate for the data being analyzed, we can
check the residual plots.
• Residual plots are:
– Plot a histogram of the residuals
– Plot residuals against the fitted values.
– Plot residuals against the independent variable.
– Plot residuals over time if the data are chronological.
50
Analysis of Residual
• A histogram of the residuals provides a check on
the normality assumption. A Normal quantile plot
of the residuals can also be used to check the
Normality assumptions.
• Regression Inference is robust against moderate
lack of Normality. On the other hand, outliers and
influential observations can invalidate the results
of inference for regression
• Plot of residuals against fitted values or the
independent variable can be used to check the
assumption of constant variance and the aptness
of the model.
51
Analysis of Residual
• Plot of residuals against time provides a
check on the independence of the error
terms assumption.
• Assumption of independence is the most
critical one.
52
Residual plots
• The residuals should
have no systematic
pattern.
• The residual plot to
right shows a scatter
of the points with no
individual
observations or
systematic change as x
increases.
Degree Days Residual Plot
-1
-0.5
0
0.5
1
0 20 40 60
Degree DaysResiduals
53
Residual plots
• The points in this
residual plot have a
curve pattern, so a
straight line fits poorly
54
Residual plots
• The points in this plot
show more spread for
larger values of the
explanatory variable x,
so prediction will be
less accurate when x is
large.
55
Heteroscedasticity
• When the requirement of a constant variance is violated we
have a condition of heteroscedasticity.
• Diagnose heteroscedasticity by plotting the residual
against the predicted y.
+ + +
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The spread increases with y^
y^
Residual
^y
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
56
Patterns in the appearance of the residuals indicates that
autocorrelation exists.
+
+
+
+ +
+
+
+
+
+
+
+
+ + +
+
+
+
+
+
+
+
+
+
+
Time
Residual Residual
Time
+
+
+
Note the runs of positive residuals,
replaced by runs of negative residuals
Note the oscillating behavior of the
residuals around zero.
0 0
Non Independence of Error Variables
57
Outliers
• An outlier is an observation that is unusually small or
large.
• Several possibilities need to be investigated when an
outlier is observed:
– There was an error in recording the value.
– The point does not belong in the sample.
– The observation is valid.
• Identify outliers from the scatter diagram.
• It is customary to suspect an observation is an outlier
if its |standard residual| > 2 58
• DFITTS value of the data point is >2
59
Variable transformations
• If the residual plot suggests that the variance is not
constant, a transformation can be used to stabilize
the variance.
• If the residual plot suggests a non linear
relationship between x and y, a transformation
may reduce it to one that is approximately linear.
• Common linearizing transformations are:
• Variance stabilizing transformations are:
)log(,
1
x
x
2
,),log(,
1
yyy
y 60
The Model
• The first order linear model
y = dependent variable
x = independent variable
β0 = y-intercept
β1 = slope of the line
ε = error variable
ε+β+β= xy 10
x
y
β0
Run
Rise β1 = Rise/Run
β0 and β1 are unknown population
parameters, therefore are estimated
from the data.
61
The Least Squares (Regression)
Line
A good line is one that minimizes
the sum of squared differences between the
points and the line.
62
Example
• Following observations are made on an
experiment that was carried out to measure
the relationship of a mathematics placement
test conducted at a faculty and final grades
of 20 students as faculty decided not to give
admissions to those students got marks
below 35 at the placement test.
63
Table
placement test Final grade
50 53
35 41
35 51
40 62
55 68
65 63
35 22
60 70
90 85
35 40
placement test Final grade
90 75
80 91
60 58
60 71
60 71
40 49
55 58
50 57
65 77
50 59
64
Scatter plot
90807060504030
100
90
80
70
60
50
40
30
20
placement t est
Finalgrade
Scatterplot of Final grade vs placement test
65
Correlations: Daily RF(0.01cm),
Particle weight (µg/m3
• Pearson correlation of Daily RF(0.01cm)
and Particle weight (µg/m3) = 0.726
• P-Value = 0.011
66
SAS For Regression and Correlation
67
PROG REG
Submit the following program in SAS. In addition to the first
two statements with which you are familiar, the third
statement requests a plot of the residuals by weight and the
fourth statement requests a plot of the studentized
(standardized) residuals by weight:
PROC REG DATA = blood;
MODEL level = weight;
PLOT level * weight;
PLOT residual. * weight;
PLOT student. * weight;
RUN;
68
Interpreting Output
Notice that the overall F-test has a p-value of
0.2160, which is greater than 0.05.
Therefore, we would conclude that blood
level and weight are independent (fail to
reject Ho: β1 = 0).
Now look at the following plots:
69
Plot of Regression Line: Notice it is the same plot as the one
you created from PROC GPLOT, except the fitted regression line
has been added to it.
70
Plot of residuals * weight: you want an even spread of
points above and below the dashed line. This is a good way
to eyeball the data for potential outliers.
71
Plot of studentized residuals * weight: look for
values with an absolute value larger than 2.6 to
determine if there are any outliers.
72
You can see from the plot that the observation
with weight = 128 (observation #4) is an
outlier.
The residual plots also help you determine
whether the assumption of constant variance is
met. Because the residuals appear to be
randomly scattered without any definite
pattern, this suggests that the data are
independent with constant variance.
73
The Normality Assumption
A convenient way to test for normality is by
constructing a “Normal Quantile Quantile”
plot. This plots the residuals you would see
under normality versus the residuals that are
actually observed. If the data are completely
normal, the residuals will follow a 45° line.
Use the following code in SAS to make the
NQQ plot:
PLOT residual. * nqq.;
RUN;
74
Residual vs. NQQ Plot
75
Interpreting the NQQ Plot
The residuals do not clearly follow a 45° line.
Because the tails of this line seem curved,
this suggests that the data may be skewed,
not normally distributed.
76
Recommendations
• It is extremely important to look at plots of raw
data prior to selecting a tentative model
• Need to be cautious in interpreting the correlation
coefficient r.
• Proper model assessment should be done prior to
using the fitted model for predictions.
• Need to focus on the range of x values used for
building the model prior to making predictions at
a desired x value. 77
78

Contenu connexe

Tendances

Nonparametric tests
Nonparametric testsNonparametric tests
Nonparametric testsArun Kumar
 
Regression analysis
Regression analysisRegression analysis
Regression analysissaba khan
 
Correlation - Biostatistics
Correlation - BiostatisticsCorrelation - Biostatistics
Correlation - BiostatisticsFahmida Swati
 
Multivariate analysis - Multiple regression analysis
Multivariate analysis -  Multiple regression analysisMultivariate analysis -  Multiple regression analysis
Multivariate analysis - Multiple regression analysisRaihanathusSahdhiyya
 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysisFarzad Javidanrad
 
Linear Regression Using SPSS
Linear Regression Using SPSSLinear Regression Using SPSS
Linear Regression Using SPSSDr Athar Khan
 
wilcoxon signed rank test
wilcoxon signed rank testwilcoxon signed rank test
wilcoxon signed rank testraj shekar
 
Spearman rank correlation coefficient
Spearman rank correlation coefficientSpearman rank correlation coefficient
Spearman rank correlation coefficientKarishma Chaudhary
 
Spearman's Rank order Correlation
Spearman's Rank order CorrelationSpearman's Rank order Correlation
Spearman's Rank order Correlationrkalidasan
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regressionpankaj8108
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionMohit Asija
 
Regression analysis
Regression analysisRegression analysis
Regression analysisRavi shankar
 

Tendances (20)

Nonparametric tests
Nonparametric testsNonparametric tests
Nonparametric tests
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Correlation - Biostatistics
Correlation - BiostatisticsCorrelation - Biostatistics
Correlation - Biostatistics
 
Chi square test
Chi square test Chi square test
Chi square test
 
Multivariate analysis - Multiple regression analysis
Multivariate analysis -  Multiple regression analysisMultivariate analysis -  Multiple regression analysis
Multivariate analysis - Multiple regression analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Introduction to correlation and regression analysis
Introduction to correlation and regression analysisIntroduction to correlation and regression analysis
Introduction to correlation and regression analysis
 
Linear Regression Using SPSS
Linear Regression Using SPSSLinear Regression Using SPSS
Linear Regression Using SPSS
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
 
wilcoxon signed rank test
wilcoxon signed rank testwilcoxon signed rank test
wilcoxon signed rank test
 
Spearman rank correlation coefficient
Spearman rank correlation coefficientSpearman rank correlation coefficient
Spearman rank correlation coefficient
 
Spearman's Rank order Correlation
Spearman's Rank order CorrelationSpearman's Rank order Correlation
Spearman's Rank order Correlation
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Comparing means
Comparing meansComparing means
Comparing means
 
Regression
RegressionRegression
Regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 

En vedette

co relation and regression
co relation and regressionco relation and regression
co relation and regressionRehan ali
 
Linearity in the non-deterministic call-by-value setting
Linearity in the non-deterministic call-by-value settingLinearity in the non-deterministic call-by-value setting
Linearity in the non-deterministic call-by-value settingAlejandro Díaz-Caro
 
ANOVA-One Way Classification
ANOVA-One Way ClassificationANOVA-One Way Classification
ANOVA-One Way ClassificationSharlaine Ruth
 
Correlation nd regression
Correlation nd regressionCorrelation nd regression
Correlation nd regressionvinay gowda
 
Pearson Correlation, Spearman Correlation &Linear Regression
Pearson Correlation, Spearman Correlation &Linear RegressionPearson Correlation, Spearman Correlation &Linear Regression
Pearson Correlation, Spearman Correlation &Linear RegressionAzmi Mohd Tamil
 
Measurement and scaling techniques
Measurement  and  scaling  techniquesMeasurement  and  scaling  techniques
Measurement and scaling techniquesUjjwal 'Shanu'
 

En vedette (12)

co relation and regression
co relation and regressionco relation and regression
co relation and regression
 
Linearity in the non-deterministic call-by-value setting
Linearity in the non-deterministic call-by-value settingLinearity in the non-deterministic call-by-value setting
Linearity in the non-deterministic call-by-value setting
 
Linearity
LinearityLinearity
Linearity
 
Questionnaire design
Questionnaire design Questionnaire design
Questionnaire design
 
ANOVA-One Way Classification
ANOVA-One Way ClassificationANOVA-One Way Classification
ANOVA-One Way Classification
 
Chi square test
Chi square testChi square test
Chi square test
 
Correlation nd regression
Correlation nd regressionCorrelation nd regression
Correlation nd regression
 
Pearson Correlation, Spearman Correlation &Linear Regression
Pearson Correlation, Spearman Correlation &Linear RegressionPearson Correlation, Spearman Correlation &Linear Regression
Pearson Correlation, Spearman Correlation &Linear Regression
 
The Chi-Squared Test
The Chi-Squared TestThe Chi-Squared Test
The Chi-Squared Test
 
Chi – square test
Chi – square testChi – square test
Chi – square test
 
Chi square test
Chi square testChi square test
Chi square test
 
Measurement and scaling techniques
Measurement  and  scaling  techniquesMeasurement  and  scaling  techniques
Measurement and scaling techniques
 

Similaire à Regression and Co-Relation

Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regressionKhulna University
 
Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxkrunal soni
 
Unit 1 Correlation- BSRM.pdf
Unit 1 Correlation- BSRM.pdfUnit 1 Correlation- BSRM.pdf
Unit 1 Correlation- BSRM.pdfRavinandan A P
 
Module 2_ Regression Models..pptx
Module 2_ Regression Models..pptxModule 2_ Regression Models..pptx
Module 2_ Regression Models..pptxnikshaikh786
 
Statistics
Statistics Statistics
Statistics KafiPati
 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regressionUnsa Shakir
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysisNimrita Koul
 
regression and correlation
regression and correlationregression and correlation
regression and correlationPriya Sharma
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression pptSantosh Bhaskar
 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysisRabin BK
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).pptMuhammadAftab89
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.pptRidaIrfan10
 

Similaire à Regression and Co-Relation (20)

Correlations
CorrelationsCorrelations
Correlations
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
 
Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptx
 
Unit 1 Correlation- BSRM.pdf
Unit 1 Correlation- BSRM.pdfUnit 1 Correlation- BSRM.pdf
Unit 1 Correlation- BSRM.pdf
 
Research Methodology-Chapter 14
Research Methodology-Chapter 14Research Methodology-Chapter 14
Research Methodology-Chapter 14
 
Module 2_ Regression Models..pptx
Module 2_ Regression Models..pptxModule 2_ Regression Models..pptx
Module 2_ Regression Models..pptx
 
Statistics
Statistics Statistics
Statistics
 
Linear regression
Linear regressionLinear regression
Linear regression
 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regression
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
regression
regressionregression
regression
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
regression and correlation
regression and correlationregression and correlation
regression and correlation
 
Simple linear regressionn and Correlation
Simple linear regressionn and CorrelationSimple linear regressionn and Correlation
Simple linear regressionn and Correlation
 
A correlation analysis.ppt 2018
A correlation analysis.ppt 2018A correlation analysis.ppt 2018
A correlation analysis.ppt 2018
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysis
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.ppt
 

Dernier

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 

Dernier (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

Regression and Co-Relation

  • 2. Objectives To determine the relationship between response variable and independent variables for prediction purposes 2
  • 3. • compute a simple linear regression model • interpret the slope and intercept in a linear regression model • Model adequacy checking • Use the model for prediction purposes 3
  • 4. Contents 1. Introduction regression and correlation 2. Simple Linear Regression - Simple linear regression model ( deals with one independent variable) - Least- square estimation of parameters - Hypothesis testing on the parameters - Interpretation 4
  • 5. 3. Correlation -Correlation co-efficient - Co- efficient of determination and its interpretation 5
  • 6. Learning Outcomes • Student will be able to identify the nature of the association between a given pair of variables • Find a suitable regression model to a given set of data of two variables • Check for model assumptions • Interpret the model parameters of the fixed model • Predict or estimate Y values for given X values 6
  • 7. Reference 1. Introduction to Linear Regression Analysis (3 rd edition) D.C. Montgomery, E.A. Peck and G.G. Vining, John Wiley ( 2004) 2. Applied Regression Analysis ( 3rd edition) N.R. Draper, H. Smith, John Wiley ( 1998) 7
  • 8. Introduction Regression and correlation are very important statistical tools which are used to identify and quantify the relationship between two or more variables Application of regression occurs almost in every field such as engineering, physical and chemical sciences, economics, life and biological sciences and social science 8
  • 9. Regression analysis was first developed by Sir Francis Galton ( 1822-1911) Regression and correlation are two different but closely related concepts Regression is a quantitative expression of the basic nature of the relationship between the dependent and independent variables Correlation is the strength of the relationship. That means correlation measures how strong the relationship between two variables is? 9
  • 10. Dependent variable • In a research study, the dependent variable is the variable that you believe might be influenced or modified by some treatment or exposure. It may also represent the variable you are trying to predict. Sometimes the dependent variable is called the outcome variable. This definition depends on the context of the study 10
  • 11. If one variable is depended on other we can say that one variable is a function of another Y = ƒ (X) Hear Y depends on X in some manner As Y depends on X , Y is called the dependent variable, criterion variable or response variable.. 11
  • 12. Independent variable In a research study, an independent variable is a variable that you believe might influence your outcome measure. X is called the independent variable, predictor variable, regress or explanatory variable 12
  • 13. This might be a variable that you control, like a treatment, or a variable not under your control, like an exposure. It also might represent a demographic factor like age or gender 13
  • 14. Regression Simple Y = ƒ (X) Multiple Y = ƒ (X1,X2,…X3) Linear Non linear Linear Non linear 14
  • 15. CONTENTS • Coefficients of correlation –meaning –values –role –significance • Regression –line of best fit –prediction –significance 15
  • 16. • Correlation –the strength of the linear relationship between two variables • Regression analysis –determines the nature of the relationship Ex : Is there a relationship between the number of units of alcohol consumed and the likelihood of developing cirrhosis of the liver? 16
  • 17. Correlation and Covariance Correlation is the standardized covariance: 17
  • 18. Measures the relative strength of the linear relationship between two variables The correlation is scale invariant and the units of measurement don't matter (unit- less) This gives the direction (- or +) and strength (0 to1) of the linear relationship between X and Y. 18
  • 19. • It is always true that -1≤ corr(X; Y ) ≤ 1. That means ranges between –1 and 1 • The closer to –1, the stronger the negative linear relationship • The closer to 1, the stronger the positive linear relationship • The closer to 0, the weaker any linear relationship Though a value close to zero indicates almost no linear association it does not mean no relationship 19
  • 20. Scatter Plots of Data with Various Correlation Coefficients Y X Y X Y X Y X Y X r = -1 r = -.6 r = 0 r = +.3r = +1 Y X r = 0 20
  • 21. Y X Y X Y Y X X Linear relationships Curvilinear relationships Linear Correlation 21
  • 22. Y X Y X Y Y X X Strong relationships Weak relationships Linear Correlation 22
  • 24. interpreting the Pearson correlation coefficient • The value of r for this data is 0.39. thus indicating weak positive linear association. • Omitting the last observation, r is 0.96. • Thus, r is sensitive to extreme observations. Hight (inches) Weight(lbs) 7672686460 170 160 150 140 130 120 110 100 90 Scatterplot of Weight (lbs) vs Hight (inches) Extreme observation 24
  • 25. • The value of r here is 0.94. • However, a straight line model may not be suitable. • The relationship appears curvilinear. Predictor Response 20151050 90 80 70 60 50 40 30 20 10 25
  • 26. continued… Extreme Observation • The value of r is -0.07. • But the plot indicates positive linear association. • Again, this anomaly is due to extreme data values. OBT marks Finalmarks 9080706050403020 70 60 50 40 30 20 10 Scatterplot of Final marks vs OBT marks 26
  • 27. • The value of r is around 0.006, thus indicating almost no linear association. • However, from the plot, we find strong relationship between the two variables. • This exemplifies that r does not provide evidence of all relationships. • These examples highlight the importance of looking at scatter plots of data prior to deciding on a model function. Age in years ReactiontimeinSeconds 403020100 50 40 30 20 10 0 Scatterplot of Reaction time in Seconds vs Age in years 27
  • 28. 17.28 Coefficient of Determination R2 has a value of .6483. This means 64.83% of the variation in the auction selling prices (y) is explained by your regression model. The remaining 35.17% is unexplained, i.e. due to error. . 28
  • 29. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2 , the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y 29
  • 30. Coefficient of determination x1 x2 y1 y2 y Two data points (x1,y1) and (x2,y2) of a certain sample are shown. =−+− 2 2 2 1 )yy()yy( 2 2 2 1 )yyˆ()yyˆ( −+− 2 22 2 11 )yˆy()yˆy( −+−+ Total variation in y = Variation explained by the regression line + Unexplained variation (error) Variation in y = SSR + SSE 30
  • 31. Coefficient of Determination • How “strong” is relationship between predictor & outcome? (Fraction of observed variance of outcome variable explained by the predictor variables). • Relationship Among SST, SSR, SSE where:where: SST = total sum of squaresSST = total sum of squares SSR = sum of squares due to regressionSSR = sum of squares due to regression SSE = sum of squares due to errorSSE = sum of squares due to error SST = SSR + SSESST = SSR + SSE 2 ( )iy y−∑ 2 ˆ( )iy y= −∑ 2 ˆ( )i iy y+ −∑ 31
  • 33. Estimation Process Regression ModelRegression Model yy == ββ00 ++ ββ11xx ++εε Regression EquationRegression Equation EE((yy) =) = ββ00 ++ ββ11xx Unknown ParametersUnknown Parameters ββ00,, ββ11 Sample Data:Sample Data: x yx y xx11 yy11 . .. . . .. . xxnn yynn bb00 andand bb11 provide estimates ofprovide estimates of ββ00 andand ββ11 EstimatedEstimated Regression EquationRegression Equation Sample StatisticsSample Statistics bb00,, bb11 0 1 ˆy b b x= + 33
  • 34. Introduction • We will examine the relationship between quantitative variables x and y via a mathematical equation. • The motivation for using the technique: – Forecast the value of a dependent variable (y) from the value of independent variables (x1, x2, …xk.). – Analyze the specific relationships between the independent variables and the dependent 34
  • 35. For a continuous variable X the easiest way of checking for a linear relationship with Y is by means of a scatter plot of Y against X. Hence, regression analysis can be started with a scatter plot. 35
  • 36. 3636 Least SquaresLeast Squares • 1.1. ‘Best Fit’ Means Difference Between‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values AreActual Y Values & Predicted Y Values Are a Minimum.a Minimum. ButBut Positive Differences Off-Positive Differences Off- Set Negative. So square errors!Set Negative. So square errors! • 2.2. LS Minimizes the Sum of the SquaredLS Minimizes the Sum of the Squared Differences (errors) (SSE)Differences (errors) (SSE) ( ) ∑∑ == =− n i i n i ii YY 1 2 1 2 ˆˆ ε 36
  • 37. 3737 Coefficient EquationsCoefficient Equations • Prediction equationPrediction equation • Sample slopeSample slope • Sample Y - interceptSample Y - intercept ii xy 10 ˆˆˆ ββ += ( )( ) ( )∑ − ∑ −− == 21 ˆ xx yyxx SS SS i ii xx xy β xy 10 ˆˆ ββ −= 37
  • 38. Interpreting regression coefficients You should interpret the slope and the intercept of this line as follows: –The slope represents the estimated average change in Y when X increases by one unit. –The intercept represents the estimated average value of Y when X equals zero 38
  • 39. 3939 Interpretation of CoefficientsInterpretation of Coefficients • 1.1. Slope (Slope (ββ11)) – EstimatedEstimated YY changes bychanges by ββ11 for each 1 unitfor each 1 unit increase inincrease in XX • IfIf ββ11 = 2, then= 2, then YY is Expected to Increase by 2 foris Expected to Increase by 2 for each 1 unit increase ineach 1 unit increase in XX • 2.2. Y-Intercept (Y-Intercept (ββ00)) – Average Value ofAverage Value of YY whenwhen XX = 0= 0 • IfIf ββ00 = 4, then Average of= 4, then Average of YY is expected to beis expected to be 4 when4 when XX is 0is 0 ^^ ^^ ^^ ^^ ^^ 39
  • 40. The Model • The first order linear model y = dependent variable x = independent variable β0 = y-intercept β1 = slope of the line ε = error variable ε+β+β= xy 10 x y β0 Run Rise β1 = Rise/Run β0 and β1 are unknown population parameters, therefore are estimated from the data. 40
  • 41. The Least Squares (Regression) Line A good line is one that minimizes the sum of squared differences between the points and the line. 41
  • 42. Model adequacy cheking When conducting linear regression, it is important to make sure the assumptions behind the model are met. It is also important to verify that the estimated linear regression model is a good fit for the data (often a linear regression line can be estimated by SAS, SPSS, MINITAB etc. even if it’s not appropriate—in this case it is up to you to judge whether the model is a good one). 42
  • 43. Assumptions • The relationship between the explanatory variable and the outcome variable is linear. In other words, each increase by one unit in an explanatory variable is associated with a fixed increase in the outcome variable. • The regression equation describes the mean value of the dependent variable for a given values of independent variable. 43
  • 44. • The individual data points of Y (the response variable) for each value of the explanatory variable are normally distributed about the line of means (regression line). • The variance of the data points about the line of means is the same for each value of explanatory variable. 44
  • 45. Assumptions About the Error Term ε 1. The error1. The error εε is a random variable with mean of zero.is a random variable with mean of zero. 2.2. The variance ofThe variance of εε ,, denoted bydenoted by σσ 22 ,, is the same foris the same for all values of the independent variable.all values of the independent variable. 3.3. The values ofThe values of εε are independent (randomly distributed.are independent (randomly distributed. 4.4. The errorThe error εε is a normally distributed randomis a normally distributed random variable with mean zero and variancevariable with mean zero and variance σσ 22 .. 45
  • 46. Testing the assumptions for regression - 2 • Normality (interval level variables) – Skewness & Kurtosis must lie within acceptable limits (-1 to +1) • How to test? • You can examine a histogram. Normality of distribution of Y data points can be checked by plotting a histogram of the residuals. 46
  • 47. • If condition violated? – Regression procedure can overestimate significance, so should add a note of caution to the interpretation of results (increases type I error rate) 47
  • 48. Testing the assumptions - normality To compute skewness and kurtosis for the included cases, select Descriptive Statistics|Descriptives… from the Analyze menu. 1 48
  • 49. Testing the assumptions - normality Second, click on the Continue button to complete the options. First, mark the checkboxes for Kurtosis and Skew ness. 49
  • 50. Analysis of Residual • To examine whether the regression model is appropriate for the data being analyzed, we can check the residual plots. • Residual plots are: – Plot a histogram of the residuals – Plot residuals against the fitted values. – Plot residuals against the independent variable. – Plot residuals over time if the data are chronological. 50
  • 51. Analysis of Residual • A histogram of the residuals provides a check on the normality assumption. A Normal quantile plot of the residuals can also be used to check the Normality assumptions. • Regression Inference is robust against moderate lack of Normality. On the other hand, outliers and influential observations can invalidate the results of inference for regression • Plot of residuals against fitted values or the independent variable can be used to check the assumption of constant variance and the aptness of the model. 51
  • 52. Analysis of Residual • Plot of residuals against time provides a check on the independence of the error terms assumption. • Assumption of independence is the most critical one. 52
  • 53. Residual plots • The residuals should have no systematic pattern. • The residual plot to right shows a scatter of the points with no individual observations or systematic change as x increases. Degree Days Residual Plot -1 -0.5 0 0.5 1 0 20 40 60 Degree DaysResiduals 53
  • 54. Residual plots • The points in this residual plot have a curve pattern, so a straight line fits poorly 54
  • 55. Residual plots • The points in this plot show more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large. 55
  • 56. Heteroscedasticity • When the requirement of a constant variance is violated we have a condition of heteroscedasticity. • Diagnose heteroscedasticity by plotting the residual against the predicted y. + + + + + + + + + + + + + + + + + + + + + + + + The spread increases with y^ y^ Residual ^y + + + + + + + + + + + ++ + + + + + + + + + + 56
  • 57. Patterns in the appearance of the residuals indicates that autocorrelation exists. + + + + + + + + + + + + + + + + + + + + + + + + + Time Residual Residual Time + + + Note the runs of positive residuals, replaced by runs of negative residuals Note the oscillating behavior of the residuals around zero. 0 0 Non Independence of Error Variables 57
  • 58. Outliers • An outlier is an observation that is unusually small or large. • Several possibilities need to be investigated when an outlier is observed: – There was an error in recording the value. – The point does not belong in the sample. – The observation is valid. • Identify outliers from the scatter diagram. • It is customary to suspect an observation is an outlier if its |standard residual| > 2 58
  • 59. • DFITTS value of the data point is >2 59
  • 60. Variable transformations • If the residual plot suggests that the variance is not constant, a transformation can be used to stabilize the variance. • If the residual plot suggests a non linear relationship between x and y, a transformation may reduce it to one that is approximately linear. • Common linearizing transformations are: • Variance stabilizing transformations are: )log(, 1 x x 2 ,),log(, 1 yyy y 60
  • 61. The Model • The first order linear model y = dependent variable x = independent variable β0 = y-intercept β1 = slope of the line ε = error variable ε+β+β= xy 10 x y β0 Run Rise β1 = Rise/Run β0 and β1 are unknown population parameters, therefore are estimated from the data. 61
  • 62. The Least Squares (Regression) Line A good line is one that minimizes the sum of squared differences between the points and the line. 62
  • 63. Example • Following observations are made on an experiment that was carried out to measure the relationship of a mathematics placement test conducted at a faculty and final grades of 20 students as faculty decided not to give admissions to those students got marks below 35 at the placement test. 63
  • 64. Table placement test Final grade 50 53 35 41 35 51 40 62 55 68 65 63 35 22 60 70 90 85 35 40 placement test Final grade 90 75 80 91 60 58 60 71 60 71 40 49 55 58 50 57 65 77 50 59 64
  • 65. Scatter plot 90807060504030 100 90 80 70 60 50 40 30 20 placement t est Finalgrade Scatterplot of Final grade vs placement test 65
  • 66. Correlations: Daily RF(0.01cm), Particle weight (µg/m3 • Pearson correlation of Daily RF(0.01cm) and Particle weight (µg/m3) = 0.726 • P-Value = 0.011 66
  • 67. SAS For Regression and Correlation 67
  • 68. PROG REG Submit the following program in SAS. In addition to the first two statements with which you are familiar, the third statement requests a plot of the residuals by weight and the fourth statement requests a plot of the studentized (standardized) residuals by weight: PROC REG DATA = blood; MODEL level = weight; PLOT level * weight; PLOT residual. * weight; PLOT student. * weight; RUN; 68
  • 69. Interpreting Output Notice that the overall F-test has a p-value of 0.2160, which is greater than 0.05. Therefore, we would conclude that blood level and weight are independent (fail to reject Ho: β1 = 0). Now look at the following plots: 69
  • 70. Plot of Regression Line: Notice it is the same plot as the one you created from PROC GPLOT, except the fitted regression line has been added to it. 70
  • 71. Plot of residuals * weight: you want an even spread of points above and below the dashed line. This is a good way to eyeball the data for potential outliers. 71
  • 72. Plot of studentized residuals * weight: look for values with an absolute value larger than 2.6 to determine if there are any outliers. 72
  • 73. You can see from the plot that the observation with weight = 128 (observation #4) is an outlier. The residual plots also help you determine whether the assumption of constant variance is met. Because the residuals appear to be randomly scattered without any definite pattern, this suggests that the data are independent with constant variance. 73
  • 74. The Normality Assumption A convenient way to test for normality is by constructing a “Normal Quantile Quantile” plot. This plots the residuals you would see under normality versus the residuals that are actually observed. If the data are completely normal, the residuals will follow a 45° line. Use the following code in SAS to make the NQQ plot: PLOT residual. * nqq.; RUN; 74
  • 75. Residual vs. NQQ Plot 75
  • 76. Interpreting the NQQ Plot The residuals do not clearly follow a 45° line. Because the tails of this line seem curved, this suggests that the data may be skewed, not normally distributed. 76
  • 77. Recommendations • It is extremely important to look at plots of raw data prior to selecting a tentative model • Need to be cautious in interpreting the correlation coefficient r. • Proper model assessment should be done prior to using the fitted model for predictions. • Need to focus on the range of x values used for building the model prior to making predictions at a desired x value. 77
  • 78. 78