Bivariate

The Multiple Regression Model

Idea: Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (Xi)

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Yi β0 β1X1i β2 X2i  βk Xki εi

Assumptions of Regression
Use the acronym LINE:
• Linearity
– The underlying relationship between X and Y is linear

• Independence of Errors
– Error values are statistically independent

• Normality of Error
– Error values (ε) are normally distributed for any given value of X

• Equal Variance (Homoscedasticity)
– The probability distribution of the errors has constant variance

Regression Statistics
Multiple R 0.998368 2 SSR 11704.1
r .996739
R Square 0.996739 SST 11740
Adjusted R
Square 0.995808
Standard
Error 1.350151 99.674% variation is
Observations 28 explained by the
dependent Variables
ANOVA
Significan
df SS MS F ce F
Regression 6 11701.72 1950.286 1069.876 5.54E-25
Residual 21 38.28108 1.822908
Total 27 11740

Adjusted r2
• r2 never decreases when a new X variable is
added to the model
– This can be a disadvantage when comparing models
• What is the net effect of adding a new variable?
– We lose a degree of freedom when a new X variable
is added
– Did the new X variable add enough explanatory
power to offset the loss of one degree of freedom?

Adjusted r2
• Shows the proportion of variation in Y explained
by all X variables adjusted for the number of X
variables used
2 n 1
2
radj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)

– Penalize excessive use of unimportant independent
variables
– Smaller than r2
– Useful in comparing among models

Error and coefficients relationship
• B1 = Covar(yx)/Varp(x)

Stddevp 419.28571 1103.4439 115902.4 1630165.82 36245060.6 706538.59 195.9184
Covar 662.14286 6862.5 25621.4286 120976.786 16061.643 257.1429
b1 0.6000694 0.059209 0.01571707 0.00333775 0.0227329 1.3125

Is the Model Significant?
• F Test for Overall Significance of the Model
• Shows if there is a linear relationship between all of the
X variables considered together and Y
• Use F-test statistic
• Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent
variable affects Y)

F Test for Overall Significance
• Test statistic:
SSR
MSR k
F
MSE SSE
n k 1
where F has (numerator) = k and
(denominator) = (n – k - 1)
degrees of freedom

Multiple Regression Assumptions
Errors (residuals) from the regression model:

<
ei = (Yi – Yi)

Assumptions:
• The errors are normally distributed
• Errors have a constant variance
• The model errors are independent

Error terms and coefficient estimates
• Once we think of the Error term as a random
variable, it becomes clear that the estimates
of b1, b2, … (as distinguished from their true
values) will also be random variables, because
the estimates generated by the SSE criterion
will depend upon the particular value of e
drawn by nature for each individual in the
data set.

Statistical Inference and Goodness of
fit
• The parameter estimates are themselves random
variables, dependent upon the random variables e.
• Thus, each estimate can be thought of as a draw
from some underlying probability distribution, the
nature of that distribution as yet unspecified.
• If we assume that the error terms e are all drawn
from the same normal distribution, it is possible to
show that the parameter estimates have a normal
distribution as well.

T Statistic and P value
• T = B1-B1average/B1 std dev

Can you have a hypothesis that
b1 average = b1 estimate
and do the T test

Are Individual Variables Significant?

• Use t tests of individual variable slopes
• Shows if there is a linear relationship between the
variable Xj and Y
• Hypotheses:
– H0: βj = 0 (no linear relationship)
– H1: βj ≠ 0 (linear relationship does exist
between Xj and Y)

Are Individual Variables Significant?

H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist
between xj and y)

Test Statistic:

bj 0
t (df = n – k – 1)
Sb j

Coefficien Standard Lower Upper Lower Upper
ts Error t Stat P-value 95% 95% 95.0% 95.0%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 -82.5325 -35.5996
OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 -0.10302 0.089097
BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 0.031028 0.052949
YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 0.000637 0.004794
VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 0.000918 0.002021
INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 -0.00552 3.78E-05
SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592 -0.41049 -0.12592

with n – (k+1) degrees of freedom

Confidence Interval Estimate
for the Slope
• Confidence interval for the population slope βj

• b j tn S
k 1 bj where t has (n – k – 1) d.f.

Example: Form a 95% confidence interval for the effect of
changes in Bars on fatal accidents:
0.041988 (2.079614 )(0.005271)
So the interval is (0.031028, 0.052949 )
(This interval does not contain zero, so bars has a significant
effect on Accidents)

Coefficien Standard Lower Upper
ts Error t Stat P-value 95% 95%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996
OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097
BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949
YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794
VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021
INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05
SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592

Using Dummy Variables

• A dummy variable is a categorical explanatory
variable with two levels:
– yes or no, on or off, male or female
– coded as 0 or 1
• Regression intercepts are different if the
variable is significant
• Assumes equal slopes for other variables

Interaction Between
Independent Variables
• Hypothesizes interaction between pairs of X
variables
– Response to one X variable may vary at different
levels of another X variable

• Contains cross-product term
ˆ
Y b0 b1X1 b2 X2 b3 X3
–
b0 b1X1 b2 X2 b3 (X1X2 )

Effect of Interaction
• Given:
Y β0 β1X1 β2 X2 β3 X1X2 ε

• Without interaction term, effect of X1 on Y is
measured by β1
• With interaction term, effect of X1 on Y is
measured by β1 + β3 X2
• Effect changes as X2 changes

Interaction Example
Suppose X2 is a dummy variable and the estimated regression equation is
ˆ
Y= 1 + 2X1 + 3X2 + 4X1X2
Y

12

X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1

4
X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5

Slopes are different if the effect of X1 on Y depends on X2 value

Residual Analysis
ei Yi ˆ
Yi
• The residual for observation i, ei, is the difference between
its observed and predicted value
• Check the assumptions of regression by examining the
residuals
– Examine for linearity assumption
– Evaluate independence assumption
– Evaluate normal distribution assumption
– Examine for constant variance for all levels of X (homoscedasticity)

• Graphical Analysis of Residuals
– Can plot residuals vs. X

Residual Analysis for
Independence

Not Independent

 Independent
residuals

X

residuals
X
residuals

X

Residual Analysis for
Equal Variance
Y
Y

x x
residuals

x residuals x

Non-constant variance
 Constant variance

Linear vs. Nonlinear Fit

Y Y

X X
residuals

X residuals X

Linear fit does not give Nonlinear fit gives
random residuals
 random residuals

Quadratic Regression Model
Yi β0 β1X1i β 2 X1i
2
εi
Quadratic models may be considered when the scatter diagram takes on one of
the following shapes:

Y Y Y Y

X1 X1 X1 X1
β1 < 0 β1 > 0 β1 < 0 β1 > 0

β2 > 0 β2 > 0 β2 < 0 β2 < 0

β1 = the coefficient of the linear term
β2 = the coefficient of the squared term

Bivariate

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Bivariate

Similaire à Bivariate (20)

Plus de Vikas Saini

Plus de Vikas Saini (10)

Dernier

Dernier (20)

Bivariate