2. The Multiple Regression Model
Idea: Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (Xi)
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
Yi β0 β1X1i β2 X2i βk Xki εi
3. Assumptions of Regression
Use the acronym LINE:
• Linearity
– The underlying relationship between X and Y is linear
• Independence of Errors
– Error values are statistically independent
• Normality of Error
– Error values (ε) are normally distributed for any given value of X
• Equal Variance (Homoscedasticity)
– The probability distribution of the errors has constant variance
4. Regression Statistics
Multiple R 0.998368 2 SSR 11704.1
r .996739
R Square 0.996739 SST 11740
Adjusted R
Square 0.995808
Standard
Error 1.350151 99.674% variation is
Observations 28 explained by the
dependent Variables
ANOVA
Significan
df SS MS F ce F
Regression 6 11701.72 1950.286 1069.876 5.54E-25
Residual 21 38.28108 1.822908
Total 27 11740
5. Adjusted r2
• r2 never decreases when a new X variable is
added to the model
– This can be a disadvantage when comparing models
• What is the net effect of adding a new variable?
– We lose a degree of freedom when a new X variable
is added
– Did the new X variable add enough explanatory
power to offset the loss of one degree of freedom?
6. Adjusted r2
• Shows the proportion of variation in Y explained
by all X variables adjusted for the number of X
variables used
2 n 1
2
radj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)
– Penalize excessive use of unimportant independent
variables
– Smaller than r2
– Useful in comparing among models
8. Is the Model Significant?
• F Test for Overall Significance of the Model
• Shows if there is a linear relationship between all of the
X variables considered together and Y
• Use F-test statistic
• Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent
variable affects Y)
9. F Test for Overall Significance
• Test statistic:
SSR
MSR k
F
MSE SSE
n k 1
where F has (numerator) = k and
(denominator) = (n – k - 1)
degrees of freedom
11. Multiple Regression Assumptions
Errors (residuals) from the regression model:
<
ei = (Yi – Yi)
Assumptions:
• The errors are normally distributed
• Errors have a constant variance
• The model errors are independent
12. Error terms and coefficient estimates
• Once we think of the Error term as a random
variable, it becomes clear that the estimates
of b1, b2, … (as distinguished from their true
values) will also be random variables, because
the estimates generated by the SSE criterion
will depend upon the particular value of e
drawn by nature for each individual in the
data set.
13. Statistical Inference and Goodness of
fit
• The parameter estimates are themselves random
variables, dependent upon the random variables e.
• Thus, each estimate can be thought of as a draw
from some underlying probability distribution, the
nature of that distribution as yet unspecified.
• If we assume that the error terms e are all drawn
from the same normal distribution, it is possible to
show that the parameter estimates have a normal
distribution as well.
14. T Statistic and P value
• T = B1-B1average/B1 std dev
Can you have a hypothesis that
b1 average = b1 estimate
and do the T test
15. Are Individual Variables Significant?
• Use t tests of individual variable slopes
• Shows if there is a linear relationship between the
variable Xj and Y
• Hypotheses:
– H0: βj = 0 (no linear relationship)
– H1: βj ≠ 0 (linear relationship does exist
between Xj and Y)
16. Are Individual Variables Significant?
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist
between xj and y)
Test Statistic:
bj 0
t (df = n – k – 1)
Sb j
17. Coefficien Standard Lower Upper Lower Upper
ts Error t Stat P-value 95% 95% 95.0% 95.0%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996 -82.5325 -35.5996
OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097 -0.10302 0.089097
BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949 0.031028 0.052949
YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794 0.000637 0.004794
VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021 0.000918 0.002021
INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05 -0.00552 3.78E-05
SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592 -0.41049 -0.12592
with n – (k+1) degrees of freedom
18. Confidence Interval Estimate
for the Slope
• Confidence interval for the population slope βj
• b j tn S
k 1 bj where t has (n – k – 1) d.f.
Example: Form a 95% confidence interval for the effect of
changes in Bars on fatal accidents:
0.041988 (2.079614 )(0.005271)
So the interval is (0.031028, 0.052949 )
(This interval does not contain zero, so bars has a significant
effect on Accidents)
19. Coefficien Standard Lower Upper
ts Error t Stat P-value 95% 95%
Intercept -59.0661 11.28404 -5.23448 3.45E-05 -82.5325 -35.5996
OFF -0.00696 0.04619 -0.15068 0.881663 -0.10302 0.089097
BAR 0.041988 0.005271 7.966651 8.81E-08 0.031028 0.052949
YNG 0.002716 0.000999 2.717326 0.012904 0.000637 0.004794
VEH 0.00147 0.000265 5.540878 1.69E-05 0.000918 0.002021
INV -0.00274 0.001336 -2.05135 0.052914 -0.00552 3.78E-05
SPD -0.2682 0.068418 -3.92009 0.000786 -0.41049 -0.12592
20. Using Dummy Variables
• A dummy variable is a categorical explanatory
variable with two levels:
– yes or no, on or off, male or female
– coded as 0 or 1
• Regression intercepts are different if the
variable is significant
• Assumes equal slopes for other variables
21. Interaction Between
Independent Variables
• Hypothesizes interaction between pairs of X
variables
– Response to one X variable may vary at different
levels of another X variable
• Contains cross-product term
ˆ
Y b0 b1X1 b2 X2 b3 X3
–
b0 b1X1 b2 X2 b3 (X1X2 )
22. Effect of Interaction
• Given:
Y β0 β1X1 β2 X2 β3 X1X2 ε
• Without interaction term, effect of X1 on Y is
measured by β1
• With interaction term, effect of X1 on Y is
measured by β1 + β3 X2
• Effect changes as X2 changes
23. Interaction Example
Suppose X2 is a dummy variable and the estimated regression equation is
ˆ
Y= 1 + 2X1 + 3X2 + 4X1X2
Y
12
X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
4
X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
24. Residual Analysis
ei Yi ˆ
Yi
• The residual for observation i, ei, is the difference between
its observed and predicted value
• Check the assumptions of regression by examining the
residuals
– Examine for linearity assumption
– Evaluate independence assumption
– Evaluate normal distribution assumption
– Examine for constant variance for all levels of X (homoscedasticity)
• Graphical Analysis of Residuals
– Can plot residuals vs. X
25. Residual Analysis for
Independence
Not Independent
Independent
residuals
X
residuals
X
residuals
X
26. Residual Analysis for
Equal Variance
Y
Y
x x
residuals
x residuals x
Non-constant variance
Constant variance
27. Linear vs. Nonlinear Fit
Y Y
X X
residuals
X residuals X
Linear fit does not give Nonlinear fit gives
random residuals
random residuals
28. Quadratic Regression Model
Yi β0 β1X1i β 2 X1i
2
εi
Quadratic models may be considered when the scatter diagram takes on one of
the following shapes:
Y Y Y Y
X1 X1 X1 X1
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β2 > 0 β2 > 0 β2 < 0 β2 < 0
β1 = the coefficient of the linear term
β2 = the coefficient of the squared term