1. Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines
2.
3. The Model House size House Cost Most lots sell for $25,000 Building a house costs about $75 per square foot. House cost = 25000 + 75(Size) The model has a deterministic and a probabilistic components
4. The Model House cost = 25000 + 75(Size) House size House Cost Most lots sell for $25,000 However, house cost vary even among same size houses! Since cost behave unpredictably, we add a random component.
5.
6.
7. The Least Squares (Regression) Line A good line is one that minimizes the sum of squared differences between the points and the line.
8. The Least Squares (Regression) Line 3 3 4 4 (1,2) 2 2 (2,4) (3,1.5) Sum of squared differences = (2 - 1) 2 + (4 - 2) 2 + (1.5 - 3) 2 + (4,3.2) (3.2 - 4) 2 = 6.89 2.5 Let us compare two lines The second line is horizontal The smaller the sum of squared differences the better the fit of the line to the data. 1 1 Sum of squared differences = (2 -2.5) 2 + (4 - 2.5) 2 + (1.5 - 2.5) 2 + (3.2 - 2.5) 2 = 3.99
9. The Estimated Coefficients To calculate the estimates of the slope and intercept of the least squares line , use the formulas: The regression equation that estimates the equation of the first order linear model is: Alternate formula for the slope b 1
14. Interpreting the Linear Regression -Equation This is the slope of the line. For each additional mile on the odometer, the price decreases by an average of $0.0623 The intercept is b 0 = $17067. 0 No data Do not interpret the intercept as the “ Price of cars that have not been driven” 17067
15.
16. The Normality of From the first three assumptions we have: y is normally distributed with mean E(y) = 0 + 1 x, and a constant standard deviation x 1 x 2 x 3 The standard deviation remains constant, but the mean value changes with x 0 + 1 x 1 0 + 1 x 2 0 + 1 x 3 E(y|x 2 ) E(y|x 3 ) E(y|x 1 )
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28. Coefficient of determination x 1 x 2 y 1 y 2 Two data points (x 1 ,y 1 ) and (x 2 ,y 2 ) of a certain sample are shown. Total variation in y = Variation explained by the regression line + Unexplained variation (error) Variation in y = SSR + SSE y
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43. For each residual we calculate the standard deviation as follows: A Partial list of Standard residuals Residual Analysis Standardized residual ‘i’ = Residual ‘i’ Standard deviation
44. It seems the residual are normally distributed with mean zero Residual Analysis
45.
46.
47.
48. Patterns in the appearance of the residuals over time indicates that autocorrelation exists. + + + + + + + + + + + + + + + + + + + + + + + + + Time Residual Residual Time + + + Note the runs of positive residuals, replaced by runs of negative residuals Note the oscillating behavior of the residuals around zero. 0 0 Non Independence of Error Variables
49.
50. + + + + + + + + + + + + + + + + The outlier causes a shift in the regression line … but, some outliers may be very influential An outlier An influential observation + + + + + + + + + + +