Lots of neat examples of how to use and interpret dummy variables in regression analysis. Created by Professor Marsh for his introductory statistics course at the University of Notre Dame, Notre Dame, Indiana.
2. “ Using Dummy Variables in Wage Discrimination Cases” Multiple Regression Sandy: pages 603 - 613 Also read paper titled:
3. Are Male Nurses Discriminated Against? male nurses 0 female nurses Years of experience, X i W f _ 4 ^ W m _ 3 ^ ~ m W 3 ~ W f ~ 4 ~ ~ adjusted for experience not adjusted for experience o o o o o o o o o o o o + + + + + + + + + + + + + + + + + + + + + + + + + o o o ~
4. I. Dummy Variables - Adjusting the intercept . Adjusting the slope . Adjusting both intercept and slope .
5. Intercept Dummy Variables Dummy variables are binary (0,1) D t = 1 if red car, D t = 0 otherwise. y t = 1 + 2 X t + 3 D t + e t y t = speed of car in miles per hour X t = age of car in years Police: red cars travel faster . H 0 : 3 = 0 H 1 : 3 > 0
6. y t = 1 + 2 X t + 3 D t + e t red cars : y t = ( 1 + 3 ) + 2 X t + e t other cars : y t = 1 + 2 X t + e t y t X t miles per hour age in years 0 1 + 3 1 2 2 red cars other cars
7. Slope Dummy Variables y t = 1 + 2 X t + 3 D t X t + e t y t = 1 + ( 2 + 3 )X t + e t y t = 1 + 2 X t + e t y t X t value of porfolio years 0 2 + 3 1 2 stocks bonds Stock portfolio: D t = 1 Bond portfolio: D t = 0 1 = initial investment
8. Different Intercepts & Slopes y t = 1 + 2 X t + 3 D t + 4 D t X t + e t y t = ( 1 + 3 ) + ( 2 + 4 )X t + e t y t = 1 + 2 X t + e t y t X t harvest weight of corn rainfall 2 + 4 1 2 “ miracle” regular “ miracle” seed: D t = 1 regular seed: D t = 0 1 + 3
9. y t = 1 + 2 X t + 3 D t + e t 2 1 + 3 2 1 y t X t Men Women 0 y t = 1 + 2 X t + e t For men D t = 1. For women D t = 0. years of experience y t = ( 1 + 3 ) + 2 X t + e t wage rate . . Testing for discrimination in starting wage H 0 : 3 = 0 H 1 : 3 > 0
10. y t = 1 + 5 X t + 6 D t X t + e t 5 5 + 6 1 y t X t Men Women 0 y t = 1 + ( 5 + 6 )X t + e t y t = 1 + 5 X t + e t For men D t = 1. For women D t = 0. Men and women have the same starting wage, 1 , but their wage rates increase at different rates (diff.= 6 ). 6 > means that men’s wage rates are increasing faster than women's wage rates. years of experience wage rate
11. y t = 1 + 2 X t + 3 D t + 4 D t X t + e t 1 + 3 1 2 2 + 4 y t X t Men Women 0 y t = ( 1 + 3 ) + ( 2 + 4 ) X t + e t y t = 1 + 2 X t + e t Women are given a higher starting wage, 1 , while men get the lower starting wage, 1 + 3 , ( 3 < 0 ). But, men get a faster rate of increase in their wages, 2 + 4 , which is higher than the rate of increase for women, 2 , (since 4 > 0 ). years of experience An Ineffective Affirmative Action Plan women are started at a higher wage. Note : ( 3 < 0 ) wage rate
12.
13. H 0 : vs 1 : H 0 : vs 1 : Y t 1 2 X t 3 D t 4 D t X t b 3 Est . Var b 3 ˜ t n 4 b 4 Est . Var b 4 ˜ t n 4 men: D t = 1 ; women: D t = 0 Testing for discrimination in starting wage. Testing for discrimination in wage increases. intercept slope e t
14. Why NOW wants one-sided test and Chauvinist Industries wants two-sided.
15. Are Two Regressions Equal? y t = 1 + 2 X t + 3 D t + 4 D t X t + e t variations of “The Chow Test” I. Assuming equal variances (pooling): men: D t = 1 ; women: D t = 0 H o : 3 = 4 = 0 vs. H 1 : otherwise y t = wage rate This model assumes equal wage rate variance. X t = years of experience
16. Testing H o : H 1 : otherwise and SSE R y t b 1 b 2 X t 2 t 1 T SSE U y t b 1 b X t b D t b D t X t 2 t 1 T SSE R SSE U 2 SSE U T 4 F T 4 intercept and slope
17. y t = 1 + 2 X t + e t II. Allowing for unequal variances: y tm = 1 + 2 X tm + e tm y tw = 1 + 2 X tw + e tw Everyone: Men only: Women only: SSE R Forcing men and women to have same 1 , 2 . Allowing men and women to be different. SSE m SSE w where SSE U = SSE m + SSE w F = (SSE R SSE U )/J SSE U /(T K) J = # restrictions K=unrestricted coefs. (running three regressions) J = 2 K = 4
18. Polynomial Terms y t = 1 + 2 X t + 3 X 2 t + 4 X 3 t + e t Linear in parameters but nonlinear in variables: y t = income; X t = age Polynomial Regression y t X t People retire at different ages or not at all. 90 20 30 40 50 60 80 70
19. y t = 1 + 2 X t + 3 X 2 t + 4 X 3 t + e t y t = income; X t = age Polynomial Regression Rate income is changing as we age : Slope changes as X t changes. y t X t = 2 + 2 3 X t + 3 4 X 2 t
20. Continuous Interaction y t = 1 + 2 Z t + 3 B t + 4 Z t B t + e t Exam grade = f(sleep: Z t , study time: B t ) Sleep and study time do not act independently. More study time will be more effective when combined with more sleep and less effective when combined with less sleep .
21. Your mind sorts things out while you sleep (when you have things to sort out.) y t = 1 + 2 Z t + 3 B t + 4 Z t B t + e t Exam grade = f(sleep: Z t , study time: B t ) Your studying is more effective with more sleep . continuous interaction y t B t = 2 + 4 Z t y t Z t = 2 + 4 B t
22. y t = 1 + 2 Z t + 3 B t + 4 Z t B t + e t Exam grade = f(sleep: Z t , study time: B t ) If Z t + B t = 24 hours, then B t = (24 Z t ) y t = 1 + 2 Z t + 3 (24 Z t ) + 4 Z t (24 Z t ) + e t y t = ( 1 + 24 3 ) + ( 2 3 + 24 4 ) Z t 4 Z 2 t + e t y t = 1 + 2 Z t + 3 Z 2 t + e t Sleep needed to maximize your exam grade : where 2 > 0 and 3 < 0 y t Z t = 2 + 2 3 Z t = 0 2 3 Z t =
24. Let yi represent the ith person's wage rate and Xi represent their months of work experience in the equation: yi = b1 + b2 Xi + ei (1) b1 = intercept (starting wage) b2 = increase in the person's wage for each additional month of work experience. ei = error term with mean zero and estimated variance s2.
25. yi = b1 + b2 Xi + b3 Mi + b4 Fi + ei (2) Fi = 1 if female Fi = 0 if male . Mi = 1 if male Mi = 0 if female .
26. yi = b1 + b2 Xi + b3 Mi + b4 Fi + ei (2) Unfortunately this equation contains an underidentified set of parameters (b1, b3, and b4) and cannot be estimated without some restriction on the coefficients.
27. To see this point, separate out the men's equation implied by equation (2) from the women's equation. For the men's equation Mi =1 and Fi =0. For men , equation (2) becomes: yi = (b1 + b3) + b2 Xi + ei (3) yi = b1 + b2 Xi + b3 Mi + b4 Fi + ei (2)
28. For women , Mi =0 and Fi =1. For women , equation (2) becomes: yi = (b1 + b4) + b2 Xi + ei (4)
29. Unfortunately, although we get estimates of the intercepts (b1 + b3) and (b1 + b4), the value of b1 cannot be separated from the values of b3 and b4. Some restriction is needed to achieve identification of b1, b3 and b4.
30. One such restriction is b1 = 0. We can drop the original intercept term, b1, since men and women already have their own intercept terms, b3 and b4 , respectively.
31. Underidentification of equation (2) can also be expressed in matrix terms. First, rewrite equation (2) putting the explanatory variables in a row vector multiplied by the corresponding column vector of their respective coefficients: y i 1 X i M i F i 2 3 4 i 5 1
32. This only represents the ith observation where i = 1, ..., n. To represent the entire set of n observations at once, we need to "pull the window shade down" as follows: y 1 y 2 M y n 1 X 1 M 1 F 1 1 X 2 M 2 F 2 M M M M 1 X n M n F n 1 2 3 4 1 2 M n (6)
33. Equation (6) presents us with an X matrix whose first column (the column of ones) is an exact linear combination of the last two columns (the M and F columns). Since Mi is always zero when Fi is equal to one and Mi is always one when Fi is equal to zero, then it always holds that Mi + Fi = 1. Therefore, the first column is equal to the sum of the last two columns.
34. Since Mi is always zero when Fi is equal to one and Mi is always one when Fi is equal to zero, then it always holds that Mi + Fi = 1. 1 1 M 1 M 1 M 2 M M n F 1 F 2 M F n ( 9 )
35. Equation (6) and, therefore,equation (2), represent a case of perfect multicollinearity . This means that a restriction must be introduced that drops one of these columns out of the regression. One such restriction is b1 = 0 , which means dropping the original intercept out of the regression model to provide the following reduced model: yi = b2 Xi + b3 Mi + b4 Fi + ei (10) Now men and women have separate intercepts and no common intercept is necessary.
36. yi = b2 Xi + b3 Mi + b4 Fi + ei b2 b3 b2 b4 yi Xi Male Female 0 yi = b3 + b2 Xi + ei yi = b4 + b2 Xi + ei For males Mi = 1 and Fi = 0. For females Mi = 0 and Fi = 1. Males and females have different starting salaries , b3 > b4 , but their salaries increase at the same rate, b2.
37. y i = b2 X i + b3 M i + b4 F i + e i b2 b3 b2 b4 y i X i Male Female 0 y i = b3 + b2 X i + e i y i = b4 + b2 X i + e i For males Mi = 1 and Fi = 0. For females Mi = 0 and Fi = 1. Males and females have different starting salaries , b3 > b4 , but their salaries increase at the same rate, b2. years of experience
38. y i = b1 + b5 M i X i + b6 F i X i + e i b6 b5 b1 y i X i Male Female 0 y i = b1 + b5 X i + e i y i = b1 + b6 X i + e i For males Mi = 1 and Fi = 0. For females Mi = 0 and Fi = 1. Males and Females have the same starting salary b1, but their salaries increase at different rates ( b5 vs. b6 ). b5 > b6 means that men salaries are increasing faster than women's salaries. years of experience
39. y i = b3 M i + b4 F i + b5 M i X i + b6 F i X i + e i b3 b4 For males Mi = 1 and Fi = 0. For females Mi = 0 and Fi = 1. b6 b5 y i X i Male Female 0 y i = b3 + b5 X i + e i y i = b4 + b6 X i + e i Females start with a higher starting salary, b4 , while men get the lower starting salary, b3 . But, men get a faster rate of increase in their salaries, b5 , which is higher than the rate of increase for females, b6 . ( b5 > b6 ). years of experience Chauvinist Industries Affirmative Action Plan
40. y i = b2 X i + b3 M i + b4 F i + e i b2 b3 b2 b4 y i X i Male Female 0 y i = b3 + b2 X i + e i y i = b4 + b2 X i + e i For males Mi = 1 and Fi = 0. For females Mi = 0 and Fi = 1. Males and females have different starting salaries , b3 > b4 , but their salaries increase at the same rate, b2. Back to our basic model: years of experience
41. Since under our null hypothesis the raw score test statistic: has a mean and a variance , we can standardize by subtracting the mean (zero) and dividing by the standard deviation (square root of the variance) to get the standardized test statistic: b 3 – b 4 Var ( b 3 – b 4 ) b 3 – b 4
42. To test the null hypothesis: Z ( b b ) 0 Var ( b b ) ~ ( 0 , 1 )
43. If the var iance of the y i , 2 , is unknown , then Var ( b 3 b 4 ) is also unknown and must be estimated from the exp ression : Est . Var ( b 3 b 4 ) Est . Var ( b 3 ) Est . Var ( b 4 ) 2 Est . Cov ( b 3 , b 4 )
44. Use the sample variance as an estimator of the population variance :
45. The values for the following expression are obtained in practice from the diagonal and off-diagonal elements of the estimated variance-covariance matrix : Est . Var ( b 3 b 4 ) Est . Var ( b 3 ) Est . Var ( b 4 ) 2 Est . Cov ( b 3 , b 4 )
46. y i = b1 + b2 X i + b3 M i b2 (b1 + b3) b2 b1 y i X i Male Female 0 y i = ( b1 + b3 ) + b2 X i y i = b1 + b2 X i Males and females have different starting salaries , b3 > 0 , but their salaries increase at the same rate, b2. years of experience Alternative : make women the default group ^ ^ ^
47. y i = b1 + b2 X i + b3 M i + b4 D i y i = (b1 + b3 + b4) + b2 X i y i = (b1 + b4) + b2 X i y i = (b1 + b3) + b2 X i y i = b1 + b2 X i characteristic dummy variables: male college grad: female college grad: male not a grad: female not a grad: ^ ^ ^ ^ ^
48. years of experience 0 X i M-D (male-degree) F-D (female-degree) M-N (male-no degree) F-N (female-no degree) y i wage rate very restrictive assumption y i = b1 + b2 X i + b3 M i + b4 D i b1 b1+b3 b1+b4 b1+b3+b4 very rigid !!! ^
50. Job: Gender: Karnaugh map for gender vs. status of job : S I M 15 25 40 F 13 27 40 28 52 80 S = supervisor I = individual men : women :
51. Occupation vs. Job vs. Gender Gender: Occupation: Job: C T U S I S I S I M 2 4 3 5 10 16 40 F 1 6 0 7 12 14 40 3 10 3 12 22 30 80 C = Computer T = Other Technical U = Untechnical
52. Karnaugh Map for Occupation , Job Status, Gender , and Degree Status: Degree No Degree C T U S I S I S I D M 1 3 2 5 6 13 30 F 0 3 0 6 7 8 24 N M 1 1 1 0 4 3 10 F 1 3 0 1 5 6 16 3 10 3 12 22 30 80
53. composite dummy variables: This defines combined ( instead of separate ) general characteristics. y i = b1 + b2 X i + b3 MN i + b4 FD i + b5 MD i years of experience 0 X i M-D (male-degree) F-D (female-degree) M-N (male-no degree) F-N (female-no degree) y i wage rate b1 b1 + b3 b1 + b4 b1 + b5 ^
54. Multiple Regression Analysis value of residential property ( buying a home )
55. A i = bathrooms X i = sq. ft. living space H 0 : vs. H 1 : H 0 : vs. H 1 : ˆ Y i b 1 b 2 X i b 3 A i b 4 A i X i b 3 Est . Var b 3 ˜ t n 4 b 4 Est . Var b 4 ˜ t n 4
56. Testing Ho: H1 : otherwise and SSE R y i b 1 b 2 X i 2 i 1 n SSE U y i b 1 b X i b A i b A i X i 2 i 1 n
57. Sale of House with Bed and Bath Dummies 800 0 0 0 10.000 1000 0 0 1 20.000 1200 1 0 0 30.000 1500 1 0 0 40.000 1800 1 0 1 50.000 2000 1 0 1 60.000 2200 0 1 0 70.000 2500 0 1 0 80.000 3000 0 1 1 90.000 3500 0 1 1 100.000 PRICE = f ( SQFEET, D2BED, B3BED, A2BATH ) I. II. III. IV. PRICE (thousands) I. SQFEET = square feet of living space II. D2BED = dummy=1 if two-bedroom house III. D3BED = dummy=1 if three-bedroom house IV. A2BATH = dummy=1 if two-bathroom house
58. PRICE = f ( SQFEET, D2BED, B3BED, A2BATH ) Sale of House with Bed and Bath Dummies ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQ F-RATIO P REGRESSION 8191.943 4 2047.986 176.378 0.000 RESIDUAL 58.057 5 11.611 DURBIN-WATSON D STATISTIC: 2.216 FIRST ORDER AUTOCORRELATION COEFF: - 0.153 DEP VAR: PRICE N: 10 MULTIPLE R: 0.996 SQUARED MULTIPLE R: 0.993 ADJUSTED SQUARED MULTIPLE R: 0.987 STD ERROR OF ESTIMATE: 3.40
59. PRICE = f ( SQFEET, D2BED, B3BED, A2BATH ) Sale of House with Bed and Bath Dummies DEP VAR: PRICE N: 10 MULTIPLE R: 0.996 SQUARED MULTIPLE R: 0.993 ADJUSTED SQUARED MULTIPLE R: 0.987 STD ERROR OF ESTIMATE: 3.40 VARIABLE COEFF STD ERR T P(2-TAIL) INTERCEPT - 6.482 4.112 -1.576 0.176 SQFEET 0.021 0.005 3.958 0.011 D2BED 14.662 4.871 3.010 0.030 D3BED 29.803 10.575 2.818 0.037 A2BATH 4.883 3.953 1.235 0.272 ( for 1,000 square feet: 21 - 6.482 = 14.518 or $14,518 )
60.
61. Sales Value of Residential Property y = sales value of the property (dollars) X = square feet of living space D1 =dummy vble for one bedroom home D2 =dummy vble for two bedroom home D3 =dummy vble for three bedroom home A1 =dummy vble for one bathroom home A2 =dummy vble for two bathroom home For a one-bedroom, one-bathroom home, such that D2=0, D3=0, and A2=0, we have: y i b 1 b 2 X i b 3 D 2 i b 4 D 3 i b 5 A 2 i ^ y i b 1 b 2 X i 1 bedroom , 1 bathroom ^
62. Sales Value of Residential Property For a 2-bedroom, 1-bathroom home, we have D2=1, D3=0, and A2=0 ^ ^ y i b 1 b 2 X i b 3 D 2 i b 4 D 3 i b 5 A 2 i y i ( b 1 b 3 ) b 2 X i 2 bedroom , 1 bathroom
63. Sales Value of Residential Property For a 1-bedroom, 2-bathroom home, we have D2=0, D3=0, and A2=1 ^ ^ y i b 1 b 2 X i b 3 D 2 i b 4 D 3 i b 5 A 2 i y i ( b 1 b 5 ) b 2 X i 1 bedroom , 2 bathroom
64. Sales Value of Residential Property For a 2-bedroom, 2-bathroom home, we have D2=1, D3=0, and A2=1 y i b 1 b 2 X i b 3 D 2 i b 4 D 3 i b 5 A 2 i ^ y i ( b 1 b 3 b 5 ) b 2 X i 2 bedroom , 2 bathroom ^ y i ( b 1 b 4 b 5 ) b 2 X i 3 bedroom , 2 bathroom ^ y i ( b 1 b 4 ) b 2 X i 3 bedroom , 1 bathroom ^
65. square feet of living space 0 X i House Sales Model with Restricted Intercepts b b b D2-A2 (two bed, two bath) b b D2-A1 (two bed, one bath) b b D1-A2 (one bed, two bath) b D1-A1 (one bed,one bath) y i selling price b b b D3-A2 (three bed, two bath) b b D3-A1 (three bed, one bath) b y i b 1 b 2 X i b 3 D 2 i b 4 D 3 i b 5 A 2 i ^ ^ Rigid !!!
68. Composite dummy variables are created for each nonempty cell. Create six composite dummy variables: D1A1=1 if one bed and one bath, or D1A1= 0 D1A2=1 if one bed and two bath, or D1A2= 0 D2A1=1 if two bed and one bath, or D2A1= 0 D2A2=1 if two bed and two bath, or D2A2= 0 D3A1=1 if three bed and one bath, or D3A1= 0 D3A2=1 if three bed and two bath, or D3A2= 0
69. Sales Value of Residential Property y = sales value of the property (dollars) X = square feet of living space D1 A1 = interaction one-bed & one-bath D1 A2 = interaction one-bed & two-bath D2 A1 = interaction two-bed & one-bath D2 A2 = interaction two-bed & two-bath D3 A1 = interaction three-bed & one-bath D3 A2 = interaction three-bed & two-bath y i b 1 b 2 X i b 3 D1A2 i b 4 D2A1 i b 5 D2A2 i ^ b 6 D3A1 i b 7 D3A2 i
70. This one equation with all these dummy variables actually is representing six equations . You must substitute in for each of the dummy variables to generate the six equations that are implied by this one dummy variable equation. For a one-bedroom, one-bathroom home, Since D1A1 = 1, while the others are zero: y i b 1 b 2 X i 1 bedroom , 1 bathroom ^ y i b 1 b 2 X i b 3 D1A2 i b 4 D2A1 i b 5 D2A2 i ^ b 6 D3A1 i b 7 D3A2 i
71. square feet of living space 0 X i House Sales Model with Unrestricted Intercepts D2-A2 (two bed, two bath) D2-A1 (two bed, one bath) D1-A2 (one bed, two bath) b D1-A1 (one bed,one bath) y i selling price D3-A2 (three bed, two bath) D3-A1 (three bed, one bath) b
72. one-bedroom , two-bathroom D1A2 =1, while the others are zero: now graph it ! =======> y i ( 1 b 3 ) b 2 X i 1 bedroom , 2 bathroom ^ y i b 1 b 2 X i b 3 D1A2 i b 4 D2A1 i b 5 D2A2 i ^ b 6 D3A1 i b 7 D3A2 i b
73. square feet of living space 0 X i House Sales Model with Unrestricted Intercepts D2-A2 (two bed, two bath) b b D2-A1 (two bed, one bath) D1-A2 (one bed, two bath) b D1-A1 (one bed,one bath) y i selling price D3-A2 (three bed, two bath) D3-A1 (three bed, one bath)
74. two-bedroom , one-bathroom now graph it ! =======> y i ( b 1 b 4 ) b 2 X i 2 bedroom , 1 bathroom ^ y i b 1 b 2 X i b 3 D1A2 i b 4 D2A1 i b 5 D2A2 i ^ b 6 D3A1 i b 7 D3A2 i D2A1 =1, while the others are zero:
75. square feet of living space 0 X i House Sales Model with Unrestricted Intercepts D2-A2 (two bed, two bath) b b D2-A1 (two bed, one bath) b b D1-A2 (one bed, two bath) b D1-A1 (one bed,one bath) y i selling price D3-A2 (three bed, two bath) D3-A1 (three bed, one bath)
76. two-bedroom , two-bathroom now graph it ! =======> y i ( b 1 b 5 ) b 2 X i 2 bedroom , 2 bathroom ^ y i b 1 b 2 X i b 3 D1A2 i b 4 D2A1 i b 5 D2A2 i ^ b 6 D3A1 i b 7 D3A2 i D2A2 =1, while the others are zero:
77. square feet of living space 0 X i House Sales Model with Unrestricted Intercepts b b D2-A2 (two bed, two bath) b b D2-A1 (two bed, one bath) b b D1-A2 (one bed, two bath) b 1 D1-A1 (one bed,one bath) y i selling price D3-A2 (three bed, two bath) D3-A1 (three bed, one bath)
78. square feet of living space 0 X i House Sales Model with Unrestricted Intercepts b b D2-A2 (two bed, two bath) b b D2-A1 (two bed, one bath) b b D1-A2 (one bed, two bath) b 1 D1-A1 (one bed,one bath) y i selling price b b D3-A2 (three bed, two bath) b b D3-A1 (three bed, one bath)