3. R-SQUARED
• The regression R2 measures the fraction of the variance of Y that is
explained by X; it is unitless and ranges between zero (no fit) and one
(perfect fit)
• By simply looking at R-squared of a regression we will not be able to say
much, we need to have at least two regressions to compare them
• All else equal we want to be able to explain a higher share of variance in Y
• Stata:
• Use California school dataset
• Regress test scores on class size. The R-squared is 0.0512. This means that class size
explains about 5% of the variance in test scores.
• Regress test scores on expenditure per student. What is the R-squared? How would
you interpret it?
4. INTERPRETING R-SQUARED
• An increase in R-squared does not necessarily mean that an added variable
is statistically significant
• A high R-squared does not mean that the regressors are a true cause of the
dependent variable
• A high R-squared does not mean that the coefficients on the regressors are
true
• A low R-squared does not mean that the coefficients on the regressors are
wrong
• A high R-squared does not necessarily mean that you have the most
appropriate set of regressors, nor does a low R-squared necessarily mean
that you have an inappropriate set of regressors.
5. R-SQUARED RULES
• Same dependent variable
• Same number of independent variables
• Stata:
• generate ltestscr=ln(testscr)
Generate a table with the regressions below (use outreg)
• Regress test score on class size and expenditure per student
• Regress natural log of test score on class size
• Regress test score on class size and average district income
• Regress expenditure per student on average district income
• Regress natural log of test score on average district income
• Regress natural log of test score on class size and expenditure per student
6. STANDARD ERROR OF THE REGRESSION
• The SER is a measure of the spread of the observations around the regression line
measured in the units of the dependent variable
• SER is an estimator of the standard deviation of the regression error 𝑢𝑖
• All else equal we want to have a smaller spread of the observations around the
regression line
• In Stata SER is called root MSE (mean squared error)
• In the regression of class size on test score the SER is about 18.6. This means that the
standard deviation of the regression residuals around the regression line is 18.6 points.
• We can use SER to compare models
• What are the SERs for the rest of the regressions you have run? What does that mean?
7. CAUSAL EFFECTS AND IDEALIZED
EXPERIMENTS
• Most of our questions concern causal relationships among variables, i.e.
does lower class size lead to higher test scores?
• Causality means that a specific action leads to a specific measurable
consequence
• The best way to measure a causal effect is by conducting an experiment
• In a randomized controlled experiment there is both a control group and a
treatment group. Assignment to a group happens randomly
• We would like to be able to show that the only systematic reason for
differences in outcomes between the treatment and control groups is the
treatment itself
• In practice, it is not possible to perform ideal experiment. This however, gives
us a benchmark.
8. NONRANDOM SAMPLE EXAMPLE - 1
• In 1936 the Literary Gazette polled a “random” sample of households chosen
from telephone records and automobile registration
• In 1936 many households did not have cars or telephones, and those that
did tended to be richer – and were also more likely to be Republican
• The results of the poll indicated the Landon (a republican presidential
candidate) would defeat an incumbent (Roosevelt) by a landslide – 57% to
43% in the 1936 election
• Roosevelt ended up winning by 59% to 41%
• Do you think surveys conducted using social media might have a similar
problem with bias?
9. NONRANDOM SAMPLE EXAMPLE - 2
• Some mutual funds simply track the market, some are actively managed by full-time
professionals.
• Do the latter mutual funds outperform the former?
• One way to answer the question is to use historical data on funds currently available
for purchase, however this means that the most poorly underperforming funds would
not be represented.
• The sample is selected based on the value of the dependent variable, returns, because
funds with the lowest returns are eliminated
• The mean return of all funds would then be lower than the mean return of those still in
existence. This is also called a survivorship bias.
• When corrected for survivorship bias it turns out actively managed funds do not
outperform the market
10. NON-RANDOM SAMPLE EXAMPLE 3
• Does the class size affect the test scores with only districts where
average class size is above 20 students included?
• What is the average height of a GU student measured outside of a
basketball locker room?