1. Stats I, II and II
Frequencies, crosstabs, correlation, ANOVA,
regression
Jodi Upton and Crina Boros
CIJ Summer 2017
2. The Data Ladder -- categorical
I. One type of response (yes or no)
Frequencies:
Crosstabs:
Yes 432 45.3%
No 521 54.7%
Live in Texas
Like Bush Yes No
Yes 382 200
No 125 307
3. The Data Ladder-- categorical
II. Two or more types of responses (race)
Frequencies:
Race
Frequency
Asian
4,766
Black
12,807
White
9,766
Hispanic
7.236
Crosstabs:
Race Warning
Ticket None
Black 1
6 0
White 4
3 1
Hispanic 0
1 2
Unknown 3
2 2
4. The Data Ladder-- categorical
III. Ordinal Data (use crosstabs and frequencies)
When the value doesn’t mean much, but the order
does:
Grade levels
Age categories
Income categories
5. The Data Ladder-- continuous data
Examples:
Income
Housing prices
Response time (police and fire)
Distance travelled (commute)
What you can do:
Mean
Median
Range
Rank
Correlation
ANOVA
Regression
8. In traditional statistics, the normal curve means 95% of observations will fall within
most of this curve
9.
10.
11. Independent vs. Dependent variable
Independent
Comes first in time
Can be more than one
variable
Dependent
What you are measuring
12. Polling
A March 9, 2016 Quinnipiac poll
found the following results, with a
+- 3.7 margin of error, at the 95%
confidence level.
Who is really ahead?
What’s the MOE for women? White
males?
13.
14. CORRELATION
AKA: Pearson’s r or coefficient of correlation
● Between 1 and -1
● If both variables move in the same direction → positive relationship
● If variables move in opposite direction → negative relationship
-1 0
+1
Strong relationship weak weak
strong
16. ANOVA
What it assumes:
Normal distribution
Independence of errors
Outliers removed*
Equal variance
(*but journalists love those!)
What it measures:
Whether the difference
within the group is greater
than the difference between
the groups
17. ANOVA needs an hypothesis
Null hypothesis: the treatment has no impact
F = the treatment variance + the random variance
the random variance
18. What you’re looking for:
The F statistic is between 0 and 1 (if it’s negative, you’ve
made a mistake)
If F > F crit, you must reject the null hypothesis (treatment had an impact)
If F < F crit, you can’t rule out the null hypothesis
The p value
If the p value is less than alpha (.05) then the result is significant (it matters)
If the p value is greater than alpha, the results are not significant
20. What you still don’t know
What accounts for the difference?
For that you need a t-test, regression or other tool.
21. ‘HOW TO CHOOSE’ MADE EASY
THE 2 MOST ESSENTIAL QUESTIONS:
1. DO YOU HAVE CATEGORICAL OR CONTINUOUS DATA IN THE VARIABLES?
2. WHAT IS YOUR INDEPENDENT VARIABLE AND DEPENDENT VARIABLE?
INDEPENDENT DEPENDENT STATISTICS
Categorical Categorical CROSS-TAB
Continuous Continuous LINEAR-REGRESSION /
MULTIPLE REGRESSION
Categorical Continuous ANALYSIS OF VARIANCE /
ANOVA
Continuous Categorical LOGISTIC REGRESSION
22. iT’S A FINE DAY FOR LINEAR REGRESSION!
Image by Paul Wesley
23. Linear Regression
I. Does the data fit
the 1st assumption:
is there a linear relationship?
1. Scatter plot
2. Trendline
3. Create a new variable
II. The last assumption:
the data should approximate
a Bell curve (normal distribution).
1. Data analysis toolpak -
Descriptive statistics
1. Mean and average should be
close to each other
1. Tick Summary Statistics
2. Tick Confidence level >> 95%
27. Linear Regression
Conditions met? Run the Regression from
the Data Analysis tool pack:
Y Range - Dependant
X Range - Independent
Turn on LABELS
CONFIDENCE LEVEL 95%
NEW WORKSHEET - REGRESSION
RESIDUALS
ADJUSTED R SQUARE 0 TO 1.0. The
closer it gets to 1, the closest is to
perfection.
SIGNIFICANT F
THE RESIDUAL STORY - Sort!
THE LINEAR REGRESSION IS
JUST THE BEGINNING OF
THE REPORTING
Conrad Carlberg - Statistical
Analysis
28.
29. Thank you!
Jodi Upton: jodi.upton@gmail.com and @jodiupton
Crina Boros: crinaboros@gmail.com
Special thanks to: Jennifer LaFleur, Center for Investigative
Reporting/Reveal
Steve Doig, Arizona State University
Notes de l'éditeur
Starting at the bottom...
Heard continuous referred to as ‘infinite’ but it’s really not. Income and prices, for example, are limited to two decimal places.
Another way that may help: Continuous (measured) vs discrete (counted)
If it would take ‘forever’ to count, it’s probably continuous
Amy Poehler
Skew = body (positive or negative); kurtosis = tail
All you really need to know: is my data evenly distributed
The margin of error does not depend on the size of the population; it depends on the size of the sample.
(In astronomy, the margin of error is 4.12 light years -- the distance to Proxima Centauri)
Important: 1 in 20 observations will NOT!
Bayesian: start with a different hypothesis, 100% within curve, but may be ‘off’
ANOVA was created by an evolutionary biologist and statistician, who wanted to be able to tell if two groups were the same or different, ie were they the same species or not?
In other words, is there enough randomness within the sample, that it outweighs any variance between the samples -- and any measured difference is the result of chance?
“F” stands for Sir Reginald Fisher, who invented this
“P” stands for the probability that -- if the null hypothesis is true -- the results are ‘extreme’ (one in 500 chance of being wrong)
A paper ‘suicide suit’ worn by a model
ANOVA was created by an evolutionary biologist and statistician, who wanted to be able to tell if two groups were the same or different, ie were they the same species or not?