SlideShare une entreprise Scribd logo
1  sur  29
Stats I, II and II
Frequencies, crosstabs, correlation, ANOVA,
regression
Jodi Upton and Crina Boros
CIJ Summer 2017
The Data Ladder -- categorical
I. One type of response (yes or no)
Frequencies:
Crosstabs:
Yes 432 45.3%
No 521 54.7%
Live in Texas
Like Bush Yes No
Yes 382 200
No 125 307
The Data Ladder-- categorical
II. Two or more types of responses (race)
Frequencies:
Race
Frequency
Asian
4,766
Black
12,807
White
9,766
Hispanic
7.236
Crosstabs:
Race Warning
Ticket None
Black 1
6 0
White 4
3 1
Hispanic 0
1 2
Unknown 3
2 2
The Data Ladder-- categorical
III. Ordinal Data (use crosstabs and frequencies)
When the value doesn’t mean much, but the order
does:
Grade levels
Age categories
Income categories
The Data Ladder-- continuous data
Examples:
Income
Housing prices
Response time (police and fire)
Distance travelled (commute)
What you can do:
Mean
Median
Range
Rank
Correlation
ANOVA
Regression
Go to Kahoot.it
(on your phone or computer)
In traditional statistics, the normal curve means 95% of observations will fall within
most of this curve
Independent vs. Dependent variable
Independent
Comes first in time
Can be more than one
variable
Dependent
What you are measuring
Polling
A March 9, 2016 Quinnipiac poll
found the following results, with a
+- 3.7 margin of error, at the 95%
confidence level.
Who is really ahead?
What’s the MOE for women? White
males?
CORRELATION
AKA: Pearson’s r or coefficient of correlation
● Between 1 and -1
● If both variables move in the same direction → positive relationship
● If variables move in opposite direction → negative relationship
-1 0
+1
Strong relationship weak weak
strong
Got it, so far?
ANOVA
What it assumes:
Normal distribution
Independence of errors
Outliers removed*
Equal variance
(*but journalists love those!)
What it measures:
Whether the difference
within the group is greater
than the difference between
the groups
ANOVA needs an hypothesis
Null hypothesis: the treatment has no impact
F = the treatment variance + the random variance
the random variance
What you’re looking for:
The F statistic is between 0 and 1 (if it’s negative, you’ve
made a mistake)
If F > F crit, you must reject the null hypothesis (treatment had an impact)
If F < F crit, you can’t rule out the null hypothesis
The p value
If the p value is less than alpha (.05) then the result is significant (it matters)
If the p value is greater than alpha, the results are not significant
In Massachusetts, are there more
suicides in local jails or in the
prison system?
What you still don’t know
What accounts for the difference?
For that you need a t-test, regression or other tool.
‘HOW TO CHOOSE’ MADE EASY
THE 2 MOST ESSENTIAL QUESTIONS:
1. DO YOU HAVE CATEGORICAL OR CONTINUOUS DATA IN THE VARIABLES?
2. WHAT IS YOUR INDEPENDENT VARIABLE AND DEPENDENT VARIABLE?
INDEPENDENT DEPENDENT STATISTICS
Categorical Categorical CROSS-TAB
Continuous Continuous LINEAR-REGRESSION /
MULTIPLE REGRESSION
Categorical Continuous ANALYSIS OF VARIANCE /
ANOVA
Continuous Categorical LOGISTIC REGRESSION
iT’S A FINE DAY FOR LINEAR REGRESSION!
Image by Paul Wesley
Linear Regression
I. Does the data fit
the 1st assumption:
is there a linear relationship?
1. Scatter plot
2. Trendline
3. Create a new variable
II. The last assumption:
the data should approximate
a Bell curve (normal distribution).
1. Data analysis toolpak -
Descriptive statistics
1. Mean and average should be
close to each other
1. Tick Summary Statistics
2. Tick Confidence level >> 95%
X vs. Y
Source:
http://www.gradeamathhelp.com/x-axis-
and-y-axis.html
Source: assetinsights.net
Source: Indian Journal of Dermatology
https://tinyurl.com/ydad546c
Linear Regression
Conditions met? Run the Regression from
the Data Analysis tool pack:
Y Range - Dependant
X Range - Independent
Turn on LABELS
CONFIDENCE LEVEL 95%
NEW WORKSHEET - REGRESSION
RESIDUALS
ADJUSTED R SQUARE 0 TO 1.0. The
closer it gets to 1, the closest is to
perfection.
SIGNIFICANT F
THE RESIDUAL STORY - Sort!
THE LINEAR REGRESSION IS
JUST THE BEGINNING OF
THE REPORTING
Conrad Carlberg - Statistical
Analysis
Thank you!
Jodi Upton: jodi.upton@gmail.com and @jodiupton
Crina Boros: crinaboros@gmail.com
Special thanks to: Jennifer LaFleur, Center for Investigative
Reporting/Reveal
Steve Doig, Arizona State University

Contenu connexe

Similaire à relational Statistics - workshops 1, II, III.pptx

Answer all questions individually and cite all work!!1. Provid.docx
Answer all questions individually and cite all work!!1. Provid.docxAnswer all questions individually and cite all work!!1. Provid.docx
Answer all questions individually and cite all work!!1. Provid.docx
festockton
 
29510Nominal Data and the Chi-Square TestsJupiterimag.docx
29510Nominal Data and  the Chi-Square TestsJupiterimag.docx29510Nominal Data and  the Chi-Square TestsJupiterimag.docx
29510Nominal Data and the Chi-Square TestsJupiterimag.docx
rhetttrevannion
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
Bonnie Green
 
The t Test for I.docx
The t Test for I.docxThe t Test for I.docx
The t Test for I.docx
christalgrieg
 
Section 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docxSection 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docx
bagotjesusa
 
Statistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxStatistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docx
dessiechisomjj4
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1
gueste87a4f
 

Similaire à relational Statistics - workshops 1, II, III.pptx (20)

Answer all questions individually and cite all work!!1. Provid.docx
Answer all questions individually and cite all work!!1. Provid.docxAnswer all questions individually and cite all work!!1. Provid.docx
Answer all questions individually and cite all work!!1. Provid.docx
 
29510Nominal Data and the Chi-Square TestsJupiterimag.docx
29510Nominal Data and  the Chi-Square TestsJupiterimag.docx29510Nominal Data and  the Chi-Square TestsJupiterimag.docx
29510Nominal Data and the Chi-Square TestsJupiterimag.docx
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 
The t Test for I.docx
The t Test for I.docxThe t Test for I.docx
The t Test for I.docx
 
Section 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docxSection 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docx
 
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)
Vergoulas Choosing the appropriate statistical test (2019 Hippokratia journal)
 
Essay On Juvenile Incarceration
Essay On Juvenile IncarcerationEssay On Juvenile Incarceration
Essay On Juvenile Incarceration
 
Choosing a test.pptx
Choosing a test.pptxChoosing a test.pptx
Choosing a test.pptx
 
Statistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docxStatistics  What you Need to KnowIntroductionOften, when peop.docx
Statistics  What you Need to KnowIntroductionOften, when peop.docx
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
1 1 data
1 1 data1 1 data
1 1 data
 
Overview of different statistical tests used in epidemiological
Overview of different  statistical tests used in epidemiologicalOverview of different  statistical tests used in epidemiological
Overview of different statistical tests used in epidemiological
 
T test
T test T test
T test
 
Impact of Race and Ethnicity on Preemployment Psychological Assessment
Impact of Race and Ethnicity on Preemployment Psychological AssessmentImpact of Race and Ethnicity on Preemployment Psychological Assessment
Impact of Race and Ethnicity on Preemployment Psychological Assessment
 
Data screening
Data screeningData screening
Data screening
 
Stat11t chapter1
Stat11t chapter1Stat11t chapter1
Stat11t chapter1
 
Stat11t Chapter1
Stat11t Chapter1Stat11t Chapter1
Stat11t Chapter1
 
Tools of the Trade: First Generation SAT Preparation: Best Practices for Over...
Tools of the Trade: First Generation SAT Preparation: Best Practices for Over...Tools of the Trade: First Generation SAT Preparation: Best Practices for Over...
Tools of the Trade: First Generation SAT Preparation: Best Practices for Over...
 
Basic-Statistics-in-Research-Design.pptx
Basic-Statistics-in-Research-Design.pptxBasic-Statistics-in-Research-Design.pptx
Basic-Statistics-in-Research-Design.pptx
 

Dernier

一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 

Dernier (20)

How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 

relational Statistics - workshops 1, II, III.pptx

  • 1. Stats I, II and II Frequencies, crosstabs, correlation, ANOVA, regression Jodi Upton and Crina Boros CIJ Summer 2017
  • 2. The Data Ladder -- categorical I. One type of response (yes or no) Frequencies: Crosstabs: Yes 432 45.3% No 521 54.7% Live in Texas Like Bush Yes No Yes 382 200 No 125 307
  • 3. The Data Ladder-- categorical II. Two or more types of responses (race) Frequencies: Race Frequency Asian 4,766 Black 12,807 White 9,766 Hispanic 7.236 Crosstabs: Race Warning Ticket None Black 1 6 0 White 4 3 1 Hispanic 0 1 2 Unknown 3 2 2
  • 4. The Data Ladder-- categorical III. Ordinal Data (use crosstabs and frequencies) When the value doesn’t mean much, but the order does: Grade levels Age categories Income categories
  • 5. The Data Ladder-- continuous data Examples: Income Housing prices Response time (police and fire) Distance travelled (commute) What you can do: Mean Median Range Rank Correlation ANOVA Regression
  • 6. Go to Kahoot.it (on your phone or computer)
  • 7.
  • 8. In traditional statistics, the normal curve means 95% of observations will fall within most of this curve
  • 9.
  • 10.
  • 11. Independent vs. Dependent variable Independent Comes first in time Can be more than one variable Dependent What you are measuring
  • 12. Polling A March 9, 2016 Quinnipiac poll found the following results, with a +- 3.7 margin of error, at the 95% confidence level. Who is really ahead? What’s the MOE for women? White males?
  • 13.
  • 14. CORRELATION AKA: Pearson’s r or coefficient of correlation ● Between 1 and -1 ● If both variables move in the same direction → positive relationship ● If variables move in opposite direction → negative relationship -1 0 +1 Strong relationship weak weak strong
  • 15. Got it, so far?
  • 16. ANOVA What it assumes: Normal distribution Independence of errors Outliers removed* Equal variance (*but journalists love those!) What it measures: Whether the difference within the group is greater than the difference between the groups
  • 17. ANOVA needs an hypothesis Null hypothesis: the treatment has no impact F = the treatment variance + the random variance the random variance
  • 18. What you’re looking for: The F statistic is between 0 and 1 (if it’s negative, you’ve made a mistake) If F > F crit, you must reject the null hypothesis (treatment had an impact) If F < F crit, you can’t rule out the null hypothesis The p value If the p value is less than alpha (.05) then the result is significant (it matters) If the p value is greater than alpha, the results are not significant
  • 19. In Massachusetts, are there more suicides in local jails or in the prison system?
  • 20. What you still don’t know What accounts for the difference? For that you need a t-test, regression or other tool.
  • 21. ‘HOW TO CHOOSE’ MADE EASY THE 2 MOST ESSENTIAL QUESTIONS: 1. DO YOU HAVE CATEGORICAL OR CONTINUOUS DATA IN THE VARIABLES? 2. WHAT IS YOUR INDEPENDENT VARIABLE AND DEPENDENT VARIABLE? INDEPENDENT DEPENDENT STATISTICS Categorical Categorical CROSS-TAB Continuous Continuous LINEAR-REGRESSION / MULTIPLE REGRESSION Categorical Continuous ANALYSIS OF VARIANCE / ANOVA Continuous Categorical LOGISTIC REGRESSION
  • 22. iT’S A FINE DAY FOR LINEAR REGRESSION! Image by Paul Wesley
  • 23. Linear Regression I. Does the data fit the 1st assumption: is there a linear relationship? 1. Scatter plot 2. Trendline 3. Create a new variable II. The last assumption: the data should approximate a Bell curve (normal distribution). 1. Data analysis toolpak - Descriptive statistics 1. Mean and average should be close to each other 1. Tick Summary Statistics 2. Tick Confidence level >> 95%
  • 26. Source: Indian Journal of Dermatology https://tinyurl.com/ydad546c
  • 27. Linear Regression Conditions met? Run the Regression from the Data Analysis tool pack: Y Range - Dependant X Range - Independent Turn on LABELS CONFIDENCE LEVEL 95% NEW WORKSHEET - REGRESSION RESIDUALS ADJUSTED R SQUARE 0 TO 1.0. The closer it gets to 1, the closest is to perfection. SIGNIFICANT F THE RESIDUAL STORY - Sort! THE LINEAR REGRESSION IS JUST THE BEGINNING OF THE REPORTING Conrad Carlberg - Statistical Analysis
  • 28.
  • 29. Thank you! Jodi Upton: jodi.upton@gmail.com and @jodiupton Crina Boros: crinaboros@gmail.com Special thanks to: Jennifer LaFleur, Center for Investigative Reporting/Reveal Steve Doig, Arizona State University

Notes de l'éditeur

  1. Starting at the bottom...
  2. Heard continuous referred to as ‘infinite’ but it’s really not. Income and prices, for example, are limited to two decimal places. Another way that may help: Continuous (measured) vs discrete (counted) If it would take ‘forever’ to count, it’s probably continuous
  3. Amy Poehler
  4. Skew = body (positive or negative); kurtosis = tail
  5. All you really need to know: is my data evenly distributed
  6. The margin of error does not depend on the size of the population; it depends on the size of the sample. (In astronomy, the margin of error is 4.12 light years -- the distance to Proxima Centauri)
  7. Important: 1 in 20 observations will NOT! Bayesian: start with a different hypothesis, 100% within curve, but may be ‘off’
  8. ANOVA was created by an evolutionary biologist and statistician, who wanted to be able to tell if two groups were the same or different, ie were they the same species or not?
  9. In other words, is there enough randomness within the sample, that it outweighs any variance between the samples -- and any measured difference is the result of chance?
  10. “F” stands for Sir Reginald Fisher, who invented this “P” stands for the probability that -- if the null hypothesis is true -- the results are ‘extreme’ (one in 500 chance of being wrong)
  11. A paper ‘suicide suit’ worn by a model
  12. ANOVA was created by an evolutionary biologist and statistician, who wanted to be able to tell if two groups were the same or different, ie were they the same species or not?