Introduction to statistical concepts (population, sample, sampling, central tendency, spread). Mainly aimed at language teachers in advanced studies programmes (e.g., Masters courses)
2. OBJECTIVES OF THIS SESSION
You will learn how to construct a sample
You will learn how to describe your sample using statistical methods.
You will learn how to find connections between different phenomena in your data.
3. OUTLINE OF THIS SESSION
1. Populations and samples
2. Different types of data
3. Univariate analysis
Central tendency
Spread
4. Bivariate analysis
Cross-tabulations
T-Tests
Correlations
5. POPULATION
The total number of people (or events, or things) whose
properties or behaviour we are interested in
understanding
e.g. University students in Austria
Symbol: N
Population
Sampling frame
Sample
6. SAMPLING FRAME
The total number of people (or events, or things) that we
have access to for our research
e.g. Students currently present in this classroom
Population
Sampling frame
Sample
7. SAMPLE
The total number of people who were contacted and
agreed to participate in the study
Symbol: n
Population
Sampling frame
Sample
13. CATEGORICAL / NOMINAL DATA
A variable is nominal if values do not have a numerical relation to each other.
Examples:
Gender (M / F / Other)
Place of Birth
14. ORDINAL DATA
Ordinal variables are like categorical ones, but we can rank the values according to order, size,
frequency, etc.
Examples:
Level of education (High School, BA, MA/Mag., Doctorate)
Attitudes (strongly disagree, disagree, neutral, agree, strongly agree)
15. CONTINUOUS / SCALE DATA
A variable is continuous if it contains an infinite number of values that can be mathematically
manipulated.
Examples:
Age (12, 13, 13:2, 14…)
Height (165cm, 167cm, 183cm…)
17. MEASURES OF CENTRAL TENDENCY
Mode (the most common value)
Median (the middle value)
Mean (the middle value, weighted)
18. EXAMPLE (RAW DATA)
Case Height Gender Loves Statistics
1 167 M Strongly disagree
2 178 M Strongly agree
3 189 F Agree
4 201 F Agree
5 182 M Disagree
6 175 F Strongly agree
7 162 M Strongly disagree
8 180 F Disagree
9 187 M Agree
20. CENTRAL TENDENCY: THE MODE
Gender N %
--Male 5 55
--Female 4 44
--Total 9 100*
Case Gender
1 1
2 1
3 2
4 2
5 1
6 2
7 1
8 2
9 1
*Rounding up error
“The majority of respondents were male (n = 5, 55%). “
“Respondents were almost evenly split
between male (n = 5, 55%) and female (n = 4,
44%)”
21. CENTRAL TENDENCY: THE MEDIAN
Case <3
1 4
2 1
3 3
4 3
5 2
6 4
7 1
8 2
9 3
Case <3
1 4
6 4
3 3
4 3
9 3
5 2
8 2
2 1
7 1
I love statistics N %
--Strongly agree 2 22
--Agree 3 33
--Disagree 2 22
--Strongly disagree 2 22
--Total 9 100*
*Rounding up error
“As can be seen in Table 1, attitudes towards statistics
were largely positive (x̅ = 3)”
22. CENTRAL TENDENCY: THE MEAN
1,441 / 9 = 180.1
Case Height
1 167
2 178
3 189
4 201
5 182
6 175
7 162
8 180
9 187
Total 1,441
“Respondents were rather tall (M = 180.1)”
24. COMPARE THESE TWO SCHOOLS
School A
School B
0
1
2
3
4
5
6
7
8
40-49 50-59 60-69 70-79 80-89 90-100
Based on Muijs 2007
25. COMPARE THESE TWO SCHOOLS
Case School A School B
1 45 60
2 50 65
3 55 65
4 60 70
5 65 70
6 70 70
7 70 70
8 75 70
9 80 70
10 85 75
11 90 75
12 95 80
Media
n
70 70
Mean 70 70
26. MEASURES OF SPREAD
Range (the difference between the highest and the lowest value)
Interquartile range (the difference between the highest and lowest values after we remove extremes)
Standard deviation
27. MEASURES OF SPREAD: RANGE
Case School A School B
1 45 60
2 50 65
3 55 65
4 60 70
5 65 70
6 70 70
7 70 70
8 75 70
9 80 70
10 85 75
11 90 75
12 95 80
Media
n
70 70
Mean 70 70
Range
Range (School A): 95 – 45 = 50
Range (School B): 80 – 60 = 20
“The test scores in School A ranged from 45 to 95 (M
= 70. Scores in School B were more tightly
distributed, ranging from 60 to 80 (M = 70)”
28. MEASURES OF SPREAD: INTERQUARTILE RANGE
Case School A School B
1 45 60
2 50 65
3 55 65
4 60 70
5 65 70
6 70 70
7 70 70
8 75 70
9 80 70
10 85 75
11 90 75
12 95 80
IQR
IQR (School A): 82.5 – 57.5 = 25
IQR (School B): 72.5 – 67.5 = 5
“Although the average test performance in both
schools was similar (M = 70), the test scores in School
A were much more widely distributed than those in
School B (IQRA = 25, IQRB = 5”
29. MEASURES OF SPREAD: STANDARD DEVIATION
Case School A School B
1 45 60
2 50 65
3 55 65
4 60 70
5 65 70
6 70 70
7 70 70
8 75 70
9 80 70
10 85 75
11 90 75
12 95 80
Media
n
70 70
Mean 70 70
SD (School A) / SDA: 15.81
SD (School B) / SDB: 5.22
“The test scores in School A were satisfactory (M = 70,
SD= 15.81). School B reported similar results, which
clustered more tightly around the average (M = 70,
SD = 5.22)”
31. POP QUIZ
Average and mean are the same thing
In daily use, the words average and mean are interchangeable. In statistics, the mean is one type of average. The mode and
median are also types of average
We must always use the median with ordinal variables
Technically, we can use both the median and the mode, but the median is a more powerful metric. The third option, the
mean, cannot be used with ordinal data.
We must always use the median with continuous variables
It is usually the best option. However, if we have unusual data (with one or two very high or very low values) it may be better
to use the median.
We can calculate mean values in a Likert scale (1: Strongly Agree; 2: Agree; 3: Disagree; 4: Strongly Disagree)
Some people do, but you shouldn’t. Likert scales produce ordinal data. You should not use the mean when your data is
ordinal.
The appropriate spread metric for nominal variables is the IQR
No. Nominal data cannot be ranked in any sensible way, so they do not have a spread.
33. CROSSTABULATIONS
We use a cross-tabulation when we want to compare two ordinal or nominal variables
Examples:
Gender x Favourite colour
School type x Attitudes towards mathematics
37. CHI-SQUARE
“A statistically significant difference was found in the toy
preferences of boys and girls. As can be seen in Table 1
boys were much more likely to prefer action figures,
compared to girls (χ2= 36.068, df=1, p=o.ooo)“
Gender AF BD Total
--Male 48 2 50
--
Female
25 35 60
--Total 73 37 110
39. T-TESTS
We use a t-test to see if there is any connection between a nominal variable (the independent variable)
and a continuous one (the dependent variable)
The t-test breaks up your population in two groups (e.g., boys and girls), examines the mean value of the
independent variable for each group, and then compares them.
43. CORRELATIONS
We use a correlation (e.g., Spearmann or Pearson‘s coefficient) to see if there is any connection between
two continuous variables (e.g. weight and height).
Correlations range from 1 to -1. A high value on either side means that the two variables are strongly
connected. A value close to 0 means that they are not.
We can depict correlations visually with a scatterplot diagramme.
50. SUMMARY
Nominal Ordinal Continuous
Nominal Crosstabs / χ2 Crosstabs / χ2 T-Test (if it has two
values)
Ordinal Crosstabs / χ2 Crosstabs / χ2 T-Test (if it has two
values)
Continuous T-Test (if it has two
values)
T-Test (if it has two
values)
Correlation
51. POP QUIZ
If I want to test whether there is a connection music preferences and gender, I must use a cross-tab
That is correct. Music preferences and gender are both nominal variables. The correct procedure for pairing nominal
variables is a crosstab (and chi-square)
A p value of 0.045 shows that something is statistically significant.
That is correct. The usual threshold of statistical significance in educational research is 0.05, and anything lower than
that is considered significant.
I can prove that something is causing something else using a Pearson‘s correlation coefficient.
No, you cannot. Correlation does not imply causation.