As a school leader or head of subject you are required to analyse attainment data relating to whole cohorts of learners. From this analysis you need to produce timely interventions and measurable initiatives to improve the very performance you are monitoring. This requires data analysis - something that most teachers and leaders either find daunting or only address in a superficial manner. Covering chapters on Data Analysis, Problems With The Mean, Comparative Statistics, Analysis Of Variance and the incredibly powerful General Linear Model (GLM) - this is a text book for real teachers faced with real issues in real classrooms.
2. This edition published by LULU, February 2012
ISBN: 978‐1‐4716‐1146‐9
This work is licensed under a Creative Commons Attribution‐NonCommercial‐
ShareAlike 3.0 Unported License (CC BY‐NC‐SA 3.0).
To view a copy of this license, visit http://creativecommons.org/licenses/by‐nc‐sa/3.0/
or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco,
California 94105, USA. Whilst the Creative Commons License for this book entitles you
to distribute / modify the work for non‐commercial use, without additional
permissions, we kindly request that you inform the authors of any intention to re‐
publish / remix this title. Send an email to mean@goingbeyond.co.uk
Every effort has been made to contact perceived copyright holders for material
reproduced in this publication. Any omissions or oversights will be rectified in
subsequent editions if written notice is given to the author. All trademarks are the
property of their respective owners. The authors are not associated with any product
or vendor mentioned in this book except where stated. Unless otherwise stated; any
third‐party quotes, images and screenshots, or portions thereof, are included under
‘fair use’ for comment, news reporting, teaching, scholarship, and research.
Acknowledgements
The authors would like to thank Michelle Gilchrist for her help, support are tireless
proof reading skills, without which, this book would not have seen the light of day.
Disclaimer
This is a book aimed at those readers wanting to explore data as used to drive decisions
in schools. It is not a comprehensive guide to statistics – no responsibility is assumed
or accepted for your decisions based on your data. Using the techniques detailed in
this text provides an aid to decision making whereas, the decision to act is left to the
discretion of the reader. No liability can be placed with the authors of this text.
By using the material contained within this guide, you acknowledge that you have read
and accept this disclaimer
2
5. Contents
Introduction 9
It’s easy to see why data is mishandled and unsafe conclusions drawn. 10
Essential definitions 11
A word about software 12
Minitab 13
Final note 13
Chapter 1 15
DATA ANALYSIS THAT SCHOOLS “DO” 15
Why we use the mean average 15
Factors we can compare 16
Central tendency 16
The mean ‐ a point statistic 16
More sophisticated analysis 18
Complementing the mean – bar charts 19
Using the mean to compare “segments” of data 20
Using the language of statistics 21
The wider school picture 21
Call to action 23
Conclusions 24
Chapter 2 25
THE PROBLEMS WITH THE MEAN 25
Statistics in action 26
Call to action 26
Problems with the mean 27
Call to action 27
The dangers of presumption – pre analyzing the data 28
Call to action 28
What do your bar charts show? 29
5
6. Ethics, politics and “getting your own way” 30
Call to action 31
How big an effect / difference is “big enough” to matter? 32
Extra information in a “modified” bar chart 33
Call to Action 33
Looking at a whole cohort 34
Preconceptions again 35
Conclusions 36
Chapter 3 37
COMPARATIVE STATISTICS 37
What does significant mean? 37
T‐tests and p values 37
Calculating significance using Excel 40
Excel command for T‐testing 41
Call to action 43
Conclusions 44
Chapter 4 45
FACTORS WITH MULTIPLE LEVELS. 45
Multi level factors 45
Combine levels to make a binary solution 45
Calculating t‐test for “binned” data 48
Limits of the t‐test 49
Multi level factors 50
ANALYSIS OF VARIANCE 50
Does attendance affect attainment? 51
Fitting a trend line to Excel data 52
2
Using R to check for “goodness” of fit 54
One way Analysis of Variance (ANOVA) 58
Non numeric multi level factors 61
Call to action 66
Pause for breath …….. 67
Questions to reflect on 67
Conclusions 68
6
7. Chapter 5 69
GENERAL LINEAR MODEL (GLM) 69
Constructing a GLM 70
Deeper analysis 73
Extending the GLM 75
Building interactions into the GLM 77
Big implications of the GLM 79
Call to action 80
Conclusions 80
Chapter 6 81
MAIN EFFECTS 81
Main Effects Plot 83
Interactions Plot 84
Call to action 88
Conclusions 88
Chapter 7 89
FINAL REMARKS 89
Tools you’ll need: 89
7
8. 8
9. Introduction
Every school leader, head of subject and class room teacher will recognize the
following scenario:
It’s a school INSET, and what wonderful pedagogical expertise is going to be shared
with you, the willing staff? – Yes, you’ve guess it “Addressing the gender differential” –
the very name sends waves of déjà‐vu through the staff and the authors of this book
develop an instant migraine.
We’re not denying that there is a difference between the genders and their approach
to education; nor are we suggesting that as teachers and leaders that you don’t need
to monitor things to ensure that situations aren’t improving/deteriorating ‐ what brings
us to the point of tears, is that this statement is based on poorly and superficially
analyzed data.
As we will show in this book, it’s easy to assume that responses will be different for a
certain factor, and when you just look at the mean of data set, this “difference” is often
seen – you’ve then proved your initial assumption and you don’t look for a more
fundamental root cause. In our experiences, this is the case with the gender
differential, and I bet you’ve fallen into it too.
When we came into teaching, for the first time in our professional lives we became
aware of the situation of being “data rich but information poor”. Education abounds
with numbers, and schools, students & teachers have never been “measured” as much
as they are in 2010‐20111
But which numbers do you use and which demand that you take them seriously?
1
Whilst this appears to be particularly true of the English / Welsh systems, all educational infrastructures
constantly battle with league tables, “banding” and other lists
9
10.
It’s easy to see why data is mishandled and unsafe conclusions drawn.
Until very recently, use of correct descriptive statistics was the preserve of the
statistician, often resulting in the calculation of arcane numbers, utilizing impenetrable
mathematics. Indeed, pick up anything but the most basic of statistics text books and
the reader will soon be swimming in a sea of mathematical notation, far beyond the
readability of those without degrees in mathematics.
But with the change is responsibilities, the TLR structure, and the reduction is
extraneous funding, the expectation is that as a subject/school leader, you undertake
data analysis and draw conclusions.
I doubt you’re trained in statistics (and why should you be?) ‐ so instead of carrying out
statistically valid analysis you’ve have returned to that most basic of measure – the
“average” – after all, it’s easy to calculate and means something doesn’t it?
Throughout the text of this book, we will look at analysing the data a typical
department in a school might produce – initially by calculating “means” and developing
this into a more rigorous assessment of data.
So dear reader, this book is aimed at classroom practitioners, heads of department and
school leaders seeking a deeper understanding of what your data actually shows.
In a nutshell, we’re going to take you “beyond the mean”.
Glen Gilchrist & Alexavier Fareheed
2012
10
11. Essential definitions
We need to define three vital terms that will be used throughout this text:
Factor: A factor is a variable whose values are independent of changes in the values of
other variables. Traditionally factors are the groups into which we split our
data – gender, SEN, free school meals are examples of educational factors.
Level: Factors can be split into different values. Statistically, these values are called
levels.
Levels can be numerical, quantitative or qualitative, binary or multi level.
Binary Levels
Levels can be binary in nature “boy or girl”, “SEN or not” and can be
represented numerically “1=boy, 2=girl” or remain as text.
Multilevel Levels
Levels are not always binary, “originating primary school” for example could be
one of 10 or more levels, with each school either referred to by name or a
coded “number” 1=School A, 2=School B etc
For continuous levels (age and attendance are good examples) levels
themselves might be grouped together to make analysis easier. These
groupings are often called “bins” and reference will be made to “bin size”.
Attendance for example could be binned as:
‐1 = less than 80%
0= 80% to 89.9%
1 = 90% and greater
The numerical value of the groups (‐1, 0, 1) is not important and the labels are
used to dentify the grouped levels. Some consideration needs to be made into
i
11
12. the size / range of the groupings as this choice can affect subsequent
data analysis – however this is outside the scope of this text, and for
the analysis undertaken in schools, just ensure that the bins are
“sensible”.
Response: The response is the output that you are measuring. For school based
data, average or total points score and number of “C’s” are the typical
responses measured.
A word about software
MS Excel is referred to throughout this text and is used as convenient shorthand for
“spreadsheet”. We acknowledge that other spread sheets such as OpenOffice and
GoogleDocs are available and can be used fairly interchangeably for MS Excel (except
where indicated). Each has their strengths / weaknesses, but all process statistical
information in much the same manner. There is no need to change your spreadsheet
package to complete the numerical analysis undertaken in the majority of this text.
Some of the more advanced statistics require the use of a dedicated statistics tool.
Recently the cost of these tools has fallen dramatically and academic licenses can be
obtained for less than £50. We cannot recommend strongly enough the value in
obtaining the correct tool to analyze your data.
A great list is maintained at Wikipedia, which compares different statistical tools, their
costs and licenses: http://en.wikipedia.org/wiki/Comparison_of_statistical_packages.
12
13.
Minitab
Throughout this book the authors makes use of Minitab as a conveniently easy tool to
get to grips with and available at an excellent price (from sub £20)
(http://www.minitab.com/en‐GB/academic/licensing‐options.aspx). The publisher also
makes available a free 30 day trial – more than enough time to learn the ropes and to
process data for your self evaluation.
Final note
The authors are practicing teachers, currently heads of subject in maintained
secondary schools and have no association with any of the tools / software / publishers
mentioned in this text.
“Data analysis is a journey that the only destination is enlightenment – get ready for
the ride of your life.” Glen & Alexavier – February 2012
13
14. 14
15. Chapter 1
Data analysis that schools “do”
One of the biggest challenges in getting data used correctly in schools used to be the
actual collection and manual processing of the “numbers”. Now with tools such as MS
Excel, OpenOffice and GoogleDocs available to all, the challenge has shifted to the
actual processing and analysis that turns “numbers” into “data”.
Courses abound in educational circles about the “use” of data, but from personal
experiences they all focus on 3 areas:
1. Sources of baseline data (CATs, FFT, Government, Feeder Primaries)
2. Segmenting the data (gender, free school meals, SEN)
3. Monitoring, assessing and explaining student performance against (1) and (2)
Valuable as these courses are (and a significant improvement on not using data), they
all focus on basic statistics – the mean average, range and a cursory diversion into
drawing and formatting bar / line graphs; and whilst this is encouraged, reliance on
these measures alone can lead to poorly drawn and costly conclusions.
Why we use the mean average
Whilst Excel et al have democratized the collection and analysis of data, they have also
exposed the fact that most users of these tools are unaware how to use them at a high
enough level to process statistical information. As a result, most users are content with
tabulation, calculation of “averages” of data sets and with drawing basic, overly
coloured bar charts.
These “averages” are then used to draw conclusions, usually in the form of
comparisons; Boys vs Girls, free school meals vs non free school, English vs Maths,
2009 vs 2010, one school vs another.
15
16. Factors we can compare
The candidate list for comparison is long: special educational needs, ethnicity, “looked
after”, target group, literacy “booster” support or a hundred‐and‐one other
educational imperatives. A situation that I am certain occurs in your school. Indeed the
schools inspection framework2 demands that schools use data to “identify, plan and
monitor” the attainment of “groups” of learners. Without extensive use of such data,
schools cannot hope to achieve a coveted “Grade 1” status.
We will expose in this chapter the dangers of using just the mean to represent a data
set, and show how drawing conclusions can lead to costly and unnecessary
interventions.
Central tendency
Used in this context, the mean is a “measure of central tendency”3
The two most widely used measures of "central tendency" of data are the mean
(average) and the median. For example, to calculate the mean weight of 50 people,
add the 50 weights together and divide by 50. To find the median weight of the 50
people, order the data and find the number that splits the data into two equal parts.
The median is generally a better measure of the centre when there are extreme values
or outliers because it is not affected by the precise numerical values of the outliers
themselves (The median is often used to describe “average” earnings in a population as
it is not affected by a small number of very large (or small) salaries) .
The mean ‐ a point statistic
The mean is a “point” statistic – that is, it reduces an entire data set to a single value,
useful to succinctly describe the data. (However, you lose any sense of the spread and
variability of the numbers). As a result, the mean is the most widely used measure of
central tendency, but as we will see, not always the most useful.
2
UK wide, but certainly heavily endorsed in England and Wales
3
There are three measures of central tendency used to describe data sets – mean, mode and median. If
you are unfamiliar with these terms or just need a recap, remember – Google is your friend.
16
17.
For example, the Average Points score for 5 schools in 2011 was:
Average
School
Points Score
A 435
B 403
C 440
D 427
E 438
What conclusions can be drawn from this data?
School “C” is the best performing
School “B” is the least performing
Schools “A”, “C” and “E” all have similar points scores
School “B” needs to do “something” as its performance is very different to the
other schools.
It’s likely that such analysis is undertaken at this level in both your department and
whole school self evaluations.
The consequences of such analysis are likely to be some form of change, intervention
or closer monitoring. In short, money, time and effort will be expended acting on this
analysis of means. A situation that we are sure has happened in your school or
department.
17
18.
More sophisticated analysis
Further and seemingly more sophisticated analysis will have you looking at the same
data over a period of 3 or 5 years:
School 2008‐2009 2009‐2010 2010‐2011
A 425 430 435
B 440 420 403
C 411 424 440
D 425 430 427
E 430 438 438
What does this show?
School “C” is the most improved over the 3 years
School “B” has fallen 37 points over 3 years
Schools “D” and “E” have shown little improvement over the three years
As part of your self evaluation / action plan – you will have undoubtedly looked at 3
year trends in mean data. You’re likely to have compared your results to that of other
departments, between local, national and family of schools and made pronouncements
on how well you are doing compared to last year.
To try an unravel some of the mystery about what your data is showing you, chances
are you’ll draw a bar chart of the means.
18
19.
Complementing the mean – bar charts
Let’s complete the analysis and draw a bar chart of the data for the schools over three
years:
What does this chart show us?
It emphasizes the fall in performance of school “B”
The performance gains of school “C” look incredible
School “D” looks all but static over the past three years
Overall, what conclusions can be drawn about schools “A” to “E”?
School “A” is doing something that is improving performance
School “C” is clearly doing something “better” than the other schools and
better than school “A”
19
20. School “D” appears not to be doing anything and performance is static
School “E” looks like something happened during 2009‐2010, but these gains
have stopped and the school has not improved since.
School “B” looks like it’s in free fall and standards are falling rapidly
No doubt such analysis is regularly completed by you and/or your senior leadership
team. And if our personal experiences are reflected in your school the stress levels and
anxiety rises in proportion to the preparation and analysis of such data.
Using the mean to compare “segments” of data
As a teacher, administrator or policy maker we often need to compare the means of
two or more populations – essentially to test whether or not an intervention or
observation produces a measurable difference. For example, the average points score
for Year 11 students upon receiving their L2 qualifications is often segmented into data
for males and females.
Average Points Score
Boy 402
Girl 448
As a result of this basic analysis, decisions and policy will be decided.
In this case, “clearly” there is a sex linked differential between Boys and Girls – with
Girls outperforming Boys by some 10%. From this analysis of means an intervention
will be planned – possibly grouping next year’s cohort into separate sex classes,
planning boy friendly lessons and tweaking the seating plans.
Again, we’re sure that you’re familiar with such segmentation of data and are certain
that your self evaluation contains statements about the gender differential and how
you intend to tackle it.
20
21. Using the language of statistics
At this point, let’s start to use the language of statistics more fully.
In the case above for boys / girls L2 performance:
We have one factor, SEX, split into two levels (Boy and Girl) – we say we have a
binary factor.
Our response is the Average Points Score
From now on, we will use factor, level and response to describe our data.
The wider school picture
Such analysis is extended across the wider school, comparing the differentials in your
subject to those in English, French and DT4 ‐ as a direct result of this analysis a working
party or even a PLC5 will be created to tackle the clear differences between subject
areas.
(Whilst written here in a tongue‐in‐cheek manner, I suspect that your school has at
some point created a working party to contemplate differences in responses when
factors are analyzed for mean differences)
4
Insert the high performing subject areas in your school
5
PLC – Professional Learning Community, school based collaborative action research – for more details
see: http://www.centerforcsri.org/plc/program.html
21
22.
What can we conclude from this chart?
French has the smallest sex differential
Science has the widest differential
In DT, boys outperform girls
The temptation in this case is to view the French differential (low) as in some way
“better” that the Science differential (high) and to invest time and resources in solving
the “problem”.
We’re not suggesting that this does not need to be solved; just that the data analysis
performed so far does not demand such investigations, merely hints at it
22
23. Call to action
1. Do you know the three measures of central tendency and when to use each
one? Do you know how to get Excel to calculate each?
2. Find your self evaluation and identify where you have used the mean of a data
set to draw a conclusion about segmentation of data
3. Look at the charts and graphs you have created for your exam analysis
meeting. Are they based on means of data? What conclusions did you draw
from them?
4. Look at whole school, local and national data – how often is an entire data set
reduced to a point statistic?
5. How well can you use your spreadsheet tools?
a. Can you enter formula to calculate the average of a data set?
b. What about counting the numbers in a column when the value in a
different column is a particular value? (CountIF() – used to
automatically count data, say based on a column containing the sex of
a learner)
23
24.
Conclusions
During this chapter we have shown the basic data analysis undertaken by schools. As
subject team leader we imagine that you have laboured over such figures yourself,
painstakingly entering figures into MS Excel, creating comparison bar / pie charts and
drawing conclusions based on the mean average of data sets.
You’ve likely taken such figures into exam analysis meetings with your head teacher
and drawn conclusions about why students who obtain free school meals do “less well”
in your subject than, say, Spanish.
All of these things are a step in the road to understanding how to use data effectively
and the fact that you are reading this title demonstrates a clear desire to take your use
of data to a higher, more effective level.
In the coming chapters I’ll show you why data analysis based solely on the mean of a
population is dangerously superficial and can lead to misdirected effort and the
potential to miss a more fundamental underlying truth.
24
25. Chapter 2
The problems with the mean
Demonstrating that there are “issues” with using the mean of a data set is often the
most instructive way forward.
Consider the following data obtained for a group of year 10 Maths students.
Student L / R Hand Score
A R 80
B R 78
C R 82
D R 84
E R 76
F L 82
G R 81
H L 79
I L 79
J R 81
K L 84
L R 76
M R 81
N R 78
If we take the average of the left handed and the right handed students, we obtain;
Hand Average Score
Left 81
Right 79.7
25
26. From this, we conclude that right handed students underperform compared to left
handed – we might even plan further monitoring, investigate the scheme of work to
look for bias and set up a far reaching working party.
Statistics in action
If you take any data set, made up from “real” data – and by real, I mean measured from
real people / events, not simulated on a computer, and segment that data into two –
you are likely to see a difference between one group and the other.
In this case, we looked at L and R hands, but the argument holds for any segmentation,
regardless of how ridiculous it sounds.
Call to action
1. The next time you teach any class, survey them for one of the following:
o Xbox or Playstation
o Blackberry or iPhone
o Eastenders vs Coronation Street
o Family Guy vs American Dad
(The choices don’t need to be binary, but at this stage, it will help with the data
analysis)
2. Add this segmentation to the class register.
3. The next time you “test” your learners, split the data into the segments that you
have just defined and calculate the mean for each: (for example)
Console Average Score
Xbox 67
Playstation 83
Ask yourselves the following question – does this show anything meaningful?
26
27. Have we just uncovered the route to educational success – “buy everyone a
Playstation” or is there something else going on?
Whilst a contrived example, I am sure from your own experience that this
segmentation and superficial analysis has been undertaken – possibly with the gender
differentials cited in the previous chapter.
Problems with the mean
From the previous example, what exactly are the problems with using the mean?
Some observations stand out:
1. The difference between left and right handed is small – 1.3 –
a. The question we should ask is:
“Is this difference big enough to matter?”
2. There are only 4 left handed students – does this affect the conclusions?
“How much data do you need to draw realistic inferences?”
These issues aside, we are sure that you have drawn conclusions using similarly
analyzed data.
Call to action
Before you read on, either for your own data or the data presented previously, splitting
into Left and Right handedness, use your favourite spreadsheet to draw a bar chart of a
set of results that can be split into two segments. For the purposes of this text, I’ll
assume that you’ve used my data.
27
28.
The dangers of presumption – pre analyzing the data
The analysis of data by using just the mean is not the only concern for rigorous data
analysis.
When we presume there is a difference between two segments of data, we are
unsurprised when we find it, and are then more likely to accept that difference as
meaningful. After all boys and girls are different, so when your data shows this, it must
be true – right?
Call to action
What presumptions do you make in your data analysis?
1. Would you have expected left and right handed segmentation to produce
different means?
a. Can you think of a pseudo‐pedagogical reason why this might be true?
2. What about other splits of data?
a. Everyone knows that free school meals, linked to poverty affects
attainment – right? Does your data show this difference?
When you analyze your data and find a difference, you are ready to accept it as real
and meaningful. The same is true with gender, SEN and a host of other factors that we
assess.
28
29.
What do your bar charts show?
Let’s show you my plots the mean data for handedness as a series of bar charts, all
showing the same data:
A C
B D
Firstly, let me assure you that these charts all show the same “numbers” for the left
and right hand segmentation of the data.
Chart “B” is the default MS Excel and OpenOffice formatting of the data as entered.
The only difference between each chart is the scale of the y‐axis.
Chart “A” shows 79.5≤ y ≤81.1, with each division being equal to 0.2
Chart “B” shows 79 ≤ y ≤81.5, with each division being equal to 0.5
Chart “C” shows 0 ≤ y ≤80, with each division being equal to 20
Chart “D” shows 0 ≤ y ≤100, with each division being equal to 20
Quite dramatically charts “A” and “B” emphasize the differences between L and R,
whilst charts “C” and “D” seem to imply the difference is almost nonexistent.
29
31.
Call to action
1. What did your bar chart look like? Which of my examples was it closest to?
2. As part of the self evaluation and action planning process you will have
certainly either constructed or interpreted charts showing results – often
segmented into different groups. Those groups will have likely shown a
difference. Have you presented data to SLT by using either the default or
custom scales – to “make your point clearer”?
What we’ve just demonstrated is that the apparent importance of differences can be
manipulated by just how you construct your charts.
3. How have you constructed charts for last year’s examination analysis meeting
with your SLT? Have you emphasized or played down an effect to influence a
decision or opinion?
However well minded your intentions, I suspect that you will have exerted some
“influence” on the data – even if it was just by using the default settings in Excel –
which in this case seem to imply that there is a huge difference between L and R.
31
33.
Extra information in a “modified” bar chart
What this chart clearly shows is the spread of data for each segment. You can see that
the entire L data sits within the R data.
The range of the R data is more than the L data
There are no values of L that are higher than R
There are low values of R, lower than any of the L data
What can be concluded from this chart, is that whilst the means are different, with L
being higher than R, the spread of the data and the low values of R have influenced the
mean value.
What if those learners with the lowest right hand score just happened to be the SEN
learners in the class? Or, what if those lowest R scores correspond to learners who
have been long term sick, incomers to school, EAL learners?
Call to Action
1. Find a data set that you can segment into two (boy / girl splits work well and
are a constant political/educational debate). You need the actual score for a
class, broken down into learners / gender.
2. Plot the scores as a modified bar chart, one column for each segment (boy /
girl)
3. What does this show you for your data?
33
34. Looking at a whole cohort
Average points score vs sex
900
800
700
600
500
POINTS
400
300
200
100
0
F M
Sex
The figure above, shows 2010 data for the average point score for a secondary school,
split by sex. As before, the line joins the means.
The chart shows that girls have a higher average points score to boys (as the line slopes
down, from left to right)
From analysis of the means, the following was presented to SLT for the annual exam
analysis meeting:
Average Points Score
Boy 443
Girl 406
From the mean analysis, it appears that there is a real and big difference between the
boy and girl average point scores.
34
35.
The modified bar char starts to add more meaning:
The spread or ranges of the boy data is more than the girl data
The boy data has far more lower scores than the girls
The girl data has the highest performing students.
Preconceptions again
Again whilst sex is a convenient (and presumptuous) way of explaining difference – and
indeed the means substantiate a conclusion, might it just be that the lowest scoring
learners (who happen to be boys) also happen to be the EAL students? Might it be
equally true that the highest performing girls receive tuition outside of school?
We come back to the question:
How big an effect / difference is “big enough” to matter?
and we add:
How do we tell what the real cause of something is?
35
37. Chapter 3
Comparative statistics
Over the previous two chapters we’ve been talking about the mean of data being a
poor summary tool and incomplete when used to compare two segments of data.
We’ve shown how we can draw a chart to help illustrate the difference between means
and how, by tweaking the scales of bar charts, you can magnify or minimize apparent
differences. Ultimately, all of these techniques are qualitative and assessing whether
or not data sets are different has been a matter of choice.
Whilst this might be satisfactory when deciding what the most popular games console
is, surely we can apply more forethought over decisions that are likely to lead to
profound implications to the education of young people.
What we are looking for is a way to quantify how different sets of data are, and an
agreed upon set of standards for assessing whether or not a measured difference is
significant – hence, if the difference is significant it demands attention and solution.
What does significant mean?
It’s important at this point to clarify that a difference is statistically significant if the
observed difference is greater than can be accounted for by random error alone.
T‐tests and p values
For the professional statistician there are a number of measures that can be used to
assess the significance of measurements being different. If we intended to compare a
response to one factor only (say gender), we would use the t‐test, which returns a
probability that the difference between the data sets cannot be distinguished from
random occurrences or accounted for by other factors.
37
38. That mouthful (presented for statistical correctness) can be reduced to:
The probability (%) that the data sets are not really different. This is often referred to
as the p value, and is either a decimal in the range 0.000 to 1.000 or a percentage. The
higher the p value, the less sure we are that the data sets are different.
For example:
If p=0.000 or 0% we would have zero concern that the means were the same.
Or put the other way, we would be totally certain that the means are different.
We would be (1‐p) or 100% confident that the means were different.
If p=0.001 or 0.1%, we would be slightly concerned and not totally confident
that the means were different. We would be (1‐p) or 99.9% confident that the
means are different.
If p=0.005 or 0.5%, we would be more concerned that the means were not
different – We would be (1‐p) or 99.5% confident that the means are different.
If p=0.10 or 10%, we would be quite concerned that the means were not
different. We would be (1‐p) or 90% confident that the means are different.
If p=0.50 or 50%, we would be totally unsure and (1‐p) = 50% would show that
it was 50/50 that the means are different.
Consider the following question – if you wanted me to invest £1,000,000 in your idea to
cure cancer, and you had tested it against a placebo, what value of p would you accept
as sensible evidence for “proving” your cure worked?
Would you accept p=0.10 or only 90% sure that your cure worked?
Would you accept p=0.005 or p=0.001?
Statisticians agree that a p value of 0.005 or less is needed for “proof” that a
difference is real and hence defined as significant.
38
39. P values in the range p=0.01 to p=0.006 show increasing evidence that a
difference might be real and probably warrants further analysis
P values in the range p=0.05 to 0.01 show a hint that there is a real difference.
At p=0.05, we would be 95% sure there is a real difference, or there’s a 5%
chance that the means are actually the same. This p=0.05 value corresponds
to the limit of “significance” – a p‐value of p=0.05 or less indicates
significance of a difference between two levels of a factor.
P values greater than p=0.05 are rejected are we are less than 95% sure the
data sets are different.
This might sound draconian, but these levels of significance are used by drug
companies to “prove” a cure works, by the courts and police to convict those accused
of crimes and by all serious scientists trying to prove that A caused B or C worked
better than D – so if it works for them, it should work for us.
39
42. The p value of 0.414 indicates a 41.4% chance that the means are actually the same.
Or as we discussed previously, a 1‐p or 58.6% chance that the means are different.
(Remember what we are talking about here – this almost represents a 50/50 case –
that the data is different OR not)
This is well above the value of statistical significance (p=0.05) and the p‐value
demands that we treat the means of these data sets as “not different”.
Contrast the value of a numerical value to the previous charts we created:
Whilst we might have concluded that the means were the same or “not likely to be
different”, clearly this was open to interpretation / bias and was left to my decision
over how we drew the charts.
Now we have a numerical value to assess the just how different a difference actually
is.
42
43.
Call to action
1. Revisit the data you collected previously.
2. For the factors that you were considering, put one value of the response
corresponding to one level of a factor (boy) in one column and the other level
(girl) into another column.
3. Calculate the TTEST value, using the ranges for the data, “2” for the tails and
“3” for the type.
4. What is the p value?
5. Does this show a significant difference between the data sets or do you
conclude that they are the same?
6. Does this disagree with any analysis you previously undertook?
7. Next time you split a data set into two groups, calculate a t‐test to see if the
means really are different.
43
44.
Conclusions
In this chapter we have introduced the concept of calculating a value that shows
whether or not the differences between two means is caused by the factors being
measured or could be down to random chance or some other, non measured factors.
We introduced the concept of the p‐value, which corresponds to a probability or
percentage that the difference between means is real or just down to chance.
P values less than p=0.001 show a 99.9% chance that the means really are different and
the factor you are measuring is responsible
P values of p=0.05 are considered the critical value and correspond to a 95% chance
that the factor you are measuring is responsible.
P values greater than p=0.05 are rejected as we are less than 95% certain that the
factor being measured is responsible.
The t‐test can be calculated in Excel with the TTEST(range 1, range 2, tails, type)
formula entered into a cell. Tails is normally “2” and type “3”
In the next chapter we’ll look at a more useful test that allows you to look at factors at
more than two levels, such as previous primary school.
44
46.
If we take the means of the bins, we conclude:
Bin Mean
"1" 449
"2" 492
Surely a 43 point difference between the average points score for the two different
reading age “bins” represents something that we must take seriously?
46
47. Let’s look at the data:
Looks encouraging, that difference of 43 surely looks impressive and stands out.
Remember what we said about scales? If we draw the same chart on axes starting at 0:
Now, the difference between the two groups looks less impressive than before –
maybe they’re not that different.
47
48.
Calculating t‐test for “binned” data
As before, let’s reorganize the data and get Excel to calculate the t‐test.
The t‐test of 0.1987 indicates a 19.87, say 20% chance that the means are actually the
same and there is no difference between the reading age bins. Put another way, there
is a 1‐p or nearly 80% chance that the means are actually different, and we cannot
conclude that the factor we are assessing is solely responsible for the difference.
Now 80% sounds positive – but remember we agreed that p=0.05 was the upper limit,
above which we cannot be certain that the factor is causing the difference in the
response.
48
49. Limits of the t‐test
I know that sounds like a bunch of statistical waffle, but the wording is important. The
t‐test does not rule out reading age having an effect on points score, but the low
significance of p=0.1987, points to some other factor either jointly being responsible or
(as is likely) more significant in explaining the difference between the data.
In our case, it means we should keep analyzing the data to find a more fundamental
difference.
As before, let’s plot a modified bar chart for the bins “1” and “2”, joining the means for
each level. In this case, it proves a particularly useful chart as it clearly shows that the
mean for level “2” of reading age is pulled upward by the three high points score.
Boxplot of Points Score vs Re-coded
650
600
550
Points Score
500
450
400
350
300
1 2
Re-coded
49
50. Multi level factors
We can use the same idea of binning‐up factor levels to ease analysis of other factors –
such as attendance data for example.
However, what if we don’t want to combine factors into just two levels? In the case of
attendance data, we might want:
‐1 = less 80
0 = 80 to 89.99
1 = 90‐ 94.99
2 = 95+
We can’t use the t‐test as it only works to discriminate between factors that are in two
levels. We need a different statistical tool – analysis of variance.
Analysis of variance
You’ve arrived at the point in the statistics journey where you are about to leave the
“core” functions of Excel behind. Whilst it’s true that you can get Excel to calculate
analysis of variance, it’s not an easy process, the preparation of the data can be
confusing and the results leave a lot to be desired.
At this point I strongly suggest that you get hold of a copy of Minitab7 or download the
excellent Daniels XL Toolbox8 – a free add‐in to Excel that will enhance its native
statistics capability.
However, even Daniels XL Toolbox will run out of steam in the next chapter, so maybe
it’s time to break the Excel apron strings ‐ ;‐)
7
Or alternative statistics package. See the preface to this book for how to obtain Minitab for a reasonable
price.
8
http://xltoolbox.sourceforge.net/
50
51.
Does attendance affect attainment?
Anyways, let’s push on and look at a continuous variable, attendance and try and
answer the questions – “Does attendance affect attainment”. Received wisdom is,
“surely yes, attendance affects attainment and the more you attend the higher the
attainment” – but ask yourself whether you’ve actually tested this “wisdom”.
As we have two data sets that are continuous, we can get a feel for what’s going on by
plotting a traditional scatter graph of attendance (x) against points score (y)
Does that help? Is there a link between attendance and attainment?
51
52.
Fitting a trend line to Excel data
Excel allows us to fit a line between the data points that “best” represents the data.
How well that line fits is shown by the R2 value – the close it is to 1, the better the fit,
with anything above 0.8 as indicating a “good” fit to the data.
Create a scatter graph as normal. Once created, right click on a data point to bring up
the context menu:
Select “Add Trendline”.
From the next context menu, you can choose what kind of line to fit – in this case we
are looking for a straight line, so choose “linear”:
Leave most of the settings to the default, but at the bottom, before you click the CLOSE
button, put a check as indicated:
52
54. From our data, the following linear trend line is fitted.
Using R2 to check for “goodness” of fit
The R2 value of 0.0093 indicates that the line does not represent the data well – in fact
anything below 0.80 is regarded as “poor”.
In fact when R2 = 0, the line fits the data no better than a horizontal line drawn through
the mean “y” value.
The closer R2 is to 1, the better we can use the line and its equation to predict values –
in this case, we if R2=1 we could 100% predict a points score from the attendance.
Clearly this is not the case for our data.
54
56. Let’s calculate the means of each bin to assess if there is any variation between
attendance figures:
Binned Mean Points
‐1 435.8
0 422.7
1 550.6
2 460.7
What the mean analysis shows, is a difference of 25 points in going from the lowest sub
80% attendance to the highest 95%+ attendance. But, is this a big enough effect to
conclude that attendance matters?
If we plot the binned attendance against points score, we can see that “something” is
going on, and the connected means show some variation
Modified Bar Chart of Points vs Binned Attendance
900
800
700
600
500
Points
400
300
200
100
0
-1 0 1 2
Binned attendance
56
57.
At this point, the observant reader might ask “Doesn’t all this depend on the size of
the bins?” – Let’s see....
If we re‐bin the data, into ‐1 (less than 90) and +1 (90 and greater) we find;
Binned Mean Points
‐1 427.9
1 486.9
This time, there’s nearly 60 points of difference between the lowest and highest
attendance – surely this is significant?
At this point we’ve reduced the factors to a binary split, so we can use the t‐test to see
if the difference between the means is real and significant.
The preparation of the data is left as an exercise for the reader, but by binning into ‐1
and +1, separating the data into columns and running the Excel TTEST function, we
obtain a value of p=0.243.
This p value is well above the value of p=0.05 for us to consider the means as
statistically different and we conclude, that there is no statistical difference between
the average points score, when we consider the factor “attendance”.
However, this is not where we wanted to be – we’ve reduced a factor to a binary
split.
We’re going to stick with the original binned data, as they correspond to how we track
learners in school:
‐1 = less 80
0 = 80 to 89.99
1 = 90‐ 94.99
2 = 95+
57
58. You’ll need Daniels XL toolbox or Minitab at this stage. Download a copy for MS
Excel from: http://xltoolbox.sourceforge.net/
One way Analysis of Variance (ANOVA)
The statistical test that we’re going to perform is called the One‐way analysis of
variance or as its usually referred to ANOVA.
ANOVA is similar in function (but mathematically much more complex) to the t‐test,
except ANOVA can test whether or not two or more means are different. ANOVA tests
produce a p value which can be interpreted in the same manner as the t‐test.
This is ideal for our case – ANOVA will reduce our problem of determining if attendance
matters to the familiar task of interpreting a p‐value.
As we’re going to use Daniels XL toolbox or Minitab, data this time can be laid out as
you would receive it from your examinations officer, without further processing.
That is a list of information with headings across
the top – no preparation will be required.
<<< Your data will be laid out like this
With one row per pupil – much easier to deal with
than before.
From the Add‐In menu in Excel, select XL Toolbox,
and navigate to the Statistics > ANOVA menu
From the One‐Way Analysis of Variance (ANOVA)
menu that appears, select the ranges for the
input data.
58
59. Click in the box once and then drag down
over the range of the bins – not including
the heading
Click in the box once and then drag down
over the range of the data – not including
the heading
59
61. Non numeric multi level factors
We started this text by looking at gender and handedness, both were binary non
numeric factors (either one value or another). Some factors under consideration can
be non numerical and text based – originating primary school9 for example.
Our fictional secondary school has 4 feeder primaries: Elm Tree, Everymans, Oldberry
and St Judes.
The average points score at the year of Year 11 for a group of learners is:
Primary Points Primary Points Primary Points Primary Points
St Judes 314 St Judes 698 Elm Tree 509 St Judes 494
St Judes 695 St Judes 440 St Judes 614 Elm Tree 440
St Judes 389 St Judes 566 St Judes 426 St Judes 597
Elm Tree 269 Oldberry 631 St Judes 467 St Judes 698
St Judes 410 Oldberry 440 Elm Tree 413 St Judes 440
Elm Tree 400 Everymans 501 St Judes 502 Everymans 566
St Judes 314 Oldberry 469 Oldberry 290 Everymans 631
St Judes 614 St Judes 342 Elm Tree 425 St Judes 440
St Judes 426 Oldberry 400 Elm Tree 158 St Judes 284
St Judes 467 Oldberry 626 St Judes 509 Oldberry 469
Oldberry 413 Oldberry 552 St Judes 479 St Judes 342
Everymans 695 Oldberry 519 Everymans 490 St Judes 400
Everymans 502 St Judes 548 St Judes 626 St Judes 548
St Judes 389 Oldberry 440 Elm Tree 401 Oldberry 440
St Judes 290 Everymans 752 Oldberry 519 Oldberry 752
St Judes 269 Everymans 834 St Judes 834 Oldberry 292
Elm Tree 425 St Judes 292 Oldberry 494 Oldberry 262
Everymans 410 Oldberry 262 Oldberry 350 Oldberry 612
St Judes 538 Elm Tree 612 Oldberry 440 Elm Tree 80
St Judes 158 Everymans 540 Oldberry 597
9
At this point, I need to be clear – I’m not suggesting a blame culture between Primary and Secondary,
more, the fact that we have this data in secondary and it can be instructive to see if and where a response
can be split by a factor.
61
62. Firing up Excel and the XL Toolbox we place the data in two columns, one for feeder
primary and the other for points score. Navigating through XL Toolbox we run an
ANOVA:
What this ANOVA shows us, with a P value of p=0.0089 is that feeder primary is more
than 99% certain to have an effect upon the average points score at the end of year 11.
What it doesn’t show is where this variation actually is. Are all the schools different, or
just one school different from the rest?
62
63. Let’s plot a modified bar chart to see:
Modified Bar Chart of Points vs Primary
900
800
700
600
500
Points
400
300
200
100
0
Elm Tree Everymans Oldberry St Judes
Primary
The “difference” is likely to be between Elm Tree and Everymans. But, being the good
statistician we now want to ask more rounded questions:
Is Everymans different to Oldberry & St Judes?
Is Elm Tree different to Oldberry?
Fortunately, tests exist to quantify this difference.
63
65. On this screen, click on “Produce report”, which will summarise this test in an easy to
read table.
Posthoc test: Bonferroni‐Holm
Group 1 Group 2 Critical P Significant?
Elm Tree Everymans 0.008333333 0.002662327 Yes
Oldberry Everymans 0.01 0.017707646 No
St Judes Everymans 0.0125 0.01989173 No
St Judes Elm Tree 0.016666667 0.074365767 No
Elm Tree Oldberry 0.025 0.082440719 No
St Judes Oldberry 0.05 0.96789046 No
(Here, the significance of the P value is slightly different than before – if the value of p
is less than the displayed “critical value”, the difference is significant.
65
66. We can see that for our data, only the Elm Tree – Everymans difference is significant,
whilst the Oldberry, and St Judes to Everymans is approaching significant.
Whilst our modified bar chart hinted at this before, we now have a hard and fast figure
that describes the difference between the primary schools.
Call to action
Now that we’ve got some real statistical tests in our tool kit, go and find your master
data set for your school / department / class.
Most schools will have spreadsheets of such data, and they probably look something
like this:
Name Sex SEN FSM CATs Att% Feeder Read Maths English Science Overall Points
Adams, Jon M NA N 119 90.35 St Judes 14.02 30 35 40 440
See if you can answer the following questions from your own data:
1. Are the overall results for your school different for gender? Is this a significant
difference ?
a. (TTEST and P value)
b. Repeat the analysis for free school meals (FSM)
2. How well does CATS, (or other base line data), attendance or reading age
predict Maths, English, Science (insert subjects that you have data for)?
a. (Scatter graph for continuous data and fit a trend line. Check R2 value)
3. Create some binned data (CATs, Feeder School) and use ANOVA to check the
significance of a multi leveled factor.
a. Use Bonferroni‐Holm to check for differences between levels of a
factor
66
67. Pause for breath ……..
At this point, you’ve come a long way. Instead of using the means of responses to
describe (possibly erroneous) differences between the effects of factor levels, you’ve
just used some real statistical tests (TTEST and ANOVA) to provide you with evidence
that is more than just a “hunch”.
Questions to reflect on
1. Did any of your analysis contradict your preconceptions?
2. Did you show that gender was statistically significant overall? What about
gender for Maths, English, Science?
3. Do learners from any of your feeder primaries perform significantly different
than learners from other? Does this surprise you?
This is the beauty of simple statistical tests – you can ask the “What if” questions and
very quickly get an answer.
But, and isn’t there always a but – from the factors listed how do you decide which is
the most important and most significant in driving a response?
Name Sex SEN FSM CATs Att% Feeder Read Overall Points
Adams, Jon M NA N 119 90.35 St Judes 14.02 440
And for that, we need yet another tool – this time, the final one we’ll introduce and the
“most useful”, generic test available. Say hello to the General Linear Model
67
68.
Conclusions
We’ve covered a lot of ground in this chapter. Starting with the t‐test previously
described we’ve looked at:
Grouping or binning factor levels to allow us to continue to use the t‐test and the
familiar p value for significance
How we can use Excel and trend lines to explore the relationship between continuous
data.
We looked at the R2 value and used it to decide how “well” a trend line matched the
data. R2 = 0.80 is the agreed upon limit, below this the fit is described as “poor”.
How continuous data can also be binned up to allow t‐tests to differentiate between
binary leveled factors
We’ve introduced the concept of One‐way analysis of variance (ANOVA), which allows
us to test for significance between multi level factors.
We looked at extending this ANOVA to explore differences between the levels of
factors and how to assess the significance of these differences.
We explored Daniels XL Toolbox, a free add‐in to Excel which makes calculating ANOVA
much more straight forward.
68