1. Reliability and Dependability
by Neil Jones
The Routledge Handbook of Language Testing
by
Glenn Fulcher and Fred Davidson
Prepared By: Amirhamid Foroughameri
ahfameri@gmail.com
November 2015
2. Reliability as an aspect of test quality
• Reliability and validity are classically cited as the two most
important properties of a test.
• Bachman (1990) identified four key qualities – validity, reliability,
impact and practicality.
• He proposed that in any testing situation validity and reliability
should be maximised to produce the most useful results for test users,
within practical constraints that always exist.
• Here, reliability will be presented rather as an integral component of
validity, and approaches to estimating reliability as potential sources
of evidence for the construct validity of a test.
3. Measurement
• The idea that quantification is the way to understanding was
memorably expressed by Kelvin in 1883:
• … when you can measure what you are speaking about, and express
it in numbers you know something about it; but when you cannot
measure it, when you cannot express it in numbers, your knowledge
is of a meagre and unsatisfactory kind.
• (Kelvin, quoted by Stellman, 1998: 1973)
4. Does this apply to the case of language proficiency?
The answer could be No for two reasons:
• First, it suggests that language proficiency is an enduring real
property that resides in a person’s head and can be quantified, like
their height or weight.
• Second the metaphor implies that language proficiency, like
temperature, has a single unique meaning, and can be precisely
quantified.
We cannot take a one-size-fits-all approach to language
assessment.
5. The concept of reliability
• Reliability equals consistency
• Reliability in assessment means something rather different to its everyday use as a
synonym of ‘trustworthy’ or ‘accurate’.
• However, in testing reliability has the narrower meaning of ‘consistent’.
• A reliable test is consistent in that it produces the same or similar result on repeated use;
that is, it would rank-order a group of test takers in nearly the same way.
• But the result need not be a correct or accurate measure of what the test claims to
measure.
• Just as a train service can run consistently late, a test may provide an incorrect result in a
consistent manner.
• High reliability does not necessarily imply that a test is good, i.e., valid.
• Nonetheless, a valid test must have acceptable reliability, because without it the results
can never be meaningful.
• Thus a degree of reliability is a necessary but not sufficient condition of validity.
6. • Reliability and error
• When a group of learners takes a test their scores will differ, reflecting
their relative ability.
• Reliability is defined as the proportion o f variation in scores caused by
the ability measured, and not by other factors.
• This proportion is typically described as a correlation (or correlation-like)
coefficient.
• Depending on the type of reliability being analysed, what is correlated
with what will change.
• A perfectly reliable test would have a reliability coefficient (r) of 1.
• The variability caused by other factors is called error.
7.
8. Replications and generalizability
‘A person with one watch knows what time it is; a person with two
watches is never quite sure.’
Thus Brennan (2001: 295) introduces a presentation of reliability
from the perspective of replications.
Information from only one observation may easily deceive, because
unverifiable, while to get direct information about consistency (i.e.,
reliability) at least two instances are required.
Replications in some form are necessary to estimate reliability.
9. Even more importantly, Brennan argues, ‘they are required for an
unambiguous conceptualization of the very notion of reliability.’
Specifying exactly what would constitute a replication of a
measurement procedure is necessary to provide any meaningful
statement about its reliability.
The individual variation in test-takers from one day to another is
difficult to measure, because the test is taken only once.
Thus its impact is very likely ignored, leading to an overestimate of
reliability, unless we can do specific experiments to replicate the
testing event in a way that will provide evidence.
10. • Reliability and dependability
• Dependability is a term sometimes used (in preference to reliability) to refer to the
consistency of a classification – that is, of a test-taker receiving the same grade or score
interpretation on repeated testing.
• The way the term is used relates to the distinction made between norm-referenced and
criterion referenced approaches to testing.
• Taken literally, norm-referencing means interpreting a learner’s performance relative to other
learners, i.e., as better or worse, while criterion-referencing interprets performance relative to
some fixed external criterion, such as a specified level of a proficiency framework like the
CEFR.
• The term dependability is used in a criterion-referencing context where the aim is to classify
learners, for example as masters or non-masters of a domain of knowledge.
11. • But if dependability relates to a particular criterion-referenced approach
to interpretation we should not conclude that reliability relates only to
norm-referenced interpretations.
• It is true that reliability is defined in terms of the consistency with which
individuals are ranked relative to each other, but in many testing
applications it is no less concerned with consistency of classification
relative to cut-off points that have well-defined criterion interpretations.
Item response theory has the particular advantage that it models a
learner’s ability in terms of probable performance on specific tasks.
Henning (1987: 111) argues that IRT reconciles norm- and criterion-
referencing.
12. • The standard error of measurement
• The standard error of measurement (SEM) is a transformation of
reliability in terms of test scores, which is useful in considering
consistency of classification.
• While reliability refers to a group of test-takers, the SEM shows the
impact of reliability on the likely score of an individual: it indicates how
close a test-taker’s score is likely to be to their ‘true score’.
One difference often cited between CTT and IRT is that CTT SEM is a
single value applied to all possible scores in a test, while the IRT SEM is
conditional on each possible score, and is probably of greater technical
value.
However, as Haertel (2006: 82) points out, CTT also has techniques for
estimating SEM conditional on score.
13. Internal consistency as the definition of a trait
• It is important to note that internal consistency is conceptually quite
unrelated to the definition of reliability.
• Think of a short test consisting of items on, say: your shoe size,
visual acuity, the number of children you have, and the distance from
your house to work. Assume that with appropriate procedures each of
these can be found without error, for a group of candidates. The
reliability of this error-free test will be a perfect 1.
• But these items are completely unrelated to each other, and so an
internal consistency estimate of their reliability would be about zero.
For this reason too, it is impossible to put a name to this test, that is,
to say what it is actually a test of.
14. Internal consistency as the definition of a trait
• Now suppose the test contained, say, items on shoe size, height,
gender. This time it is likely that on administering the test the
internal consistency estimate of reliability would be found to be
considerably higher than zero.
• The difference is that this time the items are related to each other.
• Study them and you could probably name what it measures:
something like ‘physical build’.
• So the trait which a test actually measures is whatever explains its
internal consistency.
15. Reliability and validity
• Validity nowadays tends to be judged in terms of whether the uses
made of test results are justified (Messick, 1989). This implies a
complex set of arguments that go well beyond the older and purely
psychometric issue of whether the test measures what it is believed to
measure.
16. Reliability and validity
• Coherent measurement and construct definition
• In the trait-based, unidimensional approaches, conceptions of validity and
reliability emerge as rather closely linked. They both relate to the same notion of–
of focusing in on ‘one thing’ at a time, coherent measurement.
• Typically this means identifying skills such as Reading, Writing, Listening and
Speaking as distinct traits, and testing them separately.
• Each of these traits requires definition: what do we understand by ‘Reading’ or
‘Listening’ ability, and how is it to be tested?
• Such construct definition provides the basis of a validity argument for how test
results can be interpreted.
• Defining constructs encourages test developers to identify explicit models of
language competence, enables useful profiling of an individual learner’s strengths
or weaknesses, and helps to interpret test performance in meaningful terms.
17. • Focusing on specific contexts
• The conclusion is thus that the trait-based measurement
models presented here enable approaches to language
proficiency testing which can work well, achieving a useful
blend of reliability, validity and practicality.
• However, there is a condition: each testing context must be
treated on its own terms, and tests designed for one context
may not be readily comparable with tests designed for
another context.
18. • Mislevy (1992: 22) identifies four possible levels at which tests can be compared:
• Equating – the strongest level: refers to testing the same thing in the same way, e.g. two
tests constructed from the same test specification to the same blueprint. Equating such
tests allows them to be used interchangeably.
• Calibration – refers to testing the same thing in a different way, e.g. two tests
constructed from the same specification but to a different blueprint, which thus have
different measurement characteristics.
• Projection – refers to testing a different thing in a different way, e.g. where constructs
are differently specified. It predicts learners’ scores on one test from another, with
accuracy dependent on the degree of similarity. It is relevant where both tests target the
same basic population of learners.
• Moderation – the weakest level: can be applied where performance on one test does not
predict performance on the other for an individual learner, e.g. tests of French and
German.
19. Issues with reliability
In practice language testing seeks to achieve both reliability and validity within
the practical constraints which limit every testing context.
The aim should be to optimise both, rather than prioritise one over the other.
If reliability is prioritised, then indeed it may conflict with validity.
Internal consistency estimates of reliability make it possible to drive up the
reliability of tests over time, simply by weeding out items which correlate less
highly with the others.
This, as Ennis (1999) points out, is potentially a serious threat to the validity of a
test, as it leads to a progressive narrowing of what is tested, without explicit
consideration of how the content of the test is being modified.
A classic way of narrowing the testing focus is to restrict the range of task types
used and select items primarily on psychometric quality – the discrete item
multiple-choice test format which Spolsky questioned.
20. Trait-based measures versus cognitive models
The trait-based measurement approach is most useful in summative
assessment, where at the end of a course of study the learner’s
achievements can be summarised as a simple grade or proficiency
level.
Formative assessment, which aims to feed forward into future
learning, needs to provide more information, not simply about how
much a learner knows, but about the nature of that knowledge.
As Mislevy (1992: 15) states: ‘Contemporary conceptions of
learning do not describe developing competence in terms of
increasing trait values, but in terms of alternative constructs.’