SlideShare une entreprise Scribd logo
1  sur  21
Reliability and Dependability
by Neil Jones
The Routledge Handbook of Language Testing
by
Glenn Fulcher and Fred Davidson
Prepared By: Amirhamid Foroughameri
ahfameri@gmail.com
November 2015
Reliability as an aspect of test quality
• Reliability and validity are classically cited as the two most
important properties of a test.
• Bachman (1990) identified four key qualities – validity, reliability,
impact and practicality.
• He proposed that in any testing situation validity and reliability
should be maximised to produce the most useful results for test users,
within practical constraints that always exist.
• Here, reliability will be presented rather as an integral component of
validity, and approaches to estimating reliability as potential sources
of evidence for the construct validity of a test.
Measurement
• The idea that quantification is the way to understanding was
memorably expressed by Kelvin in 1883:
• … when you can measure what you are speaking about, and express
it in numbers you know something about it; but when you cannot
measure it, when you cannot express it in numbers, your knowledge
is of a meagre and unsatisfactory kind.
• (Kelvin, quoted by Stellman, 1998: 1973)
Does this apply to the case of language proficiency?
The answer could be No for two reasons:
• First, it suggests that language proficiency is an enduring real
property that resides in a person’s head and can be quantified, like
their height or weight.
• Second the metaphor implies that language proficiency, like
temperature, has a single unique meaning, and can be precisely
quantified.
We cannot take a one-size-fits-all approach to language
assessment.
The concept of reliability
• Reliability equals consistency
• Reliability in assessment means something rather different to its everyday use as a
synonym of ‘trustworthy’ or ‘accurate’.
• However, in testing reliability has the narrower meaning of ‘consistent’.
• A reliable test is consistent in that it produces the same or similar result on repeated use;
that is, it would rank-order a group of test takers in nearly the same way.
• But the result need not be a correct or accurate measure of what the test claims to
measure.
• Just as a train service can run consistently late, a test may provide an incorrect result in a
consistent manner.
• High reliability does not necessarily imply that a test is good, i.e., valid.
• Nonetheless, a valid test must have acceptable reliability, because without it the results
can never be meaningful.
• Thus a degree of reliability is a necessary but not sufficient condition of validity.
• Reliability and error
• When a group of learners takes a test their scores will differ, reflecting
their relative ability.
• Reliability is defined as the proportion o f variation in scores caused by
the ability measured, and not by other factors.
• This proportion is typically described as a correlation (or correlation-like)
coefficient.
• Depending on the type of reliability being analysed, what is correlated
with what will change.
• A perfectly reliable test would have a reliability coefficient (r) of 1.
• The variability caused by other factors is called error.
Replications and generalizability
‘A person with one watch knows what time it is; a person with two
watches is never quite sure.’
Thus Brennan (2001: 295) introduces a presentation of reliability
from the perspective of replications.
Information from only one observation may easily deceive, because
unverifiable, while to get direct information about consistency (i.e.,
reliability) at least two instances are required.
Replications in some form are necessary to estimate reliability.
Even more importantly, Brennan argues, ‘they are required for an
unambiguous conceptualization of the very notion of reliability.’
Specifying exactly what would constitute a replication of a
measurement procedure is necessary to provide any meaningful
statement about its reliability.
The individual variation in test-takers from one day to another is
difficult to measure, because the test is taken only once.
Thus its impact is very likely ignored, leading to an overestimate of
reliability, unless we can do specific experiments to replicate the
testing event in a way that will provide evidence.
• Reliability and dependability
• Dependability is a term sometimes used (in preference to reliability) to refer to the
consistency of a classification – that is, of a test-taker receiving the same grade or score
interpretation on repeated testing.
• The way the term is used relates to the distinction made between norm-referenced and
criterion referenced approaches to testing.
• Taken literally, norm-referencing means interpreting a learner’s performance relative to other
learners, i.e., as better or worse, while criterion-referencing interprets performance relative to
some fixed external criterion, such as a specified level of a proficiency framework like the
CEFR.
• The term dependability is used in a criterion-referencing context where the aim is to classify
learners, for example as masters or non-masters of a domain of knowledge.
• But if dependability relates to a particular criterion-referenced approach
to interpretation we should not conclude that reliability relates only to
norm-referenced interpretations.
• It is true that reliability is defined in terms of the consistency with which
individuals are ranked relative to each other, but in many testing
applications it is no less concerned with consistency of classification
relative to cut-off points that have well-defined criterion interpretations.
 Item response theory has the particular advantage that it models a
learner’s ability in terms of probable performance on specific tasks.
Henning (1987: 111) argues that IRT reconciles norm- and criterion-
referencing.
• The standard error of measurement
• The standard error of measurement (SEM) is a transformation of
reliability in terms of test scores, which is useful in considering
consistency of classification.
• While reliability refers to a group of test-takers, the SEM shows the
impact of reliability on the likely score of an individual: it indicates how
close a test-taker’s score is likely to be to their ‘true score’.
 One difference often cited between CTT and IRT is that CTT SEM is a
single value applied to all possible scores in a test, while the IRT SEM is
conditional on each possible score, and is probably of greater technical
value.
 However, as Haertel (2006: 82) points out, CTT also has techniques for
estimating SEM conditional on score.
Internal consistency as the definition of a trait
• It is important to note that internal consistency is conceptually quite
unrelated to the definition of reliability.
• Think of a short test consisting of items on, say: your shoe size,
visual acuity, the number of children you have, and the distance from
your house to work. Assume that with appropriate procedures each of
these can be found without error, for a group of candidates. The
reliability of this error-free test will be a perfect 1.
• But these items are completely unrelated to each other, and so an
internal consistency estimate of their reliability would be about zero.
For this reason too, it is impossible to put a name to this test, that is,
to say what it is actually a test of.
Internal consistency as the definition of a trait
• Now suppose the test contained, say, items on shoe size, height,
gender. This time it is likely that on administering the test the
internal consistency estimate of reliability would be found to be
considerably higher than zero.
• The difference is that this time the items are related to each other.
• Study them and you could probably name what it measures:
something like ‘physical build’.
• So the trait which a test actually measures is whatever explains its
internal consistency.
Reliability and validity
• Validity nowadays tends to be judged in terms of whether the uses
made of test results are justified (Messick, 1989). This implies a
complex set of arguments that go well beyond the older and purely
psychometric issue of whether the test measures what it is believed to
measure.
Reliability and validity
• Coherent measurement and construct definition
• In the trait-based, unidimensional approaches, conceptions of validity and
reliability emerge as rather closely linked. They both relate to the same notion of–
of focusing in on ‘one thing’ at a time, coherent measurement.
• Typically this means identifying skills such as Reading, Writing, Listening and
Speaking as distinct traits, and testing them separately.
• Each of these traits requires definition: what do we understand by ‘Reading’ or
‘Listening’ ability, and how is it to be tested?
• Such construct definition provides the basis of a validity argument for how test
results can be interpreted.
• Defining constructs encourages test developers to identify explicit models of
language competence, enables useful profiling of an individual learner’s strengths
or weaknesses, and helps to interpret test performance in meaningful terms.
• Focusing on specific contexts
• The conclusion is thus that the trait-based measurement
models presented here enable approaches to language
proficiency testing which can work well, achieving a useful
blend of reliability, validity and practicality.
• However, there is a condition: each testing context must be
treated on its own terms, and tests designed for one context
may not be readily comparable with tests designed for
another context.
• Mislevy (1992: 22) identifies four possible levels at which tests can be compared:
• Equating – the strongest level: refers to testing the same thing in the same way, e.g. two
tests constructed from the same test specification to the same blueprint. Equating such
tests allows them to be used interchangeably.
• Calibration – refers to testing the same thing in a different way, e.g. two tests
constructed from the same specification but to a different blueprint, which thus have
different measurement characteristics.
• Projection – refers to testing a different thing in a different way, e.g. where constructs
are differently specified. It predicts learners’ scores on one test from another, with
accuracy dependent on the degree of similarity. It is relevant where both tests target the
same basic population of learners.
• Moderation – the weakest level: can be applied where performance on one test does not
predict performance on the other for an individual learner, e.g. tests of French and
German.
Issues with reliability
In practice language testing seeks to achieve both reliability and validity within
the practical constraints which limit every testing context.
The aim should be to optimise both, rather than prioritise one over the other.
If reliability is prioritised, then indeed it may conflict with validity.
Internal consistency estimates of reliability make it possible to drive up the
reliability of tests over time, simply by weeding out items which correlate less
highly with the others.
This, as Ennis (1999) points out, is potentially a serious threat to the validity of a
test, as it leads to a progressive narrowing of what is tested, without explicit
consideration of how the content of the test is being modified.
A classic way of narrowing the testing focus is to restrict the range of task types
used and select items primarily on psychometric quality – the discrete item
multiple-choice test format which Spolsky questioned.
Trait-based measures versus cognitive models
The trait-based measurement approach is most useful in summative
assessment, where at the end of a course of study the learner’s
achievements can be summarised as a simple grade or proficiency
level.
Formative assessment, which aims to feed forward into future
learning, needs to provide more information, not simply about how
much a learner knows, but about the nature of that knowledge.
 As Mislevy (1992: 15) states: ‘Contemporary conceptions of
learning do not describe developing competence in terms of
increasing trait values, but in terms of alternative constructs.’
Thank You

Contenu connexe

Tendances

Arte387 Ch8
Arte387 Ch8Arte387 Ch8
Arte387 Ch8
SCWARTED
 
Lesson 4 analysis of test results
Lesson 4 analysis of test resultsLesson 4 analysis of test results
Lesson 4 analysis of test results
Carlo Magno
 
Business Research Methods. measurement questionnaire and sampling
Business Research Methods. measurement questionnaire and samplingBusiness Research Methods. measurement questionnaire and sampling
Business Research Methods. measurement questionnaire and sampling
Ahsan Khan Eco (Superior College)
 
Chapter 6 class version
Chapter 6 class versionChapter 6 class version
Chapter 6 class version
jbnx
 

Tendances (17)

200 chapter 7 measurement :scaling by uma sekaran
200 chapter 7 measurement :scaling by uma sekaran 200 chapter 7 measurement :scaling by uma sekaran
200 chapter 7 measurement :scaling by uma sekaran
 
Arte387 Ch8
Arte387 Ch8Arte387 Ch8
Arte387 Ch8
 
Attitude Scales
Attitude ScalesAttitude Scales
Attitude Scales
 
Lesson 4 analysis of test results
Lesson 4 analysis of test resultsLesson 4 analysis of test results
Lesson 4 analysis of test results
 
Assessment
AssessmentAssessment
Assessment
 
Chp9 - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
Chp9  - Research Methods for Business By Authors Uma Sekaran and Roger BougieChp9  - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
Chp9 - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
 
Language testing the social dimension
Language testing  the social dimensionLanguage testing  the social dimension
Language testing the social dimension
 
6. operationalization of variables
6. operationalization of variables6. operationalization of variables
6. operationalization of variables
 
Reliability
ReliabilityReliability
Reliability
 
Chp8 - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
Chp8  - Research Methods for Business By Authors Uma Sekaran and Roger BougieChp8  - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
Chp8 - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
 
Assessment Techniques in Affective and Psychomotor Domain
Assessment Techniques in Affective and Psychomotor DomainAssessment Techniques in Affective and Psychomotor Domain
Assessment Techniques in Affective and Psychomotor Domain
 
Validity in Psychological Testing
Validity in Psychological TestingValidity in Psychological Testing
Validity in Psychological Testing
 
Business Research Methods. measurement questionnaire and sampling
Business Research Methods. measurement questionnaire and samplingBusiness Research Methods. measurement questionnaire and sampling
Business Research Methods. measurement questionnaire and sampling
 
38105795 standardized-tools
38105795 standardized-tools38105795 standardized-tools
38105795 standardized-tools
 
Chapter 6 class version
Chapter 6 class versionChapter 6 class version
Chapter 6 class version
 
Chp5 - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
Chp5  - Research Methods for Business By Authors Uma Sekaran and Roger BougieChp5  - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
Chp5 - Research Methods for Business By Authors Uma Sekaran and Roger Bougie
 
Research Method for Business chapter 7
Research Method for Business chapter  7Research Method for Business chapter  7
Research Method for Business chapter 7
 

Similaire à Reliability and dependability by neil jones

What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
mecklenburgstrelitzh
 
Qualities of Good Test.pdf
Qualities of Good Test.pdfQualities of Good Test.pdf
Qualities of Good Test.pdf
FaheemGul17
 
Reliability and validity.pptx
Reliability and validity.pptxReliability and validity.pptx
Reliability and validity.pptx
NathanMoyo1
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
songoten77
 
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docxFaith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
mecklenburgstrelitzh
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
Samcruz5
 

Similaire à Reliability and dependability by neil jones (20)

RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITY
 
What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
reliablity and validity in social sciences research
reliablity and validity  in social sciences researchreliablity and validity  in social sciences research
reliablity and validity in social sciences research
 
Qualities of Good Test.pdf
Qualities of Good Test.pdfQualities of Good Test.pdf
Qualities of Good Test.pdf
 
Reliability and validity.pptx
Reliability and validity.pptxReliability and validity.pptx
Reliability and validity.pptx
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
Reliability
ReliabilityReliability
Reliability
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Establishing the English Language Test Reliability
 Establishing the  English Language Test Reliability  Establishing the  English Language Test Reliability
Establishing the English Language Test Reliability
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
 
Validity of Assessment Tools
Validity of Assessment ToolsValidity of Assessment Tools
Validity of Assessment Tools
 
Rep
RepRep
Rep
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
Reliablity
ReliablityReliablity
Reliablity
 
Evaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptEvaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.ppt
 
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docxFaith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Validity and Reliability
Validity and Reliability Validity and Reliability
Validity and Reliability
 
Edm 202
Edm 202Edm 202
Edm 202
 

Plus de Amir Hamid Forough Ameri

Plus de Amir Hamid Forough Ameri (19)

Tsui 2011
Tsui 2011Tsui 2011
Tsui 2011
 
The task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewoodThe task based approach some questions and suggestions littlewood
The task based approach some questions and suggestions littlewood
 
Task based research and language pedagogy ellis
Task based research and language pedagogy ellisTask based research and language pedagogy ellis
Task based research and language pedagogy ellis
 
Sifakis 2007
Sifakis 2007Sifakis 2007
Sifakis 2007
 
Notional functional syllabus
Notional functional syllabusNotional functional syllabus
Notional functional syllabus
 
Integrated syllabus
Integrated syllabusIntegrated syllabus
Integrated syllabus
 
Exploring culture by ah forough ameri
Exploring culture by ah forough ameriExploring culture by ah forough ameri
Exploring culture by ah forough ameri
 
Critical pedagogy in l2 learning and teaching suresh canagarajah
Critical pedagogy in l2 learning and teaching  suresh canagarajahCritical pedagogy in l2 learning and teaching  suresh canagarajah
Critical pedagogy in l2 learning and teaching suresh canagarajah
 
Critical literacy and second language learning luke and dooley
Critical literacy and second language learning  luke and dooleyCritical literacy and second language learning  luke and dooley
Critical literacy and second language learning luke and dooley
 
Context culture .... m. wendt
Context culture .... m. wendtContext culture .... m. wendt
Context culture .... m. wendt
 
Thesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameriThesis summary by amir hamid forough ameri
Thesis summary by amir hamid forough ameri
 
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learning
 
Test specifications and designs session 4
Test specifications and designs  session 4Test specifications and designs  session 4
Test specifications and designs session 4
 
Standards based classroom assessments of english proficiency
Standards based classroom  assessments of english proficiencyStandards based classroom  assessments of english proficiency
Standards based classroom assessments of english proficiency
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6
 
Extroversion introversion
Extroversion introversionExtroversion introversion
Extroversion introversion
 
Developing a comprehensive, empirically based research framework for classroo...
Developing a comprehensive, empirically based research framework for classroo...Developing a comprehensive, empirically based research framework for classroo...
Developing a comprehensive, empirically based research framework for classroo...
 
Classroom assessment, glenn fulcher
Classroom assessment, glenn fulcherClassroom assessment, glenn fulcher
Classroom assessment, glenn fulcher
 
Behavioral view of motivation
Behavioral view of motivationBehavioral view of motivation
Behavioral view of motivation
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Dernier (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 

Reliability and dependability by neil jones

  • 1. Reliability and Dependability by Neil Jones The Routledge Handbook of Language Testing by Glenn Fulcher and Fred Davidson Prepared By: Amirhamid Foroughameri ahfameri@gmail.com November 2015
  • 2. Reliability as an aspect of test quality • Reliability and validity are classically cited as the two most important properties of a test. • Bachman (1990) identified four key qualities – validity, reliability, impact and practicality. • He proposed that in any testing situation validity and reliability should be maximised to produce the most useful results for test users, within practical constraints that always exist. • Here, reliability will be presented rather as an integral component of validity, and approaches to estimating reliability as potential sources of evidence for the construct validity of a test.
  • 3. Measurement • The idea that quantification is the way to understanding was memorably expressed by Kelvin in 1883: • … when you can measure what you are speaking about, and express it in numbers you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind. • (Kelvin, quoted by Stellman, 1998: 1973)
  • 4. Does this apply to the case of language proficiency? The answer could be No for two reasons: • First, it suggests that language proficiency is an enduring real property that resides in a person’s head and can be quantified, like their height or weight. • Second the metaphor implies that language proficiency, like temperature, has a single unique meaning, and can be precisely quantified. We cannot take a one-size-fits-all approach to language assessment.
  • 5. The concept of reliability • Reliability equals consistency • Reliability in assessment means something rather different to its everyday use as a synonym of ‘trustworthy’ or ‘accurate’. • However, in testing reliability has the narrower meaning of ‘consistent’. • A reliable test is consistent in that it produces the same or similar result on repeated use; that is, it would rank-order a group of test takers in nearly the same way. • But the result need not be a correct or accurate measure of what the test claims to measure. • Just as a train service can run consistently late, a test may provide an incorrect result in a consistent manner. • High reliability does not necessarily imply that a test is good, i.e., valid. • Nonetheless, a valid test must have acceptable reliability, because without it the results can never be meaningful. • Thus a degree of reliability is a necessary but not sufficient condition of validity.
  • 6. • Reliability and error • When a group of learners takes a test their scores will differ, reflecting their relative ability. • Reliability is defined as the proportion o f variation in scores caused by the ability measured, and not by other factors. • This proportion is typically described as a correlation (or correlation-like) coefficient. • Depending on the type of reliability being analysed, what is correlated with what will change. • A perfectly reliable test would have a reliability coefficient (r) of 1. • The variability caused by other factors is called error.
  • 7.
  • 8. Replications and generalizability ‘A person with one watch knows what time it is; a person with two watches is never quite sure.’ Thus Brennan (2001: 295) introduces a presentation of reliability from the perspective of replications. Information from only one observation may easily deceive, because unverifiable, while to get direct information about consistency (i.e., reliability) at least two instances are required. Replications in some form are necessary to estimate reliability.
  • 9. Even more importantly, Brennan argues, ‘they are required for an unambiguous conceptualization of the very notion of reliability.’ Specifying exactly what would constitute a replication of a measurement procedure is necessary to provide any meaningful statement about its reliability. The individual variation in test-takers from one day to another is difficult to measure, because the test is taken only once. Thus its impact is very likely ignored, leading to an overestimate of reliability, unless we can do specific experiments to replicate the testing event in a way that will provide evidence.
  • 10. • Reliability and dependability • Dependability is a term sometimes used (in preference to reliability) to refer to the consistency of a classification – that is, of a test-taker receiving the same grade or score interpretation on repeated testing. • The way the term is used relates to the distinction made between norm-referenced and criterion referenced approaches to testing. • Taken literally, norm-referencing means interpreting a learner’s performance relative to other learners, i.e., as better or worse, while criterion-referencing interprets performance relative to some fixed external criterion, such as a specified level of a proficiency framework like the CEFR. • The term dependability is used in a criterion-referencing context where the aim is to classify learners, for example as masters or non-masters of a domain of knowledge.
  • 11. • But if dependability relates to a particular criterion-referenced approach to interpretation we should not conclude that reliability relates only to norm-referenced interpretations. • It is true that reliability is defined in terms of the consistency with which individuals are ranked relative to each other, but in many testing applications it is no less concerned with consistency of classification relative to cut-off points that have well-defined criterion interpretations.  Item response theory has the particular advantage that it models a learner’s ability in terms of probable performance on specific tasks. Henning (1987: 111) argues that IRT reconciles norm- and criterion- referencing.
  • 12. • The standard error of measurement • The standard error of measurement (SEM) is a transformation of reliability in terms of test scores, which is useful in considering consistency of classification. • While reliability refers to a group of test-takers, the SEM shows the impact of reliability on the likely score of an individual: it indicates how close a test-taker’s score is likely to be to their ‘true score’.  One difference often cited between CTT and IRT is that CTT SEM is a single value applied to all possible scores in a test, while the IRT SEM is conditional on each possible score, and is probably of greater technical value.  However, as Haertel (2006: 82) points out, CTT also has techniques for estimating SEM conditional on score.
  • 13. Internal consistency as the definition of a trait • It is important to note that internal consistency is conceptually quite unrelated to the definition of reliability. • Think of a short test consisting of items on, say: your shoe size, visual acuity, the number of children you have, and the distance from your house to work. Assume that with appropriate procedures each of these can be found without error, for a group of candidates. The reliability of this error-free test will be a perfect 1. • But these items are completely unrelated to each other, and so an internal consistency estimate of their reliability would be about zero. For this reason too, it is impossible to put a name to this test, that is, to say what it is actually a test of.
  • 14. Internal consistency as the definition of a trait • Now suppose the test contained, say, items on shoe size, height, gender. This time it is likely that on administering the test the internal consistency estimate of reliability would be found to be considerably higher than zero. • The difference is that this time the items are related to each other. • Study them and you could probably name what it measures: something like ‘physical build’. • So the trait which a test actually measures is whatever explains its internal consistency.
  • 15. Reliability and validity • Validity nowadays tends to be judged in terms of whether the uses made of test results are justified (Messick, 1989). This implies a complex set of arguments that go well beyond the older and purely psychometric issue of whether the test measures what it is believed to measure.
  • 16. Reliability and validity • Coherent measurement and construct definition • In the trait-based, unidimensional approaches, conceptions of validity and reliability emerge as rather closely linked. They both relate to the same notion of– of focusing in on ‘one thing’ at a time, coherent measurement. • Typically this means identifying skills such as Reading, Writing, Listening and Speaking as distinct traits, and testing them separately. • Each of these traits requires definition: what do we understand by ‘Reading’ or ‘Listening’ ability, and how is it to be tested? • Such construct definition provides the basis of a validity argument for how test results can be interpreted. • Defining constructs encourages test developers to identify explicit models of language competence, enables useful profiling of an individual learner’s strengths or weaknesses, and helps to interpret test performance in meaningful terms.
  • 17. • Focusing on specific contexts • The conclusion is thus that the trait-based measurement models presented here enable approaches to language proficiency testing which can work well, achieving a useful blend of reliability, validity and practicality. • However, there is a condition: each testing context must be treated on its own terms, and tests designed for one context may not be readily comparable with tests designed for another context.
  • 18. • Mislevy (1992: 22) identifies four possible levels at which tests can be compared: • Equating – the strongest level: refers to testing the same thing in the same way, e.g. two tests constructed from the same test specification to the same blueprint. Equating such tests allows them to be used interchangeably. • Calibration – refers to testing the same thing in a different way, e.g. two tests constructed from the same specification but to a different blueprint, which thus have different measurement characteristics. • Projection – refers to testing a different thing in a different way, e.g. where constructs are differently specified. It predicts learners’ scores on one test from another, with accuracy dependent on the degree of similarity. It is relevant where both tests target the same basic population of learners. • Moderation – the weakest level: can be applied where performance on one test does not predict performance on the other for an individual learner, e.g. tests of French and German.
  • 19. Issues with reliability In practice language testing seeks to achieve both reliability and validity within the practical constraints which limit every testing context. The aim should be to optimise both, rather than prioritise one over the other. If reliability is prioritised, then indeed it may conflict with validity. Internal consistency estimates of reliability make it possible to drive up the reliability of tests over time, simply by weeding out items which correlate less highly with the others. This, as Ennis (1999) points out, is potentially a serious threat to the validity of a test, as it leads to a progressive narrowing of what is tested, without explicit consideration of how the content of the test is being modified. A classic way of narrowing the testing focus is to restrict the range of task types used and select items primarily on psychometric quality – the discrete item multiple-choice test format which Spolsky questioned.
  • 20. Trait-based measures versus cognitive models The trait-based measurement approach is most useful in summative assessment, where at the end of a course of study the learner’s achievements can be summarised as a simple grade or proficiency level. Formative assessment, which aims to feed forward into future learning, needs to provide more information, not simply about how much a learner knows, but about the nature of that knowledge.  As Mislevy (1992: 15) states: ‘Contemporary conceptions of learning do not describe developing competence in terms of increasing trait values, but in terms of alternative constructs.’