Reliability and validity are key to good tests

A. RELIABILITYA. RELIABILITY
CHARACTERISTICS OF ACHARACTERISTICS OF A
GOOD TESTGOOD TEST

ReliabilityReliability
• Reliability is synonymous with consistency. It is
the degree to which test scores for an individual
test taker or group of test takers are consistent
over repeated applications.
• No psychological test is completely consistent,
however, a measurement that is unreliable is
worthless.

Would you keep using these
measurement tools?
The consistency of test scores is critically
important in determining whether a test
can provide good measurement.

When someone says you are a
‘reliable’ person, what do they really
mean?
Are you a reliable person? 

Reliability (cont.)Reliability (cont.)
* Because no unit of measurement is exact, any time you
measure something (observed score), you are really
measuring two things.
1. True Score - the amount of observed score that truly
represents what you are intending to measure.
2. Error Component - the amount of other variables that
can impact the observed score
Observed Test Score = True Score + Errors of
Measurement

Measurement ErrorMeasurement Error
• Any fluctuation in test scores that results from
factors related to the measurement process that
are irrelevant to what is being measured.
• The difference between the observed score and
the true score is called the error score. S true = S
observed - S error

Measurement Error is Reduced By:
- Writing items clearly
- Making instructions easily understood
- Adhering to proper test administration
- Providing consistent scoring

Determining ReliabilityDetermining Reliability
• There are several ways that measuring reliability can be
determined, depending on the type of measurement the
supporting data required. They include:
- Internal Consistency
- Test-retest Reliability
- Inter rater Reliability
- Split-half Methods
- Odd-even Reliability
- Alternate Forms Methods

Internal ConsistencyInternal Consistency
• Measures the reliability of a test solely on the number of
items on the test and the inter correlation among the
items. Therefore, it compares each item to every other
item.
Cronbach’s Alpha: .80 to .95 (Excellent)
.70 to .80 (Very Good)
.60 to .70 (Satisfactory)
<.60 (Suspect)

Split Half & Odd-Even ReliabilitySplit Half & Odd-Even Reliability
Split Half - refers to determining a correlation between the first
half of the measurement and the second half of the measurement
(i.e., we would expect answers to the first half to be similar to the
second half).
Odd-Even - refers to the correlation between even items and odd
items of a measurement tool.
• In this sense, we are using a single test to create two tests,
eliminating the need for additional items and multiple
administrations.
• Since in both of these types only 1 administration is needed and
the groups are determined by the internal components of the test,
it is referred to as an internal consistency measure.

Split-half reliability
[error due to differences in item content between the halves of the test]
• Typically, responses on odd versus even items are employed
• Correlate total scores on odd items with the scores obtained
on even items
Person Odd Even
1 36 43
2 44 40
3 42 37
4 33 40
1
100
50
pairs

Test-retest ReliabilityTest-retest Reliability
• Test-retest reliability is usually measured by computing
the correlation coefficient between scores of two
administrations.

Test-retest Reliability (cont.)Test-retest Reliability (cont.)
• The amount of time allowed between measures is critical.
• The shorter the time gap, the higher the correlation; the longer
the time gap, the lower the correlation. This is because the two
observations are related over time.
• Optimum time between administrations is 2 to 4 weeks.
• The rationale behind this method is that the difference between
the scores of the test and the retest should be due to measurement
solely.

Inter rater ReliabilityInter rater Reliability
• Whenever you use humans as a part of your measurement
procedure, you have to worry about whether the results you get
are reliable or consistent. People are notorious for their
inconsistency. We are easily distractible. We get tired of doing
repetitive tasks. We daydream. We misinterpret.

Inter rater Reliability (cont.)Inter rater Reliability (cont.)
• For some scales it is important to assess interrater
reliability.
• Interrater reliability means that if two different raters
scored the scale using the scoring rules, they should
attain the same result.
• Interrater reliability is usually measured by computing
the correlation coefficient between the scores of two
raters for the set of respondents.
• Here the criterion of acceptability is pretty high (e.g., a
correlation of at least .9), but what is considered
acceptable will vary from situation to situation.

Parallel/Alternate Forms MethodParallel/Alternate Forms Method
Parallel/Alternate Forms Method - refers to the
administration of two alternate forms of the
same measurement device and then comparing
the scores.
• Both forms are administered to the same person and
the scores are correlated. If the two produce the
same results, then the instrument is considered
reliable.

Parallel/Alternate Forms Method (cont.)Parallel/Alternate Forms Method (cont.)
• A correlation between these two forms is computed just
as the test-retest method.
Advantages
• Eliminates the problem of memory effect.
• Reactivity effects (i.e., experience of taking the test) are
also partially controlled.

Factors Affecting ReliabilityFactors Affecting Reliability
• Administrator Factors
• Number of Items on the instrument
• The Instrument Taker
• Heterogeneity of the Items
• Heterogeneity of the Group Members
• Length of Time between Test and Retest

How High Should Reliability Be?How High Should Reliability Be?
• A highly reliable test is always preferable to a test with
lower reliability.
. 80 > greater (Excellent)
.70 to .80 (Very Good)
.60 to .70 (Satisfactory)
<.60 (Suspect)
• A reliability coefficient of .80 indicates that 20% of the
variability in test scores is due to measurement error.

Reliability deals with the consistency.
Reliability is the quality that guarantees us that
we will get similar results when conducting the
same test on the same population every time.
Consider this ruler…

Now compare this ruler…
With this one…

Each ruler will give the same answer each time…
But this one will be wrong each time…

Each ruler is reliable…
But reliability doesn‘t mean much when it is
wrong…

So, not only do we require reliability…
We also need…

VALIDITY
Validity deals with the
accuracy of the
measurement

Validity
 Depends on the PURPOSE
 E.g. a ruler may be a valid measuring device
for length, but isn’t very valid for measuring
volume
 Measuring what ‘it’ is supposed to
 Matter of degree (how valid?)
 Specific to a particular purpose!
 Learning outcomes
1. Content coverage (relevance?)
2. Level & type of student engagement
(cognitive, affective, psychomotor) –
appropriate?

Types of validity measures
 Face validity
 Construct validity
 Content validity
 Criterion validity

Face Validity
Does it appear to measure what it is supposed to measure?
Example: Let’s say you are interested in measuring,
‘Propensity towards violence and aggression’. By simply
looking at the following items, state which ones qualify to
measure the variable of interest:
 Have you been arrested?
 Have you been involved in physical fighting?
 Do you get angry easily?
 Do you sleep with your socks on?
 Is it hard to control your anger?
 Do you enjoy playing sports?

Construct Validity
 Does the test measure the ‘human’ theoretical construct
or trait.
 Examples
 Mathematical reasoning
 Verbal reasoning or fluency
 Musical ability
 Spatial ability
 Motivation
 Applicable to authentic assessment
 Each construct is broken down into its component parts
 E.g. ‘motivation’ can be broken down to:
 Interest
 Attention span
 Hours spent
 Assignments undertaken and submitted, etc.
All of these sub-constructs put together – measure ‘motivation’

Content Validity
How well elements of the test relate to the content
domain?
How closely content of questions in the test relates to
content of the curriculum?
Directly relates to instructional objectives and the
fulfillment of the same!
Major concernfor achievement tests (where content is
emphasized)
Can you test students on things they have not been
taught?

How to establish Content Validity?
 Instructional objectives (looking at your list)
 Table of Specification
 E.g.
 At the end of the chapter, the student will be able to
do the following:
1. Explain what ‘stars’ are
2. Discuss the type of stars and galaxies in our universe
3. Categorize different constellations by looking at the stars
4. Differentiate between our stars, the sun, and all other stars

Categories of Performance (Mental
Skills)
Content areas
Knowledge Comprehension Analysis Total
1. What are
‘stars’?
2. Our star, the
Sun
3. Constellations
4. Galaxies
Total Grand
Total
Table of Specification (An Example)

Criterion Validity
The degree to which content on a test (predictor)
correlates with performance on relevant criterion
measures (concrete criterion in the "real" world?)
If they do correlate highly, it means that the test
(predictor) is a valid one!
E.g. if you taught skills relating to ‘public speaking’ and
had students do a test on it, the test can be validated by
looking at how it relates to actual performance (public
speaking) of students inside or outside of the
classroom

Factors that can lower Validity
 Unclear directions
 Difficult reading vocabulary and sentence structure
 Ambiguity in statements
 Inadequate time limits
 Inappropriate level of difficulty
 Poorly constructed test items
 Test items inappropriate for the outcomes being measured
 Tests that are too short
 Improper arrangement of items (complex to easy?)
 Identifiable patterns of answers
 Teaching
 Administration and scoring
 Students
 Nature of criterion

Validity and Reliability
Neither Valid
nor Reliable
Reliable but not
Valid
Valid & Reliable
Fairly Valid but
not very Reliable
Think in terms of ‘the
purpose of tests’ and the
‘consistency’ with which
the purpose is
fulfilled/met

Objectivity
the state of being fair, without bias or external
influence.
if the test is marked by different people, the
score will be the same . In other words, marking
process should not be affected by the marking
person's personality.
Not inﬂuenced by emotion or personal
prejudice. Based on observable phenomena;
presented factually: an objective appraisal.
The questions and answers should be clear

 measures an individual's characteristics in a
way that is independent of rater’s bias or the
examiner's own beliefs
gauges the test taker's conscious thoughts and
feelings without regard to the test administrator's
beliefs or biases.
help greatly in determining the test taker's
personality.

Understanding Norms
a list of scores and corresponding percentile ranks,
standard scores, or other transformed scores of a
group of examinees on whom a test was
standardized.
In a psychometric context, norms are the test
performance data of a particular group of test takers
that are designed for use as a reference for evaluating
or interpreting individual test scores” (Cohen &
Swerdlik, 2002, p. 100).

TYPES OF NORMS
•Percentiles
- refer to a distribution divided into 100
equal parts.
- refer to the score at or below which a
specific percentage of scores fall.
Ex. A student got 90% rank of NAT
exam. What does this mean?

It means that 90% of his
classmates scored lower than
his score or 10% of his
classmates got score above his
score.

Age Norms (age-equivalent scores)
–“indicate the average performance of
different samples of test takers who were at
various ages at the time the test was
administered” (Cohen & Swerdlik, 2002, p.
105).
Grade Norms
–Used to indicate the average test
performance of testtakers in a specific grade.
–Based on a ten month scale, refers to grade
and month (e.g., 7.3 is equivalent to seventh
grade, third month).

•National Norms
–Derived from a standardization sample nationally
representative of the population of interest.
Subgroup Norms
–Are created when narrowly defined groups are
sampled.
Ex. •Socioeconomic status
•Handedness
•Education level

Local Norms
–Are derived from the local population’s performance
on a measure.
- Typically created locally (i.e., by guidance counselor,
personnel director, etc.)
Fixed Reference Group Scoring Systems
•Calculation of test scores is based on a fixed
reference group that was tested in the past.

•Norm referenced tests consider the
individual’s score relative to the scores of
testtakers in the normative sample.
•Criterion Referenced tests consider the
individual’s score relative to a specified
standard or criterion (cut score).
–Licensure exams
–Proficiency tests

Item Analysis
A name given to a variety of statistical techniques
designed to analyze individual items on a test
It involves examining class-wide performance on
individual test items.
It sometimes suggests why an item has not
functioned effectively and how it might be
improved
A test composed of items revised and selected on
the basis of item-analysis is almost certain to be
more reliable than the one composed of an equal
number of untested items.

Difficulty index
The proportion of students in class who got
an item correct. The larger the proportion ,
the more students who have learned the
content measured by the item

Discrimination index
A basic measure of the validity of an item.
A measure of an item’s ability to
discriminate between those who scored high
on the total test and those who scored low.
It can be interpreted as an indication of the
extent to which overall knowledge of the
content area or mastery of the skill is related
to the response on an item

Analysis of response options/distracter
analysis
In addition to examining the performance of a test
item, teachers are often interested in examining
the performance of individual distracters
( incorrect answer options) on multiple-choice
items
By calculating the proportion of students who
chose each answer option, teachers can identify
which distracters are working and appear to be
attractive to students who do not know the correct
answer, and which distracters are simply taking up
space and not being chosen by many students

To eliminate blind guessing which
results in a correct answer purely by
chance (which hurts the validity of a
test item), teachers want as many
plausible distracters as is feasible.

The process of item analysis
1. Arrange the test scores from highest to lowest
2. Select the criterion groups
Identify a High group and a Low group. The High
group is the highest-scoring 27% of the group and the Low
group is the lowest scoring 27%
27% of the examinees is called the criterion group. It
provides the best compromise between two desirable but
inconsistent aims:
to make the extreme groups as large as possible
and as different as possible
then we can say with confidence that those in the High
group are superior in the ability measured by the test than
those in the Low group.

3. For each item, count the number of
examinees in the High group who have correct
responses. Do a separate, similar procedure for the
low group
4. Solve for the difficulty index of each item
 The larger the value of the index, the easier the item.
 The smaller the value, the more difficult is the item.
 Scale for interpreting the difficulty index of an item
Below 0.25 item is very difficult
0.25 – 0.75 item is of average difficulty
or item is rightly difficult
Above 0.75 item is very easy

Example: Item analysis
1. Count and arrange the scores from highest to
lowest.
 Ex. n=43 scores
2. Calculate the criterion group (N) which is 27% of
the total number of scores.
 Ex. N=27% of 43= (0.27)(43) = 12
3. Take 12 scores from the highest down and take 12
scores from the lowest up, call these High group and
Low group respectively.
4. Tabulate the number of responses for each options
from the high and low groups for that particular item
under analysis.

5. Solve for the difficulty index of each item
 The larger the value of the index, the easier the
item. The smaller, the more difficult.
 Scale for interpreting the difficulty index of an
item
Below 0.25 item is very difficult
0.25 – 0.75 item is of average difficulty or
item is rightly difficult
Above 0.75 item is very easy

A B C D* E Total
Upper
Group
1 1 0 9 1 12
Lower
Group
3 1 4 4 0 12
Ex: Item # 5 of the Multiple Choice test, D is the correct
option.

Idis Index Description Interpretation
0.40 – 1.0 High The item is very
good
0.30 -0.39 Moderate Reasonably good,
can be improved
0.20 – 0.29 Moderate In need of
improvement
< 0.20 Low Poor, to be
discarded
The following can be used to interpret the
index of discrimination.

Idis Idif Item category
High Easy Good
High Easy/difficult Fair
Moderate Easy/difficult Fair
High/moderate Easy/difficult Fair
low At any level Poor (Discard the
item)
•Interpreting the results by giving value judgment

Index of difficulty = (Hc + Lc) / 2N =
(9+4)/2(12)=.54 ----the item is rightly
difficult
Index of discrimination = (Hc –Lc)/N=(9-
4)/12=.42
---- high index of discrimination
---- the item has the power to
discriminate
Hence, item number 5 has to be
retained.
Distracter analysis: A and C are good
distracters

Thank you and God bless us
all!

Reliability and validity are key to good tests

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Reliability and validity are key to good tests

Similaire à Reliability and validity are key to good tests (20)

Reliability and validity are key to good tests