Reliability

Test/Retest Reliability Estimates

1.

Test/Retest Reliability Estimates

A second problem with the Test/Retest method is the length
of time required to conduct the two test administrations.

A short delay between Time 1 and Time 2 increases the
potential for carry-over effects due to memory, fatigue,
practice, etc.

But a long delay between Time 1 and Time 2 increases the
potential for carry-over effects due to mood, developmental
change, etc.

Consequently, the Test/Retest method is most appropriate in
contexts wherein the test is not susceptible to carry-over
effects.

Parallel-Forms & Alternative-Forms Reliability Estimates

1.

Internal Consistency Estimates of Reliability

We have seen that reliability estimates can be obtained by
administering the same test to the same examinees and by
correlating the results: Test/Retest

We have also seen that reliability estimates can be obtained by
administering two parallel or alternate forms of a test, and then
correlating those results: Parallel- & Alternate-Forms

In both of the above cases, the researcher must administer two
exams, and they are sometimes given at different times making
them susceptible to carry-over effects.

Here, we will see that it is possible to obtain a reliability estimate
using only a single test.

The most common way to obtain a reliability estimate using a
single test is through the Split-half approach.

Split-Half approach to Reliability

When using the Split-Half approach, one gives a single test to
a group of examinees.

Later, the test is divided into two parts, which may be
considered to be alternate forms of one another.
• In fact, the split is not so arbitrary; an attempt is made to
choose the two halves so that they are parallel or
essentially τ-equivalent.
• If the halves are considered parallel, then the reliability
of the whole test is estimated using the Spearman-
Brown formula.

• If the halves are essentially τ-equivalent, then the
coefficient α can be used to estimate reliability.


1.


0.00 0.00
0.20 0.33
0.40 0.57
0.60 0.75
0.80 0.89
1.00 1.00


On the other hand, the two test halves may not (and are likely
not) parallel forms.

This is confirmed when it is determined that the two halves
have unequal variances.

In these situations, it is best to use a different approach to
estimating reliability.
• Cronbach’s coefficient α

α can be used to estimate the reliability of the entire test.


If the test halves are not essentially τ-equivalent, then
coefficient α will give a lower bound for the test’s reliability.
• In other words, the test’s reliability must be greater than,
or equal to, the value produced by Cronbach’s α.

• If α is a high value, then you know that the test
reliability is also high.

• If α is a low value, then you may not know whether the
test actually has low reliability or whether the halves of
the test are simply not essentially τ-equivalent.


It is the case, that if the variances on both test halves are
equal, then the Spearman-Brown formula and Cronbach’s α
will produce identical results.

If the variances of the two test halves are equal, but the
halves are not Essentially τ-Equivalent, then both the
Spearman-Brown formula and Cronbach’s α will
underestimate the test’s reliability.
• Lower bound estimate

If the observed-score variances of the test halves are equal
and the tests are Essentially τ-Equivalent, then the Spearman-
Brown formula and Cronbach’s α will both equal the test’s
reliability.


Obviously, the major advantage to using internal-consistency
reliability estimates is that test need only be given once to
obtain such an estimate.

Naturally, this approach is limited only to tests that can be
divided into two parts, or into two parts that are either
parallel or essentially τ-equivalent, or when the test lacks
independent items that can be separated from one another.
• In these situations, one must use test/retest, parallel- or
alternate-forms reliability approaches.

Assuming one is able to use the Split-Half approach, however,
how does one go about forming two test halves?


Forming Test Halves:

There are 3 commonly used methods for forming test halves:
1. The Odd/Even method

2. The Order method

3. The Matched Random Subsets method

Odd/Even approach to Test Halves

The Odd/Even method classifies items by their order,
whether odd-numbered or even numbered, on the test.
• In other words, all odd-numbered test items form the first
half, and all even-numbered test items form the second
half.

After the two halves are formed, a score for each half is
obtained for each examinee.

These scores are used to obtain an estimate of reliability.

This is a fairly simple, and straightforward approach to
forming two test halves.

Ordered approach to Test Halves

The Ordered method requires that a test be divided prior to its
administration.

From this point, there are multiple additional approaches to
administrating the Ordered method.
1. Every examinee can be given the same test and then, one can
compare scores from the first half to scores from the second
half.
• Carry-over effects may be a concern.

2. Each half is labeled, say A and B, are then given in different
orders to different examinees.
• In other words, half the examinees will be randomly
assigned order A-B, and the other half will be assigned
order B-A.

The Ordered method is generally considered to be less
satisfactory than the Odd/Even method because of the increased
potential for carry-over effects.

The Matched Random Subsets
approach to Test Halves

The Matched Random Subsets method is much more
sophisticated than the two aforementioned methods.

This process involves several steps:
1. For each test item, two statistics are computed:
• The proportion of examinees passing the item – a
measure of the item’s “difficulty.”
• The biserial or point-biserial correlation between the
item score and the total test score.

2. Each item is plotted on a graph using the above two
statistics.
• Items that are close together on the graph are paired,
and one item from each pair is randomly assigned to
one half of the test.
• The remaining items form the other half of the test.

The Matched Random Subsets
approach to Test Halves

For example, in the graphic above, we see the plot of test
items A, B, C, D, E & F.

Test items A & B are similar, and therefore grouped. Likewise,
so is C with D, and E with F.

Internal-Consistency Reliability – The General Case

In our previous examples, we divided a given test into two equal halves.

But, here we can examine dividing a given test into multiple equal components.

Even in these cases, we can apply the basic principles of each of the methods for dividing
a test.
• For example, the odd/even method can be modified to divide a nine item test into
thirds by taking every third item in a sequence to form a given component, etc.

• The Matched Random Subsets method would involve forming triplets, rather than
pairs, but then the first item is randomly assigned to one component, the next to
another, and so on.


Let us assume that a given test is divided into N components.

The variances of the scores on each component and the variances of the entire test are
used to estimate the reliability of the test.

If the components are essentially τ-equivalent, then formulas presented herein will
provide good estimates of the test’s reliability.

If, however, the components are not essentially τ-equivalent, then the formulas
presented herein will underestimate (i.e., provide a lower bound for) the test’s reliability.

Furthermore, it is important the any test divided into components measure only a single
trait (i.e., be homogeneous in content).
• Intelligence tests are a classic example of a heterogeneous test, because they measure
a broad spectrum of traits.


1.

The Spearman-Brown Formula: The General Case

1.


If the component tests are not parallel, then the Spearman-
Brown formula will wither over- or underestimate the
reliability of a longer test.

An example scenario of overestimation:
• Suppose one has a 10 item test with a reliability of 0.60.
• The Spearman-Brown formula predicts that by adding a
parallel ten-item test that the resultant total reliability will
be 0.75.
• But suppose the test that is added by a faulty test that has
no variance.
• Effectively, we’ve only added a constant to every
examinee’s score, which does not contribute to the test’s
reliability.
• In this case, the total test reliability would still be 0.60.


If the component tests are not parallel, then the Spearman-
Brown formula will wither over- or underestimate the
reliability of a longer test.

An example scenario of underestimation:
• Suppose a ten item test has a reliability of 0.00.
• The Spearman-Brown formula predicts that by doubling
the test length with a parallel component would produce a
reliability of 0.00.
• However, if a non-parallel test is added instead with a
reliability of, say, 0.70, then the resultant reliability of the
lengthened test will be greater than 0.00.

Comparison of Methods of Estimating Reliabilities

So far, we have learned several different ways to estimate the reliability of a
given test.

Here is a summary of the basic principles of each, that one should use when
deciding which is appropriate for estimating the reliability of one’s test:
1. When using Test/Retest methods, one should use Parallel- or Alternate-
Forms reliability estimates because most internal-consistency measures
would be inaccurate.

2. Use of Cronbach’s α or the Kuder-Richardson methods produces a lower
bound for a given test’s reliability.
• If the tests happen to be essentially τ-equivalent, then the estimated
reliability is the test’s reliability.
• But these tests should only be used for homogeneous tests

3. When using the Split-Half method, the Spearman-Brown formula can over-
or underestimate a test’s reliability if the components are not parallel.
• When the components are parallel, then the estimate provided is very
good for judging the effects of changing test length.

Standard Errors of Measurement
& Confidence Intervals for True Scores

1.


The bottom chart depicts an approximately normal
distribution of observed scores obtained from many
independent testings of a single examinee.

Note how the scores vary, but tend to group around the
examinee’s true score.


The confidence intervals for true scores can be interpreted in
either of two ways:
1. The intervals can be expected to contain a given
examinee’s true score a specified percentage of time
when the interval is constructed using observed scores
that are the result of repeated independent testings of the
examinee using the same test (or parallel tests).

2. The interval can be expected to cover a specified
percentage of the examinee’s true scores when many
examinees are tested once with the same test (or parallel
tests) and a confidence interval is calculated for each
examinee.


Tests with a high degree of measurement error will produce
confidence intervals that are necessarily wider.

Less reliable tests tend to have a high degree of
measurement error.

Therefore, wide confidence intervals are an indication that
the observed scores are not very good estimates of true
scores.

If a test has good reliability, then the confidence intervals will
also be narrow, indicating good estimates of true scores.

Reliability

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Reliability

Similaire à Reliability (20)

Reliability