SlideShare une entreprise Scribd logo
1  sur  22
Item Analysis
Table of Contents
Major Uses of Item Analysis
Item Analysis Reports
Item Analysis Response Patterns
Basic Item Analysis Statistics
Interpretation of Basic Statistics
Other Item Statistics
Summary Data
Report Options
Item Analysis Guidelines

Major Uses of Item Analysis
Item analysis can be a powerful technique available to instructors for the guidance and
improvement of instruction. For this to be so, the items to be analyzed must be valid measures of
instructional objectives. Further, the items must be diagnostic, that is, knowledge of which
incorrect options students select must be a clue to the nature of the misunderstanding, and thus
prescriptive of appropriate remediation.
In addition, instructors who construct their own examinations may greatly improve the
effectiveness of test items and the validity of test scores if they select and rewrite their items on
the basis of item performance data. Such data is available to instructors who have their
examination answer sheets scored at the Computer Laboratory Scoring Office.
[ Top ]

Item Analysis Reports
As the answer sheets are scored, records are written which contain each student's score and his
or her response to each item on the test. These records are then processed and an item analysis
report file is generated. An instructor may obtain test score distributions and a list of students'
scores, in alphabetic order, in student number order, in percentile rank order, and/or in order of
percentage of total points. Instructors are sent their item analysis reports from as e-mail
attacments. The item analysis report is contained in the file IRPT####.RPT, where the four
digits indicate the instructors's GRADER III file. A sample of an individual long form item
analysis lisitng is shown below.

Upper 27%
Middle 46%
Lower 27%
Total

Item 10 of 125. The correct option is 5.
Item Response Pattern
1
2
3
4
5
Omit
2
8
0
1
19
0
7%
27%
0% 3%
63%
0%
3
20
3
3
23
0
6%
38%
6% 6%
44%
0%
6
5
8
2
9
0
20%
17%
27% 7%
30%
0%
11
33
11
6
51
0

Error
0
0%
0
0%
0
0%
0

Total
30
100%
52
100%
30
101%
112
10%

29%

11%

5%

46%

0%

0%

100%

[ Top ]

Item Analysis Response Patterns
Each item is identified by number and the correct option is indicated. The group of students
taking the test is divided into upper, middle and lower groups on the basis of students' scores on
the test. This division is essential if information is to be provided concerning the operation of
distracters (incorrect options) and to compute an easily interpretable index of discrimination. It
has long been accepted that optimal item discrimination is obtained when the upper and lower
groups each contain twenty-seven percent of the total group.
The number of students who selected each option or omitted the item is shown for each of the
upper, middle, lower and total groups. The number of students who marked more than one
option to the item is indicated under the "error" heading. The percentage of each group who
selected each of the options, omitted the item, or erred, is also listed. Note that the total
percentage for each group may be other than 100%, since the percentages are rounded to the
nearest whole number before totaling.
The sample item listed above appears to be performing well. About two-thirds of the upper
group but only one-third of the lower group answered the item correctly. Ideally, the students
who answered the item incorrectly should select each incorrect response in roughly equal
proportions, rather than concentrating on a single incorrect option. Option two seems to be the
most attractive incorrect option, especially to the upper and middle groups. It is most
undesirable for a greater proportion of the upper group than of the lower group to select an
incorrect option. The item writer should examine such an option for possible ambiguity. For the
sample item on the previous page, option four was selected by only five percent of the total
group. An attempt might be made to make this option more attractive.
Item analysis provides the item writer with a record of student reaction to items. It gives us little
information about the appropriateness of an item for a course of instruction. The appropriateness
or content validity of an item must be determined by comparing the content of the item with the
instructional objectives.
[ Top ]

Basic Item Analysis Statistics
A number of item statistics are reported which aid in evaluating the effectiveness of an item.
The first of these is the index of difficulty which is the proportion of the total group who got the
item wrong. Thus a high index indicates a difficult item and a low index indicates an easy item.
Some item analysts prefer an index of difficulty which is the proportion of the total group who
got an item right. This index may be obtained by marking the PROPORTION RIGHT option on
the item analysis header sheet. Whichever index is selected is shown as the INDEX OF
DIFFICULTY on the item analysis print-out. For classroom achievement tests, most test
constructors desire items with indices of difficulty no lower than 20 nor higher than 80, with an
average index of difficulty from 30 or 40 to a maximum of 60.
The INDEX OF DISCRIMINATION is the difference between the proportion of the upper
group who got an item right and the proportion of the lower group who got the item right. This
index is dependent upon the difficulty of an item. It may reach a maximum value of 100 for an
item with an index of difficulty of 50, that is, when 100% of the upper group and none of the
lower group answer the item correctly. For items of less than or greater than 50 difficulty, the
index of discrimination has a maximum value of less than 100. The Interpreting the Index of
Discrimination document contains a more detailed discussion of the index of discrimination.
[ Top ]
Interpretation of Basic Statistics
To aid in interpreting the index of discrimination, the maximum discrimination value and the
discriminating efficiency are given for each item. The maximum discrimination is the highest
possible index of discrimination for an item at a given level of difficulty. For example, an item
answered correctly by 60% of the group would have an index of difficulty of 40 and a maximum
discrimination of 80. This would occur when 100% of the upper group and 20% of the lower
group answered the item correctly. The discriminating efficiency is the index of discrimination
divided by the maximum discrimination. For example, an item with an index of discrimination
of 40 and a maximum discrimination of 50 would have a discriminating efficiency of 80. This
may be interpreted to mean that the item is discriminating at 80% of the potential of an item of
its difficulty. For a more detailed discussion of the maximum discrimination and discriminating
efficiency concepts, see the Interpreting the Index of Discrimination document.
[ Top ]

Other Item Statistics
Some test analysts may desire more complex item statistics. Two correlations which are
commonly used as indicators of item discrimination are shown on the item analysis report. The
first is the biserial correlation, which is the correlation between a student's performance on an
item (right or wrong) and his or her total score on the test. This correlation assumes that the
distribution of test scores is normal and that there is a normal distribution underlying the
right/wrong dichotomy. The biserial correlation has the characteristic, disconcerting to some, of
having maximum values greater than unity. There is no exact test for the statistical significance
of the biserial correlation coefficient.
The point biserial correlation is also a correlation between student performance on an item (right
or wrong) and test score. It assumes that the test score distribution is normal and that the
division on item performance is a natural dichotomy. The possible range of values for the point
biserial correlation is +1 to -1. The Student's t test for the statistical significance of the point
biserial correlation is given on the item analysis report. Enter a table of Student's t values with N
- 2 degrees of freedom at the desired percentile point N, in this case, is the total number of
students appearing in the item analysis.
The mean scores for students who got an item right and for those who got it wrong are also
shown. These values are used in computing the biserial and point biserial coefficients of
correlation and are not generally used as item analysis statistics.
Generally, item statistics will be somewhat unstable for small groups of students. Perhaps fifty
students might be considered a minimum number if item statistics are to be stable. Note that for
a group of fifty students, the upper and lower groups would contain only thirteen students each.
The stability of item analysis results will improve as the group of students is increased to one
hundred or more. An item analysis for very small groups must not be considered a stable
indication of the performance of a set of items.
[ Top ]

Summary Data
The item analysis data are summarized on the last page of the item analysis report. The
distribution of item difficulty indices is a tabulation showing the number and percentage of
items whose difficulties are in each of ten categories, ranging from a very easy category (00-10)
to a very difficult category (91-100). The distribution of discrimination indices is tabulated in
the same manner, except that a category is included for negatively discriminating items.
The mean item difficulty is determined by adding all of the item difficulty indices and dividing
the total by the number of items. The mean item discrimination is determined in a similar
manner.
Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If the test is
speeded, that is, if some of the students did not have time to consider each test item, the
reliability estimate may be spuriously high.
The final test statistic is the standard error of measurement. This statistic is a common device for
interpreting the absolute accuracy of the test scores. The size of the standard error of
measurement depends on the standard deviation of the test scores as well as on the estimated
reliability of the test.
Occasionally, a test writer may wish to omit certain items from the analysis although these items
were included in the test as it was administered. Such items may be omitted by leaving them
blank on the test key. The response patterns for omitted items will be shown but the keyed
options will be listed as OMIT. The statistics for these items will be omitted from the Summary
Data.
[ Top ]

Report Options
A number of report options are available for item analysis data. The long-form item analysis
report contains three items per page. A standard-form item analysis report is available where
data on each item is summarized on one line. A sample reprot is shown below.

Item Key
1
4
2
2

ITEM ANALYSIS Test 4482 125 Items 112 Students
Percentages: Upper 27% - Middle - Lower 27%
1
2
3
4
5 Omit Error Diff Disc
7-23-57
0- 4- 7 28- 8-36
64-62- 0 0-0-0 0-0-0 0-0-0 54
64
7-12- 7 64-42-29 14- 4-21 14-42-36 0-0-0 0-0-0 0-0-0 56
35

The standard form shows the item number, key (number of the correct option), the percentage of
the upper, middle, and lower groups who selected each option, omitted the item or erred, the
index of difficulty, and the index of discrimination. For example, in item 1 above, option 4 was
the correct answer and it was selected by 64% of the upper group, 62% of the middle group and
0% of the lower group. The index of difficulty, based on the total group, was 54 and the index of
discrimination was 64.
[ Top ]

Item Analysis Guidelines
Item analysis is a completely futile process unless the results help instructors improve their
classroom practices and item writers improve their tests. Let us suggest a number of points of
departure in the application of item analysis data.
1. Item analysis gives necessary but not sufficient information concerning the
appropriateness of an item as a measure of intended outcomes of instruction. An item
may perform beautifully with respect to item analysis statistics and yet be quite
irrelevant to the instruction whose results it was intended to measure. A most common
error is to teach for behavioral objectives such as analysis of data or situations, ability to
discover trends, ability to infer meaning, etc., and then to construct an objective test
measuring mainly recognition of facts. Clearly, the objectives of instruction must be kept
in mind when selecting test items.
2. An item must be of appropriate difficulty for the students to whom it is administered. If
possible, items should have indices of difficulty no less than 20 and no greater than 80. lt
is desirable to have most items in the 30 to 50 range of difficulty. Very hard or very easy
items contribute little to the discriminating power of a test.
3. An item should discriminate between upper and lower groups. These groups are usually
based on total test score but they could be based on some other criterion such as gradepoint average, scores on other tests, etc. Sometimes an item will discriminate negatively,
that is, a larger proportion of the lower group than of the upper group selected the correct
option. This often means that the students in the upper group were misled by an
ambiguity that the students in the lower group, and the item writer, failed to discover.
Such an item should be revised or discarded.
4. All of the incorrect options, or distracters, should actually be distracting. Preferably,
each distracter should be selected by a greater proportion of the lower group than of the
upper group. If, in a five-option multiple-choice item, only one distracter is effective, the
item is, for all practical purposes, a two-option item. Existence of five options does not
automatically guarantee that the item will operate as a five-choice item.
[ Top ]

Item analysis is a general term that refers to the specific methods used in education to evaluate test
items, typically for the purpose of test construction and revision. Regarded as one of the most
important aspects of test construction and increasingly receiving attention, it is an approach
incorporated into item response theory (IRT), which serves as an alternative to classical
measurement theory (CMT) or classical test theory (CTT). Classical measurement theory considers a
score to be the direct result of a person's true score plus error. It is this error that is of interest as
previous measurement theories have been unable to specify its source. However, item response
theory uses item analysis to differentiate between types of error in order to gain a clearer
understanding of any existing deficiencies. Particular attention is given to individual test items, item
characteristics, probability of answering items correctly, overall ability of the test taker, and degrees
or levels of knowledge being assessed.

THE PURPOSE OF ITEM ANALYSIS
There must be a match between what is taught and what is assessed. However, there must also be an effort to test for more
complex levels of understanding, with care taken to avoid over-sampling items that assess only basic levels of knowledge. Tests
that are too difficult (and have an insufficient floor) tend to lead to frustration and lead to deflated scores, whereas tests that are
too easy (and have an insufficient ceiling) facilitate a decline in motivation and lead to inflated scores. Tests can be improved by
maintaining and developing a pool of valid items from which future tests can be drawn and that cover a reasonable span of
difficulty levels.
Item analysis helps improve test items and identify unfair or biased items. Results should be used to refine test item wording. In
addition, closer examination of items will also reveal which questions were most difficult, perhaps indicating a concept that
needs to be taught more thoroughly. If a particular distracter (that is, an incorrect answer choice) is the most often chosen
answer, and especially if that distracter positively correlates with a high total score, the item must be examined more closely for
correctness. This situation also provides an opportunity to identify and examine common misconceptions among students about
a particular concept.
In general, once test items have been created, the value of these items can be systematically assessed using several methods
representative of item analysis: a) a test item's level of difficulty, b) an item's capacity to discriminate, and c) the item
characteristic curve. Difficulty is assessed by examining the number of persons correctly endorsing the answer. Discrimination
can be examined by comparing the number of persons getting a particular item correct with the total test score. Finally, the item
characteristic curve can be used to plot the likelihood of answering correctly with the level of success on the test.

ITEM DIFFICULTY
In test construction, item difficulty is determined by the number of people who answer a particular test item correctly. For
example, if the first question on a test was answered correctly by 76% of the class, then the difficulty level (p or percentage
passing) for that question is p = .76. If the second question on a test was answered correctly by only 48% of the class, then the
difficulty level for that question is p = .48. The higher the percentage of people who answer correctly, the easier the item, so that
a difficulty level of .48 indicates that question two was more difficult than question one, which had a difficulty level of .76.
Many educators find themselves wondering how difficult a good test item should be. Several things must be taken into
consideration in order to determine appropriate difficulty level. The first task of any test maker should be to determine the
probability of answering an item correctly by chance alone, also referred to as guessing or luck. For example, a true-false item,
because it has only two choices, could be answered correctly by chance half of the time. Therefore, a true-false item with a
demonstrated difficulty level of only p = .50 would not be a good test item because that level of success could be achieved
through guessing alone and would not be an actual indication of knowledge or ability level. Similarly, a multiple-choice item
with five alternatives could be answered correctly by chance 20% of the time. Therefore, an item difficulty greater than .20
would be necessary in order to discriminate between respondents' ability to guess correctly and respondents' level of knowledge.
Desirable difficulty levels usually can be estimated as halfway between 100 percent and the percentage of success expected by
guessing. So, the desirable difficulty level for a true-false item, for example, should be aroundp = .75, which is halfway between
100% and 50% correct.
In most instances, it is desirable for a test to contain items of various difficulty levels in order to distinguish between students
who are not prepared at all, students who are fairly prepared, and students who are well prepared. In other words, educators do
not want the same level of success for those students who did not study as for those who studied a fair amount, or for those who
studied a fair amount and those who studied exceptionally hard. Therefore, it is necessary for a test to be composed of items of
varying levels of difficulty. As a general rule for norm-referenced tests, items in the difficulty range of .30 to .70 yield
important differences between individuals' level of knowledge, ability, and preparedness. There are a few exceptions to this,
however, with regard to the purpose of the test and the characteristics of the test takers. For instance, if the test is to help
determine entrance into graduate school, the items should be more difficult to be able to make finer distinctions between test
takers. For a criterion-referenced test, most of the item difficulties should be clustered around the criterion cut-off score or
higher. For example, if a passing score is 70%, the vast majority of items should have percentage passing values of

Figure 1ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE.
p = .60 or higher, with a number of items in the p > .90 range to enhance motivation and test for mastery of certain essential
concepts.

DISCRIMINATION INDEX
According to Wilson (2005), item difficulty is the most essential component of item analysis. However, it is not the only way to
evaluate test items. Discrimination goes beyond determining the proportion of people who answer correctly and looks more
specifically at who answers correctly. In other words, item discrimination determines whether those who did well on the entire
test did well on a particular item. An item should in fact be able to discriminate between upper and lower scoring groups.
Membership in these groups is usually determined based on their total test score, and it is expected that those scoring higher on
the overall test will also be more likely to endorse the correct response on a particular item. Sometimes an item will
discriminate negatively, that is, a larger proportion of the lower group select the correct response, as compared to those in the
higher scoring group. Such an item should be revised or discarded.
One way to determine an item's power to discriminate is to compare those who have done very well with those who have done
very poorly, known as the extreme group method. First, identify the students who scored in the top one-third as well as those in
the bottom one-third of the class. Next, calculate the proportion of each group that answered a particular test item correctly (i.e.,
percentage passing for the high and low groups on each item). Finally, subtract the p of the bottom performing group from the p
for the top performing group to yield an item discrimination index (D). Item discriminations of D = .50 or higher are considered
excellent. D = 0 means the item has no discrimination ability, while D = 1.00 means the item has perfect discrimination ability.
In Figure 1, it can be seen that Item 1 discriminates well with those in the top performing group obtaining the correct response
far more often (p = .92) than those in the
Figure 2ILLUSTRATION BY GGS INFORMATION
SERVICES. CENGAGE LEARNING, GALE.
low performing group (p = .40), thus resulting in an index of .52 (i.e., .92 - .40 = .52). Next, Item 2 is not difficult enough with a
discriminability index of only .04, meaning this particular item was not useful in discriminating between the high and low
scoring individuals. Finally, Item 3 is in need of revision or discarding as it discriminates negatively, meaning low performing
group members actually obtained the correct keyed answer more often than high performing group members.
Another way to determine the discriminability of an item is to determine the correlation coefficient between performance on an
item and performance on a test, or the tendency of students selecting the correct answer to have high overall scores. This
coefficient is reported as the item discrimination coefficient, or the point-biserial correlation between item score (usually scored
right or wrong) and total test score. This coefficient should be positive, indicating that students answering correctly tend to have
higher overall scores or that students answering incorrectly tend to have lower overall scores. Also, the higher the magnitude,
the better the item discriminates. The point-biserial correlation can be computed with procedures outlined in Figure 2.
In Figure 2, the point-biserial correlation between item score and total score is evaluated similarly to the extreme group
discrimination index. If the resulting value is negative or low, the item should be revised or discarded. The closer the value is to
1.0, the stronger the item's discrimination power; the closer the value is to 0,

Figure 3ILLUSTRATION BY GGS INFORMATION
SERVICES. CENGAGE LEARNING, GALE.
the weaker the power. Items that are very easy and answered correctly by the majority of respondents will have poor pointbiserial correlations.

CHARACTERISTIC CURVE
A third parameter used to conduct item analysis is known as the item characteristic curve (ICC). This is a graphical or pictorial
depiction of the characteristics of a particular item, or taken collectively, can be representative of the entire test. In the item
characteristic curve the total test score is represented on the horizontal axis and the proportion of test takers passing the item
within that range of test scores is scaled along the vertical axis.
For Figure 3, three separate item characteristic curves are shown. Line A is considered a flat curve and indicates that test takers
at all score levels were equally likely to get the item correct. This item was therefore not a useful discriminating item. Line B
demonstrates a troublesome item as it gradually rises and then drops for those scoring highest on the overall test. Though this is
unusual, it can sometimes result from those who studied most having ruled out the answer that was keyed as correct. Finally,
Line C shows the item characteristic curve for a good test item. The gradual and consistent positive slope shows that the
proportion of people passing the item gradually increases as test scores increase. Though it is not depicted here, if an ICC was
seen in the shape of a backward S, negative item discrimination would be evident, meaning that those who scored lowest were
most likely to endorse a correct response on the item.

Eight Simple Steps to Item Analysis

1. Score each answer sheet, write score total on the corner
o obviously have to do this anyway
2. Sort the pile into rank order from top to bottom score
(1 minute, 30 seconds tops)
3. If normal class of 30 students, divide class in half
o

same number in top and bottom group:

o toss middle paper if odd number (put aside)
4. Take 'top' pile, count number of students who responded to each alternative
o

fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type of error you get
for short answer, fill-in-the-blank
OR set up on spread sheet if you're familiar with computers

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30
ITEM

UPPER

1. A
*B
C
D
O

LOWER

DIFFERENCE

D

TOTAL

DIFFICULTY

0
4
1
1

*=Keyed Answer
o

repeat for lower group

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30
ITEM

UPPER

1. A
*B
C
D
O

0
4
1
1

LOWER

DIFFERENCE

D

TOTAL

DIFFICULTY

2

*=Keyed Answer
o

this is the time consuming part --> but not that bad, can do it while watching TV, because
you're just sorting piles
THREE POSSIBLE SHORT CUTS HERE (STEP 4)
(A) If you have a large sample of around 100 or more, you can cut down the sample you work
with
o

take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all 100)

o

put middle 46 aside for the moment


o

larger the sample, more accurate, but have to trade off against labour; using top 1/3
or so is probably good enough by the time you get to 100; --27% magic figure
statisticians tell us to use

I'd use halves at 30, but you could just use a sample of top 10 and bottom 10 if you're
pressed for time



o

but it means a single student changes stats by 10%
trading off speed for accuracy...

but I'd rather have you doing ten and ten than nothing

(B) Second short cut, if you have access to photocopier (budgets)
o

photocopy answer sheets, cut off identifying info
(can't use if handwriting is distinctive)

o
o
o

o
o

colour code high and low groups --> dab of marker pen color
distribute randomly to students in your class so they don't know whose answer sheet they
have
get them to raise their hands
 for #6, how many have "A" on blue sheet?
how many have "B"; how many "C"
 for #6, how many have "A" on red sheet....
some reservations because they can screw you up if they don't take it seriously
another version of this would be to hire kid who cuts your lawn to do the counting, provided
you've removed all identifying information
 I actually did this for a bunch of teachers at one high school in Edmonton when I was
in university for pocket money

(C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type than to sort

SAMPLE OF TYPING FORMAT
FOR ITEM ANALYSIS
ITEM #
KEY

1 2 3 4 5 6 7 8 9 10
T F T F T A D C A B

STUDENT
Kay
Jane

o

T T T F T A D C A D

John

o

T T T F F A D D A C

F F T F T A D C A B

type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving right hand free to
turn pages (from Sax)
IF you have a computer program -- some kicking around -- will give you all stats you need,
plus bunches more you don't-- automatically after this stage
OVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text)

5. Subtract the number of students in lower group who got question right from number
of high group students who got it right
o quite possible to get a negative number
ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30
ITEM

UPPER

1. A
*B
C
D
O

0
4
1
1

LOWER
2

DIFFERENCE

D

TOTAL

DIFFICULTY

2

*=Keyed Answer
6. Divide the difference by number of students in upper or lower group
o in this case, divide by 15
o this gives you the "discrimination index" (D)
ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30
ITEM

UPPER

1. A
*B
C
D
O

0
4
1
1

LOWER
2

2

DIFFERENCE

D

TOTAL

DIFFICULTY

0.333

*=Keyed Answer
7. Total number who got it right
ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30
ITEM

UPPER

1. A
*B
C
D
O

0
4
1
1

LOWER
2

2

DIFFERENCE
0.333

D

TOTAL

DIFFICULTY

6

*=Keyed Answer
8. If you have a large class and were only using the 1/3 sample for top and bottom groups,
then you have to NOW count number of middle group who got each question right (not
each alternative this time, just right answers)
9. Sample Form Class Size= 100.
o if class of 30, upper and lower half, no other column here
10. Divide total by total number of students
o difficulty = (proportion who got it right (p) )
ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30
ITEM

UPPER

LOWER

DIFFERENCE

D

TOTAL

DIFFICULTY
1. A
*B
C
D
O

0
4
1
1

2

2

0.333

6

.42

*=Keyed Answer
11. You will NOTE the complete lack of complicated statistics --> counting, adding, dividing -->
no tricky formulas required for this
o not going to worry about corrected point biserials etc.
o one of the advantages of using fixed number of alternatives

Interpreting Item Analysis

Let's look at what we have and see what we can see
90% of item analysis is just common sense...
1. Potential Miskey
2. Identifying Ambiguous Items
3. EqualDistribution to all alternatives.
4. Alternatives are not working
5. Distracter too atractive.
6. Question not discriminating.
7. Negative discrimination.
8. Too Easy.
9. Omit.
10. &11. Relationship between D index and Difficulty (p).

Item Analysis of Computer Printouts

o
.

1. What do we see looking at this first one? [Potential Miskey]

Upper
1. *A
B
C
D
O
o
o
o
o

Low

Difference

D

Total

1
4
-3
-.2
5
1
3
10
5
3
3
<----means omit or no answer

Difficulty
.17

#1, more high group students chose C than A, even though A is supposedly the correct
answer
more low group students chose A than high group so got negative discrimination;
only .16% of class got it right
most likely you just wrote the wrong answer key down
--> this is an easy and very common mistake for you to make
better you find out now before you hand back then when kids complain
OR WORSE, they don't complain, and teach themselves that your miskey as the
"correct" answer
so check it out and rescore that question on all the papers before handing them back
Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
--> nice item



o
o
OR:
o

you check and find that you didn't miskey it --> that is the answer you thought
two possibilities:
1. one possibility is that you made slip of the tongue and taught them the wrong
answer
 anything you say in class can be taken down and used against you on an
examination....
2. more likely means even "good" students are being tricked by a common
misconception -->
You're not supposed to have trick questions, so may want to dump it
--> give those who got it right their point, but total rest of the marks out of
24 instead of 25

If scores are high, or you want to make a point, might let it stand, and then teach to it -->
sometimes if they get caught, will help them to remember better in future
such as:
very fine distinctions
crucial steps which are often overlooked
REVISE it for next time to weaken "B"
-- alternatives are not supposed to draw more than the keyed answer
-- almost always an item flaw, rather than useful distinction

What can we see with #2: [Can identify ambiguous items]

Upper
2. A
B
*C
D
O

6
1
7
1

Low
5
2
5
3

Difference

2

.13

D

12

Total

Difficulty

.40

#2, about equal numbers of top students went for A and D.
Suggests they couldn't tell which was correct
 either, students didn't know this material (in which case you can reteach it)
 or the item was defective --->
look at their favorite alternative again, and see if you can find any reason they could be choosing it
often items that look perfectly straight forward to adults are ambiguous to students
FavoriteExamples of ambiguous items.
if you NOW realize that D was a defensible answer, rescore before you hand it back to give
everyone credit for either A or D -- avoids arguing with you in class
if it's clearly a wrong answer, then you now know which error most of your students are making to
get wrong answer
useful diagnostic information on their learning, your teaching

Equally to all alternatives
Upper

Low

Difference

D

Total

Difficulty
3. A
B
*C
D
O

4
3
5
3

3
4
4
4

1

.06

9

.30

item #3, students respond about equally to all alternatives
usually means they are guessing
Three possibilities:
0. may be material you didn't actually get to yet


you designed test in advance (because I've convinced you to plan ahead) but
didn't actually get everything covered before holidays....



or item on a common exam that you didn't stress in your class

1. item so badly written students have no idea what you're asking
2. item so difficult students just completely baffled
review the item:
 if badly written ( by other teacher) or on material your class hasn't taken, toss it out,
rescore the exam out of lower total


BUT give credit to those that got it, to a total of 100%



if seems well written, but too hard, then you know to (re)teach this material for rest
of class....



maybe the 3 who got it are top three students,


tough but valid item:



OK, if item tests valid objective



want to provide occasional challenging question for top students



but make sure you haven't defined "top 3 students" as "those able to figure
out what the heck I'm talking about"

Alternatives aren't working

Upper
4. A
*B
C
D
O

1
14
0
0

Low
5
7
2
0

Difference
7

.47

D

Total

21

Difficulty
.77

example #4 --> no one fell for D --> so it is not a plausible alternative
question is fine for this administration, but revise item for next time
toss alternative D, replace it with something more realistic
each distracter has to attract at least 5% of the students
class of 30, should get at least two students



or might accept one if you positively can't think of another fourth alternative -- otherwise, do not
reuse the item
if two alternatives don't draw any students --> might consider redoing as true/false

Distracter too attractive

Upper
5. A
B
C
*D
O

7
1
1
5

Low
10
2
1
2

Difference

3

D

.20

Total

7

Difficulty

.23

sample #5 --> too many going for A
--> no ONE distracter should get more than key
--> no one distracter should pull more than about half of students
-- doesn't leave enough for correct answer and five percent for each alternative
keep for this time
weaken it for next time

Question not discriminating

Upper

Low

6. *A 7
B 3
C 2
D 3
O

7
2
1
5

Difference
0

.00

D

Total
14

Difficulty
.47

sample #6: low group gets it as often as high group
on norm-referenced tests, point is to rank students from best to worst
so individual test items should have good students get question right, poor students get it wrong
test overall decides who is a good or poor student on this particular topic


those who do well have more information, skills than those who do less well



so if on a particular question those with more skills and knowledge do NOT do better,
something may be wrong with the question

question may be VALID, but off topic


E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and
unskilled equally as likely to recall the answer



should have homogeneous test --> don't have a math item in with social studies
if wanted to get really fancy, should do separate item analysis for each cell of your
blueprint...as long as you had six items per cell

question is VALID, on topic, but not RELIABLE


addresses the specified objective, but isn't a useful measure of individual differences



asking Grade 10s Capital of Canada is on topic, but since they will all get it right,
won't show individual differences -- give you low D

Negative Discrimination

Upper
7. *A 7
B 3
C 2
D 3
O

Low
10
3
1
1

Difference
-3

-.20

D

Total
17

Difficulty
.57

D (discrimination) index is just upper group minus lower group
varies from +1.0 to -1.0
if all top got it right, all lower got it wrong = 100% = +1
if more of the bottom group get it right than the top group, you get a negative D index
if you have a negative D, means that students with less skills and knowledge overall, are getting it
right more often than those who the test says are better overall
in other words, the better you are, the more likely you are to get it wrong
WHAT COULD ACCOUNT FOR THAT?
Two possibilities:
usually means an ambiguous question


that is confusing good students, but weak students too weak to see the problem



look at question again, look at alternatives good students are going for, to see if
you've missed something

OR:
or it might be off topic
--> something weaker students are better at (like rote memorization) than good students
--> not part of same set of skills as rest of test--> suggests design flaw with table of specifications
perhaps
((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have
two different distinct skills, because by definition, the low group is the high group on that bunch of
questions
--> end up treating them as two separate tests))
if you have a large enough sample (like the provincial exams) then we toss the item and either
don't count it or give everyone credit for it
with sample of 100 students or less, could just be random chance, so basically ignore it in terms of
THIS administration


kids wrote it, give them mark they got

furthermore, if you keep dropping questions, may find that you're starting to develop serious holes
in your blueprint coverage -- problem for sampling


but you want to track stuff this FOR NEXT TIME

if it's negative on administration after administration, consistently, likely not random chance, it's
screwing up in some way
want to build your future tests out of those items with high positive D indices
the higher the average D indices on the test, the more RELIABLE the test as a whole will
be
revise items to increase D
-->if good students are selecting one particular wrong alternative, make it less
attractive
-->or increase probability of their selecting right answer by making it more attractive
may have to include some items with negative Ds if those are the only items you have for that
specification, and it's an important specification


what this means is that there are some skills/knowledge in this unit which are
unrelated to rest of the skills/knowledge
--> but may still be important

e.g., statistics part of this course may be terrible on those students who are the best item writers,
since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still
important objective in this course


may lower reliability of test, but increases content validity

Too Easy

Upper
8. A
*B
C
D
O

0
14
0
1

Low
1
13
1
1

Difference
1

.06

D

Total
27

Difficulty
.90

too easy or too difficult won't discriminate well either
difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)
REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION
if the item is NOT miskeyed or some other glaring problem, it's too late to change after
administered --> everybody got it right, OK, give them the mark
TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
if the item is too difficult, don't drop it, just because everybody missed it --> you must have
thought it was an important objective or it wouldn't have been on there;
and unless literally EVERYONE missed it, what do you do with the students who got it right?
give them bonus marks?
cheat them of a mark they got?
furthermore, if you drop too many questions, lose content validity (specs)
--> if two or three got it right may just be random chance,
so why should they get a bonus mark

however, DO NOT REUSE questions with too high or low difficulty (p) values in future
if difficulty is over 85%, you're wasting space on limited item test
asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is
a particularly vital objective
same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic
equation
but you may want to revise question to make it easier or harder rather than just toss it out cold
OR SOME EXCEPTIONS HERE:
You may have consciously decided to develop a "Mastery" style tests
--> will often have very easy questions -& expect everyone to get everything trying
to identify only those who are not ready to go on
--> in which case, don't use any question which DOES NOT have a difficulty level
below 85% or whatever
Or you may want a test to identify the top people in class, the reach for the top team, and
design a whole test of really tough questions
--> have low difficulty values (i.e., very hard)

so depends a bit on what you intend to do with the test in question
this is what makes the difficulty index (proportion) so handy
13. you create a bank of items over the years
--> using item analysis you get better questions all the time, until you have a whole
bunch that work great
-->can then tailor-make a test for your class
you want to create an easier test this year, you pick questions with higher difficulty
(p) values;
you want to make a challenging test for your gifted kids, choose items with low
difficulty (p) values
--> for most applications will want to set difficulty level so that it gives you average
marks, nice bell curve


government uses 62.5 --> four item multiple choice, middle of bell curve,
14. start tests with an easy question or two to give students a running start
15. make sure that the difficulty levels are spread out over examination blueprint
not all hard geography questions, easy history



unfair to kids who are better at geography, worse at history
turns class off geography if they equate it with tough questions




-->REMEMBER here that difficulty is different than complexity, Bloom


so can have difficult recall knowledge question, easy synthesis



synthesis and evaluation items will tend to be harder than recall questions so if find
higher levels are more difficult, OK, but try to balance cells as much as possible



certainly content cells should be the roughly the same

OMIT

Upper
9. A
B
*C
D
O

2
3
7
1
2

Low
1
4
3
1
4

Difference

4

.26

D

Total

10

Difficulty

.33

If near end of the test
0. --> they didn't find it because it was on the next page
--format problem
OR
--> your test is too long, 6 of them (20%) didn't get to it

OR, if middle of the test:
3. --> totally baffled them because:


way too difficult for these guys



or because also 2 from high group too: ambiguous wording

2. &
3. RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)

Upper

Low

Difference

D

Total

Difficulty

10. A
0 5
*B 15 0
15
1.0 15
.50
C
0 5
D
0 5
O
--------------------------------------------------11. A 3
2
*B
C
D
O

o

8
2
2

7
3
3

1

0.6

15

.50

10 is a perfect item --> each distracter gets at least 5
discrimination index is +1.0

(ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR
GUESSING)
o
o

high discrimination D indices require optimal levels of difficulty
but optimal levels of difficulty do not assure high levels of D

o

11 has same difficulty level, different D


on four item multiple-choice, student doing totally by chance will get 25%

Item analysis
An item analysis involves many statistics that can provide useful information for improving the quality
and accuracy of multiple-choice or true/false items (questions). Some of these statistics are:
Item difficulty: the percentage of students that correctly answered the item.
Also referred to as the p-value.
The range is from 0% to 100%, or more typically written as a proportion of 0.0 to 1.00.
The higher the value, the easier the item.
Calculation: Divide the number of students who got an item correct by the total number of students
who answered it.
Ideal value: Slightly higher than midway between chance (1.00 divided by the number of choices) and
a perfect score (1.00) for the item. For example, on a four-alternative, multiple-choice item, the
random guessing level is 1.00/4 = 0.25; therefore, the optimal difficulty level is .25 + (1.00 - .25) / 2 =
0.62. On a true-false question, the guessing level is (1.00/2 = .50) and, therefore, the optimal difficulty
level is .50+(1.00-.50)/2 = .75
P-values above 0.90 are very easy items and should be carefully reviewed based on the instructor’s
purpose. For example, if the instructor is using easy “warm-up” questions or aiming for student
mastery, than some items with p values above .90 may be warranted. In contrast, if an instructor is
mainly interested in differences among students, these items may not be worth testing.
P-values below 0.20 are very difficult items and should be reviewed for possible confusing language,
removed from subsequent exams, and/or identified as an area for re-instruction. If almost all of the
students get the item wrong, there is either a problem with the item or students were not able to learn
the concept. However, if an instructor is trying to determine the top percentage of students that
learned a certain concept, this highly difficult item may be necessary.
Item discrimination: the relationship between how well students did on the item and their total exam
score.
Also referred to as the Point-Biserial correlation (PBS)
The range is from –1.00 to 1.00.
The higher the value, the more discriminating the item. A highly discriminating item indicates that the
students who had high exams scores got the item correct whereas students who had low exam scores
got the item incorrect.
Items with discrimination values near or less than zero should be removed from the exam. This
indicates that students who overall did poorly on the exam did better on that item than students who
overall did well. The item may be confusing for your better scoring students in some way.
Acceptable range: 0.20 or higher
Ideal value: The closer to 1.00 the better
Calculation:
where
Χ C = the mean total score for persons who have responded correctly to the item
Χ Τ = the mean total score for all personsp = the difficulty value for the item
q = (1 – p)
S. D. Total = the standard deviation of total exam scores
Reliability coefficient: a measure of the amount of measurement error associated with a exam score.
The range is from 0.0 to 1.0.
The higher the value, the more reliable the overall exam score.
Typically, the internal consistency reliability is measured. This indicates how well the items are
correlated with one another.
High reliability indicates that the items are all measuring the same thing, or general construct (e.g.
knowledge of how to calculate integrals for a Calculus course).
With multiple-choice items that are scored correct/incorrect, the Kuder-Richardson formula 20 (KR20) is often used to calculate the internal consistency reliability.
o
K = number of items
p = proportion of persons who responded correctly to an item (i.e., difficulty value)
q = proportion of persons who responded incorrectly to an item (i.e., 1 – p)
σ 2 x = total score variance
Three ways to improve the reliability of the exam are to 1) increase the number of items in the exam, 2)
use items that have high discrimination values in the exam, 3) or perform an item-total statistic
analysis
Acceptable range: 0.60 or higher
Ideal value: 1.00
Item-total statistics: measure the relationship of individual exam items to the overall exam score.
Currently, the University of Texas does not perform this analysis for faculty. However, one can calculate
these statistics using SPSS or SAS statistical software.
1. Corrected item-total correlation
o This is the correlation between an item and the rest of the exam, without that item considered part
of the exam.
o

If the correlation is low for an item, this means the item isn't really measuring the same thing the
rest of the exam is trying to measure.

2. Squared multiple correlation
o This measures how much of the variability in the responses to this item can be predicted from the
other items on the exam.
o

If an item does not predict much of the variability, then the item should be considered for deletion.

3. Alpha if item deleted
o The change in Cronbach's alpha if the item is deleted.
o

When the alpha value is higher than the current alpha with the item included, one should consider
deleting this item to improve the overall reliability of the exam.
EXAMPLE

Item-total statistic table
Item-total
statistics

Variable

Summary for scale:
Mean = 46.1100 S.D. = 8.26444 Valid n = 100
Cronbach alpha = .794313 Standardized alpha = .800491
Average inter-item correlation = .297818
Mean if
deleted

Var. if
deleted

S.D. if
deleted

Corrected
item-total
Correlation

Squared
multiple
correlation

Alpha if
deleted

ITEM1 41.61000

51.93790

7.206795

.656298

.507160

.752243

ITEM2 41.37000

53.79310

7.334378

.666111

.533015

.754692

ITEM3 41.41000

54.86190

7.406882

.549226

.363895

.766778

ITEM4 41.63000

56.57310

7.521509

.470852

.305573

.776015

ITEM5 41.52000

64.16961

8.010593

.054609

.057399

.824907

ITEM6 41.56000

62.68640

7.917474

.118561

.045653

.817907

ITEM7 41.46000

54.02840

7.350401

.587637

.443563

.762033

ITEM8 41.33000

53.32110

7.302130

.609204

.446298

.758992

ITEM9 41.44000

55.06640

7.420674

.502529

.328149

.772013

ITEM10 41.66000

53.78440

7.333785

.572875

.410561

.763314

By investigating the item-total correlation, we can see that the correlations of items 5 and 6 with
the overall exam are . 05 and .12, while all other items correlate at .45 or better. By investigating
the squared-multiple correlations, we can see that again items 5 and 6 are significantly lower
than the rest of the items. Finally, by exploring the alpha if deleted, we can see that the reliability
of the scale (alpha) would increase to .82 if either of these two items were to be deleted. Thus, we
would probably delete these two items from this exam.
Deleting item process: To delete these items, we would delete one item at a time, preferably
item 5 because it can produce a higher exam reliability coefficient if deleted, and re-run the itemtotal statistics report before deleting item 6 to ensure we do not lower the overall alpha of the
exam. After deleting item 5, if item 6 still appears as an item to delete, then we would re-perform
this deletion process for the latter item.
Distractor evaluation: Another useful item review technique to use.
The distractor should be considered an important part of the item. Nearly 50 years of research shows that
there is a relationship between the distractors students choose and total exam score. The quality of the
distractors influence student performance on an exam item. Although the correct answer must be truly
correct, it is just as important that the distractors be incorrect. Distractors should appeal to low scorers
who have not mastered the material whereas high scorers should infrequently select the distractors.
Reviewing the options can reveal potential errors of judgment and inadequate performance of distractors.
These poor distractors can be revised, replaced, or removed.
One way to study responses to distractors is with a frequency table. This table tells you the number and/or
percent of students that selected a given distractor. Distractors that are selected by a few or no students
should be removed or replaced. These kinds of distractors are likely to be so implausible to students that
hardly anyone selects them.
Definition: The incorrect alternatives in a multiple-choice item.
Reported as: The frequency (count), or number of students, that selected each incorrect alternative
Acceptable Range: Each distractor should be selected by at least a few students
Ideal Value: Distractors should be equally popular
Interpretation:
o Distractors that are selected by a few or no students should be removed or replaced
o

One distractor that is selected by as many or more students than the correct answer may indicate a
confusing item and/or options

The number of people choosing a distractor can be lower or higher than the expected because:
o Partial knowledge
o

Poorly constructed item

o

Distractor is outside of the area being tested

Contenu connexe

Tendances

Assessment of Learning - Multiple Choice Test
Assessment of Learning - Multiple Choice TestAssessment of Learning - Multiple Choice Test
Assessment of Learning - Multiple Choice TestXiTian Miran
 
Developing Classroom-based Assessment Tools
Developing Classroom-based Assessment ToolsDeveloping Classroom-based Assessment Tools
Developing Classroom-based Assessment ToolsMary Grace Ortiz
 
Steps in developing performance based assessment
Steps in developing performance based assessmentSteps in developing performance based assessment
Steps in developing performance based assessmentmavs morales
 
Guidelines for Constructing Effective Test Items
Guidelines for Constructing Effective Test ItemsGuidelines for Constructing Effective Test Items
Guidelines for Constructing Effective Test ItemsKimverly Torres
 
Test construction 2
Test construction 2Test construction 2
Test construction 2Arnel Rivera
 
Item and Distracter Analysis
Item and Distracter AnalysisItem and Distracter Analysis
Item and Distracter AnalysisSue Quirante
 
Principles of high quality assessment
Principles of high quality assessmentPrinciples of high quality assessment
Principles of high quality assessmentaelnogab
 
Constructing Objective Supply Type of Items
Constructing Objective Supply Type of ItemsConstructing Objective Supply Type of Items
Constructing Objective Supply Type of ItemsEzr Acelar
 
Writing Identification Tests
Writing Identification TestsWriting Identification Tests
Writing Identification Testsdessandrea
 
Preparing The Table of Specification
Preparing The Table of SpecificationPreparing The Table of Specification
Preparing The Table of SpecificationMary Eunice Quijano
 
Lesson 5 performance based assessment
Lesson 5 performance based assessmentLesson 5 performance based assessment
Lesson 5 performance based assessmentCarlo Magno
 
Purposes of assessment
Purposes of assessmentPurposes of assessment
Purposes of assessmentsethezra
 
Chapter 2- Authentic assessment
Chapter 2- Authentic assessmentChapter 2- Authentic assessment
Chapter 2- Authentic assessmentJarry Fuentes
 

Tendances (20)

Table of specifications
Table of specificationsTable of specifications
Table of specifications
 
Assessment of Learning - Multiple Choice Test
Assessment of Learning - Multiple Choice TestAssessment of Learning - Multiple Choice Test
Assessment of Learning - Multiple Choice Test
 
Developing Classroom-based Assessment Tools
Developing Classroom-based Assessment ToolsDeveloping Classroom-based Assessment Tools
Developing Classroom-based Assessment Tools
 
Steps in developing performance based assessment
Steps in developing performance based assessmentSteps in developing performance based assessment
Steps in developing performance based assessment
 
Types of test
Types of testTypes of test
Types of test
 
Guidelines for Constructing Effective Test Items
Guidelines for Constructing Effective Test ItemsGuidelines for Constructing Effective Test Items
Guidelines for Constructing Effective Test Items
 
Test construction 2
Test construction 2Test construction 2
Test construction 2
 
preparing a TOS
preparing a TOSpreparing a TOS
preparing a TOS
 
Akon1
Akon1Akon1
Akon1
 
Item and Distracter Analysis
Item and Distracter AnalysisItem and Distracter Analysis
Item and Distracter Analysis
 
Principles of high quality assessment
Principles of high quality assessmentPrinciples of high quality assessment
Principles of high quality assessment
 
Constructing Objective Supply Type of Items
Constructing Objective Supply Type of ItemsConstructing Objective Supply Type of Items
Constructing Objective Supply Type of Items
 
Writing Identification Tests
Writing Identification TestsWriting Identification Tests
Writing Identification Tests
 
Preparing The Table of Specification
Preparing The Table of SpecificationPreparing The Table of Specification
Preparing The Table of Specification
 
Assessment of learning1
Assessment of learning1Assessment of learning1
Assessment of learning1
 
Lesson 5 performance based assessment
Lesson 5 performance based assessmentLesson 5 performance based assessment
Lesson 5 performance based assessment
 
Purposes of assessment
Purposes of assessmentPurposes of assessment
Purposes of assessment
 
Chapter 2- Authentic assessment
Chapter 2- Authentic assessmentChapter 2- Authentic assessment
Chapter 2- Authentic assessment
 
DEPED Order no. 8, s. 2015
DEPED Order no. 8, s. 2015DEPED Order no. 8, s. 2015
DEPED Order no. 8, s. 2015
 
Test Construction
Test ConstructionTest Construction
Test Construction
 

Similaire à Item analysis

Analyzingandusingtestitemdata 101012035435-phpapp02
Analyzingandusingtestitemdata 101012035435-phpapp02Analyzingandusingtestitemdata 101012035435-phpapp02
Analyzingandusingtestitemdata 101012035435-phpapp02cezz gonzaga
 
Topic 8b Item Analysis
Topic 8b Item AnalysisTopic 8b Item Analysis
Topic 8b Item AnalysisYee Bee Choo
 
Item analysis- 1st yr Msc[n] research
Item analysis- 1st yr Msc[n] researchItem analysis- 1st yr Msc[n] research
Item analysis- 1st yr Msc[n] researchSUCHITRARATI1976
 
Administering,scoring and reporting a test ppt
Administering,scoring and reporting a test pptAdministering,scoring and reporting a test ppt
Administering,scoring and reporting a test pptManali Solanki
 
Brown, chapter 4 By Savaedi
Brown, chapter 4 By SavaediBrown, chapter 4 By Savaedi
Brown, chapter 4 By SavaediSavaedi
 
CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1FriasKentOmer
 
7Repeated Measures Designs for Interval DataLearnin.docx
7Repeated Measures Designs  for Interval DataLearnin.docx7Repeated Measures Designs  for Interval DataLearnin.docx
7Repeated Measures Designs for Interval DataLearnin.docxevonnehoggarth79783
 
ITEM ANALYSIS.pptx
ITEM ANALYSIS.pptxITEM ANALYSIS.pptx
ITEM ANALYSIS.pptxRizaGarganza
 
Development of pyschologica test construction
Development of pyschologica test constructionDevelopment of pyschologica test construction
Development of pyschologica test constructionKiran Dammani
 
educatiinar.pptx
educatiinar.pptxeducatiinar.pptx
educatiinar.pptxNithuNithu7
 
Kinds Of Variables Kato Begum
Kinds Of Variables Kato BegumKinds Of Variables Kato Begum
Kinds Of Variables Kato BegumDr. Cupid Lucid
 

Similaire à Item analysis (20)

Item analysis
Item analysisItem analysis
Item analysis
 
Analyzing and using test item data
Analyzing and using test item dataAnalyzing and using test item data
Analyzing and using test item data
 
Analyzing and using test item data
Analyzing and using test item dataAnalyzing and using test item data
Analyzing and using test item data
 
Analyzing and using test item data
Analyzing and using test item dataAnalyzing and using test item data
Analyzing and using test item data
 
Analyzingandusingtestitemdata 101012035435-phpapp02
Analyzingandusingtestitemdata 101012035435-phpapp02Analyzingandusingtestitemdata 101012035435-phpapp02
Analyzingandusingtestitemdata 101012035435-phpapp02
 
Topic 8b Item Analysis
Topic 8b Item AnalysisTopic 8b Item Analysis
Topic 8b Item Analysis
 
Item analysis.pptx du
Item analysis.pptx duItem analysis.pptx du
Item analysis.pptx du
 
Item
ItemItem
Item
 
Item analysis- 1st yr Msc[n] research
Item analysis- 1st yr Msc[n] researchItem analysis- 1st yr Msc[n] research
Item analysis- 1st yr Msc[n] research
 
Item analysis
Item analysisItem analysis
Item analysis
 
Administering,scoring and reporting a test ppt
Administering,scoring and reporting a test pptAdministering,scoring and reporting a test ppt
Administering,scoring and reporting a test ppt
 
Brown, chapter 4 By Savaedi
Brown, chapter 4 By SavaediBrown, chapter 4 By Savaedi
Brown, chapter 4 By Savaedi
 
CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1CHAPTER 6 Assessment of Learning 1
CHAPTER 6 Assessment of Learning 1
 
7Repeated Measures Designs for Interval DataLearnin.docx
7Repeated Measures Designs  for Interval DataLearnin.docx7Repeated Measures Designs  for Interval DataLearnin.docx
7Repeated Measures Designs for Interval DataLearnin.docx
 
Item analysis
Item analysisItem analysis
Item analysis
 
ITEM ANALYSIS.pptx
ITEM ANALYSIS.pptxITEM ANALYSIS.pptx
ITEM ANALYSIS.pptx
 
Research Procedure
Research ProcedureResearch Procedure
Research Procedure
 
Development of pyschologica test construction
Development of pyschologica test constructionDevelopment of pyschologica test construction
Development of pyschologica test construction
 
educatiinar.pptx
educatiinar.pptxeducatiinar.pptx
educatiinar.pptx
 
Kinds Of Variables Kato Begum
Kinds Of Variables Kato BegumKinds Of Variables Kato Begum
Kinds Of Variables Kato Begum
 

Dernier

Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Anamaria Contreras
 
Planetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifePlanetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifeBhavana Pujan Kendra
 
Guide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFGuide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFChandresh Chudasama
 
20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdf20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdfChris Skinner
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfRbc Rbcua
 
Jewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreJewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreNZSG
 
Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Americas Got Grants
 
WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfJamesConcepcion7
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxRakhi Bazaar
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Peter Ward
 
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...Operational Excellence Consulting
 
Onemonitar Android Spy App Features: Explore Advanced Monitoring Capabilities
Onemonitar Android Spy App Features: Explore Advanced Monitoring CapabilitiesOnemonitar Android Spy App Features: Explore Advanced Monitoring Capabilities
Onemonitar Android Spy App Features: Explore Advanced Monitoring CapabilitiesOne Monitar
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03DallasHaselhorst
 
Driving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerDriving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerAggregage
 
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...Hector Del Castillo, CPM, CPMM
 
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...ssuserf63bd7
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfShashank Mehta
 
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...Associazione Digital Days
 
Healthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterHealthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterJamesConcepcion7
 

Dernier (20)

Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.Traction part 2 - EOS Model JAX Bridges.
Traction part 2 - EOS Model JAX Bridges.
 
Planetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in LifePlanetary and Vedic Yagyas Bring Positive Impacts in Life
Planetary and Vedic Yagyas Bring Positive Impacts in Life
 
Guide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFGuide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDF
 
20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdf20200128 Ethical by Design - Whitepaper.pdf
20200128 Ethical by Design - Whitepaper.pdf
 
APRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdfAPRIL2024_UKRAINE_xml_0000000000000 .pdf
APRIL2024_UKRAINE_xml_0000000000000 .pdf
 
Jewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource CentreJewish Resources in the Family Resource Centre
Jewish Resources in the Family Resource Centre
 
Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...Church Building Grants To Assist With New Construction, Additions, And Restor...
Church Building Grants To Assist With New Construction, Additions, And Restor...
 
WSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdfWSMM Technology February.March Newsletter_vF.pdf
WSMM Technology February.March Newsletter_vF.pdf
 
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptxGo for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
Go for Rakhi Bazaar and Pick the Latest Bhaiya Bhabhi Rakhi.pptx
 
Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...Fordham -How effective decision-making is within the IT department - Analysis...
Fordham -How effective decision-making is within the IT department - Analysis...
 
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
The McKinsey 7S Framework: A Holistic Approach to Harmonizing All Parts of th...
 
Onemonitar Android Spy App Features: Explore Advanced Monitoring Capabilities
Onemonitar Android Spy App Features: Explore Advanced Monitoring CapabilitiesOnemonitar Android Spy App Features: Explore Advanced Monitoring Capabilities
Onemonitar Android Spy App Features: Explore Advanced Monitoring Capabilities
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03
 
Driving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon HarmerDriving Business Impact for PMs with Jon Harmer
Driving Business Impact for PMs with Jon Harmer
 
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
How Generative AI Is Transforming Your Business | Byond Growth Insights | Apr...
 
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
Horngren’s Financial & Managerial Accounting, 7th edition by Miller-Nobles so...
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdf
 
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
Lucia Ferretti, Lead Business Designer; Matteo Meschini, Business Designer @T...
 
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptxThe Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
The Bizz Quiz-E-Summit-E-Cell-IITPatna.pptx
 
Healthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare NewsletterHealthcare Feb. & Mar. Healthcare Newsletter
Healthcare Feb. & Mar. Healthcare Newsletter
 

Item analysis

  • 1. Item Analysis Table of Contents Major Uses of Item Analysis Item Analysis Reports Item Analysis Response Patterns Basic Item Analysis Statistics Interpretation of Basic Statistics Other Item Statistics Summary Data Report Options Item Analysis Guidelines Major Uses of Item Analysis Item analysis can be a powerful technique available to instructors for the guidance and improvement of instruction. For this to be so, the items to be analyzed must be valid measures of instructional objectives. Further, the items must be diagnostic, that is, knowledge of which incorrect options students select must be a clue to the nature of the misunderstanding, and thus prescriptive of appropriate remediation. In addition, instructors who construct their own examinations may greatly improve the effectiveness of test items and the validity of test scores if they select and rewrite their items on the basis of item performance data. Such data is available to instructors who have their examination answer sheets scored at the Computer Laboratory Scoring Office. [ Top ] Item Analysis Reports As the answer sheets are scored, records are written which contain each student's score and his or her response to each item on the test. These records are then processed and an item analysis report file is generated. An instructor may obtain test score distributions and a list of students' scores, in alphabetic order, in student number order, in percentile rank order, and/or in order of percentage of total points. Instructors are sent their item analysis reports from as e-mail attacments. The item analysis report is contained in the file IRPT####.RPT, where the four digits indicate the instructors's GRADER III file. A sample of an individual long form item analysis lisitng is shown below. Upper 27% Middle 46% Lower 27% Total Item 10 of 125. The correct option is 5. Item Response Pattern 1 2 3 4 5 Omit 2 8 0 1 19 0 7% 27% 0% 3% 63% 0% 3 20 3 3 23 0 6% 38% 6% 6% 44% 0% 6 5 8 2 9 0 20% 17% 27% 7% 30% 0% 11 33 11 6 51 0 Error 0 0% 0 0% 0 0% 0 Total 30 100% 52 100% 30 101% 112
  • 2. 10% 29% 11% 5% 46% 0% 0% 100% [ Top ] Item Analysis Response Patterns Each item is identified by number and the correct option is indicated. The group of students taking the test is divided into upper, middle and lower groups on the basis of students' scores on the test. This division is essential if information is to be provided concerning the operation of distracters (incorrect options) and to compute an easily interpretable index of discrimination. It has long been accepted that optimal item discrimination is obtained when the upper and lower groups each contain twenty-seven percent of the total group. The number of students who selected each option or omitted the item is shown for each of the upper, middle, lower and total groups. The number of students who marked more than one option to the item is indicated under the "error" heading. The percentage of each group who selected each of the options, omitted the item, or erred, is also listed. Note that the total percentage for each group may be other than 100%, since the percentages are rounded to the nearest whole number before totaling. The sample item listed above appears to be performing well. About two-thirds of the upper group but only one-third of the lower group answered the item correctly. Ideally, the students who answered the item incorrectly should select each incorrect response in roughly equal proportions, rather than concentrating on a single incorrect option. Option two seems to be the most attractive incorrect option, especially to the upper and middle groups. It is most undesirable for a greater proportion of the upper group than of the lower group to select an incorrect option. The item writer should examine such an option for possible ambiguity. For the sample item on the previous page, option four was selected by only five percent of the total group. An attempt might be made to make this option more attractive. Item analysis provides the item writer with a record of student reaction to items. It gives us little information about the appropriateness of an item for a course of instruction. The appropriateness or content validity of an item must be determined by comparing the content of the item with the instructional objectives. [ Top ] Basic Item Analysis Statistics A number of item statistics are reported which aid in evaluating the effectiveness of an item. The first of these is the index of difficulty which is the proportion of the total group who got the item wrong. Thus a high index indicates a difficult item and a low index indicates an easy item. Some item analysts prefer an index of difficulty which is the proportion of the total group who got an item right. This index may be obtained by marking the PROPORTION RIGHT option on the item analysis header sheet. Whichever index is selected is shown as the INDEX OF DIFFICULTY on the item analysis print-out. For classroom achievement tests, most test constructors desire items with indices of difficulty no lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of 60. The INDEX OF DISCRIMINATION is the difference between the proportion of the upper group who got an item right and the proportion of the lower group who got the item right. This index is dependent upon the difficulty of an item. It may reach a maximum value of 100 for an item with an index of difficulty of 50, that is, when 100% of the upper group and none of the lower group answer the item correctly. For items of less than or greater than 50 difficulty, the index of discrimination has a maximum value of less than 100. The Interpreting the Index of Discrimination document contains a more detailed discussion of the index of discrimination. [ Top ]
  • 3. Interpretation of Basic Statistics To aid in interpreting the index of discrimination, the maximum discrimination value and the discriminating efficiency are given for each item. The maximum discrimination is the highest possible index of discrimination for an item at a given level of difficulty. For example, an item answered correctly by 60% of the group would have an index of difficulty of 40 and a maximum discrimination of 80. This would occur when 100% of the upper group and 20% of the lower group answered the item correctly. The discriminating efficiency is the index of discrimination divided by the maximum discrimination. For example, an item with an index of discrimination of 40 and a maximum discrimination of 50 would have a discriminating efficiency of 80. This may be interpreted to mean that the item is discriminating at 80% of the potential of an item of its difficulty. For a more detailed discussion of the maximum discrimination and discriminating efficiency concepts, see the Interpreting the Index of Discrimination document. [ Top ] Other Item Statistics Some test analysts may desire more complex item statistics. Two correlations which are commonly used as indicators of item discrimination are shown on the item analysis report. The first is the biserial correlation, which is the correlation between a student's performance on an item (right or wrong) and his or her total score on the test. This correlation assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right/wrong dichotomy. The biserial correlation has the characteristic, disconcerting to some, of having maximum values greater than unity. There is no exact test for the statistical significance of the biserial correlation coefficient. The point biserial correlation is also a correlation between student performance on an item (right or wrong) and test score. It assumes that the test score distribution is normal and that the division on item performance is a natural dichotomy. The possible range of values for the point biserial correlation is +1 to -1. The Student's t test for the statistical significance of the point biserial correlation is given on the item analysis report. Enter a table of Student's t values with N - 2 degrees of freedom at the desired percentile point N, in this case, is the total number of students appearing in the item analysis. The mean scores for students who got an item right and for those who got it wrong are also shown. These values are used in computing the biserial and point biserial coefficients of correlation and are not generally used as item analysis statistics. Generally, item statistics will be somewhat unstable for small groups of students. Perhaps fifty students might be considered a minimum number if item statistics are to be stable. Note that for a group of fifty students, the upper and lower groups would contain only thirteen students each. The stability of item analysis results will improve as the group of students is increased to one hundred or more. An item analysis for very small groups must not be considered a stable indication of the performance of a set of items. [ Top ] Summary Data The item analysis data are summarized on the last page of the item analysis report. The distribution of item difficulty indices is a tabulation showing the number and percentage of items whose difficulties are in each of ten categories, ranging from a very easy category (00-10) to a very difficult category (91-100). The distribution of discrimination indices is tabulated in the same manner, except that a category is included for negatively discriminating items.
  • 4. The mean item difficulty is determined by adding all of the item difficulty indices and dividing the total by the number of items. The mean item discrimination is determined in a similar manner. Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If the test is speeded, that is, if some of the students did not have time to consider each test item, the reliability estimate may be spuriously high. The final test statistic is the standard error of measurement. This statistic is a common device for interpreting the absolute accuracy of the test scores. The size of the standard error of measurement depends on the standard deviation of the test scores as well as on the estimated reliability of the test. Occasionally, a test writer may wish to omit certain items from the analysis although these items were included in the test as it was administered. Such items may be omitted by leaving them blank on the test key. The response patterns for omitted items will be shown but the keyed options will be listed as OMIT. The statistics for these items will be omitted from the Summary Data. [ Top ] Report Options A number of report options are available for item analysis data. The long-form item analysis report contains three items per page. A standard-form item analysis report is available where data on each item is summarized on one line. A sample reprot is shown below. Item Key 1 4 2 2 ITEM ANALYSIS Test 4482 125 Items 112 Students Percentages: Upper 27% - Middle - Lower 27% 1 2 3 4 5 Omit Error Diff Disc 7-23-57 0- 4- 7 28- 8-36 64-62- 0 0-0-0 0-0-0 0-0-0 54 64 7-12- 7 64-42-29 14- 4-21 14-42-36 0-0-0 0-0-0 0-0-0 56 35 The standard form shows the item number, key (number of the correct option), the percentage of the upper, middle, and lower groups who selected each option, omitted the item or erred, the index of difficulty, and the index of discrimination. For example, in item 1 above, option 4 was the correct answer and it was selected by 64% of the upper group, 62% of the middle group and 0% of the lower group. The index of difficulty, based on the total group, was 54 and the index of discrimination was 64. [ Top ] Item Analysis Guidelines Item analysis is a completely futile process unless the results help instructors improve their classroom practices and item writers improve their tests. Let us suggest a number of points of departure in the application of item analysis data. 1. Item analysis gives necessary but not sufficient information concerning the appropriateness of an item as a measure of intended outcomes of instruction. An item may perform beautifully with respect to item analysis statistics and yet be quite irrelevant to the instruction whose results it was intended to measure. A most common error is to teach for behavioral objectives such as analysis of data or situations, ability to discover trends, ability to infer meaning, etc., and then to construct an objective test measuring mainly recognition of facts. Clearly, the objectives of instruction must be kept in mind when selecting test items. 2. An item must be of appropriate difficulty for the students to whom it is administered. If possible, items should have indices of difficulty no less than 20 and no greater than 80. lt
  • 5. is desirable to have most items in the 30 to 50 range of difficulty. Very hard or very easy items contribute little to the discriminating power of a test. 3. An item should discriminate between upper and lower groups. These groups are usually based on total test score but they could be based on some other criterion such as gradepoint average, scores on other tests, etc. Sometimes an item will discriminate negatively, that is, a larger proportion of the lower group than of the upper group selected the correct option. This often means that the students in the upper group were misled by an ambiguity that the students in the lower group, and the item writer, failed to discover. Such an item should be revised or discarded. 4. All of the incorrect options, or distracters, should actually be distracting. Preferably, each distracter should be selected by a greater proportion of the lower group than of the upper group. If, in a five-option multiple-choice item, only one distracter is effective, the item is, for all practical purposes, a two-option item. Existence of five options does not automatically guarantee that the item will operate as a five-choice item. [ Top ] Item analysis is a general term that refers to the specific methods used in education to evaluate test items, typically for the purpose of test construction and revision. Regarded as one of the most important aspects of test construction and increasingly receiving attention, it is an approach incorporated into item response theory (IRT), which serves as an alternative to classical measurement theory (CMT) or classical test theory (CTT). Classical measurement theory considers a score to be the direct result of a person's true score plus error. It is this error that is of interest as previous measurement theories have been unable to specify its source. However, item response theory uses item analysis to differentiate between types of error in order to gain a clearer understanding of any existing deficiencies. Particular attention is given to individual test items, item characteristics, probability of answering items correctly, overall ability of the test taker, and degrees or levels of knowledge being assessed. THE PURPOSE OF ITEM ANALYSIS There must be a match between what is taught and what is assessed. However, there must also be an effort to test for more complex levels of understanding, with care taken to avoid over-sampling items that assess only basic levels of knowledge. Tests that are too difficult (and have an insufficient floor) tend to lead to frustration and lead to deflated scores, whereas tests that are too easy (and have an insufficient ceiling) facilitate a decline in motivation and lead to inflated scores. Tests can be improved by maintaining and developing a pool of valid items from which future tests can be drawn and that cover a reasonable span of difficulty levels. Item analysis helps improve test items and identify unfair or biased items. Results should be used to refine test item wording. In addition, closer examination of items will also reveal which questions were most difficult, perhaps indicating a concept that needs to be taught more thoroughly. If a particular distracter (that is, an incorrect answer choice) is the most often chosen answer, and especially if that distracter positively correlates with a high total score, the item must be examined more closely for correctness. This situation also provides an opportunity to identify and examine common misconceptions among students about a particular concept. In general, once test items have been created, the value of these items can be systematically assessed using several methods representative of item analysis: a) a test item's level of difficulty, b) an item's capacity to discriminate, and c) the item characteristic curve. Difficulty is assessed by examining the number of persons correctly endorsing the answer. Discrimination can be examined by comparing the number of persons getting a particular item correct with the total test score. Finally, the item characteristic curve can be used to plot the likelihood of answering correctly with the level of success on the test. ITEM DIFFICULTY In test construction, item difficulty is determined by the number of people who answer a particular test item correctly. For example, if the first question on a test was answered correctly by 76% of the class, then the difficulty level (p or percentage passing) for that question is p = .76. If the second question on a test was answered correctly by only 48% of the class, then the
  • 6. difficulty level for that question is p = .48. The higher the percentage of people who answer correctly, the easier the item, so that a difficulty level of .48 indicates that question two was more difficult than question one, which had a difficulty level of .76. Many educators find themselves wondering how difficult a good test item should be. Several things must be taken into consideration in order to determine appropriate difficulty level. The first task of any test maker should be to determine the probability of answering an item correctly by chance alone, also referred to as guessing or luck. For example, a true-false item, because it has only two choices, could be answered correctly by chance half of the time. Therefore, a true-false item with a demonstrated difficulty level of only p = .50 would not be a good test item because that level of success could be achieved through guessing alone and would not be an actual indication of knowledge or ability level. Similarly, a multiple-choice item with five alternatives could be answered correctly by chance 20% of the time. Therefore, an item difficulty greater than .20 would be necessary in order to discriminate between respondents' ability to guess correctly and respondents' level of knowledge. Desirable difficulty levels usually can be estimated as halfway between 100 percent and the percentage of success expected by guessing. So, the desirable difficulty level for a true-false item, for example, should be aroundp = .75, which is halfway between 100% and 50% correct. In most instances, it is desirable for a test to contain items of various difficulty levels in order to distinguish between students who are not prepared at all, students who are fairly prepared, and students who are well prepared. In other words, educators do not want the same level of success for those students who did not study as for those who studied a fair amount, or for those who studied a fair amount and those who studied exceptionally hard. Therefore, it is necessary for a test to be composed of items of varying levels of difficulty. As a general rule for norm-referenced tests, items in the difficulty range of .30 to .70 yield important differences between individuals' level of knowledge, ability, and preparedness. There are a few exceptions to this, however, with regard to the purpose of the test and the characteristics of the test takers. For instance, if the test is to help determine entrance into graduate school, the items should be more difficult to be able to make finer distinctions between test takers. For a criterion-referenced test, most of the item difficulties should be clustered around the criterion cut-off score or higher. For example, if a passing score is 70%, the vast majority of items should have percentage passing values of Figure 1ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE. p = .60 or higher, with a number of items in the p > .90 range to enhance motivation and test for mastery of certain essential concepts. DISCRIMINATION INDEX According to Wilson (2005), item difficulty is the most essential component of item analysis. However, it is not the only way to evaluate test items. Discrimination goes beyond determining the proportion of people who answer correctly and looks more specifically at who answers correctly. In other words, item discrimination determines whether those who did well on the entire test did well on a particular item. An item should in fact be able to discriminate between upper and lower scoring groups. Membership in these groups is usually determined based on their total test score, and it is expected that those scoring higher on the overall test will also be more likely to endorse the correct response on a particular item. Sometimes an item will discriminate negatively, that is, a larger proportion of the lower group select the correct response, as compared to those in the higher scoring group. Such an item should be revised or discarded. One way to determine an item's power to discriminate is to compare those who have done very well with those who have done very poorly, known as the extreme group method. First, identify the students who scored in the top one-third as well as those in the bottom one-third of the class. Next, calculate the proportion of each group that answered a particular test item correctly (i.e., percentage passing for the high and low groups on each item). Finally, subtract the p of the bottom performing group from the p for the top performing group to yield an item discrimination index (D). Item discriminations of D = .50 or higher are considered excellent. D = 0 means the item has no discrimination ability, while D = 1.00 means the item has perfect discrimination ability. In Figure 1, it can be seen that Item 1 discriminates well with those in the top performing group obtaining the correct response far more often (p = .92) than those in the
  • 7. Figure 2ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE. low performing group (p = .40), thus resulting in an index of .52 (i.e., .92 - .40 = .52). Next, Item 2 is not difficult enough with a discriminability index of only .04, meaning this particular item was not useful in discriminating between the high and low scoring individuals. Finally, Item 3 is in need of revision or discarding as it discriminates negatively, meaning low performing group members actually obtained the correct keyed answer more often than high performing group members. Another way to determine the discriminability of an item is to determine the correlation coefficient between performance on an item and performance on a test, or the tendency of students selecting the correct answer to have high overall scores. This coefficient is reported as the item discrimination coefficient, or the point-biserial correlation between item score (usually scored right or wrong) and total test score. This coefficient should be positive, indicating that students answering correctly tend to have higher overall scores or that students answering incorrectly tend to have lower overall scores. Also, the higher the magnitude, the better the item discriminates. The point-biserial correlation can be computed with procedures outlined in Figure 2. In Figure 2, the point-biserial correlation between item score and total score is evaluated similarly to the extreme group discrimination index. If the resulting value is negative or low, the item should be revised or discarded. The closer the value is to 1.0, the stronger the item's discrimination power; the closer the value is to 0, Figure 3ILLUSTRATION BY GGS INFORMATION SERVICES. CENGAGE LEARNING, GALE. the weaker the power. Items that are very easy and answered correctly by the majority of respondents will have poor pointbiserial correlations. CHARACTERISTIC CURVE A third parameter used to conduct item analysis is known as the item characteristic curve (ICC). This is a graphical or pictorial depiction of the characteristics of a particular item, or taken collectively, can be representative of the entire test. In the item characteristic curve the total test score is represented on the horizontal axis and the proportion of test takers passing the item within that range of test scores is scaled along the vertical axis.
  • 8. For Figure 3, three separate item characteristic curves are shown. Line A is considered a flat curve and indicates that test takers at all score levels were equally likely to get the item correct. This item was therefore not a useful discriminating item. Line B demonstrates a troublesome item as it gradually rises and then drops for those scoring highest on the overall test. Though this is unusual, it can sometimes result from those who studied most having ruled out the answer that was keyed as correct. Finally, Line C shows the item characteristic curve for a good test item. The gradual and consistent positive slope shows that the proportion of people passing the item gradually increases as test scores increase. Though it is not depicted here, if an ICC was seen in the shape of a backward S, negative item discrimination would be evident, meaning that those who scored lowest were most likely to endorse a correct response on the item. Eight Simple Steps to Item Analysis 1. Score each answer sheet, write score total on the corner o obviously have to do this anyway 2. Sort the pile into rank order from top to bottom score (1 minute, 30 seconds tops) 3. If normal class of 30 students, divide class in half o same number in top and bottom group: o toss middle paper if odd number (put aside) 4. Take 'top' pile, count number of students who responded to each alternative o fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type of error you get for short answer, fill-in-the-blank OR set up on spread sheet if you're familiar with computers ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O LOWER DIFFERENCE D TOTAL DIFFICULTY 0 4 1 1 *=Keyed Answer o repeat for lower group ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER DIFFERENCE D TOTAL DIFFICULTY 2 *=Keyed Answer o this is the time consuming part --> but not that bad, can do it while watching TV, because you're just sorting piles
  • 9. THREE POSSIBLE SHORT CUTS HERE (STEP 4) (A) If you have a large sample of around 100 or more, you can cut down the sample you work with o take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all 100) o put middle 46 aside for the moment  o larger the sample, more accurate, but have to trade off against labour; using top 1/3 or so is probably good enough by the time you get to 100; --27% magic figure statisticians tell us to use I'd use halves at 30, but you could just use a sample of top 10 and bottom 10 if you're pressed for time   o but it means a single student changes stats by 10% trading off speed for accuracy... but I'd rather have you doing ten and ten than nothing (B) Second short cut, if you have access to photocopier (budgets) o photocopy answer sheets, cut off identifying info (can't use if handwriting is distinctive) o o o o o colour code high and low groups --> dab of marker pen color distribute randomly to students in your class so they don't know whose answer sheet they have get them to raise their hands  for #6, how many have "A" on blue sheet? how many have "B"; how many "C"  for #6, how many have "A" on red sheet.... some reservations because they can screw you up if they don't take it seriously another version of this would be to hire kid who cuts your lawn to do the counting, provided you've removed all identifying information  I actually did this for a bunch of teachers at one high school in Edmonton when I was in university for pocket money (C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type than to sort SAMPLE OF TYPING FORMAT FOR ITEM ANALYSIS ITEM # KEY 1 2 3 4 5 6 7 8 9 10 T F T F T A D C A B STUDENT Kay Jane o T T T F T A D C A D John o T T T F F A D D A C F F T F T A D C A B type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving right hand free to turn pages (from Sax) IF you have a computer program -- some kicking around -- will give you all stats you need, plus bunches more you don't-- automatically after this stage
  • 10. OVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text) 5. Subtract the number of students in lower group who got question right from number of high group students who got it right o quite possible to get a negative number ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER 2 DIFFERENCE D TOTAL DIFFICULTY 2 *=Keyed Answer 6. Divide the difference by number of students in upper or lower group o in this case, divide by 15 o this gives you the "discrimination index" (D) ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER 2 2 DIFFERENCE D TOTAL DIFFICULTY 0.333 *=Keyed Answer 7. Total number who got it right ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER 1. A *B C D O 0 4 1 1 LOWER 2 2 DIFFERENCE 0.333 D TOTAL DIFFICULTY 6 *=Keyed Answer 8. If you have a large class and were only using the 1/3 sample for top and bottom groups, then you have to NOW count number of middle group who got each question right (not each alternative this time, just right answers) 9. Sample Form Class Size= 100. o if class of 30, upper and lower half, no other column here 10. Divide total by total number of students o difficulty = (proportion who got it right (p) ) ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS CLASS SIZE = 30 ITEM UPPER LOWER DIFFERENCE D TOTAL DIFFICULTY
  • 11. 1. A *B C D O 0 4 1 1 2 2 0.333 6 .42 *=Keyed Answer 11. You will NOTE the complete lack of complicated statistics --> counting, adding, dividing --> no tricky formulas required for this o not going to worry about corrected point biserials etc. o one of the advantages of using fixed number of alternatives Interpreting Item Analysis Let's look at what we have and see what we can see 90% of item analysis is just common sense... 1. Potential Miskey 2. Identifying Ambiguous Items 3. EqualDistribution to all alternatives. 4. Alternatives are not working 5. Distracter too atractive. 6. Question not discriminating. 7. Negative discrimination. 8. Too Easy. 9. Omit. 10. &11. Relationship between D index and Difficulty (p). Item Analysis of Computer Printouts o . 1. What do we see looking at this first one? [Potential Miskey] Upper 1. *A B C D O o o o o Low Difference D Total 1 4 -3 -.2 5 1 3 10 5 3 3 <----means omit or no answer Difficulty .17 #1, more high group students chose C than A, even though A is supposedly the correct answer more low group students chose A than high group so got negative discrimination; only .16% of class got it right most likely you just wrote the wrong answer key down --> this is an easy and very common mistake for you to make better you find out now before you hand back then when kids complain OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer so check it out and rescore that question on all the papers before handing them back Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50 --> nice item   o o
  • 12. OR: o you check and find that you didn't miskey it --> that is the answer you thought two possibilities: 1. one possibility is that you made slip of the tongue and taught them the wrong answer  anything you say in class can be taken down and used against you on an examination.... 2. more likely means even "good" students are being tricked by a common misconception --> You're not supposed to have trick questions, so may want to dump it --> give those who got it right their point, but total rest of the marks out of 24 instead of 25 If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future such as: very fine distinctions crucial steps which are often overlooked REVISE it for next time to weaken "B" -- alternatives are not supposed to draw more than the keyed answer -- almost always an item flaw, rather than useful distinction What can we see with #2: [Can identify ambiguous items] Upper 2. A B *C D O 6 1 7 1 Low 5 2 5 3 Difference 2 .13 D 12 Total Difficulty .40 #2, about equal numbers of top students went for A and D. Suggests they couldn't tell which was correct  either, students didn't know this material (in which case you can reteach it)  or the item was defective ---> look at their favorite alternative again, and see if you can find any reason they could be choosing it often items that look perfectly straight forward to adults are ambiguous to students FavoriteExamples of ambiguous items. if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer useful diagnostic information on their learning, your teaching Equally to all alternatives Upper Low Difference D Total Difficulty
  • 13. 3. A B *C D O 4 3 5 3 3 4 4 4 1 .06 9 .30 item #3, students respond about equally to all alternatives usually means they are guessing Three possibilities: 0. may be material you didn't actually get to yet  you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....  or item on a common exam that you didn't stress in your class 1. item so badly written students have no idea what you're asking 2. item so difficult students just completely baffled review the item:  if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total  BUT give credit to those that got it, to a total of 100%  if seems well written, but too hard, then you know to (re)teach this material for rest of class....  maybe the 3 who got it are top three students,  tough but valid item:  OK, if item tests valid objective  want to provide occasional challenging question for top students  but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about" Alternatives aren't working Upper 4. A *B C D O 1 14 0 0 Low 5 7 2 0 Difference 7 .47 D Total 21 Difficulty .77 example #4 --> no one fell for D --> so it is not a plausible alternative question is fine for this administration, but revise item for next time toss alternative D, replace it with something more realistic each distracter has to attract at least 5% of the students
  • 14. class of 30, should get at least two students  or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item if two alternatives don't draw any students --> might consider redoing as true/false Distracter too attractive Upper 5. A B C *D O 7 1 1 5 Low 10 2 1 2 Difference 3 D .20 Total 7 Difficulty .23 sample #5 --> too many going for A --> no ONE distracter should get more than key --> no one distracter should pull more than about half of students -- doesn't leave enough for correct answer and five percent for each alternative keep for this time weaken it for next time Question not discriminating Upper Low 6. *A 7 B 3 C 2 D 3 O 7 2 1 5 Difference 0 .00 D Total 14 Difficulty .47 sample #6: low group gets it as often as high group on norm-referenced tests, point is to rank students from best to worst so individual test items should have good students get question right, poor students get it wrong test overall decides who is a good or poor student on this particular topic  those who do well have more information, skills than those who do less well  so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question question may be VALID, but off topic  E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
  • 15.   should have homogeneous test --> don't have a math item in with social studies if wanted to get really fancy, should do separate item analysis for each cell of your blueprint...as long as you had six items per cell question is VALID, on topic, but not RELIABLE  addresses the specified objective, but isn't a useful measure of individual differences  asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D Negative Discrimination Upper 7. *A 7 B 3 C 2 D 3 O Low 10 3 1 1 Difference -3 -.20 D Total 17 Difficulty .57 D (discrimination) index is just upper group minus lower group varies from +1.0 to -1.0 if all top got it right, all lower got it wrong = 100% = +1 if more of the bottom group get it right than the top group, you get a negative D index if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall in other words, the better you are, the more likely you are to get it wrong WHAT COULD ACCOUNT FOR THAT? Two possibilities: usually means an ambiguous question  that is confusing good students, but weak students too weak to see the problem  look at question again, look at alternatives good students are going for, to see if you've missed something OR: or it might be off topic --> something weaker students are better at (like rote memorization) than good students --> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps ((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions --> end up treating them as two separate tests))
  • 16. if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration  kids wrote it, give them mark they got furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling  but you want to track stuff this FOR NEXT TIME if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way want to build your future tests out of those items with high positive D indices the higher the average D indices on the test, the more RELIABLE the test as a whole will be revise items to increase D -->if good students are selecting one particular wrong alternative, make it less attractive -->or increase probability of their selecting right answer by making it more attractive may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification  what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge --> but may still be important e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course  may lower reliability of test, but increases content validity Too Easy Upper 8. A *B C D O 0 14 0 1 Low 1 13 1 1 Difference 1 .06 D Total 27 Difficulty .90 too easy or too difficult won't discriminate well either difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody) REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
  • 17. if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there; and unless literally EVERYONE missed it, what do you do with the students who got it right? give them bonus marks? cheat them of a mark they got? furthermore, if you drop too many questions, lose content validity (specs) --> if two or three got it right may just be random chance, so why should they get a bonus mark however, DO NOT REUSE questions with too high or low difficulty (p) values in future if difficulty is over 85%, you're wasting space on limited item test asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation but you may want to revise question to make it easier or harder rather than just toss it out cold OR SOME EXCEPTIONS HERE: You may have consciously decided to develop a "Mastery" style tests --> will often have very easy questions -& expect everyone to get everything trying to identify only those who are not ready to go on --> in which case, don't use any question which DOES NOT have a difficulty level below 85% or whatever Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions --> have low difficulty values (i.e., very hard) so depends a bit on what you intend to do with the test in question this is what makes the difficulty index (proportion) so handy 13. you create a bank of items over the years --> using item analysis you get better questions all the time, until you have a whole bunch that work great -->can then tailor-make a test for your class you want to create an easier test this year, you pick questions with higher difficulty (p) values; you want to make a challenging test for your gifted kids, choose items with low difficulty (p) values --> for most applications will want to set difficulty level so that it gives you average marks, nice bell curve  government uses 62.5 --> four item multiple choice, middle of bell curve,
  • 18. 14. start tests with an easy question or two to give students a running start 15. make sure that the difficulty levels are spread out over examination blueprint not all hard geography questions, easy history  unfair to kids who are better at geography, worse at history turns class off geography if they equate it with tough questions   -->REMEMBER here that difficulty is different than complexity, Bloom  so can have difficult recall knowledge question, easy synthesis  synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible  certainly content cells should be the roughly the same OMIT Upper 9. A B *C D O 2 3 7 1 2 Low 1 4 3 1 4 Difference 4 .26 D Total 10 Difficulty .33 If near end of the test 0. --> they didn't find it because it was on the next page --format problem OR --> your test is too long, 6 of them (20%) didn't get to it OR, if middle of the test: 3. --> totally baffled them because:  way too difficult for these guys  or because also 2 from high group too: ambiguous wording 2. & 3. RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p) Upper Low Difference D Total Difficulty 10. A 0 5 *B 15 0 15 1.0 15 .50 C 0 5 D 0 5 O --------------------------------------------------11. A 3 2
  • 19. *B C D O o 8 2 2 7 3 3 1 0.6 15 .50 10 is a perfect item --> each distracter gets at least 5 discrimination index is +1.0 (ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING) o o high discrimination D indices require optimal levels of difficulty but optimal levels of difficulty do not assure high levels of D o 11 has same difficulty level, different D  on four item multiple-choice, student doing totally by chance will get 25% Item analysis An item analysis involves many statistics that can provide useful information for improving the quality and accuracy of multiple-choice or true/false items (questions). Some of these statistics are: Item difficulty: the percentage of students that correctly answered the item. Also referred to as the p-value. The range is from 0% to 100%, or more typically written as a proportion of 0.0 to 1.00. The higher the value, the easier the item. Calculation: Divide the number of students who got an item correct by the total number of students who answered it. Ideal value: Slightly higher than midway between chance (1.00 divided by the number of choices) and a perfect score (1.00) for the item. For example, on a four-alternative, multiple-choice item, the random guessing level is 1.00/4 = 0.25; therefore, the optimal difficulty level is .25 + (1.00 - .25) / 2 = 0.62. On a true-false question, the guessing level is (1.00/2 = .50) and, therefore, the optimal difficulty level is .50+(1.00-.50)/2 = .75 P-values above 0.90 are very easy items and should be carefully reviewed based on the instructor’s purpose. For example, if the instructor is using easy “warm-up” questions or aiming for student mastery, than some items with p values above .90 may be warranted. In contrast, if an instructor is mainly interested in differences among students, these items may not be worth testing. P-values below 0.20 are very difficult items and should be reviewed for possible confusing language, removed from subsequent exams, and/or identified as an area for re-instruction. If almost all of the students get the item wrong, there is either a problem with the item or students were not able to learn the concept. However, if an instructor is trying to determine the top percentage of students that learned a certain concept, this highly difficult item may be necessary. Item discrimination: the relationship between how well students did on the item and their total exam score. Also referred to as the Point-Biserial correlation (PBS) The range is from –1.00 to 1.00. The higher the value, the more discriminating the item. A highly discriminating item indicates that the students who had high exams scores got the item correct whereas students who had low exam scores got the item incorrect. Items with discrimination values near or less than zero should be removed from the exam. This indicates that students who overall did poorly on the exam did better on that item than students who overall did well. The item may be confusing for your better scoring students in some way.
  • 20. Acceptable range: 0.20 or higher Ideal value: The closer to 1.00 the better Calculation: where Χ C = the mean total score for persons who have responded correctly to the item Χ Τ = the mean total score for all personsp = the difficulty value for the item q = (1 – p) S. D. Total = the standard deviation of total exam scores Reliability coefficient: a measure of the amount of measurement error associated with a exam score. The range is from 0.0 to 1.0. The higher the value, the more reliable the overall exam score. Typically, the internal consistency reliability is measured. This indicates how well the items are correlated with one another. High reliability indicates that the items are all measuring the same thing, or general construct (e.g. knowledge of how to calculate integrals for a Calculus course). With multiple-choice items that are scored correct/incorrect, the Kuder-Richardson formula 20 (KR20) is often used to calculate the internal consistency reliability. o K = number of items p = proportion of persons who responded correctly to an item (i.e., difficulty value) q = proportion of persons who responded incorrectly to an item (i.e., 1 – p) σ 2 x = total score variance Three ways to improve the reliability of the exam are to 1) increase the number of items in the exam, 2) use items that have high discrimination values in the exam, 3) or perform an item-total statistic analysis Acceptable range: 0.60 or higher Ideal value: 1.00 Item-total statistics: measure the relationship of individual exam items to the overall exam score. Currently, the University of Texas does not perform this analysis for faculty. However, one can calculate these statistics using SPSS or SAS statistical software. 1. Corrected item-total correlation o This is the correlation between an item and the rest of the exam, without that item considered part of the exam. o If the correlation is low for an item, this means the item isn't really measuring the same thing the rest of the exam is trying to measure. 2. Squared multiple correlation o This measures how much of the variability in the responses to this item can be predicted from the other items on the exam. o If an item does not predict much of the variability, then the item should be considered for deletion. 3. Alpha if item deleted o The change in Cronbach's alpha if the item is deleted. o When the alpha value is higher than the current alpha with the item included, one should consider deleting this item to improve the overall reliability of the exam. EXAMPLE Item-total statistic table
  • 21. Item-total statistics Variable Summary for scale: Mean = 46.1100 S.D. = 8.26444 Valid n = 100 Cronbach alpha = .794313 Standardized alpha = .800491 Average inter-item correlation = .297818 Mean if deleted Var. if deleted S.D. if deleted Corrected item-total Correlation Squared multiple correlation Alpha if deleted ITEM1 41.61000 51.93790 7.206795 .656298 .507160 .752243 ITEM2 41.37000 53.79310 7.334378 .666111 .533015 .754692 ITEM3 41.41000 54.86190 7.406882 .549226 .363895 .766778 ITEM4 41.63000 56.57310 7.521509 .470852 .305573 .776015 ITEM5 41.52000 64.16961 8.010593 .054609 .057399 .824907 ITEM6 41.56000 62.68640 7.917474 .118561 .045653 .817907 ITEM7 41.46000 54.02840 7.350401 .587637 .443563 .762033 ITEM8 41.33000 53.32110 7.302130 .609204 .446298 .758992 ITEM9 41.44000 55.06640 7.420674 .502529 .328149 .772013 ITEM10 41.66000 53.78440 7.333785 .572875 .410561 .763314 By investigating the item-total correlation, we can see that the correlations of items 5 and 6 with the overall exam are . 05 and .12, while all other items correlate at .45 or better. By investigating the squared-multiple correlations, we can see that again items 5 and 6 are significantly lower than the rest of the items. Finally, by exploring the alpha if deleted, we can see that the reliability of the scale (alpha) would increase to .82 if either of these two items were to be deleted. Thus, we would probably delete these two items from this exam. Deleting item process: To delete these items, we would delete one item at a time, preferably item 5 because it can produce a higher exam reliability coefficient if deleted, and re-run the itemtotal statistics report before deleting item 6 to ensure we do not lower the overall alpha of the exam. After deleting item 5, if item 6 still appears as an item to delete, then we would re-perform this deletion process for the latter item. Distractor evaluation: Another useful item review technique to use. The distractor should be considered an important part of the item. Nearly 50 years of research shows that there is a relationship between the distractors students choose and total exam score. The quality of the distractors influence student performance on an exam item. Although the correct answer must be truly correct, it is just as important that the distractors be incorrect. Distractors should appeal to low scorers who have not mastered the material whereas high scorers should infrequently select the distractors. Reviewing the options can reveal potential errors of judgment and inadequate performance of distractors. These poor distractors can be revised, replaced, or removed. One way to study responses to distractors is with a frequency table. This table tells you the number and/or percent of students that selected a given distractor. Distractors that are selected by a few or no students should be removed or replaced. These kinds of distractors are likely to be so implausible to students that hardly anyone selects them.
  • 22. Definition: The incorrect alternatives in a multiple-choice item. Reported as: The frequency (count), or number of students, that selected each incorrect alternative Acceptable Range: Each distractor should be selected by at least a few students Ideal Value: Distractors should be equally popular Interpretation: o Distractors that are selected by a few or no students should be removed or replaced o One distractor that is selected by as many or more students than the correct answer may indicate a confusing item and/or options The number of people choosing a distractor can be lower or higher than the expected because: o Partial knowledge o Poorly constructed item o Distractor is outside of the area being tested