1. UNIT - 6
PHILOSOPHY AND PSYCHOMETRIC PROPERTIES OF TESTS
Written by Dr. Muhammad Azeem
2. INTRODUCTION
Psychometrics is a field of study concerned with the theory and technique of
psychological measurement. Psychometrics deals with the construction and validation of
assessment tools/instruments for testing, assessment and related activities. In is usually
concerned with assessing individual’s knowledge, ability, personality, and types of
behaviors. Reliability and validity are two major psychometric properties of assessment
tools/instruments. Any assessment tools/instrument being able to state that they have
excellent psychometric properties, meaning a scale is both reliable and valid.
A reliable assessment instrument consistently assesses/measures/evaluates the same
construct. So a valid assessment tool measures what it says it is going to measure. If
something is valid, it is always reliable. However, something can be reliable without being
valid.
Objectives of Unit
1. After reading this unit, students will be able to:
2. Understand Philosophy of Tests
3. Understand Classical Test Theory
4. Understand Item Response Theory
5. Understand psychometric properties of items and tests
3. 6.1. Philosophy of Testing
Traditional philosophy of tests is that ability to learn is randomly distributed in the
population. It means that if some learning task is assigned to a class and then a test is
administering to study performance, the result of test is that students’ performance scores
are distributed normally. It means that all students cannot benefited equally from teaching
learning process. The new philosophy is that all students can attain mastery of any learning
task subject to the provision of opportunity and time. It means absolute standards can be set
for measuring performance.
In testing, measurement is a process whereby the numerical relation between a
magnitude of a quantitative attribute and a unit of the same attribute is estimated using
some procedure, often a standardized set of operations. Realists believe that one can know
the real world as it truly is. As a philosophy of testing and measurement, realism is
characterized by behaviorally stated objectives, measurement-driven instruction, and report
cards, along with the use of programmed materials. Realism emphasizes that measurable
results from students can be obtained to show precisely how well a student is achieving.
Existentialists believe that each person should learn to make choices from the alternatives
available in society. As a philosophy of education, existentialism does not advocate
predetermined objectives for student achievement or testing to determine achievement.
Individual motivation is central, and feelings are recognized as the most important part of
the human condition. Experimentalists believe that one cannot know ultimate reality, but
one can experience it. Experimentalists believe in integrating school and society. Students
should be assessed for their problem-solving ability, and evaluation consists of using
relevant sources of information to solve a problem and to test hypotheses. Idealists believe
in a subject-centered curriculum, and idealism emphasizes a coherence theory in testing.
Students should be able to use reason and logic effectively
4. and to honor eternal values. Perennialism is a philosophy directly related to idealism that
calls for a curriculum based on the "Great Books of the Western World", with a
standardized curriculum that regards childhood and youth as obstacles to be overcome
through education.
Key Points
1. Traditional philosophy of tests has the view point that students’ performance scores
are distributed normally because all students cannot have benefited equally from
teaching learning process
2. The new philosophy is that all students can attain mastery of any learning task
subject to the provision of opportunity and time.
6.2. Theories of Test Development (CT, IRT)
i. Testing Theory
Theory is a system of rules, procedures, and assumptions used to produce a result. The
theory of psychological tests and measurement, or, as typically referred to, test theory or
psychometric theory, offers a general framework and a set of techniques for evaluating the
development and use of psychological tests.
ii. Purpose of test theories
Testing is viewed as a systematic method of sampling one or more human characteristics
and the representation of these results for an individual in the form of descriptive
statements (Bloom, 1967).
Following are the main purposes of testing theories
to formulate mathematical relationship between test properties to enable
manipulating some of them to optimize the desired target properties, usually the
diagnostically important, e.g. validity and reliability etc
5. to improve the quality of test by ensuring validity and reliability of test
6. proportion of examinees that answer the item correctly. The percentage of difficulty
to predict outcomes of psychological testing
iii. Psychometrics
Psychometrics is the field of study concerned with the theory and technique of
psychological measurement, which includes the measurement of knowledge, abilities,
attitudes, personality traits, and educational measurement.
iv. Theoretical Approaches
Psychometric theories that predict outcomes of psychological testing such as the
difficulty of items or the ability of test-takers. Generally speaking, the aim of these theories
is to understand and improve the reliability and validity of psychological tests.
1. Classical Theory
Classical test theory is based on true score model is also known as the “true score
theory.” It assumes that each individual has a true score which would be obtained if there is
no error in measurement. The observed score for each person may differ from an
individual’s true ability.
X = T + E
observed score true score error score
a) Role of error in estimating test scores
Basic measure of error is identified by using the standard deviation of error that is
called standard error of measurement.
The larger the standard error of measurement, the less certain of the accuracy.
On the other hand, small standard error of measurement identifies that an
individual score is closer to the true score (Kalpan & Sacuzzo, 1997).
b) Item Difficulty
The item-difficulty index is denoted by (p). This index is determined by calculating the
7. index of item/test p is called difficulty of item/test. According to Linn and Gronlund (2005) the
difficulty of an item indicates the percentage of students who get the item right.
In the classical test theory CTT, item difficulty:
compares an examinee’s ability to the probability of success on a particular
item.
considers a pool of examinees collectively and empirically examines their
success rate in a particular item.
Calculate success rate of a particular pool of examinees on an item
The formula for the item-difficulty index is
p = No. students with correct answer
total students
As an example, assume that 50 people take a test. For the difficulty index, 30 test-takers
answers the item correctly.
p = No. students with correct answer
total students
30
50 p
=
p = 0.6
If no one examinee answers the item correctly, then
p = No. students with correct answer
total students
0
50 p
=
p = 0.0
If all examinees answer the item correctly, then
p = No. students with correct answer
total students
50
50 p
=
9. Thus item difficulty index (p) has a range of 0 to 1.
General Interpretation of Difficulty Index
Interpretation Difficulty Index Range
Very Easy Item Above 0.81
Easy Item 0.61-0.80
Average Item 0.41-0.60
Difficult Item 0.21-0.40
Very Difficult Item Less than 0.20
Optimal Difficulty of an Item
The optimal level (D max.) for an acceptable p value depends on the number of options per
item. A formula that can be used to compute the optimal level is:
1+ g
D max. = _______ where g is probability of selection each option
2
Examples of Type of Questions
Number of
Options per Item
Optimal Difficulty
1+ g
(D )
max. =
2
Single Stem Question 1 1
Yes/No, Fill-in-Blanks. True/False,
Matching, etc.
2 0,75
MCQs, Matching, Fill-in-Blanks 3 0.665
MCQs, Matching 4 0.625
11. c) Item Discrimination
This index tells the test developers how well the item functions to discriminate students,
those who have achieved the objectives and those who have not. In the context of
Psychometrics, discrimination is a desirable quality of test items. The discrimination index
is denoted by (d).
In the classical test theory CTT, item discrimination:
Tells about ability of an item to differentiate between higher ability examinees
and lower ability examinees is known as item discrimination
which is often called statistically as the Pearson product-moment correlation
coefficient between the scores on the item (e.g., 0 and 1 on an item scored
right-wrong) and the scores of the total test
For this calculation, test developer divides the test takers into flowing three groups
according to their scores on the test as a whole:
Upper Group = an upper group consisting of the 27% of the highest achievers
Lower Group = a lower group consisting of the 27% lowest achievers
Average Group= a middle group consisting of the remaining 46%.
Calculate the following
a. pupper = “Proportion in upper group who got it right”
“# in the upper group who got it right / # of students in upper group who
answered the item”
b. pLower = “Proportion in lower group who got it right”
12. “# in lower group who got it right / # of students in lower group who answered
the item”
Find the difference between the two p values which is called item discrimination.
d = pupper - pLower
Range of Discrimination Index
a. Maximum
When all the examinee of upper group answers the item correctly and all the examinees of
lower group answer the item incorrectly then
pupper = 1 and pupper = 0
d = p upper - pLower = 1 – 0 =1
the item has maximum positive discrimination between groups. This means that the item is
helping you identify those students who have achieved your objectives (and those who
have not!)
b. Minimum
When all the examinee of upper group answers the item incorrectly and all the examinees
of lower group answer the item correctly then
pupper = 0 and pupper = 1
d = p upper - pLower = 0 – 1 = – 1
the item has maximum negative discrimination between groups which means that people
with low scores got it right and people with high scores missed the item (a bad thing)
13. c. Zero discrimination
When same number of the examinees of both upper group and lower group answers the
item correctly then
pupper = p upper
p upper = pupper = q (say)
d = p upper - pLower = q – q = 0
then the item is not helping us sort people into those two performance groups at all. This
can happen when either everyone gets the item right or everyone misses it. These are
non-discriminators. so the range of discrimination index is (– 1 to + 1).
General Interpretation of Difficulty Index
Interpretation Difficulty Index Range
Ideal Item About 0.5
Very Good Item 0.4-0.49
Good Item 0.30-0.39
Fair Item 0.20-0.29
Poor Item Less than 0.19
Activity
Ten students have taken an objective test. The test comprises of 10 items. In the table
14. below, the students’ scores have been listed from high to low. There are five students in
the upper half and five students in the lower half. The number “1” indicates a correct
answer on the question; a “0” indicates an incorrect answer.
Student
ID
Total
Score
Questions
Item Item tem Item Item Item Item Item Item Item
(%) 1 2 3 4 5 6 7 8 9 10
AA 100 1 1 1 1 1 1 1 1 1 1
BA 90 1 1 1 1 1 1 1 1 0 1
CD 80 1 1 0 1 1 1 1 1 0 0
DR 70 0 1 1 1 1 1 0 1 0 1
EF 70 1 1 1 0 1 1 1 0 0 1
FG 60 1 1 1 0 1 1 0 1 0 0
GH 60 0 1 1 0 1 1 0 1 0 1
HM 50 0 1 1 1 0 0 1 0 1 0
IK 40 1 1 1 0 1 0 0 0 0 1
JN 30 0 1 0 0 0 1 0 0 1 0
Calculate the Difficulty Index (p) and the Discrimination Index (D) for each question
Item # Correct
(Upper group)
# Correct (Lower
group)
Difficulty
(p)
Discrimination
(D)
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item 8
Item 9
Item 10
Answer the following items:
1 .Which item is the easiest?
2. Which item is the most difficult?
3. Which item has the poorest discrimination?
4. Which items would you eliminate first (if any) – why?
15. Output of Iteman Software (CTT based Software)
Total Total
Rpbis Rbis
Alpha w/o M-H p Bias Against
0.055 0.069 0.832 2.933 0.034 Urban
Prop. Rpbis Rbis Mean SD Color
0.465 0.055 0.069 24.114 7.400 Maroon **KEY**
0.314 0.270 0.354 26.054 9.494 Green
0.067 -0.241 -0.463 14.875 5.367 Blue
0.132 -0.212 -0.335 17.915 7.590 Olive
0.022 -0.084 -0.233 14.500 3.586
18.231 8.594
20-40%
0.571
80-100%
0.488
0-20% 40-60% 60-80% Color
0.233 0.571 0.478 Maroon **KEY**
0.233 0.159 0.312 0.388 0.463 Green
0.200 0.095 0.052 0.030 0.000 Blue
0.333 0.175 0.065 0.104 0.049 Olive
N P
357 0.465
Option statistics
Option N
A _________
__________ 166
B 112
C 24
D 47
Omit 8
Not Admin 13
Quantile plot data
Option
A
N
166
B 112
C 24
D 47
16. Explanation of Output
Item information
It tells that item ID is 35 and it is at sr. # 35 in test. this item is MCQ with four options.
Key is A. it is related to “Algebra”. Flag box shows that it is biased items and there is
problem in key.
Item Statistics
It tells that 357 students attempted this item. The difficulty index (P)is 0.465 and
discrimination index (Rbis.) is 0,069. It is biased (not favoring urban students) against
urban students.
Option Statistics
It tells about distractor analysis. The discrimination index (Rbis) of option A and B is
positive. A is key so its discrimination must be positive but B is distractor and its
discrimination index must be negative but it is also behaving as key. It can also see in the
graph.
2. Item Response Theory (IRT)
Item response theory (IRT) is latent trait theory and proposed in the field of psychometrics
for the purpose of measurement of ability, skills, proficiency, learning, performance etc. It
is widely used in testing to calibrate and evaluate items in assessment instruments and to
score subjects on their abilities, attitudes, or other latent traits. During the last several
decades, educational assessment has used more and more IRT-based techniques to develop
tests because the methodology can significantly improve measurement accuracy and
reliability while providing potentially significant reductions in assessment time and effort,
especially via computerized adaptive testing. (Hays,
17. Morales, and Reise 2000; Edelen and Reeve 2007; Holman, Glas, and de Haan 2003; Reise
and Waller 2009).
The most popular IRT Rasch model specify a single latent trait to account for all statistical
dependencies among test items as well as all differences among test takers. It is this
underlying trait, typically denoted by theta (~) that distinguishes items with respect to
difficulty, and distinguishes test takers with respect to proficiency. Rasch model explores
probability of test taker’s specific response to an item as a function of the test
taker's location on (~) describing the relationship of the item to ~. IRT based Rasch model
is probabilistic, local item dependence estimation of item parameters, test statistics, and
examinee proficiency may result (Fennessy, 1995; Sireci, Thissen, & Wainer, 1991;
Thissen Steinberg & Mooney, 1989). Item response theory advances the concept of item
and test information to replace reliability. In the place of reliability, IRT offers the test
information function which shows the degree of precision at different values of theta. Plots
of item information are used to see how much information an item contributes. Since local
independence, item information functions are additive, the test information function is
simply the sum of the information functions of the items on the exam and with a large item
bank, test information functions helpe in judging measurement error very precisely. In Item
Response Theory, Rasch model is used because it is probabilistic model offers a way to
model probability that a person with “certain” ability will be able to perform at a “certain”
level i.e. it measure a person’s ability and its performance on a single continuum. It helps in
checking how well the data fit the model and diagnoses very quickly where the misfit is the
worst, and helps to understand this misfit in terms of the construction of the items and the
variable in terms of its theoretical development (Rasch Analysis, 2012). Means square, t,
infit, and outfit
values are used as Bond and Fox (2001) considered them important for making fit
18. decisions with more emphasize on infit values.
Item response theory is related to:
the latent trait theory.
approach focuses that each item on a test is having its own item characteristic
curve that indicates the probability of getting each particular item either right or
wrong.
association between an individual's response to an item and the underlying
latent variable ("ability" or "trait") being measured by the instrument
latent variable, expressed as theta (θ) , is a continuous one-dimensional
construct that explains the covariance among item responses.
Individuals at higher levels of θ have a higher probability of responding
correctly an item.
a) Item Characteristics Curve
It provides the useful information about the behavior (item discrimination index,
item difficulty index and guessing) of an item.
b) Item difficulty
19. where
• Each learner has ability θ
• Each item has difficulty b
The equation is called one parametric model
θ is defined as the ability at which the probability of success on the item is 0.5
(50%) on a logit scale.
“b” item’s level of difficulty is another factor affecting an individual’s
probability of responding in a particular way.
“θ" and “b” are on same scale that is on x-axis
20. Above figure shows that as the difficulty of item or ability of an individual increases from
left to right.
Activity
Let’s try the following values:
θ = 0, b = 0? θ = 3, b = 0? θ = -3, b = 0?
θ = 0, b = 3? θ = 0, b = -3? θ = 3, b = 3?
θ = -3, b = -3?
what is P(θ)?
Excel Let’s enter these into Excel, and create the item characteristic curve using above
equation of one parametric model.
c) Item discrimination
21.
22. Activity
where
• Each learner has ability θ
• Each item has difficulty b
• Each item has discrimination a
The equation is called two parametric model
• The items on a test might also differ in terms of the degree to which they
can differentiate individuals who have high trait levels from individuals
who have low trait levels.
Above figure shows that item 1 is more discriminatory than item 2. So the steepness of ICC
represents discrimination of item.
23. Let’s try the following values:
θ = 0, b = 0, a = 5? θ = 3, b = 0, a = 2? θ = -3, b = 0, a = 3? θ
= 0, b = 3, a = 1? θ = 0, b = -3, a = 1? θ = 3, b = 3, a = 5? θ =
-3, b = -3. a = 10?
what is P(θ)?
Let’s enter this into Excel, and create the item characteristic curve using above equation of
two parametric model.
d) Guessing
Where
• Each learner has ability θ
• Each item has difficulty b
• Each item has discrimination a
• Each item has guessing chance c
The equation is called three parametric model
24. Above figure shows that item 2 show
more guessing chance than item 1.
Activity
Let’s try the following values:
θ = 0, b = 0, a = 5, c = 0.5? θ = 3, b = 0, a
θ = 0, b = 3, a = 1, c = 2? θ = 0, b = -3, a = 1, c = 0? θ = 3, b = 3, a = 5, c = 3? θ
= -3, b = -3. a = 10, c = 5?
what is P(θ)?
Let’s enter this into Excel, and create the item characteristic curve using above equation of
three parametric model
= 2, c = 1? θ = -3, b = 0, a = 3, c = 0.75?
25. Outputs of IRT Based Software ConQuest for Psychometric Properties of Test
Item Person Map Sample Test 1
================================================================================ SAMPLE
TEST 1 Fri Mar 18 0 1 : 2 3 2016 MAP OF LATENT
DI STRI BUTIONS AND RESPONSE MODEL PARAMETER ESTI MATES
===========================================================Build: J a n 8 2016===
Terms i n t h e Model ( e x c l S t e p t e r m s )
p e r s o n + i t e m
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
|
|
|
X|
|
|
X|
|
|
XXXXX|
XXXXX|
XXXX|
XXXXXXX|
XXXXXXXX|
XXXXXXX|
XXXXXXXXXXXX|
XXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
11 12 28 33 34 36 43 46 51 54 58 2 29 32 35 63
64 65 67 68 77 78 40 104 122 134 142 148 156
164 175 197
100 215
93 106 199
44 92 121 162
25 39 53 91 128 179 180 190
117
98 126 208
30 131 192 110
203
7 38 45 138 141 157
20 37 55 73 97 171
71 112 118 153
87 116 209
61 62 69 124 182
57
31 149 174 185
22 66 90 130 189
26 72 183 198
48 80 81 83 114 168 170 202
79 82 85 172 119
163
42 47 169 178 188 210
147 186 200 206 211
16 50 52 74 125 139 160
15 137 184 191
18 27 140 14 59 194
75 166
56 111
76 115 205 70
167
24 207
1 49 129
10 17 23 41 201
152 173
196
8 123
21 13
19 195
181
5 60
6 9 4
| | |
=======================================================================================
Each 'X' r e p r e s e n t s 0 . 6 c a s e s
Some par amet e r s c o u l d not be f i t t e d on t h e di s p l a y
=======================================================================================
The item-person map output of sample test 1 explores distribution of items (right side of
map representing by numbers 1,2,3 etc.) and persons (lift side of the map representing by
x). The below map explores that the test is difficult because most of the items are upside
0
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXX|
XXXXXX|
XXXXX|
XXX|
XX|
X|
| |
X|
| |
| |
| 3
|
26. as compare to person. In other words, there is no person for most of the items.
27. Item Person Map Sample Test 2
================================================================================
SAMPLE TEST 2 Fri Aug 05 09 : 2 0 2016
MAP OF LATENT DISTRI BUTIONS AND RESPONSE MODEL PARAMETER ESTIMATES
===========================================================Build: J a n 8 2016===
Terms i n t h e Model ( e x c l S t e p t e r m s )
p e r s o n + i t e m
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6
5
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|27 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| 6
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|12 28
XXXXXXXXXXXXXXXXXXXXXXXXXXX|22
XXXXXXXXXXXXXXXXXXXXXXXX|8 25
XXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXX|11 14
XXXXXXXXXXXXXX|
XXXXXXXXXXXX|29
XXXXXXXX|17
XXXXXXX|1 2 10
XXXX|9
XXX|30
XXX|4
XX|3 13 32
XX|7 31
| 5
X|15 16 26
XXX|19 24
| X|20
XXX|23
| 2 1
|
XX| |
| 1 8
|
=======================================================================================
Each 'X' r e p r e s e n t s 3 . 6 c a s e s
=======================================================================================
The item-person map output of sample test 2 explores distribution of items (right side
of map representing by numbers 1,2,3 etc.) and persons (lift side of the map representing by
x). The above map explores that the test is easy because most of the items are downside as
compare to persons. In other words, there is no item for most of the persons.
3
2
1
0
-2
4
28. Fit Statistics
================================================================================
Sa mp le of Fi t S t a t i s t i c s
TAB LES O F R ESPONSE MODE L P AR AM ETER ESTI M ATES
===========================================================Bui ld: Jan 8 2016===
TER M 1 : i t e m
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -T
ERM 1 : i t e m
VAR I AB LES UNWE I GH TED F IT WE I GH TED FIT
----------------------------------------------------------------------------------------------------------------------------------------------------------------
i t em ESTI M ATE (b ) ERR OR ^ MNSQ C I T M NSQ C I T
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
----
1 1 0 .2 02 0 .0 62 0 .9 5 ( 0 .9 4, 1 .0 6 ) -1 . 8 1. 03 ( 0 . 92 , 1 . 08 ) 0 . 7
2 2 0 .2 17 0 .0 61 0 .8 4 ( 0 .9 4, 1.06) -5.6 0 .9 7 ( 0 .9 3, 1 .0 7 ) - 0 . 9
3 3 -0 . 3 82 0 . 07 2 0. 84 ( 0 . 94 , 1. 06) -5. 7 0. 94 ( 0 . 90 , 1 . 10 ) -1. 2
4 4 -0 . 2 65 0 . 07 0 0. 84 ( 0 . 94 , 1. 06) -5. 8 0. 93 ( 0 . 91 , 1 . 09 ) -1. 4
5 5 -0 . 6 76 0 .0 79 0 .9 2 ( 0 .9 4, 1.06) -2.7 0. 97 ( 0 . 89 , 1. 11) -0. 6
6 6 2 . 01 0 0 . 04 8 1 .2 5 ( 0 . 94 , 1. 06) 7. 9 1. 13 ( 0 . 96 , 1 . 04 ) 6 . 7
7 7 -0 . 5 99 0 . 07 8 0. 83 ( 0 . 94 , 1. 06) -6. 2 0. 94 ( 0 . 89 , 1. 11) -1. 0
8 8 1 .2 00 0 . 05 1 1. 08 ( 0 . 94 , 1. 06) 2. 6 1. 07 ( 0 . 95 , 1 . 05 ) 3 . 0
9 9 0 . 08 8 0 . 06 4 1. 01 ( 0 . 94 , 1. 06) 0. 5 1. 05 ( 0 . 92 , 1 . 08 ) 1 . 2
1 0 1 0 0 .2 98 0 . 06 0 0. 90 ( 0 . 94 , 1. 06) -3. 6 0. 96 ( 0 . 93 , 1 . 07 ) -1. 1
1 1 1 1 0 . 89 7 0 . 05 3 1. 02 ( 0 . 94 , 1. 06) 0. 5 1. 03 ( 0 . 95 , 1 . 05 ) 1 . 2
1 2 1 2 1 . 53 3 0 . 04 9 1. 01 ( 0 . 94 , 1. 06) 0. 3 1. 02 ( 0 . 96 , 1 . 04 ) 1.
1
1 3 1 3 -0 . 4 4 0 0 .0 74 0 .9 2 ( 0 .9 4, 1.06) -2.9 0. 96 ( 0 .9 0, 1 . 10 ) -0. 8
1 4 1 4 0 .8 54 0 .0 54 0 .9 9 ( 0 .9 4, 1.06) -0.3 1 .0 4 ( 0 .9 4, 1 .0 6 ) 1 . 4
1 5 1 5 -0 . 7 98 0 . 08 3 0. 79 ( 0 . 94 , 1. 06) -7. 8 0. 91 ( 0 . 88 , 1 . 12 ) -1. 4
1 6 1 6 -0 . 8 26 0 .0 84 1 .0 3 ( 0 .9 4, 1 .0 6 ) 1 . 0 0. 93 ( 0 . 88 , 1 . 12 ) -1. 1
1 7 1 7 0 .4 13 0 . 05 9 0 .9 2 ( 0 .9 4, 1.06) -2.9 0. 98 ( 0 .9 3, 1 .0 7 ) -0. 5
1 8 1 8 -2 . 2 70 0 . 14 4 0 .5 1 ( 0 .9 4, 1. 06)-20. 2 0. 91 ( 0 .7 5, 1 .2 5 ) -0. 6
1 9 1 9 -0 . 9 96 0 .0 88 0. 88 ( 0 .9 4, 1.06) -4.4 0. 99 ( 0 . 87 , 1 . 13 ) -0. 1
2 0 2 0 -1 . 2 84 0 . 09 8 0. 75 ( 0 . 94 , 1. 06) -9.2 0. 85 ( 0 . 85 , 1 . 15 ) -2. 0
2 1 2 1 -1 . 5 99 0 .1 10 0 .7 2 ( 0 .9 4, 1.06)- 10.3 0 .8 5 ( 0 .8 2, 1 .1 8 ) - 1 . 7
2 2 2 2 1 . 39 8 0 . 05 0 1 .1 0 ( 0 .9 4, 1 .0 6 ) 3 .4 1. 08 ( 0 . 96 , 1 . 04 ) 3 . 4
2 3 2 3 -1 . 5 00 0 . 10 6 0 .5 7 ( 0 .9 4, 1.06)- 17.5 0. 86 ( 0 .8 3, 1 . 17 ) -1. 7
2 4 2 4 -1 . 0 88 0 . 09 1 0. 89 ( 0 . 94 , 1. 06) -3. 9 0. 91 ( 0 . 86 , 1 . 14 ) -1. 3
2 5 2 5 1 .1 45 0 .0 52 0 .9 9 ( 0 .9 4, 1.06) -0.2 1 .0 0 ( 0 .9 5, 1 .0 5 ) 0 . 1
2 6 2 6 -0 . 9 16 0 . 08 6 0. 90 ( 0 . 94 , 1. 06) -3.4 0. 91 ( 0 . 87 , 1. 13) -1. 4
2 7 2 7 2 . 31 4 0 . 04 8 1 .2 2 ( 0 .9 4, 1 .0 6 ) 7 .0 1. 09 ( 0 . 97 , 1 . 03 ) 5 . 2
2 8 2 8 1 .5 02 0 . 05 0 1 .1 5 ( 0 .9 4, 1 .0 6 ) 4 .8 1. 11 ( 0 .9 6, 1 .0 4 ) 4 . 9
2 9 2 9 0 . 56 1 0 . 05 7 1. 32 ( 0 . 94 , 1. 06) 9. 8 1. 10 ( 0 . 94 , 1 . 06 ) 3 . 0
3 0 3 0 -0 . 1 16 0 .0 67 0 .9 6 ( 0 .9 4, 1.06) -1.5 0. 95 ( 0 .9 1, 1 .0 9 ) -1. 0
3 1 3 1 -0 . 5 50 0 .0 76 0. 86 ( 0 .9 4, 1.06) -4.9 0. 90 ( 0 . 89 , 1. 11) -1. 8
3 2 3 2 -0 . 3 27 * 0 .0 71 0 .9 9 ( 0 .9 4, 1.06) -0.2 0 .9 9 ( 0 .9 0, 1 .1 0 ) - 0 . 3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - An
a s t e r i s k n ext t o a p a ra met er e s t i m a t e i n d i c a t e s th at i t i s c o n s t r a i n e d Sep a rat i on
R e l i a b i l i t y = 0 . 9 96
C hi -squ a re t e s t of p a ra met er e q u a l i t y = 1 010 8. 22 , d f = 31 , Si g Lev el = 0 . 000 ^
Empi ri ca l s t a n d a r d e r r o r s ha ve b een u sed
================================================================================
Fit statistics explores the fitting of the models with the data. If Mean Square MNSQ values
of items exist in corresponding class interval CI then those items fit with the IRT model.
Red MNSQ values explores that item 6,8,22,27,28 and 29 are miss fit with the IRT model.
Estimate column in fit statistics table tells the difficulty level of the items. For example,
item 30 is of average (b § 0) difficulty as mention by red values in item - person map of
sample test 2 and table of fit statistics. Separation reliability index (green values in fit
statistics table) tells us that test is measuring wide range of abilities.
Comparison of IRT and CTT
29. CTT and IRT differ in many respects although these testing theories have many
commonalities. A crucial similarity in both testing theories are that models of performance;
if the model assumptions are not met, conclusions and interpretations will not be
supportable and the investigator will not necessarily be able to test the assumptions.
However, in the case of IRT, there are statistical procedures to help determine whether the
construct is causal or emergent (Tractenberg, 2010). Classical test theory (CTT) has been
extremely popular in the development, characterization, and sometimes selection of
outcome measures in the field of testing. IRT is powerful and offers options to measure
outcomes more accurately that CTT does not provide. However, IRT modeling is complex.
30. Following are the general comparison of CTT and IRT
Area CTT IRT
Nature Traditional Modern
Complexity Simple Complex
Model Linear Nonlinear
Scores True score Ability scores
Inferences Student’s ability Student’s expected scores
Scope Narrow Broader
Focus More on test More on item
Assumption Weak Strong
Item ability relationship Not specified Item characteristic function
Ability
Test scores or estimated
true scores are reported on
the test score scale
Ability scores are reported
on the scale -00 to +00 or a
transformed scale
Invariance of item and
person
No—item and person
parameters are sample
dependent
Yes—item and person
parameters are sample
independent if model fit
the test
Sample size(for item
parameter estimate) 200 to 500 in general
Depend on the IRT model
but larger sample i.e. over
500 in general are needed
Dependency
Item’s properties depends
on representative sample
Item’s properties do not
depends on representative
sample
IDF Sample Independent Sample Dependent
6.4. Standard Setting
Changes in educational assessment are currently being called for, both within the fields of
measurement and evaluation. Traditional forms of assessment of knowledge provide a
standard setting method for assigning numerical scores to determine letter grades but rarely
reveal information about how students actually understand and can reason with
acquired ideas or apply their knowledge to solving problems. The reflection of the
31. achievement of curriculum objectives and institutional standards, by students, are
indicated by the grading process. The model by which the grading process is carried out is
what is in question. There are three, most commonly used, grading models employed in
most educational settings and institutions. The first model is norm-referenced.
Norm-referenced grading refers to an evaluation where students are assessed in
relationship to each other. The second is criterion-referenced. Criterion-referenced grading
is the process where students are evaluated in a noncompetitive atmosphere; the emphasis
is placed on the learning objects and standards. Third is self-referenced. It is based on
comparing a learner's performance with the instructor's perceptions of the learner's ability.
Learners performing above the level of performance that the instructor perceives them
capable receive higher grades than those learners the instructor perceives as having not
made as much of an improvement. There is an even greater need for appropriate grading
methods for assigning letters to students' performance. This paper summarizes current
trends in academic grading and relates these to the assessment of student outcomes in a
specific course. After discussing these grading models the findings were that there is a
noticeable shift to the criterion-referenced grading model.
Criterion Referenced Model
This model's framework is based on a curriculum, course, or lesson. By establishing
absolute standards, grades are assigned by comparing a learner's performance to a set of
standards. Learners meeting the learning targets receive higher grades than those learners
not meeting the targets. This method presumes the learning targets are appropriately
designed for the particular learner population and the instructor is focusing instruction on
the learning targets.
Norm Referenced Model
This model's framework is based on a comparison of among learners. Establishing relative
standards means making comparisons that are relative to the group such that a learner's
32. performance is compared to others in the group.
Advantages and Disadvantages of Norm-referenced Grading
Norm-referenced grading ranks learners from highest to lowest, according to Nitko (2001)
and these systems of grading are easy for instructors to use and articulate. The Center for
Teaching and Learning Services at the University of Minnesota (2003) explains that this
form of grading works well in situations requiring rigid differentiation among students,
where restrictions are imposed. For example, when less number of students is to be
selected then this technique works better. Norm-referencing requires close scrutiny of the
actual group that will be used as a reference for the comparison. This could possibly foster
further insight into the course’s subject area and help improve instruction. This form of
grading is most appropriate in a large classroom setting.
Two primary objections surround the norm-referenced form of grading. First, an
individual’s grade is determined not only by his/her achievements and efforts but also by
achievements and efforts of others. Popham (2002) illustrates by saying, when a teacher
asserts that a student “scored at the 90th
percentile on test,” they mean that the student’s
test performance has exceeded the performance of 90% of the students in the test’s norm
group. The second objection is that norm-referenced grading promotes competition rather
than cooperation. When students are knowingly paired against each other they are less
likely to be helpful to their fellow classmate this may also is a key to eliminate cheating in
tests and examinations.
Advantages and Disadvantages of Criterion-referenced Grading
Criterion-referenced grading provides feedback relative to leaning targets and/or
performance standards. This form of grading emphasizes the objectives of the curriculum.
The student’s grade is not affected by the class. Under this form, if improvement is
needed, a student can simply observe the identified learning targets to know what areas
33. they should work on. Unlike norm-referenced grading this system is adaptable to any size
classroom setting.
There are two disadvantages that present themselves as hurdles for the
criterion-referenced form of grading.
Establishment of learning targets and/or performance standards.
Teachers set the criteria, standards, or targets based on what they know about how
students will usually perform.
Self-Referenced Model
The growth-based grading framework is based on comparing a learner's performance with
the instructor's perceptions of the learner's ability. Learners performing above the level of
performance that the instructor perceives them capable receive higher grades than those
learners the instructor perceives as having not made as much of an improvement. Thus, a
learner who has made more improvement may receive a higher grade than another learner
regardless of their absolute levels of attainment. It is therefore essential that the instructor
maintain rigorous records so as to reduce the potentially unreliable nature of judgments of
capability. While the self-referencing method reduces the overall competitiveness of
grades, it presents an irony in that learners coming in the course with the highest levels of
achievement tend to have the lowest levels of change even though their final absolute
levels of achievement remain the highest (Nitko, 2001).
Grading Methods
Absolute grading methods produce grades that share some general shortcomings,
independent of the particular method that generated the grades. For example, unless they
are accompanied by a description of the performance standards or the content domains that
have been studied, the meaning of an absolute grade is difficult to understand.
Furthermore, no criterion-referenced grading method produces grades that are strictly
absolute in meaning. Such grades are based on performance standards that nearly always
have normative basis. A "B writer" should be able to use correct referencing techniques,
34. the teacher may say, but if most college students do not and cannot, the standard is likely
to be lowered to reflect reality (the norm). Note that adjusting grades instead of modifying
the standards would contribute to meaningless grades.
Fixed Percent Scale
This method uses fixed ranges of percent-correct scores as the basis for assigning grades to
the component of a final grade. A grading scale used by most of the
institutions/universities is the following: 93-100 = A, 85-92=B, 78-84=C, etc. These
ranges are fixed at the beginning of the reporting period and are applied to the scores
from each grading component -- written tests, demonstrations, papers and performance
assessments. Component grades are then weighted and averaged to get the final grade.
Unfortunately, a percent score will be meaningless unless the domain of tasks, behaviors,
or knowledge upon which the assessment was based is defined explicitly. That is, a test
score of 100% should mean that the student has complete or thorough attainment of the
key elements of the area of knowledge and mastered the basic skills that were sampled by
the test. But if an assessment is developed in such a way that the underlying content
domain is ill-defined or vague, the percent-correct scores from it will have no meaning
beyond the specific tasks that comprise the assessment. Scores of 80% on a math test and
75% on a speech say little about performance unless we know the difficulty of the domain
of math problems and which important criteria were used to score the speech. In sum,
percent scores cannot provide a reference to absolute performance standards unless
the underlying knowledge domain and desired basic skills to be mastered are adequately
described.
Another serious drawback of this grading method is the fact that the percent-score ranges
for each grade symbol are fixed for all grading components. For example, the fact that 93%
is needed for an A places severe and unnecessary restrictions on the teacher when he or she
is developing each assessment tool. If the teacher believes there should be some A grades,
a 20-point test must be easy enough so that some students will score 19 or higher;
otherwise there will be no A grades. This circumstance creates two major problems for the
teacher as the assessment developer. First, it requires that assessment tasks be chosen more
for their anticipated easiness than for their content representativeness. As a result, there
may be an over representation of easy concepts and ideas, an overemphasis on facts and
35. knowledge, and an under representation of tasks that require higher order thinking skills.
The teacher may need to "fudge" on the domain definition to accommodate the fixed
grading scale.
A further limitation of this method relates to the accuracy of the assessment information
obtained. Since the grade cutoff scores usually are located between the 60% and 100%
points on the percent scale, most of the scale points (0-60) are of no value in describing
the different absolute levels of achievement. For example, if A and B performance must
be in the range of 85%-100%, the very best B achievement and the very worst B
achievement are separated by only eight points (85-92), as are the very best and very
worst A achievements (93-100). These are fairly narrow score ranges, especially
considering the fact that a 100-point scale is available for use. Because these ranges are
narrow and fixed, they will contribute to fairly inaccurate grades when the scores of any
single grading component are not very dependable. If the grade ranges could be made
larger when the scores of certain components are fairly inaccurate, then more accurate
grades would probably result.
The fixed percent scale method usually produces grades that have little meaning in terms of
content standards, and it often yields grades that are of questionable accuracy. The
percent cutoffs for each grade are arbitrary and, thus, not defensible. Why should the
cutoff for an A be 93, 92, or 90? Further, why shouldn't the A cutoff be 88% for a certain
text, 91% for another, and 83% for a certain simulation exercise? Is there any reason why
the same numerical standards must be applied to every grading component when those
standards are arbitrary and void of absolute meaning?
Total Point Method
When teachers accumulate points earned by students throughout a reporting period and
then assigns grades to the point total at the end of the period this method is known as total
point method. First the teacher decides which components will figure into the final grade
and what the maximum point value of each component will be. (This is done before tests
are developed and before the scoring criteria for projects are established). That is teacher
formulate the procedure of grading before the start of the program. For example, you may
decide to use two tests (50 points each), two papers (40 points each), and a report (20
points) for a maximum of 200 points for the quarter. Then the grade cutoffs might be set as
follows: 180-200 = A, 160-179 = B, 140-159 = C, 120-139 = D and 0-119 = F. Implicit in
36. this set of ranges is a percent scale with grade cutoffs of 90%, 80%, 70%, and 60%. These
cutoffs depends upon teachers own desire there is no hard and fast rule or rationale for
percentage cutoffs. They are as arbitrary, and nearly as meaningless, as those derived from
the fixed percent scale method. Unlike the fixed percent scale method, however, grades are
not assigned to components with the total point method. And unlike grading on the curve,
the arbitrary cutoff points are established at the beginning of the reporting period, before
assessment results are known.
One of the difficulties of using this method is that often a decision has to be made about
the maximum score on a project or test before the teacher has had ample time to think
about the key ingredients of the assessment. Here's how this circumstance can contribute to
poor assessment development practices: Suppose I need a 50-point test to fit my grading
scheme, but I find as I build the test that I need 32 multiple-choice items to sample the
content domain thoroughly. I find this unsatisfactory (or inconvenient) because 32 do not
divide into 50 very nicely (It is 1.56!) To make life simpler, I could drop 7 items and use a
25-item test with 2 points per item. If I did that, my point totals would be in fine shape, but
my test would be an incomplete measure of the important unit objectives. The fact that I had
to commit to 50 points prematurely dealt a serious blow to obtaining meaningful
assessment results.
Another potential drawback to the total point method is the ease with which extra credit
points can be incorporated to beef up low point totals. This practice can simultaneously
distort the meaning of the content domain and final grade. When the extra tasks are
challenging and relevant to current instruction, this seems like a reasonable way to
individualize and motivate high achieving students. In such cases, the outcome is likely to
make high point totals even higher. But extra credit that simply allows students to
compensate for low test scores or inadequate papers is not reasonable, especially if the
extra work does not help them overcome demonstrated deficiencies. The point here is that
this method of grading makes it convenient for teachers to allow extra credit work of the
latter form to compensate for low achievement. When that happens, the grades take on a
new meaning because the relevant domain of knowledge and skills gets redefined by the
nature of the extra credit tasks.
Content Based Method
This method involves assigning a grade to each component and then weighting the separate
37. grades to obtain the final one. The teacher develops brief descriptions of the achievement
levels (standards) associated with each grading symbol. These standards for "A work" and
"B work" and so on are then used to establish the grade cutoff scores for every component.
Compared to the fixed percent scale method, which keeps cutoff scores constant for all
components, this method keeps the performance standards for a grade constant but lets the
cutoff scores change. Here is an example of how the method might be used:
Suppose you have prepared a 30-item test to measure the achievement of most of the
objectives in a unit of instruction. Assuming that grades A through F will be assigned to
test scores, you will need to develop a brief description of the performance levels you
expect students to reach for each of the five possible grades. For example, you might
describe C expectations as "knows basic concepts and can do the most important skills;
lacks some prerequisites for later learning." Using descriptions like these, you can begin
an item-by-item review of the test.
For question No. 1, ask whether a student with only minimum achievement (D) should be
able to answer correctly. If so, record a D next to the item; if not, pose the same question
for grade C achievement. This process continues until the first item has been classified. For
items that the teacher believes most A students will not necessarily answer correctly, a
symbol such as N can be used to indicate that no grade level applies. After you have
classified each item with a symbol, the D-F cutoff score is found by adding the number of
D symbols. Then the C-D cutoff is obtained by adding the number of D and C symbols.
The B-C cutoff is the sum of D, C, and B symbols, and the A cutoff is the sum of the D, C,
B, and A symbols. To account for negative errors of measurement, you should lower each
grade cutoff by one or two points. Such adjustments for error at this stage of grading would
make it unnecessary to review borderline cases at a later time.
All grading methods involve subjectivity, and this one requires two main types of
subjective decisions. The first type entails the development of explicit expectations for the
achievers at each of the letter-grade levels. What is B achievement like and how is it
different from C achievement? Good teachers might disagree with one another about how to
define these performance standards. The other subjective decision making occurs when
items are reviewed to determine the grade category to which each one belongs. Again,
good teachers may disagree about whether a "B student" should be able to answer a
particular item correctly. Notice that these two types of judgments do not require that
38. subjective decision be made about individual students. There is no need to decide, for
example, whether Jana is a C student or whether Matt could answer a certain question
correctly. The judgment required here is about standards and about the particular tasks that
students at each level should be expected to do.
Some Relative Grading Methods
Grades derived from any of the relative grading methods will have certain shortcomings
that are inherent in any grading intended to have a norm-referenced meaning. For example,
unless the person interpreting the grade knows which reference group was used, the grade
means very little. Was it the student's class, a combination of classes, or classes from the
past two years? Further, by definition, a norm-referenced grade does not tell what a student
can do; there is no content basis other than the name of the subject area associated with the
grade.
Grading on the Curve
The curve referred to in the name of this method is the normal bell-shaped curve that is
often used to describe the achievements of individuals in a large heterogeneous group.
The idea behind this method is that the grades in a class should follow a normal
distribution, or one nearly like it. Under this assumption, the teacher determines the
percentage of students who should be assigned each grade symbol so that the distribution
is normal in appearance. For example, the teacher may decide that the percentages of A
through F grades in the class should be 10%, 20%, 40%, 20%, and 10%, respectively.
Since some teachers who use the method rightly believe that classroom groups are too
small for their achievement score to resemble a normal curve, they choose percentages
that, in their judgment, are more realistic. So they may decide on 20%, 35%, 30%, 10%,
and 5%. The percentages are selected arbitrarily and are treated like grade quotas so that
the top 20% of students in terms of their composite scores will earn an A, the next 35%
would be assigned a B, and so on.
Grading on the curve is a simple method to use, but it has serious drawbacks. The fixed
percentages are nearly always determined arbitrarily, and the percentages do not account
for the possibility that some classes are superior and others are inferior relative to the
phantom "typical" group the percentages are intended to represent. In addition, the use of
the normal curve to model achievement in a single classroom is generally inappropriate,
39. except in large required courses at the high school and college levels.
Distribution Gap Method
When the composite scores of a class are ranked from high to low, there will usually be
several short intervals in the score range where no student actually scored. These are gaps.
This method of grade assignment involves finding the gaps in the distribution and drawing
grade cutoffs at those places. For example, if the highest composite scores in a class was
211, 209, 209, 205, 197, 196... then the teacher might use the gap between 205 and 197 to
separate the A and B grades. The gap between 211 and 209 is too small and might
produce too few A grades. The one between 209 and 205 might be large enough, but 205
seem more like 209 than 197.
In some score distributions there are many wide gaps; in others there are only a few
narrow gaps. The sizes and locations of the gaps are determined by random errors of
measurement as well as by actual differences among students in achievement. For
example, Mike's 197 maybe would have been 203 (if there had been less error in his
scores), and Theo's 205 maybe would have been 200. Under those circumstances, the A - B
gap would be less obvious, and to many final grade decisions would have been made by
reviewing borderline cases.
When gaps are wide enough, this method helps the teacher avoid disputes with students
about near misses. But when the gaps are narrow, too much emphasis is placed on the
borderline information that the teacher had decided was not relevant enough or accurate
enough to be included among the set of grading components that formed the composite.
Only occasionally will the gap distribution method yield results that are comparable to
those obtained with more dependable and defensible methods.
Standard Deviation Method
This relative method is the most complicated computationally, but is also is the fairest in
producing grades objectively. It uses the standard deviation, a statistic that tells the
average number of points by which the scores of students differ from their class average. It
is a number that describes the dispersion, variability, or spread of scores around the
average score. In this method, the standard deviation is used like a ruler to identify grade
cutoff points.
Suppose you have formed composite scores for your class of 25 students and that the
40. average was 129 and the standard deviation was 10. (Consult an introductory measurement
or statistics book to see how to compute these statistics simply.) Assuming C to be the
average grade, we can find the cutoff between B and C by adding, for example, one-half of
the standard deviation to the average (129 + (0.5) (10) = 134). Then the A-B cutoff is
found by adding 1.5 standard deviations (for example) to the average (129 + (1.5) (10) =
144). By subtracting corresponding values from the average score, the C-D cutoff is found
to be 124, and the D-F cutoff is 114. (Can you verify these values?) The ranges for each
grade are the following: A = 145 and up, B = 135 - 144, C = 124 - 134, D = 123 - 114, and
F = 113 and below. These ranges can be made smaller or larger for groups of higher or
lower ability level by adjusting the number of standard deviations used to find the cutoffs.
For a particularly able class, for example, the A-B cutoff might be only one standard
deviation above the average and the B-C cutoff might be 0.3 above, rather than 0.5.
Unlike grading on the curve, this method requires no fixed percentages in advance, and
unlike the distribution gap method, the cutoff points are not tied to random error. When
the teacher has some notion of what the grade distribution should be like, some trial and
error might be needed to decide how many standard deviations each grade cutoff should
be from the composite average. When a relative grading method is desired, the standard
deviation method is most attractive, despite its computational requirements.
41. References
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, N.J.: Prentice-Hall.
Bloom, B.S. (!967). Towards a Theory of Testing which includes Measurement,
Evaluation and Assessment. Proceedings of Symposium on Problem in the
Evaluation and Instruction. University of California, Los Angels
Center for Teaching and Learning Services. (2003) Grading systems. Retrieved
November 30, 2004, from http://www.teaching.umn.edu
Davis, B. G., Wood, L., and Wilson, R. The ABCs of Teaching Excellence.Berkeley:
Office of Educational Development, University of California, 1983. DeVellis, R.
F. (2011), Scale Development: Theory and Applications , 3rd Edition, Ebel, R. L. (1979).
Essentials of educational measurement (2nd ed.). Englewood Cliffs,
NJ: Prentice Hall.
Ebel, R.L. (1979). Essentials of educational measurement (2nd ed.). Englewood Cliffs,
NJ: Prentice Hall.
Eble, K. E. The Craft of Teaching. (2nd ed.) San Francisco: Jossey-Bass, 1988.
Embretson, S. E. and Reise, S. P. (2000), Item Response Theory for Psychologists,
Mahwah, NJ: Lawrence Erlbaum Associates.
Erickson, B. L., and Strommer, D. W. Teaching College Freshmen. San Francisco:
Jossey-Bass, 1991.
Gronlund, N. E. & Linn, R. L. (2005). Measurement and assessment in Teaching. New
Delhi: Baba Barkha Nath Printers.
Hambleton, R. K., Swaminathan, H., and Rogers, H. J. (1991), Fundamentals of Item
Response Theory, Newbury Park, CA: Sage Publications.
42. Hays, R. D., Morales, L. S., and Reise, S. P. (2000), “Item Response Theory and Health
Outcomes Measurement in the Twenty-First Century,” Medical Care , 38, Suppl. 9,
1128–1 142.
Holman, R., Glas, C. A. W., and de Haan, R. J. (2003), “Power Analysis in Randomized
Clinical Trials Based on Item Response Theory,” Controlled Clinical Trials, 24,
390–410.
Martuza, V.R. (1977). Applying norm-referenced and criterion referenced measure in
education. Boston MA: Allyn and Bacon, Inc.
Nitko, A. J. (2001). Educational assessment of students (3rd ed.). Englewood Cliffs, NJ:
Prentice Hall.
Nitko, A.J. (2001). Educational assessment of students (3rd ed.). Englewood Cliffs, NJ:
Prentice Hall.
Oosterhof, A.C. (1987). Obtaining intended weights when combining students' scores.
Educational Measurement: Issues and Practice, 6(4), 29-37.
Popham, W.J. (2002). Classroom assessment: What teachers need to know? Boston, MA:
Allyn and Bacon, Inc.
Reise, S. P. and Waller, N. G. (2009), “Item Response Theory and Clinical
Measurement,” Annual Review of Clinical Psychology, 5, 27–48.
Scriven, M. "Evaluation of Students." Unpublished manuscript, 1974.
The Theory and Practice of Item Response Theory, New York: Guilford Press.
Thousand Oaks, CA: Sage Publications. Edelen, M. O. and Reeve, B. B. (2007),
“Applying Item Response Theory (IRT) Modeling to Θuestionnaire Development,
Evaluation, and Refinement,” Θuality of Life Research, 16, 5–18.
43. Tractenberg, R. E. (2010). Classical and modern measurement theories, patient reports,
and clinical outcomes. Contemporary Clinical Trials, 31(1), 1–3.
http://doi.org/10.1016/S1551-7144(09)00212-2