SlideShare une entreprise Scribd logo
1  sur  43
UNIT - 6
PHILOSOPHY AND PSYCHOMETRIC PROPERTIES OF TESTS
Written by Dr. Muhammad Azeem
INTRODUCTION
Psychometrics is a field of study concerned with the theory and technique of
psychological measurement. Psychometrics deals with the construction and validation of
assessment tools/instruments for testing, assessment and related activities. In is usually
concerned with assessing individual’s knowledge, ability, personality, and types of
behaviors. Reliability and validity are two major psychometric properties of assessment
tools/instruments. Any assessment tools/instrument being able to state that they have
excellent psychometric properties, meaning a scale is both reliable and valid.
A reliable assessment instrument consistently assesses/measures/evaluates the same
construct. So a valid assessment tool measures what it says it is going to measure. If
something is valid, it is always reliable. However, something can be reliable without being
valid.
Objectives of Unit
1. After reading this unit, students will be able to:
2. Understand Philosophy of Tests
3. Understand Classical Test Theory
4. Understand Item Response Theory
5. Understand psychometric properties of items and tests
6.1. Philosophy of Testing
Traditional philosophy of tests is that ability to learn is randomly distributed in the
population. It means that if some learning task is assigned to a class and then a test is
administering to study performance, the result of test is that students’ performance scores
are distributed normally. It means that all students cannot benefited equally from teaching
learning process. The new philosophy is that all students can attain mastery of any learning
task subject to the provision of opportunity and time. It means absolute standards can be set
for measuring performance.
In testing, measurement is a process whereby the numerical relation between a
magnitude of a quantitative attribute and a unit of the same attribute is estimated using
some procedure, often a standardized set of operations. Realists believe that one can know
the real world as it truly is. As a philosophy of testing and measurement, realism is
characterized by behaviorally stated objectives, measurement-driven instruction, and report
cards, along with the use of programmed materials. Realism emphasizes that measurable
results from students can be obtained to show precisely how well a student is achieving.
Existentialists believe that each person should learn to make choices from the alternatives
available in society. As a philosophy of education, existentialism does not advocate
predetermined objectives for student achievement or testing to determine achievement.
Individual motivation is central, and feelings are recognized as the most important part of
the human condition. Experimentalists believe that one cannot know ultimate reality, but
one can experience it. Experimentalists believe in integrating school and society. Students
should be assessed for their problem-solving ability, and evaluation consists of using
relevant sources of information to solve a problem and to test hypotheses. Idealists believe
in a subject-centered curriculum, and idealism emphasizes a coherence theory in testing.
Students should be able to use reason and logic effectively
and to honor eternal values. Perennialism is a philosophy directly related to idealism that
calls for a curriculum based on the "Great Books of the Western World", with a
standardized curriculum that regards childhood and youth as obstacles to be overcome
through education.
Key Points
1. Traditional philosophy of tests has the view point that students’ performance scores
are distributed normally because all students cannot have benefited equally from
teaching learning process
2. The new philosophy is that all students can attain mastery of any learning task
subject to the provision of opportunity and time.
6.2. Theories of Test Development (CT, IRT)
i. Testing Theory
Theory is a system of rules, procedures, and assumptions used to produce a result. The
theory of psychological tests and measurement, or, as typically referred to, test theory or
psychometric theory, offers a general framework and a set of techniques for evaluating the
development and use of psychological tests.
ii. Purpose of test theories
Testing is viewed as a systematic method of sampling one or more human characteristics
and the representation of these results for an individual in the form of descriptive
statements (Bloom, 1967).
Following are the main purposes of testing theories
 to formulate mathematical relationship between test properties to enable
manipulating some of them to optimize the desired target properties, usually the
diagnostically important, e.g. validity and reliability etc
 to improve the quality of test by ensuring validity and reliability of test
proportion of examinees that answer the item correctly. The percentage of difficulty
 to predict outcomes of psychological testing
iii. Psychometrics
Psychometrics is the field of study concerned with the theory and technique of
psychological measurement, which includes the measurement of knowledge, abilities,
attitudes, personality traits, and educational measurement.
iv. Theoretical Approaches
Psychometric theories that predict outcomes of psychological testing such as the
difficulty of items or the ability of test-takers. Generally speaking, the aim of these theories
is to understand and improve the reliability and validity of psychological tests.
1. Classical Theory
Classical test theory is based on true score model is also known as the “true score
theory.” It assumes that each individual has a true score which would be obtained if there is
no error in measurement. The observed score for each person may differ from an
individual’s true ability.
X = T + E
observed score true score error score
a) Role of error in estimating test scores
 Basic measure of error is identified by using the standard deviation of error that is
called standard error of measurement.
 The larger the standard error of measurement, the less certain of the accuracy.
 On the other hand, small standard error of measurement identifies that an
individual score is closer to the true score (Kalpan & Sacuzzo, 1997).
b) Item Difficulty
The item-difficulty index is denoted by (p). This index is determined by calculating the
index of item/test p is called difficulty of item/test. According to Linn and Gronlund (2005) the
difficulty of an item indicates the percentage of students who get the item right.
In the classical test theory CTT, item difficulty:
 compares an examinee’s ability to the probability of success on a particular
item.
 considers a pool of examinees collectively and empirically examines their
success rate in a particular item.
 Calculate success rate of a particular pool of examinees on an item
The formula for the item-difficulty index is
p = No. students with correct answer
total students
As an example, assume that 50 people take a test. For the difficulty index, 30 test-takers
answers the item correctly.
p = No. students with correct answer
total students
30
50 p
=
p = 0.6
If no one examinee answers the item correctly, then
p = No. students with correct answer
total students
0
50 p
=
p = 0.0
If all examinees answer the item correctly, then
p = No. students with correct answer
total students
50
50 p
=
proportion of examinees that answer the item correctly. The percentage of difficulty
p = 1.0
Thus item difficulty index (p) has a range of 0 to 1.
General Interpretation of Difficulty Index
Interpretation Difficulty Index Range
Very Easy Item Above 0.81
Easy Item 0.61-0.80
Average Item 0.41-0.60
Difficult Item 0.21-0.40
Very Difficult Item Less than 0.20
Optimal Difficulty of an Item
The optimal level (D max.) for an acceptable p value depends on the number of options per
item. A formula that can be used to compute the optimal level is:
1+ g
D max. = _______ where g is probability of selection each option
2
Examples of Type of Questions
Number of
Options per Item
Optimal Difficulty
1+ g
(D )
max. =
2
Single Stem Question 1 1
Yes/No, Fill-in-Blanks. True/False,
Matching, etc.
2 0,75
MCQs, Matching, Fill-in-Blanks 3 0.665
MCQs, Matching 4 0.625
MCQs, Matching 5 0.60
MCQs, Matching 6 0.5 83
Extended Matching Questions
EMQs
Large number of
options
About 0.5
In general, as the number of options increases, the optimum p value decreases; we
would expect questions with more options to also be more difficult to answer. Test
developer should not include too many items to have a difficulty index above optimal
difficulty because it means that these items are too easy for the examinees and therefore
will not help in discriminate among examinees.
Minimal Difficulty Level
The lower bound (minimum difficulty) for item difficulty can also be calculated.
Test developer should not include too many items to have a difficulty index below the
minimal difficulty because it means that these items are too difficult for the examinees and
therefore will not help in discriminate among examinees.
ª
D min. = +
1.1.645
«¬
Where k and n represents number of options of MCQs and number of examinees
respectively.
For example, the minimal difficulty for MCQs of four option for 100 examinees is:
§ -
( 1)· º
D k
k /
min. = + ª
«¬ 1 . 1 .645»¼
n
¨©
§ -
( 4 1 ) ·
D /4
min. = +
ª ¹¸ º
«¬ 1.1.645»¼ ¨© 100
D min. = 0.321
c) Item Discrimination
This index tells the test developers how well the item functions to discriminate students,
those who have achieved the objectives and those who have not. In the context of
Psychometrics, discrimination is a desirable quality of test items. The discrimination index
is denoted by (d).
In the classical test theory CTT, item discrimination:
 Tells about ability of an item to differentiate between higher ability examinees
and lower ability examinees is known as item discrimination
 which is often called statistically as the Pearson product-moment correlation
coefficient between the scores on the item (e.g., 0 and 1 on an item scored
right-wrong) and the scores of the total test
For this calculation, test developer divides the test takers into flowing three groups
according to their scores on the test as a whole:
Upper Group = an upper group consisting of the 27% of the highest achievers
Lower Group = a lower group consisting of the 27% lowest achievers
Average Group= a middle group consisting of the remaining 46%.
Calculate the following
a. pupper = “Proportion in upper group who got it right”
“# in the upper group who got it right / # of students in upper group who
answered the item”
b. pLower = “Proportion in lower group who got it right”
“# in lower group who got it right / # of students in lower group who answered
the item”
Find the difference between the two p values which is called item discrimination.
d = pupper - pLower
Range of Discrimination Index
a. Maximum
When all the examinee of upper group answers the item correctly and all the examinees of
lower group answer the item incorrectly then
pupper = 1 and pupper = 0
d = p upper - pLower = 1 – 0 =1
the item has maximum positive discrimination between groups. This means that the item is
helping you identify those students who have achieved your objectives (and those who
have not!)
b. Minimum
When all the examinee of upper group answers the item incorrectly and all the examinees
of lower group answer the item correctly then
pupper = 0 and pupper = 1
d = p upper - pLower = 0 – 1 = – 1
the item has maximum negative discrimination between groups which means that people
with low scores got it right and people with high scores missed the item (a bad thing)
c. Zero discrimination
When same number of the examinees of both upper group and lower group answers the
item correctly then
pupper = p upper
p upper = pupper = q (say)
d = p upper - pLower = q – q = 0
then the item is not helping us sort people into those two performance groups at all. This
can happen when either everyone gets the item right or everyone misses it. These are
non-discriminators. so the range of discrimination index is (– 1 to + 1).
General Interpretation of Difficulty Index
Interpretation Difficulty Index Range
Ideal Item About 0.5
Very Good Item 0.4-0.49
Good Item 0.30-0.39
Fair Item 0.20-0.29
Poor Item Less than 0.19
Activity
Ten students have taken an objective test. The test comprises of 10 items. In the table
below, the students’ scores have been listed from high to low. There are five students in
the upper half and five students in the lower half. The number “1” indicates a correct
answer on the question; a “0” indicates an incorrect answer.
Student
ID
Total
Score
Questions
Item Item tem Item Item Item Item Item Item Item
(%) 1 2 3 4 5 6 7 8 9 10
AA 100 1 1 1 1 1 1 1 1 1 1
BA 90 1 1 1 1 1 1 1 1 0 1
CD 80 1 1 0 1 1 1 1 1 0 0
DR 70 0 1 1 1 1 1 0 1 0 1
EF 70 1 1 1 0 1 1 1 0 0 1
FG 60 1 1 1 0 1 1 0 1 0 0
GH 60 0 1 1 0 1 1 0 1 0 1
HM 50 0 1 1 1 0 0 1 0 1 0
IK 40 1 1 1 0 1 0 0 0 0 1
JN 30 0 1 0 0 0 1 0 0 1 0
Calculate the Difficulty Index (p) and the Discrimination Index (D) for each question
Item # Correct
(Upper group)
# Correct (Lower
group)
Difficulty
(p)
Discrimination
(D)
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item 8
Item 9
Item 10
Answer the following items:
1 .Which item is the easiest?
2. Which item is the most difficult?
3. Which item has the poorest discrimination?
4. Which items would you eliminate first (if any) – why?
Output of Iteman Software (CTT based Software)
Total Total
Rpbis Rbis
Alpha w/o M-H p Bias Against
0.055 0.069 0.832 2.933 0.034 Urban
Prop. Rpbis Rbis Mean SD Color
0.465 0.055 0.069 24.114 7.400 Maroon **KEY**
0.314 0.270 0.354 26.054 9.494 Green
0.067 -0.241 -0.463 14.875 5.367 Blue
0.132 -0.212 -0.335 17.915 7.590 Olive
0.022 -0.084 -0.233 14.500 3.586
18.231 8.594
20-40%
0.571
80-100%
0.488
0-20% 40-60% 60-80% Color
0.233 0.571 0.478 Maroon **KEY**
0.233 0.159 0.312 0.388 0.463 Green
0.200 0.095 0.052 0.030 0.000 Blue
0.333 0.175 0.065 0.104 0.049 Olive
N P
357 0.465
Option statistics
Option N
A _________
__________ 166
B 112
C 24
D 47
Omit 8
Not Admin 13
Quantile plot data
Option
A
N
166
B 112
C 24
D 47
Explanation of Output
Item information
It tells that item ID is 35 and it is at sr. # 35 in test. this item is MCQ with four options.
Key is A. it is related to “Algebra”. Flag box shows that it is biased items and there is
problem in key.
Item Statistics
It tells that 357 students attempted this item. The difficulty index (P)is 0.465 and
discrimination index (Rbis.) is 0,069. It is biased (not favoring urban students) against
urban students.
Option Statistics
It tells about distractor analysis. The discrimination index (Rbis) of option A and B is
positive. A is key so its discrimination must be positive but B is distractor and its
discrimination index must be negative but it is also behaving as key. It can also see in the
graph.
2. Item Response Theory (IRT)
Item response theory (IRT) is latent trait theory and proposed in the field of psychometrics
for the purpose of measurement of ability, skills, proficiency, learning, performance etc. It
is widely used in testing to calibrate and evaluate items in assessment instruments and to
score subjects on their abilities, attitudes, or other latent traits. During the last several
decades, educational assessment has used more and more IRT-based techniques to develop
tests because the methodology can significantly improve measurement accuracy and
reliability while providing potentially significant reductions in assessment time and effort,
especially via computerized adaptive testing. (Hays,
Morales, and Reise 2000; Edelen and Reeve 2007; Holman, Glas, and de Haan 2003; Reise
and Waller 2009).
The most popular IRT Rasch model specify a single latent trait to account for all statistical
dependencies among test items as well as all differences among test takers. It is this
underlying trait, typically denoted by theta (~) that distinguishes items with respect to
difficulty, and distinguishes test takers with respect to proficiency. Rasch model explores
probability of test taker’s specific response to an item as a function of the test
taker's location on (~) describing the relationship of the item to ~. IRT based Rasch model
is probabilistic, local item dependence estimation of item parameters, test statistics, and
examinee proficiency may result (Fennessy, 1995; Sireci, Thissen, & Wainer, 1991;
Thissen Steinberg & Mooney, 1989). Item response theory advances the concept of item
and test information to replace reliability. In the place of reliability, IRT offers the test
information function which shows the degree of precision at different values of theta. Plots
of item information are used to see how much information an item contributes. Since local
independence, item information functions are additive, the test information function is
simply the sum of the information functions of the items on the exam and with a large item
bank, test information functions helpe in judging measurement error very precisely. In Item
Response Theory, Rasch model is used because it is probabilistic model offers a way to
model probability that a person with “certain” ability will be able to perform at a “certain”
level i.e. it measure a person’s ability and its performance on a single continuum. It helps in
checking how well the data fit the model and diagnoses very quickly where the misfit is the
worst, and helps to understand this misfit in terms of the construction of the items and the
variable in terms of its theoretical development (Rasch Analysis, 2012). Means square, t,
infit, and outfit
values are used as Bond and Fox (2001) considered them important for making fit
decisions with more emphasize on infit values.
Item response theory is related to:
 the latent trait theory.
 approach focuses that each item on a test is having its own item characteristic
curve that indicates the probability of getting each particular item either right or
wrong.
 association between an individual's response to an item and the underlying
latent variable ("ability" or "trait") being measured by the instrument
 latent variable, expressed as theta (θ) , is a continuous one-dimensional
construct that explains the covariance among item responses.
 Individuals at higher levels of θ have a higher probability of responding
correctly an item.
a) Item Characteristics Curve
 It provides the useful information about the behavior (item discrimination index,
item difficulty index and guessing) of an item.
b) Item difficulty
where
• Each learner has ability θ
• Each item has difficulty b
The equation is called one parametric model
 θ is defined as the ability at which the probability of success on the item is 0.5
(50%) on a logit scale.
 “b” item’s level of difficulty is another factor affecting an individual’s
probability of responding in a particular way.
 “θ" and “b” are on same scale that is on x-axis
Above figure shows that as the difficulty of item or ability of an individual increases from
left to right.
Activity
Let’s try the following values:
θ = 0, b = 0? θ = 3, b = 0? θ = -3, b = 0?
θ = 0, b = 3? θ = 0, b = -3? θ = 3, b = 3?
θ = -3, b = -3?
what is P(θ)?
Excel Let’s enter these into Excel, and create the item characteristic curve using above
equation of one parametric model.
c) Item discrimination
Activity
where
• Each learner has ability θ
• Each item has difficulty b
• Each item has discrimination a
The equation is called two parametric model
• The items on a test might also differ in terms of the degree to which they
can differentiate individuals who have high trait levels from individuals
who have low trait levels.
Above figure shows that item 1 is more discriminatory than item 2. So the steepness of ICC
represents discrimination of item.
Let’s try the following values:
θ = 0, b = 0, a = 5? θ = 3, b = 0, a = 2? θ = -3, b = 0, a = 3? θ
= 0, b = 3, a = 1? θ = 0, b = -3, a = 1? θ = 3, b = 3, a = 5? θ =
-3, b = -3. a = 10?
what is P(θ)?
Let’s enter this into Excel, and create the item characteristic curve using above equation of
two parametric model.
d) Guessing
Where
• Each learner has ability θ
• Each item has difficulty b
• Each item has discrimination a
• Each item has guessing chance c
The equation is called three parametric model
Above figure shows that item 2 show
more guessing chance than item 1.
Activity
Let’s try the following values:
θ = 0, b = 0, a = 5, c = 0.5? θ = 3, b = 0, a
θ = 0, b = 3, a = 1, c = 2? θ = 0, b = -3, a = 1, c = 0? θ = 3, b = 3, a = 5, c = 3? θ
= -3, b = -3. a = 10, c = 5?
what is P(θ)?
Let’s enter this into Excel, and create the item characteristic curve using above equation of
three parametric model
= 2, c = 1? θ = -3, b = 0, a = 3, c = 0.75?
Outputs of IRT Based Software ConQuest for Psychometric Properties of Test
Item Person Map Sample Test 1
================================================================================ SAMPLE
TEST 1 Fri Mar 18 0 1 : 2 3 2016 MAP OF LATENT
DI STRI BUTIONS AND RESPONSE MODEL PARAMETER ESTI MATES
===========================================================Build: J a n 8 2016===
Terms i n t h e Model ( e x c l S t e p t e r m s )
p e r s o n + i t e m
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
|
|
|
X|
|
|
X|
|
|
XXXXX|
XXXXX|
XXXX|
XXXXXXX|
XXXXXXXX|
XXXXXXX|
XXXXXXXXXXXX|
XXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
11 12 28 33 34 36 43 46 51 54 58 2 29 32 35 63
64 65 67 68 77 78 40 104 122 134 142 148 156
164 175 197
100 215
93 106 199
44 92 121 162
25 39 53 91 128 179 180 190
117
98 126 208
30 131 192 110
203
7 38 45 138 141 157
20 37 55 73 97 171
71 112 118 153
87 116 209
61 62 69 124 182
57
31 149 174 185
22 66 90 130 189
26 72 183 198
48 80 81 83 114 168 170 202
79 82 85 172 119
163
42 47 169 178 188 210
147 186 200 206 211
16 50 52 74 125 139 160
15 137 184 191
18 27 140 14 59 194
75 166
56 111
76 115 205 70
167
24 207
1 49 129
10 17 23 41 201
152 173
196
8 123
21 13
19 195
181
5 60
6 9 4
| | |
=======================================================================================
Each 'X' r e p r e s e n t s 0 . 6 c a s e s
Some par amet e r s c o u l d not be f i t t e d on t h e di s p l a y
=======================================================================================
The item-person map output of sample test 1 explores distribution of items (right side of
map representing by numbers 1,2,3 etc.) and persons (lift side of the map representing by
x). The below map explores that the test is difficult because most of the items are upside
0
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXX|
XXXXXX|
XXXXX|
XXX|
XX|
X|
| |
X|
| |
| |
| 3
|
as compare to person. In other words, there is no person for most of the items.
Item Person Map Sample Test 2
================================================================================
SAMPLE TEST 2 Fri Aug 05 09 : 2 0 2016
MAP OF LATENT DISTRI BUTIONS AND RESPONSE MODEL PARAMETER ESTIMATES
===========================================================Build: J a n 8 2016===
Terms i n t h e Model ( e x c l S t e p t e r m s )
p e r s o n + i t e m
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6
5
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|27 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| 6
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|12 28
XXXXXXXXXXXXXXXXXXXXXXXXXXX|22
XXXXXXXXXXXXXXXXXXXXXXXX|8 25
XXXXXXXXXXXXXXXXX|
XXXXXXXXXXXXXXXX|11 14
XXXXXXXXXXXXXX|
XXXXXXXXXXXX|29
XXXXXXXX|17
XXXXXXX|1 2 10
XXXX|9
XXX|30
XXX|4
XX|3 13 32
XX|7 31
| 5
X|15 16 26
XXX|19 24
| X|20
XXX|23
| 2 1
|
XX| |
| 1 8
|
=======================================================================================
Each 'X' r e p r e s e n t s 3 . 6 c a s e s
=======================================================================================
The item-person map output of sample test 2 explores distribution of items (right side
of map representing by numbers 1,2,3 etc.) and persons (lift side of the map representing by
x). The above map explores that the test is easy because most of the items are downside as
compare to persons. In other words, there is no item for most of the persons.
3
2
1
0
-2
4
Fit Statistics
================================================================================
Sa mp le of Fi t S t a t i s t i c s
TAB LES O F R ESPONSE MODE L P AR AM ETER ESTI M ATES
===========================================================Bui ld: Jan 8 2016===
TER M 1 : i t e m
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -T
ERM 1 : i t e m
VAR I AB LES UNWE I GH TED F IT WE I GH TED FIT
----------------------------------------------------------------------------------------------------------------------------------------------------------------
i t em ESTI M ATE (b ) ERR OR ^ MNSQ C I T M NSQ C I T
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
----
1 1 0 .2 02 0 .0 62 0 .9 5 ( 0 .9 4, 1 .0 6 ) -1 . 8 1. 03 ( 0 . 92 , 1 . 08 ) 0 . 7
2 2 0 .2 17 0 .0 61 0 .8 4 ( 0 .9 4, 1.06) -5.6 0 .9 7 ( 0 .9 3, 1 .0 7 ) - 0 . 9
3 3 -0 . 3 82 0 . 07 2 0. 84 ( 0 . 94 , 1. 06) -5. 7 0. 94 ( 0 . 90 , 1 . 10 ) -1. 2
4 4 -0 . 2 65 0 . 07 0 0. 84 ( 0 . 94 , 1. 06) -5. 8 0. 93 ( 0 . 91 , 1 . 09 ) -1. 4
5 5 -0 . 6 76 0 .0 79 0 .9 2 ( 0 .9 4, 1.06) -2.7 0. 97 ( 0 . 89 , 1. 11) -0. 6
6 6 2 . 01 0 0 . 04 8 1 .2 5 ( 0 . 94 , 1. 06) 7. 9 1. 13 ( 0 . 96 , 1 . 04 ) 6 . 7
7 7 -0 . 5 99 0 . 07 8 0. 83 ( 0 . 94 , 1. 06) -6. 2 0. 94 ( 0 . 89 , 1. 11) -1. 0
8 8 1 .2 00 0 . 05 1 1. 08 ( 0 . 94 , 1. 06) 2. 6 1. 07 ( 0 . 95 , 1 . 05 ) 3 . 0
9 9 0 . 08 8 0 . 06 4 1. 01 ( 0 . 94 , 1. 06) 0. 5 1. 05 ( 0 . 92 , 1 . 08 ) 1 . 2
1 0 1 0 0 .2 98 0 . 06 0 0. 90 ( 0 . 94 , 1. 06) -3. 6 0. 96 ( 0 . 93 , 1 . 07 ) -1. 1
1 1 1 1 0 . 89 7 0 . 05 3 1. 02 ( 0 . 94 , 1. 06) 0. 5 1. 03 ( 0 . 95 , 1 . 05 ) 1 . 2
1 2 1 2 1 . 53 3 0 . 04 9 1. 01 ( 0 . 94 , 1. 06) 0. 3 1. 02 ( 0 . 96 , 1 . 04 ) 1.
1
1 3 1 3 -0 . 4 4 0 0 .0 74 0 .9 2 ( 0 .9 4, 1.06) -2.9 0. 96 ( 0 .9 0, 1 . 10 ) -0. 8
1 4 1 4 0 .8 54 0 .0 54 0 .9 9 ( 0 .9 4, 1.06) -0.3 1 .0 4 ( 0 .9 4, 1 .0 6 ) 1 . 4
1 5 1 5 -0 . 7 98 0 . 08 3 0. 79 ( 0 . 94 , 1. 06) -7. 8 0. 91 ( 0 . 88 , 1 . 12 ) -1. 4
1 6 1 6 -0 . 8 26 0 .0 84 1 .0 3 ( 0 .9 4, 1 .0 6 ) 1 . 0 0. 93 ( 0 . 88 , 1 . 12 ) -1. 1
1 7 1 7 0 .4 13 0 . 05 9 0 .9 2 ( 0 .9 4, 1.06) -2.9 0. 98 ( 0 .9 3, 1 .0 7 ) -0. 5
1 8 1 8 -2 . 2 70 0 . 14 4 0 .5 1 ( 0 .9 4, 1. 06)-20. 2 0. 91 ( 0 .7 5, 1 .2 5 ) -0. 6
1 9 1 9 -0 . 9 96 0 .0 88 0. 88 ( 0 .9 4, 1.06) -4.4 0. 99 ( 0 . 87 , 1 . 13 ) -0. 1
2 0 2 0 -1 . 2 84 0 . 09 8 0. 75 ( 0 . 94 , 1. 06) -9.2 0. 85 ( 0 . 85 , 1 . 15 ) -2. 0
2 1 2 1 -1 . 5 99 0 .1 10 0 .7 2 ( 0 .9 4, 1.06)- 10.3 0 .8 5 ( 0 .8 2, 1 .1 8 ) - 1 . 7
2 2 2 2 1 . 39 8 0 . 05 0 1 .1 0 ( 0 .9 4, 1 .0 6 ) 3 .4 1. 08 ( 0 . 96 , 1 . 04 ) 3 . 4
2 3 2 3 -1 . 5 00 0 . 10 6 0 .5 7 ( 0 .9 4, 1.06)- 17.5 0. 86 ( 0 .8 3, 1 . 17 ) -1. 7
2 4 2 4 -1 . 0 88 0 . 09 1 0. 89 ( 0 . 94 , 1. 06) -3. 9 0. 91 ( 0 . 86 , 1 . 14 ) -1. 3
2 5 2 5 1 .1 45 0 .0 52 0 .9 9 ( 0 .9 4, 1.06) -0.2 1 .0 0 ( 0 .9 5, 1 .0 5 ) 0 . 1
2 6 2 6 -0 . 9 16 0 . 08 6 0. 90 ( 0 . 94 , 1. 06) -3.4 0. 91 ( 0 . 87 , 1. 13) -1. 4
2 7 2 7 2 . 31 4 0 . 04 8 1 .2 2 ( 0 .9 4, 1 .0 6 ) 7 .0 1. 09 ( 0 . 97 , 1 . 03 ) 5 . 2
2 8 2 8 1 .5 02 0 . 05 0 1 .1 5 ( 0 .9 4, 1 .0 6 ) 4 .8 1. 11 ( 0 .9 6, 1 .0 4 ) 4 . 9
2 9 2 9 0 . 56 1 0 . 05 7 1. 32 ( 0 . 94 , 1. 06) 9. 8 1. 10 ( 0 . 94 , 1 . 06 ) 3 . 0
3 0 3 0 -0 . 1 16 0 .0 67 0 .9 6 ( 0 .9 4, 1.06) -1.5 0. 95 ( 0 .9 1, 1 .0 9 ) -1. 0
3 1 3 1 -0 . 5 50 0 .0 76 0. 86 ( 0 .9 4, 1.06) -4.9 0. 90 ( 0 . 89 , 1. 11) -1. 8
3 2 3 2 -0 . 3 27 * 0 .0 71 0 .9 9 ( 0 .9 4, 1.06) -0.2 0 .9 9 ( 0 .9 0, 1 .1 0 ) - 0 . 3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - An
a s t e r i s k n ext t o a p a ra met er e s t i m a t e i n d i c a t e s th at i t i s c o n s t r a i n e d Sep a rat i on
R e l i a b i l i t y = 0 . 9 96
C hi -squ a re t e s t of p a ra met er e q u a l i t y = 1 010 8. 22 , d f = 31 , Si g Lev el = 0 . 000 ^
Empi ri ca l s t a n d a r d e r r o r s ha ve b een u sed
================================================================================
Fit statistics explores the fitting of the models with the data. If Mean Square MNSQ values
of items exist in corresponding class interval CI then those items fit with the IRT model.
Red MNSQ values explores that item 6,8,22,27,28 and 29 are miss fit with the IRT model.
Estimate column in fit statistics table tells the difficulty level of the items. For example,
item 30 is of average (b § 0) difficulty as mention by red values in item - person map of
sample test 2 and table of fit statistics. Separation reliability index (green values in fit
statistics table) tells us that test is measuring wide range of abilities.
Comparison of IRT and CTT
CTT and IRT differ in many respects although these testing theories have many
commonalities. A crucial similarity in both testing theories are that models of performance;
if the model assumptions are not met, conclusions and interpretations will not be
supportable and the investigator will not necessarily be able to test the assumptions.
However, in the case of IRT, there are statistical procedures to help determine whether the
construct is causal or emergent (Tractenberg, 2010). Classical test theory (CTT) has been
extremely popular in the development, characterization, and sometimes selection of
outcome measures in the field of testing. IRT is powerful and offers options to measure
outcomes more accurately that CTT does not provide. However, IRT modeling is complex.
Following are the general comparison of CTT and IRT
Area CTT IRT
Nature Traditional Modern
Complexity Simple Complex
Model Linear Nonlinear
Scores True score Ability scores
Inferences Student’s ability Student’s expected scores
Scope Narrow Broader
Focus More on test More on item
Assumption Weak Strong
Item ability relationship Not specified Item characteristic function
Ability
Test scores or estimated
true scores are reported on
the test score scale
Ability scores are reported
on the scale -00 to +00 or a
transformed scale
Invariance of item and
person
No—item and person
parameters are sample
dependent
Yes—item and person
parameters are sample
independent if model fit
the test
Sample size(for item
parameter estimate) 200 to 500 in general
Depend on the IRT model
but larger sample i.e. over
500 in general are needed
Dependency
Item’s properties depends
on representative sample
Item’s properties do not
depends on representative
sample
IDF Sample Independent Sample Dependent
6.4. Standard Setting
Changes in educational assessment are currently being called for, both within the fields of
measurement and evaluation. Traditional forms of assessment of knowledge provide a
standard setting method for assigning numerical scores to determine letter grades but rarely
reveal information about how students actually understand and can reason with
acquired ideas or apply their knowledge to solving problems. The reflection of the
achievement of curriculum objectives and institutional standards, by students, are
indicated by the grading process. The model by which the grading process is carried out is
what is in question. There are three, most commonly used, grading models employed in
most educational settings and institutions. The first model is norm-referenced.
Norm-referenced grading refers to an evaluation where students are assessed in
relationship to each other. The second is criterion-referenced. Criterion-referenced grading
is the process where students are evaluated in a noncompetitive atmosphere; the emphasis
is placed on the learning objects and standards. Third is self-referenced. It is based on
comparing a learner's performance with the instructor's perceptions of the learner's ability.
Learners performing above the level of performance that the instructor perceives them
capable receive higher grades than those learners the instructor perceives as having not
made as much of an improvement. There is an even greater need for appropriate grading
methods for assigning letters to students' performance. This paper summarizes current
trends in academic grading and relates these to the assessment of student outcomes in a
specific course. After discussing these grading models the findings were that there is a
noticeable shift to the criterion-referenced grading model.
Criterion Referenced Model
This model's framework is based on a curriculum, course, or lesson. By establishing
absolute standards, grades are assigned by comparing a learner's performance to a set of
standards. Learners meeting the learning targets receive higher grades than those learners
not meeting the targets. This method presumes the learning targets are appropriately
designed for the particular learner population and the instructor is focusing instruction on
the learning targets.
Norm Referenced Model
This model's framework is based on a comparison of among learners. Establishing relative
standards means making comparisons that are relative to the group such that a learner's
performance is compared to others in the group.
Advantages and Disadvantages of Norm-referenced Grading
Norm-referenced grading ranks learners from highest to lowest, according to Nitko (2001)
and these systems of grading are easy for instructors to use and articulate. The Center for
Teaching and Learning Services at the University of Minnesota (2003) explains that this
form of grading works well in situations requiring rigid differentiation among students,
where restrictions are imposed. For example, when less number of students is to be
selected then this technique works better. Norm-referencing requires close scrutiny of the
actual group that will be used as a reference for the comparison. This could possibly foster
further insight into the course’s subject area and help improve instruction. This form of
grading is most appropriate in a large classroom setting.
Two primary objections surround the norm-referenced form of grading. First, an
individual’s grade is determined not only by his/her achievements and efforts but also by
achievements and efforts of others. Popham (2002) illustrates by saying, when a teacher
asserts that a student “scored at the 90th
percentile on test,” they mean that the student’s
test performance has exceeded the performance of 90% of the students in the test’s norm
group. The second objection is that norm-referenced grading promotes competition rather
than cooperation. When students are knowingly paired against each other they are less
likely to be helpful to their fellow classmate this may also is a key to eliminate cheating in
tests and examinations.
Advantages and Disadvantages of Criterion-referenced Grading
Criterion-referenced grading provides feedback relative to leaning targets and/or
performance standards. This form of grading emphasizes the objectives of the curriculum.
The student’s grade is not affected by the class. Under this form, if improvement is
needed, a student can simply observe the identified learning targets to know what areas
they should work on. Unlike norm-referenced grading this system is adaptable to any size
classroom setting.
There are two disadvantages that present themselves as hurdles for the
criterion-referenced form of grading.
 Establishment of learning targets and/or performance standards.
 Teachers set the criteria, standards, or targets based on what they know about how
students will usually perform.
Self-Referenced Model
The growth-based grading framework is based on comparing a learner's performance with
the instructor's perceptions of the learner's ability. Learners performing above the level of
performance that the instructor perceives them capable receive higher grades than those
learners the instructor perceives as having not made as much of an improvement. Thus, a
learner who has made more improvement may receive a higher grade than another learner
regardless of their absolute levels of attainment. It is therefore essential that the instructor
maintain rigorous records so as to reduce the potentially unreliable nature of judgments of
capability. While the self-referencing method reduces the overall competitiveness of
grades, it presents an irony in that learners coming in the course with the highest levels of
achievement tend to have the lowest levels of change even though their final absolute
levels of achievement remain the highest (Nitko, 2001).
Grading Methods
Absolute grading methods produce grades that share some general shortcomings,
independent of the particular method that generated the grades. For example, unless they
are accompanied by a description of the performance standards or the content domains that
have been studied, the meaning of an absolute grade is difficult to understand.
Furthermore, no criterion-referenced grading method produces grades that are strictly
absolute in meaning. Such grades are based on performance standards that nearly always
have normative basis. A "B writer" should be able to use correct referencing techniques,
the teacher may say, but if most college students do not and cannot, the standard is likely
to be lowered to reflect reality (the norm). Note that adjusting grades instead of modifying
the standards would contribute to meaningless grades.
Fixed Percent Scale
This method uses fixed ranges of percent-correct scores as the basis for assigning grades to
the component of a final grade. A grading scale used by most of the
institutions/universities is the following: 93-100 = A, 85-92=B, 78-84=C, etc. These
ranges are fixed at the beginning of the reporting period and are applied to the scores
from each grading component -- written tests, demonstrations, papers and performance
assessments. Component grades are then weighted and averaged to get the final grade.
Unfortunately, a percent score will be meaningless unless the domain of tasks, behaviors,
or knowledge upon which the assessment was based is defined explicitly. That is, a test
score of 100% should mean that the student has complete or thorough attainment of the
key elements of the area of knowledge and mastered the basic skills that were sampled by
the test. But if an assessment is developed in such a way that the underlying content
domain is ill-defined or vague, the percent-correct scores from it will have no meaning
beyond the specific tasks that comprise the assessment. Scores of 80% on a math test and
75% on a speech say little about performance unless we know the difficulty of the domain
of math problems and which important criteria were used to score the speech. In sum,
percent scores cannot provide a reference to absolute performance standards unless
the underlying knowledge domain and desired basic skills to be mastered are adequately
described.
Another serious drawback of this grading method is the fact that the percent-score ranges
for each grade symbol are fixed for all grading components. For example, the fact that 93%
is needed for an A places severe and unnecessary restrictions on the teacher when he or she
is developing each assessment tool. If the teacher believes there should be some A grades,
a 20-point test must be easy enough so that some students will score 19 or higher;
otherwise there will be no A grades. This circumstance creates two major problems for the
teacher as the assessment developer. First, it requires that assessment tasks be chosen more
for their anticipated easiness than for their content representativeness. As a result, there
may be an over representation of easy concepts and ideas, an overemphasis on facts and
knowledge, and an under representation of tasks that require higher order thinking skills.
The teacher may need to "fudge" on the domain definition to accommodate the fixed
grading scale.
A further limitation of this method relates to the accuracy of the assessment information
obtained. Since the grade cutoff scores usually are located between the 60% and 100%
points on the percent scale, most of the scale points (0-60) are of no value in describing
the different absolute levels of achievement. For example, if A and B performance must
be in the range of 85%-100%, the very best B achievement and the very worst B
achievement are separated by only eight points (85-92), as are the very best and very
worst A achievements (93-100). These are fairly narrow score ranges, especially
considering the fact that a 100-point scale is available for use. Because these ranges are
narrow and fixed, they will contribute to fairly inaccurate grades when the scores of any
single grading component are not very dependable. If the grade ranges could be made
larger when the scores of certain components are fairly inaccurate, then more accurate
grades would probably result.
The fixed percent scale method usually produces grades that have little meaning in terms of
content standards, and it often yields grades that are of questionable accuracy. The
percent cutoffs for each grade are arbitrary and, thus, not defensible. Why should the
cutoff for an A be 93, 92, or 90? Further, why shouldn't the A cutoff be 88% for a certain
text, 91% for another, and 83% for a certain simulation exercise? Is there any reason why
the same numerical standards must be applied to every grading component when those
standards are arbitrary and void of absolute meaning?
Total Point Method
When teachers accumulate points earned by students throughout a reporting period and
then assigns grades to the point total at the end of the period this method is known as total
point method. First the teacher decides which components will figure into the final grade
and what the maximum point value of each component will be. (This is done before tests
are developed and before the scoring criteria for projects are established). That is teacher
formulate the procedure of grading before the start of the program. For example, you may
decide to use two tests (50 points each), two papers (40 points each), and a report (20
points) for a maximum of 200 points for the quarter. Then the grade cutoffs might be set as
follows: 180-200 = A, 160-179 = B, 140-159 = C, 120-139 = D and 0-119 = F. Implicit in
this set of ranges is a percent scale with grade cutoffs of 90%, 80%, 70%, and 60%. These
cutoffs depends upon teachers own desire there is no hard and fast rule or rationale for
percentage cutoffs. They are as arbitrary, and nearly as meaningless, as those derived from
the fixed percent scale method. Unlike the fixed percent scale method, however, grades are
not assigned to components with the total point method. And unlike grading on the curve,
the arbitrary cutoff points are established at the beginning of the reporting period, before
assessment results are known.
One of the difficulties of using this method is that often a decision has to be made about
the maximum score on a project or test before the teacher has had ample time to think
about the key ingredients of the assessment. Here's how this circumstance can contribute to
poor assessment development practices: Suppose I need a 50-point test to fit my grading
scheme, but I find as I build the test that I need 32 multiple-choice items to sample the
content domain thoroughly. I find this unsatisfactory (or inconvenient) because 32 do not
divide into 50 very nicely (It is 1.56!) To make life simpler, I could drop 7 items and use a
25-item test with 2 points per item. If I did that, my point totals would be in fine shape, but
my test would be an incomplete measure of the important unit objectives. The fact that I had
to commit to 50 points prematurely dealt a serious blow to obtaining meaningful
assessment results.
Another potential drawback to the total point method is the ease with which extra credit
points can be incorporated to beef up low point totals. This practice can simultaneously
distort the meaning of the content domain and final grade. When the extra tasks are
challenging and relevant to current instruction, this seems like a reasonable way to
individualize and motivate high achieving students. In such cases, the outcome is likely to
make high point totals even higher. But extra credit that simply allows students to
compensate for low test scores or inadequate papers is not reasonable, especially if the
extra work does not help them overcome demonstrated deficiencies. The point here is that
this method of grading makes it convenient for teachers to allow extra credit work of the
latter form to compensate for low achievement. When that happens, the grades take on a
new meaning because the relevant domain of knowledge and skills gets redefined by the
nature of the extra credit tasks.
Content Based Method
This method involves assigning a grade to each component and then weighting the separate
grades to obtain the final one. The teacher develops brief descriptions of the achievement
levels (standards) associated with each grading symbol. These standards for "A work" and
"B work" and so on are then used to establish the grade cutoff scores for every component.
Compared to the fixed percent scale method, which keeps cutoff scores constant for all
components, this method keeps the performance standards for a grade constant but lets the
cutoff scores change. Here is an example of how the method might be used:
Suppose you have prepared a 30-item test to measure the achievement of most of the
objectives in a unit of instruction. Assuming that grades A through F will be assigned to
test scores, you will need to develop a brief description of the performance levels you
expect students to reach for each of the five possible grades. For example, you might
describe C expectations as "knows basic concepts and can do the most important skills;
lacks some prerequisites for later learning." Using descriptions like these, you can begin
an item-by-item review of the test.
For question No. 1, ask whether a student with only minimum achievement (D) should be
able to answer correctly. If so, record a D next to the item; if not, pose the same question
for grade C achievement. This process continues until the first item has been classified. For
items that the teacher believes most A students will not necessarily answer correctly, a
symbol such as N can be used to indicate that no grade level applies. After you have
classified each item with a symbol, the D-F cutoff score is found by adding the number of
D symbols. Then the C-D cutoff is obtained by adding the number of D and C symbols.
The B-C cutoff is the sum of D, C, and B symbols, and the A cutoff is the sum of the D, C,
B, and A symbols. To account for negative errors of measurement, you should lower each
grade cutoff by one or two points. Such adjustments for error at this stage of grading would
make it unnecessary to review borderline cases at a later time.
All grading methods involve subjectivity, and this one requires two main types of
subjective decisions. The first type entails the development of explicit expectations for the
achievers at each of the letter-grade levels. What is B achievement like and how is it
different from C achievement? Good teachers might disagree with one another about how to
define these performance standards. The other subjective decision making occurs when
items are reviewed to determine the grade category to which each one belongs. Again,
good teachers may disagree about whether a "B student" should be able to answer a
particular item correctly. Notice that these two types of judgments do not require that
subjective decision be made about individual students. There is no need to decide, for
example, whether Jana is a C student or whether Matt could answer a certain question
correctly. The judgment required here is about standards and about the particular tasks that
students at each level should be expected to do.
Some Relative Grading Methods
Grades derived from any of the relative grading methods will have certain shortcomings
that are inherent in any grading intended to have a norm-referenced meaning. For example,
unless the person interpreting the grade knows which reference group was used, the grade
means very little. Was it the student's class, a combination of classes, or classes from the
past two years? Further, by definition, a norm-referenced grade does not tell what a student
can do; there is no content basis other than the name of the subject area associated with the
grade.
Grading on the Curve
The curve referred to in the name of this method is the normal bell-shaped curve that is
often used to describe the achievements of individuals in a large heterogeneous group.
The idea behind this method is that the grades in a class should follow a normal
distribution, or one nearly like it. Under this assumption, the teacher determines the
percentage of students who should be assigned each grade symbol so that the distribution
is normal in appearance. For example, the teacher may decide that the percentages of A
through F grades in the class should be 10%, 20%, 40%, 20%, and 10%, respectively.
Since some teachers who use the method rightly believe that classroom groups are too
small for their achievement score to resemble a normal curve, they choose percentages
that, in their judgment, are more realistic. So they may decide on 20%, 35%, 30%, 10%,
and 5%. The percentages are selected arbitrarily and are treated like grade quotas so that
the top 20% of students in terms of their composite scores will earn an A, the next 35%
would be assigned a B, and so on.
Grading on the curve is a simple method to use, but it has serious drawbacks. The fixed
percentages are nearly always determined arbitrarily, and the percentages do not account
for the possibility that some classes are superior and others are inferior relative to the
phantom "typical" group the percentages are intended to represent. In addition, the use of
the normal curve to model achievement in a single classroom is generally inappropriate,
except in large required courses at the high school and college levels.
Distribution Gap Method
When the composite scores of a class are ranked from high to low, there will usually be
several short intervals in the score range where no student actually scored. These are gaps.
This method of grade assignment involves finding the gaps in the distribution and drawing
grade cutoffs at those places. For example, if the highest composite scores in a class was
211, 209, 209, 205, 197, 196... then the teacher might use the gap between 205 and 197 to
separate the A and B grades. The gap between 211 and 209 is too small and might
produce too few A grades. The one between 209 and 205 might be large enough, but 205
seem more like 209 than 197.
In some score distributions there are many wide gaps; in others there are only a few
narrow gaps. The sizes and locations of the gaps are determined by random errors of
measurement as well as by actual differences among students in achievement. For
example, Mike's 197 maybe would have been 203 (if there had been less error in his
scores), and Theo's 205 maybe would have been 200. Under those circumstances, the A - B
gap would be less obvious, and to many final grade decisions would have been made by
reviewing borderline cases.
When gaps are wide enough, this method helps the teacher avoid disputes with students
about near misses. But when the gaps are narrow, too much emphasis is placed on the
borderline information that the teacher had decided was not relevant enough or accurate
enough to be included among the set of grading components that formed the composite.
Only occasionally will the gap distribution method yield results that are comparable to
those obtained with more dependable and defensible methods.
Standard Deviation Method
This relative method is the most complicated computationally, but is also is the fairest in
producing grades objectively. It uses the standard deviation, a statistic that tells the
average number of points by which the scores of students differ from their class average. It
is a number that describes the dispersion, variability, or spread of scores around the
average score. In this method, the standard deviation is used like a ruler to identify grade
cutoff points.
Suppose you have formed composite scores for your class of 25 students and that the
average was 129 and the standard deviation was 10. (Consult an introductory measurement
or statistics book to see how to compute these statistics simply.) Assuming C to be the
average grade, we can find the cutoff between B and C by adding, for example, one-half of
the standard deviation to the average (129 + (0.5) (10) = 134). Then the A-B cutoff is
found by adding 1.5 standard deviations (for example) to the average (129 + (1.5) (10) =
144). By subtracting corresponding values from the average score, the C-D cutoff is found
to be 124, and the D-F cutoff is 114. (Can you verify these values?) The ranges for each
grade are the following: A = 145 and up, B = 135 - 144, C = 124 - 134, D = 123 - 114, and
F = 113 and below. These ranges can be made smaller or larger for groups of higher or
lower ability level by adjusting the number of standard deviations used to find the cutoffs.
For a particularly able class, for example, the A-B cutoff might be only one standard
deviation above the average and the B-C cutoff might be 0.3 above, rather than 0.5.
Unlike grading on the curve, this method requires no fixed percentages in advance, and
unlike the distribution gap method, the cutoff points are not tied to random error. When
the teacher has some notion of what the grade distribution should be like, some trial and
error might be needed to decide how many standard deviations each grade cutoff should
be from the composite average. When a relative grading method is desired, the standard
deviation method is most attractive, despite its computational requirements.
References
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, N.J.: Prentice-Hall.
Bloom, B.S. (!967). Towards a Theory of Testing which includes Measurement,
Evaluation and Assessment. Proceedings of Symposium on Problem in the
Evaluation and Instruction. University of California, Los Angels
Center for Teaching and Learning Services. (2003) Grading systems. Retrieved
November 30, 2004, from http://www.teaching.umn.edu
Davis, B. G., Wood, L., and Wilson, R. The ABCs of Teaching Excellence.Berkeley:
Office of Educational Development, University of California, 1983. DeVellis, R.
F. (2011), Scale Development: Theory and Applications , 3rd Edition, Ebel, R. L. (1979).
Essentials of educational measurement (2nd ed.). Englewood Cliffs,
NJ: Prentice Hall.
Ebel, R.L. (1979). Essentials of educational measurement (2nd ed.). Englewood Cliffs,
NJ: Prentice Hall.
Eble, K. E. The Craft of Teaching. (2nd ed.) San Francisco: Jossey-Bass, 1988.
Embretson, S. E. and Reise, S. P. (2000), Item Response Theory for Psychologists,
Mahwah, NJ: Lawrence Erlbaum Associates.
Erickson, B. L., and Strommer, D. W. Teaching College Freshmen. San Francisco:
Jossey-Bass, 1991.
Gronlund, N. E. & Linn, R. L. (2005). Measurement and assessment in Teaching. New
Delhi: Baba Barkha Nath Printers.
Hambleton, R. K., Swaminathan, H., and Rogers, H. J. (1991), Fundamentals of Item
Response Theory, Newbury Park, CA: Sage Publications.
Hays, R. D., Morales, L. S., and Reise, S. P. (2000), “Item Response Theory and Health
Outcomes Measurement in the Twenty-First Century,” Medical Care , 38, Suppl. 9,
1128–1 142.
Holman, R., Glas, C. A. W., and de Haan, R. J. (2003), “Power Analysis in Randomized
Clinical Trials Based on Item Response Theory,” Controlled Clinical Trials, 24,
390–410.
Martuza, V.R. (1977). Applying norm-referenced and criterion referenced measure in
education. Boston MA: Allyn and Bacon, Inc.
Nitko, A. J. (2001). Educational assessment of students (3rd ed.). Englewood Cliffs, NJ:
Prentice Hall.
Nitko, A.J. (2001). Educational assessment of students (3rd ed.). Englewood Cliffs, NJ:
Prentice Hall.
Oosterhof, A.C. (1987). Obtaining intended weights when combining students' scores.
Educational Measurement: Issues and Practice, 6(4), 29-37.
Popham, W.J. (2002). Classroom assessment: What teachers need to know? Boston, MA:
Allyn and Bacon, Inc.
Reise, S. P. and Waller, N. G. (2009), “Item Response Theory and Clinical
Measurement,” Annual Review of Clinical Psychology, 5, 27–48.
Scriven, M. "Evaluation of Students." Unpublished manuscript, 1974.
The Theory and Practice of Item Response Theory, New York: Guilford Press.
Thousand Oaks, CA: Sage Publications. Edelen, M. O. and Reeve, B. B. (2007),
“Applying Item Response Theory (IRT) Modeling to Θuestionnaire Development,
Evaluation, and Refinement,” Θuality of Life Research, 16, 5–18.
Tractenberg, R. E. (2010). Classical and modern measurement theories, patient reports,
and clinical outcomes. Contemporary Clinical Trials, 31(1), 1–3.
http://doi.org/10.1016/S1551-7144(09)00212-2

Contenu connexe

Similaire à Unit. 6.doc

Running head ASSESSING A CLIENT .docx
Running head ASSESSING A CLIENT                                  .docxRunning head ASSESSING A CLIENT                                  .docx
Running head ASSESSING A CLIENT .docxhealdkathaleen
 
Item analysis and validation
Item analysis and validationItem analysis and validation
Item analysis and validationKEnkenken Tan
 
Construction of Tests
Construction of TestsConstruction of Tests
Construction of TestsDakshta1
 
BASIC OF MEASUREMENT & EVALUATION
BASIC OF MEASUREMENT & EVALUATION BASIC OF MEASUREMENT & EVALUATION
BASIC OF MEASUREMENT & EVALUATION suresh kumar
 
Validity and objectivity of tests
Validity and objectivity of testsValidity and objectivity of tests
Validity and objectivity of testsbushra mushtaq
 
PSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptx
PSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptxPSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptx
PSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptxChitra654025
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good TestDrSindhuAlmas
 
Nature or Characteristics of Good Measurement.pptx
Nature or Characteristics of Good Measurement.pptxNature or Characteristics of Good Measurement.pptx
Nature or Characteristics of Good Measurement.pptxAaryanBaskota
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Maheen Iftikhar
 
1reviewofhighqualityassessment.pptx
1reviewofhighqualityassessment.pptx1reviewofhighqualityassessment.pptx
1reviewofhighqualityassessment.pptxCamposJansen
 
Measurement, Assessment and Evaluation
Measurement, Assessment and EvaluationMeasurement, Assessment and Evaluation
Measurement, Assessment and EvaluationMelanio Florino
 
validity and reliability
validity and reliabilityvalidity and reliability
validity and reliabilityaffera mujahid
 
Educational Evaluation for Special Education
Educational Evaluation for Special EducationEducational Evaluation for Special Education
Educational Evaluation for Special EducationInternational advisers
 

Similaire à Unit. 6.doc (20)

Running head ASSESSING A CLIENT .docx
Running head ASSESSING A CLIENT                                  .docxRunning head ASSESSING A CLIENT                                  .docx
Running head ASSESSING A CLIENT .docx
 
Item analysis and validation
Item analysis and validationItem analysis and validation
Item analysis and validation
 
Construction of Tests
Construction of TestsConstruction of Tests
Construction of Tests
 
BASIC OF MEASUREMENT & EVALUATION
BASIC OF MEASUREMENT & EVALUATION BASIC OF MEASUREMENT & EVALUATION
BASIC OF MEASUREMENT & EVALUATION
 
Validity and objectivity of tests
Validity and objectivity of testsValidity and objectivity of tests
Validity and objectivity of tests
 
Himani
HimaniHimani
Himani
 
PSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptx
PSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptxPSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptx
PSYCHOLOGICAL TESTING - ppt -AhnCUHa.pptx
 
Qualities of a Good Test
Qualities of a Good TestQualities of a Good Test
Qualities of a Good Test
 
Nature or Characteristics of Good Measurement.pptx
Nature or Characteristics of Good Measurement.pptxNature or Characteristics of Good Measurement.pptx
Nature or Characteristics of Good Measurement.pptx
 
2-nature.pptx
2-nature.pptx2-nature.pptx
2-nature.pptx
 
Business research methods
Business research methodsBusiness research methods
Business research methods
 
Validity
ValidityValidity
Validity
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
Language assessment
Language assessmentLanguage assessment
Language assessment
 
1reviewofhighqualityassessment.pptx
1reviewofhighqualityassessment.pptx1reviewofhighqualityassessment.pptx
1reviewofhighqualityassessment.pptx
 
Measurement, Assessment and Evaluation
Measurement, Assessment and EvaluationMeasurement, Assessment and Evaluation
Measurement, Assessment and Evaluation
 
validity and reliability
validity and reliabilityvalidity and reliability
validity and reliability
 
Educational Evaluation for Special Education
Educational Evaluation for Special EducationEducational Evaluation for Special Education
Educational Evaluation for Special Education
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 

Plus de Imtiaz Hussain

Plus de Imtiaz Hussain (20)

Essentials for Measurement2.ppt
Essentials for Measurement2.pptEssentials for Measurement2.ppt
Essentials for Measurement2.ppt
 
BN-725592(3).ppt
BN-725592(3).pptBN-725592(3).ppt
BN-725592(3).ppt
 
Islamiat Compulsory Notes For ADP, BA and BSc.pdf
Islamiat Compulsory Notes For ADP, BA and BSc.pdfIslamiat Compulsory Notes For ADP, BA and BSc.pdf
Islamiat Compulsory Notes For ADP, BA and BSc.pdf
 
12 english idioms.pdf
12 english idioms.pdf12 english idioms.pdf
12 english idioms.pdf
 
5. RPL.pdf
5. RPL.pdf5. RPL.pdf
5. RPL.pdf
 
1. Overview of NVQF.pdf
1. Overview of NVQF.pdf1. Overview of NVQF.pdf
1. Overview of NVQF.pdf
 
4. Competency Standard.pdf
4. Competency Standard.pdf4. Competency Standard.pdf
4. Competency Standard.pdf
 
2. CBT vs Conventional.pdf
2.  CBT vs Conventional.pdf2.  CBT vs Conventional.pdf
2. CBT vs Conventional.pdf
 
3. Pathways to Assessment.pdf
3. Pathways to Assessment.pdf3. Pathways to Assessment.pdf
3. Pathways to Assessment.pdf
 
UNIT. 3.pdf
UNIT. 3.pdfUNIT. 3.pdf
UNIT. 3.pdf
 
Unit. 7.pdf
Unit. 7.pdfUnit. 7.pdf
Unit. 7.pdf
 
Unit. 7.doc
Unit. 7.docUnit. 7.doc
Unit. 7.doc
 
Assessment review.pdf
Assessment review.pdfAssessment review.pdf
Assessment review.pdf
 
BN-725592(3).ppt
BN-725592(3).pptBN-725592(3).ppt
BN-725592(3).ppt
 
Biology 9 th-ch-3
Biology  9 th-ch-3Biology  9 th-ch-3
Biology 9 th-ch-3
 
Biology 9 th ch-9-mcq
Biology  9 th ch-9-mcqBiology  9 th ch-9-mcq
Biology 9 th ch-9-mcq
 
Biology 9 th ch-8-mcq
Biology  9 th ch-8-mcqBiology  9 th ch-8-mcq
Biology 9 th ch-8-mcq
 
Biology 9th ch-7
Biology  9th ch-7Biology  9th ch-7
Biology 9th ch-7
 
Biology 9th ch-6
Biology  9th ch-6Biology  9th ch-6
Biology 9th ch-6
 
Biology 9 th ch-2-
Biology  9 th ch-2-Biology  9 th ch-2-
Biology 9 th ch-2-
 

Dernier

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 

Dernier (20)

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 

Unit. 6.doc

  • 1. UNIT - 6 PHILOSOPHY AND PSYCHOMETRIC PROPERTIES OF TESTS Written by Dr. Muhammad Azeem
  • 2. INTRODUCTION Psychometrics is a field of study concerned with the theory and technique of psychological measurement. Psychometrics deals with the construction and validation of assessment tools/instruments for testing, assessment and related activities. In is usually concerned with assessing individual’s knowledge, ability, personality, and types of behaviors. Reliability and validity are two major psychometric properties of assessment tools/instruments. Any assessment tools/instrument being able to state that they have excellent psychometric properties, meaning a scale is both reliable and valid. A reliable assessment instrument consistently assesses/measures/evaluates the same construct. So a valid assessment tool measures what it says it is going to measure. If something is valid, it is always reliable. However, something can be reliable without being valid. Objectives of Unit 1. After reading this unit, students will be able to: 2. Understand Philosophy of Tests 3. Understand Classical Test Theory 4. Understand Item Response Theory 5. Understand psychometric properties of items and tests
  • 3. 6.1. Philosophy of Testing Traditional philosophy of tests is that ability to learn is randomly distributed in the population. It means that if some learning task is assigned to a class and then a test is administering to study performance, the result of test is that students’ performance scores are distributed normally. It means that all students cannot benefited equally from teaching learning process. The new philosophy is that all students can attain mastery of any learning task subject to the provision of opportunity and time. It means absolute standards can be set for measuring performance. In testing, measurement is a process whereby the numerical relation between a magnitude of a quantitative attribute and a unit of the same attribute is estimated using some procedure, often a standardized set of operations. Realists believe that one can know the real world as it truly is. As a philosophy of testing and measurement, realism is characterized by behaviorally stated objectives, measurement-driven instruction, and report cards, along with the use of programmed materials. Realism emphasizes that measurable results from students can be obtained to show precisely how well a student is achieving. Existentialists believe that each person should learn to make choices from the alternatives available in society. As a philosophy of education, existentialism does not advocate predetermined objectives for student achievement or testing to determine achievement. Individual motivation is central, and feelings are recognized as the most important part of the human condition. Experimentalists believe that one cannot know ultimate reality, but one can experience it. Experimentalists believe in integrating school and society. Students should be assessed for their problem-solving ability, and evaluation consists of using relevant sources of information to solve a problem and to test hypotheses. Idealists believe in a subject-centered curriculum, and idealism emphasizes a coherence theory in testing. Students should be able to use reason and logic effectively
  • 4. and to honor eternal values. Perennialism is a philosophy directly related to idealism that calls for a curriculum based on the "Great Books of the Western World", with a standardized curriculum that regards childhood and youth as obstacles to be overcome through education. Key Points 1. Traditional philosophy of tests has the view point that students’ performance scores are distributed normally because all students cannot have benefited equally from teaching learning process 2. The new philosophy is that all students can attain mastery of any learning task subject to the provision of opportunity and time. 6.2. Theories of Test Development (CT, IRT) i. Testing Theory Theory is a system of rules, procedures, and assumptions used to produce a result. The theory of psychological tests and measurement, or, as typically referred to, test theory or psychometric theory, offers a general framework and a set of techniques for evaluating the development and use of psychological tests. ii. Purpose of test theories Testing is viewed as a systematic method of sampling one or more human characteristics and the representation of these results for an individual in the form of descriptive statements (Bloom, 1967). Following are the main purposes of testing theories  to formulate mathematical relationship between test properties to enable manipulating some of them to optimize the desired target properties, usually the diagnostically important, e.g. validity and reliability etc
  • 5.  to improve the quality of test by ensuring validity and reliability of test
  • 6. proportion of examinees that answer the item correctly. The percentage of difficulty  to predict outcomes of psychological testing iii. Psychometrics Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement. iv. Theoretical Approaches Psychometric theories that predict outcomes of psychological testing such as the difficulty of items or the ability of test-takers. Generally speaking, the aim of these theories is to understand and improve the reliability and validity of psychological tests. 1. Classical Theory Classical test theory is based on true score model is also known as the “true score theory.” It assumes that each individual has a true score which would be obtained if there is no error in measurement. The observed score for each person may differ from an individual’s true ability. X = T + E observed score true score error score a) Role of error in estimating test scores  Basic measure of error is identified by using the standard deviation of error that is called standard error of measurement.  The larger the standard error of measurement, the less certain of the accuracy.  On the other hand, small standard error of measurement identifies that an individual score is closer to the true score (Kalpan & Sacuzzo, 1997). b) Item Difficulty The item-difficulty index is denoted by (p). This index is determined by calculating the
  • 7. index of item/test p is called difficulty of item/test. According to Linn and Gronlund (2005) the difficulty of an item indicates the percentage of students who get the item right. In the classical test theory CTT, item difficulty:  compares an examinee’s ability to the probability of success on a particular item.  considers a pool of examinees collectively and empirically examines their success rate in a particular item.  Calculate success rate of a particular pool of examinees on an item The formula for the item-difficulty index is p = No. students with correct answer total students As an example, assume that 50 people take a test. For the difficulty index, 30 test-takers answers the item correctly. p = No. students with correct answer total students 30 50 p = p = 0.6 If no one examinee answers the item correctly, then p = No. students with correct answer total students 0 50 p = p = 0.0 If all examinees answer the item correctly, then p = No. students with correct answer total students 50 50 p =
  • 8. proportion of examinees that answer the item correctly. The percentage of difficulty p = 1.0
  • 9. Thus item difficulty index (p) has a range of 0 to 1. General Interpretation of Difficulty Index Interpretation Difficulty Index Range Very Easy Item Above 0.81 Easy Item 0.61-0.80 Average Item 0.41-0.60 Difficult Item 0.21-0.40 Very Difficult Item Less than 0.20 Optimal Difficulty of an Item The optimal level (D max.) for an acceptable p value depends on the number of options per item. A formula that can be used to compute the optimal level is: 1+ g D max. = _______ where g is probability of selection each option 2 Examples of Type of Questions Number of Options per Item Optimal Difficulty 1+ g (D ) max. = 2 Single Stem Question 1 1 Yes/No, Fill-in-Blanks. True/False, Matching, etc. 2 0,75 MCQs, Matching, Fill-in-Blanks 3 0.665 MCQs, Matching 4 0.625
  • 10. MCQs, Matching 5 0.60 MCQs, Matching 6 0.5 83 Extended Matching Questions EMQs Large number of options About 0.5 In general, as the number of options increases, the optimum p value decreases; we would expect questions with more options to also be more difficult to answer. Test developer should not include too many items to have a difficulty index above optimal difficulty because it means that these items are too easy for the examinees and therefore will not help in discriminate among examinees. Minimal Difficulty Level The lower bound (minimum difficulty) for item difficulty can also be calculated. Test developer should not include too many items to have a difficulty index below the minimal difficulty because it means that these items are too difficult for the examinees and therefore will not help in discriminate among examinees. ª D min. = + 1.1.645 «¬ Where k and n represents number of options of MCQs and number of examinees respectively. For example, the minimal difficulty for MCQs of four option for 100 examinees is: § - ( 1)· º D k k / min. = + ª «¬ 1 . 1 .645»¼ n ¨© § - ( 4 1 ) · D /4 min. = + ª ¹¸ º «¬ 1.1.645»¼ ¨© 100 D min. = 0.321
  • 11. c) Item Discrimination This index tells the test developers how well the item functions to discriminate students, those who have achieved the objectives and those who have not. In the context of Psychometrics, discrimination is a desirable quality of test items. The discrimination index is denoted by (d). In the classical test theory CTT, item discrimination:  Tells about ability of an item to differentiate between higher ability examinees and lower ability examinees is known as item discrimination  which is often called statistically as the Pearson product-moment correlation coefficient between the scores on the item (e.g., 0 and 1 on an item scored right-wrong) and the scores of the total test For this calculation, test developer divides the test takers into flowing three groups according to their scores on the test as a whole: Upper Group = an upper group consisting of the 27% of the highest achievers Lower Group = a lower group consisting of the 27% lowest achievers Average Group= a middle group consisting of the remaining 46%. Calculate the following a. pupper = “Proportion in upper group who got it right” “# in the upper group who got it right / # of students in upper group who answered the item” b. pLower = “Proportion in lower group who got it right”
  • 12. “# in lower group who got it right / # of students in lower group who answered the item” Find the difference between the two p values which is called item discrimination. d = pupper - pLower Range of Discrimination Index a. Maximum When all the examinee of upper group answers the item correctly and all the examinees of lower group answer the item incorrectly then pupper = 1 and pupper = 0 d = p upper - pLower = 1 – 0 =1 the item has maximum positive discrimination between groups. This means that the item is helping you identify those students who have achieved your objectives (and those who have not!) b. Minimum When all the examinee of upper group answers the item incorrectly and all the examinees of lower group answer the item correctly then pupper = 0 and pupper = 1 d = p upper - pLower = 0 – 1 = – 1 the item has maximum negative discrimination between groups which means that people with low scores got it right and people with high scores missed the item (a bad thing)
  • 13. c. Zero discrimination When same number of the examinees of both upper group and lower group answers the item correctly then pupper = p upper p upper = pupper = q (say) d = p upper - pLower = q – q = 0 then the item is not helping us sort people into those two performance groups at all. This can happen when either everyone gets the item right or everyone misses it. These are non-discriminators. so the range of discrimination index is (– 1 to + 1). General Interpretation of Difficulty Index Interpretation Difficulty Index Range Ideal Item About 0.5 Very Good Item 0.4-0.49 Good Item 0.30-0.39 Fair Item 0.20-0.29 Poor Item Less than 0.19 Activity Ten students have taken an objective test. The test comprises of 10 items. In the table
  • 14. below, the students’ scores have been listed from high to low. There are five students in the upper half and five students in the lower half. The number “1” indicates a correct answer on the question; a “0” indicates an incorrect answer. Student ID Total Score Questions Item Item tem Item Item Item Item Item Item Item (%) 1 2 3 4 5 6 7 8 9 10 AA 100 1 1 1 1 1 1 1 1 1 1 BA 90 1 1 1 1 1 1 1 1 0 1 CD 80 1 1 0 1 1 1 1 1 0 0 DR 70 0 1 1 1 1 1 0 1 0 1 EF 70 1 1 1 0 1 1 1 0 0 1 FG 60 1 1 1 0 1 1 0 1 0 0 GH 60 0 1 1 0 1 1 0 1 0 1 HM 50 0 1 1 1 0 0 1 0 1 0 IK 40 1 1 1 0 1 0 0 0 0 1 JN 30 0 1 0 0 0 1 0 0 1 0 Calculate the Difficulty Index (p) and the Discrimination Index (D) for each question Item # Correct (Upper group) # Correct (Lower group) Difficulty (p) Discrimination (D) Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Answer the following items: 1 .Which item is the easiest? 2. Which item is the most difficult? 3. Which item has the poorest discrimination? 4. Which items would you eliminate first (if any) – why?
  • 15. Output of Iteman Software (CTT based Software) Total Total Rpbis Rbis Alpha w/o M-H p Bias Against 0.055 0.069 0.832 2.933 0.034 Urban Prop. Rpbis Rbis Mean SD Color 0.465 0.055 0.069 24.114 7.400 Maroon **KEY** 0.314 0.270 0.354 26.054 9.494 Green 0.067 -0.241 -0.463 14.875 5.367 Blue 0.132 -0.212 -0.335 17.915 7.590 Olive 0.022 -0.084 -0.233 14.500 3.586 18.231 8.594 20-40% 0.571 80-100% 0.488 0-20% 40-60% 60-80% Color 0.233 0.571 0.478 Maroon **KEY** 0.233 0.159 0.312 0.388 0.463 Green 0.200 0.095 0.052 0.030 0.000 Blue 0.333 0.175 0.065 0.104 0.049 Olive N P 357 0.465 Option statistics Option N A _________ __________ 166 B 112 C 24 D 47 Omit 8 Not Admin 13 Quantile plot data Option A N 166 B 112 C 24 D 47
  • 16. Explanation of Output Item information It tells that item ID is 35 and it is at sr. # 35 in test. this item is MCQ with four options. Key is A. it is related to “Algebra”. Flag box shows that it is biased items and there is problem in key. Item Statistics It tells that 357 students attempted this item. The difficulty index (P)is 0.465 and discrimination index (Rbis.) is 0,069. It is biased (not favoring urban students) against urban students. Option Statistics It tells about distractor analysis. The discrimination index (Rbis) of option A and B is positive. A is key so its discrimination must be positive but B is distractor and its discrimination index must be negative but it is also behaving as key. It can also see in the graph. 2. Item Response Theory (IRT) Item response theory (IRT) is latent trait theory and proposed in the field of psychometrics for the purpose of measurement of ability, skills, proficiency, learning, performance etc. It is widely used in testing to calibrate and evaluate items in assessment instruments and to score subjects on their abilities, attitudes, or other latent traits. During the last several decades, educational assessment has used more and more IRT-based techniques to develop tests because the methodology can significantly improve measurement accuracy and reliability while providing potentially significant reductions in assessment time and effort, especially via computerized adaptive testing. (Hays,
  • 17. Morales, and Reise 2000; Edelen and Reeve 2007; Holman, Glas, and de Haan 2003; Reise and Waller 2009). The most popular IRT Rasch model specify a single latent trait to account for all statistical dependencies among test items as well as all differences among test takers. It is this underlying trait, typically denoted by theta (~) that distinguishes items with respect to difficulty, and distinguishes test takers with respect to proficiency. Rasch model explores probability of test taker’s specific response to an item as a function of the test taker's location on (~) describing the relationship of the item to ~. IRT based Rasch model is probabilistic, local item dependence estimation of item parameters, test statistics, and examinee proficiency may result (Fennessy, 1995; Sireci, Thissen, & Wainer, 1991; Thissen Steinberg & Mooney, 1989). Item response theory advances the concept of item and test information to replace reliability. In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta. Plots of item information are used to see how much information an item contributes. Since local independence, item information functions are additive, the test information function is simply the sum of the information functions of the items on the exam and with a large item bank, test information functions helpe in judging measurement error very precisely. In Item Response Theory, Rasch model is used because it is probabilistic model offers a way to model probability that a person with “certain” ability will be able to perform at a “certain” level i.e. it measure a person’s ability and its performance on a single continuum. It helps in checking how well the data fit the model and diagnoses very quickly where the misfit is the worst, and helps to understand this misfit in terms of the construction of the items and the variable in terms of its theoretical development (Rasch Analysis, 2012). Means square, t, infit, and outfit values are used as Bond and Fox (2001) considered them important for making fit
  • 18. decisions with more emphasize on infit values. Item response theory is related to:  the latent trait theory.  approach focuses that each item on a test is having its own item characteristic curve that indicates the probability of getting each particular item either right or wrong.  association between an individual's response to an item and the underlying latent variable ("ability" or "trait") being measured by the instrument  latent variable, expressed as theta (θ) , is a continuous one-dimensional construct that explains the covariance among item responses.  Individuals at higher levels of θ have a higher probability of responding correctly an item. a) Item Characteristics Curve  It provides the useful information about the behavior (item discrimination index, item difficulty index and guessing) of an item. b) Item difficulty
  • 19. where • Each learner has ability θ • Each item has difficulty b The equation is called one parametric model  θ is defined as the ability at which the probability of success on the item is 0.5 (50%) on a logit scale.  “b” item’s level of difficulty is another factor affecting an individual’s probability of responding in a particular way.  “θ" and “b” are on same scale that is on x-axis
  • 20. Above figure shows that as the difficulty of item or ability of an individual increases from left to right. Activity Let’s try the following values: θ = 0, b = 0? θ = 3, b = 0? θ = -3, b = 0? θ = 0, b = 3? θ = 0, b = -3? θ = 3, b = 3? θ = -3, b = -3? what is P(θ)? Excel Let’s enter these into Excel, and create the item characteristic curve using above equation of one parametric model. c) Item discrimination
  • 21.
  • 22. Activity where • Each learner has ability θ • Each item has difficulty b • Each item has discrimination a The equation is called two parametric model • The items on a test might also differ in terms of the degree to which they can differentiate individuals who have high trait levels from individuals who have low trait levels. Above figure shows that item 1 is more discriminatory than item 2. So the steepness of ICC represents discrimination of item.
  • 23. Let’s try the following values: θ = 0, b = 0, a = 5? θ = 3, b = 0, a = 2? θ = -3, b = 0, a = 3? θ = 0, b = 3, a = 1? θ = 0, b = -3, a = 1? θ = 3, b = 3, a = 5? θ = -3, b = -3. a = 10? what is P(θ)? Let’s enter this into Excel, and create the item characteristic curve using above equation of two parametric model. d) Guessing Where • Each learner has ability θ • Each item has difficulty b • Each item has discrimination a • Each item has guessing chance c The equation is called three parametric model
  • 24. Above figure shows that item 2 show more guessing chance than item 1. Activity Let’s try the following values: θ = 0, b = 0, a = 5, c = 0.5? θ = 3, b = 0, a θ = 0, b = 3, a = 1, c = 2? θ = 0, b = -3, a = 1, c = 0? θ = 3, b = 3, a = 5, c = 3? θ = -3, b = -3. a = 10, c = 5? what is P(θ)? Let’s enter this into Excel, and create the item characteristic curve using above equation of three parametric model = 2, c = 1? θ = -3, b = 0, a = 3, c = 0.75?
  • 25. Outputs of IRT Based Software ConQuest for Psychometric Properties of Test Item Person Map Sample Test 1 ================================================================================ SAMPLE TEST 1 Fri Mar 18 0 1 : 2 3 2016 MAP OF LATENT DI STRI BUTIONS AND RESPONSE MODEL PARAMETER ESTI MATES ===========================================================Build: J a n 8 2016=== Terms i n t h e Model ( e x c l S t e p t e r m s ) p e r s o n + i t e m - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | | | | | X| | | X| | | XXXXX| XXXXX| XXXX| XXXXXXX| XXXXXXXX| XXXXXXX| XXXXXXXXXXXX| XXXXXXXXXXX| XXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | 11 12 28 33 34 36 43 46 51 54 58 2 29 32 35 63 64 65 67 68 77 78 40 104 122 134 142 148 156 164 175 197 100 215 93 106 199 44 92 121 162 25 39 53 91 128 179 180 190 117 98 126 208 30 131 192 110 203 7 38 45 138 141 157 20 37 55 73 97 171 71 112 118 153 87 116 209 61 62 69 124 182 57 31 149 174 185 22 66 90 130 189 26 72 183 198 48 80 81 83 114 168 170 202 79 82 85 172 119 163 42 47 169 178 188 210 147 186 200 206 211 16 50 52 74 125 139 160 15 137 184 191 18 27 140 14 59 194 75 166 56 111 76 115 205 70 167 24 207 1 49 129 10 17 23 41 201 152 173 196 8 123 21 13 19 195 181 5 60 6 9 4 | | | ======================================================================================= Each 'X' r e p r e s e n t s 0 . 6 c a s e s Some par amet e r s c o u l d not be f i t t e d on t h e di s p l a y ======================================================================================= The item-person map output of sample test 1 explores distribution of items (right side of map representing by numbers 1,2,3 etc.) and persons (lift side of the map representing by x). The below map explores that the test is difficult because most of the items are upside 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX| XXXXXXXXXXXXXX| XXXXXX| XXXXX| XXX| XX| X| | | X| | | | | | 3 |
  • 26. as compare to person. In other words, there is no person for most of the items.
  • 27. Item Person Map Sample Test 2 ================================================================================ SAMPLE TEST 2 Fri Aug 05 09 : 2 0 2016 MAP OF LATENT DISTRI BUTIONS AND RESPONSE MODEL PARAMETER ESTIMATES ===========================================================Build: J a n 8 2016=== Terms i n t h e Model ( e x c l S t e p t e r m s ) p e r s o n + i t e m - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 6 5 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|27 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| 6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|12 28 XXXXXXXXXXXXXXXXXXXXXXXXXXX|22 XXXXXXXXXXXXXXXXXXXXXXXX|8 25 XXXXXXXXXXXXXXXXX| XXXXXXXXXXXXXXXX|11 14 XXXXXXXXXXXXXX| XXXXXXXXXXXX|29 XXXXXXXX|17 XXXXXXX|1 2 10 XXXX|9 XXX|30 XXX|4 XX|3 13 32 XX|7 31 | 5 X|15 16 26 XXX|19 24 | X|20 XXX|23 | 2 1 | XX| | | 1 8 | ======================================================================================= Each 'X' r e p r e s e n t s 3 . 6 c a s e s ======================================================================================= The item-person map output of sample test 2 explores distribution of items (right side of map representing by numbers 1,2,3 etc.) and persons (lift side of the map representing by x). The above map explores that the test is easy because most of the items are downside as compare to persons. In other words, there is no item for most of the persons. 3 2 1 0 -2 4
  • 28. Fit Statistics ================================================================================ Sa mp le of Fi t S t a t i s t i c s TAB LES O F R ESPONSE MODE L P AR AM ETER ESTI M ATES ===========================================================Bui ld: Jan 8 2016=== TER M 1 : i t e m - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -T ERM 1 : i t e m VAR I AB LES UNWE I GH TED F IT WE I GH TED FIT ---------------------------------------------------------------------------------------------------------------------------------------------------------------- i t em ESTI M ATE (b ) ERR OR ^ MNSQ C I T M NSQ C I T - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---- 1 1 0 .2 02 0 .0 62 0 .9 5 ( 0 .9 4, 1 .0 6 ) -1 . 8 1. 03 ( 0 . 92 , 1 . 08 ) 0 . 7 2 2 0 .2 17 0 .0 61 0 .8 4 ( 0 .9 4, 1.06) -5.6 0 .9 7 ( 0 .9 3, 1 .0 7 ) - 0 . 9 3 3 -0 . 3 82 0 . 07 2 0. 84 ( 0 . 94 , 1. 06) -5. 7 0. 94 ( 0 . 90 , 1 . 10 ) -1. 2 4 4 -0 . 2 65 0 . 07 0 0. 84 ( 0 . 94 , 1. 06) -5. 8 0. 93 ( 0 . 91 , 1 . 09 ) -1. 4 5 5 -0 . 6 76 0 .0 79 0 .9 2 ( 0 .9 4, 1.06) -2.7 0. 97 ( 0 . 89 , 1. 11) -0. 6 6 6 2 . 01 0 0 . 04 8 1 .2 5 ( 0 . 94 , 1. 06) 7. 9 1. 13 ( 0 . 96 , 1 . 04 ) 6 . 7 7 7 -0 . 5 99 0 . 07 8 0. 83 ( 0 . 94 , 1. 06) -6. 2 0. 94 ( 0 . 89 , 1. 11) -1. 0 8 8 1 .2 00 0 . 05 1 1. 08 ( 0 . 94 , 1. 06) 2. 6 1. 07 ( 0 . 95 , 1 . 05 ) 3 . 0 9 9 0 . 08 8 0 . 06 4 1. 01 ( 0 . 94 , 1. 06) 0. 5 1. 05 ( 0 . 92 , 1 . 08 ) 1 . 2 1 0 1 0 0 .2 98 0 . 06 0 0. 90 ( 0 . 94 , 1. 06) -3. 6 0. 96 ( 0 . 93 , 1 . 07 ) -1. 1 1 1 1 1 0 . 89 7 0 . 05 3 1. 02 ( 0 . 94 , 1. 06) 0. 5 1. 03 ( 0 . 95 , 1 . 05 ) 1 . 2 1 2 1 2 1 . 53 3 0 . 04 9 1. 01 ( 0 . 94 , 1. 06) 0. 3 1. 02 ( 0 . 96 , 1 . 04 ) 1. 1 1 3 1 3 -0 . 4 4 0 0 .0 74 0 .9 2 ( 0 .9 4, 1.06) -2.9 0. 96 ( 0 .9 0, 1 . 10 ) -0. 8 1 4 1 4 0 .8 54 0 .0 54 0 .9 9 ( 0 .9 4, 1.06) -0.3 1 .0 4 ( 0 .9 4, 1 .0 6 ) 1 . 4 1 5 1 5 -0 . 7 98 0 . 08 3 0. 79 ( 0 . 94 , 1. 06) -7. 8 0. 91 ( 0 . 88 , 1 . 12 ) -1. 4 1 6 1 6 -0 . 8 26 0 .0 84 1 .0 3 ( 0 .9 4, 1 .0 6 ) 1 . 0 0. 93 ( 0 . 88 , 1 . 12 ) -1. 1 1 7 1 7 0 .4 13 0 . 05 9 0 .9 2 ( 0 .9 4, 1.06) -2.9 0. 98 ( 0 .9 3, 1 .0 7 ) -0. 5 1 8 1 8 -2 . 2 70 0 . 14 4 0 .5 1 ( 0 .9 4, 1. 06)-20. 2 0. 91 ( 0 .7 5, 1 .2 5 ) -0. 6 1 9 1 9 -0 . 9 96 0 .0 88 0. 88 ( 0 .9 4, 1.06) -4.4 0. 99 ( 0 . 87 , 1 . 13 ) -0. 1 2 0 2 0 -1 . 2 84 0 . 09 8 0. 75 ( 0 . 94 , 1. 06) -9.2 0. 85 ( 0 . 85 , 1 . 15 ) -2. 0 2 1 2 1 -1 . 5 99 0 .1 10 0 .7 2 ( 0 .9 4, 1.06)- 10.3 0 .8 5 ( 0 .8 2, 1 .1 8 ) - 1 . 7 2 2 2 2 1 . 39 8 0 . 05 0 1 .1 0 ( 0 .9 4, 1 .0 6 ) 3 .4 1. 08 ( 0 . 96 , 1 . 04 ) 3 . 4 2 3 2 3 -1 . 5 00 0 . 10 6 0 .5 7 ( 0 .9 4, 1.06)- 17.5 0. 86 ( 0 .8 3, 1 . 17 ) -1. 7 2 4 2 4 -1 . 0 88 0 . 09 1 0. 89 ( 0 . 94 , 1. 06) -3. 9 0. 91 ( 0 . 86 , 1 . 14 ) -1. 3 2 5 2 5 1 .1 45 0 .0 52 0 .9 9 ( 0 .9 4, 1.06) -0.2 1 .0 0 ( 0 .9 5, 1 .0 5 ) 0 . 1 2 6 2 6 -0 . 9 16 0 . 08 6 0. 90 ( 0 . 94 , 1. 06) -3.4 0. 91 ( 0 . 87 , 1. 13) -1. 4 2 7 2 7 2 . 31 4 0 . 04 8 1 .2 2 ( 0 .9 4, 1 .0 6 ) 7 .0 1. 09 ( 0 . 97 , 1 . 03 ) 5 . 2 2 8 2 8 1 .5 02 0 . 05 0 1 .1 5 ( 0 .9 4, 1 .0 6 ) 4 .8 1. 11 ( 0 .9 6, 1 .0 4 ) 4 . 9 2 9 2 9 0 . 56 1 0 . 05 7 1. 32 ( 0 . 94 , 1. 06) 9. 8 1. 10 ( 0 . 94 , 1 . 06 ) 3 . 0 3 0 3 0 -0 . 1 16 0 .0 67 0 .9 6 ( 0 .9 4, 1.06) -1.5 0. 95 ( 0 .9 1, 1 .0 9 ) -1. 0 3 1 3 1 -0 . 5 50 0 .0 76 0. 86 ( 0 .9 4, 1.06) -4.9 0. 90 ( 0 . 89 , 1. 11) -1. 8 3 2 3 2 -0 . 3 27 * 0 .0 71 0 .9 9 ( 0 .9 4, 1.06) -0.2 0 .9 9 ( 0 .9 0, 1 .1 0 ) - 0 . 3 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - An a s t e r i s k n ext t o a p a ra met er e s t i m a t e i n d i c a t e s th at i t i s c o n s t r a i n e d Sep a rat i on R e l i a b i l i t y = 0 . 9 96 C hi -squ a re t e s t of p a ra met er e q u a l i t y = 1 010 8. 22 , d f = 31 , Si g Lev el = 0 . 000 ^ Empi ri ca l s t a n d a r d e r r o r s ha ve b een u sed ================================================================================ Fit statistics explores the fitting of the models with the data. If Mean Square MNSQ values of items exist in corresponding class interval CI then those items fit with the IRT model. Red MNSQ values explores that item 6,8,22,27,28 and 29 are miss fit with the IRT model. Estimate column in fit statistics table tells the difficulty level of the items. For example, item 30 is of average (b § 0) difficulty as mention by red values in item - person map of sample test 2 and table of fit statistics. Separation reliability index (green values in fit statistics table) tells us that test is measuring wide range of abilities. Comparison of IRT and CTT
  • 29. CTT and IRT differ in many respects although these testing theories have many commonalities. A crucial similarity in both testing theories are that models of performance; if the model assumptions are not met, conclusions and interpretations will not be supportable and the investigator will not necessarily be able to test the assumptions. However, in the case of IRT, there are statistical procedures to help determine whether the construct is causal or emergent (Tractenberg, 2010). Classical test theory (CTT) has been extremely popular in the development, characterization, and sometimes selection of outcome measures in the field of testing. IRT is powerful and offers options to measure outcomes more accurately that CTT does not provide. However, IRT modeling is complex.
  • 30. Following are the general comparison of CTT and IRT Area CTT IRT Nature Traditional Modern Complexity Simple Complex Model Linear Nonlinear Scores True score Ability scores Inferences Student’s ability Student’s expected scores Scope Narrow Broader Focus More on test More on item Assumption Weak Strong Item ability relationship Not specified Item characteristic function Ability Test scores or estimated true scores are reported on the test score scale Ability scores are reported on the scale -00 to +00 or a transformed scale Invariance of item and person No—item and person parameters are sample dependent Yes—item and person parameters are sample independent if model fit the test Sample size(for item parameter estimate) 200 to 500 in general Depend on the IRT model but larger sample i.e. over 500 in general are needed Dependency Item’s properties depends on representative sample Item’s properties do not depends on representative sample IDF Sample Independent Sample Dependent 6.4. Standard Setting Changes in educational assessment are currently being called for, both within the fields of measurement and evaluation. Traditional forms of assessment of knowledge provide a standard setting method for assigning numerical scores to determine letter grades but rarely reveal information about how students actually understand and can reason with acquired ideas or apply their knowledge to solving problems. The reflection of the
  • 31. achievement of curriculum objectives and institutional standards, by students, are indicated by the grading process. The model by which the grading process is carried out is what is in question. There are three, most commonly used, grading models employed in most educational settings and institutions. The first model is norm-referenced. Norm-referenced grading refers to an evaluation where students are assessed in relationship to each other. The second is criterion-referenced. Criterion-referenced grading is the process where students are evaluated in a noncompetitive atmosphere; the emphasis is placed on the learning objects and standards. Third is self-referenced. It is based on comparing a learner's performance with the instructor's perceptions of the learner's ability. Learners performing above the level of performance that the instructor perceives them capable receive higher grades than those learners the instructor perceives as having not made as much of an improvement. There is an even greater need for appropriate grading methods for assigning letters to students' performance. This paper summarizes current trends in academic grading and relates these to the assessment of student outcomes in a specific course. After discussing these grading models the findings were that there is a noticeable shift to the criterion-referenced grading model. Criterion Referenced Model This model's framework is based on a curriculum, course, or lesson. By establishing absolute standards, grades are assigned by comparing a learner's performance to a set of standards. Learners meeting the learning targets receive higher grades than those learners not meeting the targets. This method presumes the learning targets are appropriately designed for the particular learner population and the instructor is focusing instruction on the learning targets. Norm Referenced Model This model's framework is based on a comparison of among learners. Establishing relative standards means making comparisons that are relative to the group such that a learner's
  • 32. performance is compared to others in the group. Advantages and Disadvantages of Norm-referenced Grading Norm-referenced grading ranks learners from highest to lowest, according to Nitko (2001) and these systems of grading are easy for instructors to use and articulate. The Center for Teaching and Learning Services at the University of Minnesota (2003) explains that this form of grading works well in situations requiring rigid differentiation among students, where restrictions are imposed. For example, when less number of students is to be selected then this technique works better. Norm-referencing requires close scrutiny of the actual group that will be used as a reference for the comparison. This could possibly foster further insight into the course’s subject area and help improve instruction. This form of grading is most appropriate in a large classroom setting. Two primary objections surround the norm-referenced form of grading. First, an individual’s grade is determined not only by his/her achievements and efforts but also by achievements and efforts of others. Popham (2002) illustrates by saying, when a teacher asserts that a student “scored at the 90th percentile on test,” they mean that the student’s test performance has exceeded the performance of 90% of the students in the test’s norm group. The second objection is that norm-referenced grading promotes competition rather than cooperation. When students are knowingly paired against each other they are less likely to be helpful to their fellow classmate this may also is a key to eliminate cheating in tests and examinations. Advantages and Disadvantages of Criterion-referenced Grading Criterion-referenced grading provides feedback relative to leaning targets and/or performance standards. This form of grading emphasizes the objectives of the curriculum. The student’s grade is not affected by the class. Under this form, if improvement is needed, a student can simply observe the identified learning targets to know what areas
  • 33. they should work on. Unlike norm-referenced grading this system is adaptable to any size classroom setting. There are two disadvantages that present themselves as hurdles for the criterion-referenced form of grading.  Establishment of learning targets and/or performance standards.  Teachers set the criteria, standards, or targets based on what they know about how students will usually perform. Self-Referenced Model The growth-based grading framework is based on comparing a learner's performance with the instructor's perceptions of the learner's ability. Learners performing above the level of performance that the instructor perceives them capable receive higher grades than those learners the instructor perceives as having not made as much of an improvement. Thus, a learner who has made more improvement may receive a higher grade than another learner regardless of their absolute levels of attainment. It is therefore essential that the instructor maintain rigorous records so as to reduce the potentially unreliable nature of judgments of capability. While the self-referencing method reduces the overall competitiveness of grades, it presents an irony in that learners coming in the course with the highest levels of achievement tend to have the lowest levels of change even though their final absolute levels of achievement remain the highest (Nitko, 2001). Grading Methods Absolute grading methods produce grades that share some general shortcomings, independent of the particular method that generated the grades. For example, unless they are accompanied by a description of the performance standards or the content domains that have been studied, the meaning of an absolute grade is difficult to understand. Furthermore, no criterion-referenced grading method produces grades that are strictly absolute in meaning. Such grades are based on performance standards that nearly always have normative basis. A "B writer" should be able to use correct referencing techniques,
  • 34. the teacher may say, but if most college students do not and cannot, the standard is likely to be lowered to reflect reality (the norm). Note that adjusting grades instead of modifying the standards would contribute to meaningless grades. Fixed Percent Scale This method uses fixed ranges of percent-correct scores as the basis for assigning grades to the component of a final grade. A grading scale used by most of the institutions/universities is the following: 93-100 = A, 85-92=B, 78-84=C, etc. These ranges are fixed at the beginning of the reporting period and are applied to the scores from each grading component -- written tests, demonstrations, papers and performance assessments. Component grades are then weighted and averaged to get the final grade. Unfortunately, a percent score will be meaningless unless the domain of tasks, behaviors, or knowledge upon which the assessment was based is defined explicitly. That is, a test score of 100% should mean that the student has complete or thorough attainment of the key elements of the area of knowledge and mastered the basic skills that were sampled by the test. But if an assessment is developed in such a way that the underlying content domain is ill-defined or vague, the percent-correct scores from it will have no meaning beyond the specific tasks that comprise the assessment. Scores of 80% on a math test and 75% on a speech say little about performance unless we know the difficulty of the domain of math problems and which important criteria were used to score the speech. In sum, percent scores cannot provide a reference to absolute performance standards unless the underlying knowledge domain and desired basic skills to be mastered are adequately described. Another serious drawback of this grading method is the fact that the percent-score ranges for each grade symbol are fixed for all grading components. For example, the fact that 93% is needed for an A places severe and unnecessary restrictions on the teacher when he or she is developing each assessment tool. If the teacher believes there should be some A grades, a 20-point test must be easy enough so that some students will score 19 or higher; otherwise there will be no A grades. This circumstance creates two major problems for the teacher as the assessment developer. First, it requires that assessment tasks be chosen more for their anticipated easiness than for their content representativeness. As a result, there may be an over representation of easy concepts and ideas, an overemphasis on facts and
  • 35. knowledge, and an under representation of tasks that require higher order thinking skills. The teacher may need to "fudge" on the domain definition to accommodate the fixed grading scale. A further limitation of this method relates to the accuracy of the assessment information obtained. Since the grade cutoff scores usually are located between the 60% and 100% points on the percent scale, most of the scale points (0-60) are of no value in describing the different absolute levels of achievement. For example, if A and B performance must be in the range of 85%-100%, the very best B achievement and the very worst B achievement are separated by only eight points (85-92), as are the very best and very worst A achievements (93-100). These are fairly narrow score ranges, especially considering the fact that a 100-point scale is available for use. Because these ranges are narrow and fixed, they will contribute to fairly inaccurate grades when the scores of any single grading component are not very dependable. If the grade ranges could be made larger when the scores of certain components are fairly inaccurate, then more accurate grades would probably result. The fixed percent scale method usually produces grades that have little meaning in terms of content standards, and it often yields grades that are of questionable accuracy. The percent cutoffs for each grade are arbitrary and, thus, not defensible. Why should the cutoff for an A be 93, 92, or 90? Further, why shouldn't the A cutoff be 88% for a certain text, 91% for another, and 83% for a certain simulation exercise? Is there any reason why the same numerical standards must be applied to every grading component when those standards are arbitrary and void of absolute meaning? Total Point Method When teachers accumulate points earned by students throughout a reporting period and then assigns grades to the point total at the end of the period this method is known as total point method. First the teacher decides which components will figure into the final grade and what the maximum point value of each component will be. (This is done before tests are developed and before the scoring criteria for projects are established). That is teacher formulate the procedure of grading before the start of the program. For example, you may decide to use two tests (50 points each), two papers (40 points each), and a report (20 points) for a maximum of 200 points for the quarter. Then the grade cutoffs might be set as follows: 180-200 = A, 160-179 = B, 140-159 = C, 120-139 = D and 0-119 = F. Implicit in
  • 36. this set of ranges is a percent scale with grade cutoffs of 90%, 80%, 70%, and 60%. These cutoffs depends upon teachers own desire there is no hard and fast rule or rationale for percentage cutoffs. They are as arbitrary, and nearly as meaningless, as those derived from the fixed percent scale method. Unlike the fixed percent scale method, however, grades are not assigned to components with the total point method. And unlike grading on the curve, the arbitrary cutoff points are established at the beginning of the reporting period, before assessment results are known. One of the difficulties of using this method is that often a decision has to be made about the maximum score on a project or test before the teacher has had ample time to think about the key ingredients of the assessment. Here's how this circumstance can contribute to poor assessment development practices: Suppose I need a 50-point test to fit my grading scheme, but I find as I build the test that I need 32 multiple-choice items to sample the content domain thoroughly. I find this unsatisfactory (or inconvenient) because 32 do not divide into 50 very nicely (It is 1.56!) To make life simpler, I could drop 7 items and use a 25-item test with 2 points per item. If I did that, my point totals would be in fine shape, but my test would be an incomplete measure of the important unit objectives. The fact that I had to commit to 50 points prematurely dealt a serious blow to obtaining meaningful assessment results. Another potential drawback to the total point method is the ease with which extra credit points can be incorporated to beef up low point totals. This practice can simultaneously distort the meaning of the content domain and final grade. When the extra tasks are challenging and relevant to current instruction, this seems like a reasonable way to individualize and motivate high achieving students. In such cases, the outcome is likely to make high point totals even higher. But extra credit that simply allows students to compensate for low test scores or inadequate papers is not reasonable, especially if the extra work does not help them overcome demonstrated deficiencies. The point here is that this method of grading makes it convenient for teachers to allow extra credit work of the latter form to compensate for low achievement. When that happens, the grades take on a new meaning because the relevant domain of knowledge and skills gets redefined by the nature of the extra credit tasks. Content Based Method This method involves assigning a grade to each component and then weighting the separate
  • 37. grades to obtain the final one. The teacher develops brief descriptions of the achievement levels (standards) associated with each grading symbol. These standards for "A work" and "B work" and so on are then used to establish the grade cutoff scores for every component. Compared to the fixed percent scale method, which keeps cutoff scores constant for all components, this method keeps the performance standards for a grade constant but lets the cutoff scores change. Here is an example of how the method might be used: Suppose you have prepared a 30-item test to measure the achievement of most of the objectives in a unit of instruction. Assuming that grades A through F will be assigned to test scores, you will need to develop a brief description of the performance levels you expect students to reach for each of the five possible grades. For example, you might describe C expectations as "knows basic concepts and can do the most important skills; lacks some prerequisites for later learning." Using descriptions like these, you can begin an item-by-item review of the test. For question No. 1, ask whether a student with only minimum achievement (D) should be able to answer correctly. If so, record a D next to the item; if not, pose the same question for grade C achievement. This process continues until the first item has been classified. For items that the teacher believes most A students will not necessarily answer correctly, a symbol such as N can be used to indicate that no grade level applies. After you have classified each item with a symbol, the D-F cutoff score is found by adding the number of D symbols. Then the C-D cutoff is obtained by adding the number of D and C symbols. The B-C cutoff is the sum of D, C, and B symbols, and the A cutoff is the sum of the D, C, B, and A symbols. To account for negative errors of measurement, you should lower each grade cutoff by one or two points. Such adjustments for error at this stage of grading would make it unnecessary to review borderline cases at a later time. All grading methods involve subjectivity, and this one requires two main types of subjective decisions. The first type entails the development of explicit expectations for the achievers at each of the letter-grade levels. What is B achievement like and how is it different from C achievement? Good teachers might disagree with one another about how to define these performance standards. The other subjective decision making occurs when items are reviewed to determine the grade category to which each one belongs. Again, good teachers may disagree about whether a "B student" should be able to answer a particular item correctly. Notice that these two types of judgments do not require that
  • 38. subjective decision be made about individual students. There is no need to decide, for example, whether Jana is a C student or whether Matt could answer a certain question correctly. The judgment required here is about standards and about the particular tasks that students at each level should be expected to do. Some Relative Grading Methods Grades derived from any of the relative grading methods will have certain shortcomings that are inherent in any grading intended to have a norm-referenced meaning. For example, unless the person interpreting the grade knows which reference group was used, the grade means very little. Was it the student's class, a combination of classes, or classes from the past two years? Further, by definition, a norm-referenced grade does not tell what a student can do; there is no content basis other than the name of the subject area associated with the grade. Grading on the Curve The curve referred to in the name of this method is the normal bell-shaped curve that is often used to describe the achievements of individuals in a large heterogeneous group. The idea behind this method is that the grades in a class should follow a normal distribution, or one nearly like it. Under this assumption, the teacher determines the percentage of students who should be assigned each grade symbol so that the distribution is normal in appearance. For example, the teacher may decide that the percentages of A through F grades in the class should be 10%, 20%, 40%, 20%, and 10%, respectively. Since some teachers who use the method rightly believe that classroom groups are too small for their achievement score to resemble a normal curve, they choose percentages that, in their judgment, are more realistic. So they may decide on 20%, 35%, 30%, 10%, and 5%. The percentages are selected arbitrarily and are treated like grade quotas so that the top 20% of students in terms of their composite scores will earn an A, the next 35% would be assigned a B, and so on. Grading on the curve is a simple method to use, but it has serious drawbacks. The fixed percentages are nearly always determined arbitrarily, and the percentages do not account for the possibility that some classes are superior and others are inferior relative to the phantom "typical" group the percentages are intended to represent. In addition, the use of the normal curve to model achievement in a single classroom is generally inappropriate,
  • 39. except in large required courses at the high school and college levels. Distribution Gap Method When the composite scores of a class are ranked from high to low, there will usually be several short intervals in the score range where no student actually scored. These are gaps. This method of grade assignment involves finding the gaps in the distribution and drawing grade cutoffs at those places. For example, if the highest composite scores in a class was 211, 209, 209, 205, 197, 196... then the teacher might use the gap between 205 and 197 to separate the A and B grades. The gap between 211 and 209 is too small and might produce too few A grades. The one between 209 and 205 might be large enough, but 205 seem more like 209 than 197. In some score distributions there are many wide gaps; in others there are only a few narrow gaps. The sizes and locations of the gaps are determined by random errors of measurement as well as by actual differences among students in achievement. For example, Mike's 197 maybe would have been 203 (if there had been less error in his scores), and Theo's 205 maybe would have been 200. Under those circumstances, the A - B gap would be less obvious, and to many final grade decisions would have been made by reviewing borderline cases. When gaps are wide enough, this method helps the teacher avoid disputes with students about near misses. But when the gaps are narrow, too much emphasis is placed on the borderline information that the teacher had decided was not relevant enough or accurate enough to be included among the set of grading components that formed the composite. Only occasionally will the gap distribution method yield results that are comparable to those obtained with more dependable and defensible methods. Standard Deviation Method This relative method is the most complicated computationally, but is also is the fairest in producing grades objectively. It uses the standard deviation, a statistic that tells the average number of points by which the scores of students differ from their class average. It is a number that describes the dispersion, variability, or spread of scores around the average score. In this method, the standard deviation is used like a ruler to identify grade cutoff points. Suppose you have formed composite scores for your class of 25 students and that the
  • 40. average was 129 and the standard deviation was 10. (Consult an introductory measurement or statistics book to see how to compute these statistics simply.) Assuming C to be the average grade, we can find the cutoff between B and C by adding, for example, one-half of the standard deviation to the average (129 + (0.5) (10) = 134). Then the A-B cutoff is found by adding 1.5 standard deviations (for example) to the average (129 + (1.5) (10) = 144). By subtracting corresponding values from the average score, the C-D cutoff is found to be 124, and the D-F cutoff is 114. (Can you verify these values?) The ranges for each grade are the following: A = 145 and up, B = 135 - 144, C = 124 - 134, D = 123 - 114, and F = 113 and below. These ranges can be made smaller or larger for groups of higher or lower ability level by adjusting the number of standard deviations used to find the cutoffs. For a particularly able class, for example, the A-B cutoff might be only one standard deviation above the average and the B-C cutoff might be 0.3 above, rather than 0.5. Unlike grading on the curve, this method requires no fixed percentages in advance, and unlike the distribution gap method, the cutoff points are not tied to random error. When the teacher has some notion of what the grade distribution should be like, some trial and error might be needed to decide how many standard deviations each grade cutoff should be from the composite average. When a relative grading method is desired, the standard deviation method is most attractive, despite its computational requirements.
  • 41. References Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory. Englewood Cliffs, N.J.: Prentice-Hall. Bloom, B.S. (!967). Towards a Theory of Testing which includes Measurement, Evaluation and Assessment. Proceedings of Symposium on Problem in the Evaluation and Instruction. University of California, Los Angels Center for Teaching and Learning Services. (2003) Grading systems. Retrieved November 30, 2004, from http://www.teaching.umn.edu Davis, B. G., Wood, L., and Wilson, R. The ABCs of Teaching Excellence.Berkeley: Office of Educational Development, University of California, 1983. DeVellis, R. F. (2011), Scale Development: Theory and Applications , 3rd Edition, Ebel, R. L. (1979). Essentials of educational measurement (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Ebel, R.L. (1979). Essentials of educational measurement (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Eble, K. E. The Craft of Teaching. (2nd ed.) San Francisco: Jossey-Bass, 1988. Embretson, S. E. and Reise, S. P. (2000), Item Response Theory for Psychologists, Mahwah, NJ: Lawrence Erlbaum Associates. Erickson, B. L., and Strommer, D. W. Teaching College Freshmen. San Francisco: Jossey-Bass, 1991. Gronlund, N. E. & Linn, R. L. (2005). Measurement and assessment in Teaching. New Delhi: Baba Barkha Nath Printers. Hambleton, R. K., Swaminathan, H., and Rogers, H. J. (1991), Fundamentals of Item Response Theory, Newbury Park, CA: Sage Publications.
  • 42. Hays, R. D., Morales, L. S., and Reise, S. P. (2000), “Item Response Theory and Health Outcomes Measurement in the Twenty-First Century,” Medical Care , 38, Suppl. 9, 1128–1 142. Holman, R., Glas, C. A. W., and de Haan, R. J. (2003), “Power Analysis in Randomized Clinical Trials Based on Item Response Theory,” Controlled Clinical Trials, 24, 390–410. Martuza, V.R. (1977). Applying norm-referenced and criterion referenced measure in education. Boston MA: Allyn and Bacon, Inc. Nitko, A. J. (2001). Educational assessment of students (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Nitko, A.J. (2001). Educational assessment of students (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Oosterhof, A.C. (1987). Obtaining intended weights when combining students' scores. Educational Measurement: Issues and Practice, 6(4), 29-37. Popham, W.J. (2002). Classroom assessment: What teachers need to know? Boston, MA: Allyn and Bacon, Inc. Reise, S. P. and Waller, N. G. (2009), “Item Response Theory and Clinical Measurement,” Annual Review of Clinical Psychology, 5, 27–48. Scriven, M. "Evaluation of Students." Unpublished manuscript, 1974. The Theory and Practice of Item Response Theory, New York: Guilford Press. Thousand Oaks, CA: Sage Publications. Edelen, M. O. and Reeve, B. B. (2007), “Applying Item Response Theory (IRT) Modeling to Θuestionnaire Development, Evaluation, and Refinement,” Θuality of Life Research, 16, 5–18.
  • 43. Tractenberg, R. E. (2010). Classical and modern measurement theories, patient reports, and clinical outcomes. Contemporary Clinical Trials, 31(1), 1–3. http://doi.org/10.1016/S1551-7144(09)00212-2