The document discusses the key principles of language assessment: practicality, reliability, validity, authenticity, and washback. It defines each principle and provides examples. Practicality means a test is cost-effective, time-efficient and easy to administer. Reliability refers to a test producing consistent results. Validity concerns a test accurately measuring what it claims to measure. Authenticity refers to how well a test simulates real-world language tasks. Washback concerns a test's influence on teaching and learning. A test has positive washback if it encourages effective instruction and learning.
2. COMPONENTS OF LANGUAGE ASSESSMENT
1. Practicality,
2. Reliability,
3. Validity,
4. Authenticity and
5. Washback
3. 1. PRACTICALITY
An effective test is practical. This means that it
Is not excessively expensive,
Stays within appropriate time constraints,
Is relatively easy to administer, and
Has a scoring/evaluation procedure that is
specific and time-efficient.
4. PRACTICALITY
A test that is prohibitively expensive is
impractical. A test of language proficiency that
takes a student five hours to complete is
impractical-it consumes more time (and money)
than necessary to accomplish its objective. A test
that requires individual one-on-one proctoring is
impractical for a group of several hundred test-
takers and only a handful of examiners. A test
that takes a few minutes for a student to take
and several hours for an examiner too evaluate
is impractical for most classroom situations.
5. 2. RELIABILITY
A reliable test is consistent and dependable. If
you give the same test to the same student or
matched students on two different occasions, the
test should yield similar result. The issue of
reliability of a test may best be addressed by
considering a number of factors that may
contribute to the unreliability of a test. Consider
the following possibilities (adapted from Mousavi,
2002, p. 804): fluctuations in the student, in
scoring, in test administration, and in the test
itself.
6. 2.1 STUDENT-RELATED RELIABILITY
The most common learner-related issue in
reliability is caused by temporary illness,
fatigue, a “bad day,” anxiety, and other
physical or psychological factors, which may
make an “observed” score deviate from one’s
“true” score. Also included in this category are
such factors as a test-taker’s “test-wiseness” or
strategies for efficient test taking (Mousavi,
2002, p. 804).
7. 2.2 RATER RELIABILITY
Human error, subjectivity, and bias may enter into the
scoring process. Inter-rater reliability occurs when two
or more scores yield inconsistent score of the same
test, possibly for lack of attention to scoring criteria,
inexperience, inattention, or even preconceived biases.
In the story above about the placement test, the initial
scoring plan for the dictations was found to be
unreliable-that is, the two scorers were not applying
the same standards.
8. 2.3 TEST ADMINISTRATION RELIABILITY
Unreliability may also result from the conditions in
which the test is administered. I once witnessed the
administration of a test of aural comprehension in
which a tape recorder played items for comprehension,
but because of street noise outside the building,
students sitting next to windows could not hear the
tape accurately. This was a clear case of unreliability
caused by the conditions of the test administration.
Other sources of unreliability are found in
photocopying variations, the amount of light in
different parts of the room, variations in temperature,
and even the condition of desks and chairs.
9. 2.4 TEST RELIABILITY
Sometimes the nature of the test itself can cause
measurement errors. If a test is too long, test-takers
may become fatigued by the time they reach the later
items and hastily respond incorrectly. Timed tests may
discriminate against students who do not perform well
on a test with a time limit. We all know people (and
you may be include in this category1) who “know” the
course material perfectly but who are adversely
affected by the presence of a clock ticking away. Poorly
written test items (that are ambiguous or that have
more than on correct answer) may be a further source
of test unreliability.
10. 3. VALIDITY
By far the most complex criterion of an effective test-and
arguably the most important principle-is validity, “the extent
to which inferences made from assessment result are
appropriate, meaningful, and useful in terms of the purpose of
the assessment” (Ground, 1998, p. 226). A valid test of reading
ability actually measures reading ability-not 20/20 vision, nor
previous knowledge in a subject, nor some other variable of
questionable relevance. To measure writing ability, one might
ask students to write as many words as they can in 15
minutes, then simply count the words for the final score. Such
a test would be easy to administer (practical), and the scoring
quite dependable (reliable). But it would not constitute a valid
test of writing ability without some consideration of
comprehensibility, rhetorical discourse elements, and the
organization of ideas, among other factors.
11. 3.1 CONTENT-RELATE EVIDENCE
If a test actually samples the subject matter
about which conclusion are to be drawn, and if it
requires the test-takers to perform the behavior
that is being measured, it can claim content-
related evidence of validity, often popularly
referred to as content validity (e.g., Mousavi,
2002; Hughes, 2003). You can usually identify
content-related evidence observationally if you
can clearly define the achievement that you are
measuring.
12. 3.2 CRITERION-RELATED EVIDENCE
A second of evidence of the validity of a test may be
found in what is called criterion-related evidence, also
referred to as criterion-related validity, or the extent
to which the “criterion” of the test has actually been
reached. You will recall that in Chapter I it was noted
that most classroom-based assessment with teacher-
designed tests fits the concept of criterion-referenced
assessment. In such tests, specified classroom
objectives are measured, and implied predetermined
levels of performance are expected to be reached (80
percent is considered a minimal passing grade).
13. 3.4 CONSTRUCT-RELATED EVIDENCE
A third kind of evidence that can support
validity, but one that does not play as large a role
classroom teachers, is construct-related validity,
commonly referred to as construct validity. A
construct is any theory, hypothesis, or model that
attempts to explain observed phenomena in our
universe of perceptions. Constructs may or may
not be directly or empirically measured-their
verification often requires inferential data.
14. 3.5 CONSEQUENTIAL VALIDITY
As well as the above three widely accepted forms of
evidence that may be introduced to support the validity of
an assessment, two other categories may be of some interest
and utility in your own quest for validating classroom test.
Messick (1989), Grounlund (1998), McNamara (2000), and
Brindley (2001), among others, underscore the potential
importance of the consequences of using an assessment.
Consequential validity encompasses all the consequences of
a test, including such considerations as its accuracy in
measuring intended criteria, its impact on the preparation
of test-takers, its effect on the learner, and the (intended
and unintended) social consequences of a test’s
interpretation and use.
15. 3.6 FACE VALIDITY
An important facet of consequential validity is the
extent to which “students view the assessment as fair,
relevant, and useful for improving learning”
(Gronlund, 1998, p. 210), or what is popularly known
as face validity. “Face validity refers to the degree to
which a test looks right, and appears to measure the
knowledge or abilities it claims to measure, based on
the subjective judgment of the examines who take it,
the administrative personnel who decode on its use,
and other psychometrically unsophisticated observers”
(Mousavi, 2002, p. 244).
16. 4. AUTHENTICITY
An fourth major principle of language testing is
authenticity, a concept that is a little slippery to
define, especially within the art and science of
evaluating and designing tests. Bachman and
Palmer (1996, p. 23) define authenticity as “the
degree of correspondence of the characteristics of
a given language test task to the features of a
target language task,” and then suggest an
agenda for identifying those target language
tasks and for transforming them into valid test
items.
17. 5. WASHBACK
A facet of consequential validity, discussed above, is “the
effect of testing on teaching and learning” (Hughes, 2003, p.
1), otherwise known among language-testing specialists as
washback. In large-scale assessment, wasback generally
refers to the effects the test have on instruction in terms of
how students prepare for the test. “Cram” courses and
“teaching to the test” are examples of such washback.
Another form of washback that occurs more in classroom
assessment is the information that “washes back” to
students in the form of useful diagnoses of strengths and
weaknesses. Washback also includes the effects of an
assessment on teaching and learning prior to the
assessment itself, that is, on preparation for the
assessment.
18. 5.1 WASHBACK/BACKWASH
The term wasback is commonly used in applied
linguistics. it is rarely found in dictionaries.
However, the word backwash can be found in certain
dictionaries and it is defined as “an effect that is not
the direct result of something” by Cambridge
Advanced Learner’s Dictionary.
In dealing with principles of language assessment,
these two words somehow can be interchangeable.
Washback (Brown, 2004) or Backwash (Heaton, 1990)
refers to the influence of testing on teaching and
learning. The influence itself can be positive or
negative (Cheng et al. (Eds.), 2008:7-11)
19. 5.2 POSITIVE WASHBACK
Positive washback has beneficial influence on
teaching and learning. It means teachers and
students have a positive attitude toward the
examination or test, and work willingly and
collaboratively towards its objective (Cheng &
Curtis, 2008:10).
A good test should have a good effect.
20. 5.3 NEGATIVE WASHBACK
Negative washback does not give any beneficial
influence on teaching and learning (Cheng and
Curtis, 2008:9).
Tests which have negative washback is
considered to have negative influence on teaching
and learning.