5. Ple ase Go d m ay Ino t fail
Ple ase Go d m ay Ig e t o ve r sixty pe r ce nt
Ple ase Go d m ay Ig e t a hig h place
Ple ase Go d m ay alltho se like ly to be at m e g e t kille d in
ro ad accide nts and m ay the y die ro aring .
Irish no ve list McGahe rn
6. Overview
Types of language tests
Ways of describing tests
Evaluating the usefulness of language tests
Overview of common language tests:
TOEFL, TOEIC, IELTS, and CAEL
Impact of testing on learning and teaching
Critical use of language tests
Testing Questions
7. Testing Questions
What is actually being tested by the test
we are using?
What is the“best” test to use?
What relevant information does the test
provide?
How is testing affecting teaching and
learning behaviour?
Is language testing “fair”?
8. Validity, reliability, feasibility
Reliability relates to the consistency of an
assessment.
A reliable assessment is one which
consistently achieves the same results with
the same (or similar) cohort of students.
A valid assessment is one which
measures what it is intended to
measure
Totally valid or reliable/Driving test
9. Process of observation and objective
accumulation of evidences about the
individual learning process of students.
- How to assess?
−Checklist
−Informal teaching observation
Assessment
10. Consider the following:
o You apply for a part-time job to work your way through
school. You learn that as part of the application process,
you must take a test of word-processing speed and a
personality test.
o Mr. and Mrs. Gómez receive a call from their child’s
third-grade teacher, who says she is concerned about
Luis’ performance on a reading test. She would like to
refer Luis for further testing to see whether Luis has a
learning disability.
o Mr. and Mrs. Torres tell you that their son is not eligible
for special-education services because he scored “too
high” on an intelligence test.
12. Assessment – The process of collecting data for the purpose of
making decisions about individuals and groups, and this
decision-making role is the reason that assessment touches
so many people’s lives.
People react strongly when test scores are used to make
interpersonal comparisons in which they or those they
love look inferior.
Power of Testing
13.
14. Testing – Consists of administering a particular set of questions to an
individual or group of individuals to obtain a score. The score is the
end product of testing.
Testing may be part of the larger process
Testing and assessment are not synonymous.
Assessment is a multifacted process that involves far
more than just administering a test.
High quality assessment procedures anyone’s
performance on any task is influenced by (1) the
demands of the task itself, (2) the history and
characteristics the individual brings to the task, and (3)
the factors inherent in the context in which the
assessment is carried out.
Facts
17. • Standard test: TOEFL – IELTS – PET- CAE
• Placement test: Licenciatura test for freshmen
students
• Proficiency test: TOEFL - IELTS
• Achievement test: Parciales – workshops in ALx
Types of tests in language
education
18. Goal: it is the aim expected at the end of
learning process.
Standard: accurate conceptual domain of a
topic.
Descriptors: are the achievements by
competences, they are used in present with
closed and specific characteristics.
Indicators: it is the “regulator” of the curriculum
it is not a final result, because it is subject to
19. GOAL •To use English in common situations.
STANDARD •Students will use English to involve himself in social
circumstances.
descriptor •To recognize social codes.
Assume a critical position above actual events.
indicators •Students recognize social codes.
•Students recognize social codes with difficulty .
•Student has a lot of difficulties to recognize social
codes.
Avoid the use of
not.
NO
20. How do create an evaluation?
1.Formulate the descriptors
2.Design a plan
3.Observe the learning process
4.Evaluate
5.Determine the efficiency of
pedagogies.
21. Evaluation in Colombian settings
National standard for evaluation:
ICFES Saber 5 – 9 - 11 ECAES
Saber pro
National standard for grading:
LAW 230: E S A I D
Decreto 1290:
1 – 5 /10 - 100
22. EVALUATION AT “INITIAL
schOOL”
1.Goal, standards, descriptors and indicators
based on the “Unified” Standards.
2.Strategies for evaluating in the five skills.
3.Continuous assessing of students development
4.Supportive strategies for solving academic and
personal problems
1.Scales to compare national standards with
school’s scales
2.Explicit self evaluation
3.Participation of the educational community
23. BIBLIOGRAPHY
Common European Framework for References of Language.
Cambridge University Press.
Alderson, C.J., Beretta, A.(1993) Evaluating second language
education.(pp 4-27.).Location: Cambridge: Cambridge University
Press.
Evaluación y Promoción por Estándares y Competencias. Rivera, G.
(2009)
El proceso de la evaluación. Series lineamientos curriculares idiomas
extranjeros. Ministerio de Educación Nacional.
24. Types of Language Tests
Achievement test
associated with process of instruction
assesses where progress has been
made
should support the teaching to which it
relates
Alternative Assessment
need for assessment to be integrated
with the goals of the curriculum
25. Proficiency test
aims to establish a test taker’s
readiness for a particular
communicative role
general measure of “language ability”
measures a relatively stable trait
used to make predictions about future
language performance (Hamp-Lyons,
1998)
high-stakes test
26. Some ways of describing tests
Objective Subjective
Indirect Direct
Discrete-point Integrative
Aptitude / Achievement/
Proficiency Performance
External Internal
Norm-Referenced Criterion-Referenced
27. Evaluating the usefulness of a
language test
Usefulness= reliability+validity+ impact
authenticity+interactiveness+practicality
(Bachman and Palmer, 1996)
TEST
USEFULNESS
TEST
USEFULNESS
RELIABILITYRELIABILITY VALIDITYVALIDITY
ImpactImpact AuthenticityAuthenticity
PracticalityPracticality InteractivenessInteractiveness
28. Evaluating the usefulness of a
language test
Essential measurement qualities
reliability
construct validity
Evaluation: test taker - test task - Target
Language Use (TLU)
TLU
Test TaskTest Taker
30. Test of English as a Foreign
Language
One million test takers per
year
P&P 310-677/ CBT 0-300
Three sections:
Listening
Structure and Written
Expression
Reading
Comprehension
TWE
31. Test of English as a Foreign
Language
Objective Subjective
Discrete-point Integrative
Proficiency
Achievement
discord between test and understanding of
language and communication
passive recognition of language
cutoff scores are very problematic
general proficiency ≠ academic proficiency
32. Test of English forInternational
Communication
TOEFL equivalent for
workplace setting
two sections, 200 q.
listening
reading
entertainment,
manufacturing, health,
travel, finance, etc.
“objective and cost-
efficient”
33. Test of English forInternational
Communication
Objective
Subjective
Discrete-point
Integrative
Proficiency
Achievement
lack of correspondence with TLU
34. International English Language
Testing System
Academic/General
Results reported in
band scores 1-9
ListeningListening
G.ReadingG.Reading A.ReadingA.Reading
G.WritingG.Writing A.WritingA.Writing
SpeakingSpeaking
35. International English Language
Testing System
Objective
Subjective
Discrete-point
Integrative
Proficiency
Achievement
test tasks reflective of academic
tasks
36. Canadian Academic English
Language Assessment
Mirrors language
use in university
Topic-
based,integrated
reading, listening,
and writing tasks
provides specific
diagnostic
information
scores are reported
in bands 10-90
37. Canadian Academic English
Language Assessment
Objective Subjective
Discrete-point Integrative
Proficiency Achievement
tests performance and use
diminished gap between test and classroom
validity is supported by teacher evaluations
studies on predicting academic success
38. Washback: The Impact of Tests on
Teaching and Learning
“The power of tests has a strong influence on
curriculum and learning outcomes”
(Shohamy, 1993)
good test ≠ positive washback
form of test impact depends on
antecedent: educational context and condition
process
consequences (Wall,
2000)
39. Critical Language Testing
Focus on consequence and ethics of test
use
Tests are embedded in cultural,
educational, and political arenas
whose agenda?
Questions traditional testing knowledge
English proficiency= academic success?
English: got it or get it!
Responsible test use (Hamp-Lyons, 2000)
40. Testing Questions
What is actually being tested by the test we
are using?
What is the”best” test to use?
What relevant information does the test
provide?
How is testing affecting teaching and
learning behaviour?
Is language testing “fair”?
41. Test design criteria
Usefulness= reliability+validity+ impact
authenticity+interactiveness+practicality
reliability= consistency of measurement
validity= the extent to which the inferences that we make
on the basis of the test are valid given the target language
use situation
authenticity= how closely does the test resemble the
actual language use situation
interactiveness= to what extent is the test taker involved in
active communication
impact= what is the effect of the test on test takers, test
users, teachers etc.
42. Time – language level – design
Layout
Theoretical support (one page to explain
the test; explain why your test is
usefulness, the type of test, )
Score 1 – 5 (create bands for scores)
Make copies for the whole group
15 minutes per skill (except - speaking)
Notes de l'éditeur
I would like to begin today’s presentation with a quote which, taken to an extreme, illustrates the effect that high stakes testing can have on students.
1. It is a common assumption that well known testing tools are useful for our purposes, because we believe that they are technically sound and ongoing research is being carried out. However… 2. What is the “best” test to use? 3. Given that we use the results of language proficiency tests to determine, in large part, the academic future of our students, we must ask what relevant information the test provides that would justify this use. 4. 5. These are questions that I will begin to address in today’s presentation and we will return to them at the end.
REFER AUDIENCE TO HANDOUT 1. examples: end of course tests, portfolio assessments 2. We accumulate evidence during, or at the end of a course of study in order to see whether and where progress has been made in terms of the goals of learning 3. designed to measure how much of a syllabus a learner has mastered and thus they are only valid to the extent to which the content of the test matches the content of the syllabus 4. The use of achievement tests allows instructors to be innovative and to reflect progressive aspects of the curriculum = they are thus associated with some interesting new developments , a movement known as alternative assessment 5. this approach stresses the need for assessment to be integrated with the goals of the curriculum -learners may be encouraged to share responsibility in assessment and be trained to evaluate their own capacities -known as self-assessment Refer to Brown and Hudson for a detailed discussion of alternative assessment
1. This is established for university admission, professional certification, workplace etc. 2. “language ability” - consequently not reflective of a specific syllabus 3. “stable trait”- this means that scores tend not to change within a short period of time; thus this type of test would not be useful in the context of assessing learning over a few weeks. -indeed this change would mainly indicate statistical variance -however, programs are often pressured to employ such tests in order to determine the effectiveness of teaching 4. “predictions’ - this is why such tests are used for admissions decisions and consequently are high stakes - they determine in great part a student’s academic and economic future - Interestingly Hamp-Lyons notes that the vast majority of people who interpret test scores are neither teachers nor testing professionals, they are administrators.
Objective= no human interference, very highly reliable subjective=individuals are involved in the evaluation process indirect= we make inferences from the test tasks- e.g. using a sentence structure question to infer the writing ability of a test taker direct= no gap between test task and target language situation . E.g assessing speaking skills in an interview discrete-point=multiple-choice, often isolated items integrative=different skills are not separated but assessed holistically external internal Norm-referenced= a test takers performance is evaluated against the range of performances typical of a population of similar test takers Criterion-referenced=performances are compared to one or more descriptions of adequate performance at a given level e.g. band scores Describing and evaluating tests on a continuum allows us to steer away from a black and white judgment.
REFER AUDIENCE TO HANDOUT 1. In order to determine which test is “best” for a given assessment situation, we need to evaluate its overall usefulness. 2. Bachman and Palmer include six qualities in their definition of usefulness :list reliability= consistency of measurement validity= the extent to which the inferences that we make on the basis of the test are valid given the target language use situation authenticity= how closely does the test resemble the actual language use situation interactiveness= to what extent is the test taker involved in active communication impact= what is the effect of the test on test takers, test users, teachers etc. 3. These qualities are not all granted equal regard but they must all be considered in order to achieve a desired balance - consequently the balance would vary from one testing situation to another. these elements cannot be evaluated independently but must be looked at in terms of their combined effect - OVERALL usefulness that needs to be emphazised rather than ind Qualities. - evaluation of test usefulness is essentially subjective because it is based on judgements on part of test user REFER AUDIENCE TO QUESTIONS FOR EVALUATION ON HANDOUT
1. Two essential considerations in the evaluation of test usefulness are reliability and validity 2. Reliability is necessary because we want to ensure that test results are scored in a reliable and consistent manner. However, strong reliability without validity tells us essentially nothing. 3. Therefore, construct validity is of specific interest to us is because it is concerned with the extent to which we can interpret a given test score as an indicator of the ability we want to measure - thus, it addresses the meaningfulness and appropriateness of the interpretations that we make. 4. Threats to construct validity can occur when real requirements of the TLU domain may be not be fully represented in the test. We frequently hear people complain that even though students perform very high on the TOEFL they lack basic communication skills. This is probably the case because interaction is not required by the test. The TSE is sometimes employed to remedy this fact; however describing how a tourist can find the way to the train station will not necessarily translate into the ablity to take part in round-table discussions threats to content validity : issue is to what extent the test content forms a satisfactor basis for the inferences to be made from performance e.g. using the TOEFL to make inferences about the ability of an international student to act as a teaching assistant if we want to use the scores from a language test to make inferences about individuals’ language ability, and possibly make various types of decisions, we must be able to demonstrate how performance on that language test is related to language use in specific situations other than the language test itself that is why when considering the six qualities just addressed we always need to examine them in connection to the test taker, the test task and the Target Language Use - Ideally there should be a seamless connection between these three elements- the greater the distance the less useful the inferences that we can make.
1. The greatest language test prep industry has developed around this test introduction to test prep book states “ you are well aware that the TOEFL is one of the most important examinations that you will ever take. Your entire future may well depend on your performance in the TOEFL. The results of this test will determine whether you will be admitted to the school of your choice. 2. 1 million test takers 3. the TOEFL is 100% multiple choice -it uses “generic, or neutral” language and does not specify a context 4. Four sections - Listening section: test takers are not given opportunity to preview questions, nor to see them while listening, nor take notes 5. Research at TOEFL places heavy emphasis on reliability but provides inadequate validity evidence. New development include automatic essay scoring that is done by computer analysis of written structures - TOEFL 2000 project that aims to make changes to the construct of the test which dates back to the 1960’s.
1. Does not reflect current teaching and learning practices and could thus have negative effects on students, teachers because it is in conflict. 2. Passive reconition Students who “pass the test are often unable to communicate However, institutions and other TOEFL score recipients that note inconsistencies such as high TOEFL scores and apparent weak English proficiency, should refer to the photo on the Official Score Report for evidence of impersonation 3. Cutoff scores CPA called upon Canadian universities to refrain from using TOEFL as a standard for university admission - contrary to recommendations decisions often based solely on score - interpretation of scores is difficult because it is norm-referenced and simply provides a number -many have increased have increased TOEFL cutoffs ranging from 580-600 -many who would otherwise be qualified for university admission are denied access - after an 8week summer university orientation program given in English, students’ scores on the TOEFL itself increased from an average of 570-601 -mean score of native speakers reported by ETS is 590 4. General proficiency In his critique of language tests and admission procedures, Elson quoted several studies that have found that merely knowing how a student scored on TOEFL will tell us practically nothing we need to know to predict the student’s academic performance 5. dissatisfaction has led to disuse of TOEFL by some e.g. Australia -misuse of the TOEFL, cycles of raising and lowering requirements -TOEFL is used as an initial screen but other tests have to be taken upon arrival
1. listening section includes variety of statements, questions, short conversations 2. reading section includes incomplete sentences, error recognition, and reading comprehension 3. Content is drawn from a wide variety of areas 4. tailored to provide rapid, affordable, and convenient service; therefore only measure listening and reading since these can be tested objectively. Testing writing and speaking requires time and expense and are “less objective and less reliable”
1.Concern with lack of correspondence between test tasks and target language use. Does not measure speaking - how do you know that person will be able to communicate in a business setting? 2. It only measures listening and reading but makes inferences to communicative ability 3. the test content is extremely broad and may in the end not provide any useful information to any of the fields that use this test
1. 205 test centers in over 100 countries 2. Test is divided into four modules, which have no central theme or topic but offer separate reading and writing tasks for either general or academic English use 3. listening: number of recorded texts which increase in difficulty as the test progresses, mixture of conversations and dialogues - allowed to preview 4. readings are taken from books, magazines, journals 5. writing includes two tasks 1. Write a 150 word report based on material found in a table or diagram, demonstrating ability to describe and explain. - Short essay of 250 words in response to an opinion or a problem expected to demonstrate ability to discuss issues, construct an argument, and use appropriate tone and register 6. Speaking is assessed during a 10-15min one-on-one interview. Requires the test taker to describe, narrate, and provide explanations on a variety of personal and general interest topics - objective key for listening and reading components, speaking and writing components are marked on a subjective key 7. The test includes a variety of task and response types
1. The actual tasks are reflective of academic tasks 2. Comprehensive scoring structure has advantage of giving students knowledge of what specific area of language needs special attention - when asked whether the subjective component of the assessment procedure might introduce a degree of unfairness into the testing process, Jill Richardson said that if the test is truly to be regarded as a communication oriented process, personal interaction is a necessary ingredient without which it is difficult to truly establish a person’s capacity to use language 3.need for more reliability research. -emphasis for UCLES has been on validity and this is also reflected in their certificate exams. It comes from a tradition where teaching professionals are trusted to make fair judgements. 4. It is one of the two tests accepted by the Canadian government for immigration purposes.
1. Was designed by Carleton U. in response to their perceived failure of standardized tests to effectively identify students who were able to use English at levels required for university study 2. test is grounded in day-to day use of language within first year courses at the university -this test is designed not for the global knowledge of English but for English-medium academic contexts -attempts to recreate for the test taker the experience of joining an introductory first year course 3.Integrated, criterion-referenced, topic-based test for EAP -uses constructed response rather than multiple-choice items -there is direct overlap between taking a CAEL assessment, taking and academically oriented ESL course or taking a first year course at a university. The overlap is clear in the tasks and activities of the test -in this way the test aims to promote positive and useful learning - When completing practice tests students are provided with a conversion key that states which skill is tested by each question
1. The nature of the test tasks encourage students to make use of their language knowledge and actively engages them 2. The language skills that are promoted by the test are in line with what a teacher would use in an EAP classroom 3. Research has shown that teachers evaluate their students in-class performances similarly 4. There is an ongoing tracking study that aims to link test performance with future academic performance 5. Even though the test was designed to create positive washback for language learners and teachers; some students have reportedly the same studying habits as for the TOEFL: staying at home for independent cramming. Demonstrates that a “positive” test does not have the same impact on all students.
1. Bailey “ there is a natural tendency for both teachers and students to tailor their classroom activities to the demands of the test, especially when the test is very important to the future of the students” 2. washback can be either positive or negative to the extent that it promotes or hinders achievement of language learning goals held by learners and educators 3. Complex interaction of factors. 4. The more information is available to teachers, learners, test users, and the more they are involved in the testing process, the more likely we will be creating positive impact
-considering that proficiency tests are most powerful indicator for determining the academic future of ESL students discussion needs to start focusing on ethics and consequences of test use - Shohamy introduced the concept of critical language testing - this concept builds on critical pedagogy perspective and emphasizes that the act of testing is both a product and agent of cultural, social, and political agendas - consequently the notion of just a test does not exist -what sort of vision of society does the test create? Question puts at center the responsibility that test users carry with regard to consequences of test use -need to examine the extent to which test agendas reflect the interest of the field of language teaching and learning - it calls into question traditional testing knowledge that views numbers as symbols of objectivity and truth- these numbers are powerful not only because those who use them consider them truthful but also because they allow classification, quantification and judgement. Success and failure are determined by arbitrary cutting scores and all test takers are judged according to the same yardstick -research to suggest that academic achievement in selected disciplines is hardly affected by degree of English language proficiency- how much do we actually know about the degree of English facility that is required for successful completion? Test developers and experts cannot agree what indeed the tests measure and they do not have a clear sense of We must accept responsibility for all the consequences that we are aware of.
1. there is what the receiving institution wants to know from a test- there is also what the test actually tests, these interests are not necessarily compatible 2. There is no “best” test . We need to consider all variables to make app. choice 3. different tests produce different information . What connection is there between test items that measure surface structure recognition and the ability to be a successful student? If a test is isolated from the reality that the student will experience as a learner, it becomes accordingly less relevant 4. Impact = many of us may have encountered the answer to this question in our classrooms, when students demand to be taught to the test 5. language testing is used as a basis for refusing or admitting a student and thus shifts responsibility away from the institution itself. If the student meets the admission requirements to which native speakers are subject, then they should be admitted on the same basis. The provision of opportunities to continue developing English facility is part of commitment to learning 6. AERA standards state that test developers should provide information on the strengths and the weaknesses of their instruments. However, the ultimate responsibility for appropriate test use and interpretation lies predominantly with the test user. 7. I hope that this brief overview of language proficiency testing will lead to further reflection on language testing and that these testing questions remain with us.