Chile's university admission test, the PSU, has been sold as a test that can do anything you may want a test to do. The result is that it does none of them well. It should be scrapped, in favor of new tests modeled on the old system of a universal aptitude test with highly-focused content tests for those faculties that want them.
2. Large-scale testing: Uses and abuses
1.
2.
3.
4.
5.
3 types of large-scale tests
Measuring test quality
A chronology of mistakes
Economists misunderstand testing
How SIMCE is affected
3. 1. Three types of large-scale tests
Achievement
Aptitude
Non-cognitive
4. Achievement tests
Historically, were larger versions of classroom tests
~ 1900 - “scientific” achievement tests developed
(Germany & USA)
J.M. Rice -
systematically analyzed
test structures & effects
E.L. Thorndike -
developed scoring
scales
SOURCE: Phelps, Standardized Testing Primer, 2007
5. Achievement tests
Purpose: to measure how much you know and can recall
Developed using: content coverage analysis
How validated: retrospective or concurrent validity
(correlation with past measures, such as high school grades)
Requires a mastery of content prior to test.
Fairness assumes that all have same opportunity to learn content
Coachable – specific content is known in advance
SOURCE: Phelps, Standardized Testing Primer, 2007
6. Aptitude tests
1890s – A. Binet & T. Simon (France)
- Pre-school children with mental disabilities
- achievement test not possible
- developed content-free test of mental abilities
(association, attention, memory, motor skills, reasoning)
1917 – Adapted by U.S. Army to select, assign soldiers in World War 1
1930s – Harvard University president J. Conant
- wanted new admission test to identify students from lower social
classes with the potential to succeed at Harvard
- developed the first Scholastic Aptitude Test (SAT)
SOURCE: Phelps, Standardized Testing Primer, 2007
7. Aptitude tests
Purpose: predict how much can be learned
Developed using: skills/job analysis
How validated: predictive validity, correlation with future activity (e.g.,
university or job evaluations)
Content independent. Measures:
… what student does with content provided
… how student applies skills & abilities developed over a lifetime
Not easily coachable – the content is either…
… not known in advance,
… basic, broad, commonly known by all, curriculum-free;
… less dependent on the quality of schools
SOURCE: Phelps, Standardized Testing Primer, 2007
8. Aptitude tests
Aptitude tests can identify:
- Students bored in school who study
what interests them on their own
- Students not well adapted to high
school, but well adapted to university
- Students of high ability stuck in poor
schools
SOURCE: Phelps, Standardized Testing Primer, 2007
9. Comparing Achievement & Aptitude tests
Achievement
Aptitude
Measure
past learning
potential
Development
content analysis
job/skills analysis
Validation
retrospective
predictive
Content
dependent
independent
Coachable?
very much
not much
10. Non-cognitive tests
More recently developed
– measure values, attitudes, preferences
Types:
integrity tests
career exploration
matchmaking
employment “fit”
11. Non-cognitive tests
Purpose: to identify “fit” with others or a situation
Developed using: surveys, personal interviews
How validated? success rate in future activities
Content is personal, not learned
“Faking” can be an issue (e.g., “honesty” tests)
12. Comparing Achievement, Aptitude, &
Non-Cognitive Tests
Achievement
Aptitude
Non-Cognitive
Measure
past learning
potential
attitudes, values,
preferences
Development
content analysis
job/skills analysis
surveys
Validation
retrospective
predictive
predictive
Content
dependent
independent
independent
Coachable?
very much
very little
can be faked
13. 2. Measuring test quality
Test reports can
be “data dumps”
3 measures are important:
1. Predictive validity
2. Content coverage
3. Sub-group differences
14. Predictive validity
(values from -1.0 to +1.0)
…measures how well higher scores
on admission test match better
outcomes at university (e.g., grades,
completion)
A test with low predictive validity provides
little information.
15. A positive correlation between two measures
Source: NIST, Engineering Statistics Handbook
16. A negative correlation between two measures
Source: NIST, Engineering Statistics Handbook
18. How does one measure
predictive capacity?
Correlation Coefficient:
I--------------------------------------------I
-1
0
1
19. Predictive validities: SAT and PSU
0.6
0.5
0.4
0.3
SAT
0.2
PSU 2010
0.1
0
Language Mathematics SAT WritingSU Social Science
P
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
20. Predictive validities: SAT and PSU
(faculty: Administracion)
0.6
0.5
0.4
0.3
0.2
0.1
0
Language
Mathematics
SAT
SAT Writing
PSU Social
Science
PSU Administracion
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
21. Predictive validities: SAT and PSU
(faculty: Arquitectura)
0.6
0.5
0.4
0.3
0.2
0.1
0
Language
Mathematics
SAT
SAT Writing
PSU Social
Science
PSU Arquitectura
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
22. Predictive validities: SAT and PSU
(faculty: Educacion)
0.6
0.5
0.4
0.3
0.2
0.1
0
Language
Mathematics
SAT
SAT Writing
PSU Social
Science
PSU Educacion
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
23. Predictive validities: ACT and PSU
0.6
0.5
0.4
0.3
0.2
0.1
0
Language
Mathematics Social Science
ACT
Science
PSU
SOURCE: ACT, Research Summary Services, 1997_1998; Pearson, Final Report
Evaluation of the Chile PSU, January 2013
24. Predictive validities of the PSU
(CTA v Pearson estimates)
0.6
0.5
0.4
0.3
0.2
0.1
0
Language
Mathematics
CTA
Pearson
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013; CTA
25. Incremental Predictive validities (engineering):
(controlling for NEM)
35
30
25
20
15
PAA
10
PSU
5
0
U. Chile
PUC
Language & Math
U. Chile
PUC
Language & Math + subject test
SOURCE: S.A. Prado, Estudio de Validez Predictiva de la PSU y Comparacion con el Sistema
PAA, Universidad de Chile
26. Content coverage (values from 0% to 100%)
…how much of
the content
domain of a test
has been taught
in the schools.
It is not fair to expect students to master
content to which they have not been exposed.
…or, to compare students who have been
exposed to students who have not.
27. Percentage curricular coverage in Chilean high schools, by type of school: 2012
Mathematics, Level 1
100
75
50
25
0
Municipal
Subvencionado
Pagado
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
28. Percentage curricular coverage in Chilean high schools, by type of school: 2012
Language & Communication, Level 2
100
75
50
25
0
Municipal
Subvencionado
Pagado
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
29. Percentage curricular coverage in Chilean high schools, by type of school: 2012
Mathematics, Level 3
100
75
50
25
0
Municipal
Subvencionado
Pagado
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
30. Percentage curricular coverage in Chilean high schools, by type of school: 2012
Language & Communication, Level 4
100
75
50
25
0
Municipal
Subvencionado
Pagado
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
31. Percentage curricular coverage in Chilean high schools, by type of curriculum: 2012
Mathematics, Level 4
100
75
50
25
0
Humanista Cientifica
Technico Profesional
Polivante
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
32. Percentage curricular coverage in Chilean high schools, by type of curriculum: 2012
Language & Communication, Level 4
100
75
50
25
0
Numanista Cientifica
Technico Profesional
Polivante
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
33. Percentage of Chilean high schools with full curricular coverage, by subject area: 2012
Levels 1--4
100%
75%
50%
Do NOT Cover 100%
Cover 100%
25%
0%
Mathematics
Language &
Communication
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media
Lenguaje y Comunicacion – Matematica, Septiembre 2012
34. Subgroup differences
Differences in test scores among
subgroups (e.g., gender, ethnic, school
type) should be due only to differences
in the attribute measured by the test
and not to systematic biases in the test.
35.
36.
37.
38. Growing gaps in PSU Mathematics raw & adjusted scores,
by type of curriculum: 2002—2010
Brechas PSU Matemáticas para toda la muestra
Brechas PSU Matemáticas para toda la muestra
180
200
160
200
180
140
180
160
120
160
200
140
100
140
180
120
80
120
160
100
60
100
140
80
40
80
120
60
20
60
100
40
0
40
80
20
Brechas
Brechas Brechas Brechas
Brechas
200
20
600
0
40
190
180
170
160
150
140
130
120
110
111
100
90
95
80
70
60
50
46
40
43
30
20
2002
2003
2004
2005
10
0
2002
2003
2004
2005
2002
2003
2004
2002
2003
2004
2005
Brechas PSU Matemáticas para toda la 170
muestra
Brechas PSU Matemáticas para toda la muestra
Brechas PSU Matemáticas para toda la muestra
124
85
51
2006
2007
2008
2009
2010
2006
2005
2006
2007
2006
2007
2008
2009
2010
2007
2008
2009
2008
2009
2010
2006
2007
2008
PP Muni-TP
Brecha Sin
Ajustar
PP Muni-TP
Brecha Sin
PP Muni-TP
PP Muni-TP
Ajustar
Brecha Ajustada
Brecha Sin
Ajustar
PP Muni-TP
Brecha Ajustada
PP Muni-CH
PP Muni-TP
Brecha Sin
Brecha Ajustada
Sin
Ajustar
Ajustar
PP Muni-CH
Brecha Sin
PP Muni-CH
Muni-CH
PP Muni-TP
Ajustar
Brecha Ajustada
Sin
Brecha Ajustada
Ajustar
PP Muni-CH
Brecha Ajustada
PP Muni-CH
Brecha Ajustada
Sin
Ajustar
PP Muni-CH
Brecha Ajustada
2010
20
0
2002
2003
2004
2005
2009
2010
SOURCE: Koljatic, Silva, & Phelps, Consequential Tests and Conflicts of Interest: The Case of Chile’s
PSU, forthcoming, 2014
39. Growing gaps in PSU Language & Communication raw &
adjusted scores, by type of curriculum: 2002—2010
Brechas PSU Lenguaje para toda la muestra
Brechas PSU Matemáticas para toda la muestra
200
170
160
180
200
140
160
180
120
140
200
160
100
120
180
140
80
100
160
120
60
80
140
100
40
60
120
80
20
40
100
60
0
20
80
40
150
Brechas
Brechas Brechas Brechas
Brechas
180
200
0
60
20
160
Brechas PSU Matemáticas para toda la muestra
140
130
120
Brechas PSU Matemáticas para toda la muestra
113
110
100
106
90
79
80
70
86
60
44
50
40
30
20
2002
10
44
36
2003
2004
2005
2006
2007
2003
2002
2004
2003
2005
2004
2006
2005
2007
2008
2009
2010
2006
2007
2008
2009
2002
2003
2004
2005
2006
2007
2008
2009
2008
2009
2010
2002
2003
2004
2005
2006
2007
2008
2009
2010
PP Muni-TP
Brecha Sin
Ajustar
PP Muni-TP
Brecha Sin
PP Muni-TP
Ajustar
PP Muni-TP
Brecha Ajustada
Brecha Sin
PP Muni-TP
Ajustar
Brecha Ajustada
PP Muni-CH
PP Muni-TP
PP Muni-TP
Brecha Sin
Brecha Sin
Brecha Ajustada
Ajustar
Ajustar
PP Muni-CH
Brecha Sin
PP Muni-CH
Ajustar
PP Muni-TP
PP Muni-CH
Brecha Ajustada
Brecha Ajustada
Brecha Sin
PP Muni-CH
Ajustar
Brecha Ajustada
PP Muni-CH
PP Muni-CH
Brecha Sin
Brecha Ajustada
Ajustar
2010
0
2002
40
0
20
Brechas PSU Matemáticas para toda la muestra
161
PP Muni-CH
Brecha Ajustada
2010
0
SOURCE: Koljatic, Silva, & Phelps, Consequential Tests and Conflicts of Interest: The Case of Chile’s
PSU, forthcoming, 2014
40. 3. A chronology of mistakes
2000, initial proposal, SIES/PSU project
This proposal attempts a redesign of the tests currently used to select
students for higher education in Chile. It is expected that [this new test
will] have a positive impact in the efficiency of the selection
process, improving the psychometric properties of the measuring
instruments, and establishing a better articulation between the
selection system and the secondary education curriculum.
SOURCE: Proyecto FONDEF, Reformulacion de las Pruebas de Seleccion a la Educacion Superior
41. A chronology of mistakes (cont.)
2001 (World Bank & MINEDUC)
…the Academic Aptitude Test for entry to the university system is under
revision, together with the universities belonging to the Council of Rectors.
This instrument of entry selection, needs also to be aligned with the new
curriculum and may become an exit exam from the secondary
education system.
SOURCE: World Bank, Implementation Completion Report on a Loan in the Amount of $35 million to the Republic of
Chile for Secondary Education, 2001
42. A chronology of mistakes (cont.)
2005 (World Bank)
…The new law adopted in May 2005 (Bulletin 3223-04) established a system of
student loans available to all students achieving a threshold score in the
University Admission Exam (PSU). …the new system does not impede students
unable to provide collateral from financing their studies. The new system
promises to improve equity further by increasing options for talented
students from non-affluent families to access higher education.
SOURCE: IMPLEMENTATION COMPLETION REPORT (TF-25378 SCL-44040 PPFB-P3360) ON A LOAN IN THE AMOUNT
OF US$145.45 MILLION TO THE REPUBLIC OF CHILE FOR THE HIGHER EDUCATION IMPROVEMENT
PROJECT, December 2005
43. A chronology of mistakes (cont.)
2009 (OECD & World Bank)
[One option for revising admission testing] would be for Chile to move
away from a university entry test towards a national school leaving
test or set of tests – ideally, not simple multiple choice tests but
longer exams, which test both knowledge and candidates’ ability to
think and to apply knowledge. Such school leaving exams or tests
could also remove the need for a separate school leaving
certificate, by having two pass levels, the lower level equivalent to the NEM
and the higher level setting the minimum standard for entry to an academic or
professional degree course.
SOURCE: OECD & World Bank, Tertiary Education in Chile, 2009
44. A chronology of mistakes (cont.)
2009 (OECD & World Bank)
The second option [to revising admissions testing] would be to reform the
PSU by incorporating elements other countries consider useful and
important in identifying the students most likely to benefit from HE. These
elements would include extended essays and questions designed to
test reasoning ability and learning potential. They could also include
personal statements which could cover non-curricular
experience, personal motivation and interest in the programme.
Again, there should be a variant for vocational secondary school students.
SOURCE: OECD & World Bank, Tertiary Education in Chile, 2009
45. A chronology of mistakes (cont.)
2010 (World Bank)
Over time the government should consider replacing the university
entry exam with a national school leaving exam as the prime
criterion for entry into tertiary education institutions. This could
establish a closer link between test results and the school that is responsible
for them, making it easier to reach the goal that has been pursued with the
introduction of the PSU.
SOURCE: N. Brandt, CHILE: CLIMBING ON GIANTS' SHOULDERS: BETTER SCHOOLS FOR ALL CHILEANCHILDREN;
ECONOMICS DEPARTMENT WORKING PAPERS No. 784
46. A chronology of mistakes (cont.)
2010 (World Bank)
There is evidence that central curriculum based exit exams are strongly and
positively related to student academic performance (Wößmann, 2005;
Bishop, 2006). To allow students to show in more detail their knowledge and
their ability to apply it, the school exit exam could be a bit more in-
depth than the multiple-choice PSU, including verbal and nonverbal
reasoning.
SOURCE: N. Brandt, CHILE: CLIMBING ON GIANTS' SHOULDERS: BETTER SCHOOLS FOR ALL CHILEANCHILDREN;
ECONOMICS DEPARTMENT WORKING PAPERS No. 784
48. Testing & Measurement PhD program
(University of Massachusetts, USA, 2013-2014)
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
EDUC
501 Classroom Assessment
553 Construction, Validation, and Uses of Criterion-Referenced Tests
555 Introduction to Statistics & Computer Analysis I
632 Principles of Educational & Psychological Testing
637 Non-Parametric Statistics Analysis
656 Introduction to Statistical & Computer Analysis II
661 Educational Research Methods I
727 Scale and Instrument Development
731 Structural Equation Modeling
735 Advanced Theory & Practice of Testing I
736 Advanced Theory & Practice of Testing II
771 Application of Applied Multivariate Statistics I
772 Application of Applied Multivariate Statistics II
821 Advanced Validity Theory & Test Validation
49. How economists misunderstand
testing - 1
Increasing an admission test’s correlation
with high school work can decrease its
correlation with university work
50. How economists misunderstand
testing - 2
Incentives aren’t all that
matter in improving
efficiency;
…also important: more
and better
information, better
classification &
allocation
51. How economists misunderstand
testing - 3
Incentives generally work best
when applied to the actor
responsible for the target
behavior;
…currently, students bear the
consequences when schools do
not teach the curriculum tested
on the PSU
52. How economists misunderstand
testing - 4
Many useful and successful tests serve multiple purposes.
But, some purposes are compatible and some are not.
Responsible authorities have argued that the PSU will:
1.
2.
3.
4.
5.
6.
7.
Measure the implementation of a new curriculum;
Fairly measure mastery of two, very different curricula;
Incentivize high schools to implement the new curriculum;
Incentivize high school students to study more;
Predict success in university generally;
Predict success across very different types of university programs;
Reduce socio-economic disparities.
53. The PSU: A test at war with itself
(a science-humanities exit exam, sold originally as a science-humanities
curriculum coverage survey, that is used as an entry exam for all students)
Expected to do to many things…
…it does none of them well,
…and makes some of them worse.
54. You cannot get there from here
The PSU cannot be “fixed”; it is fundamentally flawed.
A non-cognitive test, used as a
high-stakes admission test, will
exacerbate the problems. It is
easily faked. Wealthier students
will pay for coaching and the
scores will be invalid.
The old system – PAA + PCEs – was a sensible system.
55. Other options to consider
Option for Technical-Professional Graduates:
As is done in Germany, offer short course on scientific-humanistic
11th & 12th grade curricula with exam at the end for technicalprofessional graduates who decide after graduation that they wish
to change careers.
Create separate test for technical-professionals to enter university.
ETS & Pearson recommendations:
Lessen the content in PSU to the common level – 10th grade – and
to that which is genuinely necessary for a good prediction.
56. How the PSU Runs:
• CRUCh: "owners" of the PSU
• Comité Técnico Asesor (CTA) para la PSU: designated
by CRUCh as supervisors of DEMRE and official
evaluators of the PSU
• DEMRE: responsible for developing test items, test
assembly, tests administration, test
scoring, application system for CRUCh and
associated universities, etc.
Ministry of Education--funds the system since 2007 (fee
waivers)
1/23/2014
58. 5. How SIMCE is affected
What does this have to do with SIMCE?
Most do not see the difference among tests. In public
perception, one bad test makes all tests look bad.
SIMCE’s largest challenge may the loss of public goodwill
towards all testing.
59. “If a thing exists, it exists in some amount. If it exists in some amount, then it
is capable of being measured.”
−−Rene Descartes, Principles of Philosophy, 1664
Notes de l'éditeur
Sorry, but this presentation will be in English.However, these slides will be made available later in both English and Spanish.
Used by many North American universities to test students AFTER admission, on first day at school. Used for advising and placement.
Scatterplot shows the relationship between two factors – for example, high school grades and university grades
From the Ministry of Education content coverage study.Unit of analysis is school. Vertical line shows the range of coverage – some schools teach NONE of the required curriculum; some teach all. Horizontal line is the mean coverage.
Those with 0% coverage, I assume, are behind in the curriculum, because students are behind, and still teaching lower grade content.
Now we compare by curriculum.
Scatterplot of student performance by impact of socio-economic background on that performance.Chile in below average performance / above average affect of SES quadrant, along with Mexico, Brazil, and the USA.
Scatterplot of PISA scores and PISA SES index shows clear correlation for Chile by type of school: municipal, subsidized, & private fee-based.
Trends in enrollment in Chile up to 2006 – high school and university showing opposite trends. SES gap in high school narrowed. SES gap in university widened.
Converting from classical to IRT may “modernize” a test, but does not necessarily improve its psychometric properties.The articulation now seems worse than it was with the PAA and the PCEs.
When a test becomes an exit exam from the previous level of schooling, it is no longer an “academic aptitude test”; it becomes an achievement test based on the previous curriculum.In some states of the USA, ACT administers tests that are highly predictive of future university work, and also good surveys of a student’s mastery of the high school curriculum. But, the two functions are accomplished by separate items among the tests administered, subject-area achievement tests are added to the regular ACT test and the collection of tests is administered over two or three days.
Obviously, the new system has not improved equity; in fact, it has decreased it.
It is ironic, because the World Bank and the OECD advocated and helped to finance the move away from tests that measured “candidates’ ability to think and to apply knowledge.”
Again, the World Bank and OECD are now recommending ability testing, which they earlier helped to eliminate. But, they are also recommending using non-cognitive tests – that are easily faked – as part of the university entrance requirement. A non-persevering student can easily claim to be persevering if he knows it will help get him into university.
Actually, the PSU is a curricular-based achievement test – the type typically appropriate for use as an exit exam – but it is being used as an entry exam. This recommendation is directly opposite the previous two World Bank recommendations.
There are hundreds of studies by psychologists and psychometricians the World Bank could have cited. Instead, they cite two studies by economists that are only superficially appropriate for this issue.The fact alone that test items are constructed response rather than multiple-choice does not make them any more in-depth. One can write very superficial constructed-response items, and very deep, complex multiple-choice items
Suppose you have a leaky faucet and would like it to be fixed. Which of these professionals would you call?
The courses taken by a testing and measurement student in one US PhD program.
No single test can possibly do all of this.
My opinion.
It can be very discouraging if a decision a student (or the student’s parents) makes at age 14 will determine the student’s career …forever, and cannot ever be altered.