2. Major developments in England,
France, Germany, and the USA
1836: matriculation examinations
1845: first in USA, the superiority of written
exam over oral quiz
1853: India act for impartial selection for civil
services
1858: local examinations in OXFORD and
Cambridge
Development of statistical approach in Britain
such as Spearman contribution
3. Major developments in England,
France, Germany, and the USA
1904: Binet in France , development of a series of
test to discriminate unmotivated and incapable
children from the others
USA,Yerkes et al. development of intelligence test
in army recruits
Purpose: bring scientific methods to the study of
edu such as achievement test or development of
mental tests
Problem: growing discontent regarding the
unreliability of marks and unfair evaluation by
human minds
4. Personal equation concern:
Solution: sentence completion,T/F items,
MC selection…
Development of objective and standard
based assessment (1st roots in USA and soV
is the product of NA)
Led to the mushrooming publication of
standard tests and research into test and
testing from 1910--1920
5. The outcome of pre 1921
Structured and objective assessment
Distinction btw sub-domains of edu and psycho
measurement
1. Professional communities: diagnosis,
achievement, selection
2. Scientific communities: explore personality
characteristics and innate differences
Distinction btw different types of tests (ling vs.
performance-individual vs. group- written and
standardized tests)
Recognition of CO.CO as a tool for judging the
quality of tests
6. Post 1921 era
The term “V” began to take root in the lexicon of
researchers and practitioners.
1911 Freeman: technique andV of test methods
1915Terman: evaluated theV of intelligence and
IQ tests
1916 Starch: referred toV or fairness of
measures
1916Thorndike: essentials of valid scale
1919 APA attempts for professional certification
in response to use of mental tests by unqualified
individuals
7. Post 1921 era
1921 NADER national asso of Directors of edu
research: seek standardization and consistency among
concepts and procedures (similar to APA attempts in
1895, 1906).
Regulations proposed by them:
1. Preparation and selection
2. Experimental org of test and instruction
3. Trail of tentative test
4. Final org of test
5. Final cond of test (scoring, tabulation and
interpretation)
6. DetermineV
7. Determine R
8. Determine norms
8. 1st official definition of V
By NADER
Challenged to promote and develop new methods
1st classic definition ofV:
The degree to which a test or examination
measures what it purports to measure
The idea of criterion was central to this and the
dominant approaches were predictive or
concurrent ones.
Content consideration existed yet was not sig and
robust
1915—1930 boom period: new tests multiplied like
rabbits, being uncritical to the instruments and
the results
9. Early years:
Over simplistic descriptions
Elaboration of insights that had been
established before
Elevation of empirical evidence at the expense
of logical analysis (dust-bowl empiricism)
According to Shepard: 1920—1950: defense to
test criterion correlations
1940s:V= predictive Co. CO
According to Kane: criterion phase
According to Cronbach: whole ofV theory:
prediction
10. Some issues regarding early years:
1. We cannot ignore early
years Theory of prediction descriptive and
explanatory investigations
The omissions of early years discussion is
counter productive and we shouldn’t teachV
from the baseline of 1954.
Only with reference to the baseline of 1921
the transition fromTrinitarian conception ofV
to present day theory can be understood.
11. Some issues regarding early years:
2. Too many seminal works In
early years
There were too many seminal works that
made impossible for a coherent tradition to
emerge.
Each with new perspectives
1920s was prolific for edu measurement
Difference in perspectives among authors
within sub domains as well as in different sub
domains
12. Some issues regarding early years:
3. V in different ways and
phases
Both wars influenced testing and validation.
Large implementation of mental testing and a
method of scoring by stencil for rapid marking
by Otis during 1st world war
The army α and β: military aptitude gave mental
testing publicity and prestige
Mechanical test construction to predict criterion
measures (blindly empirical)
This is only one side of this complex story from
mid of 19th to 20th century (to 1952)
13. Prediction phase a caricature:
1) Widespread adoption of blindly empirical
methods specifically aptitude testing for the
army
2) The degradation of classic definition over time
and the method forV measurement was
mistaken for definition ofV. it consists of 3 stages
a) Quality of measurement
b) Degree of correlation btw test and criterion
c) Co. Co btw the test and criterion
14. from a to b: 1922:McCall, only
by correlations we know what
test measures
Classic definition: discreteV and validation,
It was conceptual abstraction.
A hypothetical true proficiency rank as an absolute
criterion
There is no single true proficiency rank but a range
of ranks
No sense of prediction, just in terms of correlation
btw actual test results and hypo proficiency
15. from a to b: 1922:McCall, only
by correlations we know what
test measures
2 methods to determine the correspondence:
1. Prolonged careful observation in real life situ
determine true proficiency and use it as
criterion rank students on the test
correlate them
2. Rank pupils with known proficiency rank
on the test correlate them
16. Other approaches to develop
criterion:
Expert or teacher judgment
Results of multiple existing tests measure the
same thing
Results from specific tests
17. From b to c: change of criteria from
conceptual abstraction to more
concrete and pragmatic measures
Coefficient ofV= Co. Co btw the test and scores
and criterion scores
V= observed agreement rather than a hypo
agreement btw test scores and true proficiency
V= empirical correlation
There was no Q to the v of criterion scores!!
Fusion of definition and method
Underscored the use of test and each test has
differentV with regard to the use
18. From b to c: change of criteria from
conceptual abstraction to more
concrete and pragmatic measures
Dominance of atheoretical definition
Distinction btw practicalV and factorialV
PracticalV: a test is valid for anything with
which it correlates (Guilford, 1946)
There are 2 kinds ofV and the practicalV
addresses the fundamental Q ofV
Undue emphasis on empirical evidence
problem: inadequacy of definition and
criterion problem
19. Terman (1928) 3 primary concerns
of edu and psycho measurement
1. achievement 2. intelligence 3. aptitudes
1. School achievement –Walter Monroe
V as multifaceted concept based on correlation and a
conceptual definition ofV was expressed
a. Objectivity in describing the performances (rater)
b. Reliability
( Co of R, index of R, error of measurement, Co of
correspondence, overlapping of grade groups)
c. Discrimination (agreement with Normal curve
d. Comparison with criterion measures
e.V inference based on test structure and admin
20. Terman (1928) 3 primary concerns of
edu and psycho measurement
1. achievement 2. intelligence 3. aptitudes
6 threats to valid interpretation:
1. Do the tasks require other abilities ?
2. Can the tasks be answered in a variety of
methods? (other than the intended one)
3. Is the test administered under a variety of
conds?
4. Do students continue to exe their ability across
all tasks?
5. Are the tasks rep of the field of ability being
measured?
6. Are all students given this opportunity?
21. Unitary conception of V:
Integration of multiple sources of empirical
evidence and logical analysis
2 primary categories of sources of evidence:
1. Expert opinion vs. experimental Ruch 1929
2. Curricular vs. statistical – Ruch 1933
3 approaches to logical analysis: Ruch 1929
1. Competent person judgment on the appropriateness of
content
2. Alignment of content with test book
3. Alignment of content with recommendation of national edu
committees
22. Terman (1928) 3 primary concerns of edu
and psycho measurement
1. achievement 2. intelligence 3. aptitudes
Fundamental role:
extensive sampling in school achievement tests, random
sampling from the field, or rep of the most important
elements, measuring the same thing or attribute
Tests parallel to actual teaching
Centrality of logical analysis
Problem: no field is perfectly homogeneous , so there would
be always a certain degree of compromise
Major innovation:
Scaling, tests with different levels of difficulty items of a
test were not selected based on content and rep effectively
Problem: tension btw discrimination and sampling
23. From random sampling to restricted sampling
It not possible to construct a robust measure of
overall achievement based on weighted sampling
of behavior across the entire achievement domain.
So instead of rep sample we should tap the essence of
achievement .
So those items with high correlation to general
achievement must be selected. Each item play a
role contributing to the essence of general
achievement attribute
Items discriminate btw high and low students
correlate high with criterion.
24. From random sampling to
restricted sampling
V from curriculum viewpoint andV from
general achievement view point need to
arrive at a compromise.
A large unresolved tension can be detected
throughout the study by Lindquist (1936)
25. Terman (1928) 3 primary concerns of edu
and psycho measurement
1. achievement 2. intelligence 3. aptitudes
Tyler (1931):V in terms of usefulness of the test in
measuring the attainment of course objectives
He was not opposed to empirical approach, but not
impressed by the use ofT marks as empirical
criterion
His suggestion: development of preliminary tests
for each course objectives to help
1) creating comprehensive criterion measures
2) diagnostic purposes
Then preparation of some practical tests to be
validated by correlation
26. Tyler’s concerns:
1. Sampling
2. Test construction
3. Validity
4. Mental process, no distinction btw content of
subj and the required mental process, and
items test info not the interpretation or
application of principles
5. Negative impacts of tests on instruction and
the reform of curriculum. Studying and
teaching were adapted to the emphasis of
tests
27. Tension btw empirical and logical
1930s-1940s
Overemphasis on empirical: inadequacy of
criteria for establishingV and backwash
effect on teaching and learning
Overemphasis on logical: impossibility of rep
sampling and fallibility of human judgement
Tyler: rational hypo in test construction
Pendulum swings against empirical
considerations (technician viewpoint)
28. 2 key principles in evaluation movement
1. The evaluation could not begin until the
curriculum had been defined in terms of
behavioral objectives
2. Any useful device might be employed in
the production of pupil growth account:
Teacher judgment
Essay examination
Objective test
29. Terman (1928) 3 primary concerns
of edu and psycho measurement
1. achievement 2. intelligence 3. aptitudes
Logical approach: Raw brain power and Binet-Simon
scales were extended.
Problem: thorough description of the universe of
intelligent behavior was not straightforward, there
was no clear definition
Binet: faculties are different from general intelligence
, a single test can be a test of intelligence.
Post-Binet: not a single test, but combined tests
(manifold and heterogeneous) performance on a
test is the product of both faculties and general
intelligence.
30. Terman (1928) 3 primary concerns of edu
and psycho measurement
1. achievement 2. intelligence 3. aptitudes
Solution: permissive sampling, assess
considerably more than the essence of
intelligence
V can be maximized by intentional construct
under-representation or intentional
construct- irrelevance
Assumption: random irrelevant item
variance cancel out in law of averages.
31. Terman (1928) 3 primary concerns of edu
and psycho measurement
1. achievement 2. intelligence 3. aptitudes
Empirical approach:
Criterion measure of intelligence is needed
During 1st world war: a number of reputed tests of
higher quality to be adopted as yardstick
Otis group test: most valid
Terman Group
Miller Group test: least valid
Army Alpha
Cattell-1943: promoted F.A as an important
validation technique and transform it from lay
activity to scientific prax
32. Terman (1928) 3 primary concerns of edu
and psycho measurement
1. achievement 2. intelligence 3. aptitudes
For the purpose of vocational guidance and
selection
1st Assumption: aptitudes were stable, if not innate
2nd assumption: aptitudes differ across and within
individuals along continua
Difference of aptitude measurement: the criterion
was not sth of present but of the future.
Successful performance in vocation= exercise of
skills and abilities that had not yet been developed.
Problem? How should it be validated??
33. Empirical approach of
Aptitude test:
The idea of sampling is meaningless so it led
to elevation of empirical approaches in 4
stages:
1) Administer the aptitude test
2) Wait until the required skills and abilities are
received
3) Assess job proficiency in situ
4) Correlate the result of tests and assessment
of job proficency
34. Empirical approach of
Aptitude test:
Absence of clear rational principles
Development based on haphazard trial and error
search for effective predictors
With minimum rationality
Large list of preference to discriminate btw
professions
Selection of items with high correlation to criterion
in successive fashion (multiple regression
challenge) low inter item correlation and high
correlation with criterion (weakness of aptitude
test)
35. Achilles heel of aptitude
testing
Robust criterion measures
V for criterion measures
2 major components of criterion problem:
1. The definition of criterion, subjective
judgment and widespread lack of
agreement over occupational success
2. The development of a procedure to measure
the criterion
36. Thorndike (1949): 3 categories
of criteria
1. Ultimate category: complete final goal of a
particular type of selection, multifaceted
and not available for direct study
2. Intermediate category
3. Immediate category
Validation will fall back on no 2, 3
Blind empiricism is fragile, dangerous. It was
repeatedly said by Messick 1970s—1990s
37. Mid 1940s: Paul Meehl and
Lee Cronbach, construct V
Paul Meehl:
Dissatisfied by client self-rating
Self rating should not be used as a behavior
surrogate but as an indirect sign of sth deeper
Because it requires
1. Appropriate level of self understanding
2. Willingness to disclose
38. Mid 1940s: Paul Meehl and
Lee Cronbach, construct V
Lee Cronbach:
Impact of item format
Response set: the tendency to respond differently to
items in different ways
6 kinds of response: Give many responses, Speed,
Accuracy, Gamble…
A threat toV:different individuals demonstrate
different response set on same set
Solution: useT/F less and MC more
39. Cronbach (1949): 5 technical
criteria of a good test
1. Validity
2. Reliability
3. Objectivity
4. Norms
5. Good items
2 approaches of logical analysis (psychological
understanding of attribute) and empirical
evidence
40. Cronbach (1949): V as the
correspondence of test to definition of
attribute
There are items that correspond to definition of
attribute yet bring irrelevant variables that make
the items impure:
1. Items with different answers of test takers
using different methods
2. Items with limited access to some test takers
from certain cultural groups
3. Items that are vulnerable to response sets
4. Items correspond to content yet fail to assess
desired processes
41. Cronbach (1949): ultimate
consideration
1. Logical analysis is inferior to empirical evidence.
2. Most frequently used criterion: instructor or
supervisors rating, others tests of the same
attribute
3. Discussed criterion problem in-depth
4. Rise of particular empirical approach : factorialV,
the degree that a test could purely measure one
type of ability