Newton ch2

THE EARLY YEARS OF VALIDITY
1800S-----1951
Maryam Bolouri

Major developments in England,
France, Germany, and the USA
 1836: matriculation examinations
 1845: first in USA, the superiority of written
exam over oral quiz
 1853: India act for impartial selection for civil
services
 1858: local examinations in OXFORD and
Cambridge
 Development of statistical approach in Britain
such as Spearman contribution

Major developments in England,
France, Germany, and the USA
 1904: Binet in France , development of a series of
test to discriminate unmotivated and incapable
children from the others
 USA,Yerkes et al. development of intelligence test
in army recruits
 Purpose: bring scientific methods to the study of
edu such as achievement test or development of
mental tests
 Problem: growing discontent regarding the
unreliability of marks and unfair evaluation by
human minds

Personal equation concern:
 Solution: sentence completion,T/F items,
MC selection…
 Development of objective and standard
based assessment (1st roots in USA and soV
is the product of NA)
 Led to the mushrooming publication of
standard tests and research into test and
testing from 1910--1920

The outcome of pre 1921
 Structured and objective assessment
 Distinction btw sub-domains of edu and psycho
measurement
1. Professional communities: diagnosis,
achievement, selection
2. Scientific communities: explore personality
characteristics and innate differences
 Distinction btw different types of tests (ling vs.
performance-individual vs. group- written and
standardized tests)
 Recognition of CO.CO as a tool for judging the
quality of tests

Post 1921 era
 The term “V” began to take root in the lexicon of
researchers and practitioners.
 1911 Freeman: technique andV of test methods
 1915Terman: evaluated theV of intelligence and
IQ tests
 1916 Starch: referred toV or fairness of
measures
 1916Thorndike: essentials of valid scale
 1919 APA attempts for professional certification
in response to use of mental tests by unqualified
individuals

Post 1921 era
 1921 NADER national asso of Directors of edu
research: seek standardization and consistency among
concepts and procedures (similar to APA attempts in
1895, 1906).
 Regulations proposed by them:
1. Preparation and selection
2. Experimental org of test and instruction
3. Trail of tentative test
4. Final org of test
5. Final cond of test (scoring, tabulation and
interpretation)
6. DetermineV
7. Determine R
8. Determine norms

1st official definition of V
 By NADER
 Challenged to promote and develop new methods
 1st classic definition ofV:
The degree to which a test or examination
measures what it purports to measure
The idea of criterion was central to this and the
dominant approaches were predictive or
concurrent ones.
Content consideration existed yet was not sig and
robust
1915—1930 boom period: new tests multiplied like
rabbits, being uncritical to the instruments and
the results

Early years:
 Over simplistic descriptions
 Elaboration of insights that had been
established before
 Elevation of empirical evidence at the expense
of logical analysis (dust-bowl empiricism)
According to Shepard: 1920—1950: defense to
test criterion correlations
1940s:V= predictive Co. CO
According to Kane: criterion phase
According to Cronbach: whole ofV theory:
prediction

Some issues regarding early years:
1. We cannot ignore early
years Theory of prediction descriptive and
explanatory investigations
 The omissions of early years discussion is
counter productive and we shouldn’t teachV
from the baseline of 1954.
 Only with reference to the baseline of 1921
the transition fromTrinitarian conception ofV
to present day theory can be understood.

2. Too many seminal works In
early years
 There were too many seminal works that
made impossible for a coherent tradition to
emerge.
 Each with new perspectives
 1920s was prolific for edu measurement
 Difference in perspectives among authors
within sub domains as well as in different sub
domains

3. V in different ways and
phases
 Both wars influenced testing and validation.
 Large implementation of mental testing and a
method of scoring by stencil for rapid marking
by Otis during 1st world war
 The army α and β: military aptitude gave mental
testing publicity and prestige
 Mechanical test construction to predict criterion
measures (blindly empirical)
 This is only one side of this complex story from
mid of 19th to 20th century (to 1952)

Prediction phase a caricature:
1) Widespread adoption of blindly empirical
methods specifically aptitude testing for the
army
2) The degradation of classic definition over time
and the method forV measurement was
mistaken for definition ofV. it consists of 3 stages
a) Quality of measurement
b) Degree of correlation btw test and criterion
c) Co. Co btw the test and criterion

from a to b: 1922:McCall, only
by correlations we know what
test measures
 Classic definition: discreteV and validation,
 It was conceptual abstraction.
 A hypothetical true proficiency rank as an absolute
criterion
 There is no single true proficiency rank but a range
of ranks
 No sense of prediction, just in terms of correlation
btw actual test results and hypo proficiency

from a to b: 1922:McCall, only
by correlations we know what
test measures
2 methods to determine the correspondence:
1. Prolonged careful observation in real life situ
 determine true proficiency and use it as
criterion  rank students on the test
correlate them
2. Rank pupils with known proficiency  rank
on the test correlate them

Other approaches to develop
criterion:
 Expert or teacher judgment
 Results of multiple existing tests measure the
same thing
 Results from specific tests

From b to c: change of criteria from
conceptual abstraction to more
concrete and pragmatic measures
 Coefficient ofV= Co. Co btw the test and scores
and criterion scores
 V= observed agreement rather than a hypo
agreement btw test scores and true proficiency
 V= empirical correlation
 There was no Q to the v of criterion scores!!
 Fusion of definition and method
 Underscored the use of test and each test has
differentV with regard to the use

From b to c: change of criteria from
conceptual abstraction to more
concrete and pragmatic measures
 Dominance of atheoretical definition
 Distinction btw practicalV and factorialV
 PracticalV: a test is valid for anything with
which it correlates (Guilford, 1946)
 There are 2 kinds ofV and the practicalV
addresses the fundamental Q ofV
 Undue emphasis on empirical evidence
problem: inadequacy of definition and
criterion problem

Terman (1928) 3 primary concerns
of edu and psycho measurement
1. achievement 2. intelligence 3. aptitudes
1. School achievement –Walter Monroe
V as multifaceted concept based on correlation and a
conceptual definition ofV was expressed
a. Objectivity in describing the performances (rater)
b. Reliability
( Co of R, index of R, error of measurement, Co of
correspondence, overlapping of grade groups)
c. Discrimination (agreement with Normal curve
d. Comparison with criterion measures
e.V inference based on test structure and admin

Terman (1928) 3 primary concerns of
edu and psycho measurement
 6 threats to valid interpretation:
1. Do the tasks require other abilities ?
2. Can the tasks be answered in a variety of
methods? (other than the intended one)
3. Is the test administered under a variety of
conds?
4. Do students continue to exe their ability across
all tasks?
5. Are the tasks rep of the field of ability being
measured?
6. Are all students given this opportunity?

Unitary conception of V:
 Integration of multiple sources of empirical
evidence and logical analysis
 2 primary categories of sources of evidence:
1. Expert opinion vs. experimental Ruch 1929
2. Curricular vs. statistical – Ruch 1933
 3 approaches to logical analysis: Ruch 1929
1. Competent person judgment on the appropriateness of
content
2. Alignment of content with test book
3. Alignment of content with recommendation of national edu
committees

Terman (1928) 3 primary concerns of edu
and psycho measurement
Fundamental role:
 extensive sampling in school achievement tests, random
sampling from the field, or rep of the most important
elements, measuring the same thing or attribute
 Tests parallel to actual teaching
 Centrality of logical analysis
Problem: no field is perfectly homogeneous , so there would
be always a certain degree of compromise
Major innovation:
Scaling, tests with different levels of difficulty items of a
test were not selected based on content and rep effectively
Problem: tension btw discrimination and sampling

From random sampling to restricted sampling
It not possible to construct a robust measure of
overall achievement based on weighted sampling
of behavior across the entire achievement domain.
So instead of rep sample we should tap the essence of
achievement .
So those items with high correlation to general
achievement must be selected. Each item play a
role contributing to the essence of general
achievement attribute
Items discriminate btw high and low students
correlate high with criterion.

From random sampling to
restricted sampling
 V from curriculum viewpoint andV from
general achievement view point need to
arrive at a compromise.
 A large unresolved tension can be detected
throughout the study by Lindquist (1936)

 Tyler (1931):V in terms of usefulness of the test in
measuring the attainment of course objectives
 He was not opposed to empirical approach, but not
impressed by the use ofT marks as empirical
criterion
 His suggestion: development of preliminary tests
for each course objectives to help
 1) creating comprehensive criterion measures
 2) diagnostic purposes
Then preparation of some practical tests to be
validated by correlation

Tyler’s concerns:
1. Sampling
2. Test construction
3. Validity
4. Mental process, no distinction btw content of
subj and the required mental process, and
items test info not the interpretation or
application of principles
5. Negative impacts of tests on instruction and
the reform of curriculum. Studying and
teaching were adapted to the emphasis of
tests

Tension btw empirical and logical
1930s-1940s
 Overemphasis on empirical: inadequacy of
criteria for establishingV and backwash
effect on teaching and learning
 Overemphasis on logical: impossibility of rep
sampling and fallibility of human judgement
 Tyler: rational hypo in test construction
 Pendulum swings against empirical
considerations (technician viewpoint)

2 key principles in evaluation movement
1. The evaluation could not begin until the
curriculum had been defined in terms of
behavioral objectives
2. Any useful device might be employed in
the production of pupil growth account:
 Teacher judgment
 Essay examination
 Objective test

Terman (1928) 3 primary concerns
of edu and psycho measurement
Logical approach: Raw brain power and Binet-Simon
scales were extended.
Problem: thorough description of the universe of
intelligent behavior was not straightforward, there
was no clear definition
Binet: faculties are different from general intelligence
, a single test can be a test of intelligence.
Post-Binet: not a single test, but combined tests
(manifold and heterogeneous) performance on a
test is the product of both faculties and general
intelligence.

 Solution: permissive sampling, assess
considerably more than the essence of
intelligence
 V can be maximized by intentional construct
under-representation or intentional
construct- irrelevance
 Assumption: random irrelevant item
variance cancel out in law of averages.

 Empirical approach:
 Criterion measure of intelligence is needed
 During 1st world war: a number of reputed tests of
higher quality to be adopted as yardstick
Otis group test: most valid
Terman Group
Miller Group test: least valid
Army Alpha
Cattell-1943: promoted F.A as an important
validation technique and transform it from lay
activity to scientific prax

 For the purpose of vocational guidance and
selection
 1st Assumption: aptitudes were stable, if not innate
 2nd assumption: aptitudes differ across and within
individuals along continua
 Difference of aptitude measurement: the criterion
was not sth of present but of the future.
 Successful performance in vocation= exercise of
skills and abilities that had not yet been developed.
 Problem? How should it be validated??

Empirical approach of
Aptitude test:
 The idea of sampling is meaningless so it led
to elevation of empirical approaches in 4
stages:
1) Administer the aptitude test
2) Wait until the required skills and abilities are
received
3) Assess job proficiency in situ
4) Correlate the result of tests and assessment
of job proficency

Empirical approach of
Aptitude test:
 Absence of clear rational principles
 Development based on haphazard trial and error
search for effective predictors
 With minimum rationality
 Large list of preference to discriminate btw
professions
 Selection of items with high correlation to criterion
in successive fashion (multiple regression
challenge) low inter item correlation and high
correlation with criterion (weakness of aptitude
test)

Achilles heel of aptitude
testing
 Robust criterion measures
 V for criterion measures
 2 major components of criterion problem:
1. The definition of criterion, subjective
judgment and widespread lack of
agreement over occupational success
2. The development of a procedure to measure
the criterion

Thorndike (1949): 3 categories
of criteria
1. Ultimate category: complete final goal of a
particular type of selection, multifaceted
and not available for direct study
2. Intermediate category
3. Immediate category
Validation will fall back on no 2, 3
Blind empiricism is fragile, dangerous. It was
repeatedly said by Messick 1970s—1990s

Mid 1940s: Paul Meehl and
Lee Cronbach, construct V
Paul Meehl:
Dissatisfied by client self-rating
Self rating should not be used as a behavior
surrogate but as an indirect sign of sth deeper
Because it requires
1. Appropriate level of self understanding
2. Willingness to disclose

Mid 1940s: Paul Meehl and
Lee Cronbach, construct V
 Lee Cronbach:
Impact of item format
Response set: the tendency to respond differently to
items in different ways
6 kinds of response: Give many responses, Speed,
Accuracy, Gamble…
A threat toV:different individuals demonstrate
different response set on same set
Solution: useT/F less and MC more

Cronbach (1949): 5 technical
criteria of a good test
1. Validity
2. Reliability
3. Objectivity
4. Norms
5. Good items
2 approaches of logical analysis (psychological
understanding of attribute) and empirical
evidence

Cronbach (1949): V as the
correspondence of test to definition of
attribute
There are items that correspond to definition of
attribute yet bring irrelevant variables that make
the items impure:
1. Items with different answers of test takers
using different methods
2. Items with limited access to some test takers
from certain cultural groups
3. Items that are vulnerable to response sets
4. Items correspond to content yet fail to assess
desired processes

Cronbach (1949): ultimate
consideration
1. Logical analysis is inferior to empirical evidence.
2. Most frequently used criterion: instructor or
supervisors rating, others tests of the same
attribute
3. Discussed criterion problem in-depth
4. Rise of particular empirical approach : factorialV,
the degree that a test could purely measure one
type of ability

Newton ch2

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Newton ch2

Similaire à Newton ch2 (20)

Plus de Allame Tabatabaei

Plus de Allame Tabatabaei (20)

Dernier

Dernier (20)

Newton ch2