De champlain agm_sunday_2012

The Top 10 Myths on
Standard Setting
André De Champlain, PhD
Consulting Chief Research
Psychometrician &
Interim Director of R&D, MCC

The Need to Make Decisions
• The need to make classifications
permeates many aspects of daily life
• Classifications required by law
• E.g.: Passing an examination to obtain a driver’s
license requires meeting a certain level of
proficiency with regard to knowledge of traffic
laws and performance (passing, parallel parking,
etc.)

• Keeps unsafe motorists from behind the wheel!

• The need to make classifications permeates
many aspects of daily life
• Classifications required by law
• E.g.: Jury rendering an (impartial) verdict in a
criminal trial classifies a defendant as “guilty” or
“not guilty” after weighing the evidence of a case,
i.e., analyzing the facts

• Sentence meted out for incapacitation (“protection
of the public”), deterrence, denunciation,
rehabilitation, etc.


• The need to make classifications permeates
many aspects of professional life
• Classifications required within a
profession
• Medical licensing/registration examination programs
• LMCC®, USMLE® , PLAB®, AMC ®, etc.
• Medical specialty board examination programs
• ABMS®
• Medical membership organizations
• RCP(UK), RCPSC, CFPC, RACP, RACGP, etc.

Jury Standard Setting
Panel
Composition Impartial panel that Impartial panel that
represents the represents a profession
population
Size Randomly selected Randomly selected panel
group of citizens (to that is sufficiently large
satisfy social decision and representative to
rules) define standard
Task Renders a verdict “Renders” a decision

Purpose Incapacitation, Protection of the public,
rehabilitation remediation

• Number (“cut-score”) can be used to
differentiate between several “states or
degrees of performance”
• Pass / Fail
• Grant / Withhold a Credential
• Award / Deny a license
• Grant / Deny membership
• Basic / Proficient / Advanced (Honors)

• To make “sensible” decisions, information is
needed
• Decision makers need relevant and accurate information
• Standard setting is the process by which these informed
decisions are arrived at

• Standard setting can be defined as the proper
following of a prescribed, rational system of
rules or procedures resulting in the
assignment of a number to differentiate
between two or more states or degrees of
performance (Cizek, 1993)
• Addresses “procedural due process” (legal framework)

• “Procedural due process”
• Was the standard setting exercise well
documented?
• Description of the standard setting exercise
• Selection of judges
• Overview of training
• Definition of the borderline candidate
• Judges’ assessment of each phase of the
exercise and overall cut-score
• What did the judges think of the exercise?

• Consequential impact of standard
setting?
• What are the outcomes of the process?
• Substantive aspect of standard setting
• Did the process lead to a “fair” decision?
• Consequential aspect of test score use (Messick,
1989)
• What are the intended and unintended
consequences of implementing a standard?
• Several sources of (empirical) evidence can be
presented to support the fairness and
appropriateness of the decision

• How popular is standard setting?
• MEDLINE/PUBMED database
• Nearly 600 articles published in this topic area
• ERIC (Educational Resources Information
Center)
• Over 750 articles published in this domain
• Despite the immense popularity of standard
setting, basic misconceptions still persist
• What are common “myths”
surrounding standard setting?

Myth #1: Standard = Cut-score
• Performance standard (Kane,2001)
• Qualitative description of an acceptable level of
performance and knowledge required in practice
• “Conceptual” definition of competence
• Performance standard is a construct
• Example – MCCQE Part I
• The candidate who passes the Medical Council
of Canada’s Qualifying Examination Part I
(MCCQE Part I) has demonstrated knowledge,
clinical skills, and attitudes necessary for entry
into supervised clinical practice, as outlined by
the Medical Council of Canada’s Objectives

Myth #1: Standard = Cut-score
• Passing score or cut-score (Kane,
2001)
• Selected point on the score scale that
corresponds to this performance standard
• “Operational” definition of competence
• Cut-score is a number

• Example – MCCQE Part I
• A candidate who scores at or above 390 has
met the performance standard defined for the
Medical Council of Canada’s Qualifying
Examination Part I (MCCQE Part I)

Myth #2: There is a “Gold Standard”
• Standard setting entails eliciting
judgments on what cut-score best
represents “competence”
• All cut-scores are intrinsically subjective in
nature

• Cut-scores can, and will vary as a function of
several factors including, but not limited to,
the method selected to set the standard and
the panel of participating judges

• Cut-scores do not exist externally
• The aim of standard setting is not to
“discover” some true or preexisting cut-score
that separates candidates into mutually
exclusive categories (e.g.: competent vs.
incompetent)

• Standard setting is a process that
synthesizes human judgment in a rational
and defensible way to facilitate the
partitioning of a score scale into 2 or more
categories

• Cut-scores do not exist externally
• Standards do not externally exist, i.e. outside
of the realm of human opinion

• “a right answer [in standard setting] does not
exist, except, perhaps, in the minds of those
providing judgment” (Jaeger, 1989)

• Empirical evidence can help standard setting
panels translate (policy-based) judgment onto
a score scale in a defensible manner

Myth #3: Standard Setting is a
Psychometric Exercise
• Standard setting lies at the
“intersection” of science and art
• While we can facilitate the standard setting
process using psychometric models, a cut-
score is ultimately based on human judgment
• Our statistical models can help us to
systematize the process, i.e., to translate a
policy decision into a cut-score using
defensible, well-defined procedures; however,
they cannot be used to estimate some “true”
cut-score that separates masters from non-
masters

Myth #3: Standard Setting is a
Psychometric Exercise
• Standards for Educational and
Psychological Testing (1999; p.54)
• “Cut-scores embody value judgments as well
as technical and empirical considerations”

• Given that human judgment and opinion play
such a significant role in this process, a cut-
score can be regarded as a composite that
incorporates considerations that originate
from a number of arenas including medical,
statistical, educational, social, political and
economic

Myth #4: The Cut-score is Set by a
Standard Setting Panel
• A standard setting panel does not “set a cut-
score” but rather recommends a cut-score
value or standard

• The actual cut-score is set by the governing
body that legitimizes the process and the
use of the cut-score to make pass/fail
decisions
• e.g.: Legislative body, academy, certification
specialty board, a college, etc.

Myth #4: The Cut-score is Set by a
Standard Setting Panel
• The role of the standard setting panel is to provide
guidance & information to those bodies that
actually are responsible for implementing a given
cut-score value or standard

• The goal of periodic standard setting exercises is to
revisit the appropriateness of a cut-score (not
necessarily change it) based on replicated exercises
and informed expert judgment

Myth #5: Some Standard Setting
Methods Are Better Than Others
• “There can be no single method for determining
cut-scores for all tests or for all purposes, nor
can there be any single set of procedures for
establishing their defensibility”.
• Angoff (1988; p.219)
• [Regarding] the problem of setting cut-scores,
we have observed that the several judgmental
methods not only fail to yield results that agree
with one another, they even fail to yield the
same results on repeated application” .

• No standard setting method yields an “optimal” cut-
score (standards don’t exist outside of the minds of
judges)
• Extent to which a standard setting process is
properly followed has the most impact on the
cut-score
• Was the purpose of the exam and the standard setting
exercise clearly defined?
• Were the judges qualified to perform the task?
• Was adequate training offered to panelists? Etc.

• Factors to consider when selecting a
standard setting method
• A. What is the purpose of examination?
• With professional exams, norm-referenced
approaches are appropriate in instances where a
limited number of candidates can meet the cut-score
• Placement, promotion, awards, etc.

• In most instances, criterion-referenced approaches
are more suitable
• Medical licensure/certification decisions, passing a clerkship/
internship, etc.

• B. How complex is the examination?
• For knowledge-based exams (e.g.: dichotomously-
scored MCQs), test-centered methods (Angoff, Ebel,
Bookmark, etc.) are appropriate given the task required
to complete
• For performance assessments (OSCEs, workplace-
based assessments, etc.), examinee-centered
approaches (borderline groups, contrasting-groups,
body of work methods) are better suited given the
complex, multidimensional nature of the performance

• C. What is the test format?
• Certain standard setting methods were developed
solely for use with MCQs (e.g.: Nedelsky).

• While other methods can be used with different
formats (e.g. Angoff methods), certain assumptions
are made that may or may not meet expectations
(Angoff assumes a compensatory model)

• Other methods (Hofstee, contrasting-groups) were
developed as test format invariant

• D. What resources are available?
• In very high-stakes settings (e.g. medical licensing
exam), a complex standard setting exercise which
includes several panels of judges, extensive training,
multiple rounds of judgments, etc., might be preferable

• In lower-stakes settings (elective clerkship
examination), less intensive models might be
appropriate

• What makes the most sense given the intended
use of the information?

• Why not combine several standard
setting procedures?
• Standard setting and the selection of a
cut-score are a policy decision
• There’s little empirical evidence to suggest that
combining multiple methods will lead to a “better”
standard
• There is no “correct” cut-score, so how can policy
makers synthesize results from multiple approaches?
• Also requires significantly more resources

• Always better to systematically
implement 1 standard setting method
rather than provide results from several
(poorly) implemented approaches
• Properly document all phases of standard
setting
• Objective, selection of participants, training, etc.
• Provide empirical evidence to support use
of cut-score
• Impact of sources of variability (judges, panels, etc.)
• Consequences of implementing a cut-score
• Surveys, etc.

Myth #6: Expert Clinicians are de
facto Expert Standard Setting Judges
• Selection and training of judges most critical
to the success of any standard setting
exercise
• However, being a content expert is not
synonymous with expert standard setting
judge

• Participating standard setting judges need to
be carefully trained to ensure that they
understand the task and to minimize biases

facto Standard Setting Judges
• Care must be taken to assure that judges
understand what they are to do. The process
must be such that well-qualified judges can
apply their knowledge and experience to
reach meaningful and relevant judgments
that accurately reflect their understandings
and intentions”.

facto Standard Setting Judges
• Training usually includes the following
steps:
1. Provision of sample materials (test specifications,
blueprint, sample items/stations, etc.)
2. Clear presentation of the purpose of standard
setting and what we are asking of participants
3. Discussion and definition of what constitutes a
borderline candidate
4. Judgments on a set of exemplars
5. Discussion and clarification of any misconceptions
amongst participants
6. Survey participants on all aspects of training

Myth #7: We “Know” Who the
Truly Competent Candidates Are
• Classification errors are always present
in standard setting
• High-quality exams and well implemented standard
setting exercises can significantly minimize the
proportion of misclassifications
• False positive misclassification
• Candidate who “truly” lacks the knowledge, skill and/or
ability necessary to pass the examination, but actually
passes
• False negative misclassification
• Candidate who “truly” possesses the knowledge, skill
and/or ability necessary to pass the examination, but
actually fails

• Why do classification errors occur?
• Cut-scores represent inferences about the “real”
or “true” level of knowledge, skill possessed by
candidates
• The quality of those inferences is related to
a number of factors:
• The number of items/cases sampled for the standard
setting exercise
• The number of judges selected and their degree of
representativeness, etc.
• Consequently, pass/fail classifications of
candidates will always be somewhat
imperfect

• We can’t actually identify false positive
and negative misclassifications
• If we knew a candidate was a false negative,
we’d do something about it!
• We can estimate misclassification errors using
a host of statistical indices (Brennan, 2004)
• In medicine, protection of the public is a
prime concern of examinations
• Minimizing false positive misclassifications
is generally of greater interest

Myth #8: All Decisions Are Created
Equally
• For fairness reasons, failing candidates are
generally allowed to retake an examination
(sometimes repeatedly)

• Millman (1989) showed that the greater the
number of (repeat) attempts to pass an
exam, the greater the likelihood that a
candidate who does not possess the level of
knowledge or skill needed to pass, will indeed
pass (false positive)

Equally
• This phenomenon can be attributed to a
number of reasons including:
• Possible re-exposure of material (security issue)
• (Compounded) measurement errors
associated with each test score
• The more times a candidate repeats, the more likely
their score will be sufficiently high (overestimated) to
result in a false positive decision
• This could significantly impact safe and effective
patient care given the link between medical licensing
exam scores and future egregious acts in practice
(Tamblyn et al. research)

Equally
• How serious of a problem is the issue of
repeat attempts on false positive rates?

• Millman example (1989)
• Let’s assume that a cut-score is 70% on an
exam
• A candidate with a true ability of 65% (should
fail) has a greater than 50/50 chance of passing
the exam due to measurement error after 5
attempts (with MCQ exam, i.e., high reliability)

Equally
• How might we control for this effect?
• Increase the size of the item/station bank to
reduce the likelihood that previously seen material
will appear on repeat test attempts
• Incorporate item/station exposure as a constraint
when assembling test forms
• Adjust the cut-score to minimize misclassifications
• A panel sets the standard at 65%
• We can adjust the cut-score so that a candidate
with a true ability level of 65% (true master) has
a near zero probability of being misclassified

Myth #9: A Cut-score/Standard
Does Not Need to Be Evaluated
• A cut-score reflects the (informed) judgments
of a small sample of experts, based on
sample of items/stations, at a specific point
in time, using one or only a few methods
• Cut-scores can and will vary as a function of
these factors that need to be evaluated
• Evidence to support both the “internal” and
“external” validity of your cut-score should be
collected and presented to support its intended
use

• Evaluating your standard
• Internal validation
• How reproducible is the cut-score across
facets?
• Judges (inter-rater consistency)?
• Sample of stations?
• Panels of judges? Etc.
• Generalizability analysis and rater models
(IRT) are useful to help us assess how variable
the cut-score is across these facets

• Evaluating your standard
• External validation
• How do the decisions relate to other measures?
• If scores on two exams are highly related, but
decision consistency is low, perhaps the cut-
score on one assessment is not appropriate?
• Impact
• How comparable are P/F rates to historical
trends?
• Does the cut-score lead to “acceptable”
results?

Myth #10: The Angoff Method Was
Developed by Angoff
• Angoff did not formally develop the
(Angoff) standard setting method
• Origin can be traced back to a footnote in a
chapter on scales, norms and equivalent
scores that Angoff wrote in 1971
• Angoff ascribed the procedure to Tucker
• Method was a “systematic procedure for
deciding on the minimum raw scores for
passing and honors”

Myth #10: The Angoff Method Was
Developed by Angoff
“a slight variation of this procedure is to
ask each judge to state the probability that
the “minimally acceptable person” would
answer each item correctly. In effect, the
judges would think of a number of
minimally acceptable persons, instead of
only one such person, and would estimate
the proportion of minimally acceptable
persons who would answer each item
correctly. The sum of the probabilities, or
proportions, would then represent the
minimally acceptable score (p. 515)”.

De champlain agm_sunday_2012

Recommandé

Recommandé

Contenu connexe

Similaire à De champlain agm_sunday_2012

Similaire à De champlain agm_sunday_2012 (20)

Plus de MedCouncilCan

Plus de MedCouncilCan (20)

De champlain agm_sunday_2012

Notes de l'éditeur