Dr. Serkan Toy (Children's Mercy Hospital Kansas City) summarizes current literature on assessment, evaluations, rubrics, and Global Assessment Scales from the perspective of Psychometrics.
1. Psychometrics for Clinical Skills
Assessment
Serkan Toy, PhD
Director of Evaluation and Program Development
Graduate Medical Education
Children’s Mercy Hospital – Kansas City
2. Outline
• What is psychometrics?
– Measurement
– Construct development
• Reliability
• Validity
• A few other issues to consider
– Checklists vs. Global ratings
3. Psychometrics
• Educational & Psychological
Measurement
– measurement of knowledge, abilities,
attitudes, and personality traits
– mainly concerned with the construction and
validation of measurement instruments (i.e.
cognitive tests, surveys/questionnaires, and
personality assessments.)
4. Measurement
• Assigning numerals to observations based on
some pre-defined criteria
• Or assigning a value to one object/observation
in relation to another
– Intelligence - IQ
– Personality - Big Five
– Academic or procedural performance
7. Construct
Well-defined vs. Ill-defined
What do we already know about the construct in
question?
• Epistemological Beliefs
• Self efficacy
• Academic Performance
• Procedural Competency
10. Reliability
“The more consistent the scores are over
different raters and occasions, the more reliable
the assessment is thought to be” (Moskal & Leydens,
2000 as cited in Jonsson & Svingby, 2007)
• Test re-rest
• Inter-rater
• Binary vs. Likert scale
• Rubrics and calibration process
Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational
consequences. Educational Research Review, 2, 130-144.
11. Rubrics for Assessment
• “An assessment tool that describes levels of
performance on a particular task” (Jonsson & Svingby,
2007)
• Analytic, topic-specific rubrics seem to enhance reliable
scoring of performance especially when accompanied by
examples and/or rater training
• Example:
– Objective assessment of surgical competence in
gynaecological laparoscopy: development and validation of
a procedure-specific rating scale Larsen et al. (2008)
Jonsson, A. & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational
consequences. Educational Research Review, 2, 130-144.
Larsen C.R., Soerensen J.L., Grantcharov T.P., Dalsgaard T, Schouenborg L., Ottosen C.,
Schroeder T.V., Ottesen B.S. (2008) Effect of virtual reality training on laparoscopic surgery:
randomised controlled trial. BMJ, 14, 338, b1802.
12. Larsen et al. 2008
Example rating scale
1 2 3 4 5
Economy of
movements
Many
unnecessary
movements
Efficient motion
but some
unnecessary
motion
Maximum
economy of
movements
Economy of
time
Too long time
used to perform
sufficiently
Intermediate time
used to perform
sufficiently
Minimal time
used to perform
sufficiently
Errors: respect
for tissue
… … …
Flow of
operation
… … …
14. Conceptual Change in Validity
“Validity is not a property of the test or
assessment as such, but rather of the
meaning of the test scores. These scores
are a function not only of the items or
stimulus conditions, but also of the persons
responding as well as the context of the
assessment. In particular, what needs to be
valid is the meaning or interpretation of the
score; as well as any implications for action
that this meaning entails” (p. 741).
Messick, S. (1995) Validity of psychological assessment: validation of inferences from
persons’ responses and performance as scientific inquiry into score meaning. American
Psychologist, 50, 741–9.
15. Validity
• Construct Validity
– Content / Face
– Convergent
– Discriminant
– Predictive
The question is: What can we validly conclude
about a trainee who receives a score of “X”
vs. that of receiving “Y”?
16. Construct Validity
• In competency assessment instruments validity
is usually established by examining whether
they distinguish between groups logically
presumed to differ on competency being
measured
– Experienced practitioners vs. trainees or
– Peer nominated superior performers vs. average
performers
Scofield, M. E., & Yoxtheimer, L. Y. (1983). Psychometric issues in the
assessment of clinical competencies. Journal of Counseling Psychology.
30, 413-420.
17. Other Issues to Consider
• Global ratings vs. Checklist scores
• Required sample size for validity testing
• Training the trainees
• Scoring the videotaped performances
– Individual procedural vs. qualitative skills
– Team performance
• Formative and summative assessment at the
same time
– Training (intervention) or assessment
(measurement)?
18. Global ratings vs. Checklist scores
• High correlation between global ratings
and checklist scores
• Both seem to differentiate similarly
between more experienced trainees and
novices
Examples
• Kim J., Neilipovitz D., Cardinal P., Chiu M. (2009) A comparison of global rating scale
and checklist scores in the validation of an evaluation tool to assess performance in
the resuscitation of critically ill patients during simulated emergencies (abbreviated as
‘‘CRM simulator study IB’’). Simulation in Healthcare, 4, 6–16.
• Morgan P.J., Cleave-Hogg D., Guest C.B. (2001) A comparison of global ratings and
checklist scores from an undergraduate assessment using an anesthesia simulator.
Academic Medicine, 76, 1053–5.
19. A Comparison of Global Rating Scale and Checklist Scores in the
Validation of an Evaluation Tool to Assess Performance in the
Resuscitation of Critically Ill Patients During Simulated Emergencies
(Abbreviated as "CRM Simulator Study IB")
Kim, John MD, MEd, FRCPC; Neilipovitz, David MD, FRCPC; Cardinal, Pierre MD, FRCPC; Chiu, Michelle MD,
FRCPC
• 32 PGY-1 & 28 PGY-3 2 simulation scenarios on Crisis Resource
Management (CRM)
• Ottawa Global Rating Scale 7-point anchored ordinal scale – 5 CRM
categories and overall performance score
• Ottawa CRM Checklist 12 item in 5 CRM categories- max 30 points
• 3 raters blinded to year of training rated each videotaped
performance (no order specified for use of evaluation tools)
20. Kim et al. 2009 - Continued
Reliability: Inter-rater reliability Intraclass Correlation Coefficient (ICC)
• Ottawa GRS: S1=0.59 & S2=0.61
– Subcategories showed similar ICC except “resource utilization &
communication (range 0.24 to 0.38)
• Cumulative CRM checklist: S1=0.63 & S2=0.55
– Again subcategories showed similar ICC except “resource utilization &
communication (again range 0.24 to 0.38)
Validity:
• Content validation (face validity) Delphi process
• Response process Resident orientation & rater training
• Comparison of scores by PGY T test (ANOVA)
– Both the checklist and GRS overall & subcategory scores showed
statistically significant differences between PGY-1 and PGY-3 (more
experienced residents receiving higher scores)
– ANOVA showed similar results by each scenario as well as per rater
21. A Comparison of Global Ratings and Checklist Scores from an
Undergraduate Assessment Using an Anesthesia Simulator
Morgan, Pamela J. MD; Cleave-Hogg, Doreen PhD; Guest, Cameron B. MD, MEd
• 140 final year medical students 15-minute faculty-facilitated sim
scenario
– Conducted in 2nd
week of the 2-week anesthesia rotation
– Faculty followed a script and each session was videotaped
– Each student completed 1 of 6 scenarios (each with similar learning
objectives)
• 25-point criterion checklist for each scenario (not performed=0;
performed=1)
• 5-point global rating (clear failure=1 to superior performance=5)
• 10 faculty attended a workshop on performance protocols
– Randomly assigned to a rating pair to evaluate 25 to 34 videotaped
performances
22. Morgan et al. 2001 - Continued
• Correlation between checklist scores and global ratings Pearson
r = 0.74
• Global ratings correlated more highly with technical skills and
judgment than with knowledge
– Knowledge r = 0.24
– Technical skills r =0.51
– Judgment r =0.53
• Single-rater reliability (consistency)
– Mean ICC for Checklist scores 0.77 (range 0.58 to 0.93)
– Mean ICC for Global ratings 0.62 (range 0.40 to 0.77)
23. Other Issues to Consider
• Required sample size for validity testing
• Training the trainees
• Scoring the videotaped performances
– Individual procedural vs. qualitative skills
– Team performance
• Formative and summative assessment at the same time
– Training (intervention) or assessment (measurement)?
• Assessment tools to link simulated performance to actual
patient outcomes