Surveying the landscape: An overview of tools for direct observation and assessment
1. Surveying the Landscape:
An Overview of a Framework
and Tools for Direct Observation
and Assessment
André F. De Champlain, PhD
Acting Director, R&D
Assessment – Evolving Beyond the Comfort Zone
102nd Annual Meeting of the Council
Sunday, September 14th, 2014
3. Historical Reminder –
Driving Force Behind MCC Projects
• Assessment Review Task Force
(ARTF)
• Created in 2009 to undertake “a strategic
review of the MCC’s assessment processes
with a clear focus on their purposes and
objectives, their structure and their
alignment with MCC’s major stakeholder
requirements”
3
4. The ARTF Report (2011)
• Recommendation 1
• LMCC becomes ultimate credential (legislation
issue)
• Recommendation 2
• Validate & update blueprint for MCC examinations
• Recommendation 3
• More frequent scheduling of the exams and
associated automation
• Recommendation 4
• IMG assessment enhancement and national
standardization (NAC & Practice Ready
Assessment)
• Recommendation 5
• Physician performance enhancement assessments
• Recommendation 6
• Implementation oversight, including the R&D
Committee priorities and R&D Budget 4
5. ARTF Recommendation 2
• That the content of MCC examinations be expanded
by:
• Defining the knowledge and behaviours in all the CanMEDS
roles that demonstrate competency of the physician about to
enter independent practice.
• Reviewing the adequacy of content and skill coverage on the
blueprints for all MCC examinations.
• Revising the examination blueprints and reporting systems
with the aim of demonstrating that the appropriate
assessment of all core competencies is covered and fulfills
the purpose of each examination.
• Determining whether any general core competencies
considered essential cannot be tested employing the
current MCC examinations, and exploring the
development of new tools to assess these specific
competencies when current examinations cannot. 5
7. Why a New MCCQE Blueprint?
7
“When test content is a primary source of validity
evidence in support of the interpretation for the use
of a test for credentialing, a close link between test
content and the professional requirements should be
demonstrated”. (Standard 11.3, 2014)
• If the test content samples tasks with considerable
fidelity (e.g.: actual “job” samples), or in the judgment
of experts (examiners), correctly simulates job task
content (e.g.: OSCE stations), or if the test samples
specific job knowledge, then content-related
evidence can be offered as the principal form of
evidence of validity (p.178)
8. What “Validity” Is….
8
“Validity is an overall evaluative judgment of the
degree to which empirical evidence and theoretical
rationales support the adequacy and appropriateness
of interpretations and actions based on test scores or
other modes of assessment”. (Messick, 1989)
• E.G.: The admissions dean at your medical school has
asked you to develop a MCQ exam that will be used to
admit students to your undergraduate program
• Evaluative judgment
• Score on the admissions’ exam is a good predictor of MD Year 1
GPA
• Empirical evidence
• High correlation between exam scores and MD Year 1 GPA
9. What “Validity” Is Not….
9
• There is no such thing as a valid or
invalid test
• Statements such as “my test shows
construct validity” are completely devoid of
meaning
• Validity refers to the appropriateness of
inferences or judgments based on test
scores, given supporting empirical
evidence
10. Our Main “Validity” Argument
Based on MCC QE Scores
10
• Does the performance (score) on the sample of
items, stations, observations, etc., included in any
assessment, as reflected by the MCC Qualifying
Examination blueprint, correspond to my true
competency level in those domains?
• How accurately do my scores on a restricted sample
of MCQs, OSCE stations, direct observations, etc.
correspond to what I would have obtained on a
much larger collection of tasks?
11. Our Main “Validity” Argument
Based on MCCQE Scores
• One way to assure that our judgments are as accurate as
possible is to develop a blueprint via a practice analysis
that dictates as clearly as possible what should appear as
part of the MCC’s Qualifying Examination decision
• Practice analysis: A study conducted to determine the
frequency and criticality of the tasks performed in practice
• Blueprint: A plan which outlines the areas (domains) to
11
be assessed in the exam (with weightings)
12. Sources of Evidence That
Informed MCC QE Blueprint
12
Current Issues in Health
Professional and Health
Professional Trainee
Assessment (MEAAC)
Supervising PGY-1
Residents (EPAs)
Incidence and Prevalence
of Diseases in Canada
National Survey of
Physicians and the Public
13. 13
SME Panel Meeting -
Defining the Task
Candidate
Descriptions
(D1 & D2)
Blueprint
Test
Specifications
(D1 & D2)
14. Who Were Our SMEs?
14
MRA Rep
of Council
Blueprint
Central
Examination
Committee
Objectives
Committee
Test
Committees
RCPSC
University
Rep of
Council
CFPC
PGME
Deans
UGME
Deans
15. 15
Proposed Common
Blueprint
Dimensions of Care
Health Promotion
and Illness
Prevention
Acute Chronic
Psychosocial
Aspects
Physician Activities
Assessment/
Diagnosis
Management
Communication
Professional
Behaviors
16. Anticipated Gaps
16
• Based on a thorough classification exercise
(micro-gap analysis), a number of
anticipated gaps were identified based on
the current MCCQE Part I and II exams
• Physician activities: Communication, PBs
• Dimensions of care: Health Promotion
and Illness Prevention, Psychosocial
Aspects of Care
17. Anticipated Gaps
• But at a higher, systems level (macro-gap
level), what are some anticipated gaps in the
MCC QE program?
17
• Initial “philosophical” impetus for reviewing
where gaps might occur given current and
future directions in medical education
assessments was provided by the Medical
Education Assessment Advisory Committee
(MEAAC)
18. Thinking Outside the Box:
An Overview of the Medical Education
Assessment Advisory Committee (MEAAC)
19. MEAAC
• Mandate
• The Medical Assessment Advisory Committee
(MEAAC) is an external panel of experts in their field
who advise the Medical Council of Canada (MCC)
directorate (Office of the CEO) and through them,
the relevant committees of Council, on ongoing and
future assessment processes that are necessary to
enable MCC to meet its mission.
• Critical Role
• Prepare report and make recommendations through
the directorate, on new approaches to assessment
which will enable MCC to meet its mission 19
20. MEAAC Members
• Kevin Eva (Chair)
• Georges Bordage
• Craig Campbell
• Robert Galbraith
• Shiphra Ginsburg
• Eric Holmboe
• Glenn Regehr
21. 3 MEAAC Themes
1. The need to overcome unintended consequences of
competency-based assessment (CBA)
• CBA offers a focused framework that educators & MRAs can make use
of for evaluative purposes for the sake of improved patient care
• CBA tends to decontextualize competency and oversimplify what
competence truly means (e.g.: context specificity)
• Labeling a passing candidate as “competent”
CBA downplays the physician as an active, dynamic multidimensional
learner
2. Implement QA efforts to facilitate learning and
performance improvement
Emphasizes the notion of the physician as a continuous learner
3. Authenticity in assessment practices
21
22. Recommendations
1. Integrated & continuous model of assessment
• E-portfolio
• Linked assessments
• Breaks “learning/exam” cycle
2. Practice-tailored modules
• Tailoring assessments, feedback & learning to current
educational/practice reality (GME)
• Linking EHRs to tailored exams for formative assessment &
feedback purposes
• Track impact of feedback on longitudinal performance
3. Authentic assessment
• Create “unfolding” OSCE stations that mimic real practice
• Determine how MCC might broaden assessments to incorporate
WBAs that have the potential to further inform MCC QE decisions
22
23. Some Challenges
• Generalizability of performance
• Balancing “authenticity” (validity) with reproducibility
• What can we conclude from a limited sample of performances in
a comprehensive scenario?
• Better integration of performance data
• How can low- and high-stakes information be better
integrated to promote more accurate decisions as well as
promote competent physicians?
• Will require collaborative models with
partners
• MRAs, CFPC, RCPSC, medical schools, etc. 23
24. Direct Observation and
Assessment in a
High-stakes Setting
• Growing call for the inclusion of performance data from
assessment methods based on direct observation of
candidates (“workplace-based assessments – WBAs”) not
only in low stakes, but also high-stakes decisions
• Congruent with outcomes-based education and
licensure/certification
• WBAs are appealing in that they offer the promise of a
more complete evaluation of the candidate
• “Does” in Miller’s pyramid
• Feedback component is appealing and part-and-parcel of
WBAs
24
25. Challenges in Incorporating
WBAs in High-stakes
Assessment
• Procedural challenges - Sampling
• Direct observation of undergraduate trainees occurs
infrequently and poorly
• Structured, observed assessments of undergraduates done
across clerkships for only 7.4%-23.1% of students
(Kassembaum & Eaglen, 1999)
• During any core clerkship, between 17%-39% of students
were never observed (AAMC, 2004)
• Good news: Class of 2013 more frequently observed than
any past cohort, but still not satisfactory (AAMC, 2014)
• E.g.: 14.2% of 2013 undergrads report not being observed doing
a Hx; 11.6% not observed doing a PE
25
26. Challenges of Incorporating
WBAs in High-stakes
Assessment
• Procedural challenges - Training
• Observers are infrequently trained to use WBAs
• High inter-rater variability (Boulet, Norcini, McKinley & Whelan
2002)
• Rater training sessions occurred for <50% of WBAs
surveyed (Kogan, Holmboe & Hauer, 2009)
• Training occurred only once and was brief (10-180 minutes)
• Systematic training (videotapes, competency levels, practice,
feedback) occurred for <15% of WBAs surveyed
• Stringent training protocol can significantly improve the
quality of WBA data (Holmboe et al., 2004)
26
27. Commonalities Amongst
Assessment Methods
• Data that we collect from all assessments, whether
MCQs, CDMs, OSCE stations or WBAs are
observations which purportedly measure
competencies (domains) that underlie the
examination
• These data points differ in terms of standardization,
complexity and reproducibility
• Regardless, all of these data points are “snapshots”
that hopefully contribute to better defining our
candidates in terms of their proficiency level(s)
27
32. Psychometric Evidence to
Support the Use of WBAs in a
High-stakes Context
• Mini-CEX (Norcini et al., 1995)
• Reliability
• 12-14 encounters are required for reliability of .80
(Norcini et al., 1995; 2003; Holmboe, 2003)
• Findings vary as a function of examiner training, setting,
patient problem, etc.
• Studies reporting Cronbach’s alpha are overestimates of
true reliability
• Little to no reliability studies on the use of the Mini-CEX
in a high-stakes setting
32
33. Psychometric Evidence to
Support the Use of WBAs in a
High-stakes Context
• Mini-CEX (Norcini et al., 1995)
• Validity aspects
• Mini-CEX ratings correlate moderately with OSCE scores
(Boulet et al., 2002)
• Mini-CEX ratings correlate moderately with written in-training
exams (Durning et al., 2002; Hatala et al., 2006)
• Mini-CEX ratings are useful in discriminating between
candidates scripted to reflect various competency levels
(Holmboe et al., 2004)
• Mostly “opportunistic” validity research
• Stronger, theory-driven validity research is
needed 33
34. Psychometric Evidence to
Support the Use of WBAs in a
High-Stakes Context
• Multisource Feedback Tools (3600)
• Reliability
• G-coefficient of 0.70 with 10-11 peer physician raters and
11-item global instrument (Ramsey et al., 1993)
• G-coefficient of 0.75 with 13 peers and 25 patients (ACP-
360; Lelliott et al., 2008)
• Cronbach alpha in the low .80s for PAR with 10 peer
raters
• Significant variability in examiner training and
standardization (intra-/inter-rater biases)
34
35. Psychometric Evidence to
Support the Use of WBAs in a
High-Stakes Context
• Multisource Feedback Tools (3600)
• Validity
• Evidence provided to inform how attributes to be
assessed in questionnaires were determined
• PAR (CPSA, 2007; Lockyer, 2003; Violato et al., 1994; Hall
et al., 1999; Violato et al., 2003; Violato et al., 2006)
• SPRAT, Mini-PAT (GMC-UK; Archer et al., 2005; Davies
et al., 2005)
• Program director ratings correlated with peer and patient
ratings (Lipner et al., 2002)
• Virtually no other solid validity evidence! 35
36. Psychometric Evidence to
Support the Use of WBAs in a
High-Stakes Context
• Kogan, Holmboe, Hauer (2009) – Review
of 55 WBAs
• Reliability of WBA scores and decisions
• Internal consistency reliability reported for 8 WBAs
(<15%)
• Generalizability/dependability coefficients reported
for 8 WBAs (<15%)
• Interrater reliability reported for 22 WBAs (40%)
36
37. Psychometric Evidence to
Support the Use of WBAs in a
High-Stakes Context
• Kogan, Holmboe, Hauer (2009) – Review
of 55 WBAs
• Validity of WBA scores and decisions
• “Weak” validity evidence (relationships to other
variables) reported for 17 (31%) WBAs
• Correlations generally low or modest (0.1-0.3)
• Content validity evidence reported for 20 WBAs
(36%)
37
38. Psychometric Evidence to
Support the Use of WBAs in a
High-Stakes Context
• Kogan, Holmboe, Hauer (2009) – Review of 55
WBAs
• Validity of WBA scores and decisions
(outcomes)
• Impact of feedback on trainees’ knowledge, attitudes
and skills reported for 9 WBAs (16%)
• Transfer of trainee learning reported for 5 WBAs (<10%)
• Caveat: Studies were nonblinded and did not control
for baseline clinical skill level of participants
• No study conducted to assess whether WBAs
improved patient care outcomes 38
39. The Argument for Joint
Attestation of WBAs
• Insufficient evidence to support the incorporation of
“local” (e.g.: medical school) based scores (ratings)
obtained from direct observation in a high-stakes
examination program such as the MCC QE
• Issues pertaining to examiner training, context, patient
problem variability probably preclude the use of such
scores in a high-stakes setting
• However, providing an attestation for these direct
observation data sources based on strict criteria
might provide useful information to not only better
inform the MCC QE decision points but also target
gaps in the current blueprint (e.g.: Communication,
Professionalism)
39
40. The Argument for Joint
Attestation of WBAs
• Jointly certifying the credibility of assessment
programs (with partners) based on direct
observation would require meeting a number of
agreed upon standards pertaining to:
• Selection of specific rating tool(s) (e.g.: Mini-CEX)
• Adherence to a strict examiner training protocol
• Attestation that examiners have successfully met training
targets (online video training module)
• Sampling strategy (patient mix) based on an agreed
upon list of common problems (and blueprint)
• Adherence to a common scoring model and criteria for
mastery (attestation) by each participant
• Attestation process would allow for a level of
rigour to be met while still permitting some
local flexibility 40
41. Putting the Pieces Together:
The Need for a Macroscopic
Assessment of Competence
• The MEAAC report clearly called on the MCC to consider
implementing an integrated & continuous model of
assessment
• The advent of PPE, revalidation, MOL further underscores
the need to view assessment and learning (& feedback) as
interlinked components of the (assessment) continuum
• There is a clear desire to move beyond episodic, discrete
assessment “hurdles”
• How do we evolve towards this system?
41
42. Putting the Pieces Together:
The Need for a Macroscopic
Assessment of Competence
• Programmatic Assessment (Van der Vleuten et al.,
2012)
• Calls for a programmatic approach to assessment with a
“deliberate” arranged set of longitudinal assessment
activities
• Joint attestation of all data points for decision and remediation
purposes
• The input of expert professional judgment is a cornerstone of
this model
• Applying a program evaluation framework to assessment
• Systematic collection of data to answer specific questions
about a program
42
43. Putting the Pieces Together:
The Need for a Macroscopic
Assessment of Competence
• Programmatic Assessment (Van der Vleuten
et al., 2012)
• Caveat
• “The model is limited to programmatic assessment in the
educational context, and consequently licensing
assessment programmes are not considered” (p. 206)
• Despite this caveat, how might we apply a
macroscopic lens to the MCC QE and its two
decision points?
43
44. The Need for a Macroscopic
Assessment of Competence:
An Example - Decision Point I
44
Curricular
Block I
Curricular
Block II
Curricular
Block III
Decision Point I
(Entry into residency)
Educational
Activities
Assessment
Activities
Joint
Attestation of
School OSCE
Joint
Attestation of
Direct
Observation
Assessments
MCCQE I
45. Putting the Pieces Together:
The Need for a Macroscopic
Assessment of Competence
• An example – Joint Attestation of a pan-
Canadian graduation OSCE
• Can we collaboratively agree, with partners, on the
components that would comprise a pan-Canadian
graduation OSCE?
• Possible “parameters”
• Shared blueprint
• Shared pool of cases
• Development of common form assembly, examiner training,
scoring and standard setting protocols to be strictly adhered
to by all schools
• Attestation of OSCE exam program by the MCC
45
46. Putting the Pieces Together to
Arrive at a Defensible Decision:
A Parallel to Standard Setting
• At a macroscopic level, how can we aggregate this combination
of low- and high-stakes data to arrive at a defensible decision
both for entry into residency and independent practice?
• As a starting point, the (examination) standard setting process
offers a defensible model that would allow expert judgment to
be applied towards the development of a policy that would
factor in all sources of data
• The ultimate standard to grant a “pass/fail” standing at either
decision point would reflect the judgments of (highly) trained
panels of physicians from across Canada, who would agree on
what constitutes an acceptable “portfolio” of performance at each
particular decision point, based on highly standardized elements 46
47. Next Steps
• Procedural and Policy Issues
• Clearly define the elements that would
constitute assessments leading up to each
decision point (e.g.: Graduation OSCE)
• Agree on attestation criteria as they
relate to:
• A common blueprint and shared pool to be
adopted by all schools
• Training standards and clear outcomes for
examiners
• Scoring and standard setting frameworks
47
48. Next Steps
• Research Issues
• Conduct targeted research investigations with partners
to clearly address key issues needed to support a
macroscopic (programmatic) approach for the MCC QE
program
• How much variation do sources of measurement error
contribute to the final decision (examiners, patients, etc.) and
how can we best control for these effects?
• Do these ancillary observations target the competencies that
they purport to measure (e.g. Communication and Professional
Behaviours)?
• Do these additional data points improve both decisions?
• Can we gather evidence to support a programmatic approach
to assessment?
• …. And many more!
48
In addition to their expert judgment that the panel brought to the discussion
we’ll be highlight some three of the reports that were instrumental in guiding the development of the proposed blueprint and test specifications
As a reminder, our key outputs from our workshop in mid-May were
candidates clearly identified
a common blueprint and test specifications for assessments leading up to the two decision points.
As with any development work, we engaged a wide range of physicians as subject matter experts and many played multiple roles from various stakeholder organizations.
Internal and external views represented on the panel.
From an internal MCC perspective, we had representatives from University and MRA Council, the Central Examination Committee, Objectives and Test Committees
External
RCPSC, CFPC
And UGME and PGME Deans
Represented a diverse group who effectively came together to collaborate.
The common blueprint was developed that we then used to propose test specifications.
Dimensions of Care - Focus of care for the patient, family, community and/or population.
Physician Activities - Reflects the scope of practice and behaviors of a physician practicing in Canada
I’ll walk through some of the key points in the definitions.