SlideShare une entreprise Scribd logo
1  sur  100
Télécharger pour lire hors ligne
Development and validation of
prediction models
Statistics in Practice II
Ben Van Calster
Department of Development and Regeneration, KU Leuven (B)
Department of Biomedical Data Sciences, LUMC (NL)
Epi-Centre, KU Leuven (B)
Biometrisches Kolloquium, 16th of March 2021
1. Model performance and validation
(focus on binary outcomes)
3
Model validation is important!
Altman & Royston, Stat Med 2000. Altman et al, BMJ 2009.
Steyerberg et al, Epidemiology 2010. Koenig et al, Stat Med 2007.
Model validation?
It is not primarily about statistical significance of
1. The predictors
2. A goodness-of-fit test like the Hosmer-Lemeshow test or its variants
Although: a GOF test does give a gist of what is important
 agreement between predictions and observations
 CALIBRATION
4
Calibration is complicated
5
Calibration is complicated
6
Level 1: Mean calibration
Mean estimated risk = observed proportion of event
“On average, risks are not over- or underestimated.”
O:E ratio: observed events / expected events
If violated: adjust intercept.
7
Level 2: Weak calibration
Mean AND variability/spread of estimated risk is fine
“On average, risks are not over- or underestimated, nor too extreme/modest.”
Logistic calibration model: 𝑙𝑙𝑙𝑙𝑙𝑙
𝜋𝜋
1−𝜋𝜋
= 𝑎𝑎 + 𝑏𝑏𝐿𝐿 ∗ 𝐿𝐿𝐿𝐿
Calculate - calibration slope: 𝑏𝑏𝐿𝐿 (cf spread) → should be 1
- calibration intercept: 𝑎𝑎|𝑏𝑏𝐿𝐿=1 (cf mean) → should be 0
If violated: adjust intercept and coefficients using 𝑎𝑎 + 𝑏𝑏𝐿𝐿 ∗ 𝐿𝐿𝐿𝐿
8
Level 3: Moderate calibration
Estimated risks are correct, conditional on estimated risk.
“Among patients with estimated risk 𝑥𝑥, the proportion of events is 𝑥𝑥.”
Flexible calibration model: 𝑙𝑙𝑙𝑙𝑙𝑙
𝜋𝜋
1−𝜋𝜋
= 𝑎𝑎 + 𝑓𝑓 𝐿𝐿𝐿𝐿
Calibration curve.
If violated: refit model, perhaps re-evaluate functional form.
9
Level 4: Strong calibration
Estimated risks are correct, conditional on covariate pattern.
“Per covariate pattern, the proportion of events equals the estimated risk.”
In short, the model is fully correct.
10
Vach W. Regression models as a tool in medical research. Boca Raton: Chapman and Hall/CRC; 2013.
11
Miscalibration Mean
Weak Moderate
Strong
Van Calster and Steyerberg. Wiley StatsRef 2018. https://doi.org/10.1002/9781118445112.stat08078.
12
Van Calster et al. J Clin Epidemiol 2016;74:167-176.
Calibration assessment
Discrimination
Calibration: ~ absolute performance
Discrimination: ~ relative performance
Do I estimate higher risks for patients with vs without an event?
C-statistic, equal to AUROC for binary outcomes.
13
Discrimination
C or AUROC is interesting, ROC not so much.
ROC shows classification results but ‘integrated over’ threshold. But
classification critically depends on threshold! ROC = bizarre.
14
(Pun intended.)
Verbakel et al. J Clin Epidemiol 2020;126:207-216.
15
Discrimination
If you want a plot...
16
Karandeep Singh. Runway r package. https://github.com/ML4LHS/runway.
Overall measures
E.g. Brier score, R-squared measures (including IDI).
For model validation, I prefer to keep discrimination and calibration
separate.
17
Statistical validity = clinical utility?
Discrimination and calibration: how valid is the model?
That is not the full story. In the end, clinical prediction models are made to
support clinical decisions!
To make decisions about giving ‘treatment’ or not based on the model, you
have to classify patients as high and low risk.
Classifications can be right or wrong: the classical 2x2 table
18
EVENT NO EVENT
HIGH RISK True Positive False positive
LOW RISK False Negative True negative
Clinical utility
A risk threshold is needed. Which one?
Myth! The statistician can calculate the threshold from the data.
Myth! A threshold is part of the model.
→ “Model BLABLA has sensitivity of 80%” makes little sense to me.
19
Wynants et al. BMC Medicine 2019;17:192.
Clinical utility
ISSUE 1 – A RISK THRESHOLD HAS A CLINICAL MEANING
What are the utilities (U) of TP, FP, TN, FN?
Then threshold 𝑇𝑇 =
1
1+
𝑈𝑈𝑇𝑇𝑇𝑇−𝑈𝑈𝐹𝐹𝐹𝐹
𝑈𝑈𝑇𝑇𝑁𝑁−𝑈𝑈𝐹𝐹𝑃𝑃
=
1
1+
𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑜𝑜𝑜𝑜 𝑇𝑇𝑇𝑇
𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝑜𝑜𝑜𝑜 𝐹𝐹𝐹𝐹
=
𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻
𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻+𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵
Hard to determine.
Instead of fixing U values, start with T: at which threshold are you uncertain
about ‘treatment’?
20
Clinical utility
You might say: I would give ‘treatment’ if risk >20%, not if <20%
You may also say that you are willing to ‘treat’ up to 5 patients to correctly
treat one (1 TP)
Then T = 0.2
You accept up to 4 FP per TP, so Benefit is 4 times the Harm.
21
Clinical utility
Net Benefit (Vickers & Elkin 2006), even Charles Sanders Peirce (1884)
When you classify patients, TPs are good and FPs are bad
Net result: good stuff minus bad stuff: 𝑇𝑇𝑇𝑇𝑇𝑇 − 𝐹𝐹𝐹𝐹𝐹𝐹.
But: relative importance of TP and FP depends on T!
o If T=0.5, 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 = 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵; else < or >.
o Harm of FP is odds(T) times the Benefit of a TP
𝑁𝑁𝑁𝑁 =
𝑇𝑇𝑇𝑇𝑇𝑇 − 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑇𝑇 ∗ 𝐹𝐹𝐹𝐹𝐹𝐹
𝑁𝑁
22
Vickers & Elkin. Med Decis Making 2006;26:565-574.
Peirce. Science 1884;4:453-454.
Clinical utility
ISSUE 2 – THERE IS NO SINGLE CORRECT THRESHOLD
Preferences and settings differ, so Harm:Benefit and T differ.
Specify a ‘reasonable range’ for T.
Calculate NB for this range, and plot results in a decision curve
→ Decision Curve Analysis
23
Vickers et al. BMJ 2016;352:i6.
Van Calster et al. Eur Urol 2018;74:796-804.
Vickers et al. Diagnostic and Prognostic Research 2019;3:18.
Clinical utility
DCA is a simple research method. You don’t need anything special. It is useful to
assess whether a model can improve clinical decision making. It does not replace a
cost-effectiveness analysis, but can inform on further steps.
24
Clinical utility
ISSUE 3 – CPM ONLY ESTIMATES WHETHER RISK>T OR NOT
T is independent of any model.
If the model is miscalibrated, you may select too many (risk overestimated)
or too few (risk underestimated) patients for treatment!
Miscalibration reduces clinical utility, and may make the model worse than
treating all/none! (Van Calster & Vickers 2015)
NB nicely punishes for miscalibration.
25
Van Calster & Vickers. Medical Decision Making 2015;35:162-169.
Clinical utility
Models can be miscalibrated!
→ So pick any threshold with a nice combination of sensitivity and
specificity!
That does not solve the problem in my opinion.
You may need a different threshold in a new hospital.
So you may just as well recalibrate the model.
26
Types of validation: evidence pyramid…
(I love evidence pyramids)
27
APPARENT
INTERNAL
EX-
TERNAL
Exact same data
No real validation
Optimistic
Same dataset
Hence: same population
Different dataset
Hence: different population
Only apparent validation?
28
Criminal behavior!
https://lacapitale.sudinfo.be/188875/article/2018-02-08/arrestation-braine-lalleud-dun-cambrioleur-recherche
Internal validation
• Only assessment of overfitting: c-statistic and slope sufficient.
• Skepticism: who will publish a paper with poor internal validation?
• Which approach?
29
Internal validation
• Train – test split
o Inefficient: we split up the data into smaller parts
o “Split sample approaches only work when not needed” (Steyerberg & Harrell 2016)
o But: it evaluates the fitted MODEL!
• Resampling is preferred
o Enhanced bootstrap (statistics) or CV (ML) recommended
o Efficient: you can use all data to develop the model
o But: it evaluates the modeling PROCEDURE! (Dietterich 1998)
o It evaluates expected overfitting from the procedure.
30
Steyerberg & Harrell. Journal of Clinical Epidemiology 2016;69:245-247.
Dietterich. Neural Computation 1998;10:1895-1923.
Calibration slope based on bootstrapping
• Negative correlation with true slope, variability underestimated
• Example (Van Calster 2020)
o 5 predictors, development N=500, 10% event rate, 1000 repetitions
o Every repetition: slope estimated using bootstrapping
o Estimated slope: median 0.91, IQR 0.89-0.92, range 0.78-0.96
o True slope: median 0.88, IQR 0.78-0.99, range 0.53-1.53
o Spearman rho -0.72
31
Van Calster et al. Statistical Methods in Medical Research 2020;29:3166-3178.
External validation
• Results affected by overfitting and by differences in setting (case-mix,
procedures, …)
• The real test: including ‘an external validation’ settles the issue, right?
32
Nevin & PLoS Med Editors. PLoS Medicine 2018;15:e1002708.
2. Expect heterogeneity
Expect heterogeneity
• Between settings: case-mix, definitions and procedures, etc…
34
Luijken et al. J Clin Epidemiol 2020;119:7-18.
Expect heterogeneity
• Across time: population drift (Davis et al, 2017)
35
Davis et al. Journal of the American Medical Informatics Association 2017;24:1052-1061.
THERE IS NO SUCH THING AS A VALIDATED MODEL
Validation is important!
But it’s complicated
o Internal and external validations added to model development study
may be tampered with (cynical take)
o Internal validation has limitations
o An external validation (whether good or bad) in one setting does not
settle the issue
o Performance tomorrow may not be the same as yesterday
So: assess/address/monitor heterogeneity
36
Address heterogeneity
• Multicenter studies
• IPD
37
Address heterogeneity
38
Debray et al. Statistics in Medicine 2013;32:3158-3180.
Development: random effects models
Address heterogeneity in baseline risk using random cluster intercepts
𝑙𝑙𝑙𝑙𝑙𝑙
𝜋𝜋
1−𝜋𝜋
= 𝛼𝛼 + 𝛼𝛼𝑗𝑗 + 𝛃𝛃𝐓𝐓
𝐗𝐗,
with 𝛼𝛼𝑗𝑗~𝑁𝑁 0, 𝜎𝜎2
.
Heterogeneity in predictor effects can be assessed with random slopes
𝑙𝑙𝑙𝑙𝑙𝑙
𝜋𝜋
1−𝜋𝜋
= 𝛼𝛼 + 𝛼𝛼𝑗𝑗 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽1,𝑗𝑗 + 𝛽𝛽2𝑋𝑋2 + 𝛽𝛽2,𝑗𝑗,
with
𝛼𝛼𝑗𝑗
𝛽𝛽1,𝑗𝑗
𝛽𝛽2,𝑗𝑗
~𝑁𝑁
0
0
0
,
𝜎𝜎𝛼𝛼
2
𝜎𝜎𝛼𝛼,𝛽𝛽𝛽 𝜎𝜎𝛼𝛼,𝛽𝛽2
𝜎𝜎𝛼𝛼,𝛽𝛽𝛽 𝜎𝜎𝛽𝛽1
2
𝜎𝜎𝛽𝛽𝛽,𝛽𝛽𝛽
𝜎𝜎𝛼𝛼,𝛽𝛽2 𝜎𝜎𝛽𝛽1,𝛽𝛽2 𝜎𝜎𝛽𝛽2
2
39
Development: fixed cluster-level effects
Add fixed parameters to model that describe the cluster
E.g. type of hospital, country
Can be done in combination with random effects
40
Development: Internal-external CV
41
Leuven
London
Berlin
Rome
Leuven
London
Berlin
Rome
k-fold CV IECV
Royston et al. Statistics in Medicine 2004;23:906-926.
Debray et al. Statistics in Medicine 2013;32:3158-3180.
Model validation in clustered data
42
Riley et al. BMJ 2016;353:i3140.
Debray et al. Statistical Methods in Medical Research 2019;28:2768-2786.
Model validation in clustered data
1. Obtain cluster-specific performance
2. Perform random-effects meta-analysis (Snell 2018, Debray 2019)
c statistic: 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑐𝑐𝑗𝑗 ~ 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑐𝑐 , 𝑣𝑣𝑗𝑗 + 𝜏𝜏2
O:E ratio: 𝑙𝑙𝑙𝑙𝑙𝑙 𝑂𝑂𝑂𝑂𝑗𝑗 ~ 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙 𝑂𝑂𝑂𝑂 , 𝑣𝑣𝑗𝑗 + 𝜏𝜏2
Calibration intercept: 𝑎𝑎𝑗𝑗 ~ 𝑁𝑁 𝑎𝑎, 𝑣𝑣𝑗𝑗 + 𝜏𝜏2
Calibration slope: 𝑏𝑏𝑗𝑗 ~ 𝑁𝑁 𝑏𝑏, 𝑣𝑣𝑗𝑗 + 𝜏𝜏2
43
Snell et al. Statistical Methods in Medical Research 2018;28:3505-3522.
Debray et al. Statistical Methods in Medical Research 2019;28:2768-2786.
Model validation in clustered data
One-stage model for calibration slopes:
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑃𝑃 𝑌𝑌=1
1−𝑃𝑃 𝑌𝑌=1
= 𝑎𝑎 + 𝑎𝑎𝑗𝑗 + 𝑏𝑏 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 𝑏𝑏𝑗𝑗 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 , where
𝑎𝑎𝑗𝑗
𝑏𝑏𝑗𝑗
~𝑁𝑁
0
0
,
𝜏𝜏𝑎𝑎
2
𝜏𝜏𝑎𝑎𝑎𝑎
𝜏𝜏𝑎𝑎𝑎𝑎 𝜏𝜏𝑏𝑏
2 .
𝑏𝑏 is overall calibration slope.
𝑏𝑏𝑗𝑗 are used to estimate center-specific slopes.
(Bouwmeester 2013; Wynants 2018)
44
Bouwmeester et al. BMC Medical Research Methodology 2013;13:19.
Wynants et al. Statistical Methods in Medical Research 2018;27:1723-1736.
Model validation in clustered data
• For calibration intercepts, the slopes are set to 1:
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑃𝑃 𝑌𝑌=1
1−𝑃𝑃 𝑌𝑌=1
= 𝑎𝑎′ + 𝑎𝑎𝑗𝑗
′
+ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 , where
𝑎𝑎𝑗𝑗
′
~𝑁𝑁 0, 𝜏𝜏𝑎𝑎′
2
.
𝑎𝑎′is overall calibration intercept.
𝑎𝑎𝑗𝑗
′
are used to estimate center-specific intercepts.
45
Model validation in clustered data
• Net Benefit at several reasonable thresholds (Wynants 2018):
Bayesian trivariate random-effects MA of prevalence, sensitivity, specificity.
𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑗𝑗
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑆𝑆𝑆𝑆𝑗𝑗
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑆𝑆𝑝𝑝𝑗𝑗
~𝑁𝑁
𝛾𝛾1
𝛾𝛾2
𝛾𝛾3
,
𝜏𝜏1
2
𝜌𝜌12𝜏𝜏1𝜏𝜏2 𝜌𝜌13𝜏𝜏1𝜏𝜏3
𝜌𝜌12𝜏𝜏1𝜏𝜏2 𝜏𝜏2
2
𝜌𝜌23𝜏𝜏2𝜏𝜏3
𝜌𝜌13𝜏𝜏1𝜏𝜏3 𝜌𝜌23𝜏𝜏2𝜏𝜏3 𝜏𝜏3
2
Optimal specification of (vague) priors: after reparameterization
- Vague normal priors for 𝛾𝛾
- Weak Fisher priors for correlations
- Weak half-normal priors for variances
46
Wynants et al. Statistics in Medicine 2018;37:2034-2052.
Meta-regression & subgroup analysis
• Largely exploratory undertaking, unless many big clusters
• Meta-regression: add cluster-level predictors to meta-analysis
o Mean, SD or prevalence of predictors per cluster
o Can also be added to one-stage calibration analysis
• Subgroup analysis?
o Conduct the meta-analysis for important subgroups on cluster-level
(Deeks 2008; Debray 2019)
47
Deeks et al. Cochrane Handbook for Systematic Reviews of Interventions (Chapter 9). Cochrane, 2008.
Debray et al. Statistical Methods in Medical Research 2019;28:2768-2786.
Local and dynamic updating
• Advisable to validate and update an interesting model locally
• And keep it fit by dynamic updating over time (periodic refitting, moving
window, …) (Hickey 2013, Strobl 2015, Su 2018, Davis 2020)
48
Hickey et al. Circulation: Cardiovascular Quality and Outcomes 2013;6:649-658. Strobl et al. Journal of Biomedical Informatics 2015;56:87-93.
Su et al. Statistical Methods in Medical Research 2018;27:185-197. Davis et al. Journal of Biomedical Informatics 2020;112:103611.
Van Calster et al. BMC Medicine 2019;17:230.
3. Applied example: ADNEX
Case study: the ADNEX model
• Disclaimer: I developed this model, but it is by no means perfect.
• Model to estimate the risk that an ovarian tumor is malignant
• Multinomial model for the following 5-level tumor outcome: benign,
borderline malignant, stage I invasive, stage II-IV invasive, secondary
metastatic. We focus on estimated risk of malignancy
• Population: patients selected for surgery, (outcome based on histology)
• Data: 3506 patients from 1999-2007 (21 centers), 2403 patients 2403 from
2009-2012 (18 centers, mostly overlapping).
50
Van Calster et al. BMJ 2014;349:g5920.
51
IOTA data:
5909 patients
24 centers
10 countries
Leuven (930)
Rome (787)
Malmö (776)
Genk (428)
Monza (401)
Prague (354)
Bologna (348)
Milan (311)
Lublin (285)
Cagliari (261)
Milan 2 (223)
Stockholm (120)
London (119)
Naples (103)
Paris
Lund
Beijing, Maurepas,
Udine, Barcelona,
Florence, Milan 3,
Naples 2, Ontario
Case study: the ADNEX model
• Strategy: develop on first 3506, validate on 2403, then refit on all 5909
o 2-year gap between datasets, makes sense to do the split
o IECV might have been an option in hindsight?
• A priori selection of 10 predictors, then backward elimination (in
combination with multivariable fractional polynomials to decide on
transformations). (Royston & Sauerbrei 2008)
• Heterogeneity:
o 1. random intercepts
o 2. fixed binary predictor indicating type of center: oncology referral
center vs other center
52
Royston & Sauerbrei. Multivariable Model‐Building: A pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. Wiley 2008.
53
Type of center as a predictor?
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
Centres
Percentage
malignant
tumours
Case study: the ADNEX model
• SAS PROC GLIMMIX to fit ri-MLR (took me ages)
proc glimmix data=iotamc method=rspl;
class center NewIntercept;
model resp=NewIntercept NewIntercept*oncocenter NewIntercept*l2lesdmax
NewIntercept*ascites NewIntercept*age NewIntercept*propsol
NewIntercept*propsol*propsol NewIntercept*loc10 NewIntercept*papnr
NewIntercept*shadows NewIntercept*l2ca125
/ noint dist=binomial link=logit solution;
random NewIntercept / subject=Center type=un(1);
nloptions tech=nrridg;
output out=iotamcpqlpredca&imp pred(noblup ilink)=p_pqlca pred(noblup noilink)=lp_pqlca;
ods output ParameterEstimates=iotamcpqlca;
run;
Thanks to: Kuss & McLerran. Comput Methods Programs Biomed 2007;87:262-629.
54
Case study: the ADNEX model
• Shrinkage: uniform shrinkage factor per equation in the model
• 10 variables, 17 parameters (MFP), 120 cases in smallest group
• Multinomial EPP (de Jong et al 2019): 120 / ((5-1)*17) = 2
• But 52 patients per parameter…
• Sample size for multinomial outcomes needs ‘further research’
• Also, many MFP parameters: less ‘expensive’? Christodoulou et al 2021
• Refit on all data: Multinomial EPP 246 / 68 = 3.6
• 87 patients per parameter
55
de Jong et al. Statistics in Medicine 2019;38:1601-1608.
Christodoulou et al. Diagnostic and Prognostic Research 2021;forthcoming.
Case study: the ADNEX model
• Validation on 2403 patients (no meta-analysis, only pooled assessment)
56
First small external validation
57
Center N Events (%) AUC 95% CI
London (UK) 318 104 (33) 0.942 0.913-0.962
Southampton (UK) 175 56 (32) 0.900 0.841-0.938
Catania (IT) 117 22 (19) 0.990 0.959-0.998
Sayasneh et al. British Journal of Cancer 2016;115:542-548.
Large external validation
58
Van Calster et al. BMJ 2020;370:m2614.
Large external validation
• 4905 patients, 2012-2015, 17 centers
• First large study on the REAL population: any new patient with an
ovarian tumor, whether operated or not (2489 were operated)
• If not operated, patients are followed conservatively. Then, outcome was
based on clinical and ultrasound information.
o Differential verification!
o If insufficient or inconsistent FU information: missing data (MI)
59
Large external validation: discrimination
• AUC
o Analysis based on logit(AUC) and its SE (Snell 2019) – auRoc package
o Random effects meta-analysis – metafor package, rma and predict functions
o 95% prediction interval
• Calibration
o One-stage calibration model for calibration intercept and slope
o We used this model for calibration curves (i.e. no flexible calibration curves)
o Overall curve: 𝑎𝑎𝑗𝑗 and 𝑏𝑏𝑗𝑗 set to 0
o Center-specific curves: random terms used
60
Large external validation: utility
• Decision: referral for specialized oncological care
• Reasonable range of risk thresholds: 5 to 50%
• Per threshold, Net Benefit per center was combined using Bayesian
trivariate random effects meta-analysis
• Weak realistic priors
• WinBugs
61
Large external validation: results
62
Large external validation: results
63
Large external validation: results
64
Large external validation: results
65
Large external validation: results
66
Large external validation: results
67
Large external validation: subgroups
68
Operated (n=2489)
38% malignant
At least 1 FU visit (n=1958)
1% malignant
Large external validation: subgroups
69
Oncology centers (n=3094)
26% malignant
Other centers (n=1811)
10% malignant
(Interestingly, LR2 (blue line) is the only model without type of center as a predictor)
Large external validation: meta-regression
70
rma.uni(logit_auc, sei = se_logit_auc, mods = logit(EventRate), data = iota5cs, method = "REML", knha=TRUE)
R2 42%
p 0.02
R2 34%
p 0.04
R2 11%
p 0.24
R2 16%
p 0.09
Large external validation: meta-regression
71
rma.uni(lnOE, sei = se_lnOE, mods = logit(EventRate), data = iota5cs, method = "REML", knha=TRUE)
R2 41%
p 0.29
R2 0%
p 0.50
R2 0%
p 0.52
R2 0%
p 0.31
Independent external validation studies
72
Study Location N ER
(%)
AUC Calibration Utility Missing
Data
Reporting guideline
mentioned
Joyeux 2016 Dijon/Chalon (FR) 284 11 0.938 No No CCA None
Szubert 2016 Poznan (PL) 204 34 0.907 No No CCA None
Pamplona (SP) 123 28 0.955 No No CCA None
Araujo 2017 Sao Paulo (BR) 131 52 0.925 No No CCA STARD
Meys 2017 Maastricht (NL) 326 35 0.930 No No MI STARD
Chen 2019 Shanghai (CN) 278 27 0.940 No No CCA None
Stukan 2019 Gdynia (PL) 100 52 0.972 No No CCA STARD, TRIPOD
Jeong 2020 Seoul (KR) 59 17 0.924 No No No info None
Viora 2020 Turin (IT) 577 25 0.911 No No CCA None
Nam 2021 Seoul (KR) 353 4 0.920 No No No info None
Poonyakanok 2021 Bangkok (TH) 357 17 0.975 No No CCA None
Independent external validation studies
73
metagen(logitc, logitc_se, data = adnex, studlab = Publication, sm = "PLOGIT", backtransf = T,
level = 0.95, level.comb = 0.95, comb.random = T, prediction = T, level.predict = 0.95,
method.tau = "REML", hakn = TRUE)
Heterogeneity?
74
4. Machine learning
Regression vs Machine Learning
76
Breiman. Stat Sci 2001;16:199-231.
Regression vs Machine Learning
77
Shmueli. Keynote talk at 2019 ISBIS conference, Kuala Lumpur; taken from slideshare.net
Bzdok. Nature Methods 2018;15:233-4.
??
Reason for popularity
78
“Typical machine learning algorithms are highly flexible
So will uncover associations we could not find before
Hence better predictions and management decisions”
→ One of the master keys, with guaranteed success!
1. Study design trumps algorithm choice
79
Do you have a clear research question?
Do you have data that help you answer the question?
What is the quality of the data?
Dilbert.com
Riley. Nature 2019;275:27-9.
One simple example: EHR data were not collected for research purposes
Example
80
Uses ANN, Random Forests, Naïve Bayes, and Logistic Regression.
(Logistic regression performed best, AUC 0.92)
Hernandez-Suarez et al. JACC Cardiovasc Interv 2019;12:1328-38.
Example (contd)
81
The model also uses postoperative information.
David J Cohen, MD: “The model can’t be run properly until you know about both the presence and the absence of
those complications, but you don’t know about the absence of a complication until the patient has left the hospital.”
https://www.tctmd.com/news/machine-learning-helps-predict-hospital-mortality-post-tavr-skepticism-abounds
Example 2
82
Winkler et al. JAMA Dermatol 2019; in press.
Example 3
83
Khazendar et al. Facts Views Vis ObGyn 2015;7:7-15.
2. Flexible algorithms are data hungry
84
http://www.portlandsports.com/hot-dog-eating-champ-kobayashi-hits-psu/
Flexible algorithms are data hungry
85
Van der Ploeg et al. BMC Med Res Methodol 2014;14:137.
Flexible algorithms are data hungry
86
https://simplystatistics.org/2017/05/31/deeplearning-vs-leekasso/
Marcus. arXiv 2018; arXiv:1801.00631
3. Signal-to-noise ratio (SNR) is often low
87
There is support that flexible algorithms work best with high SNR, not with low
SNR.
How can methods that look everywhere be better when you do not know
where to look or what you look for?
Hand. Stat Sci 2006;21:1-14.
Algorithm flexibility and SNR
88
Makridakis et al. PLoS One 2018;13:e0194889.
Ennis et al. Stat Med 1998;17:2501-8.
Goldstein et al. Stat Med 2017;36:2750-63.
So I find this hard to believe
89
Rajkomar et al. NEJM 2019;380:1347-58.
What if you don’t know?
90
If you have no knowledge on what variables could be good
predictors (and what variables not), are you ready to make a
good prediction model?
I do not think that using flexible algorithms will bypass this.
I like a clear distinction between
- predictor finding studies
- prediction modeling studies
Machine learning: success guaranteed?
• Question 1: which algorithm works when?
o Medical data: often limited signal:noise ratio
o High-dimensional data vs CPM
o Direct prediction from medical images using DL: different ballgame?
• Question 2: ML for low-dimensional CPMs?
91
ML vs LR: systematic review
92
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
ML vs LR: systematic review
93
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
ML vs LR: systematic review
94
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
ML vs LR: systematic review
95
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
Poor modeling and unclear reporting
96
What was done about missing data? 45% fully unclear, 100% poor or unclear
How were continuous predictors modeled? 20% unclear, 25% categorized
How were hyperparameters tuned? 66% unclear, 19% tuned with information
How was performance validated? 68% unclear or biased approach
Was calibration of risk estimates studied? 79% not at all, HL test common
Prognosis: time horizon often ignored completely
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
ML vs LR: comparison of AUC
97
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
A few anecdotic observations
98
Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
Conclusions
99
Validation is important, but there is no such thing as a validated model.
Expect and address heterogeneity in model development and validation.
Ideally, models should be locally and dynamically updated. Realistic?
Study methodology trumps algorithm choice.
Machine learning: no success guaranteed.
Reporting is key. TRIPOD extensions are underway (Cluster and AI).
100
TG6 - Evaluating diagnostic tests and prediction models
Chairs: Ewout Steyerberg, Ben Van Calster
Members: Patrick Bossuyt, Tom Boyles (clinician), Gary Collins, Kathleen Kerr,
Petra Macaskill, David McLernon, Carl Moons, Maarten van Smeden, Andrew
Vickers, Laure Wynants
Papers:
- Van Calster et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine 2019.
- Wynants et al. Three myths about risk thresholds for prediction models. BMC Medicine 2019.
- McLernon et al. Assessing the performance of prediction models for survival outcomes: guidance for
validation and updating with a new prognostic factor using a breast cancer case study. Near final!
- A few others in preparation.

Contenu connexe

Tendances

A plea for good methodology when developing clinical prediction models
A plea for good methodology when developing clinical prediction modelsA plea for good methodology when developing clinical prediction models
A plea for good methodology when developing clinical prediction modelsBenVanCalster
 
Is it causal, is it prediction or is it neither?
Is it causal, is it prediction or is it neither?Is it causal, is it prediction or is it neither?
Is it causal, is it prediction or is it neither?Maarten van Smeden
 
Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...Laure Wynants
 
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019Ewout Steyerberg
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
Evaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk predictionEvaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk predictionEwout Steyerberg
 
Introduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIIntroduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIMaarten van Smeden
 
Thoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial IntelligenceThoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial IntelligenceMaarten van Smeden
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for HealthcareChandan Reddy
 
Regression shrinkage: better answers to causal questions
Regression shrinkage: better answers to causal questionsRegression shrinkage: better answers to causal questions
Regression shrinkage: better answers to causal questionsMaarten van Smeden
 
Introduction to prediction modelling - Berlin 2018 - Part I
Introduction to prediction modelling - Berlin 2018 - Part IIntroduction to prediction modelling - Berlin 2018 - Part I
Introduction to prediction modelling - Berlin 2018 - Part IMaarten van Smeden
 
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...GaryCollins74
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
Day 1 (Lecture 3): Predictive Analytics in Healthcare
Day 1 (Lecture 3): Predictive Analytics in HealthcareDay 1 (Lecture 3): Predictive Analytics in Healthcare
Day 1 (Lecture 3): Predictive Analytics in HealthcareAseda Owusua Addai-Deseh
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsMaarten van Smeden
 
Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead Maarten van Smeden
 
Data science in health care
Data science in health careData science in health care
Data science in health careChetan Khanzode
 

Tendances (20)

A plea for good methodology when developing clinical prediction models
A plea for good methodology when developing clinical prediction modelsA plea for good methodology when developing clinical prediction models
A plea for good methodology when developing clinical prediction models
 
Is it causal, is it prediction or is it neither?
Is it causal, is it prediction or is it neither?Is it causal, is it prediction or is it neither?
Is it causal, is it prediction or is it neither?
 
P-values in crisis
P-values in crisisP-values in crisis
P-values in crisis
 
Clinical prediction models
Clinical prediction modelsClinical prediction models
Clinical prediction models
 
Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...
 
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
Evaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk predictionEvaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk prediction
 
Introduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIIntroduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part II
 
Thoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial IntelligenceThoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial Intelligence
 
Big Data Analytics for Healthcare
Big Data Analytics for HealthcareBig Data Analytics for Healthcare
Big Data Analytics for Healthcare
 
Regression shrinkage: better answers to causal questions
Regression shrinkage: better answers to causal questionsRegression shrinkage: better answers to causal questions
Regression shrinkage: better answers to causal questions
 
Introduction to prediction modelling - Berlin 2018 - Part I
Introduction to prediction modelling - Berlin 2018 - Part IIntroduction to prediction modelling - Berlin 2018 - Part I
Introduction to prediction modelling - Berlin 2018 - Part I
 
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
QUANTIFYING THE IMPACT OF DIFFERENT APPROACHES FOR HANDLING CONTINUOUS PREDIC...
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
Day 1 (Lecture 3): Predictive Analytics in Healthcare
Day 1 (Lecture 3): Predictive Analytics in HealthcareDay 1 (Lecture 3): Predictive Analytics in Healthcare
Day 1 (Lecture 3): Predictive Analytics in Healthcare
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutions
 
Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead
 
Clinical data analytics
Clinical data analyticsClinical data analytics
Clinical data analytics
 
Data science in health care
Data science in health careData science in health care
Data science in health care
 

Similaire à Development and evaluation of prediction models: pitfalls and solutions (Part II)

VALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptx
VALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptxVALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptx
VALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptxShaliniPattanayak
 
Epidemiological Approaches for Evaluation of diagnostic tests.pptx
Epidemiological Approaches for Evaluation of diagnostic tests.pptxEpidemiological Approaches for Evaluation of diagnostic tests.pptx
Epidemiological Approaches for Evaluation of diagnostic tests.pptxBhoj Raj Singh
 
NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...
NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...
NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...European School of Oncology
 
UAB Pulmonary board review study design and statistical principles
UAB Pulmonary board review study  design and statistical principles UAB Pulmonary board review study  design and statistical principles
UAB Pulmonary board review study design and statistical principles Terry Shaneyfelt
 
observational analytical study
observational analytical studyobservational analytical study
observational analytical studyDr. Partha Sarkar
 
P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...David Pratap
 
Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)Evangelos Kritsotakis
 
Em score-medical-decision-making
Em score-medical-decision-makingEm score-medical-decision-making
Em score-medical-decision-makingSuperCoder LLC
 
Statistics tests and Probablity
Statistics tests and ProbablityStatistics tests and Probablity
Statistics tests and ProbablityAbdul Wasay Baloch
 
Epistemic problems 12_18_15
Epistemic problems 12_18_15Epistemic problems 12_18_15
Epistemic problems 12_18_15Scott aberegg
 
Evidence-based diagnosis
Evidence-based diagnosisEvidence-based diagnosis
Evidence-based diagnosisHesham Gaber
 
Essay On Ultrasound
Essay On UltrasoundEssay On Ultrasound
Essay On UltrasoundLori Bowie
 
1559221358_week 6-MCQ in EBP-1.pdf
1559221358_week 6-MCQ in EBP-1.pdf1559221358_week 6-MCQ in EBP-1.pdf
1559221358_week 6-MCQ in EBP-1.pdfSajidaCheema1
 
K7 - Critical Appraisal.pdf
K7 - Critical Appraisal.pdfK7 - Critical Appraisal.pdf
K7 - Critical Appraisal.pdfJeslynTengkawan1
 
Laboratory Medicine Curriculum
Laboratory Medicine Curriculum Laboratory Medicine Curriculum
Laboratory Medicine Curriculum mazen khaled
 
MSN 5600L Adult Wellness Check up Diagnosis Lab.pdf
MSN 5600L Adult Wellness Check up Diagnosis Lab.pdfMSN 5600L Adult Wellness Check up Diagnosis Lab.pdf
MSN 5600L Adult Wellness Check up Diagnosis Lab.pdfbkbk37
 
Evaluating the Medical Literature
Evaluating the Medical LiteratureEvaluating the Medical Literature
Evaluating the Medical LiteratureClista Clanton
 

Similaire à Development and evaluation of prediction models: pitfalls and solutions (Part II) (20)

VALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptx
VALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptxVALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptx
VALIDITY AND RELIABLITY OF A SCREENING TEST seminar 2.pptx
 
Research by MAGIC
Research by MAGICResearch by MAGIC
Research by MAGIC
 
Clinical epidemiology
Clinical epidemiologyClinical epidemiology
Clinical epidemiology
 
Epidemiological Approaches for Evaluation of diagnostic tests.pptx
Epidemiological Approaches for Evaluation of diagnostic tests.pptxEpidemiological Approaches for Evaluation of diagnostic tests.pptx
Epidemiological Approaches for Evaluation of diagnostic tests.pptx
 
NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...
NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...
NY Prostate Cancer Conference - A. Vickers - Session 1: Traditional statistic...
 
UAB Pulmonary board review study design and statistical principles
UAB Pulmonary board review study  design and statistical principles UAB Pulmonary board review study  design and statistical principles
UAB Pulmonary board review study design and statistical principles
 
observational analytical study
observational analytical studyobservational analytical study
observational analytical study
 
Evidence Based Diagnosis
Evidence Based DiagnosisEvidence Based Diagnosis
Evidence Based Diagnosis
 
P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...
 
Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)Common statistical pitfalls & errors in biomedical research (a top-5 list)
Common statistical pitfalls & errors in biomedical research (a top-5 list)
 
Em score-medical-decision-making
Em score-medical-decision-makingEm score-medical-decision-making
Em score-medical-decision-making
 
Statistics tests and Probablity
Statistics tests and ProbablityStatistics tests and Probablity
Statistics tests and Probablity
 
Epistemic problems 12_18_15
Epistemic problems 12_18_15Epistemic problems 12_18_15
Epistemic problems 12_18_15
 
Evidence-based diagnosis
Evidence-based diagnosisEvidence-based diagnosis
Evidence-based diagnosis
 
Essay On Ultrasound
Essay On UltrasoundEssay On Ultrasound
Essay On Ultrasound
 
1559221358_week 6-MCQ in EBP-1.pdf
1559221358_week 6-MCQ in EBP-1.pdf1559221358_week 6-MCQ in EBP-1.pdf
1559221358_week 6-MCQ in EBP-1.pdf
 
K7 - Critical Appraisal.pdf
K7 - Critical Appraisal.pdfK7 - Critical Appraisal.pdf
K7 - Critical Appraisal.pdf
 
Laboratory Medicine Curriculum
Laboratory Medicine Curriculum Laboratory Medicine Curriculum
Laboratory Medicine Curriculum
 
MSN 5600L Adult Wellness Check up Diagnosis Lab.pdf
MSN 5600L Adult Wellness Check up Diagnosis Lab.pdfMSN 5600L Adult Wellness Check up Diagnosis Lab.pdf
MSN 5600L Adult Wellness Check up Diagnosis Lab.pdf
 
Evaluating the Medical Literature
Evaluating the Medical LiteratureEvaluating the Medical Literature
Evaluating the Medical Literature
 

Dernier

MARKER ASSISTED SELECTION IN CROP IMPROVEMENT
MARKER ASSISTED SELECTION IN CROP IMPROVEMENTMARKER ASSISTED SELECTION IN CROP IMPROVEMENT
MARKER ASSISTED SELECTION IN CROP IMPROVEMENTjipexe1248
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...marwaahmad357
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docxmarwaahmad357
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...PirithiRaju
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearmarwaahmad357
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxkastureyashashree
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsSumathi Arumugam
 
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Sérgio Sacani
 
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Sérgio Sacani
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxUalikhanKalkhojayev1
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.pptaigil2
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabkiyorndlab
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Productspurwaborkar@gmail.com
 
Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)GRAPE
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsSafaFallah
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 

Dernier (20)

MARKER ASSISTED SELECTION IN CROP IMPROVEMENT
MARKER ASSISTED SELECTION IN CROP IMPROVEMENTMARKER ASSISTED SELECTION IN CROP IMPROVEMENT
MARKER ASSISTED SELECTION IN CROP IMPROVEMENT
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docx
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final year
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptx
 
M.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery SystemsM.Pharm - Question Bank - Drug Delivery Systems
M.Pharm - Question Bank - Drug Delivery Systems
 
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
Digitized Continuous Magnetic Recordings for the August/September 1859 Storms...
 
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
Legacy Analysis of Dark Matter Annihilation from the Milky Way Dwarf Spheroid...
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptx
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.ppt
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlab
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
Principles & Formulation of Hair Care Products
Principles & Formulation of Hair Care  ProductsPrinciples & Formulation of Hair Care  Products
Principles & Formulation of Hair Care Products
 
Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibiotics
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
Cheminformatics tools supporting dissemination of data associated with US EPA...
Cheminformatics tools supporting dissemination of data associated with US EPA...Cheminformatics tools supporting dissemination of data associated with US EPA...
Cheminformatics tools supporting dissemination of data associated with US EPA...
 

Development and evaluation of prediction models: pitfalls and solutions (Part II)

  • 1. Development and validation of prediction models Statistics in Practice II Ben Van Calster Department of Development and Regeneration, KU Leuven (B) Department of Biomedical Data Sciences, LUMC (NL) Epi-Centre, KU Leuven (B) Biometrisches Kolloquium, 16th of March 2021
  • 2. 1. Model performance and validation (focus on binary outcomes)
  • 3. 3 Model validation is important! Altman & Royston, Stat Med 2000. Altman et al, BMJ 2009. Steyerberg et al, Epidemiology 2010. Koenig et al, Stat Med 2007.
  • 4. Model validation? It is not primarily about statistical significance of 1. The predictors 2. A goodness-of-fit test like the Hosmer-Lemeshow test or its variants Although: a GOF test does give a gist of what is important  agreement between predictions and observations  CALIBRATION 4
  • 7. Level 1: Mean calibration Mean estimated risk = observed proportion of event “On average, risks are not over- or underestimated.” O:E ratio: observed events / expected events If violated: adjust intercept. 7
  • 8. Level 2: Weak calibration Mean AND variability/spread of estimated risk is fine “On average, risks are not over- or underestimated, nor too extreme/modest.” Logistic calibration model: 𝑙𝑙𝑙𝑙𝑙𝑙 𝜋𝜋 1−𝜋𝜋 = 𝑎𝑎 + 𝑏𝑏𝐿𝐿 ∗ 𝐿𝐿𝐿𝐿 Calculate - calibration slope: 𝑏𝑏𝐿𝐿 (cf spread) → should be 1 - calibration intercept: 𝑎𝑎|𝑏𝑏𝐿𝐿=1 (cf mean) → should be 0 If violated: adjust intercept and coefficients using 𝑎𝑎 + 𝑏𝑏𝐿𝐿 ∗ 𝐿𝐿𝐿𝐿 8
  • 9. Level 3: Moderate calibration Estimated risks are correct, conditional on estimated risk. “Among patients with estimated risk 𝑥𝑥, the proportion of events is 𝑥𝑥.” Flexible calibration model: 𝑙𝑙𝑙𝑙𝑙𝑙 𝜋𝜋 1−𝜋𝜋 = 𝑎𝑎 + 𝑓𝑓 𝐿𝐿𝐿𝐿 Calibration curve. If violated: refit model, perhaps re-evaluate functional form. 9
  • 10. Level 4: Strong calibration Estimated risks are correct, conditional on covariate pattern. “Per covariate pattern, the proportion of events equals the estimated risk.” In short, the model is fully correct. 10 Vach W. Regression models as a tool in medical research. Boca Raton: Chapman and Hall/CRC; 2013.
  • 11. 11 Miscalibration Mean Weak Moderate Strong Van Calster and Steyerberg. Wiley StatsRef 2018. https://doi.org/10.1002/9781118445112.stat08078.
  • 12. 12 Van Calster et al. J Clin Epidemiol 2016;74:167-176. Calibration assessment
  • 13. Discrimination Calibration: ~ absolute performance Discrimination: ~ relative performance Do I estimate higher risks for patients with vs without an event? C-statistic, equal to AUROC for binary outcomes. 13
  • 14. Discrimination C or AUROC is interesting, ROC not so much. ROC shows classification results but ‘integrated over’ threshold. But classification critically depends on threshold! ROC = bizarre. 14 (Pun intended.) Verbakel et al. J Clin Epidemiol 2020;126:207-216.
  • 15. 15
  • 16. Discrimination If you want a plot... 16 Karandeep Singh. Runway r package. https://github.com/ML4LHS/runway.
  • 17. Overall measures E.g. Brier score, R-squared measures (including IDI). For model validation, I prefer to keep discrimination and calibration separate. 17
  • 18. Statistical validity = clinical utility? Discrimination and calibration: how valid is the model? That is not the full story. In the end, clinical prediction models are made to support clinical decisions! To make decisions about giving ‘treatment’ or not based on the model, you have to classify patients as high and low risk. Classifications can be right or wrong: the classical 2x2 table 18 EVENT NO EVENT HIGH RISK True Positive False positive LOW RISK False Negative True negative
  • 19. Clinical utility A risk threshold is needed. Which one? Myth! The statistician can calculate the threshold from the data. Myth! A threshold is part of the model. → “Model BLABLA has sensitivity of 80%” makes little sense to me. 19 Wynants et al. BMC Medicine 2019;17:192.
  • 20. Clinical utility ISSUE 1 – A RISK THRESHOLD HAS A CLINICAL MEANING What are the utilities (U) of TP, FP, TN, FN? Then threshold 𝑇𝑇 = 1 1+ 𝑈𝑈𝑇𝑇𝑇𝑇−𝑈𝑈𝐹𝐹𝐹𝐹 𝑈𝑈𝑇𝑇𝑁𝑁−𝑈𝑈𝐹𝐹𝑃𝑃 = 1 1+ 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑜𝑜𝑜𝑜 𝑇𝑇𝑇𝑇 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝑜𝑜𝑜𝑜 𝐹𝐹𝐹𝐹 = 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻+𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 Hard to determine. Instead of fixing U values, start with T: at which threshold are you uncertain about ‘treatment’? 20
  • 21. Clinical utility You might say: I would give ‘treatment’ if risk >20%, not if <20% You may also say that you are willing to ‘treat’ up to 5 patients to correctly treat one (1 TP) Then T = 0.2 You accept up to 4 FP per TP, so Benefit is 4 times the Harm. 21
  • 22. Clinical utility Net Benefit (Vickers & Elkin 2006), even Charles Sanders Peirce (1884) When you classify patients, TPs are good and FPs are bad Net result: good stuff minus bad stuff: 𝑇𝑇𝑇𝑇𝑇𝑇 − 𝐹𝐹𝐹𝐹𝐹𝐹. But: relative importance of TP and FP depends on T! o If T=0.5, 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 = 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵; else < or >. o Harm of FP is odds(T) times the Benefit of a TP 𝑁𝑁𝑁𝑁 = 𝑇𝑇𝑇𝑇𝑇𝑇 − 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑇𝑇 ∗ 𝐹𝐹𝐹𝐹𝐹𝐹 𝑁𝑁 22 Vickers & Elkin. Med Decis Making 2006;26:565-574. Peirce. Science 1884;4:453-454.
  • 23. Clinical utility ISSUE 2 – THERE IS NO SINGLE CORRECT THRESHOLD Preferences and settings differ, so Harm:Benefit and T differ. Specify a ‘reasonable range’ for T. Calculate NB for this range, and plot results in a decision curve → Decision Curve Analysis 23 Vickers et al. BMJ 2016;352:i6. Van Calster et al. Eur Urol 2018;74:796-804. Vickers et al. Diagnostic and Prognostic Research 2019;3:18.
  • 24. Clinical utility DCA is a simple research method. You don’t need anything special. It is useful to assess whether a model can improve clinical decision making. It does not replace a cost-effectiveness analysis, but can inform on further steps. 24
  • 25. Clinical utility ISSUE 3 – CPM ONLY ESTIMATES WHETHER RISK>T OR NOT T is independent of any model. If the model is miscalibrated, you may select too many (risk overestimated) or too few (risk underestimated) patients for treatment! Miscalibration reduces clinical utility, and may make the model worse than treating all/none! (Van Calster & Vickers 2015) NB nicely punishes for miscalibration. 25 Van Calster & Vickers. Medical Decision Making 2015;35:162-169.
  • 26. Clinical utility Models can be miscalibrated! → So pick any threshold with a nice combination of sensitivity and specificity! That does not solve the problem in my opinion. You may need a different threshold in a new hospital. So you may just as well recalibrate the model. 26
  • 27. Types of validation: evidence pyramid… (I love evidence pyramids) 27 APPARENT INTERNAL EX- TERNAL Exact same data No real validation Optimistic Same dataset Hence: same population Different dataset Hence: different population
  • 28. Only apparent validation? 28 Criminal behavior! https://lacapitale.sudinfo.be/188875/article/2018-02-08/arrestation-braine-lalleud-dun-cambrioleur-recherche
  • 29. Internal validation • Only assessment of overfitting: c-statistic and slope sufficient. • Skepticism: who will publish a paper with poor internal validation? • Which approach? 29
  • 30. Internal validation • Train – test split o Inefficient: we split up the data into smaller parts o “Split sample approaches only work when not needed” (Steyerberg & Harrell 2016) o But: it evaluates the fitted MODEL! • Resampling is preferred o Enhanced bootstrap (statistics) or CV (ML) recommended o Efficient: you can use all data to develop the model o But: it evaluates the modeling PROCEDURE! (Dietterich 1998) o It evaluates expected overfitting from the procedure. 30 Steyerberg & Harrell. Journal of Clinical Epidemiology 2016;69:245-247. Dietterich. Neural Computation 1998;10:1895-1923.
  • 31. Calibration slope based on bootstrapping • Negative correlation with true slope, variability underestimated • Example (Van Calster 2020) o 5 predictors, development N=500, 10% event rate, 1000 repetitions o Every repetition: slope estimated using bootstrapping o Estimated slope: median 0.91, IQR 0.89-0.92, range 0.78-0.96 o True slope: median 0.88, IQR 0.78-0.99, range 0.53-1.53 o Spearman rho -0.72 31 Van Calster et al. Statistical Methods in Medical Research 2020;29:3166-3178.
  • 32. External validation • Results affected by overfitting and by differences in setting (case-mix, procedures, …) • The real test: including ‘an external validation’ settles the issue, right? 32 Nevin & PLoS Med Editors. PLoS Medicine 2018;15:e1002708.
  • 34. Expect heterogeneity • Between settings: case-mix, definitions and procedures, etc… 34 Luijken et al. J Clin Epidemiol 2020;119:7-18.
  • 35. Expect heterogeneity • Across time: population drift (Davis et al, 2017) 35 Davis et al. Journal of the American Medical Informatics Association 2017;24:1052-1061.
  • 36. THERE IS NO SUCH THING AS A VALIDATED MODEL Validation is important! But it’s complicated o Internal and external validations added to model development study may be tampered with (cynical take) o Internal validation has limitations o An external validation (whether good or bad) in one setting does not settle the issue o Performance tomorrow may not be the same as yesterday So: assess/address/monitor heterogeneity 36
  • 38. Address heterogeneity 38 Debray et al. Statistics in Medicine 2013;32:3158-3180.
  • 39. Development: random effects models Address heterogeneity in baseline risk using random cluster intercepts 𝑙𝑙𝑙𝑙𝑙𝑙 𝜋𝜋 1−𝜋𝜋 = 𝛼𝛼 + 𝛼𝛼𝑗𝑗 + 𝛃𝛃𝐓𝐓 𝐗𝐗, with 𝛼𝛼𝑗𝑗~𝑁𝑁 0, 𝜎𝜎2 . Heterogeneity in predictor effects can be assessed with random slopes 𝑙𝑙𝑙𝑙𝑙𝑙 𝜋𝜋 1−𝜋𝜋 = 𝛼𝛼 + 𝛼𝛼𝑗𝑗 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽1,𝑗𝑗 + 𝛽𝛽2𝑋𝑋2 + 𝛽𝛽2,𝑗𝑗, with 𝛼𝛼𝑗𝑗 𝛽𝛽1,𝑗𝑗 𝛽𝛽2,𝑗𝑗 ~𝑁𝑁 0 0 0 , 𝜎𝜎𝛼𝛼 2 𝜎𝜎𝛼𝛼,𝛽𝛽𝛽 𝜎𝜎𝛼𝛼,𝛽𝛽2 𝜎𝜎𝛼𝛼,𝛽𝛽𝛽 𝜎𝜎𝛽𝛽1 2 𝜎𝜎𝛽𝛽𝛽,𝛽𝛽𝛽 𝜎𝜎𝛼𝛼,𝛽𝛽2 𝜎𝜎𝛽𝛽1,𝛽𝛽2 𝜎𝜎𝛽𝛽2 2 39
  • 40. Development: fixed cluster-level effects Add fixed parameters to model that describe the cluster E.g. type of hospital, country Can be done in combination with random effects 40
  • 41. Development: Internal-external CV 41 Leuven London Berlin Rome Leuven London Berlin Rome k-fold CV IECV Royston et al. Statistics in Medicine 2004;23:906-926. Debray et al. Statistics in Medicine 2013;32:3158-3180.
  • 42. Model validation in clustered data 42 Riley et al. BMJ 2016;353:i3140. Debray et al. Statistical Methods in Medical Research 2019;28:2768-2786.
  • 43. Model validation in clustered data 1. Obtain cluster-specific performance 2. Perform random-effects meta-analysis (Snell 2018, Debray 2019) c statistic: 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑐𝑐𝑗𝑗 ~ 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑐𝑐 , 𝑣𝑣𝑗𝑗 + 𝜏𝜏2 O:E ratio: 𝑙𝑙𝑙𝑙𝑙𝑙 𝑂𝑂𝑂𝑂𝑗𝑗 ~ 𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙 𝑂𝑂𝑂𝑂 , 𝑣𝑣𝑗𝑗 + 𝜏𝜏2 Calibration intercept: 𝑎𝑎𝑗𝑗 ~ 𝑁𝑁 𝑎𝑎, 𝑣𝑣𝑗𝑗 + 𝜏𝜏2 Calibration slope: 𝑏𝑏𝑗𝑗 ~ 𝑁𝑁 𝑏𝑏, 𝑣𝑣𝑗𝑗 + 𝜏𝜏2 43 Snell et al. Statistical Methods in Medical Research 2018;28:3505-3522. Debray et al. Statistical Methods in Medical Research 2019;28:2768-2786.
  • 44. Model validation in clustered data One-stage model for calibration slopes: 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑃𝑃 𝑌𝑌=1 1−𝑃𝑃 𝑌𝑌=1 = 𝑎𝑎 + 𝑎𝑎𝑗𝑗 + 𝑏𝑏 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 + 𝑏𝑏𝑗𝑗 ∗ 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 , where 𝑎𝑎𝑗𝑗 𝑏𝑏𝑗𝑗 ~𝑁𝑁 0 0 , 𝜏𝜏𝑎𝑎 2 𝜏𝜏𝑎𝑎𝑎𝑎 𝜏𝜏𝑎𝑎𝑎𝑎 𝜏𝜏𝑏𝑏 2 . 𝑏𝑏 is overall calibration slope. 𝑏𝑏𝑗𝑗 are used to estimate center-specific slopes. (Bouwmeester 2013; Wynants 2018) 44 Bouwmeester et al. BMC Medical Research Methodology 2013;13:19. Wynants et al. Statistical Methods in Medical Research 2018;27:1723-1736.
  • 45. Model validation in clustered data • For calibration intercepts, the slopes are set to 1: 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑃𝑃 𝑌𝑌=1 1−𝑃𝑃 𝑌𝑌=1 = 𝑎𝑎′ + 𝑎𝑎𝑗𝑗 ′ + 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 , where 𝑎𝑎𝑗𝑗 ′ ~𝑁𝑁 0, 𝜏𝜏𝑎𝑎′ 2 . 𝑎𝑎′is overall calibration intercept. 𝑎𝑎𝑗𝑗 ′ are used to estimate center-specific intercepts. 45
  • 46. Model validation in clustered data • Net Benefit at several reasonable thresholds (Wynants 2018): Bayesian trivariate random-effects MA of prevalence, sensitivity, specificity. 𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑝𝑝𝑗𝑗 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑆𝑆𝑆𝑆𝑗𝑗 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑆𝑆𝑝𝑝𝑗𝑗 ~𝑁𝑁 𝛾𝛾1 𝛾𝛾2 𝛾𝛾3 , 𝜏𝜏1 2 𝜌𝜌12𝜏𝜏1𝜏𝜏2 𝜌𝜌13𝜏𝜏1𝜏𝜏3 𝜌𝜌12𝜏𝜏1𝜏𝜏2 𝜏𝜏2 2 𝜌𝜌23𝜏𝜏2𝜏𝜏3 𝜌𝜌13𝜏𝜏1𝜏𝜏3 𝜌𝜌23𝜏𝜏2𝜏𝜏3 𝜏𝜏3 2 Optimal specification of (vague) priors: after reparameterization - Vague normal priors for 𝛾𝛾 - Weak Fisher priors for correlations - Weak half-normal priors for variances 46 Wynants et al. Statistics in Medicine 2018;37:2034-2052.
  • 47. Meta-regression & subgroup analysis • Largely exploratory undertaking, unless many big clusters • Meta-regression: add cluster-level predictors to meta-analysis o Mean, SD or prevalence of predictors per cluster o Can also be added to one-stage calibration analysis • Subgroup analysis? o Conduct the meta-analysis for important subgroups on cluster-level (Deeks 2008; Debray 2019) 47 Deeks et al. Cochrane Handbook for Systematic Reviews of Interventions (Chapter 9). Cochrane, 2008. Debray et al. Statistical Methods in Medical Research 2019;28:2768-2786.
  • 48. Local and dynamic updating • Advisable to validate and update an interesting model locally • And keep it fit by dynamic updating over time (periodic refitting, moving window, …) (Hickey 2013, Strobl 2015, Su 2018, Davis 2020) 48 Hickey et al. Circulation: Cardiovascular Quality and Outcomes 2013;6:649-658. Strobl et al. Journal of Biomedical Informatics 2015;56:87-93. Su et al. Statistical Methods in Medical Research 2018;27:185-197. Davis et al. Journal of Biomedical Informatics 2020;112:103611. Van Calster et al. BMC Medicine 2019;17:230.
  • 50. Case study: the ADNEX model • Disclaimer: I developed this model, but it is by no means perfect. • Model to estimate the risk that an ovarian tumor is malignant • Multinomial model for the following 5-level tumor outcome: benign, borderline malignant, stage I invasive, stage II-IV invasive, secondary metastatic. We focus on estimated risk of malignancy • Population: patients selected for surgery, (outcome based on histology) • Data: 3506 patients from 1999-2007 (21 centers), 2403 patients 2403 from 2009-2012 (18 centers, mostly overlapping). 50 Van Calster et al. BMJ 2014;349:g5920.
  • 51. 51 IOTA data: 5909 patients 24 centers 10 countries Leuven (930) Rome (787) Malmö (776) Genk (428) Monza (401) Prague (354) Bologna (348) Milan (311) Lublin (285) Cagliari (261) Milan 2 (223) Stockholm (120) London (119) Naples (103) Paris Lund Beijing, Maurepas, Udine, Barcelona, Florence, Milan 3, Naples 2, Ontario
  • 52. Case study: the ADNEX model • Strategy: develop on first 3506, validate on 2403, then refit on all 5909 o 2-year gap between datasets, makes sense to do the split o IECV might have been an option in hindsight? • A priori selection of 10 predictors, then backward elimination (in combination with multivariable fractional polynomials to decide on transformations). (Royston & Sauerbrei 2008) • Heterogeneity: o 1. random intercepts o 2. fixed binary predictor indicating type of center: oncology referral center vs other center 52 Royston & Sauerbrei. Multivariable Model‐Building: A pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. Wiley 2008.
  • 53. 53 Type of center as a predictor? 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 Centres Percentage malignant tumours
  • 54. Case study: the ADNEX model • SAS PROC GLIMMIX to fit ri-MLR (took me ages) proc glimmix data=iotamc method=rspl; class center NewIntercept; model resp=NewIntercept NewIntercept*oncocenter NewIntercept*l2lesdmax NewIntercept*ascites NewIntercept*age NewIntercept*propsol NewIntercept*propsol*propsol NewIntercept*loc10 NewIntercept*papnr NewIntercept*shadows NewIntercept*l2ca125 / noint dist=binomial link=logit solution; random NewIntercept / subject=Center type=un(1); nloptions tech=nrridg; output out=iotamcpqlpredca&imp pred(noblup ilink)=p_pqlca pred(noblup noilink)=lp_pqlca; ods output ParameterEstimates=iotamcpqlca; run; Thanks to: Kuss & McLerran. Comput Methods Programs Biomed 2007;87:262-629. 54
  • 55. Case study: the ADNEX model • Shrinkage: uniform shrinkage factor per equation in the model • 10 variables, 17 parameters (MFP), 120 cases in smallest group • Multinomial EPP (de Jong et al 2019): 120 / ((5-1)*17) = 2 • But 52 patients per parameter… • Sample size for multinomial outcomes needs ‘further research’ • Also, many MFP parameters: less ‘expensive’? Christodoulou et al 2021 • Refit on all data: Multinomial EPP 246 / 68 = 3.6 • 87 patients per parameter 55 de Jong et al. Statistics in Medicine 2019;38:1601-1608. Christodoulou et al. Diagnostic and Prognostic Research 2021;forthcoming.
  • 56. Case study: the ADNEX model • Validation on 2403 patients (no meta-analysis, only pooled assessment) 56
  • 57. First small external validation 57 Center N Events (%) AUC 95% CI London (UK) 318 104 (33) 0.942 0.913-0.962 Southampton (UK) 175 56 (32) 0.900 0.841-0.938 Catania (IT) 117 22 (19) 0.990 0.959-0.998 Sayasneh et al. British Journal of Cancer 2016;115:542-548.
  • 58. Large external validation 58 Van Calster et al. BMJ 2020;370:m2614.
  • 59. Large external validation • 4905 patients, 2012-2015, 17 centers • First large study on the REAL population: any new patient with an ovarian tumor, whether operated or not (2489 were operated) • If not operated, patients are followed conservatively. Then, outcome was based on clinical and ultrasound information. o Differential verification! o If insufficient or inconsistent FU information: missing data (MI) 59
  • 60. Large external validation: discrimination • AUC o Analysis based on logit(AUC) and its SE (Snell 2019) – auRoc package o Random effects meta-analysis – metafor package, rma and predict functions o 95% prediction interval • Calibration o One-stage calibration model for calibration intercept and slope o We used this model for calibration curves (i.e. no flexible calibration curves) o Overall curve: 𝑎𝑎𝑗𝑗 and 𝑏𝑏𝑗𝑗 set to 0 o Center-specific curves: random terms used 60
  • 61. Large external validation: utility • Decision: referral for specialized oncological care • Reasonable range of risk thresholds: 5 to 50% • Per threshold, Net Benefit per center was combined using Bayesian trivariate random effects meta-analysis • Weak realistic priors • WinBugs 61
  • 68. Large external validation: subgroups 68 Operated (n=2489) 38% malignant At least 1 FU visit (n=1958) 1% malignant
  • 69. Large external validation: subgroups 69 Oncology centers (n=3094) 26% malignant Other centers (n=1811) 10% malignant (Interestingly, LR2 (blue line) is the only model without type of center as a predictor)
  • 70. Large external validation: meta-regression 70 rma.uni(logit_auc, sei = se_logit_auc, mods = logit(EventRate), data = iota5cs, method = "REML", knha=TRUE) R2 42% p 0.02 R2 34% p 0.04 R2 11% p 0.24 R2 16% p 0.09
  • 71. Large external validation: meta-regression 71 rma.uni(lnOE, sei = se_lnOE, mods = logit(EventRate), data = iota5cs, method = "REML", knha=TRUE) R2 41% p 0.29 R2 0% p 0.50 R2 0% p 0.52 R2 0% p 0.31
  • 72. Independent external validation studies 72 Study Location N ER (%) AUC Calibration Utility Missing Data Reporting guideline mentioned Joyeux 2016 Dijon/Chalon (FR) 284 11 0.938 No No CCA None Szubert 2016 Poznan (PL) 204 34 0.907 No No CCA None Pamplona (SP) 123 28 0.955 No No CCA None Araujo 2017 Sao Paulo (BR) 131 52 0.925 No No CCA STARD Meys 2017 Maastricht (NL) 326 35 0.930 No No MI STARD Chen 2019 Shanghai (CN) 278 27 0.940 No No CCA None Stukan 2019 Gdynia (PL) 100 52 0.972 No No CCA STARD, TRIPOD Jeong 2020 Seoul (KR) 59 17 0.924 No No No info None Viora 2020 Turin (IT) 577 25 0.911 No No CCA None Nam 2021 Seoul (KR) 353 4 0.920 No No No info None Poonyakanok 2021 Bangkok (TH) 357 17 0.975 No No CCA None
  • 73. Independent external validation studies 73 metagen(logitc, logitc_se, data = adnex, studlab = Publication, sm = "PLOGIT", backtransf = T, level = 0.95, level.comb = 0.95, comb.random = T, prediction = T, level.predict = 0.95, method.tau = "REML", hakn = TRUE)
  • 76. Regression vs Machine Learning 76 Breiman. Stat Sci 2001;16:199-231.
  • 77. Regression vs Machine Learning 77 Shmueli. Keynote talk at 2019 ISBIS conference, Kuala Lumpur; taken from slideshare.net Bzdok. Nature Methods 2018;15:233-4. ??
  • 78. Reason for popularity 78 “Typical machine learning algorithms are highly flexible So will uncover associations we could not find before Hence better predictions and management decisions” → One of the master keys, with guaranteed success!
  • 79. 1. Study design trumps algorithm choice 79 Do you have a clear research question? Do you have data that help you answer the question? What is the quality of the data? Dilbert.com Riley. Nature 2019;275:27-9. One simple example: EHR data were not collected for research purposes
  • 80. Example 80 Uses ANN, Random Forests, Naïve Bayes, and Logistic Regression. (Logistic regression performed best, AUC 0.92) Hernandez-Suarez et al. JACC Cardiovasc Interv 2019;12:1328-38.
  • 81. Example (contd) 81 The model also uses postoperative information. David J Cohen, MD: “The model can’t be run properly until you know about both the presence and the absence of those complications, but you don’t know about the absence of a complication until the patient has left the hospital.” https://www.tctmd.com/news/machine-learning-helps-predict-hospital-mortality-post-tavr-skepticism-abounds
  • 82. Example 2 82 Winkler et al. JAMA Dermatol 2019; in press.
  • 83. Example 3 83 Khazendar et al. Facts Views Vis ObGyn 2015;7:7-15.
  • 84. 2. Flexible algorithms are data hungry 84 http://www.portlandsports.com/hot-dog-eating-champ-kobayashi-hits-psu/
  • 85. Flexible algorithms are data hungry 85 Van der Ploeg et al. BMC Med Res Methodol 2014;14:137.
  • 86. Flexible algorithms are data hungry 86 https://simplystatistics.org/2017/05/31/deeplearning-vs-leekasso/ Marcus. arXiv 2018; arXiv:1801.00631
  • 87. 3. Signal-to-noise ratio (SNR) is often low 87 There is support that flexible algorithms work best with high SNR, not with low SNR. How can methods that look everywhere be better when you do not know where to look or what you look for? Hand. Stat Sci 2006;21:1-14.
  • 88. Algorithm flexibility and SNR 88 Makridakis et al. PLoS One 2018;13:e0194889. Ennis et al. Stat Med 1998;17:2501-8. Goldstein et al. Stat Med 2017;36:2750-63.
  • 89. So I find this hard to believe 89 Rajkomar et al. NEJM 2019;380:1347-58.
  • 90. What if you don’t know? 90 If you have no knowledge on what variables could be good predictors (and what variables not), are you ready to make a good prediction model? I do not think that using flexible algorithms will bypass this. I like a clear distinction between - predictor finding studies - prediction modeling studies
  • 91. Machine learning: success guaranteed? • Question 1: which algorithm works when? o Medical data: often limited signal:noise ratio o High-dimensional data vs CPM o Direct prediction from medical images using DL: different ballgame? • Question 2: ML for low-dimensional CPMs? 91
  • 92. ML vs LR: systematic review 92 Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 93. ML vs LR: systematic review 93 Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 94. ML vs LR: systematic review 94 Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 95. ML vs LR: systematic review 95 Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 96. Poor modeling and unclear reporting 96 What was done about missing data? 45% fully unclear, 100% poor or unclear How were continuous predictors modeled? 20% unclear, 25% categorized How were hyperparameters tuned? 66% unclear, 19% tuned with information How was performance validated? 68% unclear or biased approach Was calibration of risk estimates studied? 79% not at all, HL test common Prognosis: time horizon often ignored completely Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 97. ML vs LR: comparison of AUC 97 Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 98. A few anecdotic observations 98 Christodoulou et al. J Clin Epidemiol 2019;110:12-22.
  • 99. Conclusions 99 Validation is important, but there is no such thing as a validated model. Expect and address heterogeneity in model development and validation. Ideally, models should be locally and dynamically updated. Realistic? Study methodology trumps algorithm choice. Machine learning: no success guaranteed. Reporting is key. TRIPOD extensions are underway (Cluster and AI).
  • 100. 100 TG6 - Evaluating diagnostic tests and prediction models Chairs: Ewout Steyerberg, Ben Van Calster Members: Patrick Bossuyt, Tom Boyles (clinician), Gary Collins, Kathleen Kerr, Petra Macaskill, David McLernon, Carl Moons, Maarten van Smeden, Andrew Vickers, Laure Wynants Papers: - Van Calster et al. Calibration: the Achilles heel of predictive analytics. BMC Medicine 2019. - Wynants et al. Three myths about risk thresholds for prediction models. BMC Medicine 2019. - McLernon et al. Assessing the performance of prediction models for survival outcomes: guidance for validation and updating with a new prognostic factor using a breast cancer case study. Near final! - A few others in preparation.