Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI
1. Str-AI-ght to heaven?
Pitfalls for clinical decision support based on AI
Ben Van Calster
Department Development and Regeneration and EPI-centre, KU Leuven
Department Biomedical Data Sciences, LUMC Leiden
Research Ethics Committee, UZ Leuven
ben.vancalster@kuleuven.be; @BenVanCalster
ISUOG World Congress, 16 October 2021
2. Disclaimer
• Talk last year: “a plea for good methodology”
• This talk builds on that, in the context of AI and machine learning
• There is a lot of hype surrounding AI/ML. It may have potential, but we better
start to get real!
2
https://lawtomated.com/enough-with-the-a-i-hype-and-why/
Lawtomated
3. Do not celebrate too early…
3
Copyright Bas Czerwinski / Getty Images
Julian Alaphilippe, Liège-Bastogne-Liège (Oct. 4th, 2020)
Real winner: Primož Roglič
Real winner
4. Deep learning on medical images
4
Topol. Nat Med 2019;25:44-56. Zhu et al. Front Neurol 2019;10:869.
Titano et al. Nat Med 2018;24:1337-41; Nam et al. Radiology 2019;290:218-28; Ehteshami Bejnordi et al. JAMA 2017;318:2199-210;
Esteva et al. Nature 2017;542:115-8; De Fauw et al. Nat Med 2018;24:1342-50; Raman et al. Eye 2019;33:97-109.
5. Machine Learning for ‘EHR’ data
5
Rajkomar et al. Npj Digit Med 2018;1:18.
Rose. JAMA Netw Open 2018;1:e181404.
6. Reason for popularity?
6
“Very complex machine learning algorithms are highly flexible,
and hence find relationships we could not see before.
Therefore we make better predictions and better decisions.”
→ Guaranteed success!
Right?
7. Pitfalls for “predictive analytics”
7
1. Poor methodology
2. Lack of evidence
3. Considerable heterogeneity
4. (Financial) conflicts of interest
5. Actual implementation in clinical practice
8. 1. Methodology matters, not impact factors
8
Altman DG. BMJ 1994;308:283-284.
Van Calster et al, J Clin Epidemiol, in press.
Altman. BMJ 1994.
Our own frustration paper. JCE 2021.
9. ‘Predictive analytics’: covid-19
9
Wynants et al. BMJ 2020;369:m1328.
The review found more than 1 paper a day (!)
Results not trustworthy for 97% of the 231 models
Median sample size: 338
Non-representative sample: 42%
Representativity unclear: 25%
Data analysis problematic: 94%
No model validation at all: 22%
10. Predictive analytics for covid-19
10
Wynants et al. BMJ 2020;369:m1328
Deep learning models for covid-19 diagnosis using CT or RX
- No discussion of target population or setting
- Control group (without covid-19):
Images from pediatric population
Images from a different country
Images from different time periods
Barely defined, e.g. ‘healthy persons’
- Images from online repository, without further information
- Often not any demographic description (not even age or sex!)
12. Public covid-19 RX datasets
12
Santa Cruz et al. Med Image Analysis 2021;74:102225.
13. Complex algorithms are data hungry
So you dream of
having a Porsche?
If you cannot (or don’t want to) pay for it,
you may get this...
This also holds for predictive analytics. More fancy model? More expensive.
Currency: GOOD data.
13
14. Measurement and data quality
14
Missing values: the tricky importance of the invisible
Measurement: timing and procedure matters
Outcome: quality labels are key (see e.g. deep learning on medical images)
Beam & Kohane. JAMA 2018;319:1317-1318.
15. 2. Wanted: evidence
• Kleinrouweler (AJOG 2016): 263 models in obstetrics
• Only 23 of these (9%) had been externally validated…
Other examples of model overload:
• 1060 models predicting outcomes after CVD (1990-2015) (Wessler et al, 2017)
• 363 models predicting CVD (Damen et al, 2016)
• 231 models related to Covid-19 (Wynants et al, 2020), and counting!
• 116 models to diagnose ovarian malignancy (Kaijser et al, 2014)
15
Wessler et al. Diagn Progn Res 2017;1:20. Damen et al. BMJ 2016;353:i2416. Wynants et al. BMJ 2020;369:m1328.
Kleinrouweler et al. AJOG 2016;214:79-90. Kaijser et al. Hum Reprod Update 2014;20:229-62.
16. Smartphone apps for skin lesions
16
Freeman et al. BMJ 2020;368:m127
• 9 validation studies covering 6 apps
• 1132 lesions in total (average 126 per study)
• Methodological quality was poor
o Selective inclusions (non-representative)
o Images were taken and selected by clinicians
o Lots of unusable images
Scarce and poor evidence
17. Radiology AI
17
Van Leeuwen et al. Eur Radiol 2021;31:3797-3804
• 64/100: no evidence
• 18/100: evidence of diagnostic performance
• 18/100: evidence of potential impact
• Half of the studies were independent, the other half had conflicts of interest
18. 3. Expect (a lot of) heterogeneity
18
• Changes in care over time
• Differences in care between healthcare systems
• Differences in populations between practices/hospitals/regions
• Differences in hardware, software, and measurement procedures
• Differences in performance between patient subgroups (cf fairness)
Futoma et al. Lancet Digit Health 2020;2:e489-e492.
21. Hardware/software
21
Badgeley et al. npj Digit Med 2019;2:31.
Deep learning was better at predicting scanner model and brand
(AUC>=0.98) than at predicting hip fracture (AUC 0.78)
22. Where do DL datasets come from anyway?
22
Kaushal et al. JAMA 2020;324:1212-1213.
24. DL research (Sep 2021)
24
Perkonigg et al. Nat Comm 2021;12:5678.
25. 4. Proprietary datasets and models
25
Van Calster et al. JAMIA 2019;26:1651-1654.
https://hai.stanford.edu/news/flying-dark-hospital-ai-tools-arent-well-documented.
Not necessarily bad in principle: financial resources are needed
But it may hamper openness, availability, independent validation
COVID review: companies often did not react, but claimed that the model
was used on thousands of patients
27. Google’s Dermatology Assist (CE label)
27
https://www.statnews.com/2021/06/02/machine-learning-ai-methodology-research-flaws/.
Roxana Daneshjou (Stanford):
- No evaluation on external dataset.
- Insufficient variation in skin types.
- Outcome rarely based on biopsy.
- “I haven't seen data that makes me feel
comfortable with putting this in the hands of
patients or physicians.”
28. External validation of EPIC sepsis model
28
Wong et al. JAMA Intern Med 2021;181:1065-1070.
Model: penalized logistic regression with 80 variables
Data: 3 healthcare organizations, 2013-2015
AUC according to internal documentation: 0.78-0.83
Validation: 1 academic center, 2018-2019
AUC 0.63, calibration poor (risks way too high)
29. 5. Actual implementation
29
Logistical/practical issues to fit model in clinical workflow
Psychological issues regarding model use by healthcare staff
Medicolegal: Who is responsible when prediction is wrong?
https://www.statnews.com/2020/03/09/can-you-sue-artificial-intelligence-algorithm-for-malpractice/
Panch et al. npj Digit Med 2019;2:77.
30. Lack of evidence revisited: impact?
30
Clinical impact studies: scarce, difficult
Clinical decision support is a complex intervention (Kappen et al, 2018)
Endpoints of impact studies?
- Process-related: ‘easy’, but intermediate
- Long-term patient outcomes: difficult, lower effect sizes expected
Kappen et al. Diagn Progn Res 2018.
31. So, does medical AI ‘work’?
31
We still often don’t know!
Trust jeopardized by
- poor methodology
- lack of evidence
- lack of openness.
It may have potential if done well and evidence is gathered.
AI community / academia often shoots itself in the foot, this is a pity
Academia: wrong incentives (publish or perish)!
Companies: financial conflicts of interest!
32. That’s (not) all folks…
32
https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/.
https://spectrum.ieee.org/deep-learning-computational-cost
Thompson et al. IEEE Spectrum 2021.
Hao. MlT Technology review 2019.