Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Lorenzo Rossi, PhD
Data Scientist
City of Hope National Medical Center
DataCon LA, August 2019
Best Practices for Prototyp...
Machine learning in healthcare is growing fast,
but best practices are not well established yet
Towards Guidelines for ML ...
Motivations for ML in Healthcare
1. Lots of information about patients, but not enough time for clinicians
to process it
2...
Topics
1. The electronic health record (EHR)
2. Cohort definition
3. Data quality
4. Training - testing split
5. Performan...
Topics
1. The electronic health record (EHR)
2. Cohort definition
3. Data quality
4. Training - testing split
5. Performan...
1. The Electronic Health Record (EHR)
• Laboratory tests
• Vitals
• Diagnoses
• Medications
• X-rays, CT scans, EKGs, …
• Notes
EHR data are very heterogeneous
• Laboratory tests [multi dimensional time series]
• Vitals [multi dimensional time series]
• Diagnoses [text, codes]
• Me...
• labs
• vitals
• notes
• …
Time is a key aspect of EHR data
p01
p02
p03
time
• labs
• vitals
• notes
• …
Time is a key aspect of EHR data
p01
p02
p03
Temporal resolution varies a lot
• ICU patient [m...
• Unplanned 30 day readmission
• Length of stay
• Mortality
• Sepsis
• ICU admission
• Surgical complications
Events hospi...
• Unplanned 30 day readmission
• Length of stay
• Mortality
• Sepsis
• ICU admission
• Surgical complications
Events hospi...
• Unplanned 30 day readmission
• Length of stay
• Mortality
• Sepsis
• ICU admission
• Surgical complications
Events hospi...
Consider only binary prediction tasks for simplicity
Prediction algorithm gives score from 0 to 1
– E.g. close to 1 → high...
Consider only binary prediction tasks for simplicity
Prediction algorithm gives score from 0 to 1
– E.g. close to 1 → high...
2. Cohort Definition
Individuals “who experienced particular event during specific
period of time”
Cohort
Individuals “who experienced particular event during specific
period of time”
Given prediction task, select clinically rel...
A. Pick records of subset of patients
• labs
• vitals
• notes
• …p01
p02
p03
time
B. Pick a prediction time for each patients. Records after
prediction time are discarded
• labs
• vitals
• notes
• …p01
p0...
B. Pick a prediction time for each patients. Records after
prediction time are discarded
• labs
• vitals
• notes
• …p01
p0...
3. Data Quality
[Image source: SalesForce]
EHR data challenging in many different ways
Example: most common non-numeric entries for
lab values in a legacy HER system
• pending
• “>60”
• see note
• not done
• “...
Example: discrepancies in dates of death
between hospital records and Social Security (~
4.8 % of shared patients)
Anomalies vs. Outliers
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomal...
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomal...
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomal...
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomal...
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomal...
Distinguish between Anomalies and Outliers
Outlier: legitimate data point far away from mean/median of
distribution
Anomal...
4. Training - Testing Split
• Machine learning models evaluated on ability to make
prediction on new (unseen) data
• Split train (cross-validation) an...
5. Performance Metrics and Reporting
Background
Generally highly imbalanced problems:
15% unplanned 30 day readmissions
< 10% sepsis cases
< 1% 30 day mortality
Types of Performance Metrics
1. Measure trade-offs
– (ROC) AUC
– average precision / PR AUC
2. Measure error rate at speci...
Types of Performance Metrics (II)
1. Measure trade-offs
– AUC, average precision / PR AUC,
– good for global performance c...
Don’t use accuracy unless dataset is balanced
ROC AUC can be misleading too
ROC AUC can be misleading (II)
[Avati, Ng et al., Countdown Regression: Sharp and Calibrated
Survival Predictions. ArXiv, ...
ROC AUC (1 year) > ROC AUC (5 years), but PR AUC (1
year) < PR AUC (5 years)! Latter prediction task is easier.
[Avati, Ng...
Performance should be reported with both types
of metrics
• 1 or 2 metrics for trade-off evaluation
– ROC AUC
– average pr...
Performance should be reported with both types
of metrics
• 1 or 2 metrics for trade-off evaluation
– ROC AUC
– average pr...
Metrics in Stanford 2017 paper on mortality
prediction: AUC, average precision, recall @ 90%
Benchmarks
Main paper [Google, Nature, 2018] only reports deep
learning results with no benchmark comparison
Comparison only in supplemental online file (not on
Nature paper): deep learning only 1-2% better than
logistic regression...
Plot scales can be deceiving [undisclosed
vendor, 2017]!
Same TP, FP plots rescaled
6. Survival Analysis
B. Pick a prediction time for each patients. Records after
prediction time are discarded
• labs
• vitals
• notes
• …p01
p0...
C. Plot survival curves
• Consider binary classification tasks
– Event of interest (e.g. death) either happens or not befo...
Different selections of prediction times lead to
different survival profiles over same cohort
Example: high percentage of patients deceased within 30
days. Model trained to distinguish mostly between
relatively healt...
Example: high percentage of patients deceased within 30
days. Model trained to distinguish mostly between
relatively healt...
Final Remarks
• Outliers should not to be treated like anomalies
• Split train (CV) and test sets temporally
• Metrics:
– ...
Thank You!
Twitter: @LorenzoARossi
Supplemental Material
Example: ROC Curve
Very high detection rate,
but also high false alarm rate
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi
Prochain SlideShare
Chargement dans…5
×

Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi

7 617 vues

Publié le

Medical institutions, universities and software giants like Google and Microsoft are dedicating increasing resources to machine learning for healthcare. This is a very exciting but relatively young field. However, best practices for methods and reporting of results are not yet fully established. I have 2.5 years of experience as data scientist at a national cancer center working on clinical data, evaluating external vendors and peer reviewing machine learning in healthcare papers. The talk gives an overview of best practices in prototyping machine learning models on data from the patient electronic health record (EHR). The topics addressed are:1. Introduction to the EHR2. Overview of machine learning applications to the EHR3. Cohort definition for survival problems4. Data cleaning5. Performance metricsExcerpts of papers from renowned institutions will be critically reviewed. The material is intended to be useful not only to machine learning for healthcare professionals, but to practitioners dealing with very unbalanced dataset in the temporal domain. For example, customer churn prediction can be modeled as survival problem.

Publié dans : Technologie
  • Login to see the comments

Data Con LA 2019 - Best Practices for Prototyping Machine Learning Models for Healthcare by Lorenzo Rossi

  1. 1. Lorenzo Rossi, PhD Data Scientist City of Hope National Medical Center DataCon LA, August 2019 Best Practices for Prototyping Machine Learning Models for Healthcare
  2. 2. Machine learning in healthcare is growing fast, but best practices are not well established yet Towards Guidelines for ML in Health (8.2018, Stanford)
  3. 3. Motivations for ML in Healthcare 1. Lots of information about patients, but not enough time for clinicians to process it 2. Physicians spend too much time typing information about patients during encounters 3. Overwhelming amount of false alerts (e.g. in ICU)
  4. 4. Topics 1. The electronic health record (EHR) 2. Cohort definition 3. Data quality 4. Training - testing split 5. Performance metrics and reporting 6. Survival analysis
  5. 5. Topics 1. The electronic health record (EHR) 2. Cohort definition 3. Data quality 4. Training - testing split 5. Performance metrics and reporting 6. Survival curves Data preparation
  6. 6. 1. The Electronic Health Record (EHR)
  7. 7. • Laboratory tests • Vitals • Diagnoses • Medications • X-rays, CT scans, EKGs, … • Notes EHR data are very heterogeneous
  8. 8. • Laboratory tests [multi dimensional time series] • Vitals [multi dimensional time series] • Diagnoses [text, codes] • Medications [text, codes, numeric] • X-rays, CT scans, EKGs,… [2D - 3D images, time series, ..] • Notes [text] EHR data are very heterogeneous
  9. 9. • labs • vitals • notes • … Time is a key aspect of EHR data p01 p02 p03 time
  10. 10. • labs • vitals • notes • … Time is a key aspect of EHR data p01 p02 p03 Temporal resolution varies a lot • ICU patient [minutes] • Hospital patient [hours] • Outpatient [weeks] time
  11. 11. • Unplanned 30 day readmission • Length of stay • Mortality • Sepsis • ICU admission • Surgical complications Events hospitals want to predict from EHR data
  12. 12. • Unplanned 30 day readmission • Length of stay • Mortality • Sepsis • ICU admission • Surgical complications Events hospitals want to predict from EHR data Improve capacity
  13. 13. • Unplanned 30 day readmission • Length of stay • Mortality • Sepsis • ICU admission • Surgical complications Events hospitals want to predict from EHR data Improve capacity Optimize decisions
  14. 14. Consider only binary prediction tasks for simplicity Prediction algorithm gives score from 0 to 1 – E.g. close to 1 → high risk of readmission within 30 days 0 / 1
  15. 15. Consider only binary prediction tasks for simplicity Prediction algorithm gives score from 0 to 1 – E.g. close to 1 → high risk of readmission within 30 days Trade-off between falsely detected and missed targets 0 / 1
  16. 16. 2. Cohort Definition
  17. 17. Individuals “who experienced particular event during specific period of time” Cohort
  18. 18. Individuals “who experienced particular event during specific period of time” Given prediction task, select clinically relevant cohort E.g. for surgery complication prediction, patients who had one or more surgeries between 2011 and 2018. Cohort
  19. 19. A. Pick records of subset of patients • labs • vitals • notes • …p01 p02 p03 time
  20. 20. B. Pick a prediction time for each patients. Records after prediction time are discarded • labs • vitals • notes • …p01 p02 p03 time
  21. 21. B. Pick a prediction time for each patients. Records after prediction time are discarded • labs • vitals • notes • …p01 p02 p03 time
  22. 22. 3. Data Quality [Image source: SalesForce]
  23. 23. EHR data challenging in many different ways
  24. 24. Example: most common non-numeric entries for lab values in a legacy HER system • pending • “>60” • see note • not done • “<2” • normal • “1+” • “2 to 5” • “<250” • “<0.1”
  25. 25. Example: discrepancies in dates of death between hospital records and Social Security (~ 4.8 % of shared patients)
  26. 26. Anomalies vs. Outliers
  27. 27. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one producing rest of data Need domain knowledge to differentiate
  28. 28. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one producing rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort.
  29. 29. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = -1 → ?
  30. 30. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = -1 → anomaly (treat as missing value)
  31. 31. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = 1 → ?
  32. 32. Distinguish between Anomalies and Outliers Outlier: legitimate data point far away from mean/median of distribution Anomaly: illegitimate data point generated by process different from one generating rest of data Need domain knowledge to differentiate E.g.: Albumin level in blood. Normal range: 3.4 – 5.4 g/dL. µ=3.5, σ=0.65 over cohort. ρ = 1 → possibly a outlier (clinically relevant)
  33. 33. 4. Training - Testing Split
  34. 34. • Machine learning models evaluated on ability to make prediction on new (unseen) data • Split train (cross-validation) and test sets based on temporal criteria – e.g. no records in train set after prediction dates in test set – random splits, even if stratified, could include records virtually from ‘future’ to train model • In retrospective studies should also avoid records of same patients across train and test – model could just learn to recognize patients Guidelines
  35. 35. 5. Performance Metrics and Reporting
  36. 36. Background Generally highly imbalanced problems: 15% unplanned 30 day readmissions < 10% sepsis cases < 1% 30 day mortality
  37. 37. Types of Performance Metrics 1. Measure trade-offs – (ROC) AUC – average precision / PR AUC 2. Measure error rate at specific decision point – false positive, false negative rates – precision, recall – F1 – accuracy
  38. 38. Types of Performance Metrics (II) 1. Measure trade-offs – AUC, average precision / PR AUC, – good for global performance characterization and (intra)- model comparisons 2. Measure error rate at a specific decision point – false positives, false negatives, …, precision, recall – possibly good for interpretation of specific clinical costs and benefits
  39. 39. Don’t use accuracy unless dataset is balanced
  40. 40. ROC AUC can be misleading too
  41. 41. ROC AUC can be misleading (II) [Avati, Ng et al., Countdown Regression: Sharp and Calibrated Survival Predictions. ArXiv, 2018]
  42. 42. ROC AUC (1 year) > ROC AUC (5 years), but PR AUC (1 year) < PR AUC (5 years)! Latter prediction task is easier. [Avati, Ng et al., Countdown Regression: Sharp and Calibrated Survival Predictions. ArXiv, 2018]
  43. 43. Performance should be reported with both types of metrics • 1 or 2 metrics for trade-off evaluation – ROC AUC – average precision • 1 metric for performance at clinically meaningful decision point – e.g. recall @ 90% precision
  44. 44. Performance should be reported with both types of metrics • 1 or 2 metrics for trade-off evaluation – ROC AUC – average precision • 1 metric for performance at clinically meaningful decision point – e.g. recall @ 90% precision + Comparison with a known benchmark (baseline)
  45. 45. Metrics in Stanford 2017 paper on mortality prediction: AUC, average precision, recall @ 90%
  46. 46. Benchmarks
  47. 47. Main paper [Google, Nature, 2018] only reports deep learning results with no benchmark comparison
  48. 48. Comparison only in supplemental online file (not on Nature paper): deep learning only 1-2% better than logistic regression benchmark
  49. 49. Plot scales can be deceiving [undisclosed vendor, 2017]!
  50. 50. Same TP, FP plots rescaled
  51. 51. 6. Survival Analysis
  52. 52. B. Pick a prediction time for each patients. Records after prediction time are discarded • labs • vitals • notes • …p01 p02 p03
  53. 53. C. Plot survival curves • Consider binary classification tasks – Event of interest (e.g. death) either happens or not before censoring time • Survival curve: distribution of time to event and time to censoring
  54. 54. Different selections of prediction times lead to different survival profiles over same cohort
  55. 55. Example: high percentage of patients deceased within 30 days. Model trained to distinguish mostly between relatively healthy and moribund patients
  56. 56. Example: high percentage of patients deceased within 30 days. Model trained to distinguish mostly between relatively healthy and moribund patients → performance overestimate
  57. 57. Final Remarks • Outliers should not to be treated like anomalies • Split train (CV) and test sets temporally • Metrics: – ROC AUC alone could be misleading – Precision-Recall curve often more useful than ROC – Compare with meaningful benchmarks • Performance possibly overestimated for cohorts with unrealistic survival curves
  58. 58. Thank You! Twitter: @LorenzoARossi
  59. 59. Supplemental Material
  60. 60. Example: ROC Curve Very high detection rate, but also high false alarm rate

×