This talk begins by showing how accurately 4000 different diagnoses can be predicted in advance for any patient, from one month to twenty years before first occurrence in the patient, using high-throughput machine learning. Shortcomings of this approach motivate ways to turn prediction methods into algorithms for finding causal associations; the resulting algorithms attain high accuracy in tasks of drug repurposing and discovery of adverse drug events, but they do not come with provable guarantees of making correct causal inferences. We then introduce variants of probably-approximately correct (PAC) learning for finding causal associations, that can provide weaker but useful guarantees for such algorithms as these motivated by our experiences with EHR data.
Similaire à 2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019
Rutger Ploeg - The Netherlands - Tuesday 29 - Use of Perfusion Machines or ...incucai_isodp
Similaire à 2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019 (20)
2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019
1. New Machine Learning Algorithms for
Electronic Health Records (EHRs)
David Page
Chair, Department of Biostatistics & Bioinformatics
Duke University
2. Acknowledgements
NIH BD2K, NLM, NIGMS, FDA
Office of Generic Drugs
Center for Predictive
Computational Phenotyping
Ross Kleiman
Charles Kuang
Xiayuan Huang
Wei Zhang
Collin Engstrom
Scott Hebbring
Paul Bennett
Yujia Bao
Jon Badger
Robert Elston
Guillerme Rosa
Aubrey Barnard
Michael Caldwell
Jesse Davis
Eric Lantz
Jie Liu
Houssam Nassif
Peggy Peissig
Vitor Santos Costa
Humberto Vidaillet
Jeremy Weiss
Becca Willett
Wisconsin Genomics Initiative
3. The Electronic Health Record (EHR)
ID Year of
Birth
Gender
P1 3.22.1963 M
ID Date Diagnosis Sign/Symptom
P1 6.2.1990 PVC Palpitations
Demographics
Diagnoses
4. The Electronic Health Record (EHR)
ID Date Diagnosis Symptoms
P1 2011.06.
02
Atrial
fibrillation
Dizzy,
discomfort
Demographics
Diagnoses
ID Date Diagnosis Sign/Symptom
P1 7.3.1997 Elevated BP
ID Year of
Birth
Gender
P1 3.22.1963 M
5. The Electronic Health Record (EHR)
ID Date Diagnosis Symptoms
P1 2011.06.
02
Atrial
fibrillation
Dizzy,
discomfort
Demographics
Diagnoses
ID Date Diagnosis Symptoms
P1 2011.06.
02
Atrial
fibrillation
Dizzy,
discomfort
ID Date Diagnosis Sign/Symptom
P1 9.1.1998 Atrial
Fibrillation
Shortness of
Breath, Panic
ID Year of
Birth
Gender
P1 3.22.1963 M
6. EHR Data in a Data Warehouse
Patient ID Gender Birthdate
P1 M 3/22/1963
Patient ID Date Physician Symptoms Diagnosis
P1 1/1/2001 Smith palpitations hypoglycemic
P1 2/1/2001 Jones fever, aches influenza
Patient ID Date Lab Test Result
P1 1/1/2001 blood glucose 42
P1 1/9/2001 blood glucose 45
Patient ID Date Observation Result
P1 1/1/2001 Height 5'11
P2 1/9/2001 BMI 34.5
Patient ID
Date
Prescribed Date Filled Physician Medication Dose Duration
P1 5/17/1998 5/18/1998 Jones Prilosec 10mg 3 months
8. Sample Input Data Set for Typical Machine Learning
Patient Gender Age Hypertension
within last
year
... Average
LDL last 5
years
Statin MI in
next 5
years
P1 F 32 No ... 120 No No
P2 F 45 Yes ... 154 No No
P3 M 24 No ... 136 No No
P4 M 58 Yes ... 210 No Yes
... ... ... ... ... ... ... ...
10. Marshfield Clinic EMR
•Marshfield Clinic
−Health system in North
Central Wisconsin
•1.5M Patient Records
spanning 40 years
−Demographics
−Diagnoses (ICD-9)
−Labs
−Procedures
−Vitals
10
11. ROC Curves for Afib-related Predictions
(Davis et al., 2009; Lantz et al., 2012)
Figure 1. ROC Curves for our four target prediction tasks: predicting AF/F onset and, given AF/F onset, predicting
each of stroke, 1-year mortality, and 3-year mortality, all using only the roughly 25,000 features extracted from the
12. Motivation and Next Steps
• If we can predict atrial fibrillation accurately, in
advance, then we can intervene to prevent it in some
cases
−aspirin or stronger blood thinner
−more rest, less stress
−beta blocker to control BP, etc.
• Some other prediction tasks we’ve addressed:
− Myocardial infarction (heart attack)
− Warfarin dosing
− Age related macular degeneration
− Breast cancer (from mammograms and genetics)
• How accurately can everything be predicted?
• How far in advance?
22. MSCCS for Numeric Response (Kuang et al., KDD 2016)
Electronic Health Records (EHRs)
Person 1
200
160
Properties
Longitudinal
Diabetes
Drug 1
30
150
120
Drug 2
60
Observational
Applications
Age
Adverse Drug
Person 2
Bleeding
90 104 100
Drug 2
Age
Reaction (ADR)
discovery
Computational
Drug
Repositioning
(CDR)
23. Time-Varying Baseline (Kuang et al., IJCAI 2016)
Person 1
Diabetes
Insulin Insulin
Age
30 45 60
Time-Varying Baseline: add regularization to minimize change
in proximal, consecutive tij values:
yij | xij = tij + β
⊤xij + ϵij, ∼ N(0,σ2).ϵij
Last term penalizes consecutive baseline values
that are close together in time but far apart in
numerical value.
25. Time-Varying Baseline for Binary Response (Kuang et al.,
KDD 2017)
• Response variable: heart attack occurrence on a
particular day (binary)
• Features: drug exposure statuses on a particular day
(binary) + baseline occurrence rates
28. Thus Far
• Charles Kuang: time-varying patient-specific baseline into
linear regression (KDD 2016), time-varying baseline into
Poisson regression (KDD 2017)
• Yujia Bao (with Becca Willett): patient-specific baseline into
point process models (MLHC 2017)
• Wei Zhang: patient-specific baseline into deep neural nets
• Sinong Geng and Charles Kuang: patient-specific baseline
into graphical models (temporal Poisson square root
models), ICML 2018
• Can we bring time-varying baselines into deep neural
networks or random forests? May need other types of
regularization, EM rather than convex optimization.
29. Major Challenge: Evaluation
• Real-world data: limited causal ground truth
– Just 9 true drug-event pair ADEs in OMOP
– Are we now overfitting OMOP ground truth?
– Some criticism of each ground truth available
• Synthetic data: self-fulfilling prophecy?
• Theory: very strong assumptions, proofs usually
about guaranteed correct causal inferences
30. Probably Approximately Correct (PAC)
learning [Valiant, CACM 1984]
• Consider a class C of possible target concepts defined over a set of
instances X of length n, and a learner L using hypothesis space H
• C is PAC learnable by L using H if, for all
c∈ C
probability distributions D over X
ε such that 0 < ε < 0.5
δ such that 0 < δ < 0.5
• learner L will, with probability at least (1-δ), output a hypothesis h ∈
H such that errorD(h) ≤ ε in time that is polynomial in
1/ε
1/δ
n
size(c)
31. Proposal: PAC Causal Inference
• Randomized Turing Machines M1 and M2 generate strings over A
• M2 sees string generated by M1 before generating its own
• Training protocol P governs what else learner L may be told and
what part of full training set it sees
• learner L then sees a test set T drawn entirely from either M1 or M2
(chosen uniformly), and must accurately identify the source of T
• Family F of pairs <M1, M2> is PPACC inferable under protocol P by
learner L if for all 0 < ε < 0.5, given time, training data, and test data
polynomial in 1/ε and any other relevant parameters of P, learner L
will, with probability at least (1-ε), correctly identify the source of the
test set T
• Intuition: M1 and M2 might differ only by one causal association
• Since M2 is a TM, it can learn; positive (negative) result for PPACC
inference is negative (positive) result for adversarial training (GAN)
32. Concrete Example
• Let M1 and M2 be causal models (Bayesian networks where edges
capture direct causation) over same set of variables
• M1 and M2 identical except that M2 drops one edge, from A to B
(definition allows hardest conditional probability settings)
• Different protocols P let us inherit different existing results
• Easy if A binary, random, uniform; influence of A on B in M1 is big
• If P provides to L the shared structure of M1 and M2 and leaves
open only the question of an edge from A to B, then for some
shared structures we can employ do-calculus using the test data
• If P properly constrains the shared structure of M1 and M2 including
the size of the largest separator-set, we can use the PC algorithm
on the test data only
A C
B
A C
B
33. Summary of Other Results So Far
• Positive result for self-controlled case series (SCCS) if EHR data
generated by piecewise-constant conditional intensity model
(PCIM) with different baseline risk of event (intensity) for each
patient, the only influences are of drugs on events, drugs and drug
effects are all independent of one another, and effects big enough
• Open: more complex dependencies, time-varying baseline,
baseline regularization
34. Summary of Other Results So Far
• Positive result for causal models where causal variable G is known
and has no parents (e.g., genetic variation such as FGXS
Premutation), and we want to know if there is any causal effect
representable by a PAC-learnable model class on observed features
• Practical utility: Movaghar et al., 2019 (Science Advances) showing
FGXS Premutation has clinical effect
• Requires ”Agnostic” variant like Agnostic PAC learning
• Can be extended to allow causal variable D with parents if in addition
we can partition remaining variables into ancestors and non-
ancestors of D (e.g., by date of measurement in EHR data), Pr(D) is
linear (e.g., linear propensity model for D), and protocol P also lets us
sample from IPW distribution based on learned approximation to this
model
35. Another Application
• My Health Insurance just switched me from a Brand to Generic drug
• Is this Generic safe and effective?
• In other words, is there a combination of clinical variables that occurs
more in patients after taking Generic than Brand?
• Generic vs. Brand should eliminate confounding by indication
• Using only switchers should also control for patient-specific
confounders, even if unmeasured or completely unconsidered
• The same patients are on brand, then generic, for the same indication
• The problem: in practice, time-varying confounding is rampant
• Example: generic drug becomes available in 2005; electronic
transmission of prescriptions starts in 2005: looks like generic drug
causes electronic transmission of prescription!
36. Two Possible Approaches
• Suppose everyone on the drug was on Brand for all 2004, then
switched to Generic for all 2005
• Run Supervised ML
• Patient is a Positive Example while on Generic (2005)
• Same Patient is a Negative Example while on Brand (2004)
• Take controls not on drug at all, and use to offset time-varying effects
• Patient is a Positive Example during 2004
• Patient is a Negative Example during 2005
• Alternative approach: Use controls to learn to predict 2005 vs. 2004
(temporal propensity), use IPW to weight brand and generic patients
• PPACC Inference model says the first approach needs four times as
much data and will fail at lower event rates than the second approach
• Synthetic data further supports this message
37.
38. Conclusion
• Wide variety of clinical events can be
accurately predicted by ML from EHR, but…
• Models tempt us to wrong causal inferences
• Adding to ML latent patient-specific and time-
varying baselines greatly improves this
• We don’t demand perfection in prediction,
don’t demand it in causal inference, so…
• Need a PAC-like theory of causal inference
even more than of prediction (ground truth)