2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019

New Machine Learning Algorithms for
Electronic Health Records (EHRs)
David Page
Chair, Department of Biostatistics & Bioinformatics
Duke University

Acknowledgements
NIH BD2K, NLM, NIGMS, FDA
Office of Generic Drugs
Center for Predictive
Computational Phenotyping
Ross Kleiman
Charles Kuang
Xiayuan Huang
Wei Zhang
Collin Engstrom
Scott Hebbring
Paul Bennett
Yujia Bao
Jon Badger
Robert Elston
Guillerme Rosa
Aubrey Barnard
Michael Caldwell
Jesse Davis
Eric Lantz
Jie Liu
Houssam Nassif
Peggy Peissig
Vitor Santos Costa
Humberto Vidaillet
Jeremy Weiss
Becca Willett
Wisconsin Genomics Initiative

The Electronic Health Record (EHR)
ID Year of
Birth
Gender
P1 3.22.1963 M
ID Date Diagnosis Sign/Symptom
P1 6.2.1990 PVC Palpitations
Demographics
Diagnoses

ID Date Diagnosis Symptoms
P1 2011.06.
02
Atrial
fibrillation
Dizzy,
discomfort
Demographics
Diagnoses
P1 7.3.1997 Elevated BP
ID Year of
Birth
Gender
P1 3.22.1963 M

P1 2011.06.
02
Atrial
fibrillation
Dizzy,
discomfort
Demographics
Diagnoses
P1 2011.06.
02
Atrial
fibrillation
Dizzy,
discomfort
P1 9.1.1998 Atrial
Fibrillation
Shortness of
Breath, Panic
ID Year of
Birth
Gender
P1 3.22.1963 M

EHR Data in a Data Warehouse
Patient ID Gender Birthdate
P1 M 3/22/1963
Patient ID Date Physician Symptoms Diagnosis
P1 1/1/2001 Smith palpitations hypoglycemic
P1 2/1/2001 Jones fever, aches influenza
Patient ID Date Lab Test Result
P1 1/1/2001 blood glucose 42
P1 1/9/2001 blood glucose 45
Patient ID Date Observation Result
P1 1/1/2001 Height 5'11
P2 1/9/2001 BMI 34.5
Patient ID
Date
Prescribed Date Filled Physician Medication Dose Duration
P1 5/17/1998 5/18/1998 Jones Prilosec 10mg 3 months

Sample Input Data Set for Typical Machine Learning
Patient Gender Age Hypertension
within last
year
... Average
LDL last 5
years
Statin MI in
next 5
years
P1 F 32 No ... 120 No No
P2 F 45 Yes ... 154 No No
P3 M 24 No ... 136 No No
P4 M 58 Yes ... 210 No Yes
... ... ... ... ... ... ... ...

Predictive Personalized
Medicine
Personalized
Treatment
Individual
Patient
G + C + E
Predictive Model
for Disease
Susceptibility
& Treatment
ResponseState-of-the-Art
Machine
Learning
Genetic,
Clinical,
&
Environmental
Data
Wisconsin Genomics Initiative (WGI), 2008

Marshfield Clinic EMR
•Marshfield Clinic
−Health system in North
Central Wisconsin
•1.5M Patient Records
spanning 40 years
−Demographics
−Diagnoses (ICD-9)
−Labs
−Procedures
−Vitals
10

ROC Curves for Afib-related Predictions
(Davis et al., 2009; Lantz et al., 2012)
Figure 1. ROC Curves for our four target prediction tasks: predicting AF/F onset and, given AF/F onset, predicting
each of stroke, 1-year mortality, and 3-year mortality, all using only the roughly 25,000 features extracted from the

Motivation and Next Steps
• If we can predict atrial fibrillation accurately, in
advance, then we can intervene to prevent it in some
cases
−aspirin or stronger blood thinner
−more rest, less stress
−beta blocker to control BP, etc.
• Some other prediction tasks we’ve addressed:
− Myocardial infarction (heart attack)
− Warfarin dosing
− Age related macular degeneration
− Breast cancer (from mammograms and genetics)
• How accurately can everything be predicted?
• How far in advance?

High-Throughput ML (Kleiman et al.)
Predicting Every ICD Diagnosis Code at the Press of a Button

Simulated Prospective Study Results
14

Remember this View of EHR Data

Adverse Drug Events (ADEs)
Warfarin
Cox2 inhibitor
ACE inhibitor
Heart Attack
Angioedema
Bleeding
… …

OMOP Ground Truth and Marshfield Data

Drug Prescription Timeline from EHR (Becca Willett)

Multiple Self-Controlled Case Series
(Simpson et al., 2013)

MSCCS for Numeric Response (Kuang et al., KDD 2016)
Electronic Health Records (EHRs)
Person 1
200
160
Properties
Longitudinal
Diabetes
Drug 1
30
150
120
Drug 2
60
Observational
Applications
Age
Adverse Drug
Person 2
Bleeding
90 104 100
Drug 2
Age
Reaction (ADR)
discovery
Computational
Drug
Repositioning
(CDR)

Time-Varying Baseline (Kuang et al., IJCAI 2016)
Person 1
Diabetes
Insulin Insulin
Age
30 45 60
Time-Varying Baseline: add regularization to minimize change
in proximal, consecutive tij values:
yij | xij = tij + β
⊤xij + ϵij, ∼ N(0,σ2).ϵij
Last term penalizes consecutive baseline values
that are close together in time but far apart in
numerical value.

Last Year: Recovery of Known Glucose Lowering Agents
INDX CODE DRUG NAME SCORE COUNT INDX CODE DRUG NAME SCORE COUNT
1 4485 HUMALOG -11.786 124 1 4802 INSULIN 47 635
2 7470 PIOGLITAZONE HCL -10.220 3075 2 8316 REZULIN 50 120
3 8437 ROSIGLITAZONE MALEATE -9.731 1019 3 824 AVANDIA 59 449
4 4837 INSULN ASP PRT/INSULIN ASPART -9.658 258 4 416 AMARYL 65 503
5 6382 NEEDLES INSULIN DISPOSABLE -9.464 2827 5 5226 LANTUS 66 33
6 4171 GLUCOTROL XL -8.117 2853 6 5789 METFORMIN HYDROCHLORIDE 75 10
7 4106 GLIMEPIRIDE -7.940 3384 7 4485 HUMALOG 81 63
8 160 ACTOS -7.721 1125 8 4132 GLUCOPHAGE 86 1813
9 824 AVANDIA -6.802 1239 9 4811 INSULIN NPH 88 19
10 9152 SYRING W-NDL DISP INSUL 0.5ML -6.623 4186 10 144 ACTIGALL 90 34
11 4132 GLUCOPHAGE -6.322 6736 11 1389 CAL 90 45
12 4184 GLYBURIDE -6.021 8879 12 4171 GLUCOTROL XL 90 701
13 4170 GLUCOTROL -5.721 1259 13 9155 SYRNG W-NDL DISP INSUL 0.333ML 95 29
14 4208 GLYNASE -5.670 591 14 4116 GLUCAGON 97 121
15 416 AMARYL -5.599 2240 15 6652 NOVOLOG 98 51
16 4107 GLIPIZIDE -5.563 9993 16 160 ACTOS 106 480
17 844 AXID -4.682 189 17 6646 NOVOFINE 31 106 31
18 2830 DILTIAZEM -4.297 1021 18 4813 INSULIN NPL/INSULIN LISPRO 108 118
19 4806 INSULIN GLARGINE HUM.REC.ANLOG -4.175 4213 19 8437 ROSIGLITAZONE MALEATE 109 332
20 5787 METFORMIN HCL -4.147 19584 20 4170 GLUCOTROL 113 641
21 2824 DILAUDID -4.076 39 21 9889 URSODIOL 113 123
22 5786 METFORMIN -3.890 3838 22 5052 KAY CIEL 114 23
23 7731 PRAVACHOL -3.532 1700 23 4118 GLUCAGON HUMAN RECOMBINANT 115 227
24 1760 CELEXA -3.517 1473 24 2521 DARVOCET-N 116 11
25 4497 HUM INSULIN NPH/REG INSULIN HM -3.501 1829 25 7470 PIOGLITAZONE HCL 121 705
26 9889 URSODIOL -3.132 376 26 5786 METFORMIN 125 2149
27 4813 INSULIN NPL/INSULIN LISPRO -2.972 623 27 10366 ZINC SULFATE 130 34
28 4133 GLUCOPHAGE XR -2.845 765 28 4500 HUMULIN 135 33
29 6445 NEURONTIN -2.615 1418 29 4172 GLUCOVANCE 136 115
30 6656 NPH HUMAN INSULIN ISOPHANE -2.500 2874 30 7471 PIOGLITAZONE HCL/METFORMIN HCL 136 16
31 9379 THIAMINE HCL -2.383 341 31 6382 NEEDLES INSULIN DISPOSABLE 137 649
32 1636 CARDURA -2.198 1079 32 4184 GLYBURIDE 144 1354
33 1218 BLOOD SUGAR DIAGNOSTIC DRUM -2.073 2593 33 4208 GLYNASE 145 115
34 8025 PROZAC -2.037 1525 34 4210 GLYSET 148 7
35 8316 REZULIN -1.895 444 35 4163 GLUCOSE 159 1778
36 9136 SYRINGE & NEEDLE INSULIN 1 ML -1.885 3542 36 5977 MINIPRESS 163 19
37 4802 INSULIN -1.812 1526 37 7946 PROPANTHELINE 163 6
38 7674 POTASSIUM CHLORIDE -1.779 9842 38 1602 CARBOCAINE 182 63
39 4804 INSULIN ASPART -1.752 2476 39 1305 BUDEPRION SR 185 9
40 1200 BLOOD-GLUCOSE METER -1.719 5289 40 6657 NPH INSULIN 185 43

Time-Varying Baseline for Binary Response (Kuang et al.,
KDD 2017)
• Response variable: heart attack occurrence on a
particular day (binary)
• Features: drug exposure statuses on a particular day
(binary) + baseline occurrence rates

OMOP Ground Truth and Marshfield Data
Set

Thus Far
• Charles Kuang: time-varying patient-specific baseline into
linear regression (KDD 2016), time-varying baseline into
Poisson regression (KDD 2017)
• Yujia Bao (with Becca Willett): patient-specific baseline into
point process models (MLHC 2017)
• Wei Zhang: patient-specific baseline into deep neural nets
• Sinong Geng and Charles Kuang: patient-specific baseline
into graphical models (temporal Poisson square root
models), ICML 2018
• Can we bring time-varying baselines into deep neural
networks or random forests? May need other types of
regularization, EM rather than convex optimization.

Major Challenge: Evaluation
• Real-world data: limited causal ground truth
– Just 9 true drug-event pair ADEs in OMOP
– Are we now overfitting OMOP ground truth?
– Some criticism of each ground truth available
• Synthetic data: self-fulfilling prophecy?
• Theory: very strong assumptions, proofs usually
about guaranteed correct causal inferences

Probably Approximately Correct (PAC)
learning [Valiant, CACM 1984]
• Consider a class C of possible target concepts defined over a set of
instances X of length n, and a learner L using hypothesis space H
• C is PAC learnable by L using H if, for all
c∈ C
probability distributions D over X
ε such that 0 < ε < 0.5
δ such that 0 < δ < 0.5
• learner L will, with probability at least (1-δ), output a hypothesis h ∈
H such that errorD(h) ≤ ε in time that is polynomial in
1/ε
1/δ
n
size(c)

Proposal: PAC Causal Inference
• Randomized Turing Machines M1 and M2 generate strings over A
• M2 sees string generated by M1 before generating its own
• Training protocol P governs what else learner L may be told and
what part of full training set it sees
• learner L then sees a test set T drawn entirely from either M1 or M2
(chosen uniformly), and must accurately identify the source of T
• Family F of pairs <M1, M2> is PPACC inferable under protocol P by
learner L if for all 0 < ε < 0.5, given time, training data, and test data
polynomial in 1/ε and any other relevant parameters of P, learner L
will, with probability at least (1-ε), correctly identify the source of the
test set T
• Intuition: M1 and M2 might differ only by one causal association
• Since M2 is a TM, it can learn; positive (negative) result for PPACC
inference is negative (positive) result for adversarial training (GAN)

Concrete Example
• Let M1 and M2 be causal models (Bayesian networks where edges
capture direct causation) over same set of variables
• M1 and M2 identical except that M2 drops one edge, from A to B
(definition allows hardest conditional probability settings)
• Different protocols P let us inherit different existing results
• Easy if A binary, random, uniform; influence of A on B in M1 is big
• If P provides to L the shared structure of M1 and M2 and leaves
open only the question of an edge from A to B, then for some
shared structures we can employ do-calculus using the test data
• If P properly constrains the shared structure of M1 and M2 including
the size of the largest separator-set, we can use the PC algorithm
on the test data only
A C
B
A C
B

Summary of Other Results So Far
• Positive result for self-controlled case series (SCCS) if EHR data
generated by piecewise-constant conditional intensity model
(PCIM) with different baseline risk of event (intensity) for each
patient, the only influences are of drugs on events, drugs and drug
effects are all independent of one another, and effects big enough
• Open: more complex dependencies, time-varying baseline,
baseline regularization

Summary of Other Results So Far
• Positive result for causal models where causal variable G is known
and has no parents (e.g., genetic variation such as FGXS
Premutation), and we want to know if there is any causal effect
representable by a PAC-learnable model class on observed features
• Practical utility: Movaghar et al., 2019 (Science Advances) showing
FGXS Premutation has clinical effect
• Requires ”Agnostic” variant like Agnostic PAC learning
• Can be extended to allow causal variable D with parents if in addition
we can partition remaining variables into ancestors and non-
ancestors of D (e.g., by date of measurement in EHR data), Pr(D) is
linear (e.g., linear propensity model for D), and protocol P also lets us
sample from IPW distribution based on learned approximation to this
model

Another Application
• My Health Insurance just switched me from a Brand to Generic drug
• Is this Generic safe and effective?
• In other words, is there a combination of clinical variables that occurs
more in patients after taking Generic than Brand?
• Generic vs. Brand should eliminate confounding by indication
• Using only switchers should also control for patient-specific
confounders, even if unmeasured or completely unconsidered
• The same patients are on brand, then generic, for the same indication
• The problem: in practice, time-varying confounding is rampant
• Example: generic drug becomes available in 2005; electronic
transmission of prescriptions starts in 2005: looks like generic drug
causes electronic transmission of prescription!

Two Possible Approaches
• Suppose everyone on the drug was on Brand for all 2004, then
switched to Generic for all 2005
• Run Supervised ML
• Patient is a Positive Example while on Generic (2005)
• Same Patient is a Negative Example while on Brand (2004)
• Take controls not on drug at all, and use to offset time-varying effects
• Patient is a Positive Example during 2004
• Patient is a Negative Example during 2005
• Alternative approach: Use controls to learn to predict 2005 vs. 2004
(temporal propensity), use IPW to weight brand and generic patients
• PPACC Inference model says the first approach needs four times as
much data and will fail at lower event rates than the second approach
• Synthetic data further supports this message

Conclusion
• Wide variety of clinical events can be
accurately predicted by ML from EHR, but…
• Models tempt us to wrong causal inferences
• Adding to ML latent patient-specific and time-
varying baselines greatly improves this
• We don’t demand perfection in prediction,
don’t demand it in causal inference, so…
• Need a PAC-like theory of causal inference
even more than of prediction (ground truth)

2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (7)

Similaire à 2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019

Similaire à 2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019 (20)

Plus de The Statistical and Applied Mathematical Sciences Institute

Plus de The Statistical and Applied Mathematical Sciences Institute (20)

Dernier

Dernier (20)

2019 Triangle Machine Learning Day - Machine Learning from De-Identified Coded Electronic Health Records (EHRs) - David Page, September 20, 2019