Elsevier Medical Graph – mit Machine Learning zu Precision Medicine

1
Elsevier Health Analytics
Medical Graph v1
Empowering
KnowledgeTM
Towards
• A map of medicine
• Personalized decision support in a
clinical setting
Paul Hellwig
Director Research & Development
p.hellwig@elsevier.com
https://www.linkedin.com/in/paulhellwig
Nov, 2016

2
Elsevier
• Publisher & world-leading provider of
information solutions
• 6,700 people worldwide, € 2.8 billion
revenues1
• >2,200 journals, >25,000 book titles
• ScienceDirect, Scopus, ClinicalKey and
Nursing Consult
• Health Analytics Team in Berlin
2
LexisNexis
• Helps predict and manage risk for
industry and government
• 7,200 people, € 2.2 billion revenues1
• 35 years experience in managing big
data, currently >5 Peta Bytes
• Have developed the HPCC2
supercomputer platform
1: 2015 2: High Performance Computing Cluster
Elsevier Health Analytics combines
RELX Group's medical and big data analytics expertise

3
3
Elsevier Health Analytics
- Our vision -

4
4
physician patient
Trends driving changes in physician - patient interaction…
25 million
biomed articles
referenced on PubMed
1.2 million
new biomed articles p.a.
3. information explosion1. medical data explosion
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
2. patient empowerment
105 mm ECG biosensor
high ecg quality, heart rate, respiratory,
body temp, activity, body position, water
tight, induction charged, bluetooth,
continuous data feed

5
5
physician patient
…and the real challenge
25 million
biomed articles
referenced on PubMed
1.2 million
new biomed articles p.a.
3. information explosion1. medical data explosion
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
2. patient empowerment
105 mm ECG biosensor
high ecg quality, heart rate, respiratory,
body temp, activity, body position, water
tight, induction charged, bluetooth,
continuous data feed
< 10
minutes1
1 Europe; US up to 20 mins: Ray KN, Chari AV, Engberg J, Bertolet M, Mehrotra A. Disparities in Time Spent Seeking Medical Care in the United States. JAMA
Intern Med. 2015;175(12):1983-1986. doi:10.1001/jamainternmed.2015.4468.

6
6
Medical Graph – Research Goal A:
Risk predictions: which diseases will you likely get within 4 years?
From Electronic Health Record…
…to Top Risks

7
7
I65
Verschluss und Stenose
präzerebraler Arterien
G40
Epilepsie
I61
C71
Bösartige Neubildung des
Gehirns
odds ratio: 1.12
Intrazerebrale Blutung
1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nature
Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022.
Weitere
Covariaten
Medical Graph – Research Goal B:
Map: How are diseases, medications and other data connected?
has_successor1
…für 1600
Zielkrankheiten

9
Example: Model to predict „I50 – Heart Failure“
9
I50 -
2009
„PAST“
time
I50 -
(coded
as 0)
I50 +
(coded
as1)
2011 2014
Covariates
• Age
• Gender
• Other diseases
• Medications
• Other
Analysis Design
Predict 4 year long-term effects, balanced for all co-variables
„FUTURE“
2010

10
10
Primary care
Secondary care
Medication
Other data
Visits & diagnoses
Visits, diagnoses
& procedures
Drug presciptions
Further cooperations just started
Will enable analysis of vital and laboratory parameters
Billing data flow
60+ sickness funds;
Anonymized
feature extraction
3943 features for 3.8m
patients
• 1623 targets, 2011-2014
• 2320 covariates, 2010
Our observation / feature matrix

11
11
Attempt no. #1
on server
#2
on cluster
#3
on server
machine
learning
algorithm
Component-wise
gradient boosting
(mboost)
GLM for p-values
Logistic Regression
with LASSO
GLM for p-values
Linear gradient boosting
(sklearn + xgboost)
F-test for p-values
Did it work for
full dataset?
Worked for 100k
patients.
Failure reason:
RAM (extensive dataset
copying)
Worked for 138 models.
Failure reason:
Memory Leak every 30-40
models
Worked for 800k
patients.
Failure reason:
int32 as index for sparse
matrixes
Runtime ~7 min / target model
(on 100k patients)
~8 min / target model
(on 3.8m patients)
~7 min / target model
(on 800k patients)
Predictive Modeling for ~1600 target diseases
Multiple attempts – no software is perfect

12
12
# model 1: component-wise linear boosting
boost_train_ds <- glmboost(as.formula(paste(icd_atc_use_names[i],"~.")),
data=data[ins,][c(which_one,sample(which_zero,(length(which_one)),replace=F)),],
family=Binomial(), control=boost_control(mstop=400,trace=T,center=F))
...
# model 1: GLM with ElasticNet
model1 = H2OGeneralizedLinearEstimator(model_id=post_col, family = 'binomial', solver='IRLSM',
alpha = 0.99, #mainly LASSO
lambda_search=True, standardize=True, intercept=True)
model1.train(x=index_cols, y=post_col, training_frame=training, validation_frame=val)
...
+ XGBoost
+ mboost
# model 1: component-wise linear boosting
params={'silent': 0, 'nthread': 4,
'eval_metric':['error','map','map@'+str(top1percent_train),'map@'+str(top1percent_eval),'auc'],
'objective': 'binary:logistic', 'booster': 'gblinear',
'lambda': 0, #L2 regularization (Ridge) none
'alpha': 500} #L1 regularization (LASSO)
booster = xgb.train( params, dtrain, num_boost_round=settings.boosting_iterations,
evals=[(dtrain,'train'),(dtest,'eval')], early_stopping_rounds=10, evals_result =quality)
...
Code for model building

13
13
Krankheiten des
Nervensystems
Neubildungen
Validate & test
Interesting effects between disease chapters

14
Medical Graph backend
14
From last run:
• 2261 nodes
• 434995 edges
Relation Source Target OR beta p-value
number
relations
proportion of
incidents have source
proportion source
get incidents Mean age
has_successor Intercept ICD_M54 0,2483 -1,3930
has_successor AGE ICD_M54 1,0517 0,0504 0,000000 100,0% 21,9%
has_successor GENDER ICD_M54 0,9944 -0,0056 0,000000 82556 47,2% 21,2% 42
has_successor ICD_I10 ICD_M54 0,9260 -0,0768 0,000000 45013 25,8% 20,4% 62
has_successor ICD_H35 ICD_M54 0,9469 -0,0545 0,000000 8125 4,6% 19,5% 62
has_successor ATC_D01AC ICD_M54 1,0022 0,0022 0,000000 3382 1,9% 17,8% 47
has_successor ATC_M01AB ICD_M54 1,2207 0,1994 0,000000 16534 9,5% 17,0% 52
has_successor ICD_H26 ICD_M54 0,9420 -0,0597 0,000000 7550 4,3% 19,1% 67
has_successor ATC_C09AA ICD_M54 0,9603 -0,0405 0,000000 16840 9,6% 20,1% 62
has_successor ATC_C08CA ICD_M54 0,9299 -0,0727 0,000000 9892 5,7% 19,5% 67
has_successor ATC_C07BB ICD_M54 1,0031 0,0031 0,000000 2197 1,3% 21,3% 62
has_successor ICD_H52 ICD_M54 1,0006 0,0006 0,000000 35331 20,2% 20,5% 52
has_successor ATC_M01AE ICD_M54 1,0450 0,0440 0,000000 22808 13,0% 16,4% 42
has_successor ICD_L85 ICD_M54 0,9362 -0,0660 0,000978 1244 0,7% 18,4% 47
Edges

17
Key learnings from working 5 years with medical data
17
Physicians want
explanations.
Otherwise they will not
trust the predictions.
Typical best-in-class
classification methods
(deep learning, random
forest) do not yet
deliver explainable
models. This won‘t
do.
Open source tools have failures
(as have proprietary tools).
Debugging can be a
nightmare.
In practice, you need to
save the users processing
time, not add to it.
Visualization is
key.
Building a classification model
using open source tools is simple.
Scaling input data size is also
manageable. Building 1000+
models is complex.
Implementing, applying and
maintaining a Security
Framework to keep personal
health information secure is a
substantial effort.
Feature
engineering is
not dead. If you
want explainable
effects, you most
probably need linear
models, so you need to
engineer non-linear
effects, e.g. using
clusters.

Elsevier Medical Graph – mit Machine Learning zu Precision Medicine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

En vedette

En vedette (20)

Similaire à Elsevier Medical Graph – mit Machine Learning zu Precision Medicine

Similaire à Elsevier Medical Graph – mit Machine Learning zu Precision Medicine (20)

Plus de Rising Media Ltd.

Plus de Rising Media Ltd. (20)

Dernier

Dernier (20)

Elsevier Medical Graph – mit Machine Learning zu Precision Medicine