Elsevier Health Analytics entwickelt den Medical Knowledge Graph, welcher Korrelationen zwischen Krankheiten und zwischen Krankheiten und Behandlungen darstellt. Auf einem Gesamtdatensatz von sechs Millionen anonymisierten Patienten, beobachtbar über sechs Jahre, haben wir über 2000 Modelle erstellt, welche die Entwicklung von Krankheiten prognostizieren. Jedes Modell ist adjustiert für mehr als 3000 Kovariablen. Dazu kam ein Boosting Algorithmus mit Variablenselektion zum Einsatz. Die Betas der selektierten Variablen wurden extrahiert, getestet hinsichtlich Kausalität und Signifikanz, und daraus wurde die erste Version des Medical Graphen mit über 2000 Krankheitsknoten und 25.000 Effekt-Kanten gebaut. Der Graph wird aktuell in der Praxis getestet, mit dem Ziel, dem Arzt eine patienten-individuelle Entscheidungsunterstützung für die Behandlung zu geben.
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
1. 1
Elsevier Health Analytics
Medical Graph v1
Empowering
KnowledgeTM
Towards
• A map of medicine
• Personalized decision support in a
clinical setting
Paul Hellwig
Director Research & Development
p.hellwig@elsevier.com
https://www.linkedin.com/in/paulhellwig
Nov, 2016
2. 2
Elsevier
• Publisher & world-leading provider of
information solutions
• 6,700 people worldwide, € 2.8 billion
revenues1
• >2,200 journals, >25,000 book titles
• ScienceDirect, Scopus, ClinicalKey and
Nursing Consult
• Health Analytics Team in Berlin
2
LexisNexis
• Helps predict and manage risk for
industry and government
• 7,200 people, € 2.2 billion revenues1
• 35 years experience in managing big
data, currently >5 Peta Bytes
• Have developed the HPCC2
supercomputer platform
1: 2015 2: High Performance Computing Cluster
Elsevier Health Analytics combines
RELX Group's medical and big data analytics expertise
4. 4
4
physician patient
Trends driving changes in physician - patient interaction…
25 million
biomed articles
referenced on PubMed
1.2 million
new biomed articles p.a.
3. information explosion1. medical data explosion
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
2. patient empowerment
105 mm ECG biosensor
high ecg quality, heart rate, respiratory,
body temp, activity, body position, water
tight, induction charged, bluetooth,
continuous data feed
5. 5
5
physician patient
…and the real challenge
25 million
biomed articles
referenced on PubMed
1.2 million
new biomed articles p.a.
3. information explosion1. medical data explosion
4500 tests for gene
disorders available
(2013: 3200 +20% CAGR)
$1245
cost to sequence
full genome
(10/2014: $5730)
patientslikeme has
400,000+ members
31 million data points covering
2,500+ conditions, donating data
2. patient empowerment
105 mm ECG biosensor
high ecg quality, heart rate, respiratory,
body temp, activity, body position, water
tight, induction charged, bluetooth,
continuous data feed
< 10
minutes1
1 Europe; US up to 20 mins: Ray KN, Chari AV, Engberg J, Bertolet M, Mehrotra A. Disparities in Time Spent Seeking Medical Care in the United States. JAMA
Intern Med. 2015;175(12):1983-1986. doi:10.1001/jamainternmed.2015.4468.
6. 6
6
Medical Graph – Research Goal A:
Risk predictions: which diseases will you likely get within 4 years?
From Electronic Health Record…
…to Top Risks
7. 7
7
I65
Verschluss und Stenose
präzerebraler Arterien
G40
Epilepsie
I61
C71
Bösartige Neubildung des
Gehirns
odds ratio: 1.12
Intrazerebrale Blutung
1 Criteria based on: Jensen et.al.: Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nature
Communications, 2014 Jun 24 ;5:4022. doi: 10.1038/ncomms5022.
Weitere
Covariaten
Medical Graph – Research Goal B:
Map: How are diseases, medications and other data connected?
has_successor1
…für 1600
Zielkrankheiten
9. 9
Example: Model to predict „I50 – Heart Failure“
9
I50 -
2009
„PAST“
time
I50 -
(coded
as 0)
I50 +
(coded
as1)
2011 2014
Covariates
• Age
• Gender
• Other diseases
• Medications
• Other
Analysis Design
Predict 4 year long-term effects, balanced for all co-variables
„FUTURE“
2010
10. 10
10
Primary care
Secondary care
Medication
Other data
Visits & diagnoses
Visits, diagnoses
& procedures
Drug presciptions
Further cooperations just started
Will enable analysis of vital and laboratory parameters
Billing data flow
60+ sickness funds;
Anonymized
feature extraction
3943 features for 3.8m
patients
• 1623 targets, 2011-2014
• 2320 covariates, 2010
Our observation / feature matrix
11. 11
11
Attempt no. #1
on server
#2
on cluster
#3
on server
machine
learning
algorithm
Component-wise
gradient boosting
(mboost)
GLM for p-values
Logistic Regression
with LASSO
GLM for p-values
Linear gradient boosting
(sklearn + xgboost)
F-test for p-values
Did it work for
full dataset?
Worked for 100k
patients.
Failure reason:
RAM (extensive dataset
copying)
Worked for 138 models.
Failure reason:
Memory Leak every 30-40
models
Worked for 800k
patients.
Failure reason:
int32 as index for sparse
matrixes
Runtime ~7 min / target model
(on 100k patients)
~8 min / target model
(on 3.8m patients)
~7 min / target model
(on 800k patients)
Predictive Modeling for ~1600 target diseases
Multiple attempts – no software is perfect
12. 12
12
# model 1: component-wise linear boosting
boost_train_ds <- glmboost(as.formula(paste(icd_atc_use_names[i],"~.")),
data=data[ins,][c(which_one,sample(which_zero,(length(which_one)),replace=F)),],
family=Binomial(), control=boost_control(mstop=400,trace=T,center=F))
...
# model 1: GLM with ElasticNet
model1 = H2OGeneralizedLinearEstimator(model_id=post_col, family = 'binomial', solver='IRLSM',
alpha = 0.99, #mainly LASSO
lambda_search=True, standardize=True, intercept=True)
model1.train(x=index_cols, y=post_col, training_frame=training, validation_frame=val)
...
+ XGBoost
+ mboost
# model 1: component-wise linear boosting
params={'silent': 0, 'nthread': 4,
'eval_metric':['error','map','map@'+str(top1percent_train),'map@'+str(top1percent_eval),'auc'],
'objective': 'binary:logistic', 'booster': 'gblinear',
'lambda': 0, #L2 regularization (Ridge) none
'alpha': 500} #L1 regularization (LASSO)
booster = xgb.train( params, dtrain, num_boost_round=settings.boosting_iterations,
evals=[(dtrain,'train'),(dtest,'eval')], early_stopping_rounds=10, evals_result =quality)
...
Code for model building
17. 17
Key learnings from working 5 years with medical data
17
Physicians want
explanations.
Otherwise they will not
trust the predictions.
Typical best-in-class
classification methods
(deep learning, random
forest) do not yet
deliver explainable
models. This won‘t
do.
Open source tools have failures
(as have proprietary tools).
Debugging can be a
nightmare.
In practice, you need to
save the users processing
time, not add to it.
Visualization is
key.
Building a classification model
using open source tools is simple.
Scaling input data size is also
manageable. Building 1000+
models is complex.
Implementing, applying and
maintaining a Security
Framework to keep personal
health information secure is a
substantial effort.
Feature
engineering is
not dead. If you
want explainable
effects, you most
probably need linear
models, so you need to
engineer non-linear
effects, e.g. using
clusters.