AI for Precision Medicine (Pragmatic preclinical data science)

AI FOR PRECISION MEDICINE
PRAGMATIC PRECLINICAL DATA SCIENCE
Paul Agapow <p.agapow@imperial.ac.uk> 
Data Science Institute, Imperial College London
Pharma AI & IoT (London, July 2018)

MLMH2018 - KDD Workshop on Machine
Learning for Medicine and Healthcare
August 20, 2018, London, UK
Topics of interest:
•  Data Standards for Translational
Medicine Informatics
•  Analysis of large scale electronic
health records or patient-
generated health data records
•  Visualisation of complex and
dynamic biomedical networks
•  Disease Subtype Discovery for
Precision Medicine
•  Interpretable Machine Learning for
biomedicine and healthcare
•  Deep learning for biomedicine
Important Dates
•  Submission deadline:
May 25, 2018
•  Notiﬁcation accept:
June 8, 2018
•  Workshop date:
August 8, 2018
Meet our Panel!
T. Roy (Ph.D), University of
Southampton, UK
A. Teredesai (PhD), University of
Washington, Tacoma
S. Wagers (MD), CEO/Founder
BioSci Consulting, Belgium
Join us during the KDD Health Day!

Win IBM $1,000 travel grant for best
selected student paper!

Follow us!
https://mlmhworkshop.github.io/mlmh-2018
Twitter:
Contact us:
mlmhworkshop@googlegroups.com

Organizers:
M. Saqi, Imperial College London, UK
P. Chakraborty, IBM Research, USA
I. Balaur, EISBM, Lyon, France
P. Agapow, Imperial College London, UK
S. Wagers, BioSci Consulting, Belgium
P.Y. S. Hsueh, IBM Research, USA
F. Rahmanian, Geneia, USA
M.A. Ahmad, Kensci Inc. and University of
Washington - Tacoma, USA

BACKGROUND & DISCLOSURE
➤ Data Science Institute (Imperial
College London)
➤ Novel & advanced computation over
large rich biomedical datasets for
translational research & precision
medicine
➤ Patient subtype discovery &
mechanistic insight
➤ Scientiﬁc Advisor to PangaeaData.ai

“Nice training set. Where’s your data?
- An Analyst

BIG BIOMEDICAL DATA USUALLY ISN’T
➤ Average trial size on
ClinicalTrials.gov < 100
➤ Average #samples per GEO
dataset < 100
➤ Average GWAS cohort size
~9000 (median ~2500)
➤ 1,064 ICU admissions for flu in
UK 2016/2017 season
➤ Curse of dimensionality
➤ Deep learning requires
“thousands” of samples for
training (at least p2?)
➤ GWAS needs 3K+ for large
effects, 10K or more for small
effects …
➤ Sub-populations & rare diseases
will be smaller
VS

MAKE BIGGER DATASETS
➤ “Allow” reuse & combining not “build”
➤ FAIR
➤ Use standards like CDISC, HPO …
➤ eTRIKS
➤ Data intensive translational research
➤ Sharing data (standards, starter kit)
➤ Data catalog of ~70 studies
➤ EHDN / EHDEN
➤ European Health Data and Evidence
Network
➤ Harmonised model for accessing health data

WE NEED MORE ETL
➤ Too damn slow and expensive
➤ Tools are poor
➤ Humans are inconsistent
➤ Standards are complex
➤ Harmonisation by ML is the only
answer
➤ Learn from data examples
➤ Corrected by humans
➤ “Discover” schema if need be
1
2
3
4
1
2
3
4
Text data
Tabular data
§ Frequent Pattern Mining-Growth Algorithms to
determine schema association rules
§ Word2Vec to condense information of text sequence and
context
§ Graph-Theoretical Algorithms to determine logical
sequences, followers, associations, matchings
§ Decision Trees, Neural Nets and Support Vector
Machines for training the model
§ Custom Algorithms to prepare data and check data quality
Pre-classified
data and master
data mappingsData
extractor
Data
extractor
From PangaeaData.AI

EXAMPLE: U-BIOPRED
➤ Unbiased BIOmarkers in PREDiction of
respiratory disease outcomes
➤ 900+ patients, 16 clinical centres +
other studies combined via standards
➤ Outputs:
➤ Analyses largely on small subsets
(~100)
➤ Subtyping of asthmatics
➤ 40+ academic publications

THE REALITY OF DEEP LEARNING
➤ Deep learning is still in progress
➤ Usually insuﬃcient (good labelled)
data
➤ Interpretability issues
➤ Legal & ethical issues, federated
analysis
➤ Tells you what you’ve told it
➤ Bias towards images
➤ For now …

DEEP LEARNING WITH LESS DATA
➤ Pre-training (data without labels)
➤ Initial training with mediocre data
➤ Adapt
➤ Transfer learning (labels / output changes)
➤ Domain adaptation (data / input changes)
➤ Data augmentation
➤ Interpretability coming slowly (LIME)
Dielman 2015

“80% of the time, you can get 80% of the way
with a simple decision tree.
- Doug Mcilwraith (paraphrased)

EXAMPLE: TEXT CLASSIFICATION FOR SYSTEMATIC REVIEWS
➤ Aim: find similar or related
publications within corpus
➤ Actual aim: find which
which method of text
classification is
“best” (Validation)
➤ Data: 15 Drug Control
Reviews & Neuropathic
Pain dataset
➤ Classify with random forest,
naive bayes, SVM & CNNs
Conclusion
Dataset WSS Classifier Dataset WSS Classifier
ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM
ADHD 0.35 MNB Opioids 0.23 SVM
Antihistamines 0.19 MNB Oral
Hypoglycemics
0.21 SVM
Atypical
Antipsychotics
0.12 SVM PPI 0.17 SVM
Beta Blockers 0.13 SVM Skeletal Muscle
Relaxants
0.21 SVM
CCB 0.21 SVM Statins 0.19 SVM
Estrogen 0.25 SVM Triptans 0.22 SVM
Neuropathic Pain 0.61 CNN Urinary
Incontinence
0.25 SVM

OMICS IS ONLY ONE TYPE OF INFORMATION
➤ We don’t have enough data
➤ Methods may not work
➤ Results may be artefactual
➤ But there is other information …
EHR
interactome
devices
RWE
social media
chemistry
evolution / phylogeny
etc.

MULTI-OMICS OR INTEGRATED ANALYSIS
➤ Why?
➤ One way to get more data
➤ Statistical power
➤ Multiple defects required to drive
endogenous disease
➤ Multiple “views” on condition
➤ How?
➤ Cluster / network individual data
layers
➤ Fuse together for consensus
Nemutlu 2012

EXAMPLE: ASTHMA ENDOTYPING
➤ Asthma is highly heterogenous
➤ Symptoms
➤ Response to interventions
➤ Multiple mechanisms
➤ 3 or 4 or 7 clusters …
➤ Carefully curated data from U-
BIOPRED (~100)
➤ Multi-method, multi-data analysis
Wiki Commons

ASTHMA ENDOTYPES
➤ Use a variety of clustering approaches
over asthma cohort ‘omics data
(bayesian, spectral, iCluster)
➤ Use multi-omics approaches (SNF,
NNMF)
➤ Assess agreement / coherence
➤ Validate in pathways, in other cohorts
and in other data types

CONCLUSIONS
➤ Big biomedical data is often not big, but we can make it bigger
➤ Sometimes [Big | Deep | Advanced] approaches are useful, sometimes not: choose
wisely
➤ Contextual information is vital, both for primary analysis and for validation

THANKS
➤ Data Science Institute, ICL
➤ Fayzal Ghantiwala (Bloomberg)
➤ Nazanin Zounemat Kermani (ICL)
➤ Mansoor Saqi (ICL / KCL)
➤ Romain Guédon (Nantes)
➤ Yike Guo (ICL)
➤ eTRIKS consortium
➤ U-BIOPRED consortium

AI for Precision Medicine (Pragmatic preclinical data science)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à AI for Precision Medicine (Pragmatic preclinical data science)

Similaire à AI for Precision Medicine (Pragmatic preclinical data science) (20)

Plus de Paul Agapow

Plus de Paul Agapow (15)

Dernier

Dernier (20)

AI for Precision Medicine (Pragmatic preclinical data science)