Big biomedical data is often not large enough for advanced machine learning techniques. Platforms for sharing data are problematic as each operates independently with different data formats and tools. Computation power alone cannot solve these issues. The document discusses solutions like allowing larger datasets to be assembled using standards, developing interoperable "meta-platforms", and choosing machine learning approaches wisely based on the problem and dataset. It provides examples of projects that have successfully combined datasets and used techniques like Spark clustering to better leverage available data.
1. Big Biomedical Data is a Lie
Taming large datasets for translational research
Paul Agapow
Data Science Institute
Imperial College London
<p.agapow@imperial.ac.uk>
2018/1/31
2. Disclosure / About me
• Data Science Institute
(Imperial College London)
• Big rich biomedical datasets
for translational research &
precision medicine
• Novel & advanced
computation for research
• No actual or potential
conflict of interest in relation
to this presentation
4. Biomedical big data is often not big enough
• Average trial size on
ClinicalTrials.gov < 100
• Average #samples per
GEO dataset < 100
• Average GWAS cohort
size ~9000 (median
~2500)
• 1,064 ICU admissions for
flu in UK 2016/2017
season
• Curse of dimensionality
• Deep learning requires
“thousands” of samples
for training (at least p2?)
• GWAS needs 3K+ for
large effects, 10K or
more for small effects …
• Sub-populations will be
smaller
5. Platforms are a problem not a panacea
• Biomedical data lakes / warehouses aren’t working
• Each is an island unto itself
• Tools can’t understand data formats
• High demands on user (meaning, context)
• Poor standardisation / harmonisation tools (curation effort == analysis
effort)
• A world of distributed data
• A world of many computational idioms
• (Self) lock-in
6. Computers are not getting faster
• Data is embiggening
• Can’t rely on cheap
computation to get us out
of a hole
• Many HPC idioms, most
awkward (e.g. Map-
Reduce)
• Db schema struggle at
scale
7. What if every gene effects every other gene?
• Pritchard’s omnigenics
(2017):
• Kevin Bacon effect
• Implicated genes are a
few drivers and an
enormous number of
“related” loci
• How do we pick the
“important”genes?
8. Statisticians hate us
• P-hacking
• Garden of forking paths
• Reversion to mean
• Multiple hypothesis testing
• False discovery
• P-values
• Which method is best?
9. In summary
• Data isn’t big (enough)
• Platforms are a problem
• Computation isn’t saving us
• Diseases are complicated
• We don’t know what we’re doing
12. Allow bigger datasets
• “Allow” reuse & combining
not “build”
• Assemble datasets
according to standards
(CDISC, EDAM, HPO)
• Poor tools but getting
better: trmk / Arborist, eHS
• Issue of trust
Your study data in Excel
Import: start the import wizard to create a
study based on your study data.
Save: st
tranSMA
Load: us
your da
Your study l
tmtk ⬆ Python library
Send to the
Arborist web
application for
easy
collaboration!
From Excel
to tranSMART
in five simple steps
Try it at http://arborist-test-trait.thehyve.net/demo.
Code at https://github.com/thehyve/arborist under GPL v3 license.
1
Validate: let the toolkit check the
tranSMART-specific requirements.
Edit: ma
with the
2
The Arborist ⬇ Visual editor
Collaborate on data modelling with non-technical data experts in the
secure Arborist web application.
● Restructure the tranSMART tree with drag and drop
● Rename variables and values
● Add and edit metadata for any tree node
● Work with both low and high dimensional data
tmtk notable python commands
The main object in the tmtk workflow is the Study. It provides an API for modifying and
13. eTRIKS project
• Via IMI: Europe’s largest public-private initiative
• Data intensive translational research
• Sharing data (standards, starter kit)
• Open knowledge platform
• Sustainable service
14. Example: U-BIOPRED
• Unbiased BIOmarkers in PREDiction
of respiratory disease outcomes
• 900+ patients, 16 clinical centres +
other studies combined via
standards
• Outputs:
• Common tranSMART db
• 40+ academic publications
• Subtyping of asthmatics
15. Use your data better
• Pre-training (data without labels)
• Initial training with mediocre data
• Adapt
• Transfer learning (labels / output changes)
• Domain adaptation (data / input changes)
• Don’t use deep learning
16. Example: text extraction
• Aim: extract biological relationships from publications to
build asthma knowledge base
• Using BEL statements
• Domain expert time is prohibitive
• Use previous efforts as training
17. Example: text classification for systematic reviews
• Aim: find similar or related publications within corpus
• Actual aim: find which which method of text classification
is “best” (Validation)
• Data: 15 Drug Control Reviews & Neuropathic Pain
dataset
• Classify with random forest, naive bayes, SVM & CNNs
• Which has best recall?
19. Not platforms but meta-platforms
• The monolithic platform is dead
• We live in a world of
distributed data
• Avoid lock-in
• Don’t try to do everything
• Interoperability
• Allow different computational
idioms
20. tranSMART redevelopment
• eTRIKS enhancements
• i2b2 merger
• Next-generation tranSMART
• Major refactoring & performance fixes
• Additional tools & visualisation
• Component architecture
• Just a warehouse with API
21. Better HPC idioms
• Spark
• Map-Reduce but doesn’t
persist back between steps
• Better for iterative
processing
• Does less violence to
problem
• Graphs & ML
22. Example: Spark for clustering
• Subtyping / stratification
• Popular methods are
computationally prohibitive
on rich data
• (Also ground truth unclear)
• “Sparkify”, compare, validate
on asthma cohort
23. Hypothesis generation vs validation
• Generating leads vs.
testing
• Machine learning for:
• hypothesis generation
/ exploration
• streamlining of
laborious manual
tasks
• Validate!
24. Conclusions
• Big biomedical data is often not big, but we can make it
bigger
• We don't need more platforms, we need platforms that
work together
• Sometimes Big Data approaches are useful, sometimes
not: choose wisely
• Trust but verify (especially machine learning)
25. Thanks
• Data Science Institute, ICL
• Fayzal Ghantiwala (Bloomberg)
• Nazanin Zounemat Kermani (ICL)
• Mansoor Saqi (EISBM / ICL)
• Jose Saray (EISBM)
• eTRIKS consortium
• U-BIOPRED consortium