SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Big Biomedical Data is a Lie

Taming large datasets for translational research
Paul Agapow

Data Science Institute

Imperial College London

<p.agapow@imperial.ac.uk>

2018/1/31
Disclosure / About me
• Data Science Institute
(Imperial College London)
• Big rich biomedical datasets
for translational research &
precision medicine
• Novel & advanced
computation for research
• No actual or potential
conflict of interest in relation
to this presentation
– An analyst
“Nice training set. Where’s your
data?”
Biomedical big data is often not big enough
• Average trial size on
ClinicalTrials.gov < 100
• Average #samples per
GEO dataset < 100
• Average GWAS cohort
size ~9000 (median
~2500)
• 1,064 ICU admissions for
flu in UK 2016/2017
season
• Curse of dimensionality
• Deep learning requires
“thousands” of samples
for training (at least p2?)
• GWAS needs 3K+ for
large effects, 10K or
more for small effects …
• Sub-populations will be
smaller
Platforms are a problem not a panacea
• Biomedical data lakes / warehouses aren’t working
• Each is an island unto itself
• Tools can’t understand data formats
• High demands on user (meaning, context)
• Poor standardisation / harmonisation tools (curation effort == analysis
effort)
• A world of distributed data
• A world of many computational idioms
• (Self) lock-in
Computers are not getting faster
• Data is embiggening
• Can’t rely on cheap
computation to get us out
of a hole
• Many HPC idioms, most
awkward (e.g. Map-
Reduce)
• Db schema struggle at
scale
What if every gene effects every other gene?
• Pritchard’s omnigenics
(2017):	
• Kevin Bacon effect
• Implicated genes are a
few drivers and an
enormous number of
“related” loci
• How do we pick the
“important”genes?
Statisticians hate us
• P-hacking
• Garden of forking paths
• Reversion to mean
• Multiple hypothesis testing
• False discovery
• P-values
• Which method is best?
In summary
• Data isn’t big (enough)
• Platforms are a problem
• Computation isn’t saving us
• Diseases are complicated
• We don’t know what we’re doing
Solutions Responses
Allow bigger datasets
• “Allow” reuse & combining
not “build”
• Assemble datasets
according to standards
(CDISC, EDAM, HPO)
• Poor tools but getting
better: trmk / Arborist, eHS
• Issue of trust
Your study data in Excel
Import: start the import wizard to create a
study based on your study data.
Save: st
tranSMA
Load: us
your da
Your study l
tmtk ⬆ Python library
Send to the
Arborist web
application for
easy
collaboration!
From Excel
to tranSMART
in five simple steps
Try it at http://arborist-test-trait.thehyve.net/demo.
Code at https://github.com/thehyve/arborist under GPL v3 license.
1
Validate: let the toolkit check the
tranSMART-specific requirements.
Edit: ma
with the
2
The Arborist ⬇ Visual editor
Collaborate on data modelling with non-technical data experts in the
secure Arborist web application.
● Restructure the tranSMART tree with drag and drop
● Rename variables and values
● Add and edit metadata for any tree node
● Work with both low and high dimensional data
tmtk notable python commands
The main object in the tmtk workflow is the Study. It provides an API for modifying and
eTRIKS project
• Via IMI: Europe’s largest public-private initiative
• Data intensive translational research
• Sharing data (standards, starter kit)
• Open knowledge platform
• Sustainable service
Example: U-BIOPRED
• Unbiased BIOmarkers in PREDiction
of respiratory disease outcomes
• 900+ patients, 16 clinical centres +
other studies combined via
standards
• Outputs:
• Common tranSMART db
• 40+ academic publications
• Subtyping of asthmatics
Use your data better
• Pre-training (data without labels)
• Initial training with mediocre data
• Adapt
• Transfer learning (labels / output changes)
• Domain adaptation (data / input changes)
• Don’t use deep learning
Example: text extraction
• Aim: extract biological relationships from publications to
build asthma knowledge base
• Using BEL statements
• Domain expert time is prohibitive
• Use previous efforts as training
Example: text classification for systematic reviews
• Aim: find similar or related publications within corpus
• Actual aim: find which which method of text classification
is “best” (Validation)
• Data: 15 Drug Control Reviews & Neuropathic Pain
dataset
• Classify with random forest, naive bayes, SVM & CNNs
• Which has best recall?
When you don’t know what to use, use SVMs
Conclusion
Dataset WSS Classifier Dataset WSS Classifier
ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM
ADHD 0.35 MNB Opioids 0.23 SVM
Antihistamines 0.19 MNB Oral
Hypoglycemics
0.21 SVM
Atypical
Antipsychotics
0.12 SVM PPI 0.17 SVM
Beta Blockers 0.13 SVM Skeletal Muscle
Relaxants
0.21 SVM
CCB 0.21 SVM Statins 0.19 SVM
Estrogen 0.25 SVM Triptans 0.22 SVM
Neuropathic Pain 0.61 CNN Urinary
Incontinence
0.25 SVM
Not platforms but meta-platforms
• The monolithic platform is dead
• We live in a world of
distributed data
• Avoid lock-in
• Don’t try to do everything
• Interoperability
• Allow different computational
idioms
tranSMART redevelopment
• eTRIKS enhancements
• i2b2 merger
• Next-generation tranSMART
• Major refactoring & performance fixes
• Additional tools & visualisation
• Component architecture
• Just a warehouse with API
Better HPC idioms
• Spark
• Map-Reduce but doesn’t
persist back between steps
• Better for iterative
processing
• Does less violence to
problem
• Graphs & ML
Example: Spark for clustering
• Subtyping / stratification
• Popular methods are
computationally prohibitive
on rich data
• (Also ground truth unclear)
• “Sparkify”, compare, validate
on asthma cohort
Hypothesis generation vs validation
• Generating leads vs.
testing
• Machine learning for:
• hypothesis generation
/ exploration
• streamlining of
laborious manual
tasks
• Validate!
Conclusions
• Big biomedical data is often not big, but we can make it
bigger
• We don't need more platforms, we need platforms that
work together
• Sometimes Big Data approaches are useful, sometimes
not: choose wisely
• Trust but verify (especially machine learning)
Thanks
• Data Science Institute, ICL
• Fayzal Ghantiwala (Bloomberg)
• Nazanin Zounemat Kermani (ICL)
• Mansoor Saqi (EISBM / ICL)
• Jose Saray (EISBM)
• eTRIKS consortium
• U-BIOPRED consortium

Contenu connexe

Tendances

Content is data: pushing re-use to the limit
Content is data: pushing re-use to the limitContent is data: pushing re-use to the limit
Content is data: pushing re-use to the limit
TCUK
 

Tendances (6)

data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Content is data: pushing re-use to the limit
Content is data: pushing re-use to the limitContent is data: pushing re-use to the limit
Content is data: pushing re-use to the limit
 
Machine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's NextMachine Learning in Healthcare: What's Now & What's Next
Machine Learning in Healthcare: What's Now & What's Next
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
IBM Watson
IBM WatsonIBM Watson
IBM Watson
 
Will it last? How secure is the longevity of archaeological data?
Will it last?  How secure is the longevity of archaeological data?Will it last?  How secure is the longevity of archaeological data?
Will it last? How secure is the longevity of archaeological data?
 

Similaire à Big biomedical data is a lie

Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...
John Kinmonth
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 

Similaire à Big biomedical data is a lie (20)

Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
High Performance Computing and the Opportunity with Cognitive Technology
 High Performance Computing and the Opportunity with Cognitive Technology High Performance Computing and the Opportunity with Cognitive Technology
High Performance Computing and the Opportunity with Cognitive Technology
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Natural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health RecordsNatural Language Processing to Curate Unstructured Electronic Health Records
Natural Language Processing to Curate Unstructured Electronic Health Records
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...Umm, how did you get that number? Managing Data Integrity throughout the Data...
Umm, how did you get that number? Managing Data Integrity throughout the Data...
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
MedChemica BigData What Is That All About?
MedChemica BigData What Is That All About?MedChemica BigData What Is That All About?
MedChemica BigData What Is That All About?
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Multi-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorMulti-omics methods and resources for Bioconductor
Multi-omics methods and resources for Bioconductor
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
Machine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionMachine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical prevention
 
Pistoia alliance debates analytics 15-09-2015 16.00
Pistoia alliance debates   analytics 15-09-2015 16.00Pistoia alliance debates   analytics 15-09-2015 16.00
Pistoia alliance debates analytics 15-09-2015 16.00
 
Big Data in Clinical Research
Big Data in Clinical ResearchBig Data in Clinical Research
Big Data in Clinical Research
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
 

Plus de Paul Agapow

Plus de Paul Agapow (20)

Digital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdfDigital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdf
 
How to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdfHow to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdf
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trust
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicine
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Multi-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gainMulti-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gain
 
ML & AI in pharma: an overview
ML & AI in pharma: an overviewML & AI in pharma: an overview
ML & AI in pharma: an overview
 
ML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the icebergML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the iceberg
 
Machine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgeMachine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledge
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in Healthcare
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
Get yourself a better bioinformatics job
Get yourself a better bioinformatics jobGet yourself a better bioinformatics job
Get yourself a better bioinformatics job
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational research
 
Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)
 
Patient subtypes: real or not?
Patient subtypes: real or not?Patient subtypes: real or not?
Patient subtypes: real or not?
 
eTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, LondoneTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, London
 
Introduction to Snakemake
Introduction to SnakemakeIntroduction to Snakemake
Introduction to Snakemake
 
Analysing biomedical data (ers october 2017)
Analysing biomedical data (ers  october 2017)Analysing biomedical data (ers  october 2017)
Analysing biomedical data (ers october 2017)
 
Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)
 

Dernier

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Dernier (20)

Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 

Big biomedical data is a lie

  • 1. Big Biomedical Data is a Lie Taming large datasets for translational research Paul Agapow
 Data Science Institute Imperial College London <p.agapow@imperial.ac.uk> 2018/1/31
  • 2. Disclosure / About me • Data Science Institute (Imperial College London) • Big rich biomedical datasets for translational research & precision medicine • Novel & advanced computation for research • No actual or potential conflict of interest in relation to this presentation
  • 3. – An analyst “Nice training set. Where’s your data?”
  • 4. Biomedical big data is often not big enough • Average trial size on ClinicalTrials.gov < 100 • Average #samples per GEO dataset < 100 • Average GWAS cohort size ~9000 (median ~2500) • 1,064 ICU admissions for flu in UK 2016/2017 season • Curse of dimensionality • Deep learning requires “thousands” of samples for training (at least p2?) • GWAS needs 3K+ for large effects, 10K or more for small effects … • Sub-populations will be smaller
  • 5. Platforms are a problem not a panacea • Biomedical data lakes / warehouses aren’t working • Each is an island unto itself • Tools can’t understand data formats • High demands on user (meaning, context) • Poor standardisation / harmonisation tools (curation effort == analysis effort) • A world of distributed data • A world of many computational idioms • (Self) lock-in
  • 6. Computers are not getting faster • Data is embiggening • Can’t rely on cheap computation to get us out of a hole • Many HPC idioms, most awkward (e.g. Map- Reduce) • Db schema struggle at scale
  • 7. What if every gene effects every other gene? • Pritchard’s omnigenics (2017): • Kevin Bacon effect • Implicated genes are a few drivers and an enormous number of “related” loci • How do we pick the “important”genes?
  • 8. Statisticians hate us • P-hacking • Garden of forking paths • Reversion to mean • Multiple hypothesis testing • False discovery • P-values • Which method is best?
  • 9. In summary • Data isn’t big (enough) • Platforms are a problem • Computation isn’t saving us • Diseases are complicated • We don’t know what we’re doing
  • 10.
  • 12. Allow bigger datasets • “Allow” reuse & combining not “build” • Assemble datasets according to standards (CDISC, EDAM, HPO) • Poor tools but getting better: trmk / Arborist, eHS • Issue of trust Your study data in Excel Import: start the import wizard to create a study based on your study data. Save: st tranSMA Load: us your da Your study l tmtk ⬆ Python library Send to the Arborist web application for easy collaboration! From Excel to tranSMART in five simple steps Try it at http://arborist-test-trait.thehyve.net/demo. Code at https://github.com/thehyve/arborist under GPL v3 license. 1 Validate: let the toolkit check the tranSMART-specific requirements. Edit: ma with the 2 The Arborist ⬇ Visual editor Collaborate on data modelling with non-technical data experts in the secure Arborist web application. ● Restructure the tranSMART tree with drag and drop ● Rename variables and values ● Add and edit metadata for any tree node ● Work with both low and high dimensional data tmtk notable python commands The main object in the tmtk workflow is the Study. It provides an API for modifying and
  • 13. eTRIKS project • Via IMI: Europe’s largest public-private initiative • Data intensive translational research • Sharing data (standards, starter kit) • Open knowledge platform • Sustainable service
  • 14. Example: U-BIOPRED • Unbiased BIOmarkers in PREDiction of respiratory disease outcomes • 900+ patients, 16 clinical centres + other studies combined via standards • Outputs: • Common tranSMART db • 40+ academic publications • Subtyping of asthmatics
  • 15. Use your data better • Pre-training (data without labels) • Initial training with mediocre data • Adapt • Transfer learning (labels / output changes) • Domain adaptation (data / input changes) • Don’t use deep learning
  • 16. Example: text extraction • Aim: extract biological relationships from publications to build asthma knowledge base • Using BEL statements • Domain expert time is prohibitive • Use previous efforts as training
  • 17. Example: text classification for systematic reviews • Aim: find similar or related publications within corpus • Actual aim: find which which method of text classification is “best” (Validation) • Data: 15 Drug Control Reviews & Neuropathic Pain dataset • Classify with random forest, naive bayes, SVM & CNNs • Which has best recall?
  • 18. When you don’t know what to use, use SVMs Conclusion Dataset WSS Classifier Dataset WSS Classifier ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM ADHD 0.35 MNB Opioids 0.23 SVM Antihistamines 0.19 MNB Oral Hypoglycemics 0.21 SVM Atypical Antipsychotics 0.12 SVM PPI 0.17 SVM Beta Blockers 0.13 SVM Skeletal Muscle Relaxants 0.21 SVM CCB 0.21 SVM Statins 0.19 SVM Estrogen 0.25 SVM Triptans 0.22 SVM Neuropathic Pain 0.61 CNN Urinary Incontinence 0.25 SVM
  • 19. Not platforms but meta-platforms • The monolithic platform is dead • We live in a world of distributed data • Avoid lock-in • Don’t try to do everything • Interoperability • Allow different computational idioms
  • 20. tranSMART redevelopment • eTRIKS enhancements • i2b2 merger • Next-generation tranSMART • Major refactoring & performance fixes • Additional tools & visualisation • Component architecture • Just a warehouse with API
  • 21. Better HPC idioms • Spark • Map-Reduce but doesn’t persist back between steps • Better for iterative processing • Does less violence to problem • Graphs & ML
  • 22. Example: Spark for clustering • Subtyping / stratification • Popular methods are computationally prohibitive on rich data • (Also ground truth unclear) • “Sparkify”, compare, validate on asthma cohort
  • 23. Hypothesis generation vs validation • Generating leads vs. testing • Machine learning for: • hypothesis generation / exploration • streamlining of laborious manual tasks • Validate!
  • 24. Conclusions • Big biomedical data is often not big, but we can make it bigger • We don't need more platforms, we need platforms that work together • Sometimes Big Data approaches are useful, sometimes not: choose wisely • Trust but verify (especially machine learning)
  • 25. Thanks • Data Science Institute, ICL • Fayzal Ghantiwala (Bloomberg) • Nazanin Zounemat Kermani (ICL) • Mansoor Saqi (EISBM / ICL) • Jose Saray (EISBM) • eTRIKS consortium • U-BIOPRED consortium