SlideShare une entreprise Scribd logo
1  sur  43
Big Data & Machine Learning for
Clinical Data
Paul Agapow <p.agapow@imperial.ac.uk>
Data Science Institute, Imperial College London
 Biomedical science is now data
science
 I was a biochemist, immunologist,
and then a infectious disease
bioinformatician
 I’m now a “biomedical data
scientist”
 I will be a Health Informatics
Director at AstraZeneca
About me & these lectures
WikiMedia Commons
 We increasingly use & need:
 Lots of complex data
 Real world evidence (outside RCTs)
 Computational tools
 Statistical analysis
 Complex interactions
 Precision medicine: prediction &
(sub)typing
 Also:
 Cheap
 Successful in other domains
 But lots of hype and jargon
Biomedical science is now data science
WikiMedia Commons
 The world is increasingly
“datafied” – we make more and
bigger datasets
 Devices
 Routine collection
 Aggregation & integration
 Big Data is “too big”for
conventional approaches
Part 1: Big Data
WikiMedia Commons
 “Quantity has a quality of its
own”
 Often free
 Real
 Rich, deep, interactions
 Needed for ML and other
assumption-light approaches
Why Big Data?
By Ender005 - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=49888192
 Many diseases with the same clinical presentation have different
molecular phenotypes
 Several overlapping terms
 stratified: separate patients into groups for treatment
 precision:
 tailor treatment to individual
 improved targeted therapies with fewer side effects
 “Right medication, right dose, right patient, right time, right route”
 Also personalised, P4 …
 E.g. asthma
Why Big Data? Precision medicine
 Volume
 Velocity
 Variety
 Veracity
 Value
The 3 / 4 / 5 Vs of Big Data
By MuhammadAbuHijleh - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=46431834
 Limits labile to technological
progress
 Memory
 Compute
 Data schema
 Solutions: distributed & parallel
computation, new high-end
databases
The problem with volume: tools & platforms
WikiMedia Commons
 Multiple hypothesis testing
and false discovery
 Bias: a sample is not the
population
 The Past is not the Present
 Observation without
understanding
 The curse of dimensionality
 Privacy
 Some ML-specific issues
The problem with volume: methodology
From KDNuggets
 Many, many types of data
 How do we use multiple types?
 Which type do we use?
 Disease is systemic
 Interactions
 Evidence
 Solutions: integrated analysis,
independent analysis with
validation
The problem with variety
Wu, Sanin, Wang (2016) Clinical Applications and Systems
Biomedicine
 Much biodata is uncertain
 Noise
 Mistakes
 People lie
 A sample is not a population
 Incompatible systems
 Most analyses are not reproducible
 Solutions: imputation, standards,
cross-validation etc.
The problem with veracity
By Khaydock - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=25102900
 How do we
 Re-use data
 Compare data
 Store data from multiple sources
 Even know what data is
 FAIR, OHDSI / OMOPS, HPO
 Even just metadata helps for
cataloguing
 But: multiple & incomplete
standards, translation, complexity
Solution: Standards & ontologies
WikiMedia Commons
 Much data cannot leave its
home institution
 Hospitals
 Registries
 Insurance companies
 Governance is hard & slow
 So take the analysis to the data
 Data looks the same but may
be internally different
Solution: Federated analysis
International Collaboration for Autism Registry Epidemiology
 In a vast sea of biodata, how do you
discover anything? How do you avoid
cherry-picking?
 Solutions:
 Distinguish discovery from
exploration
 Non-parametric methods (e.g.
machine learning)
 Some problems don’t have a single
solution but many (e.g. prediction)
The problem with it all: discoverability
EnterpriseKnowledge.com
 Write analyses as recipes
 Snakemake, Nextflow, Flowr
 Use recreatable computational
systems
 Docker
 “Your biggest collaborator is
you, six months ago”
 But: it’s work
Solution: Reproducibility
From RevolutionR
 Big Data is “too big” for current conventional tools & practices
 But it’s ideal for solving many biomedical problems
 There are problems with valid discovery and just handling the data
 Standards, distributed databases and analysis and
Summary: Big Data
 “a field of Artificial Intelligence”
 “(the science of) getting computers to learn and act like humans do”
 “getting computers to act without being explicitly programmed”
 “computer systems that automatically improve with experience”
 “neural networks”
 “using statistical techniques to give computer systems the ability to
learn”
Part 2: Machine Learning
In practice:
 broadly-defined set of
algorithms that recognise &
generalise patterns in data
 “non-parametric” or
assumption-light
 may require training over
initial dataset
What is Machine Learning?
By Chire - Own work, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=11711077
 Enough data
 Enough compute
 Technical progress
 Need 'good enough'
solutions
 Prediction & forecasting
 Categorization
 Pattern recognition
 Early, startling success
Why now?
Ray Kurzweil The Singularity is Near
How is ML different to stats?
How is ML different to stats?
Statistical Machine
Assumptions strong weak
Data small large
Optimize by fitting training
Solutions “the best” “good enough”
Hypothesis proof exploration
Test p-values etc. validation
In practice:
 a field of scientific research
 machine learning
 neural networks
 deep learning
 more of an objective than a methodology
 computational systems that duplicate / emulate / replace human effort
What is Artificial Intelligence
• Many methods
• Broadly split into:
• Unsupervised: finds structure within data
• e.g. (most) clustering, self-organised maps, principal component
analysis
• Supervised: trained using labelled examples
• e.g. regression, decision trees, naive bayes, neural networks
• Categories can blur
• e.g. k-means, nearest neighbour?
• Which is better?
What are ML methods?
• (Train a model from data)
• This model encapsulates or generalizes the data
• (Validate the model against test data)
• This model transforms features into labels
• Continuous outputs (e.g. real numbers) are regressions
• Discrete outputs (e.g. categories) are classifications
ML terms & process
• Take gene expression profiles from patients and cluster to:
• See genes with similar expression profiles
• Similar patients
• Train a model on radiographs with tumours labelled, use to diagnose
unlabelled images
• Find patients with similar symptoms & signs (computational
phenotypes) in HER
• Train on histories of patients to forecast their future condition
• Find out how terms in a medical corpus relate to each other
Examples of ML
It’s everywhere
Unsupervised learning: clustering
 What does ‘similar’ mean? How
do we measure it?
 Which features & how weighted?
 Noise & overlapping clusters
 Non-numeric, non-ordered data
 What shapes can clusters be?
 How many clusters? When do we
stop?
 …
Clustering isn’t simple
By Chire - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=17085331
Varies but:
 Start with record-feature matrix
 Normalise data
 (“Supervised”: select number of
clusters)
 Run algorithm
 Validate
Clustering process
WikiMedia Commons
How not to do it
 A cluster partitioning is a hypothesis
 How do we assess? Validate:
 External: compare against external label or data
 e.g. accuracy, entropy
 Internal: goodness of clustering
 e.g. sum squared errors, cluster cohesion & separation,
silhouette
 Relative: against another clustering scheme
 e.g. is this better with 3 or 4 clusters
Validating clusters
Average over each point:
1. Calculate the average distance to all
other members of its cluster, a
2. For each other cluster, calculate the
average distance to every member.
The minimum of these is b
3. The silhouette width is (b−a) /
max(a,b), the higher the better
Clustering process
What if there are sub-clusters or
structure?
• Use hierarchical clustering
• Use homogeneity or
completeness metrics to
compare
Nesting & hierarchies
• Complex, heterogeneous
disease
• Many attempts at clustering
• Use transcriptomic &
proteomic data
• Validate with clinical
• 4 clusters with characteristic
genes & clinical behaviour
Example: asthma
 a.k.a. deep learning, (artificial)
neural networks, “AI”
 A series of layers of nodes, each of
which transforms the previous layer.
 Training sets weights on
transformations
 Capable of learning representations
Supervised learning: deep networks
WikiMedia Commons
 There’s little information in an
individual pixel (gene, data point …)
 But individual data points make up
more complete entities
 Each layer takes the layer below and
creates higher-level entities
(representations) from it.
 The system “recognises” higher-
level features that can appear
anywhere in the data.
What’s a representation?
WikiMedia Commons
 Radiologists are overwhelmed
 Want to catch errors &
double-check
 Train ANN over medical
imagery with tumour labelled
 Accuracy similar to humans
Example: diagnosis from medical imagery
From Nvidia
• The model is right but learns
the wrong thing (from our
point of view)
• Solutions:
• Interpreting models
• Better (more examined) data
Problem: useless solutions
Ribeiro et al. (2016) Why Should I Trust You?
 Reversing the model & asking “why”
 What features are important
 Mechanistic insight
 But many ML models are tangled & horribly complex
 And ML community often uninterested
 Solutions:
 Choose an intepretable model
 Software that explores feature space (LIME, Lift, IML)
Problem: interpretability
• Bias (systematic error) vs. Variance
(random error)
• Want a model that captures the
regularities in training data AND
generalizes to unseen data.
• This is impossible
• Solutions:
• Use a variety of data
• Feature selection
• Regularization
Problem: how do models get it wrong?
From KDNuggets
• What do we want from our ML
models?
• Power / accuracy
• Insight
• Error tolerance
• e.g. drug discovery vs drug safety
Problem: how good do models have to be?
After Harel
• Much (most) data has few positives
• Results in an imbalanced model
• Solutions:
• Over- and under-sampling
• Pre-train with poor data
• Ensemble methods
Problem: imbalanced data & lack of data
DataScience.com
 Machine learning uses large amounts of data with few assumptions to
make models that generalise that data
 This is useful for situations where we don’t have an explicit model and
just need ‘a’ solution.
 But this means we need to examine our data and validate our
solutions
 A ‘bad’ solution can be useful, depending on what you want to
achieve.
Summary: Machine Learning

Contenu connexe

Tendances

Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsOla Spjuth
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIPaul Agapow
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiPistoia Alliance
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsTao Xie
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences Pistoia Alliance
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceNextBio
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowEagle Genomics
 
AI is the Future of Drug Discovery
AI is the Future of Drug DiscoveryAI is the Future of Drug Discovery
AI is the Future of Drug DiscoveryDavid Leahy
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateNeuroscience Information Framework
 
Social Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in RadiologySocial Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in RadiologyErik R. Ranschaert, MD, PhD
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 

Tendances (19)

Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
 
Advancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software AnalyticsAdvancing Foundation and Practice of Software Analytics
Advancing Foundation and Practice of Software Analytics
 
Medical data diagnosis
Medical data diagnosisMedical data diagnosis
Medical data diagnosis
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
 
Considerations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflowConsiderations and challenges in building an end to-end microbiome workflow
Considerations and challenges in building an end to-end microbiome workflow
 
AI is the Future of Drug Discovery
AI is the Future of Drug DiscoveryAI is the Future of Drug Discovery
AI is the Future of Drug Discovery
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Social Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in RadiologySocial Networks and Collaborative Platforms for Data Sharing in Radiology
Social Networks and Collaborative Platforms for Data Sharing in Radiology
 
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 

Similaire à Big Data & ML for Clinical Data

Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...Health Catalyst
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...FranciscoJAzuajeG
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AImelissadata
 
Introduction to machine_learning_us
Introduction to machine_learning_usIntroduction to machine_learning_us
Introduction to machine_learning_usAnasua Sarkar
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009Ian Foster
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdfAdhySugara2
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrank Rybicki
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for DiscoveryDayOne
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016Anita de Waard
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Fondazione Giannino Bassetti
 
(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imagingKyuhwan Jung
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesPremNarayanan6
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in HealthcarePaul Agapow
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesJosef Scheiber
 

Similaire à Big Data & ML for Clinical Data (20)

Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Introduction to machine_learning_us
Introduction to machine_learning_usIntroduction to machine_learning_us
Introduction to machine_learning_us
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...Big data and machine learning: opportunità per la medicina di precisione e i ...
Big data and machine learning: opportunità per la medicina di precisione e i ...
 
(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging(2017/06)Practical points of deep learning for medical imaging
(2017/06)Practical points of deep learning for medical imaging
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical Devices
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in Healthcare
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
Clinical Data and AI
Clinical Data and AIClinical Data and AI
Clinical Data and AI
 

Plus de Paul Agapow

Digital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdfDigital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdfPaul Agapow
 
How to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdfHow to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdfPaul Agapow
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trustPaul Agapow
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicinePaul Agapow
 
Multi-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gainMulti-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gainPaul Agapow
 
ML & AI in pharma: an overview
ML & AI in pharma: an overviewML & AI in pharma: an overview
ML & AI in pharma: an overviewPaul Agapow
 
ML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the icebergML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the icebergPaul Agapow
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?Paul Agapow
 
Get yourself a better bioinformatics job
Get yourself a better bioinformatics jobGet yourself a better bioinformatics job
Get yourself a better bioinformatics jobPaul Agapow
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchPaul Agapow
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational researchPaul Agapow
 
Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)Paul Agapow
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical ResearchPaul Agapow
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)Paul Agapow
 
Patient subtypes: real or not?
Patient subtypes: real or not?Patient subtypes: real or not?
Patient subtypes: real or not?Paul Agapow
 
Big biomedical data is a lie
Big biomedical data is a lieBig biomedical data is a lie
Big biomedical data is a liePaul Agapow
 
eTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, LondoneTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, LondonPaul Agapow
 
Introduction to Snakemake
Introduction to SnakemakeIntroduction to Snakemake
Introduction to SnakemakePaul Agapow
 
Analysing biomedical data (ers october 2017)
Analysing biomedical data (ers  october 2017)Analysing biomedical data (ers  october 2017)
Analysing biomedical data (ers october 2017)Paul Agapow
 
Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)Paul Agapow
 

Plus de Paul Agapow (20)

Digital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdfDigital Biomarkers, a (too) brief introduction.pdf
Digital Biomarkers, a (too) brief introduction.pdf
 
How to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdfHow to make every mistake and still have a career, Feb2024.pdf
How to make every mistake and still have a career, Feb2024.pdf
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trust
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicine
 
Multi-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gainMulti-omics for drug discovery: what we lose, what we gain
Multi-omics for drug discovery: what we lose, what we gain
 
ML & AI in pharma: an overview
ML & AI in pharma: an overviewML & AI in pharma: an overview
ML & AI in pharma: an overview
 
ML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the icebergML & AI in Drug development: the hidden part of the iceberg
ML & AI in Drug development: the hidden part of the iceberg
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
Get yourself a better bioinformatics job
Get yourself a better bioinformatics jobGet yourself a better bioinformatics job
Get yourself a better bioinformatics job
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational research
 
Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)Bioinformatics! (What is it good for?)
Bioinformatics! (What is it good for?)
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
 
AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)AI for Precision Medicine (Pragmatic preclinical data science)
AI for Precision Medicine (Pragmatic preclinical data science)
 
Patient subtypes: real or not?
Patient subtypes: real or not?Patient subtypes: real or not?
Patient subtypes: real or not?
 
Big biomedical data is a lie
Big biomedical data is a lieBig biomedical data is a lie
Big biomedical data is a lie
 
eTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, LondoneTRIKS at Pharma IT 2017, London
eTRIKS at Pharma IT 2017, London
 
Introduction to Snakemake
Introduction to SnakemakeIntroduction to Snakemake
Introduction to Snakemake
 
Analysing biomedical data (ers october 2017)
Analysing biomedical data (ers  october 2017)Analysing biomedical data (ers  october 2017)
Analysing biomedical data (ers october 2017)
 
Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)Interpreting transcriptomics (ers berlin 2017)
Interpreting transcriptomics (ers berlin 2017)
 

Dernier

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingadibshanto115
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 

Dernier (20)

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 

Big Data & ML for Clinical Data

  • 1. Big Data & Machine Learning for Clinical Data Paul Agapow <p.agapow@imperial.ac.uk> Data Science Institute, Imperial College London
  • 2.  Biomedical science is now data science  I was a biochemist, immunologist, and then a infectious disease bioinformatician  I’m now a “biomedical data scientist”  I will be a Health Informatics Director at AstraZeneca About me & these lectures WikiMedia Commons
  • 3.  We increasingly use & need:  Lots of complex data  Real world evidence (outside RCTs)  Computational tools  Statistical analysis  Complex interactions  Precision medicine: prediction & (sub)typing  Also:  Cheap  Successful in other domains  But lots of hype and jargon Biomedical science is now data science WikiMedia Commons
  • 4.  The world is increasingly “datafied” – we make more and bigger datasets  Devices  Routine collection  Aggregation & integration  Big Data is “too big”for conventional approaches Part 1: Big Data WikiMedia Commons
  • 5.  “Quantity has a quality of its own”  Often free  Real  Rich, deep, interactions  Needed for ML and other assumption-light approaches Why Big Data? By Ender005 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49888192
  • 6.  Many diseases with the same clinical presentation have different molecular phenotypes  Several overlapping terms  stratified: separate patients into groups for treatment  precision:  tailor treatment to individual  improved targeted therapies with fewer side effects  “Right medication, right dose, right patient, right time, right route”  Also personalised, P4 …  E.g. asthma Why Big Data? Precision medicine
  • 7.  Volume  Velocity  Variety  Veracity  Value The 3 / 4 / 5 Vs of Big Data By MuhammadAbuHijleh - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=46431834
  • 8.  Limits labile to technological progress  Memory  Compute  Data schema  Solutions: distributed & parallel computation, new high-end databases The problem with volume: tools & platforms WikiMedia Commons
  • 9.  Multiple hypothesis testing and false discovery  Bias: a sample is not the population  The Past is not the Present  Observation without understanding  The curse of dimensionality  Privacy  Some ML-specific issues The problem with volume: methodology From KDNuggets
  • 10.  Many, many types of data  How do we use multiple types?  Which type do we use?  Disease is systemic  Interactions  Evidence  Solutions: integrated analysis, independent analysis with validation The problem with variety Wu, Sanin, Wang (2016) Clinical Applications and Systems Biomedicine
  • 11.  Much biodata is uncertain  Noise  Mistakes  People lie  A sample is not a population  Incompatible systems  Most analyses are not reproducible  Solutions: imputation, standards, cross-validation etc. The problem with veracity By Khaydock - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25102900
  • 12.  How do we  Re-use data  Compare data  Store data from multiple sources  Even know what data is  FAIR, OHDSI / OMOPS, HPO  Even just metadata helps for cataloguing  But: multiple & incomplete standards, translation, complexity Solution: Standards & ontologies WikiMedia Commons
  • 13.  Much data cannot leave its home institution  Hospitals  Registries  Insurance companies  Governance is hard & slow  So take the analysis to the data  Data looks the same but may be internally different Solution: Federated analysis International Collaboration for Autism Registry Epidemiology
  • 14.  In a vast sea of biodata, how do you discover anything? How do you avoid cherry-picking?  Solutions:  Distinguish discovery from exploration  Non-parametric methods (e.g. machine learning)  Some problems don’t have a single solution but many (e.g. prediction) The problem with it all: discoverability EnterpriseKnowledge.com
  • 15.  Write analyses as recipes  Snakemake, Nextflow, Flowr  Use recreatable computational systems  Docker  “Your biggest collaborator is you, six months ago”  But: it’s work Solution: Reproducibility From RevolutionR
  • 16.  Big Data is “too big” for current conventional tools & practices  But it’s ideal for solving many biomedical problems  There are problems with valid discovery and just handling the data  Standards, distributed databases and analysis and Summary: Big Data
  • 17.  “a field of Artificial Intelligence”  “(the science of) getting computers to learn and act like humans do”  “getting computers to act without being explicitly programmed”  “computer systems that automatically improve with experience”  “neural networks”  “using statistical techniques to give computer systems the ability to learn” Part 2: Machine Learning
  • 18. In practice:  broadly-defined set of algorithms that recognise & generalise patterns in data  “non-parametric” or assumption-light  may require training over initial dataset What is Machine Learning? By Chire - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=11711077
  • 19.  Enough data  Enough compute  Technical progress  Need 'good enough' solutions  Prediction & forecasting  Categorization  Pattern recognition  Early, startling success Why now? Ray Kurzweil The Singularity is Near
  • 20. How is ML different to stats?
  • 21. How is ML different to stats? Statistical Machine Assumptions strong weak Data small large Optimize by fitting training Solutions “the best” “good enough” Hypothesis proof exploration Test p-values etc. validation
  • 22. In practice:  a field of scientific research  machine learning  neural networks  deep learning  more of an objective than a methodology  computational systems that duplicate / emulate / replace human effort What is Artificial Intelligence
  • 23. • Many methods • Broadly split into: • Unsupervised: finds structure within data • e.g. (most) clustering, self-organised maps, principal component analysis • Supervised: trained using labelled examples • e.g. regression, decision trees, naive bayes, neural networks • Categories can blur • e.g. k-means, nearest neighbour? • Which is better? What are ML methods?
  • 24. • (Train a model from data) • This model encapsulates or generalizes the data • (Validate the model against test data) • This model transforms features into labels • Continuous outputs (e.g. real numbers) are regressions • Discrete outputs (e.g. categories) are classifications ML terms & process
  • 25. • Take gene expression profiles from patients and cluster to: • See genes with similar expression profiles • Similar patients • Train a model on radiographs with tumours labelled, use to diagnose unlabelled images • Find patients with similar symptoms & signs (computational phenotypes) in HER • Train on histories of patients to forecast their future condition • Find out how terms in a medical corpus relate to each other Examples of ML
  • 28.  What does ‘similar’ mean? How do we measure it?  Which features & how weighted?  Noise & overlapping clusters  Non-numeric, non-ordered data  What shapes can clusters be?  How many clusters? When do we stop?  … Clustering isn’t simple By Chire - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17085331
  • 29. Varies but:  Start with record-feature matrix  Normalise data  (“Supervised”: select number of clusters)  Run algorithm  Validate Clustering process WikiMedia Commons
  • 30. How not to do it
  • 31.  A cluster partitioning is a hypothesis  How do we assess? Validate:  External: compare against external label or data  e.g. accuracy, entropy  Internal: goodness of clustering  e.g. sum squared errors, cluster cohesion & separation, silhouette  Relative: against another clustering scheme  e.g. is this better with 3 or 4 clusters Validating clusters
  • 32. Average over each point: 1. Calculate the average distance to all other members of its cluster, a 2. For each other cluster, calculate the average distance to every member. The minimum of these is b 3. The silhouette width is (b−a) / max(a,b), the higher the better Clustering process
  • 33. What if there are sub-clusters or structure? • Use hierarchical clustering • Use homogeneity or completeness metrics to compare Nesting & hierarchies
  • 34. • Complex, heterogeneous disease • Many attempts at clustering • Use transcriptomic & proteomic data • Validate with clinical • 4 clusters with characteristic genes & clinical behaviour Example: asthma
  • 35.  a.k.a. deep learning, (artificial) neural networks, “AI”  A series of layers of nodes, each of which transforms the previous layer.  Training sets weights on transformations  Capable of learning representations Supervised learning: deep networks WikiMedia Commons
  • 36.  There’s little information in an individual pixel (gene, data point …)  But individual data points make up more complete entities  Each layer takes the layer below and creates higher-level entities (representations) from it.  The system “recognises” higher- level features that can appear anywhere in the data. What’s a representation? WikiMedia Commons
  • 37.  Radiologists are overwhelmed  Want to catch errors & double-check  Train ANN over medical imagery with tumour labelled  Accuracy similar to humans Example: diagnosis from medical imagery From Nvidia
  • 38. • The model is right but learns the wrong thing (from our point of view) • Solutions: • Interpreting models • Better (more examined) data Problem: useless solutions Ribeiro et al. (2016) Why Should I Trust You?
  • 39.  Reversing the model & asking “why”  What features are important  Mechanistic insight  But many ML models are tangled & horribly complex  And ML community often uninterested  Solutions:  Choose an intepretable model  Software that explores feature space (LIME, Lift, IML) Problem: interpretability
  • 40. • Bias (systematic error) vs. Variance (random error) • Want a model that captures the regularities in training data AND generalizes to unseen data. • This is impossible • Solutions: • Use a variety of data • Feature selection • Regularization Problem: how do models get it wrong? From KDNuggets
  • 41. • What do we want from our ML models? • Power / accuracy • Insight • Error tolerance • e.g. drug discovery vs drug safety Problem: how good do models have to be? After Harel
  • 42. • Much (most) data has few positives • Results in an imbalanced model • Solutions: • Over- and under-sampling • Pre-train with poor data • Ensemble methods Problem: imbalanced data & lack of data DataScience.com
  • 43.  Machine learning uses large amounts of data with few assumptions to make models that generalise that data  This is useful for situations where we don’t have an explicit model and just need ‘a’ solution.  But this means we need to examine our data and validate our solutions  A ‘bad’ solution can be useful, depending on what you want to achieve. Summary: Machine Learning