Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Big Data & ML for Clinical Data
1. Big Data & Machine Learning for
Clinical Data
Paul Agapow <p.agapow@imperial.ac.uk>
Data Science Institute, Imperial College London
2. Biomedical science is now data
science
I was a biochemist, immunologist,
and then a infectious disease
bioinformatician
I’m now a “biomedical data
scientist”
I will be a Health Informatics
Director at AstraZeneca
About me & these lectures
WikiMedia Commons
3. We increasingly use & need:
Lots of complex data
Real world evidence (outside RCTs)
Computational tools
Statistical analysis
Complex interactions
Precision medicine: prediction &
(sub)typing
Also:
Cheap
Successful in other domains
But lots of hype and jargon
Biomedical science is now data science
WikiMedia Commons
4. The world is increasingly
“datafied” – we make more and
bigger datasets
Devices
Routine collection
Aggregation & integration
Big Data is “too big”for
conventional approaches
Part 1: Big Data
WikiMedia Commons
5. “Quantity has a quality of its
own”
Often free
Real
Rich, deep, interactions
Needed for ML and other
assumption-light approaches
Why Big Data?
By Ender005 - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=49888192
6. Many diseases with the same clinical presentation have different
molecular phenotypes
Several overlapping terms
stratified: separate patients into groups for treatment
precision:
tailor treatment to individual
improved targeted therapies with fewer side effects
“Right medication, right dose, right patient, right time, right route”
Also personalised, P4 …
E.g. asthma
Why Big Data? Precision medicine
7. Volume
Velocity
Variety
Veracity
Value
The 3 / 4 / 5 Vs of Big Data
By MuhammadAbuHijleh - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=46431834
8. Limits labile to technological
progress
Memory
Compute
Data schema
Solutions: distributed & parallel
computation, new high-end
databases
The problem with volume: tools & platforms
WikiMedia Commons
9. Multiple hypothesis testing
and false discovery
Bias: a sample is not the
population
The Past is not the Present
Observation without
understanding
The curse of dimensionality
Privacy
Some ML-specific issues
The problem with volume: methodology
From KDNuggets
10. Many, many types of data
How do we use multiple types?
Which type do we use?
Disease is systemic
Interactions
Evidence
Solutions: integrated analysis,
independent analysis with
validation
The problem with variety
Wu, Sanin, Wang (2016) Clinical Applications and Systems
Biomedicine
11. Much biodata is uncertain
Noise
Mistakes
People lie
A sample is not a population
Incompatible systems
Most analyses are not reproducible
Solutions: imputation, standards,
cross-validation etc.
The problem with veracity
By Khaydock - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=25102900
12. How do we
Re-use data
Compare data
Store data from multiple sources
Even know what data is
FAIR, OHDSI / OMOPS, HPO
Even just metadata helps for
cataloguing
But: multiple & incomplete
standards, translation, complexity
Solution: Standards & ontologies
WikiMedia Commons
13. Much data cannot leave its
home institution
Hospitals
Registries
Insurance companies
Governance is hard & slow
So take the analysis to the data
Data looks the same but may
be internally different
Solution: Federated analysis
International Collaboration for Autism Registry Epidemiology
14. In a vast sea of biodata, how do you
discover anything? How do you avoid
cherry-picking?
Solutions:
Distinguish discovery from
exploration
Non-parametric methods (e.g.
machine learning)
Some problems don’t have a single
solution but many (e.g. prediction)
The problem with it all: discoverability
EnterpriseKnowledge.com
15. Write analyses as recipes
Snakemake, Nextflow, Flowr
Use recreatable computational
systems
Docker
“Your biggest collaborator is
you, six months ago”
But: it’s work
Solution: Reproducibility
From RevolutionR
16. Big Data is “too big” for current conventional tools & practices
But it’s ideal for solving many biomedical problems
There are problems with valid discovery and just handling the data
Standards, distributed databases and analysis and
Summary: Big Data
17. “a field of Artificial Intelligence”
“(the science of) getting computers to learn and act like humans do”
“getting computers to act without being explicitly programmed”
“computer systems that automatically improve with experience”
“neural networks”
“using statistical techniques to give computer systems the ability to
learn”
Part 2: Machine Learning
18. In practice:
broadly-defined set of
algorithms that recognise &
generalise patterns in data
“non-parametric” or
assumption-light
may require training over
initial dataset
What is Machine Learning?
By Chire - Own work, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=11711077
19. Enough data
Enough compute
Technical progress
Need 'good enough'
solutions
Prediction & forecasting
Categorization
Pattern recognition
Early, startling success
Why now?
Ray Kurzweil The Singularity is Near
21. How is ML different to stats?
Statistical Machine
Assumptions strong weak
Data small large
Optimize by fitting training
Solutions “the best” “good enough”
Hypothesis proof exploration
Test p-values etc. validation
22. In practice:
a field of scientific research
machine learning
neural networks
deep learning
more of an objective than a methodology
computational systems that duplicate / emulate / replace human effort
What is Artificial Intelligence
23. • Many methods
• Broadly split into:
• Unsupervised: finds structure within data
• e.g. (most) clustering, self-organised maps, principal component
analysis
• Supervised: trained using labelled examples
• e.g. regression, decision trees, naive bayes, neural networks
• Categories can blur
• e.g. k-means, nearest neighbour?
• Which is better?
What are ML methods?
24. • (Train a model from data)
• This model encapsulates or generalizes the data
• (Validate the model against test data)
• This model transforms features into labels
• Continuous outputs (e.g. real numbers) are regressions
• Discrete outputs (e.g. categories) are classifications
ML terms & process
25. • Take gene expression profiles from patients and cluster to:
• See genes with similar expression profiles
• Similar patients
• Train a model on radiographs with tumours labelled, use to diagnose
unlabelled images
• Find patients with similar symptoms & signs (computational
phenotypes) in HER
• Train on histories of patients to forecast their future condition
• Find out how terms in a medical corpus relate to each other
Examples of ML
28. What does ‘similar’ mean? How
do we measure it?
Which features & how weighted?
Noise & overlapping clusters
Non-numeric, non-ordered data
What shapes can clusters be?
How many clusters? When do we
stop?
…
Clustering isn’t simple
By Chire - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=17085331
29. Varies but:
Start with record-feature matrix
Normalise data
(“Supervised”: select number of
clusters)
Run algorithm
Validate
Clustering process
WikiMedia Commons
31. A cluster partitioning is a hypothesis
How do we assess? Validate:
External: compare against external label or data
e.g. accuracy, entropy
Internal: goodness of clustering
e.g. sum squared errors, cluster cohesion & separation,
silhouette
Relative: against another clustering scheme
e.g. is this better with 3 or 4 clusters
Validating clusters
32. Average over each point:
1. Calculate the average distance to all
other members of its cluster, a
2. For each other cluster, calculate the
average distance to every member.
The minimum of these is b
3. The silhouette width is (b−a) /
max(a,b), the higher the better
Clustering process
33. What if there are sub-clusters or
structure?
• Use hierarchical clustering
• Use homogeneity or
completeness metrics to
compare
Nesting & hierarchies
34. • Complex, heterogeneous
disease
• Many attempts at clustering
• Use transcriptomic &
proteomic data
• Validate with clinical
• 4 clusters with characteristic
genes & clinical behaviour
Example: asthma
35. a.k.a. deep learning, (artificial)
neural networks, “AI”
A series of layers of nodes, each of
which transforms the previous layer.
Training sets weights on
transformations
Capable of learning representations
Supervised learning: deep networks
WikiMedia Commons
36. There’s little information in an
individual pixel (gene, data point …)
But individual data points make up
more complete entities
Each layer takes the layer below and
creates higher-level entities
(representations) from it.
The system “recognises” higher-
level features that can appear
anywhere in the data.
What’s a representation?
WikiMedia Commons
37. Radiologists are overwhelmed
Want to catch errors &
double-check
Train ANN over medical
imagery with tumour labelled
Accuracy similar to humans
Example: diagnosis from medical imagery
From Nvidia
38. • The model is right but learns
the wrong thing (from our
point of view)
• Solutions:
• Interpreting models
• Better (more examined) data
Problem: useless solutions
Ribeiro et al. (2016) Why Should I Trust You?
39. Reversing the model & asking “why”
What features are important
Mechanistic insight
But many ML models are tangled & horribly complex
And ML community often uninterested
Solutions:
Choose an intepretable model
Software that explores feature space (LIME, Lift, IML)
Problem: interpretability
40. • Bias (systematic error) vs. Variance
(random error)
• Want a model that captures the
regularities in training data AND
generalizes to unseen data.
• This is impossible
• Solutions:
• Use a variety of data
• Feature selection
• Regularization
Problem: how do models get it wrong?
From KDNuggets
41. • What do we want from our ML
models?
• Power / accuracy
• Insight
• Error tolerance
• e.g. drug discovery vs drug safety
Problem: how good do models have to be?
After Harel
42. • Much (most) data has few positives
• Results in an imbalanced model
• Solutions:
• Over- and under-sampling
• Pre-train with poor data
• Ensemble methods
Problem: imbalanced data & lack of data
DataScience.com
43. Machine learning uses large amounts of data with few assumptions to
make models that generalise that data
This is useful for situations where we don’t have an explicit model and
just need ‘a’ solution.
But this means we need to examine our data and validate our
solutions
A ‘bad’ solution can be useful, depending on what you want to
achieve.
Summary: Machine Learning