13. Case Study: Harvard Med ー Proteomics of Alzheimer’s
“The Tau protein is as a
biomarker of Alzheimer’s
Disease (AD). It acts like a
cast that holds a neuron
together. Its degradation
spreads from the stem of
the brain to other regions.
No one knows why, there
is no diagnosis process
and no drug to stop it.
“We aggregated healthy and
diseased Tau samples from 5
institutes to study AD
progression. Using mass
spectrometry, the sites within
each sample have been
scanned for post-translational
modifications (PTMs).
Which PTMS at which sites
are driving the disease?
14. Ranks the type and location of the
post-translational modifications
(PTMs) that drive Alzheimer’s.
It’s largely phosphorylation &
ubiquitination sites in the middle of
the peptide. This insight can be used
to design treatments that help prevent
the degradation of the Tau protein.
Feature
Importance
Pho=P02662:115
Pho=Q14195:622
Gly=P37837:277
Ace=P04406:215
Pho=P10636:282
Pho=Q16555:485
Pho=P29966:101
Pho=P10636:231
Gly=P0CG48:63
Pho=P10636:217
Pho=P10636:181
Site & type
of protein
modification
Most
important
modifications
17. Limitations of Association Studies (GWAS)
Not
multi-modal
📷
Not multi-label
(subtypes, phases)
🐁
Not
longitudinal
⏱
Not unified model
“Many hypotheses”
🍂
No predictive algorithm
(although PRS possible)
🔮
Not designed for
parallelization
🔀
18. Neural Networks are Flexible
📷 ⏱
🧮
Information
Turing Award-Winning
Architectures +
Automated Differentiation
Information
🔠 🧮
🔢
Versus the latest task-
specific statistical tools
(e.g. nth fine-mapping tool)
19. Binary
■ Survival
■ Malignancy
Multi-Label
■ Subtyping
■ Progression
Regression
■ Expression
■ Toxicity
Forecast
■ Remission
■ Age of Onset
What is it? How much of it?
Deep Learning Answers Deeper Questions
20. 🔨 Workflows vary based on data and analysis type
❄ Each team member manually patches together their own glue code
21. 🪤 Pitfalls to Prevent with Quality Control (QC)
Data
Leakage
🚰
Model
Overfitting
🐍
Evaluation
Bias
Pipeline
Not Reusable
❄
Data
Drift
🌊
Model
Rot
🍄
22. 🎪 Data Juggling Demands Systematic Approach
Encoding multiple
stratified splits &
cross-validation folds
Sliding time
series windows
Multiple
array dimensions
(sklearn designed for 2D)
Training & evaluating
many models w many
hyperparameters
Multiple preprocesses
each with multiple
column filters
Pre/post-preprocces
during inference
6 months later
33. Example: Tumor Classification based on Gene Expression Profiles in TCGA
■ Cohort of 800 participants with
expression profiles of 20,532 genes.
■ Predict the type of tumor observed:
BRCA, KIRC, LUAD, or PRAD.
■ Rank the genes.
[notebook, data]
37. Dataset.Image
Detecting brain tumors
from MRI scans
[notebook]
Dataset.Sequence
Detecting epileptic seizures
from EEG time series
[notebook]
Other Biomedical Examples
38. Example: Compound Classification based on High Throughput Screening
■ Screened 60K compounds for 200
structural characteristics.
■ Predict whether the compound is
effective (active vs inactive).
Imbalanced: only 0.6% active.
■ Rank the structural characteristics.
■ Simulate new compounds by
tweaking those characteristics. [notebook, data]
49. Partner with Cloud Platforms to Bring ML to Genomics
+
Process Omics & Design Cohort
Analyze Cohort
50. Big Pharma is Partnering with Startups Gain AI Capabilities
PRESCIENT
DESIGN
Presents barrier (ML hurdle) for early-stage labs/biotechs
51. AIQC is the Seed Around which Labs/Biotechs can Develop ML Capabilities
Problem:
Competing
for ML talent
Problem:
Budgeting for ML
talent
Problem:
Bioinformaticians
aren’t ML experts
Problem:
Expensive to build
in-house ML
platform
Long-Term Solution:
As the biotech company
scales, adopt AIQC
platform & depend less on
professional services
Problem:
How to adopt ML to
accelerate research?
Near-Term Solution:
AIQC tool + AIQC services