AIQC - ISCB 2022.pdf

Accelerating research with an open source,
declarative framework for deep learning experiments

1. Overview
2. Origin
3. Pain points
4. Framework solution
5. Genes & drugs
6. Big picture
■ Slides: bit.ly/aiqc-bosc
■ github.com/AIQC/aiqc
■ docs.aiqc.io
■ layne@aiqc.io
Agenda Links

Gain
Insight
Track
Experiments
Prepare
Data
Train models
Register datasets Rank features
User interface
Zero setup SQLite database
Evaluate models
Preprocess data Decode results
Tune parameters
Stratify samples Make predictions

Uniﬁes
PyData
Ecosystem
With Deep
Learning
Ecosystem

Cohort Analysis Experience
GORdb

Case Study: Harvard Med ー Proteomics of Alzheimer’s
“The Tau protein is as a
biomarker of Alzheimer’s
Disease (AD). It acts like a
cast that holds a neuron
together. Its degradation
spreads from the stem of
the brain to other regions.
No one knows why, there
is no diagnosis process
and no drug to stop it.
“We aggregated healthy and
diseased Tau samples from 5
institutes to study AD
progression. Using mass
spectrometry, the sites within
each sample have been
scanned for post-translational
modiﬁcations (PTMs).
Which PTMS at which sites
are driving the disease?

Ranks the type and location of the
post-translational modifications
(PTMs) that drive Alzheimer’s. 
It’s largely phosphorylation &
ubiquitination sites in the middle of
the peptide. This insight can be used
to design treatments that help prevent
the degradation of the Tau protein. 
Feature
Importance
Pho=P02662:115
Pho=Q14195:622
Gly=P37837:277
Ace=P04406:215
Pho=P10636:282
Pho=Q16555:485
Pho=P29966:101
Pho=P10636:231
Gly=P0CG48:63
Pho=P10636:217
Pho=P10636:181
Site & type
of protein
modification
Most
important
modifications

Galton’s regression of pea size inheritance

Limitations of Association Studies (GWAS)
Not
multi-modal
📷
Not multi-label
(subtypes, phases)
🐁
Not
longitudinal
⏱
Not unified model
“Many hypotheses”
🍂
No predictive algorithm
(although PRS possible)
🔮
Not designed for
parallelization
🔀

Neural Networks are Flexible
📷 ⏱
🧮
Information
Turing Award-Winning
Architectures +
Automated Differentiation
Information
🔠 🧮
🔢
Versus the latest task-
specific statistical tools
(e.g. nth fine-mapping tool)

Binary
■ Survival
■ Malignancy
Multi-Label
■ Subtyping
■ Progression
Regression
■ Expression
■ Toxicity
Forecast
■ Remission
■ Age of Onset
What is it? How much of it?
Deep Learning Answers Deeper Questions

🔨 Workﬂows vary based on data and analysis type
❄ Each team member manually patches together their own glue code

🪤 Pitfalls to Prevent with Quality Control (QC)
Data
Leakage
🚰
Model
Overfitting
🐍
Evaluation
Bias
󰳌
Pipeline
Not Reusable
❄
Data
Drift
🌊
Model
Rot
🍄

🎪 Data Juggling Demands Systematic Approach
Encoding multiple
stratified splits &
cross-validation folds
󰤠
Sliding time
series windows
Multiple
array dimensions
(sklearn designed for 2D)
󰤕
Training & evaluating
many models w many
hyperparameters
󰤚
Multiple preprocesses
each with multiple
column filters
󰤝
Pre/post-preprocces
during inference
6 months later
󰤟

Skillset Trifecta
Bioinformatics
Data Science Software Engineering

Dataset
Feature
Dataset
Label Feature
Splitset
Encoder
Encoder
Queue
Job Predictor Prediction
Algorithm
Params
Building Blocks for
Machine Learning
Goodbye X_train, y_test

Example: Tumor Classiﬁcation based on Gene Expression Proﬁles in TCGA
■ Cohort of 800 participants with
expression profiles of 20,532 genes.
■ Predict the type of tumor observed:
BRCA, KIRC, LUAD, or PRAD.
■ Rank the genes.
[notebook, data]

Cross-validation is
just `fold_count=n`

Dataset.Image
Detecting brain tumors
from MRI scans
[notebook]
Dataset.Sequence
Detecting epileptic seizures
from EEG time series
[notebook]
Other Biomedical Examples

Example: Compound Classiﬁcation based on High Throughput Screening
■ Screened 60K compounds for 200
structural characteristics.
■ Predict whether the compound is
effective (active vs inactive).
Imbalanced: only 0.6% active.
■ Rank the structural characteristics.
■ Simulate new compounds by
tweaking those characteristics. [notebook, data]

Partner with Cloud Platforms to Bring ML to Genomics
+
Process Omics & Design Cohort
Analyze Cohort

Big Pharma is Partnering with Startups Gain AI Capabilities
PRESCIENT
DESIGN
Presents barrier (ML hurdle) for early-stage labs/biotechs

AIQC is the Seed Around which Labs/Biotechs can Develop ML Capabilities
Problem:
Competing
for ML talent
Problem:
Budgeting for ML
talent
Problem:
Bioinformaticians
aren’t ML experts
Problem:
Expensive to build
in-house ML
platform
Long-Term Solution:
As the biotech company
scales, adopt AIQC
platform & depend less on
professional services
Problem:
How to adopt ML to
accelerate research?
Near-Term Solution:
AIQC tool + AIQC services

■ Slides: bit.ly/aiqc-bosc
■ github.com/aiqc/aiqc
■ docs.aiqc.io
■ layne@aiqc.io
Links

AIQC - ISCB 2022.pdf

Recommandé

Recommandé

Contenu connexe

Similaire à AIQC - ISCB 2022.pdf

Similaire à AIQC - ISCB 2022.pdf (20)

Dernier

Dernier (20)

AIQC - ISCB 2022.pdf