SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Building useful models for
imbalanced datasets
(without resampling)
Greg Landrum
Feb 2019
T5 Informatics GmbH
greg.landrum@t5informatics.com
@dr_greg_landrum
2T5 Informatics
First things first
● RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html
● The notebooks I used for this presentation are all in Github:
○ Original notebook: https://bit.ly/2UY2u2K
○ Using the balanced random forest: https://bit.ly/2tuafSc
○ Plotting: https://bit.ly/2GJSeHH
● I have a KNIME workflow that does the same thing. Let me know if you're
interested
● Download links for the datasets are in the blog post
3T5 Informatics
The problem
● Typical datasets for bioactivity prediction tend to have way more inactives
than actives
● This leads to a couple of pathologies:
○ Overall accuracy is really not a good metric for how useful a model is
○ Many learning algorithms produce way too many false negatives
4T5 Informatics
The accuracy problem
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1102 22
Accuracy is high, but the model is pretty useless for predicting actives
Overall accuracy: 88.7%
5T5 Informatics
The accuracy problem
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1102 22
Overall accuracy: 88.7%
kappa: 0.033
This one has an easy solution: use Cohen's kappa
https://en.wikipedia.org/wiki/Cohen%27s_kappa
https://link.springer.com/article/10.1007/s10822-014-9759-6
https://www.researchgate.net/publication/258240105_TheKappaStatistic_PaulCzodrowski
kappa makes the problem obvious
6T5 Informatics
Is that model actually useless?
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1102 22
Overall accuracy: 88.7%
kappa: 0.033
AUC: 0.850
The ROC curve shows that the model has actually
learned something about the bioactivity. Why
aren't we seeing that in the predictions?
7T5 Informatics
Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
In scikit-learn the predicted class probabilities are
the means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that actual predictions are irrelevant for a ROC curve.
As long as true actives tend to have a higher predicted
probability of being active than true inactives the AUC will
be good.
8T5 Informatics
Ok, but who cares?
9T5 Informatics
An idea
● The standard decision rule for a random forest (or any bag classifier) is that
the majority wins1
, i.e. at the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
● Shift that threshold to a lower value for models built on highly imbalanced
datasets2
1
This is only strictly true for binary classifiers
2
F. Provost, T. Fawcett. "Robust classification for imprecise environments." Machine learning 42:203-31 (2001).
10T5 Informatics
How do we come up with a new decision threshold?
1. Generate a random forest for the dataset using a training set
2. Generate out-of-bag predicted probabilities using the training set
3. Try a number of different decision thresholds1
and pick the one that gives the
best kappa
Once we have the decision threshold, use it to generate predictions for the test
set.
1
Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
11T5 Informatics
Changing the decision threshold
Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation
Threshold = 0.5 Predicted inactive Predicted active
Measured inactive 8681 4
Measured active 1101 22
Overall accuracy: 88.7%
kappa: 0.033
Threshold = 0.2 Predicted inactive Predicted active
Measured inactive 8261 424
Measured active 595 528
Overall accuracy: 89.6%
kappa: 0.451kappa has improved dramatically
12T5 Informatics
Why does it work?
Prob(active)
Superimpose Prob(active) on the ROC curve
13T5 Informatics
Why does it work?
The predicted probabilities of being active are clearly carrying
significant information about activity
Prob(active)
Superimpose Prob(active) on the ROC curve Plot TPR and FPR vs -Prob(active)
14T5 Informatics
It works!
● By shifting the decision threshold from 0.5->0.2 we managed to improve the
predictivity of the model, as measured by kappa, from 0.033 to 0.451
● We did not need to retrain/change the model at all.
Yay! We're done!
15T5 Informatics
It works!
● By shifting the decision threshold from 0.5->0.2 we managed to improve the
predictivity of the model, as measured by kappa, from 0.033 to 0.451
● We did not need to retrain/change the model at all.
Not quite done yet!
● This is just one model. Does this approach work broadly?
● Let's try a broad collection of datasets
16T5 Informatics
The datasets (all extracted from ChEMBL_24)
● "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors
○ Active: pKi > 9.0, Inactive: pKi < 8.5
○ If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5
● "DS1": 80 "Dataset 1" sets.1
○ Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random
compounds from the same property space
● "PubChem": 8 HTS Validation assays with at least 3K "Potency" values
○ Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset
● "DrugMatrix": 44 DrugMatrix assays with at least 40 actives
○ Active: "active" in dataset. Inactive: "not active" in dataset
1
S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or,
how decision making by committee can be a good thing." Journal of chemical information and modeling
53:2829-36 (2013).
17T5 Informatics
Model building and validation
● Fingerprints: 2048 bit MorganFP radius=2
● 80/20 training/test split
● Random forest parameters:
○ cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2,
n_jobs=4, oob_score=True)
● Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
with out-of-bag predictions and pick the best based on kappa
● Generate initial kappa value for the test data using threshold = 0.5
● Generate "balanced" kappa value for the test data with the optimized
threshold
18T5 Informatics
Does shifting the threshold actually help?
Point coloring is based on AUC for
the test set
Yes, it definitely helps
19T5 Informatics
It works!
● When looking across a range of different assays, shifting the decision
threshold improved the predictivity of the model, as measured by kappa, in
most cases. Often significantly so
● We did not need to retrain/change the model at all.
Yay! We're done!
20T5 Informatics
It works!
● When looking across a range of different assays, shifting the decision
threshold improved the predictivity of the model, as measured by kappa, in
most cases. Often significantly so
● We did not need to retrain/change the model at all.
No, wait, there's still more...
● What about other approaches for handling imbalanced datasets?
21T5 Informatics
Quick diversion: How do bag classifiers end up with
different models?
Each tree is built
with a different
dataset
22T5 Informatics
Another approach: Balanced Random Forests1
● Take advantage of the structure of the classifier.
● Learn each tree with a balanced dataset:
○ Select a bootstrap sample of the minority class (actives)
○ Randomly select, with replacement, the same number of points from the majority class
(inactives)
● Prediction works the same as with a normal random forest
● Easy to do in scikit-learn using the imbalanced-learn contrib package:
https://imbalanced-learn.readthedocs.io/en/stable/ensemble.html#forest-of-ra
ndomized-trees
○ cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4,
oob_score=True)
1
C. Chen, A. Liaw, L. Breiman. “Using random forest to learn imbalanced data.” University of California,
Berkeley (2004), https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
23T5 Informatics
Does BRF help?
Point coloring is based on AUC for
the test set
Yes, it definitely helps
24T5 Informatics
How does BRF compare to shifting the threshold?
Point coloring is based on AUC for
the test set
Shifting the threshold
tends to do better
25T5 Informatics
Does shifting the threshold with BRF models help?
Point coloring is based on AUC for
the test set
Nope, that doesn't do
anything
26T5 Informatics
It works!
● When looking across a range of different assays, shifting the decision
threshold improved the predictivity of the model, as measured by kappa, in
most cases. Often significantly so
● The approach was generally better than using Balanced Random Forests
● We did not need to retrain/change the model at all.
Yay! We're done!
27T5 Informatics
What comes next
● Try the same thing with other learning methods like logistic regression and
stochastic gradient boosting
○ These are more complicated since they can't do out-of-bag classification
○ We need to add another data split and loop to do calibration and find the best threshold
● More datasets! I need *your* help with this
○ I can put together a script for you to run that takes sets of compounds with activity labels and
outputs summary statistics like what I'm using here
28T5 Informatics
Prelim results: logistic regression
29T5 Informatics
Acknowledgements
● Dean Abbott (Abbott Analytics, @deanabb)
● Daria Goldmann (KNIME)
30T5 Informatics
More info
● RDKit blog post with initial work:
http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html
● The notebooks I used for this presentation are all in Github:
○ Original notebook: https://bit.ly/2UY2u2K
○ Using the balanced random forest: https://bit.ly/2tuafSc
○ Plotting: https://bit.ly/2GJSeHH
● I have a KNIME workflow that does the same thing. Let me know if you're
interested
● Download links for the datasets are in the blog post

Contenu connexe

Similaire à Building useful models for imbalanced datasets (without resampling)

Optimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOEOptimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOE
Yelp Engineering
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 

Similaire à Building useful models for imbalanced datasets (without resampling) (20)

Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
report
reportreport
report
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
 
Optimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOEOptimal Learning for Fun and Profit with MOE
Optimal Learning for Fun and Profit with MOE
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.Faster and cheaper, smart ab experiments - public ver.
Faster and cheaper, smart ab experiments - public ver.
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random UndersamplingHandling Imbalanced Data: SMOTE vs. Random Undersampling
Handling Imbalanced Data: SMOTE vs. Random Undersampling
 
Causal reasoning and Learning Systems
Causal reasoning and Learning SystemsCausal reasoning and Learning Systems
Causal reasoning and Learning Systems
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 

Plus de Greg Landrum

How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 

Plus de Greg Landrum (15)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Dernier

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Cherry
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Dernier (20)

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Genome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptxGenome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 

Building useful models for imbalanced datasets (without resampling)

  • 1. Building useful models for imbalanced datasets (without resampling) Greg Landrum Feb 2019 T5 Informatics GmbH greg.landrum@t5informatics.com @dr_greg_landrum
  • 2. 2T5 Informatics First things first ● RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html ● The notebooks I used for this presentation are all in Github: ○ Original notebook: https://bit.ly/2UY2u2K ○ Using the balanced random forest: https://bit.ly/2tuafSc ○ Plotting: https://bit.ly/2GJSeHH ● I have a KNIME workflow that does the same thing. Let me know if you're interested ● Download links for the datasets are in the blog post
  • 3. 3T5 Informatics The problem ● Typical datasets for bioactivity prediction tend to have way more inactives than actives ● This leads to a couple of pathologies: ○ Overall accuracy is really not a good metric for how useful a model is ○ Many learning algorithms produce way too many false negatives
  • 4. 4T5 Informatics The accuracy problem Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1102 22 Accuracy is high, but the model is pretty useless for predicting actives Overall accuracy: 88.7%
  • 5. 5T5 Informatics The accuracy problem Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1102 22 Overall accuracy: 88.7% kappa: 0.033 This one has an easy solution: use Cohen's kappa https://en.wikipedia.org/wiki/Cohen%27s_kappa https://link.springer.com/article/10.1007/s10822-014-9759-6 https://www.researchgate.net/publication/258240105_TheKappaStatistic_PaulCzodrowski kappa makes the problem obvious
  • 6. 6T5 Informatics Is that model actually useless? Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1102 22 Overall accuracy: 88.7% kappa: 0.033 AUC: 0.850 The ROC curve shows that the model has actually learned something about the bioactivity. Why aren't we seeing that in the predictions?
  • 7. 7T5 Informatics Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins In scikit-learn the predicted class probabilities are the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that actual predictions are irrelevant for a ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
  • 9. 9T5 Informatics An idea ● The standard decision rule for a random forest (or any bag classifier) is that the majority wins1 , i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" ● Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 F. Provost, T. Fawcett. "Robust classification for imprecise environments." Machine learning 42:203-31 (2001).
  • 10. 10T5 Informatics How do we come up with a new decision threshold? 1. Generate a random forest for the dataset using a training set 2. Generate out-of-bag predicted probabilities using the training set 3. Try a number of different decision thresholds1 and pick the one that gives the best kappa Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
  • 11. 11T5 Informatics Changing the decision threshold Assay: CHEMBL1614421, PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation Threshold = 0.5 Predicted inactive Predicted active Measured inactive 8681 4 Measured active 1101 22 Overall accuracy: 88.7% kappa: 0.033 Threshold = 0.2 Predicted inactive Predicted active Measured inactive 8261 424 Measured active 595 528 Overall accuracy: 89.6% kappa: 0.451kappa has improved dramatically
  • 12. 12T5 Informatics Why does it work? Prob(active) Superimpose Prob(active) on the ROC curve
  • 13. 13T5 Informatics Why does it work? The predicted probabilities of being active are clearly carrying significant information about activity Prob(active) Superimpose Prob(active) on the ROC curve Plot TPR and FPR vs -Prob(active)
  • 14. 14T5 Informatics It works! ● By shifting the decision threshold from 0.5->0.2 we managed to improve the predictivity of the model, as measured by kappa, from 0.033 to 0.451 ● We did not need to retrain/change the model at all. Yay! We're done!
  • 15. 15T5 Informatics It works! ● By shifting the decision threshold from 0.5->0.2 we managed to improve the predictivity of the model, as measured by kappa, from 0.033 to 0.451 ● We did not need to retrain/change the model at all. Not quite done yet! ● This is just one model. Does this approach work broadly? ● Let's try a broad collection of datasets
  • 16. 16T5 Informatics The datasets (all extracted from ChEMBL_24) ● "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors ○ Active: pKi > 9.0, Inactive: pKi < 8.5 ○ If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5 ● "DS1": 80 "Dataset 1" sets.1 ○ Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random compounds from the same property space ● "PubChem": 8 HTS Validation assays with at least 3K "Potency" values ○ Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset ● "DrugMatrix": 44 DrugMatrix assays with at least 40 actives ○ Active: "active" in dataset. Inactive: "not active" in dataset 1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
  • 17. 17T5 Informatics Model building and validation ● Fingerprints: 2048 bit MorganFP radius=2 ● 80/20 training/test split ● Random forest parameters: ○ cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) ● Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best based on kappa ● Generate initial kappa value for the test data using threshold = 0.5 ● Generate "balanced" kappa value for the test data with the optimized threshold
  • 18. 18T5 Informatics Does shifting the threshold actually help? Point coloring is based on AUC for the test set Yes, it definitely helps
  • 19. 19T5 Informatics It works! ● When looking across a range of different assays, shifting the decision threshold improved the predictivity of the model, as measured by kappa, in most cases. Often significantly so ● We did not need to retrain/change the model at all. Yay! We're done!
  • 20. 20T5 Informatics It works! ● When looking across a range of different assays, shifting the decision threshold improved the predictivity of the model, as measured by kappa, in most cases. Often significantly so ● We did not need to retrain/change the model at all. No, wait, there's still more... ● What about other approaches for handling imbalanced datasets?
  • 21. 21T5 Informatics Quick diversion: How do bag classifiers end up with different models? Each tree is built with a different dataset
  • 22. 22T5 Informatics Another approach: Balanced Random Forests1 ● Take advantage of the structure of the classifier. ● Learn each tree with a balanced dataset: ○ Select a bootstrap sample of the minority class (actives) ○ Randomly select, with replacement, the same number of points from the majority class (inactives) ● Prediction works the same as with a normal random forest ● Easy to do in scikit-learn using the imbalanced-learn contrib package: https://imbalanced-learn.readthedocs.io/en/stable/ensemble.html#forest-of-ra ndomized-trees ○ cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) 1 C. Chen, A. Liaw, L. Breiman. “Using random forest to learn imbalanced data.” University of California, Berkeley (2004), https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
  • 23. 23T5 Informatics Does BRF help? Point coloring is based on AUC for the test set Yes, it definitely helps
  • 24. 24T5 Informatics How does BRF compare to shifting the threshold? Point coloring is based on AUC for the test set Shifting the threshold tends to do better
  • 25. 25T5 Informatics Does shifting the threshold with BRF models help? Point coloring is based on AUC for the test set Nope, that doesn't do anything
  • 26. 26T5 Informatics It works! ● When looking across a range of different assays, shifting the decision threshold improved the predictivity of the model, as measured by kappa, in most cases. Often significantly so ● The approach was generally better than using Balanced Random Forests ● We did not need to retrain/change the model at all. Yay! We're done!
  • 27. 27T5 Informatics What comes next ● Try the same thing with other learning methods like logistic regression and stochastic gradient boosting ○ These are more complicated since they can't do out-of-bag classification ○ We need to add another data split and loop to do calibration and find the best threshold ● More datasets! I need *your* help with this ○ I can put together a script for you to run that takes sets of compounds with activity labels and outputs summary statistics like what I'm using here
  • 28. 28T5 Informatics Prelim results: logistic regression
  • 29. 29T5 Informatics Acknowledgements ● Dean Abbott (Abbott Analytics, @deanabb) ● Daria Goldmann (KNIME)
  • 30. 30T5 Informatics More info ● RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html ● The notebooks I used for this presentation are all in Github: ○ Original notebook: https://bit.ly/2UY2u2K ○ Using the balanced random forest: https://bit.ly/2tuafSc ○ Plotting: https://bit.ly/2GJSeHH ● I have a KNIME workflow that does the same thing. Let me know if you're interested ● Download links for the datasets are in the blog post