Submit Search
Upload
Building useful models for imbalanced datasets (without resampling)
•
2 likes
•
574 views
Greg Landrum
Follow
Presentation from the SF COMP Together event
Read less
Read more
Science
Report
Share
Report
Share
1 of 30
Download now
Download to read offline
Recommended
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
Current Trends in HPC
Current Trends in HPC
Putchong Uthayopas
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
Twitter Finagle
Twitter Finagle
Knoldus Inc.
Spark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
BigML, Inc
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
Recommended
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
Current Trends in HPC
Current Trends in HPC
Putchong Uthayopas
Fast Streaming into Clickhouse with Apache Pulsar
Fast Streaming into Clickhouse with Apache Pulsar
Timothy Spann
Twitter Finagle
Twitter Finagle
Knoldus Inc.
Spark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
BigML, Inc
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
[Data Lake + Arquitetura Lambda] na prática
[Data Lake + Arquitetura Lambda] na prática
Felipe Santos
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
confluent
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
Databricks
Siber güvenlik kampı sunumu
Siber güvenlik kampı sunumu
BGA Cyber Security
ソンミサン・マウルの紹介
ソンミサン・マウルの紹介
Takuji Hiroishi
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
Analyse OpenLDAP logs with ELK
Analyse OpenLDAP logs with ELK
Clément OUDOT
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
Jan Margeta
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Databricks
Apache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval
Productionizing Real-time Serving With MLflow
Productionizing Real-time Serving With MLflow
Databricks
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
BGA Cyber Security
Apache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
Apache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
Graph representation learning to prevent payment collusion fraud
Graph representation learning to prevent payment collusion fraud
DataWorks Summit
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
Greg Landrum
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
OpenMetrics Solutions LLC
More Related Content
What's hot
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
[Data Lake + Arquitetura Lambda] na prática
[Data Lake + Arquitetura Lambda] na prática
Felipe Santos
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
confluent
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
James Serra
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
Databricks
Siber güvenlik kampı sunumu
Siber güvenlik kampı sunumu
BGA Cyber Security
ソンミサン・マウルの紹介
ソンミサン・マウルの紹介
Takuji Hiroishi
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
Analyse OpenLDAP logs with ELK
Analyse OpenLDAP logs with ELK
Clément OUDOT
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
Jan Margeta
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Databricks
Apache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval
Productionizing Real-time Serving With MLflow
Productionizing Real-time Serving With MLflow
Databricks
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
BGA Cyber Security
Apache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
Apache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
Graph representation learning to prevent payment collusion fraud
Graph representation learning to prevent payment collusion fraud
DataWorks Summit
What's hot
(20)
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
[Data Lake + Arquitetura Lambda] na prática
[Data Lake + Arquitetura Lambda] na prática
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
Siber güvenlik kampı sunumu
Siber güvenlik kampı sunumu
ソンミサン・マウルの紹介
ソンミサン・マウルの紹介
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Analyse OpenLDAP logs with ELK
Analyse OpenLDAP logs with ELK
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
Distributed computing with Ray. Find your hyper-parameters, speed up your Pan...
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
Apache Kafka - Martin Podval
Apache Kafka - Martin Podval
Productionizing Real-time Serving With MLflow
Productionizing Real-time Serving With MLflow
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
TCP/IP Ağlarda Parçalanmış Paketler ve Etkileri
Apache Kafka Introduction
Apache Kafka Introduction
Apache Flink and what it is used for
Apache Flink and what it is used for
Graph representation learning to prevent payment collusion fraud
Graph representation learning to prevent payment collusion fraud
Similar to Building useful models for imbalanced datasets (without resampling)
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
Greg Landrum
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
OpenMetrics Solutions LLC
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
Machine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
Quantopian
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
Global Knowledge Training
Machine learning algorithms
Machine learning algorithms
Shalitha Suranga
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Maarten Smeets
Random forests-talk-nl-meetup
Random forests-talk-nl-meetup
Willem Hendriks
Introduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Alok Singh
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
pandavaTirumala
pradeep ppt final.pptx
pradeep ppt final.pptx
pandavaTirumala
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Databricks
Robust Design And Variation Reduction Using DiscoverSim
Robust Design And Variation Reduction Using DiscoverSim
JohnNoguera
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Hyun Wong Choi
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
Sqrrl
GA.-.Presentation
GA.-.Presentation
oldmanpat
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
Similar to Building useful models for imbalanced datasets (without resampling)
(20)
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Using Optimization to find Synthetic Equity Universes that minimize Survivors...
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Machine learning in the life sciences with knime
Machine learning in the life sciences with knime
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
"Quantum Hierarchical Risk Parity - A Quantum-Inspired Approach to Portfolio ...
Using Apache Spark with IBM SPSS Modeler
Using Apache Spark with IBM SPSS Modeler
Machine learning algorithms
Machine learning algorithms
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
Random forests-talk-nl-meetup
Random forests-talk-nl-meetup
Introduction to XGBoost
Introduction to XGBoost
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
pradeep ppt final.pptx
pradeep ppt final.pptx
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Robust Design And Variation Reduction Using DiscoverSim
Robust Design And Variation Reduction Using DiscoverSim
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Final edited master defense-hyun_wong choi_2019_05_23_rev21
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
GA.-.Presentation
GA.-.Presentation
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
More from Greg Landrum
Chemical registration
Chemical registration
Greg Landrum
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
Greg Landrum
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
Greg Landrum
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
Greg Landrum
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Greg Landrum
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Greg Landrum
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Greg Landrum
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
More from Greg Landrum
(15)
Chemical registration
Chemical registration
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Recently uploaded
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
Areesha Ahmad
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Sumit Kumar yadav
Natural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
AArockiyaNisha
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Areesha Ahmad
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
sankalpkumarsahoo174
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
Green chemistry and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
RajatChauhan518211
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
ssifa0344
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Diwakar Mishra
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
Sérgio Sacani
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
PraveenaKalaiselvan1
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
rohankumarsinghrore1
The Philosophy of Science
The Philosophy of Science
University of Hertfordshire
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Sheetal Arora
Recently uploaded
(20)
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Natural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Green chemistry and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
The Philosophy of Science
The Philosophy of Science
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Building useful models for imbalanced datasets (without resampling)
1.
© 2019 KNIME
AG. All Rights Reserved. Building useful models for imbalanced datasets (without resampling) Greg Landrum (greg.landrum@knime.com) COMP Together, UCSF 22 Aug 2019
2.
© 2019 KNIME
AG. All Rights Reserved. 2 First things first • RDKit blog post with initial work: http://rdkit.blogspot.com/2018/11/working-with- unbalanced-data-part-i.html • The notebooks I used for this presentation are all in Github: – Original notebook: https://bit.ly/2UY2u2K – Using the balanced random forest: https://bit.ly/2tuafSc – Plotting: https://bit.ly/2GJSeHH • I have a KNIME workflow that does the same thing. Let me know if you're interested • Download links for the datasets are in the blog post
3.
© 2019 KNIME
AG. All Rights Reserved. 3 The problem • Typical datasets for bioactivity prediction tend to have way more inactives than actives • This leads to a couple of pathologies: – Overall accuracy is really not a good metric for how useful a model is – Many learning algorithms produce way too many false negatives
4.
© 2019 KNIME
AG. All Rights Reserved. 4 Example dataset • Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEM BL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/1460 • 43345 inactives, 5602 actives (using the annotations from PubChem)
5.
© 2019 KNIME
AG. All Rights Reserved. 5 Data Preparation • Structures are taken from ChEMBL – Already some standardization done – Processed with RDKit • Fingerprints: RDKit Morgan-2, 2048 bits
6.
© 2019 KNIME
AG. All Rights Reserved. 6 Modeling • Stratified 80-20 training/holdout split • KNIME random forest classifier – 500 trees – Max depth 15 – Min node size 2 This is a first pass through the cycle, we will try other fingerprints, learning algorithms, and hyperparameters in future iterations
7.
© 2019 KNIME
AG. All Rights Reserved. 7 Results CHEMBL1614421: holdout data
8.
© 2019 KNIME
AG. All Rights Reserved. 8 Evaluation CHEMBL1614421: holdout data AUROC=0.75
9.
© 2019 KNIME
AG. All Rights Reserved. 9 Taking stock • Model has: – Good overall accuracies (because of imbalance) – Decent AUROC values – Terrible Cohen kappas Now what?
10.
© 2019 KNIME
AG. All Rights Reserved. 10 Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins The predicted class probabilities are often the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that the actual predictions are irrelevant for an ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
11.
© 2019 KNIME
AG. All Rights Reserved. 11 Handling imbalanced data • The standard decision rule for a random forest (or any bag classifier) is that the majority wins1, i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" • Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
12.
© 2019 KNIME
AG. All Rights Reserved. 12 Picking a new decision threshold: approach 1 • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Try a number of different decision thresholds1 and pick the one that gives the best kappa • Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
13.
© 2019 KNIME
AG. All Rights Reserved. 13 • Balanced confusion matrix Results CHEMBL1614421 Previously 0.005 Nice! But does it work in general?
14.
14© 2019 KNIME
AG. All Rights Reserved. Validation experiment
15.
© 2019 KNIME
AG. All Rights Reserved. 15 • "Serotonin": 6 datasets with >900 Ki values for human serotonin receptors – Active: pKi > 9.0, Inactive: pKi < 8.5 – If that doesn't yield at least 50 actives: Active: pKi > 8.0, Inactive: pKi < 7.5 • "DS1": 80 "Dataset 1" sets.1 – Active: 100 diverse measured actives ("standard_value<10uM"); Inactive: 2000 random compounds from the same property space • "PubChem": 8 HTS Validation assays with at least 3K "Potency" values – Active: "active" in dataset. Inactive: "inactive", "not active", or "inconclusive" in dataset • "DrugMatrix": 44 DrugMatrix assays with at least 40 actives – Active: "active" in dataset. Inactive: "not active" in dataset The datasets (all extracted from ChEMBL_24) 1 S. Riniker, N. Fechner, G. A. Landrum. "Heterogeneous classifier fusion for ligand-based virtual screening: or, how decision making by committee can be a good thing." Journal of chemical information and modeling 53:2829-36 (2013).
16.
© 2019 KNIME
AG. All Rights Reserved. 16 Model building and validation • Fingerprints: 2048 bit MorganFP radius=2 • 80/20 training/test split • Random forest parameters: – cls = RandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True) • Try threshold values of [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ] with out-of-bag predictions and pick the best based on kappa • Generate initial kappa value for the test data using threshold = 0.5 • Generate "balanced" kappa value for the test data with the optimized threshold
17.
© 2019 KNIME
AG. All Rights Reserved. 17 Does it work in general? ChEMBL data, random-split validation
18.
© 2019 KNIME
AG. All Rights Reserved. 18 Does it work in general? Proprietary data, time-split validation
19.
© 2019 KNIME
AG. All Rights Reserved. 19 Picking a new decision threshold: approach 2 • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Pick the threshold corresponding to the point on the ROC curve that’s closest to the upper left corner • Once we have the decision threshold, use it to generate predictions for the test set. Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
20.
© 2019 KNIME
AG. All Rights Reserved. 20 Does it work in general? ChEMBL data, random-split validation
21.
© 2019 KNIME
AG. All Rights Reserved. 21 Does it work in general? ChEMBL data, random-split validation
22.
© 2019 KNIME
AG. All Rights Reserved. 22 Other evaluation metrics: F1 score ChEMBL data, random-split validation
23.
© 2019 KNIME
AG. All Rights Reserved. 23 Does it work in general? Proprietary data, time-split validation
24.
© 2019 KNIME
AG. All Rights Reserved. 24 Compare to balanced random forests • Resampling strategy that still uses the entire training set • Idea: train each tree on a balanced bootstrap sample of the training data Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666 (2004).
25.
© 2019 KNIME
AG. All Rights Reserved. 25 How do bag classifiers end up with different models? Each tree is built with a different dataset
26.
© 2019 KNIME
AG. All Rights Reserved. 26 Balanced random forests • Take advantage of the structure of the classifier. • Learn each tree with a balanced dataset: – Select a bootstrap sample of the minority class (actives) – Randomly select, with replacement, the same number of points from the majority class (inactives) • Prediction works the same as with a normal random forest • Easy to do in scikit-learn using the imbalanced-learn contrib package: https://imbalanced- learn.readthedocs.io/en/stable/ensemble.html#forest-of-randomized-trees – cls = BalancedRandomForestClassifier(n_estimators=500, max_depth=15, min_samples_leaf=2, n_jobs=4, oob_score=True Chen, C., Liaw, A. & Breiman, L. Using Random Forest to Learn Imbalanced Data. https://statistics.berkeley.edu/tech-reports/666 (2004).
27.
© 2019 KNIME
AG. All Rights Reserved. 27 Comparing to resampling: balanced random forests ChEMBL data, random-split validation
28.
© 2019 KNIME
AG. All Rights Reserved. 28 Comparing to resampling: balanced random forests ChEMBL data, random-split validation
29.
© 2019 KNIME
AG. All Rights Reserved. 29 What comes next • Try the same thing with other learning methods like logistic regression and stochastic gradient boosting – These are more complicated since they can't do out-of- bag classification – We need to add another data split and loop to do calibration and find the best threshold • More datasets! I need *your* help with this – I have a script for you to run that takes sets of compounds with activity labels and outputs the summary statistics that I'm using here
30.
© 2019 KNIME
AG. All Rights Reserved. 30 Acknowledgements • Dean Abbott (Abbott Analytics) • Daria Goldmann (KNIME) • NIBR: – Nik Stiefl – Nadine Schneider – Niko Fechner
Download now