Soumettre la recherche
Mettre en ligne
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
•
3 j'aime
•
15,295 vues
Greg Landrum
Suivre
ICCS 2018 presentation
Lire moins
Lire la suite
Sciences
Signaler
Partager
Signaler
Partager
1 sur 60
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
Greg Landrum
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
Greg Landrum
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
Greg Landrum
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
Greg Landrum
HDF Product Designer
HDF Product Designer
The HDF-EOS Tools and Information Center
An Early Evaluation of Running Spark on Kubernetes
An Early Evaluation of Running Spark on Kubernetes
DataWorks Summit
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink Forward
Recommandé
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
Greg Landrum
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
Greg Landrum
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
Greg Landrum
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
Greg Landrum
HDF Product Designer
HDF Product Designer
The HDF-EOS Tools and Information Center
An Early Evaluation of Running Spark on Kubernetes
An Early Evaluation of Running Spark on Kubernetes
DataWorks Summit
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink Forward
HDF Product Designer
HDF Product Designer
The HDF-EOS Tools and Information Center
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
InfluxData
Case Studies in advanced analytics with R
Case Studies in advanced analytics with R
Wit Jakuczun
Know your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Revolution Analytics
Clearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
Performance Co-Pilot
Performance Co-Pilot
YOSHIKAWA Ryota
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
Managing large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
Karin Patenge
RIPE Atlas
RIPE Atlas
RIPE NCC
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
OpenACC Highlights - February
OpenACC Highlights - February
NVIDIA
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
Mike Fowler
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula The Science Cloud
Deploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud Platform
MariaDB plc
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Sonja Schweigert
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019
Stefano Cossu
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
From raw data to deployment
From raw data to deployment
KNIMESlides
Contenu connexe
Tendances
HDF Product Designer
HDF Product Designer
The HDF-EOS Tools and Information Center
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
InfluxData
Case Studies in advanced analytics with R
Case Studies in advanced analytics with R
Wit Jakuczun
Know your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Revolution Analytics
Clearing Airflow Obstructions
Clearing Airflow Obstructions
Tatiana Al-Chueyr
Performance Co-Pilot
Performance Co-Pilot
YOSHIKAWA Ryota
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
Managing large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
Karin Patenge
RIPE Atlas
RIPE Atlas
RIPE NCC
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
OpenACC Highlights - February
OpenACC Highlights - February
NVIDIA
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
Mike Fowler
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula The Science Cloud
Deploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud Platform
MariaDB plc
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Sonja Schweigert
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019
Stefano Cossu
Tendances
(20)
HDF Product Designer
HDF Product Designer
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
Case Studies in advanced analytics with R
Case Studies in advanced analytics with R
Know your R usage workflow to handle reproducibility challenges
Know your R usage workflow to handle reproducibility challenges
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Clearing Airflow Obstructions
Clearing Airflow Obstructions
Performance Co-Pilot
Performance Co-Pilot
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
Managing large (and small) R based solutions with R Suite
Managing large (and small) R based solutions with R Suite
Raster Algebra mit Oracle Spatial und uDig
Raster Algebra mit Oracle Spatial und uDig
RIPE Atlas
RIPE Atlas
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
OpenACC Highlights - February
OpenACC Highlights - February
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Deploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud Platform
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Scossu gdi iiif_r+d_report_2019
Scossu gdi iiif_r+d_report_2019
Similaire à How Do You Build and Validate 1500 Models and What Can You Learn from Them?
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
From raw data to deployment
From raw data to deployment
KNIMESlides
From Raw Data to Deployment
From Raw Data to Deployment
KNIMESlides
Python tutorial for ML
Python tutorial for ML
Bin Han
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIMESlides
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
ODSC data science to DataOps
ODSC data science to DataOps
Christopher Bergh
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Amazon Web Services
Master the RETE algorithm
Master the RETE algorithm
Masahiko Umeno
Your Flight is Boarding Now!
Your Flight is Boarding Now!
MeetupDataScienceRoma
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon Web Services
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon Web Services
Sharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME Server
KNIMESlides
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Amazon Web Services
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon Web Services
Similaire à How Do You Build and Validate 1500 Models and What Can You Learn from Them?
(20)
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
From raw data to deployment
From raw data to deployment
From Raw Data to Deployment
From Raw Data to Deployment
Python tutorial for ML
Python tutorial for ML
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To Deployment
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
ODSC data science to DataOps
ODSC data science to DataOps
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the Learning
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Machine Learning at the Edge (AIM302) - AWS re:Invent 2018
Master the RETE algorithm
Master the RETE algorithm
Your Flight is Boarding Now!
Your Flight is Boarding Now!
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Anaheim AWS ...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Atlanta AWS ...
Sharing and Deploying Data Science with KNIME Server
Sharing and Deploying Data Science with KNIME Server
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Amazon CI/CD Practices for Software Development Teams - SRV320 - Chicago AWS ...
Plus de Greg Landrum
Chemical registration
Chemical registration
Greg Landrum
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Greg Landrum
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Greg Landrum
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Greg Landrum
Machine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
Plus de Greg Landrum
(13)
Chemical registration
Chemical registration
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Machine learning in the life sciences with knime
Machine learning in the life sciences with knime
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Dernier
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
anandsmhk
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Sarthak Sekhar Mondal
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
PRINCE C P
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
anilsa9823
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
Sumit Kumar yadav
Natural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
AArockiyaNisha
Nanoparticles synthesis and characterization
Nanoparticles synthesis and characterization
kaibalyasahoo82800
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Sumit Kumar yadav
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
Sérgio Sacani
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Satoshi NAKAHIRA
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
Sumit Kumar yadav
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
Sérgio Sacani
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
aasikanpl
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
Sumit Kumar yadav
The Philosophy of Science
The Philosophy of Science
University of Hertfordshire
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
Nistarini College, Purulia (W.B) India
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
muntazimhurra
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
Dernier
(20)
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
Natural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
Nanoparticles synthesis and characterization
Nanoparticles synthesis and characterization
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
The Philosophy of Science
The Philosophy of Science
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
1.
© 2018 KNIME
AG. All Rights Reserved. How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum*, Anna Martin, Daria Goldmann KNIME AG 2018 ICCS @dr_greg_landrum
2.
© 2018 KNIME
AG. All Rights Reserved. The Monster Model Factory Greg Landrum*, Anna Martin, Daria Goldmann KNIME AG 2018 ICCS @dr_greg_landrum
3.
© 2018 KNIME
AG. All Rights Reserved. 3 Who cares? • I have >1500 datasets from ChEMBL that I would like to build models for • I want to actually use the models, so they need to be deployed • The whole process needs to be automated and reproducible so that I can do it again when ChEMBL is updated • Maybe we can learn something interesting from the models themselves
4.
4© 2018 KNIME
AG. All Rights Reserved. Back to the beginning
5.
© 2018 KNIME
AG. All Rights Reserved. 5 The model process Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png CRISP-DM (CRoss Industry Standard Process for Data Mining) is a standard process for data mining solutions. wikipedia://CRISP-DM
6.
© 2018 KNIME
AG. All Rights Reserved. 6 The model process Image from: https://upload.wikimedia.org/wiki pedia/commons/b/b9/CRISP- DM_Process_Diagram.png Init Load Transform Learn Score Evaluate Deploy
7.
© 2018 KNIME
AG. All Rights Reserved. 7 The model process, multiple models …
8.
© 2018 KNIME
AG. All Rights Reserved. 8 The model process, multiple models …
9.
© 2018 KNIME
AG. All Rights Reserved. 9 The model process, multiple models … https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
10.
© 2018 KNIME
AG. All Rights Reserved. 10 The model process, multiple models … It’s not feasible to manually do this for a daunting number of models! https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
11.
11© 2018 KNIME
AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
12.
© 2018 KNIME
AG. All Rights Reserved. 12 Automation: the model process factory
13.
© 2018 KNIME
AG. All Rights Reserved. 13 Init Load Transform Learn Score Evaluate Deploy Automation: the model process factory Score EvaluateTransform DeployLoad Learn Score Learn Load Transform Evaluate Deploy Score EvaluateTransform DeployLoad Learn Score Learn Load Transform Evaluate Deploy Make each step a separate workflow. Use KNIME to orchestrate calling those workflows KNIME blog post: https://goo.gl/LvESqB White paper: https://goo.gl/d6UpUu
14.
© 2018 KNIME
AG. All Rights Reserved. 14 Model Factory Init Load Transform Learn Score Evaluate Deploy
15.
© 2018 KNIME
AG. All Rights Reserved. 15 The heart of the factory: Call Local Workflow1 • Executes another workflow in the same local repository https://pixabay.com/en/heart-veins-arteries-anatomy-152594/ 1 Call Remote Workflow when run on the KNIME Server
16.
© 2018 KNIME
AG. All Rights Reserved. 16 Model Factory Init Load Transform Learn Score Evaluate Deploy
17.
© 2018 KNIME
AG. All Rights Reserved. 17 Model Factory Init Load Transform Learn Score Evaluate Deploy
18.
18© 2018 KNIME
AG. All Rights Reserved. Details
19.
© 2018 KNIME
AG. All Rights Reserved. 19 Extracting the data • Data source: ChEMBL 23 • Activity types: ('GI50', 'IC50', 'Ki', 'MIC', 'EC50', 'AC50', 'ED50', 'GI', 'Kd', 'CC50', 'LC50', 'MIC90', 'MIC50', 'ID50’) -> 6.5 million points • Define active: Standard_value < 100nM -> 1.3 million actives • Define inactive: Standard_value > 1uM • Define an interesting assay At least 50 actives -> 1556 assays • Final dataset size: 2.5 million data points, 1.5 million compounds Init Load Transform Learn Score Evaluate Deploy
20.
© 2018 KNIME
AG. All Rights Reserved. 20 Init Load Transform Learn Score Evaluate DeployFinding more inactives • The ChEMBL datasets almost all have an unrealistically high ratio of actives to inactives • “Fix” that by adding enough assumed inactives to each dataset to get a 1:10 active:inactive ratio • Pick those assumed inactives to be roughly similar to the actives: Tanimoto similarity of between 0.35 and 0.6 using RDKit Morgan 2 fingerprints
21.
© 2018 KNIME
AG. All Rights Reserved. 21 Extracting the data Init Load Transform Learn Score Evaluate Deploy
22.
© 2018 KNIME
AG. All Rights Reserved. 22 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
23.
© 2018 KNIME
AG. All Rights Reserved. 23 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures • Generate five chemical fingerprints for each structure Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
24.
© 2018 KNIME
AG. All Rights Reserved. 24 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures • Generate five chemical fingerprints for each structure – Morgan 3 counts (ECFC6), 4K “bits” – Morgan 3 (ECFP6), 4K bits – Morgan 2 (ECFP4), 2K bits – RDKit FP, length 1-5, 2K bits – Atom pairs, distances 1-20, 4K bits Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
25.
© 2018 KNIME
AG. All Rights Reserved. 25 Learn and Score Init Load Transform Learn Score Evaluate Deploy 10 different stratified random training/holdout splits generated for each assay
26.
© 2018 KNIME
AG. All Rights Reserved. 26 Learn Init Load Transform Learn Score Evaluate Deploy Learning: • Fingerprint Bayes (NB) • Logistic Regression (LR) • Random Forest (RF) 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 • Gradient Boosting (H2O) 100 trees, max_depth = 5, learning_rate = 0.05 Model Selection: • Pick best model based on Enrichment factor at 5% (EF5)
27.
© 2018 KNIME
AG. All Rights Reserved. 27 Learn Init Load Transform Learn Score Evaluate Deploy Where did these parameters come from? Learning: • Fingerprint Bayes (NB) • Logistic Regression (LR) • Random Forest (RF) 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 • Gradient Boosting (H2O) 100 trees, max_depth = 5, learning_rate = 0.05
28.
© 2018 KNIME
AG. All Rights Reserved. 28 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy • Full parameter optimization done for each method+fingerprint on 70 assays • Results used to pick “standard” parameter sets: – Random Forest: 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 – Gradient Boosting: 100 trees, max_depth = 5, learning_rate = 0.05
29.
© 2018 KNIME
AG. All Rights Reserved. 29 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
30.
© 2018 KNIME
AG. All Rights Reserved. 30 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy The optimization and model selection workflow is presented in detail in Daria’s KNIME blog post: https://www.knime.com/blog/stuck-in-the-nine-circles-of-hell-try-parameter- optimization-a-cup-of-tea The workflow is available in the EXAMPLES folder inside KNIME: 04_Analytics/11_Optimization/08_Model_Optimization_and_Selection
31.
31© 2018 KNIME
AG. All Rights Reserved. Making it all run Init Load Transform Learn Score Evaluate Deploy
32.
© 2018 KNIME
AG. All Rights Reserved. 32 Execution • In total >310K models were built1 1 ~1550 assays * 4 methods * 5 FPs * 10 repeats
33.
© 2018 KNIME
AG. All Rights Reserved. 33 Execution KNIME Analytics Platform KNIME Server ... Distributed Executor Distributed Executor Distributed Executor Build/test workflows Run model factory Run individual assays 65-70 load-balanced distributed executors
34.
34© 2018 KNIME
AG. All Rights Reserved. Are the models any good?
35.
© 2018 KNIME
AG. All Rights Reserved. 35 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382
36.
© 2018 KNIME
AG. All Rights Reserved. 36 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382 Yeah!
37.
© 2018 KNIME
AG. All Rights Reserved. 37 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382 Yeah! Uh oh…
38.
38© 2018 KNIME
AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
39.
© 2018 KNIME
AG. All Rights Reserved. 39 An experiment to check model generalizability • Pick assays where standard_type is Ki • Group them by target ID • Limit to targets where Ki was measured in at least 5 assays -> 11 targets, 73 assays • Use the model built on one assay from a target ID to predict activity across the other assays.
40.
© 2018 KNIME
AG. All Rights Reserved. 40 An experiment to check model generalizability • The targets: TargetID Name Num Assays CHEMBL205 Carbonic anhydrase II 7 CHEMBL224 Serotonin 2a (5-HT2a) receptor 8 CHEMBL234 Dopamine D3 receptor 10 CHEMBL243 Human immunodeficiency virus type 1 protease 6 CHEMBL244 Coagulation factor X 5 CHEMBL253 Cannabinoid CB2 receptor 7 CHEMBL281 Carbonic anhydrase IV 5 CHEMBL3371 Serotonin 6 (5-HT6) receptor 8 CHEMBL344 Melanin-concentrating hormone receptor 1 5 CHEMBL4550 5-lipoxygenase activating protein 5 CHEMBL4908 Trace amine-associated receptor 1 7
41.
© 2018 KNIME
AG. All Rights Reserved. 41 Carbonic Anhydrase IV Carbonic Anhydrase II HIV Protease Factor X 5-HT6 TAAR1
42.
© 2018 KNIME
AG. All Rights Reserved. 42 Carbonic Anhydrase IV Carbonic Anhydrase II HIV Protease Factor X 5-HT6 TAAR1
43.
© 2018 KNIME
AG. All Rights Reserved. 43 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 1366806 AUROC: 0.38 EF5: 0
44.
© 2018 KNIME
AG. All Rights Reserved. 44 An Example Assay_ID 448716 Assay_ID 1366806
45.
© 2018 KNIME
AG. All Rights Reserved. 45 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 659849 AUROC: 0.99 EF5: 8.8
46.
© 2018 KNIME
AG. All Rights Reserved. 46 An Example Assay_ID 448716 Assay_ID 659849
47.
© 2018 KNIME
AG. All Rights Reserved. 47 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 1528679 AUROC: 0.83 EF5: 0.4
48.
© 2018 KNIME
AG. All Rights Reserved. 48 An Example Assay_ID 448716 Assay_ID 1528679
49.
© 2018 KNIME
AG. All Rights Reserved. 49 Intermediate conclusion • Many/most of the models have likely overfit the training data • Alternative interpretation: we’ve actually built models to predict whether or not a compound is taken from a particular paper • Unfortunately these are functionally the same if you want to predict activity
50.
50© 2018 KNIME
AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
51.
© 2018 KNIME
AG. All Rights Reserved. 51 Look for frequent algorithm + fingerprint combinations • For each of the ~1550 assays * 4 learning algorithms * 10 repeats, look at which fingerprint performed best (as measured by EF5)
52.
© 2018 KNIME
AG. All Rights Reserved. 52 Look for frequent algorithm + fingerprint combinations For each of the ~1550 assays * 4 learning algorithms * 10 repeats, look at which fingerprint performed best (as measured by EF5)
53.
© 2018 KNIME
AG. All Rights Reserved. 53 Which method/FP pair is best for each assay? • For each of the ~1550 assays * 10 repeats, look at which algorithm + fingerprint performed best (as measured by EF51, AUC2, and algorithm complexity3) 1 Rounded to 1 decimal point 2 Rounded to 2 decimal points 3 Random Forest > Gradient Boosting > Fingerprint Bayes > Logistic Regression
54.
© 2018 KNIME
AG. All Rights Reserved. 54 Which method/FP pair is best for each assay? Select best model using EF5, AUC, algorithm complexity
55.
© 2018 KNIME
AG. All Rights Reserved. 55 Wrapping up • We have automated the construction and evaluation of >1500 models for bioassays using data pulled from ChEMBL • We’ve got some strong evidence that the models themselves are significantly overfit • We were able to start to draw some general conclusions about fingerprints and methods
56.
© 2018 KNIME
AG. All Rights Reserved. 56 There’s still a lot left to do • Verify the repeatability of the process by updating when the next version of ChEMBL is released • Some more thought into combining assays to get around the “one series per paper” problem • Look into doing the full optimization run • Come up with a good way of presenting the predictions
57.
© 2018 KNIME
AG. All Rights Reserved. 57 More details… • Model process factory blog post: https://goo.gl/LvESqB • Model process factory white paper: https://goo.gl/d6UpUu • Model process factory workflow: knime://EXAMPLES/50_Applications/26_Model_Process_ Management • Daria’s blog post on the model optimization workflow: https://www.knime.com/blog/stuck-in-the-nine-circles- of-hell-try-parameter-optimization-a-cup-of-tea • Accompanying workflow: knime://EXAMPLES/ 04_Analytics/11_Optimization/08_Model_Optimization_ and_Selection • When we’re done cleaning up, there will be a blog post/sample workflow for the monster model factory too.
58.
© 2018 KNIME
AG. All Rights Reserved. 58 7th RDKit UGM: 19 - 21 September • Hosted by Andreas Bender, Cambridge University • Free registration: https://goo.gl/VVvHUH (or get it on http://www.rdkit.org) http://www.rdkit.org
59.
© 2018 KNIME
AG. All Rights Reserved. 59 KNIME Fall Summit 2018 November 6 – 9 at AT&T Executive Education and Conference Center, Austin, Texas • Tuesday & Wednesday: One-day courses • Thursday & Friday: Summit sessions Use the code ICCS-2018 for 10% off tickets. Register at: knime.com/fall-summit2018
60.
60© 2018 KNIME
AG. All Rights Reserved. The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.
Télécharger maintenant