Soumettre la recherche
Mettre en ligne
Moving from Artisanal to Industrial Machine Learning
•
4 j'aime
•
1,424 vues
Greg Landrum
Suivre
Presentation from 2019 CADD GRC
Lire moins
Lire la suite
Sciences
Signaler
Partager
Signaler
Partager
1 sur 46
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
Greg Landrum
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
State of enterprise data science
State of enterprise data science
Yan Xu
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
inside-BigData.com
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
Perspective on HPC-enabled AI
Perspective on HPC-enabled AI
inside-BigData.com
Recommandé
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
Greg Landrum
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
State of enterprise data science
State of enterprise data science
Yan Xu
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
inside-BigData.com
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
inside-BigData.com
Perspective on HPC-enabled AI
Perspective on HPC-enabled AI
inside-BigData.com
Optalysys Optical Processing for HPC
Optalysys Optical Processing for HPC
inside-BigData.com
Introduction aux algorithmes génétiques
Introduction aux algorithmes génétiques
JUG Lausanne
Modern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance Healthcare
Holden Ackerman
Scoring Metrics for Classification Models
Scoring Metrics for Classification Models
KNIMESlides
This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open Session
Helix Nebula The Science Cloud
Python tutorial for ML
Python tutorial for ML
Bin Han
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021
OpenACC
Association Rule Mining using RHadoop
Association Rule Mining using RHadoop
IRJET Journal
Container and Kubernetes without limits
Container and Kubernetes without limits
Antje Barth
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC
OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019
OpenACC
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
Lean Kanban Central Europe
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
inside-BigData.com
Master the RETE algorithm
Master the RETE algorithm
Masahiko Umeno
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
Christian Plessl
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
VMware Tanzu
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
OpenACC
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
Contenu connexe
Tendances
Optalysys Optical Processing for HPC
Optalysys Optical Processing for HPC
inside-BigData.com
Introduction aux algorithmes génétiques
Introduction aux algorithmes génétiques
JUG Lausanne
Modern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance Healthcare
Holden Ackerman
Scoring Metrics for Classification Models
Scoring Metrics for Classification Models
KNIMESlides
This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open Session
Helix Nebula The Science Cloud
Python tutorial for ML
Python tutorial for ML
Bin Han
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021
OpenACC
Association Rule Mining using RHadoop
Association Rule Mining using RHadoop
IRJET Journal
Container and Kubernetes without limits
Container and Kubernetes without limits
Antje Barth
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC
OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019
OpenACC
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
Lean Kanban Central Europe
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
inside-BigData.com
Master the RETE algorithm
Master the RETE algorithm
Masahiko Umeno
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
Christian Plessl
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
VMware Tanzu
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
OpenACC
Tendances
(18)
Optalysys Optical Processing for HPC
Optalysys Optical Processing for HPC
Introduction aux algorithmes génétiques
Introduction aux algorithmes génétiques
Modern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance Healthcare
Scoring Metrics for Classification Models
Scoring Metrics for Classification Models
This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open Session
Python tutorial for ML
Python tutorial for ML
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021
Association Rule Mining using RHadoop
Association Rule Mining using RHadoop
Container and Kubernetes without limits
Container and Kubernetes without limits
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
Master the RETE algorithm
Master the RETE algorithm
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
Similaire à Moving from Artisanal to Industrial Machine Learning
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
Greg Landrum
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
Josh Goergen
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
Edge AI and Vision Alliance
Knime & bioinformatics
Knime & bioinformatics
BioinformaticsInstitute
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
Ganesan Narayanasamy
Introduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power Systems
David Spurway
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIMESlides
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM France Lab
Key strategies for discrete manufacturers j caie arc japan 2008
Key strategies for discrete manufacturers j caie arc japan 2008
ARC Advisory Group
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Thoughtworks
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
IBM Sverige
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
David Spurway
Building a guided analytics forecasting platform with Knime
Building a guided analytics forecasting platform with Knime
Knoldus Inc.
Building Simulation, Its Role, Softwares & Their Limitations
Building Simulation, Its Role, Softwares & Their Limitations
Prasad Thanthratey
Scaling up deep learning by scaling down
Scaling up deep learning by scaling down
Nick Pentreath
Open Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME Software
KNIMESlides
Similaire à Moving from Artisanal to Industrial Machine Learning
(20)
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
Knime & bioinformatics
Knime & bioinformatics
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
Introduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power Systems
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
Key strategies for discrete manufacturers j caie arc japan 2008
Key strategies for discrete manufacturers j caie arc japan 2008
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
Building a guided analytics forecasting platform with Knime
Building a guided analytics forecasting platform with Knime
Building Simulation, Its Role, Softwares & Their Limitations
Building Simulation, Its Role, Softwares & Their Limitations
Scaling up deep learning by scaling down
Scaling up deep learning by scaling down
Open Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME Software
Plus de Greg Landrum
Chemical registration
Chemical registration
Greg Landrum
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
Greg Landrum
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
Greg Landrum
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Greg Landrum
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Greg Landrum
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Greg Landrum
Machine learning in the life sciences with knime
Machine learning in the life sciences with knime
Greg Landrum
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Greg Landrum
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Greg Landrum
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
Plus de Greg Landrum
(12)
Chemical registration
Chemical registration
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
Machine learning in the life sciences with knime
Machine learning in the life sciences with knime
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Dernier
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
RenuJangid3
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
Human genetics..........................pptx
Human genetics..........................pptx
Silpa
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
rohankumarsinghrore1
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
Areesha Ahmad
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
Sumit Kumar yadav
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx .
Poonam Aher Patil
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
PRADYUMMAURYA1
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Areesha Ahmad
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx .
Poonam Aher Patil
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
ryanrooker
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
Suji236384
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
rohankumarsinghrore1
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Alex Henderson
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
OrtegaSyrineMay
Dernier
(20)
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Human genetics..........................pptx
Human genetics..........................pptx
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx .
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx .
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
Moving from Artisanal to Industrial Machine Learning
1.
© 2019 KNIME
AG. All Rights Reserved. Moving from Artisanal to Industrial Machine Learning Greg Landrum (greg.landrum@knime.com)
2.
© 2019 KNIME
AG. All Rights Reserved. 2 This talk • Motivation • Creating a reproducible/industrial artisan • An artisanal side trip into working with imbalanced data
3.
© 2019 KNIME
AG. All Rights Reserved. 3 Context Artisanal Industrial https://flic.kr/p/RJ5xEs License: CC-BY 2.0CC BY 2.0, https://flic.kr/p/a3LLdm
4.
© 2019 KNIME
AG. All Rights Reserved. 4 Context Artisanal • Creative/Exploratory • Flexible Industrial • Automated • Reproducible • Repeatable • Quality control
5.
© 2019 KNIME
AG. All Rights Reserved. 5 Motivation: utility • Thinking about the models that are useful in the design-make-test cycle of a med-chem project • Perhaps something project-specific for the main target + important anti-targets. • Likely a host of additional global models that could be used (solubility, pKa, hERG, CYPs, synthetic accessibility, etc.)
6.
© 2019 KNIME
AG. All Rights Reserved. 6 Aspirations • Can we figure out how to help the artisan be more reproducible/repeatable? • Can we provide an “industrial” framework the artisan can work within? • Can this somehow be practical?
7.
7© 2019 KNIME
AG. All Rights Reserved. A process for data mining
8.
© 2019 KNIME
AG. All Rights Reserved. 8 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA.
9.
© 2019 KNIME
AG. All Rights Reserved. 9 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA. I can guess what you’re thinking…
10.
© 2019 KNIME
AG. All Rights Reserved. 10 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA. I can guess what you’re thinking…
11.
© 2019 KNIME
AG. All Rights Reserved. 11 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA. Shockingly, this actually produced something useful
12.
© 2019 KNIME
AG. All Rights Reserved. 12 The CRISP-DM Process 12 CRISP-DM (CRoss Industry Standard Process for Data Mining) is a standard process for data mining solutions. Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png
13.
© 2019 KNIME
AG. All Rights Reserved. 13 Establishing context • Business understanding – What problem are we trying to solve? – What would a solution look like? • Data understanding – What data do we have available? – Is it any good? – What might be useful for this problem? Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png Domain expertise required here
14.
© 2019 KNIME
AG. All Rights Reserved. 14 The problem • Build predictive models for bioactivity based on the data in screening assays
15.
© 2019 KNIME
AG. All Rights Reserved. 15 The datasets we’ll be working with • qHTS data from eight PubChem assays captured in ChEMBL • The assays have very different numbers of actives in them, so to get started we’ll use two at different ends of the spectrum
16.
© 2019 KNIME
AG. All Rights Reserved. 16 The datasets we’ll be working with • Assay CHEMBL1614166 (PubChem BioAssay. qHTS Assay for Inhibitors of MBNL1-poly(CUG) RNA binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/2675 • 34018 inactives, 98 actives (using the annotations from PubChem)
17.
© 2019 KNIME
AG. All Rights Reserved. 17 Nature of the actives (CHEMBL1614166)
18.
© 2019 KNIME
AG. All Rights Reserved. 18 Nature of the actives (CHEMBL1614166)
19.
© 2019 KNIME
AG. All Rights Reserved. 19 The datasets we’ll be working with • Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEM BL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/1460 • 43345 inactives, 5602 actives (using the annotations from PubChem)
20.
© 2019 KNIME
AG. All Rights Reserved. 20 Model building • Data Preparation – Making it machine-useable – Cleanup – Feature engineering • Modeling – The cool ML/AI stuff Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png
21.
© 2019 KNIME
AG. All Rights Reserved. 21 Data Preparation • Structures are taken from ChEMBL – Already some standardization done – Processed with RDKit • Fingerprints: RDKit Morgan-2, 2048 bits
22.
© 2019 KNIME
AG. All Rights Reserved. 22 Modeling • Stratified 80-20 training/holdout split • KNIME random forest classifier – 500 trees – Max depth 15 – Min node size 2 This is a first pass through the cycle, we will try other fingerprints, learning algorithms, and hyperparameters in future iterations
23.
© 2019 KNIME
AG. All Rights Reserved. 23 Evaluation • Does the model work? • Does it actually solve the problem? • Was the problem well posed? • Is it implying data problems? Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png
24.
© 2019 KNIME
AG. All Rights Reserved. 24 Evaluation • AUROC, overall accuracy and Cohen’s kappa on the holdout data Many, many, many options here. I’m using global metrics because in the end I want to use the “active/inactive” predictions made by the model
25.
© 2019 KNIME
AG. All Rights Reserved. 25 Using • Deployment – How do you actually use the model? – How do you keep it up to date? – How do you get people to accept the results? Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png
26.
© 2019 KNIME
AG. All Rights Reserved. 26 Deployment: technical • Easy since I’m using KNIME • Deploy as a web service – Easy to validate/test • Automated rebuild/re-evaluate when new data are available
27.
© 2019 KNIME
AG. All Rights Reserved. 27 Deployment: practical • Providing “active/inactive” classifications and predicted probabilities likely not enough • Similar compounds from training set? • Applicability domain? • Conformal prediction? • “Explanation” of the prediction (i.e. similarity maps)?
28.
28© 2019 KNIME
AG. All Rights Reserved. Results
29.
© 2019 KNIME
AG. All Rights Reserved. 29 Evaluation CHEMBL1614166: holdout data
30.
© 2019 KNIME
AG. All Rights Reserved. 30 Evaluation CHEMBL1614166: test data AUROC=0.72
31.
© 2019 KNIME
AG. All Rights Reserved. 31 Results CHEMBL1614421: holdout data
32.
© 2019 KNIME
AG. All Rights Reserved. 32 Evaluation CHEMBL1614421: holdout data AUROC=0.75
33.
© 2019 KNIME
AG. All Rights Reserved. 33 Taking stock • Both models have: – Good overall accuracies (because of imbalance) – Decent AUROC values – Terrible Cohen kappas Now what?
34.
34© 2019 KNIME
AG. All Rights Reserved. Let’s get artisanal…
35.
© 2019 KNIME
AG. All Rights Reserved. 35 Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins The predicted class probabilities are often the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that the actual predictions are irrelevant for an ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
36.
© 2019 KNIME
AG. All Rights Reserved. 36 Handling imbalanced data • The standard decision rule for a random forest (or any bag classifier) is that the majority wins1, i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" • Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
37.
© 2019 KNIME
AG. All Rights Reserved. 37 Picking a new decision threshold • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Try a number of different decision thresholds1 and pick the one that gives the best kappa • Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
38.
© 2019 KNIME
AG. All Rights Reserved. 38 Results CHEMBL1614166 • Balanced confusion matrix Previously 0.181
39.
© 2019 KNIME
AG. All Rights Reserved. 39 • Balanced confusion matrix Results CHEMBL1614421 Previously 0.005
40.
© 2019 KNIME
AG. All Rights Reserved. 40 Does it work in general? ChEMBL data, random-split validation
41.
© 2019 KNIME
AG. All Rights Reserved. 41 Does it work in general? Proprietary data, time-split validation
42.
© 2019 KNIME
AG. All Rights Reserved. 42 Coming back to validation • CHEMBL1614166: – Overall accuracy: 99.8% – Kappa: 0.53 – AUROC: 0.72 • CHEMBL1614421: – Overall accuracy: 89.6% – Kappa: 0. 30 – AUROC: 0.75
43.
© 2019 KNIME
AG. All Rights Reserved. 43 Wrapping up Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png
44.
© 2019 KNIME
AG. All Rights Reserved. 44 Maybe useful… • “Practical Machine Learning Canvas”
45.
© 2019 KNIME
AG. All Rights Reserved. 45 Data/Scripts • KNIME workflow for adjusting the decision threshold: https://kni.me/w/HRDmzyQy0UL0k7H2 • RDKit blog post about adjusting the decision threshold (includes links to code): http://rdkit.blogspot.com/2018/11/working-with- unbalanced-data-part-i.html • Practical ML Canvas: https://bit.ly/2JLLsRC
46.
© 2019 KNIME
AG. All Rights Reserved. 46 Acknowledgements • Dean Abbott (Abbott Analytics) • KNIME: – Daria Goldmann – Rosaria Silipo • NIBR: – Nik Stiefl – Nadine Schneider – Niko Fechner For more amazing car pictures: do an image search for “rat rod”
Télécharger maintenant