SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
• Features are acid, base, hydrogen
bond donor, acceptor, hydrophobe,
aromatic attachment, aliphatic
attachment and halogen. Definitions
are highly engineered.†
• Feature 1 – topological distance -
Feature 2
• Engineered for chemical relevance –
features can be superimposed or
directly linked, e.g. enables a group
to be both a hydrogen bond
acceptor and a base
• A bit identifies a pharmacophore pair
e.g. : Aromatic - 3 bonds - Base
• Used as unfolded 280 bit fingerprints
• Regression Forest as ML method
• Build models with 10 fold CV – report
CV-Pearson’s R2 and CV RMSE
• Build RF error model to generate
predicted error for each compound
using the same descriptors
†Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472.
†Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including amidines, guanidine’s - MedChemica
definitions.
Regression forest models
Strategy Number of
compounds
generated
Number of
matches to D2
known set
Maximum
pIC50
(actual)
Maximum pIC50
(predicted[error])
Hit-to-Lead 682 10 7.8 5.5[0.21]
Dopamine class 469 8 7.9 5.5[0.23]
Solubility 10148 10 7.8 5.5[0.21]
Metabolism 12729 19 7.9 5.5[0.21]
Permutative
MMPA
(env = 4)
5 3 7.9 6.1[?]
Accelerating lead optimisation with active learning by exploiting MMPA based
ADMET knowledge with regression forest potency models
A. G. Dossetter•, E. Griffen•, A. Leach•+, P. de Sousa•.
•Medchemica Ltd, Macclesfield, UK, + Pharmacy and Biomolecular Sciences, Liverpool John Moores University,
Problem
How can we reduce the number of compounds made in going from a small set of confirmed hits to
compounds we can test in vivo? For example: can we go from 30 hits to potent in vivo available leads in 10
rounds of synthesizing 30 compounds?
Learning
Combining focused generative approaches with
explainable QSAR models is shows initial promise.
The pinch point is the second set of compounds.
MedChemica
contact@medchemica.com
Approach Case Study
Dopamine D2 dataset
• Well studied target, ligand based design,
• >5200 measured compounds known
• Simulate hit optimization process
• Use known compounds as validation
The Startpoints
30 compounds: 5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by LLE sort
Generate virtual compounds from MedChemica Knowledge database
• Hit-to-Lead transformations – the most used medicinal chemistry
• ADMET transformations for metabolism and solubility
• Target class transformations learning from target analogues
Permutative MMPA
• generate compounds from data already gained
Regression forest models
• Accurate pharmacophore features with topological distance
• Unfolded fingerprints connect feature importance to pharmacophores
• Error models give accuracy of prediction for each compound
Active Learning
• Explore from predicted high potency, high error
• Exploit from predicted high potency, low error
• Take all compounds in a data set
• Find all matched pairs extract DpIC50
and the transforms between them
• Aggregate transformations with
median DpIC50 and count of pairs
• Apply all transformations back to the
initial data set (at what environment
level?)
• Predicted pIC50 = substrate pIC50 +
median DpIC50
• Remove existing compounds
• Prioritise new compounds by pIC50
estimate
Permutative MMPA
M1
M2
M3
M4
t1
M5
t1
t1
M*
• M1 à M2 transform t1
• M3 à M4 transform t1
• M5 matches t1 and generates
M*
• Predict pIC50:
pIC50(M5) + median DpIC50(t1)
MedChemica
Transformation
Database
Generator
Substrate
molecules
Virtual
molecules
Generate molecules from Knowledge Database
• Hit – to - Lead transformations:
689 transformations with >=250 example pairs
• Dopamine receptor transformations(not D2!)
1027 transformations
• Solubility
6320 transformations
• Metabolism
12719 transformations
Generating new structures is not an issue…
Conclusions
• Good starting points are key(!)
• There is no free lunch – good models need data
• Make best use of the data you already have – focused permutative MMPA finds SAR you may have missed by eye
• Target class based enumeration is most efficient, but still need a better method for round 2 synthesis
• The first set of compounds after the hits are critical if you want to move fast…
Experiment: Fully automated active learning
• Build RF model CV-R2 -0.26, small data set, is it useful?
• Enumerate from all compounds:
• what’s the best enumeration strategy?
• how to pick the (few)compounds to make from the enumerated set?
?
90% of predictions within 0.5 log of measured
• Enumeration generates high potency
compounds, but but early models are too
coarse to correctly prioritize the best small
set for synthesis either by high error or high
potency
7.9!
• Permutative MMPA with tight definition of MMPA environment generates an excellent first
set of follow up compounds learning from the SAR within the hits
• The second batch of compounds is more of a challenge….
Most potent compound(measured) from HtL
enumeration
Active Learning
Hits
Build model with
error estimates
Enumerate
Select for
Explore and
Exploit
Synthesise & Test
Compounds
with data
Compounds
meet
criteria?
Yes
No
Explore: prioritize high error
Exploit : prioritize high potency & low error
Ratio of explore to exploit varies with stage
Select enumeration strategy by stage:
Hit-to lead, target class, solubility, metabolism
For in silico simulation match to
known and measured compounds

Contenu connexe

Tendances

Tendances (19)

Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017
Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017
Learning Medicinal Chemistry ADMET rules UKQSAR Sept 2017
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
Accelerating multiple medicinal chemistry projects using Artificial Intellige...
Accelerating multiple medicinal chemistry projects using Artificial Intellige...Accelerating multiple medicinal chemistry projects using Artificial Intellige...
Accelerating multiple medicinal chemistry projects using Artificial Intellige...
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
Practical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial IntelligencePractical Drug Discovery using Explainable Artificial Intelligence
Practical Drug Discovery using Explainable Artificial Intelligence
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
Structure based computer aided drug design
Structure based computer aided drug designStructure based computer aided drug design
Structure based computer aided drug design
 
molecular docking
molecular dockingmolecular docking
molecular docking
 
molecular docking
molecular dockingmolecular docking
molecular docking
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Machine learning in computational docking
Machine learning in computational dockingMachine learning in computational docking
Machine learning in computational docking
 
SCI What can Big Data do for Chemistry 2017 MedChemica
SCI What can Big Data do for Chemistry 2017 MedChemicaSCI What can Big Data do for Chemistry 2017 MedChemica
SCI What can Big Data do for Chemistry 2017 MedChemica
 
Molecular docking and_virtual_screening
Molecular docking and_virtual_screeningMolecular docking and_virtual_screening
Molecular docking and_virtual_screening
 
Lecture 4 ligand based drug design
Lecture 4 ligand based drug designLecture 4 ligand based drug design
Lecture 4 ligand based drug design
 
Basics Of Molecular Docking
Basics Of Molecular DockingBasics Of Molecular Docking
Basics Of Molecular Docking
 
Docking Score Functions
Docking Score FunctionsDocking Score Functions
Docking Score Functions
 
Connecting Metabolomic Data with Context
Connecting Metabolomic Data with ContextConnecting Metabolomic Data with Context
Connecting Metabolomic Data with Context
 
Molecular Docking
 Molecular Docking Molecular Docking
Molecular Docking
 
Computer Aided Molecular Modeling
Computer Aided Molecular ModelingComputer Aided Molecular Modeling
Computer Aided Molecular Modeling
 

Similaire à Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
Abhik Seal
 
How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
Ann-Marie Roche
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
MLAI2
 
Bagley_HNRS_CRM_talk_2015
Bagley_HNRS_CRM_talk_2015Bagley_HNRS_CRM_talk_2015
Bagley_HNRS_CRM_talk_2015
Thomas Bagley
 
Enhanced bioseparations peptide mapping and m abs
Enhanced bioseparations peptide mapping and m absEnhanced bioseparations peptide mapping and m abs
Enhanced bioseparations peptide mapping and m abs
Oskari Aro
 

Similaire à Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models (20)

Denovo Drug Design
Denovo Drug DesignDenovo Drug Design
Denovo Drug Design
 
DENOVO DRUG DESIGN AS PER PCI SYLLABUS
DENOVO DRUG DESIGN AS PER PCI SYLLABUSDENOVO DRUG DESIGN AS PER PCI SYLLABUS
DENOVO DRUG DESIGN AS PER PCI SYLLABUS
 
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARMDENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM
DENOVO DRUG DESIGN AS PER PCI SYLLABUS M.PHARM
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Virtual sreening
Virtual sreeningVirtual sreening
Virtual sreening
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
 
PREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODS
PREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODSPREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODS
PREDICTION OF ANTIMICROBIAL PEPTIDES USING MACHINE LEARNING METHODS
 
How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
 
cadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxcadd-191129134050 (1).pptx
cadd-191129134050 (1).pptx
 
Bagley_HNRS_CRM_talk_2015
Bagley_HNRS_CRM_talk_2015Bagley_HNRS_CRM_talk_2015
Bagley_HNRS_CRM_talk_2015
 
MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...
MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...
MedChemica Large scale analysis and sharing of Medicinal chemistry Knowledge ...
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
Molecular modelling and dcoking.pptx
Molecular modelling and dcoking.pptxMolecular modelling and dcoking.pptx
Molecular modelling and dcoking.pptx
 
docking
docking docking
docking
 
Enhanced bioseparations peptide mapping and m abs
Enhanced bioseparations peptide mapping and m absEnhanced bioseparations peptide mapping and m abs
Enhanced bioseparations peptide mapping and m abs
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
 
Data analysis
Data analysisData analysis
Data analysis
 

Dernier

THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 

Dernier (20)

Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 

Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models

  • 1. • Features are acid, base, hydrogen bond donor, acceptor, hydrophobe, aromatic attachment, aliphatic attachment and halogen. Definitions are highly engineered.† • Feature 1 – topological distance - Feature 2 • Engineered for chemical relevance – features can be superimposed or directly linked, e.g. enables a group to be both a hydrogen bond acceptor and a base • A bit identifies a pharmacophore pair e.g. : Aromatic - 3 bonds - Base • Used as unfolded 280 bit fingerprints • Regression Forest as ML method • Build models with 10 fold CV – report CV-Pearson’s R2 and CV RMSE • Build RF error model to generate predicted error for each compound using the same descriptors †Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472. †Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including amidines, guanidine’s - MedChemica definitions. Regression forest models Strategy Number of compounds generated Number of matches to D2 known set Maximum pIC50 (actual) Maximum pIC50 (predicted[error]) Hit-to-Lead 682 10 7.8 5.5[0.21] Dopamine class 469 8 7.9 5.5[0.23] Solubility 10148 10 7.8 5.5[0.21] Metabolism 12729 19 7.9 5.5[0.21] Permutative MMPA (env = 4) 5 3 7.9 6.1[?] Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models A. G. Dossetter•, E. Griffen•, A. Leach•+, P. de Sousa•. •Medchemica Ltd, Macclesfield, UK, + Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Problem How can we reduce the number of compounds made in going from a small set of confirmed hits to compounds we can test in vivo? For example: can we go from 30 hits to potent in vivo available leads in 10 rounds of synthesizing 30 compounds? Learning Combining focused generative approaches with explainable QSAR models is shows initial promise. The pinch point is the second set of compounds. MedChemica contact@medchemica.com Approach Case Study Dopamine D2 dataset • Well studied target, ligand based design, • >5200 measured compounds known • Simulate hit optimization process • Use known compounds as validation The Startpoints 30 compounds: 5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by LLE sort Generate virtual compounds from MedChemica Knowledge database • Hit-to-Lead transformations – the most used medicinal chemistry • ADMET transformations for metabolism and solubility • Target class transformations learning from target analogues Permutative MMPA • generate compounds from data already gained Regression forest models • Accurate pharmacophore features with topological distance • Unfolded fingerprints connect feature importance to pharmacophores • Error models give accuracy of prediction for each compound Active Learning • Explore from predicted high potency, high error • Exploit from predicted high potency, low error • Take all compounds in a data set • Find all matched pairs extract DpIC50 and the transforms between them • Aggregate transformations with median DpIC50 and count of pairs • Apply all transformations back to the initial data set (at what environment level?) • Predicted pIC50 = substrate pIC50 + median DpIC50 • Remove existing compounds • Prioritise new compounds by pIC50 estimate Permutative MMPA M1 M2 M3 M4 t1 M5 t1 t1 M* • M1 à M2 transform t1 • M3 à M4 transform t1 • M5 matches t1 and generates M* • Predict pIC50: pIC50(M5) + median DpIC50(t1) MedChemica Transformation Database Generator Substrate molecules Virtual molecules Generate molecules from Knowledge Database • Hit – to - Lead transformations: 689 transformations with >=250 example pairs • Dopamine receptor transformations(not D2!) 1027 transformations • Solubility 6320 transformations • Metabolism 12719 transformations Generating new structures is not an issue… Conclusions • Good starting points are key(!) • There is no free lunch – good models need data • Make best use of the data you already have – focused permutative MMPA finds SAR you may have missed by eye • Target class based enumeration is most efficient, but still need a better method for round 2 synthesis • The first set of compounds after the hits are critical if you want to move fast… Experiment: Fully automated active learning • Build RF model CV-R2 -0.26, small data set, is it useful? • Enumerate from all compounds: • what’s the best enumeration strategy? • how to pick the (few)compounds to make from the enumerated set? ? 90% of predictions within 0.5 log of measured • Enumeration generates high potency compounds, but but early models are too coarse to correctly prioritize the best small set for synthesis either by high error or high potency 7.9! • Permutative MMPA with tight definition of MMPA environment generates an excellent first set of follow up compounds learning from the SAR within the hits • The second batch of compounds is more of a challenge…. Most potent compound(measured) from HtL enumeration Active Learning Hits Build model with error estimates Enumerate Select for Explore and Exploit Synthesise & Test Compounds with data Compounds meet criteria? Yes No Explore: prioritize high error Exploit : prioritize high potency & low error Ratio of explore to exploit varies with stage Select enumeration strategy by stage: Hit-to lead, target class, solubility, metabolism For in silico simulation match to known and measured compounds