Biological, chemical and physical properties of molecules are encoded in their molecular structure. The challenge lies in discovering the relationships between the molecular graphs and the measured activity. Where data is measured, collected and curated for a series of compounds there is an opportunity to find the hidden relationships.
Chemical structures come in various shapes and sizes, depending on the scientists or even algorithms that create them. Though variability may sometimes seem subtle to a trained chemist’s eyes, these can introduce inconsistencies that impair chemical search algorithms or model building. Structure normalization is a key component of any cheminformatics workflow with an often underestimated significance. Finding relationships between chemical structures and their measured properties primarily relies on the representation of the chemical matter. Variability of the calculated features and descriptors for these representations can influence data analysis and accuracy of the predictions. During the first part of the presentation we will present the effect of chemical normalization on investigating correlations and building predictive models.
The second part of the talk will incorporate the results of model building on 163 ChEMBL targets extracted from the bioactivity benchmark set1. Results with different descriptor generation methods including ECFP fingerprints, MACCS key, structural properties, geometry properties and phy-chem properties will be discussed in detail. This part focuses on summarizing the results of more than 3000 Random Forest models.
Finally model development for ADMET targets will be highlighted including hERG cardiotoxicity prediction, permeability and blood brain barrier penetration. We will describe how these models can be built, analyzed, optimized and deployed using our new machine learning platform.
7. Effect of standardization
- Simple descriptors (Mw, fsp3,
HBDA, etc. )
Imipramine pamoate Furan-2-ol
- Phys-chem (logD, pKa)
- Molecular graph, Fingerprints
Salts, solvates Tautomerism
“Overall and despite our efforts to use open software wherever possible, we find that
ChemAxon Tautomers node outperforms the other approaches we tested.”
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00606-7
8. Small molecule retention time (SMRT) dataset: Tautomerization
https:/
/www.nature.com/articles/s41467-019-13680-7
9. SMRT Tautomer example, edge case
Training set: single tautomer random 7k cases
Test set: tautomerization affected 252 cases
Run with and without tautomerization
13. Activity dataset: the ‘ChEMBL bioactivity benchmark set’
Data source: Journal of Cheminformatics, 9, 45 (2017) by Eelke B. Lenselink, Niels
ten Dijke, Brandon Bongers, George Papadatos, Herman W. T. van Vlijmen, Wojtek
Kowalczyk, Adriaan P. IJzerman, Gerard J. P. van Westen
- ChEMBL database (version 20)
- Activities were selected that met the following criteria:
- at least 30 compounds tested per protein and from at least 2 separate publications
- assay confidence score of 9
- ‘single protein’ target type
- assigned pCHEMBL value
- no flags on potential duplicate or data validity comment
- originating from scientific literature
- data points with activity comments ‘not active’, ‘inactive’, ‘inconclusive’, and ‘undetermined’ were
removed
- MED value was chosen
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
14. Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points
reserved as External set: Ext
Last 30 Ext
15. Application Study on ChEMBL
- Data points in range: 500-4703 (med:776)
- 161 ChEMBL targets, pAct
- Sorted by Document Year, last 30 points reserved
as External set: Ext
- 10-90% test-training set split: Test
- ~160k total training size
- ~18k total test size
- 20 different descriptor configurations
- Random Forest
Rnd 90% Train
Last 30 Ext
Rnd 10% Test
...
20. Is it a hard task? Original results random split
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
“The best method overall is the DNN_MC with an MCC of 0.57 (±0.07)”
21. Is it a hard task? Original results temporal split
https:/
/jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0232-0
32. Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
https:/
/www.jmlr.org/papers/volume9/shafer08a/shafer08a.pdf
https:/
/pubs.acs.org/doi/10.1021/ci5001168
33. Conformal prediction
Proper
Training Set
Model Error model
Calibration
set
Error Prediction
Training Set
P(80%)
calibration
factor (ɑ)
Test: 14233 / 17661 80.6% within the error bound
Ext: 3344 / 4890 68.4% within the error bound
37. Drill down to the lowest performance set
TC_key: CHEMBL247 - CHEMBL2373969
BIOACT_PCHEMBL_VALUE: 7.54
TGT_CHEMBL_ID: CHEMBL247
TGT_ORGANISM: Human immunodeficiency virus 1
38. Feature engineering is the
process of using domain
knowledge to extract
features from raw data.
44. Protonation de-tour: Fentanyl F-derivative: FF3
https:/
/www.nature.com/articles/s41598-019-55886-1
No difference at pH 6.5
10x MOR IC50
difference at pH 7.4
45. Statistical Assessment of the Modeling of Proteins and Ligands,
SAMPL6 challenge logP
https:/
/chemaxon.com/news/cxn-logp-prediction-sampl-6
66. Discovery teams
Fill the gap
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine { }
REST
…
API
{ }
REST
…
API
67. Discovery teams
Multi parameter optimization
Production
Models
Design Hub
Services Series
Trainer GUI
Training /
Analysis
Comp. Chem
Trainer
Engine
H1 H2 H3 H4
Trainer
Engine
70. Translate data to reliable
models
Centralize model
management
Connect project team
members and resources
Track and manage discovery
Design Hub
Lower the barrier to adopt AI models in design
Trainer Engine
71. Take away
- Chemical standardization to reduce noise
- Role of protonation and partitioning descriptors
- Successful model building on large and diverse
set of targets
- ML inference, delivery to medicinal chemists