SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Gregory Landrum
NIBR Informatics, Basel
Novartis Institutes for BioMedical Research
10th International Conference on Chemical Structures/
10th German Conference on Chemoinformatics
Large scale classification of chemical
reactions from patent data
Outline
2
§ Public data sources and reactions
§ Fingerprints for reactions
§ Validation:
•  Machine learning
•  Clustering
§ Application: models for predicting yield
Public data sources in cheminformatics
3
an aside at the beginning
§ Publicly available data sources for small molecules and
their biological activities/interactions:
•  PDB, PubChem, ChEMBL, etc.
§ Publicly available data sources for the chemistry behind
how those molecules were actually made (i.e. reactions):
•  pretty much nothing until recently
§ Plenty of data locked up in large commercial databases,
and pharmaceutical companies’ ELNs, very very little in
the open
The “public/open” point is important for
collaboration and reproducibility
A large, public source of chemical reactions
4
Not just what we made, but how we made it
§  Text-mining applied to open patent data to extract chemical reactions :
1.12 million reactions[1]
§  Reactions classified using namerxn, when possible, into 318 standard
types : >599000 classified reactions[2]
[1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD
thesis. University of Cambridge: Cambridge, UK; 2012.
[2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software)
http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the-
wild/
More about the classes
5
Frequency of reaction classes:
44675 2.1.2 Carboxylic acid + amine reaction
39297 1.7.9 Williamson ether synthesis
28194 2.1.1 Amide Schotten-Baumann
26739 1.3.7 Chloro N-arylation
22400 1.6.2 Bromo N-alkylation
20465 7.1.1 Nitro to amino
20405 1.6.4 Chloro N-alkylation
17226 6.2.2 CO2H-Me deprotection
16602 6.1.1 N-Boc deprotection
16021 6.2.1 CO2H-Et deprotection
12952 1.2.1 Aldehyde reductive amination
12250 2.2.3 Sulfonamide Schotten-Baumann
10659 11.9 Separation
8538 3.1.5 Bromo Suzuki-type coupling
7261 1.7.7 Mitsunobu aryl ether synthesis
7102 6.3.7 Methoxy to hydroxy
7071 3.3.1 Sonogashira coupling
6472 3.1.1 Bromo Suzuki coupling
6383 1.8.5 Thioether synthesis
5791 9.1.6 Hydroxy to chloro
20 most common classes:
Got the reactions, what about reaction fingerprints?
6
Criteria for them to be useful
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
Our toolbox: the RDKit
§  Open-source C++ toolkit for cheminformatics
§  Wrappers for Python (2.x), Java, C#
§  Functionality:
•  2D and 3D molecular operations
•  Descriptor generation for machine learning
•  PostgreSQL database cartridge for substructure and similarity searching
•  Knime nodes
•  IPython integration
•  Lucene integration (experimental)
•  Supports Mac/Windows/Linux
§  Releases every 6 months
§  business-friendly BSD license
§  Code: https://github.com/rdkit
§  http://www.rdkit.org
Similarity and reactions
8
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
Similarity and reactions
9
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
Got the reactions, what about reaction fingerprints?
10
Start simple: use difference fingerprints:
Similar idea here:
1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of
Metabolites. ChemMedChem 3, 821–832 (2008).
2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction
Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009).
FPReacts = FPi
i∈Reactants
∑
FPProducts = FPi
i∈Products
∑
FPRxn = FPProds − FPReacts
Refine the fingerprints a bit
11
Text-mined reactions often include catalysts,
reagents, or solvents in the reactants
Explore two options for handling this:
1.  Decrease the weight of reactant molecules where too many
of the bits are not present in the product fingerprint
2.  Decrease the weight of reactant molecules where too many
atoms are unmapped
Are the fingerprints useful?
12
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
Machine learning and chemical reactions
13
§ Validation set:
•  The 68 reaction types with at least 2000 instances from the patent
data set
-  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral
separation)
-  Final: 66 reaction types
§ Process:
•  Training set is 200 random instances of each reaction type
•  Test set is 800 random instances of each reaction type
•  Learning: random forest (scikit-learn)
Learning reaction classes
14
Results for test data
Overall:
•  Recall: 0.94
•  Precision: 0.94
•  Accuracy: 0.94
For a 66-class classifier, this looks pretty good!
Learning reaction classes
15
~94% accuracy
much of the
confusion is
between related
types
Confusion matrix for test data
Bromo Suzuki coupling
Bromo Suzuki-type coupling
Bromo N-arylation
Are the fingerprints useful?
16
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
Clustering reactions
17
§ Reaction similarity validation set:
•  The 66 most common reaction types from the patent data set
•  Look at the homogeneity of clusters with at least 10 members
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
Integration
Interpretation: <30% of clusters are <90% homogeneous
Interpretation: <40% of clusters are <80% homogeneous
Using the fingerprints
18
Can we help classify the remaining 600K reactions?
§  Apply the 66 class random forest to generate class predictions for the
unclassified compounds in order to find reactions we missed
§  Cluster the unclassified molecules, look for big clusters of unclassified
molecules, and (manually) assign classes to them.
§  Both of these approaches have been successful
Predicting yields
19
§  The data set includes text-mined yield information as well as
calculated yields.
§  For modeling: prefer the text-mined value, but take the calculated one
if that’s the only thing available
§  Look at stats for the 93 reaction classes that have at least 500
members with yields, a min yield > 0 and a max yield < 110 %:
Predicting yields
20
§  Look at the most populated classes:
Try building models for yield
21
§ Start with class 7.1.1 “nitro to amino”
§ Break into low-yield (<50%) and high-yield (>70%)
classes.
14% are low-yield
§ Try building a random forest using the atom-pair based
reaction fingerprints
Try building models for yield
22
things that don’t work
That’s performance on the training set
§ Try building a random forest using the atom-pair based
reactant fingerprints
Try building models for yield
23
things that don’t work
That’s performance on the training set
§ Look at the ROC curve for the training-set data
Try building models for yield
24
things that don’t work?
first wrong “low-yield” prediction
nine wrong “low-yield” predictions
The model is doing a great job
of ordering compounds, but a
bad job of classifying
compounds
Unbalanced data and ensemble classifiers
25
an aside
§ Usual decision rule for a two-class ensemble classifier:
take the result that the the majority of the models (decision
trees for random forests) vote for.
§ That’s a decision boundary = 0.5
§ If the dataset is unbalanced, why should we expect
balanced behavior from the classifier?
§ Idea: use the composition of the training set to decide
what the decision boundary should be.
For example: if the data set is ~20% “low yield”, then assign “low
yield” to any example where at least 20% of the trees say “low yield”
§ Try building a random forest using the atom-pair based
reactant fingerprints
§ What about moving the decision boundary to 0.2 to reflect
the unbalanced data set ?
Try building models for yield
26
Getting close to working
That’s performance on the training set
Starting to look ok. What about the test set?
§ Results from a random forest using the atom-pair based
reactant fingerprints with the shifted decision boundary
Try building models for yield
27
Getting close to working
Not too terrible.
test set
§ Aldehyde reductive amination (no shift):
§ Williamson ether synthesis (boundary 0.3)
Try building models for yield
28
Some more models
test set
test set
§ Chloro N-Alkylation (no shift):
§ Chloro N-Alkylation (0.4 shift)
Try building models for yield
29
Some more models
test set
test set
Wrapping up
30
§ Dataset: 1+ million reactions text mined from patents
(publically available) with reaction classes assigned
§ Fingerprints: weighted atom-pair delta and functional-
group delta fingerprints implemented using the RDKit
§ Fingerprint Validation:
•  Multiclass random-forest classifier ~94% accurate
•  Similarity measure works: similar reactions cluster together
§ Combination of clustering + functional group analysis
allows identification of new reaction classes
§ We’re also able to use the fingerprints to build reasonable
models for yield
§ NextMove Software:
• Roger Sayle
• Daniel Lowe
§ NIBR:
• Anna Pelliccioli
• Sereina Riniker
• Mike Tarselli
31
Acknowledgements
Advertising
32
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Registration: http://goo.gl/z6QzwD
Full announcement: http://goo.gl/ZUm2wm
We’re looking for speakers. Please contact greg.landrum@gmail.com

Contenu connexe

Tendances

(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
Akram Pasha
 
Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...
Valery Tkachenko
 
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
Tony Fast
 

Tendances (20)

ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
 
Explainable AI in Drug Hunting
Explainable AI in Drug HuntingExplainable AI in Drug Hunting
Explainable AI in Drug Hunting
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019
 

Similaire à Large scale classification of chemical reactions from patent data

Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
jaumebp
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
Jeremy Yang
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
Shivakrishnan18
 

Similaire à Large scale classification of chemical reactions from patent data (20)

Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
Pharmaceutical Design of Experiments for Beginners
Pharmaceutical Design of Experiments for Beginners  Pharmaceutical Design of Experiments for Beginners
Pharmaceutical Design of Experiments for Beginners
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Low rank models for recommender systems with limited preference information
Low rank models for recommender systems with limited preference informationLow rank models for recommender systems with limited preference information
Low rank models for recommender systems with limited preference information
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxy
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxy
 
various applied optimization techniques and their role in pharmaceutical scie...
various applied optimization techniques and their role in pharmaceutical scie...various applied optimization techniques and their role in pharmaceutical scie...
various applied optimization techniques and their role in pharmaceutical scie...
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
 
ICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction WileyICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction Wiley
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
 
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Strong Heredity Models in High Dimensional Data
Strong Heredity Models in High Dimensional DataStrong Heredity Models in High Dimensional Data
Strong Heredity Models in High Dimensional Data
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
Hybrid Model using Unsupervised Filtering Based on Ant Colony Optimization an...
 

Plus de Greg Landrum

How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 

Plus de Greg Landrum (14)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 

Dernier

The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 

Dernier (20)

GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 

Large scale classification of chemical reactions from patent data

  • 1. Gregory Landrum NIBR Informatics, Basel Novartis Institutes for BioMedical Research 10th International Conference on Chemical Structures/ 10th German Conference on Chemoinformatics Large scale classification of chemical reactions from patent data
  • 2. Outline 2 § Public data sources and reactions § Fingerprints for reactions § Validation: •  Machine learning •  Clustering § Application: models for predicting yield
  • 3. Public data sources in cheminformatics 3 an aside at the beginning § Publicly available data sources for small molecules and their biological activities/interactions: •  PDB, PubChem, ChEMBL, etc. § Publicly available data sources for the chemistry behind how those molecules were actually made (i.e. reactions): •  pretty much nothing until recently § Plenty of data locked up in large commercial databases, and pharmaceutical companies’ ELNs, very very little in the open The “public/open” point is important for collaboration and reproducibility
  • 4. A large, public source of chemical reactions 4 Not just what we made, but how we made it §  Text-mining applied to open patent data to extract chemical reactions : 1.12 million reactions[1] §  Reactions classified using namerxn, when possible, into 318 standard types : >599000 classified reactions[2] [1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD thesis. University of Cambridge: Cambridge, UK; 2012. [2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software) http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the- wild/
  • 5. More about the classes 5 Frequency of reaction classes: 44675 2.1.2 Carboxylic acid + amine reaction 39297 1.7.9 Williamson ether synthesis 28194 2.1.1 Amide Schotten-Baumann 26739 1.3.7 Chloro N-arylation 22400 1.6.2 Bromo N-alkylation 20465 7.1.1 Nitro to amino 20405 1.6.4 Chloro N-alkylation 17226 6.2.2 CO2H-Me deprotection 16602 6.1.1 N-Boc deprotection 16021 6.2.1 CO2H-Et deprotection 12952 1.2.1 Aldehyde reductive amination 12250 2.2.3 Sulfonamide Schotten-Baumann 10659 11.9 Separation 8538 3.1.5 Bromo Suzuki-type coupling 7261 1.7.7 Mitsunobu aryl ether synthesis 7102 6.3.7 Methoxy to hydroxy 7071 3.3.1 Sonogashira coupling 6472 3.1.1 Bromo Suzuki coupling 6383 1.8.5 Thioether synthesis 5791 9.1.6 Hydroxy to chloro 20 most common classes:
  • 6. Got the reactions, what about reaction fingerprints? 6 Criteria for them to be useful § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  • 7. Our toolbox: the RDKit §  Open-source C++ toolkit for cheminformatics §  Wrappers for Python (2.x), Java, C# §  Functionality: •  2D and 3D molecular operations •  Descriptor generation for machine learning •  PostgreSQL database cartridge for substructure and similarity searching •  Knime nodes •  IPython integration •  Lucene integration (experimental) •  Supports Mac/Windows/Linux §  Releases every 6 months §  business-friendly BSD license §  Code: https://github.com/rdkit §  http://www.rdkit.org
  • 8. Similarity and reactions 8 What are we talking about? §  These two reactions are both type: “1.2.5 Ketone reductive amination” It’s obvious that these are the same, right?
  • 9. Similarity and reactions 9 What are we talking about? §  These two reactions are both type: “1.2.5 Ketone reductive amination” It’s obvious that these are the same, right?
  • 10. Got the reactions, what about reaction fingerprints? 10 Start simple: use difference fingerprints: Similar idea here: 1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 3, 821–832 (2008). 2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009). FPReacts = FPi i∈Reactants ∑ FPProducts = FPi i∈Products ∑ FPRxn = FPProds − FPReacts
  • 11. Refine the fingerprints a bit 11 Text-mined reactions often include catalysts, reagents, or solvents in the reactants Explore two options for handling this: 1.  Decrease the weight of reactant molecules where too many of the bits are not present in the product fingerprint 2.  Decrease the weight of reactant molecules where too many atoms are unmapped
  • 12. Are the fingerprints useful? 12 § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  • 13. Machine learning and chemical reactions 13 § Validation set: •  The 68 reaction types with at least 2000 instances from the patent data set -  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral separation) -  Final: 66 reaction types § Process: •  Training set is 200 random instances of each reaction type •  Test set is 800 random instances of each reaction type •  Learning: random forest (scikit-learn)
  • 14. Learning reaction classes 14 Results for test data Overall: •  Recall: 0.94 •  Precision: 0.94 •  Accuracy: 0.94 For a 66-class classifier, this looks pretty good!
  • 15. Learning reaction classes 15 ~94% accuracy much of the confusion is between related types Confusion matrix for test data Bromo Suzuki coupling Bromo Suzuki-type coupling Bromo N-arylation
  • 16. Are the fingerprints useful? 16 § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  • 17. Clustering reactions 17 § Reaction similarity validation set: •  The 66 most common reaction types from the patent data set •  Look at the homogeneity of clusters with at least 10 members 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination Integration Interpretation: <30% of clusters are <90% homogeneous Interpretation: <40% of clusters are <80% homogeneous
  • 18. Using the fingerprints 18 Can we help classify the remaining 600K reactions? §  Apply the 66 class random forest to generate class predictions for the unclassified compounds in order to find reactions we missed §  Cluster the unclassified molecules, look for big clusters of unclassified molecules, and (manually) assign classes to them. §  Both of these approaches have been successful
  • 19. Predicting yields 19 §  The data set includes text-mined yield information as well as calculated yields. §  For modeling: prefer the text-mined value, but take the calculated one if that’s the only thing available §  Look at stats for the 93 reaction classes that have at least 500 members with yields, a min yield > 0 and a max yield < 110 %:
  • 20. Predicting yields 20 §  Look at the most populated classes:
  • 21. Try building models for yield 21 § Start with class 7.1.1 “nitro to amino” § Break into low-yield (<50%) and high-yield (>70%) classes. 14% are low-yield
  • 22. § Try building a random forest using the atom-pair based reaction fingerprints Try building models for yield 22 things that don’t work That’s performance on the training set
  • 23. § Try building a random forest using the atom-pair based reactant fingerprints Try building models for yield 23 things that don’t work That’s performance on the training set
  • 24. § Look at the ROC curve for the training-set data Try building models for yield 24 things that don’t work? first wrong “low-yield” prediction nine wrong “low-yield” predictions The model is doing a great job of ordering compounds, but a bad job of classifying compounds
  • 25. Unbalanced data and ensemble classifiers 25 an aside § Usual decision rule for a two-class ensemble classifier: take the result that the the majority of the models (decision trees for random forests) vote for. § That’s a decision boundary = 0.5 § If the dataset is unbalanced, why should we expect balanced behavior from the classifier? § Idea: use the composition of the training set to decide what the decision boundary should be. For example: if the data set is ~20% “low yield”, then assign “low yield” to any example where at least 20% of the trees say “low yield”
  • 26. § Try building a random forest using the atom-pair based reactant fingerprints § What about moving the decision boundary to 0.2 to reflect the unbalanced data set ? Try building models for yield 26 Getting close to working That’s performance on the training set Starting to look ok. What about the test set?
  • 27. § Results from a random forest using the atom-pair based reactant fingerprints with the shifted decision boundary Try building models for yield 27 Getting close to working Not too terrible. test set
  • 28. § Aldehyde reductive amination (no shift): § Williamson ether synthesis (boundary 0.3) Try building models for yield 28 Some more models test set test set
  • 29. § Chloro N-Alkylation (no shift): § Chloro N-Alkylation (0.4 shift) Try building models for yield 29 Some more models test set test set
  • 30. Wrapping up 30 § Dataset: 1+ million reactions text mined from patents (publically available) with reaction classes assigned § Fingerprints: weighted atom-pair delta and functional- group delta fingerprints implemented using the RDKit § Fingerprint Validation: •  Multiclass random-forest classifier ~94% accurate •  Similarity measure works: similar reactions cluster together § Combination of clustering + functional group analysis allows identification of new reaction classes § We’re also able to use the fingerprints to build reasonable models for yield
  • 31. § NextMove Software: • Roger Sayle • Daniel Lowe § NIBR: • Anna Pelliccioli • Sereina Riniker • Mike Tarselli 31 Acknowledgements
  • 32. Advertising 32 3rd RDKit User Group Meeting 22-24 October 2014 Merck KGaA, Darmstadt, Germany Talks, “talktorials”, lightning talks, social activities, and a hackathon on the 24th. Registration: http://goo.gl/z6QzwD Full announcement: http://goo.gl/ZUm2wm We’re looking for speakers. Please contact greg.landrum@gmail.com