SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Gregory Landrum
NIBR Informatics
Novartis Institutes for BioMedical Research
UK QSAR 2014
Open-source tools for querying and
organizing large reaction databases
Outline
2
§ Public data sources and reactions
§ Handling reactions with the RDKit
§ Fingerprints for reactions
§ Validation:
•  Machine learning
•  Clustering
§ Application: Identifying interesting clusters of reactions
Public data sources in cheminformatics
an aside at the beginning
Protein data bank
4
the exception
•  Crystal structures of proteins
•  Deposition is mandatory for publishing protein crystal structures
Pubchem
5
Evolution
Compounds
Assays
(non-ChEMBL)
Collection of molecules from vendors and patents together with
some assay data, primarily from NIH-funded screening centers.
ChEMBL
6
Evolution
Compounds
Activities
2009
Collection of molecules and assay data curated (primarily) from the
literature
What about how we made those molecules?
7
Public reaction data?
§ The literature:
§ Plenty of data locked up in large commercial databases,
very very little in the open
Yan, L. et al. SAR studies of 3-arylpropionic acids as potent and selective agonists of sphingosine-1-phosphate
receptor-1 (S1P1) with enhanced pharmacokinetic properties. Bioorganic & Medicinal Chemistry Letters 17, 828–
831 (2007).
An emerging area: chemical reactions
8
Not just what we made, but how we made it
§  Text-mining applied to open patent data to extract chemical reactions :
1.12 million reactions[1]
§  Reactions classified using namerxn, when possible, into 318 standard
types : >599000 classified reactions[2]
[1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD
thesis. University of Cambridge: Cambridge, UK; 2012.
[2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software)
http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the-
wild/
Lots of reactions,
lots of repeats
More about the classes
9
Frequency of classes, revisited:
44675 2.1.2 Carboxylic acid + amine reaction
39297 1.7.9 Williamson ether synthesis
28194 2.1.1 Amide Schotten-Baumann
26739 1.3.7 Chloro N-arylation
22400 1.6.2 Bromo N-alkylation
20465 7.1.1 Nitro to amino
20405 1.6.4 Chloro N-alkylation
17226 6.2.2 CO2H-Me deprotection
16602 6.1.1 N-Boc deprotection
16021 6.2.1 CO2H-Et deprotection
12952 1.2.1 Aldehyde reductive amination
12250 2.2.3 Sulfonamide Schotten-Baumann
10659 11.9 Separation
8538 3.1.5 Bromo Suzuki-type coupling
7261 1.7.7 Mitsunobu aryl ether synthesis
7102 6.3.7 Methoxy to hydroxy
7071 3.3.1 Sonogashira coupling
6472 3.1.1 Bromo Suzuki coupling
6383 1.8.5 Thioether synthesis
5791 9.1.6 Hydroxy to chloro
20 most common classes:
RDKit: What is it?
§  Open-source C++ toolkit for cheminformatics
§  Wrappers for Python (2.x), Java, C#
§  Functionality:
•  2D and 3D molecular operations
•  Descriptor generation for machine learning
•  PostgreSQL database cartridge for substructure and similarity searching
•  Knime nodes
•  IPython integration
•  Lucene integration (experimental)
•  Supports Mac/Windows/Linux
§  Releases every 6 months
§  business-friendly BSD license
§  Code: https://github.com/rdkit
§  http://www.rdkit.org
RDKit: Some features
§  Input/Output: SMILES/SMARTS, SDF, TDT, PDB,
SLN [1], Corina mol2 [1]
§  “Cheminformatics”:
•  Substructure searching
•  Canonical SMILES
•  Chirality support (i.e. R/S or E/Z labeling)
•  Chemical transformations (e.g. remove matching
substructures)
•  Chemical reactions
§  2D depiction, including constrained depiction
§  2D->3D conversion/conformational analysis via
distance geometry
§  UFF and MMFF94 implementation for cleaning up
structures
§  Fingerprinting: Daylight-like, atom pairs, topological
torsions, Morgan algorithm, “MACCS keys”, etc.
§  Similarity/diversity picking
§  2D pharmacophores [1]
§  Gasteiger-Marsili charges
§  Hierarchical subgraph/fragment analysis
§  Bemis and Murcko scaffold determination
§  RECAP and BRICS implementations
§  Multi-molecule maximum common substructure
§  Feature maps
§  Shape-based similarity
§  Fraggle similarity (from GSK)
§  Molecule-molecule alignment
§  Open3DAlign implementation
§  Integration with PyMOL for 3D visualization
§  Functional group filtering
§  Salt stripping
§  Molecular descriptor library:
Topological (κ3, Balaban J, etc.), Compositional (Number
of Rings, Number of Aromatic Heterocycles, etc.),
EState, SlogP/SMR (Wildman and Crippen approach),
“MOE like” VSA descriptors, Feature-map vectors
§  Machine Learning:
•  Clustering (hierarchical)
•  Information theory (Shannon entropy, information
gain, etc.)
§  Tight integration with the IPython notebook and
pandas
§  Integration with the InChI library
[1] These implementations are functional but are not necessarily
the best, fastest, or most complete.
RDKit reaction handling
Basics
From an rxn file:
RDKit reaction handling
Virtual Protecting groups
The problem:
Introducing the protecting group on amide Ns:
The result:
Another approach for tuning specificity
start with the problem again
Another approach for tuning specificity
and now the solution
Thanks to Holger Claussen (BioSolveIT) for the idea to use atom values for this
Query definitions added as atom values
Got the reactions, what about reaction fingerprints?
16
Criteria for them to be useful
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
Similarity applied to reactions
17
What are we talking about?
§  These two reactions are both type: “1.2.5 Ketone reductive amination”
It’s obvious that these are the same, right?
Got the reactions, what about reaction fingerprints?
18
Start simple: use difference fingerprints:
Similar idea here:
1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of
Metabolites. ChemMedChem 3, 821–832 (2008).
2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction
Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009).
FPReacts = FPi
i∈Reactants
∑
FPProducts = FPi
i∈Products
∑
FPRxn = FPProds − FPReacts
Refine the fingerprints a bit
19
Text-mined reactions often include reagents or
solvents in the reactants
Explore two options for handling this:
1.  Decrease the weight of reactant molecules where too many
of the bits are not present in the product fingerprint
2.  Decrease the weight of reactant molecules where too many
atoms are unmapped
Another reaction analysis scheme
20
Looking at functional group changes
§ Similar idea to the fingerprint analysis: count the numbers
of common functional groups in the reactants and
products and subtract the one from the other:
	
  	
  	
  	
  rfp=None	
  
	
  	
  	
  	
  for	
  ri	
  in	
  range(rxn.GetNumReactantTemplates()):	
  
	
  	
  	
  	
  	
  	
  	
  	
  m	
  =	
  rxn.GetReactantTemplate(ri)	
  
	
  	
  	
  	
  	
  	
  	
  	
  fp	
  =	
  np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  rfp	
  is	
  None:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  rfp	
  =	
  fp	
  
	
  	
  	
  	
  	
  	
  	
  	
  else:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  rfp	
  +=	
  fp	
  
	
  	
  	
  	
  pfp=None	
  
	
  	
  	
  	
  for	
  ri	
  in	
  range(rxn.GetNumProductTemplates()):	
  
	
  	
  	
  	
  	
  	
  	
  	
  m	
  =	
  rxn.GetProductTemplate(ri)	
  
	
  	
  	
  	
  	
  	
  	
  	
  fp	
  =	
  np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  pfp	
  is	
  None:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  pfp	
  =	
  fp	
  
	
  	
  	
  	
  	
  	
  	
  	
  else:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  pfp	
  +=	
  fp	
  
	
  	
  	
  	
  fp	
  =	
  pfp-­‐rfp	
  
Functional groups considered
21
acidchloride
acidchloride_aromatic
acidchloride_aliphatic
carboxylicacid
carboxylicacid_aromatic
carboxylicacid_aliphatic
carboxylicacid_alphaamino
sulfonylchloride
sulfonylchloride_aromatic
sulfonylchloride_aliphatic
amine
amine_primary
amine_primary_aromatic
amine_primary_aliphatic
amine_secondary
amine_secondary_aromatic
amine_secondary_aliphatic
amine_tertiary
amine_tertiary_aromatic
amine_tertiary_aliphatic
amine_aromatic
amine_aliphatic
amine_cyclic
boronicacid
boronicacid_aromatic
boronicacid_aliphatic
isocyanate
isocyanate_aromatic
isocyanate_aliphatic
alcohol
alcohol_aromatic
alcohol_aliphatic
aldehyde
aldehyde_aromatic
aldehyde_aliphatic
halogen
halogen_aromatic
halogen_aliphatic
halogen_notfluorine
halogen_notfluorine_aliphatic
halogen_notfluorine_aromatic
halogen_bromine
halogen_bromine_aliphatic
halogen_bromine_aromatic
halogen_bromine_bromoketone
azide
azide_aromatic
azide_aliphatic
nitro
nitro_aromatic
nitro_aliphatic
terminalalkyne
Functional group changes analyzed
22
Do the results make sense at all?
Func%onal	
  Group	
  
Avg	
  in	
  
Reac%on	
  
Overall	
  
Average	
  
halogen	
   -­‐0.98	
   -­‐0.3	
  
alcohol	
   -­‐0.95	
   -­‐0.12	
  
halogen_no4luorine	
   -­‐0.89	
   -­‐0.27	
  
alcohol_aroma:c	
   -­‐0.67	
   -­‐0.04	
  
halogen_alipha:c	
   -­‐0.62	
   -­‐0.15	
  
halogen_no4luorine_alipha:c	
   -­‐0.62	
   -­‐0.14	
  
carboxylicacid	
   -­‐0.5	
   -­‐0.23	
  
halogen_bromine	
   -­‐0.42	
   -­‐0.11	
  
halogen_bromine_alipha:c	
   -­‐0.39	
   -­‐0.06	
  
halogen_aroma:c	
   -­‐0.36	
   -­‐0.16	
  
alcohol_alipha:c	
   -­‐0.28	
   -­‐0.08	
  
halogen_no4luorine_aroma:c	
   -­‐0.27	
   -­‐0.13	
  
amine	
   -­‐0.04	
   -­‐0.3	
  
amine_alipha:c	
   -­‐0.04	
   -­‐0.27	
  
carboxylicacid_alipha:c	
   -­‐0.04	
   -­‐0.08	
  
halogen_bromine_aroma:c	
   -­‐0.03	
   -­‐0.05	
  
amine_ter:ary	
   -­‐0.02	
   -­‐0.06	
  
amine_ter:ary_alipha:c	
   -­‐0.02	
   -­‐0.08	
  
carboxylicacid_aroma:c	
   -­‐0.02	
   -­‐0.03	
  
amine_cyclic	
   -­‐0.01	
   -­‐0.02	
  
halogen_bromine_bromoketone	
   -­‐0.01	
   0	
  
Func%onal	
  Group	
  
Avg	
  in	
  
Reac%on	
  
Overall	
  
Average	
  
acidchloride	
   0	
   -­‐0.07	
  
acidchloride_alipha:c	
   0	
   -­‐0.05	
  
acidchloride_aroma:c	
   0	
   -­‐0.02	
  
aldehyde	
   0	
   -­‐0.04	
  
aldehyde_alipha:c	
   0	
   -­‐0.01	
  
aldehyde_aroma:c	
   0	
   -­‐0.03	
  
amine_aroma:c	
   0	
   -­‐0.03	
  
amine_primary	
   0	
   -­‐0.15	
  
amine_primary_alipha:c	
   0	
   -­‐0.07	
  
amine_primary_aroma:c	
   0	
   -­‐0.07	
  
amine_secondary	
   0	
   -­‐0.04	
  
amine_secondary_alipha:c	
   0	
   -­‐0.07	
  
amine_secondary_aroma:c	
   0	
   0.03	
  
amine_ter:ary_aroma:c	
   0	
   0	
  
azide	
   0	
   0	
  
azide_alipha:c	
   0	
   0	
  
azide_aroma:c	
   0	
   0	
  
boronicacid	
   0	
   -­‐0.03	
  
boronicacid_alipha:c	
   0	
   0	
  
boronicacid_aroma:c	
   0	
   -­‐0.03	
  
carboxylicacid_alphaamino	
   0	
   0	
  
isocyanate	
   0	
   -­‐0.01	
  
isocyanate_alipha:c	
   0	
   0	
  
isocyanate_aroma:c	
   0	
   0	
  
nitro	
   0	
   -­‐0.03	
  
nitro_alipha:c	
   0	
   0	
  
nitro_aroma:c	
   0	
   -­‐0.03	
  
sulfonylchloride	
   0	
   -­‐0.02	
  
sulfonylchloride_alipha:c	
   0	
   -­‐0.01	
  
sulfonylchloride_aroma:c	
   0	
   -­‐0.01	
  
terminalalkyne	
   0	
   -­‐0.01	
  
Compare the average deltas for the >39K instances of
Williamson ether synthesis
These look sensible
Are the fingerprints useful?
23
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
Machine learning and chemical reactions
24
§ Validation set:
•  The 68 reaction types with at least 2000 instances from the patent
data set
-  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral
separation)
-  Final: 66 reaction types
§ Process:
•  Training set is 200 random instances of each reaction type
•  Test set is 800 random instances of each reaction type
•  Learning: random forest (scikit-learn)
Learning reaction classes
25
Results for test data
Overall:
•  Recall: 0.94
•  Precision: 0.94
•  Accuracy: 0.94
For a 66-class classifier, this looks pretty good!
Learning reaction classes
26
~94% accuracy
much of the
confusion is
between related
types
Confusion matrix for test data
Bromo Suzuki coupling
Bromo Suzuki-type coupling
Bromo N-arylation
Are the fingerprints useful?
27
§ Question 1: do they contain bits that are helpful in
distinguishing reactions from another?
Test: can we use them with a machine-learning approach to build a
reaction classifier?
§ Question 2: are similar reactions similar with the
fingerprints
Test: do related reactions cluster together?
Clustering reactions
28
§ Reaction similarity validation set:
•  The 66 most common reaction types from the patent data set
•  Look at the homogeneity of clusters with at least 10 members
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
1.2.5 Ketone reductive
amination
Integration
Interpretation: <30% of clusters are <90% homogeneous
Interpretation: <40% of clusters are <80% homogeneous
Similarity applied to reactions
29
Can we help classify the remaining 600K reactions?
§  Starting point: we have a similarity measure that clusters related
reactions together
§  We can apply the machine-learning model to the unclassified
reactions and see if the original assignment missed any instances
§  We can then look for big clusters of unclassified molecules and
(manually) assign classes to them.
Finding related unclassified reactions
30
§  Process:
1.  Pick 10K random unclassified reactions
2.  Cluster using the same fingerprint described above
3.  Characterize clusters by average functional-group profile
4.  Pick clusters where there is a clear signal
§  An example:
Cluster	
  12	
  
	
  	
  	
  amine	
  -­‐0.68	
  
	
  	
  	
  amine_secondary	
  -­‐0.35	
  
	
  	
  	
  amine_secondary_aliphatic	
  -­‐0.35	
  
	
  	
  	
  amine_aliphatic	
  -­‐0.61	
  
	
  	
  	
  aldehyde	
  -­‐0.58	
  
	
  	
  	
  aldehyde_aromatic	
  -­‐0.58	
  
Example reactions from cluster 12
31
•  Clearly related reactions
•  Using this approach we’ve identified a number of reaction classes
Wrapping up
32
§ Dataset: 1+ million reactions text mined from patents
(publically available) with reaction classes assigned
§ Fingerprint: weighted atom-pair delta fingerprints
implemented using the RDKit
§ Fingerprint Validation:
•  Multiclass random-forest classifier ~94% accurate
•  Similarity measure works: similar reactions cluster together
§ Combination of clustering + functional group analysis
clustering allows identification of new reaction classes
§ NIBR:
• Anna Pelliccioli
• Sereina Riniker
• Mike Tarselli
§ NextMove Software:
• Roger Sayle
• Daniel Lowe
33
Acknowledgements
Advertising
34
3rd RDKit User Group Meeting
22-24 October 2014
Merck KGaA, Darmstadt, Germany
Talks, “talktorials”, lightning talks, social activities, and a hackathon on
the 24th.
Registration: http://goo.gl/z6QzwD
Full announcement: http://goo.gl/ZUm2wm
We’re looking for speakers. Please contact greg.landrum@gmail.com

Contenu connexe

Tendances

Accelerating lead optimisation with active learning by exploiting MMPA based ...
Accelerating lead optimisation with active learning by exploiting MMPA based ...Accelerating lead optimisation with active learning by exploiting MMPA based ...
Accelerating lead optimisation with active learning by exploiting MMPA based ...Ed Griffen
 
Griffen MedChemica Virtual Tox Panel
Griffen MedChemica Virtual Tox PanelGriffen MedChemica Virtual Tox Panel
Griffen MedChemica Virtual Tox PanelEd Griffen
 
Determining stable ligand orientation
Determining stable ligand orientationDetermining stable ligand orientation
Determining stable ligand orientationijaia
 
Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Valery Tkachenko
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Pistoia Alliance
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayiKiranmayiKnv
 
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO csandit
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS santosh Kumbhar
 
Resolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experienceResolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experienceChris Southan
 
molecular docking
molecular dockingmolecular docking
molecular dockingKOUSHIK DEB
 
RSC Hatfield 2018 Kinase meeting : potency patents MMPA approaches
RSC Hatfield 2018  Kinase meeting : potency patents MMPA approachesRSC Hatfield 2018  Kinase meeting : potency patents MMPA approaches
RSC Hatfield 2018 Kinase meeting : potency patents MMPA approachesEd Griffen
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Molecular Docking Using Autodock Tools
Molecular Docking Using Autodock ToolsMolecular Docking Using Autodock Tools
Molecular Docking Using Autodock ToolsVikram Aditya
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
 
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...bioejjournal
 
Qsar studies on gallic acid derivatives and molecular docking studies of bace...
Qsar studies on gallic acid derivatives and molecular docking studies of bace...Qsar studies on gallic acid derivatives and molecular docking studies of bace...
Qsar studies on gallic acid derivatives and molecular docking studies of bace...bioejjournal
 

Tendances (20)

Accelerating lead optimisation with active learning by exploiting MMPA based ...
Accelerating lead optimisation with active learning by exploiting MMPA based ...Accelerating lead optimisation with active learning by exploiting MMPA based ...
Accelerating lead optimisation with active learning by exploiting MMPA based ...
 
Griffen MedChemica Virtual Tox Panel
Griffen MedChemica Virtual Tox PanelGriffen MedChemica Virtual Tox Panel
Griffen MedChemica Virtual Tox Panel
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
Determining stable ligand orientation
Determining stable ligand orientationDetermining stable ligand orientation
Determining stable ligand orientation
 
Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...Open chemistry registry and mapping platform based on open source cheminforma...
Open chemistry registry and mapping platform based on open source cheminforma...
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayi
 
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS
MOLECULAR DOCKING AND RELATED DRUG DESIGN ACHIEVEMENTS
 
Resolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experienceResolving cryptic needles to molecular structures: The GtoPdb experience
Resolving cryptic needles to molecular structures: The GtoPdb experience
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
molecular docking
molecular dockingmolecular docking
molecular docking
 
RSC Hatfield 2018 Kinase meeting : potency patents MMPA approaches
RSC Hatfield 2018  Kinase meeting : potency patents MMPA approachesRSC Hatfield 2018  Kinase meeting : potency patents MMPA approaches
RSC Hatfield 2018 Kinase meeting : potency patents MMPA approaches
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Molecular Docking Using Autodock Tools
Molecular Docking Using Autodock ToolsMolecular Docking Using Autodock Tools
Molecular Docking Using Autodock Tools
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...
Qsar Studies on Gallic Acid Derivatives and Molecular Docking Studies of Bace...
 
Qsar studies on gallic acid derivatives and molecular docking studies of bace...
Qsar studies on gallic acid derivatives and molecular docking studies of bace...Qsar studies on gallic acid derivatives and molecular docking studies of bace...
Qsar studies on gallic acid derivatives and molecular docking studies of bace...
 

En vedette

GAMETIME Center, LLC. Exec.Summary
GAMETIME Center, LLC. Exec.SummaryGAMETIME Center, LLC. Exec.Summary
GAMETIME Center, LLC. Exec.SummaryRobert L Edwards
 
Counterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory King
Counterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory KingCounterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory King
Counterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory KingIHS
 
LWV US VG Nov 2016 I Web final
LWV US VG Nov 2016 I Web finalLWV US VG Nov 2016 I Web final
LWV US VG Nov 2016 I Web finalSarah Robinson
 
Wargaming.net: Secrets of YouTube
Wargaming.net: Secrets of YouTubeWargaming.net: Secrets of YouTube
Wargaming.net: Secrets of YouTubeDevGAMM Conference
 
Latin America
Latin AmericaLatin America
Latin AmericaMrO97
 
Hack the MOOC: alternative MOOC use
Hack the MOOC: alternative MOOC useHack the MOOC: alternative MOOC use
Hack the MOOC: alternative MOOC useInge de Waard
 
Unidad I Economia
Unidad I EconomiaUnidad I Economia
Unidad I Economiaeddith ruiz
 
Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...
Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...
Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...ViPeRz16
 
151012 visioning workshop findings empathy map_ccd
151012 visioning workshop findings empathy map_ccd151012 visioning workshop findings empathy map_ccd
151012 visioning workshop findings empathy map_ccdMKThink Strategy
 
A Hokkien Poem That Teaches
A Hokkien Poem That TeachesA Hokkien Poem That Teaches
A Hokkien Poem That TeachesOH TEIK BIN
 
What Sphere Digital Recruitment Does
What Sphere Digital Recruitment DoesWhat Sphere Digital Recruitment Does
What Sphere Digital Recruitment DoesNiomi Cowling
 
Results of 2015 Summer GiveTogether
Results of 2015 Summer GiveTogetherResults of 2015 Summer GiveTogether
Results of 2015 Summer GiveTogetherErica Klinger
 
Johnston Press' trasformation strategy
Johnston Press' trasformation strategyJohnston Press' trasformation strategy
Johnston Press' trasformation strategyiamrobertandrews
 
lafa.su презентация
lafa.su презентацияlafa.su презентация
lafa.su презентацияArtem Malyutin
 

En vedette (17)

Steve Jobs
Steve JobsSteve Jobs
Steve Jobs
 
GAMETIME Center, LLC. Exec.Summary
GAMETIME Center, LLC. Exec.SummaryGAMETIME Center, LLC. Exec.Summary
GAMETIME Center, LLC. Exec.Summary
 
Presentación1 mama
Presentación1 mamaPresentación1 mama
Presentación1 mama
 
Counterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory King
Counterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory KingCounterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory King
Counterfeiting and Semiconductor Value Chain Economics - COG 2013, Mr. Rory King
 
LWV US VG Nov 2016 I Web final
LWV US VG Nov 2016 I Web finalLWV US VG Nov 2016 I Web final
LWV US VG Nov 2016 I Web final
 
Wargaming.net: Secrets of YouTube
Wargaming.net: Secrets of YouTubeWargaming.net: Secrets of YouTube
Wargaming.net: Secrets of YouTube
 
Latin America
Latin AmericaLatin America
Latin America
 
Hack the MOOC: alternative MOOC use
Hack the MOOC: alternative MOOC useHack the MOOC: alternative MOOC use
Hack the MOOC: alternative MOOC use
 
Unidad I Economia
Unidad I EconomiaUnidad I Economia
Unidad I Economia
 
Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...
Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...
Evaluation: Question 1- IN WHAT WAYS DOES YOUR MEDIA PRODUCT USE, DEVELOP OR ...
 
151012 visioning workshop findings empathy map_ccd
151012 visioning workshop findings empathy map_ccd151012 visioning workshop findings empathy map_ccd
151012 visioning workshop findings empathy map_ccd
 
A Hokkien Poem That Teaches
A Hokkien Poem That TeachesA Hokkien Poem That Teaches
A Hokkien Poem That Teaches
 
What Sphere Digital Recruitment Does
What Sphere Digital Recruitment DoesWhat Sphere Digital Recruitment Does
What Sphere Digital Recruitment Does
 
Trastorno de personalidad
Trastorno de personalidadTrastorno de personalidad
Trastorno de personalidad
 
Results of 2015 Summer GiveTogether
Results of 2015 Summer GiveTogetherResults of 2015 Summer GiveTogether
Results of 2015 Summer GiveTogether
 
Johnston Press' trasformation strategy
Johnston Press' trasformation strategyJohnston Press' trasformation strategy
Johnston Press' trasformation strategy
 
lafa.su презентация
lafa.su презентацияlafa.su презентация
lafa.su презентация
 

Similaire à Open-source tools for querying and organizing large reaction databases

SF and PE CTR-IN 2016 Poster_FInal
SF and PE CTR-IN 2016 Poster_FInalSF and PE CTR-IN 2016 Poster_FInal
SF and PE CTR-IN 2016 Poster_FInalSteve Flynn
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIIndrajeetKumar124
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
Lead Optimization in Drug Discovery
Lead Optimization in Drug DiscoveryLead Optimization in Drug Discovery
Lead Optimization in Drug Discoveryavinashdhake3
 
Workflows supporting drug discovery against malaria
Workflows supporting drug discovery against malariaWorkflows supporting drug discovery against malaria
Workflows supporting drug discovery against malariaBarry Hardy
 
How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...Ichigaku Takigawa
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...NextMove Software
 
Basler modellers.210126reduced
Basler modellers.210126reducedBasler modellers.210126reduced
Basler modellers.210126reducedOlivier Bignucolo
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Hitesh Patel
 
Schrodinger eUGM Berlin Sept19
Schrodinger eUGM Berlin Sept19 Schrodinger eUGM Berlin Sept19
Schrodinger eUGM Berlin Sept19 Angelo Pugliese
 
Metabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plantsMetabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plantsN Poorin
 
Drug Discovery Today: Fighting TB with Technology
Drug Discovery Today: Fighting TB with TechnologyDrug Discovery Today: Fighting TB with Technology
Drug Discovery Today: Fighting TB with Technologyrendevilla
 
Manchester Open Notebook Science Talk
Manchester Open Notebook Science TalkManchester Open Notebook Science Talk
Manchester Open Notebook Science TalkJean-Claude Bradley
 

Similaire à Open-source tools for querying and organizing large reaction databases (20)

SF and PE CTR-IN 2016 Poster_FInal
SF and PE CTR-IN 2016 Poster_FInalSF and PE CTR-IN 2016 Poster_FInal
SF and PE CTR-IN 2016 Poster_FInal
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
Lead Optimization in Drug Discovery
Lead Optimization in Drug DiscoveryLead Optimization in Drug Discovery
Lead Optimization in Drug Discovery
 
Workflows supporting drug discovery against malaria
Workflows supporting drug discovery against malariaWorkflows supporting drug discovery against malaria
Workflows supporting drug discovery against malaria
 
How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
 
Basler modellers.210126reduced
Basler modellers.210126reducedBasler modellers.210126reduced
Basler modellers.210126reduced
 
Combined Draft 4
Combined Draft 4 Combined Draft 4
Combined Draft 4
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
 
Tesis Grl 2011
Tesis Grl 2011Tesis Grl 2011
Tesis Grl 2011
 
Schrodinger eUGM Berlin Sept19
Schrodinger eUGM Berlin Sept19 Schrodinger eUGM Berlin Sept19
Schrodinger eUGM Berlin Sept19
 
biomoduling
biomodulingbiomoduling
biomoduling
 
Metabolomics.ppt
Metabolomics.pptMetabolomics.ppt
Metabolomics.ppt
 
Metabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plantsMetabolic engineering approaches in medicinal plants
Metabolic engineering approaches in medicinal plants
 
Drug Discovery Today: Fighting TB with Technology
Drug Discovery Today: Fighting TB with TechnologyDrug Discovery Today: Fighting TB with Technology
Drug Discovery Today: Fighting TB with Technology
 
BCSRCv1.3
BCSRCv1.3BCSRCv1.3
BCSRCv1.3
 
Manchester Open Notebook Science Talk
Manchester Open Notebook Science TalkManchester Open Notebook Science Talk
Manchester Open Notebook Science Talk
 
ChemInform RxnFinder
ChemInform RxnFinderChemInform RxnFinder
ChemInform RxnFinder
 

Plus de Greg Landrum

Chemical registration
Chemical registrationChemical registration
Chemical registrationGreg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Greg Landrum
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Greg Landrum
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsGreg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningGreg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Greg Landrum
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysisGreg Landrum
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Greg Landrum
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialGreg Landrum
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontGreg Landrum
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knimeGreg Landrum
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 

Plus de Greg Landrum (17)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 

Dernier

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Dernier (20)

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

Open-source tools for querying and organizing large reaction databases

  • 1. Gregory Landrum NIBR Informatics Novartis Institutes for BioMedical Research UK QSAR 2014 Open-source tools for querying and organizing large reaction databases
  • 2. Outline 2 § Public data sources and reactions § Handling reactions with the RDKit § Fingerprints for reactions § Validation: •  Machine learning •  Clustering § Application: Identifying interesting clusters of reactions
  • 3. Public data sources in cheminformatics an aside at the beginning
  • 4. Protein data bank 4 the exception •  Crystal structures of proteins •  Deposition is mandatory for publishing protein crystal structures
  • 5. Pubchem 5 Evolution Compounds Assays (non-ChEMBL) Collection of molecules from vendors and patents together with some assay data, primarily from NIH-funded screening centers.
  • 6. ChEMBL 6 Evolution Compounds Activities 2009 Collection of molecules and assay data curated (primarily) from the literature
  • 7. What about how we made those molecules? 7 Public reaction data? § The literature: § Plenty of data locked up in large commercial databases, very very little in the open Yan, L. et al. SAR studies of 3-arylpropionic acids as potent and selective agonists of sphingosine-1-phosphate receptor-1 (S1P1) with enhanced pharmacokinetic properties. Bioorganic & Medicinal Chemistry Letters 17, 828– 831 (2007).
  • 8. An emerging area: chemical reactions 8 Not just what we made, but how we made it §  Text-mining applied to open patent data to extract chemical reactions : 1.12 million reactions[1] §  Reactions classified using namerxn, when possible, into 318 standard types : >599000 classified reactions[2] [1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD thesis. University of Cambridge: Cambridge, UK; 2012. [2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software) http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the- wild/ Lots of reactions, lots of repeats
  • 9. More about the classes 9 Frequency of classes, revisited: 44675 2.1.2 Carboxylic acid + amine reaction 39297 1.7.9 Williamson ether synthesis 28194 2.1.1 Amide Schotten-Baumann 26739 1.3.7 Chloro N-arylation 22400 1.6.2 Bromo N-alkylation 20465 7.1.1 Nitro to amino 20405 1.6.4 Chloro N-alkylation 17226 6.2.2 CO2H-Me deprotection 16602 6.1.1 N-Boc deprotection 16021 6.2.1 CO2H-Et deprotection 12952 1.2.1 Aldehyde reductive amination 12250 2.2.3 Sulfonamide Schotten-Baumann 10659 11.9 Separation 8538 3.1.5 Bromo Suzuki-type coupling 7261 1.7.7 Mitsunobu aryl ether synthesis 7102 6.3.7 Methoxy to hydroxy 7071 3.3.1 Sonogashira coupling 6472 3.1.1 Bromo Suzuki coupling 6383 1.8.5 Thioether synthesis 5791 9.1.6 Hydroxy to chloro 20 most common classes:
  • 10. RDKit: What is it? §  Open-source C++ toolkit for cheminformatics §  Wrappers for Python (2.x), Java, C# §  Functionality: •  2D and 3D molecular operations •  Descriptor generation for machine learning •  PostgreSQL database cartridge for substructure and similarity searching •  Knime nodes •  IPython integration •  Lucene integration (experimental) •  Supports Mac/Windows/Linux §  Releases every 6 months §  business-friendly BSD license §  Code: https://github.com/rdkit §  http://www.rdkit.org
  • 11. RDKit: Some features §  Input/Output: SMILES/SMARTS, SDF, TDT, PDB, SLN [1], Corina mol2 [1] §  “Cheminformatics”: •  Substructure searching •  Canonical SMILES •  Chirality support (i.e. R/S or E/Z labeling) •  Chemical transformations (e.g. remove matching substructures) •  Chemical reactions §  2D depiction, including constrained depiction §  2D->3D conversion/conformational analysis via distance geometry §  UFF and MMFF94 implementation for cleaning up structures §  Fingerprinting: Daylight-like, atom pairs, topological torsions, Morgan algorithm, “MACCS keys”, etc. §  Similarity/diversity picking §  2D pharmacophores [1] §  Gasteiger-Marsili charges §  Hierarchical subgraph/fragment analysis §  Bemis and Murcko scaffold determination §  RECAP and BRICS implementations §  Multi-molecule maximum common substructure §  Feature maps §  Shape-based similarity §  Fraggle similarity (from GSK) §  Molecule-molecule alignment §  Open3DAlign implementation §  Integration with PyMOL for 3D visualization §  Functional group filtering §  Salt stripping §  Molecular descriptor library: Topological (κ3, Balaban J, etc.), Compositional (Number of Rings, Number of Aromatic Heterocycles, etc.), EState, SlogP/SMR (Wildman and Crippen approach), “MOE like” VSA descriptors, Feature-map vectors §  Machine Learning: •  Clustering (hierarchical) •  Information theory (Shannon entropy, information gain, etc.) §  Tight integration with the IPython notebook and pandas §  Integration with the InChI library [1] These implementations are functional but are not necessarily the best, fastest, or most complete.
  • 13. RDKit reaction handling Virtual Protecting groups The problem: Introducing the protecting group on amide Ns: The result:
  • 14. Another approach for tuning specificity start with the problem again
  • 15. Another approach for tuning specificity and now the solution Thanks to Holger Claussen (BioSolveIT) for the idea to use atom values for this Query definitions added as atom values
  • 16. Got the reactions, what about reaction fingerprints? 16 Criteria for them to be useful § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  • 17. Similarity applied to reactions 17 What are we talking about? §  These two reactions are both type: “1.2.5 Ketone reductive amination” It’s obvious that these are the same, right?
  • 18. Got the reactions, what about reaction fingerprints? 18 Start simple: use difference fingerprints: Similar idea here: 1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 3, 821–832 (2008). 2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009). FPReacts = FPi i∈Reactants ∑ FPProducts = FPi i∈Products ∑ FPRxn = FPProds − FPReacts
  • 19. Refine the fingerprints a bit 19 Text-mined reactions often include reagents or solvents in the reactants Explore two options for handling this: 1.  Decrease the weight of reactant molecules where too many of the bits are not present in the product fingerprint 2.  Decrease the weight of reactant molecules where too many atoms are unmapped
  • 20. Another reaction analysis scheme 20 Looking at functional group changes § Similar idea to the fingerprint analysis: count the numbers of common functional groups in the reactants and products and subtract the one from the other:        rfp=None          for  ri  in  range(rxn.GetNumReactantTemplates()):                  m  =  rxn.GetReactantTemplate(ri)                  fp  =  np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))                  if  rfp  is  None:                          rfp  =  fp                  else:                          rfp  +=  fp          pfp=None          for  ri  in  range(rxn.GetNumProductTemplates()):                  m  =  rxn.GetProductTemplate(ri)                  fp  =  np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))                  if  pfp  is  None:                          pfp  =  fp                  else:                          pfp  +=  fp          fp  =  pfp-­‐rfp  
  • 21. Functional groups considered 21 acidchloride acidchloride_aromatic acidchloride_aliphatic carboxylicacid carboxylicacid_aromatic carboxylicacid_aliphatic carboxylicacid_alphaamino sulfonylchloride sulfonylchloride_aromatic sulfonylchloride_aliphatic amine amine_primary amine_primary_aromatic amine_primary_aliphatic amine_secondary amine_secondary_aromatic amine_secondary_aliphatic amine_tertiary amine_tertiary_aromatic amine_tertiary_aliphatic amine_aromatic amine_aliphatic amine_cyclic boronicacid boronicacid_aromatic boronicacid_aliphatic isocyanate isocyanate_aromatic isocyanate_aliphatic alcohol alcohol_aromatic alcohol_aliphatic aldehyde aldehyde_aromatic aldehyde_aliphatic halogen halogen_aromatic halogen_aliphatic halogen_notfluorine halogen_notfluorine_aliphatic halogen_notfluorine_aromatic halogen_bromine halogen_bromine_aliphatic halogen_bromine_aromatic halogen_bromine_bromoketone azide azide_aromatic azide_aliphatic nitro nitro_aromatic nitro_aliphatic terminalalkyne
  • 22. Functional group changes analyzed 22 Do the results make sense at all? Func%onal  Group   Avg  in   Reac%on   Overall   Average   halogen   -­‐0.98   -­‐0.3   alcohol   -­‐0.95   -­‐0.12   halogen_no4luorine   -­‐0.89   -­‐0.27   alcohol_aroma:c   -­‐0.67   -­‐0.04   halogen_alipha:c   -­‐0.62   -­‐0.15   halogen_no4luorine_alipha:c   -­‐0.62   -­‐0.14   carboxylicacid   -­‐0.5   -­‐0.23   halogen_bromine   -­‐0.42   -­‐0.11   halogen_bromine_alipha:c   -­‐0.39   -­‐0.06   halogen_aroma:c   -­‐0.36   -­‐0.16   alcohol_alipha:c   -­‐0.28   -­‐0.08   halogen_no4luorine_aroma:c   -­‐0.27   -­‐0.13   amine   -­‐0.04   -­‐0.3   amine_alipha:c   -­‐0.04   -­‐0.27   carboxylicacid_alipha:c   -­‐0.04   -­‐0.08   halogen_bromine_aroma:c   -­‐0.03   -­‐0.05   amine_ter:ary   -­‐0.02   -­‐0.06   amine_ter:ary_alipha:c   -­‐0.02   -­‐0.08   carboxylicacid_aroma:c   -­‐0.02   -­‐0.03   amine_cyclic   -­‐0.01   -­‐0.02   halogen_bromine_bromoketone   -­‐0.01   0   Func%onal  Group   Avg  in   Reac%on   Overall   Average   acidchloride   0   -­‐0.07   acidchloride_alipha:c   0   -­‐0.05   acidchloride_aroma:c   0   -­‐0.02   aldehyde   0   -­‐0.04   aldehyde_alipha:c   0   -­‐0.01   aldehyde_aroma:c   0   -­‐0.03   amine_aroma:c   0   -­‐0.03   amine_primary   0   -­‐0.15   amine_primary_alipha:c   0   -­‐0.07   amine_primary_aroma:c   0   -­‐0.07   amine_secondary   0   -­‐0.04   amine_secondary_alipha:c   0   -­‐0.07   amine_secondary_aroma:c   0   0.03   amine_ter:ary_aroma:c   0   0   azide   0   0   azide_alipha:c   0   0   azide_aroma:c   0   0   boronicacid   0   -­‐0.03   boronicacid_alipha:c   0   0   boronicacid_aroma:c   0   -­‐0.03   carboxylicacid_alphaamino   0   0   isocyanate   0   -­‐0.01   isocyanate_alipha:c   0   0   isocyanate_aroma:c   0   0   nitro   0   -­‐0.03   nitro_alipha:c   0   0   nitro_aroma:c   0   -­‐0.03   sulfonylchloride   0   -­‐0.02   sulfonylchloride_alipha:c   0   -­‐0.01   sulfonylchloride_aroma:c   0   -­‐0.01   terminalalkyne   0   -­‐0.01   Compare the average deltas for the >39K instances of Williamson ether synthesis These look sensible
  • 23. Are the fingerprints useful? 23 § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  • 24. Machine learning and chemical reactions 24 § Validation set: •  The 68 reaction types with at least 2000 instances from the patent data set -  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral separation) -  Final: 66 reaction types § Process: •  Training set is 200 random instances of each reaction type •  Test set is 800 random instances of each reaction type •  Learning: random forest (scikit-learn)
  • 25. Learning reaction classes 25 Results for test data Overall: •  Recall: 0.94 •  Precision: 0.94 •  Accuracy: 0.94 For a 66-class classifier, this looks pretty good!
  • 26. Learning reaction classes 26 ~94% accuracy much of the confusion is between related types Confusion matrix for test data Bromo Suzuki coupling Bromo Suzuki-type coupling Bromo N-arylation
  • 27. Are the fingerprints useful? 27 § Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier? § Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?
  • 28. Clustering reactions 28 § Reaction similarity validation set: •  The 66 most common reaction types from the patent data set •  Look at the homogeneity of clusters with at least 10 members 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination 1.2.5 Ketone reductive amination Integration Interpretation: <30% of clusters are <90% homogeneous Interpretation: <40% of clusters are <80% homogeneous
  • 29. Similarity applied to reactions 29 Can we help classify the remaining 600K reactions? §  Starting point: we have a similarity measure that clusters related reactions together §  We can apply the machine-learning model to the unclassified reactions and see if the original assignment missed any instances §  We can then look for big clusters of unclassified molecules and (manually) assign classes to them.
  • 30. Finding related unclassified reactions 30 §  Process: 1.  Pick 10K random unclassified reactions 2.  Cluster using the same fingerprint described above 3.  Characterize clusters by average functional-group profile 4.  Pick clusters where there is a clear signal §  An example: Cluster  12        amine  -­‐0.68        amine_secondary  -­‐0.35        amine_secondary_aliphatic  -­‐0.35        amine_aliphatic  -­‐0.61        aldehyde  -­‐0.58        aldehyde_aromatic  -­‐0.58  
  • 31. Example reactions from cluster 12 31 •  Clearly related reactions •  Using this approach we’ve identified a number of reaction classes
  • 32. Wrapping up 32 § Dataset: 1+ million reactions text mined from patents (publically available) with reaction classes assigned § Fingerprint: weighted atom-pair delta fingerprints implemented using the RDKit § Fingerprint Validation: •  Multiclass random-forest classifier ~94% accurate •  Similarity measure works: similar reactions cluster together § Combination of clustering + functional group analysis clustering allows identification of new reaction classes
  • 33. § NIBR: • Anna Pelliccioli • Sereina Riniker • Mike Tarselli § NextMove Software: • Roger Sayle • Daniel Lowe 33 Acknowledgements
  • 34. Advertising 34 3rd RDKit User Group Meeting 22-24 October 2014 Merck KGaA, Darmstadt, Germany Talks, “talktorials”, lightning talks, social activities, and a hackathon on the 24th. Registration: http://goo.gl/z6QzwD Full announcement: http://goo.gl/ZUm2wm We’re looking for speakers. Please contact greg.landrum@gmail.com