SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
from Reaction Databases
Orr Ravitz
SimBioSys Inc.
246th ACS National Meeting
Extracting Synthetic Knowledge
ARChem – main concepts
A computer-aided synthesis design system.
The Approach:
 Comprehensive rule- and precedent-based retrosynthetic analysis back to
available starting materials.
 Automated rule generation with manual rule curation.
 Generate many alternatives.
 Provide supporting literature examples.
 Allow user guidance and control.
Solution Display
Exploring Alternative Paths
Supporting Examples
Chemical Interference
Functional groups that may interfere with transformations are highlighted.
Functional Group Tolerance
Break down of example set based
on the presence of functional
groups beyond the reaction center
provides evidence for compatibility.
Examples can be exported to
database’s web interface for further
analysis.
Stereochemistry
Currently:
 Exact matches
 Starting materials
Coming soon:
 Rule-based
Essential Information
Automated extraction of knowledge
 Reaction rules
 Yield values
 Chemical interference - functional group tolerance
 Regioselectivity
 Stereochemistry
Data Information Knowledge
Perceive
Generalize
System Design
Reactions
Reaction Rules
Starting Materials
Expert Knowledge-
bases
Target
Source reactions
Esterification examples
Other examples
··· → ···
··· → ···
··· → ···
Esterification rule
Other rule
··· → ···
Reactions
Reaction Rules
Rule Extraction
Reactions
Reaction Rules
Reaction Perception
Source reaction:
Extracted core
Extended core
Reaction file with atom mapping
Atoms attached to bonds changed, made or broken in the reaction
Include all structural motifs that are essential for the reaction to occur
Extending the Core:
Passengers vs Drivers
The goal of chemical perception is to discriminate between structural features
that are essential for the reaction, and those that are passengers.
Shell-based approach: 1st shell
2nd shell
Graph-based methods are inappropriate.
Mechanism-Dependent
Core Extension
Nucleophilic aromatic substitution:
Addition /elimination
mechanism
Requires a π acceptor
group in ortho or para
position
Via organometallic
intermediate
Reactions
Reaction Rules
Rule Extraction
Similar extended cores
Completed reaction rule
Common extracted core
Nucleofuge (NF) -
a leaving group which
carries away the bonding
electron pair.
Generalized rule
Generalized group (NF) is
replaced by the most
common group.
Interfering Functionality
Following rule abstraction, compatible functionality is detected by examining the
examples:
Compatible
Interfering
 Moieties outside the extended core are
listed as compatible.
 Other functional groups will be inferred as
`possibly interfering’.
 Possibly interfering functionality will be
penalized in scoring and highlighted to the
user.
Regioselectivity – Main Steps
 Recognize rule’s reaction type – electrophilic substitution, nucleophilic
addition etc.
 Only reactions prone to regioselectivity are subject to regio calculations.
 Identify competing sites
 Identify substituents and other structural motifs that may influence the
directionality
 Collect statistics from example set regarding selectivity in the reaction core
as well as elsewhere in the molecule (chemoselectivity)
 Assign regioselectivity to rule if predefined statistical requirements are
met.
? ? ?
Collecting Statistics
Electrophilic aromatic substitutions
For each example in DB:
 Evaluate ring activation including for heteroaromatic rings and fused rings
 Evaluate location, type and neighborhood of ring substituents
 Identify symmetry
 Compute environment signatures that include all aromatic features plus
relevant substituents
For each rule:
 Cluster reacting vs. non-reacting signature-equivalent sites for reactions
with yield > 20%
 Define regioselectivity if examples ratio is 10:1
Regio Example
X=Cl, 84% X=Cl, 5.5%Rejected
Misinterpreted yield
value provided
positive evidence
Stereochemistry – the challenge
 Efficient machine perception and representation of a broad range of synthetically
important stereogenic types
Including tetrahedral C, S, N and P. Also alkenes, allenes and atropisomers
 Representation of stereochemical reaction rules and stereochemical strategies
 Develop a versatile stereochemical substructure algorithm to support retron
matching
 Efficient discovery of symmetry in stereochemically defined molecules and rules -
avoid duplicate routes
 Stereoselectivity is captured inaccurately and inconsistently across common
databases.
The Data
Database content Portion of data Notes
Number of unmapped examples 14% Reaction type unknown
Number of examples belonging to reactions
with 5000 or more examples
4% Ubiquitous protection / deprotection reactions
Number of examples belonging to reactions
with 20 or less examples
16%
Bad atom maps (database errors)
Multistep reaction sequences
General useable examples 65% 65 %
0
10
20
30
40
50
60
70
80
90
100
yield cs de ee
%ofdatabase
Examples with quoted selectivity values
Selectivity metric
0
10
20
30
40
50
60
70
80
90
100
> 0% > 25% > 50% > 75% > 90% > 95% > 98%
yield
cs
de
ee
Examples with selectivity above a threshold
%ofavailable
Threshold selectivity values
Stereo-Rules Generation –
A Different Approach
 Manually code rules for a diverse set of useful enantioselective and
generally selective reaction types.
 Mine supporting examples from existing large reaction databases to
discover reaction scope and limitations for each rule.
 Find effective strategies to aid planning of a stereo controlled synthesis
Reactions
Diels Alder Sharpless Reduction of C=C Reduction of C=O
70 reaction types with ee>95% and more than 50 examples
Designing a Rule-Set
Reaction type Bond alterations Examples with ee ≥95% Notes
Addition of C nucleophiles to C=C CH + C=C → CCCH 1603 Mostly conjugate additions
Reduction of C=O C=O → HCOH 1553 Any type of carbonyl
Addition of C nucleophiles to C=O CH + C=O →CCOH 1265 Includes mostly Aldols + alkynylations
Reduction of C=C C=C → HCCH 1120 Wide variety of environments
Addition of C nucleophiles to C=N CH + C=N →CCNH 639 Any type of C=N
Epoxidation of C=C C=C → C1CO1 415 Sharpless, Jacobsen, Shi etc
Addition via R3B to C=C C-B + C=C → CCCH 329 Mostly conjugate addition to enones
Addition via R2Zn to C=O C-Zn + C=C → CCCH 306
Dihydroxylation of C=C C=C → HOCCOH 266
Reduction of C=N C=N → HCNH 256 Any type of C=N
Diels-Alder C=C + C=CC=C → C1CCC=CC1 222 Carbocyclic Diels-Alder
Cyclopropanation of C=C C=N + C=C → C1CC1 222 Via diazo precursor (carbene)
Mukaiyama Aldol SiOC=C + C=O → O=CCCOH 210
C substitution of Br CH + CBr → CC 199
[2+3] azomethine cycloaddition C=NCH + C=C → N1CCCC1 198
Addition via R2Zn to C=C CZn + C=C → CCCH 162 Mostly conjugate addition to enones
Addition via R3B to C=O CB + C=O → CCOH 141
Oxidation of sulphides S → S=O 137 Chiral sulphoxides
Perception of stereochemistry in structural diagrams
Enabling Technology
Stereocenter manipulation and stereo descriptors
Op 1 2 3 4
A E 1 2 3 4
B C2
3 1 3 4 2
C C1
3 1 4 2 3
D C2 2 1 4 3
E C1
3 2 3 1 4
F C2
3 2 4 3 1
G C2
3 3 1 2 4
H C1
3 3 2 4 1
J C2 3 4 1 2
K C1
3 4 1 3 2
L C2
3 4 2 1 3
M C2 4 3 2 1
Op 1 2 3 4
s 2 1 3 4
E + 8C3 + 3C2
Rotations
Reflection
Conceptual Model Stereo Descriptor
Chemical constraints layer of representation
Enabling Technology
CONNECTIONS=1,2,3 FUSION=BIARYL
RINGS=5+6,6+7 BRIDGEHEAD=YES
DIFFRING=1 EPS=0,1
SAMERING=1 HETS=0,1,2
DIFF=1 NONAROMHETS=0,1,2
SAME=1 HALOGENS=0,1,2
ARYL=YES FGS=ALCOHOL
SPCENTRE=1,2,3 FGNOT=CARBONYL
CHARGE=YES PROP=EWG
HS=0,1,2 PROPNOT=Lg
Substructure search/match
Reduction of Ketones to
Secondary Alcohols
Level 1: + Environment constraints
Level 0: Bond change constraints only
Level 1: + Stereochemical constraints
Base ARChem rule Hits ee de (screen)
10,004 (10,004) Not unique to
ketone → secondary alcohol conversion
8,442 (10,004) Unique to ketone → secondary
alcohol conversion
140 tolerated functional groups
6,525 3,457 4,711 (6,765) Enantioselective and
diastereoselective examples
Dihydroxylation
of Alkenes
Level 1: Bond changes with environment constraints
Level 2: + Stereochemical constraints
Level 3: + Substitution patterns
2253 examples
(2416 screened)
Hits ee de (screen)
1,428 1,008 1,151 (1,634)
428 117 352 (444)
Hits ee de
681 578 552
526 289 418
206 131 168
12 10 11
236 89 191
123 51 103
51 27 41
8 4 7
Conclusions
 Useful chemical knowledge can be extracted algorithmically from reaction
databases.
 Automation is crucial given the size and growth of databases.
 Different layers of knowledge are tightly entangled: regioselectivity,
chemoselectivity and stereoselectivity overlap considerably.
 The extracted knowledge can be applied effectively in computer-aided
synthesis design, and empower chemists by offering new ideas a broader
perspective on the literature.
But...
The quality of extracted knowledge highly depends on the accuracy and scope
of the source data!
The Rule-Set
Cut-off threshold
Useful reactions
Noise
Distractions
Low utility
reactions
Bad atom maps (avoid)
Rare multistep reaction sequences (low utility)
Multiple concurrent reactions on substrate (very low utility)
Exotic heterocycle formation (promote)
Ubiquitous protection / deprotection FGIs such as
alcohol/ester, amine/amide etc (demote)
Conclusions
 Significant portion of data is being lost due to mapping errors and other problems.
 Yield and selectivity information is captured inconsistently.
What can be done:
 Meta data perception can be improved. (in progress)
 Mapping algorithms should reflect contemporary mechanistic understanding of
reactions.
 Systematic mapping errors can be manually fixed (planned)
 Extracted rules can be manually curated (continuous).
Acknowledgements
SimBioSys
James Law - Regioselectivity
Victoria Lubitch
Yasamin Salmasi
Aniko Simon
Zsolt Zsoldos
Reaction Data
Elsevier – Reaxys
Wiley - CIRX
RSC - MOS
Accelrys - RefLib
University of Leeds
Tony Cook - Stereochemistry
Peter Johnson
Steve Marsden
Other Collaborators
ChemAxon
And…
ARChem users! THANK YOU!

Contenu connexe

Tendances

EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
Kamel Mansouri
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
Kamel Mansouri
 

Tendances (11)

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative Structure
 
QSAR
QSARQSAR
QSAR
 
Pharmaceutical analysis,
Pharmaceutical analysis,Pharmaceutical analysis,
Pharmaceutical analysis,
 
Computer Aided Drug Design QSAR Related Methods
Computer Aided Drug Design QSAR Related MethodsComputer Aided Drug Design QSAR Related Methods
Computer Aided Drug Design QSAR Related Methods
 
orthogonal hplc methods
orthogonal hplc methodsorthogonal hplc methods
orthogonal hplc methods
 

Similaire à Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

Similaire à Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS (20)

Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
A new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChemA new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChem
 
EcoEngines Chemical Kinetics
EcoEngines Chemical KineticsEcoEngines Chemical Kinetics
EcoEngines Chemical Kinetics
 
foglar book.pdf
foglar book.pdffoglar book.pdf
foglar book.pdf
 
Randomizing genome-scale metabolic networks
Randomizing genome-scale metabolic networksRandomizing genome-scale metabolic networks
Randomizing genome-scale metabolic networks
 
chemical reaction engineering
chemical reaction engineeringchemical reaction engineering
chemical reaction engineering
 
Cad introduction 2019 30 min
Cad introduction 2019 30 minCad introduction 2019 30 min
Cad introduction 2019 30 min
 
TOC I&ECPDD Oct67
TOC I&ECPDD Oct67TOC I&ECPDD Oct67
TOC I&ECPDD Oct67
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
Saponification Presentation
Saponification PresentationSaponification Presentation
Saponification Presentation
 
CHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdf
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
ICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction WileyICIC 2014 New Product Introduction Wiley
ICIC 2014 New Product Introduction Wiley
 
Chap3 1
Chap3 1Chap3 1
Chap3 1
 
Retrosynth
RetrosynthRetrosynth
Retrosynth
 
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
Advanced Chemical Reaction Engineering-Part-1-10-Apr-2016
 
Analytical Method Development
Analytical Method DevelopmentAnalytical Method Development
Analytical Method Development
 
Predicting Novel Metabolic Pathways through Subgraph Mining
Predicting Novel Metabolic Pathways through Subgraph MiningPredicting Novel Metabolic Pathways through Subgraph Mining
Predicting Novel Metabolic Pathways through Subgraph Mining
 
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic ProcessesReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
ReactIR as a Diagnostic Tool for Developing Robust, Scalable Synthetic Processes
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS

  • 1. from Reaction Databases Orr Ravitz SimBioSys Inc. 246th ACS National Meeting Extracting Synthetic Knowledge
  • 2. ARChem – main concepts A computer-aided synthesis design system. The Approach:  Comprehensive rule- and precedent-based retrosynthetic analysis back to available starting materials.  Automated rule generation with manual rule curation.  Generate many alternatives.  Provide supporting literature examples.  Allow user guidance and control.
  • 6. Chemical Interference Functional groups that may interfere with transformations are highlighted.
  • 7. Functional Group Tolerance Break down of example set based on the presence of functional groups beyond the reaction center provides evidence for compatibility. Examples can be exported to database’s web interface for further analysis.
  • 8. Stereochemistry Currently:  Exact matches  Starting materials Coming soon:  Rule-based
  • 9. Essential Information Automated extraction of knowledge  Reaction rules  Yield values  Chemical interference - functional group tolerance  Regioselectivity  Stereochemistry Data Information Knowledge Perceive Generalize
  • 10. System Design Reactions Reaction Rules Starting Materials Expert Knowledge- bases Target
  • 11. Source reactions Esterification examples Other examples ··· → ··· ··· → ··· ··· → ··· Esterification rule Other rule ··· → ··· Reactions Reaction Rules Rule Extraction
  • 12. Reactions Reaction Rules Reaction Perception Source reaction: Extracted core Extended core Reaction file with atom mapping Atoms attached to bonds changed, made or broken in the reaction Include all structural motifs that are essential for the reaction to occur
  • 13. Extending the Core: Passengers vs Drivers The goal of chemical perception is to discriminate between structural features that are essential for the reaction, and those that are passengers. Shell-based approach: 1st shell 2nd shell Graph-based methods are inappropriate.
  • 14. Mechanism-Dependent Core Extension Nucleophilic aromatic substitution: Addition /elimination mechanism Requires a π acceptor group in ortho or para position Via organometallic intermediate
  • 15. Reactions Reaction Rules Rule Extraction Similar extended cores Completed reaction rule Common extracted core Nucleofuge (NF) - a leaving group which carries away the bonding electron pair. Generalized rule Generalized group (NF) is replaced by the most common group.
  • 16. Interfering Functionality Following rule abstraction, compatible functionality is detected by examining the examples: Compatible Interfering  Moieties outside the extended core are listed as compatible.  Other functional groups will be inferred as `possibly interfering’.  Possibly interfering functionality will be penalized in scoring and highlighted to the user.
  • 17. Regioselectivity – Main Steps  Recognize rule’s reaction type – electrophilic substitution, nucleophilic addition etc.  Only reactions prone to regioselectivity are subject to regio calculations.  Identify competing sites  Identify substituents and other structural motifs that may influence the directionality  Collect statistics from example set regarding selectivity in the reaction core as well as elsewhere in the molecule (chemoselectivity)  Assign regioselectivity to rule if predefined statistical requirements are met. ? ? ?
  • 18. Collecting Statistics Electrophilic aromatic substitutions For each example in DB:  Evaluate ring activation including for heteroaromatic rings and fused rings  Evaluate location, type and neighborhood of ring substituents  Identify symmetry  Compute environment signatures that include all aromatic features plus relevant substituents For each rule:  Cluster reacting vs. non-reacting signature-equivalent sites for reactions with yield > 20%  Define regioselectivity if examples ratio is 10:1
  • 19. Regio Example X=Cl, 84% X=Cl, 5.5%Rejected Misinterpreted yield value provided positive evidence
  • 20. Stereochemistry – the challenge  Efficient machine perception and representation of a broad range of synthetically important stereogenic types Including tetrahedral C, S, N and P. Also alkenes, allenes and atropisomers  Representation of stereochemical reaction rules and stereochemical strategies  Develop a versatile stereochemical substructure algorithm to support retron matching  Efficient discovery of symmetry in stereochemically defined molecules and rules - avoid duplicate routes  Stereoselectivity is captured inaccurately and inconsistently across common databases.
  • 21. The Data Database content Portion of data Notes Number of unmapped examples 14% Reaction type unknown Number of examples belonging to reactions with 5000 or more examples 4% Ubiquitous protection / deprotection reactions Number of examples belonging to reactions with 20 or less examples 16% Bad atom maps (database errors) Multistep reaction sequences General useable examples 65% 65 % 0 10 20 30 40 50 60 70 80 90 100 yield cs de ee %ofdatabase Examples with quoted selectivity values Selectivity metric 0 10 20 30 40 50 60 70 80 90 100 > 0% > 25% > 50% > 75% > 90% > 95% > 98% yield cs de ee Examples with selectivity above a threshold %ofavailable Threshold selectivity values
  • 22. Stereo-Rules Generation – A Different Approach  Manually code rules for a diverse set of useful enantioselective and generally selective reaction types.  Mine supporting examples from existing large reaction databases to discover reaction scope and limitations for each rule.  Find effective strategies to aid planning of a stereo controlled synthesis Reactions Diels Alder Sharpless Reduction of C=C Reduction of C=O
  • 23. 70 reaction types with ee>95% and more than 50 examples Designing a Rule-Set Reaction type Bond alterations Examples with ee ≥95% Notes Addition of C nucleophiles to C=C CH + C=C → CCCH 1603 Mostly conjugate additions Reduction of C=O C=O → HCOH 1553 Any type of carbonyl Addition of C nucleophiles to C=O CH + C=O →CCOH 1265 Includes mostly Aldols + alkynylations Reduction of C=C C=C → HCCH 1120 Wide variety of environments Addition of C nucleophiles to C=N CH + C=N →CCNH 639 Any type of C=N Epoxidation of C=C C=C → C1CO1 415 Sharpless, Jacobsen, Shi etc Addition via R3B to C=C C-B + C=C → CCCH 329 Mostly conjugate addition to enones Addition via R2Zn to C=O C-Zn + C=C → CCCH 306 Dihydroxylation of C=C C=C → HOCCOH 266 Reduction of C=N C=N → HCNH 256 Any type of C=N Diels-Alder C=C + C=CC=C → C1CCC=CC1 222 Carbocyclic Diels-Alder Cyclopropanation of C=C C=N + C=C → C1CC1 222 Via diazo precursor (carbene) Mukaiyama Aldol SiOC=C + C=O → O=CCCOH 210 C substitution of Br CH + CBr → CC 199 [2+3] azomethine cycloaddition C=NCH + C=C → N1CCCC1 198 Addition via R2Zn to C=C CZn + C=C → CCCH 162 Mostly conjugate addition to enones Addition via R3B to C=O CB + C=O → CCOH 141 Oxidation of sulphides S → S=O 137 Chiral sulphoxides
  • 24. Perception of stereochemistry in structural diagrams Enabling Technology Stereocenter manipulation and stereo descriptors Op 1 2 3 4 A E 1 2 3 4 B C2 3 1 3 4 2 C C1 3 1 4 2 3 D C2 2 1 4 3 E C1 3 2 3 1 4 F C2 3 2 4 3 1 G C2 3 3 1 2 4 H C1 3 3 2 4 1 J C2 3 4 1 2 K C1 3 4 1 3 2 L C2 3 4 2 1 3 M C2 4 3 2 1 Op 1 2 3 4 s 2 1 3 4 E + 8C3 + 3C2 Rotations Reflection Conceptual Model Stereo Descriptor
  • 25. Chemical constraints layer of representation Enabling Technology CONNECTIONS=1,2,3 FUSION=BIARYL RINGS=5+6,6+7 BRIDGEHEAD=YES DIFFRING=1 EPS=0,1 SAMERING=1 HETS=0,1,2 DIFF=1 NONAROMHETS=0,1,2 SAME=1 HALOGENS=0,1,2 ARYL=YES FGS=ALCOHOL SPCENTRE=1,2,3 FGNOT=CARBONYL CHARGE=YES PROP=EWG HS=0,1,2 PROPNOT=Lg Substructure search/match
  • 26. Reduction of Ketones to Secondary Alcohols Level 1: + Environment constraints Level 0: Bond change constraints only Level 1: + Stereochemical constraints Base ARChem rule Hits ee de (screen) 10,004 (10,004) Not unique to ketone → secondary alcohol conversion 8,442 (10,004) Unique to ketone → secondary alcohol conversion 140 tolerated functional groups 6,525 3,457 4,711 (6,765) Enantioselective and diastereoselective examples
  • 27. Dihydroxylation of Alkenes Level 1: Bond changes with environment constraints Level 2: + Stereochemical constraints Level 3: + Substitution patterns 2253 examples (2416 screened) Hits ee de (screen) 1,428 1,008 1,151 (1,634) 428 117 352 (444) Hits ee de 681 578 552 526 289 418 206 131 168 12 10 11 236 89 191 123 51 103 51 27 41 8 4 7
  • 28. Conclusions  Useful chemical knowledge can be extracted algorithmically from reaction databases.  Automation is crucial given the size and growth of databases.  Different layers of knowledge are tightly entangled: regioselectivity, chemoselectivity and stereoselectivity overlap considerably.  The extracted knowledge can be applied effectively in computer-aided synthesis design, and empower chemists by offering new ideas a broader perspective on the literature. But... The quality of extracted knowledge highly depends on the accuracy and scope of the source data!
  • 29. The Rule-Set Cut-off threshold Useful reactions Noise Distractions Low utility reactions Bad atom maps (avoid) Rare multistep reaction sequences (low utility) Multiple concurrent reactions on substrate (very low utility) Exotic heterocycle formation (promote) Ubiquitous protection / deprotection FGIs such as alcohol/ester, amine/amide etc (demote)
  • 30. Conclusions  Significant portion of data is being lost due to mapping errors and other problems.  Yield and selectivity information is captured inconsistently. What can be done:  Meta data perception can be improved. (in progress)  Mapping algorithms should reflect contemporary mechanistic understanding of reactions.  Systematic mapping errors can be manually fixed (planned)  Extracted rules can be manually curated (continuous).
  • 31. Acknowledgements SimBioSys James Law - Regioselectivity Victoria Lubitch Yasamin Salmasi Aniko Simon Zsolt Zsoldos Reaction Data Elsevier – Reaxys Wiley - CIRX RSC - MOS Accelrys - RefLib University of Leeds Tony Cook - Stereochemistry Peter Johnson Steve Marsden Other Collaborators ChemAxon And… ARChem users! THANK YOU!