SlideShare une entreprise Scribd logo
1  sur  28
Universal SMILES
            Finally, a canonical SMILES string?



                          Noel M. O’Boyle
Analytical and Biological Chemistry Research Facility, University College
                             Cork, Ireland
         (Current address: NextMove Software, Cambridge, UK)

                                Apr 2013
                        245th ACS National Meeting
                              New Orleans


                                                               Open Babel
2




Introduction to Canonical
        SMILES
3

         How to create a SMILES string
(1) Pick a starting atom
(2) Traverse the molecular graph in a Depth-First manner
(3) Encode the atoms and bonds traversed as a text string

• Let’s assume that step (3) is done in a standard manner

• Variation in steps (1) and (2) leads to many different
  possible SMILES


                 C   C    O      C   C   O

• Ethanol as CCO or OCC (among others)
4

  How to create a canonical SMILES string
(1) Give each atom a canonical label (“canonicalize”)
(2) Pick as starting atom the one with the smallest label1
(3) Traverse the molecular graph in a Depth-First manner
    following the atom with the smallest label at each branch
    point1
(4) Encode the atoms and bonds traversed as a text string
• The same SMILES string will always be generated
   – The “canonical SMILES”


               C C O           O C C
                1  2            3   2
               C3 C O          O1 C C

• Ethanol always1 as CCO                           1   For example.
5

      Why is a canonical SMILES useful?
• Check identity
   – Graph isomorphism is faster, but less convenient
• Find/avoid duplicates
• Find overlap of two databases
• Check that a structure remains unchanged
   – E.g. after some transformation


• Canonical SMILES retains the features of regular
  SMILES
   – Although slower to calculate
6

  Why are there different canonical SMILES?

• There is no published canonical SMILES implementation
  for the general case
    – Neither Weininger, Weininger nor Weininger [1] described how to
      handle stereochemistry


• Canonicalization is difficult
    – Not a simple algorithm, many corner cases
    – Trade secret


• End result: Each cheminformatics toolkit generates its
  own canonical SMILES

[1] Weininger D, Weininger A, Weininger JL. SMILES. 2. Algorithm for generation of
    unique SMILES notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97.
7

       Why a “Universal” canonical SMILES?
• All the benefits of a globally unique identifier (like the
  InChI)
   – Can link databases
   – Of benefit to the average chemist, as having different SMILES for
     the same molecule is confusing
   – Can immediately see if the Wikipedia SMILES is in agreement
     with the PubChem SMILES


• Finally possible to compare SMILES strings from
  different toolkits
   –   Identify bugs
   –   Explore underlying chemical models (e.g. aromatic models)
   –   Explore underlying stereochemistry perception
   –   Lead to improvements in quality and standards
8

Why base a canonical SMILES on the InChI?
• Canonicalization is complicated
   – Devising and describing a general canonicalization procedure
     that others could implement exactly may not be possible
• Better to build on existing work
   – Take advantage of the stellar work by the InChI team
   – The InChI has already solved the canonicalization problem for a
     broad section of chemistry
• It’s ubiquitous
   – The InChI is available in almost all cheminformatics toolkits


• Finally, all toolkits will be able to create the same
  canonical SMILES string
   – The “Universal SMILES” string!
9




How to use the InChI to create
 a Universal SMILES string
10

  How to get canonical labels from the InChI
• Use the Auxiliary Information, Luke
      $ obabel -:"ClCC(=O)Br" -oinchi -xa
      InChI=1S/C2H2BrClO/c3-2(5)1-4/h1H2
      AuxInfo=1/0/N:2,3,5,1,4/rA:5ClCCOBr/rB:s1;s2;d3;s3;/rC:;;;;;
• /N section gives the canonical labels
   – Canonical labels 1 through 5 correspond to input atoms 2, 3, 5, 1
     and 4, respectively
   – E.g. canonical label 3 is applied to input atom 5, the Bromine


• For Universal SMILES, I used two non-standard options
   – /FixedH: Enable the correct application of canonical labels in
     cases involving molecular symmetry broken by protonation states
   – /RecMet: Do not disconnect metals, as the labels for ligands will
     not be canonical
11

   Walk this way: Rules for graph traversal
• Start the graph traversal at the atom with the lowest
  canonical label
   – For disconnected structures, visit each structure in order of its
     lowest canonical label
• Visit atoms in a depth-first manner
   – At each branch point, multiple bonds are favoured over single or
     aromatic bonds, and lower canonical labels over higher.


             Cl                    Cl                   Cl
                  3
        C C O                 C    C    O         C     C    O
         1  2
         4
• Universal SMILES for this acid chloride: CC(=O)Cl
12

          Corner case: Explicit hydrogens
• Sometimes a SMILES string contains explicit hydrogens
   – Hydrogen isotopes, dihydrogen, hydrogen atoms, hydrogen ions
• Sometimes the InChI labels hydrogens
   – Hydrogen atoms, bridging hydrogens


• The problem:
   – What to do about explicit hydrogens unlabelled by the InChI?
• A solution:
   – Consider these to have a low canonical label
   – That is, in the traversal visit these hydrogens prior to other singly-
     bonded branches


         C([2H])([3H])Cl rather than C(Cl)([3H])[2H]
13

     A standard way to encode the SMILES
• The graph traversal gives us a canonical atom order
• However, despite this, many different SMILES strings
  may be written for the same molecule

The following SMILES strings for ethanol all have the same atom order:
             CCO, C-C-O, C1.C12.O2, C(C(O)), [CH3]CO


• For Universal SMILES, one particular form must be
  adopted
   – The standard form described by the Open SMILES specification
       Ref: Craig James et al, The Open SMILES specification, http://opensmiles.org
   – E.g. Don’t write single bonds explicitly, only use parentheses if
     there is a branch
14

 Encoding cis/trans stereochemistry symbols
• Question:
   – How do I know that the following SMILES string was not
     generated by Open Babel?
                                CC=CCl
• There are two possible ways to write symbols for any
  double bond system
• For Universal SMILES, the first stereochemistry bond
  symbol should be a forward slash
   – i.e. C/C=C/Cl not CC=CCl
   – Minimises backslashes (can cause problems at commandline)
   – Useful aid if reading SMILES: If you see a backslash, there must
     be a corresponding forward slash preceding it
• Show cis/trans symbols on all substituents
   – i.e. Cl/C=C(Br)/I not C/C=C(Br)I
15




Does it work?
16

       Datasets for testing implementation
• Universal SMILES was added to Open Babel v2.3.2
        $ obabel -:"c1(cc(ccc1)[N+](=O)[O-])/C=C/F" -osmi -xU
        c1cc(/C=C/F)cc(c1)[N+](=O)[O-]


• ChEMBL Release 13
   – 1.14 million compounds as 2D MOL
   – Highly curated, and normalised


• PubChem Substance subset
   – 1.04 million compounds as 2D or 3D MOL (those with SIDS from 0
     to 2 million)
   – As deposited from a variety of sources
   – Duplicates exist as well as errors
   – 1.1% were discarded as InChIs could not be generated for them
17

                              Shuffle Test
• Does the Universal SMILES procedure generate a
  canonical identifier?
   – A canonical identifier should be invariant to the input order of atoms
   – So…let’s shuffle the atoms and check whether the Universal
     SMILES changes

• For each structure, I generated
  10 “anti-canonical” SMILES
  strings using Open Babel
   – The “xC” SMILES output option


• For each of these, the
  Universal SMILES was
  generated
   – If all identical, the test is passed
18

                       Shuffle Test Results
• ChEMBL dataset
   – 2,425 canonicalization failures (0.21%)
   – 2,248 excluding failures for Open Babel’s own canonical SMILES
       • These failures are mainly due to kekulization problems

• Differences in the stereochemical model used (81%)
   – 722 failures due to disagreement on the number of tetrahedral
     stereocenters (fault with OB typically)
   – 1105 failures for stereogenic double bonds
• Handling of delocalized charges
   – Where molecular graph symmetry is broken only by
     charge states in a delocalised system, the InChI will
     regard as equivalent atoms which appear as different
     charge states in the SMILES string.
   – Two different Universal SMILES for the example:
       • C[n+]1ccn(C)c1 and Cn1cc[n+](C)c1
19

                      Shuffle Test Results
• PubChem dataset
   – 2,410 canonicalization failures (0.23%)
   – 2,183 excluding failures for Open Babel’s own canonical SMILES
• Differences in the stereochemical model used (72%)

• 56 cases of non-canonicalization of isotopes
   – Bug in InChI auxiliary information (they are aware of this)

• Interesting failure case, SID 425526
   – InChI regards ring as aromatic, and then
     identifies two-fold graph symmetry
   – Open Babel does not treat ring as aromatic
       • Series of double and single bonds
   – Two different Universal SMILES generated
20

                            Duplicate Test
• Use the Universal SMILES to find duplicates
   – True duplicates
   – False duplicates
       • A shortcoming of Universal SMILES or its implementation
       • A normalization of distinct structures


• ChEMBL dataset
   – There should not be any duplicates
   – 63 sets of duplicates according to InChI
       • Errors in database which had already been corrected in development version

• PubChem dataset
   – 143,157 sets of duplicates


• Duplicates according to InChI removed from further
  consideration
21

                   Duplicate Test Results
• ChEMBL dataset
   – 29 duplicates found
   – The majority appear to be true duplicates which the InChI considers
     as distinct due to the specific coordinates in the Mol file




• The InChI regards the stereochemistry in (b) to be undefined
22




• Identical according to Universal SMILES but distinct InChIs
   – The InChIs differ in the double bond stereochemistry layer:
                      /b31-27+,32-28?   versus   /b31-27-,32-28+
23

                 Duplicate Test Results
• PubChem dataset
   – 47 duplicates found


• In 44 cases the InChI regarded as undefined the
  tetrahedral stereochemistry at a chiral center
   – The three non-H atoms were almost in the same plane as the
     center




                               SID 855468
24




Discussion and conclusions
25

                     Overview of results
• Universal SMILES can generate canonical identifiers…
   – for 99.79% of the ChEMBL database
   – for 99.77% of a subset of the PubChem Substance database
   – Disagreements between InChI and the underlying stereochemical
     model used by Open Babel, and the handling of delocalized
     charges


• Performance could be improved further
   – Improvements in stereochemistry perception in Open Babel, or
     somehow use the stereochemistry perception from the InChI
• Outstanding issues:
   – Failures due to delocalized charges
   – The Daylight aromaticity model is not well-described and so
     different Universal SMILES implementations will vary in what is
     treated as an aromatic system
26

                     Overview of results

• The InChI is quite sensitive to the specific geometry used
  at stereocenters
   – Some structures in databases may need to be redrawn


• These ideas could be applied to other chemical file
  formats
   – Canonical forms of other line notations
   – Canonicalization of atom order in Mol files
27

                What I didn’t talk about…

• Inchified SMILES
   – A way to include the InChI normalizations into the SMILES string,
     by roundtripping through the InChI
   – A SMILES string representation of the InChI string
   – Available as Open Babel SMILES output option “I”
   – For more info see the paper (J. Cheminf., 2012, 4, 22)
Universal        Finally a canonical SMILES
          SMILES           string?


   J. Cheminf., 2012, 4, 22               baoilleach@gmail.com
blueobelisk-smiles@lists.sf.net        http://baoilleach.blogspot.com

Acknowledgements
Craig James (eMolecules): For OpenSMILES and the SMILES writer in
Open Babel




Funding
Health Research Board: Career Development Fellowship

Contenu connexe

Tendances

Fragment Based Drug Discovery
Fragment Based Drug DiscoveryFragment Based Drug Discovery
Fragment Based Drug DiscoveryAnthony Coyne
 
different types of pcr
different types of pcrdifferent types of pcr
different types of pcrNishant kumar
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug designROHIT
 
QIAGEN LNA Tools - Experience truly exceptional RNA Research
QIAGEN LNA Tools - Experience truly exceptional RNA ResearchQIAGEN LNA Tools - Experience truly exceptional RNA Research
QIAGEN LNA Tools - Experience truly exceptional RNA ResearchQIAGEN
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?Sameh El-Ashry
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Fragment based drug design
Fragment based drug designFragment based drug design
Fragment based drug designEkta Tembhare
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screeningDeependra Ban
 
暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がりMITSUNARI Shigeo
 
Some building blocks for Rational Drug Design
Some building blocks for Rational Drug Design Some building blocks for Rational Drug Design
Some building blocks for Rational Drug Design samthamby79
 
Monte Carlo Simulations & Membrane Simulation and Dynamics
Monte Carlo Simulations & Membrane Simulation and DynamicsMonte Carlo Simulations & Membrane Simulation and Dynamics
Monte Carlo Simulations & Membrane Simulation and DynamicsArindam Ghosh
 
2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORSSmita Jain
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 

Tendances (20)

Fragment Based Drug Discovery
Fragment Based Drug DiscoveryFragment Based Drug Discovery
Fragment Based Drug Discovery
 
different types of pcr
different types of pcrdifferent types of pcr
different types of pcr
 
Computer aided drug design
Computer aided drug designComputer aided drug design
Computer aided drug design
 
QIAGEN LNA Tools - Experience truly exceptional RNA Research
QIAGEN LNA Tools - Experience truly exceptional RNA ResearchQIAGEN LNA Tools - Experience truly exceptional RNA Research
QIAGEN LNA Tools - Experience truly exceptional RNA Research
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?How to create SystemVerilog verification environment?
How to create SystemVerilog verification environment?
 
Predictive Features of TCR Repertoire
Predictive Features of TCR RepertoirePredictive Features of TCR Repertoire
Predictive Features of TCR Repertoire
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
FHE in Action
FHE in ActionFHE in Action
FHE in Action
 
Fragment based drug design
Fragment based drug designFragment based drug design
Fragment based drug design
 
Scan chain operation
Scan chain operationScan chain operation
Scan chain operation
 
Major project iii 3
Major project  iii  3Major project  iii  3
Major project iii 3
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
 
暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり暗号化したまま計算できる暗号技術とOSS開発による広がり
暗号化したまま計算できる暗号技術とOSS開発による広がり
 
Some building blocks for Rational Drug Design
Some building blocks for Rational Drug Design Some building blocks for Rational Drug Design
Some building blocks for Rational Drug Design
 
Monte Carlo Simulations & Membrane Simulation and Dynamics
Monte Carlo Simulations & Membrane Simulation and DynamicsMonte Carlo Simulations & Membrane Simulation and Dynamics
Monte Carlo Simulations & Membrane Simulation and Dynamics
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Molecular modeling in drug design
Molecular modeling in drug designMolecular modeling in drug design
Molecular modeling in drug design
 

En vedette

What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2baoilleach
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Webbaoilleach
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?baoilleach
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babelbaoilleach
 
装飾的機能論をめぐる2つの米控訴裁判決
装飾的機能論をめぐる2つの米控訴裁判決装飾的機能論をめぐる2つの米控訴裁判決
装飾的機能論をめぐる2つの米控訴裁判決Ryutaro Nakagawa
 
Social Media for Planned Giving Professionals
Social Media for Planned Giving ProfessionalsSocial Media for Planned Giving Professionals
Social Media for Planned Giving ProfessionalsErica Klinger
 
Twitterers anonymous
Twitterers anonymousTwitterers anonymous
Twitterers anonymousJohn Anderson
 
Greg Willis - Agile Innovation
Greg Willis - Agile InnovationGreg Willis - Agile Innovation
Greg Willis - Agile InnovationGreg Willis
 
The Worst Hollywood Book Adaptations
The Worst Hollywood Book Adaptations The Worst Hollywood Book Adaptations
The Worst Hollywood Book Adaptations ESSAYSHARK.com
 
Systems Medicine and Metabolic Diseases
Systems Medicine and Metabolic DiseasesSystems Medicine and Metabolic Diseases
Systems Medicine and Metabolic DiseasesNatal van Riel
 
Inexpensive ways to boost the value of key biscayne waterfront condos (3)
Inexpensive ways to boost the value of key biscayne waterfront condos (3)Inexpensive ways to boost the value of key biscayne waterfront condos (3)
Inexpensive ways to boost the value of key biscayne waterfront condos (3)Alicia Ale
 
'World's Happiest Man' reveals his secret
'World's Happiest Man' reveals his secret'World's Happiest Man' reveals his secret
'World's Happiest Man' reveals his secretOH TEIK BIN
 
Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011
Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011
Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011ProductCamp Boston
 
Halifax Housing Market Confidence Tracker Q4 2013
Halifax Housing Market Confidence Tracker Q4 2013Halifax Housing Market Confidence Tracker Q4 2013
Halifax Housing Market Confidence Tracker Q4 2013Ipsos UK
 
Bloggen leren is makkelijker dan je denkt
Bloggen leren is makkelijker dan je denktBloggen leren is makkelijker dan je denkt
Bloggen leren is makkelijker dan je denktRutger Steenbergen
 
Local SEO for Auto Dealers - Digital Dealer 18
Local SEO for Auto Dealers - Digital Dealer 18Local SEO for Auto Dealers - Digital Dealer 18
Local SEO for Auto Dealers - Digital Dealer 18Greg Gifford
 

En vedette (20)

What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Web
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
装飾的機能論をめぐる2つの米控訴裁判決
装飾的機能論をめぐる2つの米控訴裁判決装飾的機能論をめぐる2つの米控訴裁判決
装飾的機能論をめぐる2つの米控訴裁判決
 
応用美術と著作権
応用美術と著作権応用美術と著作権
応用美術と著作権
 
Resume
ResumeResume
Resume
 
Social Media for Planned Giving Professionals
Social Media for Planned Giving ProfessionalsSocial Media for Planned Giving Professionals
Social Media for Planned Giving Professionals
 
Twitterers anonymous
Twitterers anonymousTwitterers anonymous
Twitterers anonymous
 
Greg Willis - Agile Innovation
Greg Willis - Agile InnovationGreg Willis - Agile Innovation
Greg Willis - Agile Innovation
 
The Worst Hollywood Book Adaptations
The Worst Hollywood Book Adaptations The Worst Hollywood Book Adaptations
The Worst Hollywood Book Adaptations
 
Systems Medicine and Metabolic Diseases
Systems Medicine and Metabolic DiseasesSystems Medicine and Metabolic Diseases
Systems Medicine and Metabolic Diseases
 
Inexpensive ways to boost the value of key biscayne waterfront condos (3)
Inexpensive ways to boost the value of key biscayne waterfront condos (3)Inexpensive ways to boost the value of key biscayne waterfront condos (3)
Inexpensive ways to boost the value of key biscayne waterfront condos (3)
 
TICS tipos de redes
TICS tipos de redes TICS tipos de redes
TICS tipos de redes
 
'World's Happiest Man' reveals his secret
'World's Happiest Man' reveals his secret'World's Happiest Man' reveals his secret
'World's Happiest Man' reveals his secret
 
Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011
Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011
Getting Messaging Right - Notes - Kathryn Roy at ProductCamp Boston, April 2011
 
246 4-recorded
246 4-recorded246 4-recorded
246 4-recorded
 
Halifax Housing Market Confidence Tracker Q4 2013
Halifax Housing Market Confidence Tracker Q4 2013Halifax Housing Market Confidence Tracker Q4 2013
Halifax Housing Market Confidence Tracker Q4 2013
 
Bloggen leren is makkelijker dan je denkt
Bloggen leren is makkelijker dan je denktBloggen leren is makkelijker dan je denkt
Bloggen leren is makkelijker dan je denkt
 
Local SEO for Auto Dealers - Digital Dealer 18
Local SEO for Auto Dealers - Digital Dealer 18Local SEO for Auto Dealers - Digital Dealer 18
Local SEO for Auto Dealers - Digital Dealer 18
 

Similaire à Universal Smiles: Finally a canonical SMILES string

Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...baoilleach
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformaticsBenjamin Bucior
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf
1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf
1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdfNongaloThozamile
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC ProjectMaho Nakata
 
organic-chemistry-2nd semester 2023.pptx
organic-chemistry-2nd semester 2023.pptxorganic-chemistry-2nd semester 2023.pptx
organic-chemistry-2nd semester 2023.pptxMuhammadJavedIqbal40
 
C5 simple covalent-bonding
C5 simple covalent-bondingC5 simple covalent-bonding
C5 simple covalent-bondingopsonise
 
Chem e2a lecture 2-2011
Chem e2a lecture 2-2011Chem e2a lecture 2-2011
Chem e2a lecture 2-2011Akki Bisht
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveNextMove Software
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)Alex Clark
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsNextMove Software
 
1-Introduction to COMSOL Multiphysics.pptx
1-Introduction to COMSOL Multiphysics.pptx1-Introduction to COMSOL Multiphysics.pptx
1-Introduction to COMSOL Multiphysics.pptxloubnakhaled
 

Similaire à Universal Smiles: Finally a canonical SMILES string (20)

Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Point GEODES
Point GEODESPoint GEODES
Point GEODES
 
15isomppa.ppt
15isomppa.ppt15isomppa.ppt
15isomppa.ppt
 
1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf
1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf
1 REVIEW OF NAMING AND INTRO TO SKELETAL STRUCTURES.pdf
 
Isomerism .ppt
Isomerism .pptIsomerism .ppt
Isomerism .ppt
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
Lecture 02
Lecture 02Lecture 02
Lecture 02
 
solid state.ppt
solid state.pptsolid state.ppt
solid state.ppt
 
organic-chemistry-2nd semester 2023.pptx
organic-chemistry-2nd semester 2023.pptxorganic-chemistry-2nd semester 2023.pptx
organic-chemistry-2nd semester 2023.pptx
 
C5 simple covalent-bonding
C5 simple covalent-bondingC5 simple covalent-bonding
C5 simple covalent-bonding
 
Chem e2a lecture 2-2011
Chem e2a lecture 2-2011Chem e2a lecture 2-2011
Chem e2a lecture 2-2011
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspective
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)
 
Efficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphsEfficient matching of multiple chemical subgraphs
Efficient matching of multiple chemical subgraphs
 
1-Introduction to COMSOL Multiphysics.pptx
1-Introduction to COMSOL Multiphysics.pptx1-Introduction to COMSOL Multiphysics.pptx
1-Introduction to COMSOL Multiphysics.pptx
 

Plus de baoilleach

Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overviewbaoilleach
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand dockingbaoilleach
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculationbaoilleach
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSARbaoilleach
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsbaoilleach
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papersbaoilleach
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...baoilleach
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tunebaoilleach
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...baoilleach
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopybaoilleach
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devicesbaoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...baoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...baoilleach
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment ratesbaoilleach
 
The Blue Obelisk community
The Blue Obelisk communityThe Blue Obelisk community
The Blue Obelisk communitybaoilleach
 
Interoperability and the Blue Obelisk
Interoperability and the Blue ObeliskInteroperability and the Blue Obelisk
Interoperability and the Blue Obeliskbaoilleach
 
Goslar2010 poster
Goslar2010 posterGoslar2010 poster
Goslar2010 posterbaoilleach
 
Open Babel 2.3 Quick Reference
Open Babel 2.3 Quick ReferenceOpen Babel 2.3 Quick Reference
Open Babel 2.3 Quick Referencebaoilleach
 
Classification of Enzyme Reaction Mechanisms
Classification of Enzyme Reaction MechanismsClassification of Enzyme Reaction Mechanisms
Classification of Enzyme Reaction Mechanismsbaoilleach
 
Digging deep for GOLD: How buriedness may be used to discriminate between act...
Digging deep for GOLD: How buriedness may be used to discriminate between act...Digging deep for GOLD: How buriedness may be used to discriminate between act...
Digging deep for GOLD: How buriedness may be used to discriminate between act...baoilleach
 

Plus de baoilleach (20)

Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overview
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculation
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSAR
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papers
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopy
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devices
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
 
The Blue Obelisk community
The Blue Obelisk communityThe Blue Obelisk community
The Blue Obelisk community
 
Interoperability and the Blue Obelisk
Interoperability and the Blue ObeliskInteroperability and the Blue Obelisk
Interoperability and the Blue Obelisk
 
Goslar2010 poster
Goslar2010 posterGoslar2010 poster
Goslar2010 poster
 
Open Babel 2.3 Quick Reference
Open Babel 2.3 Quick ReferenceOpen Babel 2.3 Quick Reference
Open Babel 2.3 Quick Reference
 
Classification of Enzyme Reaction Mechanisms
Classification of Enzyme Reaction MechanismsClassification of Enzyme Reaction Mechanisms
Classification of Enzyme Reaction Mechanisms
 
Digging deep for GOLD: How buriedness may be used to discriminate between act...
Digging deep for GOLD: How buriedness may be used to discriminate between act...Digging deep for GOLD: How buriedness may be used to discriminate between act...
Digging deep for GOLD: How buriedness may be used to discriminate between act...
 

Dernier

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 

Dernier (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

Universal Smiles: Finally a canonical SMILES string

  • 1. Universal SMILES Finally, a canonical SMILES string? Noel M. O’Boyle Analytical and Biological Chemistry Research Facility, University College Cork, Ireland (Current address: NextMove Software, Cambridge, UK) Apr 2013 245th ACS National Meeting New Orleans Open Babel
  • 3. 3 How to create a SMILES string (1) Pick a starting atom (2) Traverse the molecular graph in a Depth-First manner (3) Encode the atoms and bonds traversed as a text string • Let’s assume that step (3) is done in a standard manner • Variation in steps (1) and (2) leads to many different possible SMILES C C O C C O • Ethanol as CCO or OCC (among others)
  • 4. 4 How to create a canonical SMILES string (1) Give each atom a canonical label (“canonicalize”) (2) Pick as starting atom the one with the smallest label1 (3) Traverse the molecular graph in a Depth-First manner following the atom with the smallest label at each branch point1 (4) Encode the atoms and bonds traversed as a text string • The same SMILES string will always be generated – The “canonical SMILES” C C O O C C 1 2 3 2 C3 C O O1 C C • Ethanol always1 as CCO 1 For example.
  • 5. 5 Why is a canonical SMILES useful? • Check identity – Graph isomorphism is faster, but less convenient • Find/avoid duplicates • Find overlap of two databases • Check that a structure remains unchanged – E.g. after some transformation • Canonical SMILES retains the features of regular SMILES – Although slower to calculate
  • 6. 6 Why are there different canonical SMILES? • There is no published canonical SMILES implementation for the general case – Neither Weininger, Weininger nor Weininger [1] described how to handle stereochemistry • Canonicalization is difficult – Not a simple algorithm, many corner cases – Trade secret • End result: Each cheminformatics toolkit generates its own canonical SMILES [1] Weininger D, Weininger A, Weininger JL. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97.
  • 7. 7 Why a “Universal” canonical SMILES? • All the benefits of a globally unique identifier (like the InChI) – Can link databases – Of benefit to the average chemist, as having different SMILES for the same molecule is confusing – Can immediately see if the Wikipedia SMILES is in agreement with the PubChem SMILES • Finally possible to compare SMILES strings from different toolkits – Identify bugs – Explore underlying chemical models (e.g. aromatic models) – Explore underlying stereochemistry perception – Lead to improvements in quality and standards
  • 8. 8 Why base a canonical SMILES on the InChI? • Canonicalization is complicated – Devising and describing a general canonicalization procedure that others could implement exactly may not be possible • Better to build on existing work – Take advantage of the stellar work by the InChI team – The InChI has already solved the canonicalization problem for a broad section of chemistry • It’s ubiquitous – The InChI is available in almost all cheminformatics toolkits • Finally, all toolkits will be able to create the same canonical SMILES string – The “Universal SMILES” string!
  • 9. 9 How to use the InChI to create a Universal SMILES string
  • 10. 10 How to get canonical labels from the InChI • Use the Auxiliary Information, Luke $ obabel -:"ClCC(=O)Br" -oinchi -xa InChI=1S/C2H2BrClO/c3-2(5)1-4/h1H2 AuxInfo=1/0/N:2,3,5,1,4/rA:5ClCCOBr/rB:s1;s2;d3;s3;/rC:;;;;; • /N section gives the canonical labels – Canonical labels 1 through 5 correspond to input atoms 2, 3, 5, 1 and 4, respectively – E.g. canonical label 3 is applied to input atom 5, the Bromine • For Universal SMILES, I used two non-standard options – /FixedH: Enable the correct application of canonical labels in cases involving molecular symmetry broken by protonation states – /RecMet: Do not disconnect metals, as the labels for ligands will not be canonical
  • 11. 11 Walk this way: Rules for graph traversal • Start the graph traversal at the atom with the lowest canonical label – For disconnected structures, visit each structure in order of its lowest canonical label • Visit atoms in a depth-first manner – At each branch point, multiple bonds are favoured over single or aromatic bonds, and lower canonical labels over higher. Cl Cl Cl 3 C C O C C O C C O 1 2 4 • Universal SMILES for this acid chloride: CC(=O)Cl
  • 12. 12 Corner case: Explicit hydrogens • Sometimes a SMILES string contains explicit hydrogens – Hydrogen isotopes, dihydrogen, hydrogen atoms, hydrogen ions • Sometimes the InChI labels hydrogens – Hydrogen atoms, bridging hydrogens • The problem: – What to do about explicit hydrogens unlabelled by the InChI? • A solution: – Consider these to have a low canonical label – That is, in the traversal visit these hydrogens prior to other singly- bonded branches C([2H])([3H])Cl rather than C(Cl)([3H])[2H]
  • 13. 13 A standard way to encode the SMILES • The graph traversal gives us a canonical atom order • However, despite this, many different SMILES strings may be written for the same molecule The following SMILES strings for ethanol all have the same atom order: CCO, C-C-O, C1.C12.O2, C(C(O)), [CH3]CO • For Universal SMILES, one particular form must be adopted – The standard form described by the Open SMILES specification Ref: Craig James et al, The Open SMILES specification, http://opensmiles.org – E.g. Don’t write single bonds explicitly, only use parentheses if there is a branch
  • 14. 14 Encoding cis/trans stereochemistry symbols • Question: – How do I know that the following SMILES string was not generated by Open Babel? CC=CCl • There are two possible ways to write symbols for any double bond system • For Universal SMILES, the first stereochemistry bond symbol should be a forward slash – i.e. C/C=C/Cl not CC=CCl – Minimises backslashes (can cause problems at commandline) – Useful aid if reading SMILES: If you see a backslash, there must be a corresponding forward slash preceding it • Show cis/trans symbols on all substituents – i.e. Cl/C=C(Br)/I not C/C=C(Br)I
  • 16. 16 Datasets for testing implementation • Universal SMILES was added to Open Babel v2.3.2 $ obabel -:"c1(cc(ccc1)[N+](=O)[O-])/C=C/F" -osmi -xU c1cc(/C=C/F)cc(c1)[N+](=O)[O-] • ChEMBL Release 13 – 1.14 million compounds as 2D MOL – Highly curated, and normalised • PubChem Substance subset – 1.04 million compounds as 2D or 3D MOL (those with SIDS from 0 to 2 million) – As deposited from a variety of sources – Duplicates exist as well as errors – 1.1% were discarded as InChIs could not be generated for them
  • 17. 17 Shuffle Test • Does the Universal SMILES procedure generate a canonical identifier? – A canonical identifier should be invariant to the input order of atoms – So…let’s shuffle the atoms and check whether the Universal SMILES changes • For each structure, I generated 10 “anti-canonical” SMILES strings using Open Babel – The “xC” SMILES output option • For each of these, the Universal SMILES was generated – If all identical, the test is passed
  • 18. 18 Shuffle Test Results • ChEMBL dataset – 2,425 canonicalization failures (0.21%) – 2,248 excluding failures for Open Babel’s own canonical SMILES • These failures are mainly due to kekulization problems • Differences in the stereochemical model used (81%) – 722 failures due to disagreement on the number of tetrahedral stereocenters (fault with OB typically) – 1105 failures for stereogenic double bonds • Handling of delocalized charges – Where molecular graph symmetry is broken only by charge states in a delocalised system, the InChI will regard as equivalent atoms which appear as different charge states in the SMILES string. – Two different Universal SMILES for the example: • C[n+]1ccn(C)c1 and Cn1cc[n+](C)c1
  • 19. 19 Shuffle Test Results • PubChem dataset – 2,410 canonicalization failures (0.23%) – 2,183 excluding failures for Open Babel’s own canonical SMILES • Differences in the stereochemical model used (72%) • 56 cases of non-canonicalization of isotopes – Bug in InChI auxiliary information (they are aware of this) • Interesting failure case, SID 425526 – InChI regards ring as aromatic, and then identifies two-fold graph symmetry – Open Babel does not treat ring as aromatic • Series of double and single bonds – Two different Universal SMILES generated
  • 20. 20 Duplicate Test • Use the Universal SMILES to find duplicates – True duplicates – False duplicates • A shortcoming of Universal SMILES or its implementation • A normalization of distinct structures • ChEMBL dataset – There should not be any duplicates – 63 sets of duplicates according to InChI • Errors in database which had already been corrected in development version • PubChem dataset – 143,157 sets of duplicates • Duplicates according to InChI removed from further consideration
  • 21. 21 Duplicate Test Results • ChEMBL dataset – 29 duplicates found – The majority appear to be true duplicates which the InChI considers as distinct due to the specific coordinates in the Mol file • The InChI regards the stereochemistry in (b) to be undefined
  • 22. 22 • Identical according to Universal SMILES but distinct InChIs – The InChIs differ in the double bond stereochemistry layer: /b31-27+,32-28? versus /b31-27-,32-28+
  • 23. 23 Duplicate Test Results • PubChem dataset – 47 duplicates found • In 44 cases the InChI regarded as undefined the tetrahedral stereochemistry at a chiral center – The three non-H atoms were almost in the same plane as the center SID 855468
  • 25. 25 Overview of results • Universal SMILES can generate canonical identifiers… – for 99.79% of the ChEMBL database – for 99.77% of a subset of the PubChem Substance database – Disagreements between InChI and the underlying stereochemical model used by Open Babel, and the handling of delocalized charges • Performance could be improved further – Improvements in stereochemistry perception in Open Babel, or somehow use the stereochemistry perception from the InChI • Outstanding issues: – Failures due to delocalized charges – The Daylight aromaticity model is not well-described and so different Universal SMILES implementations will vary in what is treated as an aromatic system
  • 26. 26 Overview of results • The InChI is quite sensitive to the specific geometry used at stereocenters – Some structures in databases may need to be redrawn • These ideas could be applied to other chemical file formats – Canonical forms of other line notations – Canonicalization of atom order in Mol files
  • 27. 27 What I didn’t talk about… • Inchified SMILES – A way to include the InChI normalizations into the SMILES string, by roundtripping through the InChI – A SMILES string representation of the InChI string – Available as Open Babel SMILES output option “I” – For more info see the paper (J. Cheminf., 2012, 4, 22)
  • 28. Universal Finally a canonical SMILES SMILES string? J. Cheminf., 2012, 4, 22 baoilleach@gmail.com blueobelisk-smiles@lists.sf.net http://baoilleach.blogspot.com Acknowledgements Craig James (eMolecules): For OpenSMILES and the SMILES writer in Open Babel Funding Health Research Board: Career Development Fellowship