SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Digital chemical
representations
Roger Sayle
NextMove Software, Cambridge, UK
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Too much history!
• William J. Wiswesser, “107 Years of Line-Formula Notations
(1861-1968)”, Journal of Chemical Documentation, Vol. 8, No.
3, pp. 146-150, 1968.
• Bonnie Lawlor, “Chemical Structure Association (CSA) Trust:
Advancing Scientific Discovery for Fifty Years”, Chemical
Information (CINF) Bulletin, Vol. 67, No. 4, Winter 2015.
• Andrew Dalke, “Weininger’s Realization”, blog 2016/12/02
– http://www.dalkescientific.com/writings/diary/archive/2016/12/02/Weiningers_realization.html
• Committee on Modern Methods of Handling Chemical
Information, National Academy of Science & National
Research Council, “Survey of Chemical Notation Systems”,
Publication 1150, Washington DC, 1964.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
A little history (bonnie Lawlor)
The emergence of punch-card technologies during the middle of the last century
renewed interest in these notations, and in 1949 the International Union of Pure and
Applied Chemistry (IUPAC) invited the submission of simple notations that would be
suitable for international adoption. They ultimately chose a notation submitted by G.
Malcom Dyson, but it was one of the other seven notations that were submitted that
caught the attention of those working in the field [1]
[1] It should be noted that the selection of the Dyson notation was criticized, and a
petition was signed by about 1,000 chemists, including several who had submitted
notations for consideration, stating that the Wiswesser Notation had not been given
adequate consideration. The appeal was taken to the American Chemical Society and
the National Academy of Sciences - National Research Council who requested that the
National Science Foundation do a study, the results of which showed that more testing
of both notations should be done before any decision was made. This was not done
and the Dyson Notation was selected. A cloud hung over the decision because Dyson
was the chair of the IUPAC Commission that called for the submission of notations
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
1964: Dyson/iupac vs. wiswesser
B62CQ3
C5C23Q3
B6CX1N3
A63513b7C38EQ13
L66J B1Q
QZ2&2&2
ZR CVQ
L E5 B666 OV AHTTT&J A E
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Why use inchi?
• “InChI is a unique representation/identifier for
defined chemical structures. Probably
marginally better than previous ones”
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Iso11238 §B.2.2 InChI in XML Example
<STRUCTURAL_REPRESENTATION_TYPE>INCHI</STRUCTURAL_REPRESENTATION_TYPE>
<STRUCTURAL_REPRESENTATION>1S/C2H5NO2.AL.CLH.2H2O.ZR/C3-1-
2(4)5;;;;;/H1,3H2,(H,4,5);;1H;2*1H2;/Q;+3;;;;+4/P-
2</STRUCTURAL_REPRESENTATION>
Missing InChI=
Standard and Non-
Standard InChI?
Converted to
upper case
Indentation
Spurious Spaces
Line Breaks
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
§B.2.4 V2000 Mol File in XML Example
<STRUCTURAL_REPRESENTATION_TYPE>MOL</STRUCTURAL_REPRESENTATION_TYPE>
<STRUCTURAL_REPRESENTATION>30 29 0 0 0 0 0 0 0 0999 V2000 9.9563 -7.3055 0.0000 Y
1 1 0 0 0 0 0 0 0 0 0 0 15.0355 -4.8847 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0 13.3609 -
8.0134 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8867 -9.9869 0.0000 O 0 5 0 0 0 0 0 0 0 0 0
0 6.4178 -6.8678 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.8872 -4.8955 0.0000 O 0 5 0 0 0 0
0 0 0 0 0 0 6.7218 -5.7285 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.0541 -9.1519 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 13.3408 -6.8634 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8599 -
4.8881 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 13.0301 -5.7260 0.0000 C 0 0 0 0 0 0 0 0 0 0 0
0 5.9099 -9.9441 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 6.4492 -7.9743 0.0000 O 0 0 0 0 0 0
0 0 0 0 0 0 6.7482 -9.1149 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8605 -5.4221 0.0000 C 0
0 0 0 0 0 0 0 0 0 0 0 11.8897 -5.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.9147 -9.4555
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8855 -9.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.6897 -8.0305 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6897 -6.8513 0.0000 C 0 0 0 0 0 0 0
0 0 0 0 0 8.7018 -6.2618 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 9.2908 -5.2506 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 10.4700 -5.2524 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.0577 -6.2664
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 12.0761 -6.8427 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
12.0891 -8.0218 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.7257 -8.5952 0.0000 N 0 0 0 0 0 0
0 0 0 0 0 0 11.0839 -8.6223 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 10.4848 -9.6275 0.0000
C 0 0 0 0 0 0 0 0 0 0 0 0 9.3057 -9.6139 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 10 2 1 0 0 0 0
8 3 2 0 0 0 0 25 24 1 0 0 0 0 8 4 1 0 0 0 0 27 18 1 0 0 0 0 7 5 2 0 0 0 0 26 28 1 0 0 0 0
7 6 1 0 0 0 0 19 27 1 0 0 0 0 15 7 1 0 0 0 0 20 21 1 0 0 0 0 17 8 1 0 0 0 0 30 27 1 0 0 0
0 11 9 2 0 0 0 0 30 29 1 0 0 0 0 11 10 1 0 0 0 0 20 19 1 0 0 0 0 16 11 1 0 0 0 0 22 21 1
0 0 0 0 14 12 1 0 0 0 0 23 24 1 0 0 0 0 14 13 2 0 0 0 0 18 14 1 0 0 0 0 26 25 1 0 0 0 0
21 15 1 0 0 0 0 29 28 1 0 0 0 0 24 16 1 0 0 0 0 23 22 1 0 0 0 0 28 17 1 0 0 0 0 M CHG 4
1 3 4 -1 6 -1 12 -1 M ISO 1 1 90 M END </STRUCTURAL_REPRESENTATION>
Where to begin?
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Cheminformatics’ cutting edge
1. Mesomers and Valence Representations
2. Protonation States
3. Tautomers (prototropic, C-type, ring-chain, ring-ring)
4. Stereochemistry, Configuration and Conformation
5. Biomolecules (Peptides, Nucleic Acids, Sugars and Lipids)
6. Reactions
7. Inorganics, organometallics and intermetallic alloys.
8. Polymers
9. Mixtures (Salts, Solvents, Alloys, Formulations, etc.)
10. Patterns and Transformations.
11. Part, Position, Count and Class Variation (Markush)
12. Physical States and Forms.
13. Radicals, Excited and Metastable nuclear states.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
tautomers
• Tautomers are molecular isomers that easily
interconvert by migration of hydrogen atoms1.
1. R. Sayle, “So you think you understand tautomerism?”, JCAMD, 24(6-7):485-496, June 2010.
InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,19H
InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,17H
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Inchi issues (pwn2own at inchi con)
• Some recent limitations with InChI stereochemistry
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
polymers
• The InChI Trust has added polymer support to InChI.
• Or has it?
InChI=1B/C2H4O/c1-2-3-1/h1-2H2/z101-1-
3(1,2,1,3,2,3)
InChIKey=IAYPIBMASNFSPL-GCGQHNKHBA-N
InChI=1B/C4H8O2/c1-2-6-4-3-5-1/h1-4H2/z101-1-
6(1,2,1,5,2,6,3,4,3,5,4,6)
InChIKey=RYHBNJHYFVUHQT-UEXHMUNTBA-N
Experimental Beta Option -Polymers
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
INChI PolymerS
capping Group Frame Shift
InChI=1B/C4H8N2O3/c5-1-3(7)6-2-4(8)9/h1-
2,5H2,(H,6,7)(H,8,9)/z101-1,3,6-7(2-6,5-1)
InChIKey=YMAWOPBAYDPSLA-NPFRMHEKBA-N
InChI=1B/C4H8N2O3/c5-1-3(7)6-2-4(8)9/h1-
2,5H2,(H,6,7)(H,8,9)/z101-2-3,6-7(1-3,4-2)
InChIKey=YMAWOPBAYDPSLA-LCMLVUGIBA-N
InChI=1B/C4H8N2O3/c5-1-3(7)6-2-4(8)9/h1-
2,5H2,(H,6,7)(H,8,9)/z101-2,4,6,8(3-6,9-4)
InChIKey=YMAWOPBAYDPSLA-YYSJEQPSBA-N
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Neutral component duplication
• Duplicated components lead to different InChI
– Water (O)
• InChI=1S/H2O/h1H2
• XLYOFNOQVPJJNP-UHFFFAOYSA-N
– Wet water (O.O)
• InChI=1S/2H2O/h2*1H2
• JEGUKCSWCFPDGT-UHFFFAOYSA-N
– Dilute water (O.O.O)
• InChI=1S/3H2O/h3*1H2
• JLFVIEQMRKMAIT-UHFFFAOYSA-N
• Goodman’s Hypothesis: How many InChI-Keys?
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Metallic oxides (and friends)
• One problem area for chemical representation are
compounds that have no discrete chemical structure
but are defined by the ratios of their elements.
• One of John Dalton’s case studies (in his 1808 “A New
System of Chemical Philosophy”) was on tin oxides,
SnO and SnO2, often represented as O=[Sn],
O=[Sn]=O
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Representations of ratios
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Representations of ratios
• And these are just the neutral binary metal oxides, there are
even more permutations for ions (permanganates,
perchlorate) and halides (aluminium chloride) and so on.
• Fortunately, a defining feature of a substance is that it has
zero net charge.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Dalton smiles and Dalton inchi
– A molecular representation that correctly captures the
ratio of elements, but not necessarily connectivity.
• O=[Si]=O Silicon Dioxide (c.f. Wikipedia history)
• O=P(=O)OP(=O)=O Phosphorus pentoxide
• [C] Diamond, Graphene, Fullerenes
• [C]=[C] Graphene, Fullerene
– Extension to mixtures, where each component is listed,
but not necessarily the relative amounts of each.
• Cl.O Hydrochloric acid
• Cu1OS(=O)(=O)O1.O Copper(II) sulfate hydrate
• [Fe].[Cl] Iron chloride
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Example cSD names from CCDC
• catena(Tetra-aqua-tetrakis(mu!2$-formato-O,O')-
bis(formato-O)-di-barium-copper)
• (mu!2$-2,5-bis((Phenylimino)methyl)benzene-1,4-
diyl-C,C',N,N')-bis(eta$5!-
pentamethylcyclopentadienyl)-dichloro-di-iridium
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Identifiers vs. representations
• Compounds are composed of atoms in defined
whole-number ratios, where all atoms of an element
are identical.
• It is this statement that allows us to claim that two
compounds (or crystals) are identical, and can be
captured by a canonical form or universal identifier.
• Without it, substances or mixtures of arbitrary
composition are each unique, and one can only talk
of similarity and equivalence, not of equality.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Metal alloys
• AdmiraltyBrass
– Cu 69 %
– Zn 30 %
– Sn 1 %
• RollsRoyceTurbineAlloy1
– Ni 29.2-37 %mass
– Co 29.2-37 %mass
– Cr 10-16 %mass
– Al 4-6 %mass
– Zr 0.04-0.07 %mass
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Sea water composition
• SeaWater
– Water 1 liter
– Salts 41.953 g
• NaCl 58.490 %
• MgCl2.6H20 26.460 %
• Na2SO4 9.750 %
• CaCl2 2.765 %
• KCl 1.645 %
• NaHCO3 0.477 %
• KBr 0.238 %
• H3BO3 0.071 %
• SrCl2.6H20 0.095 %
• NaF 0.007 %
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Atmospheric composition
• Air
– Nitrogen 78.084 %v
– Oxygen 20.964 %v
– Argon 0.9340 %v
– Carbon dioxide 0.04 %v
– Neon 0.001818 %v
– Helium 0.000524 %v
– Methane 0.00018 %v
– Krypton 0.000114 %v
– Hydrogen 0.000055 %v
• Martian Atmosphere
– Carbon dioxide 95.97 %v
– Argon 1.93 %v
– Nitrogen 1.89 %v
– Oxygen 0.146 %v
– Carbon monoxide 0.0557 %v
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Ibm’s Bad similarity
• Vidal05 and Grant06 described a fast chemical
similarity measure based upon SMILES strings.
• Typically, similar molecules have similar SMILES.
• In US20080002810, IBM claim a patent for “System
and Method for Identifying Similar Molecules” that
also uses n-gram similarity but on InChI strings.
• Unfortunately, similar molecules don’t have similar
(similar runs of characters in) InChI strings!
• Benchmarking shows the IBM method to be one of
the worst chemical similarity measures to date.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
conclusions
• It’s fantastic to be here and be part of the on-
going process (which has a long legacy).
• Even when flawed, chemical line notations
help frame our understanding of chemistry.
• Extending discrete compounds to continuous
compounds is the current state-of-the-art.
• But for mixtures and beyond, the future is in
both “similarity” and canonical forms.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
acknowledgements
• The team at NextMove Software
– John Mayfield
– Noel O’Boyle
• And the Cheminformatics Community (including)
– Daniel Lowe
– Leah McEwen
– Philip Skinner
– Evan Bolton
– Greg Landrum
– Ian Bruno
– Jonathan Goodman
– Andrew Dalke
• And many thanks for your time!
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
definitions
• A substance has constant chemical composition.
• A mixture is two or more different substances.
• A chemical compounds has two or more atoms, with
a fixed proportion/ratio of it’s elements.
• Non-stoichiometric compounds = mixtures.
• Intermetallic alloys = compounds.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Inchi vs. iupac
• Physically impossible charge
– [C+7] InChI=1S/C/q+7
• Physically impossible isotope
– [5C] InChI=1S/C/i1-7
• Mononuclidic elements
– [23Na][19F] InChI=1S/FH.Na/h1H;/q;+1/p-1/i2*1+0
– [Na]F InChI=1S/FH.Na/h1H;/q;+1/p-1
• Formula Unit
– [Na]Cl InChI=1S/ClH.Na/h1H;/q;+1/p-1
– [Na]Cl.[Na]Cl InChI=1S/2ClH.2Na/h2*1H;;/q;;2*+1/p-2
– [Na]1Cl[Na]Cl1 InChI=1S/2ClH.2Na/h2*1H;;
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Inchi issues
• Some important molecules are indistinguishable
CCC=[S+]/[O-] CCC=S=O CCC=[S+][O-]
InChI=1S/C3H6OS/c1-2-3-5-4/h3H,2H2,1H3
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Inchi issues
• Some important molecules are indistinguishable
[NH3][Pt]([NH3])(Cl)Cl [NH3][Pt]([NH3])(Cl)Cl
[NH3+][Pt-2]([NH3+])(Cl)Cl [NH3+][Pt-2]([NH3+])(Cl)Cl
[NH3][Pt@SP1]([NH3])(Cl)Cl [NH3][Pt@SP2]([NH3])(Cl)Cl
InChI=1S/2ClH.2H3N.Pt/h2*1H;2*1H3;/q;;;;+2/p-2
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Mol file standardization
• RDKit’s rdkit/Docs/Book/data/actives_5ht3.sdf
– Contains 180 connection tables
• RDKit outputs 180 molecules, with no warnings.
• OEChem outputs 38 molecules, with 142 warnings.
• ChemAxon outputs 10 molecules, with 1 warnings.
• OpenBabel outputs 180 molecules, with 142 warnings.
• InChI outputs 180 molecules, with 23 warnings.
– Counts line of an offending records “ 21 24999 V2000”
– Hence, aaa is “ 21”, bbb is “ 24”, lll is “999”, ccc (chiral flag)
is “000” (i.e. not chiral).
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Mdl molfile-ageDdon
• Biovia 2017 changes the interpretation of MDL files.
• This affects over 213097 CIDs in PubChem!
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
Smiles standardization
• CHEMBL 423544
– ChEMBL uses Biovia Pipeline Pilot for its SMILES
• CCc1n[c]#[c]n1CC2CC(C(=O)O2)(c3ccccc3)c4ccccc4
• CCc1nc#cn1CC2CC(C(=O)O2)(c3ccccc3)c4ccccc4
• CCC1=NC#CN1CC2CC(C(=O)O2)(c3ccccc3)c4ccccc4
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
John Dalton’s legacy
• In 1808, John Dalton published “A New System of
Chemical Philosophy”, in which he described his
atomic theory, based upon the law of multiple
proportion that revolutionized/defined molecular
chemistry.
• Compounds are composed of atoms in defined
whole-number ratios, where all atoms of an element
are identical.
• Interestingly, 209 years later, boundary cases of this
rule, define the frontiers of cheminformatics.
Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017

Contenu connexe

Similaire à Digital Chemical Representations

Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase
 Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase
Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase
Carlos Bella
 
Macka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKED
Macka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKEDMacka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKED
Macka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKED
Mirek Macka
 
draft 7 luchinsky_CV
draft 7 luchinsky_CVdraft 7 luchinsky_CV
draft 7 luchinsky_CV
luchinskyd
 
Furrey Undergraduate Research Symposium v3 Public
Furrey Undergraduate Research Symposium v3 PublicFurrey Undergraduate Research Symposium v3 Public
Furrey Undergraduate Research Symposium v3 Public
Patrick Furrey
 

Similaire à Digital Chemical Representations (20)

opn200903-dl
opn200903-dlopn200903-dl
opn200903-dl
 
cplu_cover profile issue 10
cplu_cover profile issue 10cplu_cover profile issue 10
cplu_cover profile issue 10
 
Solar roads final selected beni
Solar roads final selected beniSolar roads final selected beni
Solar roads final selected beni
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase
 Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase
Formation SiO2 Mass-Independent Oxygen Isotopic Partitioning During Gas-Phase
 
Reference
ReferenceReference
Reference
 
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic ConnectivityEfficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
 
Optics of Solar Cells, OSA’s 93rd Annual Meeting - San José, CA, USA
Optics of Solar Cells, OSA’s 93rd Annual Meeting - San José, CA, USAOptics of Solar Cells, OSA’s 93rd Annual Meeting - San José, CA, USA
Optics of Solar Cells, OSA’s 93rd Annual Meeting - San José, CA, USA
 
Macka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKED
Macka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKEDMacka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKED
Macka Talk Pittcon 2016-v2016-03-06-GIVEN-WATERMARKED
 
Workshops escrita modulos_3_4
Workshops escrita  modulos_3_4Workshops escrita  modulos_3_4
Workshops escrita modulos_3_4
 
Workshops escrita modulos_3_4
Workshops escrita  modulos_3_4Workshops escrita  modulos_3_4
Workshops escrita modulos_3_4
 
Proquest Burye Thesis
Proquest Burye ThesisProquest Burye Thesis
Proquest Burye Thesis
 
CV 2015
CV 2015CV 2015
CV 2015
 
draft 7 luchinsky_CV
draft 7 luchinsky_CVdraft 7 luchinsky_CV
draft 7 luchinsky_CV
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Da...
 
US7909971
US7909971US7909971
US7909971
 
Sayir - Aerospace Materials for Extreme Environments - Spring Review 2013
Sayir - Aerospace Materials for Extreme Environments - Spring Review 2013Sayir - Aerospace Materials for Extreme Environments - Spring Review 2013
Sayir - Aerospace Materials for Extreme Environments - Spring Review 2013
 
Furrey Undergraduate Research Symposium v3 Public
Furrey Undergraduate Research Symposium v3 PublicFurrey Undergraduate Research Symposium v3 Public
Furrey Undergraduate Research Symposium v3 Public
 
How to write great papers and get published
How to write great papers and get publishedHow to write great papers and get published
How to write great papers and get published
 

Plus de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 

Plus de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 

Dernier

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Dernier (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Digital Chemical Representations

  • 1. Digital chemical representations Roger Sayle NextMove Software, Cambridge, UK Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 2. Too much history! • William J. Wiswesser, “107 Years of Line-Formula Notations (1861-1968)”, Journal of Chemical Documentation, Vol. 8, No. 3, pp. 146-150, 1968. • Bonnie Lawlor, “Chemical Structure Association (CSA) Trust: Advancing Scientific Discovery for Fifty Years”, Chemical Information (CINF) Bulletin, Vol. 67, No. 4, Winter 2015. • Andrew Dalke, “Weininger’s Realization”, blog 2016/12/02 – http://www.dalkescientific.com/writings/diary/archive/2016/12/02/Weiningers_realization.html • Committee on Modern Methods of Handling Chemical Information, National Academy of Science & National Research Council, “Survey of Chemical Notation Systems”, Publication 1150, Washington DC, 1964. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 3. A little history (bonnie Lawlor) The emergence of punch-card technologies during the middle of the last century renewed interest in these notations, and in 1949 the International Union of Pure and Applied Chemistry (IUPAC) invited the submission of simple notations that would be suitable for international adoption. They ultimately chose a notation submitted by G. Malcom Dyson, but it was one of the other seven notations that were submitted that caught the attention of those working in the field [1] [1] It should be noted that the selection of the Dyson notation was criticized, and a petition was signed by about 1,000 chemists, including several who had submitted notations for consideration, stating that the Wiswesser Notation had not been given adequate consideration. The appeal was taken to the American Chemical Society and the National Academy of Sciences - National Research Council who requested that the National Science Foundation do a study, the results of which showed that more testing of both notations should be done before any decision was made. This was not done and the Dyson Notation was selected. A cloud hung over the decision because Dyson was the chair of the IUPAC Commission that called for the submission of notations Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 4. 1964: Dyson/iupac vs. wiswesser B62CQ3 C5C23Q3 B6CX1N3 A63513b7C38EQ13 L66J B1Q QZ2&2&2 ZR CVQ L E5 B666 OV AHTTT&J A E Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 5. Why use inchi? • “InChI is a unique representation/identifier for defined chemical structures. Probably marginally better than previous ones” Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 6. Iso11238 §B.2.2 InChI in XML Example <STRUCTURAL_REPRESENTATION_TYPE>INCHI</STRUCTURAL_REPRESENTATION_TYPE> <STRUCTURAL_REPRESENTATION>1S/C2H5NO2.AL.CLH.2H2O.ZR/C3-1- 2(4)5;;;;;/H1,3H2,(H,4,5);;1H;2*1H2;/Q;+3;;;;+4/P- 2</STRUCTURAL_REPRESENTATION> Missing InChI= Standard and Non- Standard InChI? Converted to upper case Indentation Spurious Spaces Line Breaks Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 7. §B.2.4 V2000 Mol File in XML Example <STRUCTURAL_REPRESENTATION_TYPE>MOL</STRUCTURAL_REPRESENTATION_TYPE> <STRUCTURAL_REPRESENTATION>30 29 0 0 0 0 0 0 0 0999 V2000 9.9563 -7.3055 0.0000 Y 1 1 0 0 0 0 0 0 0 0 0 0 15.0355 -4.8847 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0 13.3609 - 8.0134 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8867 -9.9869 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 6.4178 -6.8678 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.8872 -4.8955 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 6.7218 -5.7285 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.0541 -9.1519 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.3408 -6.8634 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8599 - 4.8881 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 13.0301 -5.7260 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 5.9099 -9.9441 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 6.4492 -7.9743 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 6.7482 -9.1149 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8605 -5.4221 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.8897 -5.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.9147 -9.4555 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8855 -9.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6897 -8.0305 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6897 -6.8513 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.7018 -6.2618 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 9.2908 -5.2506 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 10.4700 -5.2524 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.0577 -6.2664 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 12.0761 -6.8427 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 12.0891 -8.0218 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.7257 -8.5952 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 11.0839 -8.6223 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 10.4848 -9.6275 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 9.3057 -9.6139 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 10 2 1 0 0 0 0 8 3 2 0 0 0 0 25 24 1 0 0 0 0 8 4 1 0 0 0 0 27 18 1 0 0 0 0 7 5 2 0 0 0 0 26 28 1 0 0 0 0 7 6 1 0 0 0 0 19 27 1 0 0 0 0 15 7 1 0 0 0 0 20 21 1 0 0 0 0 17 8 1 0 0 0 0 30 27 1 0 0 0 0 11 9 2 0 0 0 0 30 29 1 0 0 0 0 11 10 1 0 0 0 0 20 19 1 0 0 0 0 16 11 1 0 0 0 0 22 21 1 0 0 0 0 14 12 1 0 0 0 0 23 24 1 0 0 0 0 14 13 2 0 0 0 0 18 14 1 0 0 0 0 26 25 1 0 0 0 0 21 15 1 0 0 0 0 29 28 1 0 0 0 0 24 16 1 0 0 0 0 23 22 1 0 0 0 0 28 17 1 0 0 0 0 M CHG 4 1 3 4 -1 6 -1 12 -1 M ISO 1 1 90 M END </STRUCTURAL_REPRESENTATION> Where to begin? Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 8. Cheminformatics’ cutting edge 1. Mesomers and Valence Representations 2. Protonation States 3. Tautomers (prototropic, C-type, ring-chain, ring-ring) 4. Stereochemistry, Configuration and Conformation 5. Biomolecules (Peptides, Nucleic Acids, Sugars and Lipids) 6. Reactions 7. Inorganics, organometallics and intermetallic alloys. 8. Polymers 9. Mixtures (Salts, Solvents, Alloys, Formulations, etc.) 10. Patterns and Transformations. 11. Part, Position, Count and Class Variation (Markush) 12. Physical States and Forms. 13. Radicals, Excited and Metastable nuclear states. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 9. tautomers • Tautomers are molecular isomers that easily interconvert by migration of hydrogen atoms1. 1. R. Sayle, “So you think you understand tautomerism?”, JCAMD, 24(6-7):485-496, June 2010. InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,19H InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,17H Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 10. Inchi issues (pwn2own at inchi con) • Some recent limitations with InChI stereochemistry Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 11. polymers • The InChI Trust has added polymer support to InChI. • Or has it? InChI=1B/C2H4O/c1-2-3-1/h1-2H2/z101-1- 3(1,2,1,3,2,3) InChIKey=IAYPIBMASNFSPL-GCGQHNKHBA-N InChI=1B/C4H8O2/c1-2-6-4-3-5-1/h1-4H2/z101-1- 6(1,2,1,5,2,6,3,4,3,5,4,6) InChIKey=RYHBNJHYFVUHQT-UEXHMUNTBA-N Experimental Beta Option -Polymers Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 12. INChI PolymerS capping Group Frame Shift InChI=1B/C4H8N2O3/c5-1-3(7)6-2-4(8)9/h1- 2,5H2,(H,6,7)(H,8,9)/z101-1,3,6-7(2-6,5-1) InChIKey=YMAWOPBAYDPSLA-NPFRMHEKBA-N InChI=1B/C4H8N2O3/c5-1-3(7)6-2-4(8)9/h1- 2,5H2,(H,6,7)(H,8,9)/z101-2-3,6-7(1-3,4-2) InChIKey=YMAWOPBAYDPSLA-LCMLVUGIBA-N InChI=1B/C4H8N2O3/c5-1-3(7)6-2-4(8)9/h1- 2,5H2,(H,6,7)(H,8,9)/z101-2,4,6,8(3-6,9-4) InChIKey=YMAWOPBAYDPSLA-YYSJEQPSBA-N Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 13. Neutral component duplication • Duplicated components lead to different InChI – Water (O) • InChI=1S/H2O/h1H2 • XLYOFNOQVPJJNP-UHFFFAOYSA-N – Wet water (O.O) • InChI=1S/2H2O/h2*1H2 • JEGUKCSWCFPDGT-UHFFFAOYSA-N – Dilute water (O.O.O) • InChI=1S/3H2O/h3*1H2 • JLFVIEQMRKMAIT-UHFFFAOYSA-N • Goodman’s Hypothesis: How many InChI-Keys? Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 14. Metallic oxides (and friends) • One problem area for chemical representation are compounds that have no discrete chemical structure but are defined by the ratios of their elements. • One of John Dalton’s case studies (in his 1808 “A New System of Chemical Philosophy”) was on tin oxides, SnO and SnO2, often represented as O=[Sn], O=[Sn]=O Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 15. Representations of ratios Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 16. Representations of ratios • And these are just the neutral binary metal oxides, there are even more permutations for ions (permanganates, perchlorate) and halides (aluminium chloride) and so on. • Fortunately, a defining feature of a substance is that it has zero net charge. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 17. Dalton smiles and Dalton inchi – A molecular representation that correctly captures the ratio of elements, but not necessarily connectivity. • O=[Si]=O Silicon Dioxide (c.f. Wikipedia history) • O=P(=O)OP(=O)=O Phosphorus pentoxide • [C] Diamond, Graphene, Fullerenes • [C]=[C] Graphene, Fullerene – Extension to mixtures, where each component is listed, but not necessarily the relative amounts of each. • Cl.O Hydrochloric acid • Cu1OS(=O)(=O)O1.O Copper(II) sulfate hydrate • [Fe].[Cl] Iron chloride Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 18. Example cSD names from CCDC • catena(Tetra-aqua-tetrakis(mu!2$-formato-O,O')- bis(formato-O)-di-barium-copper) • (mu!2$-2,5-bis((Phenylimino)methyl)benzene-1,4- diyl-C,C',N,N')-bis(eta$5!- pentamethylcyclopentadienyl)-dichloro-di-iridium Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 19. Identifiers vs. representations • Compounds are composed of atoms in defined whole-number ratios, where all atoms of an element are identical. • It is this statement that allows us to claim that two compounds (or crystals) are identical, and can be captured by a canonical form or universal identifier. • Without it, substances or mixtures of arbitrary composition are each unique, and one can only talk of similarity and equivalence, not of equality. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 20. Metal alloys • AdmiraltyBrass – Cu 69 % – Zn 30 % – Sn 1 % • RollsRoyceTurbineAlloy1 – Ni 29.2-37 %mass – Co 29.2-37 %mass – Cr 10-16 %mass – Al 4-6 %mass – Zr 0.04-0.07 %mass Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 21. Sea water composition • SeaWater – Water 1 liter – Salts 41.953 g • NaCl 58.490 % • MgCl2.6H20 26.460 % • Na2SO4 9.750 % • CaCl2 2.765 % • KCl 1.645 % • NaHCO3 0.477 % • KBr 0.238 % • H3BO3 0.071 % • SrCl2.6H20 0.095 % • NaF 0.007 % Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 22. Atmospheric composition • Air – Nitrogen 78.084 %v – Oxygen 20.964 %v – Argon 0.9340 %v – Carbon dioxide 0.04 %v – Neon 0.001818 %v – Helium 0.000524 %v – Methane 0.00018 %v – Krypton 0.000114 %v – Hydrogen 0.000055 %v • Martian Atmosphere – Carbon dioxide 95.97 %v – Argon 1.93 %v – Nitrogen 1.89 %v – Oxygen 0.146 %v – Carbon monoxide 0.0557 %v Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 23. Ibm’s Bad similarity • Vidal05 and Grant06 described a fast chemical similarity measure based upon SMILES strings. • Typically, similar molecules have similar SMILES. • In US20080002810, IBM claim a patent for “System and Method for Identifying Similar Molecules” that also uses n-gram similarity but on InChI strings. • Unfortunately, similar molecules don’t have similar (similar runs of characters in) InChI strings! • Benchmarking shows the IBM method to be one of the worst chemical similarity measures to date. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 24. conclusions • It’s fantastic to be here and be part of the on- going process (which has a long legacy). • Even when flawed, chemical line notations help frame our understanding of chemistry. • Extending discrete compounds to continuous compounds is the current state-of-the-art. • But for mixtures and beyond, the future is in both “similarity” and canonical forms. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 25. acknowledgements • The team at NextMove Software – John Mayfield – Noel O’Boyle • And the Cheminformatics Community (including) – Daniel Lowe – Leah McEwen – Philip Skinner – Evan Bolton – Greg Landrum – Ian Bruno – Jonathan Goodman – Andrew Dalke • And many thanks for your time! Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 26. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 27. definitions • A substance has constant chemical composition. • A mixture is two or more different substances. • A chemical compounds has two or more atoms, with a fixed proportion/ratio of it’s elements. • Non-stoichiometric compounds = mixtures. • Intermetallic alloys = compounds. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 28. Inchi vs. iupac • Physically impossible charge – [C+7] InChI=1S/C/q+7 • Physically impossible isotope – [5C] InChI=1S/C/i1-7 • Mononuclidic elements – [23Na][19F] InChI=1S/FH.Na/h1H;/q;+1/p-1/i2*1+0 – [Na]F InChI=1S/FH.Na/h1H;/q;+1/p-1 • Formula Unit – [Na]Cl InChI=1S/ClH.Na/h1H;/q;+1/p-1 – [Na]Cl.[Na]Cl InChI=1S/2ClH.2Na/h2*1H;;/q;;2*+1/p-2 – [Na]1Cl[Na]Cl1 InChI=1S/2ClH.2Na/h2*1H;; Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 29. Inchi issues • Some important molecules are indistinguishable CCC=[S+]/[O-] CCC=S=O CCC=[S+][O-] InChI=1S/C3H6OS/c1-2-3-5-4/h3H,2H2,1H3 Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 30. Inchi issues • Some important molecules are indistinguishable [NH3][Pt]([NH3])(Cl)Cl [NH3][Pt]([NH3])(Cl)Cl [NH3+][Pt-2]([NH3+])(Cl)Cl [NH3+][Pt-2]([NH3+])(Cl)Cl [NH3][Pt@SP1]([NH3])(Cl)Cl [NH3][Pt@SP2]([NH3])(Cl)Cl InChI=1S/2ClH.2H3N.Pt/h2*1H;2*1H3;/q;;;;+2/p-2 Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 31. Mol file standardization • RDKit’s rdkit/Docs/Book/data/actives_5ht3.sdf – Contains 180 connection tables • RDKit outputs 180 molecules, with no warnings. • OEChem outputs 38 molecules, with 142 warnings. • ChemAxon outputs 10 molecules, with 1 warnings. • OpenBabel outputs 180 molecules, with 142 warnings. • InChI outputs 180 molecules, with 23 warnings. – Counts line of an offending records “ 21 24999 V2000” – Hence, aaa is “ 21”, bbb is “ 24”, lll is “999”, ccc (chiral flag) is “000” (i.e. not chiral). Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 32. Mdl molfile-ageDdon • Biovia 2017 changes the interpretation of MDL files. • This affects over 213097 CIDs in PubChem! Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 33. Smiles standardization • CHEMBL 423544 – ChEMBL uses Biovia Pipeline Pilot for its SMILES • CCc1n[c]#[c]n1CC2CC(C(=O)O2)(c3ccccc3)c4ccccc4 • CCc1nc#cn1CC2CC(C(=O)O2)(c3ccccc3)c4ccccc4 • CCC1=NC#CN1CC2CC(C(=O)O2)(c3ccccc3)c4ccccc4 Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017
  • 34. John Dalton’s legacy • In 1808, John Dalton published “A New System of Chemical Philosophy”, in which he described his atomic theory, based upon the law of multiple proportion that revolutionized/defined molecular chemistry. • Compounds are composed of atoms in defined whole-number ratios, where all atoms of an element are identical. • Interestingly, 209 years later, boundary cases of this rule, define the frontiers of cheminformatics. Status and Future of IUPAC InChI, Bethesda, Maryland, USA, Wednesday 16th August 2017