Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Chemical structure representation in PubChem

1 213 vues

Publié le

252nd ACS National Meeting Philadelphia Fall 2016
Roger Sayle

Publié dans : Sciences
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Chemical structure representation in PubChem

  1. 1. Chemical structure representation in pubchem Roger Sayle NextMove Software, Cambridge, UK 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  2. 2. Selected Pubchem publications • Sunghwan Kim, Paul A. Thiessen, Evan E. Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A. Shoemaker, Jiyao Wang, Bo Yu, Jian Zhang and Stephen H. Bryant, “PubChem Substance and Compound Databases”, Nucleic Acids Research, 2015. • Volker D. Hahnke, Evan E. Bolton and Stephen H. Bryant, “PubChem atom enironments”, Journal of Cheminformatics, 7:41, 2015. • Evan E. Bolton, Yanli Wang, Paul A. Thiessen, Stephen H. Bryant, “PubChem: Integrated Platform of Molecule Molecules and Biological Activities”, Annual Reports in Computational Chemistry, Volume 4., Chapter 12, pp. 217-241, 2008. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  3. 3. Substance and compound • A unique and invaluable feature of PubChem’s architecture is the distinction between the deposited structures (substances) and the normalized structures (compounds), and the retention of both. • Pubchem Substance contains ~209.6M structures. • Pubchem Compound contains ~91.7M structures. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  4. 4. Molecular identity • When are two chemical structures the same? – Alternate chemical representations. – Aromaticity and conjugation. – Protonation states and tautomerism. – Errors and typographical mistakes. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  5. 5. Pubchem standardization service https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  6. 6. example 1: ethanol • PubChem CID 702 has been deposited 1569 times with six different explicit atom counts. – 1311 have 9 atoms and 8 bonds. – 249 have 3 atoms and 2 bonds. – 4 have 0 atoms and 0 bonds. – 2 have 4 atoms and 3 bonds. – 2 have 5 atoms and 4 bonds. – 1 has 7 atoms and 6 bonds. • All have same SMILES (“CCO”) and InChI. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  7. 7. Explicit vs. implicit hydrogens 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  8. 8. example 2: nitrobenzene • Pubchem CID 7416 has been deposited as 164 distinct substance depositions (2 without structures). 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  9. 9. Mdl molfile-ageDdon • Biovia 2017 changed the interpretation of CT files. • This affects 342,689 SIDs and 213,097 CIDs. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  10. 10. Hydrogens: easy come/easy go? • PubChem is inconsistent on protonation/hydrogens. • Common organic element radicals are hydrogenated: – [C] → C, [Cl] → Cl, [P] → P, [S] → S, [H] → [HH] – [Li], [Be], [B], [Si], [As], [Se], [At], etc. remain unchanged. • Some groups get deprotonated – c1ccccc1[N+](=O)O → c1ccccc1[N+](=O)[O-] • But generally protonation state is preserved – CC(=O)O, CC(=O)[O-], [NH4+], [NH3+]CC(=O)[O-] – C[N+](C)(C)O 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  11. 11. Example 3: o-xylene • A major challenge in chemical databases is aromaticity; that two compounds that differ in Kekule forms are the same molecule. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016 CID 7237
  12. 12. Pubchem canonical kekule smiles • A significant novel innovation in cheminformatics was Evan Bolton’s development of a “canonical” Kekulé SMILES form of a molecule. • Different chemistry toolkits (and chemists!) differ in opinion on which ring systems are aromatic and which are not, hence PubChem’s wish to remain “neutral” by only providing non-aromatic SMILES. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  13. 13. Bolton’s algorithm • Steps of Bolton’s Canonical Kekulé Form Algorithm: 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  14. 14. Tricky case: 10b,10c-dihydropyrene • An important aspect is to aromatize all conjugated cycles, not just those associated with SSSR. • Unfortunately, this computationally demanding requirement is a source of pain at the NCBI. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  15. 15. Conjugated ring systems • Does it make sense to distinguish 4n+2 Hückel aromaticity from conjugated ring systems? 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  16. 16. Resonance forms • CCN(=O)=O → CC[N+](=O)[O-] • CCN=N#N → CCN=[N+]=[N-] • CC[O+]=C=[N-] → CCOC#N • C[P+](C)(C)[O-] → CP(=O)(C)C • CC(=[NH2+])[O-] → CC(=O)N • CS(=[OH+])(=O)[O-] • C[S+2]([O-])([O-])C 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  17. 17. Tautomers are normalized • CC(=N)O → CC(=O)N • CC(=[NH2+])[O-] → CC(=O)N • n1ccccc1O → [nH]1ccccc1=O • n1ccc(O)cc1 → [nH]1ccc(=O)cc1 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  18. 18. Classic tautomerism: laar 1886 InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,19H InChI=1S/C16H12N20/c19-16-11-10-15(13-8-4-5-9-14(13)16)18-17-12-6-2-1-3-7-12/h1-11,17H CID 5355205 (CAS 3651-02-3) 5 SIDs13 SIDs
  19. 19. But things could be improved... 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  20. 20. Bonds to metals • PubChem follows InChI breaking bonds to metals. – Table salt • [Na]Cl → [Na+].[Cl-] • [Na].[Cl] → [Na].Cl – Zirconium(IV) ethoxide • CCO[Zr](OCC)(OCC)OCC → [Zr].CCO.CCO.CCO.CCO • [Zr+4].CC[O-].CC[O-].CC[O-].CC[O-] – Grignard reagents • c1ccccc1[Mg]Br → c1cccc[c-]1.[Mg+2].[Br-] • c1ccccc1[Mg+].[Br-] → c1cccc[c-]1.[Mg+].[Br-] 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  21. 21. Periodic table (circa 1997-2003) • PubChem currently handles 109 of the 118 elements in the periodic table [to be ratified in 2016]. • Hence “Mt” is the heaviest element at the moment. • “Ds”, “Rg”, “Cn”, “Fl”, “Lv” already ratified. • “Nh”, “Mc”, “Ts” and “Og” expected soon. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  22. 22. Pubchem Isotopes • PubChem registration confirms that any specified isotope has been observed experimentally. • Hence [7CH4] is rejected, but [8CH4] is allowed. • Interestingly, the [8CH4] of CID 11635947 has a half- life of only two zeptoseconds (2×10-19 seconds). • Another quirk is that PubChem doesn’t normalize mononuclidic isotopes. Hence [19F]C (CID58338844) is the sames as FC (CID11638). 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  23. 23. Disavowed by the government • There are a number of species PubChem rejects: – Chlorine dioxide O=[Cl]=O – Carbide anions: [C-]#[C-] and [C-4] • But there is hope… – Disulfur dioxide: O=[S][S]=O → O=S=S=O 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  24. 24. Related compounds/substances • CID → SID – Same Connectivity, Same Stereochemistry, Same Isotopes – Same Parent Connectivity, Same Exact Parent – Mixtures, Components and Neutralized Forms – Unique Components – Similar Compounds (90% Tanimoto), Similar Conformers • CID → SID – All, Same Structure, Mixture • SID → SID – Same Connectivity, Same Exact • SID → CID – PubChem SID 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  25. 25. Pubchem bond encoding • PubChem allows depositors to specify advanced representations of molecular structures such as inorganics and organometallics via SD tags. • PUBCHEM_NONSTANDARDBOND – 4 = Quadruple bond, 5 = Dative bond, 6 = Complex bond, 7 = Ionic bond. • PUBCHEM_BONDANNOTATIONS – 2 = Hydrogen bond, 9 = Resonance bond, 10 = Bold bond, 11 = Fischer bond, 12 = Close contact. • Relatively few depositors make use of these. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  26. 26. Final thoughts: abstract For all of the grief that I give Evan, often over corner cases of chemical semantics that only one or two people care about, it is fair to say that PubChem represents the current state-of-the-art in chemical structure representation. Nobody does it better. Under the surface, unseen to most users, are a large number of technical and scientific innovations that have enabled PubChem to scale over the past decade and a half to now contain approaching 100 million compounds. From simple design decisions such as the substance vs. compound distinction [that allows PubChem to avoid the early mistakes of CAS] to breakthroughs such as canonical Kekule SMILEs [to avoid the early mistakes of Daylight Chemical Information Systems], the architecture of Pubchem contains a treasure trove of cheminformatics innovations, covering normalization, tautomers, mixtures, 2D fingerprints and similarity, substructure search, biopolymers, text mining and much more. During this presentation I hope to share some of the cool insights that the remarkable staff at the NCBI often forget to mention or are too modest to point out. Congratulations Evan and Steve. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016
  27. 27. acknowledgements • Evan Bolton, Steve Bryant, Paul Thiessen, Volker Hähnke, David Lipman and the PubChem team at the NCBI. • John May, at NextMove Software, for the analysis of PubChem atom types affected by Biovia changes. • The rest of the team at NextMove Software. • George Vacek and the team at OpenEye Scientific Software. 252nd ACS National Meeting, Philadelphia, PA, Tuesday 23th August 2016

×