Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Connecting chemistry-to-biology

170 vues

Publié le

Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria

Publié dans : Sciences
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Connecting chemistry-to-biology

  1. 1. Why is connecting chemistry-to-biology in open sources more difficult than it should be? Presented at UCL School of Pharmacy, London, 13 June 2019 Hosted by Professor Mathew Todd 1 Christopher Southan
  2. 2. Abstract Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity.There thus remains a massive shortfall in public D-A-R- C-P capture from decades of papers and patents.This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria 2
  3. 3. Outline • D-A-R-C-P • Chemistry space • Biocuration challenges • Biocuration sources • Chemistry-to-document • OSM engagment • Conclusions 3
  4. 4. The core of the problem 4 "We have spent millions putting chemistry into PDFs but now we are spending more millions taking it back out” (Anon)
  5. 5. The chemistry < - > biology join • Chemistry that does something significant in vitro, in cellulo, in vivo or in clinic • Major bioactivity domains from drug discovery, chemical biology and ecology • Some cases not adequately covered by this simple relationship chain (e.g. heparin as indirect inhibitor of thrombin or where P could be a bacteria or protozoan) • The majority of data still primarily archived in papers and patent documents • Upper limit statistics for quality publications essentially unknown D – A – R – C – P
  6. 6. So how much disintered chemistry is out there? 6
  7. 7. But getting D-A-R-C-P out of text is hard
  8. 8. Unsung Heroes Expert extraction of D-A-R-C-P by biocurators is hard for many reasons that include; • Poor continuity of funding and career support • Entity disambiguation challenges • Unintentional obfuscation, ambiguity and errors by authors (and occasionally deliberately from patent applicants) • Difficult to capture nuances and complexities of molecular mechanisms of action (e.g. prodrugs or no molecular target) • Even primary activity parameters (IC50, Ki, Kd) have ~ 10-fold variation between publications for nominally the same assays • Judging the quality and potential reproducibility of the publications selected for extraction • Publisher guidelines only slowly beginning to address above • Authors engagement with assay and target ontologies is limited
  9. 9. Disinterment from the PDF tomb (I) Image extraction > structure • Real chemists sketch images in a jiffy • The rest of us can use OSRA: Optical Structure Recognition
  10. 10. Chemistry disinterment from PDF tombs (II) IUPAC name > structure
  11. 11. 11 Commercial biocuration of D-A-R-C-P Exelra (formerly GVKBIO) GOSTAR stats from 2015 • 1.3 million cpds from 112K papers (~ 15 per paper) • 3.5 million cpds from 70K patents (~ 50 per pat) • 3,882 human targets
  12. 12. 12 Open biocuration of D-A-R-C-P (I)
  13. 13. 13 Open biocuration of D-A-R-C-P (II)
  14. 14. 14 Biocuration and BioAssay merging into PubChem
  15. 15. 15 Chemistry < > document as a proxy for full D-A-R-C-P
  16. 16. Key paper on PubChem < > PubMed 16
  17. 17. Recent large-scale chem < > doc PubChem submissions 17 • Generally a good thing but with caveats • Difficult to automate filtration to identify “aboutness” of key compounds • Issues with indexing of non-PubMed DOI-only Journal papers • Quality of CNER chemistry extraction • Introduces a // document < > structure mapping system into PubChem
  18. 18. Reciprocal links > virtuous circles (I) 18 • GtoPdb users can navigate “out” via PubChem or PubMed • NCBI users can navigate “in” via PubChem or PubMed
  19. 19. Reciprocal links > virtuous circles (II) 19 • GtoMdb users can navigate “out” via PubChem or PubMed • NCBI users can navigate “in” via PubChem or PubMed
  20. 20. 20 Grappling with Open Source Malaria (in a good way :)
  21. 21. O_S_M 21
  22. 22. OSM-S-363 data links (I) 22
  23. 23. OSM-S-363 (II) 23
  24. 24. OSM-S-363 (III) 24
  25. 25. OSM open data sheet 25 Next slide shows results of uploading 782 InChIKeys to PubChem
  26. 26. Statistics of OSM PubChem matches 26
  27. 27. 27 Rounding off
  28. 28. Emerging capture challenges for bioactivity 28
  29. 29. Conclusions • The bioscience community (including big data miners) still have their collective feet nailed to the floor from the 5-decade backlog of scientifically valuable bioactive chemistry relationships entombed in PDF papers and patents • Biocuration of D-A-R-C-P makes a crucial contribution but limited scale • Automated entity extraction is advancing but is way behind the specificity of mechanistic biocuration and is publisher-constrained • Existence of several // document <> chemistry systems (e.g. MeSH, IBM, ChEMBL, EPMC, Springer Nature,Theime ,Wikidata) is enabling but also confusing • The spread of Open Science ELNs is good to see but findability, searchability and database submissions still need to be optimised • The need remains to facilitate a flow of published (inc. preprints) of author-specified bioactive chemistry direct to databases (even if the papers are FAIR) 29
  30. 30. Proposed core of the solution 30 “Mandating authors to explicitly connect chemical structures to their experimental bioactivity results in a form (extrinsic to PDF) that is FAIR, structured, includes metadata, machine readable, ontologised, transferable to open database records and reciprocally linked to their publications” (Southan 2019) • This is, of course, a council of perfection • In essence, authors should become biocurators • Currently only a few papers with data sets submitted to PubChem BioAssay by authors would conform • Has been technically feasible for at least a decade • Impediments are thus sociological and publishing models
  31. 31. Further info 31

×