The document summarizes the implications of the large influx of patent chemistry data into PubChem from various sources performing chemical named entity recognition (CNER) on patent texts. Over 30 million structures have been added from these sources. While this "Big Bang" greatly expands the available chemistry, there are also caveats to consider like fragmentation of structures, inclusion of mixtures and virtual structures, and the fact that most added structures lack associated bioactivity data. The opportunities for data mining are significant but care must be taken to understand the limitations and artifacts of the automated extraction methods.
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
The open patent chemistry “big bang”: Implications, opportunities and caveats
1. www.guidetopharmacology.org
The Open Patent Chemistry “Big Bang”:
Implications, Opportunities and Caveats
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
http://www.slideshare.net/cdsouthan/the-open-patent-chemistry-big-
bang-implications-opportunities-and-caveats
Prepared for
1
2. Outline
• Big Bang in PubChem
• Balancing IP against bioactivity mining
• Relative source coverage
• Comparing Mwts
• Activity gap
• Unique content
• Mixtures
• CWUs
• Virtuals of various types
• Orthogonal paper
• Conclusions
• References
2
3. History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (CWU) 0.07 mil
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
3
4. “Big Bang” of CNER PubChem source submissions (SIDs)
4
IBM II + SureChEMBL + NM
IBM I
SCRIPDB
5. Current PubChem patent chemistry
• 31.7 mil patent-extracted structures (Oct 2015)
• = 20% of 158 mil total Substance Identifiers (SIDs)
• CIDs with patent SIDs = 17.8 from total of 60.8 mil = 30%
• 2.8 million patent document numbers indexed
• * TRP estimated and “half-open” (i.e. structures and dates but document links
require a Cortelis subscription)
5
SID counts in mil
6. Opportunities from the Big Bang:
balancing the IP vs SAR utility split
IP assessment
• De facto crucial prior art
• Differential coverage as an adjunct to
commercial sources
• Facilitates IP mining for those who
cannot afford commercial offerings
• PubChem content is chemistry from
patents, not patented chemistry
• CNER is brainless compared to expert
IP-relevance selection
• Claim extraction generally poor
• CNER-extracted chemistry artefacts can
confound assessments (e.g. virtuals)
• Dense image tables still a coverage gap
• Major sources currently static in
PubChem (except SureChEMBL & TRP)
• Asian chemistry shortfall
• The “common chemistry” problem
Bioactivity data-mining
• Circa 5x more SAR that literature
• Chemistry > data via PubChem pat
number indexing > free full-text
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL including SciBite
bioentity mark-up
• Challenge of judging scientific quality
• Synthesis extraction (NextMove)
• Valuable intersects with papers and
targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 12 million have marginal utility
• Drug structure multiplexing problem
6
7. Major PubChem CNER patent sources at the compound level:
structural corroboration but also divergence
7
SCRIPDB = 4.0
(SID:CID 1.5)
IBM = 7.9
(SID:CID 1.2)
SureChEMBL = 14.6
(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95 Counts are Compound
Identifiers (CIDs) in millions
with a union of 17.8
8. Patent CNER vs manual bioactivity sources in PubChem:
structural corroboration but also divergence
8
SCRIPDB + IBM
+SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55
Counts are CIDs in millions
10. The bioactivity-gap:
majority of patent chemistry has no linked data
10
1.8 mil CNER CIDs
Compare with a
bioactivity-focussed
source e.g. Guide to
PHARMACOLOGY
(GtoPdb) 6037 CIDs
15. Mixture extractions: more problematic than useful
15
N.b. PubChem ameliorates the issue by splitting all SID/CID mixtures to
component CIDs while maintaining the back-mapping
18. Virtuals II: stereo enumerations from US 20080085923
18
260 CIDs > 581 SIDs from IBM,
SureChEMBL, SCRIPDB, Thomson
Pharma and Discovery Gate
19. Virtuals III: deuterated enumerations from US20080045558
19
986 deuterated CIDs > 2818
SIDs from IBM, SureChEMBL
and SCRIPDB,
20. Very virtual: d100 dalbavancin
20
Submitted to PubChem by Thomson Pharma (only) on 16th of March 2009
21. Recent orthogonal analysis of Big Bang impact
• Compares SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concludes; “50–66 % of the relevant content from the latter was also found
in the former”
• Equivalent comparisons executed in PubChem, along the lines presented
here, would record a higher overlap
• This would be via contributions from the other three open sources and
mixture splitting
• Note the update schedule for SurChEMBL in PubChem will be quarterly, but
new patent chemistry surfaces in SureChEMBL at the EBI within 2-4 days and
is refreshed in the EBI UniChem resource ~ monthly
21
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
22. Conclusions
• The “Big Bang” value massively outweighs the caveats
• All sources contributing to open patent chemistry are to be congratulated,
and PubChem for wrangling them
• PubChem slice-and-dice functionality is informative for comparing sources
• Bioactivity mining is extensively enabled but still challenging
• IP assessment also not straightforward but playing field has levelled
• But we do need to look the gift horse in the mouth
• Important to resolve and understand quirks, artefacts and pitfalls
• PubChem filters can partially ameliorate some of these
• Between open and commercial we are approaching the best of both worlds
• It will be interesting to see where we go from here
22
23. References and questions please
23
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
N.b. from the aspect of reproducibility, anyone needing technical tips to reproduce or
extend the PubChem queries used for these slides is welcome to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348
ACS “Deuterogate” slides http://www.slideshare.net/cdsouthan/causes-and-consequences-of-
automated-extraction-of-patentspecified-virtual-deuterated-drugs
//nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037