The open patent chemistry “big bang”: Implications, opportunities and caveats

www.guidetopharmacology.org
The Open Patent Chemistry “Big Bang”:
Implications, Opportunities and Caveats
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
http://www.slideshare.net/cdsouthan/the-open-patent-chemistry-big-
bang-implications-opportunities-and-caveats
Prepared for
1

Outline
• Big Bang in PubChem
• Balancing IP against bioactivity mining
• Relative source coverage
• Comparing Mwts
• Activity gap
• Unique content
• Mixtures
• CWUs
• Virtuals of various types
• Orthogonal paper
• Conclusions
• References
2

History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (CWU) 0.07 mil
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
3

“Big Bang” of CNER PubChem source submissions (SIDs)
4
IBM II + SureChEMBL + NM
IBM I
SCRIPDB

Current PubChem patent chemistry
• 31.7 mil patent-extracted structures (Oct 2015)
• = 20% of 158 mil total Substance Identifiers (SIDs)
• CIDs with patent SIDs = 17.8 from total of 60.8 mil = 30%
• 2.8 million patent document numbers indexed
• * TRP estimated and “half-open” (i.e. structures and dates but document links
require a Cortelis subscription)
5
SID counts in mil

Opportunities from the Big Bang:
balancing the IP vs SAR utility split
IP assessment
• De facto crucial prior art
• Differential coverage as an adjunct to
commercial sources
• Facilitates IP mining for those who
cannot afford commercial offerings
• PubChem content is chemistry from
patents, not patented chemistry
• CNER is brainless compared to expert
IP-relevance selection
• Claim extraction generally poor
• CNER-extracted chemistry artefacts can
confound assessments (e.g. virtuals)
• Dense image tables still a coverage gap
• Major sources currently static in
PubChem (except SureChEMBL & TRP)
• Asian chemistry shortfall
• The “common chemistry” problem
Bioactivity data-mining
• Circa 5x more SAR that literature
• Chemistry > data via PubChem pat
number indexing > free full-text
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL including SciBite
bioentity mark-up
• Challenge of judging scientific quality
• Synthesis extraction (NextMove)
• Valuable intersects with papers and
targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 12 million have marginal utility
• Drug structure multiplexing problem
6

Major PubChem CNER patent sources at the compound level:
structural corroboration but also divergence
7
SCRIPDB = 4.0
(SID:CID 1.5)
IBM = 7.9
(SID:CID 1.2)
SureChEMBL = 14.6
(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95 Counts are Compound
Identifiers (CIDs) in millions
with a union of 17.8

Patent CNER vs manual bioactivity sources in PubChem:
structural corroboration but also divergence
8
SCRIPDB + IBM
+SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55
Counts are CIDs in millions

Mw plots indicate the CNER fragmentation problem
9

The bioactivity-gap:
majority of patent chemistry has no linked data
10
1.8 mil CNER CIDs
Compare with a
bioactivity-focussed
source e.g. Guide to
PHARMACOLOGY
(GtoPdb) 6037 CIDs

Patent-unique structures : a mixed blessing
11

Patent-picking: vendors listing probable non-stock structures
12
Has been reduced since the recent
deprecation of 20 million Angene SIDs

CNER whitespace problem: mixtures from WO2010053438
13

US6589997: missing punctuation > CNER fails and mixtures
14
NextMove
SureChEMBL (have now fixed this document)

Mixture extractions: more problematic than useful
15
N.b. PubChem ameliorates the issue by splitting all SID/CID mixtures to
component CIDs while maintaining the back-mapping

CWU chemistry: from the sublime…
16

To the ridiculous…. “Chessbordane” CWU virtuals
17
C362H422

Virtuals II: stereo enumerations from US 20080085923
18
260 CIDs > 581 SIDs from IBM,
SureChEMBL, SCRIPDB, Thomson
Pharma and Discovery Gate

Virtuals III: deuterated enumerations from US20080045558
19
986 deuterated CIDs > 2818
SIDs from IBM, SureChEMBL
and SCRIPDB,

Very virtual: d100 dalbavancin
20
Submitted to PubChem by Thomson Pharma (only) on 16th of March 2009

Recent orthogonal analysis of Big Bang impact
• Compares SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concludes; “50–66 % of the relevant content from the latter was also found
in the former”
• Equivalent comparisons executed in PubChem, along the lines presented
here, would record a higher overlap
• This would be via contributions from the other three open sources and
mixture splitting
• Note the update schedule for SurChEMBL in PubChem will be quarterly, but
new patent chemistry surfaces in SureChEMBL at the EBI within 2-4 days and
is refreshed in the EBI UniChem resource ~ monthly
21
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)

Conclusions
• The “Big Bang” value massively outweighs the caveats
• All sources contributing to open patent chemistry are to be congratulated,
and PubChem for wrangling them
• PubChem slice-and-dice functionality is informative for comparing sources
• Bioactivity mining is extensively enabled but still challenging
• IP assessment also not straightforward but playing field has levelled
• But we do need to look the gift horse in the mouth
• Important to resolve and understand quirks, artefacts and pitfalls
• PubChem filters can partially ameliorate some of these
• Between open and commercial we are approaching the best of both worlds
• It will be interesting to see where we go from here
22

References and questions please
23
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
N.b. from the aspect of reproducibility, anyone needing technical tips to reproduce or
extend the PubChem queries used for these slides is welcome to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348
ACS “Deuterogate” slides http://www.slideshare.net/cdsouthan/causes-and-consequences-of-
automated-extraction-of-patentspecified-virtual-deuterated-drugs
//nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037

The open patent chemistry “big bang”: Implications, opportunities and caveats

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à The open patent chemistry “big bang”: Implications, opportunities and caveats

Similaire à The open patent chemistry “big bang”: Implications, opportunities and caveats (20)

Plus de Dr. Haxel Consult

Plus de Dr. Haxel Consult (20)

Dernier

Dernier (20)

The open patent chemistry “big bang”: Implications, opportunities and caveats