1. www.guidetopharmacology.org
The open patent chemistry “big bang”:
large opportunities for small enterprises
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
ACS Mon, Mar 14 CINF: Division of Chemical Information, 79
SESSION: Chemical Information for Small Businesses & Startups
1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm,
1
http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes
2. Abstract (will be skipped for presentation)
2
In 2012, after the first IBM open deposition of 2.5 million structures, few would have
predicted that PubChem compounds that include patent-extracted submissions would
approach 20 million by 2015 (PMID 26194581). The current major open patent
chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and
SureChEMBL. The comparative statistics of sources and the arguments that the
coverage probability of lead compound prior-art structures is now very high, will be
presented. The consequences are that the academic community and small companies
can now patent-mine extensively in PubChem and SureChEMBL, possibly even
without needing commercial sources to support their own filings. Other recent major
enabling aspects for small institutions include a) the open availability of patent full-text
for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056)
and c) automatic bioentity mark-up in patent text (e.g. protein names) from the
SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published
patents will be shown. Even for small enterprises not filing directly open patent
chemistry presents a big expansion in accessible SAR space and aspects of mining
this will be exemplified. However, open chemistry extraction does bring in a variety of
artefacts that add confounding structural “noise” These include a) permutations of
mixtures and chiral exemplifications, b) virtual structures c) extractions from
documents cannot directly indicate IP status and d) “common chemistry” swamping.
These problems and some partial solutions using PubChem filters will be discussed.
4. Outline
• Balancing IP against bioactivity mining
• Source coverage for patent extraction
• Caveats with automated extraction
• The example of US9056843
• Source extraction comparisons
• DIY extraction
• Questions on open searching
• Conclusions
• References
4
5. IP vs SAR from open patent mining
IP assessment
• Essential source of prior art chemistry
• De facto adjunct to commercial sources
• Improved portals (EPO, WIPO, FPOL)
• SureChEMBL, TRP & BindingDB active
• PubChem content is chemistry from
patents, not patented chemistry
• CNER brainless compared to expert IP-
relevance selection
• Claim section extraction often weak
• Extracted artefacts confounding (e.g.
mixtures & virtuals)
• Dense image tables still a coverage gap
• IBM and SCRIPDB static in PubChem
• Asian chemistry shortfall
• The “common chemistry” problem
• Patent blitzing for drug candidates
Bioactivity data mining
• Circa 5x more SAR than literature
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL
• Bulk synthesis extraction (NextMove)
• Valuable intersects with papers,
authors and targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Obfuscation in example > assay data
• Challenge of judging scientific quality
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 15 million have marginal utility
• CNER > structural multiplexing
5
6. Big chemistry: prior art statistics
March 2016 snapshots
• GDB-13: 907 million virtual structures (similarity search)
• Google InChIKey: 120+? million (exact match search)
• EBI UniChem: 110.7 million 27 sources (exact match search)
• CAS: 109 million substances (commercial, similarity search)
• PubChem: 89 million 390 sources (similarity search)
• ChemSpider: 43 million 510 sources (similarity search)
• SureChEMBL: 16.8 million (similarity search)
• GVKBio: 6.2 million (commercial bioactivity capture from patents and
papers, similarity search)
6
7. History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (now 0.08 mil)
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
• 2016 - SureChEMBL 15.8 mil
• CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March)
• Total patent chemistry with estimate from TRP ~ 20.5 mill
7
8. CNER patent sources vs. patent and paper curation:
corroboration and divergence
8
IBM +
SCRIPDB +
SureChembl +
NextMove
= 19.01
ChEMBL20 = 1.45
Thomson Pharma = 4.3
17.3
0.18
1.4 2.5
0.12 0.25
0.9
Counts are
PubChem
Compound
Identifiers (CIDs)
in millions
9. CNER caveats (I) fragmentation: Mw plots
9
Can be partially ameliorated by using Mw ranking as a filter
10. CNER caveats (II) the bioactivity-gap:
majority of patent chemistry has no linked assay data
10
11. CNER caveats (III): strange patent-unique structures
11
• Weird stuff generally non-biological chemistry (i.e. not A61)
• For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs
12. CNER caveats (IV): mixture extractions (a mixed blessing)
12
• Mostly TFA or HCl salts
• Includes combination claims and reactant mixtures
• Causes sources to appear more divergent by exact match statistics
• PubChem splits to component CIDs while maintaining the back-mapping
• Can normalise with “CovalentUnitCount =1” filter
16. Extraction splits by source, date and isomeric connectivity:
(it can get complicated….)
16
Different sources (SIDs) for same
structure (CID)
Different CID isomers with same core
connectivity
18. Extraction
source
selectivity
• 151 BindingDB CIDs direct from PubChem
• 93 Thomson Pharma CIDs (within the 151 above)
• 296 SDFs from SciFinder > 269 CIDs
• 648 SureChEMBL IDs > 511 CIDs
• Numbers are not absolute because of “round tripping” mapping issues
but they illustrate the selectivity and extent of open coverage
18
23. Recent comparative analysis
• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concluded; “50–66 % of the relevant content from the latter was also
found in the former”
• Equivalent comparisons executed in the latest PubChem with all patent
sources would probably record a higher overlap
23
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
http://www.ncbi.nlm.nih.gov/pubmed/26457120
24. First 64K$ Q:
can you search your novel chemistry in open dbs?
• The InChIKey connectivity layer already facilitates blinded exact match
(isomer-agnostic) searching anywhere, including Google
• PubChem and SureChEMBL default to https; so searching is secure
• There is (and never will be?) patent case law where novelty was challenged
in court based on structures intercepted from public servers
• Without metadata (e.g. target & disease) interception per se not much use
• As for sequence data, hard evidence of serious competitive damage via
query interception remains zero (after 20+ years)
• Commercial dbs cannot capture all prior art, so need open check anyway
24
25. Second 64K$ Q:
Can you file based on open-only diligence?
If convinced your novel series < billion$ drug, maybe not - but consider
• Chances of completely missing an overlapping chemical series in
open sources from a competing patent are diminishing
• Prior art is confounded anyway by the 18-month publication shadow
and Markush enumeration
• Filing a 12 month provisional is low cost option
• Portal queries allow you to find relevant patents (e.g. by target name)
even if open chemistry extraction was limited
• The searches that really count are the ones the patent examiner does
for you (on payment) using all their sources (including PubChem)
• However, attorney costs for drafting applications need balancing
against savings on commercial patent resources
25
26. Conclusions
• The “Big Bang” of open chemistry and full text from patents now make these
an essential part of IP and bioactivity assessments for SMEs
• The combination of SureChEMBL and other sources within PubChem
provide over 20 million patent-extracted structures and powerful analysis
options
• The gap between open and commercial has narrowed to the point you can at
least consider doing without the latter
• Note also the former has functionality absent from the latter
• Bioactivity identification, mining and target mapping are still challenging but
becoming easier
• It is important to understand patent chemistry automated extraction quirks,
artefacts, and pitfalls so you can filter these
26
27. References and questions
27
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051
http://www.ncbi.nlm.nih.gov/pubmed/23618056