SlideShare une entreprise Scribd logo
1  sur  27
www.guidetopharmacology.org
The open patent chemistry “big bang”:
large opportunities for small enterprises
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
ACS Mon, Mar 14 CINF: Division of Chemical Information, 79
SESSION: Chemical Information for Small Businesses & Startups
1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm,
1
http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes
Abstract (will be skipped for presentation)
2
In 2012, after the first IBM open deposition of 2.5 million structures, few would have
predicted that PubChem compounds that include patent-extracted submissions would
approach 20 million by 2015 (PMID 26194581). The current major open patent
chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and
SureChEMBL. The comparative statistics of sources and the arguments that the
coverage probability of lead compound prior-art structures is now very high, will be
presented. The consequences are that the academic community and small companies
can now patent-mine extensively in PubChem and SureChEMBL, possibly even
without needing commercial sources to support their own filings. Other recent major
enabling aspects for small institutions include a) the open availability of patent full-text
for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056)
and c) automatic bioentity mark-up in patent text (e.g. protein names) from the
SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published
patents will be shown. Even for small enterprises not filing directly open patent
chemistry presents a big expansion in accessible SAR space and aspects of mining
this will be exemplified. However, open chemistry extraction does bring in a variety of
artefacts that add confounding structural “noise” These include a) permutations of
mixtures and chiral exemplifications, b) virtual structures c) extractions from
documents cannot directly indicate IP status and d) “common chemistry” swamping.
These problems and some partial solutions using PubChem filters will be discussed.
Encouraging preface
3
Outline
• Balancing IP against bioactivity mining
• Source coverage for patent extraction
• Caveats with automated extraction
• The example of US9056843
• Source extraction comparisons
• DIY extraction
• Questions on open searching
• Conclusions
• References
4
IP vs SAR from open patent mining
IP assessment
• Essential source of prior art chemistry
• De facto adjunct to commercial sources
• Improved portals (EPO, WIPO, FPOL)
• SureChEMBL, TRP & BindingDB active
• PubChem content is chemistry from
patents, not patented chemistry
• CNER brainless compared to expert IP-
relevance selection
• Claim section extraction often weak
• Extracted artefacts confounding (e.g.
mixtures & virtuals)
• Dense image tables still a coverage gap
• IBM and SCRIPDB static in PubChem
• Asian chemistry shortfall
• The “common chemistry” problem
• Patent blitzing for drug candidates
Bioactivity data mining
• Circa 5x more SAR than literature
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL
• Bulk synthesis extraction (NextMove)
• Valuable intersects with papers,
authors and targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Obfuscation in example > assay data
• Challenge of judging scientific quality
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 15 million have marginal utility
• CNER > structural multiplexing
5
Big chemistry: prior art statistics
March 2016 snapshots
• GDB-13: 907 million virtual structures (similarity search)
• Google InChIKey: 120+? million (exact match search)
• EBI UniChem: 110.7 million 27 sources (exact match search)
• CAS: 109 million substances (commercial, similarity search)
• PubChem: 89 million 390 sources (similarity search)
• ChemSpider: 43 million 510 sources (similarity search)
• SureChEMBL: 16.8 million (similarity search)
• GVKBio: 6.2 million (commercial bioactivity capture from patents and
papers, similarity search)
6
History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (now 0.08 mil)
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
• 2016 - SureChEMBL 15.8 mil
• CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March)
• Total patent chemistry with estimate from TRP ~ 20.5 mill
7
CNER patent sources vs. patent and paper curation:
corroboration and divergence
8
IBM +
SCRIPDB +
SureChembl +
NextMove
= 19.01
ChEMBL20 = 1.45
Thomson Pharma = 4.3
17.3
0.18
1.4 2.5
0.12 0.25
0.9
Counts are
PubChem
Compound
Identifiers (CIDs)
in millions
CNER caveats (I) fragmentation: Mw plots
9
Can be partially ameliorated by using Mw ranking as a filter
CNER caveats (II) the bioactivity-gap:
majority of patent chemistry has no linked assay data
10
CNER caveats (III): strange patent-unique structures
11
• Weird stuff generally non-biological chemistry (i.e. not A61)
• For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs
CNER caveats (IV): mixture extractions (a mixed blessing)
12
• Mostly TFA or HCl salts
• Includes combination claims and reactant mixtures
• Causes sources to appear more divergent by exact match statistics
• PubChem splits to component CIDs while maintaining the back-mapping
• Can normalise with “CovalentUnitCount =1” filter
An example
13
“Trifluorom
ethyl-
oxadiazole
derivatives
and their
use in the
treatment
of disease”
(Novartis)
PTC for the
patent
family
WO201300
8162,
2013-01-17
SAR table
14
All three data sets extracted and example-numbered in BindingDB
PubChem retrieval by patent number -> series cluster
15
Extraction splits by source, date and isomeric connectivity:
(it can get complicated….)
16
Different sources (SIDs) for same
structure (CID)
Different CID isomers with same core
connectivity
Impressive SureChEMBL family extraction
17
4830 rows 648 IDs mapped to 511 PubChem CIDs
Extraction
source
selectivity
• 151 BindingDB CIDs direct from PubChem
• 93 Thomson Pharma CIDs (within the 151 above)
• 296 SDFs from SciFinder > 269 CIDs
• 648 SureChEMBL IDs > 511 CIDs
• Numbers are not absolute because of “round tripping” mapping issues
but they illustrate the selectivity and extent of open coverage
18
Orthogonal
entity mark-up
(I) : Ferret
(Chrome
plug-in)
19
Orthogonal entity mark-up (II) :
SciBite’s Termite (within SureChEMBL)
20
Roll-your-own extraction (II): OSRA
21
Roll-your-own extraction (I): ChemAxon chemicalize.org
22
Recent comparative analysis
• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concluded; “50–66 % of the relevant content from the latter was also
found in the former”
• Equivalent comparisons executed in the latest PubChem with all patent
sources would probably record a higher overlap
23
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
http://www.ncbi.nlm.nih.gov/pubmed/26457120
First 64K$ Q:
can you search your novel chemistry in open dbs?
• The InChIKey connectivity layer already facilitates blinded exact match
(isomer-agnostic) searching anywhere, including Google
• PubChem and SureChEMBL default to https; so searching is secure
• There is (and never will be?) patent case law where novelty was challenged
in court based on structures intercepted from public servers
• Without metadata (e.g. target & disease) interception per se not much use
• As for sequence data, hard evidence of serious competitive damage via
query interception remains zero (after 20+ years)
• Commercial dbs cannot capture all prior art, so need open check anyway
24
Second 64K$ Q:
Can you file based on open-only diligence?
If convinced your novel series < billion$ drug, maybe not - but consider
• Chances of completely missing an overlapping chemical series in
open sources from a competing patent are diminishing
• Prior art is confounded anyway by the 18-month publication shadow
and Markush enumeration
• Filing a 12 month provisional is low cost option
• Portal queries allow you to find relevant patents (e.g. by target name)
even if open chemistry extraction was limited
• The searches that really count are the ones the patent examiner does
for you (on payment) using all their sources (including PubChem)
• However, attorney costs for drafting applications need balancing
against savings on commercial patent resources
25
Conclusions
• The “Big Bang” of open chemistry and full text from patents now make these
an essential part of IP and bioactivity assessments for SMEs
• The combination of SureChEMBL and other sources within PubChem
provide over 20 million patent-extracted structures and powerful analysis
options
• The gap between open and commercial has narrowed to the point you can at
least consider doing without the latter
• Note also the former has functionality absent from the latter
• Bioactivity identification, mining and target mapping are still challenging but
becoming easier
• It is important to understand patent chemistry automated extraction quirks,
artefacts, and pitfalls so you can filter these
26
References and questions
27
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051
http://www.ncbi.nlm.nih.gov/pubmed/23618056

Contenu connexe

Tendances

2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...open_phacts
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemChris Southan
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?Chris Southan
 
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesThe IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesGuide to PHARMACOLOGY
 
BigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigData_Europe
 
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...Frederik van den Broek
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?Sunghwan Kim
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial dataChris Southan
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChemSunghwan Kim
 
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...Frederik van den Broek
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...Frederik van den Broek
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy trainingSunghwan Kim
 
GuideToImmunopharmacology_SIF_Nov2019
GuideToImmunopharmacology_SIF_Nov2019GuideToImmunopharmacology_SIF_Nov2019
GuideToImmunopharmacology_SIF_Nov2019Guide to PHARMACOLOGY
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europeopen_phacts
 
Guide to Malaria Pharmacology, GEMM 2019
Guide to Malaria Pharmacology, GEMM 2019Guide to Malaria Pharmacology, GEMM 2019
Guide to Malaria Pharmacology, GEMM 2019Guide to PHARMACOLOGY
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The BasicsPeter Berger
 
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019Guide to PHARMACOLOGY
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
 
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...Frederik van den Broek
 
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-upopen_phacts
 

Tendances (20)

2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...
 
Assessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChemAssessing GtoPdb ligand content in PubChem
Assessing GtoPdb ligand content in PubChem
 
Will the correct drugs please stand up?
Will  the correct drugs please stand up?Will  the correct drugs please stand up?
Will the correct drugs please stand up?
 
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updatesThe IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
The IUPHAR/BPS Guide to PHARAMCOLOGY in 2018: new features and updates
 
BigDataEurope - Big Data & Health
BigDataEurope - Big Data & HealthBigDataEurope - Big Data & Health
BigDataEurope - Big Data & Health
 
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
Data-driven drug discovery for rare diseases - Tales from the trenches (CINF ...
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 
Connecting antimalarial data
Connecting antimalarial dataConnecting antimalarial data
Connecting antimalarial data
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
 
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
Finding novel lead compounds in pesticide discovery inspired by pharmaceutica...
 
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
De-siloing data and building knowledge graphs outside of drug discovery: Oppo...
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
GuideToImmunopharmacology_SIF_Nov2019
GuideToImmunopharmacology_SIF_Nov2019GuideToImmunopharmacology_SIF_Nov2019
GuideToImmunopharmacology_SIF_Nov2019
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Guide to Malaria Pharmacology, GEMM 2019
Guide to Malaria Pharmacology, GEMM 2019Guide to Malaria Pharmacology, GEMM 2019
Guide to Malaria Pharmacology, GEMM 2019
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
 
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
GtoPDB_ELIXIR_UK_AllHands_update_Dec2019
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
UDM (Unified Data Model) - Enabling Exchange of Comprehensive Reaction Inform...
 
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
2015-04-28 Open PHACTS at Swedish Linked Data Network Meet-up
 

Similaire à Patent chemisty big bang: utilities for SMEs

The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsChris Southan
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horseChris Southan
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemChris Southan
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Chris Southan
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research DataChris Southan
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data ChemistrySunghwan Kim
 
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...ChemAxon
 

Similaire à Patent chemisty big bang: utilities for SMEs (20)

The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Integrating Patents with Research Data
Integrating Patents with Research DataIntegrating Patents with Research Data
Integrating Patents with Research Data
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...USUGM 2014 -  Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
USUGM 2014 - Gerald Wyckoff (Chemalytics): Development of the Chemalytics Pl...
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 

Plus de Chris Southan

Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityChris Southan
 

Plus de Chris Southan (20)

Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 

Dernier

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 

Dernier (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

Patent chemisty big bang: utilities for SMEs

  • 1. www.guidetopharmacology.org The open patent chemistry “big bang”: large opportunities for small enterprises Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh ACS Mon, Mar 14 CINF: Division of Chemical Information, 79 SESSION: Chemical Information for Small Businesses & Startups 1:00 PM - 4:55 PM- Room 24C 4:25pm - 4:50pm, 1 http://www.slideshare.net/cdsouthan/patent-chemisty-big-bang-utilities-for-smes
  • 2. Abstract (will be skipped for presentation) 2 In 2012, after the first IBM open deposition of 2.5 million structures, few would have predicted that PubChem compounds that include patent-extracted submissions would approach 20 million by 2015 (PMID 26194581). The current major open patent chemistry feeds (in size order) are NextMove, SCRIPDB, Thomson Pharma, IBM and SureChEMBL. The comparative statistics of sources and the arguments that the coverage probability of lead compound prior-art structures is now very high, will be presented. The consequences are that the academic community and small companies can now patent-mine extensively in PubChem and SureChEMBL, possibly even without needing commercial sources to support their own filings. Other recent major enabling aspects for small institutions include a) the open availability of patent full-text for querying b) a range of free tools for DIY chemistry extraction (PMID 23618056) and c) automatic bioentity mark-up in patent text (e.g. protein names) from the SureChEMBL/SciBite collaboration. Examples of DIY analysis of newly published patents will be shown. Even for small enterprises not filing directly open patent chemistry presents a big expansion in accessible SAR space and aspects of mining this will be exemplified. However, open chemistry extraction does bring in a variety of artefacts that add confounding structural “noise” These include a) permutations of mixtures and chiral exemplifications, b) virtual structures c) extractions from documents cannot directly indicate IP status and d) “common chemistry” swamping. These problems and some partial solutions using PubChem filters will be discussed.
  • 4. Outline • Balancing IP against bioactivity mining • Source coverage for patent extraction • Caveats with automated extraction • The example of US9056843 • Source extraction comparisons • DIY extraction • Questions on open searching • Conclusions • References 4
  • 5. IP vs SAR from open patent mining IP assessment • Essential source of prior art chemistry • De facto adjunct to commercial sources • Improved portals (EPO, WIPO, FPOL) • SureChEMBL, TRP & BindingDB active • PubChem content is chemistry from patents, not patented chemistry • CNER brainless compared to expert IP- relevance selection • Claim section extraction often weak • Extracted artefacts confounding (e.g. mixtures & virtuals) • Dense image tables still a coverage gap • IBM and SCRIPDB static in PubChem • Asian chemistry shortfall • The “common chemistry” problem • Patent blitzing for drug candidates Bioactivity data mining • Circa 5x more SAR than literature • Patent families collapse to < 100K C07D primary documents • Advanced query options in SureChEMBL • Bulk synthesis extraction (NextMove) • Valuable intersects with papers, authors and targets via ChEMBL • Easy intersecting with DIY chemistry extraction from any document • Obfuscation in example > assay data • Challenge of judging scientific quality • Only ~ 5 mil structures potentially linkable to bioactivity data • Thus ~ 15 million have marginal utility • CNER > structural multiplexing 5
  • 6. Big chemistry: prior art statistics March 2016 snapshots • GDB-13: 907 million virtual structures (similarity search) • Google InChIKey: 120+? million (exact match search) • EBI UniChem: 110.7 million 27 sources (exact match search) • CAS: 109 million substances (commercial, similarity search) • PubChem: 89 million 390 sources (similarity search) • ChemSpider: 43 million 510 sources (similarity search) • SureChEMBL: 16.8 million (similarity search) • GVKBio: 6.2 million (commercial bioactivity capture from patents and papers, similarity search) 6
  • 7. History of patent chemistry feeds into PubChem • 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from patents and papers (now 4.3 mil, ~40% patents) • 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil - SLING Consortium EPO extraction 0.1 mil • 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil • 2013 - SureChem, CNER + image, 9.0 mil • 2014 - BindingDB USPTO assay extraction (now 0.08 mil) • 2015- (CNER+images + CWU) • SureChEMBL 13.0 mil • IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping • 2016 - SureChEMBL 15.8 mil • CIDs from CNER extractions 19.1 mil (from 88.8 mill, 4th March) • Total patent chemistry with estimate from TRP ~ 20.5 mill 7
  • 8. CNER patent sources vs. patent and paper curation: corroboration and divergence 8 IBM + SCRIPDB + SureChembl + NextMove = 19.01 ChEMBL20 = 1.45 Thomson Pharma = 4.3 17.3 0.18 1.4 2.5 0.12 0.25 0.9 Counts are PubChem Compound Identifiers (CIDs) in millions
  • 9. CNER caveats (I) fragmentation: Mw plots 9 Can be partially ameliorated by using Mw ranking as a filter
  • 10. CNER caveats (II) the bioactivity-gap: majority of patent chemistry has no linked assay data 10
  • 11. CNER caveats (III): strange patent-unique structures 11 • Weird stuff generally non-biological chemistry (i.e. not A61) • For the record C07D = 10.9, A61K = 0.9, (C097D + A61K) = 0.81 mill CIDs
  • 12. CNER caveats (IV): mixture extractions (a mixed blessing) 12 • Mostly TFA or HCl salts • Includes combination claims and reactant mixtures • Causes sources to appear more divergent by exact match statistics • PubChem splits to component CIDs while maintaining the back-mapping • Can normalise with “CovalentUnitCount =1” filter
  • 13. An example 13 “Trifluorom ethyl- oxadiazole derivatives and their use in the treatment of disease” (Novartis) PTC for the patent family WO201300 8162, 2013-01-17
  • 14. SAR table 14 All three data sets extracted and example-numbered in BindingDB
  • 15. PubChem retrieval by patent number -> series cluster 15
  • 16. Extraction splits by source, date and isomeric connectivity: (it can get complicated….) 16 Different sources (SIDs) for same structure (CID) Different CID isomers with same core connectivity
  • 17. Impressive SureChEMBL family extraction 17 4830 rows 648 IDs mapped to 511 PubChem CIDs
  • 18. Extraction source selectivity • 151 BindingDB CIDs direct from PubChem • 93 Thomson Pharma CIDs (within the 151 above) • 296 SDFs from SciFinder > 269 CIDs • 648 SureChEMBL IDs > 511 CIDs • Numbers are not absolute because of “round tripping” mapping issues but they illustrate the selectivity and extent of open coverage 18
  • 19. Orthogonal entity mark-up (I) : Ferret (Chrome plug-in) 19
  • 20. Orthogonal entity mark-up (II) : SciBite’s Termite (within SureChEMBL) 20
  • 22. Roll-your-own extraction (I): ChemAxon chemicalize.org 22
  • 23. Recent comparative analysis • Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial) • Concluded; “50–66 % of the relevant content from the latter was also found in the former” • Equivalent comparisons executed in the latest PubChem with all patent sources would probably record a higher overlap 23 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL) http://www.ncbi.nlm.nih.gov/pubmed/26457120
  • 24. First 64K$ Q: can you search your novel chemistry in open dbs? • The InChIKey connectivity layer already facilitates blinded exact match (isomer-agnostic) searching anywhere, including Google • PubChem and SureChEMBL default to https; so searching is secure • There is (and never will be?) patent case law where novelty was challenged in court based on structures intercepted from public servers • Without metadata (e.g. target & disease) interception per se not much use • As for sequence data, hard evidence of serious competitive damage via query interception remains zero (after 20+ years) • Commercial dbs cannot capture all prior art, so need open check anyway 24
  • 25. Second 64K$ Q: Can you file based on open-only diligence? If convinced your novel series < billion$ drug, maybe not - but consider • Chances of completely missing an overlapping chemical series in open sources from a competing patent are diminishing • Prior art is confounded anyway by the 18-month publication shadow and Markush enumeration • Filing a 12 month provisional is low cost option • Portal queries allow you to find relevant patents (e.g. by target name) even if open chemistry extraction was limited • The searches that really count are the ones the patent examiner does for you (on payment) using all their sources (including PubChem) • However, attorney costs for drafting applications need balancing against savings on commercial patent resources 25
  • 26. Conclusions • The “Big Bang” of open chemistry and full text from patents now make these an essential part of IP and bioactivity assessments for SMEs • The combination of SureChEMBL and other sources within PubChem provide over 20 million patent-extracted structures and powerful analysis options • The gap between open and commercial has narrowed to the point you can at least consider doing without the latter • Note also the former has functionality absent from the latter • Bioactivity identification, mining and target mapping are still challenging but becoming easier • It is important to understand patent chemistry automated extraction quirks, artefacts, and pitfalls so you can filter these 26
  • 27. References and questions 27 http://cdsouthan.blogspot.com/ 19 posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624 (with PubMed Commons data link) www.ncbi.nlm.nih.gov/pubmed/25415348 http://www.ncbi.nlm.nih.gov/pubmed/23399051 http://www.ncbi.nlm.nih.gov/pubmed/23618056